RE: libhdfs install dep

2012-09-25 Thread Leo Leung
Rodrigo,
  Assuming you are asking for hadoop 1.x

  You are missing the hadoop-*libhdfs* rpm.
  Build it or get it from the vendor you got your hadoop from.

 

-Original Message-
From: Pastrana, Rodrigo (RIS-BCT) [mailto:rodrigo.pastr...@lexisnexis.com] 
Sent: Monday, September 24, 2012 8:20 PM
To: 'core-u...@hadoop.apache.org'
Subject: libhdfs install dep

Anybody know why libhdfs.so is not found by package managers on CentOS 64 and 
OpenSuse64? 

I hava an rpm which declares Hadoop as a dependacy, but the package managers 
(KPackageKit, zypper, etc) report libhdfs.so as a missing dependency eventhough 
Hadoop has been installed via rpm package, and libhdfs.so is installed as well. 

Thanks, Rodrigo.



Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Hemanth Yamijala
By java environment variables, do you mean the ones passed as
-Dkey=value ? That's one way of passing them. I suppose another way is
to have a client side site configuration (like mapred-site.xml) that
is in the classpath of the client app.

Thanks
Hemanth

On Tue, Sep 25, 2012 at 12:20 AM, Varad Meru meru.va...@gmail.com wrote:
 Thanks Hemanth,

 But in general, if we want to pass arguments to any job (not only
 PiEstimator from examples-jar) and submit the Job to the Job queue
 scheduler, by the looks of it, we might always need to use the java
 environment variables only.

 Is my above assumption correct?

 Thanks,
 Varad

 On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.comwrote:

 Varad,

 Looking at the code for the PiEstimator class which implements the
 'pi' example, the two arguments are mandatory and are used *before*
 the job is submitted for execution - i.e on the client side. In
 particular, one of them (nSamples) is used not by the MapReduce job,
 but by the client code (i.e. PiEstimator) to generate some input.

 Hence, I believe all of this additional work that is being done by the
 PiEstimator class will be bypassed if we directly use the job -submit
 command. In other words, I don't think these two ways of running the
 job:

 - using the hadoop jar examples pi
 - using hadoop job -submit

 are equivalent.

 As a general answer to your question though, if additional parameters
 are used by the Mappers or reducers, then they will generally be set
 as additional job specific configuration items. So, one way of using
 them with the job -submit command will be to find out the specific
 names of the configuration items (from code, or some other
 documentation), and include them in the job.xml used when submitting
 the job.

 Thanks
 Hemanth

 On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com wrote:
  Hi,
 
  I want to run the PiEstimator example from using the following command
 
  $hadoop job -submit pieestimatorconf.xml
 
  which contains all the info required by hadoop to run the job. E.g. the
  input file location, the output file location and other details.
 
 
 propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property
  propertynamemapred.map.tasks/namevalue20/value/property
  propertynamemapred.reduce.tasks/namevalue2/value/property
  ...
  propertynamemapred.job.name
 /namevaluePiEstimator/value/property
 
 propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property
 
  Now, as we now, to run the PiEstimator, we can use the following command
 too
 
  $hadoop jar hadoop-examples.1.0.3 pi 5 10
 
  where 5 and 10 are the arguments to the main class of the PiEstimator.
 How
  can I pass the same arguments (5 and 10) using the job -submit command
  through conf. file or any other way, without changing the code of the
  examples to reflect the use of environment variables.
 
  Thanks in advance,
  Varad
 
  -
  Varad Meru
  Software Engineer,
  Business Intelligence and Analytics,
  Persistent Systems and Solutions Ltd.,
  Pune, India.



Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Bertrand Dechoux
Building on Hemanth answer : at the end your variables should be in the
job.xml (the second file needed with the jar to run a job). Building this
job.xml can be done in various way but it does inherit from your local
configuration and you can change it using the java API but at the end it is
only a xml file so you are not hand tied.

I know there is a job file that you can provide with the shell command :
http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#job

But I haven't used it yet so I can tell you more about this option.

Regards

Bertrand

On Tue, Sep 25, 2012 at 9:10 AM, Hemanth Yamijala yhema...@gmail.comwrote:

 By java environment variables, do you mean the ones passed as
 -Dkey=value ? That's one way of passing them. I suppose another way is
 to have a client side site configuration (like mapred-site.xml) that
 is in the classpath of the client app.

 Thanks
 Hemanth

 On Tue, Sep 25, 2012 at 12:20 AM, Varad Meru meru.va...@gmail.com wrote:
  Thanks Hemanth,
 
  But in general, if we want to pass arguments to any job (not only
  PiEstimator from examples-jar) and submit the Job to the Job queue
  scheduler, by the looks of it, we might always need to use the java
  environment variables only.
 
  Is my above assumption correct?
 
  Thanks,
  Varad
 
  On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.com
 wrote:
 
  Varad,
 
  Looking at the code for the PiEstimator class which implements the
  'pi' example, the two arguments are mandatory and are used *before*
  the job is submitted for execution - i.e on the client side. In
  particular, one of them (nSamples) is used not by the MapReduce job,
  but by the client code (i.e. PiEstimator) to generate some input.
 
  Hence, I believe all of this additional work that is being done by the
  PiEstimator class will be bypassed if we directly use the job -submit
  command. In other words, I don't think these two ways of running the
  job:
 
  - using the hadoop jar examples pi
  - using hadoop job -submit
 
  are equivalent.
 
  As a general answer to your question though, if additional parameters
  are used by the Mappers or reducers, then they will generally be set
  as additional job specific configuration items. So, one way of using
  them with the job -submit command will be to find out the specific
  names of the configuration items (from code, or some other
  documentation), and include them in the job.xml used when submitting
  the job.
 
  Thanks
  Hemanth
 
  On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com
 wrote:
   Hi,
  
   I want to run the PiEstimator example from using the following command
  
   $hadoop job -submit pieestimatorconf.xml
  
   which contains all the info required by hadoop to run the job. E.g.
 the
   input file location, the output file location and other details.
  
  
 
 propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property
   propertynamemapred.map.tasks/namevalue20/value/property
   propertynamemapred.reduce.tasks/namevalue2/value/property
   ...
   propertynamemapred.job.name
  /namevaluePiEstimator/value/property
  
 
 propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property
  
   Now, as we now, to run the PiEstimator, we can use the following
 command
  too
  
   $hadoop jar hadoop-examples.1.0.3 pi 5 10
  
   where 5 and 10 are the arguments to the main class of the PiEstimator.
  How
   can I pass the same arguments (5 and 10) using the job -submit command
   through conf. file or any other way, without changing the code of the
   examples to reflect the use of environment variables.
  
   Thanks in advance,
   Varad
  
   -
   Varad Meru
   Software Engineer,
   Business Intelligence and Analytics,
   Persistent Systems and Solutions Ltd.,
   Pune, India.
 




-- 
Bertrand Dechoux


Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Mohit Anchlia
You could always write your own properties file and read it as resource.

On Tue, Sep 25, 2012 at 12:10 AM, Hemanth Yamijala yhema...@gmail.comwrote:

 By java environment variables, do you mean the ones passed as
 -Dkey=value ? That's one way of passing them. I suppose another way is
 to have a client side site configuration (like mapred-site.xml) that
 is in the classpath of the client app.

 Thanks
 Hemanth

 On Tue, Sep 25, 2012 at 12:20 AM, Varad Meru meru.va...@gmail.com wrote:
  Thanks Hemanth,
 
  But in general, if we want to pass arguments to any job (not only
  PiEstimator from examples-jar) and submit the Job to the Job queue
  scheduler, by the looks of it, we might always need to use the java
  environment variables only.
 
  Is my above assumption correct?
 
  Thanks,
  Varad
 
  On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.com
 wrote:
 
  Varad,
 
  Looking at the code for the PiEstimator class which implements the
  'pi' example, the two arguments are mandatory and are used *before*
  the job is submitted for execution - i.e on the client side. In
  particular, one of them (nSamples) is used not by the MapReduce job,
  but by the client code (i.e. PiEstimator) to generate some input.
 
  Hence, I believe all of this additional work that is being done by the
  PiEstimator class will be bypassed if we directly use the job -submit
  command. In other words, I don't think these two ways of running the
  job:
 
  - using the hadoop jar examples pi
  - using hadoop job -submit
 
  are equivalent.
 
  As a general answer to your question though, if additional parameters
  are used by the Mappers or reducers, then they will generally be set
  as additional job specific configuration items. So, one way of using
  them with the job -submit command will be to find out the specific
  names of the configuration items (from code, or some other
  documentation), and include them in the job.xml used when submitting
  the job.
 
  Thanks
  Hemanth
 
  On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com
 wrote:
   Hi,
  
   I want to run the PiEstimator example from using the following command
  
   $hadoop job -submit pieestimatorconf.xml
  
   which contains all the info required by hadoop to run the job. E.g.
 the
   input file location, the output file location and other details.
  
  
 
 propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property
   propertynamemapred.map.tasks/namevalue20/value/property
   propertynamemapred.reduce.tasks/namevalue2/value/property
   ...
   propertynamemapred.job.name
  /namevaluePiEstimator/value/property
  
 
 propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property
  
   Now, as we now, to run the PiEstimator, we can use the following
 command
  too
  
   $hadoop jar hadoop-examples.1.0.3 pi 5 10
  
   where 5 and 10 are the arguments to the main class of the PiEstimator.
  How
   can I pass the same arguments (5 and 10) using the job -submit command
   through conf. file or any other way, without changing the code of the
   examples to reflect the use of environment variables.
  
   Thanks in advance,
   Varad
  
   -
   Varad Meru
   Software Engineer,
   Business Intelligence and Analytics,
   Persistent Systems and Solutions Ltd.,
   Pune, India.
 



Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)

2012-09-25 Thread Chen He
Hi Sudha

Good question.

First of all, you need to specify clearly about your Hadoop environment,
(pseudo distributed or real cluster)

Secondly, you need to clearly understand how hadoop load job's jar file to
all worker nodes, it only copy the jar file to worker nodes. It does not
contain the jcuda.jar file. MapReduce program may not know where it is even
you specify the jcuda.jar file in our worker node classpath.

I prefer you can include the Jcuda.jar into your wordcount.jar. Then when
Hadoop copy the wordcount.jar file to all worker nodes' temporary working
directory, you do not need to worry about this issue.

Let me know if you meet further question.

Chen

On Tue, Sep 25, 2012 at 12:38 AM, sudha sadhasivam 
sudhasadhasi...@yahoo.com wrote:

 Sir
 We tried to integrate hadoop and JCUDA.
 We tried a code from


 http://code.google.com/p/mrcl/source/browse/trunk/hama-mrcl/src/mrcl/mrcl/?r=76

 We re able to compile. We are not able to execute. It does not recognise
 JCUBLAS.jar. We tried setting the classpath
 We are herewith attaching the procedure for the same along with errors
 Kindly inform us how to proceed. It is our UG project
 Thanking you
 Dr G sudha Sadasivam

 --- On *Mon, 9/24/12, Chen He airb...@gmail.com* wrote:


 From: Chen He airb...@gmail.com
 Subject: Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)
 To: common-user@hadoop.apache.org
 Date: Monday, September 24, 2012, 9:03 PM


 http://wiki.apache.org/hadoop/CUDA%20On%20Hadoop

 On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets 
 oruchov...@gmail.comhttp://mc/compose?to=oruchov...@gmail.com
 wrote:

  Hi
 
  I am going to process video analytics using hadoop
  I am very interested about CPU+GPU architercute espessially using CUDA (
  http://www.nvidia.com/object/cuda_home_new.html) and JCUDA (
  http://jcuda.org/)
  Does using HADOOP and CPU+GPU architecture bring significant performance
  improvement and does someone succeeded to implement it in production
  quality?
 
  I didn't fine any projects / examples  to use such technology.
  If someone could give me a link to best practices and example using
  CUDA/JCUDA + hadoop that would be great.
  Thanks in advane
  Oleg.
 




Re: libhdfs install dep

2012-09-25 Thread Brian Bockelman
Hi Rodrigo,

The hadoop RPMs are a bit deficient compared to those you would find from your 
Linux distribution.

For example, look at the Apache RPM you used:

[bbockelm@rcf-bockelman ~]$ rpm -qp 
http://mirrors.sonic.net/apache/hadoop/common/hadoop-1.0.3/hadoop-1.0.3-1.x86_64.rpm
 --provides
hadoop  
hadoop = 1.0.3-1

Normally, you would expect to see something like this (using the CDH4 
distribution as an example) as it contains a shared library:

[bbockelm@brian-test ~]$ rpm -q --provides hadoop-libhdfs
libhdfs.so.0()(64bit)  
hadoop-libhdfs = 2.0.0+88-1.cdh4.0.0.p0.30.osg.el5
libhdfs.so.0  
hadoop-libhdfs = 2.0.0+88-1.cdh4.0.0.p0.30.osg.el5

Because the Apache RPM does not list itself as providing libhdfs.so.0()(64bit), 
it breaks your automatic RPM dependency detection.

[Aside: I know from experience that building a high-quality (as in, follows the 
Fedora Packaging Guidelines) RPM for Java software is incredibly hard as the 
packaging approaches between the Linux distributions and Java community are 
incredibly divergent.  Not to say that the Java approach is inherently wrong, 
it's just different, and does not translate naturally to RPM.  Accordingly, to 
take Hadoop and make a rule-abiding RPM in Fedora would be hundreds of hours of 
work.  It's one of those things that appear to be much easier than it is to 
accomplish.]

The Hadoop community is very friendly, and I'm sure they would accept any 
patches to fix this oversight in future releases.

Brian

On Sep 25, 2012, at 7:57 AM, Pastrana, Rodrigo (RIS-BCT) 
rodrigo.pastr...@lexisnexis.com wrote:

 Leo, yes I'm working with hadoop-1.0.1-1.amd64.rpm from Apache's download 
 site.
 The rpm installs libhdfs in /usr/lib64 so I'm not sure why I would need the 
 hadoop-*libhdfs* rpm.
 
 Any idea why the installed /usr/lib64/libhdfs.so is not detected by the 
 package managers?
 
 Thanks, Rodrigo.
 
 -Original Message-
 From: Leo Leung [mailto:lle...@ddn.com] 
 Sent: Tuesday, September 25, 2012 2:11 AM
 To: common-user@hadoop.apache.org
 Subject: RE: libhdfs install dep
 
 Rodrigo,
  Assuming you are asking for hadoop 1.x
 
  You are missing the hadoop-*libhdfs* rpm.
  Build it or get it from the vendor you got your hadoop from.
 
 
 
 -Original Message-
 From: Pastrana, Rodrigo (RIS-BCT) [mailto:rodrigo.pastr...@lexisnexis.com] 
 Sent: Monday, September 24, 2012 8:20 PM
 To: 'core-u...@hadoop.apache.org'
 Subject: libhdfs install dep
 
 Anybody know why libhdfs.so is not found by package managers on CentOS 64 and 
 OpenSuse64? 
 
 I hava an rpm which declares Hadoop as a dependacy, but the package managers 
 (KPackageKit, zypper, etc) report libhdfs.so as a missing dependency 
 eventhough Hadoop has been installed via rpm package, and libhdfs.so is 
 installed as well. 
 
 Thanks, Rodrigo.
 
 
 -
 The information contained in this e-mail message is intended only
 for the personal and confidential use of the recipient(s) named
 above. This message may be an attorney-client communication and/or
 work product and as such is privileged and confidential. If the
 reader of this message is not the intended recipient or an agent
 responsible for delivering it to the intended recipient, you are
 hereby notified that you have received this document in error and
 that any review, dissemination, distribution, or copying of this
 message is strictly prohibited. If you have received this
 communication in error, please notify us immediately by e-mail, and
 delete the original message.



Re: Python + hdfs written thrift sequence files: lots of moving parts!

2012-09-25 Thread Harsh J
Hi Jay,

This may be off-topic to you, but I feel its related: Use Avro
DataFiles. There's Python support already available, as well as
several other languages.

On Tue, Sep 25, 2012 at 10:57 PM, Jay Vyas jayunit...@gmail.com wrote:
 Hi guys!

 Im trying to read some hadoop outputted thrift files in plain old java
 (without using SequenceFile.Reader).  The reason for this is that I

 (1) want to understand the sequence file format better and
 (2) would like to be able to port this code to a language which doesnt have
 robust hadoop sequence file i/o / thrift support  (python). My code looks
 like this:

 So, before reading forward, if anyone has :

 1) Some general hints on how to create a Sequence file with thrift encoded
 key values in python would be very useful.
 2) Some tips on the generic approach for reading a sequencefile (the
 comments seem to be a bit underspecified in the SequenceFile header)

 I'd appreciate it!

 Now, here is my adventure into thrift/hdfs sequence file i/o :

 I've written a simple stub which , I think, should be the start of a
 sequence file reader (just tries to skip the header and get straight to the
 data).

 But it doesnt handle compression.

 http://pastebin.com/vyfgjML9

 So, this code ^^ appears to fail with cryptic errors : don't know what
 type: 15.

 This error comes from a case statement, which attempts to determine what
 type of thrift record is being read in:
 fail 127 don't know what type: 15

   private byte getTType(byte type) throws TProtocolException {
 switch ((byte)(type  0x0f)) {
   case TType.STOP:
 return TType.STOP;
   case Types.BOOLEAN_FALSE:
   case Types.BOOLEAN_TRUE:
 return TType.BOOL;
  
  case Types.STRUCT:
 return TType.STRUCT;
   default:
 throw new TProtocolException(don't know what type:  +
 (byte)(type  0x0f));
 }

 Upon further investigation, I have found that, in fact, the Configuration
 object is (of course) heavily utilized by the SequenceFile reader, in
 particular, to
 determine the Codec.  That corroborates my hypothesis that the data needs
 to be decompressed or decoded before it can be deserialized by thrift, I
 believe.

 So... I guess what Im assuming is missing here, is that I don't know how to
 manually reproduce the Codec/GZip, etc.. logic inside of
 SequenceFile.Reader in plain old java (i.e without cheating and using the
 SequenceFile.Reader class that is configured in our mapreduce soruce
 code).

 With my end goal being to read the file in python, I think it would be nice
 to be able to read the sequencefile in java, and use this as a template
 (since I know that my thrift objects and serialization are working
 correctly in my current java source codebase, when read in from
 SequenceFile.Reader api).

 Any suggestions on how I can distill the logic of the SequenceFile.Reader
 class into a simplified version which is specific to my data, so that I can
 start porting into a python script which is capable of scanning a few real
 sequencefiles off of HDFS would be much appreciated !!!

 In general... what are the core steps for doing i/o with sequence files
 that are compressed and or serialized in different formats?  Do we
 decompress first , and then deserialize?  Or do them both at the same time
 ?  Thanks!

 PS I've added an issue to github here
 https://github.com/matteobertozzi/Hadoop/issues/5, for a python
 SequenceFile reader.  If I get some helpful hints on this thread maybe I
 can directly implement an example on matteobertozzi's python hadoop trunk.

 --
 Jay Vyas
 MMSB/UCHC



-- 
Harsh J


Re: Python + hdfs written thrift sequence files: lots of moving parts!

2012-09-25 Thread Jay Vyas
Thanks harsh: In any case, I'm really curious about how it is that sequence
file headers are formatted, as the documentation in the SequenceFile
javadocs seems to be very generic.

To make my questions more concrete:

1) I notice that the FileSplit class has a getStart() function.  It is
documented as returning the place to start processing.  Does that imply
that a FileSplit does, or does not include a header?

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html#getStart%28%29

2) Also, Its not clear to me that how compression and serialization are
related.  These are two inticrately coupled aspects of HDFS file writing,
and im not sure what the idiom for coordinating the compression of records
to  the deserialization is.


Re: libhdfs install dep

2012-09-25 Thread Harsh J
I'd recommend using the packages for Apache Hadoop from Apache Bigtop
(https://cwiki.apache.org/confluence/display/BIGTOP). The ones
upstream (here) aren't maintained as much these days.

On Tue, Sep 25, 2012 at 6:27 PM, Pastrana, Rodrigo (RIS-BCT)
rodrigo.pastr...@lexisnexis.com wrote:
 Leo, yes I'm working with hadoop-1.0.1-1.amd64.rpm from Apache's download 
 site.
 The rpm installs libhdfs in /usr/lib64 so I'm not sure why I would need the 
 hadoop-*libhdfs* rpm.

 Any idea why the installed /usr/lib64/libhdfs.so is not detected by the 
 package managers?

 Thanks, Rodrigo.

 -Original Message-
 From: Leo Leung [mailto:lle...@ddn.com]
 Sent: Tuesday, September 25, 2012 2:11 AM
 To: common-user@hadoop.apache.org
 Subject: RE: libhdfs install dep

 Rodrigo,
   Assuming you are asking for hadoop 1.x

   You are missing the hadoop-*libhdfs* rpm.
   Build it or get it from the vendor you got your hadoop from.



 -Original Message-
 From: Pastrana, Rodrigo (RIS-BCT) [mailto:rodrigo.pastr...@lexisnexis.com]
 Sent: Monday, September 24, 2012 8:20 PM
 To: 'core-u...@hadoop.apache.org'
 Subject: libhdfs install dep

 Anybody know why libhdfs.so is not found by package managers on CentOS 64 and 
 OpenSuse64?

 I hava an rpm which declares Hadoop as a dependacy, but the package managers 
 (KPackageKit, zypper, etc) report libhdfs.so as a missing dependency 
 eventhough Hadoop has been installed via rpm package, and libhdfs.so is 
 installed as well.

 Thanks, Rodrigo.


 -
 The information contained in this e-mail message is intended only
 for the personal and confidential use of the recipient(s) named
 above. This message may be an attorney-client communication and/or
 work product and as such is privileged and confidential. If the
 reader of this message is not the intended recipient or an agent
 responsible for delivering it to the intended recipient, you are
 hereby notified that you have received this document in error and
 that any review, dissemination, distribution, or copying of this
 message is strictly prohibited. If you have received this
 communication in error, please notify us immediately by e-mail, and
 delete the original message.



-- 
Harsh J