Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Harsh J
Tom,

What I meant to say was that doing this is well supported with
existing API/libraries itself:

- The class MultipleOutputs supports providing a filename for an
output. See MultipleOutputs.addNamedOutput usage [1].
- The type 'NullWritable' is a special writable that doesn't do
anything. So if its configured into the above filename addition as a
key-type, and you pass NullWritable.get() as the key in every write
operation, you will end up just writing the value part of (key,
value).
- This way you do not have to write a custom OutputFormat for your use-case.

[1] - 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
(Also available for the new API, depending on which
version/distribution of Hadoop you are on)

On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez  wrote:
> Hi Harsh,
>
> Thanks for the response.  Unfortunately, I'm not following your response.  :-)
>
> Could you elaborate a bit?
>
> Thanks,
>
> Tom
>
> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J  wrote:
>> You can use MultipleOutputs (or MultiTextOutputFormat for direct
>> key-file mapping, but I'd still prefer the stable MultipleOutputs).
>> Your sinking Key can be of NullWritable type, and you can keep passing
>> an instance of NullWritable.get() to it in every cycle. This would
>> write just the value, while the filenames are added/sourced from the
>> key inside the mapper code.
>>
>> This, if you are not comfortable writing your own code and maintaining
>> it, I s'pose. Your approach is correct as well, if the question was
>> specifically that.
>>
>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez  wrote:
>>> Hi Folks,
>>>
>>> Just doing a sanity check here.
>>>
>>> I have a map-only job, which produces a filename for a key and data as
>>> a value.  I want to write the value (data) into the key (filename) in
>>> the path specified when I run the job.
>>>
>>> The value (data) doesn't need any formatting, I can just write it to
>>> HDFS without modification.
>>>
>>> So, looking at this link (the Output Formats section):
>>>
>>> http://developer.yahoo.com/hadoop/tutorial/module5.html
>>>
>>> Looks like I want to:
>>> - create a new output format
>>> - override write, tell it not to call writekey as I don't want that written
>>> - new getRecordWriter method that use the key as the filename and
>>> calls my outputformat
>>>
>>> Sound reasonable?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>> --
>>> ===
>>> Skybox is hiring.
>>> http://www.skyboximaging.com/careers/jobs
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> ===
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>



-- 
Harsh J


Submitting and running hadoop jobs Programmatically

2011-07-26 Thread madhu phatak
Hi,
  I am working on a open source project
Nectar where
i am trying to create the hadoop jobs depending upon the user input. I was
using Java Process API to run the bin/hadoop shell script to submit the
jobs. But it seems not good way because the process creation model is
not consistent across different operating systems . Is there any better way
to submit the jobs rather than invoking the shell script? I am using
hadoop-0.21.0 version and i am running my program in the same user where
hadoop is installed . Some of the older thread told if I add configuration
files in path it will work fine . But i am not able to run in that way . So
anyone tried this before? If So , please can you give detailed instruction
how to achieve it . Advanced thanks for your help.

Regards,
Madhukara Phatak


Re: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread Harsh J
A simple job.submit(…) OR JobClient.runJob(jobConf), submits your job
right from the Java API. Does this not work for you? If not, what
error do you face?

Forking out and launching from a system process is a bad idea unless
there's absolutely no way.

On Tue, Jul 26, 2011 at 3:28 PM, madhu phatak  wrote:
> Hi,
>  I am working on a open source project
> Nectar where
> i am trying to create the hadoop jobs depending upon the user input. I was
> using Java Process API to run the bin/hadoop shell script to submit the
> jobs. But it seems not good way because the process creation model is
> not consistent across different operating systems . Is there any better way
> to submit the jobs rather than invoking the shell script? I am using
> hadoop-0.21.0 version and i am running my program in the same user where
> hadoop is installed . Some of the older thread told if I add configuration
> files in path it will work fine . But i am not able to run in that way . So
> anyone tried this before? If So , please can you give detailed instruction
> how to achieve it . Advanced thanks for your help.
>
> Regards,
> Madhukara Phatak
>



-- 
Harsh J


RE: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread Devaraj K
Hi Madhu,

   You can submit the jobs using the Job API's programmatically from any
system. The job submission code can be written this way.

 // Create a new Job
 Job job = new Job(new Configuration());
 job.setJarByClass(MyJob.class);
 
 // Specify various job-specific parameters 
 job.setJobName("myjob");
 
 job.setInputPath(new Path("in"));
 job.setOutputPath(new Path("out"));
 
 job.setMapperClass(MyJob.MyMapper.class);
 job.setReducerClass(MyJob.MyReducer.class);

 // Submit the job
 job.submit();



For submitting this, need to add the hadoop jar files and configuration
files in the class path of the application from where you want to submit the
job. 

You can refer this docs for more info on Job API's.
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
uce/Job.html



Devaraj K 

-Original Message-
From: madhu phatak [mailto:phatak@gmail.com] 
Sent: Tuesday, July 26, 2011 3:29 PM
To: common-user@hadoop.apache.org
Subject: Submitting and running hadoop jobs Programmatically

Hi,
  I am working on a open source project
Nectar where
i am trying to create the hadoop jobs depending upon the user input. I was
using Java Process API to run the bin/hadoop shell script to submit the
jobs. But it seems not good way because the process creation model is
not consistent across different operating systems . Is there any better way
to submit the jobs rather than invoking the shell script? I am using
hadoop-0.21.0 version and i am running my program in the same user where
hadoop is installed . Some of the older thread told if I add configuration
files in path it will work fine . But i am not able to run in that way . So
anyone tried this before? If So , please can you give detailed instruction
how to achieve it . Advanced thanks for your help.

Regards,
Madhukara Phatak



Re: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread madhu phatak
Hi
 I am using the same APIs but i am not able to run the jobs by just adding
the configuration files and jars . It never create a job in Hadoop , it just
shows cleaning up staging area and fails.

On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K  wrote:

> Hi Madhu,
>
>   You can submit the jobs using the Job API's programmatically from any
> system. The job submission code can be written this way.
>
> // Create a new Job
> Job job = new Job(new Configuration());
> job.setJarByClass(MyJob.class);
>
> // Specify various job-specific parameters
> job.setJobName("myjob");
>
> job.setInputPath(new Path("in"));
> job.setOutputPath(new Path("out"));
>
> job.setMapperClass(MyJob.MyMapper.class);
> job.setReducerClass(MyJob.MyReducer.class);
>
> // Submit the job
> job.submit();
>
>
>
> For submitting this, need to add the hadoop jar files and configuration
> files in the class path of the application from where you want to submit
> the
> job.
>
> You can refer this docs for more info on Job API's.
>
> http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
> uce/Job.html
>
>
>
> Devaraj K
>
> -Original Message-
> From: madhu phatak [mailto:phatak@gmail.com]
> Sent: Tuesday, July 26, 2011 3:29 PM
> To: common-user@hadoop.apache.org
> Subject: Submitting and running hadoop jobs Programmatically
>
> Hi,
>  I am working on a open source project
> Nectar where
> i am trying to create the hadoop jobs depending upon the user input. I was
> using Java Process API to run the bin/hadoop shell script to submit the
> jobs. But it seems not good way because the process creation model is
> not consistent across different operating systems . Is there any better way
> to submit the jobs rather than invoking the shell script? I am using
> hadoop-0.21.0 version and i am running my program in the same user where
> hadoop is installed . Some of the older thread told if I add configuration
> files in path it will work fine . But i am not able to run in that way . So
> anyone tried this before? If So , please can you give detailed instruction
> how to achieve it . Advanced thanks for your help.
>
> Regards,
> Madhukara Phatak
>
>


Re: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread Harsh J
Madhu,

Do you get a specific error message / stack trace? Could you also
paste your JT logs?

On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak  wrote:
> Hi
>  I am using the same APIs but i am not able to run the jobs by just adding
> the configuration files and jars . It never create a job in Hadoop , it just
> shows cleaning up staging area and fails.
>
> On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K  wrote:
>
>> Hi Madhu,
>>
>>   You can submit the jobs using the Job API's programmatically from any
>> system. The job submission code can be written this way.
>>
>>     // Create a new Job
>>     Job job = new Job(new Configuration());
>>     job.setJarByClass(MyJob.class);
>>
>>     // Specify various job-specific parameters
>>     job.setJobName("myjob");
>>
>>     job.setInputPath(new Path("in"));
>>     job.setOutputPath(new Path("out"));
>>
>>     job.setMapperClass(MyJob.MyMapper.class);
>>     job.setReducerClass(MyJob.MyReducer.class);
>>
>>     // Submit the job
>>     job.submit();
>>
>>
>>
>> For submitting this, need to add the hadoop jar files and configuration
>> files in the class path of the application from where you want to submit
>> the
>> job.
>>
>> You can refer this docs for more info on Job API's.
>>
>> http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
>> uce/Job.html
>>
>>
>>
>> Devaraj K
>>
>> -Original Message-
>> From: madhu phatak [mailto:phatak@gmail.com]
>> Sent: Tuesday, July 26, 2011 3:29 PM
>> To: common-user@hadoop.apache.org
>> Subject: Submitting and running hadoop jobs Programmatically
>>
>> Hi,
>>  I am working on a open source project
>> Nectar where
>> i am trying to create the hadoop jobs depending upon the user input. I was
>> using Java Process API to run the bin/hadoop shell script to submit the
>> jobs. But it seems not good way because the process creation model is
>> not consistent across different operating systems . Is there any better way
>> to submit the jobs rather than invoking the shell script? I am using
>> hadoop-0.21.0 version and i am running my program in the same user where
>> hadoop is installed . Some of the older thread told if I add configuration
>> files in path it will work fine . But i am not able to run in that way . So
>> anyone tried this before? If So , please can you give detailed instruction
>> how to achieve it . Advanced thanks for your help.
>>
>> Regards,
>> Madhukara Phatak
>>
>>
>



-- 
Harsh J


Re: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread madhu phatak
I am using JobControl.add() to add a job and running job control in
a separate thread and using JobControl.allFinished() to see all jobs
completed or not . Is this work same as Job.submit()??

On Tue, Jul 26, 2011 at 4:08 PM, Harsh J  wrote:

> Madhu,
>
> Do you get a specific error message / stack trace? Could you also
> paste your JT logs?
>
> On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak 
> wrote:
> > Hi
> >  I am using the same APIs but i am not able to run the jobs by just
> adding
> > the configuration files and jars . It never create a job in Hadoop , it
> just
> > shows cleaning up staging area and fails.
> >
> > On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K  wrote:
> >
> >> Hi Madhu,
> >>
> >>   You can submit the jobs using the Job API's programmatically from any
> >> system. The job submission code can be written this way.
> >>
> >> // Create a new Job
> >> Job job = new Job(new Configuration());
> >> job.setJarByClass(MyJob.class);
> >>
> >> // Specify various job-specific parameters
> >> job.setJobName("myjob");
> >>
> >> job.setInputPath(new Path("in"));
> >> job.setOutputPath(new Path("out"));
> >>
> >> job.setMapperClass(MyJob.MyMapper.class);
> >> job.setReducerClass(MyJob.MyReducer.class);
> >>
> >> // Submit the job
> >> job.submit();
> >>
> >>
> >>
> >> For submitting this, need to add the hadoop jar files and configuration
> >> files in the class path of the application from where you want to submit
> >> the
> >> job.
> >>
> >> You can refer this docs for more info on Job API's.
> >>
> >>
> http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
> >> uce/Job.html
> >>
> >>
> >>
> >> Devaraj K
> >>
> >> -Original Message-
> >> From: madhu phatak [mailto:phatak@gmail.com]
> >> Sent: Tuesday, July 26, 2011 3:29 PM
> >> To: common-user@hadoop.apache.org
> >> Subject: Submitting and running hadoop jobs Programmatically
> >>
> >> Hi,
> >>  I am working on a open source project
> >> Nectar where
> >> i am trying to create the hadoop jobs depending upon the user input. I
> was
> >> using Java Process API to run the bin/hadoop shell script to submit the
> >> jobs. But it seems not good way because the process creation model is
> >> not consistent across different operating systems . Is there any better
> way
> >> to submit the jobs rather than invoking the shell script? I am using
> >> hadoop-0.21.0 version and i am running my program in the same user where
> >> hadoop is installed . Some of the older thread told if I add
> configuration
> >> files in path it will work fine . But i am not able to run in that way .
> So
> >> anyone tried this before? If So , please can you give detailed
> instruction
> >> how to achieve it . Advanced thanks for your help.
> >>
> >> Regards,
> >> Madhukara Phatak
> >>
> >>
> >
>
>
>
> --
> Harsh J
>


Re: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread Harsh J
Yes. Internally, it calls regular submit APIs.

On Tue, Jul 26, 2011 at 4:32 PM, madhu phatak  wrote:
> I am using JobControl.add() to add a job and running job control in
> a separate thread and using JobControl.allFinished() to see all jobs
> completed or not . Is this work same as Job.submit()??
>
> On Tue, Jul 26, 2011 at 4:08 PM, Harsh J  wrote:
>
>> Madhu,
>>
>> Do you get a specific error message / stack trace? Could you also
>> paste your JT logs?
>>
>> On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak 
>> wrote:
>> > Hi
>> >  I am using the same APIs but i am not able to run the jobs by just
>> adding
>> > the configuration files and jars . It never create a job in Hadoop , it
>> just
>> > shows cleaning up staging area and fails.
>> >
>> > On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K  wrote:
>> >
>> >> Hi Madhu,
>> >>
>> >>   You can submit the jobs using the Job API's programmatically from any
>> >> system. The job submission code can be written this way.
>> >>
>> >>     // Create a new Job
>> >>     Job job = new Job(new Configuration());
>> >>     job.setJarByClass(MyJob.class);
>> >>
>> >>     // Specify various job-specific parameters
>> >>     job.setJobName("myjob");
>> >>
>> >>     job.setInputPath(new Path("in"));
>> >>     job.setOutputPath(new Path("out"));
>> >>
>> >>     job.setMapperClass(MyJob.MyMapper.class);
>> >>     job.setReducerClass(MyJob.MyReducer.class);
>> >>
>> >>     // Submit the job
>> >>     job.submit();
>> >>
>> >>
>> >>
>> >> For submitting this, need to add the hadoop jar files and configuration
>> >> files in the class path of the application from where you want to submit
>> >> the
>> >> job.
>> >>
>> >> You can refer this docs for more info on Job API's.
>> >>
>> >>
>> http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
>> >> uce/Job.html
>> >>
>> >>
>> >>
>> >> Devaraj K
>> >>
>> >> -Original Message-
>> >> From: madhu phatak [mailto:phatak@gmail.com]
>> >> Sent: Tuesday, July 26, 2011 3:29 PM
>> >> To: common-user@hadoop.apache.org
>> >> Subject: Submitting and running hadoop jobs Programmatically
>> >>
>> >> Hi,
>> >>  I am working on a open source project
>> >> Nectar where
>> >> i am trying to create the hadoop jobs depending upon the user input. I
>> was
>> >> using Java Process API to run the bin/hadoop shell script to submit the
>> >> jobs. But it seems not good way because the process creation model is
>> >> not consistent across different operating systems . Is there any better
>> way
>> >> to submit the jobs rather than invoking the shell script? I am using
>> >> hadoop-0.21.0 version and i am running my program in the same user where
>> >> hadoop is installed . Some of the older thread told if I add
>> configuration
>> >> files in path it will work fine . But i am not able to run in that way .
>> So
>> >> anyone tried this before? If So , please can you give detailed
>> instruction
>> >> how to achieve it . Advanced thanks for your help.
>> >>
>> >> Regards,
>> >> Madhukara Phatak
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



-- 
Harsh J


RE: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread Devaraj K
Madhu,

 Can you check the client logs, whether any error/exception is coming while
submitting the job? 

Devaraj K 

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Tuesday, July 26, 2011 5:01 PM
To: common-user@hadoop.apache.org
Subject: Re: Submitting and running hadoop jobs Programmatically

Yes. Internally, it calls regular submit APIs.

On Tue, Jul 26, 2011 at 4:32 PM, madhu phatak  wrote:
> I am using JobControl.add() to add a job and running job control in
> a separate thread and using JobControl.allFinished() to see all jobs
> completed or not . Is this work same as Job.submit()??
>
> On Tue, Jul 26, 2011 at 4:08 PM, Harsh J  wrote:
>
>> Madhu,
>>
>> Do you get a specific error message / stack trace? Could you also
>> paste your JT logs?
>>
>> On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak 
>> wrote:
>> > Hi
>> >  I am using the same APIs but i am not able to run the jobs by just
>> adding
>> > the configuration files and jars . It never create a job in Hadoop , it
>> just
>> > shows cleaning up staging area and fails.
>> >
>> > On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K 
wrote:
>> >
>> >> Hi Madhu,
>> >>
>> >>   You can submit the jobs using the Job API's programmatically from
any
>> >> system. The job submission code can be written this way.
>> >>
>> >>     // Create a new Job
>> >>     Job job = new Job(new Configuration());
>> >>     job.setJarByClass(MyJob.class);
>> >>
>> >>     // Specify various job-specific parameters
>> >>     job.setJobName("myjob");
>> >>
>> >>     job.setInputPath(new Path("in"));
>> >>     job.setOutputPath(new Path("out"));
>> >>
>> >>     job.setMapperClass(MyJob.MyMapper.class);
>> >>     job.setReducerClass(MyJob.MyReducer.class);
>> >>
>> >>     // Submit the job
>> >>     job.submit();
>> >>
>> >>
>> >>
>> >> For submitting this, need to add the hadoop jar files and
configuration
>> >> files in the class path of the application from where you want to
submit
>> >> the
>> >> job.
>> >>
>> >> You can refer this docs for more info on Job API's.
>> >>
>> >>
>>
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
>> >> uce/Job.html
>> >>
>> >>
>> >>
>> >> Devaraj K
>> >>
>> >> -Original Message-
>> >> From: madhu phatak [mailto:phatak@gmail.com]
>> >> Sent: Tuesday, July 26, 2011 3:29 PM
>> >> To: common-user@hadoop.apache.org
>> >> Subject: Submitting and running hadoop jobs Programmatically
>> >>
>> >> Hi,
>> >>  I am working on a open source project
>> >> Nectar where
>> >> i am trying to create the hadoop jobs depending upon the user input. I
>> was
>> >> using Java Process API to run the bin/hadoop shell script to submit
the
>> >> jobs. But it seems not good way because the process creation model is
>> >> not consistent across different operating systems . Is there any
better
>> way
>> >> to submit the jobs rather than invoking the shell script? I am using
>> >> hadoop-0.21.0 version and i am running my program in the same user
where
>> >> hadoop is installed . Some of the older thread told if I add
>> configuration
>> >> files in path it will work fine . But i am not able to run in that way
.
>> So
>> >> anyone tried this before? If So , please can you give detailed
>> instruction
>> >> how to achieve it . Advanced thanks for your help.
>> >>
>> >> Regards,
>> >> Madhukara Phatak
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



-- 
Harsh J



Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Tom Melendez
Hi Harsh,

Cool, thanks for the details.  For anyone interested, with your tip
and description I was able to find an example inside the "Hadoop in
Action" (Chapter 7, p168) book.

Another question, though, it doesn't look like MultipleOutputs will
let me control the filename in a per-key (per map) manner.  So,
basically, if my map receives a key of "mykey", I want my file to be
"mykey-someotherstuff.foo" (this is a binary file).  Am I right about
this?

Thanks,

Tom

On Tue, Jul 26, 2011 at 1:34 AM, Harsh J  wrote:
> Tom,
>
> What I meant to say was that doing this is well supported with
> existing API/libraries itself:
>
> - The class MultipleOutputs supports providing a filename for an
> output. See MultipleOutputs.addNamedOutput usage [1].
> - The type 'NullWritable' is a special writable that doesn't do
> anything. So if its configured into the above filename addition as a
> key-type, and you pass NullWritable.get() as the key in every write
> operation, you will end up just writing the value part of (key,
> value).
> - This way you do not have to write a custom OutputFormat for your use-case.
>
> [1] - 
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
> (Also available for the new API, depending on which
> version/distribution of Hadoop you are on)
>
> On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez  wrote:
>> Hi Harsh,
>>
>> Thanks for the response.  Unfortunately, I'm not following your response.  
>> :-)
>>
>> Could you elaborate a bit?
>>
>> Thanks,
>>
>> Tom
>>
>> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J  wrote:
>>> You can use MultipleOutputs (or MultiTextOutputFormat for direct
>>> key-file mapping, but I'd still prefer the stable MultipleOutputs).
>>> Your sinking Key can be of NullWritable type, and you can keep passing
>>> an instance of NullWritable.get() to it in every cycle. This would
>>> write just the value, while the filenames are added/sourced from the
>>> key inside the mapper code.
>>>
>>> This, if you are not comfortable writing your own code and maintaining
>>> it, I s'pose. Your approach is correct as well, if the question was
>>> specifically that.
>>>
>>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez  wrote:
 Hi Folks,

 Just doing a sanity check here.

 I have a map-only job, which produces a filename for a key and data as
 a value.  I want to write the value (data) into the key (filename) in
 the path specified when I run the job.

 The value (data) doesn't need any formatting, I can just write it to
 HDFS without modification.

 So, looking at this link (the Output Formats section):

 http://developer.yahoo.com/hadoop/tutorial/module5.html

 Looks like I want to:
 - create a new output format
 - override write, tell it not to call writekey as I don't want that written
 - new getRecordWriter method that use the key as the filename and
 calls my outputformat

 Sound reasonable?

 Thanks,

 Tom

 --
 ===
 Skybox is hiring.
 http://www.skyboximaging.com/careers/jobs

>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> ===
>> Skybox is hiring.
>> http://www.skyboximaging.com/careers/jobs
>>
>
>
>
> --
> Harsh J
>



-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Multiple Output Formats

2011-07-26 Thread Roger Chen
Hi all,

I am attempting to implement MultipleOutputFormat to write data to multiple
files dependent on the output keys and values. Can somebody provide a
working example with how to implement this in Hadoop 0.20.2?

Thanks!

-- 
Roger Chen
UC Davis Genome Center


RE: Hadoop-streaming using binary executable c program

2011-07-26 Thread Daniel Yehdego

Good afternoon Bobby, 

Thanks so much, now its working excellent. And the speed is also reasonable. 
Once again thanks u.  

Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Mon, 25 Jul 2011 14:47:34 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
> 
> This is likely to be slow and it is not ideal.  The ideal would be to modify 
> pknotsRG to be able to read from stdin, but that may not be possible.
> 
> The shell script would probably look something like the following
> 
> #!/bin/sh
> rm -f temp.txt;
> while read line
> do
>   echo $line >> temp.txt;
> done
> exec pknotsRG temp.txt;
> 
> Place it in a file say hadoopPknotsRG  Then you probably want to run
> 
> chmod +x hadoopPknotsRG
> 
> After that you want to test it with
> 
> hadoop fs -cat 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
> ./hadoopPknotsRG
> 
> If that works then you can try it with Hadoop streaming
> 
> HADOOP_HOME$ bin/hadoop jar 
> /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
> ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
> /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
> /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> /user/yehdego/RF-out -reducer NONE -verbose
> 
> --Bobby
> 
> On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:
> 
> 
> 
> Good afternoon Bobby,
> 
> Thanks, you gave me a great help in finding out what the problem was. After I 
> put the command line you suggested me, I found out that there was a 
> segmentation error.
> The binary executable program pknotsRG only reads a file with a sequence in 
> it. This means, there should be a shell script, as you have said, that will 
> take the data coming
> from stdin and write it to a temporary file. Any idea on how to do this job 
> in shell script. The thing is I am from a biology background and don't have 
> much experience in CS.
> looking forward to hear from you. Thanks so much.
> 
> Regards,
> 
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> dtyehd...@miners.utep.edu
> 
> > From: ev...@yahoo-inc.com
> > To: common-user@hadoop.apache.org
> > Date: Fri, 22 Jul 2011 12:39:08 -0700
> > Subject: Re: Hadoop-streaming using binary executable c program
> >
> > I would suggest that you do the following to help you debug.
> >
> > hadoop fs -cat 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 
> > | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
> >
> > This is simulating what hadoop streaming is doing.  Here we are taking the 
> > first 2 lines out of the input file and feeding them to the stdin of 
> > pknotsRG.  The first step is to make sure that you can get your program to 
> > run correctly with something like this.  You may need to change the command 
> > line to pknotsRG to get it to read the data it is processing from stdin, 
> > instead of from a file.  Alternatively you may need to write a shell script 
> > that will take the data coming from stdin.  Write it to a file and then 
> > call pknotsRG on that temporary file.  Once you have this working then you 
> > should try it again with streaming.
> >
> > --Bobby Evans
> >
> > On 7/22/11 12:31 PM, "Daniel Yehdego"  wrote:
> >
> >
> >
> > Hi Bobby, Thanks for the response.
> >
> > After I tried the following comannd:
> >
> > bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
> > /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
> > /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> > /user/yehdego/RF-out - verbose
> >
> > I got a stderr logs :
> >
> > java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> > failed with code 139
> > at 
> > org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> > at 
> > org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> > at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> > at 
> > org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
> >
> >
> > syslog logs
> >
> > 2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> > Initializing JVM Metrics with processName=MAP, sessionId=
> > 2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: 
> > numReduceTasks: 0
> > 2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: 
> > PipeMapRed exec 
> >

Re: Multiple Output Formats

2011-07-26 Thread Ayon Sinha
package com.shopkick.util;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;


public class MultiFileOutput extends MultipleTextOutputFormat {

@Override
protected String generateFileNameForKeyValue(Text key, Text value,
String name) {
// TODO Auto-generated method stub
return key.toString()+"/"+name;
}

}


 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.




From: Roger Chen 
To: common-user@hadoop.apache.org
Sent: Tuesday, July 26, 2011 9:11 AM
Subject: Multiple Output Formats

Hi all,

I am attempting to implement MultipleOutputFormat to write data to multiple
files dependent on the output keys and values. Can somebody provide a
working example with how to implement this in Hadoop 0.20.2?

Thanks!

-- 
Roger Chen
UC Davis Genome Center

Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread A Df
Dear All:

I am trying to run Hadoop on Windows 7 so as to test programs before moving to 
Unix/Linux. I have downloaded the Hadoop 0.20.2 and Eclipse 3.6 because I want 
to use the plugin. I am also using cygwin. However, I set the environment 
variable for JAVA_HOME and added the c:\cygwin\bin;c:\cygwin\usr\bin to the 
PATH variable but I still get the error below when trying to start the Hadoop. 
This is based on the instructions to edit the file conf/hadoop-env.sh to define 
at least JAVA_HOME to be the root of your Java installation which I changed to 
"export JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26" with no 
success. I added the \ to escape special characters.


Error:
bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory


I also wanted to find out which is the stable release of Hadoop and which 
version of Eclipse and the Plugin should I use? So far almost every tutorial I 
have seen from Googling shows different versions like on:
http://developer.yahoo.com/hadoop/tutorial/index.html
OR
http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html

In Eclipse the WordCount project has 42 errors because it will not recognize 
the "import org.apache.." in the code.


I wanted to test on Windows first to get a feel of Hadoop since I am new to it 
and also because I am newbie Unix/Linux user. I have been trying to follow the 
tutorials shown at the link above but each time I run into errors with the 
plugin or not recognizing the import or JAVA_HOME not set. Please can I get 
some help. Thanks

Cheers
A Df


Re: Multiple Output Formats

2011-07-26 Thread Harsh J
Roger,

Beyond Ayon's example answer, I'd like you to note that the newer API
will *not* carry a supported MultipleOutputFormat as it has been
obsoleted away in favor of MultipleOutputs, whose use is much easier,
is threadsafe, and also carries an example to look at, at [1].

[1] - 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html

On Tue, Jul 26, 2011 at 9:41 PM, Roger Chen  wrote:
> Hi all,
>
> I am attempting to implement MultipleOutputFormat to write data to multiple
> files dependent on the output keys and values. Can somebody provide a
> working example with how to implement this in Hadoop 0.20.2?
>
> Thanks!
>
> --
> Roger Chen
> UC Davis Genome Center
>



-- 
Harsh J


Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Harsh J
Tom,

You can theoretically add N amounts of named outputs from a single
task itself, even from within the map() calls (addNamedOutputs or
addMultiNamedOutputs checks within itself for dupes, so you don't have
to). So yes, you can keep adding outputs and using them per-key, and
given your earlier details of how many that's gonna be, I think MO
would behave just fine with its cache of record writers.

Regarding your other question, there are certain restrictions to the
names provided to MultipleOutputs as a named output. Specifically,
they accept only [A-Za-z0-9] and auto-include an "_" if you are using
multi-named outputs. These may be going away in the future (0.23+) to
allow for more flexible naming, however.

On Tue, Jul 26, 2011 at 9:21 PM, Tom Melendez  wrote:
> Hi Harsh,
>
> Cool, thanks for the details.  For anyone interested, with your tip
> and description I was able to find an example inside the "Hadoop in
> Action" (Chapter 7, p168) book.
>
> Another question, though, it doesn't look like MultipleOutputs will
> let me control the filename in a per-key (per map) manner.  So,
> basically, if my map receives a key of "mykey", I want my file to be
> "mykey-someotherstuff.foo" (this is a binary file).  Am I right about
> this?
>
> Thanks,
>
> Tom
>
> On Tue, Jul 26, 2011 at 1:34 AM, Harsh J  wrote:
>> Tom,
>>
>> What I meant to say was that doing this is well supported with
>> existing API/libraries itself:
>>
>> - The class MultipleOutputs supports providing a filename for an
>> output. See MultipleOutputs.addNamedOutput usage [1].
>> - The type 'NullWritable' is a special writable that doesn't do
>> anything. So if its configured into the above filename addition as a
>> key-type, and you pass NullWritable.get() as the key in every write
>> operation, you will end up just writing the value part of (key,
>> value).
>> - This way you do not have to write a custom OutputFormat for your use-case.
>>
>> [1] - 
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>> (Also available for the new API, depending on which
>> version/distribution of Hadoop you are on)
>>
>> On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez  wrote:
>>> Hi Harsh,
>>>
>>> Thanks for the response.  Unfortunately, I'm not following your response.  
>>> :-)
>>>
>>> Could you elaborate a bit?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J  wrote:
 You can use MultipleOutputs (or MultiTextOutputFormat for direct
 key-file mapping, but I'd still prefer the stable MultipleOutputs).
 Your sinking Key can be of NullWritable type, and you can keep passing
 an instance of NullWritable.get() to it in every cycle. This would
 write just the value, while the filenames are added/sourced from the
 key inside the mapper code.

 This, if you are not comfortable writing your own code and maintaining
 it, I s'pose. Your approach is correct as well, if the question was
 specifically that.

 On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez  wrote:
> Hi Folks,
>
> Just doing a sanity check here.
>
> I have a map-only job, which produces a filename for a key and data as
> a value.  I want to write the value (data) into the key (filename) in
> the path specified when I run the job.
>
> The value (data) doesn't need any formatting, I can just write it to
> HDFS without modification.
>
> So, looking at this link (the Output Formats section):
>
> http://developer.yahoo.com/hadoop/tutorial/module5.html
>
> Looks like I want to:
> - create a new output format
> - override write, tell it not to call writekey as I don't want that 
> written
> - new getRecordWriter method that use the key as the filename and
> calls my outputformat
>
> Sound reasonable?
>
> Thanks,
>
> Tom
>
> --
> ===
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>



 --
 Harsh J

>>>
>>>
>>>
>>> --
>>> ===
>>> Skybox is hiring.
>>> http://www.skyboximaging.com/careers/jobs
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> ===
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>



-- 
Harsh J


Re: Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread James Seigel
Try using virtual box/vmware and downloading either an image that has hadoop on 
it or a linux image and installing it there.  

Good luck
James.


On 2011-07-26, at 12:33 PM, A Df wrote:

> Dear All:
> 
> I am trying to run Hadoop on Windows 7 so as to test programs before moving 
> to Unix/Linux. I have downloaded the Hadoop 0.20.2 and Eclipse 3.6 because I 
> want to use the plugin. I am also using cygwin. However, I set the 
> environment variable for JAVA_HOME and added the 
> c:\cygwin\bin;c:\cygwin\usr\bin to the PATH variable but I still get the 
> error below when trying to start the Hadoop. This is based on the 
> instructions to edit the file conf/hadoop-env.sh to define at least JAVA_HOME 
> to be the root of your Java installation which I changed to "export 
> JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26" with no 
> success. I added the \ to escape special characters.
> 
> 
> Error:
> bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory
> 
> 
> I also wanted to find out which is the stable release of Hadoop and which 
> version of Eclipse and the Plugin should I use? So far almost every tutorial 
> I have seen from Googling shows different versions like on:
> http://developer.yahoo.com/hadoop/tutorial/index.html
> OR
> http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
> 
> In Eclipse the WordCount project has 42 errors because it will not recognize 
> the "import org.apache.." in the code.
> 
> 
> I wanted to test on Windows first to get a feel of Hadoop since I am new to 
> it and also because I am newbie Unix/Linux user. I have been trying to follow 
> the tutorials shown at the link above but each time I run into errors with 
> the plugin or not recognizing the import or JAVA_HOME not set. Please can I 
> get some help. Thanks
> 
> Cheers
> A Df



Re: Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread Harsh J
A Df,

(Inlines)

On Wed, Jul 27, 2011 at 12:03 AM, A Df  wrote:
> Dear All:
>
> I am trying to run Hadoop on Windows 7 so as to test programs before moving 
> to Unix/Linux. I have downloaded the Hadoop 0.20.2 and Eclipse 3.6 because I 
> want to use the plugin. I am also using cygwin. However, I set the 
> environment variable for JAVA_HOME and added the 
> c:\cygwin\bin;c:\cygwin\usr\bin to the PATH variable but I still get the 
> error below when trying to start the Hadoop. This is based on the 
> instructions to edit the file conf/hadoop-env.sh to define at least JAVA_HOME 
> to be the root of your Java installation which I changed to "export 
> JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26" with no 
> success. I added the \ to escape special characters.

This looks fine.

> Error:
> bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory

I've encountered this before, its harmless. Your problem lies
elsewhere. Is the Cygwin's bin/ directory on your Windows path? What
errors do you get on submit?

> I also wanted to find out which is the stable release of Hadoop and which 
> version of Eclipse and the Plugin should I use? So far almost every tutorial 
> I have seen from Googling shows different versions like on:
> http://developer.yahoo.com/hadoop/tutorial/index.html
> OR
> http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
>
> In Eclipse the WordCount project has 42 errors because it will not recognize 
> the "import org.apache.." in the code.

The last known version I'd heard had a no-complains, fully-working
eclipse plugin along with it was Hadoop 0.20.2 (although stable is
203, I've seen lots of issues pop up with eclipse plugin from members
on the ML, but someone else can comment better on if its fixed for 204
or is a non-issue). I've used this one personally on Windows myself
and things work. I think there was just one issue one could encounter
somehow and I'd covered it in a blog post some time ago, here:
http://www.harshj.com/2010/07/18/making-the-eclipse-plugin-work-for-hadoop/

Beyond that, the tutorial at v-lad.org is the one I'd recommend
following. It has worked well for me over time.

> I wanted to test on Windows first to get a feel of Hadoop since I am new to 
> it and also because I am newbie Unix/Linux user. I have been trying to follow 
> the tutorials shown at the link above but each time I run into errors with 
> the plugin or not recognizing the import or JAVA_HOME not set. Please can I 
> get some help. Thanks

I'd say use Linux when/where possible. A VM is a good choice as well,
as James pointed out above, if your hardware can handle it.

Also checkout the Karmasphere's community edition tools @
http://karmasphere.com/Download/register-for-community-edition.html --
They require registration but were good to start with when I'd tried
them ~1 year ago. Should be better now. Not sure if they are F/OSS but
surely good tools.

-- 
Harsh J
Get CDH and more: http://www.cloudera.com/hadoop


RE: Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread Eric Payne
Hi A Df,

I haven't set up Hadoop under cygwin, but I use cygwin a lot.

One thing I would suggest is to use the bash shell in cygwin and use the 
following format for the $PATH additions:
PATH=$PATH:/cygdrive/c/cygwin/bin:/cygdrive/c/cygwin/usr/bin

My understanding is that the stable version of Hadoop is 0.20.203.

Thanks,
-Eric

> -Original Message-
> From: A Df [mailto:abbey_dragonfor...@yahoo.com]
> Sent: Tuesday, July 26, 2011 1:34 PM
> To: common-user@hadoop.apache.org
> Subject: Cygwin not working with Hadoop and Eclipse Plugin
> 
> Dear All:
> 
> I am trying to run Hadoop on Windows 7 so as to test programs before
> moving to Unix/Linux. I have downloaded the Hadoop 0.20.2 and Eclipse 3.6
> because I want to use the plugin. I am also using cygwin. However, I set
> the environment variable for JAVA_HOME and added the
> c:\cygwin\bin;c:\cygwin\usr\bin to the PATH variable but I still get the
> error below when trying to start the Hadoop. This is based on the
> instructions to edit the file conf/hadoop-env.sh to define at least
> JAVA_HOME to be the root of your Java installation which I changed to
> "export JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26"
> with no success. I added the \ to escape special characters.
> 
> 
> Error:
> bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory
> 
> 
> I also wanted to find out which is the stable release of Hadoop and which
> version of Eclipse and the Plugin should I use? So far almost every
> tutorial I have seen from Googling shows different versions like on:
> http://developer.yahoo.com/hadoop/tutorial/index.html
> OR
> http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
> 
> In Eclipse the WordCount project has 42 errors because it will not
> recognize the "import org.apache.." in the code.
> 
> 
> I wanted to test on Windows first to get a feel of Hadoop since I am new
> to it and also because I am newbie Unix/Linux user. I have been trying to
> follow the tutorials shown at the link above but each time I run into
> errors with the plugin or not recognizing the import or JAVA_HOME not set.
> Please can I get some help. Thanks
> 
> Cheers
> A Df


Re: Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread Eric Fiala
A Df,
Try reinstalling java to a friendlier location (without spaces) - c:\java
rather than c:\Program Files - it's parsing on the space is what it appears
from the error message ~ I've encountered this very same problem.

JAVA_HOME to be the root of your Java installation which I changed to
> "export JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26" with
> no success. I added the \ to escape special characters.


 EF



On Tue, Jul 26, 2011 at 12:33 PM, A Df  wrote:

> Dear All:
>
> I am trying to run Hadoop on Windows 7 so as to test programs before moving
> to Unix/Linux. I have downloaded the Hadoop 0.20.2 and Eclipse 3.6 because I
> want to use the plugin. I am also using cygwin. However, I set the
> environment variable for JAVA_HOME and added the
> c:\cygwin\bin;c:\cygwin\usr\bin to the PATH variable but I still get the
> error below when trying to start the Hadoop. This is based on the
> instructions to edit the file conf/hadoop-env.sh to define at least
> JAVA_HOME to be the root of your Java installation which I changed to
> "export JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26" with
> no success. I added the \ to escape special characters.
>
>
> Error:
> bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory
>
>
> I also wanted to find out which is the stable release of Hadoop and which
> version of Eclipse and the Plugin should I use? So far almost every tutorial
> I have seen from Googling shows different versions like on:
> http://developer.yahoo.com/hadoop/tutorial/index.html
> OR
> http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
>
> In Eclipse the WordCount project has 42 errors because it will not
> recognize the "import org.apache.." in the code.
>
>
> I wanted to test on Windows first to get a feel of Hadoop since I am new to
> it and also because I am newbie Unix/Linux user. I have been trying to
> follow the tutorials shown at the link above but each time I run into errors
> with the plugin or not recognizing the import or JAVA_HOME not set. Please
> can I get some help. Thanks
>
> Cheers
> A Df
>



-- 
*Eric Fiala*
*Fiala Consulting*
T: 403.828.1117
E: e...@fiala.ca
http://www.fiala.ca


Re: Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread A Df
Harsh:

See (inline at the **) I hope its easy to follow and for the other responses, I 
was not sure how to respond to get everything into one. Sorry for top posting!


Eric where would I put the line below and explain in newbie terms, thanks:
PATH=$PATH:/cygdrive/c/cygwin/bin:/cygdrive/c/cygwin/usr/bin

EF:
I will try to reinstall Java and yes the spaces gives problems. :(


Cheers,
A Df


>
>From: Harsh J 
>To: common-user@hadoop.apache.org; A Df 
>Sent: Tuesday, 26 July 2011, 20:25
>Subject: Re: Cygwin not working with Hadoop and Eclipse Plugin
>
>A Df,
>
>(Inlines)
>
>On Wed, Jul 27, 2011 at 12:03 AM, A Df  wrote:
>> Dear All:
>>
>> I am trying to run Hadoop on Windows 7 so as to test programs before moving 
>> to Unix/Linux. I have downloaded the Hadoop 0.20.2 and Eclipse 3.6 because I 
>> want to use the plugin. I am also using cygwin. However, I set the 
>> environment variable for JAVA_HOME and added the 
>> c:\cygwin\bin;c:\cygwin\usr\bin to the PATH variable but I still get the 
>> error below when trying to start the Hadoop. This is based on the 
>> instructions to edit the file conf/hadoop-env.sh to define at least 
>> JAVA_HOME to be the root of your Java installation which I changed to 
>> "export JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26" with 
>> no success. I added the \ to escape special characters.
>
>This looks fine.
>
>> Error:
>> bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory
>
>I've encountered this before, its harmless. Your problem lies
>elsewhere. Is the Cygwin's bin/ directory on your Windows path? What
>errors do you get on submit?
>
>> I also wanted to find out which is the stable release of Hadoop and which 
>> version of Eclipse and the Plugin should I use? So far almost every tutorial 
>> I have seen from Googling shows different versions like on:
>> http://developer.yahoo.com/hadoop/tutorial/index.html
>> OR
>> http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
>>
>> In Eclipse the WordCount project has 42 errors because it will not recognize 
>> the "import org.apache.." in the code.
>
>The last known version I'd heard had a no-complains, fully-working
>eclipse plugin along with it was Hadoop 0.20.2 (although stable is
>203, I've seen lots of issues pop up with eclipse plugin from members
>on the ML, but someone else can comment better on if its fixed for 204
>or is a non-issue). I've used this one personally on Windows myself
>and things work. I think there was just one issue one could encounter
>somehow and I'd covered it in a blog post some time ago, here:
>http://www.harshj.com/2010/07/18/making-the-eclipse-plugin-work-for-hadoop/
>
>** I tried to use the patch but my cygwin gives the error: "bash: patch: 
>command not found"
>
>Beyond that, the tutorial at v-lad.org is the one I'd recommend
>following. It has worked well for me over time.
>
>** yes, the screenshots and instructions are easy to follow just that I seem 
>to always have a problem with the plugin or cygwin
>
>> I wanted to test on Windows first to get a feel of Hadoop since I am new to 
>> it and also because I am newbie Unix/Linux user. I have been trying to 
>> follow the tutorials shown at the link above but each time I run into errors 
>> with the plugin or not recognizing the import or JAVA_HOME not set. Please 
>> can I get some help. Thanks
>
>I'd say use Linux when/where possible. A VM is a good choice as well,
>as James pointed out above, if your hardware can handle it.
>
>** Harsh and James, I tried the vmware from the Yahoo tutorial but I had 
>problems with the plugin too.
>
>Also checkout the Karmasphere's community edition tools @
>http://karmasphere.com/Download/register-for-community-edition.html --
>They require registration but were good to start with when I'd tried
>them ~1 year ago. Should be better now. Not sure if they are F/OSS but
>surely good tools.
>
>** looks good, thanks I will give it a try tomorrow
>-- 
>Harsh J
>Get CDH and more: http://www.cloudera.com/hadoop
>
>
>

Re: Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread Harsh J
A Df,

On Wed, Jul 27, 2011 at 1:42 AM, A Df  wrote:
> Harsh:
>
> See (inline at the **) I hope its easy to follow and for the other responses, 
> I was not sure how to respond to get everything into one. Sorry for top 
> posting!

Np! I don't strongly enforce a style of reply so long as it is
visible, and readable :)

>
> Eric where would I put the line below and explain in newbie terms, thanks:
> PATH=$PATH:/cygdrive/c/cygwin/bin:/cygdrive/c/cygwin/usr/bin

You'd set this in your Windows environment. A good guide (googled
link): 
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx

>>The last known version I'd heard had a no-complains, fully-working
>>eclipse plugin along with it was Hadoop 0.20.2 (although stable is
>>203, I've seen lots of issues pop up with eclipse plugin from members
>>on the ML, but someone else can comment better on if its fixed for 204
>>or is a non-issue). I've used this one personally on Windows myself
>>and things work. I think there was just one issue one could encounter
>>somehow and I'd covered it in a blog post some time ago, here:
>>http://www.harshj.com/2010/07/18/making-the-eclipse-plugin-work-for-hadoop/
>>
>>** I tried to use the patch but my cygwin gives the error: "bash: patch: 
>>command not found"

Feared you may face it. You need to install the patch program from
Cygwin's package manager/installer. I believe the package name is
(iirc): patchutils

>>Beyond that, the tutorial at v-lad.org is the one I'd recommend
>>following. It has worked well for me over time.
>>
>>** yes, the screenshots and instructions are easy to follow just that I seem 
>>to always have a problem with the plugin or cygwin

What specific error do you get when you load the plugin or start the
daemons via cygwin shell, etc.? Its easier for folks to answer if they
see an error message, or a stacktrace.

>>> I wanted to test on Windows first to get a feel of Hadoop since I am new to 
>>> it and also because I am newbie Unix/Linux user. I have been trying to 
>>> follow the tutorials shown at the link above but each time I run into 
>>> errors with the plugin or not recognizing the import or JAVA_HOME not set. 
>>> Please can I get some help. Thanks
>>
>>I'd say use Linux when/where possible. A VM is a good choice as well,
>>as James pointed out above, if your hardware can handle it.
>>
>>** Harsh and James, I tried the vmware from the Yahoo tutorial but I had 
>>problems with the plugin too.

You can setup a raw linux VM, and install stuff atop. I've had better
success with the VMs Cloudera offers:
https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM
(They come ready with the whole stack). But basically it all boils
down to using a Linux VM, wherever you source it from.

-- 
Harsh J


Re: Multiple Output Formats

2011-07-26 Thread Roger Chen
The problem I'm facing right now is with the configuration needed for
MultipleOutputs, because JobConf is deprecated now and I am unable to do its
equivalent with Configuration. I set the configuration of the job by:

 Job job = new Job(getConf());

but when I'm trying to use this line in my config:

 MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
 LongWritable.class, Text.class);

I get an issue about no suitable method being found.

Roger

On Tue, Jul 26, 2011 at 12:00 PM, Harsh J  wrote:

> Roger,
>
> Beyond Ayon's example answer, I'd like you to note that the newer API
> will *not* carry a supported MultipleOutputFormat as it has been
> obsoleted away in favor of MultipleOutputs, whose use is much easier,
> is threadsafe, and also carries an example to look at, at [1].
>
> [1] -
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>
> On Tue, Jul 26, 2011 at 9:41 PM, Roger Chen  wrote:
> > Hi all,
> >
> > I am attempting to implement MultipleOutputFormat to write data to
> multiple
> > files dependent on the output keys and values. Can somebody provide a
> > working example with how to implement this in Hadoop 0.20.2?
> >
> > Thanks!
> >
> > --
> > Roger Chen
> > UC Davis Genome Center
> >
>
>
>
> --
> Harsh J
>



-- 
Roger Chen
UC Davis Genome Center


Re: Multiple Output Formats

2011-07-26 Thread Harsh J
Gotcha, my bad then. The hadoop distribution I use provides a
backported MO, so I overlooked this particular issue while replying.

Still, the warning holds as the versions would roll ahead. But I
believe the refactor would not be that much of a pain, so perhaps its
a no-worry.

On Wed, Jul 27, 2011 at 2:00 AM, Roger Chen  wrote:
> The problem I'm facing right now is with the configuration needed for
> MultipleOutputs, because JobConf is deprecated now and I am unable to do its
> equivalent with Configuration. I set the configuration of the job by:
>
>  Job job = new Job(getConf());
>
> but when I'm trying to use this line in my config:
>
>  MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
>  LongWritable.class, Text.class);
>
> I get an issue about no suitable method being found.
>
> Roger
>
> On Tue, Jul 26, 2011 at 12:00 PM, Harsh J  wrote:
>
>> Roger,
>>
>> Beyond Ayon's example answer, I'd like you to note that the newer API
>> will *not* carry a supported MultipleOutputFormat as it has been
>> obsoleted away in favor of MultipleOutputs, whose use is much easier,
>> is threadsafe, and also carries an example to look at, at [1].
>>
>> [1] -
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>
>> On Tue, Jul 26, 2011 at 9:41 PM, Roger Chen  wrote:
>> > Hi all,
>> >
>> > I am attempting to implement MultipleOutputFormat to write data to
>> multiple
>> > files dependent on the output keys and values. Can somebody provide a
>> > working example with how to implement this in Hadoop 0.20.2?
>> >
>> > Thanks!
>> >
>> > --
>> > Roger Chen
>> > UC Davis Genome Center
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Roger Chen
> UC Davis Genome Center
>



-- 
Harsh J


Re: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread madhu phatak
Hi
I am submitting the job as follows

java -cp
 
Nectar-analytics-0.0.1-SNAPSHOT.jar:/home/hadoop/hadoop-for-nectar/hadoop-0.21.0/conf/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_COMMON_HOME/*
com.zinnia.nectar.regression.hadoop.primitive.jobs.SigmaJob input/book.csv
kkk11fffrrw 1

I get the log in CLI as below

11/07/27 10:22:54 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/07/27 10:22:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/07/27 10:22:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
11/07/27 10:22:54 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/07/27 10:22:54 INFO mapreduce.JobSubmitter: Cleaning up the staging area
file:/tmp/hadoop-hadoop/mapred/staging/hadoop-1331241340/.staging/job_local_0001

It doesn't create any job in hadoop.

On Tue, Jul 26, 2011 at 5:11 PM, Devaraj K  wrote:

> Madhu,
>
>  Can you check the client logs, whether any error/exception is coming while
> submitting the job?
>
> Devaraj K
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: Tuesday, July 26, 2011 5:01 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Submitting and running hadoop jobs Programmatically
>
> Yes. Internally, it calls regular submit APIs.
>
> On Tue, Jul 26, 2011 at 4:32 PM, madhu phatak 
> wrote:
> > I am using JobControl.add() to add a job and running job control in
> > a separate thread and using JobControl.allFinished() to see all jobs
> > completed or not . Is this work same as Job.submit()??
> >
> > On Tue, Jul 26, 2011 at 4:08 PM, Harsh J  wrote:
> >
> >> Madhu,
> >>
> >> Do you get a specific error message / stack trace? Could you also
> >> paste your JT logs?
> >>
> >> On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak 
> >> wrote:
> >> > Hi
> >> >  I am using the same APIs but i am not able to run the jobs by just
> >> adding
> >> > the configuration files and jars . It never create a job in Hadoop ,
> it
> >> just
> >> > shows cleaning up staging area and fails.
> >> >
> >> > On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K 
> wrote:
> >> >
> >> >> Hi Madhu,
> >> >>
> >> >>   You can submit the jobs using the Job API's programmatically from
> any
> >> >> system. The job submission code can be written this way.
> >> >>
> >> >> // Create a new Job
> >> >> Job job = new Job(new Configuration());
> >> >> job.setJarByClass(MyJob.class);
> >> >>
> >> >> // Specify various job-specific parameters
> >> >> job.setJobName("myjob");
> >> >>
> >> >> job.setInputPath(new Path("in"));
> >> >> job.setOutputPath(new Path("out"));
> >> >>
> >> >> job.setMapperClass(MyJob.MyMapper.class);
> >> >> job.setReducerClass(MyJob.MyReducer.class);
> >> >>
> >> >> // Submit the job
> >> >> job.submit();
> >> >>
> >> >>
> >> >>
> >> >> For submitting this, need to add the hadoop jar files and
> configuration
> >> >> files in the class path of the application from where you want to
> submit
> >> >> the
> >> >> job.
> >> >>
> >> >> You can refer this docs for more info on Job API's.
> >> >>
> >> >>
> >>
>
> http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
> >> >> uce/Job.html
> >> >>
> >> >>
> >> >>
> >> >> Devaraj K
> >> >>
> >> >> -Original Message-
> >> >> From: madhu phatak [mailto:phatak@gmail.com]
> >> >> Sent: Tuesday, July 26, 2011 3:29 PM
> >> >> To: common-user@hadoop.apache.org
> >> >> Subject: Submitting and running hadoop jobs Programmatically
> >> >>
> >> >> Hi,
> >> >>  I am working on a open source project
> >> >> Nectar where
> >> >> i am trying to create the hadoop jobs depending upon the user input.
> I
> >> was
> >> >> using Java Process API to run the bin/hadoop shell script to submit
> the
> >> >> jobs. But it seems not good way because the process creation model is
> >> >> not consistent across different operating systems . Is there any
> better
> >> way
> >> >> to submit the jobs rather than invoking the shell script? I am using
> >> >> hadoop-0.21.0 version and i am running my program in the same user
> where
> >> >> hadoop is installed . Some of the older thread told if I add
> >> configuration
> >> >> files in path it will work fine . But i am not able to run in that
> way
> .
> >> So
> >> >> anyone tried this before? If So , please can you give detailed
> >> instruction
> >> >> how to achieve it . Advanced thanks for your help.
> >> >>
> >> >> Regards,
> >> >> Madhukara Phatak
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
>
>
>
> --
> Harsh J
>
>


Build Hadoop 0.20.2 from source

2011-07-26 Thread Vighnesh Avadhani
Hi,

I want to build Hadoop 0.20.2 from source using the Eclipse IDE. Can anyone
help me with this?

Regards,
Vighnesh


Re: Build Hadoop 0.20.2 from source

2011-07-26 Thread Uma Maheswara Rao G 72686
Hi Vighnesh,

Step 1) Download the code base from apache svn repository.
Step 2) In root folder you can find build.xml file. In that folder just execute 
  a)ant  and b)ant eclipse 

this will generate the eclipse project setings files.

After this directly you can import this project in you eclipse.

Regards,
Uma 

**
 This email and its attachments contain confidential information from HUAWEI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained here in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
email in error, please notify the sender by phone or email immediately and 
delete it!
 
*

- Original Message -
From: Vighnesh Avadhani 
Date: Wednesday, July 27, 2011 11:08 am
Subject: Build Hadoop 0.20.2 from source
To: common-user@hadoop.apache.org

> Hi,
> 
> I want to build Hadoop 0.20.2 from source using the Eclipse IDE. 
> Can anyone
> help me with this?
> 
> Regards,
> Vighnesh
> 


Re: Submitting and running hadoop jobs Programmatically

2011-07-26 Thread Harsh J
Madhu,

Ditch the '*' in the classpath element that has the configuration
directory. The directory ought to be on the classpath, not the files
AFAIK.

Try and let us know if it then picks up the proper config (right now,
its using the local mode).

On Wed, Jul 27, 2011 at 10:25 AM, madhu phatak  wrote:
> Hi
> I am submitting the job as follows
>
> java -cp
>  Nectar-analytics-0.0.1-SNAPSHOT.jar:/home/hadoop/hadoop-for-nectar/hadoop-0.21.0/conf/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_COMMON_HOME/*
> com.zinnia.nectar.regression.hadoop.primitive.jobs.SigmaJob input/book.csv
> kkk11fffrrw 1
>
> I get the log in CLI as below
>
> 11/07/27 10:22:54 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=30
> 11/07/27 10:22:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 11/07/27 10:22:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
> processName=JobTracker, sessionId= - already initialized
> 11/07/27 10:22:54 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 11/07/27 10:22:54 INFO mapreduce.JobSubmitter: Cleaning up the staging area
> file:/tmp/hadoop-hadoop/mapred/staging/hadoop-1331241340/.staging/job_local_0001
>
> It doesn't create any job in hadoop.
>
> On Tue, Jul 26, 2011 at 5:11 PM, Devaraj K  wrote:
>
>> Madhu,
>>
>>  Can you check the client logs, whether any error/exception is coming while
>> submitting the job?
>>
>> Devaraj K
>>
>> -Original Message-
>> From: Harsh J [mailto:ha...@cloudera.com]
>> Sent: Tuesday, July 26, 2011 5:01 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Submitting and running hadoop jobs Programmatically
>>
>> Yes. Internally, it calls regular submit APIs.
>>
>> On Tue, Jul 26, 2011 at 4:32 PM, madhu phatak 
>> wrote:
>> > I am using JobControl.add() to add a job and running job control in
>> > a separate thread and using JobControl.allFinished() to see all jobs
>> > completed or not . Is this work same as Job.submit()??
>> >
>> > On Tue, Jul 26, 2011 at 4:08 PM, Harsh J  wrote:
>> >
>> >> Madhu,
>> >>
>> >> Do you get a specific error message / stack trace? Could you also
>> >> paste your JT logs?
>> >>
>> >> On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak 
>> >> wrote:
>> >> > Hi
>> >> >  I am using the same APIs but i am not able to run the jobs by just
>> >> adding
>> >> > the configuration files and jars . It never create a job in Hadoop ,
>> it
>> >> just
>> >> > shows cleaning up staging area and fails.
>> >> >
>> >> > On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K 
>> wrote:
>> >> >
>> >> >> Hi Madhu,
>> >> >>
>> >> >>   You can submit the jobs using the Job API's programmatically from
>> any
>> >> >> system. The job submission code can be written this way.
>> >> >>
>> >> >>     // Create a new Job
>> >> >>     Job job = new Job(new Configuration());
>> >> >>     job.setJarByClass(MyJob.class);
>> >> >>
>> >> >>     // Specify various job-specific parameters
>> >> >>     job.setJobName("myjob");
>> >> >>
>> >> >>     job.setInputPath(new Path("in"));
>> >> >>     job.setOutputPath(new Path("out"));
>> >> >>
>> >> >>     job.setMapperClass(MyJob.MyMapper.class);
>> >> >>     job.setReducerClass(MyJob.MyReducer.class);
>> >> >>
>> >> >>     // Submit the job
>> >> >>     job.submit();
>> >> >>
>> >> >>
>> >> >>
>> >> >> For submitting this, need to add the hadoop jar files and
>> configuration
>> >> >> files in the class path of the application from where you want to
>> submit
>> >> >> the
>> >> >> job.
>> >> >>
>> >> >> You can refer this docs for more info on Job API's.
>> >> >>
>> >> >>
>> >>
>>
>> http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred
>> >> >> uce/Job.html
>> >> >>
>> >> >>
>> >> >>
>> >> >> Devaraj K
>> >> >>
>> >> >> -Original Message-
>> >> >> From: madhu phatak [mailto:phatak@gmail.com]
>> >> >> Sent: Tuesday, July 26, 2011 3:29 PM
>> >> >> To: common-user@hadoop.apache.org
>> >> >> Subject: Submitting and running hadoop jobs Programmatically
>> >> >>
>> >> >> Hi,
>> >> >>  I am working on a open source project
>> >> >> Nectar where
>> >> >> i am trying to create the hadoop jobs depending upon the user input.
>> I
>> >> was
>> >> >> using Java Process API to run the bin/hadoop shell script to submit
>> the
>> >> >> jobs. But it seems not good way because the process creation model is
>> >> >> not consistent across different operating systems . Is there any
>> better
>> >> way
>> >> >> to submit the jobs rather than invoking the shell script? I am using
>> >> >> hadoop-0.21.0 version and i am running my program in the same user
>> where
>> >> >> hadoop is installed . Some of the older thread told if I add
>> >> configuration
>> >> >> files in path it will work fine . But i am not able to run in that
>> way
>> .
>> >> So
>> >> >> anyone tried this before? If So , p

questions regarding data storage and inputformat

2011-07-26 Thread Tom Melendez
Hi Folks,

I have a bunch of binary files which I've stored in a sequencefile.
The name of the file is the key, the data is the value and I've stored
them sorted by key.  (I'm not tied to using a sequencefile for this).
The current test data is only 50MB, but the real data will be 500MB -
1GB.

My M/R job requires that it's input be several of these records in the
sequence file, which is determined by the key.  The sorting mentioned
above keeps these all packed together.

1. Any reason not to use a sequence file for this?  Perhaps a mapfile?
 Since I've sorted it, I don't need "random" accesses, but I do need
to be aware of the keys, as I need to be sure that I get all of the
relevant keys sent to a given mapper

2. Looks like I want a custom inputformat for this, extending
SequenceFileInputFormat.  Do you agree?  I'll gladly take some
opinions on this, as I ultimately want to split the based on what's in
the file, which might be a little unorthodox.

3. Another idea might be create separate seq files for chunk of
records and make them non-splittable, ensuring that they go to a
single mapper.  Assuming I can get away with this, see any pros/cons
with that approach?

Thanks,

Tom

-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Re: Multiple Output Formats

2011-07-26 Thread Luca Pireddu
On July 26, 2011 06:11:33 PM Roger Chen wrote:
> Hi all,
> 
> I am attempting to implement MultipleOutputFormat to write data to multiple
> files dependent on the output keys and values. Can somebody provide a
> working example with how to implement this in Hadoop 0.20.2?
> 
> Thanks!

Hello,

I have a working sample here:

http://biodoop-seal.bzr.sourceforge.net/bzr/biodoop-
seal/trunk/annotate/head%3A/src/it/crs4/seal/demux/DemuxOutputFormat.java

It extends FileOutputFormat.

-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452