subject:"Hadoop Streaming"

Hadoop streaming - Subprocess failed

2012-08-29 Thread Periya.Data

Hi,
I am running a map-reduce job in Python and I get this error message. I
do not understand what it means. Output is not written to HDFS. I am using
CDH3u3. Any suggestion is appreciated.

MapAttempt TASK_TYPE="MAP" TASKID="task_201208232245_2812_m_00"
TASK_ATTEMPT_ID="attempt_201208232245_2812_m_00_0"
TASK_STATUS="FAILED"  *ERROR="java\.lang\.RuntimeException:
PipeMapRed\.waitOutputThreads(): subprocess failed with code 1*
at
org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
at
org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
at
org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:136)
at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)
at
org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:34)
at
org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:391)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:325)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:270)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
at
org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1157)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:264)
" .

Re: Issue with Hadoop Streaming

2012-08-03 Thread Subir S

In streaming contents of the file will be streamed to mapper through
STDIN, not the file names.

Fix the perl script accordingly.

Thanks, Subir

On 8/3/12, Devi Kumarappan  wrote:
>
>
> After specifying NLineInputFormat option, streaming job fails with
>
> Error from attempt_201205171448_0092_m_00_0: java.lang.RuntimeException:
> PipeMapRed.waitOutputThreads(): subprocess failed with code 2
>
> It spawns two mappers, but i am not sure whether the mapper runs with file
> names
> specified in the input option.  I was expecting one mapper to run with
> /user/devi/s_input/a.txt and one mapper to run with
> /user/devi/s_input/b.txt. I
> digged into the task files, but could not find anything.
>
> Here is the simple  mapper perl script .All does is it reads the file and
> prints
> it. (It needs to do much more stuff, but I could not get the basic job
> itself to
> run).
>
>  $i = 0;
>    $userinput = ;
>    open(INFILE,"$userinput") || die "could not open the file $userinput \n";
>    while () {
>  my $line = $_;
>  print "$i".$line ;
>  $i++;
>    }
>    close(INFILE);
> exit;
>
> My command is hadoop jar
> /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar
> -input
> /user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl
> /home/devi/Perl/crash_parser.pl" -inputformat
> org.apache.hadoop.mapred.lib.NLineInputFormat
>
>
> Really appreciate your help.
>
> Devi
>
>
>
>
>
>
> ________
> From: Robert Evans 
> To: "mapreduce-u...@hadoop.apache.org" ;
> "common-user@hadoop.apache.org" 
> Sent: Thu, August 2, 2012 1:16:54 PM
> Subject: Re: Issue with Hadoop Streaming
>
>
> http://www.mail-archive.com/core-user@hadoop.apache.org/msg07382.html
>
>
>
> From: Devi Kumarappan 
> Reply-To: "mapreduce-u...@hadoop.apache.org"
> 
> Date: Thursday, August 2, 2012 3:03 PM
> To: "common-user@hadoop.apache.org" ,
> "mapreduce-u...@hadoop.apache.org" 
> Subject: Re: Issue with Hadoop Streaming
>
>
> My mapper is perl script  and it is not in Java.So how do I specify the
> NLineFormat?
>
>
>
>
> 
> From: Robert Evans 
> To: "mapreduce-u...@hadoop.apache.org" ;
> "common-user@hadoop.apache.org" 
> Sent: Thu, August 2, 2012 12:59:50 PM
> Subject: Re: Issue with Hadoop Streaming
>
> It depends on the input format you use.  You probably want to look at using
> NLineInputFormat
>
> From: Devi Kumarappan mailto:kpala...@att.net>>
> Reply-To:
> "mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>"
> mailto:mapreduce-u...@hadoop.apache.org>>
> Date: Wednesday, August 1, 2012 8:09 PM
> To: "common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>"
> mailto:common-user@hadoop.apache.org>>,
> "mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>"
> mailto:mapreduce-u...@hadoop.apache.org>>
> Subject: Issue with Hadoop Streaming
>
> I am trying to run hadoop streaming using perl script as the mapper and with
> no
> reducer. My requirement is for the Mapper  to run on one file at a time.
> since
> I have to do pattern processing in the entire contents of one file at a time
> and
> the file size is small.
>
> Hadoop streaming manual suggests the following solution
>
> *  Generate a file containing the full HDFS path of the input files. Each
> map
> task would get one file name as input.
> *  Create a mapper script which, given a filename, will get the file to
> local
> disk, gzip the file and put it back in the desired output directory.
>
> I am running the fllowing command.
>
> hadoop jar
> /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar
> -input
> /user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl
> /home/devi/Perl/crash_parser.pl"
>
>
>
> /user/devi/file.txt contains the following two lines.
>
> /user/devi/s_input/a.txt
> /user/devi/s_input/b.txt
>
> When this runs, instead of spawing two mappers for a.txt and b.txt as per
> the
> document, only one mapper is being spawned and the perl script gets the
> /user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.
>
>
>
> How could I make the mapper perl script to run using only one file at a time
> ?
>
>
>
> Appreciate your help, Thanks, Devi

Re: Issue with Hadoop Streaming

2012-08-02 Thread Devi Kumarappan



After specifying NLineInputFormat option, streaming job fails with    

Error from attempt_201205171448_0092_m_00_0: java.lang.RuntimeException: 
PipeMapRed.waitOutputThreads(): subprocess failed with code 2

It spawns two mappers, but i am not sure whether the mapper runs with file 
names 
specified in the input option.  I was expecting one mapper to run with 
/user/devi/s_input/a.txt and one mapper to run with /user/devi/s_input/b.txt. I 
digged into the task files, but could not find anything.

Here is the simple  mapper perl script .All does is it reads the file and 
prints 
it. (It needs to do much more stuff, but I could not get the basic job itself 
to 
run).

 $i = 0;
   $userinput = ;
   open(INFILE,"$userinput") || die "could not open the file $userinput \n";
   while () {
 my $line = $_;
 print "$i".$line ;
 $i++;
   }
   close(INFILE);
exit;

My command is hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input 
/user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl" -inputformat 
org.apache.hadoop.mapred.lib.NLineInputFormat 


Really appreciate your help.

Devi


 




From: Robert Evans 
To: "mapreduce-u...@hadoop.apache.org" ; 
"common-user@hadoop.apache.org" 
Sent: Thu, August 2, 2012 1:16:54 PM
Subject: Re: Issue with Hadoop Streaming


http://www.mail-archive.com/core-user@hadoop.apache.org/msg07382.html



From: Devi Kumarappan 
Reply-To: "mapreduce-u...@hadoop.apache.org" 
Date: Thursday, August 2, 2012 3:03 PM
To: "common-user@hadoop.apache.org" , 
"mapreduce-u...@hadoop.apache.org" 
Subject: Re: Issue with Hadoop Streaming


My mapper is perl script  and it is not in Java.So how do I specify the 
NLineFormat?





From: Robert Evans 
To: "mapreduce-u...@hadoop.apache.org" ; 
"common-user@hadoop.apache.org" 
Sent: Thu, August 2, 2012 12:59:50 PM
Subject: Re: Issue with Hadoop Streaming

It depends on the input format you use.  You probably want to look at using 
NLineInputFormat

From: Devi Kumarappan mailto:kpala...@att.net>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Date: Wednesday, August 1, 2012 8:09 PM
To: "common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>" 
mailto:common-user@hadoop.apache.org>>, 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Subject: Issue with Hadoop Streaming

I am trying to run hadoop streaming using perl script as the mapper and with no 
reducer. My requirement is for the Mapper  to run on one file at a time.  since 
I have to do pattern processing in the entire contents of one file at a time 
and 
the file size is small.

Hadoop streaming manual suggests the following solution

*  Generate a file containing the full HDFS path of the input files. Each map 
task would get one file name as input.
*  Create a mapper script which, given a filename, will get the file to local 
disk, gzip the file and put it back in the desired output directory.

I am running the fllowing command.

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input 
/user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl"



/user/devi/file.txt contains the following two lines.

/user/devi/s_input/a.txt
/user/devi/s_input/b.txt

When this runs, instead of spawing two mappers for a.txt and b.txt as per the 
document, only one mapper is being spawned and the perl script gets the 
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.



How could I make the mapper perl script to run using only one file at a time ?



Appreciate your help, Thanks, Devi

Re: Issue with Hadoop Streaming

2012-08-02 Thread Robert Evans

http://www.mail-archive.com/core-user@hadoop.apache.org/msg07382.html

From: Devi Kumarappan mailto:kpala...@att.net>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Date: Thursday, August 2, 2012 3:03 PM
To: "common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>" 
mailto:common-user@hadoop.apache.org>>, 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Subject: Re: Issue with Hadoop Streaming

My mapper is perl script  and it is not in Java.So how do I specify the 
NLineFormat?

From: Robert Evans mailto:ev...@yahoo-inc.com>>
To: "mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>; 
"common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>" 
mailto:common-user@hadoop.apache.org>>
Sent: Thu, August 2, 2012 12:59:50 PM
Subject: Re: Issue with Hadoop Streaming

It depends on the input format you use.  You probably want to look at using 
NLineInputFormat

From: Devi Kumarappan 
mailto:kpala...@att.net><mailto:kpala...@att.net<mailto:kpala...@att.net>>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>"

mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>>
Date: Wednesday, August 1, 2012 8:09 PM
To: 
"common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org><mailto:common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>>"

mailto:common-user@hadoop.apache.org><mailto:common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>>>,

"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>"

mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>>
Subject: Issue with Hadoop Streaming

I am trying to run hadoop streaming using perl script as the mapper and with no 
reducer. My requirement is for the Mapper  to run on one file at a time.  since 
I have to do pattern processing in the entire contents of one file at a time 
and the file size is small.

Hadoop streaming manual suggests the following solution

*  Generate a file containing the full HDFS path of the input files. Each map 
task would get one file name as input.
*  Create a mapper script which, given a filename, will get the file to local 
disk, gzip the file and put it back in the desired output directory.

I am running the fllowing command.

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input /user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl"

/user/devi/file.txt contains the following two lines.

/user/devi/s_input/a.txt
/user/devi/s_input/b.txt

When this runs, instead of spawing two mappers for a.txt and b.txt as per the 
document, only one mapper is being spawned and the perl script gets the 
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.

How could I make the mapper perl script to run using only one file at a time ?

Appreciate your help, Thanks, Devi

Re: Issue with Hadoop Streaming

2012-08-02 Thread Devi Kumarappan

My mapper is perl script  and it is not in Java.So how do I specify the 
NLineFormat?

From: Robert Evans 
To: "mapreduce-u...@hadoop.apache.org" ; 
"common-user@hadoop.apache.org" 
Sent: Thu, August 2, 2012 12:59:50 PM
Subject: Re: Issue with Hadoop Streaming

It depends on the input format you use.  You probably want to look at using 
NLineInputFormat

From: Devi Kumarappan mailto:kpala...@att.net>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Date: Wednesday, August 1, 2012 8:09 PM
To: "common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>" 
mailto:common-user@hadoop.apache.org>>, 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Subject: Issue with Hadoop Streaming

I am trying to run hadoop streaming using perl script as the mapper and with no 
reducer. My requirement is for the Mapper  to run on one file at a time.  since 
I have to do pattern processing in the entire contents of one file at a time 
and 
the file size is small.

Hadoop streaming manual suggests the following solution

*  Generate a file containing the full HDFS path of the input files. Each map 
task would get one file name as input.
*  Create a mapper script which, given a filename, will get the file to local 
disk, gzip the file and put it back in the desired output directory.

I am running the fllowing command.

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input 
/user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl"

/user/devi/file.txt contains the following two lines.

/user/devi/s_input/a.txt
/user/devi/s_input/b.txt

When this runs, instead of spawing two mappers for a.txt and b.txt as per the 
document, only one mapper is being spawned and the perl script gets the 
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.

How could I make the mapper perl script to run using only one file at a time ?

Appreciate your help, Thanks, Devi

Re: Issue with Hadoop Streaming

2012-08-02 Thread Robert Evans

It depends on the input format you use.  You probably want to look at using 
NLineInputFormat

From: Devi Kumarappan mailto:kpala...@att.net>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Date: Wednesday, August 1, 2012 8:09 PM
To: "common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>" 
mailto:common-user@hadoop.apache.org>>, 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Subject: Issue with Hadoop Streaming

I am trying to run hadoop streaming using perl script as the mapper and with no 
reducer. My requirement is for the Mapper  to run on one file at a time.  since 
I have to do pattern processing in the entire contents of one file at a time 
and the file size is small.

Hadoop streaming manual suggests the following solution

 *   Generate a file containing the full HDFS path of the input files. Each map 
task would get one file name as input.
 *   Create a mapper script which, given a filename, will get the file to local 
disk, gzip the file and put it back in the desired output directory.

I am running the fllowing command.

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input /user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl"

/user/devi/file.txt contains the following two lines.

/user/devi/s_input/a.txt
/user/devi/s_input/b.txt

When this runs, instead of spawing two mappers for a.txt and b.txt as per the 
document, only one mapper is being spawned and the perl script gets the 
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.

How could I make the mapper perl script to run using only one file at a time ?

Appreciate your help, Thanks, Devi

Issue with Hadoop Streaming

2012-08-01 Thread Devi Kumarappan

I am trying to run hadoop streaming using perl script as the mapper and with no 
reducer. My requirement is for the Mapper  to run on one file at a time.  since 
I have to do pattern processing in the entire contents of one file at a time 
and 
the file size is small.

Hadoop streaming manual suggests the following solution
* Generate a file containing the full HDFS path of the input files. 
Each map 
task would get one file name as input.
* Create a mapper script which, given a filename, will get the file to 
local 
disk, gzip the file and put it back in the desired output directory.
I am running the fllowing command.
hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input 
/user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl" 

 
/user/devi/file.txt contains the following two lines.
/user/devi/s_input/a.txt
/user/devi/s_input/b.txt

When this runs, instead of spawing two mappers for a.txt and b.txt as per the 
document, only one mapper is being spawned and the perl script gets the 
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.
 
How could I make the mapper perl script to run using only one file at a time ?
 
Appreciate your help, Thanks, Devi

Hadoop Streaming Example - Issue

2012-06-05 Thread karanveer.singh

Hi,

I am trying to run a java program as a mapper using Hadoop Streaming but 
getting the following error:

"Cannot run program "new_code.class": java.io.IOException: error=2, No such 
file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)"


The command being run is:

/usr/bin/hadoop jar 
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.0.2.jar -file 
/home/hadoop/zql/demo/api/new_code.class -mapper 
/home/hadoop/zql/demo/api/new_code.class -jobconf mapred.reduce.tasks=0 -input 
/rbbpoc/input/* -output /rbbpoc/output2001


The new_code.java program is running fine for me in a stand alone mode and is 
accepting stdin inputs. I have checked the permissions of the class file & they 
seem good.

Any inputs to help resolve the above issue will be helpful.


Regards,

Karan





This e-mail and any attachments are confidential and intended
solely for the addressee and may also be privileged or exempt from
disclosure under applicable law. If you are not the addressee, or
have received this e-mail in error, please notify the sender
immediately, delete it from your system and do not copy, disclose
or otherwise act upon any part of this e-mail or its attachments.

Internet communications are not guaranteed to be secure or
virus-free.
The Barclays Group does not accept responsibility for any loss
arising from unauthorised access to, or interference with, any
Internet communications by any third party, or from the
transmission of any viruses. Replies to this e-mail may be
monitored by the Barclays Group for operational or business
reasons.

Any opinion or other information in this e-mail or its attachments
that does not relate to the business of the Barclays Group is
personal to the sender and is not given or endorsed by the Barclays
Group.

Barclays Bank PLC. Registered in England and Wales (registered no.
1026167).
Registered Office: 1 Churchill Place, London, E14 5HP, United
Kingdom.

Barclays Bank PLC is authorised and regulated by the Financial
Services Authority.

Re: hadoop streaming using a java program as mapper

2012-05-02 Thread Robert Evans

Do you have the error message from running java?  You can use myMapper.sh to 
help you debug what is happening and logging it.  Stderr of myMapper.sh is 
logged and you can get to it.  You can run shell commands link find, ls, and 
you can probably look at any error messages that java produced while trying to 
run.  Things like class not found exceptions.

--Bobby Evans

On 5/2/12 12:25 AM, "Boyu Zhang"  wrote:

Yes, I did, the myMapper.sh is executed, the problem is inside this
myMapper.sh, it calls a java program named myJava, the myJava did not get
executed on slaves, and I shipped myJava.class too.

Thanks,
Boyu

On Wed, May 2, 2012 at 1:20 AM, 黄 山  wrote:

> have you shipped myMapper.sh to each node?
>
> thuhuang...@gmail.com
>
>
>
> 在 2012-5-2，下午1:17， Boyu Zhang 写道：
>
> > Hi All,
> >
> > I am in a little bit strange situation, I am using Hadoop streaming to
> run
> > a bash shell program myMapper.sh, and in the myMapper.sh, it calls a java
> > program, then a R program, then output intermediate key, values. I used
> > -file option to ship the java and R files, but the java program was not
> > executed by the streaming. The myMapper.sh has something like this:
> >
> > java myJava arguments
> >
> > And in the streaming command, I use something like this:
> >
> > hadoop jar /opt/hadoop/hadoop-0.20.2-streaming.jar -D
> mapred.reduce.tasks=0
> > -input /user/input -output /user/output7 -mapper ./myMapper.sh -file
> > myJava.class  -verbose
> >
> > And the myJava program is not run when I execute like this, and if I go
> to
> > the actual slave node to check the files, the myMapper.sh is shipped to
> the
> > slave node, but the myJava.class is not, it is inside the job.jar file.
> >
> > Can someone provide some insights on how to run a java program through
> > hadoop streaming? Thanks!
> >
> > Boyu
>
>

Re: hadoop streaming using a java program as mapper

2012-05-01 Thread Boyu Zhang

Yes, I did, the myMapper.sh is executed, the problem is inside this
myMapper.sh, it calls a java program named myJava, the myJava did not get
executed on slaves, and I shipped myJava.class too.

Thanks,
Boyu

On Wed, May 2, 2012 at 1:20 AM, 黄 山  wrote:

> have you shipped myMapper.sh to each node?
>
> thuhuang...@gmail.com
>
>
>
> 在 2012-5-2，下午1:17， Boyu Zhang 写道：
>
> > Hi All,
> >
> > I am in a little bit strange situation, I am using Hadoop streaming to
> run
> > a bash shell program myMapper.sh, and in the myMapper.sh, it calls a java
> > program, then a R program, then output intermediate key, values. I used
> > -file option to ship the java and R files, but the java program was not
> > executed by the streaming. The myMapper.sh has something like this:
> >
> > java myJava arguments
> >
> > And in the streaming command, I use something like this:
> >
> > hadoop jar /opt/hadoop/hadoop-0.20.2-streaming.jar -D
> mapred.reduce.tasks=0
> > -input /user/input -output /user/output7 -mapper ./myMapper.sh -file
> > myJava.class  -verbose
> >
> > And the myJava program is not run when I execute like this, and if I go
> to
> > the actual slave node to check the files, the myMapper.sh is shipped to
> the
> > slave node, but the myJava.class is not, it is inside the job.jar file.
> >
> > Can someone provide some insights on how to run a java program through
> > hadoop streaming? Thanks!
> >
> > Boyu
>
>

Re: hadoop streaming using a java program as mapper

2012-05-01 Thread 黄山

have you shipped myMapper.sh to each node?

thuhuang...@gmail.com



在 2012-5-2，下午1:17， Boyu Zhang 写道：

> Hi All,
> 
> I am in a little bit strange situation, I am using Hadoop streaming to run
> a bash shell program myMapper.sh, and in the myMapper.sh, it calls a java
> program, then a R program, then output intermediate key, values. I used
> -file option to ship the java and R files, but the java program was not
> executed by the streaming. The myMapper.sh has something like this:
> 
> java myJava arguments
> 
> And in the streaming command, I use something like this:
> 
> hadoop jar /opt/hadoop/hadoop-0.20.2-streaming.jar -D mapred.reduce.tasks=0
> -input /user/input -output /user/output7 -mapper ./myMapper.sh -file
> myJava.class  -verbose
> 
> And the myJava program is not run when I execute like this, and if I go to
> the actual slave node to check the files, the myMapper.sh is shipped to the
> slave node, but the myJava.class is not, it is inside the job.jar file.
> 
> Can someone provide some insights on how to run a java program through
> hadoop streaming? Thanks!
> 
> Boyu

RE: hadoop streaming and a directory containing large number of .tgz files

2012-04-24 Thread Devaraj k

Hi Sunil,

Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful 
to solve your problem.

Thanks
Devaraj

From: Sunil S Nandihalli [sunil.nandiha...@gmail.com]
Sent: Tuesday, April 24, 2012 7:12 PM
To: common-user@hadoop.apache.org
Subject: hadoop streaming and a directory containing large number of .tgz files

Hi Everybody,
 I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
"cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
how can I achieve this using hadoop-streaming or some-other similar
library..


thanks,
Sunil.

Re: hadoop streaming and a directory containing large number of .tgz files

2012-04-24 Thread Raj Vishwanathan

Sunil

You could use identity mappers, a single identity reducer and by not having 
output compression.,

Raj



>
> From: Sunil S Nandihalli 
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, April 24, 2012 7:01 AM
>Subject: Re: hadoop streaming and a directory containing large number of .tgz 
>files
> 
>Sorry for reforwarding this email. I was not sure if it actually got
>through since I just got the confirmation regarding my membership to the
>mailing list.
>Thanks,
>Sunil.
>
>On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
>sunil.nandiha...@gmail.com> wrote:
>
>> Hi Everybody,
>>  I am a newbie to hadoop. I have about 40K .tgz files each of
>> approximately 3MB . I would like to process this as if it were a single
>> large file formed by
>> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
>> how can I achieve this using hadoop-streaming or some-other similar
>> library..
>>
>>
>> thanks,
>> Sunil.
>>
>
>
>

Re: hadoop streaming and a directory containing large number of .tgz files

2012-04-24 Thread Sunil S Nandihalli

Sorry for reforwarding this email. I was not sure if it actually got
through since I just got the confirmation regarding my membership to the
mailing list.
Thanks,
Sunil.

On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
sunil.nandiha...@gmail.com> wrote:

> Hi Everybody,
>  I am a newbie to hadoop. I have about 40K .tgz files each of
> approximately 3MB . I would like to process this as if it were a single
> large file formed by
> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
> how can I achieve this using hadoop-streaming or some-other similar
> library..
>
>
> thanks,
> Sunil.
>

New question: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2012-04-10 Thread Shi Yu


Hi,

I looked back to the old post trying to find out a solution to my 
problem.  I am using hadoop 0.20.203 streaming for a C++ program. The 
program loads many dictionaries stored in local folders. For example,


mainfolder - dir1 ->  dicfile 1
mainfolder - dir1 ->  dicfile 2
mainfolder - dir2 ->  dicfile 3
mainfolder - dir2 ->  dicfile 4

I didn't change those dictionary loading functions in C++ based on the 
assumption that the whole directory at mainfolder level could be passed 
to streaming.  However, it seems not working well cause I observed the 
following error:


java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:435)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)


It seems the program failed to load the dictionaries. What is the most 
efficient way to do pass multiple files with directory dependencies in 
hadoop streaming?  I guess I don't need to change the C++ code, or 
should I remove all the directory dependencies in dictionary loading?


Thanks!

Shi

On 6/29/2011 1:44 AM, Guang-Nan Cheng wrote:

Well, my bad. I made a simple test and confirmed that  -files works that way
already.

On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote:



I'm fancied about passing a whole ruby app to streaming, so I don't need
to
bother with ruby file dependencies.

For example,

./streaming

...
-mapper 'ruby aaa/bbb/ccc'
-files  aaa<--- pass the folder




Is this supported already? If not, any tips on how to make this work?

I'm

willing to add some code by myself and rebuild the streaming jar.


--
Nick Jones

Re: Hadoop streaming or pipes ..

2012-04-06 Thread Mark question

Thanks all, and Charles you guided me to Baidu slides titled:
Introduction to *Hadoop C++
Extension*<http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5>
which is their experience and the sixth-slide shows exactly what I was
looking for. It is still hard to manage memory with pipes besides the no
performance gains, hence the advancement of HCE.

Thanks,
Mark
On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl wrote:

> Also bear in mind that there is a kind of detour involved, in the sense
> that a pipes map must send key,value data back to the Java process and then
> to reduce (more or less).
> I think that the Hadoop C Extension (HCE, there is a patch) is supposed to
> be faster.
> Would be interested to know if the community has any experience with HCE
> performance.
> C
>
> On Apr 5, 2012, at 3:49 PM, Robert Evans  wrote:
>
> > Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
> >
> > I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
> >
> > --Bobby Evans
> >
> >
> > On 4/5/12 1:54 PM, "Mark question"  wrote:
> >
> > Hi guys,
> >  quick question:
> >   Are there any performance gains from hadoop streaming or pipes over
> > Java? From what I've read, it's only to ease testing by using your
> favorite
> > language. So I guess it is eventually translated to bytecode then
> executed.
> > Is that true?
> >
> > Thank you,
> > Mark
> >
>

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Charles Earl

Also bear in mind that there is a kind of detour involved, in the sense that a 
pipes map must send key,value data back to the Java process and then to reduce 
(more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be 
faster. 
Would be interested to know if the community has any experience with HCE 
performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a 
> separate process that is running whatever you want it to run.  The JVM that 
> is running hadoop then communicates with this process to send the data over 
> and get the processing results back.  The difference between streaming and 
> pipes is that streaming uses stdin/stdout for this communication so 
> preexisting processing like grep, sed and awk can be used here.  Pipes uses a 
> custom protocol with a C++ library to communicate.  The C++ library is tagged 
> with SWIG compatible data so that it can be wrapped to have APIs in other 
> languages like python or perl.
> 
> I am not sure what the performance difference is between the two, but in my 
> own work I have seen a significant performance penalty from using either of 
> them, because there is a somewhat large overhead of sending all of the data 
> out to a separate process just to read it back in again.
> 
> --Bobby Evans
> 
> 
> On 4/5/12 1:54 PM, "Mark question"  wrote:
> 
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
> 
> Thank you,
> Mark
>

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans

It is a regular process, unless you explicitly say you want it to be java, 
which would be a bit odd to do, but possible.

--Bobby

On 4/5/12 3:14 PM, "Mark question"  wrote:

Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question"  wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Mark question

Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question"  wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans

Both streaming and pipes do very similar things.  They will fork/exec a 
separate process that is running whatever you want it to run.  The JVM that is 
running hadoop then communicates with this process to send the data over and 
get the processing results back.  The difference between streaming and pipes is 
that streaming uses stdin/stdout for this communication so preexisting 
processing like grep, sed and awk can be used here.  Pipes uses a custom 
protocol with a C++ library to communicate.  The C++ library is tagged with 
SWIG compatible data so that it can be wrapped to have APIs in other languages 
like python or perl.

I am not sure what the performance difference is between the two, but in my own 
work I have seen a significant performance penalty from using either of them, 
because there is a somewhat large overhead of sending all of the data out to a 
separate process just to read it back in again.

--Bobby Evans


On 4/5/12 1:54 PM, "Mark question"  wrote:

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark

Hadoop streaming or pipes ..

2012-04-05 Thread Mark question

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark

Running a simple Hadoop Streaming job With MapReduce 2.0

2012-03-12 Thread stevens35


Hi All,

I've setup a cluster using Cloudera's CDH4 Beta 1 release of MapReduce 
2.0 and I'd like to test some hadoop streaming scripts I have.  Before 
using the real scripts, I wanted to test the simplest streaming 
application: downloading a file from hdfs, zipping it, and then loading 
it back to hdfs.  In particular I want to test this out with large 
files, say 900 megs, since previous versions of hadoop streaming had an 
issue where the uploading of the file would claim to have finished 
before the file was fully uploaded.  This pretty much aligns with the 
streaming question in 
http://hadoop.apache.org/common/docs/current/streaming.html#How+do+I+process+files%2C+one+per+map%3F


My test script looks like:

  while read fileName; do
  echo "reporter:status:copying file" >&2
  echo $line
  hadoop fs -copyToLocal /path/to/file/$fileName .
  ls -l
  tar -czf smaller.tar.gz $fileName
  hadoop fs -copyFromLocal /path/to/file/$smaller.tar.gz
  done

And I run this with

hadoop jar /usr/lib/hadoop/hadoop-streaming-0.23.0-cdh4b1.jar streamjob 
-input input -output output -mapper streamTest.sh  -file streamTest.sh.


The job looks good as it starts, however, I quickly get the following error:

  Container 
[pid=4181,containerID=container_1331576268297_0001_01_03] is running 
beyond virtual memory limits. Current usage: 302.6mb of 1.0gb   physical 
memory used; 4.9gb of 2.1gb virtual memory used. Killing container.

  Dump of the process-tree for container_1331576268297_0001_01_03 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 4181 4070 4181 4181 (java) 322 20 581316608 52327 
/usr/java/default/bin/java -Djava.net.preferIPv4Stack=true 
-Dhadoop.metrics.log.level=WARN  -Xmx200m 
-Djava.io.tmpdir=/hdfs/dfs/yarn/usercache/stevens35/appcache/application_1331576268297_0001/container_1331576268297_0001_01_03/tmp 
-Dlog4j.configuration=container-log4j.properties 
-Dyarn.app.mapreduce.container.log.dir=/hdfs/logs/yarn/application_1331576268297_0001/container_1331576268297_0001_01_03 
-Dyarn.app.mapreduce.container.log.filesize=0 
-Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 
10.220.5.11 58072 attempt_1331576268297_0001_m_01_0 3
|- 4252 4246 4181 4181 (java) 334 90 4566265856 24808 
/usr/java/default/bin/java -Xmx4000m -Dhadoop.log.dir=/hdfs/logs  
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop 
-Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console 
-Djava.library.path=/usr /lib/hadoop/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-Dlog4j.configuration=container-log4j.properties  
-Dyarn.app.mapreduce.container.log.dir=/hdfs/logs/yarn/application_1331576268297_0001/container_1331576268297_0001_01_03  
-Dyarn.app.mapreduce.container.log.filesize=0 
-Dhadoop.root.logger=INFO,CLA -Dhadoop.security.logger=INFO,NullAppender 
org.apache.hadoop.fs.FsShell  -copyToLocal 
/data/nyt/nyt03_NMF_500-ds.dat.transpose .
|- 4246 4181 4181 4181 (streamTest.sh) 0 0 65507328 325 
/bin/bash 
/hdfs/dfs/yarn/usercache/stevens35/appcache/application_1331576268297_0001/container_1331576268297_0001_01_03/./streamTest.sh 



The container is running out of virtual memory, but i'm not exactly sure 
why this would be the case.  In version 0.20.2, this job worked just 
fine.  What has changed that might cause this kind of streaming job run 
out of memory? Is it not possible to pull files from hdfs from within a 
streaming job?


Also, as a secondary question, is there any documentation on reporting 
the status of the job?  Previously, I'd update the task's status with 
"reporter:status:current status description" written to stderr.  
However, looking through the new application manager ui and job history 
ui, I don't see this status being reported anywhere.  Is there a new 
format for this?  Is it reported elsewhere?


Thanks!
--Keith

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-06 Thread Russell Jurney

Rules of thumb IMO:

You should be using Pig in place of MR jobs at all times that performance
isn't absolutely crucial.  Writing unnecessary MR is needless technical
debt that you will regret as people are replaced and your organization
scales.  Pig gets it done in much less time.  If you need faster jobs, then
optimize your Pig, and if that doesn't work, put a single
MAPREDUCE<http://pig.apache.org/docs/r0.9.2/basic.html#mapreduce> job
at the bottleneck.  Also, realize that it can be hard to actually beat
Pig's performance without experience.  Check that your MR job is actually
faster than Pig at the same load before assuming you can do better than Pig.

Streaming is good if your data doesn't easily map to tuples, you really
like using the abstractions of your favoriate language's MR library, or you
are doing something weird like simulations/pure batch jobs (no mR).

If you're doing a lot of joins and performance is a problem - consider
doing fewer joins.  I would strongly suggest that you prioritize
de-normalizing and duplicating data over switching to raw MR jobs because
HIVE joins are slow.  MapReduce is slow at joins.  Programmer time is more
valuable than machine time.  If you're having to write tons of raw MR, then
get more machines.

On Fri, Mar 2, 2012 at 6:21 AM, Subir S  wrote:

> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:
>
> > On Fri, Mar 2, 2012 at 10:18 AM, Subir S 
> > wrote:
> > > Hello Folks,
> > >
> > > Are there any pointers to such comparisons between Apache Pig and
> Hadoop
> > > Streaming Map Reduce jobs?
> >
> > I do not see why you seek to compare these two. Pig offers a language
> > that lets you write data-flow operations and runs these statements as
> > a series of MR jobs for you automatically (Making it a great tool to
> > use to get data processing done really quick, without bothering with
> > code), while streaming is something you use to write non-Java, simple
> > MR jobs. Both have their own purposes.
> >
>
> Basically we are comparing these two to see the benefits and how much they
> help in improving the productive coding time, without jeopardizing the
> performance of MR jobs.
>
>
> > > Also there was a claim in our company that Pig performs better than Map
> > > Reduce jobs? Is this true? Are there any such benchmarks available
> >
> > Pig _runs_ MR jobs. It does do job design (and some data)
> > optimizations based on your queries, which is what may give it an edge
> > over designing elaborate flows of plain MR jobs with tools like
> > Oozie/JobControl (Which takes more time to do). But regardless, Pig
> > only makes it easy doing the same thing with Pig Latin statements for
> > you.
> >
>
> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
> pretty slow with lot of joins, which we can achieve faster with writing raw
> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
> for example what kind of projects should consider Pig. Say when we have a
> lot of Joins, which writing with plain MR jobs takes time. Thoughts?
>
> Thank you Harsh for your comments. They are helpful!
>
>
> >
> > --
> > Harsh J
> >
>

-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-05 Thread Russell Jurney

Streaming is good for simulation. Long running map-only processes, where pig 
doesn't really help and it is simple to fire off a streaming process.  You do 
have to set some options so they can take a long time to return/return counters.

Russell Jurney http://datasyndrome.com

On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn  wrote:

> I'm really interested in this as well. I have trouble seeing a really good 
> use case for streaming map-reduce. Is there something I can do in streaming 
> that I can't do in Pig? If I want to re-use previously made Python functions 
> from my code base, I can do that in Pig as much as Streaming, and from what 
> I've experienced thus far, Python streaming seems to go slower than or at the 
> same speed as Pig, so why would I want to write a whole lot of 
> more-difficult-to-read mappers and reducers when I can do equally fast 
> performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but 
> currently I just can't think of the right use case.
> 
> Eli
> 
> On 3/2/12 9:21 AM, Subir S wrote:
>> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:
>> 
>>> On Fri, Mar 2, 2012 at 10:18 AM, Subir S
>>> wrote:
>>>> Hello Folks,
>>>> 
>>>> Are there any pointers to such comparisons between Apache Pig and Hadoop
>>>> Streaming Map Reduce jobs?
>>> I do not see why you seek to compare these two. Pig offers a language
>>> that lets you write data-flow operations and runs these statements as
>>> a series of MR jobs for you automatically (Making it a great tool to
>>> use to get data processing done really quick, without bothering with
>>> code), while streaming is something you use to write non-Java, simple
>>> MR jobs. Both have their own purposes.
>>> 
>> Basically we are comparing these two to see the benefits and how much they
>> help in improving the productive coding time, without jeopardizing the
>> performance of MR jobs.
>> 
>> 
>>>> Also there was a claim in our company that Pig performs better than Map
>>>> Reduce jobs? Is this true? Are there any such benchmarks available
>>> Pig _runs_ MR jobs. It does do job design (and some data)
>>> optimizations based on your queries, which is what may give it an edge
>>> over designing elaborate flows of plain MR jobs with tools like
>>> Oozie/JobControl (Which takes more time to do). But regardless, Pig
>>> only makes it easy doing the same thing with Pig Latin statements for
>>> you.
>>> 
>> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
>> pretty slow with lot of joins, which we can achieve faster with writing raw
>> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
>> for example what kind of projects should consider Pig. Say when we have a
>> lot of Joins, which writing with plain MR jobs takes time. Thoughts?
>> 
>> Thank you Harsh for your comments. They are helpful!
>> 
>> 
>>> --
>>> Harsh J
>>> 
>

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-05 Thread Eli Finkelshteyn

I'm really interested in this as well. I have trouble seeing a really 
good use case for streaming map-reduce. Is there something I can do in 
streaming that I can't do in Pig? If I want to re-use previously made 
Python functions from my code base, I can do that in Pig as much as 
Streaming, and from what I've experienced thus far, Python streaming 
seems to go slower than or at the same speed as Pig, so why would I want 
to write a whole lot of more-difficult-to-read mappers and reducers when 
I can do equally fast performance-wise, shorter, and clearer code in 
Pig? Maybe it's obvious, but currently I just can't think of the right 
use case.


Eli

On 3/2/12 9:21 AM, Subir S wrote:

On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:


On Fri, Mar 2, 2012 at 10:18 AM, Subir S
wrote:

Hello Folks,

Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?

I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.


Basically we are comparing these two to see the benefits and how much they
help in improving the productive coding time, without jeopardizing the
performance of MR jobs.



Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available

Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.


I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
pretty slow with lot of joins, which we can achieve faster with writing raw
MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
for example what kind of projects should consider Pig. Say when we have a
lot of Joins, which writing with plain MR jobs takes time. Thoughts?

Thank you Harsh for your comments. They are helpful!



--
Harsh J

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-02 Thread Subir S

On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:

> On Fri, Mar 2, 2012 at 10:18 AM, Subir S 
> wrote:
> > Hello Folks,
> >
> > Are there any pointers to such comparisons between Apache Pig and Hadoop
> > Streaming Map Reduce jobs?
>
> I do not see why you seek to compare these two. Pig offers a language
> that lets you write data-flow operations and runs these statements as
> a series of MR jobs for you automatically (Making it a great tool to
> use to get data processing done really quick, without bothering with
> code), while streaming is something you use to write non-Java, simple
> MR jobs. Both have their own purposes.
>

Basically we are comparing these two to see the benefits and how much they
help in improving the productive coding time, without jeopardizing the
performance of MR jobs.

> > Also there was a claim in our company that Pig performs better than Map
> > Reduce jobs? Is this true? Are there any such benchmarks available
>
> Pig _runs_ MR jobs. It does do job design (and some data)
> optimizations based on your queries, which is what may give it an edge
> over designing elaborate flows of plain MR jobs with tools like
> Oozie/JobControl (Which takes more time to do). But regardless, Pig
> only makes it easy doing the same thing with Pig Latin statements for
> you.
>

I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
pretty slow with lot of joins, which we can achieve faster with writing raw
MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
for example what kind of projects should consider Pig. Say when we have a
lot of Joins, which writing with plain MR jobs takes time. Thoughts?

Thank you Harsh for your comments. They are helpful!

>
> --
> Harsh J
>

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-02 Thread Subir S

Thank you Jie!

I have downloaded Pig Experience and will read it.

On Fri, Mar 2, 2012 at 12:36 PM, Jie Li  wrote:

> Considering Pig essentially translates scripts into Map Reduce jobs, one
> can always write as good Map Reduce jobs as Pig does. You can refer to "Pig
> experience" paper to see the overhead Pig introduces, but it's been
> improved all the time.
>
> Btw if you really care about the performance, how you configure Hadoop and
> Pig can also play an important role.
>
> Thanks,
> Jie
> --
> Starfish is an intelligent performance tuning tool for Hadoop.
> Homepage: www.cs.duke.edu/starfish/
> Mailing list: http://groups.google.com/group/hadoop-starfish
>
> On Thu, Mar 1, 2012 at 11:48 PM, Subir S 
> wrote:
>
> > Hello Folks,
> >
> > Are there any pointers to such comparisons between Apache Pig and Hadoop
> > Streaming Map Reduce jobs?
> >
> > Also there was a claim in our company that Pig performs better than Map
> > Reduce jobs? Is this true? Are there any such benchmarks available
> >
> > Thanks, Subir
> >
>

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Harsh J

On Fri, Mar 2, 2012 at 10:18 AM, Subir S  wrote:
> Hello Folks,
>
> Are there any pointers to such comparisons between Apache Pig and Hadoop
> Streaming Map Reduce jobs?

I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.

> Also there was a claim in our company that Pig performs better than Map
> Reduce jobs? Is this true? Are there any such benchmarks available

Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.

-- 
Harsh J

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Jie Li

Considering Pig essentially translates scripts into Map Reduce jobs, one
can always write as good Map Reduce jobs as Pig does. You can refer to "Pig
experience" paper to see the overhead Pig introduces, but it's been
improved all the time.

Btw if you really care about the performance, how you configure Hadoop and
Pig can also play an important role.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish

On Thu, Mar 1, 2012 at 11:48 PM, Subir S  wrote:

> Hello Folks,
>
> Are there any pointers to such comparisons between Apache Pig and Hadoop
> Streaming Map Reduce jobs?
>
> Also there was a claim in our company that Pig performs better than Map
> Reduce jobs? Is this true? Are there any such benchmarks available
>
> Thanks, Subir
>

Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Subir S

Hello Folks,

Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?

Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available

Thanks, Subir

Re: hadoop streaming : need help in using custom key value separator

2012-02-28 Thread Austin Chungath

Thanks subir,

"-D stream.mapred.output.field.separator=*" is not an available option, my
bad
what I should have done is:

-D stream.map.output.field.separator=*
On Tue, Feb 28, 2012 at 2:36 PM, Subir S  wrote:

>
> http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs
>
> Read this link, your options are wrong below.
>
>
>
> On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath 
> wrote:
>
> > When I am using more than one reducer in hadoop streaming where I am
> using
> > my custom separater rather than the tab, it looks like the hadoop
> shuffling
> > process is not happening as it should.
> >
> > This is the reducer output when I am using '\t' to separate my key value
> > pair that is output from the mapper.
> >
> > *output from reducer 1:*
> > 10321,22
> > 23644,37
> > 41231,42
> > 23448,20
> > 12325,39
> > 71234,20
> > *output from reducer 2:*
> > 24123,43
> > 33213,46
> > 11321,29
> > 21232,32
> >
> > the above output is as expected the first column is the key and the
> second
> > value is the count. There are 10 unique keys and 6 of them are in output
> of
> > the first reducer and the remaining 4 int the second reducer output.
> >
> > But now when I use a custom separater for my key value pair output from
> my
> > mapper. Here I am using '*' as the separator
> > -D stream.mapred.output.field.separator=*
> > -D mapred.reduce.tasks=2
> >
> > *output from reducer 1:*
> > 10321,5
> > 21232,19
> > 24123,16
> > 33213,28
> > 23644,21
> > 41231,12
> > 23448,18
> > 11321,29
> > 12325,24
> > 71234,9
> > * *
> > *output from reducer 2:*
>  > 10321,17
> > 21232,13
> > 33213,18
> > 23644,16
> > 41231,30
> > 23448,2
> > 24123,27
> > 12325,15
> > 71234,11
> >
> > Now both the reducers are getting all the keys and part of the values go
> to
> > reducer 1 and part of the reducer go to reducer 2.
> > Why is it behaving like this when I am using a custom separator,
> shouldn't
> > each reducer get a unique key after the shuffling?
> > I am using Hadoop 0.20.205.0 and below is the command that I am using to
> > run hadoop streaming. Is there some more options that I should specify
> for
> > hadoop streaming to work properly if I am using a custom separator?
> >
> > hadoop jar
> > $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
> > -D stream.mapred.output.field.separator=*
> > -D mapred.reduce.tasks=2
> > -mapper ./map.py
> > -reducer ./reducer.py
> > -file ./map.py
> > -file ./reducer.py
> > -input /user/inputdata
> > -output /user/outputdata
> > -verbose
> >
> >
> > Any help is much appreciated,
> > Thanks,
> > Austin
> >
>

Re: hadoop streaming : need help in using custom key value separator

2012-02-28 Thread Subir S

http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs

Read this link, your options are wrong below.



On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath  wrote:

> When I am using more than one reducer in hadoop streaming where I am using
> my custom separater rather than the tab, it looks like the hadoop shuffling
> process is not happening as it should.
>
> This is the reducer output when I am using '\t' to separate my key value
> pair that is output from the mapper.
>
> *output from reducer 1:*
> 10321,22
> 23644,37
> 41231,42
> 23448,20
> 12325,39
> 71234,20
> *output from reducer 2:*
> 24123,43
> 33213,46
> 11321,29
> 21232,32
>
> the above output is as expected the first column is the key and the second
> value is the count. There are 10 unique keys and 6 of them are in output of
> the first reducer and the remaining 4 int the second reducer output.
>
> But now when I use a custom separater for my key value pair output from my
> mapper. Here I am using '*' as the separator
> -D stream.mapred.output.field.separator=*
> -D mapred.reduce.tasks=2
>
> *output from reducer 1:*
> 10321,5
> 21232,19
> 24123,16
> 33213,28
> 23644,21
> 41231,12
> 23448,18
> 11321,29
> 12325,24
> 71234,9
> * *
> *output from reducer 2:*
> 10321,17
> 21232,13
> 33213,18
> 23644,16
> 41231,30
> 23448,2
> 24123,27
> 12325,15
> 71234,11
>
> Now both the reducers are getting all the keys and part of the values go to
> reducer 1 and part of the reducer go to reducer 2.
> Why is it behaving like this when I am using a custom separator, shouldn't
> each reducer get a unique key after the shuffling?
> I am using Hadoop 0.20.205.0 and below is the command that I am using to
> run hadoop streaming. Is there some more options that I should specify for
> hadoop streaming to work properly if I am using a custom separator?
>
> hadoop jar
> $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
> -D stream.mapred.output.field.separator=*
> -D mapred.reduce.tasks=2
> -mapper ./map.py
> -reducer ./reducer.py
> -file ./map.py
> -file ./reducer.py
> -input /user/inputdata
> -output /user/outputdata
> -verbose
>
>
> Any help is much appreciated,
> Thanks,
> Austin
>

hadoop streaming : need help in using custom key value separator

2012-02-27 Thread Austin Chungath

When I am using more than one reducer in hadoop streaming where I am using
my custom separater rather than the tab, it looks like the hadoop shuffling
process is not happening as it should.

This is the reducer output when I am using '\t' to separate my key value
pair that is output from the mapper.

*output from reducer 1:*
10321,22
23644,37
41231,42
23448,20
12325,39
71234,20
*output from reducer 2:*
24123,43
33213,46
11321,29
21232,32

the above output is as expected the first column is the key and the second
value is the count. There are 10 unique keys and 6 of them are in output of
the first reducer and the remaining 4 int the second reducer output.

But now when I use a custom separater for my key value pair output from my
mapper. Here I am using '*' as the separator
-D stream.mapred.output.field.separator=*
-D mapred.reduce.tasks=2

*output from reducer 1:*
10321,5
21232,19
24123,16
33213,28
23644,21
41231,12
23448,18
11321,29
12325,24
71234,9
* *
*output from reducer 2:*
10321,17
21232,13
33213,18
23644,16
41231,30
23448,2
24123,27
12325,15
71234,11

Now both the reducers are getting all the keys and part of the values go to
reducer 1 and part of the reducer go to reducer 2.
Why is it behaving like this when I am using a custom separator, shouldn't
each reducer get a unique key after the shuffling?
I am using Hadoop 0.20.205.0 and below is the command that I am using to
run hadoop streaming. Is there some more options that I should specify for
hadoop streaming to work properly if I am using a custom separator?

hadoop jar
$HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
-D stream.mapred.output.field.separator=*
-D mapred.reduce.tasks=2
-mapper ./map.py
-reducer ./reducer.py
-file ./map.py
-file ./reducer.py
-input /user/inputdata
-output /user/outputdata
-verbose


Any help is much appreciated,
Thanks,
Austin

Re: Question on Hadoop Streaming

2011-12-06 Thread Romeo Kienzler


Hi,

the following command works:

hadoop jar 
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar 
-input input -output output2 -mapper /root/bowtiestreaming.sh -reducer NONE


Best Regards,

Romeo

On 12/06/2011 10:49 AM, Brock Noland wrote:

Does you job end with an error?

I am guessing what you want is:

-mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh'

First option says use your script as a mapper and second says ship
your script as part of the job.

Brock

On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzler  wrote:

Hi,

I've got the following setup for NGS read alignment:


A script accepting data from stdin/out:

cat /root/bowtiestreaming.sh
cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/
/home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 -
2>  /root/bowtie.log



A file copied to HDFS:

hadoop fs -put
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1

A streaming job invoked with only the mapper:

hadoop jar
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
-output
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
-mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0

The file cannot be found even it is displayed:

hadoop fs -cat
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
11/12/06 09:07:47 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
cat: File does not exist:
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned


He file looks like this (tab seperated):
head
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
@SRR014475.1 :1:1:108:111 length=36 GAGACGTCGTCCTCAGTACATATA
I3I+I(%BH43%III7I(5III*<&II+
@SRR014475.2 :1:1:112:26 length=36  GNNTTCCCCAACTTCCAAATCACCTAAC
I!!II=IC@=III()+:+2&$
@SRR014475.6 :1:1:106:14 length=36  GNNNTNTAGCATTAAGTAATTGGT
I!!!I!I6I*+III:%IB0+I.%?
@SRR014475.7 :1:1:118:934 length=36 GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT
III0%%)&%I.I&I;III.(I@E&2>*'+1;;#;&'
@SRR014475.8 :1:1:123:8 length=36   GNNNTTNN
I!!!$(!!
@SRR014475.9 :1:1:118:88 length=36  GGAAACTGGCGCGCTACCAGGTAACGCGCCAC
IIIGIAA4;1+16*;*+)'$%#$%
@SRR014475.10 :1:1:92:122 length=36 ATTTGCTGCCAATGGCGAGATTACGAATAATA
IICII;CGIDI?%$I:%6)C*;#;


and the result like this:

cat
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
|./bowtiestreaming.sh |head
@SRR014475.3 :1:1:101:937 length=36 +
gi|110640213|ref|NC_008253.1|   3393863 GAAGATCCGGTACAACCCTGATGTAAATGGTA
IAIIAII%IC,27:G>T
@SRR014475.4 :1:1:124:64 length=36  +
gi|110640213|ref|NC_008253.1|   2288633 GAACACATAGAACAACAGGATTCGCCAGAACACCTG
III>C
@SRR014475.5 :1:1:108:897 length=36 +
gi|110640213|ref|NC_008253.1|   4389356 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
I0I:I'+IG3II46II0>C@=III()+:+2&$  0
5:C>A,28:G>T,29:C>G,30:A>T,34:C>T
@SRR014475.9 :1:1:118:88 length=36  -
gi|110640213|ref|NC_008253.1|   3598410 GTGGCGCGTTACCTGGTAGCGCGCCAGTTTCC
%$#%$')+*;*61+1;4AAIGIII  0
@SRR014475.15 :1:1:87:967 length=36 +
gi|110640213|ref|NC_008253.1|   4474247 GACTACACGATCGCCTGCCTTAATATTCTTTACACC
A27II7CIII*I5I+F?II'  0   6:G>A,26:G>T
@SRR014475.20 :1:1:108:121 length=36-
gi|110640213|ref|NC_008253.1|   37761   AATGCATATTGAGAGTGTGATTATTAGC
IT
@SRR014475.23 :1:1:75:54 length=36  +
gi|110640213|ref|NC_008253.1|   2465453 GGTTTCTTTCTGCGCAGATGCCAGACGGTCTTTATA
CIIT,21:G>T,30:C>T,31:T>G,34:A>T
@SRR014475.27 :1:1:74:887 length=36 -
gi|110640213|ref|NC_008253.1|   540567  AAACGTGGCGTTTCAGGGATCGTTTGCCTGCATTAC
*&(%9%0F3.@4;&?4I3I6%:9AI0HI  0   34:C>A,35:C>A
@SRR014475.30 :1:1:123:73 length=36 +
gi|110640213|ref|NC_008253.1|   3391697 GATTGCGACTGACGGCGCAAATGCCCTCCGTT
ICI:II3*<4.*'+%'&)&$;+;%;%;;  0   30:C>T,34:G>T


Any ideas?

best Regards,

Romeo


-
Romeo Kienzler
r o m e o @ o r m i u m . d e

Re: Question on Hadoop Streaming

2011-12-06 Thread Romeo Kienzler


Hi Brock,

I'm not getting any errors.

I'm issuing the following command now:

hadoop jar 
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar 
-input 
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 
-output 
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned 
-mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0 -file 
bowtiestreaming.sh


The only error I get using "cat hadoop-0.21.0/logs/* |grep Exception" is:
org.apache.hadoop.fs.ChecksumException: Checksum error: 
file:/root/hadoop-0.21.0/logs/history/job_201112060917_0002_root at 2620416
2011-12-06 11:14:34,515 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13816: No 
such process
2011-12-06 11:14:43,039 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13862: No 
such process
2011-12-06 11:14:46,282 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13891: No 
such process
2011-12-06 11:14:49,841 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13978: No 
such process



best Regards,

Romeo

On 12/06/2011 10:49 AM, Brock Noland wrote:

Does you job end with an error?

I am guessing what you want is:

-mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh'

First option says use your script as a mapper and second says ship
your script as part of the job.

Brock

On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzler  wrote:

Hi,

I've got the following setup for NGS read alignment:


A script accepting data from stdin/out:

cat /root/bowtiestreaming.sh
cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/
/home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 -
2>  /root/bowtie.log



A file copied to HDFS:

hadoop fs -put
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1

A streaming job invoked with only the mapper:

hadoop jar
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
-output
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
-mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0

The file cannot be found even it is displayed:

hadoop fs -cat
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
11/12/06 09:07:47 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
cat: File does not exist:
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned


He file looks like this (tab seperated):
head
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
@SRR014475.1 :1:1:108:111 length=36 GAGACGTCGTCCTCAGTACATATA
I3I+I(%BH43%III7I(5III*<&II+
@SRR014475.2 :1:1:112:26 length=36  GNNTTCCCCAACTTCCAAATCACCTAAC
I!!II=IC@=III()+:+2&$
@SRR014475.6 :1:1:106:14 length=36  GNNNTNTAGCATTAAGTAATTGGT
I!!!I!I6I*+III:%IB0+I.%?
@SRR014475.7 :1:1:118:934 length=36 GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT
III0%%)&%I.I&I;III.(I@E&2>*'+1;;#;&'
@SRR014475.8 :1:1:123:8 length=36   GNNNTTNN
I!!!$(!!
@SRR014475.9 :1:1:118:88 length=36  GGAAACTGGCGCGCTACCAGGTAACGCGCCAC
IIIGIAA4;1+16*;*+)'$%#$%
@SRR014475.10 :1:1:92:122 length=36 ATTTGCTGCCAATGGCGAGATTACGAATAATA
IICII;CGIDI?%$I:%6)C*;#;


and the result like this:

cat
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
|./bowtiestreaming.sh |head
@SRR014475.3 :1:1:101:937 length=36 +
gi|110640213|ref|NC_008253.1|   3393863 GAAGATCCGGTACAACCCTGATGTAAATGGTA
IAIIAII%IC,27:G>T
@SRR014475.4 :1:1:124:64 length=36  +
gi|110640213|ref|NC_008253.1|   2288633 GAACACATAGAACAACAGGATTCGCCAGAACACCTG
III>C
@SRR014475.5 :1:1:108:897 length=36 +
gi|110640213|ref|NC_008253.1|   4389356 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
I0I:I'+IG3II46II0>C

Re: Question on Hadoop Streaming

2011-12-06 Thread Brock Noland

Does you job end with an error?

I am guessing what you want is:

-mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh'

First option says use your script as a mapper and second says ship
your script as part of the job.

Brock

On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzler  wrote:
> Hi,
>
> I've got the following setup for NGS read alignment:
>
>
> A script accepting data from stdin/out:
> 
> cat /root/bowtiestreaming.sh
> cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/
> /home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 -
> 2> /root/bowtie.log
>
>
>
> A file copied to HDFS:
> 
> hadoop fs -put
> SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
> SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
>
> A streaming job invoked with only the mapper:
> 
> hadoop jar
> hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input
> SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
> -output
> SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
> -mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0
>
> The file cannot be found even it is displayed:
> 
> hadoop fs -cat
> /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
> 11/12/06 09:07:47 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=30
> 11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated.
> Instead, use mapreduce.task.attempt.id
> cat: File does not exist:
> /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
>
>
> He file looks like this (tab seperated):
> head
> SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
> @SRR014475.1 :1:1:108:111 length=36     GAGACGTCGTCCTCAGTACATATA
>    I3I+I(%BH43%III7I(5III*<&II+
> @SRR014475.2 :1:1:112:26 length=36      GNNTTCCCCAACTTCCAAATCACCTAAC
>    I!!II=I @SRR014475.3 :1:1:101:937 length=36     GAAGATCCGGTACAACCCTGATGTAAATGGTA
>    IAIIAII%I @SRR014475.4 :1:1:124:64 length=36      GAACACATAGAACAACAGGATTCGCCAGAACACCTG
>    III> @SRR014475.5 :1:1:108:897 length=36     GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
>    I0I:I'+IG3II46II0>C@=III()+:+2&$
> @SRR014475.6 :1:1:106:14 length=36      GNNNTNTAGCATTAAGTAATTGGT
>    I!!!I!I6I*+III:%IB0+I.%?
> @SRR014475.7 :1:1:118:934 length=36     GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT
>    III0%%)&%I.I&I;III.(I@E&2>*'+1;;#;&'
> @SRR014475.8 :1:1:123:8 length=36       GNNNTTNN
>    I!!!$(!!
> @SRR014475.9 :1:1:118:88 length=36      GGAAACTGGCGCGCTACCAGGTAACGCGCCAC
>    IIIGIAA4;1+16*;*+)'$%#$%
> @SRR014475.10 :1:1:92:122 length=36     ATTTGCTGCCAATGGCGAGATTACGAATAATA
>    IICII;CGIDI?%$I:%6)C*;#;
>
>
> and the result like this:
>
> cat
> SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
> |./bowtiestreaming.sh |head
> @SRR014475.3 :1:1:101:937 length=36     +
> gi|110640213|ref|NC_008253.1|   3393863 GAAGATCCGGTACAACCCTGATGTAAATGGTA
>    IAIIAII%IC,27:G>T
> @SRR014475.4 :1:1:124:64 length=36      +
> gi|110640213|ref|NC_008253.1|   2288633 GAACACATAGAACAACAGGATTCGCCAGAACACCTG
>    III>C
> @SRR014475.5 :1:1:108:897 length=36     +
> gi|110640213|ref|NC_008253.1|   4389356 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
>    I0I:I'+IG3II46II0>C@=III()+:+2&$  0
> 5:C>A,28:G>T,29:C>G,30:A>T,34:C>T
> @SRR014475.9 :1:1:118:88 length=36      -
> gi|110640213|ref|NC_008253.1|   3598410 GTGGCGCGTTACCTGGTAGCGCGCCAGTTTCC
>    %$#%$')+*;*61+1;4AAIGIII  0
> @SRR014475.15 :1:1:87:967 length=36     +
> gi|110640213|ref|NC_008253.1|   4474247 GACTACACGATCGCCTGCCTTAATATTCTTTACACC
>    A27II7CIII*I5I+F?II'  0       6:G>A,26:G>T
> @SRR014475.20 :1:1:108:121 length=36    -
> gi|110640213|ref|NC_008253.1|   37761   AATGCATATTGAGAGTGTGATTATTAGC
>    IT
> @SRR014475.23 :1:1:75:54 length=36      +
> gi|110640213|ref|NC_008253.1|   2465453 GGTTTCTTTCTGCGCAGATGCCAGACGGTCTTTATA
>    CII @SRR014475.24 :1:1:89:904 length=36     -
> gi|110640213|ref|NC_008253.1|   3216193 ATTAGTGTTAAGATTTCTATATTGTTGAGGCC
>    #%);%;$EI-;$%8%&I%I/+III  0
> 18:C>T,21:G>T,30:C>T,31:T>G,34:A>T
> @SRR014475.27 :1:1:74:887 length=36     -
> gi|110640213|ref|NC_008253.1|   540567  AAACGTGGCGTTTCAGGGATCGTTTGCCTGCATTAC
>    *&(%9%0F3.@4;&?4I3I6%:

Re: Hadoop Streaming

2011-12-03 Thread Tom Melendez

Oh, I see the line wrapped.  My bad.

Either way, I think the NLineInputFormat is what you need.  I'm
assuming you want one line of input to execute on one mapper.

Thanks,

Tom

On Sat, Dec 3, 2011 at 7:57 PM, Daniel Yehdego
 wrote:
>
> TOM,
> What the HADOOP script do is ...read each line from the STDIN and execute the 
> program pknotsRG. tmp.txt is a temporary file.
> the script is like this:
>    #!/bin/sh
>    rm -f temp.txt;    while read line
>   do    echo $line >> temp.txt;    done    exec 
> /data/yehdego/hadoop-0.20.2/PKNOTSRG/src/pknotsRG -k 0 -F temp.txt;
>
>> Date: Sat, 3 Dec 2011 19:49:46 -0800
>> Subject: Re: Hadoop Streaming
>> From: t...@supertom.com
>> To: common-user@hadoop.apache.org
>>
>> Hi Daniel,
>>
>> I see from your other thread that your HADOOP script has a line like:
>>
>> #!/bin/shrm -f temp.txt
>>
>> I'm not sure what that is, exactly.  I suspect the -f is reading from
>> some file and the while loop you had listed read from stdin it seems.
>>
>> What does your input look like?  I think what's happening is that you
>> might be expecting lines of input and you're getting splits.  What
>> does your input look like?
>>
>> You might want to try this:
>> -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>>
>> On Sat, Dec 3, 2011 at 7:22 PM, Daniel Yehdego
>>  wrote:
>> >
>> > Thanks Tom for your reply,
>> > I think my code is reading from stdin. Because I tried it locally using 
>> > the following command and its running:
>> >  $ bin/hadoop fs -cat 
>> > /user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt 
>> > | head -2 | ./HADOOP
>> >
>> > But when I tried streaming , it failed and gave me the error code 126.
>> >
>> >> Date: Sat, 3 Dec 2011 19:14:20 -0800
>> >> Subject: Re: Hadoop Streaming
>> >> From: t...@supertom.com
>> >> To: common-user@hadoop.apache.org
>> >>
>> >> So that code 126 should be kicked out by your program - do you know
>> >> what that means?
>> >>
>> >> Your code can read from stdin?
>> >>
>> >> Thanks,
>> >>
>> >> Tom
>> >>
>> >> On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
>> >>  wrote:
>> >> >
>> >> > I have the following error in running hadoop streaming,
>> >> > PipeMapRed\.waitOutputThreads(): subprocess failed with code 126        
>> >> > at 
>> >> > org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
>> >> >   at 
>> >> > org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
>> >> >      at 
>> >> > org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132) 
>> >> >      at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57) 
>> >> >      at 
>> >> > org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)
>> >> >    at 
>> >> > org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)   
>> >> >      at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
>> >> > org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
>> >> > I couldn't find out any other error information.
>> >> > Any help ?
>> >> >
>> >
>

RE: Hadoop Streaming

2011-12-03 Thread Daniel Yehdego


TOM, 
What the HADOOP script do is ...read each line from the STDIN and execute the 
program pknotsRG. tmp.txt is a temporary file.
the script is like this: 
#!/bin/sh
rm -f temp.txt;while read line
   doecho $line >> temp.txt;doneexec 
/data/yehdego/hadoop-0.20.2/PKNOTSRG/src/pknotsRG -k 0 -F temp.txt;

> Date: Sat, 3 Dec 2011 19:49:46 -0800
> Subject: Re: Hadoop Streaming
> From: t...@supertom.com
> To: common-user@hadoop.apache.org
> 
> Hi Daniel,
> 
> I see from your other thread that your HADOOP script has a line like:
> 
> #!/bin/shrm -f temp.txt
> 
> I'm not sure what that is, exactly.  I suspect the -f is reading from
> some file and the while loop you had listed read from stdin it seems.
> 
> What does your input look like?  I think what's happening is that you
> might be expecting lines of input and you're getting splits.  What
> does your input look like?
> 
> You might want to try this:
> -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
> 
> Thanks,
> 
> Tom
> 
> 
> 
> 
> On Sat, Dec 3, 2011 at 7:22 PM, Daniel Yehdego
>  wrote:
> >
> > Thanks Tom for your reply,
> > I think my code is reading from stdin. Because I tried it locally using the 
> > following command and its running:
> >  $ bin/hadoop fs -cat 
> > /user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt 
> > | head -2 | ./HADOOP
> >
> > But when I tried streaming , it failed and gave me the error code 126.
> >
> >> Date: Sat, 3 Dec 2011 19:14:20 -0800
> >> Subject: Re: Hadoop Streaming
> >> From: t...@supertom.com
> >> To: common-user@hadoop.apache.org
> >>
> >> So that code 126 should be kicked out by your program - do you know
> >> what that means?
> >>
> >> Your code can read from stdin?
> >>
> >> Thanks,
> >>
> >> Tom
> >>
> >> On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
> >>  wrote:
> >> >
> >> > I have the following error in running hadoop streaming,
> >> > PipeMapRed\.waitOutputThreads(): subprocess failed with code 126
> >> > at 
> >> > org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
> >> >   at 
> >> > org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
> >> >  at 
> >> > org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132)  
> >> > at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)   
> >> >at 
> >> > org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)
> >> >at 
> >> > org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)
> >> > at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
> >> > org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
> >> > I couldn't find out any other error information.
> >> > Any help ?
> >> >
> >

Re: Hadoop Streaming

2011-12-03 Thread Tom Melendez

Hi Daniel,

I see from your other thread that your HADOOP script has a line like:

#!/bin/shrm -f temp.txt

I'm not sure what that is, exactly.  I suspect the -f is reading from
some file and the while loop you had listed read from stdin it seems.

What does your input look like?  I think what's happening is that you
might be expecting lines of input and you're getting splits.  What
does your input look like?

You might want to try this:
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

Thanks,

Tom




On Sat, Dec 3, 2011 at 7:22 PM, Daniel Yehdego
 wrote:
>
> Thanks Tom for your reply,
> I think my code is reading from stdin. Because I tried it locally using the 
> following command and its running:
>  $ bin/hadoop fs -cat 
> /user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt | 
> head -2 | ./HADOOP
>
> But when I tried streaming , it failed and gave me the error code 126.
>
>> Date: Sat, 3 Dec 2011 19:14:20 -0800
>> Subject: Re: Hadoop Streaming
>> From: t...@supertom.com
>> To: common-user@hadoop.apache.org
>>
>> So that code 126 should be kicked out by your program - do you know
>> what that means?
>>
>> Your code can read from stdin?
>>
>> Thanks,
>>
>> Tom
>>
>> On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
>>  wrote:
>> >
>> > I have the following error in running hadoop streaming,
>> > PipeMapRed\.waitOutputThreads(): subprocess failed with code 126        at 
>> > org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
>> >   at 
>> > org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
>> >      at 
>> > org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132)    
>> >   at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)      
>> > at 
>> > org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36) 
>> >   at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358) 
>> >        at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
>> > org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
>> > I couldn't find out any other error information.
>> > Any help ?
>> >
>

RE: Hadoop Streaming

2011-12-03 Thread Daniel Yehdego


Thanks Tom for your reply, 
I think my code is reading from stdin. Because I tried it locally using the 
following command and its running:
 $ bin/hadoop fs -cat 
/user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt | 
head -2 | ./HADOOP

But when I tried streaming , it failed and gave me the error code 126.

> Date: Sat, 3 Dec 2011 19:14:20 -0800
> Subject: Re: Hadoop Streaming
> From: t...@supertom.com
> To: common-user@hadoop.apache.org
> 
> So that code 126 should be kicked out by your program - do you know
> what that means?
> 
> Your code can read from stdin?
> 
> Thanks,
> 
> Tom
> 
> On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
>  wrote:
> >
> > I have the following error in running hadoop streaming,
> > PipeMapRed\.waitOutputThreads(): subprocess failed with code 126at 
> > org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
> >   at 
> > org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
> >  at 
> > org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132) 
> >  at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)  at 
> > org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)  
> >  at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)   
> >  at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
> > org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
> > I couldn't find out any other error information.
> > Any help ?
> >

Re: Hadoop Streaming

2011-12-03 Thread Tom Melendez

So that code 126 should be kicked out by your program - do you know
what that means?

Your code can read from stdin?

Thanks,

Tom

On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
 wrote:
>
> I have the following error in running hadoop streaming,
> PipeMapRed\.waitOutputThreads(): subprocess failed with code 126        at 
> org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
>   at 
> org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
>      at 
> org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132)      
> at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)      at 
> org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)   
> at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)      
>   at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
> org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
> I couldn't find out any other error information.
> Any help ?
>

Hadoop Streaming

2011-12-03 Thread Daniel Yehdego


I have the following error in running hadoop streaming, 
PipeMapRed\.waitOutputThreads(): subprocess failed with code 126at 
org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
  at 
org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
 at org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132) 
 at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)  at 
org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)   at 
org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)at 
org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
I couldn't find out any other error information. 
Any help ?

RE: Hadoop-streaming using binary executable c program

2011-12-02 Thread Daniel Yehdego






Hi.

I was trying to run hadoop streaming and before that I check with the following 
:
bin/hadoop fs -cat 
/user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt | 
head -2 | ./HADOOP 
Were HADOOP is a shell script:
#!/bin/shrm -f temp.txt;while read line doecho $line >> temp.txt;doneexec 
/data/yehdego/hadoop-0.20.2/PKNOTSRG/src/bin/pknotsRG -k o -F temp.txt;
and its working, but when i try running on streaming using the following:
 bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
./HADOOP  -file /data/yehdego/hadoop-0.20.2/HADOOP -file 
/data/yehdego/hadoop-0.20.2/PKNOTSRG/src/bin/pknotsRG -reducer 
./ReduceLatest.py -file /data/yehdego/hadoop-0.20.2/ReduceLatest.py -input 
/user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt  
-output /user/yehdego/RF171_NEW/RF00171_A.bpseqL3G1_Optimized_Method40.txt 
-verbose 
it failed with the following error:
PipeMapRed\.waitOutputThreads(): subprocess failed with code 126at 
org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
  at 
org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
 at org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132) 
 at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)  at 
org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)   at 
org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)at 
org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
Any idea on this problem ?
Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Mon, 25 Jul 2011 14:47:34 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
> 
> This is likely to be slow and it is not ideal.  The ideal would be to modify 
> pknotsRG to be able to read from stdin, but that may not be possible.
> 
> The shell script would probably look something like the following
> 
> #!/bin/sh
> rm -f temp.txt;
> while read line
> do
>   echo $line >> temp.txt;
> done
> exec pknotsRG temp.txt;
> 
> Place it in a file say hadoopPknotsRG  Then you probably want to run
> 
> chmod +x hadoopPknotsRG
> 
> After that you want to test it with
> 
> hadoop fs -cat 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
> ./hadoopPknotsRG
> 
> If that works then you can try it with Hadoop streaming
> 
> HADOOP_HOME$ bin/hadoop jar 
> /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
> ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
> /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
> /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> /user/yehdego/RF-out -reducer NONE -verbose
> 
> --Bobby
> 
> On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:
> 
> 
> 
> Good afternoon Bobby,
> 
> Thanks, you gave me a great help in finding out what the problem was. After I 
> put the command line you suggested me, I found out that there was a 
> segmentation error.
> The binary executable program pknotsRG only reads a file with a sequence in 
> it. This means, there should be a shell script, as you have said, that will 
> take the data coming
> from stdin and write it to a temporary file. Any idea on how to do this job 
> in shell script. The thing is I am from a biology background and don't have 
> much experience in CS.
> looking forward to hear from you. Thanks so much.
> 
> Regards,
> 
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> dtyehd...@miners.utep.edu
> 
> > From: ev...@yahoo-inc.com
> > To: common-user@hadoop.apache.org
> > Date: Fri, 22 Jul 2011 12:39:08 -0700
> > Subject: Re: Hadoop-streaming using binary executable c program
> >
> > I would suggest that you do the following to help you debug.
> >
> > hadoop fs -cat 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 
> > | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
> >
> > This is simulating what hadoop streaming is doing.  Here we are taking the 
> > first 2 lines out of the input file and feeding them to the stdin of 
> > pknotsRG.  The first step is to make sure that you can get your program to 
> > run correctly with something like this.  You may need to change the command 
> > line to pknotsRG to get it to read the data it is processing from stdin, 
> > instead of from a file.  Alternatively you may need to write a s

Hadoop Streaming

2011-10-01 Thread Daniel Yehdego


Hi all, 
I am using hadoop streaming and I want to use a secondary sort so that I will 
output my values in order. Can i use stream.num.map.input.key.fields instead of 
stream.num.map.output.key.fields ? I am doing this because the output from the 
mapper is just a string of letters and its difficult to use the keys to compare.

Regards,

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner

Thank you, Prashant, it seems so. I already verified this by refactoring the
code to use 0.20 API as well as 0.21 API in two different packages, and
streaming happily works with 0.20.

Mark

On Wed, Sep 14, 2011 at 11:46 PM, Prashant  wrote:

> On 09/15/2011 08:18 AM, Mark Kerzner wrote:
>
>> Hi,
>>
>> I am using the latest Cloudera distribution, and with that I am able to
>> use
>> the latest Hadoop API, which I believe is 0.21, for such things as
>>
>> import org.apache.hadoop.mapreduce.**Reducer;
>>
>> So I am using mapreduce, not mapred, and everything works fine.
>>
>> However, in a small streaming job, trying it out with Java classes first,
>> I
>> get this error
>>
>> Exception in thread "main" java.lang.RuntimeException: class mypackage.Map
>> not org.apache.hadoop.mapred.**Mapper -- which it really is not, it is a
>> mapreduce.Mapper.
>>
>> So it seems that Cloudera backports some of the advances but for streaming
>> it is still the old API.
>>
>> So it is me or the world?
>>
>> Thank you,
>> Mark
>>
>>  The world!
>

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Prashant


On 09/15/2011 08:18 AM, Mark Kerzner wrote:

Hi,

I am using the latest Cloudera distribution, and with that I am able to use
the latest Hadoop API, which I believe is 0.21, for such things as

import org.apache.hadoop.mapreduce.Reducer;

So I am using mapreduce, not mapred, and everything works fine.

However, in a small streaming job, trying it out with Java classes first, I
get this error

Exception in thread "main" java.lang.RuntimeException: class mypackage.Map
not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a
mapreduce.Mapper.

So it seems that Cloudera backports some of the advances but for streaming
it is still the old API.

So it is me or the world?

Thank you,
Mark


The world!

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner

I am sorry, you are right.

mark

On Wed, Sep 14, 2011 at 9:52 PM, Konstantin Boudnik  wrote:

> I am sure if you ask at provider's specific list you'll get a better answer
> than from common Hadoop list ;)
>
> Cos
>
> On Wed, Sep 14, 2011 at 09:48PM, Mark Kerzner wrote:
> > Hi,
> >
> > I am using the latest Cloudera distribution, and with that I am able to
> use
> > the latest Hadoop API, which I believe is 0.21, for such things as
> >
> > import org.apache.hadoop.mapreduce.Reducer;
> >
> > So I am using mapreduce, not mapred, and everything works fine.
> >
> > However, in a small streaming job, trying it out with Java classes first,
> I
> > get this error
> >
> > Exception in thread "main" java.lang.RuntimeException: class
> mypackage.Map
> > not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a
> > mapreduce.Mapper.
> >
> > So it seems that Cloudera backports some of the advances but for
> streaming
> > it is still the old API.
> >
> > So it is me or the world?
> >
> > Thank you,
> > Mark
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iF4EAREIAAYFAk5xaGIACgkQenyFlstYjhKtZAEAmNtHK9DqBFmZ2DTJgAxEbF+p
> P0Tek1iW1P1ZwlqGDRIA/AuVVaNiul1bQM0NRYuAVxLn7sJOTSCQG5PRGJUQdvjq
> =Z/hO
> -END PGP SIGNATURE-
>
>

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Konstantin Boudnik

I am sure if you ask at provider's specific list you'll get a better answer
than from common Hadoop list ;)

Cos

On Wed, Sep 14, 2011 at 09:48PM, Mark Kerzner wrote:
> Hi,
> 
> I am using the latest Cloudera distribution, and with that I am able to use
> the latest Hadoop API, which I believe is 0.21, for such things as
> 
> import org.apache.hadoop.mapreduce.Reducer;
> 
> So I am using mapreduce, not mapred, and everything works fine.
> 
> However, in a small streaming job, trying it out with Java classes first, I
> get this error
> 
> Exception in thread "main" java.lang.RuntimeException: class mypackage.Map
> not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a
> mapreduce.Mapper.
> 
> So it seems that Cloudera backports some of the advances but for streaming
> it is still the old API.
> 
> So it is me or the world?
> 
> Thank you,
> Mark


signature.asc
Description: Digital signature

Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner

Hi,

I am using the latest Cloudera distribution, and with that I am able to use
the latest Hadoop API, which I believe is 0.21, for such things as

import org.apache.hadoop.mapreduce.Reducer;

So I am using mapreduce, not mapred, and everything works fine.

However, in a small streaming job, trying it out with Java classes first, I
get this error

Exception in thread "main" java.lang.RuntimeException: class mypackage.Map
not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a
mapreduce.Mapper.

So it seems that Cloudera backports some of the advances but for streaming
it is still the old API.

So it is me or the world?

Thank you,
Mark

Re: Hadoop Streaming job Fails - Permission Denied error

2011-09-14 Thread Shi Yu

Just a quick ask, have you tried to switch off the dfs permission check 
in the hdfs-site.xml file?



dfs.permissions
false



Shi

On 9/14/2011 8:29 AM, Brock Noland wrote:

Hi,

This probably belongs on mapreduce-user as opposed to common-user. I
have BCC'ed the common-user group.

Generally it's a best practice to ship the scripts with the job. Like so:

hadoop  jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
-input /userdata/bejoy/apps/wc/input -output /userdata/bejoy/apps/wc/output
-mapper WcStreamMap.py  -reducer WcStreamReduce.py
-file /home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py
-file /home/cloudera/bejoy/apps/inputs/wc/WcStreamReduce.py

Brock

On Mon, Sep 12, 2011 at 4:18 AM, Bejoy KS  wrote:

Hi
  I wanted to try out hadoop steaming and got the sample python code for
mapper and reducer. I copied both into my lfs and tried running the steaming
job as mention in the documentation.
Here the command i used to run the job

hadoop  jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
-input /userdata/bejoy/apps/wc/input -output /userdata/bejoy/apps/wc/output
-mapper /home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py  -reducer
/home/cloudera/bejoy/apps/inputs/wc/WcStreamReduce.py

Here other than input and output the rest all are on lfs locations. How ever
the job is failing. The error log from the jobtracker url is as

java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:386)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:230)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program
"/home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py": java.io.IOException:
error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission
denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

On the error I checked the permissions of mapper and reducer. Issued a chmod
777 command as well. Still no luck.

The permission of the files are as follows
cloudera@cloudera-vm:~$ ls -l /home/cloudera/bejoy/apps/inputs/wc/
-rwxrwxrwx 1 cloudera cloudera  707 2011-09-11 23:42 WcStreamMap.py
-rwxrwxrwx 1 cloudera cloudera 1077 2011-09-11 23:42 WcStreamReduce.py

I'm testing the same on Cloudera Demo VM. So the hadoop setup would be on
pseudo distributed mode. Any help would be highly appreciated.

Thank You

Regards
Bejoy.K.S

Re: Hadoop Streaming job Fails - Permission Denied error

2011-09-14 Thread Brock Noland

Hi,

This probably belongs on mapreduce-user as opposed to common-user. I
have BCC'ed the common-user group.

Generally it's a best practice to ship the scripts with the job. Like so:

hadoop  jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
-input /userdata/bejoy/apps/wc/input -output /userdata/bejoy/apps/wc/output
-mapper WcStreamMap.py  -reducer WcStreamReduce.py
-file /home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py
-file /home/cloudera/bejoy/apps/inputs/wc/WcStreamReduce.py

Brock

On Mon, Sep 12, 2011 at 4:18 AM, Bejoy KS  wrote:
> Hi
>      I wanted to try out hadoop steaming and got the sample python code for
> mapper and reducer. I copied both into my lfs and tried running the steaming
> job as mention in the documentation.
> Here the command i used to run the job
>
> hadoop  jar
> /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
> -input /userdata/bejoy/apps/wc/input -output /userdata/bejoy/apps/wc/output
> -mapper /home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py  -reducer
> /home/cloudera/bejoy/apps/inputs/wc/WcStreamReduce.py
>
> Here other than input and output the rest all are on lfs locations. How ever
> the job is failing. The error log from the jobtracker url is as
>
> java.lang.RuntimeException: Error in configuring object
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>    at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>    at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:386)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>    at java.security.AccessController.doPrivileged(Native Method)
>    at javax.security.auth.Subject.doAs(Subject.java:396)
>    at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>    at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.lang.reflect.InvocationTargetException
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>    at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>    at java.lang.reflect.Method.invoke(Method.java:597)
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>    ... 9 more
> Caused by: java.lang.RuntimeException: Error in configuring object
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>    at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>    at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>    ... 14 more
> Caused by: java.lang.reflect.InvocationTargetException
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>    at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>    at java.lang.reflect.Method.invoke(Method.java:597)
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>    ... 17 more
> Caused by: java.lang.RuntimeException: configuration exception
>    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:230)
>    at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
>    ... 22 more
> Caused by: java.io.IOException: Cannot run program
> "/home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py": java.io.IOException:
> error=13, Permission denied
>    at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
>    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>    ... 23 more
> Caused by: java.io.IOException: java.io.IOException: error=13, Permission
> denied
>    at java.lang.UNIXProcess.(UNIXProcess.java:148)
>    at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>    at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
>    ... 24 more
>
> On the error I checked the permissions of mapper and reducer. Issued a chmod
> 777 command as well. Still no luck.
>
> The permission of the files are as follows
> cloudera@cloudera-vm:~$ ls -l /home/cloudera/bejoy/apps/inputs/wc/
> -rwxrwxrwx 1 cloudera cloudera  707 2011-09-11 23:42 WcStreamMap.py
> -rwxrwxrwx 1 cloudera cloudera 1077 2011-09-11 23:42 WcStreamReduce.py
>
> I'm testing the same on Cloudera Demo VM. So the hadoop setup would be on
> pseudo distributed mode. Any help would be highly appreciated.
>
> Thank You
>
> Regards
> Bejoy.K.S
>

Hadoop Streaming job Fails - Permission Denied error

2011-09-12 Thread Bejoy KS

Hi
  I wanted to try out hadoop steaming and got the sample python code for
mapper and reducer. I copied both into my lfs and tried running the steaming
job as mention in the documentation.
Here the command i used to run the job

hadoop  jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
-input /userdata/bejoy/apps/wc/input -output /userdata/bejoy/apps/wc/output
-mapper /home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py  -reducer
/home/cloudera/bejoy/apps/inputs/wc/WcStreamReduce.py

Here other than input and output the rest all are on lfs locations. How ever
the job is failing. The error log from the jobtracker url is as

java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:386)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:230)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program
"/home/cloudera/bejoy/apps/inputs/wc/WcStreamMap.py": java.io.IOException:
error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
... 23 more
Caused by: java.io.IOException: java.io.IOException: error=13, Permission
denied
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more

On the error I checked the permissions of mapper and reducer. Issued a chmod
777 command as well. Still no luck.

The permission of the files are as follows
cloudera@cloudera-vm:~$ ls -l /home/cloudera/bejoy/apps/inputs/wc/
-rwxrwxrwx 1 cloudera cloudera  707 2011-09-11 23:42 WcStreamMap.py
-rwxrwxrwx 1 cloudera cloudera 1077 2011-09-11 23:42 WcStreamReduce.py

I'm testing the same on Cloudera Demo VM. So the hadoop setup would be on
pseudo distributed mode. Any help would be highly appreciated.

Thank You

Regards
Bejoy.K.S

RE: Map-reduce sample code not working for hadoop streaming

2011-08-29 Thread Stanislav.Seltser

Hadoop is not supported on windows xp
Use linux

-Original Message-
From: ws_dev2001 [mailto:ws_dev2...@yahoo.com] 
Sent: 29 августа 2011 г. 10:13
To: core-u...@hadoop.apache.org
Subject: Map-reduce sample code not working for hadoop streaming


Hi,
I am trying to run the following code sample: 
http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-example/
I am on Windows XP, Cygwin and am using hadoop-0.20.2.

I am getting stuck on an error while running the hadoop streaming sample on
this page. Please let me know if I need to supply further information.

hadoop@ ~
$ /usr/local/hadoop-0.20.2/bin/hadoop jar
E:/Tools/cygwin/usr/local/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar
-mapper python multifetch-mapper.py -file /tmp/multifetch-mapper.py -reducer
python multifetch-reducer.py -file /tmp/multifetch-reducer.py -input urls/*
-output titles
packageJobJar: [/tmp/multifetch-mapper.py, /tmp/multifetch-reducer.py,
/E:/tmp/hadoop/hadoop-unjar9032604726399647405/] [] C:\Documents and
Settings\hadoop\streamjob1540735851960870565.jar tmpDir=null
11/08/29 16:22:34 INFO mapred.FileInputFormat: Total input paths to process
: 2
11/08/29 16:22:34 INFO streaming.StreamJob: getLocalDirs():
[/tmp/hadoop/mapred/local]
11/08/29 16:22:34 INFO streaming.StreamJob: Running job:
job_201108261641_0042
11/08/29 16:22:34 INFO streaming.StreamJob: To kill this job, run:
11/08/29 16:22:34 INFO streaming.StreamJob:
E:\Tools\cygwin\usr\local\hadoop-0.20.2\/bin/hadoop job 
-Dmapred.job.tracker=localhost:9002 -kill job_201108261641_0042
11/08/29 16:22:34 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030
/jobdetails.jsp?jobid=job_201108261641_0042
11/08/29 16:22:35 INFO streaming.StreamJob:  map 0%  reduce 0%
11/08/29 16:22:53 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 16:23:04 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 16:49:17 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 16:49:23 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:48:28 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 18:48:52 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:48:55 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 18:48:58 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:49:01 INFO streaming.StreamJob:  map 100%  reduce 17%
11/08/29 18:49:07 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:49:16 INFO streaming.StreamJob:  map 100%  reduce 17%
11/08/29 18:49:22 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:52:43 INFO streaming.StreamJob:  map 100%  reduce 100%
11/08/29 18:52:43 INFO streaming.StreamJob: To kill this job, run:
11/08/29 18:52:43 INFO streaming.StreamJob:
E:\Tools\cygwin\usr\local\hadoop-0.20.2\/bin/hadoop job 
-Dmapred.job.tracker=localhost:9002 -kill job_201108261641_0042
11/08/29 18:52:43 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030
/jobdetails.jsp?jobid=job_201108261641_0042
11/08/29 18:52:43 ERROR streaming.StreamJob: Job not Successful!
11/08/29 18:52:43 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

Please help me out on this.
TIA,
R2D2.

-- 
View this message in context: 
http://old.nabble.com/Map-reduce-sample-code-not-working-for-hadoop-streaming-tp32357180p32357180.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Map-reduce sample code not working for hadoop streaming

2011-08-29 Thread ws_dev2001


Hi,
I am trying to run the following code sample: 
http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-example/
I am on Windows XP, Cygwin and am using hadoop-0.20.2.

I am getting stuck on an error while running the hadoop streaming sample on
this page. Please let me know if I need to supply further information.

hadoop@ ~
$ /usr/local/hadoop-0.20.2/bin/hadoop jar
E:/Tools/cygwin/usr/local/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar
-mapper python multifetch-mapper.py -file /tmp/multifetch-mapper.py -reducer
python multifetch-reducer.py -file /tmp/multifetch-reducer.py -input urls/*
-output titles
packageJobJar: [/tmp/multifetch-mapper.py, /tmp/multifetch-reducer.py,
/E:/tmp/hadoop/hadoop-unjar9032604726399647405/] [] C:\Documents and
Settings\hadoop\streamjob1540735851960870565.jar tmpDir=null
11/08/29 16:22:34 INFO mapred.FileInputFormat: Total input paths to process
: 2
11/08/29 16:22:34 INFO streaming.StreamJob: getLocalDirs():
[/tmp/hadoop/mapred/local]
11/08/29 16:22:34 INFO streaming.StreamJob: Running job:
job_201108261641_0042
11/08/29 16:22:34 INFO streaming.StreamJob: To kill this job, run:
11/08/29 16:22:34 INFO streaming.StreamJob:
E:\Tools\cygwin\usr\local\hadoop-0.20.2\/bin/hadoop job 
-Dmapred.job.tracker=localhost:9002 -kill job_201108261641_0042
11/08/29 16:22:34 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030
/jobdetails.jsp?jobid=job_201108261641_0042
11/08/29 16:22:35 INFO streaming.StreamJob:  map 0%  reduce 0%
11/08/29 16:22:53 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 16:23:04 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 16:49:17 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 16:49:23 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:48:28 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 18:48:52 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:48:55 INFO streaming.StreamJob:  map 50%  reduce 0%
11/08/29 18:48:58 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:49:01 INFO streaming.StreamJob:  map 100%  reduce 17%
11/08/29 18:49:07 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:49:16 INFO streaming.StreamJob:  map 100%  reduce 17%
11/08/29 18:49:22 INFO streaming.StreamJob:  map 100%  reduce 0%
11/08/29 18:52:43 INFO streaming.StreamJob:  map 100%  reduce 100%
11/08/29 18:52:43 INFO streaming.StreamJob: To kill this job, run:
11/08/29 18:52:43 INFO streaming.StreamJob:
E:\Tools\cygwin\usr\local\hadoop-0.20.2\/bin/hadoop job 
-Dmapred.job.tracker=localhost:9002 -kill job_201108261641_0042
11/08/29 18:52:43 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030
/jobdetails.jsp?jobid=job_201108261641_0042
11/08/29 18:52:43 ERROR streaming.StreamJob: Job not Successful!
11/08/29 18:52:43 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

Please help me out on this.
TIA,
R2D2.

-- 
View this message in context: 
http://old.nabble.com/Map-reduce-sample-code-not-working-for-hadoop-streaming-tp32357180p32357180.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Hadoop Streaming 0.20.2 and how to specify number of reducers per node -- is it possible?

2011-08-21 Thread Allen Wittenauer


On Aug 17, 2011, at 12:36 AM, Steven Hafran wrote:
> 
> 
> after reviewing the hadoop docs, i've tried setting the following properties
> when starting my streaming job; however, they don't seem to have any impact.
> -jobconf mapred.tasktracker.reduce.tasks.maximum=1

"tasktracker" is the hint:  that's a server side setting.

You're looking for the mapred.reduce.tasks settings.

Hadoop Streaming 0.20.2 and how to specify number of reducers per node -- is it possible?

2011-08-17 Thread Steven Hafran

hi everyone,

i have a few hadoop streaming tasks that would benefit from having the
reduce phase execute one reducer per node instead of two per node due to
high cpu and i/o.  currently, i have a 30 node cluster and specify 30
reducers.  when reviewing the job stats on the job tracker, i do see 30
reducers queued/executing; however, i have observed that those reducers are
distributed to 15 nodes resulting in only 50% use of my cluster.

after reviewing the hadoop docs, i've tried setting the following properties
when starting my streaming job; however, they don't seem to have any impact.
-jobconf mapred.tasktracker.reduce.tasks.maximum=1

how do i tell hadoop to run 1 reducer per node with streaming?

thanks in advance for your assistance!

regards,
-steven

Hadoop Streaming Combiner Problem

2011-08-04 Thread Premal Shah

According to the hadoop streaming
docs<http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29>,
there is an inbuilt Aggregate Java class which can work both as a mapper and
a reducer.

Here is the command:
*shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py
-combiner aggregate -reducer NONE -input input_files -output output_path*

Executing this command fails the mapper with this error:
*java.io.IOException: Cannot run program "aggregate": java.io.IOException:
error=2, No such file or directory*

However, if you run this command using aggregate as the reducer and not the
combiner, the job works fine.
*shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py
-reduce aggregate -input input_files -output output_path*

What am I doing wrong? Is aggregate treated as a command and not a
JavaClassName? If yes, how do I use the JavaClassName instead?

-- 
Regards,
Premal Shah.

Re: Hadoop-streaming using binary executable c program

2011-08-02 Thread Robert Evans

What I usually do to debug streaming is to print things to STDERR.  STDERR 
shows up in the logs for the attempt and you should be able to see better what 
is happening.  I am not an expert on perl so I am not sure if you have to pass 
in something special to get your perl script to read form STDIN.  I see  you 
opening handles to all of the files on the command line, but I am not sure how 
that works with stdin, becaue whatever you run through streaming has to read 
from stdin and write to stdout.

cat map1.txt map2.txt map3.txt | ./reducer.pl

--Bobby

On 8/1/11 5:13 PM, "Daniel Yehdego"  wrote:



Hi Bobby,

I have written a small Perl script which do the following job:

Assume we have an output from the mapper

MAP1



MAP2



MAP3



and what the script does is reduce in the following manner :
\t\n
 and the script looks like this:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my @handles = map { open my $h, '<', $_; $h } @ARGV;

while (@handles){
@handles = grep { ! eof $_ } @handles;
my @lines = map { my $v = <$_>; chomp $v; $v } @handles;
print join(' ', @lines), "\n";
}

close $_ for @handles;

This should work for any inputs from the  mapper. But after I use hadoop 
streaming and put the above code as my reducer, the job was successful
but the output files were empty. And I couldn't find out.

 bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
-mapper ./hadoopPknotsRG
-file /data/yehdego/hadoop-0.20.2/pknotsRG
-file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG
-reducer ./reducer.pl
-file /data/yehdego/hadoop-0.20.2/reducer.pl
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
-output /user/yehdego/RFR2-out - verbose

Any help or suggestion is really appreciatedI am just stuck here for the 
weekend.

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Thu, 28 Jul 2011 07:12:11 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
>
> I am not completely sure what you are getting at.  It looks like the output 
> of your c program is (And this is just a guess)  NOTE: \t stands for the tab 
> character and in streaming it is used to separate the key from the value \n 
> stands for carriage return and is used to separate individual records..
> \t\n
> \t\n
> \t\n
> ...
>
>
> And you want the output to look like
> \t\n
>
> You could use a reduce to do this, but the issue here is with the shuffle in 
> between the maps and the reduces.  The Shuffle will group by the key to send 
> to the reducers and then sort by the key.  So in reality your map output 
> looks something like
>
> FROM MAP 1:
> \t\n
> \t\n
>
> FROM MAP 2:
> \t\n
> \t\n
>
> FROM MAP 3:
> \t\n
> \t\n
>
> If you send it to a single reducer (The only way to get a single file) Then 
> the input to the reducer will be sorted alphabetically by the RNA, and the 
> order of the input will be lost.  You can work around this by giving each 
> line a unique number that is in the order you want It to be output.  But 
> doing this would require you to write some code.  I would suggest that you do 
> it with a small shell script after all the maps have completed to splice them 
> together.
>
> --
> Bobby
>
> On 7/27/11 2:55 PM, "Daniel Yehdego"  wrote:
>
>
>
> Hi Bobby,
>
> I just want to ask you if there is away of using a reducer or something like 
> concatenation to glue my outputs from the mapper and outputs
> them as a single file and segment of the predicted RNA 2D structure?
>
> FYI: I have used a reducer NONE before:
>
> HADOOP_HOME$ bin/hadoop jar
> /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
> ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
> /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
> /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
> /user/yehdego/RF-out -reducer NONE -verbose
>
> and a sample of my output using the mapper of two different slave nodes looks 
> like this :
>
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
> and
> [...(((...))).].
>   (-13.46)
>
> GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
> .(((.((......)..  (-11.00)
>
> and I want to concatenate and output them as a single predicated RNA sequence 
> structure:
>
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU

RE: Hadoop-streaming using binary executable c program

2011-08-01 Thread Daniel Yehdego


Hi Bobby, 

I have written a small Perl script which do the following job:

Assume we have an output from the mapper

MAP1



MAP2



MAP3



and what the script does is reduce in the following manner : 
\t\n
 and the script looks like this:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my @handles = map { open my $h, '<', $_; $h } @ARGV;

while (@handles){
@handles = grep { ! eof $_ } @handles;
my @lines = map { my $v = <$_>; chomp $v; $v } @handles;
print join(' ', @lines), "\n";
}

close $_ for @handles;

This should work for any inputs from the  mapper. But after I use hadoop 
streaming and put the above code as my reducer, the job was successful
but the output files were empty. And I couldn't find out.

 bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar 
-mapper ./hadoopPknotsRG 
-file /data/yehdego/hadoop-0.20.2/pknotsRG 
-file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG 
-reducer ./reducer.pl 
-file /data/yehdego/hadoop-0.20.2/reducer.pl  
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt 
-output /user/yehdego/RFR2-out - verbose

Any help or suggestion is really appreciatedI am just stuck here for the 
weekend.
 
Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Thu, 28 Jul 2011 07:12:11 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
> 
> I am not completely sure what you are getting at.  It looks like the output 
> of your c program is (And this is just a guess)  NOTE: \t stands for the tab 
> character and in streaming it is used to separate the key from the value \n 
> stands for carriage return and is used to separate individual records..
> \t\n
> \t\n
> \t\n
> ...
> 
> 
> And you want the output to look like
> \t\n
> 
> You could use a reduce to do this, but the issue here is with the shuffle in 
> between the maps and the reduces.  The Shuffle will group by the key to send 
> to the reducers and then sort by the key.  So in reality your map output 
> looks something like
> 
> FROM MAP 1:
> \t\n
> \t\n
> 
> FROM MAP 2:
> \t\n
> \t\n
> 
> FROM MAP 3:
> \t\n
> \t\n
> 
> If you send it to a single reducer (The only way to get a single file) Then 
> the input to the reducer will be sorted alphabetically by the RNA, and the 
> order of the input will be lost.  You can work around this by giving each 
> line a unique number that is in the order you want It to be output.  But 
> doing this would require you to write some code.  I would suggest that you do 
> it with a small shell script after all the maps have completed to splice them 
> together.
> 
> --
> Bobby
> 
> On 7/27/11 2:55 PM, "Daniel Yehdego"  wrote:
> 
> 
> 
> Hi Bobby,
> 
> I just want to ask you if there is away of using a reducer or something like 
> concatenation to glue my outputs from the mapper and outputs
> them as a single file and segment of the predicted RNA 2D structure?
> 
> FYI: I have used a reducer NONE before:
> 
> HADOOP_HOME$ bin/hadoop jar
> /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
> ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
> /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
> /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
> /user/yehdego/RF-out -reducer NONE -verbose
> 
> and a sample of my output using the mapper of two different slave nodes looks 
> like this :
> 
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
> and
> [...(((...))).].
>   (-13.46)
> 
> GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
> .(((.((......)..  (-11.00)
> 
> and I want to concatenate and output them as a single predicated RNA sequence 
> structure:
> 
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
> 
> [...(((...))).].}}}}((((.(((.((......)..
> 
> 
> Regards,
> 
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> dtyehd...@miners.utep.edu
> 
> > From: dtyehd...@miners.utep.edu
> > To: common-user@hadoop.apache.org
> > Subject: RE: Hadoop-streaming using binary executable c program
> > Date: Tue, 26 Jul 2011 16:23:10 +
> >
> >
> > Good af

Re: Hadoop-streaming using binary executable c program

2011-07-28 Thread Robert Evans

I am not completely sure what you are getting at.  It looks like the output of 
your c program is (And this is just a guess)  NOTE: \t stands for the tab 
character and in streaming it is used to separate the key from the value \n 
stands for carriage return and is used to separate individual records..
\t\n
\t\n
\t\n
...

And you want the output to look like
\t\n

You could use a reduce to do this, but the issue here is with the shuffle in 
between the maps and the reduces.  The Shuffle will group by the key to send to 
the reducers and then sort by the key.  So in reality your map output looks 
something like

FROM MAP 1:
\t\n
\t\n

FROM MAP 2:
\t\n
\t\n

FROM MAP 3:
\t\n
\t\n

If you send it to a single reducer (The only way to get a single file) Then the 
input to the reducer will be sorted alphabetically by the RNA, and the order of 
the input will be lost.  You can work around this by giving each line a unique 
number that is in the order you want It to be output.  But doing this would 
require you to write some code.  I would suggest that you do it with a small 
shell script after all the maps have completed to splice them together.

--
Bobby

On 7/27/11 2:55 PM, "Daniel Yehdego"  wrote:

Hi Bobby,

I just want to ask you if there is away of using a reducer or something like 
concatenation to glue my outputs from the mapper and outputs
them as a single file and segment of the predicted RNA 2D structure?

FYI: I have used a reducer NONE before:

HADOOP_HOME$ bin/hadoop jar
/data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
/data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
/user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
/user/yehdego/RF-out -reducer NONE -verbose

and a sample of my output using the mapper of two different slave nodes looks 
like this :

AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
and
[...(((...))).].
  (-13.46)

GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
.(((.((......)..  (-11.00)

and I want to concatenate and output them as a single predicated RNA sequence 
structure:

AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU

[...(((...))).]..(((.((......)..

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: dtyehd...@miners.utep.edu
> To: common-user@hadoop.apache.org
> Subject: RE: Hadoop-streaming using binary executable c program
> Date: Tue, 26 Jul 2011 16:23:10 +
>
>
> Good afternoon Bobby,
>
> Thanks so much, now its working excellent. And the speed is also reasonable. 
> Once again thanks u.
>
> Regards,
>
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> dtyehd...@miners.utep.edu
>
> > From: ev...@yahoo-inc.com
> > To: common-user@hadoop.apache.org
> > Date: Mon, 25 Jul 2011 14:47:34 -0700
> > Subject: Re: Hadoop-streaming using binary executable c program
> >
> > This is likely to be slow and it is not ideal.  The ideal would be to 
> > modify pknotsRG to be able to read from stdin, but that may not be possible.
> >
> > The shell script would probably look something like the following
> >
> > #!/bin/sh
> > rm -f temp.txt;
> > while read line
> > do
> >   echo $line >> temp.txt;
> > done
> > exec pknotsRG temp.txt;
> >
> > Place it in a file say hadoopPknotsRG  Then you probably want to run
> >
> > chmod +x hadoopPknotsRG
> >
> > After that you want to test it with
> >
> > hadoop fs -cat 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 
> > | ./hadoopPknotsRG
> >
> > If that works then you can try it with Hadoop streaming
> >
> > HADOOP_HOME$ bin/hadoop jar 
> > /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
> > ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
> > /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
> > /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> > /user/yehdego/RF-out -reducer NONE -verbose
> >
> > --Bobby
> >
> > On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:
> >
> >
> >
> > Good afternoon Bobby,
> >
> > Thanks, you gave me a great help in finding out what the problem was. After 
> >

RE: Hadoop-streaming using binary executable c program

2011-07-27 Thread Daniel Yehdego


Hi Bobby, 

I just want to ask you if there is away of using a reducer or something like 
concatenation to glue my outputs from the mapper and outputs
them as a single file and segment of the predicted RNA 2D structure?

FYI: I have used a reducer NONE before:

HADOOP_HOME$ bin/hadoop jar
/data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
/data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
/user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
/user/yehdego/RF-out -reducer NONE -verbose

and a sample of my output using the mapper of two different slave nodes looks 
like this :

AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
and
[...(((...))).].
  (-13.46)

GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
.(((.((......)..  (-11.00)

and I want to concatenate and output them as a single predicated RNA sequence 
structure:

AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
   

[...(((...))).]..(((.((......)..
  


Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> From: dtyehd...@miners.utep.edu
> To: common-user@hadoop.apache.org
> Subject: RE: Hadoop-streaming using binary executable c program
> Date: Tue, 26 Jul 2011 16:23:10 +
> 
> 
> Good afternoon Bobby, 
> 
> Thanks so much, now its working excellent. And the speed is also reasonable. 
> Once again thanks u.  
> 
> Regards, 
> 
> Daniel T. Yehdego
> Computational Science Program 
> University of Texas at El Paso, UTEP 
> dtyehd...@miners.utep.edu
> 
> > From: ev...@yahoo-inc.com
> > To: common-user@hadoop.apache.org
> > Date: Mon, 25 Jul 2011 14:47:34 -0700
> > Subject: Re: Hadoop-streaming using binary executable c program
> > 
> > This is likely to be slow and it is not ideal.  The ideal would be to 
> > modify pknotsRG to be able to read from stdin, but that may not be possible.
> > 
> > The shell script would probably look something like the following
> > 
> > #!/bin/sh
> > rm -f temp.txt;
> > while read line
> > do
> >   echo $line >> temp.txt;
> > done
> > exec pknotsRG temp.txt;
> > 
> > Place it in a file say hadoopPknotsRG  Then you probably want to run
> > 
> > chmod +x hadoopPknotsRG
> > 
> > After that you want to test it with
> > 
> > hadoop fs -cat 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 
> > | ./hadoopPknotsRG
> > 
> > If that works then you can try it with Hadoop streaming
> > 
> > HADOOP_HOME$ bin/hadoop jar 
> > /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
> > ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
> > /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
> > /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> > /user/yehdego/RF-out -reducer NONE -verbose
> > 
> > --Bobby
> > 
> > On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:
> > 
> > 
> > 
> > Good afternoon Bobby,
> > 
> > Thanks, you gave me a great help in finding out what the problem was. After 
> > I put the command line you suggested me, I found out that there was a 
> > segmentation error.
> > The binary executable program pknotsRG only reads a file with a sequence in 
> > it. This means, there should be a shell script, as you have said, that will 
> > take the data coming
> > from stdin and write it to a temporary file. Any idea on how to do this job 
> > in shell script. The thing is I am from a biology background and don't have 
> > much experience in CS.
> > looking forward to hear from you. Thanks so much.
> > 
> > Regards,
> > 
> > Daniel T. Yehdego
> > Computational Science Program
> > University of Texas at El Paso, UTEP
> > dtyehd...@miners.utep.edu
> > 
> > > From: ev...@yahoo-inc.com
> > > To: common-user@hadoop.apache.org
> > > Date: Fri, 22 Jul 2011 12:39:08 -0700
> > > Subject: Re: Hadoop-streaming using binary executable c program
> > >
> > > I would suggest that you do the following to help you debug.
> > >
> > > hadoop fs -cat 
> > > /user/yehdego/RNAData/RF00028

RE: Hadoop-streaming using binary executable c program

2011-07-26 Thread Daniel Yehdego


Good afternoon Bobby, 

Thanks so much, now its working excellent. And the speed is also reasonable. 
Once again thanks u.  

Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Mon, 25 Jul 2011 14:47:34 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
> 
> This is likely to be slow and it is not ideal.  The ideal would be to modify 
> pknotsRG to be able to read from stdin, but that may not be possible.
> 
> The shell script would probably look something like the following
> 
> #!/bin/sh
> rm -f temp.txt;
> while read line
> do
>   echo $line >> temp.txt;
> done
> exec pknotsRG temp.txt;
> 
> Place it in a file say hadoopPknotsRG  Then you probably want to run
> 
> chmod +x hadoopPknotsRG
> 
> After that you want to test it with
> 
> hadoop fs -cat 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
> ./hadoopPknotsRG
> 
> If that works then you can try it with Hadoop streaming
> 
> HADOOP_HOME$ bin/hadoop jar 
> /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
> ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
> /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
> /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> /user/yehdego/RF-out -reducer NONE -verbose
> 
> --Bobby
> 
> On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:
> 
> 
> 
> Good afternoon Bobby,
> 
> Thanks, you gave me a great help in finding out what the problem was. After I 
> put the command line you suggested me, I found out that there was a 
> segmentation error.
> The binary executable program pknotsRG only reads a file with a sequence in 
> it. This means, there should be a shell script, as you have said, that will 
> take the data coming
> from stdin and write it to a temporary file. Any idea on how to do this job 
> in shell script. The thing is I am from a biology background and don't have 
> much experience in CS.
> looking forward to hear from you. Thanks so much.
> 
> Regards,
> 
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> dtyehd...@miners.utep.edu
> 
> > From: ev...@yahoo-inc.com
> > To: common-user@hadoop.apache.org
> > Date: Fri, 22 Jul 2011 12:39:08 -0700
> > Subject: Re: Hadoop-streaming using binary executable c program
> >
> > I would suggest that you do the following to help you debug.
> >
> > hadoop fs -cat 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 
> > | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
> >
> > This is simulating what hadoop streaming is doing.  Here we are taking the 
> > first 2 lines out of the input file and feeding them to the stdin of 
> > pknotsRG.  The first step is to make sure that you can get your program to 
> > run correctly with something like this.  You may need to change the command 
> > line to pknotsRG to get it to read the data it is processing from stdin, 
> > instead of from a file.  Alternatively you may need to write a shell script 
> > that will take the data coming from stdin.  Write it to a file and then 
> > call pknotsRG on that temporary file.  Once you have this working then you 
> > should try it again with streaming.
> >
> > --Bobby Evans
> >
> > On 7/22/11 12:31 PM, "Daniel Yehdego"  wrote:
> >
> >
> >
> > Hi Bobby, Thanks for the response.
> >
> > After I tried the following comannd:
> >
> > bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
> > /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
> > /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> > /user/yehdego/RF-out - verbose
> >
> > I got a stderr logs :
> >
> > java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> > failed with code 139
> > at 
> > org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> > at 
> > org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> > at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> > at 
> > org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> > at org.apache.hadoop.mapred.MapTask.r

Re: Hadoop-streaming using binary executable c program

2011-07-25 Thread Robert Evans

This is likely to be slow and it is not ideal.  The ideal would be to modify 
pknotsRG to be able to read from stdin, but that may not be possible.

The shell script would probably look something like the following

#!/bin/sh
rm -f temp.txt;
while read line
do
  echo $line >> temp.txt;
done
exec pknotsRG temp.txt;

Place it in a file say hadoopPknotsRG  Then you probably want to run

chmod +x hadoopPknotsRG

After that you want to test it with

hadoop fs -cat 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
./hadoopPknotsRG

If that works then you can try it with Hadoop streaming

HADOOP_HOME$ bin/hadoop jar 
/data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
/data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
/user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
/user/yehdego/RF-out -reducer NONE -verbose

--Bobby

On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:



Good afternoon Bobby,

Thanks, you gave me a great help in finding out what the problem was. After I 
put the command line you suggested me, I found out that there was a 
segmentation error.
The binary executable program pknotsRG only reads a file with a sequence in it. 
This means, there should be a shell script, as you have said, that will take 
the data coming
from stdin and write it to a temporary file. Any idea on how to do this job in 
shell script. The thing is I am from a biology background and don't have much 
experience in CS.
looking forward to hear from you. Thanks so much.

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Fri, 22 Jul 2011 12:39:08 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
>
> I would suggest that you do the following to help you debug.
>
> hadoop fs -cat 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
>
> This is simulating what hadoop streaming is doing.  Here we are taking the 
> first 2 lines out of the input file and feeding them to the stdin of 
> pknotsRG.  The first step is to make sure that you can get your program to 
> run correctly with something like this.  You may need to change the command 
> line to pknotsRG to get it to read the data it is processing from stdin, 
> instead of from a file.  Alternatively you may need to write a shell script 
> that will take the data coming from stdin.  Write it to a file and then call 
> pknotsRG on that temporary file.  Once you have this working then you should 
> try it again with streaming.
>
> --Bobby Evans
>
> On 7/22/11 12:31 PM, "Daniel Yehdego"  wrote:
>
>
>
> Hi Bobby, Thanks for the response.
>
> After I tried the following comannd:
>
> bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> /user/yehdego/RF-out - verbose
>
> I got a stderr logs :
>
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
> with code 139
> at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> at 
> org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>
>
> syslog logs
>
> 2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 0
> 2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: 
> PipeMapRed exec 
> [/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_00_0/work/./pknotsRG]
> 2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
> R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> MROutputThread done
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRe

RE: Hadoop-streaming using binary executable c program

2011-07-25 Thread Daniel Yehdego


Good afternoon Bobby, 

Thanks, you gave me a great help in finding out what the problem was. After I 
put the command line you suggested me, I found out that there was a 
segmentation error.
The binary executable program pknotsRG only reads a file with a sequence in it. 
This means, there should be a shell script, as you have said, that will take 
the data coming
from stdin and write it to a temporary file. Any idea on how to do this job in 
shell script. The thing is I am from a biology background and don't have much 
experience in CS.
looking forward to hear from you. Thanks so much.

Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Fri, 22 Jul 2011 12:39:08 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
> 
> I would suggest that you do the following to help you debug.
> 
> hadoop fs -cat 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
> 
> This is simulating what hadoop streaming is doing.  Here we are taking the 
> first 2 lines out of the input file and feeding them to the stdin of 
> pknotsRG.  The first step is to make sure that you can get your program to 
> run correctly with something like this.  You may need to change the command 
> line to pknotsRG to get it to read the data it is processing from stdin, 
> instead of from a file.  Alternatively you may need to write a shell script 
> that will take the data coming from stdin.  Write it to a file and then call 
> pknotsRG on that temporary file.  Once you have this working then you should 
> try it again with streaming.
> 
> --Bobby Evans
> 
> On 7/22/11 12:31 PM, "Daniel Yehdego"  wrote:
> 
> 
> 
> Hi Bobby, Thanks for the response.
> 
> After I tried the following comannd:
> 
> bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> /user/yehdego/RF-out - verbose
> 
> I got a stderr logs :
> 
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
> with code 139
> at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> at 
> org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 
> 
> 
> syslog logs
> 
> 2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 0
> 2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: 
> PipeMapRed exec 
> [/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_00_0/work/./pknotsRG]
> 2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
> R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> MROutputThread done
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> MRErrorThread done
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> PipeMapRed failed!
> 2011-07-22 13:02:28,361 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
> with code 139
> at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> at 
> org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 2011-07-22 13:

Re: Hadoop-streaming using binary executable c program

2011-07-22 Thread Robert Evans

I would suggest that you do the following to help you debug.

hadoop fs -cat 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -

This is simulating what hadoop streaming is doing.  Here we are taking the 
first 2 lines out of the input file and feeding them to the stdin of pknotsRG.  
The first step is to make sure that you can get your program to run correctly 
with something like this.  You may need to change the command line to pknotsRG 
to get it to read the data it is processing from stdin, instead of from a file. 
 Alternatively you may need to write a shell script that will take the data 
coming from stdin.  Write it to a file and then call pknotsRG on that temporary 
file.  Once you have this working then you should try it again with streaming.

--Bobby Evans

On 7/22/11 12:31 PM, "Daniel Yehdego"  wrote:

Hi Bobby, Thanks for the response.

After I tried the following comannd:

bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
/user/yehdego/RF-out - verbose

I got a stderr logs :

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

syslog logs

2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
exec 
[/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_00_0/work/./pknotsRG]
2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MROutputThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MRErrorThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
failed!
2011-07-22 13:02:28,361 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-07-22 13:02:28,395 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org; dtyehd...@miners.utep.edu
> Date: Fri, 22 Jul 2011 09:12:18 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
>
> It looks like it tried to run your program and the program exited with a 1 
> not a 0.  What are the stderr logs like for the mappers that were launched, 
> you should be able to access them through the Web GUI?  You might want to add 
> in some stderr log messages to you c program too. To be able to debug how far 
> along it is going before exiting.
>
> --Bobby Evans
>
> On 7/22/11 9:19 AM, "Daniel Yehdego"  wrote:
>
> I am trying to parallelize some very long RNA sequence for the sake of
> predicting their RNA 2D structures. I am using a binary executable c
> program called pknotsRG as my mapper. I tried the following bin/hadoop
> command:
>
> HADOOP_HOME$ bin/hadoop
> jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
> -mapper /data/yehdego/hadoop-0.20.2/pknotsRG
> -file /data/yehdego/hadoop-0.20.2/pknotsRG
> -input /u

RE: Hadoop-streaming using binary executable c program

2011-07-22 Thread Daniel Yehdego

Hi Bobby, Thanks for the response.

After I tried the following comannd:

bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
/user/yehdego/RF-out - verbose

I got a stderr logs :

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

syslog logs

2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
exec 
[/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_00_0/work/./pknotsRG]
2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MROutputThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MRErrorThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
failed!
2011-07-22 13:02:28,361 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-07-22 13:02:28,395 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task

Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org; dtyehd...@miners.utep.edu
> Date: Fri, 22 Jul 2011 09:12:18 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
> 
> It looks like it tried to run your program and the program exited with a 1 
> not a 0.  What are the stderr logs like for the mappers that were launched, 
> you should be able to access them through the Web GUI?  You might want to add 
> in some stderr log messages to you c program too. To be able to debug how far 
> along it is going before exiting.
> 
> --Bobby Evans
> 
> On 7/22/11 9:19 AM, "Daniel Yehdego"  wrote:
> 
> I am trying to parallelize some very long RNA sequence for the sake of
> predicting their RNA 2D structures. I am using a binary executable c
> program called pknotsRG as my mapper. I tried the following bin/hadoop
> command:
> 
> HADOOP_HOME$ bin/hadoop
> jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
> -mapper /data/yehdego/hadoop-0.20.2/pknotsRG
> -file /data/yehdego/hadoop-0.20.2/pknotsRG
> -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
> -output /user/yehdego/RF-out -reducer NONE -verbose
> 
> but i keep getting the following error message:
> 
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
> failed with code 1
> at
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> at
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> at 
> org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.C

RE: Hadoop-streaming with a c binary executable as a mapper

2011-07-22 Thread Daniel Yehdego

Thanks Joey for your quick response, 

I have tried the suggestion you gave me and its still not working, after I  run:

bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
/user/yehdego/RF-out - verbose

I  got the following task logs:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

syslog logs

2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
exec 
[/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_00_0/work/./pknotsRG]
2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MROutputThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MRErrorThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
failed!
2011-07-22 13:02:28,361 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-07-22 13:02:28,395 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task

Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

> CC: common-user@hadoop.apache.org
> From: j...@cloudera.com
> Subject: Re: Hadoop-streaming with a c binary executable as a mapper
> Date: Fri, 22 Jul 2011 11:34:08 -0400
> To: common-user@hadoop.apache.org
> 
> Your executable needs to read lines from standard in. Try setting your mapper 
> like this:
> 
> > -mapper "/data/yehdego/hadoop-0.20.2/pknotsRG -"
> 
> If that doesn't work, you may need to execute your C program from a shell 
> script. The -I added to the command line says read from STDIN. 
> 
> -Joey
> 
> 
> On Jul 22, 2011, at 10:41, Daniel Yehdego  wrote:
> 
> > Hi, 
> > 
> > I using hadoop-streaming for parallelizing a big RNA data. I am using a
> > c binary executable program called pknotsRG as my mapper. My command to
> > run the job looks like:
> > 
> > HADOOP_HOME$  bin/hadoop
> > jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
> > -mapper /data/yehdego/hadoop-0.20.2/pknotsRG
> > -file /data/yehdego/hadoop-0.20.2/pknotsRG 
> > -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
> > -output /user/yehdego/RF-out 
> > -reducer NONE 
> > -verbose 
> > 
> > and I keep getting the following error messages:
> > 
> > java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
> > failed with code 1
> >at 
> > org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> >at 
> > org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> >at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> >at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> >at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> >at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> &g

Re: Hadoop-streaming using binary executable c program

2011-07-22 Thread Robert Evans

It looks like it tried to run your program and the program exited with a 1 not 
a 0.  What are the stderr logs like for the mappers that were launched, you 
should be able to access them through the Web GUI?  You might want to add in 
some stderr log messages to you c program too. To be able to debug how far 
along it is going before exiting.

--Bobby Evans

On 7/22/11 9:19 AM, "Daniel Yehdego"  wrote:

I am trying to parallelize some very long RNA sequence for the sake of
predicting their RNA 2D structures. I am using a binary executable c
program called pknotsRG as my mapper. I tried the following bin/hadoop
command:

HADOOP_HOME$ bin/hadoop
jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
-mapper /data/yehdego/hadoop-0.20.2/pknotsRG
-file /data/yehdego/hadoop-0.20.2/pknotsRG
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
-output /user/yehdego/RF-out -reducer NONE -verbose

but i keep getting the following error message:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

FYI: my input file is RF00028_B.bpseqL3G5_seg_Centered_Method.txt which
is a chunk of RNA sequences and the mapper is expected to get the input
and execute the input file line by line and out put the predicted
structure for each line of sequence for a specified number of maps. Any
help on this problem is really appreciated. Thanks.

Re: Problem with Hadoop Streaming -file option for Java class files

2011-07-22 Thread Robert Evans

>From a practical standpoint if you just leave off the -mapper you will get an 
>IdentityMapper being run in streaming.  I don't believe that -mapper will 
>understand something.class as a class file that should be loaded and used as 
>the mapper.  I think you need to specify the class, including the package to 
>get it to load like you did with org.apache.hadoop.mapred.lib.IdentityMapper.  
>I am not sure what changes you made to IdentiyMapper.java before recompiling 
>but in order to get it on the classpath you probably need to ship it as a jar 
>not as a single file.  I believe that you can use -libJars to ship it and add 
>it to the classpath of the JVM, but I am not positive of that.

--Bobby Evans

On 7/22/11 10:18 AM, "Shrish"  wrote:



I am struggling with a  issue in hadoop streaming in the "-file" option.

First I tried the very basic example in streaming:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper
org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstchk22

which worked absolutely fine.

Then I copied the IdentityMapper.java source code and compiled it. Then I placed
this class file in the /home/hadoop folder and executed the following in the
terminal.

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file ~/IdentityMapper.class
-mapper IdentityMapper.class \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstch6

The execution failed with the following error in the stderr file:

java.io.IOException: Cannot run program "IdentityMapper.class":
java.io.IOException: error=2, No such file or directory

Then again I tried it by copying the IdentityMapper.class file in the hadoop
installation and executed the following:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file IdentityMapper.class
-mapper IdentityMapper.class \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstch5

But unfortunately again I got the same error.

It would be great if you can help me with it as I cannot move any further
without overcoming this.


***I am trying this after I tried hadoop-streaming for a different class file
which failed, so to identify if there is something wrong with the class file
itself or with the way I am using it


Thanking you in anticipation

Hadoop-streaming using binary executable c program

2011-07-22 Thread Daniel Yehdego

I am trying to parallelize some very long RNA sequence for the sake of
predicting their RNA 2D structures. I am using a binary executable c
program called pknotsRG as my mapper. I tried the following bin/hadoop
command:

HADOOP_HOME$ bin/hadoop
jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
-mapper /data/yehdego/hadoop-0.20.2/pknotsRG
-file /data/yehdego/hadoop-0.20.2/pknotsRG
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
-output /user/yehdego/RF-out -reducer NONE -verbose 

but i keep getting the following error message: 

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

FYI: my input file is RF00028_B.bpseqL3G5_seg_Centered_Method.txt which
is a chunk of RNA sequences and the mapper is expected to get the input
and execute the input file line by line and out put the predicted
structure for each line of sequence for a specified number of maps. Any
help on this problem is really appreciated. Thanks.

Re: Hadoop-streaming with a c binary executable as a mapper

2011-07-22 Thread Joey Echeverria

Your executable needs to read lines from standard in. Try setting your mapper 
like this:

> -mapper "/data/yehdego/hadoop-0.20.2/pknotsRG -"

If that doesn't work, you may need to execute your C program from a shell 
script. The -I added to the command line says read from STDIN. 

-Joey


On Jul 22, 2011, at 10:41, Daniel Yehdego  wrote:

> Hi, 
> 
> I using hadoop-streaming for parallelizing a big RNA data. I am using a
> c binary executable program called pknotsRG as my mapper. My command to
> run the job looks like:
> 
> HADOOP_HOME$  bin/hadoop
> jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
> -mapper /data/yehdego/hadoop-0.20.2/pknotsRG
> -file /data/yehdego/hadoop-0.20.2/pknotsRG 
> -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
> -output /user/yehdego/RF-out 
> -reducer NONE 
> -verbose 
> 
> and I keep getting the following error messages:
> 
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
> failed with code 1
>at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
>at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
>at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
>at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
>at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
>at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 
> FYI: I am inputing a file with lines of sequences and the mapper is expected 
> to take each line 
> and execute and predict their 2D secondary structure. I tried the executable 
> locally and it worked.
> 
> [yehdego@bulgaria hadoop-0.20.2]$ ./pknotsRG
> RF00028_B.bpseqL3G5_seg_Centered_Method.txt 
> 
> AUGACUCUCUAAAUUGCUUUACCUUUGGAGGGGUUAUCAGGCCUGCACCUGAUAGCUAGUCUUUAAACCAAUAGAUUGCAUCGGUUUAAUA
> (..)...((..))[.{{...].}}...
>   
> GCAAGACCGUCAAAUUGCGGGGGGU
> .....  
> CAACAGCCGUUCAGUACCAAGUCUCAA
> ..((.((.(()).)).)).  
> AACUUUGAGAUGGCCUUGCAAAGGAUAUGGUAAUAAGCUGACGGACAGGGUCCUAACCACGCAGCCAAGUCCUAAGUCAACAUUU
> ..[[[.]]](.......)).))..)....
>   
> CGGUGUUGAUAUGGAUGCAGUUCACAGACUAAAUGUCGGUCAAGAAUAGGUAUUCUUCUCAUAAGAUAUAGUCGGACCUCUCCUUAAUGGGAGCU
> .(((...(...)..(...())..))).()..
>

Problem with Hadoop Streaming -file option for Java class files

2011-07-22 Thread Shrish


I am struggling with a  issue in hadoop streaming in the "-file" option.

First I tried the very basic example in streaming:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper
org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstchk22

which worked absolutely fine.

Then I copied the IdentityMapper.java source code and compiled it. Then I placed
this class file in the /home/hadoop folder and executed the following in the
terminal.

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file ~/IdentityMapper.class
-mapper IdentityMapper.class \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstch6

The execution failed with the following error in the stderr file:

java.io.IOException: Cannot run program "IdentityMapper.class":
java.io.IOException: error=2, No such file or directory

Then again I tried it by copying the IdentityMapper.class file in the hadoop
installation and executed the following:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file IdentityMapper.class
-mapper IdentityMapper.class \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstch5

But unfortunately again I got the same error.

It would be great if you can help me with it as I cannot move any further
without overcoming this.


***I am trying this after I tried hadoop-streaming for a different class file
which failed, so to identify if there is something wrong with the class file
itself or with the way I am using it


Thanking you in anticipation

Hadoop-streaming with a c binary executable as a mapper

2011-07-22 Thread Daniel Yehdego

Hi, 

I using hadoop-streaming for parallelizing a big RNA data. I am using a
c binary executable program called pknotsRG as my mapper. My command to
run the job looks like:

HADOOP_HOME$  bin/hadoop
jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
-mapper /data/yehdego/hadoop-0.20.2/pknotsRG
-file /data/yehdego/hadoop-0.20.2/pknotsRG 
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
-output /user/yehdego/RF-out 
-reducer NONE 
-verbose 

and I keep getting the following error messages:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

FYI: I am inputing a file with lines of sequences and the mapper is expected to 
take each line 
and execute and predict their 2D secondary structure. I tried the executable 
locally and it worked.

[yehdego@bulgaria hadoop-0.20.2]$ ./pknotsRG
RF00028_B.bpseqL3G5_seg_Centered_Method.txt 

AUGACUCUCUAAAUUGCUUUACCUUUGGAGGGGUUAUCAGGCCUGCACCUGAUAGCUAGUCUUUAAACCAAUAGAUUGCAUCGGUUUAAUA
(..)...((..))[.{{...].}}...
  
GCAAGACCGUCAAAUUGCGGGGGGU
.....  
CAACAGCCGUUCAGUACCAAGUCUCAA
..((.((.(()).)).)).  
AACUUUGAGAUGGCCUUGCAAAGGAUAUGGUAAUAAGCUGACGGACAGGGUCCUAACCACGCAGCCAAGUCCUAAGUCAACAUUU
..[[[.]]](.......)).))..)....
  
CGGUGUUGAUAUGGAUGCAGUUCACAGACUAAAUGUCGGUCAAGAAUAGGUAUUCUUCUCAUAAGAUAUAGUCGGACCUCUCCUUAAUGGGAGCU
.(((...(...)..(...())..))).()..

Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-29 Thread Paul Ingles

Hi,

I'm not familiar with wukong, but Mandy has some scripts that wrap the hadoop 
commands- the default behaviour IIRC is to package the folder the script is in.

This is then distributed so the app carries all its dependencies with it.

Happy to hear -files works for you.

Sent from my iPhone

On 29 Jun 2011, at 07:44, Guang-Nan Cheng  wrote:

> Well, my bad. I made a simple test and confirmed that  -files works that way
> already.
> 
> For the two guys that "answered" my question, sorry I asked the question
> unclearly... I don't see how those two projects related to the question,
> but thank you. :D
> 
> 
> 
> 
> On Wed, Jun 29, 2011 at 12:35 AM, Abhinay Mehta 
> wrote:
> 
>> We use Mandy: https://github.com/forward/mandy for this.
>> 
>> 
>> On 28 June 2011 17:26, Nick Jones  wrote:
>> 
>>> Take a look at Wukong from the guys at Infochimps:
>>> https://github.com/mrflip/**wukong 
>>> 
>>> 
>>> On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote:
>>> 
 I'm fancied about passing a whole ruby app to streaming, so I don't need
 to
 bother with ruby file dependencies.
 
 For example,
 
 ./streaming
 
 ...
 -mapper 'ruby aaa/bbb/ccc'
 -files  aaa<--- pass the folder
 
 
 
 
 Is this supported already? If not, any tips on how to make this work?
>> I'm
 willing to add some code by myself and rebuild the streaming jar.
 
>>> 
>>> --
>>> Nick Jones
>>> 
>>> 
>>> 
>>

Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Guang-Nan Cheng

Well, my bad. I made a simple test and confirmed that  -files works that way
already.

For the two guys that "answered" my question, sorry I asked the question
unclearly... I don't see how those two projects related to the question,
 but thank you. :D




On Wed, Jun 29, 2011 at 12:35 AM, Abhinay Mehta wrote:

> We use Mandy: https://github.com/forward/mandy for this.
>
>
> On 28 June 2011 17:26, Nick Jones  wrote:
>
> > Take a look at Wukong from the guys at Infochimps:
> > https://github.com/mrflip/**wukong 
> >
> >
> > On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote:
> >
> >> I'm fancied about passing a whole ruby app to streaming, so I don't need
> >> to
> >> bother with ruby file dependencies.
> >>
> >> For example,
> >>
> >> ./streaming
> >>
> >> ...
> >> -mapper 'ruby aaa/bbb/ccc'
> >> -files  aaa<--- pass the folder
> >>
> >>
> >>
> >>
> >> Is this supported already? If not, any tips on how to make this work?
> I'm
> >> willing to add some code by myself and rebuild the streaming jar.
> >>
> >
> > --
> > Nick Jones
> >
> >
> >
>

Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Abhinay Mehta

We use Mandy: https://github.com/forward/mandy for this.


On 28 June 2011 17:26, Nick Jones  wrote:

> Take a look at Wukong from the guys at Infochimps:
> https://github.com/mrflip/**wukong 
>
>
> On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote:
>
>> I'm fancied about passing a whole ruby app to streaming, so I don't need
>> to
>> bother with ruby file dependencies.
>>
>> For example,
>>
>> ./streaming
>>
>> ...
>> -mapper 'ruby aaa/bbb/ccc'
>> -files  aaa<--- pass the folder
>>
>>
>>
>>
>> Is this supported already? If not, any tips on how to make this work? I'm
>> willing to add some code by myself and rebuild the streaming jar.
>>
>
> --
> Nick Jones
>
>
>

Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Nick Jones

Take a look at Wukong from the guys at Infochimps: 
https://github.com/mrflip/wukong


On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote:

I'm fancied about passing a whole ruby app to streaming, so I don't need to
bother with ruby file dependencies.

For example,

./streaming

...
-mapper 'ruby aaa/bbb/ccc'
-files  aaa<--- pass the folder




Is this supported already? If not, any tips on how to make this work? I'm
willing to add some code by myself and rebuild the streaming jar.


--
Nick Jones

Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Guang-Nan Cheng

I'm fancied about passing a whole ruby app to streaming, so I don't need to
bother with ruby file dependencies.

For example,

./streaming

...
-mapper 'ruby aaa/bbb/ccc'
-files  aaa  <--- pass the folder




Is this supported already? If not, any tips on how to make this work? I'm
willing to add some code by myself and rebuild the streaming jar.

hadoop streaming random fails

2011-06-09 Thread Onur Cenk

Hello everyone,

I'm using hadoop streaming. Getting the following error for 9 reduces
of 32. The tasks are retried by hadoop and everythings fine then.

Do you have an idea, why I'm getting this.

java.io.IOException: subprocess exited with error code 141
R/W/S=237891/234843/0 in:14868=237891/16 [rec/s] out:14677=234843/16 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hd2
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |68064006?838?12?1?+905303343108?9|
Date: Thu Jun 09 01:38:46 EEST 2011
Broken pipe
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:131)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:468)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Re: hadoop streaming and job conf settings

2011-04-13 Thread Amareshwari Sri Ramadasu

Looks like you are hitting https://issues.apache.org/jira/browse/MAPREDUCE-1621.

-Amareshwari
On 4/13/11 11:39 PM, "Shivani Rao"  wrote:

Hello,

I am facing trouble using hadoop streaming in order to  solve a simple
nearest neighbor problem.

Input data is in the following format
'\t'

key is the imageid for which nearest neighbor will be computed
the value is 100 dimensional  vector of floating point values separated by
space or tab

The mapper reads in the query (the query is a 100 dimensional vector) and
each line of the input and outputs a 
where key2 is a floating point value indicating the distance, and value2 is
the imageid

The number of reducers is set to 1. And the reducer is set to be the
identity reducer.

I tried to use the following command

bin/hadoop jar ./mapred/contrib/streaming/hadoop-0.21.0-streaming.jar
-Dmapreduce.job.output.key.class=org.apache.hadoop.io.DoubleWritable -files
/home/shivani/research/toolkit/mathouttuts/nearestneighbor/code/IdentityMapper.R#file1
-input datain/comparedata -output dataout5 -mapper file1 -reducer
org.apache.hadoop.mapred.lib.IdentityReducer -verbose

This is the output stream is as below. The failure is in the mapper itself,
more specifically the TEXTOUTPUTREADER. I am not sure how to fix this. The
logs are attached below:


11/04/13 13:22:15 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/04/13 13:22:15 WARN conf.Configuration: mapred.used.genericoptionsparser
is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
STREAM: addTaskEnvironment=
STREAM: shippedCanonFiles_=[]
STREAM: shipped: false /usr/local/hadoop/file1
STREAM: cmd=file1
STREAM: cmd=null
STREAM: shipped: false
/usr/local/hadoop/org.apache.hadoop.mapred.lib.IdentityReducer
STREAM: cmd=org.apache.hadoop.mapred.lib.IdentityReducer
11/04/13 13:22:15 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
STREAM: Found runtime classes in:
/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/
packageJobJar: [/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/]
[] /tmp/streamjob2923554781371902680.jar tmpDir=null
JarBuilder.addNamedStream META-INF/MANIFEST.MF
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritable.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesInput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/StreamUtil$TaskId.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/RawBytesOutputReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TypedBytesInputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TextInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/InputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TextOutputReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/IdentifierResolver.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/RawBytesInputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TypedBytesOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/OutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PathFinder.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/LoadTypedBytes.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/StreamXmlRecordReader.class
JarBuilde

hadoop streaming and job conf settings, error in textoutputreader

2011-04-13 Thread Shivani Rao

Hello,

I am facing trouble using hadoop streaming in order to  solve a simple
nearest neighbor problem.

Input data is in the following format
'\t'

key is the imageid for which nearest neighbor will be computed
the value is 100 dimensional  vector of floating point values separated by
space or tab

The mapper reads in the query (the query is a 100 dimensional vector) and
each line of the input and outputs a 
where key2 is a floating point value indicating the distance, and value2 is
the imageid

The number of reducers is set to 1. And the reducer is set to be the
identity reducer.

I tried to use the following command

bin/hadoop jar ./mapred/contrib/streaming/hadoop-0.21.0-streaming.jar
-Dmapreduce.job.output.key.class=org.apache.hadoop.io.DoubleWritable -files
/home/shivani/research/toolkit/mathouttuts/nearestneighbor/code/IdentityMapper.R#file1
-input datain/comparedata -output dataout5 -mapper file1 -reducer
org.apache.hadoop.mapred.lib.IdentityReducer -verbose

This is the output stream is as below. The failure is in the mapper itself,
more specifically the TEXTOUTPUTREADER. I am not sure how to fix this. The
logs are attached below:


11/04/13 13:22:15 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/04/13 13:22:15 WARN conf.Configuration: mapred.used.genericoptionsparser
is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
STREAM: addTaskEnvironment=
STREAM: shippedCanonFiles_=[]
STREAM: shipped: false /usr/local/hadoop/file1
STREAM: cmd=file1
STREAM: cmd=null
STREAM: shipped: false
/usr/local/hadoop/org.apache.hadoop.mapred.lib.IdentityReducer
STREAM: cmd=org.apache.hadoop.mapred.lib.IdentityReducer
11/04/13 13:22:15 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
STREAM: Found runtime classes in:
/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/
packageJobJar: [/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/]
[] /tmp/streamjob2923554781371902680.jar tmpDir=null
JarBuilder.addNamedStream META-INF/MANIFEST.MF
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritable.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesInput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/StreamUtil$TaskId.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/RawBytesOutputReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TypedBytesInputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TextInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/InputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TextOutputReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/IdentifierResolver.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/RawBytesInputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TypedBytesOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/OutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PathFinder.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/LoadTypedBytes.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/StreamXmlRecordReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/UTF8ByteArrayUtils.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/JarBui

Re: hadoop streaming and job conf settings

2011-04-13 Thread Mehmet Tepedelenlioglu

I am not sure what the problem is but your approach seems incorrect unless you
always want to use 1 mapper. You need to make your queries available to all 
mappers
(cache them-although I am not sure how to do that with streaming). Then
you definitely want to use a combiner to reduce over each mappers output. At 
the reduce
you would reduce over the queries to find the min. The important point is the 
fact that
all mappers need all queries.


On Apr 13, 2011, at 11:09 AM, Shivani Rao wrote:

> Hello,
> 
> I am facing trouble using hadoop streaming in order to  solve a simple
> nearest neighbor problem.
> 
> Input data is in the following format
> '\t'
> 
> key is the imageid for which nearest neighbor will be computed
> the value is 100 dimensional  vector of floating point values separated by
> space or tab
> 
> The mapper reads in the query (the query is a 100 dimensional vector) and
> each line of the input and outputs a 
> where key2 is a floating point value indicating the distance, and value2 is
> the imageid
> 
> The number of reducers is set to 1. And the reducer is set to be the
> identity reducer.
> 
> I tried to use the following command
> 
> bin/hadoop jar ./mapred/contrib/streaming/hadoop-0.21.0-streaming.jar
> -Dmapreduce.job.output.key.class=org.apache.hadoop.io.DoubleWritable -files
> /home/shivani/research/toolkit/mathouttuts/nearestneighbor/code/IdentityMapper.R#file1
> -input datain/comparedata -output dataout5 -mapper file1 -reducer
> org.apache.hadoop.mapred.lib.IdentityReducer -verbose
> 
> This is the output stream is as below. The failure is in the mapper itself,
> more specifically the TEXTOUTPUTREADER. I am not sure how to fix this. The
> logs are attached below:
> 
> 
> 11/04/13 13:22:15 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=30
> 11/04/13 13:22:15 WARN conf.Configuration: mapred.used.genericoptionsparser
> is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
> STREAM: addTaskEnvironment=
> STREAM: shippedCanonFiles_=[]
> STREAM: shipped: false /usr/local/hadoop/file1
> STREAM: cmd=file1
> STREAM: cmd=null
> STREAM: shipped: false
> /usr/local/hadoop/org.apache.hadoop.mapred.lib.IdentityReducer
> STREAM: cmd=org.apache.hadoop.mapred.lib.IdentityReducer
> 11/04/13 13:22:15 WARN conf.Configuration: mapred.task.id is deprecated.
> Instead, use mapreduce.task.attempt.id
> STREAM: Found runtime classes in:
> /usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/
> packageJobJar: [/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/]
> [] /tmp/streamjob2923554781371902680.jar tmpDir=null
> JarBuilder.addNamedStream META-INF/MANIFEST.MF
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesWritable.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesOutput.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesOutput$1.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesInput$1.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesWritableInput.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesRecordInput.class
> JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class
> JarBuilder.addNamedStream
> org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class
> JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class
> JarBuilder.addNamedStream
> org/apache/hadoop/streaming/StreamUtil$TaskId.class
> JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class
> JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class
> JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class
> JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class
> JarBuilder.addNamedStream
> org/apache/hadoop/streaming/io/RawBytesOutputReader.class
> JarBuilder.addNamedStream
> org/apache/hadoop/streaming/io/TypedBytesInputWriter.class
> JarBuilder.addNamedStream
> org/

hadoop streaming and job conf settings

2011-04-13 Thread Shivani Rao

Hello,

I am facing trouble using hadoop streaming in order to  solve a simple
nearest neighbor problem.

Input data is in the following format
'\t'

key is the imageid for which nearest neighbor will be computed
the value is 100 dimensional  vector of floating point values separated by
space or tab

The mapper reads in the query (the query is a 100 dimensional vector) and
each line of the input and outputs a 
where key2 is a floating point value indicating the distance, and value2 is
the imageid

The number of reducers is set to 1. And the reducer is set to be the
identity reducer.

I tried to use the following command

bin/hadoop jar ./mapred/contrib/streaming/hadoop-0.21.0-streaming.jar
-Dmapreduce.job.output.key.class=org.apache.hadoop.io.DoubleWritable -files
/home/shivani/research/toolkit/mathouttuts/nearestneighbor/code/IdentityMapper.R#file1
-input datain/comparedata -output dataout5 -mapper file1 -reducer
org.apache.hadoop.mapred.lib.IdentityReducer -verbose

This is the output stream is as below. The failure is in the mapper itself,
more specifically the TEXTOUTPUTREADER. I am not sure how to fix this. The
logs are attached below:


11/04/13 13:22:15 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/04/13 13:22:15 WARN conf.Configuration: mapred.used.genericoptionsparser
is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
STREAM: addTaskEnvironment=
STREAM: shippedCanonFiles_=[]
STREAM: shipped: false /usr/local/hadoop/file1
STREAM: cmd=file1
STREAM: cmd=null
STREAM: shipped: false
/usr/local/hadoop/org.apache.hadoop.mapred.lib.IdentityReducer
STREAM: cmd=org.apache.hadoop.mapred.lib.IdentityReducer
11/04/13 13:22:15 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
STREAM: Found runtime classes in:
/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/
packageJobJar: [/usr/local/hadoop-hadoop/hadoop-unjar7358684340334149267/]
[] /tmp/streamjob2923554781371902680.jar tmpDir=null
JarBuilder.addNamedStream META-INF/MANIFEST.MF
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritable.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesOutput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesInput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class
JarBuilder.addNamedStream
org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/StreamUtil$TaskId.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/RawBytesOutputReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TypedBytesInputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TextInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/InputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TextOutputReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/IdentifierResolver.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/RawBytesInputWriter.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/io/TypedBytesOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/OutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PathFinder.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/LoadTypedBytes.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/StreamXmlRecordReader.class
JarBuilder.addNamedStream
org/apache/hadoop/streaming/UTF8ByteArrayUtils.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/JarBui

Re: sorting reducer input numerically in hadoop streaming

2011-04-13 Thread Dieter Plaetinck

Thank you Harsh,
that works fine!
(looks like the page I was looking at was the same, but for an older
version of hadoop)

Dieter

On Fri, 1 Apr 2011 13:07:38 +0530
Harsh J  wrote:

> You will need to supply your own Key-comparator Java class by setting
> an appropriate parameter for it, as noted in:
> http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Comparator+Class
> [The -D mapred.output.key.comparator.class=xyz part]
> 
> On Thu, Mar 31, 2011 at 6:26 PM, Dieter Plaetinck
>  wrote:
> > couldn't find how I should do that.
>

Re: sorting reducer input numerically in hadoop streaming

2011-04-01 Thread Harsh J

You will need to supply your own Key-comparator Java class by setting
an appropriate parameter for it, as noted in:
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Comparator+Class
[The -D mapred.output.key.comparator.class=xyz part]

On Thu, Mar 31, 2011 at 6:26 PM, Dieter Plaetinck
 wrote:
> couldn't find how I should do that.

-- 
Harsh J
http://harshj.com

sorting reducer input numerically in hadoop streaming

2011-03-31 Thread Dieter Plaetinck

hi,
I use hadoop 0.20.2, more specifically hadoop-streaming, on Debian 6.0
(squeeze) nodes.

My question is: how do I make sure input keys being fed to the reducer
are sorted numerically rather then alphabetically?

example:
- standard behavior:
#1 some-value1
#10 some-value10
#100 some-value100
#2 some-value2
#3 some-value3

- what I want:
#1 some-value1
#2 some-value2
#3 some-value3
#10 some-value10
#100 some-value100

I found
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/KeyFieldBasedComparator.html,
which supposedly supports GNU sed-like numeric sorting,
there are also some examples of jobconf parameters at
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html,
however that seems to be meant for key-value configuration flags,
whereas I somehow need to instruct streamer I want to use that specific
java class with that specific option for numeric sorting, and I
couldn't find how I should do that.

Thanks,
Dieter

hadoop streaming shebang line for python and mappers jumping to 100% completion right away

2011-03-31 Thread Dieter Plaetinck

Hi,
I use 0.20.2 on Debian 6.0 (squeeze) nodes.
I have 2 problems with my streaming jobs:
1) I start the job like so:
hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
-file /proj/Search/wall/experiment/ \
-mapper './nolog.sh mapper' \
-reducer './nolog.sh reducer' \
-input sim-input -output sim-output

nolog.sh is just a simple wrapper for my python program,
it calls build-models.py with --mapper or --reducer, depending on which 
argument it got,
and it removes any bogus logging output using grep.
it looks like this:

#!/bin/sh
python $(dirname $0)/build-models.py --$1 | egrep -v 'INFO|DEBUG|WARN'

build-models.py is a python 2 program containing all mapper/reducer/etc logic, 
it has the executable flag set for owner/group/other.
(I even added `chmod +x` on it in nolog.sh to be really sure)

The problems:
When I use this shebang for build-models.py: "#!/usr/bin/python" or 
"#!/usr/bin/env python" (I would expect the last to work for sure?),
and 
$(dirname $0)/build-models.py in nolog.sh
I get this error: 
/tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././nolog.sh:
9: 
/tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././build-models.py:
Permission denied
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1


So, despite not understanding why it's needed (python is installed correctly, 
executable flags set, etc), I can "solve" this by using the invocation in 
nolog.sh as shown above (`python `).
Since, if you invoke a python program like that, you can just as well remove 
the shebang because it's not needed (I verified this manually).
However when running it in hadoop it tries to execute the python file as a bash 
file, and yields a bunch of "command not found" errors.
What is going on? Why can't I just execute the file and rely on the shebang? 
And if I invoke the file as argument to the python program, why is the shebang 
still needed?


2) the second problem is somewhat related: I notice my mappers jump to "100% 
completion" right away - but they take about an hour to complete, so I see them 
running for an hour in 'RUNNING' with 100% completion, then they really finish.
this is probably an issue with the reading of stdin, as python uses
buffering by default (see
http://stackoverflow.com/questions/3670323/setting-smaller-buffer-size-for-sys-stdin
 )
In my code I iterate over stdin like this: `for line in sys.stdin:`, so I 
process line by line, but apparently python reads the entire stdin right away, 
my hdfs blocksize is 20KiB (which according to the thread above happens to be 
pretty much the size of the python buffer)

Now, why is this related? -> Because I can invoke python in a different way to 
keep it from doing the buffering.
apparently using the -u flag should do the trick, or setting the environment 
variable PYTHONUNBUFFERED to a nonempty string.
However:
- putting `python -u` in nolog.sh doesn't do it, why?
- neither does putting `export PYTHONUNBUFFERED=true` in nolog.sh before the 
invocation, why?
- in build-models.py shebang:
  putting `/usr/bin/env python -u` or '/usr/bin/env 'python -u'` gives:
  /usr/bin/env: python -u: No such file or directory, why?
I did find a working variant, that is, I can use this shebang:
`#!/usr/bin/env PYTHONUNBUFFERED=true python2`, however since I use the same 
file for multiple things, this made i/o for a bunch of other things way too 
slow, so I tried solving this in the python code (as per the tip in the above 
link), but to no avail. (I know, my final question is a bit less related)

So I tried remapping sys.stdin (before iterating it) with these two attemptst:
( see http://docs.python.org/library/os.html#os.fdopen )
newin = os.fdopen(sys.stdin.fileno(), 'r', 100) # should make buffersize +- 
100bytes
newin = os.fdopen(sys.stdin.fileno(), 'r', 1) # should make python buffer line 
by line

however, neither of those worked..

Any help/input is welcome.
I'm usually pretty good at figuring out issues with these kinds of issues of 
invocation, but this one blows my mind :/

Dieter

Re: Regarding chaining multiple map-reduce jobs in Hadoop streaming

2011-01-07 Thread Harsh J

You can use Oozie from Yahoo! for building an elegant workflow out of
your streaming jobs; but you would still require output spots on your
HDFS as output records aren't really pipelined into the next job.

-- 
Harsh J
www.harshj.com

Regarding chaining multiple map-reduce jobs in Hadoop streaming

2011-01-07 Thread Varadharajan Mukundan

Hi,

I need to chain a couple of mapreduce jobs in Hadoop streaming. i am
planning to use python to write the mapper and reducer scripts. is
there any other way to chain these jobs other than using a shell
script and a temporary directory in HDFS?

-- 
Thanks,
M. Varadharajan



"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- www.thinkasgeek.wordpress.com

Re: Hadoop Streaming?

2010-09-08 Thread Amareshwari Sri Ramadasu

Some documentation on Hadoop streaming and pipes:
http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/pipes/package-summary.html

On 9/8/10 2:34 PM, "Rita Liu"  wrote:

Hi :)

May I have two simple (and general) question regarding Hadoop Streaming?

1. What's the difference among hadoop streaming, hadoop pipe, and hadoop
online (hop), a pipelining version developed by UC Berkeley?

2. In the current hadoop trunk, where could we find hadoop-streaming.jar?
Further -- may I have an example which teaches me how to use
hadoop-streaming feature?

Thanks a lot!
-Rita :)

Hadoop Streaming?

2010-09-08 Thread Rita Liu

Hi :)

May I have two simple (and general) question regarding Hadoop Streaming?

1. What's the difference among hadoop streaming, hadoop pipe, and hadoop
online (hop), a pipelining version developed by UC Berkeley?

2. In the current hadoop trunk, where could we find hadoop-streaming.jar?
Further -- may I have an example which teaches me how to use
hadoop-streaming feature?

Thanks a lot!
-Rita :)

Re: Hadoop Streaming (with Python) and Queue's

2010-07-14 Thread Amareshwari Sri Ramadasu

-D options (which is a generic option) should be moved before the command 
specific options.
The syntax is
Bin/hadoop jar streaming.jar  

Thanks
Amareshwari

On 7/14/10 10:27 PM, "Moritz Krog"  wrote:

I second that observation, I c&p'ed most of the -D options directly from the
tutorial and found the same error message.

I'm sorry I can't help you, Eric

On Wed, Jul 14, 2010 at 6:25 PM, eric.brose  wrote:

>
> Hey all,
> We just added queue's to our capacity scheduler and now (we did not set a
> default.. which it appears we might have to change)
> if i try and run a simple streaming job i get the following error.
> 10/07/14 11:03:02 ERROR streaming.StreamJob: Error Launching job :
> java.io.IOException: Queue "default" does not exist
>at
> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2998)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>
> Streaming Job Failed!
>
> been playing around with adding my queue name (with the generic -D option)
> to the streaming command but have had no luck
> e.g.
> bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
> /dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
> /dev/reducer.py -input DEV/input/* -output DEV/output/ -D
> mapred.queue.names="dev"
>
> with this i get the following error
>
> 10/07/14 10:54:49 ERROR streaming.StreamJob: Unrecognized option: -D
>
>
> i've tried something similar to one of the examples in the streaming
> documentation
>
> bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
> /dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
> /dev/reducer.py -input DEV/input/* -output DEV/output/ -D
> mapred.reduce.tasks=2
>
> and still get the error
> ERROR streaming.StreamJob: Unrecognized option: -D
>
> Any assistance would be greatly appreciated! Thanks ahead of time!
> -eric
> ps using version 0.20.2 on RHEL servers
> --
> View this message in context:
> http://hadoop-common.472056.n3.nabble.com/Hadoop-Streaming-with-Python-and-Queue-s-tp966968p966968.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: Hadoop Streaming (with Python) and Queue's

2010-07-14 Thread Ted Yu

If you're using capacity scheduler, see:
http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html#Setting+up+queues

The queues can be checked through job tracker web UI under Scheduling
Information section

On Wed, Jul 14, 2010 at 9:57 AM, Moritz Krog wrote:

> I second that observation, I c&p'ed most of the -D options directly from
> the
> tutorial and found the same error message.
>
> I'm sorry I can't help you, Eric
>
> On Wed, Jul 14, 2010 at 6:25 PM, eric.brose  wrote:
>
> >
> > Hey all,
> > We just added queue's to our capacity scheduler and now (we did not set a
> > default.. which it appears we might have to change)
> > if i try and run a simple streaming job i get the following error.
> > 10/07/14 11:03:02 ERROR streaming.StreamJob: Error Launching job :
> > java.io.IOException: Queue "default" does not exist
> >at
> > org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2998)
> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >at java.lang.reflect.Method.invoke(Method.java:597)
> >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:396)
> >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> >
> > Streaming Job Failed!
> >
> > been playing around with adding my queue name (with the generic -D
> option)
> > to the streaming command but have had no luck
> > e.g.
> > bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
> > /dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
> > /dev/reducer.py -input DEV/input/* -output DEV/output/ -D
> > mapred.queue.names="dev"
> >
> > with this i get the following error
> >
> > 10/07/14 10:54:49 ERROR streaming.StreamJob: Unrecognized option: -D
> >
> >
> > i've tried something similar to one of the examples in the streaming
> > documentation
> >
> > bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
> > /dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
> > /dev/reducer.py -input DEV/input/* -output DEV/output/ -D
> > mapred.reduce.tasks=2
> >
> > and still get the error
> > ERROR streaming.StreamJob: Unrecognized option: -D
> >
> > Any assistance would be greatly appreciated! Thanks ahead of time!
> > -eric
> > ps using version 0.20.2 on RHEL servers
> > --
> > View this message in context:
> >
> http://hadoop-common.472056.n3.nabble.com/Hadoop-Streaming-with-Python-and-Queue-s-tp966968p966968.html
> > Sent from the Users mailing list archive at Nabble.com.
> >
>

Re: Hadoop Streaming (with Python) and Queue's

2010-07-14 Thread Moritz Krog

I second that observation, I c&p'ed most of the -D options directly from the
tutorial and found the same error message.

I'm sorry I can't help you, Eric

On Wed, Jul 14, 2010 at 6:25 PM, eric.brose  wrote:

>
> Hey all,
> We just added queue's to our capacity scheduler and now (we did not set a
> default.. which it appears we might have to change)
> if i try and run a simple streaming job i get the following error.
> 10/07/14 11:03:02 ERROR streaming.StreamJob: Error Launching job :
> java.io.IOException: Queue "default" does not exist
>at
> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2998)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>
> Streaming Job Failed!
>
> been playing around with adding my queue name (with the generic -D option)
> to the streaming command but have had no luck
> e.g.
> bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
> /dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
> /dev/reducer.py -input DEV/input/* -output DEV/output/ -D
> mapred.queue.names="dev"
>
> with this i get the following error
>
> 10/07/14 10:54:49 ERROR streaming.StreamJob: Unrecognized option: -D
>
>
> i've tried something similar to one of the examples in the streaming
> documentation
>
> bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
> /dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
> /dev/reducer.py -input DEV/input/* -output DEV/output/ -D
> mapred.reduce.tasks=2
>
> and still get the error
> ERROR streaming.StreamJob: Unrecognized option: -D
>
> Any assistance would be greatly appreciated! Thanks ahead of time!
> -eric
> ps using version 0.20.2 on RHEL servers
> --
> View this message in context:
> http://hadoop-common.472056.n3.nabble.com/Hadoop-Streaming-with-Python-and-Queue-s-tp966968p966968.html
> Sent from the Users mailing list archive at Nabble.com.
>

Hadoop Streaming (with Python) and Queue's

2010-07-14 Thread eric.brose


Hey all,
We just added queue's to our capacity scheduler and now (we did not set a
default.. which it appears we might have to change)
if i try and run a simple streaming job i get the following error.
10/07/14 11:03:02 ERROR streaming.StreamJob: Error Launching job :
java.io.IOException: Queue "default" does not exist
at
org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2998)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

Streaming Job Failed!

been playing around with adding my queue name (with the generic -D option)
to the streaming command but have had no luck
e.g.
bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
/dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
/dev/reducer.py -input DEV/input/* -output DEV/output/ -D
mapred.queue.names="dev"

with this i get the following error

10/07/14 10:54:49 ERROR streaming.StreamJob: Unrecognized option: -D


i've tried something similar to one of the examples in the streaming
documentation

bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file
/dev/mapper.py -mapper /dev/mapper.py -file /dev/reducer.py -reducer
/dev/reducer.py -input DEV/input/* -output DEV/output/ -D
mapred.reduce.tasks=2

and still get the error
ERROR streaming.StreamJob: Unrecognized option: -D

Any assistance would be greatly appreciated! Thanks ahead of time!
-eric
ps using version 0.20.2 on RHEL servers
-- 
View this message in context: 
http://hadoop-common.472056.n3.nabble.com/Hadoop-Streaming-with-Python-and-Queue-s-tp966968p966968.html
Sent from the Users mailing list archive at Nabble.com.

Re: Hadoop Streaming

2010-07-14 Thread Moritz Krog

Does that Perl script also work when I use multiple reducer tasks?

Anyway, this isn't really what I was looking for, because I intended to use
my own reducer. On top of that, I also need the intermediate data run more
than one time through the reducer. I was just hoping there is some way to
make streaming output the intermediate data as k -> list(v) somehow.
I could of course work in iterations, where I use the Perl reducer in the
first iteration and use the results from that in later iterations... but it
does sound like a lot of unnecessary work.

On Wed, Jul 14, 2010 at 10:51 AM, Alex Kozlov  wrote:

> You can use the following perl script as a reducer:
>
> ===
> #!/usr/bin/perl
>
> $,="\t";
>
> while (<>) {
>my ($key, $value) = split($,, $_, 2);
>if ($lastkey eq $key) {
>  push @values, $value;
>} else {
>  print $lastkey, join(",", @values) if defined($lastkey);
>  $lastkey = $key;
>  @values = ($value);
>}
> }
>
> print $lastkey, join(",", @values) if defined($lastkey) and @values > 0;
> ===
>
> Alex K
>
>
> On Wed, Jul 14, 2010 at 1:17 AM, Moritz Krog  >wrote:
>
> > First of all thanks  for the quick answer :)
> >
> > is there any way to configure the job in such a way, that I get the key
> ->
> > value list? I specifically need exactly this behavior.. it's crucial to
> > what
> > I want to do with Hadoop..
> >
> >
> > On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu <
> > amar...@yahoo-inc.com> wrote:
> >
> > > In streaming, the combined values are given to reducer as 
> > > pairs again, so you don't see key and list of values.
> > > I think it is done in that way to be symmetrical with mapper, though I
> > > don't know exact reason.
> > >
> > > Thanks
> > > Amareshwari
> > >
> > > On 7/14/10 1:05 PM, "Moritz Krog"  wrote:
> > >
> > > Hi everyone,
> > >
> > > I'm pretty new to Hadoop and generally avoiding Java everywhere I can,
> so
> > > I'm getting started with Hadoop streaming and python mapper and
> reducer.
> > > From what I read in the mapreduce tutorial, mapper an reducer can be
> > > plugged
> > > into Hadoop via the "-mapper" and "-reducer" options on job start. I
> was
> > > wondering what the input for the reducer would look like, so I ran a
> > Hadoop
> > > job using my own mapper but /bin/cat as reducer. As you can see, the
> > output
> > > of the job is ordered, but the keys haven't been combined:
> > >
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   107488
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   95560
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   95562
> > >
> > > I would have expected something like:
> > >
> > > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > > 'person'}   95560, 95562, 107488
> > >
> > > my understanding from the tutorial was, that this reduction is a part
> of
> > > the
> > > shuffle and sort phase. Or do I need to use a combiner to get that
> done?
> > > Does Hadoop streaming even do this, or do I need to use a native java
> > > class?
> > >
> > > Best,
> > > Moritz
> > >
> > >
> >
>

Re: Hadoop Streaming

2010-07-14 Thread Alex Kozlov

You can use the following perl script as a reducer:

===
#!/usr/bin/perl

$,="\t";

while (<>) {
my ($key, $value) = split($,, $_, 2);
if ($lastkey eq $key) {
  push @values, $value;
} else {
  print $lastkey, join(",", @values) if defined($lastkey);
  $lastkey = $key;
  @values = ($value);
}
}

print $lastkey, join(",", @values) if defined($lastkey) and @values > 0;
===

Alex K


On Wed, Jul 14, 2010 at 1:17 AM, Moritz Krog wrote:

> First of all thanks  for the quick answer :)
>
> is there any way to configure the job in such a way, that I get the key ->
> value list? I specifically need exactly this behavior.. it's crucial to
> what
> I want to do with Hadoop..
>
>
> On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu <
> amar...@yahoo-inc.com> wrote:
>
> > In streaming, the combined values are given to reducer as 
> > pairs again, so you don't see key and list of values.
> > I think it is done in that way to be symmetrical with mapper, though I
> > don't know exact reason.
> >
> > Thanks
> > Amareshwari
> >
> > On 7/14/10 1:05 PM, "Moritz Krog"  wrote:
> >
> > Hi everyone,
> >
> > I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
> > I'm getting started with Hadoop streaming and python mapper and reducer.
> > From what I read in the mapreduce tutorial, mapper an reducer can be
> > plugged
> > into Hadoop via the "-mapper" and "-reducer" options on job start. I was
> > wondering what the input for the reducer would look like, so I ran a
> Hadoop
> > job using my own mapper but /bin/cat as reducer. As you can see, the
> output
> > of the job is ordered, but the keys haven't been combined:
> >
> > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > 'person'}   107488
> > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > 'person'}   95560
> > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > 'person'}   95562
> >
> > I would have expected something like:
> >
> > {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> > 'person'}   95560, 95562, 107488
> >
> > my understanding from the tutorial was, that this reduction is a part of
> > the
> > shuffle and sort phase. Or do I need to use a combiner to get that done?
> > Does Hadoop streaming even do this, or do I need to use a native java
> > class?
> >
> > Best,
> > Moritz
> >
> >
>

Re: Hadoop Streaming

2010-07-14 Thread Moritz Krog

First of all thanks  for the quick answer :)

is there any way to configure the job in such a way, that I get the key ->
value list? I specifically need exactly this behavior.. it's crucial to what
I want to do with Hadoop..


On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu <
amar...@yahoo-inc.com> wrote:

> In streaming, the combined values are given to reducer as 
> pairs again, so you don't see key and list of values.
> I think it is done in that way to be symmetrical with mapper, though I
> don't know exact reason.
>
> Thanks
> Amareshwari
>
> On 7/14/10 1:05 PM, "Moritz Krog"  wrote:
>
> Hi everyone,
>
> I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
> I'm getting started with Hadoop streaming and python mapper and reducer.
> From what I read in the mapreduce tutorial, mapper an reducer can be
> plugged
> into Hadoop via the "-mapper" and "-reducer" options on job start. I was
> wondering what the input for the reducer would look like, so I ran a Hadoop
> job using my own mapper but /bin/cat as reducer. As you can see, the output
> of the job is ordered, but the keys haven't been combined:
>
> {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> 'person'}   107488
> {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> 'person'}   95560
> {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> 'person'}   95562
>
> I would have expected something like:
>
> {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
> 'person'}   95560, 95562, 107488
>
> my understanding from the tutorial was, that this reduction is a part of
> the
> shuffle and sort phase. Or do I need to use a combiner to get that done?
> Does Hadoop streaming even do this, or do I need to use a native java
> class?
>
> Best,
> Moritz
>
>

Re: Hadoop Streaming

2010-07-14 Thread Amareshwari Sri Ramadasu

In streaming, the combined values are given to reducer as  pairs 
again, so you don't see key and list of values.
I think it is done in that way to be symmetrical with mapper, though I don't 
know exact reason.

Thanks
Amareshwari

On 7/14/10 1:05 PM, "Moritz Krog"  wrote:

Hi everyone,

I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
I'm getting started with Hadoop streaming and python mapper and reducer.
>From what I read in the mapreduce tutorial, mapper an reducer can be plugged
into Hadoop via the "-mapper" and "-reducer" options on job start. I was
wondering what the input for the reducer would look like, so I ran a Hadoop
job using my own mapper but /bin/cat as reducer. As you can see, the output
of the job is ordered, but the keys haven't been combined:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   107488
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95560
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95562

I would have expected something like:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95560, 95562, 107488

my understanding from the tutorial was, that this reduction is a part of the
shuffle and sort phase. Or do I need to use a combiner to get that done?
Does Hadoop streaming even do this, or do I need to use a native java class?

Best,
Moritz

Hadoop Streaming

2010-07-14 Thread Moritz Krog

Hi everyone,

I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
I'm getting started with Hadoop streaming and python mapper and reducer.
>From what I read in the mapreduce tutorial, mapper an reducer can be plugged
into Hadoop via the "-mapper" and "-reducer" options on job start. I was
wondering what the input for the reducer would look like, so I ran a Hadoop
job using my own mapper but /bin/cat as reducer. As you can see, the output
of the job is ordered, but the keys haven't been combined:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   107488
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95560
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95562

I would have expected something like:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'}   95560, 95562, 107488

my understanding from the tutorial was, that this reduction is a part of the
shuffle and sort phase. Or do I need to use a combiner to get that done?
Does Hadoop streaming even do this, or do I need to use a native java class?

Best,
Moritz

1 2 >

1 - 100 of 141 matches

Mail list logo