Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Dan Starr
Hi, I've tried posting this to Cloudera's community support site, but
the community website getsatisfaction.com returns various server
errors at the moment.  I believe the following is an issue related to
my environment within Cloudera's Training virtual machine.

Despite having success running Hadoop streaming on other Hadoop
clusters and on Cloudera's Training VM in local mode, I'm currently
getting an error when attempting to run a simple Hadoop streaming job
in the normal queue based mode on the Training VM.  I'm thinking the
error described below is an issue related to the worker node not
recognizing the python reference in the script's top shebang line.

The hadoop command I am executing is:

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
-mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
-input test_input/* -output output

Where the test_input directory contains 3 UNIX formatted, single line files:

training-vm: 3$ hadoop dfs -ls /user/training/test_input/
Found 3 items
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file1
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file2
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file3

training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
test_line1
test_line2
test_line3

And where blah.py looks like (UNIX formatted):

#!/usr/bin/python
import sys
for line in sys.stdin:
   print line

The resulting Hadoop-Streaming error is:

java.io.IOException: Cannot run program blah.py:
java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
   ...


I get the same error when placing the python script on the HDFS, and
then using this in the hadoop command:

... -mapper hdfs:///user/training/blah.py ...


One suggestion found online, which may not be relevant to Cloudera's
distribution, mentions that the first line of the hadoop-streaming
python script (the shebang line) may not describe an applicable path
for the system.  The solution mentioned is to use: ... -mapper python
blah.py  ... in the Hadoop streaming command.  This doesn't seem to
work correctly for me, since I find that the lines from the input data
files are also parsed by the Python interpreter.  But this does reveal
that python is available on the worker node when using this technique.
 I have also tried without success the '-mapper blah.py' technique
using shebang lines: #!/usr/bin/env python, although on the training
VM Python is installed under /usr/bin/python.

Maybe the issue is something else.  Any suggestions or insights will be helpful.


Re: Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Todd Lipcon
Are you passing the python script to the cluster using the -file
option? eg -mapper foo.py -file foo.py

Thanks
-Todd

On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote:
 Hi, I've tried posting this to Cloudera's community support site, but
 the community website getsatisfaction.com returns various server
 errors at the moment.  I believe the following is an issue related to
 my environment within Cloudera's Training virtual machine.

 Despite having success running Hadoop streaming on other Hadoop
 clusters and on Cloudera's Training VM in local mode, I'm currently
 getting an error when attempting to run a simple Hadoop streaming job
 in the normal queue based mode on the Training VM.  I'm thinking the
 error described below is an issue related to the worker node not
 recognizing the python reference in the script's top shebang line.

 The hadoop command I am executing is:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output

 Where the test_input directory contains 3 UNIX formatted, single line files:

 training-vm: 3$ hadoop dfs -ls /user/training/test_input/
 Found 3 items
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file1
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file2
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file3

 training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
 test_line1
 test_line2
 test_line3

 And where blah.py looks like (UNIX formatted):

 #!/usr/bin/python
 import sys
 for line in sys.stdin:
    print line

 The resulting Hadoop-Streaming error is:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
 at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
    ...


 I get the same error when placing the python script on the HDFS, and
 then using this in the hadoop command:

 ... -mapper hdfs:///user/training/blah.py ...


 One suggestion found online, which may not be relevant to Cloudera's
 distribution, mentions that the first line of the hadoop-streaming
 python script (the shebang line) may not describe an applicable path
 for the system.  The solution mentioned is to use: ... -mapper python
 blah.py  ... in the Hadoop streaming command.  This doesn't seem to
 work correctly for me, since I find that the lines from the input data
 files are also parsed by the Python interpreter.  But this does reveal
 that python is available on the worker node when using this technique.
  I have also tried without success the '-mapper blah.py' technique
 using shebang lines: #!/usr/bin/env python, although on the training
 VM Python is installed under /usr/bin/python.

 Maybe the issue is something else.  Any suggestions or insights will be 
 helpful.



Re: Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Dan Starr
Yes, I have tried that when passing the script.  Just now I tried:

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
-mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
-input test_input/* -output output -file blah.py

And got this error for a map task:

java.io.IOException: Cannot run program blah.py:
java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
...

-Dan


On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote:
 Are you passing the python script to the cluster using the -file
 option? eg -mapper foo.py -file foo.py

 Thanks
 -Todd

 On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote:
 Hi, I've tried posting this to Cloudera's community support site, but
 the community website getsatisfaction.com returns various server
 errors at the moment.  I believe the following is an issue related to
 my environment within Cloudera's Training virtual machine.

 Despite having success running Hadoop streaming on other Hadoop
 clusters and on Cloudera's Training VM in local mode, I'm currently
 getting an error when attempting to run a simple Hadoop streaming job
 in the normal queue based mode on the Training VM.  I'm thinking the
 error described below is an issue related to the worker node not
 recognizing the python reference in the script's top shebang line.

 The hadoop command I am executing is:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output

 Where the test_input directory contains 3 UNIX formatted, single line files:

 training-vm: 3$ hadoop dfs -ls /user/training/test_input/
 Found 3 items
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file1
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file2
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file3

 training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
 test_line1
 test_line2
 test_line3

 And where blah.py looks like (UNIX formatted):

 #!/usr/bin/python
 import sys
 for line in sys.stdin:
    print line

 The resulting Hadoop-Streaming error is:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
 at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
    ...


 I get the same error when placing the python script on the HDFS, and
 then using this in the hadoop command:

 ... -mapper hdfs:///user/training/blah.py ...


 One suggestion found online, which may not be relevant to Cloudera's
 distribution, mentions that the first line of the hadoop-streaming
 python script (the shebang line) may not describe an applicable path
 for the system.  The solution mentioned is to use: ... -mapper python
 blah.py  ... in the Hadoop streaming command.  This doesn't seem to
 work correctly for me, since I find that the lines from the input data
 files are also parsed by the Python interpreter.  But this does reveal
 that python is available on the worker node when using this technique.
  I have also tried without success the '-mapper blah.py' technique
 using shebang lines: #!/usr/bin/env python, although on the training
 VM Python is installed under /usr/bin/python.

 Maybe the issue is something else.  Any suggestions or insights will be 
 helpful.




Re: Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Dan Starr
Todd, Thanks!
This solved it.

-Dan

On Wed, Feb 17, 2010 at 8:00 PM, Todd Lipcon t...@cloudera.com wrote:
 Hi Dan,

 This is actually a bug in the release you're using. Please run:

 $ sudo apt-get update
 $ sudo apt-get install hadoop-0.20

 Then restart the daemons (or the entire VM) and give it another go.

 Thanks
 -Todd

 On Wed, Feb 17, 2010 at 7:56 PM, Dan Starr dsta...@gmail.com wrote:
 Yes, I have tried that when passing the script.  Just now I tried:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output -file blah.py

 And got this error for a map task:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
        at 
 org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
        at 
 org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
        ...

 -Dan


 On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote:
 Are you passing the python script to the cluster using the -file
 option? eg -mapper foo.py -file foo.py

 Thanks
 -Todd

 On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote:
 Hi, I've tried posting this to Cloudera's community support site, but
 the community website getsatisfaction.com returns various server
 errors at the moment.  I believe the following is an issue related to
 my environment within Cloudera's Training virtual machine.

 Despite having success running Hadoop streaming on other Hadoop
 clusters and on Cloudera's Training VM in local mode, I'm currently
 getting an error when attempting to run a simple Hadoop streaming job
 in the normal queue based mode on the Training VM.  I'm thinking the
 error described below is an issue related to the worker node not
 recognizing the python reference in the script's top shebang line.

 The hadoop command I am executing is:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output

 Where the test_input directory contains 3 UNIX formatted, single line 
 files:

 training-vm: 3$ hadoop dfs -ls /user/training/test_input/
 Found 3 items
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file1
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file2
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file3

 training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
 test_line1
 test_line2
 test_line3

 And where blah.py looks like (UNIX formatted):

 #!/usr/bin/python
 import sys
 for line in sys.stdin:
    print line

 The resulting Hadoop-Streaming error is:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
 at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
    ...


 I get the same error when placing the python script on the HDFS, and
 then using this in the hadoop command:

 ... -mapper hdfs:///user/training/blah.py ...


 One suggestion found online, which may not be relevant to Cloudera's
 distribution, mentions that the first line of the hadoop-streaming
 python script (the shebang line) may not describe an applicable path
 for the system.  The solution mentioned is to use: ... -mapper python
 blah.py  ... in the Hadoop streaming command.  This doesn't seem to
 work correctly for me, since I find that the lines from the input data
 files are also parsed by the Python interpreter.  But this does reveal
 that python is available on the worker node when using this technique.
  I have also tried without success the '-mapper blah.py' technique
 using shebang lines: #!/usr/bin/env python, although on the training
 VM Python is installed under /usr/bin/python.

 Maybe the issue is something else.  Any suggestions or insights will be 
 helpful.