Are you passing the python script to the cluster using the -file
option? eg -mapper foo.py -file foo.py

Thanks
-Todd

On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr <dsta...@gmail.com> wrote:
> Hi, I've tried posting this to Cloudera's community support site, but
> the community website getsatisfaction.com returns various server
> errors at the moment.  I believe the following is an issue related to
> my environment within Cloudera's Training virtual machine.
>
> Despite having success running Hadoop streaming on other Hadoop
> clusters and on Cloudera's Training VM in local mode, I'm currently
> getting an error when attempting to run a simple Hadoop streaming job
> in the normal queue based mode on the Training VM.  I'm thinking the
> error described below is an issue related to the worker node not
> recognizing the python reference in the script's top shebang line.
>
> The hadoop command I am executing is:
>
> hadoop jar 
> /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
> -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
> -input test_input/* -output output
>
> Where the test_input directory contains 3 UNIX formatted, single line files:
>
> training-vm: 3$ hadoop dfs -ls /user/training/test_input/
> Found 3 items
> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
> /user/training/test_input/file1
> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
> /user/training/test_input/file2
> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
> /user/training/test_input/file3
>
> training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
> test_line1
> test_line2
> test_line3
>
> And where blah.py looks like (UNIX formatted):
>
> #!/usr/bin/python
> import sys
> for line in sys.stdin:
>    print line
>
> The resulting Hadoop-Streaming error is:
>
> java.io.IOException: Cannot run program "blah.py":
> java.io.IOException: error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>    ...
>
>
> I get the same error when placing the python script on the HDFS, and
> then using this in the hadoop command:
>
> ... -mapper hdfs:///user/training/blah.py ...
>
>
> One suggestion found online, which may not be relevant to Cloudera's
> distribution, mentions that the first line of the hadoop-streaming
> python script (the shebang line) may not describe an applicable path
> for the system.  The solution mentioned is to use: ... -mapper "python
> blah.py " ... in the Hadoop streaming command.  This doesn't seem to
> work correctly for me, since I find that the lines from the input data
> files are also parsed by the Python interpreter.  But this does reveal
> that python is available on the worker node when using this technique.
>  I have also tried without success the '-mapper blah.py' technique
> using shebang lines: "#!/usr/bin/env python", although on the training
> VM Python is installed under /usr/bin/python.
>
> Maybe the issue is something else.  Any suggestions or insights will be 
> helpful.
>

Reply via email to