Are you passing the python script to the cluster using the -file option? eg -mapper foo.py -file foo.py
Thanks -Todd On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr <dsta...@gmail.com> wrote: > Hi, I've tried posting this to Cloudera's community support site, but > the community website getsatisfaction.com returns various server > errors at the moment. I believe the following is an issue related to > my environment within Cloudera's Training virtual machine. > > Despite having success running Hadoop streaming on other Hadoop > clusters and on Cloudera's Training VM in local mode, I'm currently > getting an error when attempting to run a simple Hadoop streaming job > in the normal queue based mode on the Training VM. I'm thinking the > error described below is an issue related to the worker node not > recognizing the python reference in the script's top shebang line. > > The hadoop command I am executing is: > > hadoop jar > /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar > -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer > -input test_input/* -output output > > Where the test_input directory contains 3 UNIX formatted, single line files: > > training-vm: 3$ hadoop dfs -ls /user/training/test_input/ > Found 3 items > -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 > /user/training/test_input/file1 > -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 > /user/training/test_input/file2 > -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 > /user/training/test_input/file3 > > training-vm: 3$ hadoop dfs -cat /user/training/test_input/* > test_line1 > test_line2 > test_line3 > > And where blah.py looks like (UNIX formatted): > > #!/usr/bin/python > import sys > for line in sys.stdin: > print line > > The resulting Hadoop-Streaming error is: > > java.io.IOException: Cannot run program "blah.py": > java.io.IOException: error=2, No such file or directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) > ... > > > I get the same error when placing the python script on the HDFS, and > then using this in the hadoop command: > > ... -mapper hdfs:///user/training/blah.py ... > > > One suggestion found online, which may not be relevant to Cloudera's > distribution, mentions that the first line of the hadoop-streaming > python script (the shebang line) may not describe an applicable path > for the system. The solution mentioned is to use: ... -mapper "python > blah.py " ... in the Hadoop streaming command. This doesn't seem to > work correctly for me, since I find that the lines from the input data > files are also parsed by the Python interpreter. But this does reveal > that python is available on the worker node when using this technique. > I have also tried without success the '-mapper blah.py' technique > using shebang lines: "#!/usr/bin/env python", although on the training > VM Python is installed under /usr/bin/python. > > Maybe the issue is something else. Any suggestions or insights will be > helpful. >