Hadoop Streaming File-not-found error on Cloudera's training VM
Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.
Re: Hadoop Streaming File-not-found error on Cloudera's training VM
Are you passing the python script to the cluster using the -file option? eg -mapper foo.py -file foo.py Thanks -Todd On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote: Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.
Re: Hadoop Streaming File-not-found error on Cloudera's training VM
Yes, I have tried that when passing the script. Just now I tried: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output -file blah.py And got this error for a map task: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) ... -Dan On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote: Are you passing the python script to the cluster using the -file option? eg -mapper foo.py -file foo.py Thanks -Todd On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote: Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.
Re: Hadoop Streaming File-not-found error on Cloudera's training VM
Todd, Thanks! This solved it. -Dan On Wed, Feb 17, 2010 at 8:00 PM, Todd Lipcon t...@cloudera.com wrote: Hi Dan, This is actually a bug in the release you're using. Please run: $ sudo apt-get update $ sudo apt-get install hadoop-0.20 Then restart the daemons (or the entire VM) and give it another go. Thanks -Todd On Wed, Feb 17, 2010 at 7:56 PM, Dan Starr dsta...@gmail.com wrote: Yes, I have tried that when passing the script. Just now I tried: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output -file blah.py And got this error for a map task: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) ... -Dan On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote: Are you passing the python script to the cluster using the -file option? eg -mapper foo.py -file foo.py Thanks -Todd On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote: Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.