Import path for hadoop streaming with python

2008-05-22 Thread Martin Blom
Hello all,

I'm trying to stream a little python script on my small hadoop
cluster, and it doesn't work like I thought it would.

The script looks something like

#!/usr/bin/env python
import mylib
dostuff

where mylib is a small python library that I want included, and I
launch the whole thing with something like

bin/hadoop jar contrib/streaming/hadoop-0.16.4-streaming.jar
-cacheFile hdfs://master:54310/user/hadoop/mylib.py#mylib.py -file
scrpit.py -mapper script.py -input input -output output

so it seems to me like the library should be available to the script.
When I run the script locally on my machine everything works perfectly
fine. However, when I run it it the script can't find the library.
Does hadoop do anything strange to default paths? Am I missing
something obvious? Any pointers or ideas on how to fix this would be
great.

Martin Blom


Re: Import path for hadoop streaming with python

2008-05-22 Thread Saptarshi Guha
I haven't done this using hadoop but before i 16.4 i had written my  
own distributed batch processor using HDFS as a common file storage  
and remote execution of python scripts.
They all required a custom module which was copied to the remote temp  
folders (a primitive implementation of cacheFile)


So this is what I did:  just after #!/usr/bin/env python

import sys
sys.path.append('.')
import mylib
dostuff

so that your module can be found in the current path.
It should work thereafter
Regards
Saptarshi

On May 22, 2008, at 7:39 PM, Martin Blom wrote:


Hello all,

I'm trying to stream a little python script on my small hadoop
cluster, and it doesn't work like I thought it would.

The script looks something like

#!/usr/bin/env python
import mylib
dostuff

where mylib is a small python library that I want included, and I
launch the whole thing with something like

bin/hadoop jar contrib/streaming/hadoop-0.16.4-streaming.jar
-cacheFile hdfs://master:54310/user/hadoop/mylib.py#mylib.py -file
scrpit.py -mapper script.py -input input -output output







so it seems to me like the library should be available to the script.
When I run the script locally on my machine everything works perfectly
fine. However, when I run it it the script can't find the library.
Does hadoop do anything strange to default paths? Am I missing
something obvious? Any pointers or ideas on how to fix this would be
great.

Martin Blom


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
You love your home and want it to be beautiful.



smime.p7s
Description: S/MIME cryptographic signature