Has anybody been able to ship a hadoop streaming library using
cacheArchive? I am able to see my unjarred archive from my mapper,
but I'm not able to import Python files within it.
As a test, I'm jarring up a test directory and putting it on the HDFS:
[EMAIL PROTECTED] ~]# ls jar_test
__init__.py __init__.pyc bar.py foo.py foo.pyc
[EMAIL PROTECTED] ~]# jar cvf jar_test.jar -C jar_test .
[...]
[EMAIL PROTECTED] ~]# hadoop dfs -put jar_test.jar
jar_test.jar
[...]
My test module is importable.
[EMAIL PROTECTED] ~]# python
Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> import jar_test.foo
>>>
I include "-cacheArchive hdfs:///user/root/jar_test.jar#jar_test" in
my Hadoop streaming invocation.
My mapper is able to read the linked, extrated jar_test directory.
This prints "['foo.py', '.jar_test.jar.crc', 'jar_test.jar',
'__init__.py', 'META-INF', 'bar.py']" to the mapper output.
#!/usr/bin/env python
import sys
import os
#import jar_test.foo
if __name__ == "__main__":
for line in sys.stdin:
pass
print os.listdir('jar_test')
However, when I uncomment the import line, my mapper dies with
"ImportError: No module named jar_test.foo".
Any clues?
Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra