shipping streaming libraries with cacheArchive

Karl Anderson Wed, 08 Oct 2008 15:16:57 -0700

Has anybody been able to ship a hadoop streaming library using
cacheArchive?  I am able to see my unjarred archive from my mapper,
but I'm not able to import Python files within it.


As a test, I'm jarring up a test directory and putting it on the HDFS:

  [EMAIL PROTECTED] ~]# ls jar_test
  __init__.py  __init__.pyc  bar.py  foo.py  foo.pyc
  [EMAIL PROTECTED] ~]# jar cvf jar_test.jar -C jar_test .
  [...]

[EMAIL PROTECTED] ~]# hadoop dfs -put jar_test.jarjar_test.jar

  [...]

My test module is importable.

  [EMAIL PROTECTED] ~]# python
  Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11)
  [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2

Type "help", "copyright", "credits" or "license" for moreinformation.

  >>> import jar_test.foo
  >>> 

I include "-cacheArchive hdfs:///user/root/jar_test.jar#jar_test" in
my Hadoop streaming invocation.

My mapper is able to read the linked, extrated jar_test directory.
This prints "['foo.py', '.jar_test.jar.crc', 'jar_test.jar',
'__init__.py', 'META-INF', 'bar.py']" to the mapper output.

  #!/usr/bin/env python

  import sys
  import os

  #import jar_test.foo

  if __name__ == "__main__":
      for line in sys.stdin:
          pass
      print os.listdir('jar_test')


However, when I uncomment the import line, my mapper dies with
"ImportError: No module named jar_test.foo".

Any clues?


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra

shipping streaming libraries with cacheArchive

Reply via email to