Maybe you doesn't specify symlink name in you cmd line, so the symlink name will be just lib.jar, so I am not sure how you import lib module in your main.py file. Please try this: put main.py lib.py in same jar file, e.g. app.zip *-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py map" -reducer "app/main.py reduce" in main.py: import app.lib or: import .lib
On Mon, Aug 12, 2013 at 6:01 PM, Andrei <faithlessfri...@gmail.com> wrote: > Hi Binglin, > > thanks for your explanation, now it makes sense. However, I'm not sure how > to implement suggested method with. > > First of all, I found out that `-cachArchive` option is deprecated, so I > had to use `-archives` instead. I put my `lib.py` to directory `lib` and > then zipped it to `lib.zip`. After that I uploaded archive to HDFS and > linked it in call to Streaming API as follows: > > hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files > main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper > "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine" > -input input -output output > > But script failed, and from logs I see that lib.jar hasn't been unpacked. > What am I missing? > > > > > On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <decst...@gmail.com>wrote: > >> Hi, >> >> The problem seems to caused by symlink, hadoop uses file cache, so every >> file is in fact a symlink. >> >> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py -> >> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py >> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py -> >> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py >> [root@master01 tmp]# ./main.py >> Traceback (most recent call last): >> File "./main.py", line 3, in ? >> import lib >> ImportError: No module named lib >> >> This should be a python bug: when using import, it can't handle symlink >> >> You can try to use a directory containing lib.py and use -cacheArchive, >> so the symlink actually links to a directory, python may handle this case >> well. >> >> Thanks, >> Binglin >> >> >> >> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <faithlessfri...@gmail.com>wrote: >> >>> (cross-posted from >>> StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208> >>> ) >>> >>> I have a MapReduce job defined in file *main.py*, which imports module >>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to >>> Hadoop cluster as follows: >>> >>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar >>> >>> -files lib.py,main.py >>> -mapper "./main.py map" -reducer "./main.py reduce" >>> -input input -output output >>> >>> In my understanding, this should put both main.py and lib.py into >>> *distributed >>> cache folder* on each computing machine and thus make module lib available >>> to main. But it doesn't happen - from log file I see, that files *are >>> really copied* to the same directory, but main can't import lib, >>> throwing*ImportError*. >>> >>> Adding current directory to the path didn't work: >>> >>> import sys >>> sys.path.append(os.path.realpath(__file__))import lib# ImportError >>> >>> though, loading module manually did the trick: >>> >>> import imp >>> lib = imp.load_source('lib', 'lib.py') >>> >>> But that's not what I want. So why Python interpreter can see other .py >>> files >>> in the same directory, but can't import them? Note, I have already tried >>> adding empty __init__.py file to the same directory without effect. >>> >>> >>> >> >