Re: Problem with Python + Hadoop: how to link .so outside Python?

2011-09-18 Thread Guang-Nan Cheng
You can do it.

If you understand how Hadoop works, then you should realized that it's
a Python question and a Linux question.

Pass the native files via -files and setup environment variables
via mapred.child.env.

I've done a similar thing with Ruby. For Ruby, the environment
variables are PATH, GEM_HOME, GEM_PATH, LD_LIBRARY_PATH and RUBYLIB.


  -D 
mapred.child.env=PATH=ruby-1.9.2-p180/bin:'$PATH',GEM_HOME=ruby-1.9.2-p180,LD_LIBRARY_PATH=ruby-1.9.2-p180/lib,GEM_PATH=ruby-1.9.2-p180,RUBYLIB=ruby-1.9.2-p180/lib/ruby/site_ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/site_ruby/1.9.1/x86_64-linux:ruby-1.9.2-p180/lib/ruby/site_ruby:ruby-1.9.2-p180/lib/ruby/vendor_ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/vendor_ruby/1.9.1/x86_64-linux:ruby-1.9.2-p180/lib/ruby/vendor_ruby:ruby-1.9.2-p180/lib/ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/1.9.1/x86_64-linux
 \
  -files ruby-1.9.2-p180 \




On Thu, Sep 1, 2011 at 8:01 PM, Xiong Deng dbigb...@gmail.com wrote:
 Hi,

 I have successfully installed scipy on my Python 2.7 on my local Linux, and
 I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python
 MapReduce scripts,  like this:

  20 ${HADOOP_HOME}/bin/hadoop streaming \$
  21      -input ${input} \$
  22      -output ${output} \$
  23      -mapper python27/bin/python27.sh rp_extractMap.py \$
  24      -reducer python27/bin/python27.sh rp_extractReduce.py \$
  25      -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
 \$
  26      -file rp_extractMap.py \$
  27      -file rp_extractReduce.py \$
  28      -file shitu_conf.py \$
  29      -cacheArchive /share/python27.tar.gz#python27 \$
  30      -outputformat org.apache.hadoop.mapred.TextOutputFormat \$
  31      -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$
  32      -jobconf mapred.max.split.size=51200 \$
  33      -jobconf mapred.job.name=[reserve_price][rp_extract] \$
  34      -jobconf mapred.job.priority=HIGH \$
  35      -jobconf mapred.job.map.capacity=1000 \$
  36      -jobconf mapred.job.reduce.capacity=200 \$
  37      -jobconf mapred.reduce.tasks=200$
  38      -jobconf num.key.fields.for.partition=2$

 I have to do this, because the Hadoop server installed its own python of
 very low version which may not support some of my python scripts, and I do
 not have privilege to install scipy lib on that server. So,I have to use the
 -cacheArchieve command to include my own python2.7 with scipy

 But, I find out that some of the .so in scipy are linked to other dynamic
 libs outside Python2.7.. For example

 $ ldd
 ~/local/python-2.7.2/lib/python2.7/site-packages/scipy/linalg/flapack.so
        liblapack.so = /usr/local/atlas/lib/liblapack.so
 (0x002a956fd000)
        libatlas.so = /usr/local/atlas/lib/libatlas.so (0x002a95df3000)
        libgfortran.so.3 =
 /home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x002a9668d000)
        libm.so.6 = /lib64/tls/libm.so.6 (0x002a968b6000)
        libgcc_s.so.1 = /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1
 (0x002a96a3c000)
        libquadmath.so.0 =
 /home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x002a96b51000)
        libc.so.6 = /lib64/tls/libc.so.6 (0x002a96c87000)
        libpthread.so.0 = /lib64/tls/libpthread.so.0 (0x002a96ebb000)
        /lib64/ld-linux-x86-64.so.2 (0x00552000)


 So, my question is: how can I include this libs? Should I search for all the
 linked .so and .a under my local linux and pack them together with
 Python2.7??? If yes, How can I get a full list of the libs needed and How
 can make the packed Python2.7 know where to find the new libs??

 Thanks
 Xiong



Problem with Python + Hadoop: how to link .so outside Python?

2011-09-01 Thread Xiong Deng
Hi,

I have successfully installed scipy on my Python 2.7 on my local Linux, and
I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python
MapReduce scripts,  like this:

 20 ${HADOOP_HOME}/bin/hadoop streaming \$
 21  -input ${input} \$
 22  -output ${output} \$
 23  -mapper python27/bin/python27.sh rp_extractMap.py \$
 24  -reducer python27/bin/python27.sh rp_extractReduce.py \$
 25  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
\$
 26  -file rp_extractMap.py \$
 27  -file rp_extractReduce.py \$
 28  -file shitu_conf.py \$
 29  -cacheArchive /share/python27.tar.gz#python27 \$
 30  -outputformat org.apache.hadoop.mapred.TextOutputFormat \$
 31  -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$
 32  -jobconf mapred.max.split.size=51200 \$
 33  -jobconf mapred.job.name=[reserve_price][rp_extract] \$
 34  -jobconf mapred.job.priority=HIGH \$
 35  -jobconf mapred.job.map.capacity=1000 \$
 36  -jobconf mapred.job.reduce.capacity=200 \$
 37  -jobconf mapred.reduce.tasks=200$
 38  -jobconf num.key.fields.for.partition=2$

I have to do this, because the Hadoop server installed its own python of
very low version which may not support some of my python scripts, and I do
not have privilege to install scipy lib on that server. So,I have to use the
-cacheArchieve command to include my own python2.7 with scipy

But, I find out that some of the .so in scipy are linked to other dynamic
libs outside Python2.7.. For example

$ ldd
~/local/python-2.7.2/lib/python2.7/site-packages/scipy/linalg/flapack.so
liblapack.so = /usr/local/atlas/lib/liblapack.so
(0x002a956fd000)
libatlas.so = /usr/local/atlas/lib/libatlas.so (0x002a95df3000)
libgfortran.so.3 =
/home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x002a9668d000)
libm.so.6 = /lib64/tls/libm.so.6 (0x002a968b6000)
libgcc_s.so.1 = /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1
(0x002a96a3c000)
libquadmath.so.0 =
/home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x002a96b51000)
libc.so.6 = /lib64/tls/libc.so.6 (0x002a96c87000)
libpthread.so.0 = /lib64/tls/libpthread.so.0 (0x002a96ebb000)
/lib64/ld-linux-x86-64.so.2 (0x00552000)


So, my question is: how can I include this libs? Should I search for all the
linked .so and .a under my local linux and pack them together with
Python2.7??? If yes, How can I get a full list of the libs needed and How
can make the packed Python2.7 know where to find the new libs??

Thanks
Xiong