Hi all, After SPARK-5479 <https://issues.apache.org/jira/browse/SPARK-5479> issue fix (thanks to Marcelo Vanzin), now pyspark handles several python files (or in zip folder with __init__.py) addition to PYTHONPATH correctly in yarn-cluster mode.
But adding python module as zip folder, still fails - if that zip folder have other file types (compiled byte code or c code) other than python files. Let's say you want to provide numpy package to --py-files flag, which is downloaded from as numpy-1.9.2.zip from this link <https://pypi.python.org/pypi/numpy> does not work - complaining import numpy line has failed. numpy module need to be *installed* before importing it in Spark Python script. So does that mean you need to install on all machines required python modules before using pyspark ? Or what is best pattern using any python 3rd party module in Spark Python job ? Thanks. On Thu, Jun 25, 2015 at 12:55 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > Please take a look at the pull request with the actual fix; that will > explain why it's the same issue. > > On Thu, Jun 25, 2015 at 12:51 PM, Elkhan Dadashov <elkhan8...@gmail.com> > wrote: > >> Thanks Marcelo. >> >> But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local >> directory* (can also put in HDFS), but still fails. >> >> But SPARK-5479 <https://issues.apache.org/jira/browse/SPARK-5479> is : >> PySpark on yarn mode need to support *non-local* python files. >> >> The job fails only when i try to include 3rd party dependency from local >> computer with --py-files (in Spark 1.4) >> >> Both of these commands succeed: >> >> ./bin/spark-submit --master yarn-cluster --verbose hdfs:///pi.py >> ./bin/spark-submit --master yarn-cluster --deploy-mode cluster --verbose >> examples/src/main/python/pi.py >> >> But in this particular example with 3rd party numpy module: >> >> ./bin/spark-submit --verbose --master yarn-cluster --py-files >> mypython/libs/numpy-1.9.2.zip --deploy-mode cluster >> mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0 >> >> >> All these files : >> >> mypython/libs/numpy-1.9.2.zip, mypython/scripts/kmeans.py are local >> files, kmeans_data.txt is in HDFS. >> >> >> Thanks. >> >> >> On Thu, Jun 25, 2015 at 12:22 PM, Marcelo Vanzin <van...@cloudera.com> >> wrote: >> >>> That sounds like SPARK-5479 which is not in 1.4... >>> >>> On Thu, Jun 25, 2015 at 12:17 PM, Elkhan Dadashov <elkhan8...@gmail.com> >>> wrote: >>> >>>> In addition to previous emails, when i try to execute this command from >>>> command line: >>>> >>>> ./bin/spark-submit --verbose --master yarn-cluster --py-files >>>> mypython/libs/numpy-1.9.2.zip --deploy-mode cluster >>>> mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0 >>>> >>>> >>>> - numpy-1.9.2.zip - is downloaded numpy package >>>> - kmeans.py is default example which comes with Spark 1.4 >>>> - kmeans_data.txt - is default data file which comes with Spark 1.4 >>>> >>>> >>>> It fails saying that it could not find numpy: >>>> >>>> File "kmeans.py", line 31, in <module> >>>> import numpy >>>> ImportError: No module named numpy >>>> >>>> Has anyone run Python Spark application on Yarn-cluster mode ? (which >>>> has 3rd party Python modules to be shipped with) >>>> >>>> What are the configurations or installations to be done before running >>>> Python Spark job with 3rd party dependencies on Yarn-cluster ? >>>> >>>> Thanks in advance. >>>> >>>> >>> -- >>> Marcelo >>> >> >> >> >> -- >> >> Best regards, >> Elkhan Dadashov >> > > > > -- > Marcelo > -- Best regards, Elkhan Dadashov