You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included.
An approach like this solved the last problem I had that seemed like this - https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony <js...@stanford.edu> wrote: > Hey everyone, > > > I am currently trying to run a Python Spark job (using YARN client mode) > that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that, > I create a dependencies.zip file that contains all of the > dependencies/libraries (installed through pip) for the job to run > successfully, such as pandas, scipy, tqdm, psycopg2, etc. The > dependencies.zip file is contained within an outside directory (let’s call > it “project”) that contains all the code to run my Spark job. I then zip up > everything within project (including dependencies.zip) into project.zip. > Then, I call spark-submit on the master node in my EMR cluster as follows: > > > `spark-submit --packages … --py-files project.zip --jars ... > run_command.py` > > > Within “run_command.py” I add dependencies.zip as follows: > > `self.spark.sparkContext.addPyFile("dependencies.zip”)` > > > The run_command.py then uses other files within project.zip to complete > the spark job, and within those files, I import various libraries (found in > dependencies.zip). > > > I am running into a strange issue where all of the libraries are imported > correctly (with no problems) with the exception of scipy and pandas. > > > For scipy I get the following error: > > > `File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in > <module> > > File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line > 1, in <module> > > ImportError: cannot import name _ccallback_c` > > > And for pandas I get this error message: > > > `File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, > in <module> > > ImportError: C extension: No module named tslib not built. If you want to > import pandas from the source directory, you may need to run 'python > setup.py build_ext --inplace --force' to build the C extensions first.` > > > When I comment out the imports for these two libraries (and their use from > within the code) everything works fine. > > > Surprisingly, when I run the application locally (on master node) without > passing in dependencies.zip, it picks and resolves the libraries from > site-packages correctly and successfully runs to completion. > dependencies.zip is created by zipping the contents of site-packages. > > > Does anyone have any ideas as to what may be happening here? I would > really appreciate it. > > > pip version: 18.0 > > spark version: 2.3.1 > > python version: 2.7 > > > Thank you, > > > Jonas > >