Re: Python Dependencies Issue on EMR

2018-09-20 Thread Jonas Shomorony
Thanks Patrick. Using a conda virtual environment did help with libraries
that required the extra C stuff.

Jonas

On Fri, Sep 14, 2018 at 8:02 AM Patrick McCarthy 
wrote:

> You didn't say how you're zipping the dependencies, but I'm guessing you
> either include .egg files or zipped up a virtualenv. In either case, the
> extra C stuff that scipy and pandas rely upon doesn't get included.
>
> An approach like this solved the last problem I had that seemed like this
> -
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>
> On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony 
> wrote:
>
>> Hey everyone,
>>
>>
>> I am currently trying to run a Python Spark job (using YARN client mode)
>> that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
>> I create a dependencies.zip file that contains all of the
>> dependencies/libraries (installed through pip) for the job to run
>> successfully, such as pandas, scipy, tqdm, psycopg2, etc. The
>> dependencies.zip file is contained within an outside directory (let’s call
>> it “project”) that contains all the code to run my Spark job. I then zip up
>> everything within project (including dependencies.zip) into project.zip.
>> Then, I call spark-submit on the master node in my EMR cluster as follows:
>>
>>
>> `spark-submit --packages … --py-files project.zip --jars ...
>> run_command.py`
>>
>>
>> Within “run_command.py” I add dependencies.zip as follows:
>>
>> `self.spark.sparkContext.addPyFile("dependencies.zip”)`
>>
>>
>> The run_command.py then uses other files within project.zip to complete
>> the spark job, and within those files, I import various libraries (found in
>> dependencies.zip).
>>
>>
>> I am running into a strange issue where all of the libraries are imported
>> correctly (with no problems) with the exception of scipy and pandas.
>>
>>
>> For scipy I get the following error:
>>
>>
>> `File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in
>> 
>>
>>   File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py",
>> line 1, in 
>>
>> ImportError: cannot import name _ccallback_c`
>>
>>
>> And for pandas I get this error message:
>>
>>
>> `File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35,
>> in 
>>
>> ImportError: C extension: No module named tslib not built. If you want to
>> import pandas from the source directory, you may need to run 'python
>> setup.py build_ext --inplace --force' to build the C extensions first.`
>>
>>
>> When I comment out the imports for these two libraries (and their use
>> from within the code) everything works fine.
>>
>>
>> Surprisingly, when I run the application locally (on master node) without
>> passing in dependencies.zip, it picks and resolves the libraries from
>> site-packages correctly and successfully runs to completion.
>> dependencies.zip is created by zipping the contents of site-packages.
>>
>>
>> Does anyone have any ideas as to what may be happening here? I would
>> really appreciate it.
>>
>>
>> pip version: 18.0
>>
>> spark version: 2.3.1
>>
>> python version: 2.7
>>
>>
>> Thank you,
>>
>>
>> Jonas
>>
>>
>


Re: Python Dependencies Issue on EMR

2018-09-14 Thread Patrick McCarthy
You didn't say how you're zipping the dependencies, but I'm guessing you
either include .egg files or zipped up a virtualenv. In either case, the
extra C stuff that scipy and pandas rely upon doesn't get included.

An approach like this solved the last problem I had that seemed like this -
https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html

On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony 
wrote:

> Hey everyone,
>
>
> I am currently trying to run a Python Spark job (using YARN client mode)
> that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
> I create a dependencies.zip file that contains all of the
> dependencies/libraries (installed through pip) for the job to run
> successfully, such as pandas, scipy, tqdm, psycopg2, etc. The
> dependencies.zip file is contained within an outside directory (let’s call
> it “project”) that contains all the code to run my Spark job. I then zip up
> everything within project (including dependencies.zip) into project.zip.
> Then, I call spark-submit on the master node in my EMR cluster as follows:
>
>
> `spark-submit --packages … --py-files project.zip --jars ...
> run_command.py`
>
>
> Within “run_command.py” I add dependencies.zip as follows:
>
> `self.spark.sparkContext.addPyFile("dependencies.zip”)`
>
>
> The run_command.py then uses other files within project.zip to complete
> the spark job, and within those files, I import various libraries (found in
> dependencies.zip).
>
>
> I am running into a strange issue where all of the libraries are imported
> correctly (with no problems) with the exception of scipy and pandas.
>
>
> For scipy I get the following error:
>
>
> `File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in
> 
>
>   File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line
> 1, in 
>
> ImportError: cannot import name _ccallback_c`
>
>
> And for pandas I get this error message:
>
>
> `File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35,
> in 
>
> ImportError: C extension: No module named tslib not built. If you want to
> import pandas from the source directory, you may need to run 'python
> setup.py build_ext --inplace --force' to build the C extensions first.`
>
>
> When I comment out the imports for these two libraries (and their use from
> within the code) everything works fine.
>
>
> Surprisingly, when I run the application locally (on master node) without
> passing in dependencies.zip, it picks and resolves the libraries from
> site-packages correctly and successfully runs to completion.
> dependencies.zip is created by zipping the contents of site-packages.
>
>
> Does anyone have any ideas as to what may be happening here? I would
> really appreciate it.
>
>
> pip version: 18.0
>
> spark version: 2.3.1
>
> python version: 2.7
>
>
> Thank you,
>
>
> Jonas
>
>


Python Dependencies Issue on EMR

2018-09-13 Thread Jonas Shomorony
Hey everyone,


I am currently trying to run a Python Spark job (using YARN client mode)
that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
I create a dependencies.zip file that contains all of the
dependencies/libraries (installed through pip) for the job to run
successfully, such as pandas, scipy, tqdm, psycopg2, etc. The
dependencies.zip file is contained within an outside directory (let’s call
it “project”) that contains all the code to run my Spark job. I then zip up
everything within project (including dependencies.zip) into project.zip.
Then, I call spark-submit on the master node in my EMR cluster as follows:


`spark-submit --packages … --py-files project.zip --jars ... run_command.py`


Within “run_command.py” I add dependencies.zip as follows:

`self.spark.sparkContext.addPyFile("dependencies.zip”)`


The run_command.py then uses other files within project.zip to complete the
spark job, and within those files, I import various libraries (found in
dependencies.zip).


I am running into a strange issue where all of the libraries are imported
correctly (with no problems) with the exception of scipy and pandas.


For scipy I get the following error:


`File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in


  File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line
1, in 

ImportError: cannot import name _ccallback_c`


And for pandas I get this error message:


`File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, in


ImportError: C extension: No module named tslib not built. If you want to
import pandas from the source directory, you may need to run 'python
setup.py build_ext --inplace --force' to build the C extensions first.`


When I comment out the imports for these two libraries (and their use from
within the code) everything works fine.


Surprisingly, when I run the application locally (on master node) without
passing in dependencies.zip, it picks and resolves the libraries from
site-packages correctly and successfully runs to completion.
dependencies.zip is created by zipping the contents of site-packages.


Does anyone have any ideas as to what may be happening here? I would really
appreciate it.


pip version: 18.0

spark version: 2.3.1

python version: 2.7


Thank you,


Jonas