Re: Issue while installing dependencies Python Spark

Patrick McCarthy Fri, 18 Dec 2020 12:33:04 -0800

At the risk of repeating myself, this is what I was hoping to avoid when I
suggested deploying a full, zipped, conda venv.


What is your motivation for running an install process on the nodes and
risking the process failing, instead of pushing a validated environment
artifact and not having that risk? In either case you move about the same
number of bytes around.

On Fri, Dec 18, 2020 at 3:04 PM Sachit Murarka <connectsac...@gmail.com>
wrote:

> Hi Patrick/Users,
>
> I am exploring wheel file form packages for this , as this seems simple:-
>
>
> https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7
>
> However, I am facing another issue:- I am using pandas , which needs
> numpy. Numpy is giving error!
>
>
> ImportError: Unable to import required dependencies:
> numpy:
>
> IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
>
> Importing the numpy C-extensions failed. This error can happen for
> many reasons, often due to issues with your setup or how NumPy was
> installed.
>
> We have compiled some common reasons and troubleshooting tips at:
>
>     https://numpy.org/devdocs/user/troubleshooting-importerror.html
>
> Please note and check the following:
>
>   * The Python version is: Python3.7 from "/usr/bin/python3"
>   * The NumPy version is: "1.19.4"
>
> and make sure that they are the versions you expect.
> Please carefully study the documentation linked above for further help.
>
> Original error was: No module named 'numpy.core._multiarray_umath'
>
>
>
> Kind Regards,
> Sachit Murarka
>
>
> On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com>
> wrote:
>
>> I'm not very familiar with the environments on cloud clusters, but in
>> general I'd be reluctant to lean on setuptools or other python install
>> mechanisms. In the worst case, you might encounter /usr/bin/pip not having
>> permissions to install new packages, or even if you do a package might
>> require something you can't change like a libc dependency.
>>
>> Perhaps you can install the .whl and its dependencies to the virtualenv
>> on a local machine, and then *after* the install process, package that
>> venv?
>>
>> If possible, I like conda for this approach over a vanilla venv because
>> it will contain all the non-python dependencies (like libc) if they're
>> needed.
>>
>>
>> Another thing - I think there are several ways to do this, but I've had
>> the most success including the .zip containing my environment in
>> `spark.yarn.dist.archives` and then using a relative path:
>>
>> os.environ['PYSPARK_PYTHON'] = './py37minimal_env/py37minimal/bin/python'
>>
>> dist_archives =
>> 'hdfs:///user/pmccarthy/conda/py37minimal.zip#py37minimal_env'
>>
>> SparkSession.builder.
>> ...
>>          .config('spark.yarn.dist.archives', dist_archives)
>>
>>
>> On Thu, Dec 17, 2020 at 10:32 AM Sachit Murarka <connectsac...@gmail.com>
>> wrote:
>>
>>> Hi Users
>>>
>>> I have a wheel file , while creating it I have mentioned dependencies in
>>> setup.py file.
>>> Now I have 2 virtual envs, 1 was already there . another one I created
>>> just now.
>>>
>>> I have switched to new virtual env, I want spark to download the
>>> dependencies while doing spark-submit using wheel.
>>>
>>> Could you please help me on this?
>>>
>>> It is not downloading dependencies , instead it is pointing to older
>>> version of  virtual env and proceeding with the execution of spark job.
>>>
>>> Please note I have tried setting the env variables also.
>>> Also I have tried following options as well in spark submit
>>>
>>> --conf spark.pyspark.virtualenv.enabled=true  --conf
>>> spark.pyspark.virtualenv.type=native --conf
>>> spark.pyspark.virtualenv.requirements=requirements.txt  --conf
>>> spark.pyspark.python= /path/to/venv/bin/python3 --conf
>>> spark.pyspark.driver.python=/path/to/venv/bin/python3
>>>
>>> This did not help too..
>>>
>>> Kind Regards,
>>> Sachit Murarka
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: Issue while installing dependencies Python Spark

Reply via email to