Hi Patrick/Users,

I am exploring wheel file form packages for this , as this seems simple:-

https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7

However, I am facing another issue:- I am using pandas , which needs numpy.
Numpy is giving error!


ImportError: Unable to import required dependencies:
numpy:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.7 from "/usr/bin/python3"
  * The NumPy version is: "1.19.4"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'



Kind Regards,
Sachit Murarka


On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com>
wrote:

> I'm not very familiar with the environments on cloud clusters, but in
> general I'd be reluctant to lean on setuptools or other python install
> mechanisms. In the worst case, you might encounter /usr/bin/pip not having
> permissions to install new packages, or even if you do a package might
> require something you can't change like a libc dependency.
>
> Perhaps you can install the .whl and its dependencies to the virtualenv on
> a local machine, and then *after* the install process, package that venv?
>
> If possible, I like conda for this approach over a vanilla venv because it
> will contain all the non-python dependencies (like libc) if they're needed.
>
>
> Another thing - I think there are several ways to do this, but I've had
> the most success including the .zip containing my environment in
> `spark.yarn.dist.archives` and then using a relative path:
>
> os.environ['PYSPARK_PYTHON'] = './py37minimal_env/py37minimal/bin/python'
>
> dist_archives =
> 'hdfs:///user/pmccarthy/conda/py37minimal.zip#py37minimal_env'
>
> SparkSession.builder.
> ...
>          .config('spark.yarn.dist.archives', dist_archives)
>
>
> On Thu, Dec 17, 2020 at 10:32 AM Sachit Murarka <connectsac...@gmail.com>
> wrote:
>
>> Hi Users
>>
>> I have a wheel file , while creating it I have mentioned dependencies in
>> setup.py file.
>> Now I have 2 virtual envs, 1 was already there . another one I created
>> just now.
>>
>> I have switched to new virtual env, I want spark to download the
>> dependencies while doing spark-submit using wheel.
>>
>> Could you please help me on this?
>>
>> It is not downloading dependencies , instead it is pointing to older
>> version of  virtual env and proceeding with the execution of spark job.
>>
>> Please note I have tried setting the env variables also.
>> Also I have tried following options as well in spark submit
>>
>> --conf spark.pyspark.virtualenv.enabled=true  --conf
>> spark.pyspark.virtualenv.type=native --conf
>> spark.pyspark.virtualenv.requirements=requirements.txt  --conf
>> spark.pyspark.python= /path/to/venv/bin/python3 --conf
>> spark.pyspark.driver.python=/path/to/venv/bin/python3
>>
>> This did not help too..
>>
>> Kind Regards,
>> Sachit Murarka
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Reply via email to