At the risk of repeating myself, this is what I was hoping to avoid when I suggested deploying a full, zipped, conda venv.
What is your motivation for running an install process on the nodes and risking the process failing, instead of pushing a validated environment artifact and not having that risk? In either case you move about the same number of bytes around. On Fri, Dec 18, 2020 at 3:04 PM Sachit Murarka <connectsac...@gmail.com> wrote: > Hi Patrick/Users, > > I am exploring wheel file form packages for this , as this seems simple:- > > > https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7 > > However, I am facing another issue:- I am using pandas , which needs > numpy. Numpy is giving error! > > > ImportError: Unable to import required dependencies: > numpy: > > IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE! > > Importing the numpy C-extensions failed. This error can happen for > many reasons, often due to issues with your setup or how NumPy was > installed. > > We have compiled some common reasons and troubleshooting tips at: > > https://numpy.org/devdocs/user/troubleshooting-importerror.html > > Please note and check the following: > > * The Python version is: Python3.7 from "/usr/bin/python3" > * The NumPy version is: "1.19.4" > > and make sure that they are the versions you expect. > Please carefully study the documentation linked above for further help. > > Original error was: No module named 'numpy.core._multiarray_umath' > > > > Kind Regards, > Sachit Murarka > > > On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com> > wrote: > >> I'm not very familiar with the environments on cloud clusters, but in >> general I'd be reluctant to lean on setuptools or other python install >> mechanisms. In the worst case, you might encounter /usr/bin/pip not having >> permissions to install new packages, or even if you do a package might >> require something you can't change like a libc dependency. >> >> Perhaps you can install the .whl and its dependencies to the virtualenv >> on a local machine, and then *after* the install process, package that >> venv? >> >> If possible, I like conda for this approach over a vanilla venv because >> it will contain all the non-python dependencies (like libc) if they're >> needed. >> >> >> Another thing - I think there are several ways to do this, but I've had >> the most success including the .zip containing my environment in >> `spark.yarn.dist.archives` and then using a relative path: >> >> os.environ['PYSPARK_PYTHON'] = './py37minimal_env/py37minimal/bin/python' >> >> dist_archives = >> 'hdfs:///user/pmccarthy/conda/py37minimal.zip#py37minimal_env' >> >> SparkSession.builder. >> ... >> .config('spark.yarn.dist.archives', dist_archives) >> >> >> On Thu, Dec 17, 2020 at 10:32 AM Sachit Murarka <connectsac...@gmail.com> >> wrote: >> >>> Hi Users >>> >>> I have a wheel file , while creating it I have mentioned dependencies in >>> setup.py file. >>> Now I have 2 virtual envs, 1 was already there . another one I created >>> just now. >>> >>> I have switched to new virtual env, I want spark to download the >>> dependencies while doing spark-submit using wheel. >>> >>> Could you please help me on this? >>> >>> It is not downloading dependencies , instead it is pointing to older >>> version of virtual env and proceeding with the execution of spark job. >>> >>> Please note I have tried setting the env variables also. >>> Also I have tried following options as well in spark submit >>> >>> --conf spark.pyspark.virtualenv.enabled=true --conf >>> spark.pyspark.virtualenv.type=native --conf >>> spark.pyspark.virtualenv.requirements=requirements.txt --conf >>> spark.pyspark.python= /path/to/venv/bin/python3 --conf >>> spark.pyspark.driver.python=/path/to/venv/bin/python3 >>> >>> This did not help too.. >>> >>> Kind Regards, >>> Sachit Murarka >>> >> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016