I've found Anaconda encapsulates modules and dependencies and such nicely, and you can deploy all the needed .so files and such by deploying a whole conda environment.
I've used this method with success: https://community.cloudera.com/t5/Community-Articles/Running-PySpark-with-Conda-Env/ta-p/247551 On Sat, Jun 6, 2020 at 4:16 PM Anwar AliKhan <anwaralikhan...@gmail.com> wrote: > " > Have you looked into this article? > https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987 > " > > This is weird ! > I was hanging out here https://machinelearningmastery.com/start-here/. > When I came across this post. > > The weird part is I was just wondering how I can take one of the > projects(Open AI GYM taxi-vt2 in Python), a project I want to develop > further. > > I want to run on Spark using Spark's parallelism features and GPU > capabilities, when I am using bigger datasets . While installing the > workers (slaves) doing the sliced dataset computations on the new 8GB RAM > Raspberry Pi (Linux). > > Are any other documents on official website which shows how to do that, > or any other location , preferably showing full self contained examples? > > > > On Fri, 5 Jun 2020, 09:02 Dark Crusader, <relinquisheddra...@gmail.com> > wrote: > >> Hi Stone, >> >> >> I haven't tried it with .so files however I did use the approach he >> recommends to install my other dependencies. >> I Hope it helps. >> >> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zh...@gmail.com> wrote: >> >>> Hi, >>> >>> So my pyspark app depends on some python libraries, it is not a problem, >>> I pack all the dependencies into a file libs.zip, and then call >>> *sc.addPyFile("libs.zip")* and it works pretty well for a while. >>> >>> Then I encountered a problem, if any of my library has any binary file >>> dependency (like .so files), this approach does not work. Mainly because >>> when you set PYTHONPATH to a zip file, python does not look up needed >>> binary library (e.g. a .so file) inside the zip file, this is a python >>> *limitation*. So I got a workaround: >>> >>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current >>> directory >>> 2) When my python code starts, manually call *sys.path.insert(0, >>> f"{os.getcwd()}/libs")* to set PYTHONPATH >>> >>> This workaround works well for me. Then I got another problem: what if >>> my code in executor need python library that has binary code? Below is am >>> example: >>> >>> def do_something(p): >>> ... >>> >>> rdd = sc.parallelize([ >>> {"x": 1, "y": 2}, >>> {"x": 2, "y": 3}, >>> {"x": 3, "y": 4}, >>> ]) >>> a = rdd.map(do_something) >>> >>> What if the function "do_something" need a python library that has >>> binary code? My current solution is, extract libs.zip into a NFS share (or >>> a SMB share) and manually do *sys.path.insert(0, >>> f"share_mount_dir/libs") *in my "do_something" function, but adding >>> such code in each function looks ugly, is there any better/elegant solution? >>> >>> Thanks, >>> Stone >>> >>> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016