Hi Stone, Have you looked into this article? https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies. I Hope it helps. On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zh...@gmail.com> wrote: > Hi, > > So my pyspark app depends on some python libraries, it is not a problem, I > pack all the dependencies into a file libs.zip, and then call > *sc.addPyFile("libs.zip")* and it works pretty well for a while. > > Then I encountered a problem, if any of my library has any binary file > dependency (like .so files), this approach does not work. Mainly because > when you set PYTHONPATH to a zip file, python does not look up needed > binary library (e.g. a .so file) inside the zip file, this is a python > *limitation*. So I got a workaround: > > 1) Do not call sc.addPyFile, instead extract the libs.zip into current > directory > 2) When my python code starts, manually call *sys.path.insert(0, > f"{os.getcwd()}/libs")* to set PYTHONPATH > > This workaround works well for me. Then I got another problem: what if my > code in executor need python library that has binary code? Below is am > example: > > def do_something(p): > ... > > rdd = sc.parallelize([ > {"x": 1, "y": 2}, > {"x": 2, "y": 3}, > {"x": 3, "y": 4}, > ]) > a = rdd.map(do_something) > > What if the function "do_something" need a python library that has > binary code? My current solution is, extract libs.zip into a NFS share (or > a SMB share) and manually do *sys.path.insert(0, > f"share_mount_dir/libs") *in my "do_something" function, but adding such > code in each function looks ugly, is there any better/elegant solution? > > Thanks, > Stone > >