Hi Stone,

Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987

I haven't tried it with .so files however I did use the approach he
recommends to install my other dependencies.
I Hope it helps.

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zh...@gmail.com> wrote:

> Hi,
>
> So my pyspark app depends on some python libraries, it is not a problem, I
> pack all the dependencies into a file libs.zip, and then call
> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>
> Then I encountered a problem, if any of my library has any binary file
> dependency (like .so files), this approach does not work. Mainly because
> when you set PYTHONPATH to a zip file, python does not look up needed
> binary library (e.g. a .so file) inside the zip file, this is a python
> *limitation*. So I got a workaround:
>
> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
> directory
> 2) When my python code starts, manually call *sys.path.insert(0,
> f"{os.getcwd()}/libs")* to set PYTHONPATH
>
> This workaround works well for me. Then I got another problem: what if my
> code in executor need python library that has binary code? Below is am
> example:
>
> def do_something(p):
>     ...
>
> rdd = sc.parallelize([
>     {"x": 1, "y": 2},
>     {"x": 2, "y": 3},
>     {"x": 3, "y": 4},
> ])
> a = rdd.map(do_something)
>
> What if the function "do_something" need a python library that has
> binary code? My current solution is, extract libs.zip into a NFS share (or
> a SMB share) and manually do *sys.path.insert(0,
> f"share_mount_dir/libs") *in my "do_something" function, but adding such
> code in each function looks ugly, is there any better/elegant solution?
>
> Thanks,
> Stone
>
>

Reply via email to