Great, thank you Masood, will look into it. Regards, Stone
On Fri, Jun 5, 2020 at 7:47 PM Masood Krohy <masood.krohy@analytical.works> wrote: > Not totally sure it's gonna help your use case, but I'd recommend that you > consider these too: > > - pex <https://github.com/pantsbuild/pex> A library and tool for > generating .pex (Python EXecutable) files > - cluster-pack <https://github.com/criteo/cluster-pack> cluster-pack > is a library on top of either pex or conda-pack to make your Python code > easily available on a cluster. > > Masood > > __________________ > > Masood Krohy, Ph.D. > Data Science Advisor|Platform Architecthttps://www.analytical.works > > On 6/5/20 4:29 AM, Stone Zhong wrote: > > Thanks Dark. Looked at that article. I think the article described > approach B, let me summary both approach A and approach B > A) Put libraries in a network share, mount on each node, and in your code, > manually set PYTHONPATH > B) In your code, manually install the necessary package using "pip install > -r <temp_dir>" > > I think approach B is very similar to approach A, both has pros and cons. > With B), your cluster need to have internet access (which in my case, our > cluster runs in an isolated environment for security reason), but you can > set a private pip server anyway and stage those needed packages, while for > A, you need to have admin permission to be able to mount the network share > which is also a devop burden. > > I am wondering if spark can create some new API to tackle this scenario > instead of these workaround, which I suppose would be more clean and > elegant. > > Regards, > Stone > > > On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader <relinquisheddra...@gmail.com> > wrote: > >> Hi Stone, >> >> Have you looked into this article? >> >> https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987 >> >> >> I haven't tried it with .so files however I did use the approach he >> recommends to install my other dependencies. >> I Hope it helps. >> >> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zh...@gmail.com> wrote: >> >>> Hi, >>> >>> So my pyspark app depends on some python libraries, it is not a problem, >>> I pack all the dependencies into a file libs.zip, and then call >>> *sc.addPyFile("libs.zip")* and it works pretty well for a while. >>> >>> Then I encountered a problem, if any of my library has any binary file >>> dependency (like .so files), this approach does not work. Mainly because >>> when you set PYTHONPATH to a zip file, python does not look up needed >>> binary library (e.g. a .so file) inside the zip file, this is a python >>> *limitation*. So I got a workaround: >>> >>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current >>> directory >>> 2) When my python code starts, manually call *sys.path.insert(0, >>> f"{os.getcwd()}/libs")* to set PYTHONPATH >>> >>> This workaround works well for me. Then I got another problem: what if >>> my code in executor need python library that has binary code? Below is am >>> example: >>> >>> def do_something(p): >>> ... >>> >>> rdd = sc.parallelize([ >>> {"x": 1, "y": 2}, >>> {"x": 2, "y": 3}, >>> {"x": 3, "y": 4}, >>> ]) >>> a = rdd.map(do_something) >>> >>> What if the function "do_something" need a python library that has >>> binary code? My current solution is, extract libs.zip into a NFS share (or >>> a SMB share) and manually do *sys.path.insert(0, >>> f"share_mount_dir/libs") *in my "do_something" function, but adding >>> such code in each function looks ugly, is there any better/elegant >>> solution? >>> >>> Thanks, >>> Stone >>> >>>