[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747291#comment-15747291 ]
Prasanna Santhanam commented on SPARK-13587: -------------------------------------------- [~nchammas] sorry, this got buried in several other emails at my org. What you've implemented as a shell installer is exactly what I've done except within Spark code branched off of the 2.0.1 release. I use the YARN archives mechanism to distribute the zip files and control the conda environment binaries. It should be straightforward to change my diff to work with virtualenv as well. As you've explained the advantage of this process is that Python doesn't need to be installed at all in the worker nodes. I've also implemented the mechanism with {{--py-files}} so standalone spark can take advantage but I haven't got around to testing it yet. The downside of the zip distribution solution is however that the start time of the application significantly increased - nearly 4m to zip all libraries. I tested this on a 16 core machine with 30GB memory. When I try to zip the binaries it ends up creating a 400MB archive for just basic libaries like {{matplotlib}}, {{scipy}}, {{numpy}}. What zip times did you experience? Much to my surprise, contrasted with the original proposal by [~zjffdu] of downloading the dependencies on all the workers, this eats up significant time in a spark program that runs for no greater than 2s. This restricted me from pushing the implementation further. Would like to hear your observations from testing your shell implementation of the same. > Support virtualenv in PySpark > ----------------------------- > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark > Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org