[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747291#comment-15747291
 ] 

Prasanna Santhanam commented on SPARK-13587:
--------------------------------------------

[~nchammas] sorry, this got buried in several other emails at my org. 

What you've implemented as a shell installer is exactly what I've done except 
within Spark code branched off of the 2.0.1 release. I use the YARN archives 
mechanism to distribute the zip files and control the conda environment 
binaries. It should be straightforward to change my diff to work with 
virtualenv as well. As you've explained the advantage of this process is that 
Python doesn't need to be installed at all in the worker nodes. I've also 
implemented the mechanism with {{--py-files}} so standalone spark can take 
advantage but I haven't got around to testing it yet. 

The downside of the zip distribution solution is however that the start time of 
the application significantly increased - nearly 4m to zip all libraries. I 
tested this on a 16 core machine with 30GB memory. When I try to zip the 
binaries it ends up creating a 400MB archive for just basic libaries like 
{{matplotlib}}, {{scipy}}, {{numpy}}. What zip times did you experience?

Much to my surprise, contrasted with the original proposal by [~zjffdu] of 
downloading the dependencies on all the workers, this eats up significant time 
in a spark program that runs for no greater than 2s. This restricted me from 
pushing the implementation further. Would like to hear your observations from 
testing your shell implementation of the same.

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to