[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174996#comment-15174996
 ] 

Mike Sukmanowsky commented on SPARK-13587:
------------------------------------------

Thanks for letting me know about this [~jeffzhang].

I think in general, I'm +1 on the proposal.

virtualenvs are the way to go to install requirements and ensure isolation of 
dependencies between multiple driver scripts. As you noted though, installing 
hefty requirements like pandas or numpy (assuming you aren't using Conda), 
would add a pretty significant overhead to startup which could be amortized if 
the driver was assumed to run for a long enough period of time. Conda of course 
would pretty well eliminate that problem as it provides pre-compiled binaries 
for most OSs.

I'd like to offer [PEX|https://pex.readthedocs.org/en/stable/] as an 
alternative, where spark-submit would build a self-contained virtualenv in a 
.pex file on the Spark master node and then distribute to all other nodes. 
However, it turns out PEX doesn't support editable requirements and introduces 
an assumption that all nodes in a cluster are homogenous so that a Python 
package with C extensions compiled on the master node would run on worker nodes 
without issue. The latter assumption may be a leap too far for all Spark users.

One thing I'm not entirely sure of is the need for the 
spark.pyspark.virtualenv.path property. If the virtualenv is temporary, why 
would this path ever be specified? Wouldn't a temporary path be used and 
subsequently removed after the Python worker completes?

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to