[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747381#comment-15747381
 ] 

Prasanna Santhanam commented on SPARK-13587:
--------------------------------------------

[~zjffdu] In case of Anaconda Python the environment is self-contained. The 
conda environment is managed using binaries that are hardlinked unlike in 
virtualenv. So zipping the conda environment zips the entire python system 
together. After that Spark only needs to know that the binary for Python it 
should use is relative to the archives that were distributed. Hence I was able 
to run my spark application without installing anaconda on any of my workers.

Let's give these mechanisms names so it makes it easier to compare. On the one 
hand we have the "Library Distribution Mechanism" and on the other we have the 
"Library Download Mechanism". Right now, the overhead is greater in the case of 
the distribution mechanism because zipping the binaries eats up significant 
time before application startup. This was my observation with the 16core 
machine, YMMV. The library distribution mechanism can however be improved with 
some basic caching done on the gateway node so that subsequent zip operations 
on a different spark application with similar library requirements is faster. 

On the other hand, in the library download mechanism - the downloads are very 
fast and can even be cached locally or proxied to a locally managed egg/pypi 
repository/conda channel. So the download mechanism is still superior save for 
some complexity in the implementation and options to be specified. However, I'm 
not sure whether downloads will be throttled on publicly exposed repositories 
like PyPi when say a 1000 node spark cluster is simultaneously requesting for 
python binaries to be downloaded from all its workers.



> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to