[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747381#comment-15747381 ]
Prasanna Santhanam commented on SPARK-13587: -------------------------------------------- [~zjffdu] In case of Anaconda Python the environment is self-contained. The conda environment is managed using binaries that are hardlinked unlike in virtualenv. So zipping the conda environment zips the entire python system together. After that Spark only needs to know that the binary for Python it should use is relative to the archives that were distributed. Hence I was able to run my spark application without installing anaconda on any of my workers. Let's give these mechanisms names so it makes it easier to compare. On the one hand we have the "Library Distribution Mechanism" and on the other we have the "Library Download Mechanism". Right now, the overhead is greater in the case of the distribution mechanism because zipping the binaries eats up significant time before application startup. This was my observation with the 16core machine, YMMV. The library distribution mechanism can however be improved with some basic caching done on the gateway node so that subsequent zip operations on a different spark application with similar library requirements is faster. On the other hand, in the library download mechanism - the downloads are very fast and can even be cached locally or proxied to a locally managed egg/pypi repository/conda channel. So the download mechanism is still superior save for some complexity in the implementation and options to be specified. However, I'm not sure whether downloads will be throttled on publicly exposed repositories like PyPi when say a 1000 node spark cluster is simultaneously requesting for python binaries to be downloaded from all its workers. > Support virtualenv in PySpark > ----------------------------- > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark > Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org