[jira] [Updated] (SPARK-20001) Support PythonRunner executing inside a Conda env

Andrew Ash (JIRA) Fri, 17 Mar 2017 11:22:16 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Ash updated SPARK-20001:
-------------------------------
    Description: 
Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
environment that PythonRunner will run from. 
This change remembers theconda environment found on the driver and installs the 
same packages on the executor side, only once per PythonWorkerFactory. The list 
of requested conda packages are added to the PythonWorkerFactory cache, so two 
collects using the same environment (incl packages) can re-use the same running 
executors.

You have to specify outright what packages and channels to "bootstrap" the 
environment with. 

However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
expanded to support addCondaPackage and addCondaChannel.
Rationale is:
* you might want to add more packages once you're already running in the driver
* you might want to add a channel which requires some token for authentication, 
which you don't yet have access to until the module is already running

This issue requires that the conda binary is already available on the driver as 
well as executors, you just have to specify where it can be found.

Please see the attached pull request on palantir/spark for additional details: 
https://github.com/palantir/spark/pull/115

As for tests, there is a local python test, as well as yarn client & 
cluster-mode tests, which ensure that a newly installed package is visible from 
both the driver and the executor.

  was:
Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
environment that PythonRunner will run from. 
This change remembers theconda environment found on the driver and installs the 
same packages on the executor side, only once per PythonWorkerFactory. The list 
of requested conda packages are added to the PythonWorkerFactory cache, so two 
collects using the same environment (incl packages) can re-use the same running 
executors.

You have to specify outright what packages and channels to "bootstrap" the 
environment with. 

However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
expanded to support addCondaPackage and addCondaChannel.
Rationale is:
* you might want to add more packages once you're already running in the driver
* you might want to add a channel which requires some token for authentication, 
which you don't yet have access to until the module is already running

This issue requires that the conda binary is already available on the driver as 
well as executors, you just have to specify where it can be found.

Please see the attached issue on palantir/spark for additional details: 
https://github.com/palantir/spark/pull/115

As for tests, there is a local python test, as well as yarn client & 
cluster-mode tests, which ensure that a newly installed package is visible from 
both the driver and the executor.


> Support PythonRunner executing inside a Conda env
> -------------------------------------------------
>
>                 Key: SPARK-20001
>                 URL: https://issues.apache.org/jira/browse/SPARK-20001
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Dan Sanduleac
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Similar to SPARK-13587, I'm trying to allow the user to configure a Conda 
> environment that PythonRunner will run from. 
> This change remembers theconda environment found on the driver and installs 
> the same packages on the executor side, only once per PythonWorkerFactory. 
> The list of requested conda packages are added to the PythonWorkerFactory 
> cache, so two collects using the same environment (incl packages) can re-use 
> the same running executors.
> You have to specify outright what packages and channels to "bootstrap" the 
> environment with. 
> However, SparkContext (as well as JavaSparkContext & the pyspark version) are 
> expanded to support addCondaPackage and addCondaChannel.
> Rationale is:
> * you might want to add more packages once you're already running in the 
> driver
> * you might want to add a channel which requires some token for 
> authentication, which you don't yet have access to until the module is 
> already running
> This issue requires that the conda binary is already available on the driver 
> as well as executors, you just have to specify where it can be found.
> Please see the attached pull request on palantir/spark for additional 
> details: https://github.com/palantir/spark/pull/115
> As for tests, there is a local python test, as well as yarn client & 
> cluster-mode tests, which ensure that a newly installed package is visible 
> from both the driver and the executor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20001) Support PythonRunner executing inside a Conda env

Reply via email to