[ 
https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459584#comment-17459584
 ] 

Apache Spark commented on SPARK-37650:
--------------------------------------

User 'PerilousApricot' has created a pull request for this issue:
https://github.com/apache/spark/pull/34903

> Tell spark-env.sh the python interpreter
> ----------------------------------------
>
>                 Key: SPARK-37650
>                 URL: https://issues.apache.org/jira/browse/SPARK-37650
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Andrew Malone Melo
>            Priority: Major
>
> When loading config defaults via spark-env.sh, it can be useful to know
> the current pyspark python interpreter to allow the configuration to set
> values properly. Pass this value in the environment as
> _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script.
> h3. What changes were proposed in this pull request?
> It's currently possible to set sensible site-wide spark configuration 
> defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a 
> user is using pyspark, however, there are a number of things that aren't 
> discoverable by that script, due to the way that it's called. There is a 
> chain of calls (java_gateway.py -> shell script -> java -> shell script) that 
> ends up obliterating any bit of the python context.
> This change proposes to add en environment variable 
> {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the 
> top-level python executable within pyspark's {{java_gateway.py}} 
> bootstrapping process. With that, spark-env.sh will be able to infer enough 
> information about the python environment to set the appropriate configuration 
> variables.
> h3. Why are the changes needed?
> Right now, there a number of config options useful to pyspark that can't be 
> reliably set by {{spark-env.sh}} because it is unaware of the python context 
> that spawning the executor. To give the most trivial example, it is currently 
> possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} 
> based on information readily available from the environment (e.g. the k8s 
> downward API). However, {{spark.pyspark.python}} and family cannot be set 
> because when {{spark-env.sh}} executes it's lost all of the python context. 
> We can instruct users to add the appropriate config variables, but this form 
> of cargo-culting is error-prone and not scalable. It would be much better to 
> expose important python variables so that pyspark can not be a second-class 
> citizen.
> h3. Does this PR introduce _any_ user-facing change?
> Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will 
> receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing 
> to the python executor.
> h3. How was this patch tested?
> To be perfectly honest, I don't know where this fits into the testing 
> infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to 
> java_gateway.py and that works, but in terms of adding this to the CI ... I'm 
> at a loss. I'm more than willing to add the additional info, if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to