[ https://issues.apache.org/jira/browse/SPARK-37650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459584#comment-17459584 ]
Apache Spark commented on SPARK-37650: -------------------------------------- User 'PerilousApricot' has created a pull request for this issue: https://github.com/apache/spark/pull/34903 > Tell spark-env.sh the python interpreter > ---------------------------------------- > > Key: SPARK-37650 > URL: https://issues.apache.org/jira/browse/SPARK-37650 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.2.0 > Reporter: Andrew Malone Melo > Priority: Major > > When loading config defaults via spark-env.sh, it can be useful to know > the current pyspark python interpreter to allow the configuration to set > values properly. Pass this value in the environment as > _PYSPARK_DRIVER_SYS_EXECUTABLE to the environment script. > h3. What changes were proposed in this pull request? > It's currently possible to set sensible site-wide spark configuration > defaults by using {{{}$SPARK_CONF_DIR/spark-env.sh{}}}. In the case where a > user is using pyspark, however, there are a number of things that aren't > discoverable by that script, due to the way that it's called. There is a > chain of calls (java_gateway.py -> shell script -> java -> shell script) that > ends up obliterating any bit of the python context. > This change proposes to add en environment variable > {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} which points to the filename of the > top-level python executable within pyspark's {{java_gateway.py}} > bootstrapping process. With that, spark-env.sh will be able to infer enough > information about the python environment to set the appropriate configuration > variables. > h3. Why are the changes needed? > Right now, there a number of config options useful to pyspark that can't be > reliably set by {{spark-env.sh}} because it is unaware of the python context > that spawning the executor. To give the most trivial example, it is currently > possible to set {{spark.kubernetes.container.image}} or {{spark.driver.host}} > based on information readily available from the environment (e.g. the k8s > downward API). However, {{spark.pyspark.python}} and family cannot be set > because when {{spark-env.sh}} executes it's lost all of the python context. > We can instruct users to add the appropriate config variables, but this form > of cargo-culting is error-prone and not scalable. It would be much better to > expose important python variables so that pyspark can not be a second-class > citizen. > h3. Does this PR introduce _any_ user-facing change? > Yes. With this change, if python spawns the JVM, {{spark-env.sh}} will > receive an environment variable {{_PYSPARK_DRIVER_SYS_EXECUTABLE}} pointing > to the python executor. > h3. How was this patch tested? > To be perfectly honest, I don't know where this fits into the testing > infrastructure. I monkey-patched a binary 3.2.0 install to add the lines to > java_gateway.py and that works, but in terms of adding this to the CI ... I'm > at a loss. I'm more than willing to add the additional info, if needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org