Willi Raschkowski created SPARK-44767: -----------------------------------------
Summary: Plugin API for PySpark and SparkR subprocesses Key: SPARK-44767 URL: https://issues.apache.org/jira/browse/SPARK-44767 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.4.1 Reporter: Willi Raschkowski An API to customize Python and R workers allows for extensibility beyond what can be expressed via static configs and environment variables like, e.g., {{spark.pyspark.python}}. A use case we had for this is overriding {{PATH}} when using {{spark.archives}} with, say, conda-pack (as documented [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). Some packages rely on binaries. And if we want to use those packages in Spark, we need to include their binaries in the {{PATH}}. But we can't set the {{PATH}} via some config because 1) the environment with its binaries may be at a dynamic location (archives are unpacked on the driver [into a directory with random name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), and 2) we may not want to override the {{PATH}} that's pre-configured on the hosts. Other use cases unlocked by this include overriding the executable dynamically (e.g., to select a version) or forking/redirecting the worker's output stream. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org