gwdgithubnom opened a new pull request, #55310:
URL: https://github.com/apache/spark/pull/55310

   
   ### What changes were proposed in this pull request?
   
   This PR improves the Python executable selection logic in `SparkContext` to 
resolve version mismatch issues, particularly in YARN client mode.
   
   Previously, the driver might fail to locate the correct Python interpreter 
when `PYSPARK_PYTHON` was not explicitly set in the shell environment, even if 
it was defined in `SparkConf`. This led to `RuntimeError` due to minor version 
discrepancies between the driver and executors (e.g., Driver using system 
Python 3.10 while Executors use archived Python 3.6).
   
   Key changes:
   
   1. **New method `_resolve_python_exec()`**: Implemented a robust Python 
executable resolution method that follows a 7-level priority sequence to ensure 
consistency:
      - `PYSPARK_DRIVER_PYTHON` (Env) > `PYSPARK_PYTHON` (Env) > 
`spark.pyspark.driver.python` (Conf) > `spark.pyspark.python` (Conf) > 
`spark.executorEnv.PYSPARK_DRIVER_PYTHON` > `spark.executorEnv.PYSPARK_PYTHON` 
> Default (`python3`)
   2. **Environment variable sync**: When the Python path is resolved from a 
configuration key, it is synced back to `os.environ["PYSPARK_PYTHON"]` for 
downstream compatibility.
   3. **Improved client mode support**: Ensures the driver can correctly 
resolve the Python path from Spark configuration without requiring manual 
environment variable exports for every script execution.
   
   Note: This PR is a revival and optimization of #51357.
   
   ### Why are the changes needed?
   
   PySpark requires the driver and executors to use consistent Python minor 
versions. In many production environments (especially when using conda-pack or 
virtualenvs), `PYSPARK_PYTHON` is passed via `SparkConf` rather than 
system-wide environment variables.
   
   Without this fix, the driver falls back to the system default Python when 
scripts are launched directly, causing a mismatch with the executor's archived 
Python environment. This change automates the resolution, making the deployment 
more robust and user-friendly by eliminating the need to manually export 
environment variables for each session.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. The driver now resolves the Python executable from Spark configuration 
keys (`spark.pyspark.python`, `spark.pyspark.driver.python`, and related 
`spark.executorEnv.*` keys) as a fallback when environment variables are not 
set. Previously, the driver would only check `PYSPARK_PYTHON` environment 
variable and fall back to `python3`. This provides better support for YARN 
client mode with archived Python environments (e.g., conda-pack, virtualenv).
   
   ### How was this patch tested?
   
   1. **Unit Tests**: Added `test_resolve_python_exec()` in 
`pyspark/tests/test_context.py` with 5 test cases:
      - `spark.pyspark.driver.python` takes precedence over 
`spark.pyspark.python`
      - `spark.pyspark.python` fallback when `driver.python` not set
      - `PYSPARK_DRIVER_PYTHON` env has highest priority
      - `PYSPARK_PYTHON` env overrides Spark config
      - Default fallback to `python3`
   2. **Manual Verification**: Ran a PySpark job in YARN client mode with a 
specific Python archive. Verified that `sys.version` and `sys.executable` match 
between the driver and executors using:
      ```python
      import sys
      spark.range(1).rdd.map(lambda x: (x, sys.version, 
sys.executable)).collect()
      ```
   3. **Linting**: Passed `dev/lint-python` checks.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to