gwdgithubnom opened a new pull request, #55310:
URL: https://github.com/apache/spark/pull/55310
### What changes were proposed in this pull request?
This PR improves the Python executable selection logic in `SparkContext` to
resolve version mismatch issues, particularly in YARN client mode.
Previously, the driver might fail to locate the correct Python interpreter
when `PYSPARK_PYTHON` was not explicitly set in the shell environment, even if
it was defined in `SparkConf`. This led to `RuntimeError` due to minor version
discrepancies between the driver and executors (e.g., Driver using system
Python 3.10 while Executors use archived Python 3.6).
Key changes:
1. **New method `_resolve_python_exec()`**: Implemented a robust Python
executable resolution method that follows a 7-level priority sequence to ensure
consistency:
- `PYSPARK_DRIVER_PYTHON` (Env) > `PYSPARK_PYTHON` (Env) >
`spark.pyspark.driver.python` (Conf) > `spark.pyspark.python` (Conf) >
`spark.executorEnv.PYSPARK_DRIVER_PYTHON` > `spark.executorEnv.PYSPARK_PYTHON`
> Default (`python3`)
2. **Environment variable sync**: When the Python path is resolved from a
configuration key, it is synced back to `os.environ["PYSPARK_PYTHON"]` for
downstream compatibility.
3. **Improved client mode support**: Ensures the driver can correctly
resolve the Python path from Spark configuration without requiring manual
environment variable exports for every script execution.
Note: This PR is a revival and optimization of #51357.
### Why are the changes needed?
PySpark requires the driver and executors to use consistent Python minor
versions. In many production environments (especially when using conda-pack or
virtualenvs), `PYSPARK_PYTHON` is passed via `SparkConf` rather than
system-wide environment variables.
Without this fix, the driver falls back to the system default Python when
scripts are launched directly, causing a mismatch with the executor's archived
Python environment. This change automates the resolution, making the deployment
more robust and user-friendly by eliminating the need to manually export
environment variables for each session.
### Does this PR introduce _any_ user-facing change?
Yes. The driver now resolves the Python executable from Spark configuration
keys (`spark.pyspark.python`, `spark.pyspark.driver.python`, and related
`spark.executorEnv.*` keys) as a fallback when environment variables are not
set. Previously, the driver would only check `PYSPARK_PYTHON` environment
variable and fall back to `python3`. This provides better support for YARN
client mode with archived Python environments (e.g., conda-pack, virtualenv).
### How was this patch tested?
1. **Unit Tests**: Added `test_resolve_python_exec()` in
`pyspark/tests/test_context.py` with 5 test cases:
- `spark.pyspark.driver.python` takes precedence over
`spark.pyspark.python`
- `spark.pyspark.python` fallback when `driver.python` not set
- `PYSPARK_DRIVER_PYTHON` env has highest priority
- `PYSPARK_PYTHON` env overrides Spark config
- Default fallback to `python3`
2. **Manual Verification**: Ran a PySpark job in YARN client mode with a
specific Python archive. Verified that `sys.version` and `sys.executable` match
between the driver and executors using:
```python
import sys
spark.range(1).rdd.map(lambda x: (x, sys.version,
sys.executable)).collect()
```
3. **Linting**: Passed `dev/lint-python` checks.
### Was this patch authored or co-authored using generative AI tooling?
Yes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]