anblanco opened a new pull request, #55201: URL: https://github.com/apache/spark/pull/55201
### What changes were proposed in this pull request? Add an explicit `sock_file.flush()` in a `finally` block wrapping the `main()` call in `worker.py`'s `__main__` block. The simple-worker codepath (used on Windows and when `spark.python.use.daemon=false`) calls `main(sock_file, sock_file)` as the last statement. When `main()` returns, the process exits without flushing the `BufferedRWPair` write buffer. On Python 3.12+, changed GC finalization ordering ([cpython#97922](https://github.com/python/cpython/issues/97922)) causes the underlying socket to close before `BufferedRWPair.__del__` can flush, resulting in data loss and `EOFException` on the JVM side. This mirrors the existing pattern in `daemon.py`'s `worker()` function, which already has `outfile.flush()` in its `finally` block (~line 95). Also adds a regression test (`SimpleWorkerFlushTest`) that exercises the simple-worker path with `daemon=false`. ### Why are the changes needed? PySpark is broken on Python 3.12+ when using the simple-worker path: - **Windows**: Always uses simple-worker (`os.fork()` unavailable) — PySpark is completely unusable on Windows with Python 3.12+ - **Linux/macOS**: Affected when `spark.python.use.daemon=false` The bug is deterministic — every worker-dependent operation (`rdd.map()`, `createDataFrame()`, UDFs) crashes with: ``` org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) Caused by: java.io.EOFException ``` Verified broken on all currently supported PySpark releases (3.5.8, 4.0.2, 4.1.1) with Python 3.12 and 3.13. **Root cause**: The daemon path (`daemon.py`) wraps `worker_main()` in `try/finally` with `outfile.flush()`. The simple-worker path (`worker.py` `__main__`) does not — it relies on `BufferedRWPair.__del__` during interpreter shutdown, which [is not guaranteed](https://docs.python.org/3/reference/datamodel.html#object.__del__) and breaks on Python 3.12+ due to changed GC finalization ordering. ### Does this PR introduce _any_ user-facing change? Yes. Fixes a crash that makes PySpark unusable on Windows with Python 3.12+, and on Linux/macOS with `spark.python.use.daemon=false` on Python 3.12+. ### How was this patch tested? - Added `SimpleWorkerFlushTest` — integration test running `rdd.map().collect()` with `spark.python.use.daemon=false` - Red/green verified against pip-installed PySpark 4.1.1 on Python 3.12.10 (Windows 11): - **Without fix**: test FAILS deterministically (`Python worker exited unexpectedly`) - **With fix (pyspark.zip patched)**: test PASSES - Verification matrix from standalone reproducer ([anblanco/spark53759-reproducer](https://github.com/anblanco/spark53759-reproducer)): | Platform | Python | Unpatched | Patched | |----------|--------|-----------|---------| | Windows 11 | 3.11.9 | PASS | PASS (harmless) | | Windows 11 | 3.12.10 | **FAIL** | PASS | | Windows 11 | 3.13.3 | **FAIL** | PASS | | Linux (Ubuntu 24.04) | 3.12.3 | **FAIL** | PASS | **Note on master**: SPARK-55665 refactored `worker.py` to use a `get_sock_file_to_executor()` context manager with `close()` in the `finally` block. Since `BufferedRWPair.close()` flushes internally, master is already covered by the structural change. This backport targets the released code structure which lacks that refactoring. A defense-in-depth flush was also added to master's context manager for consistency with `daemon.py`. This fix cherry-picks cleanly to `branch-4.0` and `branch-3.5` (identical code). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.6 (Anthropic) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
