anblanco commented on code in PR #55201:
URL: https://github.com/apache/spark/pull/55201#discussion_r3040931134
##########
python/pyspark/worker.py:
##########
@@ -3420,4 +3420,16 @@ def process():
# TODO: Remove the following two lines and use `Process.pid()` when we
drop JDK 8.
write_int(os.getpid(), sock_file)
sock_file.flush()
- main(sock_file, sock_file)
+ try:
+ main(sock_file, sock_file)
+ finally:
+ # SPARK-53759: Flush before close to ensure all buffered data reaches
+ # the socket. On Python 3.12+, changed GC finalization ordering
+ # (https://github.com/python/cpython/issues/97922) can cause the
+ # underlying socket to close before BufferedRWPair flushes its write
+ # buffer, resulting in data loss and EOFException on the JVM side.
+ # This mirrors the explicit flush in daemon.py's worker() finally
block.
+ try:
+ sock_file.flush()
Review Comment:
So I evaluated the feasibility of backporting PR #54458 (SPARK-55665)
properly.
At a high level, that unification was very much written to be a refactor on
top of `master` - I don't think the original PR had considered the possibility
of backporting to older branches.
Specifically, #54458 does fix SPARK-53759 on master via its context
manager's explicit close() in a finally block, but cherry-picking it to stable
branches ranges from tedious to impractical — branch-4.1 requires resolving 10
import conflicts across 14 files, branch-4.0 needs the context manager
rewritten for the older java_port connection pattern, and branch-3.5 is missing
10 of 14 worker files making it effectively a hand-written 4-file commit.
Alternatively, this PR #55201 is a more precision try/finally fix that
applies identically and cleanly to all three release branches, and can be
superseded by the worker unification changes in the 4.2.x branches.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]