anblanco commented on code in PR #55201:
URL: https://github.com/apache/spark/pull/55201#discussion_r3040931134


##########
python/pyspark/worker.py:
##########
@@ -3420,4 +3420,16 @@ def process():
     # TODO: Remove the following two lines and use `Process.pid()` when we 
drop JDK 8.
     write_int(os.getpid(), sock_file)
     sock_file.flush()
-    main(sock_file, sock_file)
+    try:
+        main(sock_file, sock_file)
+    finally:
+        # SPARK-53759: Flush before close to ensure all buffered data reaches
+        # the socket. On Python 3.12+, changed GC finalization ordering
+        # (https://github.com/python/cpython/issues/97922) can cause the
+        # underlying socket to close before BufferedRWPair flushes its write
+        # buffer, resulting in data loss and EOFException on the JVM side.
+        # This mirrors the explicit flush in daemon.py's worker() finally 
block.
+        try:
+            sock_file.flush()

Review Comment:
   So I evaluated the feasibility of backporting PR #54458 (SPARK-55665) 
properly.
   
   At a high level, that unification was very much written to be a refactor on 
top of `master` - I don't think the original PR had considered the possibility 
of backporting to older branches. 
   
   Specifically, #54458 does fix SPARK-53759 on master via its context 
manager's explicit close() in a finally block, but cherry-picking it to stable 
branches ranges from tedious to impractical — branch-4.1 requires resolving 10 
import conflicts across 14 files, branch-4.0 needs the context manager 
rewritten for the older java_port connection pattern, and branch-3.5 is missing 
10 of 14 worker files making it effectively a hand-written 4-file commit. 
   
   Alternatively, this PR #55201 is a more precision try/finally fix that 
applies identically and cleanly to all three release  branches, and can be 
superseded by the worker unification changes in the 4.2.x branches. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to