[PR] [CONNECT][TESTS] Retry flaky python foreachBatch termination test [spark]

via GitHub Tue, 21 Apr 2026 22:27:23 -0700


zhengruifeng opened a new pull request, #55473:
URL: https://github.com/apache/spark/pull/55473


   ### What changes were proposed in this pull request?
   
   Wrap the body of `SparkConnectSessionHolderSuite`'s test `"python 
foreachBatch process: process terminates after query is stopped"` with 
`SparkFunSuite.retry(n = 2)` and `failAfter(1.minute)` to bound the impact when 
the test hangs. The test body is extracted into a private method so the diff 
against master stays minimal (no re-indentation of the existing body).
   
   ### Why are the changes needed?
   
   The test has an observed hang mode where `query1.stop()` waits forever 
inside `StreamExecution.stop()` → `queryExecutionThread.join(0)` because the 
stream execution thread is blocked at `StreamingForeachBatchHelper.scala:172` 
in `dataIn.readInt()` (reading from the Python foreachBatch worker socket). The 
default `spark.sql.streaming.stopTimeout = 0` means infinite wait, and 
`Thread.interrupt()` cannot unblock a blocking socket read.
   
   Example CI run that burned the entire 150-minute job budget on this exact 
test: https://github.com/apache/spark/actions/runs/24748010505/job/72404564812
   
   This PR does not fix the underlying I/O issue (there is a `TODO` at 
`StreamingForeachBatchHelper.scala:196` for a proper read timeout). It only 
limits the blast radius: a hung attempt trips the 1-minute `failAfter`, and 
`retry` gives the flake a chance to recover with fresh session state 
(`SparkFunSuite.retry` calls `afterEach` / `beforeEach` between attempts).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-only change.
   
   ### How was this patch tested?
   
   - Compile + scalastyle clean.
   - Ran the target test 10 consecutive times locally in a single SBT session; 
10/10 passed with no retries triggered.
   - Successful reference CI run on personal fork: 
https://github.com/zhengruifeng/spark/actions/runs/24756414638/job/72430787535 
— the target test completed in **5.57 s**, so the 1-minute per-attempt cap is 
~10× margin.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Anthropic Claude Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [CONNECT][TESTS] Retry flaky python foreachBatch termination test [spark]

Reply via email to