wombatu-kun opened a new pull request, #19019:
URL: https://github.com/apache/hudi/pull/19019

   ### Describe the issue this Pull Request addresses
   
   The Flink streaming-read integration test 
`ITTestHoodieDataSource#testStreamReadMorTableWithCompactionFromEarliest` is 
intermittently flaky in CI. It surfaced on an unrelated PR run with 
`java.lang.NullPointerException` at 
`ParquetColumnarRowSplitReader.readNextRowGroup`, reported as "Unexpected job 
failure". Originating failure: 
https://github.com/apache/hudi/actions/runs/27597833446/job/81591777608
   
   ### Summary and Changelog
   
   The test drives a continuous streaming read and terminates it by having 
`CollectSink` throw `SuccessException` once the expected row count is 
collected. During that cascading shutdown the source's `SplitFetcher` calls 
`close()` down to `ParquetColumnarRowSplitReader#close`, which nulls out its 
`reader` field, while the task-thread mailbox is still draining a 
`BatchRecords` batch queued earlier for the same split. Depending on exact 
timing the in-flight row-group read surfaces either as `IOException("Stream is 
closed!")` or, when `close()` fully completes first, as a 
`NullPointerException` from `reader.readNextRowGroup()` on the now-null 
`reader`. Both are the same benign teardown race: the sink has already 
collected all expected rows by then, so the functional outcome is unchanged and 
only the reported failure cause differs. Splits are read strictly sequentially 
and `reader` is nulled only by `close()` (reachable here only on job teardown), 
so this is not a steady-state correctnes
 s/data-loss issue.
   
   `fetchResultWithExpectedNum` already documents and tolerates the 
`IOException("Stream is closed!")` variant via `isAcceptableTerminalFailure`. 
This change extends that same test-only tolerance to the `NullPointerException` 
variant, scoped narrowly to an NPE whose stack trace originates from 
`ParquetColumnarRowSplitReader#readNextRowGroup` (robust to Flink's 
`SerializedThrowable` wrapping), so that genuine NPEs and the legitimate 
`IOException("expecting more rows...")` thrown from the same method still fail 
the test. The call-site rationale comment is extended to document the second 
manifestation. The flaky test was introduced by #18848.
   
   ### Impact
   
   Test-only change; no production code is modified. It reduces flakiness of 
the Flink streaming-read IT. The tolerance stays scoped to the 
`SuccessException`-based test termination pattern and is deliberately not 
mirrored in production code, where failing the job on a stream-closed-mid-read 
is the correct behavior for real I/O failures.
   
   ### Risk Level
   
   low
   
   Verified by building the `hudi-flink` module (`test-compile`) with 0 
checkstyle violations and running 
`ITTestHoodieDataSource#testStreamReadMorTableWithCompactionFromEarliest` (both 
`useSourceV2` parameterizations) to a clean pass. The change only adds a branch 
inside the existing teardown-failure catch path, so it cannot affect the happy 
path.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to