wombatu-kun opened a new pull request, #19019: URL: https://github.com/apache/hudi/pull/19019
### Describe the issue this Pull Request addresses The Flink streaming-read integration test `ITTestHoodieDataSource#testStreamReadMorTableWithCompactionFromEarliest` is intermittently flaky in CI. It surfaced on an unrelated PR run with `java.lang.NullPointerException` at `ParquetColumnarRowSplitReader.readNextRowGroup`, reported as "Unexpected job failure". Originating failure: https://github.com/apache/hudi/actions/runs/27597833446/job/81591777608 ### Summary and Changelog The test drives a continuous streaming read and terminates it by having `CollectSink` throw `SuccessException` once the expected row count is collected. During that cascading shutdown the source's `SplitFetcher` calls `close()` down to `ParquetColumnarRowSplitReader#close`, which nulls out its `reader` field, while the task-thread mailbox is still draining a `BatchRecords` batch queued earlier for the same split. Depending on exact timing the in-flight row-group read surfaces either as `IOException("Stream is closed!")` or, when `close()` fully completes first, as a `NullPointerException` from `reader.readNextRowGroup()` on the now-null `reader`. Both are the same benign teardown race: the sink has already collected all expected rows by then, so the functional outcome is unchanged and only the reported failure cause differs. Splits are read strictly sequentially and `reader` is nulled only by `close()` (reachable here only on job teardown), so this is not a steady-state correctnes s/data-loss issue. `fetchResultWithExpectedNum` already documents and tolerates the `IOException("Stream is closed!")` variant via `isAcceptableTerminalFailure`. This change extends that same test-only tolerance to the `NullPointerException` variant, scoped narrowly to an NPE whose stack trace originates from `ParquetColumnarRowSplitReader#readNextRowGroup` (robust to Flink's `SerializedThrowable` wrapping), so that genuine NPEs and the legitimate `IOException("expecting more rows...")` thrown from the same method still fail the test. The call-site rationale comment is extended to document the second manifestation. The flaky test was introduced by #18848. ### Impact Test-only change; no production code is modified. It reduces flakiness of the Flink streaming-read IT. The tolerance stays scoped to the `SuccessException`-based test termination pattern and is deliberately not mirrored in production code, where failing the job on a stream-closed-mid-read is the correct behavior for real I/O failures. ### Risk Level low Verified by building the `hudi-flink` module (`test-compile`) with 0 checkstyle violations and running `ITTestHoodieDataSource#testStreamReadMorTableWithCompactionFromEarliest` (both `useSourceV2` parameterizations) to a clean pass. The change only adds a branch inside the existing teardown-failure catch path, so it cannot affect the happy path. ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
