featzhang created FLINK-39481:
---------------------------------
Summary: [tests] Fix flaky
WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets
Key: FLINK-39481
URL: https://issues.apache.org/jira/browse/FLINK-39481
Project: Flink
Issue Type: Bug
Components: Table SQL / Planner
Reporter: featzhang
The test `WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets` is
flaky
and fails intermittently in CI due to a race condition in the test framework
itself.
1. Failure Evidence
Azure Build #74194 (2026-04-17):
- Run 1: PASS
- Run 2: PASS
- Run 3: FAIL ← flaky
- Run 4: PASS
Error message:
org.opentest4j.AssertionFailedError:
Expected 23 rows but got 14 rows.
Missing rows (all related to window [2020-10-10T00:00:30, ...]):
- 0,b,2020-10-10T00:00:30,2020-10-10T00:00:35,1,3.33,3.0,3.0,1
- 0,b,2020-10-10T00:00:30,2020-10-10T00:00:40,1,3.33,3.0,3.0,1
- 0,b,2020-10-10T00:00:30,2020-10-10T00:00:45,1,3.33,3.0,3.0,1
- 0,null,2020-10-10T00:00:30,2020-10-10T00:00:35,1,7.77,7.0,7.0,0
- 0,null,2020-10-10T00:00:30,2020-10-10T00:00:40,1,7.77,7.0,7.0,0
- 0,null,2020-10-10T00:00:30,2020-10-10T00:00:45,1,7.77,7.0,7.0,0
- 1,null,2020-10-10T00:00:30,2020-10-10T00:00:35,2,11.10,7.0,3.0,1
- 1,null,2020-10-10T00:00:30,2020-10-10T00:00:40,2,11.10,7.0,3.0,1
- 1,null,2020-10-10T00:00:30,2020-10-10T00:00:45,2,11.10,7.0,3.0,1
2. Root Cause Analysis
The test uses `FailingCollectionSource` with `'failing-source' = 'true'`,
which intentionally throws an artificial exception to trigger
checkpoint-restore path.
The source emits 11 rows with an event at timestamp `2020-10-10 00:00:32` and
`2020-10-10 00:00:34` at the end of the dataset.
The flakiness stems from a race condition in `FailingCollectionSource`:
1. The source uses a checkpoint-triggered failure mechanism. When the failure
occurs, only the rows emitted BEFORE the checkpoint are guaranteed to be
replayed after restore.
2. The last 2 rows (timestamps 00:00:32 and 00:00:34) advance the watermark
past
`00:00:31`, triggering the cumulate windows ending at 00:00:35/40/45.
If the failure and restore happen AFTER the source emits rows
at 00:00:32/00:00:34 but BEFORE the downstream operators flush their state,
those late-window results can be lost.
3. Specifically, the issue is that `FailingCollectionSource` uses a legacy
`SourceFunction` API with `ctx.getCheckpointLock()`. The timing of when
`numSuccessfulCheckpoints >= 1 && lastCheckpointedEmittedNum >= 1`
is satisfied relative to when the last 2 rows are emitted is non-deterministic,
meaning sometimes the last 2 rows are emitted before the artificial failure is
triggered (correct case), and sometimes after (data loss case).
3. Impact
- Flink CI blocked intermittently on the `test_ci table` job
- 3 out of 4 retry runs pass, demonstrating it's a classic flaky test pattern
- Affects: WindowDistinctAggregateITCase (CUMULATE window with GROUPING
SETS/CUBE/ROLLUP)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)