featzhang created FLINK-39481:
---------------------------------

             Summary: ⁠[tests] Fix flaky 
WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets⁠ 
                 Key: FLINK-39481
                 URL: https://issues.apache.org/jira/browse/FLINK-39481
             Project: Flink
          Issue Type: Bug
          Components: Table SQL / Planner
            Reporter: featzhang


The test `WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets` is 
flaky 
and fails intermittently in CI due to a race condition in the test framework 
itself.

1. Failure Evidence

Azure Build #74194 (2026-04-17):
 - Run 1: PASS
 - Run 2: PASS
 - Run 3: FAIL ← flaky
 - Run 4: PASS

Error message:
org.opentest4j.AssertionFailedError:
Expected 23 rows but got 14 rows.

Missing rows (all related to window [2020-10-10T00:00:30, ...]):
 - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:35,1,3.33,3.0,3.0,1
 - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:40,1,3.33,3.0,3.0,1
 - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:45,1,3.33,3.0,3.0,1
 - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:35,1,7.77,7.0,7.0,0
 - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:40,1,7.77,7.0,7.0,0
 - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:45,1,7.77,7.0,7.0,0
 - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:35,2,11.10,7.0,3.0,1
 - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:40,2,11.10,7.0,3.0,1
 - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:45,2,11.10,7.0,3.0,1

2. Root Cause Analysis

The test uses `FailingCollectionSource` with `'failing-source' = 'true'`, 
which intentionally throws an artificial exception to trigger 
checkpoint-restore path.
The source emits 11 rows with an event at timestamp `2020-10-10 00:00:32` and 
`2020-10-10 00:00:34` at the end of the dataset.

The flakiness stems from a race condition in `FailingCollectionSource`:

1. The source uses a checkpoint-triggered failure mechanism. When the failure 
occurs, only the rows emitted BEFORE the checkpoint are guaranteed to be 
replayed after restore.

2. The last 2 rows (timestamps 00:00:32 and 00:00:34) advance the watermark 
past 
`00:00:31`, triggering the cumulate windows ending at 00:00:35/40/45. 
If the failure and restore happen AFTER the source emits rows 
at 00:00:32/00:00:34 but BEFORE the downstream operators flush their state, 
those late-window results can be lost.

3. Specifically, the issue is that `FailingCollectionSource` uses a legacy 
`SourceFunction` API with `ctx.getCheckpointLock()`. The timing of when 
`numSuccessfulCheckpoints >= 1 && lastCheckpointedEmittedNum >= 1` 
is satisfied relative to when the last 2 rows are emitted is non-deterministic,
meaning sometimes the last 2 rows are emitted before the artificial failure is 
triggered (correct case), and sometimes after (data loss case).

 

3. Impact
 - Flink CI blocked intermittently on the `test_ci table` job
 - 3 out of 4 retry runs pass, demonstrating it's a classic flaky test pattern
 - Affects: WindowDistinctAggregateITCase (CUMULATE window with GROUPING 
SETS/CUBE/ROLLUP)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to