[
https://issues.apache.org/jira/browse/FLINK-39481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086922#comment-18086922
]
Martijn Visser commented on FLINK-39481:
----------------------------------------
Like I said, I don't think it's a test instability. In the failing runs the
assertion fails with fewer rows than expected, e.g. {{testCumulateWindow}} in
build 75662 produced 14 expected / 8 actual, with the entire trailing
{{[2020-10-10T00:00:30, *]}} cumulate window family missing. Those windows are
emitted only by the end-of-input {{MAX_WATERMARK}} after the checkpoint+restore
cycle. A non-transactional sink or a flaky harness can only ever produce
extra/duplicate rows or pass; it cannot make committed window results
disappear.
A list of failing runs:
Default scheduler:
- testHopWindow_GroupingSets:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75699&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96
- testCumulateWindow_Rollup, testTumbleWindow_Cube:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75649&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96
- testCumulateWindow_Cube:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75587&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96
- testCumulateWindow_Rollup (jdk21):
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75582&view=logs&j=26b84117-e436-5720-913e-3e280ce55cae&t=5f4830b0-feb5-5f50-3884-17ef62db4a16
- testCumulateWindow_GroupingSets:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75573&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96
- testCumulateWindow_GroupingSets (PR/push CI):
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75611&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96
- testCascadingTumbleWindow_Cube:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75458&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96
- testCumulateWindow (hadoop313):
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75390&view=logs&j=de826397-1924-5900-0034-51895f69d4b7&t=44427b29-c7bf-5161-0204-10ade0d21012
Adaptive scheduler:
- testCascadingTumbleWindow_Cube, testCumulateWindow, testCumulateWindow_Cube,
testHopWindow, testHopWindow_Rollup, testTumbleWindow_Cube:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75760&view=logs&j=f2c100be-250b-5e85-7bbe-176f68fcddc5&t=4d8dc517-8416-58e1-9186-75f3b7856333
- testCascadingTumbleWindow_Cube, testCumulateWindow, testHopWindow_Cube:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75662&view=logs&j=f2c100be-250b-5e85-7bbe-176f68fcddc5&t=4d8dc517-8416-58e1-9186-75f3b7856333
- testCascadingTumbleWindow_GroupingSets, testCumulateWindow_Rollup,
testHopWindow_Cube:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75582&view=logs&j=f2c100be-250b-5e85-7bbe-176f68fcddc5&t=4d8dc517-8416-58e1-9186-75f3b7856333
- testCumulateWindow_GroupingSets, testHopWindow_Cube:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75458&view=logs&j=f2c100be-250b-5e85-7bbe-176f68fcddc5&t=4d8dc517-8416-58e1-9186-75f3b7856333
- testCumulateWindow_GroupingSets, testHopWindow_Cube, testTumbleWindow_Cube,
testTumbleWindow_Rollup:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75433&view=logs&j=f2c100be-250b-5e85-7bbe-176f68fcddc5&t=4d8dc517-8416-58e1-9186-75f3b7856333
- testCascadingTumbleWindow, testCascadingTumbleWindow_Cube, testHopWindow,
testHopWindow_Rollup, testTumbleWindow_GroupingSets, testTumbleWindow_Rollup:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75390&view=logs&j=f2c100be-250b-5e85-7bbe-176f68fcddc5&t=4d8dc517-8416-58e1-9186-75f3b7856333
> [tests] Fix flaky
> WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets
> ----------------------------------------------------------------------------------
>
> Key: FLINK-39481
> URL: https://issues.apache.org/jira/browse/FLINK-39481
> Project: Flink
> Issue Type: Bug
> Components: Table SQL / Planner
> Reporter: featzhang
> Priority: Major
> Labels: pull-request-available
>
> The test `WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets` is
> flaky
> and fails intermittently in CI due to a race condition in the test framework
> itself.
> 1. Failure Evidence
> Azure Build #74194 (2026-04-17):
> - Run 1: PASS
> - Run 2: PASS
> - Run 3: FAIL ← flaky
> - Run 4: PASS
> Error message:
> org.opentest4j.AssertionFailedError:
> Expected 23 rows but got 14 rows.
> Missing rows (all related to window [2020-10-10T00:00:30, ...]):
> - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:35,1,3.33,3.0,3.0,1
> - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:40,1,3.33,3.0,3.0,1
> - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:45,1,3.33,3.0,3.0,1
> - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:35,1,7.77,7.0,7.0,0
> - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:40,1,7.77,7.0,7.0,0
> - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:45,1,7.77,7.0,7.0,0
> - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:35,2,11.10,7.0,3.0,1
> - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:40,2,11.10,7.0,3.0,1
> - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:45,2,11.10,7.0,3.0,1
> 2. Root Cause Analysis
> The test uses `FailingCollectionSource` with `'failing-source' = 'true'`,
> which intentionally throws an artificial exception to trigger
> checkpoint-restore path.
> The source emits 11 rows with an event at timestamp `2020-10-10 00:00:32` and
> `2020-10-10 00:00:34` at the end of the dataset.
> The flakiness stems from a race condition in `FailingCollectionSource`:
> 1. The source uses a checkpoint-triggered failure mechanism. When the failure
> occurs, only the rows emitted BEFORE the checkpoint are guaranteed to be
> replayed after restore.
> 2. The last 2 rows (timestamps 00:00:32 and 00:00:34) advance the watermark
> past
> `00:00:31`, triggering the cumulate windows ending at 00:00:35/40/45.
> If the failure and restore happen AFTER the source emits rows
> at 00:00:32/00:00:34 but BEFORE the downstream operators flush their state,
> those late-window results can be lost.
> 3. Specifically, the issue is that `FailingCollectionSource` uses a legacy
> `SourceFunction` API with `ctx.getCheckpointLock()`. The timing of when
> `numSuccessfulCheckpoints >= 1 && lastCheckpointedEmittedNum >= 1`
> is satisfied relative to when the last 2 rows are emitted is
> non-deterministic,
> meaning sometimes the last 2 rows are emitted before the artificial failure
> is
> triggered (correct case), and sometimes after (data loss case).
>
> 3. Impact
> - Flink CI blocked intermittently on the `test_ci table` job
> - 3 out of 4 retry runs pass, demonstrating it's a classic flaky test pattern
> - Affects: WindowDistinctAggregateITCase (CUMULATE window with GROUPING
> SETS/CUBE/ROLLUP)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)