[
https://issues.apache.org/jira/browse/FLINK-38613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059412#comment-18059412
]
Roman Khachatryan edited comment on FLINK-38613 at 2/18/26 6:26 PM:
--------------------------------------------------------------------
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=72469&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=9d734c8c-6253-55e6-3bce-47e7cdf68ac4&l=42325
I was debugging this issue and discovered that:
1. The test fails because some (source) tasks finish after recovery
2. They finish because there are no splits
3. There are no splits because source range is partitioned into a smaller set
of splits than there are subtasks (not always, depending on scheduling)
4. So the Source Coordinator informs some subtasks about NoMoreSplits - and if
we notice that while waitForAllTaskRunning - we get a failure
To make it fail reliably, we need to sleep 1-2s before waitForAllTaskRunning
after recovery.
As a fix, I see the following options:
1. Don't fail if some tasks are finished - but then we loose some test coverage
(lower parallelism when taking a checkpoint)
2. Change the test to use some truly unbounded source (instead of Long MIN ...
MAX)
3. Force some minimum number of splits in NumberSequenceSource
Here's the PR to illustrate it: [https://github.com/apache/flink/pull/27635]
[~fanrui] do you have any other ideas?
was (Author: roman_khachatryan):
I was debugging this issue and discovered that:
1. The test fails because some (source) tasks finish after recovery
2. They finish because there are no splits
3. There are no splits because source range is partitioned into a smaller set
of splits than there are subtasks (not always, depending on scheduling)
4. So the Source Coordinator informs some subtasks about NoMoreSplits - and if
we notice that while waitForAllTaskRunning - we get a failure
To make it fail reliably, we need to sleep 1-2s before waitForAllTaskRunning
after recovery.
As a fix, I see the following options:
1. Don't fail if some tasks are finished - but then we loose some test coverage
(lower parallelism when taking a checkpoint)
2. Change the test to use some truly unbounded source (instead of Long MIN ...
MAX)
3. Force some minimum number of splits in NumberSequenceSource
[~fanrui] do you have any other ideas?
> UnalignedCheckpointRescaleWithMixedExchangesITCase.testRescaleFromUnalignedCheckpoint
> failed in test_cron_jdk11 tests
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-38613
> URL: https://issues.apache.org/jira/browse/FLINK-38613
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.2.0
> Reporter: Ruan Hang
> Priority: Major
> Labels: pull-request-available
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70660&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=d102aafb-3bbd-55e4-a35f-e8935afffc31
--
This message was sent by Atlassian Jira
(v8.20.10#820010)