[ 
https://issues.apache.org/jira/browse/FLINK-38613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059412#comment-18059412
 ] 

Roman Khachatryan edited comment on FLINK-38613 at 2/18/26 6:26 PM:
--------------------------------------------------------------------

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=72469&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=9d734c8c-6253-55e6-3bce-47e7cdf68ac4&l=42325

 

I was debugging this issue and discovered that:
1. The test fails because some (source) tasks finish after recovery
2. They finish because there are no splits
3. There are no splits because source range is partitioned into a smaller set 
of splits than there are subtasks (not always, depending on scheduling)
4. So the Source Coordinator informs some subtasks about NoMoreSplits - and if 
we notice that while waitForAllTaskRunning - we get a failure

To make it fail reliably, we need to sleep 1-2s before waitForAllTaskRunning 
after recovery.

As a fix, I see the following options:
1. Don't fail if some tasks are finished - but then we loose some test coverage 
(lower parallelism when taking a checkpoint)
2. Change the test to use some truly unbounded source (instead of Long MIN ... 
MAX)
3. Force some minimum number of splits in NumberSequenceSource

 

Here's the PR to illustrate it: [https://github.com/apache/flink/pull/27635]

 

[~fanrui] do you have any other ideas?


was (Author: roman_khachatryan):
I was debugging this issue and discovered that:
1. The test fails because some (source) tasks finish after recovery
2. They finish because there are no splits
3. There are no splits because source range is partitioned into a smaller set 
of splits than there are subtasks (not always, depending on scheduling)
4. So the Source Coordinator informs some subtasks about NoMoreSplits - and if 
we notice that while waitForAllTaskRunning - we get a failure

To make it fail reliably, we need to sleep 1-2s before waitForAllTaskRunning 
after recovery.

As a fix, I see the following options:
1. Don't fail if some tasks are finished - but then we loose some test coverage 
(lower parallelism when taking a checkpoint)
2. Change the test to use some truly unbounded source (instead of Long MIN ... 
MAX)
3. Force some minimum number of splits in NumberSequenceSource

[~fanrui] do you have any other ideas?

> UnalignedCheckpointRescaleWithMixedExchangesITCase.testRescaleFromUnalignedCheckpoint
>  failed in test_cron_jdk11 tests
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-38613
>                 URL: https://issues.apache.org/jira/browse/FLINK-38613
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.2.0
>            Reporter: Ruan Hang
>            Priority: Major
>              Labels: pull-request-available
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70660&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=d102aafb-3bbd-55e4-a35f-e8935afffc31



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to