[
https://issues.apache.org/jira/browse/FLINK-38613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059412#comment-18059412
]
Roman Khachatryan commented on FLINK-38613:
-------------------------------------------
I was debugging this issue and discovered that:
1. The test fails because some (source) tasks finish after recovery
2. They finish because there are no splits
3. There are no splits because source range is partitioned into a smaller set
of splits than there are subtasks (not always, depending on scheduling)
4. So the Source Coordinator informs some subtasks about NoMoreSplits - and if
we notice that while waitForAllTaskRunning - we get a failure
To make it fail reliably, we need to sleep 1-2s before waitForAllTaskRunning
after recovery.
As a fix, I see the following options:
1. Don't fail if some tasks are finished - but then we loose some test coverage
(lower parallelism when taking a checkpoint)
2. Change the test to use some truly unbounded source (instead of Long MIN ...
MAX)
3. Force some minimum number of splits in NumberSequenceSource
[~fanrui] do you have any other ideas?
> UnalignedCheckpointRescaleWithMixedExchangesITCase.testRescaleFromUnalignedCheckpoint
> failed in test_cron_jdk11 tests
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-38613
> URL: https://issues.apache.org/jira/browse/FLINK-38613
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.2.0
> Reporter: Ruan Hang
> Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70660&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=d102aafb-3bbd-55e4-a35f-e8935afffc31
--
This message was sent by Atlassian Jira
(v8.20.10#820010)