zentol commented on a change in pull request #8820: [FLINK-12916][tests] Retry
cancelWithSavepoint on cancellation barrier in AbstractOperatorRestoreTestBase
URL: https://github.com/apache/flink/pull/8820#discussion_r303423389
##########
File path:
flink-tests/src/test/java/org/apache/flink/test/state/operator/restore/AbstractOperatorRestoreTestBase.java
##########
@@ -66,12 +72,16 @@
private static final int NUM_TMS = 1;
private static final int NUM_SLOTS_PER_TM = 4;
private static final Duration TEST_TIMEOUT = Duration.ofSeconds(10000L);
- private static final Pattern
PATTERN_CANCEL_WITH_SAVEPOINT_TOLERATED_EXCEPTIONS = Pattern
- .compile(
- "(was not running)" +
- "|(Not all required tasks are currently
running)" +
- "|(Checkpoint was declined \\(tasks not
ready\\))"
- );
+
+ private static final Pattern
PATTERN_CANCEL_WITH_SAVEPOINT_TOLERATED_EXCEPTIONS = Pattern.compile(
+ Stream.of(
+ TRIGGER_SAVEPOINT_FAILURE.message(),
+ NOT_ALL_REQUIRED_TASKS_RUNNING.message(),
+ CHECKPOINT_DECLINED_TASK_NOT_READY.message(),
+ // If task already in state RUNNING while stream task
not running, stream task would then broadcast barrier.
+ CHECKPOINT_DECLINED_ON_CANCELLATION_BARRIER.message())
Review comment:
Here's the thing, this case shouldn't be possible in the first place. For a
cancel-with-savepoint, we
* disable the checkpoint coordinator
* trigger a savepoint, and
* once the savepoint completes (successfully!), cancel all tasks.
Given that we only cancel tasks if the SP has completed, and the SP can only
complete if all tasks are running, I don't see how we can ever be in a
situation where we try to cancel yet not all tasks being in a running state.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services