[
https://issues.apache.org/jira/browse/FLINK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585320#comment-15585320
]
ASF GitHub Bot commented on FLINK-4715:
---------------------------------------
Github user StephanEwen commented on a diff in the pull request:
https://github.com/apache/flink/pull/2652#discussion_r83840545
--- Diff:
flink-runtime/src/test/java/org/apache/flink/runtime/taskmanager/TaskTest.java
---
@@ -565,6 +568,128 @@ public void testOnPartitionStateUpdate() throws
Exception {
verify(inputGate,
times(1)).retriggerPartitionRequest(eq(partitionId.getPartitionId()));
}
+ /**
+ * Tests that interrupt happens via watch dog if canceller is stuck in
cancel.
+ * Task cancellation blocks the task canceller. Interrupt after cancel
via
+ * cancellation watch dog.
+ */
+ @Test
+ public void testWatchDogInterruptsTask() throws Exception {
+ Configuration config = new Configuration();
+ config.setLong(TaskOptions.CANCELLATION_INTERVAL.key(), 5);
+ config.setLong(TaskOptions.CANCELLATION_TIMEOUT.key(), 50);
+
+ Task task = createTask(InvokableBlockingInCancel.class, config);
+ task.startTaskThread();
+
+ awaitLatch.await();
+
+ task.cancelExecution();
+
+ triggerLatch.await();
+
+ // No fatal error
+ for (Object msg : taskManagerMessages) {
+ assertEquals(false, msg instanceof
TaskManagerMessages.FatalError);
+ }
+ }
+
+ /**
+ * The invoke() method holds a lock (trigger awaitLatch after
acquisition)
+ * and cancel cannot complete because it also tries to acquire the same
lock.
+ * This is resolved by the watch dog, no fatal error.
+ */
+ @Test
+ public void testInterruptableSharedLockInInvokeAndCancel() throws
Exception {
+ Configuration config = new Configuration();
+ config.setLong(TaskOptions.CANCELLATION_INTERVAL.key(), 5);
+ config.setLong(TaskOptions.CANCELLATION_TIMEOUT.key(), 50);
+
+ Task task =
createTask(InvokableInterruptableSharedLockInInvokeAndCancel.class, config);
+ task.startTaskThread();
+
+ awaitLatch.await();
+
+ task.cancelExecution();
+
+ triggerLatch.await();
--- End diff --
how about adding `task.getExecutingThread().join()` instead of the using
the trigger latch? Seems more intuitive and safer.
> TaskManager should commit suicide after cancellation failure
> ------------------------------------------------------------
>
> Key: FLINK-4715
> URL: https://issues.apache.org/jira/browse/FLINK-4715
> Project: Flink
> Issue Type: Improvement
> Components: TaskManager
> Affects Versions: 1.2.0
> Reporter: Till Rohrmann
> Assignee: Ufuk Celebi
> Fix For: 1.2.0
>
>
> In case of a failed cancellation, e.g. the task cannot be cancelled after a
> given time, the {{TaskManager}} should kill itself. That way we guarantee
> that there is no resource leak.
> This behaviour acts as a safety-net against faulty user code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)