Maximilian Michels created FLINK-16511:
------------------------------------------
Summary: Task cancellation timeout is not effective on OOM errors
Key: FLINK-16511
URL: https://issues.apache.org/jira/browse/FLINK-16511
Project: Flink
Issue Type: Bug
Components: Runtime / Task
Reporter: Maximilian Michels
Under high memory pressure, the task manager shutdown on fatal errors is not
reliable:
If a task does not cooperate and cannot be canceled and there is a OOM when
starting the task cancellation watchdog thread, the exception is not propagated
correctly. The reason for this is that the job manager retries the cancelTask()
request multiple times. The operation is stateful and if we fail to start the
watchdog thread, we won't attempt it again as the task already switches to the
CANCELING state before starting the watchdog thread.
Such fatal errors should automatically shutdown the task manager without a
retry form the job manager side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)