[
https://issues.apache.org/jira/browse/FLINK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann closed FLINK-4711.
--------------------------------
Resolution: Fixed
Fixed via 477d1c5d4ca6f469b3c87bc1f7962ece805cae1d
> TaskManager can crash due to failing onPartitionStateUpdate call
> ----------------------------------------------------------------
>
> Key: FLINK-4711
> URL: https://issues.apache.org/jira/browse/FLINK-4711
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.2.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Fix For: 1.2.0
>
>
> The {{TaskManager}} can crash because it calls
> {{Task.onPartitionStateUpdate}} when it receives a {{PartitionState}}
> message. The {{onPartitionStateUpdate}} method can throw an {{IOException}}
> or {{InterruptedException}} which are not handled on the {{TaskManager}}
> level.
> Another problem is that the initial partition state request is triggered
> within the {{SingleInputGate}}. The request causes the {{JobManager}} to send
> a {{PartitionState}} message to the {{TaskManager}} which forwards it to the
> {{Task}}. If the at any of these points a message gets lost, then it is not
> retried and the partition state remains unknown.
> In order to handle the exceptions, to make the data flow clearer and to add
> automatic retries, I propose to let the {{Task}} send the partition state
> check requests. Furthermore, the {{JobManager}} should directly answer to the
> {{Task}} by replying to an ask operation. That way the message does not have
> to be routed through the {{TaskManager}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)