[
https://issues.apache.org/jira/browse/FLINK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532964#comment-15532964
]
ASF GitHub Bot commented on FLINK-4711:
---------------------------------------
GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/2569
[FLINK-4711] Let the Task trigger partition state requests and handle their
responses
This PR makes changes the partition state check in a way that the Task is
now responsible
for triggering the state check instead of the SingleInputGate. Furthermore,
the operation
returns a future containing the JobManager's answer. That way we don't have
to route the
response through the TaskManager and can add automatic retries in case of a
timeout.
The PR removes the JobManagerCommunicationFactory and gets rid of the
excessive
PartitionStateChecker and ResultPartitionConsumableNotifier creation.
Instead of creating
for each SingleInputGate one PartitionStateChecker we create one for the
TaskManager which
is reused across all SingleInputGates. The same applies to the
ResultPartitionConsumableNotifier.
This PR is also a simplification for the Flip-6 implementation.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink fixOnUpdatePartitionState
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2569.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2569
----
commit eefd4ee31633656d134078503a60f43e14806311
Author: Till Rohrmann <[email protected]>
Date: 2016-09-29T14:19:30Z
[FLINK-4711] Let the Task trigger partition state requests and handle their
responses
This PR makes changes the partition state check in a way that the Task is
now responsible
for triggering the state check instead of the SingleInputGate. Furthermore,
the operation
returns a future containing the JobManager's answer. That way we don't have
to route the
response through the TaskManager and can add automatic retries in case of a
timeout.
The PR removes the JobManagerCommunicationFactory and gets rid of the
excessive
PartitionStateChecker and ResultPartitionConsumableNotifier creation.
Instead of creating
for each SingleInputGate one PartitionStateChecker we create one for the
TaskManager which
is reused across all SingleInputGates. The same applies to the
ResultPartitionConsumableNotifier.
----
> TaskManager can crash due to failing onPartitionStateUpdate call
> ----------------------------------------------------------------
>
> Key: FLINK-4711
> URL: https://issues.apache.org/jira/browse/FLINK-4711
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.2.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Fix For: 1.2.0
>
>
> The {{TaskManager}} can crash because it calls
> {{Task.onPartitionStateUpdate}} when it receives a {{PartitionState}}
> message. The {{onPartitionStateUpdate}} method can throw an {{IOException}}
> or {{InterruptedException}} which are not handled on the {{TaskManager}}
> level.
> Another problem is that the initial partition state request is triggered
> within the {{SingleInputGate}}. The request causes the {{JobManager}} to send
> a {{PartitionState}} message to the {{TaskManager}} which forwards it to the
> {{Task}}. If the at any of these points a message gets lost, then it is not
> retried and the partition state remains unknown.
> In order to handle the exceptions, to make the data flow clearer and to add
> automatic retries, I propose to let the {{Task}} send the partition state
> check requests. Furthermore, the {{JobManager}} should directly answer to the
> {{Task}} by replying to an ask operation. That way the message does not have
> to be routed through the {{TaskManager}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)