[
https://issues.apache.org/jira/browse/FLINK-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126361#comment-16126361
]
Till Rohrmann commented on FLINK-7444:
--------------------------------------
It is problematic if the error handler tries to stop the failing
{{RpcEndpoint}} in a blocking fashion. Then, it is basically deadlocked because
the actor thread never terminates. We have seen this problem with the
{{MiniCluster}} where an {{Exception}} is thrown at shut down which blocks the
actor's main thread while the {{MiniCluster}} is being shut down waiting for
the {{ActorSystem}} to terminate.
I think the underlying problem is that one does not know what's happening
outside of the {{RpcEndpoint's}} main thread and the idea was to guard against
this by making the calls asynchronous. I see the point that one would want to
react fast to fatal errors and maybe the problem is that we are abusing the
{{FatalErrorHandler}} also for non fatal errors (e.g. more like an uncaught
exception handler). Maybe we can introduce different failure cases but then one
shouldn't do any blocking operations which require the {{RpcEndpoint}} to be
terminated in the fatal error case.
> Make external calls non-blocking
> --------------------------------
>
> Key: FLINK-7444
> URL: https://issues.apache.org/jira/browse/FLINK-7444
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.4.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Labels: flip-6
>
> All external calls from a {{RpcEndpoint}} can be potentially blocking, e.g.
> calls to the {{FatalErrorHandler}}. Therefore, I propose to make all these
> calls coming from the {{RpcEndpoint's}} main thread non-blocking by running
> them in an {{Executor}}. That way the main thread will never be blocked.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)