[
https://issues.apache.org/jira/browse/FLINK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405674#comment-17405674
]
Aitozi commented on FLINK-24021:
--------------------------------
cc [[email protected]]
> Potential job unrecoverable due to Network failure
> --------------------------------------------------
>
> Key: FLINK-24021
> URL: https://issues.apache.org/jira/browse/FLINK-24021
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Aitozi
> Priority: Critical
>
> Now we use zk to do leader election and retrieval for HA. And we register a
> fatalError handler in leaderElectionService and leaderRetrievalService to let
> jobManager or taskManager process exit at the time of some unexpected error.
> But we don't do this at the time of curatorFrameworkClient#start in
> ZookeeperUtils. This may lead to some unexpected error like :
>
> # ZookeeperUtils start curator client, but failed by network loss, this will
> not throw exception now, because we do not register a error handler.
> # The network recover when master begin do leader election, so this will
> success
> # The leaderRetrieval begin to work by get_data periodically, but this will
> not be executed , because the curator client start failed in phase 1.
>
> So I think we should register a error handler in phase1 , so that we can fail
> fast.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)