Aitozi created FLINK-24021:
------------------------------
Summary: Potential job unrecoverable due to Network failure
Key: FLINK-24021
URL: https://issues.apache.org/jira/browse/FLINK-24021
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Reporter: Aitozi
Now we use zk to do leader election and retrieval for HA. And we register a
fatalError handler in leaderElectionService and leaderRetrievalService to let
jobManager or taskManager process exit at the time of some unexpected error.
But we don't do this at the time of curatorFrameworkClient#start in
ZookeeperUtils. This may lead to some unexpected error like :
# ZookeeperUtils start curator client, but failed by network loss, this will
not throw exception now, because we do not register a error handler.
# The network recover when master begin do leader election, so this will
success
# The leaderRetrieval begin to work by get_data periodically, but this will
not be executed , because the curator client start failed in phase 1.
So I think we should register a error handler in phase1 , so that we can fail
fast.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)