[ 
https://issues.apache.org/jira/browse/FLINK-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932792#comment-16932792
 ] 

Till Rohrmann commented on FLINK-14112:
---------------------------------------

You are right Stephan. For leader election it should not be a big problem. We 
use Curator's {{LeaderLatch}} internally which should regenerate the leader 
latch Znodes. Hence, the cluster should be able to eventually recover from it 
(modulo log statements).

However, for pointers we store in ZooKeeper for checkpoints and the submitted 
{{JobGraphs}} it is problematic. The cluster won't be able to find them if 
someone deleted the parent ZNode.

> Removing zookeeper state should cause the task manager and job managers to 
> restart
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-14112
>                 URL: https://issues.apache.org/jira/browse/FLINK-14112
>             Project: Flink
>          Issue Type: Wish
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.1, 1.9.0
>            Reporter: Aaron Levin
>            Priority: Minor
>
> Suppose you have a flink application running on a cluster with the following 
> configuration:
> {noformat}
> high-availability.zookeeper.path.root: /flink
> {noformat}
> Now suppose you delete all the znodes within {{/flink}}. I experienced the 
> following:
>  * massive amount of logging
>  * application did not restart
>  * task manager did not crash or restart
>  * job manager did not crash or restart
> From this state I had to restart all the task managers and all the job 
> managers in order for the flink application to recover.
> It would be desirable for the Task Managers and Job Managers to crash if the 
> znode is not available (though perhaps you all have thought about this more 
> deeply than I!)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to