I don't know how to reproduce it but what I've observed are three kinds of
termination when connectivity with zookeeper is somehow disrupted. I don't
think its an issue with zookeeper as it supports a much bigger kafka cluster
since a few years. 

1. The first kind is exactly this -
https://github.com/apache/flink/pull/11338. Basically temporary loss of
connectivity or rolling upgrade of zookeeper will cause job to terminate. It
will restart eventually from where it left off.
2. The second kind is when job terminates and restarts for the same reason
but is unable to recover from checkpoint. I think its similar to this -
https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0
(from 1.11.2) will fix the second issue then I'll upgrade. 
3. The third kind is where it repeatedly restarts as its unable to establish
a session with Zookeeper. I don't know if reducing session timeout will help
here but in this case, I'm forced to disable zookeeper HA entirely as the
job cannot even restart here. 

I could create a JIRA ticket for discussion zookeeper itself if you suggest
but the issue of zookeeper and savepoints are related as I'm not sure what
will happen in each of the above.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to