Hey, I recently expanded my zookeeper cluster from 3 nodes to 5 nodes and have run into some issues. My cluster is deployed on kubernetes and is being used for HA for Flink.
My first symptom is that Flink was unable to correctly perform leader election. It successfully connects to zookeeper and retrieves the leaderlatch and leader information. It does not become the new leader. The problem is that the leader information is pointing to wrong. I am unable to delete the latch and the zNode using the zkCli. The cli replies "Node does not exist: " even though I can query the node and see data. We are seeing data integrity errors in the logs. We aren't sure if that is related or not. "Message: Digests are not matching. Value is Zxid. Last value:16617228477804" How can I go about clearing the erroneous state? Why did this occur? And how can I prevent this from happening in the future? Also what additional information is needed to help debug this issue?
