Michael Pedersen created KAFKA-4418:
---------------------------------------

             Summary: Broker Leadership Election Fails If Missing ZK Path 
Raises Exception
                 Key: KAFKA-4418
                 URL: https://issues.apache.org/jira/browse/KAFKA-4418
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.10.0.1, 0.10.0.0, 0.9.0.1
            Reporter: Michael Pedersen


Our Kafka cluster went down because a single node went down *and* a path in 
Zookeeper was missing for one topic (/brokers/topics/<topicname>/partitions). 
When this occurred, leadership election could not run, and produced a stack 
trace that looked like this:

Failed to start preferred replica election
org.I0Itec.zkclient.exception.ZkNoNodeException: 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /brokers/topics/warandpeace/partitions
        at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
        at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995)
        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675)
        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671)
        at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:537)
        at 
kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:817)
        at 
kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:816)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at kafka.utils.ZkUtils.getAllPartitions(ZkUtils.scala:816)
        at 
kafka.admin.PreferredReplicaLeaderElectionCommand$.main(PreferredReplicaLeaderElectionCommand.scala:64)
        at 
kafka.admin.PreferredReplicaLeaderElectionCommand.main(PreferredReplicaLeaderElectionCommand.scala)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
        at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114)
        at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678)
        at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675)
        at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
        ... 16 more

I have checked through the code a bit, and have found a quick place to 
introduce a fix that would seem to allow the leadership election to continue. 
Specifically, the function at 
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/utils/ZkUtils.scala#L633
 does not handle possible exceptions. Wrapping a try/catch block here would 
work, but could introduce a number of other problems:

* If the code is used elsewhere, the exception might be needed at a higher 
level to prevent something else.
* Unless the exception is logged/reported somehow, no one will know this 
problem exists, which makes debugging other problems harder.

I'm sure there are other issues I'm not aware of, but those two come to mind 
quickly. What would be the best route for getting this resolved quickly?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to