[ 
https://issues.apache.org/jira/browse/KAFKA-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated KAFKA-4418:
------------------------------
    Component/s: zkclient

> Broker Leadership Election Fails If Missing ZK Path Raises Exception
> --------------------------------------------------------------------
>
>                 Key: KAFKA-4418
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4418
>             Project: Kafka
>          Issue Type: Bug
>          Components: zkclient
>    Affects Versions: 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Michael Pedersen
>            Priority: Major
>              Labels: reliability
>
> Our Kafka cluster went down because a single node went down *and* a path in 
> Zookeeper was missing for one topic (/brokers/topics/<topicname>/partitions). 
> When this occurred, leadership election could not run, and produced a stack 
> trace that looked like this:
> Failed to start preferred replica election
> org.I0Itec.zkclient.exception.ZkNoNodeException: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /brokers/topics/warandpeace/partitions
>       at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
>       at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995)
>       at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675)
>       at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671)
>       at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:537)
>       at 
> kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:817)
>       at 
> kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:816)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>       at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>       at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>       at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>       at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>       at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>       at kafka.utils.ZkUtils.getAllPartitions(ZkUtils.scala:816)
>       at 
> kafka.admin.PreferredReplicaLeaderElectionCommand$.main(PreferredReplicaLeaderElectionCommand.scala:64)
>       at 
> kafka.admin.PreferredReplicaLeaderElectionCommand.main(PreferredReplicaLeaderElectionCommand.scala)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>       at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>       at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114)
>       at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678)
>       at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675)
>       at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
>       ... 16 more
> I have checked through the code a bit, and have found a quick place to 
> introduce a fix that would seem to allow the leadership election to continue. 
> Specifically, the function at 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/utils/ZkUtils.scala#L633
>  does not handle possible exceptions. Wrapping a try/catch block here would 
> work, but could introduce a number of other problems:
> * If the code is used elsewhere, the exception might be needed at a higher 
> level to prevent something else.
> * Unless the exception is logged/reported somehow, no one will know this 
> problem exists, which makes debugging other problems harder.
> I'm sure there are other issues I'm not aware of, but those two come to mind 
> quickly. What would be the best route for getting this resolved quickly?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to