[ https://issues.apache.org/jira/browse/KAFKA-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Phil Hargett updated KAFKA-989: ------------------------------- Status: Patch Available (was: Open) When in doubt about how to fix a locking issue...add another lock. ;) While the real race here involves startConnections / stopConnections in ConsumerFetcherManager, the real trigger for such races appears to be the lack of protection in the shutdown and rebalance operations on ZookeeperConsumerConnector. There is nothing to prevent a rebalance while a shutdown is in progress, and it would appear that could trigger the race in ConsumerFetcherManager. The patch I'm attaching (see KAFKA-989-failed-to-find-leader-patch2.patch) adds a shutdown lock grabbed first in both shutdown() and in the run method of the ZKRebalancerListener. This should prevent a rebalance from happening on a consumer that has already shutdown. This prevents the fetcher or the zkclient from being in intermediate states, and thus should prevent the race. > Race condition shutting down high-level consumer results in spinning > background thread > -------------------------------------------------------------------------------------- > > Key: KAFKA-989 > URL: https://issues.apache.org/jira/browse/KAFKA-989 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Environment: Ubuntu Linux x64 > Reporter: Phil Hargett > Attachments: KAFKA-989-failed-to-find-leader.patch > > > Running an application that uses the Kafka client under load, can often hit > this issue within a few hours. > High-level consumers come and go over this application's lifecycle, but there > are a variety of defenses that ensure each high-level consumer lasts several > seconds before being shutdown. Nevertheless, some race is causing this > background thread to continue long after the ZKClient it is using has been > disconnected. Since the thread was spawned by a consumer that has already > been shutdown, the application has no way to find this thread and stop it. > Reported on the users-kafka mailing list 6/25 as "0.8 throwing exception > 'Failed to find leader' and high-level consumer fails to make progress". > The only remedy is to shutdown the application and restart it. Externally > detecting that this state has occurred is not pleasant: need to grep log for > repeated occurrences of the same exception. > Stack trace: > Failed to find leader for Set([topic6,0]): java.lang.NullPointerException > at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416) > at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413) > at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413) > at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409) > at kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438) > at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75) > at > kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira