[
https://issues.apache.org/jira/browse/KAFKA-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729682#comment-13729682
]
Phil Hargett commented on KAFKA-989:
------------------------------------
Yes, but my working hypothesis is that because there are at least 2 sets of
races (in consumer connector syncedRebalance/shutdown, then in
ConsumerFetcherManager startConnections/stopConnections), it is actually
possible to have a LeaderFinderThread still running that has not been shutdown,
even though its consumer has--because a stopConnections call completed before a
startConnections call finished. So there's a started leader finder thread, but
its ZkClient has been closed.
The key, I think, is that there is no guarantee that while the consumer
connector is shutting down a rebalance event won't actually startup another
leader finder thread (by starting fetchers again).
I believe the race in ConsumerFetcherManager is not likely to happen, if the
race in ZookeeperConsumerConnector is fixed instead. Thus I avoid fixing the
harder race by fixing an easier one that may be its only trigger (at present).
:)
> Race condition shutting down high-level consumer results in spinning
> background thread
> --------------------------------------------------------------------------------------
>
> Key: KAFKA-989
> URL: https://issues.apache.org/jira/browse/KAFKA-989
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8
> Environment: Ubuntu Linux x64
> Reporter: Phil Hargett
> Attachments: KAFKA-989-failed-to-find-leader.patch,
> KAFKA-989-failed-to-find-leader-patch2.patch
>
>
> Running an application that uses the Kafka client under load, can often hit
> this issue within a few hours.
> High-level consumers come and go over this application's lifecycle, but there
> are a variety of defenses that ensure each high-level consumer lasts several
> seconds before being shutdown. Nevertheless, some race is causing this
> background thread to continue long after the ZKClient it is using has been
> disconnected. Since the thread was spawned by a consumer that has already
> been shutdown, the application has no way to find this thread and stop it.
> Reported on the users-kafka mailing list 6/25 as "0.8 throwing exception
> 'Failed to find leader' and high-level consumer fails to make progress".
> The only remedy is to shutdown the application and restart it. Externally
> detecting that this state has occurred is not pleasant: need to grep log for
> repeated occurrences of the same exception.
> Stack trace:
> Failed to find leader for Set([topic6,0]): java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
> at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
> at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
> at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
> at kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
> at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
> at
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira