[jira] [Commented] (KAFKA-989) Race condition shutting down high-level consumer results in spinning background thread

Phil Hargett (JIRA) Mon, 05 Aug 2013 10:35:20 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729682#comment-13729682
 ]


Phil Hargett commented on KAFKA-989:
------------------------------------

Yes, but my working hypothesis is that because there are at least 2 sets of 
races (in consumer connector syncedRebalance/shutdown, then in 
ConsumerFetcherManager startConnections/stopConnections), it is actually 
possible to have a LeaderFinderThread still running that has not been shutdown, 
even though its consumer has--because a stopConnections call completed before a 
startConnections call finished.  So there's a started leader finder thread, but 
its ZkClient has been closed.

The key, I think, is that there is no guarantee that while the consumer 
connector is shutting down a rebalance event won't actually startup another 
leader finder thread (by starting fetchers again).

I believe the race in ConsumerFetcherManager is not likely to happen, if the 
race in ZookeeperConsumerConnector is fixed instead. Thus I avoid fixing the 
harder race by fixing an easier one that may be its only trigger (at present). 
:)
                
> Race condition shutting down high-level consumer results in spinning 
> background thread
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-989
>                 URL: https://issues.apache.org/jira/browse/KAFKA-989
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: Ubuntu Linux x64
>            Reporter: Phil Hargett
>         Attachments: KAFKA-989-failed-to-find-leader.patch, 
> KAFKA-989-failed-to-find-leader-patch2.patch
>
>
> Running an application that uses the Kafka client under load, can often hit 
> this issue within a few hours.
> High-level consumers come and go over this application's lifecycle, but there 
> are a variety of defenses that ensure each high-level consumer lasts several 
> seconds before being shutdown.  Nevertheless, some race is causing this 
> background thread to continue long after the ZKClient it is using has been 
> disconnected.  Since the thread was spawned by a consumer that has already 
> been shutdown, the application has no way to find this thread and stop it.
> Reported on the users-kafka mailing list 6/25 as "0.8 throwing exception 
> 'Failed to find leader' and high-level consumer fails to make progress". 
> The only remedy is to shutdown the application and restart it.  Externally 
> detecting that this state has occurred is not pleasant: need to grep log for 
> repeated occurrences of the same exception.
> Stack trace:
> Failed to find leader for Set([topic6,0]): java.lang.NullPointerException
>       at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
>       at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
>       at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
>       at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
>       at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
>       at kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
>       at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
>       at 
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
>       at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-989) Race condition shutting down high-level consumer results in spinning background thread

Reply via email to