[ 
https://issues.apache.org/jira/browse/KAFKA-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phil Hargett updated KAFKA-989:
-------------------------------

    Status: Patch Available  (was: Open)

When in doubt about how to fix a locking issue...add another lock. ;)

While the real race here involves startConnections / stopConnections in 
ConsumerFetcherManager, the real trigger for such races appears to be the lack 
of protection in the shutdown and rebalance operations on 
ZookeeperConsumerConnector.  There is nothing to prevent a rebalance while a 
shutdown is in progress, and it would appear that could trigger the race in 
ConsumerFetcherManager.

The patch I'm attaching (see KAFKA-989-failed-to-find-leader-patch2.patch) adds 
a shutdown lock grabbed first in both shutdown() and in the run method of the 
ZKRebalancerListener.  This should prevent a rebalance from happening on a 
consumer that has already shutdown.  This prevents the fetcher or the zkclient 
from being in intermediate states, and thus should prevent the race.
                
> Race condition shutting down high-level consumer results in spinning 
> background thread
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-989
>                 URL: https://issues.apache.org/jira/browse/KAFKA-989
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: Ubuntu Linux x64
>            Reporter: Phil Hargett
>         Attachments: KAFKA-989-failed-to-find-leader.patch
>
>
> Running an application that uses the Kafka client under load, can often hit 
> this issue within a few hours.
> High-level consumers come and go over this application's lifecycle, but there 
> are a variety of defenses that ensure each high-level consumer lasts several 
> seconds before being shutdown.  Nevertheless, some race is causing this 
> background thread to continue long after the ZKClient it is using has been 
> disconnected.  Since the thread was spawned by a consumer that has already 
> been shutdown, the application has no way to find this thread and stop it.
> Reported on the users-kafka mailing list 6/25 as "0.8 throwing exception 
> 'Failed to find leader' and high-level consumer fails to make progress". 
> The only remedy is to shutdown the application and restart it.  Externally 
> detecting that this state has occurred is not pleasant: need to grep log for 
> repeated occurrences of the same exception.
> Stack trace:
> Failed to find leader for Set([topic6,0]): java.lang.NullPointerException
>       at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
>       at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
>       at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
>       at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
>       at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
>       at kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
>       at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
>       at 
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
>       at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to