[jira] [Commented] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

Jason Rosenberg (JIRA) Sun, 23 Mar 2014 23:25:33 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944775#comment-13944775
 ]


Jason Rosenberg commented on KAFKA-1029:
----------------------------------------

Perhaps this should be re-opened a separate ticket?

The issue seems to have started when we had a network outage.  Several 
high-level consumers could not communicate at all with zookeeper (or kafka) for 
several minutes.  When the network was restarted, these continual "I wrote this 
conlicted ephemeral node...." log messages have been running steadily, e.g.:

2014-03-19 00:13:14,165  INFO [ZkClient-EventThread-51-myzkserver] 
utils.ZkUtils$ - conflict in 
/consumers/myapp/ids/myapp_myhost-1394905418548-e159fc25 data: { 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187970147", "version":1 } stored data: { 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187967170", "version":1 }
2014-03-19 00:13:14,166  INFO [ZkClient-EventThread-51-myzkserver] 
utils.ZkUtils$ - I wrote this conflicted ephemeral node [{ 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187970147", "version":1 }] at 
/consumers/myapp/ids/myapp_awa60.sjc1b.square-1394905418548-e159fc25 a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry

These are happening continuously.  We have tried doing a rolling restart of our 
consumer apps, and even a rolling restart of zk.  We are using zk 3.3.6, in 
this case, but will try upgrading to 3.4.5 shortly.

It seems that there are repeated consumer rebalances, as well.

> Zookeeper leader election stuck in ephemeral node retry loop
> ------------------------------------------------------------
>
>                 Key: KAFKA-1029
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1029
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.0
>            Reporter: Sam Meder
>            Assignee: Sam Meder
>            Priority: Blocker
>             Fix For: 0.8.0
>
>         Attachments: 
> 0002-KAFKA-1029-Use-brokerId-instead-of-leaderId-when-tri.patch
>
>
> We're seeing the following log statements (over and over):
> [2013-08-27 07:21:49,538] INFO conflict in /controller data: { "brokerid":3, 
> "timestamp":"1377587945206", "version":1 } stored data: { "brokerid":2, 
> "timestamp":"1377587460904", "version":1 } (kafka.utils.ZkUtils$)
> [2013-08-27 07:21:49,559] INFO I wrote this conflicted ephemeral node [{ 
> "brokerid":3, "timestamp":"1377587945206", "version":1 }] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> where the broker is essentially stuck in the loop that is trying to deal with 
> left-over ephemeral nodes. The code looks a bit racy to me. In particular:
> ZookeeperLeaderElector:
>   def elect: Boolean = {
>     controllerContext.zkClient.subscribeDataChanges(electionPath, 
> leaderChangeListener)
>     val timestamp = SystemTime.milliseconds.toString
>     val electString = ...
>     try {
>       
> createEphemeralPathExpectConflictHandleZKBug(controllerContext.zkClient, 
> electionPath, electString, leaderId,
>         (controllerString : String, leaderId : Any) => 
> KafkaController.parseControllerId(controllerString) == 
> leaderId.asInstanceOf[Int],
>         controllerContext.zkSessionTimeout)
> leaderChangeListener is registered before the create call (by the way, it 
> looks like a new registration will be added every elect call - shouldn't it 
> register in startup()?) so can update leaderId to the current leader before 
> the call to create. If that happens then we will continuously get node exists 
> exceptions and the checker function will always return true, i.e. we will 
> never get out of the while(true) loop.
> I think the right fix here is to pass brokerId instead of leaderId when 
> calling create, i.e.
> createEphemeralPathExpectConflictHandleZKBug(controllerContext.zkClient, 
> electionPath, electString, brokerId,
>         (controllerString : String, leaderId : Any) => 
> KafkaController.parseControllerId(controllerString) == 
> leaderId.asInstanceOf[Int],
>         controllerContext.zkSessionTimeout)
> The loop dealing with the ephemeral node bug is now only triggered for the 
> broker that owned the node previously, although I am still not 100% sure if 
> that is sufficient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

Reply via email to