[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323229#comment-14323229 ] Tommy Becker commented on KAFKA-1387: - Can a project member comment on what it is going to take to get this patch accepted? We have been running 0.8.1 with it for months, and I guess we'll have to apply it to 0.8.2 as well, but it would be nice to get it into the official tree... > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy > Labels: newbie, patch > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516102#comment-14516102 ] Thomas Omans commented on KAFKA-1387: - I am seeing similar behavior in my consumer, using kafka 0.8.2.1 and zookeeper 3.4.6 In an infinite loop: {code} 15/04/27 17:44:31 INFO utils.ZkUtils$: conflict in /consumers/** 15/04/27 17:44:31 INFO utils.ZkUtils$: I wrote this conflicted ephemeral node ** a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry 15/04/27 17:45:01 INFO INFO utils.ZkUtils$: conflict in /consumers/** 15/04/27 17:45:01 INFO utils.ZkUtils$: I wrote this conflicted ephemeral node ** a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry {code} > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy > Labels: newbie, patch > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519189#comment-14519189 ] Marcus Aidley commented on KAFKA-1387: -- I've also encountered this issue running Kafka 0.8.2.0 and Zookeeper 3.4.6 in a three node cluster. The error occured after two zookeeper nodes got restarted at the same time. The error below repeatedly appeared in the Kafka logs. I resolved the issue by restarting Kafka. {panel} [2015-04-27 03:47:03,292] INFO I wrote this conflicted ephemeral node ["jmx_port":-1,"timestamp":"1430038275477","host":"ams5mdppdmsbacmq01b.markit.partners","version":1,"port":9092] at /brokers/ids/2 a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {panel} > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy > Labels: newbie, patch > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520867#comment-14520867 ] Thomas Omans commented on KAFKA-1387: - It looks like this "infinite retry" behavior is only in kafka to accomodate another strange issue where zookeeper was deleting ephemeral nodes out from under it: https://github.com/apache/kafka/blob/0.8.2.1/core/src/main/scala/kafka/utils/ZkUtils.scala#L272 https://issues.apache.org/jira/browse/ZOOKEEPER-1740 It seems the simplest thing to do would be to just delete the conflicted node and write the truth about the process environment it knows. I see that my issue appeared in the consumer code, where this issue is occurring in the kafka brokers themselves, but the bug appears to be the same: There are two exceptional cases in ephemeral nodes that I can see, either the ZOOKEEPER-1740 bug was hit in which case our ephemeral node mysteriously was lost out from under us, but our session is still active and we can just create a new one. The other bug I believe we are seeing is that the session is long gone but the ephemeral node is still hanging around until the consumer process exits. Currently the first case is handled, but I the second case is not. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy > Labels: newbie, patch > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533402#comment-14533402 ] Abhishek Nigam commented on KAFKA-1387: --- I have seen the ephemeral node issue before and the fix made there was exactly what Thomas mentioned: "It seems the simplest thing to do would be to just delete the conflicted node and write the truth about the process environment it knows." Is there a reason why the approach outlined by Thomas does not work for kafka? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648378#comment-14648378 ] Clark Haskins commented on KAFKA-1387: -- This patch is listed as a blocker. Can the existing patch be committed? Is anyone actively working on it? This has been a problem for us recently and we would like to see this fixed soon. Thanks, -Clark > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682263#comment-14682263 ] Mayuresh Gharat commented on KAFKA-1387: Can the person who uploaded the patch submit a testcase on how to reproduce this? We are hitting this in production but are not able to reproduce this locally. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682269#comment-14682269 ] Fedor Korotkiy commented on KAFKA-1387: --- Have you tried steps from issue description? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682416#comment-14682416 ] Guozhang Wang commented on KAFKA-1387: -- [~fpj] Could you help taking a look at this issue? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692595#comment-14692595 ] James Lent commented on KAFKA-1387: --- It has been a while since I investigated this issue. I will take another look at it tomorrow and get back to you. Sent from my iPhone > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694052#comment-14694052 ] James Lent commented on KAFKA-1387: --- After refreshing my memory of this issue I was unable to come up with any new ideas for how to create an automated test case for the issue. I was only able to reproduce this issue in my dev environment using the cumbersome manual process I outlined in my Sept 27 comment. My question posted to the zookeeper-user mailing list regarding the validity of the key assumption of the patch logic generated no feedback. We have been using the patch I provided with Kafka 0.8.1.1 for almost a year now. We have not seen a re-occurrence of the hung ephemeral connection issue since then. Since the problem was intermittent and only triggered when the system was unstable, this may or may not be due to the presence of the patch. There was one an NPE issue found during test in March when our application code changed and in certain cases tried to close a Connector that had never been fully started. That was fixed as follows: {noformat} Index: core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala === --- core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala (revision 73668) +++ core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala (revision 73669) @@ -162,7 +162,9 @@ if (canShutdown) { info("ZKConsumerConnector shutting down") -consumerNodeMonitor.close() +if (consumerNodeMonitor != null) { + consumerNodeMonitor.close() +} if (wildcardTopicWatcher != null) wildcardTopicWatcher.shutdown() {noformat} Not sure any of this was of much help, but, I would be happy to try to answer any questions regarding the patch logic and/or update it based on your comments. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697002#comment-14697002 ] Flavio Junqueira commented on KAFKA-1387: - I'm actually really sorry that this issue has been around for so long, I didn't realize it was going on and that I was even indirectly participating in it. Let me start by giving a sort of general overview of what to expect. If a client has received a session expiration event, it means that the leader has expired the session and has broadcast the closeSession event to the followers. If the same client creates a new session successfully, then the server it connects to must have applied the previous closeSession, which deletes the ephemeral znodes, because ZK guarantees that txns are totally ordered. Consequently, the client shouldn't observe an ephemeral from an old session of its own. Note that another client could still observe the ephemeral znode after the session expiration if it is connected to a server that is a bit behind, but that's fine. What I'm thinking is that one problem that could happen is that a client creates a new session before receiving the session expiration for an earlier session. In that case the ephemerals will still be there because the session still exists. The bottom line is that if the client has seen the session expiration event, then it seems fine to go ahead and create new ephemerals without having to check whether ephemerals are stale or not. If the session creation isn't clean, then there are a few options like waiting for the timeout period, storing and recovering the session id. I'll dig into the code to see how we can fix this, have a closer look at the patch, and will reopen the associated ZOOKEEPER-1740 issue until we sort this out. let me know if the explanation above makes sense in the meanwhile. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697427#comment-14697427 ] Abhishek Nigam commented on KAFKA-1387: --- Thanks a lot for digging into this. Not sure if it helps but in the past when I saw this issue it went like this: a) Say session time out is 30 seconds. b) If we kill the instance which create the zookeeper ephemeral node and bring it back up quickly (less than 30 seconds) we would find the previous session data (ephemeral node) still exists. The solution was to assume the existing data was from an old session, delete and re-create it during startup. However, we were processing the zookeeper events on a single thread. On Fri, Aug 14, 2015 at 6:34 AM, Flavio Junqueira (JIRA) > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697907#comment-14697907 ] Guozhang Wang commented on KAFKA-1387: -- Thanks [~fpj], this is very helpful. Just to add some more context regarding this issue, we have seen issues when ephemeral nodes were not deleted when brokers / consumers try to re-register themselves in ZK upon a session timeout event (details can be found in KAFKA-992). We tried to fix it via adding a registration timestamp into the registration node's data, and checking if the timestamp is different upon seeing it, and if not backing off to wait for this node to be removed. However people have been also reporting a couple of times that the backing-off is never ending, i.e. the node has a different timestamp, but was never deleted. The suspicion was that there were multiple consequent session creation at a very short period of time, and the node with a different timestamp is created by a session that was not actually expired, and hence will never be gone. But no one has validated if this is the case though. The logic of re-registration can be found in ZookeeperConsumerConnector.scala and KafkaHealthcheck.scala. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699514#comment-14699514 ] Flavio Junqueira commented on KAFKA-1387: - There are two problems at a high level described here: zk losing ephemerals and ephemerals not going away. I haven't been able to reproduce the former, but I've been able to find one potential problem that could be causing it. I started by finding suspicious that the ZK listeners were not dealing with session events at all: {code} def handleStateChanged(state: KeeperState) { // do nothing, since zkclient will do reconnect for us. } {code} It is quite typical with ZK that you wait for the connected event before making progress. Looking at the ZkClient implementation, I realized that it retries operations in the case of connection loss or session expiration until they go through. There is a race here, though. Say you submit a create, but instead of getting OK as a response, you get connection loss. ZkClient in this case will say "well, need to retry" and will get a node exists exception, which the code currently treats as a znode from a previous session. This znode will never go away because it belongs to the current session! Now let's say we get rid of such corner cases. It is still possible that when the client recovers it finds a znode from a previous session. It can happen because the lease (session) corresponding to the znode is still valid, so ZK can't get rid of it. Revoking leases in general is a bit complicated, but it sounds ok in this case if there is no risky of having multiple incarnations of the same element (a broker) running concurrently. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700113#comment-14700113 ] Guozhang Wang commented on KAFKA-1387: -- I thought that when the previous session has ended (e.g. expired), its ephemeral node will be "eventually" removed? Does ZooKeeper itself have a leasing mechanism? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700187#comment-14700187 ] Flavio Junqueira commented on KAFKA-1387: - bq. I thought that when the previous session has ended (e.g. expired), its ephemeral node will be "eventually" removed? If the session ends cleanly, by the client submitting a closeSession request, then the session closes and the ephemerals are deleted with the request. But, if the client crashes and the server simply stops hearing from the client, then the session has to time out and expire so it takes some time. bq. Does ZooKeeper itself have a leasing mechanism? I'm referring to the fact that the ephemeral represents a lease that is revoked when the session times out. I'm not sure if this is clear, but one of the problems I'm pointing out is that zkclient might end up creating the ephemeral znode in your *current* session. In this case, the znode won't go away. Here is actually another problem I found along the same lines. The createEphemeral call in ZkClient ends up calling retryUntilConnected, which retries even when the session expires: {code} try { return callable.call(); } catch (ConnectionLossException e) { // we give the event thread some time to update the status to 'Disconnected' Thread.yield(); waitForRetry(); } catch (SessionExpiredException e) { // we give the event thread some time to update the status to 'Expired' Thread.yield(); waitForRetry(); } {code} In this case, say that one call to createEphemeral via handleNewSession happens during a given session, but the session expires before the operation goes through. The client will retry with the new session. When the consumer tries again, it will fail because the znode is there and won't go away. This is another case in which the znode won't go away because it has been created in the current session. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700483#comment-14700483 ] Guozhang Wang commented on KAFKA-1387: -- [~fpj] That makes sense. So it seems the right resolution should be at the ZkClient layer, not on Kafka's layer? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701418#comment-14701418 ] Flavio Junqueira commented on KAFKA-1387: - It doesn't look like it 'd be a small change to zkclient to fix this. We essentially need it to expose zk events as they occur. In the way it currently does it, the events are serialized and the operations are retried transparently so I don't know if the znode already exists because of a connection loss or if the session actually expired and there is a new one now. The simplest way around this seems to be to just re-register the consumer directly (delete and create) upon a node exists exception. This should work because of the following argument. There are three possibilities when we get a node exists exception: # The znode exists from a previous session and hasn't been reclaimed yet # The znode exists because of a connection loss event while the znode was being created, so the second time we get an exception (event) # The previous session has expired, a new one was created, and the registration was occurring around this transition, so when we execute handleNewSession for the new session, we get a node exists exception. In all these three cases, deleting and recreating seems fine. It is clearly conservative and more expensive than necessary, but at least it doesn't require changes to zkclient. Does it sound a reasonable? Do you see any problem? CC [~guozhang] [~jwl...@gmail.com] > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701509#comment-14701509 ] Guozhang Wang commented on KAFKA-1387: -- Thanks [~fpj], that makes sense to me. [~jwlent55] do you want to submit a new patch following this approach? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701571#comment-14701571 ] Flavio Junqueira commented on KAFKA-1387: - [~guozhang] it looks like [~jwl...@gmail.com] isn't in the list of contributors, could you add him so that we can assign the jira to him? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702992#comment-14702992 ] James Lent commented on KAFKA-1387: --- Your approach sounds much simpler than mine (which I like). Similar to what I proposed doing only at startup (ensureNodeDoesNotExist method). I am however not sure I understand the exact change you propose. As I remember the createEphemeralPathExpectConflictHandleZKBug is called by three code paths: - Register Broker - Register Consumer - Leadership election In my change I specifically tried avoid changing the Leadership election logic. Is your change basically to implement your new logic (delete if already exists) instead of calling createEphemeralPathExpectConflictHandleZKBug in the first two cases? If so I agree it sounds reasonable. I suppose in a misconfiguration case two nodes might get into a registration war over the Broker node, but, that could (perhaps) be handled at startup (second one fails to start up). If your propose replacing the createEphemeralPathExpectConflictHandleZKBug for the Leadership election case too then I am less comfortable making (and testing) that change. I have never really dug into that logic too much. One other factor to consider is that I am a bit backed up a work right now and this will not be issue will not be my highest priority. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703532#comment-14703532 ] Guozhang Wang commented on KAFKA-1387: -- [~jwlent55] I agree that this fix may be just for broker / consumer registration, i.e. ZK should not be used to detect mis-configuration that two brokers / clients use the same Id. Hence for that case, in the new approach they may end-up in a delete-and-write war. We should consider fixing such mis-operation in a different manner which is orthogonal to this JIRA. For leader election, one should not simply delete the path upon conflict, we should leave it as is. In the future, we should either fix the root cause in ZkClient or move on to use a different client as KIP-30 is current discussing about. If you do not have time this week and feel it is OK, [~fpj] could you help taking it over? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703767#comment-14703767 ] Flavio Junqueira commented on KAFKA-1387: - I'm indeed proposing to get rid of createEphemeralPathExpectConflictHandleZKBug. I can investigate the impact to leadership election. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711838#comment-14711838 ] Guozhang Wang commented on KAFKA-1387: -- Thanks [~fpj], thanks for the patch. Here are some high-level comments: 1. Will the mixing usage of ZK directly and ZkClient together violate ordering? AFAIK ZkClient orders all events fired by watchers and hand them to the user callbacks one-by-one, if we use ZK's Watcher directly will its callback be called out-of-order with other events? 2. If we get a Code.OK in CreateCallback, do we still need to trigger a ZooKeeper.exist with ExistsCallback again? 3. For the consumer / server registration case particularly, we tries to handle parent path creation in ZkUtils.makeSurePersistentPathExists, so I feel we should expose the problem that parent path does not exist yet instead trying to hide it in createRecursive. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Assignee: Flavio Junqueira >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: KAFKA-1387.patch, kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721137#comment-14721137 ] ASF GitHub Bot commented on KAFKA-1387: --- GitHub user fpj opened a pull request: https://github.com/apache/kafka/pull/178 KAFKA-1387: Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time This is a patch to get around the problem discussed in the KAFKA-1387 jira. The tests are not passing in my box when I run them all, but they do pass when I run them individually, which indicates that there is something leaking from a test to the next. I still need to work this out and also work on further testing this. I wanted to open this PR now so that it can start getting reviewed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fpj/kafka 1387 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #178 commit f8be8657e649d0490e9ed1f1ef52234b3c31435e Author: flavio junqueira Date: 2015-08-23T13:55:11Z KAFKA-1387: First cut, node dependency on curator commit b8f901b6478d4ac9c961e899d702e6fc11cfee07 Author: flavio junqueira Date: 2015-08-23T13:55:11Z KAFKA-1387: First cut, node dependency on curator commit 2369e66921f88b2ee1b24ddeff2bf2d050015447 Author: flavio junqueira Date: 2015-08-23T14:07:41Z Merge branch '1387' of https://github.com/fpj/kafka into 1387 commit f03c301d5d919d9c05c6837de508b4f383906fdb Author: flavio junqueira Date: 2015-08-23T13:55:11Z KAFKA-1387: First cut, node dependency on curator commit d8eab9e0f569eaaecb4afda4d486d00600ad1e6f Author: flavio junqueira Date: 2015-08-24T14:56:01Z KAFKA-1387: Some polishing commit b7cbe5dbecbc28a564b99209114f39db785c73dd Author: flavio junqueira Date: 2015-08-24T15:50:58Z KAFKA-1387: Style fixes commit 336f67c641c44b73ac1dbb66cdde4ff97f2fcd9a Author: flavio junqueira Date: 2015-08-24T15:53:18Z KAFKA-1387: More style fixes commit 201ab2dcc33ba10a19c51f7452ce40497d3fcf83 Author: flavio junqueira Date: 2015-08-24T15:59:32Z Merge branch '1387' of https://github.com/fpj/kafka into 1387 commit 9961665230e04331f7767d8aa8aaac0a14f46cd8 Author: flavio junqueira Date: 2015-08-23T13:55:11Z KAFKA-1387: First cut, node dependency on curator commit b52c12422f7a831137d8659f14779eaad1972217 Author: flavio junqueira Date: 2015-08-24T14:56:01Z KAFKA-1387: Some polishing commit b2400a0a37555250d50b1f1abfdda2c4d00b03ac Author: flavio junqueira Date: 2015-08-24T15:50:58Z KAFKA-1387: Style fixes commit 888f6e0cf17d6a3a8d6b8dd46f8099731ba36511 Author: flavio junqueira Date: 2015-08-24T15:53:18Z KAFKA-1387: More style fixes commit d675b024b0e8627c4c2c9c113c07527851e81f7a Author: flavio junqueira Date: 2015-08-29T15:00:07Z KAFKA-1387 commit 4c83ac2609ed29a0f1887bf5087dab50e3e93488 Author: flavio junqueira Date: 2015-08-29T15:07:23Z KAFKA-1387: Removing whitespaces. commit 240b51a77715c53db784d5932702318ff28468c2 Author: flavio junqueira Date: 2015-08-29T15:11:30Z Merge branch '1387' of https://github.com/fpj/kafka into 1387 > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Assignee: Flavio Junqueira >Priority: Blocker > Labels: newbie, patch, zkclient-problems > Attachments: KAFKA-1387.patch, kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting >
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903437#comment-14903437 ] Flavio Junqueira commented on KAFKA-1387: - hey [~guozhang] bq. Will the mixing usage of ZK directly and ZkClient together violate ordering? AFAIK ZkClient orders all events fired by watchers and hand them to the user callbacks one-by-one, if we use ZK's Watcher directly will its callback be called out-of-order with other events? ZkClient indeed handles the processing to a separate thread. To avoid blocking the dispatcher thread, it uses a separate thread to deliver events. This can be a problem if the events here and events handled directly by ZkClient are correlated. I tried to confine the ZK processing for this feature in the same class to avoid ordering issues. I don't see a problem concretely, but if you do, let me know. Right now it sounds like you're just speculating that it could be a problem, yes? bq. If we get a Code.OK in CreateCallback, do we still need to trigger a ZooKeeper.exist with ExistsCallback again? Right, that exists call is to set a watch. bq. For the consumer / server registration case particularly, we tries to handle parent path creation in ZkUtils.makeSurePersistentPathExists, so I feel we should expose the problem that parent path does not exist yet instead trying to hide it in createRecursive. I've commented on the PR about this. What's your specific concern here? If you could elaborate a bit more, I'd appreciate. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Assignee: Flavio Junqueira >Priority: Critical > Labels: newbie, patch, zkclient-problems > Fix For: 0.9.0.0 > > Attachments: KAFKA-1387.patch, kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906672#comment-14906672 ] ASF GitHub Bot commented on KAFKA-1387: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/178 > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy >Assignee: Flavio Junqueira >Priority: Critical > Labels: newbie, patch, zkclient-problems > Fix For: 0.9.0.0 > > Attachments: KAFKA-1387.patch, kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150883#comment-14150883 ] James Lent commented on KAFKA-1387: --- I have seen this issue in our QA environment (3 ZooKeeper, 3 Kafka and several application specific nodes) several times now. The problem is triggered when the system is under stress (high I/O and CPU load) and the ZooKeeper connections become unstable. When this happens Kafka threads can get stuck trying to register Brokers nodes and Application threads get stuck trying to register Consumer nodes. One way to recover is to restart the impacted nodes. As an experiment I aslo tried deleting the blocking ZooKeeper nodes (hours later when the system was under no stress). When I did so the createEphemeralPathExpectConflictHandleZKBug would rocess one expire, break out of its loop, but, then immediately reenter it whenit tired to process the next expire message. The few times I tested this approach I had to delete the node dozens of times before the problem would clear itself - in other words there were dozens of Expire messages wating to be processed. Obvoisuly I am looking into this issue from a configuration point of view (avoid the unstable connection issue), but, this Kafka error behavior concerns me. I have reproduced it (somewhat artificially) in a dev environment as follows: 1) Start one ZooKeeper and on Kafka node. 2) Set a thread breakpoint in KafkaHealthCheck.java. def handleNewSession() { info("re-registering broker info in ZK for broker " + brokerId) --> register() info("done re-registering broker") info("Subscribing to %s path to watch for new topics".format(ZkUtils.BrokerTopicsPath)) } 3) Pause Kafka. 4) Wait for ZooKeeper to expire the first session and drop the ephemeral node. 5) Unpause Kafka. 6) Kafka reconnects with ZooKeeper, receives an Expire, and establishes a second session. 7) Breakpoint hit and event thread paused before handling the first Expire. 8) Pause Kafka again. 9) Wait for ZooKeeper to expire the second session and delete the ephemeral node (again). 10) Remove breakpoint, unpause Kafka, and finally release the event thread. 11) Kafka reconnects with ZooKeeper, receives a second Expire, and establishes a third session. 12) Kafka registers an ephemeral triggered by the first expire (which triggerd the second session), but, ZooKeeper associates it with the third Session. 13) Kafka tries to register an an ephemeral triggered by the second expire, but, ZooKeeper already has a stable node. 14) Kafka assumes this node will go away soon, sleeps, and then retries. 15) The node is associcated with a valid session and threfore does not go away so Kafka remains stuck in the retry loop. I have tested this with the latest code in trunk and noted the same behavior (the code looks pretty similar). I have coded up a potential 0.8.1.1 patch for this issue based on the following principles: 1) Ensure that when the node starts stale nodes are removed in main - For Brokers this means remove nodes with the same host name and port otherwise fail to start (the existing checker logic) - For Consumer nodes don't worry about stale nodes - the way they are named should prevent this from ever happening. 2) In main add the initial node which should now always work with no looping required - direct call to createEphemeralPath 3) Create a EphemeralNodeMonitor class that contains: - IZkDataListener - IZkStateListener 4) The users of this class provide a path to monitor and in a closure that defines what to do when the node is not found 5) When the state listener is notifed about a new session it checks to see if the node is already gone: - Yes, call the provided function - No, ignore the event 6) When the data listener is notified of a deletion it does the same thing 7) Both the Broker and Comsumer registation use this new class in the same way they curently use their individual state listeners. There only change in behavior is to call createEphemeralPath directly (and avoid the looping code). Since all this work should be done in the event thread I don't think there are any race conditions and no other nodes should be adding these nodes (or we have a serious configuration issue that should have been detected at startup). One assumption is that we will always recieve at least one more event (expire and/or delete) after the node is really deleted by ZooKeeper. I think that is a valid assumption (ZooKeeper can't send the delete until the node is gone). I wonder if we could just get away with monitoring node deletions, but, that seems risky. The only change in behavior should be that if the expire is recieved before the node is actually deleted then event loop is not blocked and could process other messages while waiting for the delete event. Note: I have not touched the leader election /
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151260#comment-14151260 ] Jun Rao commented on KAFKA-1387: James, Thanks for reporting this. Yes, what you discovered is a real problem. The fix is going to be tricky though. The issue is the following. When a client lose an ephemeral node in ZK due to session expiration, that ephemeral node is not removed exactly at expiration time, but a short time after (ZOOKEEPER-1740). When the client tries to recreate the ephemeral node and get a NodeExistException, one of the two things could happen: (1) the existing node is from the expired session and is on its way to be deleted, (2) the node is actually created on the latest session (The reason is what you discovered: the client gets multiple handleNewSession() calls due to multiple session expiration events, but the node is created on the latest session). I am not sure if there is an easy way to distinguish the two cases though. Overall, it seems to me that there are so many corner cases that one has to deal with during ZK session expiration. The simplest approach is probably to prevent session expiration from happening at all (e.g., set a larger session timeout). > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151264#comment-14151264 ] Gwen Shapira commented on KAFKA-1387: - AFAIK the ZK bug was never reproduced in newer versions of ZK. I'm wondering if at some point we can say that ZK 3.3 is no longer supported and remove the work-around (which is creating few issues of its own). > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151266#comment-14151266 ] Jun Rao commented on KAFKA-1387: Gwen, >From ZOOKEEPER-1809, it seems the design of not deleting ephemeral node >immediately on session expiration still exists on ZK 3.4.x and beyond? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151275#comment-14151275 ] Gwen Shapira commented on KAFKA-1387: - ZOOKEEPER-1809 was closed because the re-creation of the issue was buggy (the test app was actually creating two sessions at same time). I agree that Flavio indicated that ZNodes can hang around after expiration, but he also indicated the opposite in the email thread for ZOOKEEPER-1740. Its important to get this right, so I'll do more research on the expected ZooKeeper behavior here. One thing I'm not sure about is why does createEphemeralPathExpectConflictHandleZKBug loop indefinitely? If ZK indeed takes a bit of extra time to clean up, we can loop for specific amount of time (number of retries), like Curator typically does. After few seconds, the probability that the ZNode belongs to an active session and not an expired one is very high. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151663#comment-14151663 ] James Lent commented on KAFKA-1387: --- As background we are using ZooKeeper 3.4.5. When trying to come up with a fix for this I did consider limiting the loop to 2 to 3 tries. My concerns with this approach were: # Slow to recover if there are lots of Expire messages tp process and each of these could trigger redundant rebalance events until you get to the last one. # What happens if you don't loop quite long enough? You are again stuck in a bad state when the ephemeral does go away. I also considered trying to access the Session Id and storing that value instead of (or in addition to) the timestamp in the node's data. That appraoch looked difficult to implement, error prone, and had the application doing what I would consider ZooKeeper work. I agree there are a lot of corner cases to consider, but, I think we are going to pursue the approach I outlined above. I would be happy to post the proposed solution for your review, but, again I am not sure about the protocol around patch submission. I would not want this to be mistaken by someone as any kind of offical patch without a lot more review. When working on this appraoch I looked at the curator PersistentEphemeralNode for ideas: https://github.com/bazaarvoice/curator-extensions/blob/master/recipes/src/main/java/com/bazaarvoice/curator/recipes/PersistentEphemeralNode.java This is curator based so done not directly apply to Kafka (yet), but, it also keys off nodeDelete to restore the node. In the end I went with the simple idea that: "If when we process an Expire event the node still exists then ZooKeeper will inform us if that node later goes away." If we can't trust ZooKeeper/ZkClient to do that then ... {noformat} class StateListener() extends IZkStateListener { def handleStateChanged(state: KeeperState) {} def handleNewSession() { if (zkClient.exists(path)) { info("New session started, but, ephemeral %s already/still exists".format(path)) } else { info("New session started, recreate ephemeral node %s".format(path)) recreateNode() } } } {noformat} > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151673#comment-14151673 ] James Lent commented on KAFKA-1387: --- In case anyone is interested in the complete code for the new class I am testing with: {noformat} class EphemeralNodeMonitor(zkClient: ZkClient, path: String, recreateNode: () => Unit) extends Logging { val dataListener = new DataListener val stateListener = new StateListener def start() { zkClient.subscribeStateChanges(stateListener) zkClient.subscribeDataChanges(path, dataListener) } def close() { zkClient.unsubscribeStateChanges(stateListener) zkClient.unsubscribeDataChanges(path, dataListener) } class DataListener extends IZkDataListener { var oldData: String = null def handleDataChange(dataPath: String, newData: scala.Any) { if (!newData.toString.equals(oldData)) { oldData = newData.toString info("Ephemeral node %s has new data [%s]".format(dataPath, newData)) } } def handleDataDeleted(dataPath: String) { if (zkClient.exists(path)) { info("Ephemeral node %s was deleted, but, has already been recreated".format(dataPath)) } else { info("Ephemeral node %s was deleted, recreate it".format(dataPath)) recreateNode() } } } class StateListener() extends IZkStateListener { def handleStateChanged(state: KeeperState) {} def handleNewSession() { if (zkClient.exists(path)) { info("New session started, but, ephemeral %s already/still exists".format(path)) } else { info("New session started, recreate ephemeral node %s".format(path)) recreateNode() } } } {noformat} > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151873#comment-14151873 ] Jun Rao commented on KAFKA-1387: James, Contributing code to Kafka is pretty simple. You just need to attach a patch to the jira. As for your solution, we probably need to verify the following: will a watcher fire if it's registered on a path created by an already expired session and the path will be deleted soon. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152337#comment-14152337 ] James Lent commented on KAFKA-1387: --- I aplogize in advance for my ignorance, but, I have one newbie question. My starting point is the 0.8.1.1 tag (really the 0.8.1.1 source distribution). Would it be OK for me to submit a patch against that baseline or would it be better for me to first merge the code to trunk and then create the patch? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152453#comment-14152453 ] James Lent commented on KAFKA-1387: --- As for your question (which I agree is one of the key questions) I have the following comments: * The ZooKeeper documentation states there is one case where a watch may be missed which I do not think applies to the situation I am trying to address: "Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected." * In my testing the node is normally gone by the time the New Session event is handled which recreates the node. In that case I do not see a Delete message (I log that arrival of a delete event even if the node is already gone): {noformat} [2014-09-29 18:23:43,071] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2014-09-29 18:23:43,071] INFO Unable to reconnect to ZooKeeper service, session 0x148c36a0a94000f has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,071] INFO Initiating client connection, connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@56404645 (org.apache.zookeeper.ZooKeeper) [2014-09-29 18:23:43,072] INFO Opening socket connection to server localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,073] INFO Socket connection established to localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,074] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,082] INFO Closing socket connection to /10.210.10.165. (kafka.network.Processor) [2014-09-29 18:23:43,087] INFO Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x148c36a0a940010, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,087] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2014-09-29 18:23:43,099] INFO 0 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2014-09-29 18:23:43,143] INFO New session started, recreate ephemeral node /brokers/ids/0 (kafka.utils.EphemeralNodeMonitor) [2014-09-29 18:23:43,144] INFO Start registering broker 0 in ZooKeeper (kafka.server.KafkaHealthcheck) [2014-09-29 18:23:43,161] INFO Registered broker 0 at path /brokers/ids/0 with address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$) [2014-09-29 18:23:43,218] INFO Ephemeral node /brokers/ids/0 has new data [{"jmx_port":10001,"timestamp":"1412029423148","host":"jlent.digitalsmiths.com","version":1,"port":9092}] (kafka.utils.EphemeralNodeMonitor) [2014-09-29 18:23:43,237] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) {noformat} * I have seen cases where the node is still present when the New Session is handled and in that case I do see a Delete event a short while later. I don't have the logs that document that (don't ask me why I don't have logs to document the most important scenario). I will try to recreate that situation. * As an alternative I modified the New Session handling code to do nothing (except log the arrival of the new session event). In that case I do see the Delete event. This could perhaps be viewed a more severe test. In this case we get notified of a Delete that actually occured before we even handled the New Seesion event. That was actually how I did some of my original testing. {noformat} [2014-09-29 18:14:31,414] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2014-09-29 18:14:31,414] INFO Unable to reconnect to ZooKeeper service, session 0x148c36a0a94000c has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,414] INFO Initiating client connection, connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@15c58840 (org.apache.zookeeper.ZooKeeper) [2014-09-29 18:14:31,414] INFO Opening socket connection to server localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,415] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,415] INFO Socket connection established to localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,420] INFO
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153370#comment-14153370 ] James Lent commented on KAFKA-1387: --- I have messed things up. I tried to use the Submit Patch option. I filled out the fields in the form, but, it never asked me for a file. I also specifed labels that I assumed were related to the patch, but, instead are associated with the issue itself. I then directly attached the file to the issue. That seemed to go OK. Now the Submit Patch option is gone and the Status is Patch Available. I don't think that is correct. I decided it is best if I stop messing with the issue for now. I have done enough damage. I apologize for my ignorance of the process. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy > Labels: newbie, patch > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156635#comment-14156635 ] Jun Rao commented on KAFKA-1387: James, For my question, could you ask the ZK mailing list and get your understanding confirmed by their developers? Thanks, > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy > Labels: newbie, patch > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156746#comment-14156746 ] James Lent commented on KAFKA-1387: --- Good idea and done: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201410.mbox/browser > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1.1 >Reporter: Fedor Korotkiy > Labels: newbie, patch > Attachments: kafka-1387.patch > > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1394#comment-1394 ] Guozhang Wang commented on KAFKA-1387: -- Hi Fedor, do you think this is caused by the same issue described in https://issues.apache.org/jira/browse/KAFKA-1382 ? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967841#comment-13967841 ] Fedor Korotkiy commented on KAFKA-1387: --- I think it's a different issue. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967928#comment-13967928 ] Guozhang Wang commented on KAFKA-1387: -- I think the main issue here is when there is a zookeeper session timeout, the zkClient will re-try write the data which could be already committed to ZK and failed. This issue is the same as the one causing KAFKA-1382. But I think their fixes would be different. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085792#comment-14085792 ] Joe Stein commented on KAFKA-1387: -- Here is another way to reproduce this issue. I have seen it a few times now with folks getting going with their clusters. steps to reproduce. install a 3 node zk ensemble with 3 brokers cluster e.g. git clone https://github.com/stealthly/scala-kafka git checkout -b zkbk3 origin/zkbk3 vagrant up provider=virtualbox now setup each node in the cluster as you would broker 1,2,3 and the ensemble e.g. vagrant ssh zkbkOne sudo su cd /vagrant/vagrant/ && ./up.sh vagrant ssh zkbkTwo sudo su cd /vagrant/vagrant/ && ./up.sh vagrant ssh zkbkThree sudo su cd /vagrant/vagrant/ && ./up.sh start up zookeeper on all 3 nodes cd /opt/apache/kafka && bin/zookeeper-server-start.sh config/zookeeper.properties 1>>/tmp/zk.log 2>>/tmp/zk.log & now, start up broker on node 2 only cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 1>>/tmp/bk.log 2>>/tmp/bk.log & ok, now here is where it gets wonky - change the broker.id int server 3 to = 2 now you need to start up server 1 and 3 (even though it is 2) at the same time cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 1>>/tmp/bk.log 2>>/tmp/bk.log & cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 1>>/tmp/bk.log 2>>/tmp/bk.log & ( you can have two tabs, hit enter in one switch to other tab and hit enter is close enough to same time) and you get this looping forever 2014-08-05 04:34:38,591] INFO I wrote this conflicted ephemeral node [{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2014-08-05 04:34:44,598] INFO conflict in /controller data: {"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: {"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$) [2014-08-05 04:34:44,601] INFO I wrote this conflicted ephemeral node [{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2014-08-05 04:34:50,610] INFO conflict in /controller data: {"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: {"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$) [2014-08-05 04:34:50,614] INFO I wrote this conflicted ephemeral node [{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2014-08-05 04:34:56,621] INFO conflict in /controller data: {"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: {"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$) the expected result that you get should be [2014-08-05 04:07:20,917] INFO conflict in /brokers/ids/2 data: {"jmx_port":-1,"timestamp":"1407211640900","host":"192.168.30.3","version":1,"port":9092} stored data: {"jmx_port":-1,"timestamp":"140721119 9464","host":"192.168.30.2","version":1,"port":9092} (kafka.utils.ZkUtils$) [2014-08-05 04:07:20,949] FATAL Fatal error during KafkaServerStable startup. Prepare to shutdown (kafka.server.KafkaServerStartable) java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/2. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering. at kafka.utils.ZkUtils$.registerBrokerInZk(ZkUtils.scala:205) at kafka.server.KafkaHealthcheck.register(KafkaHealthcheck.scala:57) at kafka.server.KafkaHealthcheck.startup(KafkaHealthcheck.scala:44) at kafka.server.KafkaServer.startup(KafkaServer.scala:103) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:34) at kafka.Kafka$.main(Kafka.scala:46) at kafka.Kafka.main(Kafka.scala) [2014-08-05 04:07:20,952] INFO [Kafka Server 2], shutting down (kafka.server.KafkaServer) [2014-08-05 04:07:20,954] INFO [Socket Server on Broker 2], Shutting down (kafka.network.SocketServer) [2014-08-05 04:07:20,959] INFO [Socket Server on Broker 2], Shutdown completed (kafka.network.SocketServer) [2014-08-05 04:07:20,960] INFO [Kafka Request Handler on Broker 2], shutting down (kafka.server.KafkaRequestHandlerPool) [2014-08-05 04:07:20,992] INFO [Kafka Request Handler on Broker 2], shut down completely (kafka.server.KafkaRequestHandlerPool) [2014-08-05 04:07:21,263] INFO [Replica Manager on Broker 2]: Shut down (kafka.server.ReplicaManager) [
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086398#comment-14086398 ] Jun Rao commented on KAFKA-1387: Joe, The issue that you described is probably fixed in KAFKA-1451? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087063#comment-14087063 ] Joe Stein commented on KAFKA-1387: -- [~junrao] I tested on trunk and it is much worse now. instead of looping on the /controller node (like it was before) ... node 3 actually overwrote/stole the /brokers/ids/2 (doing a get before had it as 192.168.30.1 and after it is 192.168.30.1) so now i have a situation where I have two broker servers, each with the same broker id running, node 3 is the broker with all the topics being created on it and failing requests for producing and consuming (because all the data is on node 1 but that is not advertised) and node 1 is still the controller. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088474#comment-14088474 ] Gwen Shapira commented on KAFKA-1387: - Attempted to reproduce with trunk as well. I'm not seeing the same behavior as [~joestein]. In my experiment the new broker 2 fails with the correct error message. The old broker 2, OTOH, goes into a loop, printing: "[2014-08-06 16:37:01,884] INFO Partition [test1,0] on broker 2: Cached zkVersion [89] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)" Not a good behavior either. > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time
[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092255#comment-14092255 ] Jun Rao commented on KAFKA-1387: Hmm, this seems really weird. Not sure why starting two brokers at the same time will affect the ZK registration. Is this reproducible by running multiple brokers on the same machine? > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > - > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug >Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.2#6252)