[jira] [Comment Edited] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

Joe Stein (JIRA) Mon, 04 Aug 2014 21:43:18 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085792#comment-14085792
 ]


Joe Stein edited comment on KAFKA-1387 at 8/5/14 4:41 AM:
----------------------------------------------------------

Here is another way to reproduce this issue.  I have seen it a few times now 
with folks getting going with their clusters.

steps to reproduce.  install a 3 node zk ensemble with 3 brokers cluster

e.g. 

git clone https://github.com/stealthly/scala-kafka
git checkout -b zkbk3 origin/zkbk3
vagrant up provider=virtualbox

now setup each node in the cluster as you would broker 1,2,3 and the ensemble

e.g.

vagrant ssh zkbkOne
sudo su
cd /vagrant/vagrant/ && ./up.sh
vagrant ssh zkbkTwo
sudo su
cd /vagrant/vagrant/ && ./up.sh
vagrant ssh zkbkThree
sudo su
cd /vagrant/vagrant/ && ./up.sh

start up zookeeper on all 3 nodes
cd /opt/apache/kafka && bin/zookeeper-server-start.sh 
config/zookeeper.properties 1>>/tmp/zk.log 2>>/tmp/zk.log &

now, start up broker on node 2 only
cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &

ok, now here is where it gets wonky

on server 3 change from broker.id=3 to broker.id=2 
now you need to start up server 1 and 3 (even though it is broker.id=2) at the 
same time

cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &
cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &
( you can have two tabs, hit enter in one switch to other tab and hit enter is 
close enough to same time)

and you get this looping forever

2014-08-05 04:34:38,591] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:44,598] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)
[2014-08-05 04:34:44,601] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:50,610] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)
[2014-08-05 04:34:50,614] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:56,621] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)

the expected result that you get should be

[2014-08-05 04:07:20,917] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":-1,"timestamp":"1407211640900","host":"192.168.30.3","version":1,"port":9092}
 stored data: {"jmx_port":-1,"timestamp":"140721119
9464","host":"192.168.30.2","version":1,"port":9092} (kafka.utils.ZkUtils$)
[2014-08-05 04:07:20,949] FATAL Fatal error during KafkaServerStable startup. 
Prepare to shutdown (kafka.server.KafkaServerStartable)
java.lang.RuntimeException: A broker is already registered on the path 
/brokers/ids/2. This probably indicates that you either have configured a 
brokerid that is already in use, or else you have shutdown 
this broker and restarted it faster than the zookeeper timeout so it appears to 
be re-registering.
        at kafka.utils.ZkUtils$.registerBrokerInZk(ZkUtils.scala:205)
        at kafka.server.KafkaHealthcheck.register(KafkaHealthcheck.scala:57)
        at kafka.server.KafkaHealthcheck.startup(KafkaHealthcheck.scala:44)
        at kafka.server.KafkaServer.startup(KafkaServer.scala:103)
        at 
kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:34)
        at kafka.Kafka$.main(Kafka.scala:46)
        at kafka.Kafka.main(Kafka.scala)
[2014-08-05 04:07:20,952] INFO [Kafka Server 2], shutting down 
(kafka.server.KafkaServer)
[2014-08-05 04:07:20,954] INFO [Socket Server on Broker 2], Shutting down 
(kafka.network.SocketServer)
[2014-08-05 04:07:20,959] INFO [Socket Server on Broker 2], Shutdown completed 
(kafka.network.SocketServer)
[2014-08-05 04:07:20,960] INFO [Kafka Request Handler on Broker 2], shutting 
down (kafka.server.KafkaRequestHandlerPool)
[2014-08-05 04:07:20,992] INFO [Kafka Request Handler on Broker 2], shut down 
completely (kafka.server.KafkaRequestHandlerPool)
[2014-08-05 04:07:21,263] INFO [Replica Manager on Broker 2]: Shut down 
(kafka.server.ReplicaManager)
[2014-08-05 04:07:21,263] INFO [ReplicaFetcherManager on broker 2] shutting 
down (kafka.server.ReplicaFetcherManager)

which is what you get if you just start server 3 on its one configured with 
broker.id=2

I originally saw this on a 12 node AWS cluster same kafka 0.8.1.1 zk 3.3.4

I don't know how very critical this is, someone just brought up something 
similar but with /brokers/ids/x I don't know if they are related but something 
wonky is going on with the ZkUtils.createEphemeralPathExpectConflictHandleZKBug 
code paths.





was (Author: joestein):
Here is another way to reproduce this issue.  I have seen it a few times now 
with folks getting going with their clusters.

steps to reproduce.  install a 3 node zk ensemble with 3 brokers cluster

e.g. 

git clone https://github.com/stealthly/scala-kafka
git checkout -b zkbk3 origin/zkbk3
vagrant up provider=virtualbox

now setup each node in the cluster as you would broker 1,2,3 and the ensemble

e.g.

vagrant ssh zkbkOne
sudo su
cd /vagrant/vagrant/ && ./up.sh
vagrant ssh zkbkTwo
sudo su
cd /vagrant/vagrant/ && ./up.sh
vagrant ssh zkbkThree
sudo su
cd /vagrant/vagrant/ && ./up.sh

start up zookeeper on all 3 nodes
cd /opt/apache/kafka && bin/zookeeper-server-start.sh 
config/zookeeper.properties 1>>/tmp/zk.log 2>>/tmp/zk.log &

now, start up broker on node 2 only
cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &

ok, now here is where it gets wonky

- change the broker.id int server 3 to = 2 
now you need to start up server 1 and 3 (even though it is 2) at the same time

cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &
cd /opt/apache/kafka && bin/kafka-server-start.sh config/server.properties 
1>>/tmp/bk.log 2>>/tmp/bk.log &
( you can have two tabs, hit enter in one switch to other tab and hit enter is 
close enough to same time)

and you get this looping forever

2014-08-05 04:34:38,591] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:44,598] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)
[2014-08-05 04:34:44,601] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:50,610] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)
[2014-08-05 04:34:50,614] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":2,"timestamp":"1407212148186"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-08-05 04:34:56,621] INFO conflict in /controller data: 
{"version":1,"brokerid":2,"timestamp":"1407212148186"} stored data: 
{"version":1,"brokerid":2,"timestamp":"1407211911014"} (kafka.utils.ZkUtils$)

the expected result that you get should be

[2014-08-05 04:07:20,917] INFO conflict in /brokers/ids/2 data: 
{"jmx_port":-1,"timestamp":"1407211640900","host":"192.168.30.3","version":1,"port":9092}
 stored data: {"jmx_port":-1,"timestamp":"140721119
9464","host":"192.168.30.2","version":1,"port":9092} (kafka.utils.ZkUtils$)
[2014-08-05 04:07:20,949] FATAL Fatal error during KafkaServerStable startup. 
Prepare to shutdown (kafka.server.KafkaServerStartable)
java.lang.RuntimeException: A broker is already registered on the path 
/brokers/ids/2. This probably indicates that you either have configured a 
brokerid that is already in use, or else you have shutdown 
this broker and restarted it faster than the zookeeper timeout so it appears to 
be re-registering.
        at kafka.utils.ZkUtils$.registerBrokerInZk(ZkUtils.scala:205)
        at kafka.server.KafkaHealthcheck.register(KafkaHealthcheck.scala:57)
        at kafka.server.KafkaHealthcheck.startup(KafkaHealthcheck.scala:44)
        at kafka.server.KafkaServer.startup(KafkaServer.scala:103)
        at 
kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:34)
        at kafka.Kafka$.main(Kafka.scala:46)
        at kafka.Kafka.main(Kafka.scala)
[2014-08-05 04:07:20,952] INFO [Kafka Server 2], shutting down 
(kafka.server.KafkaServer)
[2014-08-05 04:07:20,954] INFO [Socket Server on Broker 2], Shutting down 
(kafka.network.SocketServer)
[2014-08-05 04:07:20,959] INFO [Socket Server on Broker 2], Shutdown completed 
(kafka.network.SocketServer)
[2014-08-05 04:07:20,960] INFO [Kafka Request Handler on Broker 2], shutting 
down (kafka.server.KafkaRequestHandlerPool)
[2014-08-05 04:07:20,992] INFO [Kafka Request Handler on Broker 2], shut down 
completely (kafka.server.KafkaRequestHandlerPool)
[2014-08-05 04:07:21,263] INFO [Replica Manager on Broker 2]: Shut down 
(kafka.server.ReplicaManager)
[2014-08-05 04:07:21,263] INFO [ReplicaFetcherManager on broker 2] shutting 
down (kafka.server.ReplicaFetcherManager)

which is what you get if you just start server 3 on its one configured with 
broker.id=2

I originally saw this on a 12 node AWS cluster same kafka 0.8.1.1 zk 3.3.4

I don't know how very critical this is, someone just brought up something 
similar but with /brokers/ids/x I don't know if they are related but something 
wonky is going on with the ZkUtils.createEphemeralPathExpectConflictHandleZKBug 
code paths.




> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1387
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1387
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (KAFKA-1387) Kafka getting stuck creating ephemeral node it has already created when two zookeeper sessions are established in a very short period of time

Reply via email to