[ 
https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014006#comment-15014006
 ] 

Zach Cox commented on KAFKA-1451:
---------------------------------

[~fpj] Yes we saw the "I wrote this conflicted ephemeral node" error messages, 
we saw lots of partitions in/out of ISRs and a lot of this too:

{code}
[2015-11-19 01:05:51,685] INFO Opening socket connection to server 
ip-10-10-1-35.ec2.internal/10.10.1.35:2181. Will not attempt to authenticate 
using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,685] INFO Socket connection established to 
ip-10-10-1-35.ec2.internal/10.10.1.35:2181, initiating session 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,687] INFO Unable to reconnect to ZooKeeper service, 
session 0x54a0e5799a8195d has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,687] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2015-11-19 01:05:51,687] INFO Initiating client connection, 
connectString=zookeeper1.production.redacted.com:2181,zookeeper2.production.redacted.com:2181,zookeeper3.production.redacted.com:2181/kafka
 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@ace1333 
(org.apache.zookeeper.ZooKeeper)
[2015-11-19 01:05:51,701] INFO EventThread shut down 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,701] ERROR Error handling event ZkEvent[New session event 
sent to kafka.controller.KafkaController$SessionExpirationListener@2261adb8] 
(org.I0Itec.zkclient.ZkEventThread)
java.lang.IllegalStateException: Kafka scheduler has not been started
  at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:114)
  at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86)
  at 
kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:350)
  at 
kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply$mcZ$sp(KafkaController.scala:1108)
  at 
kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107)
  at 
kafka.controller.KafkaController$SessionExpirationListener$$anonfun$handleNewSession$1.apply(KafkaController.scala:1107)
  at kafka.utils.Utils$.inLock(Utils.scala:535)
  at 
kafka.controller.KafkaController$SessionExpirationListener.handleNewSession(KafkaController.scala:1107)
  at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472)
  at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
[2015-11-19 01:05:51,701] INFO re-registering broker info in ZK for broker 3 
(kafka.server.KafkaHealthcheck)
[2015-11-19 01:05:51,701] INFO Opening socket connection to server 
ip-10-10-1-104.ec2.internal/10.10.1.104:2181. Will not attempt to authenticate 
using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,702] INFO Socket connection established to 
ip-10-10-1-104.ec2.internal/10.10.1.104:2181, initiating session 
(org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,713] INFO Session establishment complete on server 
ip-10-10-1-104.ec2.internal/10.10.1.104:2181, sessionid = 0x64a0e57972a1a85, 
negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2015-11-19 01:05:51,713] INFO zookeeper state changed (SyncConnected) 
(org.I0Itec.zkclient.ZkClient)
[2015-11-19 01:05:51,718] INFO Registered broker 3 at path /brokers/ids/3 with 
address mesos-slave3.production.redacted.com:9092. (kafka.utils.ZkUtils$)
[2015-11-19 01:05:51,718] INFO done re-registering broker 
(kafka.server.KafkaHealthcheck)
[2015-11-19 01:05:51,718] INFO Subscribing to /brokers/topics path to watch for 
new topics (kafka.server.KafkaHealthcheck)
[2015-11-19 01:05:51,721] INFO New leader is 1 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
{code}

> Broker stuck due to leader election race 
> -----------------------------------------
>
>                 Key: KAFKA-1451
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1451
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.1.1
>            Reporter: Maciek Makowski
>            Assignee: Manikumar Reddy
>            Priority: Minor
>              Labels: newbie
>             Fix For: 0.8.2.0
>
>         Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, 
> KAFKA-1451_2014-07-29_10:13:23.patch
>
>
> h3. Symptoms
> The broker does not become available due to being stuck in an infinite loop 
> while electing leader. This can be recognised by the following line being 
> repeatedly written to server.log:
> {code}
> [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node 
> [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> {code}
> h3. Steps to Reproduce
> In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely 
> behave the same with the ZK version included in Kafka distribution) node 
> setup:
> # start both zookeeper and kafka (in any order)
> # stop zookeeper
> # stop kafka
> # start kafka
> # start zookeeper
> h3. Likely Cause
> {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then 
> triggers an election. if the deletion of ephemeral {{/controller}} node 
> associated with previous zookeeper session of the broker happens after 
> subscription to changes in new session, election will be invoked twice, once 
> from {{startup}} and once from {{handleDataDeleted}}:
> * {{startup}}: acquire {{controllerLock}}
> * {{startup}}: subscribe to data changes
> * zookeeper: delete {{/controller}} since the session that created it timed 
> out
> * {{handleDataDeleted}}: {{/controller}} was deleted
> * {{handleDataDeleted}}: wait on {{controllerLock}}
> * {{startup}}: elect -- writes {{/controller}}
> * {{startup}}: release {{controllerLock}}
> * {{handleDataDeleted}}: acquire {{controllerLock}}
> * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then 
> gets into infinite loop as a result of conflict
> {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing 
> znode was written from different session, which is not true in this case; it 
> was written from the same session. That adds to the confusion.
> h3. Suggested Fix
> In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe 
> to data changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to