[ https://issues.apache.org/jira/browse/KAFKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on KAFKA-4229 started by Pengwei. -------------------------------------- > Controller can't start after several zk expired event > ----------------------------------------------------- > > Key: KAFKA-4229 > URL: https://issues.apache.org/jira/browse/KAFKA-4229 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Pengwei > Assignee: Pengwei > > We found the controller not started after several zk expired event in our > test environment. By analysing the log, I found the controller will handle > the ephemeral node data delete event first and then the zk expired event , > then the controller will gone. > I can reproducer it on my develop env: > 1. set up a one broker and one zk env, specify a very large zk timeout (20s) > 2. stop the broker and remove the zk's /broker/ids/0 directory > 3. restart the broker and make a breakpoint in the zk client's event thread > to queue the delete event. > 4. after the /controller node gone the breakpoint will hit. > 5. expired the current session(suspend the send thread) and create a new > session s2 > 6. resume the event thread, then the controller will handle > LeaderChangeListener.handleDataDeleted and become leader > 7. then controller will handle SessionExpirationListener.handleNewSession, it > resign the controller and elect, but when elect it found the /controller > node is exist and not become the leader. But the /controller node is created > by current session s2 will not remove. So the controller is gone -- This message was sent by Atlassian JIRA (v6.3.4#6332)