Akihiro Suda created ZOOKEEPER-2162: ---------------------------------------
Summary: infinite exception loop occurs when dataDir is lost Key: ZOOKEEPER-2162 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2162 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0 Reporter: Akihiro Suda This sequence leads server.1 and server.2 to infinite exception loop. * Start server.1 and server.2 with the initial ensemble server.1=participant, server.2=observer. In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2. * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up to 2. * Kill server.2 * Remove dataDir of server.2 excluding the myid file. (In real production environments, both of confDir and dataDir can be lost due to reprovisioning) * Start server.2 * server.1 and server.2 enters infinite exception loop. The log (threshold is set to INFO in log4j.properties) size can reach > 100MB in 30 seconds. AFAIK, the bug can be reproduced with ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015). I made a Docker container so that people who are interested can reproduce the bug easily. (Sorry for no JUnit tests right now) {noformat} $ docker run -i -t --rm akihirosuda/zookeeper-bug01 Reproducing the bug: infinite exception loop occurs when dataDir is lost * Resetting * Starting [1,2] with the initial ensemble [1] * Sleeping for 3 seconds * Invoking Reconfig [1]->[2] * Sleeping for 3 seconds * Killing server.2 (pid=10542) * Sleeping for 3 seconds * Resetting /zk02_data * Starting server.2 * Sleeping for 30 seconds /zk01_log: 81665114 bytes The log dir is extremely large. Perhaps the bug was REPRODUCED! /zk02_log: 23949367 bytes The log dir is extremely large. Perhaps the bug was REPRODUCED! * Exiting {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)