[ https://issues.apache.org/jira/browse/ZOOKEEPER-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Akihiro Suda updated ZOOKEEPER-2162: ------------------------------------ Description: This sequence leads server.1 and server.2 to infinite exception loop. * Start server.1 and server.2 with the initial ensemble server.1=participant, server.2=observer. In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2. * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up to 2. * Kill server.2 * Remove dataDir of server.2 excluding the myid file. (In real production environments, both of confDir and dataDir can be lost due to reprovisioning) * Start server.2 * server.1 and server.2 enters infinite exception loop. The log (threshold is set to INFO in log4j.properties) size can reach > 100MB in 30 seconds. AFAIK, the bug can be reproduced with ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015). I made a Docker container so that people who are interested can reproduce the bug easily. (Sorry for no JUnit test right now) {noformat} $ docker run -i -t --rm akihirosuda/zookeeper-bug01 Reproducing the bug: infinite exception loop occurs when dataDir is lost * Resetting * Starting [1,2] with the initial ensemble [1] * Sleeping for 3 seconds * Invoking Reconfig [1]->[2] * Sleeping for 3 seconds * Killing server.2 (pid=10542) * Sleeping for 3 seconds * Resetting /zk02_data * Starting server.2 * Sleeping for 30 seconds /zk01_log: 81665114 bytes The log dir is extremely large. Perhaps the bug was REPRODUCED! /zk02_log: 23949367 bytes The log dir is extremely large. Perhaps the bug was REPRODUCED! * Exiting {noformat} For details of the log, please refer to https://github.com/AkihiroSuda/suda-pub/blob/master/dockerfiles/zookeeper-bug01/README.md . was: This sequence leads server.1 and server.2 to infinite exception loop. * Start server.1 and server.2 with the initial ensemble server.1=participant, server.2=observer. In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2. * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up to 2. * Kill server.2 * Remove dataDir of server.2 excluding the myid file. (In real production environments, both of confDir and dataDir can be lost due to reprovisioning) * Start server.2 * server.1 and server.2 enters infinite exception loop. The log (threshold is set to INFO in log4j.properties) size can reach > 100MB in 30 seconds. AFAIK, the bug can be reproduced with ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015). I made a Docker container so that people who are interested can reproduce the bug easily. (Sorry for no JUnit tests right now) {noformat} $ docker run -i -t --rm akihirosuda/zookeeper-bug01 Reproducing the bug: infinite exception loop occurs when dataDir is lost * Resetting * Starting [1,2] with the initial ensemble [1] * Sleeping for 3 seconds * Invoking Reconfig [1]->[2] * Sleeping for 3 seconds * Killing server.2 (pid=10542) * Sleeping for 3 seconds * Resetting /zk02_data * Starting server.2 * Sleeping for 30 seconds /zk01_log: 81665114 bytes The log dir is extremely large. Perhaps the bug was REPRODUCED! /zk02_log: 23949367 bytes The log dir is extremely large. Perhaps the bug was REPRODUCED! * Exiting {noformat} > infinite exception loop occurs when dataDir is lost > --------------------------------------------------- > > Key: ZOOKEEPER-2162 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2162 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.5.0 > Reporter: Akihiro Suda > Attachments: ZOOKEEPER-2162.patch > > > This sequence leads server.1 and server.2 to infinite exception loop. > * Start server.1 and server.2 with the initial ensemble > server.1=participant, server.2=observer. > In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2. > * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up > to 2. > * Kill server.2 > * Remove dataDir of server.2 excluding the myid file. > (In real production environments, both of confDir and dataDir can be lost > due to reprovisioning) > * Start server.2 > * server.1 and server.2 enters infinite exception loop. > The log (threshold is set to INFO in log4j.properties) size can reach > > 100MB in 30 seconds. > AFAIK, the bug can be reproduced with > ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015). > I made a Docker container so that people who are interested can reproduce the > bug easily. (Sorry for no JUnit test right now) > {noformat} > $ docker run -i -t --rm akihirosuda/zookeeper-bug01 > Reproducing the bug: infinite exception loop occurs when dataDir is lost > * Resetting > * Starting [1,2] with the initial ensemble [1] > * Sleeping for 3 seconds > * Invoking Reconfig [1]->[2] > * Sleeping for 3 seconds > * Killing server.2 (pid=10542) > * Sleeping for 3 seconds > * Resetting /zk02_data > * Starting server.2 > * Sleeping for 30 seconds > /zk01_log: 81665114 bytes > The log dir is extremely large. Perhaps the bug was REPRODUCED! > /zk02_log: 23949367 bytes > The log dir is extremely large. Perhaps the bug was REPRODUCED! > * Exiting > {noformat} > For details of the log, please refer to > https://github.com/AkihiroSuda/suda-pub/blob/master/dockerfiles/zookeeper-bug01/README.md > . -- This message was sent by Atlassian JIRA (v6.3.4#6332)