[jira] [Updated] (ZOOKEEPER-2162) infinite exception loop occurs when dataDir is lost

Akihiro Suda (JIRA) Sun, 12 Apr 2015 21:41:39 -0700

     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Akihiro Suda updated ZOOKEEPER-2162:
------------------------------------
    Description: 
This sequence leads server.1 and server.2 to infinite exception loop.

 * Start server.1 and server.2 with the initial ensemble server.1=participant, 
server.2=observer.
   In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
 * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up to 
2.
 * Kill server.2
 * Remove dataDir of server.2 excluding the myid file.
   (In real production environments, both of confDir and dataDir can be lost 
due to reprovisioning)
 * Start server.2
 * server.1 and server.2 enters infinite exception loop.
   The log (threshold is set to INFO in log4j.properties) size can reach > 
100MB in 30 seconds.

AFAIK, the bug can be reproduced with 
ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).

I made a Docker container so that people who are interested can reproduce the 
bug easily. (Sorry for no JUnit test right now)
{noformat}
$ docker run -i -t --rm akihirosuda/zookeeper-bug01
Reproducing the bug: infinite exception loop occurs when dataDir is lost
* Resetting
* Starting [1,2] with the initial ensemble [1]
* Sleeping for 3 seconds
* Invoking Reconfig [1]->[2]
* Sleeping for 3 seconds
* Killing server.2 (pid=10542)
* Sleeping for 3 seconds
* Resetting /zk02_data
* Starting server.2
* Sleeping for 30 seconds
/zk01_log: 81665114 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
/zk02_log: 23949367 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
* Exiting
{noformat}

For details of the log, please refer to 
https://github.com/AkihiroSuda/suda-pub/blob/master/dockerfiles/zookeeper-bug01/README.md
 .


  was:
This sequence leads server.1 and server.2 to infinite exception loop.

 * Start server.1 and server.2 with the initial ensemble server.1=participant, 
server.2=observer.
   In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
 * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up to 
2.
 * Kill server.2
 * Remove dataDir of server.2 excluding the myid file.
   (In real production environments, both of confDir and dataDir can be lost 
due to reprovisioning)
 * Start server.2
 * server.1 and server.2 enters infinite exception loop.
   The log (threshold is set to INFO in log4j.properties) size can reach > 
100MB in 30 seconds.

AFAIK, the bug can be reproduced with 
ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).

I made a Docker container so that people who are interested can reproduce the 
bug easily. (Sorry for no JUnit tests right now)
{noformat}
$ docker run -i -t --rm akihirosuda/zookeeper-bug01
Reproducing the bug: infinite exception loop occurs when dataDir is lost
* Resetting
* Starting [1,2] with the initial ensemble [1]
* Sleeping for 3 seconds
* Invoking Reconfig [1]->[2]
* Sleeping for 3 seconds
* Killing server.2 (pid=10542)
* Sleeping for 3 seconds
* Resetting /zk02_data
* Starting server.2
* Sleeping for 30 seconds
/zk01_log: 81665114 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
/zk02_log: 23949367 bytes
The log dir is extremely large. Perhaps the bug was REPRODUCED!
* Exiting
{noformat}




> infinite exception loop occurs when dataDir is lost
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-2162
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2162
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0
>            Reporter: Akihiro Suda
>         Attachments: ZOOKEEPER-2162.patch
>
>
> This sequence leads server.1 and server.2 to infinite exception loop.
>  * Start server.1 and server.2 with the initial ensemble 
> server.1=participant, server.2=observer.
>    In this time, acceptedEpoch\[i\] == currentEpoch\[i\] == 1 for i = 1, 2.
>  * Invoke reconfig so that acceptedEpoch\[i\] and currentEpoch\[i\] grows up 
> to 2.
>  * Kill server.2
>  * Remove dataDir of server.2 excluding the myid file.
>    (In real production environments, both of confDir and dataDir can be lost 
> due to reprovisioning)
>  * Start server.2
>  * server.1 and server.2 enters infinite exception loop.
>    The log (threshold is set to INFO in log4j.properties) size can reach > 
> 100MB in 30 seconds.
> AFAIK, the bug can be reproduced with 
> ZooKeeper@f5fb50ed2591ba9a24685a227bb5374759516828 (Apr 7, 2015).
> I made a Docker container so that people who are interested can reproduce the 
> bug easily. (Sorry for no JUnit test right now)
> {noformat}
> $ docker run -i -t --rm akihirosuda/zookeeper-bug01
> Reproducing the bug: infinite exception loop occurs when dataDir is lost
> * Resetting
> * Starting [1,2] with the initial ensemble [1]
> * Sleeping for 3 seconds
> * Invoking Reconfig [1]->[2]
> * Sleeping for 3 seconds
> * Killing server.2 (pid=10542)
> * Sleeping for 3 seconds
> * Resetting /zk02_data
> * Starting server.2
> * Sleeping for 30 seconds
> /zk01_log: 81665114 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> /zk02_log: 23949367 bytes
> The log dir is extremely large. Perhaps the bug was REPRODUCED!
> * Exiting
> {noformat}
> For details of the log, please refer to 
> https://github.com/AkihiroSuda/suda-pub/blob/master/dockerfiles/zookeeper-bug01/README.md
>  .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2162) infinite exception loop occurs when dataDir is lost

Reply via email to