[jira] [Created] (ZOOKEEPER-4306) CloseSessionTxn contains too many ephemal nodes cause cluster crash

Lin Changrui (Jira) Thu, 27 May 2021 20:10:05 -0700

Lin Changrui created ZOOKEEPER-4306:
---------------------------------------


             Summary: CloseSessionTxn contains too many ephemal nodes cause 
cluster crash
                 Key: ZOOKEEPER-4306
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4306
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.6.2
            Reporter: Lin Changrui
         Attachments: cs.jpg, f.jpg, l1.png, l2.jpg, r.jpg

We took a test about how many ephemal nodes can client create under one parent 
node with defalut configuration. The test caused cluster crash at last, 
exception stack trace like this.

follower:

!f.jpg!

leader:

!l1.png!

!l2.jpg!

It seems that leader sent a too large txn packet to followers. When follower 
try to deserialize the txn, it found the txn length out of its buffer 
size(default 1MB+1MB, jute.maxbuffer + jute.maxbuffer.extrasize). That causes 
followers crashed, and then, leader found there was no sufficient followers 
synced, so leader shutdown later. When leader shutdown, it called 
zkDb.fastForwardDataBase() , and leader found the txn read from txnlog out of 
its buffer size, so it crashed too.

After the servers crashed, they try to restart the quorum. But they would not 
success because the last txn is too large. We lose the log at that moment, but 
the stack trace is same as this one.

!r.jpg|width=1468,height=598!

 

*Root Cause*

We use org.apache.zookeeper.server.LogFormatter(-Djute.maxbuffer=74827780) 
visualize this log and found this. !cs.jpg|width=1400,height=581! So 
closeSessionTxn contains all ephemal nodes with absolute path. We know we will 
get a large getChildren respose if we create too many children nodes under one 
parent node, that is limited by jute.maxbuffer of client. If we create plenty 
of ephemal nodes under different parent nodes with one session, it may not 
cause out of buffer of client, but when the session close without delete these 
node first, it probably cause cluster crash.

Is it a bug or just a unspecified feature？If it just so, how should we judge 
the upper limit of creating nodes? 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ZOOKEEPER-4306) CloseSessionTxn contains too many ephemal nodes cause cluster crash

Reply via email to