[
https://issues.apache.org/jira/browse/ZOOKEEPER-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365277#comment-17365277
]
Lin Changrui edited comment on ZOOKEEPER-4306 at 6/18/21, 8:29 AM:
-------------------------------------------------------------------
Hi [~ztzg],
Thanks for your contribution, and I have seen your commit. Add a new
KeepExcetion and check it before create an ephemeral node could resolve the
problem, it seems to be correct to me. I don't find any loopholes.
I think it‘s more easier for others to notice this limitation if add some
JavaDoc of ZooKeeper.create. Would you agree? :D
was (Author: changrui lin):
Hi [~ztzg],
Thanks for your contribution, and I have seen your commit. Add a new
KeepExcetion and check it before create an ephemeral node could resolve the
problem, it seems to be correct to me. I don't find any loopholes.
I think it‘s more easier for others notice this limitation if add some JavaDoc
of ZooKeeper.create. Would you agree? :D
> CloseSessionTxn contains too many ephemal nodes cause cluster crash
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-4306
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4306
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.6.2
> Reporter: Lin Changrui
> Priority: Critical
> Attachments: cs.jpg, f.jpg, l1.png, l2.jpg, r.jpg
>
>
> We took a test about how many ephemal nodes can client create under one
> parent node with defalut configuration. The test caused cluster crash at
> last, exception stack trace like this.
> follower:
> !f.jpg!
> leader:
> !l1.png!
> !l2.jpg!
> It seems that leader sent a too large txn packet to followers. When follower
> try to deserialize the txn, it found the txn length out of its buffer
> size(default 1MB+1MB, jute.maxbuffer + jute.maxbuffer.extrasize). That causes
> followers crashed, and then, leader found there was no sufficient followers
> synced, so leader shutdown later. When leader shutdown, it called
> zkDb.fastForwardDataBase() , and leader found the txn read from txnlog out of
> its buffer size, so it crashed too.
> After the servers crashed, they try to restart the quorum. But they would not
> success because the last txn is too large. We lose the log at that moment,
> but the stack trace is same as this one.
> !r.jpg|width=1468,height=598!
>
> *Root Cause*
> We use org.apache.zookeeper.server.LogFormatter(-Djute.maxbuffer=74827780)
> visualize this log and found this. !cs.jpg|width=1400,height=581! So
> closeSessionTxn contains all ephemal nodes with absolute path. We know we
> will get a large getChildren respose if we create too many children nodes
> under one parent node, that is limited by jute.maxbuffer of client. If we
> create plenty of ephemal nodes under different parent nodes with one session,
> it may not cause out of buffer of client, but when the session close without
> delete these node first, it probably cause cluster crash.
> Is it a bug or just a unspecified feature?If it just so, how should we judge
> the upper limit of creating nodes?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)