Hello,
We had a production outage due to the issue reported in https://issues.apache.org/jira/browse/ZOOKEEPER-4306 and some other users also ran into the same issue. I wonder if we can use this thread to discuss and come to a consensus on how to fix it. :-) Thanks Damien Diederen <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ztzg> for the contribution and patch. Limiting the number of ephemeral nodes that can be created in a session looks like a simple and reasonable solution to me. Having a way to enforce it will protect the system from potential OOM issues. I've also looked into the possibility of splitting CloseSessionTxn into smaller ones. Unfortunately, it didn't work, as currently in Zookeeper, one request can only have one txn. Even though we can split the paths to be deleted into multiple batches and define sub-txn for each batch, we have to wrap all sub-txn(s) into a single wrapper txn and associate it to the request. At the end, when loading zk database, we still have to deserialize the large wrapper txn, which can fail the length check (jute.maxBuffer + zookeeper.jute.maxbuffer.extrasize). Changing ZK to allow multiple txns for a single request looks quite involved and it may have other implications. I wonder if anyone has any input or any better ideas? Thanks, Li