Thanks for the comments - I'll incorporate them in a future fix. There is actually a flaw in this code as it's currently implemented - it does not match the original behavior and I need to think more carefully.
Arshad, I think ZOOKEEPER-2570 is a somewhat different issue. The root cause in both cases is that the ProcessRequestThread is overloaded, but large multi-op transactions are probably a degenerate case. On Thu, Oct 13, 2016 at 1:12 PM, Edward Ribeiro <edward.ribe...@gmail.com> wrote: > Very interesting patch, Mike. > > I've left a couple of review comments (hope you don't mind) in the > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c > 422b3c8f0c commit. :) > > Cheers, > Eddie > > > On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad < > arshad.mohamma...@gmail.com> wrote: > >> Hi Mike >> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be >> used to quickly check performance gains in each modification. Hope it is >> useful. >> >> -Arshad >> >> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote: >> >> > I've been performance testing 3.5.2 and hit an interesting unavailability >> > issue. >> > >> > When there server is very busy (64k connections, 16k writes per >> > second) the leader can get busy enough that connections get throttled. >> > Enough throttling causes sessions to expire. As sessions expire, the >> > CPU consumption rises and the quorum is effectively unavailable. >> > Interestingly, if you shut down all the clients, the quorum won't heal >> > for nearly 10 minutes. >> > >> > The issue is that the outstandingChanges queue has 250k items in it >> > and the closeSession code scans this linearly under a lock. Replacing >> > the linear scan with a hash table lookup improves this, but likely the >> > real solution is some backpressure on clients as a result of an >> > oversized outstandingChanges queue. >> > >> > Here is a sample fix: >> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c >> > 422b3c8f0c >> > >> > This results in the quorum healing about 30 seconds after the clients >> > disconnect. >> > >> > Is there a way to prevent runaway growth in this queue? I'm wondering >> > if changing the definition of "throttling" to take into account the >> > size of this queue might help mitigate this. The end goal is that some >> > stable amount of traffic is reached asymptotically without suffering a >> > collapse. >> > >> > Thanks, >> > -Mike >> > >>