Thanks for the comments - I'll incorporate them in a future fix. There
is actually a flaw in this code as it's currently implemented - it
does not match the original behavior and I need to think more
carefully.

Arshad, I think ZOOKEEPER-2570 is a somewhat different issue.  The
root cause in both cases is that the ProcessRequestThread is
overloaded, but large multi-op transactions are probably a degenerate
case.

On Thu, Oct 13, 2016 at 1:12 PM, Edward Ribeiro
<edward.ribe...@gmail.com> wrote:
> Very interesting patch, Mike.
>
> I've left a couple of review comments (hope you don't mind) in the
> https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
> 422b3c8f0c commit. :)
>
> Cheers,
> Eddie
>
>
> On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad <
> arshad.mohamma...@gmail.com> wrote:
>
>> Hi Mike
>> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
>> used to quickly check  performance gains in each modification.  Hope it is
>> useful.
>>
>> -Arshad
>>
>> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote:
>>
>> > I've been performance testing 3.5.2 and hit an interesting unavailability
>> > issue.
>> >
>> > When there server is very busy (64k connections, 16k writes per
>> > second) the leader can get busy enough that connections get throttled.
>> > Enough throttling causes sessions to expire. As sessions expire, the
>> > CPU consumption rises and the quorum is effectively unavailable.
>> > Interestingly, if you shut down all the clients, the quorum won't heal
>> > for nearly 10 minutes.
>> >
>> > The issue is that the outstandingChanges queue has 250k items in it
>> > and the closeSession code scans this linearly under a lock. Replacing
>> > the linear scan with a hash table lookup improves this, but likely the
>> > real solution is some backpressure on clients as a result of an
>> > oversized outstandingChanges queue.
>> >
>> > Here is a sample fix:
>> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
>> > 422b3c8f0c
>> >
>> > This results in the quorum healing about 30 seconds after the clients
>> > disconnect.
>> >
>> > Is there a way to prevent runaway growth in this queue? I'm wondering
>> > if changing the definition of "throttling" to take into account the
>> > size of this queue might help mitigate this. The end goal is that some
>> > stable amount of traffic is reached asymptotically without suffering a
>> > collapse.
>> >
>> > Thanks,
>> > -Mike
>> >
>>

Reply via email to