Re: Critical worker threads liveness checking drawbacks

Andrey Kuznetsov Fri, 07 Sep 2018 09:04:22 -0700

Yakov,

Thanks for reply. Indeed, initial design assumed node termination when
hanging critical thread has been detected. But sometimes it looks
inappropriate. Let, for example fsync in WAL writer thread takes too long,
and we terminate the node. Upon rebalancing, this may lead to long fsyncs
on other nodes due to increased per node load, hence we can terminate the
next node as well. Eventually we can collapse the entire cluster. Is it a
possible scenario?


пт, 7 сент. 2018 г. в 18:44, Yakov Zhdanov <yzhda...@apache.org>:

> Andrey,
>
> I don't understand your point. My opinion, the idea of these changes is to
> make cluster more stable and responsive by eliminating hanged nodes. I
> would not make too much difference between threads trapped in deadlock and
> threads hanging on fsync calls for too long. Both situations lead to
> increasing latency in cluster till its full unavailability.
>
> So, killing node hanging on fsync may be reasonable. Agree?
>
> You may implement the approach when you have warning messages in logs by
> default, but termination option should also be available.
>
> Thanks!
>
> --Yakov
>
>

Re: Critical worker threads liveness checking drawbacks

Reply via email to