Re: [DISCUSS] Re: Replication resiliency

Enis Söztutar Thu, 26 Jan 2017 17:54:06 -0800

> Do we have worker threads that we can't safely continue without
indefinitely? Can we solve the general problem of "unhandled exception
in threads cause a RS Abort"?
We have this already in the code base. We are injecting an
UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
of the HRegionServer service threads (see HRS.startServiceThreads). But
I've also seen cases where some important thread got unmarked. I think it
is good idea to revisit that and make sure that all the threads are
injected with the UEH.


The replication source threads are started on demand, that is why the UEH
is not injected I think. But agreed that we should do the safe route here,
and abort the regionserver.

Enis

On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser <els...@apache.org> wrote:

> +1 If any "worker" thread can't safely/reasonably retry some unexpected
> exception without a reasonable expectation of self-healing, tank the RS.
>
> Having those threads die but not the RS could go uncaught for indefinite
> period of time.
>
>
> Sean Busbey wrote:
>
>> I've noticed a few other places where we can lose a worker thread and
>> the RegionServer happily continues. One notable example is the worker
>> threads that handle syncs for the WAL. I'm generally a
>> fail-fast-and-loud advocate, so I like aborting when things look
>> wonky. I've also had to deal with a lot of pain around silent and thus
>> hard to see replication failures, so strong signals that the
>> replication system is in a bad way sound good to me atm.
>>
>> Do we have worker threads that we can't safely continue without
>> indefinitely? Can we solve the general problem of "unhandled exception
>> in threads cause a RS Abort"?
>>
>> As mentioned on the jira, I do worry a bit about cluster stability and
>> cascading failures, given the ability to have user-provided endpoints
>> in the replication process. Ultimately, I don't see that as different
>> than all the other places coprocessors can put the cluster at risk.
>>
>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<bus...@apache.org>  wrote:
>>
>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>>
>>>
>>>
>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelml...@gmail.com>
>>> wrote:
>>>
>>>> Over in HBASE-17381 there has been some discussion around whether an
>>>> unhandled exception in a ReplicationSourceWorkerThread should trigger a
>>>> regionserver abort.
>>>>
>>>> The current behavior in the case of an unexpected exception in
>>>> ReplicationSourceWorkerThread.run() is to log a message and simply let
>>>> the
>>>> thread die, allowing replication for this source to back up.
>>>>
>>>> I've seen this happen in an OOME scenario, which seems like a clear case
>>>> where we would be better off aborting the regionserver.
>>>>
>>>> However, in the case of any other unexpected exceptions out of the run()
>>>> method, how do we want to handle this?
>>>>
>>>> I'm of the general opinion that where we would be better off aborting on
>>>> all unexpected exceptions, as it means that:
>>>> a) we have some missing error handling
>>>> b) failing fast raises visibility and makes it easier to add any error
>>>> handling that should be there
>>>> c) silently stopping up replication creates problems that are difficult
>>>> for
>>>> our users to identify operationally and hard to troubleshoot.
>>>>
>>>> However, the current behavior has been there for quite a while, and
>>>> maybe
>>>> there are other situations or concerns I'm not seeing which would
>>>> justify
>>>> having regionserver stability over replication stability.
>>>>
>>>> What are folks thoughts on this?  Should the regionserver abort on all
>>>> unexpected exceptions in the run method or should we more narrowly focus
>>>> this on OOME's?
>>>>
>>>

Re: [DISCUSS] Re: Replication resiliency

Reply via email to