Re: [DISCUSS] Re: Replication resiliency

Andrew Purtell Fri, 27 Jan 2017 13:14:01 -0800

There is an old JIRA somewhere to use Error Prone 
(https://github.com/google/error-prone) as framework for implementing static 
code analysis checks like that. FWIW



> On Jan 27, 2017, at 1:03 PM, Sean Busbey <[email protected]> wrote:
> 
> Josh, probably worth checking if a grep or something else we can do in
> precommit could catch this.
> 
> 
> 
>> On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser <[email protected]> wrote:
>> Cool.
>> 
>> Let me open an issue to scan the codebase to see if we can find any
>> instances where we are starting threads which aren't using the UEH.
>> 
>> 
>> Andrew Purtell wrote:
>>> 
>>> Agreed, let's abort with an abundance of caution. That is the _least_ that
>>> should be done when a thread dies unexpectedly. Maybe we can improve
>>> resiliency for specific cases later.
>>> 
>>> 
>>> On Jan 26, 2017, at 5:53 PM, Enis Söztutar<[email protected]>  wrote:
>>> 
>>>>> Do we have worker threads that we can't safely continue without
>>>> 
>>>> indefinitely? Can we solve the general problem of "unhandled exception
>>>> in threads cause a RS Abort"?
>>>> We have this already in the code base. We are injecting an
>>>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
>>>> of the HRegionServer service threads (see HRS.startServiceThreads). But
>>>> I've also seen cases where some important thread got unmarked. I think it
>>>> is good idea to revisit that and make sure that all the threads are
>>>> injected with the UEH.
>>>> 
>>>> The replication source threads are started on demand, that is why the UEH
>>>> is not injected I think. But agreed that we should do the safe route
>>>> here,
>>>> and abort the regionserver.
>>>> 
>>>> Enis
>>>> 
>>>>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser<[email protected]>  wrote:
>>>>> 
>>>>> +1 If any "worker" thread can't safely/reasonably retry some unexpected
>>>>> exception without a reasonable expectation of self-healing, tank the RS.
>>>>> 
>>>>> Having those threads die but not the RS could go uncaught for indefinite
>>>>> period of time.
>>>>> 
>>>>> 
>>>>> Sean Busbey wrote:
>>>>> 
>>>>>> I've noticed a few other places where we can lose a worker thread and
>>>>>> the RegionServer happily continues. One notable example is the worker
>>>>>> threads that handle syncs for the WAL. I'm generally a
>>>>>> fail-fast-and-loud advocate, so I like aborting when things look
>>>>>> wonky. I've also had to deal with a lot of pain around silent and thus
>>>>>> hard to see replication failures, so strong signals that the
>>>>>> replication system is in a bad way sound good to me atm.
>>>>>> 
>>>>>> Do we have worker threads that we can't safely continue without
>>>>>> indefinitely? Can we solve the general problem of "unhandled exception
>>>>>> in threads cause a RS Abort"?
>>>>>> 
>>>>>> As mentioned on the jira, I do worry a bit about cluster stability and
>>>>>> cascading failures, given the ability to have user-provided endpoints
>>>>>> in the replication process. Ultimately, I don't see that as different
>>>>>> than all the other places coprocessors can put the cluster at risk.
>>>>>> 
>>>>>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Over in HBASE-17381 there has been some discussion around whether an
>>>>>>>> unhandled exception in a ReplicationSourceWorkerThread should trigger
>>>>>>>> a
>>>>>>>> regionserver abort.
>>>>>>>> 
>>>>>>>> The current behavior in the case of an unexpected exception in
>>>>>>>> ReplicationSourceWorkerThread.run() is to log a message and simply
>>>>>>>> let
>>>>>>>> the
>>>>>>>> thread die, allowing replication for this source to back up.
>>>>>>>> 
>>>>>>>> I've seen this happen in an OOME scenario, which seems like a clear
>>>>>>>> case
>>>>>>>> where we would be better off aborting the regionserver.
>>>>>>>> 
>>>>>>>> However, in the case of any other unexpected exceptions out of the
>>>>>>>> run()
>>>>>>>> method, how do we want to handle this?
>>>>>>>> 
>>>>>>>> I'm of the general opinion that where we would be better off aborting
>>>>>>>> on
>>>>>>>> all unexpected exceptions, as it means that:
>>>>>>>> a) we have some missing error handling
>>>>>>>> b) failing fast raises visibility and makes it easier to add any
>>>>>>>> error
>>>>>>>> handling that should be there
>>>>>>>> c) silently stopping up replication creates problems that are
>>>>>>>> difficult
>>>>>>>> for
>>>>>>>> our users to identify operationally and hard to troubleshoot.
>>>>>>>> 
>>>>>>>> However, the current behavior has been there for quite a while, and
>>>>>>>> maybe
>>>>>>>> there are other situations or concerns I'm not seeing which would
>>>>>>>> justify
>>>>>>>> having regionserver stability over replication stability.
>>>>>>>> 
>>>>>>>> What are folks thoughts on this?  Should the regionserver abort on
>>>>>>>> all
>>>>>>>> unexpected exceptions in the run method or should we more narrowly
>>>>>>>> focus
>>>>>>>> this on OOME's?
>>>>>>>> 
>>

Re: [DISCUSS] Re: Replication resiliency

Reply via email to