Re: [DISCUSS] Re: Replication resiliency

2017-01-27 Thread Andrew Purtell
There is an old JIRA somewhere to use Error Prone 
(https://github.com/google/error-prone) as framework for implementing static 
code analysis checks like that. FWIW


> On Jan 27, 2017, at 1:03 PM, Sean Busbey  wrote:
> 
> Josh, probably worth checking if a grep or something else we can do in
> precommit could catch this.
> 
> 
> 
>> On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser  wrote:
>> Cool.
>> 
>> Let me open an issue to scan the codebase to see if we can find any
>> instances where we are starting threads which aren't using the UEH.
>> 
>> 
>> Andrew Purtell wrote:
>>> 
>>> Agreed, let's abort with an abundance of caution. That is the _least_ that
>>> should be done when a thread dies unexpectedly. Maybe we can improve
>>> resiliency for specific cases later.
>>> 
>>> 
>>> On Jan 26, 2017, at 5:53 PM, Enis Söztutar  wrote:
>>> 
> Do we have worker threads that we can't safely continue without
 
 indefinitely? Can we solve the general problem of "unhandled exception
 in threads cause a RS Abort"?
 We have this already in the code base. We are injecting an
 UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
 of the HRegionServer service threads (see HRS.startServiceThreads). But
 I've also seen cases where some important thread got unmarked. I think it
 is good idea to revisit that and make sure that all the threads are
 injected with the UEH.
 
 The replication source threads are started on demand, that is why the UEH
 is not injected I think. But agreed that we should do the safe route
 here,
 and abort the regionserver.
 
 Enis
 
> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser  wrote:
> 
> +1 If any "worker" thread can't safely/reasonably retry some unexpected
> exception without a reasonable expectation of self-healing, tank the RS.
> 
> Having those threads die but not the RS could go uncaught for indefinite
> period of time.
> 
> 
> Sean Busbey wrote:
> 
>> I've noticed a few other places where we can lose a worker thread and
>> the RegionServer happily continues. One notable example is the worker
>> threads that handle syncs for the WAL. I'm generally a
>> fail-fast-and-loud advocate, so I like aborting when things look
>> wonky. I've also had to deal with a lot of pain around silent and thus
>> hard to see replication failures, so strong signals that the
>> replication system is in a bad way sound good to me atm.
>> 
>> Do we have worker threads that we can't safely continue without
>> indefinitely? Can we solve the general problem of "unhandled exception
>> in threads cause a RS Abort"?
>> 
>> As mentioned on the jira, I do worry a bit about cluster stability and
>> cascading failures, given the ability to have user-provided endpoints
>> in the replication process. Ultimately, I don't see that as different
>> than all the other places coprocessors can put the cluster at risk.
>> 
>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey
>>> wrote:
>>> 
>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>> 
>>> 
>>> 
>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling
>>> wrote:
>>> 
 Over in HBASE-17381 there has been some discussion around whether an
 unhandled exception in a ReplicationSourceWorkerThread should trigger
 a
 regionserver abort.
 
 The current behavior in the case of an unexpected exception in
 ReplicationSourceWorkerThread.run() is to log a message and simply
 let
 the
 thread die, allowing replication for this source to back up.
 
 I've seen this happen in an OOME scenario, which seems like a clear
 case
 where we would be better off aborting the regionserver.
 
 However, in the case of any other unexpected exceptions out of the
 run()
 method, how do we want to handle this?
 
 I'm of the general opinion that where we would be better off aborting
 on
 all unexpected exceptions, as it means that:
 a) we have some missing error handling
 b) failing fast raises visibility and makes it easier to add any
 error
 handling that should be there
 c) silently stopping up replication creates problems that are
 difficult
 for
 our users to identify operationally and hard to troubleshoot.
 
 However, the current behavior has been there for quite a while, and
 maybe
 there are other situations or concerns I'm not seeing which would
 justify
 having regionserver stability over replication stability.
 
 What are folks thoughts on 

Re: [DISCUSS] Re: Replication resiliency

2017-01-27 Thread Sean Busbey
Josh, probably worth checking if a grep or something else we can do in
precommit could catch this.



On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser  wrote:
> Cool.
>
> Let me open an issue to scan the codebase to see if we can find any
> instances where we are starting threads which aren't using the UEH.
>
>
> Andrew Purtell wrote:
>>
>> Agreed, let's abort with an abundance of caution. That is the _least_ that
>> should be done when a thread dies unexpectedly. Maybe we can improve
>> resiliency for specific cases later.
>>
>>
>> On Jan 26, 2017, at 5:53 PM, Enis Söztutar  wrote:
>>
 Do we have worker threads that we can't safely continue without
>>>
>>> indefinitely? Can we solve the general problem of "unhandled exception
>>> in threads cause a RS Abort"?
>>> We have this already in the code base. We are injecting an
>>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
>>> of the HRegionServer service threads (see HRS.startServiceThreads). But
>>> I've also seen cases where some important thread got unmarked. I think it
>>> is good idea to revisit that and make sure that all the threads are
>>> injected with the UEH.
>>>
>>> The replication source threads are started on demand, that is why the UEH
>>> is not injected I think. But agreed that we should do the safe route
>>> here,
>>> and abort the regionserver.
>>>
>>> Enis
>>>
 On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser  wrote:

 +1 If any "worker" thread can't safely/reasonably retry some unexpected
 exception without a reasonable expectation of self-healing, tank the RS.

 Having those threads die but not the RS could go uncaught for indefinite
 period of time.


 Sean Busbey wrote:

> I've noticed a few other places where we can lose a worker thread and
> the RegionServer happily continues. One notable example is the worker
> threads that handle syncs for the WAL. I'm generally a
> fail-fast-and-loud advocate, so I like aborting when things look
> wonky. I've also had to deal with a lot of pain around silent and thus
> hard to see replication failures, so strong signals that the
> replication system is in a bad way sound good to me atm.
>
> Do we have worker threads that we can't safely continue without
> indefinitely? Can we solve the general problem of "unhandled exception
> in threads cause a RS Abort"?
>
> As mentioned on the jira, I do worry a bit about cluster stability and
> cascading failures, given the ability to have user-provided endpoints
> in the replication process. Ultimately, I don't see that as different
> than all the other places coprocessors can put the cluster at risk.
>
>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey
>> wrote:
>>
>> (edited subject to ensure folks filtering for DISCUSS see this)
>>
>>
>>
>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling
>> wrote:
>>
>>> Over in HBASE-17381 there has been some discussion around whether an
>>> unhandled exception in a ReplicationSourceWorkerThread should trigger
>>> a
>>> regionserver abort.
>>>
>>> The current behavior in the case of an unexpected exception in
>>> ReplicationSourceWorkerThread.run() is to log a message and simply
>>> let
>>> the
>>> thread die, allowing replication for this source to back up.
>>>
>>> I've seen this happen in an OOME scenario, which seems like a clear
>>> case
>>> where we would be better off aborting the regionserver.
>>>
>>> However, in the case of any other unexpected exceptions out of the
>>> run()
>>> method, how do we want to handle this?
>>>
>>> I'm of the general opinion that where we would be better off aborting
>>> on
>>> all unexpected exceptions, as it means that:
>>> a) we have some missing error handling
>>> b) failing fast raises visibility and makes it easier to add any
>>> error
>>> handling that should be there
>>> c) silently stopping up replication creates problems that are
>>> difficult
>>> for
>>> our users to identify operationally and hard to troubleshoot.
>>>
>>> However, the current behavior has been there for quite a while, and
>>> maybe
>>> there are other situations or concerns I'm not seeing which would
>>> justify
>>> having regionserver stability over replication stability.
>>>
>>> What are folks thoughts on this?  Should the regionserver abort on
>>> all
>>> unexpected exceptions in the run method or should we more narrowly
>>> focus
>>> this on OOME's?
>>>
>


Re: [DISCUSS] Re: Replication resiliency

2017-01-27 Thread Gary Helmling
Thanks for the feedback, all.  Makes sense to me.  I'll follow up in the
issue to use the same UEH to abort or at least same abort handling as other
cases (the current UEH used for replication source worker threads only
logs).


On Fri, Jan 27, 2017 at 11:27 AM Josh Elser  wrote:

Cool.

Let me open an issue to scan the codebase to see if we can find any
instances where we are starting threads which aren't using the UEH.

Andrew Purtell wrote:
> Agreed, let's abort with an abundance of caution. That is the _least_
that should be done when a thread dies unexpectedly. Maybe we can improve
resiliency for specific cases later.
>
>
> On Jan 26, 2017, at 5:53 PM, Enis Söztutar  wrote:
>
>>> Do we have worker threads that we can't safely continue without
>> indefinitely? Can we solve the general problem of "unhandled exception
>> in threads cause a RS Abort"?
>> We have this already in the code base. We are injecting an
>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
>> of the HRegionServer service threads (see HRS.startServiceThreads). But
>> I've also seen cases where some important thread got unmarked. I think it
>> is good idea to revisit that and make sure that all the threads are
>> injected with the UEH.
>>
>> The replication source threads are started on demand, that is why the UEH
>> is not injected I think. But agreed that we should do the safe route
here,
>> and abort the regionserver.
>>
>> Enis
>>
>>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser  wrote:
>>>
>>> +1 If any "worker" thread can't safely/reasonably retry some unexpected
>>> exception without a reasonable expectation of self-healing, tank the RS.
>>>
>>> Having those threads die but not the RS could go uncaught for indefinite
>>> period of time.
>>>
>>>
>>> Sean Busbey wrote:
>>>
 I've noticed a few other places where we can lose a worker thread and
 the RegionServer happily continues. One notable example is the worker
 threads that handle syncs for the WAL. I'm generally a
 fail-fast-and-loud advocate, so I like aborting when things look
 wonky. I've also had to deal with a lot of pain around silent and thus
 hard to see replication failures, so strong signals that the
 replication system is in a bad way sound good to me atm.

 Do we have worker threads that we can't safely continue without
 indefinitely? Can we solve the general problem of "unhandled exception
 in threads cause a RS Abort"?

 As mentioned on the jira, I do worry a bit about cluster stability and
 cascading failures, given the ability to have user-provided endpoints
 in the replication process. Ultimately, I don't see that as different
 than all the other places coprocessors can put the cluster at risk.

> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey
 wrote:
>
> (edited subject to ensure folks filtering for DISCUSS see this)
>
>
>
> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling
> wrote:
>
>> Over in HBASE-17381 there has been some discussion around whether an
>> unhandled exception in a ReplicationSourceWorkerThread should
trigger a
>> regionserver abort.
>>
>> The current behavior in the case of an unexpected exception in
>> ReplicationSourceWorkerThread.run() is to log a message and simply
let
>> the
>> thread die, allowing replication for this source to back up.
>>
>> I've seen this happen in an OOME scenario, which seems like a clear
case
>> where we would be better off aborting the regionserver.
>>
>> However, in the case of any other unexpected exceptions out of the
run()
>> method, how do we want to handle this?
>>
>> I'm of the general opinion that where we would be better off
aborting on
>> all unexpected exceptions, as it means that:
>> a) we have some missing error handling
>> b) failing fast raises visibility and makes it easier to add any
error
>> handling that should be there
>> c) silently stopping up replication creates problems that are
difficult
>> for
>> our users to identify operationally and hard to troubleshoot.
>>
>> However, the current behavior has been there for quite a while, and
>> maybe
>> there are other situations or concerns I'm not seeing which would
>> justify
>> having regionserver stability over replication stability.
>>
>> What are folks thoughts on this?  Should the regionserver abort on
all
>> unexpected exceptions in the run method or should we more narrowly
focus
>> this on OOME's?
>>


Re: [DISCUSS] Re: Replication resiliency

2017-01-27 Thread Josh Elser

Cool.

Let me open an issue to scan the codebase to see if we can find any 
instances where we are starting threads which aren't using the UEH.


Andrew Purtell wrote:

Agreed, let's abort with an abundance of caution. That is the _least_ that 
should be done when a thread dies unexpectedly. Maybe we can improve resiliency 
for specific cases later.


On Jan 26, 2017, at 5:53 PM, Enis Söztutar  wrote:


Do we have worker threads that we can't safely continue without

indefinitely? Can we solve the general problem of "unhandled exception
in threads cause a RS Abort"?
We have this already in the code base. We are injecting an
UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
of the HRegionServer service threads (see HRS.startServiceThreads). But
I've also seen cases where some important thread got unmarked. I think it
is good idea to revisit that and make sure that all the threads are
injected with the UEH.

The replication source threads are started on demand, that is why the UEH
is not injected I think. But agreed that we should do the safe route here,
and abort the regionserver.

Enis


On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser  wrote:

+1 If any "worker" thread can't safely/reasonably retry some unexpected
exception without a reasonable expectation of self-healing, tank the RS.

Having those threads die but not the RS could go uncaught for indefinite
period of time.


Sean Busbey wrote:


I've noticed a few other places where we can lose a worker thread and
the RegionServer happily continues. One notable example is the worker
threads that handle syncs for the WAL. I'm generally a
fail-fast-and-loud advocate, so I like aborting when things look
wonky. I've also had to deal with a lot of pain around silent and thus
hard to see replication failures, so strong signals that the
replication system is in a bad way sound good to me atm.

Do we have worker threads that we can't safely continue without
indefinitely? Can we solve the general problem of "unhandled exception
in threads cause a RS Abort"?

As mentioned on the jira, I do worry a bit about cluster stability and
cascading failures, given the ability to have user-provided endpoints
in the replication process. Ultimately, I don't see that as different
than all the other places coprocessors can put the cluster at risk.


On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey   wrote:

(edited subject to ensure folks filtering for DISCUSS see this)



On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling
wrote:


Over in HBASE-17381 there has been some discussion around whether an
unhandled exception in a ReplicationSourceWorkerThread should trigger a
regionserver abort.

The current behavior in the case of an unexpected exception in
ReplicationSourceWorkerThread.run() is to log a message and simply let
the
thread die, allowing replication for this source to back up.

I've seen this happen in an OOME scenario, which seems like a clear case
where we would be better off aborting the regionserver.

However, in the case of any other unexpected exceptions out of the run()
method, how do we want to handle this?

I'm of the general opinion that where we would be better off aborting on
all unexpected exceptions, as it means that:
a) we have some missing error handling
b) failing fast raises visibility and makes it easier to add any error
handling that should be there
c) silently stopping up replication creates problems that are difficult
for
our users to identify operationally and hard to troubleshoot.

However, the current behavior has been there for quite a while, and
maybe
there are other situations or concerns I'm not seeing which would
justify
having regionserver stability over replication stability.

What are folks thoughts on this?  Should the regionserver abort on all
unexpected exceptions in the run method or should we more narrowly focus
this on OOME's?



Re: [DISCUSS] Re: Replication resiliency

2017-01-26 Thread Andrew Purtell
Agreed, let's abort with an abundance of caution. That is the _least_ that 
should be done when a thread dies unexpectedly. Maybe we can improve resiliency 
for specific cases later. 


On Jan 26, 2017, at 5:53 PM, Enis Söztutar  wrote:

>> Do we have worker threads that we can't safely continue without
> indefinitely? Can we solve the general problem of "unhandled exception
> in threads cause a RS Abort"?
> We have this already in the code base. We are injecting an
> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
> of the HRegionServer service threads (see HRS.startServiceThreads). But
> I've also seen cases where some important thread got unmarked. I think it
> is good idea to revisit that and make sure that all the threads are
> injected with the UEH.
> 
> The replication source threads are started on demand, that is why the UEH
> is not injected I think. But agreed that we should do the safe route here,
> and abort the regionserver.
> 
> Enis
> 
>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser  wrote:
>> 
>> +1 If any "worker" thread can't safely/reasonably retry some unexpected
>> exception without a reasonable expectation of self-healing, tank the RS.
>> 
>> Having those threads die but not the RS could go uncaught for indefinite
>> period of time.
>> 
>> 
>> Sean Busbey wrote:
>> 
>>> I've noticed a few other places where we can lose a worker thread and
>>> the RegionServer happily continues. One notable example is the worker
>>> threads that handle syncs for the WAL. I'm generally a
>>> fail-fast-and-loud advocate, so I like aborting when things look
>>> wonky. I've also had to deal with a lot of pain around silent and thus
>>> hard to see replication failures, so strong signals that the
>>> replication system is in a bad way sound good to me atm.
>>> 
>>> Do we have worker threads that we can't safely continue without
>>> indefinitely? Can we solve the general problem of "unhandled exception
>>> in threads cause a RS Abort"?
>>> 
>>> As mentioned on the jira, I do worry a bit about cluster stability and
>>> cascading failures, given the ability to have user-provided endpoints
>>> in the replication process. Ultimately, I don't see that as different
>>> than all the other places coprocessors can put the cluster at risk.
>>> 
 On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey  wrote:
 
 (edited subject to ensure folks filtering for DISCUSS see this)
 
 
 
 On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling
 wrote:
 
> Over in HBASE-17381 there has been some discussion around whether an
> unhandled exception in a ReplicationSourceWorkerThread should trigger a
> regionserver abort.
> 
> The current behavior in the case of an unexpected exception in
> ReplicationSourceWorkerThread.run() is to log a message and simply let
> the
> thread die, allowing replication for this source to back up.
> 
> I've seen this happen in an OOME scenario, which seems like a clear case
> where we would be better off aborting the regionserver.
> 
> However, in the case of any other unexpected exceptions out of the run()
> method, how do we want to handle this?
> 
> I'm of the general opinion that where we would be better off aborting on
> all unexpected exceptions, as it means that:
> a) we have some missing error handling
> b) failing fast raises visibility and makes it easier to add any error
> handling that should be there
> c) silently stopping up replication creates problems that are difficult
> for
> our users to identify operationally and hard to troubleshoot.
> 
> However, the current behavior has been there for quite a while, and
> maybe
> there are other situations or concerns I'm not seeing which would
> justify
> having regionserver stability over replication stability.
> 
> What are folks thoughts on this?  Should the regionserver abort on all
> unexpected exceptions in the run method or should we more narrowly focus
> this on OOME's?
> 
 


Re: [DISCUSS] Re: Replication resiliency

2017-01-26 Thread Enis Söztutar
> Do we have worker threads that we can't safely continue without
indefinitely? Can we solve the general problem of "unhandled exception
in threads cause a RS Abort"?
We have this already in the code base. We are injecting an
UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
of the HRegionServer service threads (see HRS.startServiceThreads). But
I've also seen cases where some important thread got unmarked. I think it
is good idea to revisit that and make sure that all the threads are
injected with the UEH.

The replication source threads are started on demand, that is why the UEH
is not injected I think. But agreed that we should do the safe route here,
and abort the regionserver.

Enis

On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser  wrote:

> +1 If any "worker" thread can't safely/reasonably retry some unexpected
> exception without a reasonable expectation of self-healing, tank the RS.
>
> Having those threads die but not the RS could go uncaught for indefinite
> period of time.
>
>
> Sean Busbey wrote:
>
>> I've noticed a few other places where we can lose a worker thread and
>> the RegionServer happily continues. One notable example is the worker
>> threads that handle syncs for the WAL. I'm generally a
>> fail-fast-and-loud advocate, so I like aborting when things look
>> wonky. I've also had to deal with a lot of pain around silent and thus
>> hard to see replication failures, so strong signals that the
>> replication system is in a bad way sound good to me atm.
>>
>> Do we have worker threads that we can't safely continue without
>> indefinitely? Can we solve the general problem of "unhandled exception
>> in threads cause a RS Abort"?
>>
>> As mentioned on the jira, I do worry a bit about cluster stability and
>> cascading failures, given the ability to have user-provided endpoints
>> in the replication process. Ultimately, I don't see that as different
>> than all the other places coprocessors can put the cluster at risk.
>>
>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey  wrote:
>>
>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>>
>>>
>>>
>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling
>>> wrote:
>>>
 Over in HBASE-17381 there has been some discussion around whether an
 unhandled exception in a ReplicationSourceWorkerThread should trigger a
 regionserver abort.

 The current behavior in the case of an unexpected exception in
 ReplicationSourceWorkerThread.run() is to log a message and simply let
 the
 thread die, allowing replication for this source to back up.

 I've seen this happen in an OOME scenario, which seems like a clear case
 where we would be better off aborting the regionserver.

 However, in the case of any other unexpected exceptions out of the run()
 method, how do we want to handle this?

 I'm of the general opinion that where we would be better off aborting on
 all unexpected exceptions, as it means that:
 a) we have some missing error handling
 b) failing fast raises visibility and makes it easier to add any error
 handling that should be there
 c) silently stopping up replication creates problems that are difficult
 for
 our users to identify operationally and hard to troubleshoot.

 However, the current behavior has been there for quite a while, and
 maybe
 there are other situations or concerns I'm not seeing which would
 justify
 having regionserver stability over replication stability.

 What are folks thoughts on this?  Should the regionserver abort on all
 unexpected exceptions in the run method or should we more narrowly focus
 this on OOME's?

>>>


Re: [DISCUSS] Re: Replication resiliency

2017-01-26 Thread Josh Elser
+1 If any "worker" thread can't safely/reasonably retry some unexpected 
exception without a reasonable expectation of self-healing, tank the RS.


Having those threads die but not the RS could go uncaught for indefinite 
period of time.


Sean Busbey wrote:

I've noticed a few other places where we can lose a worker thread and
the RegionServer happily continues. One notable example is the worker
threads that handle syncs for the WAL. I'm generally a
fail-fast-and-loud advocate, so I like aborting when things look
wonky. I've also had to deal with a lot of pain around silent and thus
hard to see replication failures, so strong signals that the
replication system is in a bad way sound good to me atm.

Do we have worker threads that we can't safely continue without
indefinitely? Can we solve the general problem of "unhandled exception
in threads cause a RS Abort"?

As mentioned on the jira, I do worry a bit about cluster stability and
cascading failures, given the ability to have user-provided endpoints
in the replication process. Ultimately, I don't see that as different
than all the other places coprocessors can put the cluster at risk.

On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey  wrote:

(edited subject to ensure folks filtering for DISCUSS see this)



On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling  wrote:

Over in HBASE-17381 there has been some discussion around whether an
unhandled exception in a ReplicationSourceWorkerThread should trigger a
regionserver abort.

The current behavior in the case of an unexpected exception in
ReplicationSourceWorkerThread.run() is to log a message and simply let the
thread die, allowing replication for this source to back up.

I've seen this happen in an OOME scenario, which seems like a clear case
where we would be better off aborting the regionserver.

However, in the case of any other unexpected exceptions out of the run()
method, how do we want to handle this?

I'm of the general opinion that where we would be better off aborting on
all unexpected exceptions, as it means that:
a) we have some missing error handling
b) failing fast raises visibility and makes it easier to add any error
handling that should be there
c) silently stopping up replication creates problems that are difficult for
our users to identify operationally and hard to troubleshoot.

However, the current behavior has been there for quite a while, and maybe
there are other situations or concerns I'm not seeing which would justify
having regionserver stability over replication stability.

What are folks thoughts on this?  Should the regionserver abort on all
unexpected exceptions in the run method or should we more narrowly focus
this on OOME's?


Re: [DISCUSS] Re: Replication resiliency

2017-01-26 Thread Sean Busbey
I've noticed a few other places where we can lose a worker thread and
the RegionServer happily continues. One notable example is the worker
threads that handle syncs for the WAL. I'm generally a
fail-fast-and-loud advocate, so I like aborting when things look
wonky. I've also had to deal with a lot of pain around silent and thus
hard to see replication failures, so strong signals that the
replication system is in a bad way sound good to me atm.

Do we have worker threads that we can't safely continue without
indefinitely? Can we solve the general problem of "unhandled exception
in threads cause a RS Abort"?

As mentioned on the jira, I do worry a bit about cluster stability and
cascading failures, given the ability to have user-provided endpoints
in the replication process. Ultimately, I don't see that as different
than all the other places coprocessors can put the cluster at risk.

On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey  wrote:
> (edited subject to ensure folks filtering for DISCUSS see this)
>
>
>
> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling  wrote:
>> Over in HBASE-17381 there has been some discussion around whether an
>> unhandled exception in a ReplicationSourceWorkerThread should trigger a
>> regionserver abort.
>>
>> The current behavior in the case of an unexpected exception in
>> ReplicationSourceWorkerThread.run() is to log a message and simply let the
>> thread die, allowing replication for this source to back up.
>>
>> I've seen this happen in an OOME scenario, which seems like a clear case
>> where we would be better off aborting the regionserver.
>>
>> However, in the case of any other unexpected exceptions out of the run()
>> method, how do we want to handle this?
>>
>> I'm of the general opinion that where we would be better off aborting on
>> all unexpected exceptions, as it means that:
>> a) we have some missing error handling
>> b) failing fast raises visibility and makes it easier to add any error
>> handling that should be there
>> c) silently stopping up replication creates problems that are difficult for
>> our users to identify operationally and hard to troubleshoot.
>>
>> However, the current behavior has been there for quite a while, and maybe
>> there are other situations or concerns I'm not seeing which would justify
>> having regionserver stability over replication stability.
>>
>> What are folks thoughts on this?  Should the regionserver abort on all
>> unexpected exceptions in the run method or should we more narrowly focus
>> this on OOME's?


[DISCUSS] Re: Replication resiliency

2017-01-26 Thread Sean Busbey
(edited subject to ensure folks filtering for DISCUSS see this)



On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling  wrote:
> Over in HBASE-17381 there has been some discussion around whether an
> unhandled exception in a ReplicationSourceWorkerThread should trigger a
> regionserver abort.
>
> The current behavior in the case of an unexpected exception in
> ReplicationSourceWorkerThread.run() is to log a message and simply let the
> thread die, allowing replication for this source to back up.
>
> I've seen this happen in an OOME scenario, which seems like a clear case
> where we would be better off aborting the regionserver.
>
> However, in the case of any other unexpected exceptions out of the run()
> method, how do we want to handle this?
>
> I'm of the general opinion that where we would be better off aborting on
> all unexpected exceptions, as it means that:
> a) we have some missing error handling
> b) failing fast raises visibility and makes it easier to add any error
> handling that should be there
> c) silently stopping up replication creates problems that are difficult for
> our users to identify operationally and hard to troubleshoot.
>
> However, the current behavior has been there for quite a while, and maybe
> there are other situations or concerns I'm not seeing which would justify
> having regionserver stability over replication stability.
>
> What are folks thoughts on this?  Should the regionserver abort on all
> unexpected exceptions in the run method or should we more narrowly focus
> this on OOME's?