[DISCUSS] Re: Replication resiliency

Sean Busbey Thu, 26 Jan 2017 12:49:27 -0800

(edited subject to ensure folks filtering for DISCUSS see this)



On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling <[email protected]> wrote:
> Over in HBASE-17381 there has been some discussion around whether an
> unhandled exception in a ReplicationSourceWorkerThread should trigger a
> regionserver abort.
>
> The current behavior in the case of an unexpected exception in
> ReplicationSourceWorkerThread.run() is to log a message and simply let the
> thread die, allowing replication for this source to back up.
>
> I've seen this happen in an OOME scenario, which seems like a clear case
> where we would be better off aborting the regionserver.
>
> However, in the case of any other unexpected exceptions out of the run()
> method, how do we want to handle this?
>
> I'm of the general opinion that where we would be better off aborting on
> all unexpected exceptions, as it means that:
> a) we have some missing error handling
> b) failing fast raises visibility and makes it easier to add any error
> handling that should be there
> c) silently stopping up replication creates problems that are difficult for
> our users to identify operationally and hard to troubleshoot.
>
> However, the current behavior has been there for quite a while, and maybe
> there are other situations or concerns I'm not seeing which would justify
> having regionserver stability over replication stability.
>
> What are folks thoughts on this?  Should the regionserver abort on all
> unexpected exceptions in the run method or should we more narrowly focus
> this on OOME's?

[DISCUSS] Re: Replication resiliency

Reply via email to