(edited subject to ensure folks filtering for DISCUSS see this)
On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling <[email protected]> wrote: > Over in HBASE-17381 there has been some discussion around whether an > unhandled exception in a ReplicationSourceWorkerThread should trigger a > regionserver abort. > > The current behavior in the case of an unexpected exception in > ReplicationSourceWorkerThread.run() is to log a message and simply let the > thread die, allowing replication for this source to back up. > > I've seen this happen in an OOME scenario, which seems like a clear case > where we would be better off aborting the regionserver. > > However, in the case of any other unexpected exceptions out of the run() > method, how do we want to handle this? > > I'm of the general opinion that where we would be better off aborting on > all unexpected exceptions, as it means that: > a) we have some missing error handling > b) failing fast raises visibility and makes it easier to add any error > handling that should be there > c) silently stopping up replication creates problems that are difficult for > our users to identify operationally and hard to troubleshoot. > > However, the current behavior has been there for quite a while, and maybe > there are other situations or concerns I'm not seeing which would justify > having regionserver stability over replication stability. > > What are folks thoughts on this? Should the regionserver abort on all > unexpected exceptions in the run method or should we more narrowly focus > this on OOME's?
