There is an old JIRA somewhere to use Error Prone (https://github.com/google/error-prone) as framework for implementing static code analysis checks like that. FWIW
> On Jan 27, 2017, at 1:03 PM, Sean Busbey <[email protected]> wrote: > > Josh, probably worth checking if a grep or something else we can do in > precommit could catch this. > > > >> On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser <[email protected]> wrote: >> Cool. >> >> Let me open an issue to scan the codebase to see if we can find any >> instances where we are starting threads which aren't using the UEH. >> >> >> Andrew Purtell wrote: >>> >>> Agreed, let's abort with an abundance of caution. That is the _least_ that >>> should be done when a thread dies unexpectedly. Maybe we can improve >>> resiliency for specific cases later. >>> >>> >>> On Jan 26, 2017, at 5:53 PM, Enis Söztutar<[email protected]> wrote: >>> >>>>> Do we have worker threads that we can't safely continue without >>>> >>>> indefinitely? Can we solve the general problem of "unhandled exception >>>> in threads cause a RS Abort"? >>>> We have this already in the code base. We are injecting an >>>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all >>>> of the HRegionServer service threads (see HRS.startServiceThreads). But >>>> I've also seen cases where some important thread got unmarked. I think it >>>> is good idea to revisit that and make sure that all the threads are >>>> injected with the UEH. >>>> >>>> The replication source threads are started on demand, that is why the UEH >>>> is not injected I think. But agreed that we should do the safe route >>>> here, >>>> and abort the regionserver. >>>> >>>> Enis >>>> >>>>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser<[email protected]> wrote: >>>>> >>>>> +1 If any "worker" thread can't safely/reasonably retry some unexpected >>>>> exception without a reasonable expectation of self-healing, tank the RS. >>>>> >>>>> Having those threads die but not the RS could go uncaught for indefinite >>>>> period of time. >>>>> >>>>> >>>>> Sean Busbey wrote: >>>>> >>>>>> I've noticed a few other places where we can lose a worker thread and >>>>>> the RegionServer happily continues. One notable example is the worker >>>>>> threads that handle syncs for the WAL. I'm generally a >>>>>> fail-fast-and-loud advocate, so I like aborting when things look >>>>>> wonky. I've also had to deal with a lot of pain around silent and thus >>>>>> hard to see replication failures, so strong signals that the >>>>>> replication system is in a bad way sound good to me atm. >>>>>> >>>>>> Do we have worker threads that we can't safely continue without >>>>>> indefinitely? Can we solve the general problem of "unhandled exception >>>>>> in threads cause a RS Abort"? >>>>>> >>>>>> As mentioned on the jira, I do worry a bit about cluster stability and >>>>>> cascading failures, given the ability to have user-provided endpoints >>>>>> in the replication process. Ultimately, I don't see that as different >>>>>> than all the other places coprocessors can put the cluster at risk. >>>>>> >>>>>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> (edited subject to ensure folks filtering for DISCUSS see this) >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Over in HBASE-17381 there has been some discussion around whether an >>>>>>>> unhandled exception in a ReplicationSourceWorkerThread should trigger >>>>>>>> a >>>>>>>> regionserver abort. >>>>>>>> >>>>>>>> The current behavior in the case of an unexpected exception in >>>>>>>> ReplicationSourceWorkerThread.run() is to log a message and simply >>>>>>>> let >>>>>>>> the >>>>>>>> thread die, allowing replication for this source to back up. >>>>>>>> >>>>>>>> I've seen this happen in an OOME scenario, which seems like a clear >>>>>>>> case >>>>>>>> where we would be better off aborting the regionserver. >>>>>>>> >>>>>>>> However, in the case of any other unexpected exceptions out of the >>>>>>>> run() >>>>>>>> method, how do we want to handle this? >>>>>>>> >>>>>>>> I'm of the general opinion that where we would be better off aborting >>>>>>>> on >>>>>>>> all unexpected exceptions, as it means that: >>>>>>>> a) we have some missing error handling >>>>>>>> b) failing fast raises visibility and makes it easier to add any >>>>>>>> error >>>>>>>> handling that should be there >>>>>>>> c) silently stopping up replication creates problems that are >>>>>>>> difficult >>>>>>>> for >>>>>>>> our users to identify operationally and hard to troubleshoot. >>>>>>>> >>>>>>>> However, the current behavior has been there for quite a while, and >>>>>>>> maybe >>>>>>>> there are other situations or concerns I'm not seeing which would >>>>>>>> justify >>>>>>>> having regionserver stability over replication stability. >>>>>>>> >>>>>>>> What are folks thoughts on this? Should the regionserver abort on >>>>>>>> all >>>>>>>> unexpected exceptions in the run method or should we more narrowly >>>>>>>> focus >>>>>>>> this on OOME's? >>>>>>>> >>
