[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179910#comment-16179910 ]
Roger Hoover commented on KAFKA-5973: ------------------------------------- I guess there's a 4th option too: 4) Restart failed threads - I think there would have to be a notion of FatalExceptions in this case so that un-recoverable failures can shutdown the broker. I'm in favor of #1, since it's the simplest way to expose critical issues. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -------------------------------------------------------------------------------------- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.11.0.0, 0.11.0.1 > Reporter: Tom Crayford > Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)