[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193905#comment-16193905 ]
Guozhang Wang commented on KAFKA-5973: -------------------------------------- I'm in favor of both action items, i.e. 1) making a pass over the existing thread's exception handling logic and decide which exception could be handled, which to kill itself, which to kill the whole process and 2) add a metric for alive threads in categories (handler, socket receiver / sender, replica fetcher, log cleaner) on brokers. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -------------------------------------------------------------------------------------- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.11.0.0, 0.11.0.1 > Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)