[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181043#comment-16181043 ]
ASF GitHub Bot commented on KAFKA-5973: --------------------------------------- GitHub user tedyu opened a pull request: https://github.com/apache/kafka/pull/3962 KAFKA-5973 Exit when ShutdownableThread encounters uncaught exception This PR installs UncaughtExceptionHandler which calls Exit.exit() . According to discussion on KAFKA-5973, exiting seems to be the consensus in this scenario. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tedyu/kafka trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3962.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3962 ---- commit 9b0b7671c4a454c5dd2a9fa44ac7cd841c8f71ed Author: tedyu <yuzhih...@gmail.com> Date: 2017-09-26T16:14:02Z KAFKA-5973 Exit when ShutdownableThread encounters uncaught exception ---- > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -------------------------------------------------------------------------------------- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.11.0.0, 0.11.0.1 > Reporter: Tom Crayford > Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)