[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179702#comment-16179702 ] Ted Yu commented on KAFKA-5973: --- Interesting. Looks like the following class is not used: connect/runtime/src/main/java/org/apache/kafka/connect/util/ShutdownableThread.java I would favor #1 above. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford >Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179910#comment-16179910 ] Roger Hoover commented on KAFKA-5973: - I guess there's a 4th option too: 4) Restart failed threads - I think there would have to be a notion of FatalExceptions in this case so that un-recoverable failures can shutdown the broker. I'm in favor of #1, since it's the simplest way to expose critical issues. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford >Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181043#comment-16181043 ] ASF GitHub Bot commented on KAFKA-5973: --- GitHub user tedyu opened a pull request: https://github.com/apache/kafka/pull/3962 KAFKA-5973 Exit when ShutdownableThread encounters uncaught exception This PR installs UncaughtExceptionHandler which calls Exit.exit() . According to discussion on KAFKA-5973, exiting seems to be the consensus in this scenario. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tedyu/kafka trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3962.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3962 commit 9b0b7671c4a454c5dd2a9fa44ac7cd841c8f71ed Author: tedyu Date: 2017-09-26T16:14:02Z KAFKA-5973 Exit when ShutdownableThread encounters uncaught exception > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford >Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181184#comment-16181184 ] Ted Yu commented on KAFKA-5973: --- [~tcrayford-heroku] [~theduderog]: What do you think of the PR ? > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford >Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182694#comment-16182694 ] Ted Yu commented on KAFKA-5973: --- [~damianguy] [~guozhang]: Can you take a look ? Thanks > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford >Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182705#comment-16182705 ] Ismael Juma commented on KAFKA-5973: Because of the statefulness of Kafka brokers, you may not want to kill it if a thread dies. It may be better to trigger an alert via a metric and let the Ops team decide how they would like to handle it. In some cases, you may want to run some additional diagnostics while the broker is still running. Also, imagine a situation where a software bug causes one thread to die in multiple brokers. This could be a somewhat harmless situation, but if each of them immediately commits suicide, you may have a serious outage. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford >Priority: Minor > Fix For: 1.0.0, 0.11.0.2 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182775#comment-16182775 ] Ted Yu commented on KAFKA-5973: --- [~ijuma]: Among the threads identified by Tom, can you see if any thread doesn't have to exist for the broker to keep functioning ? > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182794#comment-16182794 ] Roger Hoover commented on KAFKA-5973: - [~ijuma] You're right that failing more aggressively could sometimes make an outage worse. However, I think the right way to address that concern is with more tests (unit, system, fault-injection, etc). Otherwise, as the software evolves, operators will be forever trying to detect and respond to a changing array of partially broken states. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182839#comment-16182839 ] Ismael Juma commented on KAFKA-5973: [~theduderog], hmm, I don't understand why. If there is a metric, operations can simply kill the broker themselves if that's what they want, right? > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182856#comment-16182856 ] Ismael Juma commented on KAFKA-5973: A few things to think about before we devise a solution: 1. What causes a thread to die. Some of the mentioned threads catch Throwable and basically avoid death at all costs. It would be good to have a general policy on how unexpected exceptions are handled. 2. How do we handle exceptions/errors that we have little control over, a good example is OutOfMemoryError. 3. Should the broker kill itself, or should it inform a monitoring system (via metrics) that has a view of the cluster and can perhaps do better. For example, such a system could detect an OOM and restart one broker at a time (if multiple ones are affected). It could also potentially increase the heap or tweak some config settings. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182867#comment-16182867 ] Roger Hoover commented on KAFKA-5973: - [~ijuma] Great points. A stable metric solves the issue of maintaining an evolving list of thread to monitor and allows an external system to take controlled action such as rolling restart. +1 for that approach > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182962#comment-16182962 ] ASF GitHub Bot commented on KAFKA-5973: --- Github user tedyu closed the pull request at: https://github.com/apache/kafka/pull/3962 > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193905#comment-16193905 ] Guozhang Wang commented on KAFKA-5973: -- I'm in favor of both action items, i.e. 1) making a pass over the existing thread's exception handling logic and decide which exception could be handled, which to kill itself, which to kill the whole process and 2) add a metric for alive threads in categories (handler, socket receiver / sender, replica fetcher, log cleaner) on brokers. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193918#comment-16193918 ] Ted Yu commented on KAFKA-5973: --- bq. which to kill itself, which to kill the whole process In description, Tom listed threads which use {{ShutdownableThread}}. Is any of them not in the above category ? > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240958#comment-16240958 ] Rajini Sivaram commented on KAFKA-5973: --- [~tedyu] Are you still working on this one? > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
[ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240971#comment-16240971 ] Ted Yu commented on KAFKA-5973: --- Currently not - there is no consensus for the solution. > ShutdownableThread catching errors can lead to partial hard to diagnose > broker failure > -- > > Key: KAFKA-5973 > URL: https://issues.apache.org/jira/browse/KAFKA-5973 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tom Crayford > Fix For: 0.11.0.2, 1.0.1 > > Attachments: 5973.v1.txt > > > When any kafka broker {{ShutdownableThread}} subclasses crashes due to an > uncaught exception, the broker is left running in a very weird/bad state with > some > threads not running, but potentially the broker can still be serving traffic > to > users but not performing its usual operations. > This is problematic, because monitoring may say that "the broker is up and > fine", but in fact it is not healthy. > At Heroku we've been mitigating this by monitoring all threads that "should" > be > running on a broker and alerting when a given thread isn't running for some > reason. > Things that use {{ShutdownableThread}} that can crash and leave a broker/the > controller in a bad state: > - log cleaner > - replica fetcher threads > - controller to broker send threads > - controller topic deletion threads > - quota throttling reapers > - io threads > - network threads > - group metadata management threads > Some of these can have disasterous consequences, and nearly all of them > crashing for any reason is a cause for alert. > But, users probably shouldn't have to know about all the internals of Kafka > and run thread dumps periodically as part of normal operations. > There are a few potential options here: > 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker > process > We could crash the whole broker when an individual thread dies. I think this > is pretty reasonable, it's better to have a very visible breakage than a very > hard to detect one. > 2. Add some healthcheck JMX bean to detect these thread crashes > Users having to audit all of Kafka's source code on each new release and > track a list of "threads that should be running" is... pretty silly. We could > instead expose a JMX bean of some kind indicating threads that died due to > uncaught exceptions > 3. Do nothing, but add documentation around monitoring/logging that exposes > this error > These thread deaths *do* emit log lines, but it's not that clear or obvious > to users they need to monitor and alert on them. The project could add > documentation -- This message was sent by Atlassian JIRA (v6.4.14#64029)