[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179702#comment-16179702
 ] 

Ted Yu commented on KAFKA-5973:
---

Interesting.
Looks like the following class is not used:
connect/runtime/src/main/java/org/apache/kafka/connect/util/ShutdownableThread.java

I would favor #1 above.

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
>Priority: Minor
> Fix For: 1.0.0, 0.11.0.2
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-25 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179910#comment-16179910
 ] 

Roger Hoover commented on KAFKA-5973:
-

I guess there's a 4th option too:
4) Restart failed threads - I think there would have to be a notion of 
FatalExceptions in this case so that un-recoverable failures can shutdown the 
broker.

I'm in favor of #1, since it's the simplest way to expose critical issues.

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
>Priority: Minor
> Fix For: 1.0.0, 0.11.0.2
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181043#comment-16181043
 ] 

ASF GitHub Bot commented on KAFKA-5973:
---

GitHub user tedyu opened a pull request:

https://github.com/apache/kafka/pull/3962

KAFKA-5973 Exit when ShutdownableThread encounters uncaught exception

This PR installs UncaughtExceptionHandler which calls Exit.exit() .

According to discussion on KAFKA-5973, exiting seems to be the consensus in 
this scenario.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tedyu/kafka trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3962.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3962


commit 9b0b7671c4a454c5dd2a9fa44ac7cd841c8f71ed
Author: tedyu 
Date:   2017-09-26T16:14:02Z

KAFKA-5973 Exit when ShutdownableThread encounters uncaught exception




> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
>Priority: Minor
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181184#comment-16181184
 ] 

Ted Yu commented on KAFKA-5973:
---

[~tcrayford-heroku] [~theduderog]:
What do you think of the PR ?

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
>Priority: Minor
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182694#comment-16182694
 ] 

Ted Yu commented on KAFKA-5973:
---

[~damianguy] [~guozhang]:
Can you take a look ?

Thanks

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
>Priority: Minor
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182705#comment-16182705
 ] 

Ismael Juma commented on KAFKA-5973:


Because of the statefulness of Kafka brokers, you may not want to kill it if a 
thread dies. It may be better to trigger an alert via a metric and let the Ops 
team decide how they would like to handle it. In some cases, you may want to 
run some additional diagnostics while the broker is still running. Also, 
imagine a situation where a software bug causes one thread to die in multiple 
brokers. This could be a somewhat harmless situation, but if each of them 
immediately commits suicide, you may have a serious outage.

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
>Priority: Minor
> Fix For: 1.0.0, 0.11.0.2
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182775#comment-16182775
 ] 

Ted Yu commented on KAFKA-5973:
---

[~ijuma]:
Among the threads identified by Tom, can you see if any thread doesn't have to 
exist for the broker to keep functioning ?

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182794#comment-16182794
 ] 

Roger Hoover commented on KAFKA-5973:
-

[~ijuma]  You're right that failing more aggressively could sometimes make an 
outage worse.  However, I think the right way to address that concern is with 
more tests (unit, system, fault-injection, etc).  Otherwise, as the software 
evolves, operators will be forever trying to detect and respond to a changing 
array of partially broken states.

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182839#comment-16182839
 ] 

Ismael Juma commented on KAFKA-5973:


[~theduderog], hmm, I don't understand why. If there is a metric, operations 
can simply kill the broker themselves if that's what they want, right?

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182856#comment-16182856
 ] 

Ismael Juma commented on KAFKA-5973:


A few things to think about before we devise a solution:

1. What causes a thread to die. Some of the mentioned threads catch Throwable 
and basically avoid death at all costs. It would be good to have a general 
policy on how unexpected exceptions are handled.
2. How do we handle exceptions/errors that we have little control over, a good 
example is OutOfMemoryError.
3. Should the broker kill itself, or should it inform a monitoring system (via 
metrics) that has a view of the cluster and can perhaps do better. For example, 
such a system could detect an OOM and restart one broker at a time (if multiple 
ones are affected). It could also potentially increase the heap or tweak some 
config settings.

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182867#comment-16182867
 ] 

Roger Hoover commented on KAFKA-5973:
-

[~ijuma] Great points. A stable metric solves the issue of maintaining an 
evolving list of thread to monitor and allows an external system to take 
controlled action such as rolling restart.

+1 for that approach

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-09-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182962#comment-16182962
 ] 

ASF GitHub Bot commented on KAFKA-5973:
---

Github user tedyu closed the pull request at:

https://github.com/apache/kafka/pull/3962


> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-10-05 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193905#comment-16193905
 ] 

Guozhang Wang commented on KAFKA-5973:
--

I'm in favor of both action items, i.e. 1) making a pass over the existing 
thread's exception handling logic and decide which exception could be handled, 
which to kill itself, which to kill the whole process and 2) add a metric for 
alive threads in categories (handler, socket receiver / sender, replica 
fetcher, log cleaner) on brokers.

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-10-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193918#comment-16193918
 ] 

Ted Yu commented on KAFKA-5973:
---

bq. which to kill itself, which to kill the whole process

In description, Tom listed threads which use {{ShutdownableThread}}. Is any of 
them not in the above category ?

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-11-06 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240958#comment-16240958
 ] 

Rajini Sivaram commented on KAFKA-5973:
---

[~tedyu] Are you still working on this one?

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

2017-11-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240971#comment-16240971
 ] 

Ted Yu commented on KAFKA-5973:
---

Currently not - there is no consensus for the solution.

> ShutdownableThread catching errors can lead to partial hard to diagnose 
> broker failure
> --
>
> Key: KAFKA-5973
> URL: https://issues.apache.org/jira/browse/KAFKA-5973
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 0.11.0.1
>Reporter: Tom Crayford
> Fix For: 0.11.0.2, 1.0.1
>
> Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with 
> some
> threads not running, but potentially the broker can still be serving traffic 
> to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and 
> fine", but in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" 
> be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
> controller in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them 
> crashing for any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka 
> and run thread dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker 
> process
> We could crash the whole broker when an individual thread dies. I think this 
> is pretty reasonable, it's better to have a very visible breakage than a very 
> hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and 
> track a list of "threads that should be running" is... pretty silly. We could 
> instead expose a JMX bean of some kind indicating threads that died due to 
> uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes 
> this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious 
> to users they need to monitor and alert on them. The project could add 
> documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)