Tom Crayford created KAFKA-5973:
-----------------------------------

             Summary: ShutdownableThread catching errors can lead to partial 
hard to diagnose broker failure
                 Key: KAFKA-5973
                 URL: https://issues.apache.org/jira/browse/KAFKA-5973
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 0.11.0.0, 0.11.0.1
            Reporter: Tom Crayford
            Priority: Minor
             Fix For: 1.0.0, 0.11.0.2


When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
uncaught exception, the broker is left running in a very weird/bad state with 
some
threads not running, but potentially the broker can still be serving traffic to
users but not performing its usual operations.

This is problematic, because monitoring may say that "the broker is up and 
fine", but in fact it is not healthy.

At Heroku we've been mitigating this by monitoring all threads that "should" be
running on a broker and alerting when a given thread isn't running for some
reason.

Things that use {{ShutdownableThread}} that can crash and leave a broker/the 
controller in a bad state:
- log cleaner
- replica fetcher threads
- controller to broker send threads
- controller topic deletion threads
- quota throttling reapers
- io threads
- network threads
- group metadata management threads

Some of these can have disasterous consequences, and nearly all of them 
crashing for any reason is a cause for alert.
But, users probably shouldn't have to know about all the internals of Kafka and 
run thread dumps periodically as part of normal operations.

There are a few potential options here:

1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker process

We could crash the whole broker when an individual thread dies. I think this is 
pretty reasonable, it's better to have a very visible breakage than a very hard 
to detect one.

2. Add some healthcheck JMX bean to detect these thread crashes

Users having to audit all of Kafka's source code on each new release and track 
a list of "threads that should be running" is... pretty silly. We could instead 
expose a JMX bean of some kind indicating threads that died due to uncaught 
exceptions

3. Do nothing, but add documentation around monitoring/logging that exposes 
this error

These thread deaths *do* emit log lines, but it's not that clear or obvious to 
users they need to monitor and alert on them. The project could add 
documentation




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to