Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Jason Gustafson Thu, 09 Aug 2018 16:41:25 -0700

Hey Stanislav,

Sorry, I was probably looking at an older version (I had the tab open for
so long!).


I have been thinking about `max.uncleanable.partitions` and wondering if
it's what we really want. The main risk if the cleaner cannot clean a
partition is eventually running out of disk space. This is the most common
problem we have seen with cleaner failures and it can happen even if there
is just one uncleanable partition. We've actually seen cases in which a
single __consumer_offsets grew large enough to fill a significant portion
of the disk. The difficulty with allowing a system to run out of disk space
before failing is that it makes recovery difficult and time consuming.
Clean shutdown, for example, requires writing some state to disk. Without
clean shutdown, it can take the broker significantly longer to startup
because it has do more segment recovery.

For this problem, `max.uncleanable.partitions` does not really help. You
can set it to 1 and fail fast, but that is not much better than the
existing state. You had a suggestion previously in the KIP to use the size
of uncleanable disk space instead. What was the reason for rejecting that?
Intuitively, it seems like a better fit for a cleaner failure. It would
provide users some time to react to failures while still protecting them
from exhausting the disk.

Thanks,
Jason




On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <[email protected]>
wrote:

> Hey Jason,
>
> 1. *10* is the default value, it says so in the KIP
> 2. This is a good catch. As the current implementation stands, it's not a
> useful metric since the thread continues to run even if all log directories
> are offline (although I'm not sure what the broker's behavior is in that
> scenario). I'll make sure the thread stops if all log directories are
> online.
>
> I don't know which "Needs Discussion" item you're referencing, there hasn't
> been any in the KIP since August 1 and that was for the metric only. KIP
> History
> <https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?
> pageId=89064875>
>
> I've updated the KIP to mention the "time-since-last-run" metric.
>
> Thanks,
> Stanislav
>
> On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <[email protected]>
> wrote:
>
> > Hi Stanislav,
> >
> > Just a couple quick questions:
> >
> > 1. I may have missed it, but what will be the default value for
> > `max.uncleanable.partitions`?
> > 2. It seems there will be some impact for users that monitoring
> > "time-since-last-run-ms" in order to detect cleaner failures. Not sure
> it's
> > a major concern, but probably worth mentioning in the compatibility
> > section. Also, is this still a useful metric after this KIP?
> >
> > Also, maybe the "Needs Discussion" item can be moved to rejected
> > alternatives since you've moved to a vote? I think leaving this for
> > potential future work is reasonable.
> >
> > Thanks,
> > Jason
> >
> >
> > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <[email protected]> wrote:
> >
> > > I'm okay with that.
> > >
> > > -Ray
> > >
> > > On 8/6/18 10:59 AM, Colin McCabe wrote:
> > >
> > >> Perhaps we could start with max.uncleanable.partitions and then
> > implement
> > >> max.uncleanable.partitions.per.logdir in a follow-up change if it
> seemed
> > >> to be necessary?  What do you think?
> > >>
> > >> regards,
> > >> Colin
> > >>
> > >>
> > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> > >>
> > >>> Hey Ray,
> > >>>
> > >>> Thanks for the explanation. In regards to the configuration property
> -
> > >>> I'm
> > >>> not sure. As long as it has sufficient documentation, I find
> > >>> "max.uncleanable.partitions" to be okay. If we were to add the
> > >>> distinction
> > >>> explicitly, maybe it should be `max.uncleanable.partitions.
> per.logdir`
> > ?
> > >>>
> > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <[email protected]>
> wrote:
> > >>>
> > >>> One more thing occurred to me.  Should the configuration property be
> > >>>> named "max.uncleanable.partitions.per.disk" instead?
> > >>>>
> > >>>> -Ray
> > >>>>
> > >>>>
> > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > >>>>
> > >>>>> Yes, good catch. Thank you, James!
> > >>>>>
> > >>>>> Best,
> > >>>>> Stanislav
> > >>>>>
> > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <[email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Can you update the KIP to say what the default is for
> > >>>>>> max.uncleanable.partitions?
> > >>>>>>
> > >>>>>> -James
> > >>>>>>
> > >>>>>> Sent from my iPhone
> > >>>>>>
> > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > >>>>>>>
> > >>>>>> [email protected]>
> > >>>>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Hey group,
> > >>>>>>>
> > >>>>>>> I am planning on starting a voting thread tomorrow. Please do
> reply
> > >>>>>>> if
> > >>>>>>>
> > >>>>>> you
> > >>>>>>
> > >>>>>>> feel there is anything left to discuss.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Stanislav
> > >>>>>>>
> > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > >>>>>>>
> > >>>>>> [email protected]>
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> Hey, Ray
> > >>>>>>>>
> > >>>>>>>> Thanks for pointing that out, it's fixed now
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Stanislav
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <[email protected]>
> > >>>>>>>>>
> > >>>>>>>> wrote:
> > >>>>
> > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table
> on
> > >>>>>>>>> the main KIP landing page
> > >>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> > >>>> Improvement+Proposals#
> > >>>>
> > >>>>> ?
> > >>>>>>>
> > >>>>>>>> I tried, but the Wiki won't let me.
> > >>>>>>>>>
> > >>>>>>>>> -Ray
> > >>>>>>>>>
> > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > >>>>>>>>>> Hey guys,
> > >>>>>>>>>>
> > >>>>>>>>>> @Colin - good point. I added some sentences mentioning recent
> > >>>>>>>>>>
> > >>>>>>>>> improvements
> > >>>>>>>>>
> > >>>>>>>>>> in the introductory section.
> > >>>>>>>>>>
> > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a
> > >>>>>>>>>> disk
> > >>>>>>>>>>
> > >>>>>>>>> fails,
> > >>>>>>>>>
> > >>>>>>>>>> you don't want to work with it again. As such, I've changed my
> > >>>>>>>>>> mind
> > >>>>>>>>>>
> > >>>>>>>>> and
> > >>>>>>
> > >>>>>>> believe that we should mark the LogDir (assume its a disk) as
> > >>>>>>>>>>
> > >>>>>>>>> offline
> > >>>>
> > >>>>> on
> > >>>>>>
> > >>>>>>> the first `IOException` encountered. This is the LogCleaner's
> > >>>>>>>>>>
> > >>>>>>>>> current
> > >>>>
> > >>>>> behavior. We shouldn't change that.
> > >>>>>>>>>>
> > >>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn a
> > >>>>>>>>>> thread.
> > >>>>>>>>>>
> > >>>>>>>>> The
> > >>>>>>
> > >>>>>>> correct approach in my mind is to either have it stay dead or
> never
> > >>>>>>>>>>
> > >>>>>>>>> let
> > >>>>>>
> > >>>>>>> it
> > >>>>>>>>>
> > >>>>>>>>>> die in the first place.
> > >>>>>>>>>>
> > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this
> > metric
> > >>>>>>>>>>
> > >>>>>>>>> is
> > >>>>
> > >>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> > >>>>>>>>>>
> > >>>>>>>>> metric
> > >>>>
> > >>>>> and
> > >>>>>>>>>
> > >>>>>>>>>> inspect logs.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Hey Ray,
> > >>>>>>>>>>
> > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
> > LogCleaner
> > >>>>>>>>>>>
> > >>>>>>>>>> to
> > >>>>
> > >>>>> skip over problematic partitions instead of dying.
> > >>>>>>>>>>>
> > >>>>>>>>>> I think we can do this for every exception that isn't
> > >>>>>>>>>> `IOException`.
> > >>>>>>>>>>
> > >>>>>>>>> This
> > >>>>>>>>>
> > >>>>>>>>>> will future-proof us against bugs in the system and potential
> > >>>>>>>>>> other
> > >>>>>>>>>>
> > >>>>>>>>> errors.
> > >>>>>>>>>
> > >>>>>>>>>> Protecting yourself against unexpected failures is always a
> good
> > >>>>>>>>>>
> > >>>>>>>>> thing
> > >>>>
> > >>>>> in
> > >>>>>>>>>
> > >>>>>>>>>> my mind, but I also think that protecting yourself against
> bugs
> > in
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>
> > >>>>> software is sort of clunky. What does everybody think about this?
> > >>>>>>>>>>
> > >>>>>>>>>> 4) The only improvement I can think of is that if such an
> > >>>>>>>>>>> error occurs, then have the option (configuration setting?)
> to
> > >>>>>>>>>>>
> > >>>>>>>>>> create a
> > >>>>>>
> > >>>>>>> <log_segment>.skip file (or something similar).
> > >>>>>>>>>>>
> > >>>>>>>>>> This is a good suggestion. Have others also seen corruption be
> > >>>>>>>>>>
> > >>>>>>>>> generally
> > >>>>>>
> > >>>>>>> tied to the same segment?
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> > >>>>>>>>>> [email protected]
> > >>>>>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> For the cleaner thread specifically, I do not think respawning
> > >>>>>>>>>>> will
> > >>>>>>>>>>>
> > >>>>>>>>>> help at
> > >>>>>>>>>
> > >>>>>>>>>> all because we are more than likely to run into the same issue
> > >>>>>>>>>>>
> > >>>>>>>>>> again
> > >>>>
> > >>>>> which
> > >>>>>>>>>
> > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense for
> > >>>>>>>>>>>
> > >>>>>>>>>> transient
> > >>>>
> > >>>>> errors or when you believe some part of the system could have
> > >>>>>>>>>>>
> > >>>>>>>>>> healed
> > >>>>
> > >>>>> itself, both of which I think are not true for the log cleaner.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> > >>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> <<<respawning threads is likely to make things worse, by
> putting
> > >>>>>>>>>>>>
> > >>>>>>>>>>> you
> > >>>>
> > >>>>> in
> > >>>>>>>>>
> > >>>>>>>>>> an
> > >>>>>>>>>>>
> > >>>>>>>>>>>> infinite loop which consumes resources and fires off
> > continuous
> > >>>>>>>>>>>>
> > >>>>>>>>>>> log
> > >>>>
> > >>>>> messages.
> > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>
> > >>>>>>>>>>> effect
> > >>>>>>>>>
> > >>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is to
> > >>>>>>>>>>>> occur
> > >>>>>>>>>>>>
> > >>>>>>>>>>> then
> > >>>>>>
> > >>>>>>> wait
> > >>>>>>>>>>>
> > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is to
> > >>>>>>>>>>>> occur
> > >>>>>>>>>>>>
> > >>>>>>>>>>> wait
> > >>>>>>
> > >>>>>>> for
> > >>>>>>>>>>>
> > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc.
> up
> > to
> > >>>>>>>>>>>>
> > >>>>>>>>>>> some
> > >>>>>>
> > >>>>>>> max
> > >>>>>>>>>
> > >>>>>>>>>> wait time).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or not
> in
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>
> > >>>>>>>>>>> context,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> but a mitigation like the increasing backoff described above
> > may
> > >>>>>>>>>>>>
> > >>>>>>>>>>> be
> > >>>>
> > >>>>> relevant in weighing the pros and cons.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Ron
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> > >>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were more
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> tolerant
> > >>>>
> > >>>>> of
> > >>>>>>>>>
> > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Things are better now than they used to be. We have the
> > metric
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
> since-last-run-ms
> > >>>>
> > >>>>> which we can use to tell us if the threads are dead. And as of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1.1.0,
> > >>>>>>>>>
> > >>>>>>>>>> we
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> thread,
> > >>>>
> > >>>>> without requiring a broker restart.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> > >>>> Dynamic+Broker+Configuration
> > >>>>
> > >>>>> <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 226+-+
> > >>>> Dynamic+Broker+Configuration
> > >>>>
> > >>>>> I've only read about this, I haven't personally tried it.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> probably
> > >>>>
> > >>>>> add a
> > >>>>>>>>>>>
> > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in
> > the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> KIP.
> > >>>>>>
> > >>>>>>> Maybe
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> in the intro section?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I think it's clear that requiring the users to manually
> > restart
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>
> > >>>>> log
> > >>>>>>>>>
> > >>>>>>>>>> cleaner is not a very good solution.  But it's good to know
> that
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> it's a
> > >>>>>>>>>
> > >>>>>>>>>> possibility on some older releases.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Some comments:
> > >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue to
> > clean
> > >>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> many
> > >>>>>>>>>
> > >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> possible.
> > >>>>>>>>>
> > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> automatically
> > >>>>
> > >>>>> be
> > >>>>>>
> > >>>>>>> revived. Your KIP attempts to do that by catching exceptions
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> during
> > >>>>>>
> > >>>>>>> execution, but I think we should go all the way and make sure
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> that a
> > >>>>>>
> > >>>>>>> new
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> one gets created, if the thread ever dies.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka works.
> > We
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> don't
> > >>>>>>
> > >>>>>>> automatically re-create other threads in the broker if they
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> terminate.
> > >>>>>>>>>
> > >>>>>>>>>> In
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> general, if there is a serious bug in the code, respawning
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> threads
> > >>>>
> > >>>>> is
> > >>>>>>
> > >>>>>>> likely to make things worse, by putting you in an infinite loop
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> which
> > >>>>>>
> > >>>>>>> consumes resources and fires off continuous log messages.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> partitions.
> > >>>>
> > >>>>> I've
> > >>>>>>>>>>>
> > >>>>>>>>>>>> seen cases where an uncleanable partition later became
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> cleanable.
> > >>>>
> > >>>>> I
> > >>>>>>
> > >>>>>>> unfortunately don't remember how that happened, but I remember
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> being
> > >>>>>>
> > >>>>>>> surprised when I discovered it. It might have been something
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> like
> > >>>>
> > >>>>> a
> > >>>>>>
> > >>>>>>> follower was uncleanable but after a leader election happened,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>
> > >>>>> log
> > >>>>>>>>>>>
> > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Distributed
> > >>>>
> > >>>>> File
> > >>>>>>>>>
> > >>>>>>>>>> System (HDFS) and it was a constant source of user problems.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> What would happen is disks would just go bad over time.
> The
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> DataNode
> > >>>>>>
> > >>>>>>> would notice this and take them offline.  But then, due to some
> > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try to
> > >>>>>>>>>>>>> re-add
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> them
> > >>>>>>
> > >>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>>> the system.  Then one of two things would happen: the disk
> > would
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> just
> > >>>>>>
> > >>>>>>> fail
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> immediately again, or it would appear to work and then fail
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> after a
> > >>>>
> > >>>>> short
> > >>>>>>>>>>>
> > >>>>>>>>>>>> amount of time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The way the disk failed was normally having an I/O request
> > >>>>>>>>>>>>> take a
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> really
> > >>>>>>>>>>>
> > >>>>>>>>>>>> long time and time out.  So a bunch of request handler
> threads
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> would
> > >>>>>>
> > >>>>>>> basically slam into a brick wall when they tried to access the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> bad
> > >>>>
> > >>>>> disk,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the
> > >>>>>>>>>>>>> second
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> scenario,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> if the disk appeared to work for a while, but then failed.
> > Any
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> data
> > >>>>>>
> > >>>>>>> that
> > >>>>>>>>>>>
> > >>>>>>>>>>>> had been written on that DataNode to that disk would be
> lost,
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> we
> > >>>>>>
> > >>>>>>> would
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> need to re-replicate it.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over
> time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Once
> > >>>>
> > >>>>> they're
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust
> > against
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> cases
> > >>>>>>
> > >>>>>>> where
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> the disk really is failing, and really is returning bad
> data
> > or
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> timing
> > >>>>>>>>>
> > >>>>>>>>>> out.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> * For your metrics, can you spell out the full metric in
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> JMX-style
> > >>>>
> > >>>>> format, such as:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
> > >>>> partitions-count
> > >>>>
> > >>>>>                 value=4
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can
> be
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> very
> > >>>>
> > >>>>> long.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I think the current max size is 210 characters (or maybe
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> 240-ish?).
> > >>>>>>
> > >>>>>>> Having the "uncleanable-partitions" being a list could be very
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> large
> > >>>>>>
> > >>>>>>> metric. Also, having the metric come out as a csv might be
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> difficult
> > >>>>>>
> > >>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>>> work with for monitoring systems. If we *did* want the topic
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> names
> > >>>>
> > >>>>> to
> > >>>>>>>>>
> > >>>>>>>>>> be
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> accessible, what do you think of having the
> > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> > >>>>>>>>>>>>>> er,topic=topic1,partition=2
> > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but
> my
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> example
> > >>>>>>
> > >>>>>>> was
> > >>>>>>>>>>>
> > >>>>>>>>>>>> that the topic and partition can be tags in the metric. That
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> will
> > >>>>
> > >>>>> allow
> > >>>>>>>>>>>
> > >>>>>>>>>>>> monitoring systems to more easily slice and dice the metric.
> > I'm
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> not
> > >>>>>>
> > >>>>>>> sure what the attribute for that metric would be. Maybe
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> something
> > >>>>
> > >>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> time-since-last-clean?
> > >>>>>>>>>
> > >>>>>>>>>> Or
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> maybe even just "Value=1".
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> I haven't though about this that hard, but do we really
> need
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>> uncleanable topic names to be accessible through a metric?
> > It
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> seems
> > >>>>>>
> > >>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>>> the admin should notice that uncleanable partitions are
> > present,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> and
> > >>>>>>
> > >>>>>>> then
> > >>>>>>>>>>>
> > >>>>>>>>>>>> check the logs?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this
> > likely
> > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not sure
> > that
> > >>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>>>
> > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> problems,
> > >>>>>>
> > >>>>>>> all
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> of them are partition-level scenarios that happened during
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> normal
> > >>>>
> > >>>>> operation. None of them were indicative of disk problems.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In general,
> > we
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> don't
> > >>>>>>
> > >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> cluster.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> problems"
> > >>>>>>
> > >>>>>>> we
> > >>>>>>>>>
> > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove
> that
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> disk
> > >>>>>>
> > >>>>>>> problems don't happen, but  just that JIRA isn't the right place
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> for
> > >>>>>>
> > >>>>>>> them.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> I do agree that the log cleaner has had a significant
> number
> > of
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> logic
> > >>>>>>
> > >>>>>>> bugs, and that we need to be careful to limit their impact.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> That's
> > >>>>
> > >>>>> one
> > >>>>>>>>>
> > >>>>>>>>>> reason why I think that a threshold of "number of uncleanable
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> logs"
> > >>>>
> > >>>>> is
> > >>>>>>>>>
> > >>>>>>>>>> a
> > >>>>>>>>>>>
> > >>>>>>>>>>>> good idea, rather than just failing after one IOException.
> In
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> all
> > >>>>
> > >>>>> the
> > >>>>>>>>>
> > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log
> cleaner,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> it
> > >>>>
> > >>>>> was
> > >>>>>>>>>
> > >>>>>>>>>> just one partition that had the issue.  We also should
> increase
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> test
> > >>>>>>
> > >>>>>>> coverage for the log cleaner.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * About marking disks as offline when exceeding a certain
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> threshold,
> > >>>>>>
> > >>>>>>> that actually increases the blast radius of log compaction
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> failures.
> > >>>>>>
> > >>>>>>> Currently, the uncleaned partitions are still readable and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> writable.
> > >>>>>>
> > >>>>>>> Taking the disks offline would impact availability of the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> uncleanable
> > >>>>>>>>>
> > >>>>>>>>>> partitions, as well as impact all other partitions that are on
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>
> > >>>>> disk.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the disk
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> partition
> > >>>>>>
> > >>>>>>> offline.  This is spelled out in KIP-112 (
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> > >>>> 3A+Handle+disk+failure+for+JBOD
> > >>>>
> > >>>>> ) :
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it
> starts,
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> mark
> > >>>>>>>>>
> > >>>>>>>>>> log directory as
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> bad once there is IOException when broker attempts to
> access
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> (i.e.
> > >>>>
> > >>>>> read
> > >>>>>>>>>>>
> > >>>>>>>>>>>> or write) the log directory.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - Broker will be offline if all log directories are bad.
> > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
> > directory.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> New
> > >>>
> > >>>
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to