Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Stanislav Kozlovski Tue, 31 Jul 2018 09:57:20 -0700

Hey group,

I am planning on starting a voting thread tomorrow. Please do reply if you
feel there is anything left to discuss.


Best,
Stanislav

On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <stanis...@confluent.io>
wrote:

> Hey, Ray
>
> Thanks for pointing that out, it's fixed now
>
> Best,
> Stanislav
>
> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> wrote:
>
>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>> the main KIP landing page
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>?
>>
>> I tried, but the Wiki won't let me.
>>
>> -Ray
>>
>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>> > Hey guys,
>> >
>> > @Colin - good point. I added some sentences mentioning recent
>> improvements
>> > in the introductory section.
>> >
>> > *Disk Failure* - I tend to agree with what Colin said - once a disk
>> fails,
>> > you don't want to work with it again. As such, I've changed my mind and
>> > believe that we should mark the LogDir (assume its a disk) as offline on
>> > the first `IOException` encountered. This is the LogCleaner's current
>> > behavior. We shouldn't change that.
>> >
>> > *Respawning Threads* - I believe we should never re-spawn a thread. The
>> > correct approach in my mind is to either have it stay dead or never let
>> it
>> > die in the first place.
>> >
>> > *Uncleanable-partition-names metric* - Colin is right, this metric is
>> > unneeded. Users can monitor the `uncleanable-partitions-count` metric
>> and
>> > inspect logs.
>> >
>> >
>> > Hey Ray,
>> >
>> >> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>> >> skip over problematic partitions instead of dying.
>> > I think we can do this for every exception that isn't `IOException`.
>> This
>> > will future-proof us against bugs in the system and potential other
>> errors.
>> > Protecting yourself against unexpected failures is always a good thing
>> in
>> > my mind, but I also think that protecting yourself against bugs in the
>> > software is sort of clunky. What does everybody think about this?
>> >
>> >> 4) The only improvement I can think of is that if such an
>> >> error occurs, then have the option (configuration setting?) to create a
>> >> <log_segment>.skip file (or something similar).
>> > This is a good suggestion. Have others also seen corruption be generally
>> > tied to the same segment?
>> >
>> > On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io>
>> wrote:
>> >
>> >> For the cleaner thread specifically, I do not think respawning will
>> help at
>> >> all because we are more than likely to run into the same issue again
>> which
>> >> would end up crashing the cleaner. Retrying makes sense for transient
>> >> errors or when you believe some part of the system could have healed
>> >> itself, both of which I think are not true for the log cleaner.
>> >>
>> >> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com>
>> wrote:
>> >>
>> >>> <<<respawning threads is likely to make things worse, by putting you
>> in
>> >> an
>> >>> infinite loop which consumes resources and fires off continuous log
>> >>> messages.
>> >>> Hi Colin.  In case it could be relevant, one way to mitigate this
>> effect
>> >> is
>> >>> to implement a backoff mechanism (if a second respawn is to occur then
>> >> wait
>> >>> for 1 minute before doing it; then if a third respawn is to occur wait
>> >> for
>> >>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some
>> max
>> >>> wait time).
>> >>>
>> >>> I have no opinion on whether respawn is appropriate or not in this
>> >> context,
>> >>> but a mitigation like the increasing backoff described above may be
>> >>> relevant in weighing the pros and cons.
>> >>>
>> >>> Ron
>> >>>
>> >>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org>
>> wrote:
>> >>>
>> >>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>> >>>>> Hi Stanislav! Thanks for this KIP!
>> >>>>>
>> >>>>> I agree that it would be good if the LogCleaner were more tolerant
>> of
>> >>>>> errors. Currently, as you said, once it dies, it stays dead.
>> >>>>>
>> >>>>> Things are better now than they used to be. We have the metric
>> >>>>>        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>> >>>>> which we can use to tell us if the threads are dead. And as of
>> 1.1.0,
>> >>> we
>> >>>>> have KIP-226, which allows you to restart the log cleaner thread,
>> >>>>> without requiring a broker restart.
>> >>>>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>> <
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>
>> >>>>> I've only read about this, I haven't personally tried it.
>> >>>> Thanks for pointing this out, James!  Stanislav, we should probably
>> >> add a
>> >>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
>> >>> Maybe
>> >>>> in the intro section?
>> >>>>
>> >>>> I think it's clear that requiring the users to manually restart the
>> log
>> >>>> cleaner is not a very good solution.  But it's good to know that
>> it's a
>> >>>> possibility on some older releases.
>> >>>>
>> >>>>> Some comments:
>> >>>>> * I like the idea of having the log cleaner continue to clean as
>> many
>> >>>>> partitions as it can, skipping over the problematic ones if
>> possible.
>> >>>>>
>> >>>>> * If the log cleaner thread dies, I think it should automatically be
>> >>>>> revived. Your KIP attempts to do that by catching exceptions during
>> >>>>> execution, but I think we should go all the way and make sure that a
>> >>> new
>> >>>>> one gets created, if the thread ever dies.
>> >>>> This is inconsistent with the way the rest of Kafka works.  We don't
>> >>>> automatically re-create other threads in the broker if they
>> terminate.
>> >>> In
>> >>>> general, if there is a serious bug in the code, respawning threads is
>> >>>> likely to make things worse, by putting you in an infinite loop which
>> >>>> consumes resources and fires off continuous log messages.
>> >>>>
>> >>>>> * It might be worth trying to re-clean the uncleanable partitions.
>> >> I've
>> >>>>> seen cases where an uncleanable partition later became cleanable. I
>> >>>>> unfortunately don't remember how that happened, but I remember being
>> >>>>> surprised when I discovered it. It might have been something like a
>> >>>>> follower was uncleanable but after a leader election happened, the
>> >> log
>> >>>>> truncated and it was then cleanable again. I'm not sure.
>> >>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>> File
>> >>>> System (HDFS) and it was a constant source of user problems.
>> >>>>
>> >>>> What would happen is disks would just go bad over time.  The DataNode
>> >>>> would notice this and take them offline.  But then, due to some
>> >>>> "optimistic" code, the DataNode would periodically try to re-add them
>> >> to
>> >>>> the system.  Then one of two things would happen: the disk would just
>> >>> fail
>> >>>> immediately again, or it would appear to work and then fail after a
>> >> short
>> >>>> amount of time.
>> >>>>
>> >>>> The way the disk failed was normally having an I/O request take a
>> >> really
>> >>>> long time and time out.  So a bunch of request handler threads would
>> >>>> basically slam into a brick wall when they tried to access the bad
>> >> disk,
>> >>>> slowing the DataNode to a crawl.  It was even worse in the second
>> >>> scenario,
>> >>>> if the disk appeared to work for a while, but then failed.  Any data
>> >> that
>> >>>> had been written on that DataNode to that disk would be lost, and we
>> >>> would
>> >>>> need to re-replicate it.
>> >>>>
>> >>>> Disks aren't biological systems-- they don't heal over time.  Once
>> >>> they're
>> >>>> bad, they stay bad.  The log cleaner needs to be robust against cases
>> >>> where
>> >>>> the disk really is failing, and really is returning bad data or
>> timing
>> >>> out.
>> >>>>> * For your metrics, can you spell out the full metric in JMX-style
>> >>>>> format, such as:
>> >>>>>
>> >>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>> >>>>>                value=4
>> >>>>>
>> >>>>> * For "uncleanable-partitions": topic-partition names can be very
>> >> long.
>> >>>>> I think the current max size is 210 characters (or maybe 240-ish?).
>> >>>>> Having the "uncleanable-partitions" being a list could be very large
>> >>>>> metric. Also, having the metric come out as a csv might be difficult
>> >> to
>> >>>>> work with for monitoring systems. If we *did* want the topic names
>> to
>> >>> be
>> >>>>> accessible, what do you think of having the
>> >>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>> >>>>> I'm not sure if LogCleanerManager is the right type, but my example
>> >> was
>> >>>>> that the topic and partition can be tags in the metric. That will
>> >> allow
>> >>>>> monitoring systems to more easily slice and dice the metric. I'm not
>> >>>>> sure what the attribute for that metric would be. Maybe something
>> >> like
>> >>>>> "uncleaned bytes" for that topic-partition? Or
>> time-since-last-clean?
>> >>> Or
>> >>>>> maybe even just "Value=1".
>> >>>> I haven't though about this that hard, but do we really need the
>> >>>> uncleanable topic names to be accessible through a metric?  It seems
>> >> like
>> >>>> the admin should notice that uncleanable partitions are present, and
>> >> then
>> >>>> check the logs?
>> >>>>
>> >>>>> * About `max.uncleanable.partitions`, you said that this likely
>> >>>>> indicates that the disk is having problems. I'm not sure that is the
>> >>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
>> >>> all
>> >>>>> of them are partition-level scenarios that happened during normal
>> >>>>> operation. None of them were indicative of disk problems.
>> >>>> I don't think this is a meaningful comparison.  In general, we don't
>> >>>> accept JIRAs for hard disk problems that happen on a particular
>> >> cluster.
>> >>>> If someone opened a JIRA that said "my hard disk is having problems"
>> we
>> >>>> could close that as "not a Kafka bug."  This doesn't prove that disk
>> >>>> problems don't happen, but  just that JIRA isn't the right place for
>> >>> them.
>> >>>> I do agree that the log cleaner has had a significant number of logic
>> >>>> bugs, and that we need to be careful to limit their impact.  That's
>> one
>> >>>> reason why I think that a threshold of "number of uncleanable logs"
>> is
>> >> a
>> >>>> good idea, rather than just failing after one IOException.  In all
>> the
>> >>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>> was
>> >>>> just one partition that had the issue.  We also should increase test
>> >>>> coverage for the log cleaner.
>> >>>>
>> >>>>> * About marking disks as offline when exceeding a certain threshold,
>> >>>>> that actually increases the blast radius of log compaction failures.
>> >>>>> Currently, the uncleaned partitions are still readable and writable.
>> >>>>> Taking the disks offline would impact availability of the
>> uncleanable
>> >>>>> partitions, as well as impact all other partitions that are on the
>> >>> disk.
>> >>>> In general, when we encounter I/O errors, we take the disk partition
>> >>>> offline.  This is spelled out in KIP-112 (
>> >>>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>> >>>> ) :
>> >>>>
>> >>>>> - Broker assumes a log directory to be good after it starts, and
>> mark
>> >>>> log directory as
>> >>>>> bad once there is IOException when broker attempts to access (i.e.
>> >> read
>> >>>> or write) the log directory.
>> >>>>> - Broker will be offline if all log directories are bad.
>> >>>>> - Broker will stop serving replicas in any bad log directory. New
>> >>>> replicas will only be created
>> >>>>> on good log directory.
>> >>>> The behavior Stanislav is proposing for the log cleaner is actually
>> >> more
>> >>>> optimistic than what we do for regular broker I/O, since we will
>> >> tolerate
>> >>>> multiple IOExceptions, not just one.  But it's generally consistent.
>> >>>> Ignoring errors is not.  In any case, if you want to tolerate an
>> >>> unlimited
>> >>>> number of I/O errors, you can just set the threshold to an infinite
>> >> value
>> >>>> (although I think that would be a bad idea).
>> >>>>
>> >>>> best,
>> >>>> Colin
>> >>>>
>> >>>>> -James
>> >>>>>
>> >>>>>
>> >>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>> >>>> stanis...@confluent.io> wrote:
>> >>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
>> >>> is
>> >>>> the
>> >>>>>> new link:
>> >>>>>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>> >>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>> >>>> stanis...@confluent.io>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hey group,
>> >>>>>>>
>> >>>>>>> I created a new KIP about making log compaction more
>> >> fault-tolerant.
>> >>>>>>> Please give it a look here and please share what you think,
>> >>>> especially in
>> >>>>>>> regards to the points in the "Needs Discussion" paragraph.
>> >>>>>>>
>> >>>>>>> KIP: KIP-346
>> >>>>>>> <
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>> >>>>>>> --
>> >>>>>>> Best,
>> >>>>>>> Stanislav
>> >>>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best,
>> >>>>>> Stanislav
>> >
>>
>>
>
> --
> Best,
> Stanislav
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to