Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

James Cheng Wed, 01 Aug 2018 09:06:02 -0700

Can you update the KIP to say what the default is for 
max.uncleanable.partitions?


-James

Sent from my iPhone

> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <stanis...@confluent.io> 
> wrote:
> 
> Hey group,
> 
> I am planning on starting a voting thread tomorrow. Please do reply if you
> feel there is anything left to discuss.
> 
> Best,
> Stanislav
> 
> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <stanis...@confluent.io>
> wrote:
> 
>> Hey, Ray
>> 
>> Thanks for pointing that out, it's fixed now
>> 
>> Best,
>> Stanislav
>> 
>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> wrote:
>>> 
>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>> the main KIP landing page
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>?
>>> 
>>> I tried, but the Wiki won't let me.
>>> 
>>> -Ray
>>> 
>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>> Hey guys,
>>>> 
>>>> @Colin - good point. I added some sentences mentioning recent
>>> improvements
>>>> in the introductory section.
>>>> 
>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>> fails,
>>>> you don't want to work with it again. As such, I've changed my mind and
>>>> believe that we should mark the LogDir (assume its a disk) as offline on
>>>> the first `IOException` encountered. This is the LogCleaner's current
>>>> behavior. We shouldn't change that.
>>>> 
>>>> *Respawning Threads* - I believe we should never re-spawn a thread. The
>>>> correct approach in my mind is to either have it stay dead or never let
>>> it
>>>> die in the first place.
>>>> 
>>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>>> and
>>>> inspect logs.
>>>> 
>>>> 
>>>> Hey Ray,
>>>> 
>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>>>>> skip over problematic partitions instead of dying.
>>>> I think we can do this for every exception that isn't `IOException`.
>>> This
>>>> will future-proof us against bugs in the system and potential other
>>> errors.
>>>> Protecting yourself against unexpected failures is always a good thing
>>> in
>>>> my mind, but I also think that protecting yourself against bugs in the
>>>> software is sort of clunky. What does everybody think about this?
>>>> 
>>>>> 4) The only improvement I can think of is that if such an
>>>>> error occurs, then have the option (configuration setting?) to create a
>>>>> <log_segment>.skip file (or something similar).
>>>> This is a good suggestion. Have others also seen corruption be generally
>>>> tied to the same segment?
>>>> 
>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io>
>>> wrote:
>>>> 
>>>>> For the cleaner thread specifically, I do not think respawning will
>>> help at
>>>>> all because we are more than likely to run into the same issue again
>>> which
>>>>> would end up crashing the cleaner. Retrying makes sense for transient
>>>>> errors or when you believe some part of the system could have healed
>>>>> itself, both of which I think are not true for the log cleaner.
>>>>> 
>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> <<<respawning threads is likely to make things worse, by putting you
>>> in
>>>>> an
>>>>>> infinite loop which consumes resources and fires off continuous log
>>>>>> messages.
>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>>> effect
>>>>> is
>>>>>> to implement a backoff mechanism (if a second respawn is to occur then
>>>>> wait
>>>>>> for 1 minute before doing it; then if a third respawn is to occur wait
>>>>> for
>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some
>>> max
>>>>>> wait time).
>>>>>> 
>>>>>> I have no opinion on whether respawn is appropriate or not in this
>>>>> context,
>>>>>> but a mitigation like the increasing backoff described above may be
>>>>>> relevant in weighing the pros and cons.
>>>>>> 
>>>>>> Ron
>>>>>> 
>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org>
>>> wrote:
>>>>>> 
>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>> 
>>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
>>> of
>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>> 
>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>> which we can use to tell us if the threads are dead. And as of
>>> 1.1.0,
>>>>>> we
>>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>>>>>> without requiring a broker restart.
>>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>> <
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>> 
>>>>>>>> I've only read about this, I haven't personally tried it.
>>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
>>>>> add a
>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
>>>>>> Maybe
>>>>>>> in the intro section?
>>>>>>> 
>>>>>>> I think it's clear that requiring the users to manually restart the
>>> log
>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>> it's a
>>>>>>> possibility on some older releases.
>>>>>>> 
>>>>>>>> Some comments:
>>>>>>>> * I like the idea of having the log cleaner continue to clean as
>>> many
>>>>>>>> partitions as it can, skipping over the problematic ones if
>>> possible.
>>>>>>>> 
>>>>>>>> * If the log cleaner thread dies, I think it should automatically be
>>>>>>>> revived. Your KIP attempts to do that by catching exceptions during
>>>>>>>> execution, but I think we should go all the way and make sure that a
>>>>>> new
>>>>>>>> one gets created, if the thread ever dies.
>>>>>>> This is inconsistent with the way the rest of Kafka works.  We don't
>>>>>>> automatically re-create other threads in the broker if they
>>> terminate.
>>>>>> In
>>>>>>> general, if there is a serious bug in the code, respawning threads is
>>>>>>> likely to make things worse, by putting you in an infinite loop which
>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>> 
>>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
>>>>> I've
>>>>>>>> seen cases where an uncleanable partition later became cleanable. I
>>>>>>>> unfortunately don't remember how that happened, but I remember being
>>>>>>>> surprised when I discovered it. It might have been something like a
>>>>>>>> follower was uncleanable but after a leader election happened, the
>>>>> log
>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>>> File
>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>> 
>>>>>>> What would happen is disks would just go bad over time.  The DataNode
>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>> "optimistic" code, the DataNode would periodically try to re-add them
>>>>> to
>>>>>>> the system.  Then one of two things would happen: the disk would just
>>>>>> fail
>>>>>>> immediately again, or it would appear to work and then fail after a
>>>>> short
>>>>>>> amount of time.
>>>>>>> 
>>>>>>> The way the disk failed was normally having an I/O request take a
>>>>> really
>>>>>>> long time and time out.  So a bunch of request handler threads would
>>>>>>> basically slam into a brick wall when they tried to access the bad
>>>>> disk,
>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>>>>> scenario,
>>>>>>> if the disk appeared to work for a while, but then failed.  Any data
>>>>> that
>>>>>>> had been written on that DataNode to that disk would be lost, and we
>>>>>> would
>>>>>>> need to re-replicate it.
>>>>>>> 
>>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
>>>>>> they're
>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against cases
>>>>>> where
>>>>>>> the disk really is failing, and really is returning bad data or
>>> timing
>>>>>> out.
>>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>>>>>> format, such as:
>>>>>>>> 
>>>>>>  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>               value=4
>>>>>>>> 
>>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
>>>>> long.
>>>>>>>> I think the current max size is 210 characters (or maybe 240-ish?).
>>>>>>>> Having the "uncleanable-partitions" being a list could be very large
>>>>>>>> metric. Also, having the metric come out as a csv might be difficult
>>>>> to
>>>>>>>> work with for monitoring systems. If we *did* want the topic names
>>> to
>>>>>> be
>>>>>>>> accessible, what do you think of having the
>>>>>>>>       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my example
>>>>> was
>>>>>>>> that the topic and partition can be tags in the metric. That will
>>>>> allow
>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm not
>>>>>>>> sure what the attribute for that metric would be. Maybe something
>>>>> like
>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>> time-since-last-clean?
>>>>>> Or
>>>>>>>> maybe even just "Value=1".
>>>>>>> I haven't though about this that hard, but do we really need the
>>>>>>> uncleanable topic names to be accessible through a metric?  It seems
>>>>> like
>>>>>>> the admin should notice that uncleanable partitions are present, and
>>>>> then
>>>>>>> check the logs?
>>>>>>> 
>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>> indicates that the disk is having problems. I'm not sure that is the
>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
>>>>>> all
>>>>>>>> of them are partition-level scenarios that happened during normal
>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>> I don't think this is a meaningful comparison.  In general, we don't
>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>> cluster.
>>>>>>> If someone opened a JIRA that said "my hard disk is having problems"
>>> we
>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that disk
>>>>>>> problems don't happen, but  just that JIRA isn't the right place for
>>>>>> them.
>>>>>>> I do agree that the log cleaner has had a significant number of logic
>>>>>>> bugs, and that we need to be careful to limit their impact.  That's
>>> one
>>>>>>> reason why I think that a threshold of "number of uncleanable logs"
>>> is
>>>>> a
>>>>>>> good idea, rather than just failing after one IOException.  In all
>>> the
>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>>> was
>>>>>>> just one partition that had the issue.  We also should increase test
>>>>>>> coverage for the log cleaner.
>>>>>>> 
>>>>>>>> * About marking disks as offline when exceeding a certain threshold,
>>>>>>>> that actually increases the blast radius of log compaction failures.
>>>>>>>> Currently, the uncleaned partitions are still readable and writable.
>>>>>>>> Taking the disks offline would impact availability of the
>>> uncleanable
>>>>>>>> partitions, as well as impact all other partitions that are on the
>>>>>> disk.
>>>>>>> In general, when we encounter I/O errors, we take the disk partition
>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>> ) :
>>>>>>> 
>>>>>>>> - Broker assumes a log directory to be good after it starts, and
>>> mark
>>>>>>> log directory as
>>>>>>>> bad once there is IOException when broker attempts to access (i.e.
>>>>> read
>>>>>>> or write) the log directory.
>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>> - Broker will stop serving replicas in any bad log directory. New
>>>>>>> replicas will only be created
>>>>>>>> on good log directory.
>>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
>>>>> more
>>>>>>> optimistic than what we do for regular broker I/O, since we will
>>>>> tolerate
>>>>>>> multiple IOExceptions, not just one.  But it's generally consistent.
>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>>>>> unlimited
>>>>>>> number of I/O errors, you can just set the threshold to an infinite
>>>>> value
>>>>>>> (although I think that would be a bad idea).
>>>>>>> 
>>>>>>> best,
>>>>>>> Colin
>>>>>>> 
>>>>>>>> -James
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>>>>> stanis...@confluent.io> wrote:
>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
>>>>>> is
>>>>>>> the
>>>>>>>>> new link:
>>>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>>>>> stanis...@confluent.io>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hey group,
>>>>>>>>>> 
>>>>>>>>>> I created a new KIP about making log compaction more
>>>>> fault-tolerant.
>>>>>>>>>> Please give it a look here and please share what you think,
>>>>>>> especially in
>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>>>> 
>>>>>>>>>> KIP: KIP-346
>>>>>>>>>> <
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>> --
>>>>>>>>>> Best,
>>>>>>>>>> Stanislav
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Best,
>>>>>>>>> Stanislav
>>>> 
>>> 
>>> 
>> 
>> --
>> Best,
>> Stanislav
>> 
> 
> 
> -- 
> Best,
> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to