Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Jason Gustafson Tue, 07 Aug 2018 16:13:12 -0700

Hi Stanislav,

Just a couple quick questions:


1. I may have missed it, but what will be the default value for
`max.uncleanable.partitions`?
2. It seems there will be some impact for users that monitoring
"time-since-last-run-ms" in order to detect cleaner failures. Not sure it's
a major concern, but probably worth mentioning in the compatibility
section. Also, is this still a useful metric after this KIP?

Also, maybe the "Needs Discussion" item can be moved to rejected
alternatives since you've moved to a vote? I think leaving this for
potential future work is reasonable.

Thanks,
Jason


On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rchi...@apache.org> wrote:

> I'm okay with that.
>
> -Ray
>
> On 8/6/18 10:59 AM, Colin McCabe wrote:
>
>> Perhaps we could start with max.uncleanable.partitions and then implement
>> max.uncleanable.partitions.per.logdir in a follow-up change if it seemed
>> to be necessary?  What do you think?
>>
>> regards,
>> Colin
>>
>>
>> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
>>
>>> Hey Ray,
>>>
>>> Thanks for the explanation. In regards to the configuration property -
>>> I'm
>>> not sure. As long as it has sufficient documentation, I find
>>> "max.uncleanable.partitions" to be okay. If we were to add the
>>> distinction
>>> explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?
>>>
>>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rchi...@apache.org> wrote:
>>>
>>> One more thing occurred to me.  Should the configuration property be
>>>> named "max.uncleanable.partitions.per.disk" instead?
>>>>
>>>> -Ray
>>>>
>>>>
>>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
>>>>
>>>>> Yes, good catch. Thank you, James!
>>>>>
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wushuja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Can you update the KIP to say what the default is for
>>>>>> max.uncleanable.partitions?
>>>>>>
>>>>>> -James
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
>>>>>>>
>>>>>> stanis...@confluent.io>
>>>>
>>>>> wrote:
>>>>>>
>>>>>>> Hey group,
>>>>>>>
>>>>>>> I am planning on starting a voting thread tomorrow. Please do reply
>>>>>>> if
>>>>>>>
>>>>>> you
>>>>>>
>>>>>>> feel there is anything left to discuss.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stanislav
>>>>>>>
>>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>>>>>>>
>>>>>> stanis...@confluent.io>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hey, Ray
>>>>>>>>
>>>>>>>> Thanks for pointing that out, it's fixed now
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stanislav
>>>>>>>>
>>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org>
>>>>>>>>>
>>>>>>>> wrote:
>>>>
>>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>>>>>>>> the main KIP landing page
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
>>>> Improvement+Proposals#
>>>>
>>>>> ?
>>>>>>>
>>>>>>>> I tried, but the Wiki won't let me.
>>>>>>>>>
>>>>>>>>> -Ray
>>>>>>>>>
>>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>>>>>>>> Hey guys,
>>>>>>>>>>
>>>>>>>>>> @Colin - good point. I added some sentences mentioning recent
>>>>>>>>>>
>>>>>>>>> improvements
>>>>>>>>>
>>>>>>>>>> in the introductory section.
>>>>>>>>>>
>>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a
>>>>>>>>>> disk
>>>>>>>>>>
>>>>>>>>> fails,
>>>>>>>>>
>>>>>>>>>> you don't want to work with it again. As such, I've changed my
>>>>>>>>>> mind
>>>>>>>>>>
>>>>>>>>> and
>>>>>>
>>>>>>> believe that we should mark the LogDir (assume its a disk) as
>>>>>>>>>>
>>>>>>>>> offline
>>>>
>>>>> on
>>>>>>
>>>>>>> the first `IOException` encountered. This is the LogCleaner's
>>>>>>>>>>
>>>>>>>>> current
>>>>
>>>>> behavior. We shouldn't change that.
>>>>>>>>>>
>>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn a
>>>>>>>>>> thread.
>>>>>>>>>>
>>>>>>>>> The
>>>>>>
>>>>>>> correct approach in my mind is to either have it stay dead or never
>>>>>>>>>>
>>>>>>>>> let
>>>>>>
>>>>>>> it
>>>>>>>>>
>>>>>>>>>> die in the first place.
>>>>>>>>>>
>>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
>>>>>>>>>>
>>>>>>>>> is
>>>>
>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
>>>>>>>>>>
>>>>>>>>> metric
>>>>
>>>>> and
>>>>>>>>>
>>>>>>>>>> inspect logs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hey Ray,
>>>>>>>>>>
>>>>>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
>>>>>>>>>>>
>>>>>>>>>> to
>>>>
>>>>> skip over problematic partitions instead of dying.
>>>>>>>>>>>
>>>>>>>>>> I think we can do this for every exception that isn't
>>>>>>>>>> `IOException`.
>>>>>>>>>>
>>>>>>>>> This
>>>>>>>>>
>>>>>>>>>> will future-proof us against bugs in the system and potential
>>>>>>>>>> other
>>>>>>>>>>
>>>>>>>>> errors.
>>>>>>>>>
>>>>>>>>>> Protecting yourself against unexpected failures is always a good
>>>>>>>>>>
>>>>>>>>> thing
>>>>
>>>>> in
>>>>>>>>>
>>>>>>>>>> my mind, but I also think that protecting yourself against bugs in
>>>>>>>>>>
>>>>>>>>> the
>>>>
>>>>> software is sort of clunky. What does everybody think about this?
>>>>>>>>>>
>>>>>>>>>> 4) The only improvement I can think of is that if such an
>>>>>>>>>>> error occurs, then have the option (configuration setting?) to
>>>>>>>>>>>
>>>>>>>>>> create a
>>>>>>
>>>>>>> <log_segment>.skip file (or something similar).
>>>>>>>>>>>
>>>>>>>>>> This is a good suggestion. Have others also seen corruption be
>>>>>>>>>>
>>>>>>>>> generally
>>>>>>
>>>>>>> tied to the same segment?
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
>>>>>>>>>> dhru...@confluent.io
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For the cleaner thread specifically, I do not think respawning
>>>>>>>>>>> will
>>>>>>>>>>>
>>>>>>>>>> help at
>>>>>>>>>
>>>>>>>>>> all because we are more than likely to run into the same issue
>>>>>>>>>>>
>>>>>>>>>> again
>>>>
>>>>> which
>>>>>>>>>
>>>>>>>>>> would end up crashing the cleaner. Retrying makes sense for
>>>>>>>>>>>
>>>>>>>>>> transient
>>>>
>>>>> errors or when you believe some part of the system could have
>>>>>>>>>>>
>>>>>>>>>> healed
>>>>
>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
>>>>>>>>>>> rndg...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> <<<respawning threads is likely to make things worse, by putting
>>>>>>>>>>>>
>>>>>>>>>>> you
>>>>
>>>>> in
>>>>>>>>>
>>>>>>>>>> an
>>>>>>>>>>>
>>>>>>>>>>>> infinite loop which consumes resources and fires off continuous
>>>>>>>>>>>>
>>>>>>>>>>> log
>>>>
>>>>> messages.
>>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
>>>>>>>>>>>> this
>>>>>>>>>>>>
>>>>>>>>>>> effect
>>>>>>>>>
>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is to
>>>>>>>>>>>> occur
>>>>>>>>>>>>
>>>>>>>>>>> then
>>>>>>
>>>>>>> wait
>>>>>>>>>>>
>>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is to
>>>>>>>>>>>> occur
>>>>>>>>>>>>
>>>>>>>>>>> wait
>>>>>>
>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
>>>>>>>>>>>>
>>>>>>>>>>> some
>>>>>>
>>>>>>> max
>>>>>>>>>
>>>>>>>>>> wait time).
>>>>>>>>>>>>
>>>>>>>>>>>> I have no opinion on whether respawn is appropriate or not in
>>>>>>>>>>>> this
>>>>>>>>>>>>
>>>>>>>>>>> context,
>>>>>>>>>>>
>>>>>>>>>>>> but a mitigation like the increasing backoff described above may
>>>>>>>>>>>>
>>>>>>>>>>> be
>>>>
>>>>> relevant in weighing the pros and cons.
>>>>>>>>>>>>
>>>>>>>>>>>> Ron
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
>>>>>>>>>>>> cmcc...@apache.org>
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were more
>>>>>>>>>>>>>>
>>>>>>>>>>>>> tolerant
>>>>
>>>>> of
>>>>>>>>>
>>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>
>>>>> which we can use to tell us if the threads are dead. And as of
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.1.0,
>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>>>
>>>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
>>>>>>>>>>>>>>
>>>>>>>>>>>>> thread,
>>>>
>>>>> without requiring a broker restart.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
>>>> Dynamic+Broker+Configuration
>>>>
>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
>>>> Dynamic+Broker+Configuration
>>>>
>>>>> I've only read about this, I haven't personally tried it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
>>>>>>>>>>>>>
>>>>>>>>>>>> probably
>>>>
>>>>> add a
>>>>>>>>>>>
>>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
>>>>>>>>>>>>>
>>>>>>>>>>>> KIP.
>>>>>>
>>>>>>> Maybe
>>>>>>>>>>>>
>>>>>>>>>>>>> in the intro section?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's clear that requiring the users to manually restart
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>
>>>>> log
>>>>>>>>>
>>>>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>>>>>>>>>>>>
>>>>>>>>>>>> it's a
>>>>>>>>>
>>>>>>>>>> possibility on some older releases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some comments:
>>>>>>>>>>>>>> * I like the idea of having the log cleaner continue to clean
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>
>>>>>>>>>>>>> many
>>>>>>>>>
>>>>>>>>>> partitions as it can, skipping over the problematic ones if
>>>>>>>>>>>>>>
>>>>>>>>>>>>> possible.
>>>>>>>>>
>>>>>>>>>> * If the log cleaner thread dies, I think it should
>>>>>>>>>>>>>>
>>>>>>>>>>>>> automatically
>>>>
>>>>> be
>>>>>>
>>>>>>> revived. Your KIP attempts to do that by catching exceptions
>>>>>>>>>>>>>>
>>>>>>>>>>>>> during
>>>>>>
>>>>>>> execution, but I think we should go all the way and make sure
>>>>>>>>>>>>>>
>>>>>>>>>>>>> that a
>>>>>>
>>>>>>> new
>>>>>>>>>>>>
>>>>>>>>>>>>> one gets created, if the thread ever dies.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
>>>>>>>>>>>>>
>>>>>>>>>>>> don't
>>>>>>
>>>>>>> automatically re-create other threads in the broker if they
>>>>>>>>>>>>>
>>>>>>>>>>>> terminate.
>>>>>>>>>
>>>>>>>>>> In
>>>>>>>>>>>>
>>>>>>>>>>>>> general, if there is a serious bug in the code, respawning
>>>>>>>>>>>>>
>>>>>>>>>>>> threads
>>>>
>>>>> is
>>>>>>
>>>>>>> likely to make things worse, by putting you in an infinite loop
>>>>>>>>>>>>>
>>>>>>>>>>>> which
>>>>>>
>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
>>>>>>>>>>>>>>
>>>>>>>>>>>>> partitions.
>>>>
>>>>> I've
>>>>>>>>>>>
>>>>>>>>>>>> seen cases where an uncleanable partition later became
>>>>>>>>>>>>>>
>>>>>>>>>>>>> cleanable.
>>>>
>>>>> I
>>>>>>
>>>>>>> unfortunately don't remember how that happened, but I remember
>>>>>>>>>>>>>>
>>>>>>>>>>>>> being
>>>>>>
>>>>>>> surprised when I discovered it. It might have been something
>>>>>>>>>>>>>>
>>>>>>>>>>>>> like
>>>>
>>>>> a
>>>>>>
>>>>>>> follower was uncleanable but after a leader election happened,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>
>>>>> log
>>>>>>>>>>>
>>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
>>>>>>>>>>>>>
>>>>>>>>>>>> Distributed
>>>>
>>>>> File
>>>>>>>>>
>>>>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would happen is disks would just go bad over time.  The
>>>>>>>>>>>>>
>>>>>>>>>>>> DataNode
>>>>>>
>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try to
>>>>>>>>>>>>> re-add
>>>>>>>>>>>>>
>>>>>>>>>>>> them
>>>>>>
>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> the system.  Then one of two things would happen: the disk would
>>>>>>>>>>>>>
>>>>>>>>>>>> just
>>>>>>
>>>>>>> fail
>>>>>>>>>>>>
>>>>>>>>>>>>> immediately again, or it would appear to work and then fail
>>>>>>>>>>>>>
>>>>>>>>>>>> after a
>>>>
>>>>> short
>>>>>>>>>>>
>>>>>>>>>>>> amount of time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The way the disk failed was normally having an I/O request
>>>>>>>>>>>>> take a
>>>>>>>>>>>>>
>>>>>>>>>>>> really
>>>>>>>>>>>
>>>>>>>>>>>> long time and time out.  So a bunch of request handler threads
>>>>>>>>>>>>>
>>>>>>>>>>>> would
>>>>>>
>>>>>>> basically slam into a brick wall when they tried to access the
>>>>>>>>>>>>>
>>>>>>>>>>>> bad
>>>>
>>>>> disk,
>>>>>>>>>>>
>>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the
>>>>>>>>>>>>> second
>>>>>>>>>>>>>
>>>>>>>>>>>> scenario,
>>>>>>>>>>>>
>>>>>>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
>>>>>>>>>>>>>
>>>>>>>>>>>> data
>>>>>>
>>>>>>> that
>>>>>>>>>>>
>>>>>>>>>>>> had been written on that DataNode to that disk would be lost,
>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>> we
>>>>>>
>>>>>>> would
>>>>>>>>>>>>
>>>>>>>>>>>>> need to re-replicate it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over time.
>>>>>>>>>>>>>
>>>>>>>>>>>> Once
>>>>
>>>>> they're
>>>>>>>>>>>>
>>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
>>>>>>>>>>>>>
>>>>>>>>>>>> cases
>>>>>>
>>>>>>> where
>>>>>>>>>>>>
>>>>>>>>>>>>> the disk really is failing, and really is returning bad data or
>>>>>>>>>>>>>
>>>>>>>>>>>> timing
>>>>>>>>>
>>>>>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>>> * For your metrics, can you spell out the full metric in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> JMX-style
>>>>
>>>>> format, such as:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
>>>> partitions-count
>>>>
>>>>>                 value=4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
>>>>>>>>>>>>>>
>>>>>>>>>>>>> very
>>>>
>>>>> long.
>>>>>>>>>>>
>>>>>>>>>>>> I think the current max size is 210 characters (or maybe
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 240-ish?).
>>>>>>
>>>>>>> Having the "uncleanable-partitions" being a list could be very
>>>>>>>>>>>>>>
>>>>>>>>>>>>> large
>>>>>>
>>>>>>> metric. Also, having the metric come out as a csv might be
>>>>>>>>>>>>>>
>>>>>>>>>>>>> difficult
>>>>>>
>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> work with for monitoring systems. If we *did* want the topic
>>>>>>>>>>>>>>
>>>>>>>>>>>>> names
>>>>
>>>>> to
>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>>>>>
>>>>>>>>>>>>> accessible, what do you think of having the
>>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
>>>>>>>>>>>>>> er,topic=topic1,partition=2
>>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
>>>>>>>>>>>>>>
>>>>>>>>>>>>> example
>>>>>>
>>>>>>> was
>>>>>>>>>>>
>>>>>>>>>>>> that the topic and partition can be tags in the metric. That
>>>>>>>>>>>>>>
>>>>>>>>>>>>> will
>>>>
>>>>> allow
>>>>>>>>>>>
>>>>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
>>>>>>>>>>>>>>
>>>>>>>>>>>>> not
>>>>>>
>>>>>>> sure what the attribute for that metric would be. Maybe
>>>>>>>>>>>>>>
>>>>>>>>>>>>> something
>>>>
>>>>> like
>>>>>>>>>>>
>>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>>>>>>>>>>>>>
>>>>>>>>>>>>> time-since-last-clean?
>>>>>>>>>
>>>>>>>>>> Or
>>>>>>>>>>>>
>>>>>>>>>>>>> maybe even just "Value=1".
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I haven't though about this that hard, but do we really need
>>>>>>>>>>>>> the
>>>>>>>>>>>>> uncleanable topic names to be accessible through a metric?  It
>>>>>>>>>>>>>
>>>>>>>>>>>> seems
>>>>>>
>>>>>>> like
>>>>>>>>>>>
>>>>>>>>>>>> the admin should notice that uncleanable partitions are present,
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>
>>>>>>> then
>>>>>>>>>>>
>>>>>>>>>>>> check the logs?
>>>>>>>>>>>>>
>>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>>>>>>>> indicates that the disk is having problems. I'm not sure that
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>
>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>>>>>>>>>>>>>>
>>>>>>>>>>>>> problems,
>>>>>>
>>>>>>> all
>>>>>>>>>>>>
>>>>>>>>>>>>> of them are partition-level scenarios that happened during
>>>>>>>>>>>>>>
>>>>>>>>>>>>> normal
>>>>
>>>>> operation. None of them were indicative of disk problems.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think this is a meaningful comparison.  In general, we
>>>>>>>>>>>>>
>>>>>>>>>>>> don't
>>>>>>
>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>>>>>>>>>>
>>>>>>>>>>>> cluster.
>>>>>>>>>>>
>>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
>>>>>>>>>>>>>
>>>>>>>>>>>> problems"
>>>>>>
>>>>>>> we
>>>>>>>>>
>>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
>>>>>>>>>>>>>
>>>>>>>>>>>> disk
>>>>>>
>>>>>>> problems don't happen, but  just that JIRA isn't the right place
>>>>>>>>>>>>>
>>>>>>>>>>>> for
>>>>>>
>>>>>>> them.
>>>>>>>>>>>>
>>>>>>>>>>>>> I do agree that the log cleaner has had a significant number of
>>>>>>>>>>>>>
>>>>>>>>>>>> logic
>>>>>>
>>>>>>> bugs, and that we need to be careful to limit their impact.
>>>>>>>>>>>>>
>>>>>>>>>>>> That's
>>>>
>>>>> one
>>>>>>>>>
>>>>>>>>>> reason why I think that a threshold of "number of uncleanable
>>>>>>>>>>>>>
>>>>>>>>>>>> logs"
>>>>
>>>>> is
>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>>> good idea, rather than just failing after one IOException.  In
>>>>>>>>>>>>>
>>>>>>>>>>>> all
>>>>
>>>>> the
>>>>>>>>>
>>>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
>>>>>>>>>>>>>
>>>>>>>>>>>> it
>>>>
>>>>> was
>>>>>>>>>
>>>>>>>>>> just one partition that had the issue.  We also should increase
>>>>>>>>>>>>>
>>>>>>>>>>>> test
>>>>>>
>>>>>>> coverage for the log cleaner.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * About marking disks as offline when exceeding a certain
>>>>>>>>>>>>>>
>>>>>>>>>>>>> threshold,
>>>>>>
>>>>>>> that actually increases the blast radius of log compaction
>>>>>>>>>>>>>>
>>>>>>>>>>>>> failures.
>>>>>>
>>>>>>> Currently, the uncleaned partitions are still readable and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> writable.
>>>>>>
>>>>>>> Taking the disks offline would impact availability of the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> uncleanable
>>>>>>>>>
>>>>>>>>>> partitions, as well as impact all other partitions that are on
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>
>>>>> disk.
>>>>>>>>>>>>
>>>>>>>>>>>>> In general, when we encounter I/O errors, we take the disk
>>>>>>>>>>>>>
>>>>>>>>>>>> partition
>>>>>>
>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
>>>> 3A+Handle+disk+failure+for+JBOD
>>>>
>>>>> ) :
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Broker assumes a log directory to be good after it starts,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> mark
>>>>>>>>>
>>>>>>>>>> log directory as
>>>>>>>>>>>>>
>>>>>>>>>>>>>> bad once there is IOException when broker attempts to access
>>>>>>>>>>>>>>
>>>>>>>>>>>>> (i.e.
>>>>
>>>>> read
>>>>>>>>>>>
>>>>>>>>>>>> or write) the log directory.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log directory.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> New
>>>
>>>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to