Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

James Cheng Mon, 23 Jul 2018 23:20:42 -0700

Hi Stanislav! Thanks for this KIP!

I agree that it would be good if the LogCleaner were more tolerant of errors. 
Currently, as you said, once it dies, it stays dead.

Things are better now than they used to be. We have the metric
        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
which we can use to tell us if the threads are dead. And as of 1.1.0, we have 
KIP-226, which allows you to restart the log cleaner thread, without requiring 
a broker restart. 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration

<https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration>
 I've only read about this, I haven't personally tried it.

Some comments:
* I like the idea of having the log cleaner continue to clean as many 
partitions as it can, skipping over the problematic ones if possible.

* If the log cleaner thread dies, I think it should automatically be revived. 
Your KIP attempts to do that by catching exceptions during execution, but I 
think we should go all the way and make sure that a new one gets created, if 
the thread ever dies.

* It might be worth trying to re-clean the uncleanable partitions. I've seen 
cases where an uncleanable partition later became cleanable. I unfortunately 
don't remember how that happened, but I remember being surprised when I 
discovered it. It might have been something like a follower was uncleanable but 
after a leader election happened, the log truncated and it was then cleanable 
again. I'm not sure.

* For your metrics, can you spell out the full metric in JMX-style format, such 
as:
        kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
                value=4

* For "uncleanable-partitions": topic-partition names can be very long. I think 
the current max size is 210 characters (or maybe 240-ish?). Having the 
"uncleanable-partitions" being a list could be very large metric. Also, having 
the metric come out as a csv might be difficult to work with for monitoring 
systems. If we *did* want the topic names to be accessible, what do you think 
of having the 
        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
I'm not sure if LogCleanerManager is the right type, but my example was that 
the topic and partition can be tags in the metric. That will allow monitoring 
systems to more easily slice and dice the metric. I'm not sure what the 
attribute for that metric would be. Maybe something like  "uncleaned bytes" for 
that topic-partition? Or time-since-last-clean? Or maybe even just "Value=1".

* About `max.uncleanable.partitions`, you said that this likely indicates that 
the disk is having problems. I'm not sure that is the case. For the 4 JIRAs 
that you mentioned about log cleaner problems, all of them are partition-level 
scenarios that happened during normal operation. None of them were indicative 
of disk problems.

* About marking disks as offline when exceeding a certain threshold, that 
actually increases the blast radius of log compaction failures. Currently, the 
uncleaned partitions are still readable and writable. Taking the disks offline 
would impact availability of the uncleanable partitions, as well as impact all 
other partitions that are on the disk.

-James

> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <stanis...@confluent.io> 
> wrote:
> 
> I renamed the KIP and that changed the link. Sorry about that. Here is the
> new link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> 
> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <stanis...@confluent.io>
> wrote:
> 
>> Hey group,
>> 
>> I created a new KIP about making log compaction more fault-tolerant.
>> Please give it a look here and please share what you think, especially in
>> regards to the points in the "Needs Discussion" paragraph.
>> 
>> KIP: KIP-346
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure>
>> --
>> Best,
>> Stanislav
>> 
> 
> 
> -- 
> Best,
> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to