Hi Stanislav! Thanks for this KIP!
I agree that it would be good if the LogCleaner were more tolerant of errors.
Currently, as you said, once it dies, it stays dead.
Things are better now than they used to be. We have the metric
kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
which we can use to tell us if the threads are dead. And as of 1.1.0, we have
KIP-226, which allows you to restart the log cleaner thread, without requiring
a broker restart.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration>
I've only read about this, I haven't personally tried it.
Some comments:
* I like the idea of having the log cleaner continue to clean as many
partitions as it can, skipping over the problematic ones if possible.
* If the log cleaner thread dies, I think it should automatically be revived.
Your KIP attempts to do that by catching exceptions during execution, but I
think we should go all the way and make sure that a new one gets created, if
the thread ever dies.
* It might be worth trying to re-clean the uncleanable partitions. I've seen
cases where an uncleanable partition later became cleanable. I unfortunately
don't remember how that happened, but I remember being surprised when I
discovered it. It might have been something like a follower was uncleanable but
after a leader election happened, the log truncated and it was then cleanable
again. I'm not sure.
* For your metrics, can you spell out the full metric in JMX-style format, such
as:
kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
value=4
* For "uncleanable-partitions": topic-partition names can be very long. I think
the current max size is 210 characters (or maybe 240-ish?). Having the
"uncleanable-partitions" being a list could be very large metric. Also, having
the metric come out as a csv might be difficult to work with for monitoring
systems. If we *did* want the topic names to be accessible, what do you think
of having the
kafka.log:type=LogCleanerManager,topic=topic1,partition=2
I'm not sure if LogCleanerManager is the right type, but my example was that
the topic and partition can be tags in the metric. That will allow monitoring
systems to more easily slice and dice the metric. I'm not sure what the
attribute for that metric would be. Maybe something like "uncleaned bytes" for
that topic-partition? Or time-since-last-clean? Or maybe even just "Value=1".
* About `max.uncleanable.partitions`, you said that this likely indicates that
the disk is having problems. I'm not sure that is the case. For the 4 JIRAs
that you mentioned about log cleaner problems, all of them are partition-level
scenarios that happened during normal operation. None of them were indicative
of disk problems.
* About marking disks as offline when exceeding a certain threshold, that
actually increases the blast radius of log compaction failures. Currently, the
uncleaned partitions are still readable and writable. Taking the disks offline
would impact availability of the uncleanable partitions, as well as impact all
other partitions that are on the disk.
-James
> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <[email protected]>
> wrote:
>
> I renamed the KIP and that changed the link. Sorry about that. Here is the
> new link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>
> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <[email protected]>
> wrote:
>
>> Hey group,
>>
>> I created a new KIP about making log compaction more fault-tolerant.
>> Please give it a look here and please share what you think, especially in
>> regards to the points in the "Needs Discussion" paragraph.
>>
>> KIP: KIP-346
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure>
>> --
>> Best,
>> Stanislav
>>
>
>
> --
> Best,
> Stanislav