Hi Stanislav! Thanks for this KIP! I agree that it would be good if the LogCleaner were more tolerant of errors. Currently, as you said, once it dies, it stays dead.
Things are better now than they used to be. We have the metric kafka.log:type=LogCleanerManager,name=time-since-last-run-ms which we can use to tell us if the threads are dead. And as of 1.1.0, we have KIP-226, which allows you to restart the log cleaner thread, without requiring a broker restart. https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration <https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration> I've only read about this, I haven't personally tried it. Some comments: * I like the idea of having the log cleaner continue to clean as many partitions as it can, skipping over the problematic ones if possible. * If the log cleaner thread dies, I think it should automatically be revived. Your KIP attempts to do that by catching exceptions during execution, but I think we should go all the way and make sure that a new one gets created, if the thread ever dies. * It might be worth trying to re-clean the uncleanable partitions. I've seen cases where an uncleanable partition later became cleanable. I unfortunately don't remember how that happened, but I remember being surprised when I discovered it. It might have been something like a follower was uncleanable but after a leader election happened, the log truncated and it was then cleanable again. I'm not sure. * For your metrics, can you spell out the full metric in JMX-style format, such as: kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count value=4 * For "uncleanable-partitions": topic-partition names can be very long. I think the current max size is 210 characters (or maybe 240-ish?). Having the "uncleanable-partitions" being a list could be very large metric. Also, having the metric come out as a csv might be difficult to work with for monitoring systems. If we *did* want the topic names to be accessible, what do you think of having the kafka.log:type=LogCleanerManager,topic=topic1,partition=2 I'm not sure if LogCleanerManager is the right type, but my example was that the topic and partition can be tags in the metric. That will allow monitoring systems to more easily slice and dice the metric. I'm not sure what the attribute for that metric would be. Maybe something like "uncleaned bytes" for that topic-partition? Or time-since-last-clean? Or maybe even just "Value=1". * About `max.uncleanable.partitions`, you said that this likely indicates that the disk is having problems. I'm not sure that is the case. For the 4 JIRAs that you mentioned about log cleaner problems, all of them are partition-level scenarios that happened during normal operation. None of them were indicative of disk problems. * About marking disks as offline when exceeding a certain threshold, that actually increases the blast radius of log compaction failures. Currently, the uncleaned partitions are still readable and writable. Taking the disks offline would impact availability of the uncleanable partitions, as well as impact all other partitions that are on the disk. -James > On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <stanis...@confluent.io> > wrote: > > I renamed the KIP and that changed the link. Sorry about that. Here is the > new link: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error > > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <stanis...@confluent.io> > wrote: > >> Hey group, >> >> I created a new KIP about making log compaction more fault-tolerant. >> Please give it a look here and please share what you think, especially in >> regards to the points in the "Needs Discussion" paragraph. >> >> KIP: KIP-346 >> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure> >> -- >> Best, >> Stanislav >> > > > -- > Best, > Stanislav