Hey, Ray Thanks for pointing that out, it's fixed now
Best, Stanislav On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> wrote: > Thanks. Can you fix the link in the "KIPs under discussion" table on > the main KIP landing page > < > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>? > > I tried, but the Wiki won't let me. > > -Ray > > On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: > > Hey guys, > > > > @Colin - good point. I added some sentences mentioning recent > improvements > > in the introductory section. > > > > *Disk Failure* - I tend to agree with what Colin said - once a disk > fails, > > you don't want to work with it again. As such, I've changed my mind and > > believe that we should mark the LogDir (assume its a disk) as offline on > > the first `IOException` encountered. This is the LogCleaner's current > > behavior. We shouldn't change that. > > > > *Respawning Threads* - I believe we should never re-spawn a thread. The > > correct approach in my mind is to either have it stay dead or never let > it > > die in the first place. > > > > *Uncleanable-partition-names metric* - Colin is right, this metric is > > unneeded. Users can monitor the `uncleanable-partitions-count` metric and > > inspect logs. > > > > > > Hey Ray, > > > >> 2) I'm 100% with James in agreement with setting up the LogCleaner to > >> skip over problematic partitions instead of dying. > > I think we can do this for every exception that isn't `IOException`. This > > will future-proof us against bugs in the system and potential other > errors. > > Protecting yourself against unexpected failures is always a good thing in > > my mind, but I also think that protecting yourself against bugs in the > > software is sort of clunky. What does everybody think about this? > > > >> 4) The only improvement I can think of is that if such an > >> error occurs, then have the option (configuration setting?) to create a > >> <log_segment>.skip file (or something similar). > > This is a good suggestion. Have others also seen corruption be generally > > tied to the same segment? > > > > On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io> > wrote: > > > >> For the cleaner thread specifically, I do not think respawning will > help at > >> all because we are more than likely to run into the same issue again > which > >> would end up crashing the cleaner. Retrying makes sense for transient > >> errors or when you believe some part of the system could have healed > >> itself, both of which I think are not true for the log cleaner. > >> > >> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com> > wrote: > >> > >>> <<<respawning threads is likely to make things worse, by putting you in > >> an > >>> infinite loop which consumes resources and fires off continuous log > >>> messages. > >>> Hi Colin. In case it could be relevant, one way to mitigate this > effect > >> is > >>> to implement a backoff mechanism (if a second respawn is to occur then > >> wait > >>> for 1 minute before doing it; then if a third respawn is to occur wait > >> for > >>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some > max > >>> wait time). > >>> > >>> I have no opinion on whether respawn is appropriate or not in this > >> context, > >>> but a mitigation like the increasing backoff described above may be > >>> relevant in weighing the pros and cons. > >>> > >>> Ron > >>> > >>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org> > wrote: > >>> > >>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > >>>>> Hi Stanislav! Thanks for this KIP! > >>>>> > >>>>> I agree that it would be good if the LogCleaner were more tolerant of > >>>>> errors. Currently, as you said, once it dies, it stays dead. > >>>>> > >>>>> Things are better now than they used to be. We have the metric > >>>>> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms > >>>>> which we can use to tell us if the threads are dead. And as of 1.1.0, > >>> we > >>>>> have KIP-226, which allows you to restart the log cleaner thread, > >>>>> without requiring a broker restart. > >>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > >>>>> < > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > >>>> > >>>>> I've only read about this, I haven't personally tried it. > >>>> Thanks for pointing this out, James! Stanislav, we should probably > >> add a > >>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP. > >>> Maybe > >>>> in the intro section? > >>>> > >>>> I think it's clear that requiring the users to manually restart the > log > >>>> cleaner is not a very good solution. But it's good to know that it's > a > >>>> possibility on some older releases. > >>>> > >>>>> Some comments: > >>>>> * I like the idea of having the log cleaner continue to clean as many > >>>>> partitions as it can, skipping over the problematic ones if possible. > >>>>> > >>>>> * If the log cleaner thread dies, I think it should automatically be > >>>>> revived. Your KIP attempts to do that by catching exceptions during > >>>>> execution, but I think we should go all the way and make sure that a > >>> new > >>>>> one gets created, if the thread ever dies. > >>>> This is inconsistent with the way the rest of Kafka works. We don't > >>>> automatically re-create other threads in the broker if they terminate. > >>> In > >>>> general, if there is a serious bug in the code, respawning threads is > >>>> likely to make things worse, by putting you in an infinite loop which > >>>> consumes resources and fires off continuous log messages. > >>>> > >>>>> * It might be worth trying to re-clean the uncleanable partitions. > >> I've > >>>>> seen cases where an uncleanable partition later became cleanable. I > >>>>> unfortunately don't remember how that happened, but I remember being > >>>>> surprised when I discovered it. It might have been something like a > >>>>> follower was uncleanable but after a leader election happened, the > >> log > >>>>> truncated and it was then cleanable again. I'm not sure. > >>>> James, I disagree. We had this behavior in the Hadoop Distributed > File > >>>> System (HDFS) and it was a constant source of user problems. > >>>> > >>>> What would happen is disks would just go bad over time. The DataNode > >>>> would notice this and take them offline. But then, due to some > >>>> "optimistic" code, the DataNode would periodically try to re-add them > >> to > >>>> the system. Then one of two things would happen: the disk would just > >>> fail > >>>> immediately again, or it would appear to work and then fail after a > >> short > >>>> amount of time. > >>>> > >>>> The way the disk failed was normally having an I/O request take a > >> really > >>>> long time and time out. So a bunch of request handler threads would > >>>> basically slam into a brick wall when they tried to access the bad > >> disk, > >>>> slowing the DataNode to a crawl. It was even worse in the second > >>> scenario, > >>>> if the disk appeared to work for a while, but then failed. Any data > >> that > >>>> had been written on that DataNode to that disk would be lost, and we > >>> would > >>>> need to re-replicate it. > >>>> > >>>> Disks aren't biological systems-- they don't heal over time. Once > >>> they're > >>>> bad, they stay bad. The log cleaner needs to be robust against cases > >>> where > >>>> the disk really is failing, and really is returning bad data or timing > >>> out. > >>>>> * For your metrics, can you spell out the full metric in JMX-style > >>>>> format, such as: > >>>>> > >>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count > >>>>> value=4 > >>>>> > >>>>> * For "uncleanable-partitions": topic-partition names can be very > >> long. > >>>>> I think the current max size is 210 characters (or maybe 240-ish?). > >>>>> Having the "uncleanable-partitions" being a list could be very large > >>>>> metric. Also, having the metric come out as a csv might be difficult > >> to > >>>>> work with for monitoring systems. If we *did* want the topic names to > >>> be > >>>>> accessible, what do you think of having the > >>>>> kafka.log:type=LogCleanerManager,topic=topic1,partition=2 > >>>>> I'm not sure if LogCleanerManager is the right type, but my example > >> was > >>>>> that the topic and partition can be tags in the metric. That will > >> allow > >>>>> monitoring systems to more easily slice and dice the metric. I'm not > >>>>> sure what the attribute for that metric would be. Maybe something > >> like > >>>>> "uncleaned bytes" for that topic-partition? Or time-since-last-clean? > >>> Or > >>>>> maybe even just "Value=1". > >>>> I haven't though about this that hard, but do we really need the > >>>> uncleanable topic names to be accessible through a metric? It seems > >> like > >>>> the admin should notice that uncleanable partitions are present, and > >> then > >>>> check the logs? > >>>> > >>>>> * About `max.uncleanable.partitions`, you said that this likely > >>>>> indicates that the disk is having problems. I'm not sure that is the > >>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems, > >>> all > >>>>> of them are partition-level scenarios that happened during normal > >>>>> operation. None of them were indicative of disk problems. > >>>> I don't think this is a meaningful comparison. In general, we don't > >>>> accept JIRAs for hard disk problems that happen on a particular > >> cluster. > >>>> If someone opened a JIRA that said "my hard disk is having problems" > we > >>>> could close that as "not a Kafka bug." This doesn't prove that disk > >>>> problems don't happen, but just that JIRA isn't the right place for > >>> them. > >>>> I do agree that the log cleaner has had a significant number of logic > >>>> bugs, and that we need to be careful to limit their impact. That's > one > >>>> reason why I think that a threshold of "number of uncleanable logs" is > >> a > >>>> good idea, rather than just failing after one IOException. In all the > >>>> cases I've seen where a user hit a logic bug in the log cleaner, it > was > >>>> just one partition that had the issue. We also should increase test > >>>> coverage for the log cleaner. > >>>> > >>>>> * About marking disks as offline when exceeding a certain threshold, > >>>>> that actually increases the blast radius of log compaction failures. > >>>>> Currently, the uncleaned partitions are still readable and writable. > >>>>> Taking the disks offline would impact availability of the uncleanable > >>>>> partitions, as well as impact all other partitions that are on the > >>> disk. > >>>> In general, when we encounter I/O errors, we take the disk partition > >>>> offline. This is spelled out in KIP-112 ( > >>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD > >>>> ) : > >>>> > >>>>> - Broker assumes a log directory to be good after it starts, and mark > >>>> log directory as > >>>>> bad once there is IOException when broker attempts to access (i.e. > >> read > >>>> or write) the log directory. > >>>>> - Broker will be offline if all log directories are bad. > >>>>> - Broker will stop serving replicas in any bad log directory. New > >>>> replicas will only be created > >>>>> on good log directory. > >>>> The behavior Stanislav is proposing for the log cleaner is actually > >> more > >>>> optimistic than what we do for regular broker I/O, since we will > >> tolerate > >>>> multiple IOExceptions, not just one. But it's generally consistent. > >>>> Ignoring errors is not. In any case, if you want to tolerate an > >>> unlimited > >>>> number of I/O errors, you can just set the threshold to an infinite > >> value > >>>> (although I think that would be a bad idea). > >>>> > >>>> best, > >>>> Colin > >>>> > >>>>> -James > >>>>> > >>>>> > >>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski < > >>>> stanis...@confluent.io> wrote: > >>>>>> I renamed the KIP and that changed the link. Sorry about that. Here > >>> is > >>>> the > >>>>>> new link: > >>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error > >>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski < > >>>> stanis...@confluent.io> > >>>>>> wrote: > >>>>>> > >>>>>>> Hey group, > >>>>>>> > >>>>>>> I created a new KIP about making log compaction more > >> fault-tolerant. > >>>>>>> Please give it a look here and please share what you think, > >>>> especially in > >>>>>>> regards to the points in the "Needs Discussion" paragraph. > >>>>>>> > >>>>>>> KIP: KIP-346 > >>>>>>> < > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure > >>>>>>> -- > >>>>>>> Best, > >>>>>>> Stanislav > >>>>>>> > >>>>>> > >>>>>> -- > >>>>>> Best, > >>>>>> Stanislav > > > > -- Best, Stanislav