Hey group, I just wanted to note that I have an implementation ready for review. Feel free to take a quick look and raise any concerns you might have in due time. I plan on starting the voting thread tomorrow.
Best, Stanislav On Wed, Aug 1, 2018 at 10:01 AM Stanislav Kozlovski <stanis...@confluent.io> wrote: > Hey Ray, > > * Yes, we'd have the logDir as a tag in the metric > * The intention is to have Int.MaxValue as the maximum uncleanable > partitions count > * My idea is to store the marked logs (actually partitions) in memory > instead of the ".skip" files to keep the change simple. I have also decided > to omit any retries from the implementation - once a partition is marked as > "uncleanable" it stays so until a broker restart > > Please do let me know if you are okay with this description. I should have > the code available for review soon > > Thanks, > Stanislav > > On Tue, Jul 31, 2018 at 6:30 PM Ray Chiang <rchi...@apache.org> wrote: > >> I had one question that I was trying to do some investigation before I >> asked, but I'm having some issues with my JMX browser right now. >> >> * For the uncleanable-partitions-count metric, is that going to be >> per-logDir entry? >> * For max.uncleanable.partitions, is the intention to have -1 be >> "infinite" or are we going to use Int.MaxValue as a practical >> equivalent? >> * In this sentence: "When evaluating which logs to compact, skip the >> marked ones.", should we define what "marking" will be? If we're >> going with the ".skip" file or equivalent, can we also add how >> successful retries will behave? >> >> -Ray >> >> On 7/31/18 9:56 AM, Stanislav Kozlovski wrote: >> > Hey group, >> > >> > I am planning on starting a voting thread tomorrow. Please do reply if >> you >> > feel there is anything left to discuss. >> > >> > Best, >> > Stanislav >> > >> > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski < >> stanis...@confluent.io> >> > wrote: >> > >> >> Hey, Ray >> >> >> >> Thanks for pointing that out, it's fixed now >> >> >> >> Best, >> >> Stanislav >> >> >> >> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> wrote: >> >> >> >>> Thanks. Can you fix the link in the "KIPs under discussion" table on >> >>> the main KIP landing page >> >>> < >> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals# >> >? >> >>> >> >>> I tried, but the Wiki won't let me. >> >>> >> >>> -Ray >> >>> >> >>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: >> >>>> Hey guys, >> >>>> >> >>>> @Colin - good point. I added some sentences mentioning recent >> >>> improvements >> >>>> in the introductory section. >> >>>> >> >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk >> >>> fails, >> >>>> you don't want to work with it again. As such, I've changed my mind >> and >> >>>> believe that we should mark the LogDir (assume its a disk) as >> offline on >> >>>> the first `IOException` encountered. This is the LogCleaner's current >> >>>> behavior. We shouldn't change that. >> >>>> >> >>>> *Respawning Threads* - I believe we should never re-spawn a thread. >> The >> >>>> correct approach in my mind is to either have it stay dead or never >> let >> >>> it >> >>>> die in the first place. >> >>>> >> >>>> *Uncleanable-partition-names metric* - Colin is right, this metric is >> >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric >> >>> and >> >>>> inspect logs. >> >>>> >> >>>> >> >>>> Hey Ray, >> >>>> >> >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner >> to >> >>>>> skip over problematic partitions instead of dying. >> >>>> I think we can do this for every exception that isn't `IOException`. >> >>> This >> >>>> will future-proof us against bugs in the system and potential other >> >>> errors. >> >>>> Protecting yourself against unexpected failures is always a good >> thing >> >>> in >> >>>> my mind, but I also think that protecting yourself against bugs in >> the >> >>>> software is sort of clunky. What does everybody think about this? >> >>>> >> >>>>> 4) The only improvement I can think of is that if such an >> >>>>> error occurs, then have the option (configuration setting?) to >> create a >> >>>>> <log_segment>.skip file (or something similar). >> >>>> This is a good suggestion. Have others also seen corruption be >> generally >> >>>> tied to the same segment? >> >>>> >> >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io> >> >>> wrote: >> >>>>> For the cleaner thread specifically, I do not think respawning will >> >>> help at >> >>>>> all because we are more than likely to run into the same issue again >> >>> which >> >>>>> would end up crashing the cleaner. Retrying makes sense for >> transient >> >>>>> errors or when you believe some part of the system could have healed >> >>>>> itself, both of which I think are not true for the log cleaner. >> >>>>> >> >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com> >> >>> wrote: >> >>>>>> <<<respawning threads is likely to make things worse, by putting >> you >> >>> in >> >>>>> an >> >>>>>> infinite loop which consumes resources and fires off continuous log >> >>>>>> messages. >> >>>>>> Hi Colin. In case it could be relevant, one way to mitigate this >> >>> effect >> >>>>> is >> >>>>>> to implement a backoff mechanism (if a second respawn is to occur >> then >> >>>>> wait >> >>>>>> for 1 minute before doing it; then if a third respawn is to occur >> wait >> >>>>> for >> >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to >> some >> >>> max >> >>>>>> wait time). >> >>>>>> >> >>>>>> I have no opinion on whether respawn is appropriate or not in this >> >>>>> context, >> >>>>>> but a mitigation like the increasing backoff described above may be >> >>>>>> relevant in weighing the pros and cons. >> >>>>>> >> >>>>>> Ron >> >>>>>> >> >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org> >> >>> wrote: >> >>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: >> >>>>>>>> Hi Stanislav! Thanks for this KIP! >> >>>>>>>> >> >>>>>>>> I agree that it would be good if the LogCleaner were more >> tolerant >> >>> of >> >>>>>>>> errors. Currently, as you said, once it dies, it stays dead. >> >>>>>>>> >> >>>>>>>> Things are better now than they used to be. We have the metric >> >>>>>>>> >> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms >> >>>>>>>> which we can use to tell us if the threads are dead. And as of >> >>> 1.1.0, >> >>>>>> we >> >>>>>>>> have KIP-226, which allows you to restart the log cleaner thread, >> >>>>>>>> without requiring a broker restart. >> >>>>>>>> >> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration >> >>>>>>>> < >> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration >> >>>>>>>> I've only read about this, I haven't personally tried it. >> >>>>>>> Thanks for pointing this out, James! Stanislav, we should >> probably >> >>>>> add a >> >>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the >> KIP. >> >>>>>> Maybe >> >>>>>>> in the intro section? >> >>>>>>> >> >>>>>>> I think it's clear that requiring the users to manually restart >> the >> >>> log >> >>>>>>> cleaner is not a very good solution. But it's good to know that >> >>> it's a >> >>>>>>> possibility on some older releases. >> >>>>>>> >> >>>>>>>> Some comments: >> >>>>>>>> * I like the idea of having the log cleaner continue to clean as >> >>> many >> >>>>>>>> partitions as it can, skipping over the problematic ones if >> >>> possible. >> >>>>>>>> * If the log cleaner thread dies, I think it should >> automatically be >> >>>>>>>> revived. Your KIP attempts to do that by catching exceptions >> during >> >>>>>>>> execution, but I think we should go all the way and make sure >> that a >> >>>>>> new >> >>>>>>>> one gets created, if the thread ever dies. >> >>>>>>> This is inconsistent with the way the rest of Kafka works. We >> don't >> >>>>>>> automatically re-create other threads in the broker if they >> >>> terminate. >> >>>>>> In >> >>>>>>> general, if there is a serious bug in the code, respawning >> threads is >> >>>>>>> likely to make things worse, by putting you in an infinite loop >> which >> >>>>>>> consumes resources and fires off continuous log messages. >> >>>>>>> >> >>>>>>>> * It might be worth trying to re-clean the uncleanable >> partitions. >> >>>>> I've >> >>>>>>>> seen cases where an uncleanable partition later became >> cleanable. I >> >>>>>>>> unfortunately don't remember how that happened, but I remember >> being >> >>>>>>>> surprised when I discovered it. It might have been something >> like a >> >>>>>>>> follower was uncleanable but after a leader election happened, >> the >> >>>>> log >> >>>>>>>> truncated and it was then cleanable again. I'm not sure. >> >>>>>>> James, I disagree. We had this behavior in the Hadoop Distributed >> >>> File >> >>>>>>> System (HDFS) and it was a constant source of user problems. >> >>>>>>> >> >>>>>>> What would happen is disks would just go bad over time. The >> DataNode >> >>>>>>> would notice this and take them offline. But then, due to some >> >>>>>>> "optimistic" code, the DataNode would periodically try to re-add >> them >> >>>>> to >> >>>>>>> the system. Then one of two things would happen: the disk would >> just >> >>>>>> fail >> >>>>>>> immediately again, or it would appear to work and then fail after >> a >> >>>>> short >> >>>>>>> amount of time. >> >>>>>>> >> >>>>>>> The way the disk failed was normally having an I/O request take a >> >>>>> really >> >>>>>>> long time and time out. So a bunch of request handler threads >> would >> >>>>>>> basically slam into a brick wall when they tried to access the bad >> >>>>> disk, >> >>>>>>> slowing the DataNode to a crawl. It was even worse in the second >> >>>>>> scenario, >> >>>>>>> if the disk appeared to work for a while, but then failed. Any >> data >> >>>>> that >> >>>>>>> had been written on that DataNode to that disk would be lost, and >> we >> >>>>>> would >> >>>>>>> need to re-replicate it. >> >>>>>>> >> >>>>>>> Disks aren't biological systems-- they don't heal over time. Once >> >>>>>> they're >> >>>>>>> bad, they stay bad. The log cleaner needs to be robust against >> cases >> >>>>>> where >> >>>>>>> the disk really is failing, and really is returning bad data or >> >>> timing >> >>>>>> out. >> >>>>>>>> * For your metrics, can you spell out the full metric in >> JMX-style >> >>>>>>>> format, such as: >> >>>>>>>> >> >>>>>> >> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count >> >>>>>>>> value=4 >> >>>>>>>> >> >>>>>>>> * For "uncleanable-partitions": topic-partition names can be very >> >>>>> long. >> >>>>>>>> I think the current max size is 210 characters (or maybe >> 240-ish?). >> >>>>>>>> Having the "uncleanable-partitions" being a list could be very >> large >> >>>>>>>> metric. Also, having the metric come out as a csv might be >> difficult >> >>>>> to >> >>>>>>>> work with for monitoring systems. If we *did* want the topic >> names >> >>> to >> >>>>>> be >> >>>>>>>> accessible, what do you think of having the >> >>>>>>>> kafka.log:type=LogCleanerManager,topic=topic1,partition=2 >> >>>>>>>> I'm not sure if LogCleanerManager is the right type, but my >> example >> >>>>> was >> >>>>>>>> that the topic and partition can be tags in the metric. That will >> >>>>> allow >> >>>>>>>> monitoring systems to more easily slice and dice the metric. I'm >> not >> >>>>>>>> sure what the attribute for that metric would be. Maybe something >> >>>>> like >> >>>>>>>> "uncleaned bytes" for that topic-partition? Or >> >>> time-since-last-clean? >> >>>>>> Or >> >>>>>>>> maybe even just "Value=1". >> >>>>>>> I haven't though about this that hard, but do we really need the >> >>>>>>> uncleanable topic names to be accessible through a metric? It >> seems >> >>>>> like >> >>>>>>> the admin should notice that uncleanable partitions are present, >> and >> >>>>> then >> >>>>>>> check the logs? >> >>>>>>> >> >>>>>>>> * About `max.uncleanable.partitions`, you said that this likely >> >>>>>>>> indicates that the disk is having problems. I'm not sure that is >> the >> >>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner >> problems, >> >>>>>> all >> >>>>>>>> of them are partition-level scenarios that happened during normal >> >>>>>>>> operation. None of them were indicative of disk problems. >> >>>>>>> I don't think this is a meaningful comparison. In general, we >> don't >> >>>>>>> accept JIRAs for hard disk problems that happen on a particular >> >>>>> cluster. >> >>>>>>> If someone opened a JIRA that said "my hard disk is having >> problems" >> >>> we >> >>>>>>> could close that as "not a Kafka bug." This doesn't prove that >> disk >> >>>>>>> problems don't happen, but just that JIRA isn't the right place >> for >> >>>>>> them. >> >>>>>>> I do agree that the log cleaner has had a significant number of >> logic >> >>>>>>> bugs, and that we need to be careful to limit their impact. >> That's >> >>> one >> >>>>>>> reason why I think that a threshold of "number of uncleanable >> logs" >> >>> is >> >>>>> a >> >>>>>>> good idea, rather than just failing after one IOException. In all >> >>> the >> >>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, >> it >> >>> was >> >>>>>>> just one partition that had the issue. We also should increase >> test >> >>>>>>> coverage for the log cleaner. >> >>>>>>> >> >>>>>>>> * About marking disks as offline when exceeding a certain >> threshold, >> >>>>>>>> that actually increases the blast radius of log compaction >> failures. >> >>>>>>>> Currently, the uncleaned partitions are still readable and >> writable. >> >>>>>>>> Taking the disks offline would impact availability of the >> >>> uncleanable >> >>>>>>>> partitions, as well as impact all other partitions that are on >> the >> >>>>>> disk. >> >>>>>>> In general, when we encounter I/O errors, we take the disk >> partition >> >>>>>>> offline. This is spelled out in KIP-112 ( >> >>>>>>> >> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD >> >>>>>>> ) : >> >>>>>>> >> >>>>>>>> - Broker assumes a log directory to be good after it starts, and >> >>> mark >> >>>>>>> log directory as >> >>>>>>>> bad once there is IOException when broker attempts to access >> (i.e. >> >>>>> read >> >>>>>>> or write) the log directory. >> >>>>>>>> - Broker will be offline if all log directories are bad. >> >>>>>>>> - Broker will stop serving replicas in any bad log directory. New >> >>>>>>> replicas will only be created >> >>>>>>>> on good log directory. >> >>>>>>> The behavior Stanislav is proposing for the log cleaner is >> actually >> >>>>> more >> >>>>>>> optimistic than what we do for regular broker I/O, since we will >> >>>>> tolerate >> >>>>>>> multiple IOExceptions, not just one. But it's generally >> consistent. >> >>>>>>> Ignoring errors is not. In any case, if you want to tolerate an >> >>>>>> unlimited >> >>>>>>> number of I/O errors, you can just set the threshold to an >> infinite >> >>>>> value >> >>>>>>> (although I think that would be a bad idea). >> >>>>>>> >> >>>>>>> best, >> >>>>>>> Colin >> >>>>>>> >> >>>>>>>> -James >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski < >> >>>>>>> stanis...@confluent.io> wrote: >> >>>>>>>>> I renamed the KIP and that changed the link. Sorry about that. >> Here >> >>>>>> is >> >>>>>>> the >> >>>>>>>>> new link: >> >>>>>>>>> >> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error >> >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski < >> >>>>>>> stanis...@confluent.io> >> >>>>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>>> Hey group, >> >>>>>>>>>> >> >>>>>>>>>> I created a new KIP about making log compaction more >> >>>>> fault-tolerant. >> >>>>>>>>>> Please give it a look here and please share what you think, >> >>>>>>> especially in >> >>>>>>>>>> regards to the points in the "Needs Discussion" paragraph. >> >>>>>>>>>> >> >>>>>>>>>> KIP: KIP-346 >> >>>>>>>>>> < >> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure >> >>>>>>>>>> -- >> >>>>>>>>>> Best, >> >>>>>>>>>> Stanislav >> >>>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> Best, >> >>>>>>>>> Stanislav >> >>> >> >> -- >> >> Best, >> >> Stanislav >> >> >> > >> >> > > -- > Best, > Stanislav > -- Best, Stanislav