Hey group, I am planning on starting a voting thread tomorrow. Please do reply if you feel there is anything left to discuss.
Best, Stanislav On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <stanis...@confluent.io> wrote: > Hey, Ray > > Thanks for pointing that out, it's fixed now > > Best, > Stanislav > > On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> wrote: > >> Thanks. Can you fix the link in the "KIPs under discussion" table on >> the main KIP landing page >> < >> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>? >> >> I tried, but the Wiki won't let me. >> >> -Ray >> >> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: >> > Hey guys, >> > >> > @Colin - good point. I added some sentences mentioning recent >> improvements >> > in the introductory section. >> > >> > *Disk Failure* - I tend to agree with what Colin said - once a disk >> fails, >> > you don't want to work with it again. As such, I've changed my mind and >> > believe that we should mark the LogDir (assume its a disk) as offline on >> > the first `IOException` encountered. This is the LogCleaner's current >> > behavior. We shouldn't change that. >> > >> > *Respawning Threads* - I believe we should never re-spawn a thread. The >> > correct approach in my mind is to either have it stay dead or never let >> it >> > die in the first place. >> > >> > *Uncleanable-partition-names metric* - Colin is right, this metric is >> > unneeded. Users can monitor the `uncleanable-partitions-count` metric >> and >> > inspect logs. >> > >> > >> > Hey Ray, >> > >> >> 2) I'm 100% with James in agreement with setting up the LogCleaner to >> >> skip over problematic partitions instead of dying. >> > I think we can do this for every exception that isn't `IOException`. >> This >> > will future-proof us against bugs in the system and potential other >> errors. >> > Protecting yourself against unexpected failures is always a good thing >> in >> > my mind, but I also think that protecting yourself against bugs in the >> > software is sort of clunky. What does everybody think about this? >> > >> >> 4) The only improvement I can think of is that if such an >> >> error occurs, then have the option (configuration setting?) to create a >> >> <log_segment>.skip file (or something similar). >> > This is a good suggestion. Have others also seen corruption be generally >> > tied to the same segment? >> > >> > On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io> >> wrote: >> > >> >> For the cleaner thread specifically, I do not think respawning will >> help at >> >> all because we are more than likely to run into the same issue again >> which >> >> would end up crashing the cleaner. Retrying makes sense for transient >> >> errors or when you believe some part of the system could have healed >> >> itself, both of which I think are not true for the log cleaner. >> >> >> >> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com> >> wrote: >> >> >> >>> <<<respawning threads is likely to make things worse, by putting you >> in >> >> an >> >>> infinite loop which consumes resources and fires off continuous log >> >>> messages. >> >>> Hi Colin. In case it could be relevant, one way to mitigate this >> effect >> >> is >> >>> to implement a backoff mechanism (if a second respawn is to occur then >> >> wait >> >>> for 1 minute before doing it; then if a third respawn is to occur wait >> >> for >> >>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some >> max >> >>> wait time). >> >>> >> >>> I have no opinion on whether respawn is appropriate or not in this >> >> context, >> >>> but a mitigation like the increasing backoff described above may be >> >>> relevant in weighing the pros and cons. >> >>> >> >>> Ron >> >>> >> >>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org> >> wrote: >> >>> >> >>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: >> >>>>> Hi Stanislav! Thanks for this KIP! >> >>>>> >> >>>>> I agree that it would be good if the LogCleaner were more tolerant >> of >> >>>>> errors. Currently, as you said, once it dies, it stays dead. >> >>>>> >> >>>>> Things are better now than they used to be. We have the metric >> >>>>> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms >> >>>>> which we can use to tell us if the threads are dead. And as of >> 1.1.0, >> >>> we >> >>>>> have KIP-226, which allows you to restart the log cleaner thread, >> >>>>> without requiring a broker restart. >> >>>>> >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration >> >>>>> < >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration >> >>>> >> >>>>> I've only read about this, I haven't personally tried it. >> >>>> Thanks for pointing this out, James! Stanislav, we should probably >> >> add a >> >>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP. >> >>> Maybe >> >>>> in the intro section? >> >>>> >> >>>> I think it's clear that requiring the users to manually restart the >> log >> >>>> cleaner is not a very good solution. But it's good to know that >> it's a >> >>>> possibility on some older releases. >> >>>> >> >>>>> Some comments: >> >>>>> * I like the idea of having the log cleaner continue to clean as >> many >> >>>>> partitions as it can, skipping over the problematic ones if >> possible. >> >>>>> >> >>>>> * If the log cleaner thread dies, I think it should automatically be >> >>>>> revived. Your KIP attempts to do that by catching exceptions during >> >>>>> execution, but I think we should go all the way and make sure that a >> >>> new >> >>>>> one gets created, if the thread ever dies. >> >>>> This is inconsistent with the way the rest of Kafka works. We don't >> >>>> automatically re-create other threads in the broker if they >> terminate. >> >>> In >> >>>> general, if there is a serious bug in the code, respawning threads is >> >>>> likely to make things worse, by putting you in an infinite loop which >> >>>> consumes resources and fires off continuous log messages. >> >>>> >> >>>>> * It might be worth trying to re-clean the uncleanable partitions. >> >> I've >> >>>>> seen cases where an uncleanable partition later became cleanable. I >> >>>>> unfortunately don't remember how that happened, but I remember being >> >>>>> surprised when I discovered it. It might have been something like a >> >>>>> follower was uncleanable but after a leader election happened, the >> >> log >> >>>>> truncated and it was then cleanable again. I'm not sure. >> >>>> James, I disagree. We had this behavior in the Hadoop Distributed >> File >> >>>> System (HDFS) and it was a constant source of user problems. >> >>>> >> >>>> What would happen is disks would just go bad over time. The DataNode >> >>>> would notice this and take them offline. But then, due to some >> >>>> "optimistic" code, the DataNode would periodically try to re-add them >> >> to >> >>>> the system. Then one of two things would happen: the disk would just >> >>> fail >> >>>> immediately again, or it would appear to work and then fail after a >> >> short >> >>>> amount of time. >> >>>> >> >>>> The way the disk failed was normally having an I/O request take a >> >> really >> >>>> long time and time out. So a bunch of request handler threads would >> >>>> basically slam into a brick wall when they tried to access the bad >> >> disk, >> >>>> slowing the DataNode to a crawl. It was even worse in the second >> >>> scenario, >> >>>> if the disk appeared to work for a while, but then failed. Any data >> >> that >> >>>> had been written on that DataNode to that disk would be lost, and we >> >>> would >> >>>> need to re-replicate it. >> >>>> >> >>>> Disks aren't biological systems-- they don't heal over time. Once >> >>> they're >> >>>> bad, they stay bad. The log cleaner needs to be robust against cases >> >>> where >> >>>> the disk really is failing, and really is returning bad data or >> timing >> >>> out. >> >>>>> * For your metrics, can you spell out the full metric in JMX-style >> >>>>> format, such as: >> >>>>> >> >>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count >> >>>>> value=4 >> >>>>> >> >>>>> * For "uncleanable-partitions": topic-partition names can be very >> >> long. >> >>>>> I think the current max size is 210 characters (or maybe 240-ish?). >> >>>>> Having the "uncleanable-partitions" being a list could be very large >> >>>>> metric. Also, having the metric come out as a csv might be difficult >> >> to >> >>>>> work with for monitoring systems. If we *did* want the topic names >> to >> >>> be >> >>>>> accessible, what do you think of having the >> >>>>> kafka.log:type=LogCleanerManager,topic=topic1,partition=2 >> >>>>> I'm not sure if LogCleanerManager is the right type, but my example >> >> was >> >>>>> that the topic and partition can be tags in the metric. That will >> >> allow >> >>>>> monitoring systems to more easily slice and dice the metric. I'm not >> >>>>> sure what the attribute for that metric would be. Maybe something >> >> like >> >>>>> "uncleaned bytes" for that topic-partition? Or >> time-since-last-clean? >> >>> Or >> >>>>> maybe even just "Value=1". >> >>>> I haven't though about this that hard, but do we really need the >> >>>> uncleanable topic names to be accessible through a metric? It seems >> >> like >> >>>> the admin should notice that uncleanable partitions are present, and >> >> then >> >>>> check the logs? >> >>>> >> >>>>> * About `max.uncleanable.partitions`, you said that this likely >> >>>>> indicates that the disk is having problems. I'm not sure that is the >> >>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems, >> >>> all >> >>>>> of them are partition-level scenarios that happened during normal >> >>>>> operation. None of them were indicative of disk problems. >> >>>> I don't think this is a meaningful comparison. In general, we don't >> >>>> accept JIRAs for hard disk problems that happen on a particular >> >> cluster. >> >>>> If someone opened a JIRA that said "my hard disk is having problems" >> we >> >>>> could close that as "not a Kafka bug." This doesn't prove that disk >> >>>> problems don't happen, but just that JIRA isn't the right place for >> >>> them. >> >>>> I do agree that the log cleaner has had a significant number of logic >> >>>> bugs, and that we need to be careful to limit their impact. That's >> one >> >>>> reason why I think that a threshold of "number of uncleanable logs" >> is >> >> a >> >>>> good idea, rather than just failing after one IOException. In all >> the >> >>>> cases I've seen where a user hit a logic bug in the log cleaner, it >> was >> >>>> just one partition that had the issue. We also should increase test >> >>>> coverage for the log cleaner. >> >>>> >> >>>>> * About marking disks as offline when exceeding a certain threshold, >> >>>>> that actually increases the blast radius of log compaction failures. >> >>>>> Currently, the uncleaned partitions are still readable and writable. >> >>>>> Taking the disks offline would impact availability of the >> uncleanable >> >>>>> partitions, as well as impact all other partitions that are on the >> >>> disk. >> >>>> In general, when we encounter I/O errors, we take the disk partition >> >>>> offline. This is spelled out in KIP-112 ( >> >>>> >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD >> >>>> ) : >> >>>> >> >>>>> - Broker assumes a log directory to be good after it starts, and >> mark >> >>>> log directory as >> >>>>> bad once there is IOException when broker attempts to access (i.e. >> >> read >> >>>> or write) the log directory. >> >>>>> - Broker will be offline if all log directories are bad. >> >>>>> - Broker will stop serving replicas in any bad log directory. New >> >>>> replicas will only be created >> >>>>> on good log directory. >> >>>> The behavior Stanislav is proposing for the log cleaner is actually >> >> more >> >>>> optimistic than what we do for regular broker I/O, since we will >> >> tolerate >> >>>> multiple IOExceptions, not just one. But it's generally consistent. >> >>>> Ignoring errors is not. In any case, if you want to tolerate an >> >>> unlimited >> >>>> number of I/O errors, you can just set the threshold to an infinite >> >> value >> >>>> (although I think that would be a bad idea). >> >>>> >> >>>> best, >> >>>> Colin >> >>>> >> >>>>> -James >> >>>>> >> >>>>> >> >>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski < >> >>>> stanis...@confluent.io> wrote: >> >>>>>> I renamed the KIP and that changed the link. Sorry about that. Here >> >>> is >> >>>> the >> >>>>>> new link: >> >>>>>> >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error >> >>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski < >> >>>> stanis...@confluent.io> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> Hey group, >> >>>>>>> >> >>>>>>> I created a new KIP about making log compaction more >> >> fault-tolerant. >> >>>>>>> Please give it a look here and please share what you think, >> >>>> especially in >> >>>>>>> regards to the points in the "Needs Discussion" paragraph. >> >>>>>>> >> >>>>>>> KIP: KIP-346 >> >>>>>>> < >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure >> >>>>>>> -- >> >>>>>>> Best, >> >>>>>>> Stanislav >> >>>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> Best, >> >>>>>> Stanislav >> > >> >> > > -- > Best, > Stanislav > -- Best, Stanislav