> there's a point at which a host limping along is better put down and replaced
I did a basic literature review and it looks like load (total program-erase cycles), disk age, and operating temperature all lead to BER increases. We don't need to build a whole model of disk failure, we could probably get a lot of mileage out of a warn / failure threshold for number of automatic corruption repairs. Under this model, Cassandra could automatically repair X (3?) corruption events before warning a user ("time to replace this host"), and Y (10?) corruption events before forcing itself down. But it would be good to get a better sense of user expectations here. Bowen - how would you want Cassandra to handle frequent disk corruption events? -- Abe > On Mar 9, 2023, at 12:44 PM, Josh McKenzie <jmcken...@apache.org> wrote: > >> I'm not seeing any reasons why CEP-21 would make this more difficult to >> implement > I think I communicated poorly - I was just trying to point out that there's a > point at which a host limping along is better put down and replaced than > piecemeal flagging range after range dead and working around it, and there's > no immediately obvious "Correct" answer to where that point is regardless of > what mechanism we're using to hold a cluster-wide view of topology. > >> ...CEP-21 makes this sequencing safe... > For sure - I wouldn't advocate for any kind of "automated corrupt data > repair" in a pre-CEP-21 world. > > On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote: >> I'm not seeing any reasons why CEP-21 would make this more difficult to >> implement, besides the fact that it hasn't landed yet. >> >> There are two major potential pitfalls that CEP-21 would help us avoid: >> 1. Bit-errors beget further bit-errors, so we ought to be resistant to a >> high frequency of corruption events >> 2. Avoid token ownership changes when attempting to stream a corrupted token >> >> I found some data supporting (1) - >> https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf >> >> If we detect bit-errors and store them in system_distributed, then we need a >> capacity to throttle that load and ensure that consistency is maintained. >> >> When we attempt to rectify any bit-error by streaming data from peers, we >> implicitly take a lock on token ownership. A user needs to know that it is >> unsafe to change token ownership in a cluster that is currently in the >> process of repairing a corruption error on one of its instances' disks. >> CEP-21 makes this sequencing safe, and provides abstractions to better >> expose this information to operators. >> >> -- >> Abe >> >>> On Mar 9, 2023, at 10:55 AM, Josh McKenzie <jmcken...@apache.org> wrote: >>> >>>> Personally, I'd like to see the fix for this issue come after CEP-21. It >>>> could be feasible to implement a fix before then, that detects bit-errors >>>> on the read path and refuses to respond to the coordinator, implicitly >>>> having speculative execution handle the retry against another replica >>>> while repair of that range happens. But that feels suboptimal to me when a >>>> better framework is on the horizon. >>> I originally typed something in agreement with you but the more I think >>> about this, the more a node-local "reject queries for specific token >>> ranges" degradation profile seems like it _could_ work. I don't see an >>> obvious way to remove the need for a human-in-the-loop on fixing things in >>> a pre-CEP-21 world without opening pandora's box (Gossip + TMD + >>> non-deterministic agreement on ownership state cluster-wide /cry). >>> >>> And even in a post CEP-21 world you're definitely in the "at what point is >>> it better to declare a host dead and replace it" fuzzy territory where >>> there's no immediately correct answers. >>> >>> A system_distributed table of corrupt token ranges that are currently being >>> rejected by replicas with a mechanism to kick off a repair of those ranges >>> could be interesting. >>> >>> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote: >>>> Thanks for proposing this discussion Bowen. I see a few different issues >>>> here: >>>> >>>> 1. How do we safely handle corruption of a handful of tokens without >>>> taking an entire instance offline for re-bootstrap? This includes refusal >>>> to serve read requests for the corrupted token(s), and correct repair of >>>> the data. >>>> 2. How do we expose the corruption rate to operators, in a way that lets >>>> them decide whether a full disk replacement is worthwhile? >>>> 3. When CEP-21 lands it should become feasible to support ownership >>>> draining, which would let us migrate read traffic for a given token range >>>> away from an instance where that range is corrupted. Is it worth planning >>>> a fix for this issue before CEP-21 lands? >>>> >>>> I'm also curious whether there's any existing literature on how different >>>> filesystems and storage media accommodate bit-errors (correctable and >>>> uncorrectable), so we can be consistent with those behaviors. >>>> >>>> Personally, I'd like to see the fix for this issue come after CEP-21. It >>>> could be feasible to implement a fix before then, that detects bit-errors >>>> on the read path and refuses to respond to the coordinator, implicitly >>>> having speculative execution handle the retry against another replica >>>> while repair of that range happens. But that feels suboptimal to me when a >>>> better framework is on the horizon. >>>> >>>> -- >>>> Abe >>>> >>>>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <dev@cassandra.apache.org> >>>>> wrote: >>>>> >>>>> Hi Jeremiah, >>>>> >>>>> I'm fully aware of that, which is why I said that deleting the affected >>>>> SSTable files is "less safe". >>>>> >>>>> If the "bad blocks" logic is implemented and the node abort the current >>>>> read query when hitting a bad block, it should remain safe, as the data >>>>> in other SSTable files will not be used. The streamed data should contain >>>>> the unexpired tombstones, and that's enough to keep the data consistent >>>>> on the node. >>>>> >>>>> >>>>> Cheers, >>>>> Bowen >>>>> >>>>> >>>>> >>>>> On 09/03/2023 15:58, Jeremiah D Jordan wrote: >>>>>> It is actually more complicated than just removing the sstable and >>>>>> running repair. >>>>>> >>>>>> In the face of expired tombstones that might be covering data in other >>>>>> sstables the only safe way to deal with a bad sstable is wipe the token >>>>>> range in the bad sstable and rebuild/bootstrap that range (or >>>>>> wipe/rebuild the whole node which is usually the easier way). If there >>>>>> are expired tombstones in play, it means they could have already been >>>>>> compacted away on the other replicas, but may not have compacted away on >>>>>> the current replica, meaning the data they cover could still be present >>>>>> in other sstables on this node. Removing the sstable will mean >>>>>> resurrecting that data. And pulling the range from other nodes does not >>>>>> help because they can have already compacted away the tombstone, so you >>>>>> won’t get it back. >>>>>> >>>>>> Tl;DR you can’t just remove the one sstable you have to remove all data >>>>>> in the token range covered by the sstable (aka all data that sstable may >>>>>> have had a tombstone covering). Then you can stream from the other >>>>>> nodes to get the data back. >>>>>> >>>>>> -Jeremiah >>>>>> >>>>>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev >>>>>>> <dev@cassandra.apache.org> <mailto:dev@cassandra.apache.org> wrote: >>>>>>> >>>>>>> At the moment, when a read error, such as unrecoverable bit error or >>>>>>> data corruption, occurs in the SSTable data files, regardless of the >>>>>>> disk_failure_policy configuration, manual (or to be precise, external) >>>>>>> intervention is required to recover from the error. >>>>>>> >>>>>>> Commonly, there's two approach to recover from such error: >>>>>>> >>>>>>> The safer, but slower recover strategy: replace the entire node. >>>>>>> The less safe, but faster recover strategy: shut down the node, delete >>>>>>> the affected SSTable file(s), and then bring the node back online and >>>>>>> run repair. >>>>>>> Based on my understanding of Cassandra, it should be possible to >>>>>>> recover from such error by marking the affected token range in the >>>>>>> existing SSTable as "corrupted" and stop reading from them (e.g. >>>>>>> creating a "bad block" file or in memory), and then streaming the >>>>>>> affected token range from the healthy replicas. The corrupted SSTable >>>>>>> file can then be removed upon the next successful compaction involving >>>>>>> it, or alternatively an anti-compaction is performed on it to remove >>>>>>> the corrupted data. >>>>>>> >>>>>>> The advantage of this strategy is: >>>>>>> >>>>>>> Reduced node down time - node restart or replacement is not needed >>>>>>> Less data streaming is required - only the affected token range >>>>>>> Faster recovery time - less streaming and delayed compaction or >>>>>>> anti-compaction >>>>>>> No less safe than replacing the entire node >>>>>>> This process can be automated internally, removing the need for >>>>>>> operator inputs >>>>>>> The disadvantage is added complexity on the SSTable read path and it >>>>>>> may mask disk failures from the operator who is not paying attention to >>>>>>> it. >>>>>>> >>>>>>> What do you think about this? >>>>>>>