Re: [DISCUSS] Enhanced Disk Error Handling

Bowen Song via dev Thu, 09 Mar 2023 17:11:53 -0800

From an operator's view, I think the most reliable indicator is not thetotal count of corruption events, but the frequency of the events. Letme try to explain that over some examples:


1. many corruption events in short period of time, then nothing after that
   The disk is probably still healthy.
   The spike in corruption events could be the result of reading some
   bad blocks that hasn't been accessed for a long time
   A warning in the log is preferred.
2. sparse corruption events over many years, the total number is high
   The disk is probably still healthy.
   As long as the frequency does not have an obvious increasing trend,
   it should be fine.
   A warning in the log is preferred.
3. clusters of corruption events started recently and continues to
   happen for days or weeks
   The disk is probably faulty.
   Unless the access pattern from the application side has changed,
   this is a fairly reliable indicator that the disk has failed or is
   about to.
   Initially, a warning in the log is preferred. If this persists for
   too long (configurable number of days?), raise the severity level to
   error, and depending on the disk_failure_policy, may stop or kill
   the node.
4. many corruption events happening continuously
   The disk is probably faulty.
   Other than faulty disk or damaged data (e.g. data getting
   overwritten by a rogue application, like a virus), nothing else
   could explain this situation.
   An error in the log is preferred, and depending on the
   disk_failure_policy, may stop or kill the node.

Internally, inside Cassandra, this could be implemented as a fixednumber of scaling sized time buckets, arranged in such way that theevent frequency over different sized time window can be calculated andcompared to other recent time windows of the same size.For example: 24x hourly buckets, 30x daily buckets and 24x monthlybuckets will only need to store 78 integers, but will show thedifference between the above 4 examples.Externally, exposing those time buckets via the MBeans should besufficient, maybe an additional cumulative counter can be added too.

Failing that, a cumulative counter exposed via MBeans is fine. As anoperator, I can always deal with that in other tools, such as Prometheus.


On 09/03/2023 20:57, Abe Ratnofsky wrote:

> there's a point at which a host limping along is better put down andreplaced
I did a basic literature review and it looks like load (totalprogram-erase cycles), disk age, and operating temperature all lead toBER increases. We don't need to build a whole model of disk failure,we could probably get a lot of mileage out of a warn / failurethreshold for number of automatic corruption repairs.
Under this model, Cassandra could automatically repair X (3?)corruption events before warning a user ("time to replace this host"),and Y (10?) corruption events before forcing itself down.
But it would be good to get a better sense of user expectations here.Bowen - how would you want Cassandra to handle frequent diskcorruption events?
--
Abe
On Mar 9, 2023, at 12:44 PM, Josh McKenzie <[email protected]> wrote:
I'm not seeing any reasons why CEP-21 would make this more difficultto implement
I think I communicated poorly - I was just trying to point out thatthere's a point at which a host limping along is better put down andreplaced than piecemeal flagging range after range dead and workingaround it, and there's no immediately obvious "Correct" answer towhere that point is regardless of what mechanism we're using to holda cluster-wide view of topology.
...CEP-21 makes this sequencing safe...
For sure - I wouldn't advocate for any kind of "automated corruptdata repair" in a pre-CEP-21 world.
On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
I'm not seeing any reasons why CEP-21 would make this more difficultto implement, besides the fact that it hasn't landed yet.
There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistantto a high frequency of corruption events2. Avoid token ownership changes when attempting to stream acorrupted token
I found some data supporting (1) -https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf
If we detect bit-errors and store them in system_distributed, thenwe need a capacity to throttle that load and ensure that consistencyis maintained.
When we attempt to rectify any bit-error by streaming data frompeers, we implicitly take a lock on token ownership. A user needs toknow that it is unsafe to change token ownership in a cluster thatis currently in the process of repairing a corruption error on oneof its instances' disks. CEP-21 makes this sequencing safe, andprovides abstractions to better expose this information to operators.
--
Abe
On Mar 9, 2023, at 10:55 AM, Josh McKenzie <[email protected]>wrote:
Personally, I'd like to see the fix for this issue come afterCEP-21. It could be feasible to implement a fix before then, thatdetects bit-errors on the read path and refuses to respond to thecoordinator, implicitly having speculative execution handle theretry against another replica while repair of that range happens.But that feels suboptimal to me when a better framework is on thehorizon.
I originally typed something in agreement with you but the more Ithink about this, the more a node-local "reject queries forspecific token ranges" degradation profile seems like it _could_work. I don't see an obvious way to remove the need for ahuman-in-the-loop on fixing things in a pre-CEP-21 world withoutopening pandora's box (Gossip + TMD + non-deterministic agreementon ownership state cluster-wide /cry).
And even in a post CEP-21 world you're definitely in the "at whatpoint is it better to declare a host dead and replace it" fuzzyterritory where there's no immediately correct answers.
A system_distributed table of corrupt token ranges that arecurrently being rejected by replicas with a mechanism to kick off arepair of those ranges could be interesting.
On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
Thanks for proposing this discussion Bowen. I see a few differentissues here:
1. How do we safely handle corruption of a handful of tokenswithout taking an entire instance offline for re-bootstrap? Thisincludes refusal to serve read requests for the corruptedtoken(s), and correct repair of the data.2. How do we expose the corruption rate to operators, in a waythat lets them decide whether a full disk replacement is worthwhile?3. When CEP-21 lands it should become feasible to supportownership draining, which would let us migrate read traffic for agiven token range away from an instance where that range iscorrupted. Is it worth planning a fix for this issue before CEP-21lands?
I'm also curious whether there's any existing literature on howdifferent filesystems and storage media accommodate bit-errors(correctable and uncorrectable), so we can be consistent withthose behaviors.
Personally, I'd like to see the fix for this issue come afterCEP-21. It could be feasible to implement a fix before then, thatdetects bit-errors on the read path and refuses to respond to thecoordinator, implicitly having speculative execution handle theretry against another replica while repair of that range happens.But that feels suboptimal to me when a better framework is on thehorizon.
--
Abe
On Mar 9, 2023, at 8:23 AM, Bowen Song via dev<[email protected]> wrote:
Hi Jeremiah,
I'm fully aware of that, which is why I said that deleting theaffected SSTable files is "less safe".
If the "bad blocks" logic is implemented and the node abort thecurrent read query when hitting a bad block, it should remainsafe, as the data in other SSTable files will not be used. Thestreamed data should contain the unexpired tombstones, and that'senough to keep the data consistent on the node.
Cheers,
Bowen


On 09/03/2023 15:58, Jeremiah D Jordan wrote:
It is actually more complicated than just removing the sstableand running repair.
In the face of expired tombstones that might be covering data inother sstables the only safe way to deal with a bad sstable iswipe the token range in the bad sstable and rebuild/bootstrapthat range (or wipe/rebuild the whole node which is usually theeasier way). If there are expired tombstones in play, it meansthey could have already been compacted away on the otherreplicas, but may not have compacted away on the currentreplica, meaning the data they cover could still be present inother sstables on this node. Removing the sstable will meanresurrecting that data. And pulling the range from other nodesdoes not help because they can have already compacted away thetombstone, so you won’t get it back.
Tl;DR you can’t just remove the one sstable you have to removeall data in the token range covered by the sstable (aka all datathat sstable may have had a tombstone covering). Then you canstream from the other nodes to get the data back.
-Jeremiah
On Mar 8, 2023, at 7:24 AM, Bowen Song viadev<[email protected]><mailto:[email protected]>wrote:
At the moment, when a read error, such as unrecoverable biterror or data corruption, occurs in the SSTable data files,regardless of the disk_failure_policy configuration, manual (orto be precise, external) intervention is required to recoverfrom the error.
Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire
    node.
 2. The less safe, but faster recover strategy: shut down the
    node, delete the affected SSTable file(s), and then bring
    the node back online and run repair.
Based on my understanding of Cassandra, it should be possibleto recover from such error by marking the affected token rangein the existing SSTable as "corrupted" and stop reading fromthem (e.g. creating a "bad block" file or in memory), and thenstreaming the affected token range from the healthy replicas.The corrupted SSTable file can then be removed upon the nextsuccessful compaction involving it, or alternatively ananti-compaction is performed on it to remove the corrupted data.
The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not
    needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed
    compaction or anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need
    for operator inputs
The disadvantage is added complexity on the SSTable read pathand it may mask disk failures from the operator who is notpaying attention to it.
What do you think about this?

Re: [DISCUSS] Enhanced Disk Error Handling

Reply via email to