On Wed, Mar 8, 2023 at 5:25 AM Bowen Song via dev <dev@cassandra.apache.org>
wrote:

> At the moment, when a read error, such as unrecoverable bit error or data
> corruption, occurs in the SSTable data files, regardless of the
> disk_failure_policy configuration, manual (or to be precise, external)
> intervention is required to recover from the error.
>
> Commonly, there's two approach to recover from such error:
>
>    1. The safer, but slower recover strategy: replace the entire node.
>    2. The less safe, but faster recover strategy: shut down the node,
>    delete the affected SSTable file(s), and then bring the node back online
>    and run repair.
>
> Based on my understanding of Cassandra, it should be possible to recover
> from such error by marking the affected token range in the existing SSTable
> as "corrupted" and stop reading from them (e.g. creating a "bad block" file
> or in memory), and then streaming the affected token range from the healthy
> replicas. The corrupted SSTable file can then be removed upon the next
> successful compaction involving it, or alternatively an anti-compaction is
> performed on it to remove the corrupted data.
>
> The advantage of this strategy is:
>
>    - Reduced node down time - node restart or replacement is not needed
>    - Less data streaming is required - only the affected token range
>    - Faster recovery time - less streaming and delayed compaction or
>    anti-compaction
>    - No less safe than replacing the entire node
>    - This process can be automated internally, removing the need for
>    operator inputs
>
> The disadvantage is added complexity on the SSTable read path and it may
> mask disk failures from the operator who is not paying attention to it.
>
> What do you think about this?
>

In a database with a shard manager rather than token range, you'd mark that
shard as dead and re-replicate it.

With the token range, if you COULD mark that token range as dead, and
re-bootstrap it (which today means a repair of the range followed by a
stream from a source that's not corrupt), you could do the same thing, but
there's no current way to mark that range as offline on a specific replica.

When transactional metadata ( CEP-21 ) is done, a future version of it
should make sure we eventually allow us to move past the token ring to
something that approximates this for cassandra (because doing so would also
allow us to expand/shrink faster, and handle other types of hot-ranges
better as well).

Reply via email to