[ 
https://issues.apache.org/jira/browse/CASSANDRA-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187597#comment-14187597
 ] 

John Sumsion commented on CASSANDRA-8169:
-----------------------------------------

I don't care that much about marking-things-unrepaired-in-the-bitrot-case 
because I believe its easier to just replace a bitrot-susceptible node than it 
is to repair around the bitrot.

My main motivation in submitting this ticket is to make sure there is as 
lightweight a mechanism as possible (read-only, low-throughput) for 
periodically verifying that ALL data can be read, and failing the node as early 
as possible to stay ahead of the replacement curve.

The 'scrub' tool is not good because it rewrites all the data.  The 'repair' 
tool because the move toward incremental-ness (awesome, btw) does not 
aggressively read all the data.  If a 'validate' tool existed, and if it 
triggered the 'disk_failure_policy' properly on all cases of corrupt data files 
(sstables, etc), then that is what I want.  The likelihood of cascading bitrot 
across boxes is not something I thought needed any attention.

> Background bitrot detector to avoid client exposure
> ---------------------------------------------------
>
>                 Key: CASSANDRA-8169
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8169
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: John Sumsion
>
> With a lot of static data sitting in SSTables, and with only a relatively 
> small add/edit rate, incremental repair sounds very good.  However, there is 
> one significant cost to switching away from full repair.
> If/when bitrot corrupts an SSTable, there is nothing standing between a user 
> query and a corruption/failure-response event except for the other replicas.  
> This combined with a rolling restart or upgrade can make a token range 
> non-writable via quorum CL.
> While you could argue that full repairs should be scheduled on a longer-term 
> regular basis, I don't really care about all the repair overhead, I just want 
> something that can run ahead of user queries whose only responsibility is to 
> detect bitrot, so that I can replace nodes in an aggressive way instead of 
> having it be a failure-response situation.
> This bitrot detector need not incur the full cross-cluster cost of repair, 
> and so would be less of a burden to run periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to