[ 
https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-9491:
--------------------------------------
    Assignee: Yuki Morishita

> Inefficient sequential repairs against vnode clusters
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9491
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9491
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Yuki Morishita
>            Priority: Minor
>
> I've got a cluster with vnodes enabled. People regularly run sequential 
> repairs against that cluster.
> During such a sequential repair (just {{nodetool -pr}}, statistics show:
> * huge increase of live-sstable-count (approx doubling the amount),
> * huge amount of memtable-switches (approx 1200 per node per minute),
> * huge number of flushed (approx 25 per node per minute)
> * memtable-data-size drops to (nearly) 0
> * huge amount of compaction-completed-tasks (60k per minute) and 
> compacted-bytes (25GB per minute)
> These numbers do not match the real, tiny workload that the cluster really 
> has.
> The reason for these (IMO crazy) numbers is the way how sequential repairs 
> work on vnode clusters:
> Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a 
> repair on the ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. 
> I'll express the schema in pseudo-code:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
>       foreach columnFamily
>       {
>               start async RepairJob
>               {
>                       if sequentialRepair:
>                               start SnapshotTask against each endpoint 
> (including self)
>                               send tree requests if snapshot successful
>                       else // if parallel repair
>                               send tree requests
>               }
>       }
> }
> {code}
> This means, that for each sequential repair, a snapshot (including all its 
> implications like flushes, tiny sstables, followup-compactions) is taken for 
> every range. That means 256 snapshots per column-family per repair on each 
> (involved) endpoint. For about 20 tables, this could mean 5120 snapshots 
> within a very short period of time. You do not realize that amount on the 
> file system, since the _tag_ for the snapshot is always the same - so all 
> snapshots end in the same directory.
> IMO it would be sufficient to snapshot only once per column-family. Or do I 
> miss something?
> So basically changing the pseudo-code to:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
>       foreach columnFamily
>       {
>               if sequentialRepair:
>                       start SnapshotTask against each endpoint (including 
> self)
>               start async RepairJob
>               {
>                       send tree requests (if snapshot successful)
>               }
>       }
> }
> {code}
> NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, 
> trunk)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to