[ https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-9491: -------------------------------------- Assignee: Yuki Morishita > Inefficient sequential repairs against vnode clusters > ----------------------------------------------------- > > Key: CASSANDRA-9491 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9491 > Project: Cassandra > Issue Type: Improvement > Reporter: Robert Stupp > Assignee: Yuki Morishita > Priority: Minor > > I've got a cluster with vnodes enabled. People regularly run sequential > repairs against that cluster. > During such a sequential repair (just {{nodetool -pr}}, statistics show: > * huge increase of live-sstable-count (approx doubling the amount), > * huge amount of memtable-switches (approx 1200 per node per minute), > * huge number of flushed (approx 25 per node per minute) > * memtable-data-size drops to (nearly) 0 > * huge amount of compaction-completed-tasks (60k per minute) and > compacted-bytes (25GB per minute) > These numbers do not match the real, tiny workload that the cluster really > has. > The reason for these (IMO crazy) numbers is the way how sequential repairs > work on vnode clusters: > Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a > repair on the ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. > I'll express the schema in pseudo-code: > {code} > ranges = getLocalPrimaryRanges(keyspace) > foreach range in ranges: > { > foreach columnFamily > { > start async RepairJob > { > if sequentialRepair: > start SnapshotTask against each endpoint > (including self) > send tree requests if snapshot successful > else // if parallel repair > send tree requests > } > } > } > {code} > This means, that for each sequential repair, a snapshot (including all its > implications like flushes, tiny sstables, followup-compactions) is taken for > every range. That means 256 snapshots per column-family per repair on each > (involved) endpoint. For about 20 tables, this could mean 5120 snapshots > within a very short period of time. You do not realize that amount on the > file system, since the _tag_ for the snapshot is always the same - so all > snapshots end in the same directory. > IMO it would be sufficient to snapshot only once per column-family. Or do I > miss something? > So basically changing the pseudo-code to: > {code} > ranges = getLocalPrimaryRanges(keyspace) > foreach range in ranges: > { > foreach columnFamily > { > if sequentialRepair: > start SnapshotTask against each endpoint (including > self) > start async RepairJob > { > send tree requests (if snapshot successful) > } > } > } > {code} > NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, > trunk) -- This message was sent by Atlassian JIRA (v6.3.4#6332)