[jira] [Created] (CASSANDRA-2498) Improve read performance in update-intensive workload

Jonathan Ellis (JIRA) Mon, 18 Apr 2011 13:00:47 -0700

Improve read performance in update-intensive workload
-----------------------------------------------------


                 Key: CASSANDRA-2498
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2498
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
            Reporter: Jonathan Ellis
            Priority: Minor
             Fix For: 1.0


Read performance in an update-heavy environment relies heavily on compaction to 
maintain good throughput. (This is not the case for workloads where rows are 
only inserted once, because the bloom filter keeps us from having to check 
sstables unnecessarily.)

Very early versions of Cassandra attempted to mitigate this by checking 
sstables in descending generation order (mostly equivalent to descending 
mtime): once all the requested columns were found, it would not check any older 
sstables.

This was incorrect, because data timestamp will not correspond to sstable 
timestamp, both because compaction has the side effect of "refreshing" data to 
a newer sstable, and because hintead handoff may send us data older than what 
we already have.

Instead, we could create a per-sstable piece of metadata containing the most 
recent (client-specified) timestamp for any column in the sstable.  We could 
then sort sstables by this timestamp instead, and perform a similar 
optimization (if the remaining sstable client-timestamps are older than the 
oldest column found in the desired result set so far, we don't need to look 
further). Since under almost every workload, client timestamps of data in a 
given sstable will tend to be similar, we expect this to cut the number of 
sstables down proportionally to how frequently each column in the row is 
updated. (If each column is updated with each write, we only have to check a 
single sstable.)

This may also be useful information when deciding which SSTables to compact.

(Note that this optimization is only appropriate for named-column queries, not 
slice queries, since we don't know what non-overlapping columns may exist in 
older sstables.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-2498) Improve read performance in update-intensive workload

Reply via email to