[jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Jonathan Ellis (JIRA) Wed, 02 Mar 2016 06:08:53 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175634#comment-15175634
 ]


Jonathan Ellis edited comment on CASSANDRA-11206 at 3/2/16 2:07 PM:
--------------------------------------------------------------------

bq. For partitions < 64k (partitions without an IndexInfo object) we could skip 
the indirection during reads via RowIndexEntry at all by extending the 
IndexSummary and directly store the offset into the data file

Since the idea here is to do something simple that we can be confident about 
shipping in 3.6 if CASSANDRA-9754 isn't ready, let's avoid making changes to 
the on disk layout.

To clarify for others following along,

bq. Remove IndexInfo from the key cache (not from the index file on disk, of 
course)

This sounds scary but it's core to the goal here: if we're going to support 
large partitions, we can't afford the overhead either of keeping the entire 
summary on heap, or of reading it from disk in the first place.  (If we're 
reading a 1KB row, then reading 2MB of summary first on a cache miss is a huge 
overhead.)  Moving the key cache off heap (CASSANDRA-9738) would have helped 
with the first but not the second.

So one approach is to go back to the old strategy of only caching the partition 
key location, and then go through the index bsearch using the offsets map every 
time.  For small partitions this will be trivial and I hope negligible to the 
performance story vs the current cache.  (If not, we can look at a hybrid 
strategy, but I'd like to avoid that complexity if possible.)

bq. what I was thinking was that the key cache instead of storing a copy of the 
RIE it would store an offset into the index that is the location of the RIE. 
Then the RIE could be accessed off heap via a memory mapping without doing any 
allocations or copies

I was thinking that even the offsets alone for a 4GB partition are going to be 
256KB, so we don't want to cache the entire offsets map.  But the other side 
there is that if you have a bunch of 4GB partitions you won't have very many of 
them.  16TB of data would be 1GB of offsets which is within the bounds of 
reasonable when off heap.  And your approach may require less logic changes 
than the one above, since we're still "caching" the entire summary, sort of; 
only adding an extra indirection to read the IndexInfo entries.  So that might 
well be simpler.

Edit: but switching to a per-row cache (from per-partition) would be a much 
bigger change and I don't see the performance implications as straightforward 
at all, so let's not do that.


was (Author: jbellis):
bq. For partitions < 64k (partitions without an IndexInfo object) we could skip 
the indirection during reads via RowIndexEntry at all by extending the 
IndexSummary and directly store the offset into the data file

Since the idea here is to do something simple that we can be confident about 
shipping in 3.6 if CASSANDRA-9754 isn't ready, let's avoid making changes to 
the on disk layout.

To clarify for others following along,

bq. Remove IndexInfo from the key cache (not from the index file on disk, of 
course)

This sounds scary but it's core to the goal here: if we're going to support 
large partitions, we can't afford the overhead either of keeping the entire 
summary on heap, or of reading it from disk in the first place.  (If we're 
reading a 1KB row, then reading 2MB of summary first on a cache miss is a huge 
overhead.)  Moving the key cache off heap (CASSANDRA-9738) would have helped 
with the first but not the second.

So one approach is to go back to the old strategy of only caching the partition 
key location, and then go through the index bsearch using the offsets map every 
time.  For small partitions this will be trivial and I hope negligible to the 
performance story vs the current cache.  (If not, we can look at a hybrid 
strategy, but I'd like to avoid that complexity if possible.)

bq. what I was thinking was that the key cache instead of storing a copy of the 
RIE it would store an offset into the index that is the location of the RIE. 
Then the RIE could be accessed off heap via a memory mapping without doing any 
allocations or copies

I was thinking that even the offsets alone for a 4GB partition are going to be 
256KB, so we don't want to cache the entire offsets map.  But the other side 
there is that if you have a bunch of 4GB partitions you won't have very many of 
them.  16TB of data would be 1GB of offsets which is within the bounds of 
reasonable when off heap.  And your approach may require less logic changes 
than the one above, since we're still "caching" the entire summary, sort of; 
only adding an extra indirection to read the IndexInfo entries.  So that might 
well be simpler.

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Reply via email to