[jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Ariel Weisberg (JIRA) Tue, 01 Mar 2016 11:50:29 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174309#comment-15174309
 ]


Ariel Weisberg edited comment on CASSANDRA-11206 at 3/1/16 7:49 PM:
--------------------------------------------------------------------

I'm summarizing to make sure I remember correctly what the key cache miss read 
path for a table looks like.
1. Binary search index summary to find location of partition index entry in 
index
2. Lookup index entry which may just be a pointer to the data file, or it may 
be a sampled index of rows in the partition
3. Look up the partition contents based on the index entry

The index summary is a sampling of the index so most of the time we aren't 
going to get a hit into the data file right? We have to scan the index to find 
the RIE and that entire process is what the key cache saves us from.

If I remember correctly what I was thinking was that the key cache instead of 
storing a copy of the RIE it would store an offset into the index that is the 
location of the RIE. Then the RIE could be accessed off heap via a memory 
mapping without doing any allocations or copies.

I agree that for partitions that aren't indexed the key cache could point 
straight to the data file and skip the index lookup since there doesn't need to 
be additional data there. I don't follow the path you are describing to 
completely removing the key cache without restructuring index summaries and 
indexes into something that is either traversed differently or doesn't 
summarize/sample.

An aside. Is {{RowIndexEntry}} named incorrectly? Should it be 
{{PartitionIndexEntry}}?



was (Author: aweisberg):
I'm summarizing to make sure I remember correctly what the key cache miss read 
path for a table looks like.
1. Binary search index summary to find location of partition index entry in 
index
2. Lookup index entry which may just be a pointer to the data file, or it may 
be a sampled index of rows in the partition
3. Look up the partition contents based on the index entry

The index summary is a sampling of the index so most of the time we aren't 
going to get a hit into the data file right? We have to scan the index to find 
the RIE and that entire process is what the key cache saves us from.

If I remember correctly what I was thinking was that the key cache instead of 
storing a copy of the RIE it would store an offset into the index that is the 
location of the RIE. Then the RIE could be accessed off heap via a memory 
mapping without doing any allocations or copies.

I agree that for partitions that aren't indexed the key cache could point 
straight to the data file and skip the index lookup since there doesn't need to 
be additional data there. I don't follow the path you are describing to 
completely removing the key cache without restructuring index summaries and 
indexes.

An aside. Is {{RowIndexEntry}} named incorrectly? Should it be 
{{PartitionIndexEntry}}?


> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Reply via email to