[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Jonathan Ellis (JIRA) Mon, 22 Feb 2016 09:21:42 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157318#comment-15157318
 ]


Jonathan Ellis commented on CASSANDRA-11206:
--------------------------------------------

The offset map is written AFTER the serialized IndexInfo, but since we write 
out the size of both up front, we can still access the map without 
deserializing everything.  Here's the code from RIE that writes it out:

{code}
            out.writeUnsignedVInt(rie.position);
            out.writeUnsignedVInt(rie.promotedSize(idxSerializer));
            ...
            out.writeUnsignedVInt(rie.columnsIndex().size());
           ... [write out the IndexInfo and compute offsets map as we go] ...
           for (int off : offsets)
                out.writeInt(off);
{code}

(There is no code yet that reads it back in because CASSANDRA-9738 got put on 
the back burner.)

Thus the offsets map starts at the total size, minus count * sizeof(int).

So we can read the middle offsetmap entry, deserialize the IndexInfo it points 
to, compare with the row we're looking for, and repeat until bsearch is done.

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>             Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

Reply via email to