[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-06-20 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339248#comment-15339248
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

bq. RowIndexEntry$serializedSize used to return the size of the index for the 
entire row.
The meaning of this method changed but hasn't been renamed accordingly - my 
bad. It just returns the serialized size of these fields, so without the actual 
"index payload".

bq. Javadoc for IndexInfo
The only real new thing in 3.0 index format is the table with the offsets to 
the IndexInfo objects. The rest has changed mostly by switching to vint 
encoding - "hidden" by the note for "ma" _store rows natively_.

bq. Pre_C_11206_RowIndexEntry
You can safely ignore (or even remove) the Pre-C-11206 stuff in 
RowIndexEntryTest. It just felt safer to have it initially as it was meant to 
ensure that the new implementation is binary compatible with the old one.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
>  Labels: docs-impacting
> Fix For: 3.6
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-06-13 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328785#comment-15328785
 ] 

Michael Kjellman commented on CASSANDRA-11206:
--

going thru the changes and have some questions :)

# RowIndexEntry$serializedSize used to return the size of the index for the 
entire row. As the size of the IndexInfo elements are variable length I'm 
having trouble understanding how the new/current implementation does this:
{code}
private static int serializedSize(DeletionTime deletionTime, long headerLength, 
int columnIndexCount)
{
return TypeSizes.sizeofUnsignedVInt(headerLength)
   + (int) DeletionTime.serializer.serializedSize(deletionTime)
   + TypeSizes.sizeofUnsignedVInt(columnIndexCount);
}
{code}
# In the class level Javadoc for IndexInfo there is a lot of comment about 
serialization format changes and even a comment "Serialization format changed 
in 3.0" yet I don't see any corresponding changes in BigFormat$BigVersion

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
>  Labels: docs-impacting
> Fix For: 3.6
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-18 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246027#comment-15246027
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

Thanks!
Rebased again and triggered CI for that before commit.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
>  Labels: docs-impacting
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-18 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245659#comment-15245659
 ] 

T Jake Luciani commented on CASSANDRA-11206:


Looks like the offsets are written every time now in this CI.close() now thx 
https://github.com/apache/cassandra/commit/aad9988701ca49bc905d1933c1f4b2ecb3ba84d8
 

Thanks for the clarifying comments etc.  I think this patch is good to commit 
barring CI results +1

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-16 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244100#comment-15244100
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

bq. have ColumnIndex but it's been refactored into RowIndexWriter

Yea - it doesn't look the same any more. So I went ahead and moved it into BTW 
since it's the only class from which it's being used. Could move that to 
{{o.a.c.io.sstable.format.big}}, where BTW is.

bq. BTW.addIndexBlock() the indexOffsets\[0\] is always 0

Put some comments in the code for that.

bq. explain in RowIndexEntry.create why you are returning each of the types

Put some comments in the code for that.

bq. don't need indexOffsets once you reach column_index_cache_size_in_kb

It's needed for both cases (shallow and non-shallow RIEs). Put a comment in the 
code for that.

Also ran some cstar tests to compare a version with and without the metrics 
with column_index_cache_size_in_kb 0kB and 2kB on taylor and blade_11_b:
[2kB on 
taylor|http://cstar.datastax.com/tests/id/b4c3dd12-033e-11e6-8db8-0256e416528f] 
[2kB on 
blade_11_b|http://cstar.datastax.com/tests/id/a9c828be-033e-11e6-8db8-0256e416528f]
 [0kB on 
taylor|http://cstar.datastax.com/tests/id/621f0886-034b-11e6-8db8-0256e416528f] 
[0kB on 
blade_11_b|http://cstar.datastax.com/tests/id/6f010ad6-034b-11e6-8db8-0256e416528f]

Commits pushed and CI triggered.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-15 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243640#comment-15243640
 ] 

T Jake Luciani commented on CASSANDRA-11206:



Looks like you still have ColumnIndex but it's been refactored into 
RowIndexWriter.
I think RowIndexWriter should be moved to and replace ColumnIndex since there is
no need to move it.

In BTW.addIndexBlock() the indexOffsets[0] is always 0 since its always skipped 
on the null case and columnIndexCount is incremented.
It looks like it was intentional but it's not easy to understand. I think it 
works out because indexSamplesSerializedSize is 0 anyway.

Please explain in RowIndexEntry.create why you are returning each of the types. 
It's not clear why indexSamples == null && columnIndexRow > 1 is significant.

It seems like you don't need indexOffsets once you reach 
column_index_cache_size_in_kb
it's only used for the non-indexes.  Does that mean the offsets aren't being 
written to the index properly? 
In the RIE example they are all appended to the end.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-15 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243356#comment-15243356
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

Pushed another commit for the metrics. Intention of the metrics is to find the 
_sweet spot_ of {{column_index_cache_size_in_kb}}. In order to find that _sweet 
spot_ you need to know the size of the entries. The metrics below 
{{org.apache.cassandra.metrics:type=Index,name=RowIndexEntry}} are updated on 
each call to {{openWithIndex}}. But again, configuring 
{{column_index_cache_size_in_kb}} too high would result in GC pressure and 
probably in a bad key cache hit ratio.
* {{IndexedEntrySize}} histogram about the side of IndexedEntry (every type)
* {{IndexInfoCount}} histogram about the number of IndexInfo objects per 
IndexedEntry (every type)
* {{IndexInfoGets}} histogram about the number of gets of a IndexInfo objects 
per IndexedEntry (every type) (for example the number of gets for a binary 
search)


> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-15 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242969#comment-15242969
 ] 

T Jake Luciani commented on CASSANDRA-11206:


I think it would make sense to expose a metric of what kind of index cache hit 
we have Shallow or Regular 

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-14 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241975#comment-15241975
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

bq. need to change the version of sstable
The change does not change the index sstable format - just the format of the 
saved key cache.

bq. AutoSavingCache change require a step on the users part
No, all that happens is that you lose the contents of the old saved key cache. 
This is since the change requires some more information on shallow indexed 
entries (offset in index file).

bq. 0,1,2 magic bytes
Made these constants and pushed a commit for this.

bq. dtests/unit test with column_index_cache_size_in_kb: 0
I've setup a new branch {{11206-large-part-0kb-trunk}} and triggered CI for 
this. 
[testall|http://cassci.datastax.com/view/Dev/view/snazy/job/snazy-11206-large-part-0kb-trunk-testall/lastBuild/]
 
[dtest|http://cassci.datastax.com/view/Dev/view/snazy/job/snazy-11206-large-part-0kb-trunk-dtest/lastBuild/]


> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-14 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241925#comment-15241925
 ] 

T Jake Luciani commented on CASSANDRA-11206:


* You need to change the version of sstable since this change alters the Index 
component.
* Please run dtests/unit test with column_index_cache_size_in_kb: 0 
* Is the AutoSavingCache change require a step on the users part or will it 
naturally skip the saved cache on startup?
* The 0,1,2 magic bytes that encode what type of index entry this is should be 
made constants

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-11 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235091#comment-15235091
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

Pushed a commit that re-adds the generics.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-10 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234447#comment-15234447
 ] 

T Jake Luciani commented on CASSANDRA-11206:


I haven't dug into this much but on the surface this effectively breaks 
CASSANDRA-7443 since you removed all generics from the IndexEntry. 
I don't see any reason you can't support a serializer implementaion per format.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-20 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203372#comment-15203372
 ] 

Jonathan Ellis commented on CASSANDRA-11206:


Very promising!

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-20 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201135#comment-15201135
 ] 

Stefania commented on CASSANDRA-11206:
--

Thank you!

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-20 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203187#comment-15203187
 ] 

DOAN DuyHai commented on CASSANDRA-11206:
-

I have some questions related to the outcome of this JIRA.

 Since 2.1 incremental repair only repairs *chunks* of a partition (e.g. the 
chunks that are in the un-repaired SSTables set) so even in case of mismatch we 
no longer stream the *entire* partition. And using paging we can read through 
very wide partitions. With the improvement brought by this JIRA, does it mean 
that now we can handle *virtually* unbounded or partitions exceeding 2.10^9 
physical columns ?

 I'm asking because it will impact greatly the way we model data. There are 
still some points that can cause trouble with ultra-wide partitions:

 - bootstrapping/adding new nodes to the cluster --> streaming of an ultra-wide 
partitions. What happens if the streaming fails in the middle ? Do we restart 
the streaming of the whole partition or can we *resume* at the last clustering ?
 - compaction. With LCS, ultra wide partitions can create overly huge SSTables. 
In general, how compaction ultra wide partitions will impact node stability ? 
 - read path with STCS --> more SSTables to touch on disk

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-19 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199215#comment-15199215
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

Alright - opened CASSANDRA-11369 as a follow-up for 
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}}

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-19 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198608#comment-15198608
 ] 

Stefania commented on CASSANDRA-11206:
--

bq. My understanding of 
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}} is, that it 
uses the IndexInfo objects that are already in the key-cache and will go to 
disk if there is a key-cache miss.

Yes. Except previously it had to do this anyway because of the partition 
deletion, whereas now the partition deletion will be available but not the full 
IndexInfo objects.

bq. We could (in theory) add stuff to the partition summary or change the 
serialized index - but unfortunately not in 3.x.

I think it's reasonable to wait until the new major version to improve on the 
optimization of CASSANDRA-8180. So I'm happy with this compromise. Shall we 
open a ticket for this?

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-18 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198225#comment-15198225
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

Note: utests and dtests are fine now (did nothing than a rebase and re-run).

bq. partition should be added to the key cache if not already present

Yes and no. This ticket will add a _shallow_ version of {{IndexedEntry}} to the 
key cache (without the IndexInfo objects as these cause a lot of heap 
pressure). So, when the {{IndexInfo}} objects are actually needed, these will 
be read from disk. My understanding of 
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}} is, that it 
uses the IndexInfo objects that are already in the key-cache and will go to 
disk if there is a key-cache miss. If we would re-read the IndexInfo objects in 
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}}, this would 
add overhead. Or did I get it wrong and 
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}} accesses 
the same partition as {{IndexState}} does? If that's the case, we can maybe 
pass the current, "fully accessible" {{IndexedEntry}} to 
{{UnfilteredRowInteratorWithLowerBound}} (not checked that yet).

We could (in theory) add stuff to the partition summary or change the 
serialized index - but unfortunately not in 3.x.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-14 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193082#comment-15193082
 ] 

Stefania commented on CASSANDRA-11206:
--

bq. IndexInfo is also used from 
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}} 
(CASSANDRA-8180) - not sure whether it's worth to deserialize the index for 
this functionality, *as it is currently restricted to the entries that are 
present in the key cache*. I tend to remove this access. 

If I am not mistaken when the sstable iterator is created, the partition should 
be added to the key cache if not already present. Please have a look at 
BigTableReader {{iterator()}} and {{getPosition()}} to confirm. The reason we 
need the index info is that the lower bounds in the sstable metatdata do not 
work for tombstones. This is the only lower bound we have for tombstones. If 
it's removed then the optimization of CASSANDRA-8180 no longer works in the 
presence of tombstones (whether this is acceptable is up for discussion). 

Can't we add the partition bounds to the offset map? 

For completeness, I also add that we don't necessarily need a lower bound for 
the partion, it can be a lower bound for the entire sstable if easier. However 
it should work for tombstones, that is it should be an instance of 
{{ClusteringPrefix}} rather than an array of {{ByteBuffer}} as it is currently 
stored in the sstable metadata. 

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-14 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192977#comment-15192977
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

Quick progress status:
* refactored the code to be able to handle "flat byte structures" (i.e. a 
{{byte[]}} at the moment - as a pre-requisite to directly access the index file)
* IndexInfo is only used from {{AbstractSSTableIterator.IndexState}} - an 
instance to an open index-file is available, so removing the {{byte[]}} and 
accessing the index file directly is the next step.
* unit and dtests are mostly passing (i.e. there are some flakey ones on 
cassci, which are passing locally). Still need to identify what's going on with 
the failing paging dtests.
* cstar tests show similar results compared to current trunk
* IndexInfo is also used from 
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}} 
(CASSANDRA-8180) - not sure whether it's worth to deserialize the index for 
this functionality, as it is currently restricted to the entries that are 
present in the key cache. I tend to remove this access. (/cc [~Stefania])

Observations:
* accesses to IndexInfo objects are "random" during the binary search operation 
(as expected)
* accesses to IndexInfo objects are "nearly sequential" during scan operations 
- "nearly" means, it accesses index N, then index N-1, then index N+1 before it 
actually moves ahead - but does some random accesses to previously accessed 
IndexInfo instances afterwards. Therefore {{IndexState}} "caches" the already 
deserialised {{IndexInfo}} objects. These should stay in new-gen as these are 
only referenced during the lifetime of the actual read. Alternatively it is 
possible to use a plain & boring LRU like cache for the 10 last IndexInfo 
objects in IndexState.
* index-file writes (flushes/compactions) also used {{IndexInfo}} objects - 
replaced with a buffered write ({{DataOutputBuffer}})

Assumptions:
* heap pressure due to the vast amount of {{IndexInfo}} objects is already 
handled by this patch (exchanged to one {{byte[]}} at the moment) both for 
reads and flushes/compactions
* after replacing the {{byte[]}} with index file access, we could lower the 
(default) key-cache size since we then no longer cache {{IndexInfo}} objects on 
heap

So the next step is to remove the {{byte[]}} from {{IndexedEntry}} and replace 
it with index-file access from {{IndexState}}.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-02 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175634#comment-15175634
 ] 

Jonathan Ellis commented on CASSANDRA-11206:


bq. For partitions < 64k (partitions without an IndexInfo object) we could skip 
the indirection during reads via RowIndexEntry at all by extending the 
IndexSummary and directly store the offset into the data file

Since the idea here is to do something simple that we can be confident about 
shipping in 3.6 if CASSANDRA-9754 isn't ready, let's avoid making changes to 
the on disk layout, i.e., your Plan B.

To clarify for others following along,

bq. Remove IndexInfo from the key cache (not from the index file on disk, of 
course)

This sounds scary but it's core to the goal here: if we're going to support 
large partitions, we can't afford the overhead either of keeping the entire 
summary on heap, or of reading it from disk in the first place.  (If we're 
reading a 1KB row, then reading 2MB of summary first on a cache miss is a huge 
overhead.)  Moving the key cache off heap (CASSANDRA-9738) would have helped 
with the first but not the second.

So one approach is to go back to the old strategy of only caching the partition 
key location, and then go through the index bsearch using the offsets map every 
time.  For small partitions this will be trivial and I hope negligible to the 
performance story vs the current cache.  (If not, we can look at a hybrid 
strategy, but I'd like to avoid that complexity if possible.)

bq. what I was thinking was that the key cache instead of storing a copy of the 
RIE it would store an offset into the index that is the location of the RIE. 
Then the RIE could be accessed off heap via a memory mapping without doing any 
allocations or copies

I was thinking that even the offsets alone for a 4GB partition are going to be 
256KB, so we don't want to cache the entire offsets map.  But the other side 
there is that if you have a bunch of 4GB partitions you won't have very many of 
them.  16TB of data would be 1GB of offsets which is within the bounds of 
reasonable when off heap.  And your approach may require less logic changes 
than the one above, since we're still "caching" the entire summary, sort of; 
only adding an extra indirection to read the IndexInfo entries.  So that might 
well be simpler.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174309#comment-15174309
 ] 

Ariel Weisberg commented on CASSANDRA-11206:


I'm summarizing to make sure I remember correctly what the key cache miss read 
path for a table looks like.
1. Binary search index summary to find location of partition index entry in 
index
2. Lookup index entry which may just be a pointer to the data file, or it may 
be a sampled index of rows in the partition
3. Look up the partition contents based on the index entry

The index summary is a sampling of the index so most of the time we aren't 
going to get a hit into the data file right? We have to scan the index to find 
the RIE and that entire process is what the key cache saves us from.

If I remember correctly what I was thinking was that the key cache instead of 
storing a copy of the RIE it would store an offset into the index that is the 
location of the RIE. Then the RIE could be accessed off heap via a memory 
mapping without doing any allocations or copies.

I agree that for partitions that aren't indexed the key cache could point 
straight to the data file and skip the index lookup since there doesn't need to 
be additional data there. I don't follow the path you are describing to 
completely removing the key cache without restructuring index summaries and 
indexes.

An aside. Is {{RowIndexEntry}} named incorrectly? Should it be 
{{PartitionIndexEntry}}?


> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-03-01 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174150#comment-15174150
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

A brief outline of what I am planning ("full version"):

For partitions < 64k (partitions without an IndexInfo object) we could skip the 
indirection during reads via RowIndexEntry at all by extending the IndexSummary 
and directly store the offset into the data file. (This also flattens the 
IndexedEntry vs. RowIndexEntry class hierarchy and removes some if-else 
constructs.) Maybe also use vint encoding in IndexSummary to save some space in 
memory and on disk (looks possible from a brief look). Eventually also add the 
partition deletion time to the summary, if it's worth to do that (not sure 
about this - it's in IndexedEntry but not in RowIndexEntry).

For other partitions we use the offset information in IndexedEntry and only 
read those IndexInfo entries that are really necessary during the binary 
search. It doesn't really matter whether we are reading cold or hot data as 
cold data has to be read from disk anyway and hot data should already be in the 
page cache.

Having the offset into the data file in the summary, we can remove the key 
cache.

Tests for CASSANDRA-9738 have shown that there is not much benefit keeping the 
full IndexedEntry + IndexInfo structure in memory (off heap). So this ticket 
would supersede CASSANDRA-9738 and CASSANDRA-10320.

Downside of this approach is that it changes the on-disk format of 
IndexSummary, which might be an issue in 3.x - so there's a "plan B version":

* Leave IndexSummary untouched
* Remove IndexInfo from the key cache (not from the index file on disk, of 
course)
* Change IndexSummary and remove the whole key cache in a follow-up ticket for 
4.x

/cc [~slebresne] [~aweisberg] [~iamaleksey] 

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-02-22 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157471#comment-15157471
 ] 

sankalp kohli commented on CASSANDRA-11206:
---

+1 for stop gap. 

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-02-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157390#comment-15157390
 ] 

Jonathan Ellis commented on CASSANDRA-11206:


The offset map is what allows us to deal with variable length index entries.  
So you only deserialize exactly as many IndexInfo as needed to locate the right 
64KB row block.  Then scanning for the row w/in the block is unchanged.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-02-22 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157346#comment-15157346
 ] 

Michael Kjellman commented on CASSANDRA-11206:
--

The IndexEntry objects are currently variable length [~jbellis] which might 
make this a bit complicated on the read path. Also, how many elements would 
need to be deserialized at minimum? Whatever the bucket size used for the skip 
list implementation?

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-02-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157320#comment-15157320
 ] 

Jonathan Ellis commented on CASSANDRA-11206:


(Note that I am a big fan of the proposal in CASSANDRA-9754, this is intended 
as a simpler approach that we can ship quickly and replace when 9754 is ready. 
/cc [~mkjellman])

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-02-22 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157318#comment-15157318
 ] 

Jonathan Ellis commented on CASSANDRA-11206:


The offset map is written AFTER the serialized IndexInfo, but since we write 
out the size of both up front, we can still access the map without 
deserializing everything.  Here's the code from RIE that writes it out:

{code}
out.writeUnsignedVInt(rie.position);
out.writeUnsignedVInt(rie.promotedSize(idxSerializer));
...
out.writeUnsignedVInt(rie.columnsIndex().size());
   ... [write out the IndexInfo and compute offsets map as we go] ...
   for (int off : offsets)
out.writeInt(off);
{code}

(There is no code yet that reads it back in because CASSANDRA-9738 got put on 
the back burner.)

Thus the offsets map starts at the total size, minus count * sizeof(int).

So we can read the middle offsetmap entry, deserialize the IndexInfo it points 
to, compare with the row we're looking for, and repeat until bsearch is done.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)