subject:"\[jira\] \[Commented\] \(CASSANDRA\-9754\) Make index info heap friendly for large CQL partitions"


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583396#comment-15583396
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Great. I'm working on a trunk based version now. 8099 is really fun! :)

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583373#comment-15583373
 ] 

Pavel Yaskevich commented on CASSANDRA-9754:


I'm planning to take a closer look at the code etc. soon, so if I see something 
or have any ideas I'll let you know!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583328#comment-15583328
 ] 

Pavel Yaskevich commented on CASSANDRA-9754:


[~mkjellman] Maybe "largeuuid1"? Looks like rows there were about ~300KB too, 
which is reasonable.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583336#comment-15583336
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

sure, let me change it now. 

also, If you have any input on the overall way I'm testing and generating load 
please let me know -- I really did try to make it as realistic as I could and 
we discussed it internally over here but I'm all ears if you have a different 
kind of load I'm missing that isn't making it an accurate test so far for 
certain workloads.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583305#comment-15583305
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Sure, I can change the test right now. Which table specifically are you talking 
about adding more keys, it's a single command line parameter and a restart of 
the perf load? I'll need to bounce the cluster for the key cache change 
obviously though.

The control cluster which had 2.1.16 without Birch which I did on purpose to 
see how performance was with Birch vs without to specifically make sure there 
wasn't a regression at the low end like you're rightfully concerned about (as I 
am/was too). 

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583294#comment-15583294
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

One idea I've had for a while is that we could switch the current Summary 
implementation to just having it be a Birch tree itself with all keys (not 
sampled). You could then do a lookup into the row index to get the offset to 
the columns index in what we call the "primary index" today. Then you'd have a 
tree per row like we have today.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583289#comment-15583289
 ] 

Pavel Yaskevich commented on CASSANDRA-9754:


bq. I'm actually only am using the key cache in the current implementation

I wanted to mention that purely from looking up key in the key cache 
perspective, I've assumed that index is only going to have key offsets in it, 
so we are on the same page. 

[~barnie] Is there any way you can run this through automated perf stress test? 
Since the size of the tree attached to the key is bigger than it was 
originally, I'm curious what is performance difference in conditions where rows 
are just barely big enough to be indexed and there are a lot of keys.

[~mkjellman] I understand that the test you are running is designed to check 
what is the performance like relative to the Birch tree itself, but is there 
there any chance you can disable key cache and generate some more keys (maybe 
~100k?) to see how changes to the column index are affecting read path top-down?

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583250#comment-15583250
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

In regards to your second point: I'm actually only am using the key cache in 
the current implementation if a) it's a legacy index that hasn't been upgraded 
yet (to keep performance for indexed rows the same during upgrades) b) if the 
row is "non" indexed, or < 64kb so just the starting offset.

For Birch indexed rows they always come from the Birch impl on disk and don't 
get stored in the key cache at all. Ideally I think it would be great if we 
could get rid of the key cache all together! There was some chat about this in 
the ticket earlier...

There is the index summary which has an offset for keys as they are sampled 
during compaction which let you skip to a given starting file offset inside the 
index for a key which reduces the problem you're talking about. I don't think 
the performance of the small-to-medium sized case should be any different with 
the Birch implementation than the current implementation and I'm trying to test 
that with the workload going on for the test_keyspace.largeuuid1 table. The 
issue with the Birch implementation vs the current though is going to be the 
size of the index file on disk due to the segments being aligned on 4kb 
boundaries. I've talked a bunch about this and thrown some ideas around with 
people and I think maybe the best case would be to check if the previously 
added row was a non-indexed segment (so just a long for the start of the 
partition in the index and no tree being built) and then don't align the file 
to a boundary for those cases. The real issue is I don't know the length ahead 
of time so I can't just encode the aligned segments at the end starting at some 
starting offset and encode relative offsets iteratively during compaction. Any 
thoughts on this would be really appreciated though...

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583249#comment-15583249
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

In regards to your second point: I'm actually only am using the key cache in 
the current implementation if a) it's a legacy index that hasn't been upgraded 
yet (to keep performance for indexed rows the same during upgrades) b) if the 
row is "non" indexed, or < 64kb so just the starting offset.

For Birch indexed rows they always come from the Birch impl on disk and don't 
get stored in the key cache at all. Ideally I think it would be great if we 
could get rid of the key cache all together! There was some chat about this in 
the ticket earlier...

There is the index summary which has an offset for keys as they are sampled 
during compaction which let you skip to a given starting file offset inside the 
index for a key which reduces the problem you're talking about. I don't think 
the performance of the small-to-medium sized case should be any different with 
the Birch implementation than the current implementation and I'm trying to test 
that with the workload going on for the test_keyspace.largeuuid1 table. The 
issue with the Birch implementation vs the current though is going to be the 
size of the index file on disk due to the segments being aligned on 4kb 
boundaries. I've talked a bunch about this and thrown some ideas around with 
people and I think maybe the best case would be to check if the previously 
added row was a non-indexed segment (so just a long for the start of the 
partition in the index and no tree being built) and then don't align the file 
to a boundary for those cases. The real issue is I don't know the length ahead 
of time so I can't just encode the aligned segments at the end starting at some 
starting offset and encode relative offsets iteratively during compaction. Any 
thoughts on this would be really appreciated though...

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583224#comment-15583224
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Here is cfstats from one of the instances.

{code}
Keyspace: test_keyspace
Read Count: 114179492
Read Latency: 1.6377607135701742 ms.
Write Count: 662747473
Write Latency: 0.030130128499184786 ms.
Pending Flushes: 0
Table: largetext1
SSTable count: 26
SSTables in each level: [0, 3, 7, 8, 8, 0, 0, 0, 0]
Space used (live): 434883821719
Space used (total): 434883821719
Space used by snapshots (total): 0
Off heap memory used (total): 67063584
SSTable Compression Ratio: 0.7882047641965452
Number of keys (estimate): 14
Memtable cell count: 58930
Memtable data size: 25518748
Memtable off heap memory used: 0
Memtable switch count: 3416
Local read count: 71154231
Local read latency: 2.468 ms
Local write count: 410631676
Local write latency: 0.030 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used: 496
Bloom filter off heap memory used: 288
Index summary off heap memory used: 1144
Compression metadata off heap memory used: 67062152
Compacted partition minimum bytes: 20924301
Compacted partition maximum bytes: 91830775932
Compacted partition mean bytes: 19348020195
Average live cells per slice (last five minutes): 
0.9998001524322566
Maximum live cells per slice (last five minutes): 1.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0

Table: largeuuid1
SSTable count: 59
SSTables in each level: [1, 10, 48, 0, 0, 0, 0, 0, 0]
Space used (live): 9597872057
Space used (total): 9597872057
Space used by snapshots (total): 0
Off heap memory used (total): 3960428
SSTable Compression Ratio: 0.2836031289299396
Number of keys (estimate): 27603
Memtable cell count: 228244
Memtable data size: 7874514
Memtable off heap memory used: 0
Memtable switch count: 521
Local read count: 18463741
Local read latency: 0.271 ms
Local write count: 108570121
Local write latency: 0.031 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used: 22008
Bloom filter off heap memory used: 21536
Index summary off heap memory used: 11308
Compression metadata off heap memory used: 3927584
Compacted partition minimum bytes: 42511
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 1290148
Average live cells per slice (last five minutes): 
0.9992537806937392
Maximum live cells per slice (last five minutes): 1.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0

Table: timeuuid1
SSTable count: 7
SSTables in each level: [0, 1, 3, 3, 0, 0, 0, 0, 0]
Space used (live): 103161816378
Space used (total): 103161816378
Space used by snapshots (total): 0
Off heap memory used (total): 13820716
SSTable Compression Ratio: 0.9105016396444802
Number of keys (estimate): 6
Memtable cell count: 150596
Memtable data size: 41378801
Memtable off heap memory used: 0
Memtable switch count: 1117
Local read count: 24561527
Local read latency: 0.264 ms
Local write count: 143545778
Local write latency: 0.033 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used: 128
Bloom filter off heap memory used: 72
Index summary off heap memory used: 308
Compression metadata off heap memory used: 13820336
Compacted partition minimum

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583210#comment-15583210
 ] 

Pavel Yaskevich commented on CASSANDRA-9754:


[~mkjellman] This looks great! Can you please post information regarding 
SSTables sizes and their estimated key counts as well? AFAIR there exists 
another problem related to how indexes are currently stored - if key is not in 
the key cache there is no way to jump to it directly in the index file, index 
reader has to scan through index segment to find requested key, so I'm 
wondering what happens in the situation when there are many keys which are 
small-to-medium sized e.g. 64-128 MB in each given SSTable (let's say SSTable 
size is set to 1G or 2G) and stress readers are trying to read random keys, 
what would be the difference between current index read performance vs. index + 
birch tree?...

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583163#comment-15583163
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Not a "stupid" question at all! There is certainly a bit more overhead here 
than what we did before, however, I'm closely monitoring compaction in these 
tests and Pending Tasks isn't backing up so at this read/write load it seems 
like the additional work is negligible in real world terms.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583160#comment-15583160
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Test clusters have crossed 110GB for the large CQL Partitions!!! Latency is 
still stable :)

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-15 Thread DOAN DuyHai (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578618#comment-15578618
 ] 

DOAN DuyHai commented on CASSANDRA-9754:


Stupid question: how are those improvement affect compaction ? Did you also 
monitor the compaction time during your benchmark tests and compare the time 
taken by each impl ?

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-14 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576497#comment-15576497
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

All of the threads that were responsible for generating load in the control 
cluster for the two large partition read and write workloads had died because 
the cluster became so unstable. As soon as I just restarted the stress load 60% 
of the instances in the cluster OOMed within 2 minutes of restarting the load. 

At this point I don't think I can drive any more data into the partitions with 
the old code and I'm going to declare defeat and say that 17GB as the absolute 
max partition size possible with the old/previous/current index implementation 
(given the JVM parameters as I detailed in the test description above).

I'm going to leave the load at the current read and write rates in the two 
Birch clusters until things explode to see the theoretical max partition size 
possible with the Birch implementation today. After that I'll wipe the clusters 
and restart the same load at 2x the read and write rates to see how things go 
with that configuration.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-14 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576390#comment-15576390
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I have even more great news! My two test clusters just crossed 53GB for the 
90th percentile for max row size. The 50th percentile for mean row size is 
~8.5GB. Read and write latencies are still the same as the numbers I posted 
above from 3 days ago. So you could have an entire cluster of 10GB rows and 
still be stable :) 

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-12 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569292#comment-15569292
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Morning update :) The stress load has continued in all partitions since the 
last update. The large partitions have grown to ~21GB. Latencies are still 
unchanged for both reads and writes in all percentiles!! Onwards to the next 
milestone, 50GB! I also doubled the read and write load around 10 hours ago to 
4k reads/sec and 10k writes/sec to grow the partitions faster.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566748#comment-15566748
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Attaching an initial set of very rough graphs showing the last 12 hours of 
stress/performance testing that's been running. I apologize ahead of time for 
some of the graphs -- I wanted to include the average, p99.9th, and count for 
all key metrics and in some cases some of the values overlapped and my graphing 
foo wasn't good enough to improve the readability. I'll take another pass when 
I get some time with the next round of performance testing. The "large" CQL 
partitions in all 3 clusters are currently (and during the duration of the 
test) between ~6GB and ~12.5GB, although I'm planning on running the 
stress/performance tests in all 3 clusters until the "large" CQL partitions 
hits ~50GB. The load was started in all 3 clusters (where all 3 were totally 
empty at start) at the same time -- from the same stress tool code that I wrote 
specifically to realistically test Birch as after repeated attempts to generate 
a good workload with cassandra-stress I gave up. Some details of the stress 
tool and load that was being generated for these graphs is below.

h3. There are three read-write workloads being run to generate the load during 
these tests.

I wrote the following two methods for my "simple-cassandra-stress" tool I threw 
together to generate keys that the worker-threads operate on. I'll refer to 
them below in terms of how the stress load is currently being generated. 

{code:java}
public static List generateRandomKeys(int number) {
List keysToOperateOn = new ArrayList<>();
HashFunction hf = Hashing.murmur3_128();
for (int i = 0; i < number; i++) {
HashCode hashedKey = 
hf.newHasher().putLong(RANDOM_THREAD_LOCAL.get().nextInt(30) + 1).hash();
keysToOperateOn.add(hashedKey);
}
return keysToOperateOn;
}

public static List generateEvenlySpacedPredictableKeys(int number, 
int offset,
 String seed, 
Cluster cluster) throws InvalidParameterException {
Set tokenRanges = cluster.getMetadata().getTokenRanges();
int numberOfKeysToGenerate = (number < tokenRanges.size()) ? 
tokenRanges.size() : number;

Long[] tokens = new Long[numberOfKeysToGenerate];

int pos = 0;

int numberOfSplits = (number <= tokenRanges.size()) ? 1 : (number / 
tokenRanges.size()) + 1;
for (TokenRange tokenRange : tokenRanges) {
for (TokenRange splitTokenRange : 
tokenRange.splitEvenly(numberOfSplits)) {
if (pos >= tokens.length)
break;

tokens[pos++] = (Long) splitTokenRange.getStart().getValue();
}

if (pos >= tokens.length)
break;
}

HashCode[] randomKeys = new HashCode[tokens.length];
int pendingRandomKeys = tokens.length;
while (pendingRandomKeys > 0) {
for (int i = offset; i < (offset + numberOfKeysToGenerate) * (number * 
10); i++) {
if (pendingRandomKeys <= 0)
break;

HashFunction hf = Hashing.murmur3_128();
HashCode hashedKey = hf.newHasher().putString(seed, 
Charset.defaultCharset()).putInt(i).hash();

for (int t = 0; t < tokens.length; t++) {
if ((t + 1 == tokens.length && hashedKey.asLong() >= tokens[t]) 
|| (hashedKey.asLong() >= tokens[t] && hashedKey.asLong() < tokens[t + 1])) {
if (randomKeys[t] == null) {
randomKeys[t] = hashedKey;
pendingRandomKeys--;
}

break;
}
}
}
}

return Arrays.asList(randomKeys);
}
{code}

There are 12 Cassandra instances in each performance/stress cluster running JDK 
1.8_u74 with the CMS collector (obviously simplified) running with -Xms5G 
-Xmx5G -Xmn1G. 

The test keyspace is created with RF=3:
{code:SQL}
CREATE KEYSPACE IF NOT EXISTS test_keyspace WITH replication = {'class': 
'NetworkTopologyStrategy', 'datacenter1': 3}
{code}

Operations for test_keyspace.largeuuid1 generate a new key to insert and read 
from at the top of every iteration with generateRandomKeys(1). Each worker then 
generates 10,000 random mutations, with the current timeuuid and a random value 
blob of 30 bytes to 2kb. This is intended to get some more "normal" load on the 
cluster.

{code:SQL}
CREATE TABLE IF NOT EXISTS test_keyspace.timeuuid1 (name text, col1 timeuuid, 
value blob, primary key(name, col1)) WITH compaction = { 
'class':'LeveledCompactionStrategy' }

"INSERT INTO test_keyspace.largeuuid1 (name, col1, value) VALUES (?, ?, ?)"
"SELECT * FROM test_keyspace.largeuuid1 WHERE name = ? and col1 = ?"
{code}

The second and third generated workload attempt to stress the large row size 
element of this

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566188#comment-15566188
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

There were some issues with my cherry-pick to my public GitHub branch. I 
started from scratch and squashed all 182 individual commits  from scratch, 
rebased up to 2.1.16, and pushed to a new branch: 
https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2

The full squashed 2.1 based patch is 
https://github.com/mkjellman/cassandra/commit/b17f2c1317326fac7b6864a2fc61d7ee2580f740

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566183#comment-15566183
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Fixed: single squashed commit for 182 individual commits (wow didn't realize it 
was that many) just pushed to a new branch and rebased up to 2.1.16  
https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566132#comment-15566132
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Yes, however something went wrong with the cherry-pick to the external 
github.com repo as caught by Jeff. I'm squashing all the changes now into a 
single commit and pushing a new branch up. Give me a few more moments.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566131#comment-15566131
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Yes, however something went wrong with the cherry-pick to the external 
github.com repo as caught by Jeff. I'm squashing all the changes now into a 
single commit and pushing a new branch up. Give me a few more moments.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-11 Thread Branimir Lambov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566048#comment-15566048
 ] 

Branimir Lambov commented on CASSANDRA-9754:


Is it now ready for review?

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565876#comment-15565876
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Latest set of fixes pushed to 
https://github.com/mkjellman/cassandra/commit/5586be24f55a16887376cb244a7d1b1fa777927f

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565823#comment-15565823
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Stable all night! My large test partitions have grown to ~12.5GB Just as stable 
-- latencies are unchanged. I'm so happy!!! ~7ms avg p99.9th and ~925 
microseconds average read latency. GC basically non-existant -- and for what GC 
is happening, the instances are averaging a 111 microsecond ParNew collection 
-- almost NO CMS! Compaction is keeping up.

On the converse side (the control 2.1 cluster running the same load) has 
instances are OOMing left and right -- CMS is frequently running 250 ms 
collections, ParNew is running 1.28 times a second on average with 75 ms 
average ParNew times. Horrible! And that's average -- the upper percentiles are 
a mess so I won't bore everyone. Read latencies are currently 380 ms average 
with many 15 *second* read latencies in the p99.9.



> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564840#comment-15564840
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I fixed the last (pretty nasty) bug today/tonight! The issue was in 
IndexedSliceReader#IndexedBlockFetcher where I was failing to properly 
initializing a new iterator to the given start of the slice for the read query. 
This caused every read to iterate over all indexed entries every time. 
Fortunately that bug had brought some performance concerns on the underlying 
read logic to my attention which I also addressed thinking that was the root 
cause.

I'm currently running my performance/stress load in three separate performance 
clusters; two with a build that has Birch and one that is a control version of 
2.1.16. I'm currently performing 700 reads per/sec per instance and 1.5k writes 
per/sec.

Read Latencies in both Birch perf clusters are showing (at the storage proxy 
level) 838 microseconds latencies in the average percentile and only 7.4 
milliseconds in the p99.9th!
Write Latencies in both Birch perf clusters are showing (at the storage proxy 
level) 138 microseconds in the average percentile and 775 microseconds in the 
p99.9th!

There is basically no GC to be spoken for and the latencies are very stable 
(and have been) for the past hour since I restarted the load with the fix for 
the Iterator as mentioned above.

The best thing about all these stats above is many of the reads are occurring 
against a (currently) 8.5GB rows! The control cluster has latencies 7-8x the 
Birch clusters so far and GC is out of control and instances are starting to 
constantly OOM. It's hard to compare anything against the control cluster as 
things start to fall apart very significantly after the test CQL partitions 
grow above ~4GB eek.

I'm going to let the load continue overnight to grow the partitions larger (I'm 
targeting 50GB for this first performance milestone). 

It's pretty hard to not be happy when you see these numbers. This could end up 
being very very epic for our little project. I'm *pretty*, pretty, pretty (okay 
*really() happy tonight!!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-06 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553444#comment-15553444
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I just pushed up a squashed commit for the roughly 68 individual commits I've 
made while working on stability and performance over the past few weeks.

https://github.com/mkjellman/cassandra/commit/41c6d43d0b020149a5564d4f7ab3c92e1bfcba64

I'm currently writing up the findings from the latest stress test I've been 
running for the last 24 hours across 3 performance clusters and will update the 
ticket with that in a bit.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-05 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550920#comment-15550920
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I'm a bit late on my Tuesday target I was aiming for on Saturday but for good 
reason :) I've been working almost non-stop since (went to bed at 3:30am and 
was up at 8:30am looking at graphs.. and I've been looking at graphs ever 
since). I have a performance load running in 3 perf clusters -- I'd like to 
aggregate those objective findings tomorrow and then push up whatever the state 
of things is (it's *very* stable so I'm pretty pumped about that) along with 
some benchmarks (the good and possibly bad/still needs improvement).

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-03 Thread Branimir Lambov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542919#comment-15542919
 ] 

Branimir Lambov commented on CASSANDRA-9754:


bq. if we mmap a few times we'll still incur the very high and unpredictable 
costs from mmap

The {{MmappedRegions}} usage is to map the regions at sstable load, i.e. 
effectively only once in the table's lifecycle, which should completely avoid 
any mmap costs at read time.

bq. I'm wondering though if mmap'ing things even makes since

Depends if we want to squeeze the last bit of performance or not. Memmapped 
data (assuming already mapped as above) that resides in the page cache has no 
cost whatsoever to be accessed, while reading it off RAF or a channel still 
needs a system call plus some copying. The difference is fest most on workloads 
that fit entirely in the page cache.

If you don't feel like this is helpful, you can leave this out of the 2.1 
version and rely on {{Rebufferer}} (or {{RandomAccessReader}}) to do memmapping 
or caching for you in trunk.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-03 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542718#comment-15542718
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I saw that way back when I started implementing things -- I'm wondering though 
if mmap'ing things even makes since. Given the chunks of work are aligned on 4k 
boundaries, even if we mmap a few times we'll still incur the very high and 
unpredictable costs from mmap (even if less given 2GB chunks are obviously much 
bigger than 4kb)... thoughts? I'm trying to profile it now...

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-03 Thread Branimir Lambov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542043#comment-15542043
 ] 

Branimir Lambov commented on CASSANDRA-9754:


> Originally, I was mmapping 4kb aligned chunks as necessary.

Cassandra has some machinery to deal with the same problem in 
{{RandomAccessReader}}; the solution we have in place is to map the entire file 
in <2GB chunks and look the chunk up on a read. Take a look at 
{{MmappedRegions}} in trunk and its users.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-01 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538960#comment-15538960
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Oh, another very important update. Originally, I was mmapping 4kb aligned 
chunks as necessary. When I finally got things stable due to a few file 
descriptor leaks and fun fighting Java with MemoryByteBuffer objects I ran the 
performance load from the stress tool I wrote and found the performance was 
randomly *terrible* (like 1.3 SECONDS in the 99.9th percentile). Upon 
investigation and a ton instrumentation I found mmap calls were taking *90+ms* 
in the 99th percentile and *70+ms* in the 90th percentile on the hardware I'm 
using for performance testing. I looked into the JDK source code to figure out 
if there were any synchronized blocks in the native code but it's pretty sane 
and just calls the mmap syscall. Discussed it a bit with Norman Maurer and we 
both came up pretty shocked that mmap could be that slow! These boxes have 
256GB of RAM and there was basically zero disk IO as everything was in the page 
cache as expected. There were a lot of major page faults but really very very 
surprising mmap can be so horrible in the upper percentiles.

I ripped out all the mmap logic on the read path and switched to directly 
reading from the RAF from the aligned 4kb chunks as needed and everything 
looked amazing.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-01 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538947#comment-15538947
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Wanted to post a quick update on the ticket. I've been working pretty much 
around the clock for the last two weeks on stabilizing, performance testing, 
validating, and bug fixing the code. I had an unfortunate unexpected death in 
my family last week so I lost the better part of this past week tying up the 
last pieces I was finishing up before I got the bad news.

After attempting to work with a few people in the community to get 
cassandra-stress working in a way that actually stresses large partitions and 
validates the data written into it, I ended up needing to write a stress tool. 
I loaded up a few hundred 30GB+ partitions with column sizes of 300-2048 bytes 
while constantly reading data that was sampled during the inserts to make sure 
I'm not returning bad data or incorrect results.

I ran the most recent load for ~2 days in a small performance cluster and there 
were no validation errors. Additionally, I'm running the exact same stress/perf 
load in another identical cluster with a 2.1 build that does *not* contain 
Birch. This is allowing me to make objective A/B comparisons between the two 
builds.

The build is stable, there are no exceptions or errors in the logs even under 
pretty high load (the instances are doing 3x the load we generally run at in 
production) and most importantly GC is *very* stable. In contrast, GC starts 
off great without Birch but around the time the large partitions generated by 
the stress tool reached  ~250MB GC shot up and then started increasing 
literally as the row increased (as expected). In contrast, the cluster with the 
Birch build had no change in GC as the size of the partitions increased.

I was a bit disappointed with some of the latencies I saw on reads in the upper 
percentiles and so I've identified what I'm almost positive was the cause and 
just finished up refactoring the logic for serializing/deserializing the 
aligned segments and subsegments in PageAlignedWriter/PageAlignedReader.

I'm cleaning up the commit now and then going to get it into the perf cluster 
to start another load. If that looks good hoping to push all the stability and 
performance changes I've made up to my public Github branch most likely Tuesday 
as I'd like to let the performance load run for 2 days to build up large enough 
partitions to accurately stress and test things. :)

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-09-08 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474662#comment-15474662
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Over the past few days I've made some really great progress. I have the 2.1 
based implementation (as found at 
https://github.com/mkjellman/cassandra/commits/CASSANDRA-9754-2.1) in a 
temporary performance cluster running stably against cassandra-stress.

I found a few issues that I've been fixing as I find them while running the 
code under load:
 * Fix reading of non-birch indexes from SSTableScanner
 * Force un-mmapping of the current mmapped buffer from a PageAlignedReader 
before mmapping a new region
 * Fix alignTo() issues when using anything other than 4096 padding for indexes 
(e.g. 2048)
 * Make Birch/PageAligned Format padding length configurable 
(sstable_index_segment_padding_in_kb)
 * Fix signing issue when serializing and deserializing an unsigned short
 * Use a reusable buffer in PageAlignedWriter
 * Fix an issue where the index of the current subsegment was being used when 
the index of the current segment should have been used
 * Other minor cleanup, spelling nits, etc

I've observed a bug where a java.nio.BufferUnderflowException is sometimes 
thrown under load from a ValidationExecutor thread while doing a repair. I've 
put some temporary logging in to dump the state of the reader when the 
exception happens but I'm still not sure how it gets into that state. Wondering 
if there is some kind of concurrency problem somewhere?

Also, (although obvious in hindsight) the page alignment to keep segments 
aligned on 4kb boundaries causes an unacceptable write amplificiation for the 
size of the index file for workloads with small row keys and < 64kb of data in 
the row (a.k.a. no index). I've been discussing with a few people the various 
options we have and the tradeoffs for each one of them. Hoping to formalize 
those thoughts and implement something today or tomorrow.

So, all and all, the 2.1 based implementation is really stabilizing and initial 
performance tests are looking very encouraging!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-09-01 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454525#comment-15454525
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I've discovered a performance regression caused by the original logic in 
PageAlignedReader. I always knew the original design wasn't ideal, however, I 
felt that the additional code complexity wasn't worth the performance 
improvements. However, now that the code is stabilized and I've moved on to 
performance validation (and not just bugs and implementation) I found it was 
horribly inefficient.

https://github.com/mkjellman/cassandra/commit/33d35272ae50803bac626ab60d5ecd3a36f5b283

I've updated the documentation in PageAlignedWriter to cover the new 
PageAligned file format. The new implementation allows lazy deserialization of 
segment metadata as required, and enables binary search across segments via the 
fixed length starting offsets. This means deserialization of the segments are 
no longer required ahead of time -- deserialization of the segment metadata 
only occurs when required to return a result.

Initial benchmarking and profiling makes me a pretty happy guy. I think the new 
design is a massive improvement over the old one and looks pretty good so far.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-08-30 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451066#comment-15451066
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I pushed a rebased commit that addresses many additional comments by 
[~jasobrown] from review, adds additional unit tests, and has many further 
improvements to documentation. This is still 2.1 based, however the review and 
improvements made in the org.apache.cassandra.db.index.birch package is 
agnostic to a trunk or 2.1 based patch.

https://github.com/mkjellman/cassandra/commit/3d686799a0e79c23d86881bb041b5408dcfda014
https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1

Some Highlights:
 * Fix a bug in KeyIterator identified by [~jjirsa] that would cause the 
iterator to return nothing when the backing SegmentedFile contains exactly 1 
key/segment.
 * Add unit tests for KeyIterator
 * Add SSTable version ka support to LegacySSTableTest. Actually test something 
in LegacySSTableTest.
 * Add additional unit tests around PageAlignedReader, PageAlignedWriter, 
BirchWriter, and BirchReader
 * Remove word lists and refactor all unit tests to use 
TimeUUIDTreeSerializableIterator instead
 * Improve documentation and fix documentation as required to properly parse 
and format during javadoc creation
 * Remove reset() functionality from BirchReader.BirchIterator
 * Fix many other nits

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-08-23 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434219#comment-15434219
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

So, I'm mostly done with a trunk version of the patch, however, I'm currently 
focusing on finishing and polishing the 2.1 based version. Although the 
abstraction of the index is almost a total rewrite between 2.1 and trunk the 
tree itself and the Birch implementation should remain the same so this 
certainly isn't wasted time for anyone. :) I've cleaned up the implementation a 
bunch, taken care of a bunch of todos and low hanging fruit, added more 
documentation, and pushed it to Github to make it a bit easier to make sure the 
changes apply cleanly.

https://github.com/mkjellman/cassandra/commit/e5389378b19eb03de7dd4d50d6df110c68057985

The following 4 unit tests (out of 1184) are still failing (so close!):
 * org.apache.cassandra.cql3.KeyCacheCqlTest (2 of 2). Need to talk to 
[~aweisberg] to understand exactly what these unit tests are testing.
 * org.apache.cassandra.db.ColumnFamilyStoreTest (2 of 38, both related to 
secondary indexes)

Tomorrow, I hope to push a patch addressing the feedback from [~barnie] (see 
above comment) above along with any changes that come out of the code review 
currently underway by [~jasobrown] and [~kohlisankalp]. I also need/want to do 
some work on feeling more comfortable on the upgrade/backwards compatibility 
story and make sure there is a good unit test story around that.

[~jjirsa] if you get a chance to take a look please let me know if you have any 
initial feedback that would be awesome!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-07-15 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380306#comment-15380306
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Thanks Branimir for starting to review the code!!

1) Yes, good optimization. The first most entry in the tree (so the first 
element in the root node, doesn't need to be repeated in the inner nodes as you 
can assume that you always go to the left.
2) I think we can make the assumption that if the length of the bytes for a 
given entry's key is greater than or equal to the max length for a given 
node/leaf, we can have the code make an assumption that there will be a int 
encoded after the key bytes to the offset in the overflow page. If the key 
happens to be equal to the max length (but doesn't actually overflow) then we 
can encode 0 or something and then have the code know that value means no 
overflow page. The downside here is adding more assumptions and complexity to 
the code vs eating the single byte in the overflow page.
3) Could do this but I think it increases the complexity as the root node would 
need to be special cased for max size.. currnently the root node, inner nodes, 
and leaf nodes all use the exact same code, where only the value serializer is 
different (to either write the offset to the next node/leaf or the actual value 
of the entry you want to add to the tree). We'd need to subtract the side of 
the descriptor from the available side of the aligned root node and then know 
where to seek inside it.

In regards to your final comment: while technically we could build the 
root/inner nodes to point to an offset of indexinfo entries serialized in the 
current format, I think the tradeoffs make it non-ideal. If we serialize the 
entries in the leaf like I'm currently doing we a) get the benefits of the leaf 
being aligned and padded and b) with the proposed leaf format, we can binary 
search inside each leaf node to get the exact entry, vs having an inner node 
just point to an offset and then needing to linearly scan until we hit a match 

I'm almost done with a refactor for trunk. We found some pretty serious 
regressions in 2.1 that required my immediate attention but I hope to have a 
trunk based patch with your initial suggestions from above incorporated into 
the tree implementation very soon.


> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-06-27 Thread Branimir Lambov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351341#comment-15351341
 ] 

Branimir Lambov commented on CASSANDRA-9754:


I spent some time reading up {{BirchReader}} to figure out the nuts and bolts 
of how the storage works. I think we can squeeze a little more efficiency into 
the structure:
- As far as I could see, your current implementation places a lot of copies on 
the lower side of each span in the non-leaf nodes (for example, the lowest key 
of the partition is present in the leaf node, its parent as well as all parents 
leading all the way to the root). This should not be necessary, simply omitting 
the first key (but retaining the child pointer) from all intermediate nodes and 
adding 1 to what the binary search returns will achieve the same result.
- I find the overlow flag (and jumping back and forth to read it) less 
efficient than necessary. If we assume instead that key length equal to the max 
always entails overflow data, we would be using less space and be more 
efficient in the common case, while having a very low chance of taking a few 
bytes more in the uncommon situation of long keys.
- Root node could be in the same page with descriptor (it is usually smaller so 
high chance to fit). Perhaps overflow is best placed elsewhere?

More generally (ignoring padding on the leaves which is not necessarily always 
beneficial), the B+ structure you have built is practically a B-Tree index over 
a linear list of index entries. As we already have a linear list of 
{{IndexInfo}} structures in the current format, what are we gaining by not just 
building a B-Tree index over that? To me the latter would appear to be less 
complicated and much more generic with immediate possible applications in other 
parts of the codebase.


> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-06-09 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323500#comment-15323500
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

[~tjake] will do!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-06-09 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323406#comment-15323406
 ] 

T Jake Luciani commented on CASSANDRA-9754:
---

[~mkjellman]  for a > 2.1 map take a look at CASSANDRA-7443 which added an 
abstraction for IndexEntry and serializers which should hopefully be similar to 
what you did for this 2.1 version.  

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-06-09 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323330#comment-15323330
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Some additional thoughts while I'm thinking about them:
 * PageAlignedReader currently deserializes all the segments in the 
constructor. It might be more efficient to lazily deserialize the segments as 
we need the segment. I'm sure perf testing will quickly make it clear if the 
extra code complexity is worth the potential performance trade-off...
 * I picked 4kb for the page size based on an educated guess, but obviously 
other sizes need to be tested (less? more?)


> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-06-08 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321353#comment-15321353
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

TIL: Attempting to upload to Jira via the slow and overpriced Gogo in-flight 
wifi doesn't work... "Cannot attach file 9754_part2-v1.diff: Unable to 
communicate with JIRA." Working on it.. :) 

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Attachments: 9754_part1-v1.diff
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-06-08 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321338#comment-15321338
 ] 

Jonathan Ellis commented on CASSANDRA-9754:
---

Delighted to see this patch land, looking forward to getting it merged!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-06-08 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321332#comment-15321332
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Alright, happy to finally be able to write this. :) I'm attaching a v1 diff 
containing Birch!

h4. Why is it named Birch?
B+ Tree -> Trees that start with the letter B -> Birch... get it? haha...

h4. Description
Birch is a B+ish/inspired tree aimed at improving the performance of the 
SSTable index in Cassandra (especially with large partitions).

The existing implementation scales poorly with the size of the index/ row as 
the entire index must be deserialized onto the heap even to look for a single 
element. This puts significant pressure on the heap, where one read to a large 
partition will cause at the minimum a long painful CMS GC pause or -- in the 
worst case -- an OOM. 

The Birch implementation has a predictable fixed cost for reads at the expense 
of the additional on disk overhead for the tree itself -- with an 
implementation that is the same complexity O(log(n)) as the existing 
implementation. Every row added to the SSTable is also added to the primary 
index. If the size of the row is greater than 64kb we build an index (otherwise 
we just encode the position in the sstable for that row). All entries encoded 
into the index are page aligned and padded to the nearest boundary (4096 bytes 
by default). Every segment can be marked as either internally padded/aligned 
along a boundary or non-padded/aligned (up to 2GB). Birch indexes are aligned 
into 4096 byte nodes (both leaf and inner). Keys will be encoded inside the 
node itself, unless they exceed the size of the node/2. In that case, the size 
of the node/2 is encoded into the node itself and the offset of the remaining 
bytes in the overflow page is encoded. This enables predictable fixed 
performance of the tree, but accommodates variable length keys/elements.

h4. Notes on v1 of the diff (in no particular order)
 * I broke the changes into two logical parts: The first abstracts out the 
existing Index implementation and adds no new logic. The second includes a 
IndexedEntry implementation backed by a Birch tree.
 * The attached v1 patch is written for 2.1, I have already started rebasing 
the patch onto trunk and hope to finish that shortly and post a the trunk based 
patch
 * There's some high level Javadoc documentation in BirchWriter and 
PageAlignedWriter on the layout of the tree on disk, serialization and 
deserialization paths, and higher level goals of the classes
 * The next steps are to start getting feedback from reviews and the community. 
I also have profiled the tree itself but profiling the tree integrated into the 
stack and optimizing non-performant code paths is next (after the immediate 
task to rebase the change onto trunk)
 * There are still a few todo's I've left in regards to handling backwards 
compatibility, parts of the code I expect might be non-performant, and things 
I'd like to discuss on the "correct" implementation/behavior etc
 * I have a few unit tests that still don't pass and still need to be root 
caused... I've taken the approach this entire time that the unit tests 
shouldn't be touched to pass, so there is still a few behavioral regressions 
I've accidentally introduced. The current failing tests are: 
 ** AutoSavingCacheTest
 ** SecondaryIndexTest
 ** BatchlogManagerTest
 ** KeyCacheTest
 ** ScrubTest
 ** IndexSummaryManagerTest
 ** LegacySSTableTest
 ** MultiSliceTest
 * I need to write a unit test to test reading the legacy/existing primary 
index implementation
 * By the nature of the index's role in the database, the unit test coverage is 
actually pretty extensive as any read and write touches the index in some 
capacity

I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high 
level design I ended up with and considerations I had to take into account once 
I actually got deep inside this part of the code.

Looking forward to feedback!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-04-11 Thread Jack Krupansky (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235141#comment-15235141
 ] 

Jack Krupansky commented on CASSANDRA-9754:
---

Any idea how a new wide partition will perform relative to the same amount of 
data and same number of clustering rows divided into bucketed partitions? For 
example, a single 1 GB wide partition vs. ten 100 MB partitions (same partition 
key plus a 0-9 bucket number) vs. a hundred 10 MB partitions (0-99 bucket 
number), for two access patterns: 1) random access a row or short slice, and 2) 
a full bulk read of the 1 GB of data, one moderate slice at a time.

Or maybe the question is equivalent to asking what the cost is to access the 
last row of the 1 GB partition vs. the last row of the tenth or hundredth 
bucket of the bucketed equivalent.

No precision required. Just inquiring whether we can get rid of bucketing as a 
preferred data modeling strategy, at least for the common use cases where the 
sum of the buckets is roughly 2 GB or less..

The bucketing approach does have the side effect of distributing the buckets 
around the cluster, which could be a good thing, or maybe not.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-31 Thread Jeff Jirsa (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221050#comment-15221050
 ] 

Jeff Jirsa commented on CASSANDRA-9754:
---

I renew my offer to help test code (and / or port to other versions, especially 
if you don't yet have a 2.1 patch set)

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-31 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221026#comment-15221026
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I have some insanely encouraging initial performance numbers!... I'd like to do 
some more validation to make sure I didn't screw up any of the benchmarks 
before sharing, but the read story is better than I could have ever imagined!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-31 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220973#comment-15220973
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Alright, good news! My unit test that creates and reads from an index with 
100,000,000 entries (!!) successfully passes! 

Came up with a pretty nice solution to the word-list issue (unable to find a 
word list of 100m+ entries) and instead I am creating n TimeUUID elements -- 
which nicely removes duplicates, can create an infinite number of, and come 
already sorted as they're being generated!

I'm currently profiling the code to come up with numbers...

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-11 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190722#comment-15190722
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Also, does anyone know of any truly massive word lists that are totally free of 
legal concerns for testing? (I'm looking for >3-4 million words) 

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-11 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190718#comment-15190718
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I have the new FileSegment friendly implementation working for the following 
conditions:

1) straight search for key -> get value
2) iterate efficiently both forwards and reversed thru all elements in the tree
3) binary search for a given key and then iterate thru all remaining keys from 
the found offset
4) overflow page for handling variable length tree elements that exceed the max 
size for a given individual page (up to 2GB)

I also have successfully ran some new unit tests I wrote that now do 5000 
consecutive iterations with randomly generated data (to "fuzz" the tree for 
edge conditions) for building and validating trees that contain between 
300,000-500,000 elements. I've also spent a good amount of time writing some 
pretty reasonable documentation of the binary format itself.

Tomorrow, I'm planning on testing a 4.5GB individual tree against the new 
implementation and doing some profiling to see the exact memory impact now that 
it's basically completed on both the serialization and deserialization paths. 
Will update with those findings tomorrow!

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-10 Thread Aleksey Yeschenko (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189412#comment-15189412
 ] 

Aleksey Yeschenko commented on CASSANDRA-9754:
--

[~jkrupan] Priority of the JIRA ticket here is irrelevant now that it's being 
actively worked on - and also CASSANDRA-11206.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-10 Thread Jack Krupansky (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189407#comment-15189407
 ] 

Jack Krupansky commented on CASSANDRA-9754:
---

Is this issue still considered a Minor priority? Seems like a bigger deal to 
me. +1 for making it a Major priority - unless there is a longer list of even 
bigger fish in the queue.

Just today there is a user on the list struggling with time series data and 
really not wanting to have to split a partition that he needs to be able to 
scan. Of source, scanning a super-wide partition will still be a very bad idea 
anyway, but at least more narrow scans would still be workable with this 
improvement in place.

Is this a 3.x improvement or 4.x or beyond? +1 for 3.x (3.6? 3.8?).

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-10 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188993#comment-15188993
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Had a very productive day... made a huge amount of progress on the 
deserialization side of things and the required serialization changes.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-03-01 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173981#comment-15173981
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

Update: serialization path basically done with new design more friendly for 
SegmentedFile issue. Will finish that up today and move onto Deserialization. 

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-16 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149269#comment-15149269
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

well - then you'd lose the cache-line aware logic I've implemented to make the 
B+ tree efficient...

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-16 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149265#comment-15149265
 ] 

Jonathan Ellis commented on CASSANDRA-9754:
---

Would it make sense to always used buffered i/o for the B+ instead of mmap?

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-16 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149034#comment-15149034
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I was asked to update. Making good progress. A large number of tests pass. I'm 
basically just getting the math right for "Inception: The Cassandra Director's 
Cut" due to the need to make a B+ Tree (disk based by it's very definition) 
work with SegmentedFile etc... 

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-09 Thread Jack Krupansky (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139712#comment-15139712
 ] 

Jack Krupansky commented on CASSANDRA-9754:
---

bq. large CQL partitions (4GB,75GB,etc)

What is the intended target/sweet spot for large partitions... 1GB, 2GB, 4GB, 
8GB, 10GB, 15GB, 16GB, or... what? Will random access to larger partitions 
create any significant heap/off-heap memory demand, or will heap/memory simply 
become the total rows accessed regardless of how they might be bucketed into 
partitions?

Will we be able to tell people that bucketing of partitions is now never 
needed, or will there now just be a larger bucket size, like 4GB/partition 
rather than the 10MB or 50MB or 100MB that some of us recommend today?

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-09 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139130#comment-15139130
 ] 

Jonathan Ellis commented on CASSANDRA-9754:
---

[~mkjellman] I'd be very interested to see what impact this has on query 
performance as partition size grows.

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-09 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139735#comment-15139735
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

[~jkrupan] ~2GB is the max target at the moment I'd recommend from experience.

The current implementation will create a IndexInfo entry every 64kb (by default 
- but I highly doubt anyone actually changes this default) worth of data. Each 
IndexInfo object contains the offset into the sstable where the partition/row 
starts, the length to read, and the name. These IndexInfo objects are placed 
into a list and binary searched over to find the name closest to the query. 
Then, we go to that offset in the sstable and start reading the actual data. 

The issue here that makes things so bad with large partitions is when doing an 
Indexed read across a given partition the entire list of indexinfo objects is 
currently just serialized one after another into the index file on disk. To use 
it we have to read the entire thing off disk, deserializing every IndexInfo 
object, place it into a list, and the binary search across it. This creates a 
ton of small objects very quickly that are likely to be promoted and thus 
create a lot of GC pressure.

If you take the average size of each column you have in a row you can figure 
out how many index entry objects will be created (for every 64k of your data in 
that partition). I've found that once the index info array will contain > 300k 
objects things get bad.

The implementation I'm *almost* done with has the same big O complexity 
(O(log(n))) as the current implementation but instead the index is backed by 
page cache aligned mmap'ed segments (B+ tree-ish with an overflow page 
implementation similar to that of SQLite). This means we can now walk the 
IndexEntry objects an only bring the 4k chunks onto the heap that are involved 
in the binary search for the correct entry itself.

The tree itself is finished and heavily tested. I've also already abstracted 
out the index implementation in Cassandra so that the current implementation 
and the new one I'll be proposing and contributing here can be dropped in 
easily without special casing the code all over the place to check the SSTable 
descriptor for what index implementation was used. All the unit tests and 
d-tests pass after my abstraction work. The final thing I'm almost done with is 
refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware 
(and make sure all the math works when the offset into the actual file will 
differ depending on the segment etc).

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-05 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135081#comment-15135081
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

I spent the last 15 hours finishing up the last remaining pieces on the 
serialization... almost there..

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-05 Thread Brandon Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135073#comment-15135073
 ] 

Brandon Williams commented on CASSANDRA-9754:
-

I am seeing an increased amount of people running into this problem, is there 
any update here?

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-02-05 Thread Jeff Jirsa (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135071#comment-15135071
 ] 

Jeff Jirsa commented on CASSANDRA-9754:
---

[~mkjellman] - I know you've put a lot of thought into this already. This is 
impacting us and I've love to help. Is there anything I can do to assist? Are 
you working on a patch I can help you test (or can I volunteer to help write 
tests or similar)?



> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2015-08-31 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723931#comment-14723931
 ] 

Ariel Weisberg commented on CASSANDRA-9754:
---

Maybe it makes sense to have an intermediate step. Leverage Robert's work in 
9738 to transition to simple off heap representation that can be mapped and 
then remove the key cache. It seems like this would effectively increase the 
upper bound on usable partition size in all cases compared to what we have 
today (is this a true statement?).

After, or in parallel work on another representation for partition indexes.



> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2015-08-29 Thread Robert Stupp (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721034#comment-14721034
]

Robert Stupp commented on CASSANDRA-9754:
-

I think we definitely need better data structures since RIE is neither a good
fit for KC nor index. That's the point in this ticket, CASSANDRA-8931,
CASSANDRA-9843 and the pitfall in current WIP in CASSANDRA-9738.
Not fully agree on relying on page cache due to its granularity (4kB i think)
which might be too coarse for keys. But that depends on the actual data
structure - i.e. grouping hot keys per page, which contradicts with immutable
sstables.
Another point is the effort to move to thread-per-core model, having distinct
and independent data structures per thread without barriers/locks/whatever -
and page-cache is a shared resource.
Next thing is hot and cold data - i.e. we could use bigger intervals
(column_index_size_in_kb in current terminology) for cold data.
TBC: I'm not against page cache or so - just want to note what I think may
influence new stuff.

Make index info heap friendly for large CQL partitions
--

Key: CASSANDRA-9754
URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
Project: Cassandra
Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

Looking at a heap dump of 2.0 cluster, I found that majority of the objects
are IndexInfo and its ByteBuffers. This is specially bad in endpoints with
large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K
IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for
GC. Can this be improved by not creating so many objects?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

[
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720928#comment-14720928
]

Michael Kjellman commented on CASSANDRA-9754:
-

Just a couple more thoughts:

1) One big question I have is how a On-Disk B+ Tree would play with the on-heap
(or off-heap) cache. Ideally, I believe we would only want to cache on the heap
columns that we got a read request for to avoid polluting the heap with objects
that were never actually even requested by the user. However, as we store
intervals in the index and not all the actual values, I'm not sure how to do
efficient in memory lookups if we only stored fragments of the overall index
for a given key. For instance, If you put 1 matching leaf of index objects from
the b+ tree into the cache and then got another request for the say key but
different index interval, we'd need to constantly keep rebalancing some kind of
data structure on the heap and I'm operating under the assumption that would be
pretty inefficient and painful. Which brings me to...
2) Maybe the KeyCache isn't necessary if we implement an efficient B+ On-Disk
format. I'm doubtful that anything we implement from a cache perspective inside
the application will be better than the kernel's page cache. Frequently
accessed keys would be the pages most likely to be in the page cache as well,
so we still should get the benefit of LRU eviction.

Make index info heap friendly for large CQL partitions
--

Key: CASSANDRA-9754
URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
Project: Cassandra
Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2015-08-28 Thread Jonathan Ellis (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720893#comment-14720893
]

Jonathan Ellis commented on CASSANDRA-9754:
---

1. Learning time for us would be compaction
2. ISTM this was not core to the algorithm, but it's been a while since I read
the details
3. We could store the offset in the ARF leaves, this was definitely not core
4, 5. Yes, this is a key point. Like our existing index, ARF is designed to be
memory-resident. As partitions grow larger the ARF would degrade accuracy
rather than spilling to disk (like a B-tree) or getting obscenely large (like
our existing index).

I would add,

6. Because of (5), ARF gives you BF-like behavior for range queries and can
quickly optimize away scans of sstables that don't contain the data in
question. (A very good fit for DTCS; a smaller benefit for LCS.)

So, maybe we really want both. ARF for the quick reject, (on-disk) B+ for
where do I start scanning.

Make index info heap friendly for large CQL partitions
--

Key: CASSANDRA-9754
URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
Project: Cassandra
Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2015-08-28 Thread Robert Stupp (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720365#comment-14720365
 ] 

Robert Stupp commented on CASSANDRA-9754:
-

Linking CASSANDRA-9738 as it runs into the same problem with {{IndexInfo}} 
objects (just off-heap instead of disk).

 Make index info heap friendly for large CQL partitions
 --

 Key: CASSANDRA-9754
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
 are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
 large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
 IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
 GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2015-08-28 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720608#comment-14720608
 ] 

Jonathan Ellis commented on CASSANDRA-9754:
---

ARF may be a better fit than B+tree. CASSANDRA-9843

 Make index info heap friendly for large CQL partitions
 --

 Key: CASSANDRA-9754
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
 are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
 large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
 IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
 GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720626#comment-14720626
 ] 

Michael Kjellman commented on CASSANDRA-9754:
-

[~jbellis] I PoC'ed a B+ Tree but certainly not tied to it in any way. Let me 
read this white paper today and see if I can get a PoC together.

 Make index info heap friendly for large CQL partitions
 --

 Key: CASSANDRA-9754
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
 are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
 large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
 IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
 GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2015-08-28 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720523#comment-14720523
 ] 

sankalp kohli commented on CASSANDRA-9754:
--

If we use B+ tree, we can actually put a leaf node in key cache. For most small 
partitions, leaf=root due to there small size. 

 Make index info heap friendly for large CQL partitions
 --

 Key: CASSANDRA-9754
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
 are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
 large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
 IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
 GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

[
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720314#comment-14720314
]

Michael Kjellman commented on CASSANDRA-9754:
-

We've had a bunch of discussions around this over the past few weeks and I
think i finally have a grasp of the entire issue. The issue here is that large
CQL partitions (4GB,75GB,etc) end up with large 200MB+ serialized indexes. The
current logic is when we don't get a cache hit to deserialize the entire thing
and split it into IndexInfo objects which contain 2 ByteBuffers (first and last
key), and 2 longs (Offset and Width). This means we get a very very large
amount of small most likely very shortly lived objects creating garbage on the
heap --- and with a high probability they will be evicted from the cache
anyways. On disk we just lay out the objects down with the assumption the
entire thing will always be deserialized when it's needed and never accessed
from disk without deserializing the entire thing.

I think the only option here is to make a change to the actual way we lay
things out on disk. Two options would be a Skip List or a B+ Tree where we mmap
the pages of the index and try to do something intelligent to avoid actually
bringing objects onto the heap as much as possible. The downside a B+ Tree
would be the overhead of creating it on flush and it's log(n) (although the
current code is log(n) too as we binary search over the objects we deserialized
into the List, but just do it on the heap.

The only references I could find to B+ Trees in this project were
CASSANDRA-6709 and CASSANDRA-7447. I think we don't need to reinvent the wheel
here and entirely change the storage format but I think if we just use a
targeted data structure *just* for the Index we might get something nice. The
question would be what impact will this have for normal rows/partitions.

Any input on other on disk data structures we might want to consider would be
great.

The other issue is that I'd love to be able to only cache the column that we
got a hit on for the cache. Unfortunately that might be difficult. Today we
binary search over the entire ListIndexInfo to find hits. If you get a column
that's in between the first and last name you return the left node and go and
check and hopefully it's actually there. As we essentially have interval-ish
objects here along with non fixed length values it does make things a bit more
fun.

Make index info heap friendly for large CQL partitions
--

Key: CASSANDRA-9754
URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
Project: Cassandra
Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2015-08-28 Thread sankalp kohli (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720852#comment-14720852
 ] 

sankalp kohli commented on CASSANDRA-9754:
--

How is this any different than an interval tree where we using a boolean 
instead of a max for leaf nodes? 

Disclaimer: I did not read the whole paper

 Make index info heap friendly for large CQL partitions
 --

 Key: CASSANDRA-9754
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: Michael Kjellman
Priority: Minor

  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
 are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
 large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
 IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
 GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions