[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880077#comment-15880077 ] Michael Kjellman commented on CASSANDRA-9754: - Just wanted to give a quick update: # I'm really sorry for the delay getting this finished for trunk. I've started a trunk based/post 8099 version 3 times now -- the holidays happened -- more pressing things stole my attention -- big commits like removing Thrift, CFMetadata, etc, etc kept getting committed before I was done, and well -- enough excuses from me... # A belated thanks for your initial comments Branimir -- I did read them and I'll be addressing them with my trunk rebased changes. # I'm almost done with the refactoring to move all the current array based index logic into a IndexEntry implementation. I have all unit tests passing (finally) with the exception of KeyCacheCqlTest (which I'm working on right now). # Assuming I get the post 8099 Indexed Iterator based abstractions/changes correct it should be a matter of just dropping in the Birch package/classes I had for 2.1 and switching the default serializer to use the Birch IndexedEntry implementation. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 0f8e28c220fd5af6c7b5dd2d3dab6936c4aa4b6b.patch, > gc_collection_times_with_birch.png, gc_collection_times_without_birch.png, > gc_counts_with_birch.png, gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795310#comment-15795310 ] Branimir Lambov commented on CASSANDRA-9754: As this is not a critical bug-fix and hence the 2.1 version cannot go into the codebase, unfortunately I cannot justify investing the time to give it a full review until we have a trunk patch. I looked at the main {{BirchReader/Writer}} components again as they are likely to stay the same for a trunk patch. Here are my comments: - I still think not storing the first key in intermediate nodes would save significant amounts of space and time and should be implemented. - There are a pair of methods called {{binarySearch}} which return the floor (less than or equal) for a given key. I would prefer them to be named after what they produce, as {{binarySearch}} implies a certain kind of result (negative for non-equal) and the fact that it is implemented through binary search is largely an implementation detail. - [{{entries == 1}} check|https://github.com/apache/cassandra/compare/trunk...mkjellman:CASSANDRA-9754-2.1-v2#diff-8561257b0836a3403d14d5dac9f8b3d0R393] looks suspect, as there should be no need for one-entry nodes in the tree. Could you comment why it is necessary? - I think it is better to handle the special meaning of empty key and the {{reversed}} flag first thing in [the {{search}} method|https://github.com/apache/cassandra/compare/trunk...mkjellman:CASSANDRA-9754-2.1-v2#diff-8561257b0836a3403d14d5dac9f8b3d0R432] rather than propagating it into the {{binarySearch}} calls, especially since you know the position of the first ({{0}}) and last ({{descriptor.getFirstNodeOffset() - descriptor.getAlignedPageSize()}}) leaf nodes in the tree. The iterator initialization already does that. - The meaning of "matching" in the [{{search}} doc|https://github.com/mkjellman/cassandra/commit/b17f2c1317326fac7b6864a2fc61d7ee2580f740#diff-8561257b0836a3403d14d5dac9f8b3d0R429] is unclear. What happens on no equal element present? If it returns the floor, please state so. - {{BirchIterator}} could use a forward/reversed implementation split. - There's a lot of potential off-by-one mishap in {{BirchIterator}}, and not only in the reverse case: -- the first element returned in the forward case can be less than {{searchKey}} (if not equal); -- the respective problem is also there in the reverse case; -- the page we find the entry in during initialization ({{currentPage}}) is not the page we apply that index to during the first {{computeNext}} ({{currentPage - 1}}); -- the column iteration code probably masks the first two at some non-trivial efficiency cost, but the latter looks like something that can get to the surface as missed data. - The [level generation loop in {{BirchWriter}}|https://github.com/apache/cassandra/compare/trunk...mkjellman:CASSANDRA-9754-2.1-v2#diff-c5fa7e9cc1eac71a75b38caa716f64c3R260] is effectively a {{while (true)}} loop as we always reset {{inProgressInnerNodeLevel}} before going into it. - The list should never become empty, thus the [emptyness check|https://github.com/apache/cassandra/compare/trunk...mkjellman:CASSANDRA-9754-2.1-v2#diff-c5fa7e9cc1eac71a75b38caa716f64c3R282] is suspect -- if necessary it would indicate an error in the logic; I'd replace is with a non-empty assertion. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 0f8e28c220fd5af6c7b5dd2d3dab6936c4aa4b6b.patch, > gc_collection_times_with_birch.png, gc_collection_times_without_birch.png, > gc_counts_with_birch.png, gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658206#comment-15658206 ] Michael Kjellman commented on CASSANDRA-9754: - [~llambiel] Yes, we ran out of disk space before the code fell over. We had some 250GB partitions when we finally ran out of disk space. Waiting on review and comments from [~barnie] and I'm working on the trunk version. I have most of the unit tests passing althuogh the new RangeTombstoneBounds etc is proving pretty fragile and giving me a bit of pain. [~jjirsa] as we've discussed i'm 99.9% sure we should go with your changes too. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 0f8e28c220fd5af6c7b5dd2d3dab6936c4aa4b6b.patch, > gc_collection_times_with_birch.png, gc_collection_times_without_birch.png, > gc_counts_with_birch.png, gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657238#comment-15657238 ] Loic Lambiel commented on CASSANDRA-9754: - Any update on your ongoing tests [~mkjellman] ? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 0f8e28c220fd5af6c7b5dd2d3dab6936c4aa4b6b.patch, > gc_collection_times_with_birch.png, gc_collection_times_without_birch.png, > gc_counts_with_birch.png, gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592881#comment-15592881 ] Michael Kjellman commented on CASSANDRA-9754: - The large partitions hit 158GB today. Latencies are still stable and unchanged. Compaction backed up a little at one point but has fully caught up (with no change in actual write/read tps). I'm working very hard on a post 8099 version of the patch for trunk/3.0.x. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583396#comment-15583396 ] Michael Kjellman commented on CASSANDRA-9754: - Great. I'm working on a trunk based version now. 8099 is really fun! :) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583373#comment-15583373 ] Pavel Yaskevich commented on CASSANDRA-9754: I'm planning to take a closer look at the code etc. soon, so if I see something or have any ideas I'll let you know! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583328#comment-15583328 ] Pavel Yaskevich commented on CASSANDRA-9754: [~mkjellman] Maybe "largeuuid1"? Looks like rows there were about ~300KB too, which is reasonable. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583336#comment-15583336 ] Michael Kjellman commented on CASSANDRA-9754: - sure, let me change it now. also, If you have any input on the overall way I'm testing and generating load please let me know -- I really did try to make it as realistic as I could and we discussed it internally over here but I'm all ears if you have a different kind of load I'm missing that isn't making it an accurate test so far for certain workloads. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583305#comment-15583305 ] Michael Kjellman commented on CASSANDRA-9754: - Sure, I can change the test right now. Which table specifically are you talking about adding more keys, it's a single command line parameter and a restart of the perf load? I'll need to bounce the cluster for the key cache change obviously though. The control cluster which had 2.1.16 without Birch which I did on purpose to see how performance was with Birch vs without to specifically make sure there wasn't a regression at the low end like you're rightfully concerned about (as I am/was too). > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583294#comment-15583294 ] Michael Kjellman commented on CASSANDRA-9754: - One idea I've had for a while is that we could switch the current Summary implementation to just having it be a Birch tree itself with all keys (not sampled). You could then do a lookup into the row index to get the offset to the columns index in what we call the "primary index" today. Then you'd have a tree per row like we have today. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583289#comment-15583289 ] Pavel Yaskevich commented on CASSANDRA-9754: bq. I'm actually only am using the key cache in the current implementation I wanted to mention that purely from looking up key in the key cache perspective, I've assumed that index is only going to have key offsets in it, so we are on the same page. [~barnie] Is there any way you can run this through automated perf stress test? Since the size of the tree attached to the key is bigger than it was originally, I'm curious what is performance difference in conditions where rows are just barely big enough to be indexed and there are a lot of keys. [~mkjellman] I understand that the test you are running is designed to check what is the performance like relative to the Birch tree itself, but is there there any chance you can disable key cache and generate some more keys (maybe ~100k?) to see how changes to the column index are affecting read path top-down? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583250#comment-15583250 ] Michael Kjellman commented on CASSANDRA-9754: - In regards to your second point: I'm actually only am using the key cache in the current implementation if a) it's a legacy index that hasn't been upgraded yet (to keep performance for indexed rows the same during upgrades) b) if the row is "non" indexed, or < 64kb so just the starting offset. For Birch indexed rows they always come from the Birch impl on disk and don't get stored in the key cache at all. Ideally I think it would be great if we could get rid of the key cache all together! There was some chat about this in the ticket earlier... There is the index summary which has an offset for keys as they are sampled during compaction which let you skip to a given starting file offset inside the index for a key which reduces the problem you're talking about. I don't think the performance of the small-to-medium sized case should be any different with the Birch implementation than the current implementation and I'm trying to test that with the workload going on for the test_keyspace.largeuuid1 table. The issue with the Birch implementation vs the current though is going to be the size of the index file on disk due to the segments being aligned on 4kb boundaries. I've talked a bunch about this and thrown some ideas around with people and I think maybe the best case would be to check if the previously added row was a non-indexed segment (so just a long for the start of the partition in the index and no tree being built) and then don't align the file to a boundary for those cases. The real issue is I don't know the length ahead of time so I can't just encode the aligned segments at the end starting at some starting offset and encode relative offsets iteratively during compaction. Any thoughts on this would be really appreciated though... > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583249#comment-15583249 ] Michael Kjellman commented on CASSANDRA-9754: - In regards to your second point: I'm actually only am using the key cache in the current implementation if a) it's a legacy index that hasn't been upgraded yet (to keep performance for indexed rows the same during upgrades) b) if the row is "non" indexed, or < 64kb so just the starting offset. For Birch indexed rows they always come from the Birch impl on disk and don't get stored in the key cache at all. Ideally I think it would be great if we could get rid of the key cache all together! There was some chat about this in the ticket earlier... There is the index summary which has an offset for keys as they are sampled during compaction which let you skip to a given starting file offset inside the index for a key which reduces the problem you're talking about. I don't think the performance of the small-to-medium sized case should be any different with the Birch implementation than the current implementation and I'm trying to test that with the workload going on for the test_keyspace.largeuuid1 table. The issue with the Birch implementation vs the current though is going to be the size of the index file on disk due to the segments being aligned on 4kb boundaries. I've talked a bunch about this and thrown some ideas around with people and I think maybe the best case would be to check if the previously added row was a non-indexed segment (so just a long for the start of the partition in the index and no tree being built) and then don't align the file to a boundary for those cases. The real issue is I don't know the length ahead of time so I can't just encode the aligned segments at the end starting at some starting offset and encode relative offsets iteratively during compaction. Any thoughts on this would be really appreciated though... > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583224#comment-15583224 ] Michael Kjellman commented on CASSANDRA-9754: - Here is cfstats from one of the instances. {code} Keyspace: test_keyspace Read Count: 114179492 Read Latency: 1.6377607135701742 ms. Write Count: 662747473 Write Latency: 0.030130128499184786 ms. Pending Flushes: 0 Table: largetext1 SSTable count: 26 SSTables in each level: [0, 3, 7, 8, 8, 0, 0, 0, 0] Space used (live): 434883821719 Space used (total): 434883821719 Space used by snapshots (total): 0 Off heap memory used (total): 67063584 SSTable Compression Ratio: 0.7882047641965452 Number of keys (estimate): 14 Memtable cell count: 58930 Memtable data size: 25518748 Memtable off heap memory used: 0 Memtable switch count: 3416 Local read count: 71154231 Local read latency: 2.468 ms Local write count: 410631676 Local write latency: 0.030 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.0 Bloom filter space used: 496 Bloom filter off heap memory used: 288 Index summary off heap memory used: 1144 Compression metadata off heap memory used: 67062152 Compacted partition minimum bytes: 20924301 Compacted partition maximum bytes: 91830775932 Compacted partition mean bytes: 19348020195 Average live cells per slice (last five minutes): 0.9998001524322566 Maximum live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 Table: largeuuid1 SSTable count: 59 SSTables in each level: [1, 10, 48, 0, 0, 0, 0, 0, 0] Space used (live): 9597872057 Space used (total): 9597872057 Space used by snapshots (total): 0 Off heap memory used (total): 3960428 SSTable Compression Ratio: 0.2836031289299396 Number of keys (estimate): 27603 Memtable cell count: 228244 Memtable data size: 7874514 Memtable off heap memory used: 0 Memtable switch count: 521 Local read count: 18463741 Local read latency: 0.271 ms Local write count: 108570121 Local write latency: 0.031 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.0 Bloom filter space used: 22008 Bloom filter off heap memory used: 21536 Index summary off heap memory used: 11308 Compression metadata off heap memory used: 3927584 Compacted partition minimum bytes: 42511 Compacted partition maximum bytes: 4866323 Compacted partition mean bytes: 1290148 Average live cells per slice (last five minutes): 0.9992537806937392 Maximum live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 Table: timeuuid1 SSTable count: 7 SSTables in each level: [0, 1, 3, 3, 0, 0, 0, 0, 0] Space used (live): 103161816378 Space used (total): 103161816378 Space used by snapshots (total): 0 Off heap memory used (total): 13820716 SSTable Compression Ratio: 0.9105016396444802 Number of keys (estimate): 6 Memtable cell count: 150596 Memtable data size: 41378801 Memtable off heap memory used: 0 Memtable switch count: 1117 Local read count: 24561527 Local read latency: 0.264 ms Local write count: 143545778 Local write latency: 0.033 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.0 Bloom filter space used: 128 Bloom filter off heap memory used: 72 Index summary off heap memory used: 308 Compression metadata off heap memory used: 13820336 Compacted partition minimum
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583210#comment-15583210 ] Pavel Yaskevich commented on CASSANDRA-9754: [~mkjellman] This looks great! Can you please post information regarding SSTables sizes and their estimated key counts as well? AFAIR there exists another problem related to how indexes are currently stored - if key is not in the key cache there is no way to jump to it directly in the index file, index reader has to scan through index segment to find requested key, so I'm wondering what happens in the situation when there are many keys which are small-to-medium sized e.g. 64-128 MB in each given SSTable (let's say SSTable size is set to 1G or 2G) and stress readers are trying to read random keys, what would be the difference between current index read performance vs. index + birch tree?... > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583163#comment-15583163 ] Michael Kjellman commented on CASSANDRA-9754: - Not a "stupid" question at all! There is certainly a bit more overhead here than what we did before, however, I'm closely monitoring compaction in these tests and Pending Tasks isn't backing up so at this read/write load it seems like the additional work is negligible in real world terms. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583160#comment-15583160 ] Michael Kjellman commented on CASSANDRA-9754: - Test clusters have crossed 110GB for the large CQL Partitions!!! Latency is still stable :) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578618#comment-15578618 ] DOAN DuyHai commented on CASSANDRA-9754: Stupid question: how are those improvement affect compaction ? Did you also monitor the compaction time during your benchmark tests and compare the time taken by each impl ? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576497#comment-15576497 ] Michael Kjellman commented on CASSANDRA-9754: - All of the threads that were responsible for generating load in the control cluster for the two large partition read and write workloads had died because the cluster became so unstable. As soon as I just restarted the stress load 60% of the instances in the cluster OOMed within 2 minutes of restarting the load. At this point I don't think I can drive any more data into the partitions with the old code and I'm going to declare defeat and say that 17GB as the absolute max partition size possible with the old/previous/current index implementation (given the JVM parameters as I detailed in the test description above). I'm going to leave the load at the current read and write rates in the two Birch clusters until things explode to see the theoretical max partition size possible with the Birch implementation today. After that I'll wipe the clusters and restart the same load at 2x the read and write rates to see how things go with that configuration. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576390#comment-15576390 ] Michael Kjellman commented on CASSANDRA-9754: - I have even more great news! My two test clusters just crossed 53GB for the 90th percentile for max row size. The 50th percentile for mean row size is ~8.5GB. Read and write latencies are still the same as the numbers I posted above from 3 days ago. So you could have an entire cluster of 10GB rows and still be stable :) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569292#comment-15569292 ] Michael Kjellman commented on CASSANDRA-9754: - Morning update :) The stress load has continued in all partitions since the last update. The large partitions have grown to ~21GB. Latencies are still unchanged for both reads and writes in all percentiles!! Onwards to the next milestone, 50GB! I also doubled the read and write load around 10 hours ago to 4k reads/sec and 10k writes/sec to grow the partitions faster. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: gc_collection_times_with_birch.png, > gc_collection_times_without_birch.png, gc_counts_with_birch.png, > gc_counts_without_birch.png, > perf_cluster_1_with_birch_read_latency_and_counts.png, > perf_cluster_1_with_birch_write_latency_and_counts.png, > perf_cluster_2_with_birch_read_latency_and_counts.png, > perf_cluster_2_with_birch_write_latency_and_counts.png, > perf_cluster_3_without_birch_read_latency_and_counts.png, > perf_cluster_3_without_birch_write_latency_and_counts.png > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566748#comment-15566748 ] Michael Kjellman commented on CASSANDRA-9754: - Attaching an initial set of very rough graphs showing the last 12 hours of stress/performance testing that's been running. I apologize ahead of time for some of the graphs -- I wanted to include the average, p99.9th, and count for all key metrics and in some cases some of the values overlapped and my graphing foo wasn't good enough to improve the readability. I'll take another pass when I get some time with the next round of performance testing. The "large" CQL partitions in all 3 clusters are currently (and during the duration of the test) between ~6GB and ~12.5GB, although I'm planning on running the stress/performance tests in all 3 clusters until the "large" CQL partitions hits ~50GB. The load was started in all 3 clusters (where all 3 were totally empty at start) at the same time -- from the same stress tool code that I wrote specifically to realistically test Birch as after repeated attempts to generate a good workload with cassandra-stress I gave up. Some details of the stress tool and load that was being generated for these graphs is below. h3. There are three read-write workloads being run to generate the load during these tests. I wrote the following two methods for my "simple-cassandra-stress" tool I threw together to generate keys that the worker-threads operate on. I'll refer to them below in terms of how the stress load is currently being generated. {code:java} public static List generateRandomKeys(int number) { List keysToOperateOn = new ArrayList<>(); HashFunction hf = Hashing.murmur3_128(); for (int i = 0; i < number; i++) { HashCode hashedKey = hf.newHasher().putLong(RANDOM_THREAD_LOCAL.get().nextInt(30) + 1).hash(); keysToOperateOn.add(hashedKey); } return keysToOperateOn; } public static List generateEvenlySpacedPredictableKeys(int number, int offset, String seed, Cluster cluster) throws InvalidParameterException { Set tokenRanges = cluster.getMetadata().getTokenRanges(); int numberOfKeysToGenerate = (number < tokenRanges.size()) ? tokenRanges.size() : number; Long[] tokens = new Long[numberOfKeysToGenerate]; int pos = 0; int numberOfSplits = (number <= tokenRanges.size()) ? 1 : (number / tokenRanges.size()) + 1; for (TokenRange tokenRange : tokenRanges) { for (TokenRange splitTokenRange : tokenRange.splitEvenly(numberOfSplits)) { if (pos >= tokens.length) break; tokens[pos++] = (Long) splitTokenRange.getStart().getValue(); } if (pos >= tokens.length) break; } HashCode[] randomKeys = new HashCode[tokens.length]; int pendingRandomKeys = tokens.length; while (pendingRandomKeys > 0) { for (int i = offset; i < (offset + numberOfKeysToGenerate) * (number * 10); i++) { if (pendingRandomKeys <= 0) break; HashFunction hf = Hashing.murmur3_128(); HashCode hashedKey = hf.newHasher().putString(seed, Charset.defaultCharset()).putInt(i).hash(); for (int t = 0; t < tokens.length; t++) { if ((t + 1 == tokens.length && hashedKey.asLong() >= tokens[t]) || (hashedKey.asLong() >= tokens[t] && hashedKey.asLong() < tokens[t + 1])) { if (randomKeys[t] == null) { randomKeys[t] = hashedKey; pendingRandomKeys--; } break; } } } } return Arrays.asList(randomKeys); } {code} There are 12 Cassandra instances in each performance/stress cluster running JDK 1.8_u74 with the CMS collector (obviously simplified) running with -Xms5G -Xmx5G -Xmn1G. The test keyspace is created with RF=3: {code:SQL} CREATE KEYSPACE IF NOT EXISTS test_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': 3} {code} Operations for test_keyspace.largeuuid1 generate a new key to insert and read from at the top of every iteration with generateRandomKeys(1). Each worker then generates 10,000 random mutations, with the current timeuuid and a random value blob of 30 bytes to 2kb. This is intended to get some more "normal" load on the cluster. {code:SQL} CREATE TABLE IF NOT EXISTS test_keyspace.timeuuid1 (name text, col1 timeuuid, value blob, primary key(name, col1)) WITH compaction = { 'class':'LeveledCompactionStrategy' } "INSERT INTO test_keyspace.largeuuid1 (name, col1, value) VALUES (?, ?, ?)" "SELECT * FROM test_keyspace.largeuuid1 WHERE name = ? and col1 = ?" {code} The second and third generated workload attempt to stress the large row size element of this
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566188#comment-15566188 ] Michael Kjellman commented on CASSANDRA-9754: - There were some issues with my cherry-pick to my public GitHub branch. I started from scratch and squashed all 182 individual commits from scratch, rebased up to 2.1.16, and pushed to a new branch: https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2 The full squashed 2.1 based patch is https://github.com/mkjellman/cassandra/commit/b17f2c1317326fac7b6864a2fc61d7ee2580f740 > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566183#comment-15566183 ] Michael Kjellman commented on CASSANDRA-9754: - Fixed: single squashed commit for 182 individual commits (wow didn't realize it was that many) just pushed to a new branch and rebased up to 2.1.16 https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2 > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566132#comment-15566132 ] Michael Kjellman commented on CASSANDRA-9754: - Yes, however something went wrong with the cherry-pick to the external github.com repo as caught by Jeff. I'm squashing all the changes now into a single commit and pushing a new branch up. Give me a few more moments. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566131#comment-15566131 ] Michael Kjellman commented on CASSANDRA-9754: - Yes, however something went wrong with the cherry-pick to the external github.com repo as caught by Jeff. I'm squashing all the changes now into a single commit and pushing a new branch up. Give me a few more moments. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566048#comment-15566048 ] Branimir Lambov commented on CASSANDRA-9754: Is it now ready for review? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565876#comment-15565876 ] Michael Kjellman commented on CASSANDRA-9754: - Latest set of fixes pushed to https://github.com/mkjellman/cassandra/commit/5586be24f55a16887376cb244a7d1b1fa777927f > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565823#comment-15565823 ] Michael Kjellman commented on CASSANDRA-9754: - Stable all night! My large test partitions have grown to ~12.5GB Just as stable -- latencies are unchanged. I'm so happy!!! ~7ms avg p99.9th and ~925 microseconds average read latency. GC basically non-existant -- and for what GC is happening, the instances are averaging a 111 microsecond ParNew collection -- almost NO CMS! Compaction is keeping up. On the converse side (the control 2.1 cluster running the same load) has instances are OOMing left and right -- CMS is frequently running 250 ms collections, ParNew is running 1.28 times a second on average with 75 ms average ParNew times. Horrible! And that's average -- the upper percentiles are a mess so I won't bore everyone. Read latencies are currently 380 ms average with many 15 *second* read latencies in the p99.9. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564840#comment-15564840 ] Michael Kjellman commented on CASSANDRA-9754: - I fixed the last (pretty nasty) bug today/tonight! The issue was in IndexedSliceReader#IndexedBlockFetcher where I was failing to properly initializing a new iterator to the given start of the slice for the read query. This caused every read to iterate over all indexed entries every time. Fortunately that bug had brought some performance concerns on the underlying read logic to my attention which I also addressed thinking that was the root cause. I'm currently running my performance/stress load in three separate performance clusters; two with a build that has Birch and one that is a control version of 2.1.16. I'm currently performing 700 reads per/sec per instance and 1.5k writes per/sec. Read Latencies in both Birch perf clusters are showing (at the storage proxy level) 838 microseconds latencies in the average percentile and only 7.4 milliseconds in the p99.9th! Write Latencies in both Birch perf clusters are showing (at the storage proxy level) 138 microseconds in the average percentile and 775 microseconds in the p99.9th! There is basically no GC to be spoken for and the latencies are very stable (and have been) for the past hour since I restarted the load with the fix for the Iterator as mentioned above. The best thing about all these stats above is many of the reads are occurring against a (currently) 8.5GB rows! The control cluster has latencies 7-8x the Birch clusters so far and GC is out of control and instances are starting to constantly OOM. It's hard to compare anything against the control cluster as things start to fall apart very significantly after the test CQL partitions grow above ~4GB eek. I'm going to let the load continue overnight to grow the partitions larger (I'm targeting 50GB for this first performance milestone). It's pretty hard to not be happy when you see these numbers. This could end up being very very epic for our little project. I'm *pretty*, pretty, pretty (okay *really() happy tonight!! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553444#comment-15553444 ] Michael Kjellman commented on CASSANDRA-9754: - I just pushed up a squashed commit for the roughly 68 individual commits I've made while working on stability and performance over the past few weeks. https://github.com/mkjellman/cassandra/commit/41c6d43d0b020149a5564d4f7ab3c92e1bfcba64 I'm currently writing up the findings from the latest stress test I've been running for the last 24 hours across 3 performance clusters and will update the ticket with that in a bit. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550920#comment-15550920 ] Michael Kjellman commented on CASSANDRA-9754: - I'm a bit late on my Tuesday target I was aiming for on Saturday but for good reason :) I've been working almost non-stop since (went to bed at 3:30am and was up at 8:30am looking at graphs.. and I've been looking at graphs ever since). I have a performance load running in 3 perf clusters -- I'd like to aggregate those objective findings tomorrow and then push up whatever the state of things is (it's *very* stable so I'm pretty pumped about that) along with some benchmarks (the good and possibly bad/still needs improvement). > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542919#comment-15542919 ] Branimir Lambov commented on CASSANDRA-9754: bq. if we mmap a few times we'll still incur the very high and unpredictable costs from mmap The {{MmappedRegions}} usage is to map the regions at sstable load, i.e. effectively only once in the table's lifecycle, which should completely avoid any mmap costs at read time. bq. I'm wondering though if mmap'ing things even makes since Depends if we want to squeeze the last bit of performance or not. Memmapped data (assuming already mapped as above) that resides in the page cache has no cost whatsoever to be accessed, while reading it off RAF or a channel still needs a system call plus some copying. The difference is fest most on workloads that fit entirely in the page cache. If you don't feel like this is helpful, you can leave this out of the 2.1 version and rely on {{Rebufferer}} (or {{RandomAccessReader}}) to do memmapping or caching for you in trunk. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542718#comment-15542718 ] Michael Kjellman commented on CASSANDRA-9754: - I saw that way back when I started implementing things -- I'm wondering though if mmap'ing things even makes since. Given the chunks of work are aligned on 4k boundaries, even if we mmap a few times we'll still incur the very high and unpredictable costs from mmap (even if less given 2GB chunks are obviously much bigger than 4kb)... thoughts? I'm trying to profile it now... > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542043#comment-15542043 ] Branimir Lambov commented on CASSANDRA-9754: > Originally, I was mmapping 4kb aligned chunks as necessary. Cassandra has some machinery to deal with the same problem in {{RandomAccessReader}}; the solution we have in place is to map the entire file in <2GB chunks and look the chunk up on a read. Take a look at {{MmappedRegions}} in trunk and its users. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538960#comment-15538960 ] Michael Kjellman commented on CASSANDRA-9754: - Oh, another very important update. Originally, I was mmapping 4kb aligned chunks as necessary. When I finally got things stable due to a few file descriptor leaks and fun fighting Java with MemoryByteBuffer objects I ran the performance load from the stress tool I wrote and found the performance was randomly *terrible* (like 1.3 SECONDS in the 99.9th percentile). Upon investigation and a ton instrumentation I found mmap calls were taking *90+ms* in the 99th percentile and *70+ms* in the 90th percentile on the hardware I'm using for performance testing. I looked into the JDK source code to figure out if there were any synchronized blocks in the native code but it's pretty sane and just calls the mmap syscall. Discussed it a bit with Norman Maurer and we both came up pretty shocked that mmap could be that slow! These boxes have 256GB of RAM and there was basically zero disk IO as everything was in the page cache as expected. There were a lot of major page faults but really very very surprising mmap can be so horrible in the upper percentiles. I ripped out all the mmap logic on the read path and switched to directly reading from the RAF from the aligned 4kb chunks as needed and everything looked amazing. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538947#comment-15538947 ] Michael Kjellman commented on CASSANDRA-9754: - Wanted to post a quick update on the ticket. I've been working pretty much around the clock for the last two weeks on stabilizing, performance testing, validating, and bug fixing the code. I had an unfortunate unexpected death in my family last week so I lost the better part of this past week tying up the last pieces I was finishing up before I got the bad news. After attempting to work with a few people in the community to get cassandra-stress working in a way that actually stresses large partitions and validates the data written into it, I ended up needing to write a stress tool. I loaded up a few hundred 30GB+ partitions with column sizes of 300-2048 bytes while constantly reading data that was sampled during the inserts to make sure I'm not returning bad data or incorrect results. I ran the most recent load for ~2 days in a small performance cluster and there were no validation errors. Additionally, I'm running the exact same stress/perf load in another identical cluster with a 2.1 build that does *not* contain Birch. This is allowing me to make objective A/B comparisons between the two builds. The build is stable, there are no exceptions or errors in the logs even under pretty high load (the instances are doing 3x the load we generally run at in production) and most importantly GC is *very* stable. In contrast, GC starts off great without Birch but around the time the large partitions generated by the stress tool reached ~250MB GC shot up and then started increasing literally as the row increased (as expected). In contrast, the cluster with the Birch build had no change in GC as the size of the partitions increased. I was a bit disappointed with some of the latencies I saw on reads in the upper percentiles and so I've identified what I'm almost positive was the cause and just finished up refactoring the logic for serializing/deserializing the aligned segments and subsegments in PageAlignedWriter/PageAlignedReader. I'm cleaning up the commit now and then going to get it into the perf cluster to start another load. If that looks good hoping to push all the stability and performance changes I've made up to my public Github branch most likely Tuesday as I'd like to let the performance load run for 2 days to build up large enough partitions to accurately stress and test things. :) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474662#comment-15474662 ] Michael Kjellman commented on CASSANDRA-9754: - Over the past few days I've made some really great progress. I have the 2.1 based implementation (as found at https://github.com/mkjellman/cassandra/commits/CASSANDRA-9754-2.1) in a temporary performance cluster running stably against cassandra-stress. I found a few issues that I've been fixing as I find them while running the code under load: * Fix reading of non-birch indexes from SSTableScanner * Force un-mmapping of the current mmapped buffer from a PageAlignedReader before mmapping a new region * Fix alignTo() issues when using anything other than 4096 padding for indexes (e.g. 2048) * Make Birch/PageAligned Format padding length configurable (sstable_index_segment_padding_in_kb) * Fix signing issue when serializing and deserializing an unsigned short * Use a reusable buffer in PageAlignedWriter * Fix an issue where the index of the current subsegment was being used when the index of the current segment should have been used * Other minor cleanup, spelling nits, etc I've observed a bug where a java.nio.BufferUnderflowException is sometimes thrown under load from a ValidationExecutor thread while doing a repair. I've put some temporary logging in to dump the state of the reader when the exception happens but I'm still not sure how it gets into that state. Wondering if there is some kind of concurrency problem somewhere? Also, (although obvious in hindsight) the page alignment to keep segments aligned on 4kb boundaries causes an unacceptable write amplificiation for the size of the index file for workloads with small row keys and < 64kb of data in the row (a.k.a. no index). I've been discussing with a few people the various options we have and the tradeoffs for each one of them. Hoping to formalize those thoughts and implement something today or tomorrow. So, all and all, the 2.1 based implementation is really stabilizing and initial performance tests are looking very encouraging! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454525#comment-15454525 ] Michael Kjellman commented on CASSANDRA-9754: - I've discovered a performance regression caused by the original logic in PageAlignedReader. I always knew the original design wasn't ideal, however, I felt that the additional code complexity wasn't worth the performance improvements. However, now that the code is stabilized and I've moved on to performance validation (and not just bugs and implementation) I found it was horribly inefficient. https://github.com/mkjellman/cassandra/commit/33d35272ae50803bac626ab60d5ecd3a36f5b283 I've updated the documentation in PageAlignedWriter to cover the new PageAligned file format. The new implementation allows lazy deserialization of segment metadata as required, and enables binary search across segments via the fixed length starting offsets. This means deserialization of the segments are no longer required ahead of time -- deserialization of the segment metadata only occurs when required to return a result. Initial benchmarking and profiling makes me a pretty happy guy. I think the new design is a massive improvement over the old one and looks pretty good so far. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451066#comment-15451066 ] Michael Kjellman commented on CASSANDRA-9754: - I pushed a rebased commit that addresses many additional comments by [~jasobrown] from review, adds additional unit tests, and has many further improvements to documentation. This is still 2.1 based, however the review and improvements made in the org.apache.cassandra.db.index.birch package is agnostic to a trunk or 2.1 based patch. https://github.com/mkjellman/cassandra/commit/3d686799a0e79c23d86881bb041b5408dcfda014 https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1 Some Highlights: * Fix a bug in KeyIterator identified by [~jjirsa] that would cause the iterator to return nothing when the backing SegmentedFile contains exactly 1 key/segment. * Add unit tests for KeyIterator * Add SSTable version ka support to LegacySSTableTest. Actually test something in LegacySSTableTest. * Add additional unit tests around PageAlignedReader, PageAlignedWriter, BirchWriter, and BirchReader * Remove word lists and refactor all unit tests to use TimeUUIDTreeSerializableIterator instead * Improve documentation and fix documentation as required to properly parse and format during javadoc creation * Remove reset() functionality from BirchReader.BirchIterator * Fix many other nits > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434219#comment-15434219 ] Michael Kjellman commented on CASSANDRA-9754: - So, I'm mostly done with a trunk version of the patch, however, I'm currently focusing on finishing and polishing the 2.1 based version. Although the abstraction of the index is almost a total rewrite between 2.1 and trunk the tree itself and the Birch implementation should remain the same so this certainly isn't wasted time for anyone. :) I've cleaned up the implementation a bunch, taken care of a bunch of todos and low hanging fruit, added more documentation, and pushed it to Github to make it a bit easier to make sure the changes apply cleanly. https://github.com/mkjellman/cassandra/commit/e5389378b19eb03de7dd4d50d6df110c68057985 The following 4 unit tests (out of 1184) are still failing (so close!): * org.apache.cassandra.cql3.KeyCacheCqlTest (2 of 2). Need to talk to [~aweisberg] to understand exactly what these unit tests are testing. * org.apache.cassandra.db.ColumnFamilyStoreTest (2 of 38, both related to secondary indexes) Tomorrow, I hope to push a patch addressing the feedback from [~barnie] (see above comment) above along with any changes that come out of the code review currently underway by [~jasobrown] and [~kohlisankalp]. I also need/want to do some work on feeling more comfortable on the upgrade/backwards compatibility story and make sure there is a good unit test story around that. [~jjirsa] if you get a chance to take a look please let me know if you have any initial feedback that would be awesome! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380306#comment-15380306 ] Michael Kjellman commented on CASSANDRA-9754: - Thanks Branimir for starting to review the code!! 1) Yes, good optimization. The first most entry in the tree (so the first element in the root node, doesn't need to be repeated in the inner nodes as you can assume that you always go to the left. 2) I think we can make the assumption that if the length of the bytes for a given entry's key is greater than or equal to the max length for a given node/leaf, we can have the code make an assumption that there will be a int encoded after the key bytes to the offset in the overflow page. If the key happens to be equal to the max length (but doesn't actually overflow) then we can encode 0 or something and then have the code know that value means no overflow page. The downside here is adding more assumptions and complexity to the code vs eating the single byte in the overflow page. 3) Could do this but I think it increases the complexity as the root node would need to be special cased for max size.. currnently the root node, inner nodes, and leaf nodes all use the exact same code, where only the value serializer is different (to either write the offset to the next node/leaf or the actual value of the entry you want to add to the tree). We'd need to subtract the side of the descriptor from the available side of the aligned root node and then know where to seek inside it. In regards to your final comment: while technically we could build the root/inner nodes to point to an offset of indexinfo entries serialized in the current format, I think the tradeoffs make it non-ideal. If we serialize the entries in the leaf like I'm currently doing we a) get the benefits of the leaf being aligned and padded and b) with the proposed leaf format, we can binary search inside each leaf node to get the exact entry, vs having an inner node just point to an offset and then needing to linearly scan until we hit a match I'm almost done with a refactor for trunk. We found some pretty serious regressions in 2.1 that required my immediate attention but I hope to have a trunk based patch with your initial suggestions from above incorporated into the tree implementation very soon. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Fix For: 4.x > > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351341#comment-15351341 ] Branimir Lambov commented on CASSANDRA-9754: I spent some time reading up {{BirchReader}} to figure out the nuts and bolts of how the storage works. I think we can squeeze a little more efficiency into the structure: - As far as I could see, your current implementation places a lot of copies on the lower side of each span in the non-leaf nodes (for example, the lowest key of the partition is present in the leaf node, its parent as well as all parents leading all the way to the root). This should not be necessary, simply omitting the first key (but retaining the child pointer) from all intermediate nodes and adding 1 to what the binary search returns will achieve the same result. - I find the overlow flag (and jumping back and forth to read it) less efficient than necessary. If we assume instead that key length equal to the max always entails overflow data, we would be using less space and be more efficient in the common case, while having a very low chance of taking a few bytes more in the uncommon situation of long keys. - Root node could be in the same page with descriptor (it is usually smaller so high chance to fit). Perhaps overflow is best placed elsewhere? More generally (ignoring padding on the leaves which is not necessarily always beneficial), the B+ structure you have built is practically a B-Tree index over a linear list of index entries. As we already have a linear list of {{IndexInfo}} structures in the current format, what are we gaining by not just building a B-Tree index over that? To me the latter would appear to be less complicated and much more generic with immediate possible applications in other parts of the codebase. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323500#comment-15323500 ] Michael Kjellman commented on CASSANDRA-9754: - [~tjake] will do! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323406#comment-15323406 ] T Jake Luciani commented on CASSANDRA-9754: --- [~mkjellman] for a > 2.1 map take a look at CASSANDRA-7443 which added an abstraction for IndexEntry and serializers which should hopefully be similar to what you did for this 2.1 version. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323330#comment-15323330 ] Michael Kjellman commented on CASSANDRA-9754: - Some additional thoughts while I'm thinking about them: * PageAlignedReader currently deserializes all the segments in the constructor. It might be more efficient to lazily deserialize the segments as we need the segment. I'm sure perf testing will quickly make it clear if the extra code complexity is worth the potential performance trade-off... * I picked 4kb for the page size based on an educated guess, but obviously other sizes need to be tested (less? more?) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Attachments: 9754_part1-v1.diff, 9754_part2-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321353#comment-15321353 ] Michael Kjellman commented on CASSANDRA-9754: - TIL: Attempting to upload to Jira via the slow and overpriced Gogo in-flight wifi doesn't work... "Cannot attach file 9754_part2-v1.diff: Unable to communicate with JIRA." Working on it.. :) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > Attachments: 9754_part1-v1.diff > > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321338#comment-15321338 ] Jonathan Ellis commented on CASSANDRA-9754: --- Delighted to see this patch land, looking forward to getting it merged! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321332#comment-15321332 ] Michael Kjellman commented on CASSANDRA-9754: - Alright, happy to finally be able to write this. :) I'm attaching a v1 diff containing Birch! h4. Why is it named Birch? B+ Tree -> Trees that start with the letter B -> Birch... get it? haha... h4. Description Birch is a B+ish/inspired tree aimed at improving the performance of the SSTable index in Cassandra (especially with large partitions). The existing implementation scales poorly with the size of the index/ row as the entire index must be deserialized onto the heap even to look for a single element. This puts significant pressure on the heap, where one read to a large partition will cause at the minimum a long painful CMS GC pause or -- in the worst case -- an OOM. The Birch implementation has a predictable fixed cost for reads at the expense of the additional on disk overhead for the tree itself -- with an implementation that is the same complexity O(log(n)) as the existing implementation. Every row added to the SSTable is also added to the primary index. If the size of the row is greater than 64kb we build an index (otherwise we just encode the position in the sstable for that row). All entries encoded into the index are page aligned and padded to the nearest boundary (4096 bytes by default). Every segment can be marked as either internally padded/aligned along a boundary or non-padded/aligned (up to 2GB). Birch indexes are aligned into 4096 byte nodes (both leaf and inner). Keys will be encoded inside the node itself, unless they exceed the size of the node/2. In that case, the size of the node/2 is encoded into the node itself and the offset of the remaining bytes in the overflow page is encoded. This enables predictable fixed performance of the tree, but accommodates variable length keys/elements. h4. Notes on v1 of the diff (in no particular order) * I broke the changes into two logical parts: The first abstracts out the existing Index implementation and adds no new logic. The second includes a IndexedEntry implementation backed by a Birch tree. * The attached v1 patch is written for 2.1, I have already started rebasing the patch onto trunk and hope to finish that shortly and post a the trunk based patch * There's some high level Javadoc documentation in BirchWriter and PageAlignedWriter on the layout of the tree on disk, serialization and deserialization paths, and higher level goals of the classes * The next steps are to start getting feedback from reviews and the community. I also have profiled the tree itself but profiling the tree integrated into the stack and optimizing non-performant code paths is next (after the immediate task to rebase the change onto trunk) * There are still a few todo's I've left in regards to handling backwards compatibility, parts of the code I expect might be non-performant, and things I'd like to discuss on the "correct" implementation/behavior etc * I have a few unit tests that still don't pass and still need to be root caused... I've taken the approach this entire time that the unit tests shouldn't be touched to pass, so there is still a few behavioral regressions I've accidentally introduced. The current failing tests are: ** AutoSavingCacheTest ** SecondaryIndexTest ** BatchlogManagerTest ** KeyCacheTest ** ScrubTest ** IndexSummaryManagerTest ** LegacySSTableTest ** MultiSliceTest * I need to write a unit test to test reading the legacy/existing primary index implementation * By the nature of the index's role in the database, the unit test coverage is actually pretty extensive as any read and write touches the index in some capacity I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high level design I ended up with and considerations I had to take into account once I actually got deep inside this part of the code. Looking forward to feedback! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235141#comment-15235141 ] Jack Krupansky commented on CASSANDRA-9754: --- Any idea how a new wide partition will perform relative to the same amount of data and same number of clustering rows divided into bucketed partitions? For example, a single 1 GB wide partition vs. ten 100 MB partitions (same partition key plus a 0-9 bucket number) vs. a hundred 10 MB partitions (0-99 bucket number), for two access patterns: 1) random access a row or short slice, and 2) a full bulk read of the 1 GB of data, one moderate slice at a time. Or maybe the question is equivalent to asking what the cost is to access the last row of the 1 GB partition vs. the last row of the tenth or hundredth bucket of the bucketed equivalent. No precision required. Just inquiring whether we can get rid of bucketing as a preferred data modeling strategy, at least for the common use cases where the sum of the buckets is roughly 2 GB or less.. The bucketing approach does have the side effect of distributing the buckets around the cluster, which could be a good thing, or maybe not. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221050#comment-15221050 ] Jeff Jirsa commented on CASSANDRA-9754: --- I renew my offer to help test code (and / or port to other versions, especially if you don't yet have a 2.1 patch set) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221026#comment-15221026 ] Michael Kjellman commented on CASSANDRA-9754: - I have some insanely encouraging initial performance numbers!... I'd like to do some more validation to make sure I didn't screw up any of the benchmarks before sharing, but the read story is better than I could have ever imagined! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220973#comment-15220973 ] Michael Kjellman commented on CASSANDRA-9754: - Alright, good news! My unit test that creates and reads from an index with 100,000,000 entries (!!) successfully passes! Came up with a pretty nice solution to the word-list issue (unable to find a word list of 100m+ entries) and instead I am creating n TimeUUID elements -- which nicely removes duplicates, can create an infinite number of, and come already sorted as they're being generated! I'm currently profiling the code to come up with numbers... > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190722#comment-15190722 ] Michael Kjellman commented on CASSANDRA-9754: - Also, does anyone know of any truly massive word lists that are totally free of legal concerns for testing? (I'm looking for >3-4 million words) > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190718#comment-15190718 ] Michael Kjellman commented on CASSANDRA-9754: - I have the new FileSegment friendly implementation working for the following conditions: 1) straight search for key -> get value 2) iterate efficiently both forwards and reversed thru all elements in the tree 3) binary search for a given key and then iterate thru all remaining keys from the found offset 4) overflow page for handling variable length tree elements that exceed the max size for a given individual page (up to 2GB) I also have successfully ran some new unit tests I wrote that now do 5000 consecutive iterations with randomly generated data (to "fuzz" the tree for edge conditions) for building and validating trees that contain between 300,000-500,000 elements. I've also spent a good amount of time writing some pretty reasonable documentation of the binary format itself. Tomorrow, I'm planning on testing a 4.5GB individual tree against the new implementation and doing some profiling to see the exact memory impact now that it's basically completed on both the serialization and deserialization paths. Will update with those findings tomorrow! > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189412#comment-15189412 ] Aleksey Yeschenko commented on CASSANDRA-9754: -- [~jkrupan] Priority of the JIRA ticket here is irrelevant now that it's being actively worked on - and also CASSANDRA-11206. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189407#comment-15189407 ] Jack Krupansky commented on CASSANDRA-9754: --- Is this issue still considered a Minor priority? Seems like a bigger deal to me. +1 for making it a Major priority - unless there is a longer list of even bigger fish in the queue. Just today there is a user on the list struggling with time series data and really not wanting to have to split a partition that he needs to be able to scan. Of source, scanning a super-wide partition will still be a very bad idea anyway, but at least more narrow scans would still be workable with this improvement in place. Is this a 3.x improvement or 4.x or beyond? +1 for 3.x (3.6? 3.8?). > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188993#comment-15188993 ] Michael Kjellman commented on CASSANDRA-9754: - Had a very productive day... made a huge amount of progress on the deserialization side of things and the required serialization changes. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173981#comment-15173981 ] Michael Kjellman commented on CASSANDRA-9754: - Update: serialization path basically done with new design more friendly for SegmentedFile issue. Will finish that up today and move onto Deserialization. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149269#comment-15149269 ] Michael Kjellman commented on CASSANDRA-9754: - well - then you'd lose the cache-line aware logic I've implemented to make the B+ tree efficient... > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149265#comment-15149265 ] Jonathan Ellis commented on CASSANDRA-9754: --- Would it make sense to always used buffered i/o for the B+ instead of mmap? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149034#comment-15149034 ] Michael Kjellman commented on CASSANDRA-9754: - I was asked to update. Making good progress. A large number of tests pass. I'm basically just getting the math right for "Inception: The Cassandra Director's Cut" due to the need to make a B+ Tree (disk based by it's very definition) work with SegmentedFile etc... > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139712#comment-15139712 ] Jack Krupansky commented on CASSANDRA-9754: --- bq. large CQL partitions (4GB,75GB,etc) What is the intended target/sweet spot for large partitions... 1GB, 2GB, 4GB, 8GB, 10GB, 15GB, 16GB, or... what? Will random access to larger partitions create any significant heap/off-heap memory demand, or will heap/memory simply become the total rows accessed regardless of how they might be bucketed into partitions? Will we be able to tell people that bucketing of partitions is now never needed, or will there now just be a larger bucket size, like 4GB/partition rather than the 10MB or 50MB or 100MB that some of us recommend today? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139130#comment-15139130 ] Jonathan Ellis commented on CASSANDRA-9754: --- [~mkjellman] I'd be very interested to see what impact this has on query performance as partition size grows. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139735#comment-15139735 ] Michael Kjellman commented on CASSANDRA-9754: - [~jkrupan] ~2GB is the max target at the moment I'd recommend from experience. The current implementation will create a IndexInfo entry every 64kb (by default - but I highly doubt anyone actually changes this default) worth of data. Each IndexInfo object contains the offset into the sstable where the partition/row starts, the length to read, and the name. These IndexInfo objects are placed into a list and binary searched over to find the name closest to the query. Then, we go to that offset in the sstable and start reading the actual data. The issue here that makes things so bad with large partitions is when doing an Indexed read across a given partition the entire list of indexinfo objects is currently just serialized one after another into the index file on disk. To use it we have to read the entire thing off disk, deserializing every IndexInfo object, place it into a list, and the binary search across it. This creates a ton of small objects very quickly that are likely to be promoted and thus create a lot of GC pressure. If you take the average size of each column you have in a row you can figure out how many index entry objects will be created (for every 64k of your data in that partition). I've found that once the index info array will contain > 300k objects things get bad. The implementation I'm *almost* done with has the same big O complexity (O(log(n))) as the current implementation but instead the index is backed by page cache aligned mmap'ed segments (B+ tree-ish with an overflow page implementation similar to that of SQLite). This means we can now walk the IndexEntry objects an only bring the 4k chunks onto the heap that are involved in the binary search for the correct entry itself. The tree itself is finished and heavily tested. I've also already abstracted out the index implementation in Cassandra so that the current implementation and the new one I'll be proposing and contributing here can be dropped in easily without special casing the code all over the place to check the SSTable descriptor for what index implementation was used. All the unit tests and d-tests pass after my abstraction work. The final thing I'm almost done with is refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware (and make sure all the math works when the offset into the actual file will differ depending on the segment etc). > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135081#comment-15135081 ] Michael Kjellman commented on CASSANDRA-9754: - I spent the last 15 hours finishing up the last remaining pieces on the serialization... almost there.. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135073#comment-15135073 ] Brandon Williams commented on CASSANDRA-9754: - I am seeing an increased amount of people running into this problem, is there any update here? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135071#comment-15135071 ] Jeff Jirsa commented on CASSANDRA-9754: --- [~mkjellman] - I know you've put a lot of thought into this already. This is impacting us and I've love to help. Is there anything I can do to assist? Are you working on a patch I can help you test (or can I volunteer to help write tests or similar)? > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723931#comment-14723931 ] Ariel Weisberg commented on CASSANDRA-9754: --- Maybe it makes sense to have an intermediate step. Leverage Robert's work in 9738 to transition to simple off heap representation that can be mapped and then remove the key cache. It seems like this would effectively increase the upper bound on usable partition size in all cases compared to what we have today (is this a true statement?). After, or in parallel work on another representation for partition indexes. > Make index info heap friendly for large CQL partitions > -- > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Michael Kjellman >Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721034#comment-14721034 ] Robert Stupp commented on CASSANDRA-9754: - I think we definitely need better data structures since RIE is neither a good fit for KC nor index. That's the point in this ticket, CASSANDRA-8931, CASSANDRA-9843 and the pitfall in current WIP in CASSANDRA-9738. Not fully agree on relying on page cache due to its granularity (4kB i think) which might be too coarse for keys. But that depends on the actual data structure - i.e. grouping hot keys per page, which contradicts with immutable sstables. Another point is the effort to move to thread-per-core model, having distinct and independent data structures per thread without barriers/locks/whatever - and page-cache is a shared resource. Next thing is hot and cold data - i.e. we could use bigger intervals (column_index_size_in_kb in current terminology) for cold data. TBC: I'm not against page cache or so - just want to note what I think may influence new stuff. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720928#comment-14720928 ] Michael Kjellman commented on CASSANDRA-9754: - Just a couple more thoughts: 1) One big question I have is how a On-Disk B+ Tree would play with the on-heap (or off-heap) cache. Ideally, I believe we would only want to cache on the heap columns that we got a read request for to avoid polluting the heap with objects that were never actually even requested by the user. However, as we store intervals in the index and not all the actual values, I'm not sure how to do efficient in memory lookups if we only stored fragments of the overall index for a given key. For instance, If you put 1 matching leaf of index objects from the b+ tree into the cache and then got another request for the say key but different index interval, we'd need to constantly keep rebalancing some kind of data structure on the heap and I'm operating under the assumption that would be pretty inefficient and painful. Which brings me to... 2) Maybe the KeyCache isn't necessary if we implement an efficient B+ On-Disk format. I'm doubtful that anything we implement from a cache perspective inside the application will be better than the kernel's page cache. Frequently accessed keys would be the pages most likely to be in the page cache as well, so we still should get the benefit of LRU eviction. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720893#comment-14720893 ] Jonathan Ellis commented on CASSANDRA-9754: --- 1. Learning time for us would be compaction 2. ISTM this was not core to the algorithm, but it's been a while since I read the details 3. We could store the offset in the ARF leaves, this was definitely not core 4, 5. Yes, this is a key point. Like our existing index, ARF is designed to be memory-resident. As partitions grow larger the ARF would degrade accuracy rather than spilling to disk (like a B-tree) or getting obscenely large (like our existing index). I would add, 6. Because of (5), ARF gives you BF-like behavior for range queries and can quickly optimize away scans of sstables that don't contain the data in question. (A very good fit for DTCS; a smaller benefit for LCS.) So, maybe we really want both. ARF for the quick reject, (on-disk) B+ for where do I start scanning. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720365#comment-14720365 ] Robert Stupp commented on CASSANDRA-9754: - Linking CASSANDRA-9738 as it runs into the same problem with {{IndexInfo}} objects (just off-heap instead of disk). Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720608#comment-14720608 ] Jonathan Ellis commented on CASSANDRA-9754: --- ARF may be a better fit than B+tree. CASSANDRA-9843 Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720626#comment-14720626 ] Michael Kjellman commented on CASSANDRA-9754: - [~jbellis] I PoC'ed a B+ Tree but certainly not tied to it in any way. Let me read this white paper today and see if I can get a PoC together. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720523#comment-14720523 ] sankalp kohli commented on CASSANDRA-9754: -- If we use B+ tree, we can actually put a leaf node in key cache. For most small partitions, leaf=root due to there small size. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720314#comment-14720314 ] Michael Kjellman commented on CASSANDRA-9754: - We've had a bunch of discussions around this over the past few weeks and I think i finally have a grasp of the entire issue. The issue here is that large CQL partitions (4GB,75GB,etc) end up with large 200MB+ serialized indexes. The current logic is when we don't get a cache hit to deserialize the entire thing and split it into IndexInfo objects which contain 2 ByteBuffers (first and last key), and 2 longs (Offset and Width). This means we get a very very large amount of small most likely very shortly lived objects creating garbage on the heap --- and with a high probability they will be evicted from the cache anyways. On disk we just lay out the objects down with the assumption the entire thing will always be deserialized when it's needed and never accessed from disk without deserializing the entire thing. I think the only option here is to make a change to the actual way we lay things out on disk. Two options would be a Skip List or a B+ Tree where we mmap the pages of the index and try to do something intelligent to avoid actually bringing objects onto the heap as much as possible. The downside a B+ Tree would be the overhead of creating it on flush and it's log(n) (although the current code is log(n) too as we binary search over the objects we deserialized into the List, but just do it on the heap. The only references I could find to B+ Trees in this project were CASSANDRA-6709 and CASSANDRA-7447. I think we don't need to reinvent the wheel here and entirely change the storage format but I think if we just use a targeted data structure *just* for the Index we might get something nice. The question would be what impact will this have for normal rows/partitions. Any input on other on disk data structures we might want to consider would be great. The other issue is that I'd love to be able to only cache the column that we got a hit on for the cache. Unfortunately that might be difficult. Today we binary search over the entire ListIndexInfo to find hits. If you get a column that's in between the first and last name you return the left node and go and check and hopefully it's actually there. As we essentially have interval-ish objects here along with non fixed length values it does make things a bit more fun. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720852#comment-14720852 ] sankalp kohli commented on CASSANDRA-9754: -- How is this any different than an interval tree where we using a boolean instead of a max for leaf nodes? Disclaimer: I did not read the whole paper Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720874#comment-14720874 ] Michael Kjellman commented on CASSANDRA-9754: - [~jbellis] I read a bunch of the Adaptive Range Filters (ARF) paper from CASSANDRA-9843 and I'm not sure it's a good fit here, but please let me know if I misunderstood the paper. 1) The key optimization that they make in ARF is the adaptive or learning component that resizes and optimizes to adjust the shape and layout of the B-Tree dynamically as elements are added or removed. This was important to the original authors as Microsoft SQL Server appears to have hot and cold storage and so every time a record exceeds the threshold to go from hot - cold storage they need to efficiently modify and update the tree. Given that all of our data structures in this case will be immutable this has no benefit to us. 2) The paper uses fixed length keys for the intervals. I'm unsure how we could modify this to accommodate variable length start and end values as are required for index info 3) ARF (like Bloom Filters) only returns true or false to indicate the presence of a value in that interval. For IndexInfo we need the offset and width into the data to start reading to retrieve the actual data. I'm not sure how we can modify ARF to accommodate returning values and not just a boolean. 4) If we dynamically adjust the size of the intervals and make them larger it means that we will make the index even worse, requiring us to read a larger amount of actual data as our offset and width will be increased. If the query was a false positive it would mean they would become even more expensive given the possibility of reading a huge amount of data for the larger interval to actually find no data. 5) It appears ARFs were always intended to stay in memory. I'm not sure how feasible it would be to implement ARF in an on-disk cache friendly way. As we need to store the bytes for the first and last values for each indexed position I think we still should use a data structure that can use from disk without deserializing the entire structure onto the heap (which is the same problem we have with the current storage format today). Let me know if you had any more thoughts or if I've misunderstood the paper! Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Michael Kjellman Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619977#comment-14619977 ] Robert Stupp commented on CASSANDRA-9754: - Yes, the purpose of CASSANDRA-9738 is to move the (permanent part of the) key-cache to off-heap - basically eliminating a huge amount of tiny object in the old gen. bq. we can still optimize on reducing the objects created Absolutely agree! Although, the first benchmark I did with CASSANDRA-9738 shows a huge reduction of GC effort of roughly 90% (total and avg GC time with 8u45+G1) for a read-only workload with actually more (short-lived) objects created - I suppose these will never be promoted to the old gen. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619318#comment-14619318 ] Jonathan Ellis commented on CASSANDRA-9754: --- CASSANDRA-9738 might be the simplest way forward then. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619286#comment-14619286 ] sankalp kohli commented on CASSANDRA-9754: -- The strongly referenced IndexInfo objects are mostly from key cache(100MB we are using). The problem is that due to key cache, all these objects will get promoted and cause fragmentation when evicted from key cache. I also looked at a heap dump generated by an OOM. 93% of the retained size was coming from IndexInfo. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619368#comment-14619368 ] sankalp kohli commented on CASSANDRA-9754: -- If I am understanding it correctly, CASSANDRA-9738 will move the key cache off heap. This will help as these IndexInfo objects won't be promoted as they won't be referenced by any on heap map. However, we will still need to generate all these objects for each read whether it is in key cache or not. With CASSANDRA-9738, on key cache hit, we will create new objects again in deserializing. I think CASSANDRA-9738 is good but we can still optimize on reducing the objects created which is currently promotional to the size of the CQL partition. Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617679#comment-14617679 ] Jonathan Ellis commented on CASSANDRA-9754: --- But are most of those in the index summary (should be relatively stable once tenured) or the rowcache (high churn)? Make index info heap friendly for large CQL partitions -- Key: CASSANDRA-9754 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)