[ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139735#comment-15139735 ]
Michael Kjellman commented on CASSANDRA-9754: --------------------------------------------- [~jkrupan] ~2GB is the max target at the moment I'd recommend from experience. The current implementation will create a IndexInfo entry every 64kb (by default - but I highly doubt anyone actually changes this default) worth of data. Each IndexInfo object contains the offset into the sstable where the partition/row starts, the length to read, and the name. These IndexInfo objects are placed into a list and binary searched over to find the name closest to the query. Then, we go to that offset in the sstable and start reading the actual data. The issue here that makes things so bad with large partitions is when doing an Indexed read across a given partition the entire list of indexinfo objects is currently just serialized one after another into the index file on disk. To use it we have to read the entire thing off disk, deserializing every IndexInfo object, place it into a list, and the binary search across it. This creates a ton of small objects very quickly that are likely to be promoted and thus create a lot of GC pressure. If you take the average size of each column you have in a row you can figure out how many index entry objects will be created (for every 64k of your data in that partition). I've found that once the index info array will contain > 300k objects things get bad. The implementation I'm *almost* done with has the same big O complexity (O(log(n))) as the current implementation but instead the index is backed by page cache aligned mmap'ed segments (B+ tree-ish with an overflow page implementation similar to that of SQLite). This means we can now walk the IndexEntry objects an only bring the 4k chunks onto the heap that are involved in the binary search for the correct entry itself. The tree itself is finished and heavily tested. I've also already abstracted out the index implementation in Cassandra so that the current implementation and the new one I'll be proposing and contributing here can be dropped in easily without special casing the code all over the place to check the SSTable descriptor for what index implementation was used. All the unit tests and d-tests pass after my abstraction work. The final thing I'm almost done with is refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware (and make sure all the math works when the offset into the actual file will differ depending on the segment etc). > Make index info heap friendly for large CQL partitions > ------------------------------------------------------ > > Key: CASSANDRA-9754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9754 > Project: Cassandra > Issue Type: Improvement > Reporter: sankalp kohli > Assignee: Michael Kjellman > Priority: Minor > > Looking at a heap dump of 2.0 cluster, I found that majority of the objects > are IndexInfo and its ByteBuffers. This is specially bad in endpoints with > large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K > IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for > GC. Can this be improved by not creating so many objects? -- This message was sent by Atlassian JIRA (v6.3.4#6332)