[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139735#comment-15139735
 ] 

Michael Kjellman commented on CASSANDRA-9754:
---------------------------------------------

[~jkrupan] ~2GB is the max target at the moment I'd recommend from experience.

The current implementation will create a IndexInfo entry every 64kb (by default 
- but I highly doubt anyone actually changes this default) worth of data. Each 
IndexInfo object contains the offset into the sstable where the partition/row 
starts, the length to read, and the name. These IndexInfo objects are placed 
into a list and binary searched over to find the name closest to the query. 
Then, we go to that offset in the sstable and start reading the actual data. 

The issue here that makes things so bad with large partitions is when doing an 
Indexed read across a given partition the entire list of indexinfo objects is 
currently just serialized one after another into the index file on disk. To use 
it we have to read the entire thing off disk, deserializing every IndexInfo 
object, place it into a list, and the binary search across it. This creates a 
ton of small objects very quickly that are likely to be promoted and thus 
create a lot of GC pressure.

If you take the average size of each column you have in a row you can figure 
out how many index entry objects will be created (for every 64k of your data in 
that partition). I've found that once the index info array will contain > 300k 
objects things get bad.

The implementation I'm *almost* done with has the same big O complexity 
(O(log(n))) as the current implementation but instead the index is backed by 
page cache aligned mmap'ed segments (B+ tree-ish with an overflow page 
implementation similar to that of SQLite). This means we can now walk the 
IndexEntry objects an only bring the 4k chunks onto the heap that are involved 
in the binary search for the correct entry itself.

The tree itself is finished and heavily tested. I've also already abstracted 
out the index implementation in Cassandra so that the current implementation 
and the new one I'll be proposing and contributing here can be dropped in 
easily without special casing the code all over the place to check the SSTable 
descriptor for what index implementation was used. All the unit tests and 
d-tests pass after my abstraction work. The final thing I'm almost done with is 
refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware 
(and make sure all the math works when the offset into the actual file will 
differ depending on the segment etc).

> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>
>                 Key: CASSANDRA-9754
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Michael Kjellman
>            Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to