[ https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Stupp updated CASSANDRA-11206: ------------------------------------- Status: Patch Available (was: In Progress) Pushed the latest version to the [git branch|https://github.com/apache/cassandra/compare/trunk...snazy:11206-large-part-trunk?expand=1]. CI results ([testall|http://cassci.datastax.com/job/snazy-11206-large-part-trunk-testall/lastCompletedBuild/testReport/], [dtest|http://cassci.datastax.com/job/snazy-11206-large-part-trunk-dtest/lastCompletedBuild/testReport/]) and cstar results (see below) look good. The initial approach was to “ban” all {{IndexInfo}} instances from the key cache. Although this is a great option for big partitions, “moderately” sized partitions suffer from that approach (see “0kB” cstar run below). So, as a compromise a new {{cassandra.yaml}} option {{column_index_cache_size_in_kb}} that defines when {{IndexInfo}} objects should not be kept on heap has been introduced. The new option defaults to 2 kB. It is possible to tune it to lower values (down to 0) and higher values. Some thoughts about both directions: * Setting the value to 0 or some other very low value will obviously reduce GC pressure at the cost of high I/O * The cost of accessing index samples on disk is two-folded: First, there’s the obvious I/O cost via a {{RandomAccessReader}}. Second, that each {{RandomAccessReader}} instance has its own buffer (which can be off- or on-heap, but seems to default to off-heap) - so there seems to be some (quite small) overhead to borrow/release that buffer. * The higher the value of {{column_index_cache_size_in_kb}}, the more objects will be in the heap - therefore: GC pressure * Note that the parameter refers to the _serialized_ size and not the _amount_ of {{IndexInfo}} objects. This was chosen to get some more obvious relation between the size of {{IndexInfo}} objects to the amount of consumed heap - size of {{IndexInfo}} objects is mostly related to the size of the clustering keys. * Also note that some internal system/schema tables - especially those for LWTs - use clustering keys and therefore index samples. * For workloads with a huge amount of random reads against a large data set, small values for {{column_index_cache_size_in_kb}} (like the default value) are beneficial if the key cache is always full (i.e. it is evicting a lot). Some local tests with the new {{LargePartitionTest}} on my Macbook (time machine + indexing turned off) indicate that caching seems to work for shallow indexed entries. I’ve scheduled some cstar runs against the _trades_ workload. Only the result with {{column_index_cache_size_in_kb: 0}} (which means, that no {{IndexInfo}} will be kept on heap (and in the key cache) shows a performance regression. The default value of 2kb for {{column_index_cache_size_in_kb}} was chosen as a result of this experiment. * {{column_index_cache_size_in_kb: 0}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=e96c871e-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2794.77&ymin=0&ymax=141912.1] * {{column_index_cache_size_in_kb: 2}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=410592e2-f288-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2044.46&ymin=0&ymax=142732.7] * {{column_index_cache_size_in_kb: 4}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=f8e36ec4-f275-11e5-a3d3-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2101.44&ymin=0&ymax=141696.5] * {{column_index_cache_size_in_kb: 8}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=be3516a6-f275-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2057.88&ymin=0&ymax=142156.3] Other cstar runs ([here|http://cstar.datastax.com/graph?command=one_job&stats=ce9de45a-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=write&smoothing=1&show_aggregates=true], [here|http://cstar.datastax.com/graph?command=one_job&stats=c97118bc-f275-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=259.82&ymin=0&ymax=73909] and [here|http://cstar.datastax.com/graph?command=one_job&stats=c4ece8d4-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=496.32&ymin=0&ymax=89063.7]) have shown that there’s no change for some plain workloads. Daily regression tests show a similar performance: [compaction|http://cstar.datastax.com/graph?command=one_job&stats=86d7cda8-f346-11e5-8ef0-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=54.56&ymin=0&ymax=275053.9], [repair|http://cstar.datastax.com/graph?command=one_job&stats=9c78fd1c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=55.88&ymin=0&ymax=279059], [STCS|http://cstar.datastax.com/graph?command=one_job&stats=ac43b886-f346-11e5-8ef0-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=170.39&ymin=0&ymax=98341.1], [DTCS|http://cstar.datastax.com/graph?command=one_job&stats=b8e0a11c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=172.15&ymin=0&ymax=96739.5], [LCS|http://cstar.datastax.com/graph?command=one_job&stats=f2b4530c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=171.6&ymin=0&ymax=94480.1], [1 MV|http://cstar.datastax.com/graph?command=one_job&stats=c9fcd4e8-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=1401.73&ymin=0&ymax=70372.5], [3 MV|http://cstar.datastax.com/tests/id/d80c88a8-f346-11e5-82b8-0256e416528f], [rolling upgrade|http://cstar.datastax.com/graph?command=one_job&stats=44be9a90-f347-11e5-b06a-0256e416528f&metric=op_rate&operation=5_read&smoothing=1&show_aggregates=true&xmin=0&xmax=201.41&ymin=0&ymax=171992.7] Summary of the changes: * Ungenerified {{RowIndexEntry}} * {{RowIndexEntry}} now has a method to create an accessor to the {{IndexInfo}} objects on disk - that accessor requires an instance of {{FileDataInput}} * {{RowIndexEntry}} now has three subclasses: {{ShallowIndexedEntry}}, which is basically the old {{IndexedEntry}} with the {{IndexInfo}} array-list removed but only responsible for index files with an offsets-table, and {{LegacyShallowIndexedEntry}} which is responsible for index files without an offsets-table (so pre-3.0). {{IndexedEntry}} keeps the {{IndexInfo}} objects in an array - used if the serialized size of the RIE’s payload is less than the new cassandra.yaml parameter {{column_index_cache_size_in_kb}}. * {{RowIndexEntry.IndexInfoRetriever}} is the interface to access {{IndexInfo}} on disk using a {{FileDataInput}}. It has concrete implementations: one for sstable versions with offsets and one for legacy sstable versions. This one is only used from {{AbstractSSTableIterator.IndexState}}. * Added “cache” of already deserialized {{IndexInfo}} instances in the base class of {{IndexInfoRetriever}} for “shallow” indexed entries. This is not necessary for binary-search but for many other access patterns, which sometimes appear to “jump around” in the {{IndexInfo}} objects. Since {{IndexState}} is a short lived object, these cached {{IndexInfo}} instances get garbage collected early. * Writing of index files is also changed. It now switches to serialization into a byte buffer instead of collecting an array-list of {{IndexInfo}} objects, when {{column_index_cache_size_in_kb}} is hit. * Bumped version of serialized key-cache from {{d}} to {{e}}. The key cache and its serialized form no longer contain {{IndexInfo}} objects for indexed entries that exceed {{column_index_cache_size_in_kb}} but need the position in the index file. Therefore, the serialized format of the key cache has changed. * {{Serializers}} (which is an instance per {{CFMetaData}}) keeps a “singleton” {{IndexInfo.Serializer}} instance for {{BigFormat.latestVersion}} and constructs and keeps instances for other versions. For “shallow” RIEs we need an instance of {{IndexInfo.Serializer}} to read {{IndexInfo}} from disk - a “singleton” further reduces the number of objects on heap. TBC we create(d) a lot of these instances (roughly one per {{IndexInfo}} instance/operation). We could also reduce the number of {{IndexSerializer}} instances in the future - but it felt not to be necessary for this ticket. * Merged RIE’s {{IndexSerializer}} interface and {{Serializer}} class (that interface had only a single implementation) * Added methods to {{IndexSerializer}} to handle the special serialization for saved key caches * Added specialized {{deserializePosition}} method to {{IndexSerializer}} as some invocations just need the position in the data file. * Moved {{IndexInfo}} binary-search into {{AbstractSSTableIterator.IndexState}} class (the only place, where it’s used) * Added some more {{skip}} methods in various places. These are required to calculate the offsets array for legacy sstable versions. * Classes {{ColumnIndex}} and {{IndexHelpler}} have been removed (functionality moved), {{IndexInfo}} is now a top-level class. * Added some {{Pre_11206_*}} classes that are copies of the previous implementations into {{RowIndexEntryTest}} * Added new {{PagingQueryTest}} to test paged queries * Added new {{LargePartitionsTest}} to test/compare various partition sizes (to be run explicitly, otherwise ignored) * Added test methods in {{KeyCacheTest}} and {{KeyCacheCqlTest}} for shallow/non-shallow indexed entries. * Also re-added behavior of CASSANDRA-8180 for {{IndexedEntry}} (but not for ShallowIndexedEntry}}) > Support large partitions on the 3.0 sstable format > -------------------------------------------------- > > Key: CASSANDRA-11206 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11206 > Project: Cassandra > Issue Type: Improvement > Reporter: Jonathan Ellis > Assignee: Robert Stupp > Fix For: 3.x > > Attachments: 11206-gc.png, trunk-gc.png > > > Cassandra saves a sample of IndexInfo objects that store the offset within > each partition of every 64KB (by default) range of rows. To find a row, we > binary search this sample, then scan the partition of the appropriate range. > The problem is that this scales poorly as partitions grow: on a cache miss, > we deserialize the entire set of IndexInfo, which both creates a lot of GC > overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity > (relative to reading a single 64KB row range) as partitions get truly large. > We introduced an "offset map" in CASSANDRA-10314 that allows us to perform > the IndexInfo bsearch while only deserializing IndexInfo that we need to > compare against, i.e. log(N) deserializations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)