[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Stupp updated CASSANDRA-11206:
-------------------------------------
    Status: Patch Available  (was: In Progress)

Pushed the latest version to the [git 
branch|https://github.com/apache/cassandra/compare/trunk...snazy:11206-large-part-trunk?expand=1].
 CI results 
([testall|http://cassci.datastax.com/job/snazy-11206-large-part-trunk-testall/lastCompletedBuild/testReport/],
 
[dtest|http://cassci.datastax.com/job/snazy-11206-large-part-trunk-dtest/lastCompletedBuild/testReport/])
 and cstar results (see below) look good.

The initial approach was to “ban” all {{IndexInfo}} instances from the key 
cache. Although this is a great option for big partitions, “moderately” sized 
partitions suffer from that approach (see “0kB” cstar run below). So, as a 
compromise a new {{cassandra.yaml}} option {{column_index_cache_size_in_kb}} 
that defines when {{IndexInfo}} objects should not be kept on heap has been 
introduced. The new option defaults to 2 kB. It is possible to tune it to lower 
values (down to 0) and higher values. Some thoughts about both directions:
* Setting the value to 0 or some other very low value will obviously reduce GC 
pressure at the cost of high I/O
* The cost of accessing index samples on disk is two-folded: First, there’s the 
obvious I/O cost via a {{RandomAccessReader}}. Second, that each 
{{RandomAccessReader}} instance has its own buffer (which can be off- or 
on-heap, but seems to default to off-heap) - so there seems to be some (quite 
small) overhead to borrow/release that buffer.
* The higher the value of {{column_index_cache_size_in_kb}}, the more objects 
will be in the heap - therefore: GC pressure
* Note that the parameter refers to the _serialized_ size and not the _amount_ 
of {{IndexInfo}} objects. This was chosen to get some more obvious relation 
between the size of {{IndexInfo}} objects to the amount of consumed heap - size 
of {{IndexInfo}} objects is mostly related to the size of the clustering keys.
* Also note that some internal system/schema tables - especially those for LWTs 
- use clustering keys and therefore index samples.
* For workloads with a huge amount of random reads against a large data set, 
small values for {{column_index_cache_size_in_kb}} (like the default value) are 
beneficial if the key cache is always full (i.e. it is evicting a lot).

Some local tests with the new {{LargePartitionTest}} on my Macbook (time 
machine + indexing turned off) indicate that caching seems to work for shallow 
indexed entries.

I’ve scheduled some cstar runs against the _trades_ workload. Only the result 
with {{column_index_cache_size_in_kb: 0}} (which means, that no {{IndexInfo}} 
will be kept on heap (and in the key cache) shows a performance regression. The 
default value of 2kb for {{column_index_cache_size_in_kb}} was chosen as a 
result of this experiment.
* {{column_index_cache_size_in_kb: 0}} - [cstar 
result|http://cstar.datastax.com/graph?command=one_job&stats=e96c871e-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2794.77&ymin=0&ymax=141912.1]
* {{column_index_cache_size_in_kb: 2}} - [cstar 
result|http://cstar.datastax.com/graph?command=one_job&stats=410592e2-f288-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2044.46&ymin=0&ymax=142732.7]
* {{column_index_cache_size_in_kb: 4}} - [cstar 
result|http://cstar.datastax.com/graph?command=one_job&stats=f8e36ec4-f275-11e5-a3d3-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2101.44&ymin=0&ymax=141696.5]
* {{column_index_cache_size_in_kb: 8}} - [cstar 
result|http://cstar.datastax.com/graph?command=one_job&stats=be3516a6-f275-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2057.88&ymin=0&ymax=142156.3]

Other cstar runs 
([here|http://cstar.datastax.com/graph?command=one_job&stats=ce9de45a-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=write&smoothing=1&show_aggregates=true],
 
[here|http://cstar.datastax.com/graph?command=one_job&stats=c97118bc-f275-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=259.82&ymin=0&ymax=73909]
 and 
[here|http://cstar.datastax.com/graph?command=one_job&stats=c4ece8d4-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=496.32&ymin=0&ymax=89063.7])
 have shown that there’s no change for some plain workloads.

Daily regression tests show a similar performance: 
[compaction|http://cstar.datastax.com/graph?command=one_job&stats=86d7cda8-f346-11e5-8ef0-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=54.56&ymin=0&ymax=275053.9],
 
[repair|http://cstar.datastax.com/graph?command=one_job&stats=9c78fd1c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=55.88&ymin=0&ymax=279059],
 
[STCS|http://cstar.datastax.com/graph?command=one_job&stats=ac43b886-f346-11e5-8ef0-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=170.39&ymin=0&ymax=98341.1],
 
[DTCS|http://cstar.datastax.com/graph?command=one_job&stats=b8e0a11c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=172.15&ymin=0&ymax=96739.5],
 
[LCS|http://cstar.datastax.com/graph?command=one_job&stats=f2b4530c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=171.6&ymin=0&ymax=94480.1],
 [1 
MV|http://cstar.datastax.com/graph?command=one_job&stats=c9fcd4e8-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=1401.73&ymin=0&ymax=70372.5],
 [3 
MV|http://cstar.datastax.com/tests/id/d80c88a8-f346-11e5-82b8-0256e416528f], 
[rolling 
upgrade|http://cstar.datastax.com/graph?command=one_job&stats=44be9a90-f347-11e5-b06a-0256e416528f&metric=op_rate&operation=5_read&smoothing=1&show_aggregates=true&xmin=0&xmax=201.41&ymin=0&ymax=171992.7]

Summary of the changes:
* Ungenerified {{RowIndexEntry}}
* {{RowIndexEntry}} now has a method to create an accessor to the {{IndexInfo}} 
objects on disk - that accessor requires an instance of {{FileDataInput}}
* {{RowIndexEntry}} now has three subclasses: {{ShallowIndexedEntry}}, which is 
basically the old {{IndexedEntry}} with the {{IndexInfo}} array-list removed 
but only responsible for index files with an offsets-table, and 
{{LegacyShallowIndexedEntry}} which is responsible for index files without an 
offsets-table (so pre-3.0). {{IndexedEntry}} keeps the {{IndexInfo}} objects in 
an array - used if the serialized size of the RIE’s payload is less than the 
new cassandra.yaml parameter {{column_index_cache_size_in_kb}}.
* {{RowIndexEntry.IndexInfoRetriever}} is the interface to access {{IndexInfo}} 
on disk using a {{FileDataInput}}. It has concrete implementations: one for 
sstable versions with offsets and one for legacy sstable versions. This one is 
only used from {{AbstractSSTableIterator.IndexState}}. 
* Added “cache” of already deserialized {{IndexInfo}} instances in the base 
class of {{IndexInfoRetriever}} for “shallow” indexed entries. This is not 
necessary for binary-search but for many other access patterns, which sometimes 
appear to “jump around” in the {{IndexInfo}} objects. Since {{IndexState}} is a 
short lived object, these cached {{IndexInfo}} instances get garbage collected 
early.
* Writing of index files is also changed. It now switches to serialization into 
a byte buffer instead of collecting an array-list of {{IndexInfo}} objects, 
when {{column_index_cache_size_in_kb}} is hit.
* Bumped version of serialized key-cache from {{d}} to {{e}}. The key cache and 
its serialized form no longer contain {{IndexInfo}} objects for indexed entries 
that exceed {{column_index_cache_size_in_kb}} but need the position in the 
index file. Therefore, the serialized format of the key cache has changed.
* {{Serializers}} (which is an instance per {{CFMetaData}}) keeps a “singleton” 
{{IndexInfo.Serializer}} instance for {{BigFormat.latestVersion}} and 
constructs and keeps instances for other versions. For “shallow” RIEs we need 
an instance of {{IndexInfo.Serializer}} to read {{IndexInfo}} from disk - a 
“singleton” further reduces the number of objects on heap. TBC we create(d) a 
lot of these instances (roughly one per {{IndexInfo}} instance/operation). We 
could also reduce the number of {{IndexSerializer}} instances in the future - 
but it felt not to be necessary for this ticket.
* Merged RIE’s {{IndexSerializer}} interface and {{Serializer}} class (that 
interface had only a single implementation)
* Added methods to {{IndexSerializer}} to handle the special serialization for 
saved key caches
* Added specialized {{deserializePosition}} method to {{IndexSerializer}} as 
some invocations just need the position in the data file.
* Moved {{IndexInfo}} binary-search into {{AbstractSSTableIterator.IndexState}} 
class (the only place, where it’s used)
* Added some more {{skip}} methods in various places. These are required to 
calculate the offsets array for legacy sstable versions.
* Classes {{ColumnIndex}} and {{IndexHelpler}} have been removed (functionality 
moved), {{IndexInfo}} is now a top-level class.
* Added some {{Pre_11206_*}} classes that are copies of the previous 
implementations into {{RowIndexEntryTest}}
* Added new {{PagingQueryTest}} to test paged queries
* Added new {{LargePartitionsTest}} to test/compare various partition sizes (to 
be run explicitly, otherwise ignored)
* Added test methods in {{KeyCacheTest}} and {{KeyCacheCqlTest}} for 
shallow/non-shallow indexed entries.
* Also re-added behavior of CASSANDRA-8180 for {{IndexedEntry}} (but not for 
ShallowIndexedEntry}})


> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>         Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to