[ https://issues.apache.org/jira/browse/CASSANDRA-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377309#comment-17377309 ]
Benjamin Lerer commented on CASSANDRA-14415: -------------------------------------------- Thanks a lot for the patch [~sklock]. Sorry it took so long to get it committed. > Performance regression in queries for distinct keys > --------------------------------------------------- > > Key: CASSANDRA-14415 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14415 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths > Reporter: Samuel Klock > Assignee: Samuel Klock > Priority: Normal > Labels: performance > Fix For: 3.11.11, 4.0.1 > > > Running Cassandra 3.0.16, we observed a major performance regression > affecting {{SELECT DISTINCT keys}}-style queries against certain tables. > Based on some investigation (guided by some helpful feedback from Benjamin on > the dev list), we tracked the regression down to two problems. > * One is that Cassandra was reading more data from disk than was necessary > to satisfy the query. This was fixed under CASSANDRA-10657 in a later 3.x > release. > * If the fix for CASSANDRA-10657 is incorporated, the other is this code > snippet in {{RebufferingInputStream}}: > {code:java} > @Override > public int skipBytes(int n) throws IOException > { > if (n < 0) > return 0; > int requested = n; > int position = buffer.position(), limit = buffer.limit(), remaining; > while ((remaining = limit - position) < n) > { > n -= remaining; > buffer.position(limit); > reBuffer(); > position = buffer.position(); > limit = buffer.limit(); > if (position == limit) > return requested - n; > } > buffer.position(position + n); > return requested; > } > {code} > The gist of it is that to skip bytes, the stream needs to read those bytes > into memory then throw them away. In our tests, we were spending a lot of > time in this method, so it looked like the chief drag on performance. > We noticed that the subclass of {{RebufferingInputStream}} in use for our > queries, {{RandomAccessReader}} (over compressed sstables), implements a > {{seek()}} method. Overriding {{skipBytes()}} in it to use {{seek()}} > instead was sufficient to fix the performance regression. > The performance difference is significant for tables with large values. It's > straightforward to evaluate with very simple key-value tables, e.g.: > {{CREATE TABLE testtable (key TEXT PRIMARY KEY, value BLOB);}} > We did some basic experimentation with the following variations (all in a > single-node 3.11.2 cluster with off-the-shelf settings running on a dev > workstation): > * small values (1 KB, 100,000 entries), somewhat larger values (25 KB, > 10,000 entries), and much larger values (1 MB, 10,000 entries); > * compressible data (a single byte repeated) and uncompressible data (output > from {{openssl rand $bytes}}); and > * with and without sstable compression. (With compression, we use > Cassandra's defaults.) > The difference is most conspicuous for tables with large, uncompressible data > and sstable decompression (which happens to describe the use case that > triggered our investigation). It is smaller but still readily apparent for > tables with effective compression. For uncompressible data without > compression enabled, there is no appreciable difference. > Here's what the performance looks like without our patch for the 1-MB entries > (times in seconds, five consecutive runs for each data set, all exhausting > the results from a {{SELECT DISTINCT key FROM ...}} query with a page size of > 24): > {noformat} > working on compressible > 5.21180510521 > 5.10270500183 > 5.22311806679 > 4.6732840538 > 4.84219098091 > working on uncompressible_uncompressed > 55.0423607826 > 0.769015073776 > 0.850513935089 > 0.713396072388 > 0.62596988678 > working on uncompressible > 413.292617083 > 231.345913887 > 449.524993896 > 425.135111094 > 243.469946861 > {noformat} > and with the fix: > {noformat} > working on compressible > 2.86733293533 > 1.24895811081 > 1.108907938 > 1.12742400169 > 1.04647302628 > working on uncompressible_uncompressed > 56.4146180153 > 0.895509958267 > 0.922824144363 > 0.772884130478 > 0.731923818588 > working on uncompressible > 64.4587619305 > 1.81325793266 > 1.52577018738 > 1.41769099236 > 1.60442209244 > {noformat} > The long initial runs for the uncompressible data presumably come from > repeatedly hitting the disk. In contrast to the runs without the fix, the > initial runs seem to be effective at warming the page cache (as lots of data > is skipped, so the data that's read can fit in memory), so subsequent runs > are faster. > For smaller data sets, {{RandomAccessReader.seek()}} and > {{RebufferingInputStream.skipBytes()}} are approximately equivalent in their > behavior (reducing to changing the position pointer of an in-memory buffer > most of the time), so there isn't much difference. Here's before the fix for > the 1-KB entries: > {noformat} > working on small_compressible > 8.34115099907 > 8.57280993462 > 8.3534219265 > 8.55130696297 > 8.17362189293 > working on small_uncompressible_uncompressed > 7.85155582428 > 7.54075288773 > 7.50106596947 > 7.39202189445 > 7.95735621452 > working on small_uncompressible > 7.89256501198 > 7.88875198364 > 7.9013261795 > 7.76551413536 > 7.84927678108 > {noformat} > and after: > {noformat} > working on small_compressible > 8.29225707054 > 7.57822394371 > 8.10092878342 > 8.21332192421 > 8.19347810745 > working on small_uncompressible_uncompressed > 7.74823594093 > 7.81218004227 > 7.68660092354 > 7.95432114601 > 7.77612304688 > working on small_uncompressible > 8.18260502815 > 8.21010804176 > 8.1233921051 > 7.31543707848 > 7.91079998016 > {noformat} > The effect is similar for the 25-KB entries, which might enjoy a slight > performance benefit from the patch (perhaps because they're larger than the > default buffer size defined in {{RandomAccessReader}}). Before: > {noformat} > working on medium_compressible > 0.988080978394 > 1.02464294434 > 0.977658033371 > 1.02553391457 > 0.769363880157 > working on medium_uncompressible_uncompressed > 1.07718396187 > 1.08547902107 > 1.12398791313 > 1.10300898552 > 1.08757281303 > working on medium_uncompressible > 0.940990209579 > 0.917474985123 > 0.768013954163 > 0.871683835983 > 0.814841985703 > {noformat} > and after: > {noformat} > working on medium_compressible > 0.829009056091 > 0.705173015594 > 0.603646993637 > 0.820069074631 > 0.873830080032 > working on medium_uncompressible_uncompressed > 0.785156965256 > 0.808106184006 > 0.848286151886 > 0.857885837555 > 0.825689077377 > working on medium_uncompressible > 0.845101118088 > 0.913790941238 > 0.824147939682 > 0.849114894867 > 0.85981798172 > {noformat} > In short, this looks like a pretty straightforward performance win with > negligible cost. (It's worth noting that for our use case, disabling sstable > compression is clearly the _best_ solution, but there's still reasonably > clear benefit from this minor fix for data sets with larger, compressible > values, as well as presumably data sets with a mix of compressible and > uncompressible values in environments where storage is limited.) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org