[ 
https://issues.apache.org/jira/browse/CASSANDRA-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16474190#comment-16474190
 ] 

Samuel Klock commented on CASSANDRA-14415:
------------------------------------------

Thanks for taking a look, [~KurtG]. Adding a test is certainly reasonable 
feedback; we'll make that change soon.
{quote}In the case that n < 0, delegating back to {{RIS.skipBytes()}} will 
return the same thing, and in the case that buffer == null, {{RIS.skipBytes()}} 
will NPE. If the extra check is necessary seems to me it should apply to both 
methods, but all you can do is return 0. Might be better to just leave the 
extra check out...?
{quote}
The intent of this check is to conservatively preserve the existing behavior 
for {{n}} < 0 or a null {{buffer}}. Regarding the NPE behavior in particular: 
having a {{skipBytes()}} implementation NPE is at least surprising (and 
possibly a bug), but it's not straightforward to tell if there's any logic in 
Cassandra that depends on it. So we elected not to change that behavior. 
Delegating to the superclass seemed like a good way to ensure that the behavior 
of {{skipBytes()}} is kept consistent in case the superclass implementation 
does ever end up changing.

That said: happy not to delegate to the superclass (or add an explanatory 
comment) if that'd be preferable. For an implementation here:
 * If {{n}} < 0, the obvious choice is to return 0 without mutating the 
reader's state.
 * If {{buffer}} is null, if we just fall through to {{seek()}}, we'll throw an 
{{IllegalStateException}}, which seems undesirable. The contract for 
{{skipBytes()}} suggests returning 0 would be a reasonable choice, although it 
might be better to signal a problem to the caller by throwing an 
{{IOException}}. That would be a pretty abrasive change in behavior, however.

> Performance regression in queries for distinct keys
> ---------------------------------------------------
>
>                 Key: CASSANDRA-14415
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14415
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Samuel Klock
>            Assignee: Samuel Klock
>            Priority: Major
>              Labels: performance
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Running Cassandra 3.0.16, we observed a major performance regression 
> affecting {{SELECT DISTINCT keys}}-style queries against certain tables.  
> Based on some investigation (guided by some helpful feedback from Benjamin on 
> the dev list), we tracked the regression down to two problems.
>  * One is that Cassandra was reading more data from disk than was necessary 
> to satisfy the query.  This was fixed under CASSANDRA-10657 in a later 3.x 
> release.
>  * If the fix for CASSANDRA-10657 is incorporated, the other is this code 
> snippet in {{RebufferingInputStream}}:
> {code:java}
>     @Override
>     public int skipBytes(int n) throws IOException
>     {
>         if (n < 0)
>             return 0;
>         int requested = n;
>         int position = buffer.position(), limit = buffer.limit(), remaining;
>         while ((remaining = limit - position) < n)
>         {
>             n -= remaining;
>             buffer.position(limit);
>             reBuffer();
>             position = buffer.position();
>             limit = buffer.limit();
>             if (position == limit)
>                 return requested - n;
>         }
>         buffer.position(position + n);
>         return requested;
>     }
> {code}
> The gist of it is that to skip bytes, the stream needs to read those bytes 
> into memory then throw them away.  In our tests, we were spending a lot of 
> time in this method, so it looked like the chief drag on performance.
> We noticed that the subclass of {{RebufferingInputStream}} in use for our 
> queries, {{RandomAccessReader}} (over compressed sstables), implements a 
> {{seek()}} method.  Overriding {{skipBytes()}} in it to use {{seek()}} 
> instead was sufficient to fix the performance regression.
> The performance difference is significant for tables with large values.  It's 
> straightforward to evaluate with very simple key-value tables, e.g.:
> {{CREATE TABLE testtable (key TEXT PRIMARY KEY, value BLOB);}}
> We did some basic experimentation with the following variations (all in a 
> single-node 3.11.2 cluster with off-the-shelf settings running on a dev 
> workstation):
>  * small values (1 KB, 100,000 entries), somewhat larger values (25 KB, 
> 10,000 entries), and much larger values (1 MB, 10,000 entries);
>  * compressible data (a single byte repeated) and uncompressible data (output 
> from {{openssl rand $bytes}}); and
>  * with and without sstable compression.  (With compression, we use 
> Cassandra's defaults.)
> The difference is most conspicuous for tables with large, uncompressible data 
> and sstable decompression (which happens to describe the use case that 
> triggered our investigation).  It is smaller but still readily apparent for 
> tables with effective compression.  For uncompressible data without 
> compression enabled, there is no appreciable difference.
> Here's what the performance looks like without our patch for the 1-MB entries 
> (times in seconds, five consecutive runs for each data set, all exhausting 
> the results from a {{SELECT DISTINCT key FROM ...}} query with a page size of 
> 24):
> {noformat}
> working on compressible
> 5.21180510521
> 5.10270500183
> 5.22311806679
> 4.6732840538
> 4.84219098091
> working on uncompressible_uncompressed
> 55.0423607826
> 0.769015073776
> 0.850513935089
> 0.713396072388
> 0.62596988678
> working on uncompressible
> 413.292617083
> 231.345913887
> 449.524993896
> 425.135111094
> 243.469946861
> {noformat}
> and with the fix:
> {noformat}
> working on compressible
> 2.86733293533
> 1.24895811081
> 1.108907938
> 1.12742400169
> 1.04647302628
> working on uncompressible_uncompressed
> 56.4146180153
> 0.895509958267
> 0.922824144363
> 0.772884130478
> 0.731923818588
> working on uncompressible
> 64.4587619305
> 1.81325793266
> 1.52577018738
> 1.41769099236
> 1.60442209244
> {noformat}
> The long initial runs for the uncompressible data presumably come from 
> repeatedly hitting the disk.  In contrast to the runs without the fix, the 
> initial runs seem to be effective at warming the page cache (as lots of data 
> is skipped, so the data that's read can fit in memory), so subsequent runs 
> are faster.
> For smaller data sets, {{RandomAccessReader.seek()}} and 
> {{RebufferingInputStream.skipBytes()}} are approximately equivalent in their 
> behavior (reducing to changing the position pointer of an in-memory buffer 
> most of the time), so there isn't much difference.  Here's before the fix for 
> the 1-KB entries:
> {noformat}
> working on small_compressible
> 8.34115099907
> 8.57280993462
> 8.3534219265
> 8.55130696297
> 8.17362189293
> working on small_uncompressible_uncompressed
> 7.85155582428
> 7.54075288773
> 7.50106596947
> 7.39202189445
> 7.95735621452
> working on small_uncompressible
> 7.89256501198
> 7.88875198364
> 7.9013261795
> 7.76551413536
> 7.84927678108
> {noformat}
> and after:
> {noformat}
> working on small_compressible
> 8.29225707054
> 7.57822394371
> 8.10092878342
> 8.21332192421
> 8.19347810745
> working on small_uncompressible_uncompressed
> 7.74823594093
> 7.81218004227
> 7.68660092354
> 7.95432114601
> 7.77612304688
> working on small_uncompressible
> 8.18260502815
> 8.21010804176
> 8.1233921051
> 7.31543707848
> 7.91079998016
> {noformat}
> The effect is similar for the 25-KB entries, which might enjoy a slight 
> performance benefit from the patch (perhaps because they're larger than the 
> default buffer size defined in {{RandomAccessReader}}).  Before:
> {noformat}
> working on medium_compressible
> 0.988080978394
> 1.02464294434
> 0.977658033371
> 1.02553391457
> 0.769363880157
> working on medium_uncompressible_uncompressed
> 1.07718396187
> 1.08547902107
> 1.12398791313
> 1.10300898552
> 1.08757281303
> working on medium_uncompressible
> 0.940990209579
> 0.917474985123
> 0.768013954163
> 0.871683835983
> 0.814841985703
> {noformat}
> and after:
> {noformat}
> working on medium_compressible
> 0.829009056091
> 0.705173015594
> 0.603646993637
> 0.820069074631
> 0.873830080032
> working on medium_uncompressible_uncompressed
> 0.785156965256
> 0.808106184006
> 0.848286151886
> 0.857885837555
> 0.825689077377
> working on medium_uncompressible
> 0.845101118088
> 0.913790941238
> 0.824147939682
> 0.849114894867
> 0.85981798172
> {noformat}
> In short, this looks like a pretty straightforward performance win with 
> negligible cost.  (It's worth noting that for our use case, disabling sstable 
> compression is clearly the _best_ solution, but there's still reasonably 
> clear benefit from this minor fix for data sets with larger, compressible 
> values, as well as presumably data sets with a mix of compressible and 
> uncompressible values in environments where storage is limited.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to