[ https://issues.apache.org/jira/browse/CASSANDRA-19365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881390#comment-17881390 ]
Maxim Muzafarov edited comment on CASSANDRA-19365 at 9/12/24 8:20 PM: ---------------------------------------------------------------------- [https://github.com/apache/cassandra/pull/3543/files] Changes are ready for review. I've added benchmarks and improved the consistency so we won't lose any updates as previously mentioned. The corresponding Javadoc has also been updated to reflect that no locking is used for the {{decayingBuckets}} reset. I'll prepare CI shortly. Benchmarks: {code:java} cassandra-19365 Benchmark (landmarkResetIntervalNs) Mode Cnt Score Error Units DecayingEstimatedHistogramBench.update 100000 thrpt 12 14995,732 ± 815,913 ops/ms DecayingEstimatedHistogramBench.update 500000 thrpt 12 14290,593 ± 669,975 ops/ms DecayingEstimatedHistogramBench.update 1000000 thrpt 12 14648,427 ± 800,957 ops/ms trunk Benchmark (landmarkResetIntervalNs) Mode Cnt Score Error Units DecayingEstimatedHistogramBench.update 100000 thrpt 12 14236,466 ± 1203,280 ops/ms DecayingEstimatedHistogramBench.update 500000 thrpt 12 13746,524 ± 1908,030 ops/ms DecayingEstimatedHistogramBench.update 1000000 thrpt 12 14048,394 ± 676,323 ops/ms {code} was (Author: mmuzaf): [https://github.com/apache/cassandra/pull/3543/files] Changes are ready for review. I've added benchmarks and improved the consistency so we won't lose any updates as previously mentioned. The corresponding Javadoc has also been updated to reflect that no locking is used for the {{decayingBuckets}} reset. I'll prepare CI shortly. > This lets us keep updates non-synchronized at the price of letting some > updates be missed during rescale. that's no longer relevant Benchmarks: {code:java} cassandra-19365 Benchmark (landmarkResetIntervalNs) Mode Cnt Score Error Units DecayingEstimatedHistogramBench.update 100000 thrpt 12 14995,732 ± 815,913 ops/ms DecayingEstimatedHistogramBench.update 500000 thrpt 12 14290,593 ± 669,975 ops/ms DecayingEstimatedHistogramBench.update 1000000 thrpt 12 14648,427 ± 800,957 ops/ms trunk Benchmark (landmarkResetIntervalNs) Mode Cnt Score Error Units DecayingEstimatedHistogramBench.update 100000 thrpt 12 14236,466 ± 1203,280 ops/ms DecayingEstimatedHistogramBench.update 500000 thrpt 12 13746,524 ± 1908,030 ops/ms DecayingEstimatedHistogramBench.update 1000000 thrpt 12 14048,394 ± 676,323 ops/ms {code} > invalid EstimatedHistogramReservoirSnapshot::getValue values due to race > condition in DecayingEstimatedHistogramReservoir > ------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-19365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19365 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics > Reporter: Jakub Zytka > Assignee: Maxim Muzafarov > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Time Spent: 3.5h > Remaining Estimate: 0h > > `DecayingEstimatedHistogramReservoir` has a race condition between `update` > and `rescaleIfNeeded`. > A sample which ends up (`update`) in an already scaled decayingBucket > (`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark` > has not been updated yet at the moment of `update`. > > The observed consequence was flooding of the cluster with speculative retries > (we happened to hit low-percentile buckets with overweight samples, which > drove p99 below true p50 for a long time). > Please note that despite the manifestation being similar to CASSANDRA-19330, > these are two distinct bugs in their own right. > This bug affects versions 4.0+ > On 3.11 there's locking in DEHR. I did not check earlier versions. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org