[
https://issues.apache.org/jira/browse/HBASE-29645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043527#comment-18043527
]
Hudson commented on HBASE-29645:
--------------------------------
Results for branch branch-2.6
[build #394 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/394/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/394/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- Something went wrong running this stage, please [check relevant console
output|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/394//console].
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/394/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- Something went wrong running this stage, please [check relevant console
output|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/394//console].
(x) {color:red}-1 jdk17 hadoop3 checks{color}
-- Something went wrong running this stage, please [check relevant console
output|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/394//console].
> AsyncBufferedMutatorImpl concurrency improvement
> ------------------------------------------------
>
> Key: HBASE-29645
> URL: https://issues.apache.org/jira/browse/HBASE-29645
> Project: HBase
> Issue Type: Improvement
> Components: Client
> Affects Versions: 3.0.0-beta-1, 2.6.3, 2.5.12
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5, 2.5.14
>
> Attachments: AsyncBufferedMutatorBenchmark.java
>
>
> This patch modifies AsyncBufferedMutatorImpl class to improve its performance
> under concurrent usage.
> While AsyncTable#batch() is largely asynchronous in nature, it can exhibit
> blocking behavior during its preparation phase, for instance, while looking
> up region locations. In the original implementation of
> AsyncBufferedMutatorImpl, calls to AsyncTable#batch() occur within a
> synchronized block, potentially causing severe contention and stalling other
> threads trying to buffer their mutations. The original implementation relied
> on coarse grained synchronized blocks for multi-threading safety, so when one
> thread triggered a buffer flush (either because the buffer was full or a
> periodic timer fired), all other threads attempting to add mutations via the
> mutate method would be blocked until the table.batch() call completed, which
> could take a surprisingly long time.
> The new implementation replaces the broad synchronized blocks with a
> ReentrantLock. This lock is acquired only for the brief period needed to
> safely copy the current batch of mutations and futures into local variables
> and swap in a new internal buffer. Immediately after this quick operation,
> the lock is released. The batch() call is then executed outside of the locked
> section. This allows other threads to continue adding new mutations
> concurrently while the flushing of the previous batch proceeds independently.
> The client has already opted in to asynchronous and potentially interleaved
> commit of the mutations submitted to AsyncBufferedMutator, by definition. The
> minimization of critical section scope minimizes thread contention and
> significantly boosts throughput under load. Other related profiler driven
> efficiency changes are also included, such as elimination of stream api and
> array resizing hotspots identified by the profiler.
> To validate the performance improvement of these changes, a JMH benchmark,
> AsyncBufferedMutatorBenchmark (attached to this issue), was created to
> measure the performance of the mutate method under various conditions. It
> focuses specifically on the overhead and concurrency management of
> AsyncBufferedMutatorImpl itself, not the underlying network communication. To
> achieve this, it uses the Mockito framework to create a mock AsyncTable that
> instantly returns completed futures, isolating the mutator's buffering logic
> for measurement. It runs tests with 1, 10, and 100 threads to simulate no,
> medium, and high levels of concurrency. It uses a low value (100) for
> maxMutations to force frequent flushes based on the number of mutations, and
> a very high value (100,000) to ensure flushes are rare in that measurement
> case. The benchmark measures the average time per operation in microseconds,
> where a lower score indicates better performance and higher throughput.
> With a single thread and no contention the performance of both
> implementations is nearly identical. The minor variations are negligible and
> show that the new locking mechanism does not introduce any performance
> regression in the non-concurrent case. For example, with a 10MB buffer and
> high maxMutations, the NEW implementation scored 0.167 µs/op while the OLD
> scored 0.169 µs/op, a statistically insignificant difference. When the test
> is run with 10 threads, a noticeable gap appears. In the scenario designed to
> cause frequent flushes (maxMutations = 100), the NEW implementation is
> approximately 12 times faster than the OLD one (14.250 µs/op for NEW vs.
> 172.463 µs/op for OLD). This is because the OLD implementation forces threads
> to wait while flushes occur, and flushes incur a synthetic thread sleep of
> 1ms to simulate occasional unexpected blocking behavior in
> AsyncTable#batch(), whereas the NEW implementation allows them to proceed
> without contention. The most significant results come from the 100-thread
> tests, which simulate high contention. In the frequent flush scenario
> (maxMutations = 100) the NEW implementation is 114 times faster in the
> synthetic benchmark scenario (16.123 µs/op for NEW vs. 1847.567 µs/op for
> OLD). Note that blocking IO observed in a real client for e.g. region
> location lookups can produce a much more significant impact. With the OLD
> code, 100 threads are constantly competing for a lock that is held for a long
> duration, leading to a contention storm. The NEW code's reduced locking scope
> almost entirely eliminates this bottleneck.
> OS: Apple Silicon (aarch64) M1 Max / 64 GB
> JVM: openjdk version "17.0.11" 2024-04-16 LTS / OpenJDK 64-Bit Server VM
> Zulu17.50+19-CA (build 17.0.11+9-LTS, mixed mode, sharing)
> |Threads|Max Mutations|OLD Implementation (µs/op)|NEW Implementation
> (µs/op)|Performance Gain (OLD / NEW)|
> |1|100|14.091|16.313|0.86x (comparable)|
> |1|100,000|0.169|0.167|1.01x (comparable)|
> |10|100|172.463|14.250|12.10x|
> |10|100,000|2.465|1.072|2.30x|
> |100|100|1847.567|16.123|114.59x|
> |100|100,000|24.125|12.796|1.89x|
--
This message was sent by Atlassian Jira
(v8.20.10#820010)