Jaehui Lee created HBASE-29665:
----------------------------------
Summary: Bidirectional bulkload replication causes excessive
network traffic
Key: HBASE-29665
URL: https://issues.apache.org/jira/browse/HBASE-29665
Project: HBase
Issue Type: Bug
Reporter: Jaehui Lee
Assignee: Jaehui Lee
Attachments: image-2025-10-16-21-59-13-156.png
h2. Problem
When performing a bulkload on one of two clusters configured with bidirectional
replication, the cluster executing the bulk load experiences unexpectedly high
network usage.
h2. Root Cause
HBASE-22380 prevented circle bulkload replication by having
{{SecureBulkloadManager}} check if the current clusterId already exists in
{{{}clusterIds{}}}. If present, it assumes replication has already occurred and
stops further processing.
However, {{SecureBulkloadManager}} is invoked by the {{LoadIncrementalHFiles}}
tool, which copies the target HFiles to a staging directory in the local HDFS
_before_ checking whether replication should proceed. This premature copying
causes unnecessary network and disk usage.
h2. Solution
Unlike {{clusterIds}} used in regular mutation replication (which are included
in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are managed in
a separate class called {{{}BulkloadDescriptor{}}}. As a result, they are not
filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering logic only runs
after the bulkload request is received.
The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload
operations, just like regular mutations. This allows filtering to occur before
the bulkload request is processed, preventing unnecessary file copying.
h2. Test
Setup
* Two clusters (Cluster A and Cluster B) running HBase 2.6.3
* HBase and HDFS clusters are separated (compute-storage separation
architecture)
* Bulk load replication and bidirectional replication enabled
* Bulk load executed on Cluster A only
!image-2025-10-16-21-59-13-156.png|width=517,height=440!
Since the bulkload is executed only on Cluster A in both bidirectional and
one-way replication scenarios, resource usage should be identical between
scenarios 1 and 2. However, as shown in the metrics above, scenario 1 consumes
significantly more resources. This is due to the unnecessary copying of HFiles
to the staging directory, as explained in the root cause section.
After applying the patch, scenario 3 shows resource usage identical to scenario
2, confirming that the unnecessary file copying has been eliminated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)