Jaehui Lee created HBASE-29665:
----------------------------------

             Summary: Bidirectional bulkload replication causes excessive 
network traffic
                 Key: HBASE-29665
                 URL: https://issues.apache.org/jira/browse/HBASE-29665
             Project: HBase
          Issue Type: Bug
            Reporter: Jaehui Lee
            Assignee: Jaehui Lee
         Attachments: image-2025-10-16-21-59-13-156.png

h2. Problem

When performing a bulkload on one of two clusters configured with bidirectional 
replication, the cluster executing the bulk load experiences unexpectedly high 
network usage.
h2. Root Cause

HBASE-22380 prevented circle bulkload replication by having 
{{SecureBulkloadManager}} check if the current clusterId already exists in 
{{{}clusterIds{}}}. If present, it assumes replication has already occurred and 
stops further processing.

However, {{SecureBulkloadManager}} is invoked by the {{LoadIncrementalHFiles}} 
tool, which copies the target HFiles to a staging directory in the local HDFS 
_before_ checking whether replication should proceed. This premature copying 
causes unnecessary network and disk usage.
h2. Solution

Unlike {{clusterIds}} used in regular mutation replication (which are included 
in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are managed in 
a separate class called {{{}BulkloadDescriptor{}}}. As a result, they are not 
filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering logic only runs 
after the bulkload request is received.

The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload 
operations, just like regular mutations. This allows filtering to occur before 
the bulkload request is processed, preventing unnecessary file copying.
h2. Test

Setup
 * Two clusters (Cluster A and Cluster B) running HBase 2.6.3
 * HBase and HDFS clusters are separated (compute-storage separation 
architecture)
 * Bulk load replication and bidirectional replication enabled
 * Bulk load executed on Cluster A only

!image-2025-10-16-21-59-13-156.png|width=517,height=440!

Since the bulkload is executed only on Cluster A in both bidirectional and 
one-way replication scenarios, resource usage should be identical between 
scenarios 1 and 2. However, as shown in the metrics above, scenario 1 consumes 
significantly more resources. This is due to the unnecessary copying of HFiles 
to the staging directory, as explained in the root cause section.

After applying the patch, scenario 3 shows resource usage identical to scenario 
2, confirming that the unnecessary file copying has been eliminated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to