[ https://issues.apache.org/jira/browse/KUDU-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xixu Wang updated KUDU-3447: ---------------------------- Attachment: image-2023-02-13-17-22-25-368.png > Limit the usage of network bandwidth of tablet copying > ------------------------------------------------------- > > Key: KUDU-3447 > URL: https://issues.apache.org/jira/browse/KUDU-3447 > Project: Kudu > Issue Type: Improvement > Reporter: Xixu Wang > Priority: Minor > Attachments: image-2023-02-09-10-38-50-512.png, > image-2023-02-09-10-47-58-370.png, image-2023-02-13-17-08-37-256.png, > image-2023-02-13-17-16-50-491.png, image-2023-02-13-17-22-25-368.png > > > Copying tablets from an old cluster to another new cluster is a high resource > consumed operation using the command : kudu local_replica copy_from_remote. > As the follow picture shows: the usage of memory is as high as 75%. And the > network is almost occupied fully (the overall network bandwidth is 2Gb/s). > Disk reading is every high (the overall disk bandwidth is 200MB/s). > !image-2023-02-09-10-47-58-370.png|width=996,height=369! > If the data size is very large, the copying process will last for a long > time. Other service maybe get impacted and become unavailable. Therefore it > is better to limit the tablets copying speed and make the system more stable. > The goal is to balance the tablets copying speed and the impact to other > services. > As copy_from_remote is mainly downloading data from the remote cluster and > write the data to local file system, it is better to control the downloading > speed to control the resource consumption. There are some algorithms to > implement a rate limiter. This patch will use the token bucket algorithm > implemented by Facebook Folly library: > [https://github.com/facebook/folly/blob/main/folly/TokenBucket.h] > > *Performance Tests* > 1. Data size: > TABLE test_1 > on disk size: 13263880213 > live row count: 66433035 > 2. Test Case: > case 1: > kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 > -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir > -tablet_copy_download_threads_nums_per_session=4 -num_threads=4 > case 2: > kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 > -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir > -tablet_copy_download_threads_nums_per_session=4 -num_threads=4 > -enable_network_speed_limit=true -limit_network_speed=25 > 3. Results: > 3.1 The usage of CPU > Left is test case 1, right is 2. As we can seek, using speed limit feature > can reduce CPU comsumption. > !image-2023-02-13-17-08-37-256.png|width=418,height=559!!image-2023-02-13-17-16-50-491.png|width=794,height=369! > 3.2 Load of CPU > -- This message was sent by Atlassian Jira (v8.20.10#820010)