[ https://issues.apache.org/jira/browse/KUDU-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xixu Wang updated KUDU-3447: ---------------------------- Description: Copying tablets from an old cluster to another new cluster is a high resource consumed operation using the command : kudu local_replica copy_from_remote. As the follow picture shows: the usage of memory is as high as 75%. And the network is almost occupied fully (the overall network bandwidth is 2Gb/s). Disk reading is every high (the overall disk bandwidth is 200MB/s). !image-2023-02-09-10-47-58-370.png|width=996,height=369! If the data size is very large, the copying process will last for a long time. Other service maybe get impacted and become unavailable. Therefore it is better to limit the tablets copying speed and make the system more stable. The goal is to balance the tablets copying speed and the impact to other services. As copy_from_remote is mainly downloading data from the remote cluster and write the data to local file system, it is better to control the downloading speed to control the resource consumption. There are some algorithms to implement a rate limiter. This patch will use the token bucket algorithm implemented by Facebook Folly library: [https://github.com/facebook/folly/blob/main/folly/TokenBucket.h] *Performance Tests* 1. Data size: TABLE test_1 on disk size: 13263880213 live row count: 66433035 2. Test Case: case 1: kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir -tablet_copy_download_threads_nums_per_session=4 -num_threads=4 case 2: kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir -tablet_copy_download_threads_nums_per_session=4 -num_threads=4 -enable_network_speed_limit=true -limit_network_speed=25 3. Results: 3.1 The usage of CPU Left is test case 1, right is 2. As we can seek, using speed limit feature can reduce CPU comsumption. !image-2023-02-13-17-08-37-256.png|width=418,height=559!!image-2023-02-13-17-16-50-491.png|width=794,height=369! 3.2 Load of CPU was: Copying tablets from an old cluster to another new cluster is a high resource consumed operation using the command : kudu local_replica copy_from_remote. As the follow picture shows: the usage of memory is as high as 75%. And the network is almost occupied fully (the overall network bandwidth is 2Gb/s). Disk reading is every high (the overall disk bandwidth is 200MB/s). !image-2023-02-09-10-47-58-370.png|width=996,height=369! If the data size is very large, the copying process will last for a long time. Other service maybe get impacted and become unavailable. Therefore it is better to limit the tablets copying speed and make the system more stable. The goal is to balance the tablets copying speed and the impact to other services. As copy_from_remote is mainly downloading data from the remote cluster and write the data to local file system, it is better to control the downloading speed to control the resource consumption. There are some algorithms to implement a rate limiter. This patch will use the token bucket algorithm implemented by Facebook Folly library: [https://github.com/facebook/folly/blob/main/folly/TokenBucket.h] > Limit the usage of network bandwidth of tablet copying > ------------------------------------------------------- > > Key: KUDU-3447 > URL: https://issues.apache.org/jira/browse/KUDU-3447 > Project: Kudu > Issue Type: Improvement > Reporter: Xixu Wang > Priority: Minor > Attachments: image-2023-02-09-10-38-50-512.png, > image-2023-02-09-10-47-58-370.png, image-2023-02-13-17-08-37-256.png, > image-2023-02-13-17-16-50-491.png > > > Copying tablets from an old cluster to another new cluster is a high resource > consumed operation using the command : kudu local_replica copy_from_remote. > As the follow picture shows: the usage of memory is as high as 75%. And the > network is almost occupied fully (the overall network bandwidth is 2Gb/s). > Disk reading is every high (the overall disk bandwidth is 200MB/s). > !image-2023-02-09-10-47-58-370.png|width=996,height=369! > If the data size is very large, the copying process will last for a long > time. Other service maybe get impacted and become unavailable. Therefore it > is better to limit the tablets copying speed and make the system more stable. > The goal is to balance the tablets copying speed and the impact to other > services. > As copy_from_remote is mainly downloading data from the remote cluster and > write the data to local file system, it is better to control the downloading > speed to control the resource consumption. There are some algorithms to > implement a rate limiter. This patch will use the token bucket algorithm > implemented by Facebook Folly library: > [https://github.com/facebook/folly/blob/main/folly/TokenBucket.h] > > *Performance Tests* > 1. Data size: > TABLE test_1 > on disk size: 13263880213 > live row count: 66433035 > 2. Test Case: > case 1: > kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 > -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir > -tablet_copy_download_threads_nums_per_session=4 -num_threads=4 > case 2: > kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050 > -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir > -tablet_copy_download_threads_nums_per_session=4 -num_threads=4 > -enable_network_speed_limit=true -limit_network_speed=25 > 3. Results: > 3.1 The usage of CPU > Left is test case 1, right is 2. As we can seek, using speed limit feature > can reduce CPU comsumption. > !image-2023-02-13-17-08-37-256.png|width=418,height=559!!image-2023-02-13-17-16-50-491.png|width=794,height=369! > 3.2 Load of CPU > -- This message was sent by Atlassian Jira (v8.20.10#820010)