From: Hyman Huang(黄勇) <huang...@chinatelecom.cn> v2: This version make a little bit modifications comparing with version 1 as following: 1. fix the overflow issue reported by Peter Maydell 2. add parameter check for hmp "set_vcpu_dirty_limit" command 3. fix the racing issue between dirty ring reaper thread and Qemu main thread. 4. add migrate parameter check for x-vcpu-dirty-limit-period and vcpu-dirty-limit. 5. add the logic to forbid hmp/qmp commands set_vcpu_dirty_limit, cancel_vcpu_dirty_limit during dirty-limit live migration when implement dirty-limit convergence algo. 6. add capability check to ensure auto-converge and dirty-limit are mutually exclusive. 7. pre-check if kvm dirty ring size is configured before setting dirty-limit migrate parameter
A more comprehensive test was done comparing with version 1. The following are test environment: ------------------------------------------------------------- a. Host hardware info: CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Memory: Hynix 503Gi Interface: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09) Speed: 1000Mb/s b. Host software info: OS: ctyunos release 2 Kernel: 4.19.90-2102.2.0.0066.ctl2.x86_64 Libvirt baseline version: libvirt-6.9.0 Qemu baseline version: qemu-5.0 c. vm scale CPU: 4 Memory: 4G ------------------------------------------------------------- All the supplementary test data shown as follows are basing on above test environment. In version 1, we post a test data from unixbench as follows: $ taskset -c 8-15 ./Run -i 2 -c 8 {unixbench test item} host cpu: Intel(R) Xeon(R) Platinum 8378A host interface speed: 1000Mb/s |---------------------+--------+------------+---------------| | UnixBench test item | Normal | Dirtylimit | Auto-converge | |---------------------+--------+------------+---------------| | dhry2reg | 32800 | 32786 | 25292 | | whetstone-double | 10326 | 10315 | 9847 | | pipe | 15442 | 15271 | 14506 | | context1 | 7260 | 6235 | 4514 | | spawn | 3663 | 3317 | 3249 | | syscall | 4669 | 4667 | 3841 | |---------------------+--------+------------+---------------| In version 2, we post a supplementary test data that do not use taskset and make the scenario more general, see as follows: $ ./Run per-vcpu data: |---------------------+--------+------------+---------------| | UnixBench test item | Normal | Dirtylimit | Auto-converge | |---------------------+--------+------------+---------------| | dhry2reg | 2991 | 2902 | 1722 | | whetstone-double | 1018 | 1006 | 627 | | Execl Throughput | 955 | 320 | 660 | | File Copy - 1 | 2362 | 805 | 1325 | | File Copy - 2 | 1500 | 1406 | 643 | | File Copy - 3 | 4778 | 2160 | 1047 | | Pipe Throughput | 1181 | 1170 | 842 | | Context Switching | 192 | 224 | 198 | | Process Creation | 490 | 145 | 95 | | Shell Scripts - 1 | 1284 | 565 | 610 | | Shell Scripts - 2 | 2368 | 900 | 1040 | | System Call Overhead| 983 | 948 | 698 | | Index Score | 1263 | 815 | 600 | |---------------------+--------+------------+---------------| Note: File Copy - 1: File Copy 1024 bufsize 2000 maxblocks File Copy - 2: File Copy 256 bufsize 500 maxblocks File Copy - 3: File Copy 4096 bufsize 8000 maxblocks Shell Scripts - 1: Shell Scripts (1 concurrent) Shell Scripts - 2: Shell Scripts (8 concurrent) Basing on above data, we can draw a conclusion that dirty-limit can hugely improve the system benchmark almost in every respect, the "System Benchmarks Index Score" show it improve 35% performance comparing with auto-converge during live migration. 4-vcpu parallel data(we run a test vm with 4c4g-scale): |---------------------+--------+------------+---------------| | UnixBench test item | Normal | Dirtylimit | Auto-converge | |---------------------+--------+------------+---------------| | dhry2reg | 7975 | 7146 | 5071 | | whetstone-double | 3982 | 3561 | 2124 | | Execl Throughput | 1882 | 1205 | 768 | | File Copy - 1 | 1061 | 865 | 498 | | File Copy - 2 | 676 | 491 | 519 | | File Copy - 3 | 2260 | 923 | 1329 | | Pipe Throughput | 3026 | 3009 | 1616 | | Context Switching | 1219 | 1093 | 695 | | Process Creation | 947 | 307 | 446 | | Shell Scripts - 1 | 2469 | 977 | 989 | | Shell Scripts - 2 | 2667 | 1275 | 984 | | System Call Overhead| 1592 | 1459 | 692 | | Index Score | 1976 | 1294 | 997 | |---------------------+--------+------------+---------------| For the parallel data, the "System Benchmarks Index Score" show it also improve 29% performance. In version 1, migration total time is shown as follows: host cpu: Intel(R) Xeon(R) Platinum 8378A host interface speed: 1000Mb/s |-----------------------+----------------+-------------------| | dirty memory size(MB) | Dirtylimit(ms) | Auto-converge(ms) | |-----------------------+----------------+-------------------| | 60 | 2014 | 2131 | | 70 | 5381 | 12590 | | 90 | 6037 | 33545 | | 110 | 7660 | [*] | |-----------------------+----------------+-------------------| [*]: This case means migration is not convergent. In version 2, we post more comprehensive migration total time test data as follows: we update N MB on 4 cpus and sleep S us every time 1 MB data was updated. test twice in each condition, data is shown as follow: |-----------+--------+--------+----------------+-------------------| | ring size | N (MB) | S (us) | Dirtylimit(ms) | Auto-converge(ms) | |-----------+--------+--------+----------------+-------------------| | 1024 | 1024 | 1000 | 44951 | 191780 | | 1024 | 1024 | 1000 | 44546 | 185341 | | 1024 | 1024 | 500 | 46505 | 203545 | | 1024 | 1024 | 500 | 45469 | 909945 | | 1024 | 1024 | 0 | 61858 | [*] | | 1024 | 1024 | 0 | 57922 | [*] | | 1024 | 2048 | 0 | 91982 | [*] | | 1024 | 2048 | 0 | 90388 | [*] | | 2048 | 128 | 10000 | 14511 | 25971 | | 2048 | 128 | 10000 | 13472 | 26294 | | 2048 | 1024 | 10000 | 44244 | 26294 | | 2048 | 1024 | 10000 | 45099 | 157701 | | 2048 | 1024 | 500 | 51105 | [*] | | 2048 | 1024 | 500 | 49648 | [*] | | 2048 | 1024 | 0 | 229031 | [*] | | 2048 | 1024 | 0 | 154282 | [*] | |-----------+--------+--------+----------------+-------------------| [*]: This case means migration is not convergent. Not that the larger ring size is, the less sensitively dirty-limit responds, so we should choose a optimal ring size base on the test data with different scale vm. We also test the effect of "x-vcpu-dirty-limit-period" parameter on migration total time. test twice in each condition, data is shown as follows: |-----------+--------+--------+-------------+----------------------| | ring size | N (MB) | S (us) | Period (ms) | migration total time | |-----------+--------+--------+-------------+----------------------| | 2048 | 1024 | 10000 | 100 | [*] | | 2048 | 1024 | 10000 | 100 | [*] | | 2048 | 1024 | 10000 | 300 | 156795 | | 2048 | 1024 | 10000 | 300 | 118179 | | 2048 | 1024 | 10000 | 500 | 44244 | | 2048 | 1024 | 10000 | 500 | 45099 | | 2048 | 1024 | 10000 | 700 | 41871 | | 2048 | 1024 | 10000 | 700 | 42582 | | 2048 | 1024 | 10000 | 1000 | 41430 | | 2048 | 1024 | 10000 | 1000 | 40383 | | 2048 | 1024 | 10000 | 1500 | 42030 | | 2048 | 1024 | 10000 | 1500 | 42598 | | 2048 | 1024 | 10000 | 2000 | 41694 | | 2048 | 1024 | 10000 | 2000 | 42403 | | 2048 | 1024 | 10000 | 3000 | 43538 | | 2048 | 1024 | 10000 | 3000 | 43010 | |-----------+--------+--------+-------------+----------------------| It shows that x-vcpu-dirty-limit-period should be configured with 1000 ms in above condition. Please review, any comments and suggestions are very appreciated, thanks Yong Hyman Huang (11): dirtylimit: Fix overflow when computing MB softmmu/dirtylimit: Add parameter check for hmp "set_vcpu_dirty_limit" kvm-all: Do not allow reap vcpu dirty ring buffer if not ready qapi/migration: Introduce x-vcpu-dirty-limit-period parameter qapi/migration: Introduce vcpu-dirty-limit parameters migration: Introduce dirty-limit capability migration: Implement dirty-limit convergence algo migration: Export dirty-limit time info tests: Add migration dirty-limit capability test tests/migration: Introduce dirty-ring-size option into guestperf tests/migration: Introduce dirty-limit into guestperf accel/kvm/kvm-all.c | 36 ++++++++ include/sysemu/dirtylimit.h | 2 + migration/migration.c | 85 ++++++++++++++++++ migration/migration.h | 1 + migration/ram.c | 62 ++++++++++--- migration/trace-events | 1 + monitor/hmp-cmds.c | 26 ++++++ qapi/migration.json | 60 +++++++++++-- softmmu/dirtylimit.c | 75 +++++++++++++++- tests/migration/guestperf/comparison.py | 24 +++++ tests/migration/guestperf/engine.py | 24 ++++- tests/migration/guestperf/hardware.py | 8 +- tests/migration/guestperf/progress.py | 17 +++- tests/migration/guestperf/scenario.py | 11 ++- tests/migration/guestperf/shell.py | 25 +++++- tests/qtest/migration-test.c | 154 ++++++++++++++++++++++++++++++++ 16 files changed, 577 insertions(+), 34 deletions(-) -- 1.8.3.1