* Peter Xu (pet...@redhat.com) wrote: > Based-on: <20211224065000.97572-1-pet...@redhat.com> > > Human version - This patchset is based on: > https://lore.kernel.org/qemu-devel/20211224065000.97572-1-pet...@redhat.com/ > > This series can also be found here: > https://github.com/xzpeter/qemu/tree/postcopy-preempt > > Abstract > ======== > > This series added a new migration capability called "postcopy-preempt". It > can > be enabled when postcopy is enabled, and it'll simply (but greatly) speed up > postcopy page requests handling process. > > Some quick tests below measuring postcopy page request latency: > > - Guest config: 20G guest, 40 vcpus > - Host config: 10Gbps host NIC attached between src/dst > - Workload: one busy dirty thread, writting to 18G memory (pre-faulted). > (refers to "2M/4K huge page, 1 dirty thread" tests below) > - Script: see [1] > > |----------------+--------------+-----------------------| > | Host page size | Vanilla (ms) | Postcopy Preempt (ms) | > |----------------+--------------+-----------------------| > | 2M | 10.58 | 4.96 | > | 4K | 10.68 | 0.57 | > |----------------+--------------+-----------------------| > > For 2M page, we got 1x speedup. For 4K page, 18x speedup. > > For more information on the testing, please refer to "Test Results" below. > > Design > ====== > > The postcopy-preempt feature contains two major reworks on postcopy page fault > handlings: > > (1) Postcopy requests are now sent via a different socket from precopy > background migration stream, so as to be isolated from very high page > request delays > > (2) For huge page enabled hosts: when there's postcopy requests, they can > now intercept a partial sending of huge host pages on src QEMU. > > The design is relatively straightforward, however there're trivial > implementation details that the patchset needs to address. Many of them are > addressed as separate patches. The rest is handled majorly in the big patch > to > enable the whole feature. > > Postcopy recovery is not yet supported, it'll be done after some initial > review > on the solution first. > > Patch layout > ============ > > The initial 10 (out of 15) patches are mostly even suitable to be merged > without the new feature, so they can be looked at even earlier. > > Patch 11-14 implements the new feature, in which patches 11-13 are mostly > still > small and doing preparations, and the major change is done in patch 14. > > Patch 15 is an unit test. > > Tests Results > ================== > > When measuring the page request latency, I did that via trapping userfaultfd > kernel faults using the bpf script [1]. I ignored kvm fast page faults, > because > when it happened it means no major/real page fault is even needed, IOW, no > query to src QEMU. > > The numbers (and histogram) I captured below are based on a whole procedure of > postcopy migration that I sampled with different configurations, and the > average page request latency was calculated. I also captured the latency > distribution, it's also interesting too to look at them here. > > One thing to mention is I didn't even test 1G pages. It doesn't mean that > this > series won't help 1G - actually it'll help no less than what I've tested I > believe, it's just that for 1G huge pages the latency will be >1sec on 10Gbps > nic so it's not really a usable scenario for any sensible customer. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 2M huge page, 1 dirty thread > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > With vanilla postcopy: > > Average: 10582 (us) > > @delay_us: > [1K, 2K) 7 | > | > [2K, 4K) 1 | > | > [4K, 8K) 9 | > | > [8K, 16K) 1983 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > With postcopy-preempt: > > Average: 4960 (us) > > @delay_us: > [1K, 2K) 5 | > | > [2K, 4K) 44 | > | > [4K, 8K) 3495 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K) 154 |@@ > | > [16K, 32K) 1 | > | > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 4K small page, 1 dirty thread > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > With vanilla postcopy: > > Average: 10676 (us) > > @delay_us: > [4, 8) 1 | > | > [8, 16) 3 | > | > [16, 32) 5 | > | > [32, 64) 3 | > | > [64, 128) 12 | > | > [128, 256) 10 | > | > [256, 512) 27 | > | > [512, 1K) 5 | > | > [1K, 2K) 11 | > | > [2K, 4K) 17 | > | > [4K, 8K) 10 | > | > [8K, 16K) 2681 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16K, 32K) 6 | > | > > With postcopy preempt: > > Average: 570 (us) > > @delay_us: > [16, 32) 5 | > | > [32, 64) 6 | > | > [64, 128) 8340 |@@@@@@@@@@@@@@@@@@ > | > [128, 256) 23052 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [256, 512) 8119 |@@@@@@@@@@@@@@@@@@ > | > [512, 1K) 148 | > | > [1K, 2K) 759 |@ > | > [2K, 4K) 6729 |@@@@@@@@@@@@@@@ > | > [4K, 8K) 80 | > | > [8K, 16K) 115 | > | > [16K, 32K) 32 | > |
Nice speedups. > So one thing funny about 4K small pages is that with vanilla postcopy I didn't > even get a speedup comparing to 2M pages, probably because the major overhead > is not sending the page itself, but other things (e.g. waiting for precopy to > flush the existing pages). > > The other thing is in postcopy preempt test, I can still see a bunch of > 2ms-4ms > latency page requests. That's probably what we would like to dig into next. > One possibility is since we shared the same sending thread on src QEMU, we > could have yield ourselves because precopy socket is full. But that's TBD. I guess those could be pages queued behind others; or maybe something like one that starts getting sent on the main socket but then interrupted by another, but then the original page is wanted? > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 4K small page, 16 dirty threads > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > What I did test in extra was using 16 concurrent faulting threads, in this > case > the postcopy queue can be relatively longer. It's done via: > > $ stress -m 16 --vm-bytes 1073741824 --vm-keep > > With vanilla postcopy: > > Average: 2244 (us) > > @delay_us: > [0] 556 | > | > [1] 11251 |@@@@@@@@@@@@ > | > [2, 4) 12094 |@@@@@@@@@@@@@ > | > [4, 8) 12234 |@@@@@@@@@@@@@ > | > [8, 16) 47144 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16, 32) 42281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [32, 64) 17676 |@@@@@@@@@@@@@@@@@@@ > | > [64, 128) 952 |@ > | > [128, 256) 405 | > | > [256, 512) 779 | > | > [512, 1K) 1003 |@ > | > [1K, 2K) 1976 |@@ > | > [2K, 4K) 4865 |@@@@@ > | > [4K, 8K) 5892 |@@@@@@ > | > [8K, 16K) 26941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [16K, 32K) 844 | > | > [32K, 64K) 17 | > | > > With postcopy preempt: > > Average: 1064 (us) > > @delay_us: > [0] 1341 | > | > [1] 30211 |@@@@@@@@@@@@ > | > [2, 4) 32934 |@@@@@@@@@@@@@ > | > [4, 8) 21295 |@@@@@@@@ > | > [8, 16) 130774 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16, 32) 95128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [32, 64) 49591 |@@@@@@@@@@@@@@@@@@@ > | > [64, 128) 3921 |@ > | > [128, 256) 1066 | > | > [256, 512) 2730 |@ > | > [512, 1K) 1849 | > | > [1K, 2K) 512 | > | > [2K, 4K) 2355 | > | > [4K, 8K) 48812 |@@@@@@@@@@@@@@@@@@@ > | > [8K, 16K) 10026 |@@@ > | > [16K, 32K) 810 | > | > [32K, 64K) 68 | > | > > In this specific case, a funny thing is when there're tons of postcopy > requests, the vanilla postcopy page requests are handled even faster (2ms > average) than when there's only 1 dirty thread. It's probably because > unqueue_page() will always hit anyway so precopy streaming has a less effect > on > postcopy. However that'll be still slower than having a standalone postcopy > stream as preempt version has (1ms). Curious. Dave > Any comment welcomed. > > [1] > https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf > > Peter Xu (15): > migration: No off-by-one for pss->page update in host page size > migration: Allow pss->page jump over clean pages > migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat > migration: Add postcopy_has_request() > migration: Simplify unqueue_page() > migration: Move temp page setup and cleanup into separate functions > migration: Introduce postcopy channels on dest node > migration: Dump ramblock and offset too when non-same-page detected > migration: Add postcopy_thread_create() > migration: Move static var in ram_block_from_stream() into global > migration: Add pss.postcopy_requested status > migration: Move migrate_allow_multifd and helpers into migration.c > migration: Add postcopy-preempt capability > migration: Postcopy preemption on separate channel > tests: Add postcopy preempt test > > migration/migration.c | 107 +++++++-- > migration/migration.h | 55 ++++- > migration/multifd.c | 19 +- > migration/multifd.h | 2 - > migration/postcopy-ram.c | 192 ++++++++++++---- > migration/postcopy-ram.h | 14 ++ > migration/ram.c | 417 ++++++++++++++++++++++++++++------- > migration/ram.h | 2 + > migration/savevm.c | 12 +- > migration/socket.c | 18 ++ > migration/socket.h | 1 + > migration/trace-events | 12 +- > qapi/migration.json | 8 +- > tests/qtest/migration-test.c | 21 ++ > 14 files changed, 716 insertions(+), 164 deletions(-) > > -- > 2.32.0 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK