Re: [PATCH RFC 00/15] migration: Postcopy Preemption

Dr. David Alan Gilbert Wed, 19 Jan 2022 05:13:24 -0800

* Peter Xu (pet...@redhat.com) wrote:
> Based-on: <20211224065000.97572-1-pet...@redhat.com>
> 
> Human version - This patchset is based on:
>   https://lore.kernel.org/qemu-devel/20211224065000.97572-1-pet...@redhat.com/
> 
> This series can also be found here:
>   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> 
> Abstract
> ========
> 
> This series added a new migration capability called "postcopy-preempt".  It 
> can
> be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> postcopy page requests handling process.
> 
> Some quick tests below measuring postcopy page request latency:
> 
>   - Guest config: 20G guest, 40 vcpus
>   - Host config: 10Gbps host NIC attached between src/dst
>   - Workload: one busy dirty thread, writting to 18G memory (pre-faulted).
>     (refers to "2M/4K huge page, 1 dirty thread" tests below)
>   - Script: see [1]
> 
>   |----------------+--------------+-----------------------|
>   | Host page size | Vanilla (ms) | Postcopy Preempt (ms) |
>   |----------------+--------------+-----------------------|
>   | 2M             |        10.58 |                  4.96 |
>   | 4K             |        10.68 |                  0.57 |
>   |----------------+--------------+-----------------------|
> 
> For 2M page, we got 1x speedup.  For 4K page, 18x speedup.
> 
> For more information on the testing, please refer to "Test Results" below.
> 
> Design
> ======
> 
> The postcopy-preempt feature contains two major reworks on postcopy page fault
> handlings:
> 
>     (1) Postcopy requests are now sent via a different socket from precopy
>         background migration stream, so as to be isolated from very high page
>         request delays
> 
>     (2) For huge page enabled hosts: when there's postcopy requests, they can
>         now intercept a partial sending of huge host pages on src QEMU.
> 
> The design is relatively straightforward, however there're trivial
> implementation details that the patchset needs to address.  Many of them are
> addressed as separate patches.  The rest is handled majorly in the big patch 
> to
> enable the whole feature.
> 
> Postcopy recovery is not yet supported, it'll be done after some initial 
> review
> on the solution first.
> 
> Patch layout
> ============
> 
> The initial 10 (out of 15) patches are mostly even suitable to be merged
> without the new feature, so they can be looked at even earlier.
> 
> Patch 11-14 implements the new feature, in which patches 11-13 are mostly 
> still
> small and doing preparations, and the major change is done in patch 14.
> 
> Patch 15 is an unit test.
> 
> Tests Results
> ==================
> 
> When measuring the page request latency, I did that via trapping userfaultfd
> kernel faults using the bpf script [1]. I ignored kvm fast page faults, 
> because
> when it happened it means no major/real page fault is even needed, IOW, no
> query to src QEMU.
> 
> The numbers (and histogram) I captured below are based on a whole procedure of
> postcopy migration that I sampled with different configurations, and the
> average page request latency was calculated.  I also captured the latency
> distribution, it's also interesting too to look at them here.
> 
> One thing to mention is I didn't even test 1G pages.  It doesn't mean that 
> this
> series won't help 1G - actually it'll help no less than what I've tested I
> believe, it's just that for 1G huge pages the latency will be >1sec on 10Gbps
> nic so it's not really a usable scenario for any sensible customer.
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 2M huge page, 1 dirty thread
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With vanilla postcopy:
> 
> Average: 10582 (us)
> 
> @delay_us:
> [1K, 2K)               7 |                                                    
> |
> [2K, 4K)               1 |                                                    
> |
> [4K, 8K)               9 |                                                    
> |
> [8K, 16K)           1983 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> With postcopy-preempt:
> 
> Average: 4960 (us)
> 
> @delay_us:
> [1K, 2K)               5 |                                                    
> |
> [2K, 4K)              44 |                                                    
> |
> [4K, 8K)            3495 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)            154 |@@                                                  
> |
> [16K, 32K)             1 |                                                    
> |
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 4K small page, 1 dirty thread
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With vanilla postcopy:
> 
> Average: 10676 (us)
> 
> @delay_us:
> [4, 8)                 1 |                                                    
> |
> [8, 16)                3 |                                                    
> |
> [16, 32)               5 |                                                    
> |
> [32, 64)               3 |                                                    
> |
> [64, 128)             12 |                                                    
> |
> [128, 256)            10 |                                                    
> |
> [256, 512)            27 |                                                    
> |
> [512, 1K)              5 |                                                    
> |
> [1K, 2K)              11 |                                                    
> |
> [2K, 4K)              17 |                                                    
> |
> [4K, 8K)              10 |                                                    
> |
> [8K, 16K)           2681 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)             6 |                                                    
> |
> 
> With postcopy preempt:
> 
> Average: 570 (us)
> 
> @delay_us:
> [16, 32)               5 |                                                    
> |
> [32, 64)               6 |                                                    
> |
> [64, 128)           8340 |@@@@@@@@@@@@@@@@@@                                  
> |
> [128, 256)         23052 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [256, 512)          8119 |@@@@@@@@@@@@@@@@@@                                  
> |
> [512, 1K)            148 |                                                    
> |
> [1K, 2K)             759 |@                                                   
> |
> [2K, 4K)            6729 |@@@@@@@@@@@@@@@                                     
> |
> [4K, 8K)              80 |                                                    
> |
> [8K, 16K)            115 |                                                    
> |
> [16K, 32K)            32 |                                                    
> |


Nice speedups.

> So one thing funny about 4K small pages is that with vanilla postcopy I didn't
> even get a speedup comparing to 2M pages, probably because the major overhead
> is not sending the page itself, but other things (e.g. waiting for precopy to
> flush the existing pages).
> 
> The other thing is in postcopy preempt test, I can still see a bunch of 
> 2ms-4ms
> latency page requests.  That's probably what we would like to dig into next.
> One possibility is since we shared the same sending thread on src QEMU, we
> could have yield ourselves because precopy socket is full.  But that's TBD.

I guess those could be pages queued behind others; or maybe something
like one that starts getting sent on the main socket but then
interrupted by another, but then the original page is wanted?

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 4K small page, 16 dirty threads
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> What I did test in extra was using 16 concurrent faulting threads, in this 
> case
> the postcopy queue can be relatively longer.  It's done via:
> 
>   $ stress -m 16 --vm-bytes 1073741824 --vm-keep
> 
> With vanilla postcopy:
> 
> Average: 2244 (us)
> 
> @delay_us:
> [0]                  556 |                                                    
> |
> [1]                11251 |@@@@@@@@@@@@                                        
> |
> [2, 4)             12094 |@@@@@@@@@@@@@                                       
> |
> [4, 8)             12234 |@@@@@@@@@@@@@                                       
> |
> [8, 16)            47144 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16, 32)           42281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      
> |
> [32, 64)           17676 |@@@@@@@@@@@@@@@@@@@                                 
> |
> [64, 128)            952 |@                                                   
> |
> [128, 256)           405 |                                                    
> |
> [256, 512)           779 |                                                    
> |
> [512, 1K)           1003 |@                                                   
> |
> [1K, 2K)            1976 |@@                                                  
> |
> [2K, 4K)            4865 |@@@@@                                               
> |
> [4K, 8K)            5892 |@@@@@@                                              
> |
> [8K, 16K)          26941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       
> |
> [16K, 32K)           844 |                                                    
> |
> [32K, 64K)            17 |                                                    
> |
> 
> With postcopy preempt:
> 
> Average: 1064 (us)
> 
> @delay_us:
> [0]                 1341 |                                                    
> |
> [1]                30211 |@@@@@@@@@@@@                                        
> |
> [2, 4)             32934 |@@@@@@@@@@@@@                                       
> |
> [4, 8)             21295 |@@@@@@@@                                            
> |
> [8, 16)           130774 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16, 32)           95128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               
> |
> [32, 64)           49591 |@@@@@@@@@@@@@@@@@@@                                 
> |
> [64, 128)           3921 |@                                                   
> |
> [128, 256)          1066 |                                                    
> |
> [256, 512)          2730 |@                                                   
> |
> [512, 1K)           1849 |                                                    
> |
> [1K, 2K)             512 |                                                    
> |
> [2K, 4K)            2355 |                                                    
> |
> [4K, 8K)           48812 |@@@@@@@@@@@@@@@@@@@                                 
> |
> [8K, 16K)          10026 |@@@                                                 
> |
> [16K, 32K)           810 |                                                    
> |
> [32K, 64K)            68 |                                                    
> |
> 
> In this specific case, a funny thing is when there're tons of postcopy
> requests, the vanilla postcopy page requests are handled even faster (2ms
> average) than when there's only 1 dirty thread.  It's probably because
> unqueue_page() will always hit anyway so precopy streaming has a less effect 
> on
> postcopy.  However that'll be still slower than having a standalone postcopy
> stream as preempt version has (1ms).

Curious.

Dave

> Any comment welcomed.
> 
> [1] 
> https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf
> 
> Peter Xu (15):
>   migration: No off-by-one for pss->page update in host page size
>   migration: Allow pss->page jump over clean pages
>   migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat
>   migration: Add postcopy_has_request()
>   migration: Simplify unqueue_page()
>   migration: Move temp page setup and cleanup into separate functions
>   migration: Introduce postcopy channels on dest node
>   migration: Dump ramblock and offset too when non-same-page detected
>   migration: Add postcopy_thread_create()
>   migration: Move static var in ram_block_from_stream() into global
>   migration: Add pss.postcopy_requested status
>   migration: Move migrate_allow_multifd and helpers into migration.c
>   migration: Add postcopy-preempt capability
>   migration: Postcopy preemption on separate channel
>   tests: Add postcopy preempt test
> 
>  migration/migration.c        | 107 +++++++--
>  migration/migration.h        |  55 ++++-
>  migration/multifd.c          |  19 +-
>  migration/multifd.h          |   2 -
>  migration/postcopy-ram.c     | 192 ++++++++++++----
>  migration/postcopy-ram.h     |  14 ++
>  migration/ram.c              | 417 ++++++++++++++++++++++++++++-------
>  migration/ram.h              |   2 +
>  migration/savevm.c           |  12 +-
>  migration/socket.c           |  18 ++
>  migration/socket.h           |   1 +
>  migration/trace-events       |  12 +-
>  qapi/migration.json          |   8 +-
>  tests/qtest/migration-test.c |  21 ++
>  14 files changed, 716 insertions(+), 164 deletions(-)
> 
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH RFC 00/15] migration: Postcopy Preemption

Reply via email to