On Mon, Dec 15, 2025 at 10:06:10PM +0800, Chuang Xu wrote:
> In this version:
> 
> - drop duplicate vhost_log_sync optimization
> - refactor physical_memory_test_and_clear_dirty
> - provide more detailed bitmap sync time for each part in this cover
> 
> 
> In our long-term experience in Bytedance, we've found that under the same 
> load,
> live migration of larger VMs with more devices is often more difficult to
> converge (requiring a larger downtime limit).
> 
> We've observed that the live migration bandwidth of large, multi-device VMs is
> severely distorted, a phenomenon likely similar to the problem described in 
> this link
> (https://wiki.qemu.org/ToDo/LiveMigration#Optimize_migration_bandwidth_calculation).
> 
> Through some testing and calculations, we conclude that bitmap sync time 
> affects
> the calculation of live migration bandwidth.
> 
> Now, let me use formulaic reasoning to illustrate the relationship between 
> the downtime
> limit required to achieve the stop conditions and the bitmap sync time.
> 
> Assume the actual live migration bandwidth is B, the dirty page rate is D,
> the bitmap sync time is x (ms), the transfer time per iteration is t (ms), 
> and the
> downtime limit is y (ms).
> 
> To simplify the calculation, we assume all of dirty pages are not zero page 
> and only
> consider the case B > D.
> 
> When x + t > 100ms, the bandwidth calculated by qemu is R = B * t / (x + t).
> When x + t < 100ms, the bandwidth calculated by qemu is R = B * (100 - x) / 
> 100.
> 
> If there is a critical convergence state, then we have:
>   (1) B * t = D * (x + t)
>   (2) t = D * x / (B - D)
> For the stop condition to be successfully determined, then we have two cases:
>   When:
>   (3) x + t > 100
>   (4) x + D * x / (B - D) > 100
>   (5) x > 100 - 100 * D / B
>   Then:
>   (6) R * y > D * (x + t)
>   (7) B * t * y / (x + t) > D * (x + t)
>   (8) (B * (D * x / (B - D)) * y) / (x + D * x / (B - D)) > D * (x + D * x / 
> (B - D))
>   (9) D * y > D * (x + D * x / (B - D))
>   (10) y > x + D * x / (B - D)
>   (11) (B - D) * y > B * x
>   (12) y > B * x / (B - D)
> 
>   When:
>   (13) x + t < 100
>   (14) x + D * x / (B - D) < 100
>   (15) x < 100 - 100 * D / B
>   Then:
>   (16) R * y > D * (x + t)
>   (17) B * (100 - x) * y / 100 > D * (x + t)
>   (18) B * (100 - x) * y / 100 > D * (x + D * x / (B - D))
>   (19) y > 100 * D * x / ((B - D) * (100 - x))
> 
> After deriving the formula, we can use some data for comparison.
> 
> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 
> vhost-user-blk(4 queue per blk),
> the sync time is as high as *73ms* (tested with 10GBps dirty rate, the sync 
> time increases as the dirty page rate increases),
> Here are each part of the sync time:
> 
> - sync from kvm to ram_list: 2.5ms
> - vhost_log_sync:3ms
> - sync aligned memory from ram_list to RAMBlock: 5ms
> - sync misaligned memory from ram_list to RAMBlock: 61ms
> 
> After applying this patch, syncing misaligned memory from ram_list to 
> RAMBlock takes only about 1ms,
> and the total sync time is only *12ms*.

These numbers are greatly helpful, thanks a lot.  Please put that into the
commit message of the patch.

OTOH, IMHO you can drop the formula and bw calculation complexities.  Your
numbers here already justify this patch very useful.

I could have amended the commit message myself when queuing, but there's a
code change I want to double check with you.  I'll reply there soon.

> 
> *First case, assume our maximum bandwidth can reach 15GBps and the dirty page 
> rate is 10GBps.
> 
> If x = 73 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 146 ms,
> because x + t = 219ms > 100ms,
> so we get y > B * x / (B - D) = 219ms.
> 
> If x = 12 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 24 ms,
> because x + t = 36ms < 100ms,
> so we get y > 100 * D * x / ((B - D) * (100 - x)) = 27.2ms.
> 
> We can see that after optimization, under the same bandwidth and dirty rate 
> scenario,
> the downtime limit required for dirty page convergence is significantly 
> reduced.
> 
> *Second case, assume our maximum bandwidth can reach 15GBps and the downtime 
> limit is set to 150ms.
> If x = 73 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 73) / 150 = 7.7GBps,
> when x + t < 100ms,
> we use formula(19) get D < 5.35GBps
> 
> If x = 12 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 12) / 150 = 13.8GBps,
> when x + t < 100ms,
> we use formula(19) get D < 13.75GBps
> 
> We can see that after optimization, under the same bandwidth and downtime 
> limit scenario,
> the convergent dirty page rate is significantly improved.
> 
> Through the above formula derivation, we have proven that reducing bitmap 
> sync time
> can significantly improve dirty page convergence capability. 
> 
> This patch only optimizes bitmap sync time for part of scenarios.
> There may still be many scenarios where bitmap sync time negatively impacts 
> dirty page
> convergence capability, and we can also try to optimize using this approach.
> 

-- 
Peter Xu


Reply via email to