On Mon, Dec 15, 2025 at 10:06:10PM +0800, Chuang Xu wrote: > In this version: > > - drop duplicate vhost_log_sync optimization > - refactor physical_memory_test_and_clear_dirty > - provide more detailed bitmap sync time for each part in this cover > > > In our long-term experience in Bytedance, we've found that under the same > load, > live migration of larger VMs with more devices is often more difficult to > converge (requiring a larger downtime limit). > > We've observed that the live migration bandwidth of large, multi-device VMs is > severely distorted, a phenomenon likely similar to the problem described in > this link > (https://wiki.qemu.org/ToDo/LiveMigration#Optimize_migration_bandwidth_calculation). > > Through some testing and calculations, we conclude that bitmap sync time > affects > the calculation of live migration bandwidth. > > Now, let me use formulaic reasoning to illustrate the relationship between > the downtime > limit required to achieve the stop conditions and the bitmap sync time. > > Assume the actual live migration bandwidth is B, the dirty page rate is D, > the bitmap sync time is x (ms), the transfer time per iteration is t (ms), > and the > downtime limit is y (ms). > > To simplify the calculation, we assume all of dirty pages are not zero page > and only > consider the case B > D. > > When x + t > 100ms, the bandwidth calculated by qemu is R = B * t / (x + t). > When x + t < 100ms, the bandwidth calculated by qemu is R = B * (100 - x) / > 100. > > If there is a critical convergence state, then we have: > (1) B * t = D * (x + t) > (2) t = D * x / (B - D) > For the stop condition to be successfully determined, then we have two cases: > When: > (3) x + t > 100 > (4) x + D * x / (B - D) > 100 > (5) x > 100 - 100 * D / B > Then: > (6) R * y > D * (x + t) > (7) B * t * y / (x + t) > D * (x + t) > (8) (B * (D * x / (B - D)) * y) / (x + D * x / (B - D)) > D * (x + D * x / > (B - D)) > (9) D * y > D * (x + D * x / (B - D)) > (10) y > x + D * x / (B - D) > (11) (B - D) * y > B * x > (12) y > B * x / (B - D) > > When: > (13) x + t < 100 > (14) x + D * x / (B - D) < 100 > (15) x < 100 - 100 * D / B > Then: > (16) R * y > D * (x + t) > (17) B * (100 - x) * y / 100 > D * (x + t) > (18) B * (100 - x) * y / 100 > D * (x + D * x / (B - D)) > (19) y > 100 * D * x / ((B - D) * (100 - x)) > > After deriving the formula, we can use some data for comparison. > > For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 > vhost-user-blk(4 queue per blk), > the sync time is as high as *73ms* (tested with 10GBps dirty rate, the sync > time increases as the dirty page rate increases), > Here are each part of the sync time: > > - sync from kvm to ram_list: 2.5ms > - vhost_log_sync:3ms > - sync aligned memory from ram_list to RAMBlock: 5ms > - sync misaligned memory from ram_list to RAMBlock: 61ms > > After applying this patch, syncing misaligned memory from ram_list to > RAMBlock takes only about 1ms, > and the total sync time is only *12ms*.
These numbers are greatly helpful, thanks a lot. Please put that into the commit message of the patch. OTOH, IMHO you can drop the formula and bw calculation complexities. Your numbers here already justify this patch very useful. I could have amended the commit message myself when queuing, but there's a code change I want to double check with you. I'll reply there soon. > > *First case, assume our maximum bandwidth can reach 15GBps and the dirty page > rate is 10GBps. > > If x = 73 ms, when there is a critical convergence state, > we use formula(2) get t = D * x / (B - D) = 146 ms, > because x + t = 219ms > 100ms, > so we get y > B * x / (B - D) = 219ms. > > If x = 12 ms, when there is a critical convergence state, > we use formula(2) get t = D * x / (B - D) = 24 ms, > because x + t = 36ms < 100ms, > so we get y > 100 * D * x / ((B - D) * (100 - x)) = 27.2ms. > > We can see that after optimization, under the same bandwidth and dirty rate > scenario, > the downtime limit required for dirty page convergence is significantly > reduced. > > *Second case, assume our maximum bandwidth can reach 15GBps and the downtime > limit is set to 150ms. > If x = 73 ms, > when x + t > 100ms, > we use formula(12) get D < B * (y - x) / y = 15 * (150 - 73) / 150 = 7.7GBps, > when x + t < 100ms, > we use formula(19) get D < 5.35GBps > > If x = 12 ms, > when x + t > 100ms, > we use formula(12) get D < B * (y - x) / y = 15 * (150 - 12) / 150 = 13.8GBps, > when x + t < 100ms, > we use formula(19) get D < 13.75GBps > > We can see that after optimization, under the same bandwidth and downtime > limit scenario, > the convergent dirty page rate is significantly improved. > > Through the above formula derivation, we have proven that reducing bitmap > sync time > can significantly improve dirty page convergence capability. > > This patch only optimizes bitmap sync time for part of scenarios. > There may still be many scenarios where bitmap sync time negatively impacts > dirty page > convergence capability, and we can also try to optimize using this approach. > -- Peter Xu
