On Wed, Dec 17, 2025 at 09:43:24PM +0800, Chuang Xu wrote: > On 17/12/2025 21:21, Peter Xu wrote: > > On Wed, Dec 17, 2025 at 02:46:58PM +0800, Chuang Xu wrote: > >> On 17/12/2025 00:26, Peter Xu wrote: > >>> On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote: > >>>> "Chuang Xu" <[email protected]> writes: > >>>> > >>>>> From: xuchuangxclwt <[email protected]> > >>>>> > >>>>> In our long-term experience in Bytedance, we've found that under > >>>>> the same load, live migration of larger VMs with more devices is > >>>>> often more difficult to converge (requiring a larger downtime limit). > >>>>> > >>>>> Through some testing and calculations, we conclude that bitmap sync time > >>>>> affects the calculation of live migration bandwidth. > >>> Side note: > >>> > >>> I forgot to mention when replying to the old versions, but we introduced > >>> avail-switchover-bandwidth to partially remedy this problem when we hit it > >>> before - which may or may not be exactly the same reason here on unaligned > >>> syncs as we didn't further investigate (we have VFIO-PCI devices when > >>> testing), but the whole logic should be similar that bw was calculated too > >>> small. > >> In bytedance, we also migrate vms with vfio devices, which also suffer from > >> the issue of long vfio bitmap sync time for large vm. > >>> So even if with this patch optimizing sync, bw is always not as accurate. > >>> I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too > >>> short a period to take samples, or at least we should be able to remember > >>> more samples so the reported bw (even if we keep sampling per 100ms) will > >>> cover longer period. > >>> > >>> Feel free to share your thoughts if you have any. > >>> > >> FYI: > >> Initially, when I encountered the problem of large vm migration hard to > >> converge, > >> I tried subtracting the bitmap sync time from the bandwidth calculation, > >> which alleviated the problem somewhat. However, through formula > >> calculation, > >> I found that this did not completely solve the problem. Therefore, I > > If you ruled out sync time, why the bw is still not accurate? Have you > > investigated that? > > > > Maybe there's something else happening besides the sync period you > > excluded. > > Referring to the formula I wrote in the cover, after subtracting sync time, > > we get the prerequisite that R=B. Substituting this condition into the > > subsequent formula derivation(B * t = D * (x + t) and R * y > D * (x + t)), > > we will eventually get y > D * x / (B - D). > > This means that even if our bandwidth calculations are correct, > > the sync time can still affect our judgment of downtime conditions.
Right, it will, because any time used for sync has the vCPUs running, so that will contributes to the total dirtied pages, hence partly increase D, as you pointed out. But my point is, if you _really_ have R=B all right, you should e.g. on a 10Gbps NIC seeing R~=10Gbps. If R is not wire speed, it means the R is not really correctly measured.. I think it's likely impossible to measure the correct R so that it'll equal to B, however IMHO we can still think about something that makes the R getting much closer to B, then when normally y is a constant (default 300ms, for example) it'll start to converge where it used to not be able to. E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO it'll be much better and start solving a lot of such problems if it can start to report at least a few Gbps based on all kinds of methods (e.g. excluding sync, as you experimented), then even if it's not reporting 10Gbps it'll help. -- Peter Xu
