On Thu, Dec 18, 2025 at 05:20:19PM +0800, Chuang Xu wrote: > On 17/12/2025 22:59, Peter Xu wrote: > > Right, it will, because any time used for sync has the vCPUs running, so > > that will contributes to the total dirtied pages, hence partly increase D, > > as you pointed out. > > > > But my point is, if you _really_ have R=B all right, you should e.g. on a > > 10Gbps NIC seeing R~=10Gbps. If R is not wire speed, it means the R is not > > really correctly measured.. > > In my experience, the bandwidth of live migration usually doesn't reach > the nic's bandwidth limit (my test environment's nic bandwidth limit is > 200Gbps). > This could be due to various reasons: for example, the live migration main > thread's > ability to search for dirty pages may have reached a bottleneck; > the nic's interrupt binding range might limit the softirq's processing > capacity; > there might be too few multifd threads; or there might be overhead in > synchronizing > between the live migration main thread and the multifd thread.
Exactly, especially when you have 200Gbps NICs. I hope I have some of those for testing too! I don't, so I can't provide really useful input.. My vague memory (I got some chance using a 100Gbps NIC, if I recall correctly) is that main thread will bottleneck already there, where I should have (maybe?) 8 multifd threads. I just never knew whether we need to scale it out yet so far, normally 100G/200G setup only happens with direct attached, not a major use case for cluster setup? Or maybe I am outdated? If that'll be a major use case at some point, and if main thread is the bottleneck distributing things, then we need to scale it out. I think it's doable. > > > > > I think it's likely impossible to measure the correct R so that it'll equal > > to B, however IMHO we can still think about something that makes the R > > getting much closer to B, then when normally y is a constant (default > > 300ms, for example) it'll start to converge where it used to not be able to. > > Yes, there are always various factors that can cause measurement errors. > We can only try to make the calculated value as close as possible to the > actual value. > > > E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO > > it'll be much better and start solving a lot of such problems if it can > > start to report at least a few Gbps based on all kinds of methods > > (e.g. excluding sync, as you experimented), then even if it's not reporting > > 10Gbps it'll help. > > > After I applied these optimizations, typically the bandwidth statistics > from QEMU and the real-time nic bandwidth monitored by atop are close. > > Those extremely low bandwidth(but consistent with atop monitoring) is usually > caused by zero pages or dirty pages with extremely high compression rates. > In these cases, QEMU uses very little nic bandwidth to transmit a large number > of dirty pages, but the bandwidth is only calculated based on the actual > amount of data transmitted. Yes. That's a major issue in QEMU, zero page / compressed page / ... not only affects how QEMU "measures" the mbps, but also affects how QEMU decides when to converge: here I'm not talking about the bw difference causing "bw * downtime_limit" [A] too small. I'm talking about the other side of equation where we used [A] to compare with "remain_dirty_pages * psize" [B]. In reality, [B] isn't accurate either when zero page / compressed page / ... is used.. Maybe.. the switchover decision shouldn't be MBps as unit, but "number of pages". It'll remove most of those effects at least, but that needs some more considerations.. > > If we want to use the actual number of dirty pages transmitted to calculate > bandwidth, we face another risk: if the dirty pages transmitted before the > downtime have a high compression ratio, and the dirty pages to be transmitted > after the downtime have a low compression ratio, then the downtime will far > exceed expectations. ... like what you mentioned here will also be an issue if we switch to use n_pages to do the math. :) > > This may have strayed a bit, but just providing some potentially useful > information > from my perspective. Not really; patch alone is good, I appreciate the discussions. Thanks, -- Peter Xu
