Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Jonah Palmer writes: [...] >> I think I finally know enough to give you constructive feedback. >> >> Your commit messages should answer the questions I had. Specifically: >> >> * Why are we doing this? To shorten guest-visible downtime. >> >> * How are we doing this? We additionally pin memory before entering the >> main loop. This speeds up the pinning we still do in the main loop. >> >> * Drawback: slower startup. In particular, QMP becomes >> available later. >> >> * Secondary benefit: main loop responsiveness improves, in particular >> QMP. >> >> * What uses of QEMU are affected? Only with vhost-vDPA. Spell out all >>the ways to get vhost-vDPA, please. >> >> * There's a tradeoff. Show your numbers. Discuss whether this needs to >> be configurable. >> >> If you can make a case for pinning memory this way always, do so. If >> you believe making it configurable would be a good idea, do so. If >> you're not sure, say so in the cover letter, and add a suitable TODO >> comment. >> >> Questions? > > No questions, understood. > > As I was writing the responses to your questions I was thinking to > myself that this stuff should've been in the cover letter / commit > messages in the first place. > > Definitely a learning moment for me. Thanks for your time on this Markus! You're welcome!
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 7/8/25 4:17 AM, Markus Armbruster wrote: Jonah Palmer writes: On 7/4/25 11:00 AM, Markus Armbruster wrote: Jonah Palmer writes: [...] So, total time increases: early pinning (before main loop) takes more time than we save pinning (in the main loop). Correct? Correct. We only save ~0.07s from the pinning that happens in the main loop. But the extra 3s we now need to spend pinning before qemu_main_loop() overshadows it. Got it. We want this trade, because the time spent in the main loop is a problem: guest-visible downtime. Correct? [...] Correct. Though whether or not we want this trade I suppose is subjective. But the 50-60% reduction in guest-visible downtime is pretty nice if we can stomach the initial startup costs. I'll get back to this at the end. [...] Let me circle back to my question: Under what circumstances is QMP responsiveness affected? The answer seems to be "only when we're using a vhost-vDPA device". Correct? Correct, since using one of these guys causes us to do this memory pinning. If we're not using one, it's business as usual for Qemu. Got it. We're using one exactly when QEMU is running with one of its vhost-vdpa-device-pci* device models. Correct? Yea, or something like: -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \ -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \ I'll get back to this at the end. [...] Let me recap: * No change at all unless we're pinning memory early, and we're doing that only when we're using a vhost-vDPA device. Correct? * If we are using a vhost-vDPA device: - Total startup time (until we're done pinning) increases. Correct. - QMP becomes available later. Correct. - Main loop behavior improves: less guest-visible downtime, QMP more responsive (once it's available) Correct. Though the improvement is modest at best if we put aside the guest-visible downtime improvement. This is a tradeoff we want always. There is no need to let users pick "faster startup, worse main loop behavior." "Always" might be subjective here. For example, if there's no desire to perform live migration, then the user kinda just gets stuck with the cons. Whether or not we want to make this configurable though is another discussion. Correct? [...] I think I finally know enough to give you constructive feedback. Your commit messages should answer the questions I had. Specifically: * Why are we doing this? To shorten guest-visible downtime. * How are we doing this? We additionally pin memory before entering the main loop. This speeds up the pinning we still do in the main loop. * Drawback: slower startup. In particular, QMP becomes available later. * Secondary benefit: main loop responsiveness improves, in particular QMP. * What uses of QEMU are affected? Only with vhost-vDPA. Spell out all the ways to get vhost-vDPA, please. * There's a tradeoff. Show your numbers. Discuss whether this needs to be configurable. If you can make a case for pinning memory this way always, do so. If you believe making it configurable would be a good idea, do so. If you're not sure, say so in the cover letter, and add a suitable TODO comment. Questions? No questions, understood. As I was writing the responses to your questions I was thinking to myself that this stuff should've been in the cover letter / commit messages in the first place. Definitely a learning moment for me. Thanks for your time on this Markus! Jonah
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Jonah Palmer writes: > On 7/4/25 11:00 AM, Markus Armbruster wrote: >> Jonah Palmer writes: [...] >> So, total time increases: early pinning (before main loop) takes more >> time than we save pinning (in the main loop). Correct? > > Correct. We only save ~0.07s from the pinning that happens in the main loop. > But the extra 3s we now need to spend pinning before qemu_main_loop() > overshadows it. Got it. >> We want this trade, because the time spent in the main loop is a >> problem: guest-visible downtime. Correct? >> [...] > > Correct. Though whether or not we want this trade I suppose is subjective. > But the 50-60% reduction in guest-visible downtime is pretty nice if we can > stomach the initial startup costs. I'll get back to this at the end. [...] >> Let me circle back to my question: Under what circumstances is QMP >> responsiveness affected? >> >> The answer seems to be "only when we're using a vhost-vDPA device". >> Correct? > > Correct, since using one of these guys causes us to do this memory pinning. > If we're not using one, it's business as usual for Qemu. Got it. >> We're using one exactly when QEMU is running with one of its >> vhost-vdpa-device-pci* device models. Correct? > > Yea, or something like: > > -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \ > -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \ I'll get back to this at the end. [...] >> Let me recap: >> >> * No change at all unless we're pinning memory early, and we're doing >>that only when we're using a vhost-vDPA device. Correct? >> >> * If we are using a vhost-vDPA device: >>- Total startup time (until we're done pinning) increases. > > Correct. > >>- QMP becomes available later. > > Correct. > >>- Main loop behavior improves: less guest-visible downtime, QMP more >> responsive (once it's available) > > Correct. Though the improvement is modest at best if we put aside the > guest-visible downtime improvement. > >>This is a tradeoff we want always. There is no need to let users pick >>"faster startup, worse main loop behavior." >> > > "Always" might be subjective here. For example, if there's no desire to > perform live migration, then the user kinda just gets stuck with the cons. > > Whether or not we want to make this configurable though is another discussion. > >> Correct? >> >> [...] I think I finally know enough to give you constructive feedback. Your commit messages should answer the questions I had. Specifically: * Why are we doing this? To shorten guest-visible downtime. * How are we doing this? We additionally pin memory before entering the main loop. This speeds up the pinning we still do in the main loop. * Drawback: slower startup. In particular, QMP becomes available later. * Secondary benefit: main loop responsiveness improves, in particular QMP. * What uses of QEMU are affected? Only with vhost-vDPA. Spell out all the ways to get vhost-vDPA, please. * There's a tradeoff. Show your numbers. Discuss whether this needs to be configurable. If you can make a case for pinning memory this way always, do so. If you believe making it configurable would be a good idea, do so. If you're not sure, say so in the cover letter, and add a suitable TODO comment. Questions?
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 7/4/25 11:00 AM, Markus Armbruster wrote: Jonah Palmer writes: On 6/26/25 8:08 AM, Markus Armbruster wrote: [...] Apologies for the delay in getting back to you. I just wanted to be thorough and answer everything as accurately and clearly as possible. Before these patches, pinning started in vhost_vdpa_dev_start(), where the memory listener was registered, and began calling vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This happens after entering qemu_main_loop(). After these patches, pinning started in vhost_dev_init() (specifically vhost_vdpa_set_owner()), where the memory listener registration was moved to. This happens *before* entering qemu_main_loop(). However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The pinning that happens before we enter qemu_main_loop() is the full guest RAM pinning, which is the main, heavy lifting work when it comes to pinning memory. The rest of the pinning work happens after entering qemu_main_loop() (approximately around the same timing as when pinning started before these patches). But, since we already did the heavy lifting of the pinning work pre qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're already mapped and locked in RAM). This makes the pinning work we do after entering qemu_main_loop() much faster than compared to the same pinning we had to do before these patches. However, we have to pay a cost for this. Because we do the heavy lifting work earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the guest hasn't yet touched its memory yet, all host pages are still anonymous and unallocated. This essentially means that doing the pinning earlier is more expensive time-wise given that we need to also allocate physical pages for each chunk of memory. To (hopefully) show this more clearly, I ran some tests before and after these patches and averaged the results. I used a 50G guest with real vDPA hardware (Mellanox CX-6Dx): 0.) How many vhost_vdpa_listener_region_add() (pins) calls? | Total | Before qemu_main_loop | After qemu_main_loop _ Before patches | 6 | 0 | 6 ---|- After patches | 11 | 5 | 6 - After the patches, this looks like we doubled the work we're doing (given the extra 5 calls), however, the 6 calls that happen after entering qemu_main_loop() are essentially replays of the first 5 we did. * In other words, after the patches, the 6 calls made after entering qemu_main_loop() are performed much faster than the same 6 calls before the patches. * From my measurements, these are the timings it took to perform those 6 calls after entering qemu_main_loop(): > Before patches: 0.0770s > After patches: 0.0065s --- 1.) Time from starting the guest to entering qemu_main_loop(): * Before patches: 0.112s * After patches: 3.900s - This is due to the 5 early pins we're doing now with these patches, whereas before we never did any pinning work at all. - From measuring the time between the first and last vhost_vdpa_listener_region_add() calls during this period, this comes out to ~3s for the early pinning. So, total time increases: early pinning (before main loop) takes more time than we save pinning (in the main loop). Correct? Correct. We only save ~0.07s from the pinning that happens in the main loop. But the extra 3s we now need to spend pinning before qemu_main_loop() overshadows it. We want this trade, because the time spent in the main loop is a problem: guest-visible downtime. Correct? [...] Correct. Though whether or not we want this trade I suppose is subjective. But the 50-60% reduction in guest-visible downtime is pretty nice if we can stomach the initial startup costs. Let's see whether I understand... Please correct my mistakes. Memory pinning takes several seconds for large guests. Your patch makes pinning much slower. You're theorizing this is because pinning cold memory is slower than pinning warm memory. I suppose the extra time is saved elsewhere, i.e. the entire startup time remains roughly the same. Have you verified this experimentally? Based on my measurements that I did, we pay a ~3s increase in initialization time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning earlier for a vhost-vDPA device. This resulted in: * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s). * Shorter downtime phase during live migration (see below). * Slight increase in time for the device to be operational (e.g. guest sets DRIVER_OK). > This measured the start time of the guest to guest setting DRIVER_OK for the device: Before patch
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Jonah Palmer writes: > On 6/26/25 8:08 AM, Markus Armbruster wrote: [...] > Apologies for the delay in getting back to you. I just wanted to be thorough > and answer everything as accurately and clearly as possible. > > > > Before these patches, pinning started in vhost_vdpa_dev_start(), where the > memory listener was registered, and began calling > vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This > happens after entering qemu_main_loop(). > > After these patches, pinning started in vhost_dev_init() (specifically > vhost_vdpa_set_owner()), where the memory listener registration was moved to. > This happens *before* entering qemu_main_loop(). > > However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The > pinning that happens before we enter qemu_main_loop() is the full guest RAM > pinning, which is the main, heavy lifting work when it comes to pinning > memory. > > The rest of the pinning work happens after entering qemu_main_loop() > (approximately around the same timing as when pinning started before these > patches). But, since we already did the heavy lifting of the pinning work pre > qemu_main_loop() (e.g. all pages were already allocated and pinned), we're > just re-pinning here (i.e. kernel just updates its IOTLB tables for pages > that're already mapped and locked in RAM). > > This makes the pinning work we do after entering qemu_main_loop() much faster > than compared to the same pinning we had to do before these patches. > > However, we have to pay a cost for this. Because we do the heavy lifting work > earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the > guest hasn't yet touched its memory yet, all host pages are still anonymous > and unallocated. This essentially means that doing the pinning earlier is > more expensive time-wise given that we need to also allocate physical pages > for each chunk of memory. > > To (hopefully) show this more clearly, I ran some tests before and after > these patches and averaged the results. I used a 50G guest with real vDPA > hardware (Mellanox CX-6Dx): > > 0.) How many vhost_vdpa_listener_region_add() (pins) calls? > >| Total | Before qemu_main_loop | After qemu_main_loop > _ > Before patches | 6 | 0 | 6 > ---|- > After patches | 11 | 5 | 6 > > - After the patches, this looks like we doubled the work we're doing (given > the extra 5 calls), however, the 6 calls that happen after entering > qemu_main_loop() are essentially replays of the first 5 we did. > > * In other words, after the patches, the 6 calls made after entering > qemu_main_loop() are performed much faster than the same 6 calls before the > patches. > > * From my measurements, these are the timings it took to perform those 6 > calls after entering qemu_main_loop(): >> Before patches: 0.0770s >> After patches: 0.0065s > > --- > > 1.) Time from starting the guest to entering qemu_main_loop(): > * Before patches: 0.112s > * After patches: 3.900s > > - This is due to the 5 early pins we're doing now with these patches, whereas > before we never did any pinning work at all. > > - From measuring the time between the first and last > vhost_vdpa_listener_region_add() calls during this period, this comes out to > ~3s for the early pinning. So, total time increases: early pinning (before main loop) takes more time than we save pinning (in the main loop). Correct? We want this trade, because the time spent in the main loop is a problem: guest-visible downtime. Correct? [...] >> Let's see whether I understand... Please correct my mistakes. >> >> Memory pinning takes several seconds for large guests. >> >> Your patch makes pinning much slower. You're theorizing this is because >> pinning cold memory is slower than pinning warm memory. >> >> I suppose the extra time is saved elsewhere, i.e. the entire startup >> time remains roughly the same. Have you verified this experimentally? > > Based on my measurements that I did, we pay a ~3s increase in initialization > time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning > earlier for a vhost-vDPA device. This resulted in: > > * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s). > > * Shorter downtime phase during live migration (see below). > > * Slight increase in time for the device to be operational (e.g. guest sets > DRIVER_OK). > > This measured the start time of the guest to guest setting DRIVER_OK for > the device: > > Before patches: 22.46s > After patches: 23.40s > > The real timesaver here is the guest-visisble downtime during live migration > (when using a vhost-vDPA device). Since the heavy lifting of the memory > pinning is done during the initialization phas
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 6/26/25 8:08 AM, Markus Armbruster wrote: Jonah Palmer writes: On 6/2/25 4:29 AM, Markus Armbruster wrote: Butterfingers... let's try this again. Markus Armbruster writes: Si-Wei Liu writes: On 5/26/2025 2:16 AM, Markus Armbruster wrote: Si-Wei Liu writes: On 5/15/2025 11:40 PM, Markus Armbruster wrote: Jason Wang writes: On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. Adding Markus to see if this is a real problem or not. I guess the answer is "depends", and to get a more useful one, we need more information. When all you care is time from executing qemu-system-FOO to guest finish booting, and the guest takes 10s to boot, then an extra 0.2s won't matter much. There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about). When a management application runs qemu-system-FOO several times to probe its capabilities via QMP, then even milliseconds can hurt. Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args. In what scenarios exactly is QMP delayed? Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well. In what scenarios exactly is QMP delayed compared to before the patch? The page pinning process now runs in a pretty early phase at qemu_init() e.g. machine_run_board_init(), It runs within qemu_init() qmp_x_exit_preconfig() qemu_init_board() machine_run_board_init() Except when --preconfig is given, it instead runs within QMP command x-exit-preconfig. Correct? before any QMP command can be serviced, the latter of which typically would be able to get run from qemu_main_loop() until the AIO gets chance to be started to get polled and dispatched to bh. We create the QMP monitor within qemu_create_late_backends(), which runs before qmp_x_exit_preconfig(), but commands get processed only in the main loop, which we enter later. Correct? Technically it's not a real delay for specific QMP command, but rather an extended span of initialization process may take place before the very first QMP request, usually qmp_capabilities, will be serviced. It's natural for mgmt software to expect initialization delay for the first qmp_capabilities response if it has to immediately issue one after launching qemu, especially when you have a large guest with hundred GBs of memory and with passthrough device that has to pin memory for DMA e.g. VFIO, the delayed effect from the QEMU initialization process is very visible too. The work clearly needs to be done. Whether it needs to be blocking other things is less clear. Even if it doesn't need to be blocking, we may choose not to avoid blocking for now. That should be an informed decision, though. All I'm trying to do here is understand the tradeoffs, so I can give useful advice. On the other hand, before the patch, if memory happens to be in the middle of being pinned, any ongoing QMP can't be serviced by the QEMU main loop, either. When exactly does this pinning happen before the patch? In which function? Before the patches, the memory listener was registered in vhost_vdpa_dev_start(), well after device initialization. And by device initialization here I mean the qemu_create_late_backends() function. With these patches, the memory lis
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Jonah Palmer writes: > On 6/2/25 4:29 AM, Markus Armbruster wrote: >> Butterfingers... let's try this again. >> >> Markus Armbruster writes: >> >>> Si-Wei Liu writes: >>> On 5/26/2025 2:16 AM, Markus Armbruster wrote: > Si-Wei Liu writes: > >> On 5/15/2025 11:40 PM, Markus Armbruster wrote: >>> Jason Wang writes: >>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: > Current memory operations like pinning may take a lot of time at the > destination. Currently they are done after the source of the > migration is > stopped, and before the workload is resumed at the destination. This > is a > period where neigher traffic can flow, nor the VM workload can > continue > (downtime). > > We can do better as we know the memory layout of the guest RAM at the > destination from the moment that all devices are initializaed. So > moving that operation allows QEMU to communicate the kernel the maps > while the workload is still running in the source, so Linux can start > mapping them. > > As a small drawback, there is a time in the initialization where QEMU > cannot respond to QMP etc. By some testing, this time is about > 0.2seconds. Adding Markus to see if this is a real problem or not. >>> I guess the answer is "depends", and to get a more useful one, we need >>> more information. >>> >>> When all you care is time from executing qemu-system-FOO to guest >>> finish booting, and the guest takes 10s to boot, then an extra 0.2s >>> won't matter much. >> There's no such delay of an extra 0.2s or higher per se, it's just >> shifting around the page pinning hiccup, no matter it is 0.2s or >> something else, from the time of guest booting up to before guest is >> booted. This saves back guest boot time or start up delay, but in turn >> the same delay effectively will be charged to VM launch time. We follow >> the same model with VFIO, which would see the same hiccup during launch >> (at an early stage where no real mgmt software would care about). >> >>> When a management application runs qemu-system-FOO several times to >>> probe its capabilities via QMP, then even milliseconds can hurt. >>> >> Not something like that, this page pinning hiccup is one time only that >> occurs in the very early stage when launching QEMU, i.e. there's no >> consistent delay every time when QMP is called. The delay in QMP >> response at that very point depends on how much memory the VM has, but >> this is just specif to VM with VFIO or vDPA devices that have to pin >> memory for DMA. Having said, there's no extra delay at all if QEMU args >> has no vDPA device assignment, on the other hand, there's same delay or >> QMP hiccup when VFIO is around in QEMU args. >> >>> In what scenarios exactly is QMP delayed? >> Having said, this is not a new problem to QEMU in particular, this QMP >> delay is not peculiar, it's existent on VFIO as well. > > In what scenarios exactly is QMP delayed compared to before the patch? The page pinning process now runs in a pretty early phase at qemu_init() e.g. machine_run_board_init(), >>> >>> It runs within >>> >>> qemu_init() >>> qmp_x_exit_preconfig() >>> qemu_init_board() >>> machine_run_board_init() >>> >>> Except when --preconfig is given, it instead runs within QMP command >>> x-exit-preconfig. >>> >>> Correct? >>> before any QMP command can be serviced, the latter of which typically would be able to get run from qemu_main_loop() until the AIO gets chance to be started to get polled and dispatched to bh. >>> >>> We create the QMP monitor within qemu_create_late_backends(), which runs >>> before qmp_x_exit_preconfig(), but commands get processed only in the >>> main loop, which we enter later. >>> >>> Correct? >>> Technically it's not a real delay for specific QMP command, but rather an extended span of initialization process may take place before the very first QMP request, usually qmp_capabilities, will be serviced. It's natural for mgmt software to expect initialization delay for the first qmp_capabilities response if it has to immediately issue one after launching qemu, especially when you have a large guest with hundred GBs of memory and with passthrough device that has to pin memory for DMA e.g. VFIO, the delayed effect from the QEMU initialization process is very visible too. >> >> The work clearly needs to be done. Whether it needs to be blocking >> other things is less clear. >> >> Even if it doesn't need to be blocking, we may choose not to avoid >> blocking for now. That should be an informed decision, though. >> >> All I'm trying to do here i
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 6/2/25 4:29 AM, Markus Armbruster wrote: Butterfingers... let's try this again. Markus Armbruster writes: Si-Wei Liu writes: On 5/26/2025 2:16 AM, Markus Armbruster wrote: Si-Wei Liu writes: On 5/15/2025 11:40 PM, Markus Armbruster wrote: Jason Wang writes: On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. Adding Markus to see if this is a real problem or not. I guess the answer is "depends", and to get a more useful one, we need more information. When all you care is time from executing qemu-system-FOO to guest finish booting, and the guest takes 10s to boot, then an extra 0.2s won't matter much. There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about). When a management application runs qemu-system-FOO several times to probe its capabilities via QMP, then even milliseconds can hurt. Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args. In what scenarios exactly is QMP delayed? Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well. In what scenarios exactly is QMP delayed compared to before the patch? The page pinning process now runs in a pretty early phase at qemu_init() e.g. machine_run_board_init(), It runs within qemu_init() qmp_x_exit_preconfig() qemu_init_board() machine_run_board_init() Except when --preconfig is given, it instead runs within QMP command x-exit-preconfig. Correct? before any QMP command can be serviced, the latter of which typically would be able to get run from qemu_main_loop() until the AIO gets chance to be started to get polled and dispatched to bh. We create the QMP monitor within qemu_create_late_backends(), which runs before qmp_x_exit_preconfig(), but commands get processed only in the main loop, which we enter later. Correct? Technically it's not a real delay for specific QMP command, but rather an extended span of initialization process may take place before the very first QMP request, usually qmp_capabilities, will be serviced. It's natural for mgmt software to expect initialization delay for the first qmp_capabilities response if it has to immediately issue one after launching qemu, especially when you have a large guest with hundred GBs of memory and with passthrough device that has to pin memory for DMA e.g. VFIO, the delayed effect from the QEMU initialization process is very visible too. The work clearly needs to be done. Whether it needs to be blocking other things is less clear. Even if it doesn't need to be blocking, we may choose not to avoid blocking for now. That should be an informed decision, though. All I'm trying to do here is understand the tradeoffs, so I can give useful advice. On the other hand, before the patch, if memory happens to be in the middle of being pinned, any ongoing QMP can't be serviced by the QEMU main loop, either. When exactly does this pinning happen before the patch? In which function? Before the patches, the memory listener was registered in vhost_vdpa_dev_start(), well after device initialization. And by device initialization here I mean the qemu_create_late_backends() function. With these patches, the memory listener is now being registered in vhost_vdpa_set_owner(), called from vhost_dev_init
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Butterfingers... let's try this again. Markus Armbruster writes: > Si-Wei Liu writes: > >> On 5/26/2025 2:16 AM, Markus Armbruster wrote: >>> Si-Wei Liu writes: >>> On 5/15/2025 11:40 PM, Markus Armbruster wrote: > Jason Wang writes: > >> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer >> wrote: >>> Current memory operations like pinning may take a lot of time at the >>> destination. Currently they are done after the source of the migration >>> is >>> stopped, and before the workload is resumed at the destination. This >>> is a >>> period where neigher traffic can flow, nor the VM workload can continue >>> (downtime). >>> >>> We can do better as we know the memory layout of the guest RAM at the >>> destination from the moment that all devices are initializaed. So >>> moving that operation allows QEMU to communicate the kernel the maps >>> while the workload is still running in the source, so Linux can start >>> mapping them. >>> >>> As a small drawback, there is a time in the initialization where QEMU >>> cannot respond to QMP etc. By some testing, this time is about >>> 0.2seconds. >> Adding Markus to see if this is a real problem or not. > I guess the answer is "depends", and to get a more useful one, we need > more information. > > When all you care is time from executing qemu-system-FOO to guest > finish booting, and the guest takes 10s to boot, then an extra 0.2s > won't matter much. There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about). > When a management application runs qemu-system-FOO several times to > probe its capabilities via QMP, then even milliseconds can hurt. > Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args. > In what scenarios exactly is QMP delayed? Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well. >>> >>> In what scenarios exactly is QMP delayed compared to before the patch? >> >> The page pinning process now runs in a pretty early phase at >> qemu_init() e.g. machine_run_board_init(), > > It runs within > > qemu_init() > qmp_x_exit_preconfig() > qemu_init_board() > machine_run_board_init() > > Except when --preconfig is given, it instead runs within QMP command > x-exit-preconfig. > > Correct? > >> before any QMP command can be serviced, the latter of which typically >> would be able to get run from qemu_main_loop() until the AIO gets >> chance to be started to get polled and dispatched to bh. > > We create the QMP monitor within qemu_create_late_backends(), which runs > before qmp_x_exit_preconfig(), but commands get processed only in the > main loop, which we enter later. > > Correct? > >> Technically it's not a real delay for specific QMP command, but rather >> an extended span of initialization process may take place before the >> very first QMP request, usually qmp_capabilities, will be >> serviced. It's natural for mgmt software to expect initialization >> delay for the first qmp_capabilities response if it has to immediately >> issue one after launching qemu, especially when you have a large guest >> with hundred GBs of memory and with passthrough device that has to pin >> memory for DMA e.g. VFIO, the delayed effect from the QEMU >> initialization process is very visible too. The work clearly needs to be done. Whether it needs to be blocking other things is less clear. Even if it doesn't need to be blocking, we may choose not to avoid blocking for now. That should be an informed decision, though. All I'm trying to do here is understand the tradeoffs, so I can give useful advice. >> On the other hand, before >> the patch, if memory happens to be in the middle of being pinned, any >> ongoing QMP can't be serviced by the QEMU main loop, either. When exactly does t
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Si-Wei Liu writes: > On 5/26/2025 2:16 AM, Markus Armbruster wrote: >> Si-Wei Liu writes: >> >>> On 5/15/2025 11:40 PM, Markus Armbruster wrote: Jason Wang writes: > On Thu, May 8, 2025 at 2:47 AM Jonah Palmer > wrote: >> Current memory operations like pinning may take a lot of time at the >> destination. Currently they are done after the source of the migration >> is >> stopped, and before the workload is resumed at the destination. This is >> a >> period where neigher traffic can flow, nor the VM workload can continue >> (downtime). >> >> We can do better as we know the memory layout of the guest RAM at the >> destination from the moment that all devices are initializaed. So >> moving that operation allows QEMU to communicate the kernel the maps >> while the workload is still running in the source, so Linux can start >> mapping them. >> >> As a small drawback, there is a time in the initialization where QEMU >> cannot respond to QMP etc. By some testing, this time is about >> 0.2seconds. > Adding Markus to see if this is a real problem or not. I guess the answer is "depends", and to get a more useful one, we need more information. When all you care is time from executing qemu-system-FOO to guest finish booting, and the guest takes 10s to boot, then an extra 0.2s won't matter much. >>> >>> There's no such delay of an extra 0.2s or higher per se, it's just shifting >>> around the page pinning hiccup, no matter it is 0.2s or something else, >>> from the time of guest booting up to before guest is booted. This saves >>> back guest boot time or start up delay, but in turn the same delay >>> effectively will be charged to VM launch time. We follow the same model >>> with VFIO, which would see the same hiccup during launch (at an early stage >>> where no real mgmt software would care about). >>> When a management application runs qemu-system-FOO several times to probe its capabilities via QMP, then even milliseconds can hurt. >>> Not something like that, this page pinning hiccup is one time only that >>> occurs in the very early stage when launching QEMU, i.e. there's no >>> consistent delay every time when QMP is called. The delay in QMP response >>> at that very point depends on how much memory the VM has, but this is just >>> specif to VM with VFIO or vDPA devices that have to pin memory for DMA. >>> Having said, there's no extra delay at all if QEMU args has no vDPA device >>> assignment, on the other hand, there's same delay or QMP hiccup when VFIO >>> is around in QEMU args. >>> In what scenarios exactly is QMP delayed? >>> >>> Having said, this is not a new problem to QEMU in particular, this QMP >>> delay is not peculiar, it's existent on VFIO as well. >> >> In what scenarios exactly is QMP delayed compared to before the patch? > > The page pinning process now runs in a pretty early phase at > qemu_init() e.g. machine_run_board_init(), It runs within qemu_init() qmp_x_exit_preconfig() qemu_init_board() machine_run_board_init() Except when --preconfig is given, it instead runs within QMP command x-exit-preconfig. Correct? > before any QMP command can be serviced, the latter of which typically > would be able to get run from qemu_main_loop() until the AIO gets > chance to be started to get polled and dispatched to bh. We create the QMP monitor within qemu_create_late_backends(), which runs before qmp_x_exit_preconfig(), but commands get processed only in the main loop, which we enter later. Correct? > Technically it's not a real delay for specific QMP command, but rather > an extended span of initialization process may take place before the > very first QMP request, usually qmp_capabilities, will be > serviced. It's natural for mgmt software to expect initialization > delay for the first qmp_capabilities response if it has to immediately > issue one after launching qemu, especially when you have a large guest > with hundred GBs of memory and with passthrough device that has to pin > memory for DMA e.g. VFIO, the delayed effect from the QEMU > initialization process is very visible too. > On the other hand, before > the patch, if memory happens to be in the middle of being pinned, any > ongoing QMP can't be serviced by the QEMU main loop, either. > > I'd also like to highlight that without this patch, the pretty high > delay due to page pinning is even visible to the guest in addition to > just QMP delay, which largely affected guest boot time with vDPA > device already. It is long standing, and every VM user with vDPA > device would like to avoid such high delay for the first boot, which > is not seen with similar device e.g. VFIO passthrough. > >> >>> Thanks, >>> -Siwei >>> You told us an absolute delay you observed. What's the relative delay
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 5/26/2025 2:16 AM, Markus Armbruster wrote: Si-Wei Liu writes: On 5/15/2025 11:40 PM, Markus Armbruster wrote: Jason Wang writes: On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. Adding Markus to see if this is a real problem or not. I guess the answer is "depends", and to get a more useful one, we need more information. When all you care is time from executing qemu-system-FOO to guest finish booting, and the guest takes 10s to boot, then an extra 0.2s won't matter much. There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about). When a management application runs qemu-system-FOO several times to probe its capabilities via QMP, then even milliseconds can hurt. Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args. In what scenarios exactly is QMP delayed? Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well. In what scenarios exactly is QMP delayed compared to before the patch? The page pinning process now runs in a pretty early phase at qemu_init() e.g. machine_run_board_init(), before any QMP command can be serviced, the latter of which typically would be able to get run from qemu_main_loop() until the AIO gets chance to be started to get polled and dispatched to bh. Technically it's not a real delay for specific QMP command, but rather an extended span of initialization process may take place before the very first QMP request, usually qmp_capabilities, will be serviced. It's natural for mgmt software to expect initialization delay for the first qmp_capabilities response if it has to immediately issue one after launching qemu, especially when you have a large guest with hundred GBs of memory and with passthrough device that has to pin memory for DMA e.g. VFIO, the delayed effect from the QEMU initialization process is very visible too. On the other hand, before the patch, if memory happens to be in the middle of being pinned, any ongoing QMP can't be serviced by the QEMU main loop, either. I'd also like to highlight that without this patch, the pretty high delay due to page pinning is even visible to the guest in addition to just QMP delay, which largely affected guest boot time with vDPA device already. It is long standing, and every VM user with vDPA device would like to avoid such high delay for the first boot, which is not seen with similar device e.g. VFIO passthrough. Thanks, -Siwei You told us an absolute delay you observed. What's the relative delay, i.e. what's the delay with and without these patches? Can you answer this question? I thought I already got that answered in earlier reply. The relative delay is subject to the size of memory. Usually mgmt software won't be able to notice, unless the guest has more than 100GB of THP memory to pin, for DMA or whatever reason. We need QMP to become available earlier in the startup sequence for other reasons. Could we bypass the delay that way? Please understand that this would likely be quite difficult: we know from experience that messing with the startup sequence is prone to introduce subtle compatility breaks and even bugs. (I remember VFIO has some optimization in the speed of the pinning, could vDPA do the same?) That's well outside my bailiwick :) Please be understood that any p
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Si-Wei Liu writes: > On 5/15/2025 11:40 PM, Markus Armbruster wrote: >> Jason Wang writes: >> >>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. >>> >>> Adding Markus to see if this is a real problem or not. >> >> I guess the answer is "depends", and to get a more useful one, we need >> more information. >> >> When all you care is time from executing qemu-system-FOO to guest >> finish booting, and the guest takes 10s to boot, then an extra 0.2s >> won't matter much. > > There's no such delay of an extra 0.2s or higher per se, it's just shifting > around the page pinning hiccup, no matter it is 0.2s or something else, from > the time of guest booting up to before guest is booted. This saves back guest > boot time or start up delay, but in turn the same delay effectively will be > charged to VM launch time. We follow the same model with VFIO, which would > see the same hiccup during launch (at an early stage where no real mgmt > software would care about). > >> When a management application runs qemu-system-FOO several times to >> probe its capabilities via QMP, then even milliseconds can hurt. >> > Not something like that, this page pinning hiccup is one time only that > occurs in the very early stage when launching QEMU, i.e. there's no > consistent delay every time when QMP is called. The delay in QMP response at > that very point depends on how much memory the VM has, but this is just > specif to VM with VFIO or vDPA devices that have to pin memory for DMA. > Having said, there's no extra delay at all if QEMU args has no vDPA device > assignment, on the other hand, there's same delay or QMP hiccup when VFIO is > around in QEMU args. > >> In what scenarios exactly is QMP delayed? > > Having said, this is not a new problem to QEMU in particular, this QMP delay > is not peculiar, it's existent on VFIO as well. In what scenarios exactly is QMP delayed compared to before the patch? > Thanks, > -Siwei > >> >> You told us an absolute delay you observed. What's the relative delay, >> i.e. what's the delay with and without these patches? Can you answer this question? >> We need QMP to become available earlier in the startup sequence for >> other reasons. Could we bypass the delay that way? Please understand >> that this would likely be quite difficult: we know from experience that >> messing with the startup sequence is prone to introduce subtle >> compatility breaks and even bugs. >> >>> (I remember VFIO has some optimization in the speed of the pinning, >>> could vDPA do the same?) >> >> That's well outside my bailiwick :) >> >> [...] >>
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 5/14/25 11:49 AM, Eugenio Perez Martin wrote: On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. This may be further reduced (or increased) depending on the vdpa driver and the platform hardware, and it is dominated by the cost of memory pinning. This matches the time that we move out of the called downtime window. The downtime is measured as checking the trace timestamp from the moment the source suspend the device to the moment the destination starts the eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 secs to 2.0949. Hi Jonah, Could you update this benchmark? I don't think it changed a lot but just to be as updated as possible. Yes, will update this for 39G guest and for 128G guests :) I think I cannot ack the series as I sent the first revision. Jason or Si-Wei, could you ack it? Thanks! Future directions on top of this series may include to move more things ahead of the migration time, like set DRIVER_OK or perform actual iterative migration of virtio-net devices. Comments are welcome. This series is a different approach of series [1]. As the title does not reflect the changes anymore, please refer to the previous one to know the series history. This series is based on [2], it must be applied after it. [Jonah Palmer] This series was rebased after [3] was pulled in, as [3] was a prerequisite fix for this series. v4: --- * Add memory listener unregistration to vhost_vdpa_reset_device. * Remove memory listener unregistration from vhost_vdpa_reset_status. v3: --- * Rebase v2: --- * Move the memory listener registration to vhost_vdpa_set_owner function. * Move the iova_tree allocation to net_vhost_vdpa_init. v1 athttps://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86V7C4DccE$ . [1]https://urldefense.com/v3/__https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VTze8nNQ$ [2]https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VNYsAGaI$ [3]https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/[email protected]/__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VXyDekTU$ Jonah - note: I'll be on vacation from May 10-19. Will respond to comments when I return. Eugenio Pérez (7): vdpa: check for iova tree initialized at net_client_start vdpa: reorder vhost_vdpa_set_backend_cap vdpa: set backend capabilities at vhost_vdpa_init vdpa: add listener_registered vdpa: reorder listener assignment vdpa: move iova_tree allocation to net_vhost_vdpa_init vdpa: move memory listener register to vhost_vdpa_init hw/virtio/vhost-vdpa.c | 107 + include/hw/virtio/vhost-vdpa.h | 22 ++- net/vhost-vdpa.c | 34 +-- 3 files changed, 93 insertions(+), 70 deletions(-) -- 2.43.5
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 5/15/2025 11:40 PM, Markus Armbruster wrote: Jason Wang writes: On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. Adding Markus to see if this is a real problem or not. I guess the answer is "depends", and to get a more useful one, we need more information. When all you care is time from executing qemu-system-FOO to guest finish booting, and the guest takes 10s to boot, then an extra 0.2s won't matter much. There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about). When a management application runs qemu-system-FOO several times to probe its capabilities via QMP, then even milliseconds can hurt. Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args. In what scenarios exactly is QMP delayed? Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well. Thanks, -Siwei You told us an absolute delay you observed. What's the relative delay, i.e. what's the delay with and without these patches? We need QMP to become available earlier in the startup sequence for other reasons. Could we bypass the delay that way? Please understand that this would likely be quite difficult: we know from experience that messing with the startup sequence is prone to introduce subtle compatility breaks and even bugs. (I remember VFIO has some optimization in the speed of the pinning, could vDPA do the same?) That's well outside my bailiwick :) [...]
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On Thu, May 15, 2025 at 10:41:45AM -0700, Si-Wei Liu wrote: > > > On 5/14/2025 10:43 PM, Michael S. Tsirkin wrote: > > On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote: > > > Hi Eugenio, > > > > > > On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote: > > > > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer > > > > wrote: > > > > > Current memory operations like pinning may take a lot of time at the > > > > > destination. Currently they are done after the source of the > > > > > migration is > > > > > stopped, and before the workload is resumed at the destination. This > > > > > is a > > > > > period where neigher traffic can flow, nor the VM workload can > > > > > continue > > > > > (downtime). > > > > > > > > > > We can do better as we know the memory layout of the guest RAM at the > > > > > destination from the moment that all devices are initializaed. So > > > > > moving that operation allows QEMU to communicate the kernel the maps > > > > > while the workload is still running in the source, so Linux can start > > > > > mapping them. > > > > > > > > > > As a small drawback, there is a time in the initialization where QEMU > > > > > cannot respond to QMP etc. By some testing, this time is about > > > > > 0.2seconds. This may be further reduced (or increased) depending on > > > > > the > > > > > vdpa driver and the platform hardware, and it is dominated by the cost > > > > > of memory pinning. > > > > > > > > > > This matches the time that we move out of the called downtime window. > > > > > The downtime is measured as checking the trace timestamp from the > > > > > moment > > > > > the source suspend the device to the moment the destination starts the > > > > > eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > > > > > secs to 2.0949. > > > > > > > > > Hi Jonah, > > > > > > > > Could you update this benchmark? I don't think it changed a lot but > > > > just to be as updated as possible. > > > Jonah is off this week and will be back until next Tuesday, but I recall > > > he > > > indeed did some downtime test with VM with 128GB memory before taking off, > > > which shows obvious improvement from around 10 seconds to 5.8 seconds > > > after > > > applying this series. Since this is related to update on the cover letter, > > > would it be okay for you and Jason to ack now and then proceed to Michael > > > for upcoming merge? > > > > > > > I think I cannot ack the series as I sent the first revision. Jason or > > > > Si-Wei, could you ack it? > > > Sure, I just give my R-b, this series look good to me. Hopefully Jason can > > > ack on his own. > > > > > > Thanks! > > > -Siwei > > I just sent a pull, next one in a week or two, so - no rush. > All right, should be good to wait. In any case you have to repost a v2 PULL, > hope this series can be piggy-back'ed as we did extensive tests about it. > ;-) > > -Siwei You mean "in case"? > > > > > > > > Thanks! > > > > > > > > > Future directions on top of this series may include to move more > > > > > things ahead > > > > > of the migration time, like set DRIVER_OK or perform actual iterative > > > > > migration > > > > > of virtio-net devices. > > > > > > > > > > Comments are welcome. > > > > > > > > > > This series is a different approach of series [1]. As the title does > > > > > not > > > > > reflect the changes anymore, please refer to the previous one to know > > > > > the > > > > > series history. > > > > > > > > > > This series is based on [2], it must be applied after it. > > > > > > > > > > [Jonah Palmer] > > > > > This series was rebased after [3] was pulled in, as [3] was a > > > > > prerequisite > > > > > fix for this series. > > > > > > > > > > v4: > > > > > --- > > > > > * Add memory listener unregistration to vhost_vdpa_reset_device. > > > > > * Remove memory listener unregistration from vhost_vdpa_reset_status. > > > > > > > > > > v3: > > > > > --- > > > > > * Rebase > > > > > > > > > > v2: > > > > > --- > > > > > * Move the memory listener registration to vhost_vdpa_set_owner > > > > > function. > > > > > * Move the iova_tree allocation to net_vhost_vdpa_init. > > > > > > > > > > v1 at > > > > > https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. > > > > > > > > > > [1] > > > > > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ > > > > > [2] > > > > > https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html > > > > > [3] > > > > > https://lore.kernel.org/qemu-devel/[email protected]/ > > > > > > > > > > Jonah - note: I'll be on vacation from May 10-19. Will respond to > > > > > comments when I return. > > > > > > > > > > Eugenio Pérez (7): > > > > > vdpa: check for iova tree initialized at net_client_start > > > > > vdpa: reorder vhost_vdpa_set_backend_cap > > > > > vdpa: set backend capabilities at vhost_vdpa_init > > > > > vdpa: add l
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Jason Wang writes: > On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: >> >> Current memory operations like pinning may take a lot of time at the >> destination. Currently they are done after the source of the migration is >> stopped, and before the workload is resumed at the destination. This is a >> period where neigher traffic can flow, nor the VM workload can continue >> (downtime). >> >> We can do better as we know the memory layout of the guest RAM at the >> destination from the moment that all devices are initializaed. So >> moving that operation allows QEMU to communicate the kernel the maps >> while the workload is still running in the source, so Linux can start >> mapping them. >> >> As a small drawback, there is a time in the initialization where QEMU >> cannot respond to QMP etc. By some testing, this time is about >> 0.2seconds. > > Adding Markus to see if this is a real problem or not. I guess the answer is "depends", and to get a more useful one, we need more information. When all you care is time from executing qemu-system-FOO to guest finish booting, and the guest takes 10s to boot, then an extra 0.2s won't matter much. When a management application runs qemu-system-FOO several times to probe its capabilities via QMP, then even milliseconds can hurt. In what scenarios exactly is QMP delayed? You told us an absolute delay you observed. What's the relative delay, i.e. what's the delay with and without these patches? We need QMP to become available earlier in the startup sequence for other reasons. Could we bypass the delay that way? Please understand that this would likely be quite difficult: we know from experience that messing with the startup sequence is prone to introduce subtle compatility breaks and even bugs. > (I remember VFIO has some optimization in the speed of the pinning, > could vDPA do the same?) That's well outside my bailiwick :) [...]
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: > > Current memory operations like pinning may take a lot of time at the > destination. Currently they are done after the source of the migration is > stopped, and before the workload is resumed at the destination. This is a > period where neigher traffic can flow, nor the VM workload can continue > (downtime). > > We can do better as we know the memory layout of the guest RAM at the > destination from the moment that all devices are initializaed. So > moving that operation allows QEMU to communicate the kernel the maps > while the workload is still running in the source, so Linux can start > mapping them. > > As a small drawback, there is a time in the initialization where QEMU > cannot respond to QMP etc. By some testing, this time is about > 0.2seconds. Adding Markus to see if this is a real problem or not. (I remember VFIO has some optimization in the speed of the pinning, could vDPA do the same?) Thanks > This may be further reduced (or increased) depending on the > vdpa driver and the platform hardware, and it is dominated by the cost > of memory pinning. > > This matches the time that we move out of the called downtime window. > The downtime is measured as checking the trace timestamp from the moment > the source suspend the device to the moment the destination starts the > eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > secs to 2.0949. > > Future directions on top of this series may include to move more things ahead > of the migration time, like set DRIVER_OK or perform actual iterative > migration > of virtio-net devices. > > Comments are welcome. > > This series is a different approach of series [1]. As the title does not > reflect the changes anymore, please refer to the previous one to know the > series history. > > This series is based on [2], it must be applied after it. > > [Jonah Palmer] > This series was rebased after [3] was pulled in, as [3] was a prerequisite > fix for this series. > > v4: > --- > * Add memory listener unregistration to vhost_vdpa_reset_device. > * Remove memory listener unregistration from vhost_vdpa_reset_status. > > v3: > --- > * Rebase > > v2: > --- > * Move the memory listener registration to vhost_vdpa_set_owner function. > * Move the iova_tree allocation to net_vhost_vdpa_init. > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. > > [1] > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html > [3] > https://lore.kernel.org/qemu-devel/[email protected]/ > > Jonah - note: I'll be on vacation from May 10-19. Will respond to > comments when I return. > > Eugenio Pérez (7): > vdpa: check for iova tree initialized at net_client_start > vdpa: reorder vhost_vdpa_set_backend_cap > vdpa: set backend capabilities at vhost_vdpa_init > vdpa: add listener_registered > vdpa: reorder listener assignment > vdpa: move iova_tree allocation to net_vhost_vdpa_init > vdpa: move memory listener register to vhost_vdpa_init > > hw/virtio/vhost-vdpa.c | 107 + > include/hw/virtio/vhost-vdpa.h | 22 ++- > net/vhost-vdpa.c | 34 +-- > 3 files changed, 93 insertions(+), 70 deletions(-) > > -- > 2.43.5 >
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On Thu, May 15, 2025 at 8:17 AM Si-Wei Liu wrote: > > Hi Eugenio, > > On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote: > > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote: > >> Current memory operations like pinning may take a lot of time at the > >> destination. Currently they are done after the source of the migration is > >> stopped, and before the workload is resumed at the destination. This is a > >> period where neigher traffic can flow, nor the VM workload can continue > >> (downtime). > >> > >> We can do better as we know the memory layout of the guest RAM at the > >> destination from the moment that all devices are initializaed. So > >> moving that operation allows QEMU to communicate the kernel the maps > >> while the workload is still running in the source, so Linux can start > >> mapping them. > >> > >> As a small drawback, there is a time in the initialization where QEMU > >> cannot respond to QMP etc. By some testing, this time is about > >> 0.2seconds. This may be further reduced (or increased) depending on the > >> vdpa driver and the platform hardware, and it is dominated by the cost > >> of memory pinning. > >> > >> This matches the time that we move out of the called downtime window. > >> The downtime is measured as checking the trace timestamp from the moment > >> the source suspend the device to the moment the destination starts the > >> eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > >> secs to 2.0949. > >> > > Hi Jonah, > > > > Could you update this benchmark? I don't think it changed a lot but > > just to be as updated as possible. > Jonah is off this week and will be back until next Tuesday, but I recall > he indeed did some downtime test with VM with 128GB memory before taking > off, which shows obvious improvement from around 10 seconds to 5.8 > seconds after applying this series. Since this is related to update on > the cover letter, would it be okay for you and Jason to ack now and then > proceed to Michael for upcoming merge? I will go through the series. Thanks
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: > > Current memory operations like pinning may take a lot of time at the > destination. Currently they are done after the source of the migration is > stopped, and before the workload is resumed at the destination. This is a > period where neigher traffic can flow, nor the VM workload can continue > (downtime). > > We can do better as we know the memory layout of the guest RAM at the > destination from the moment that all devices are initializaed. So > moving that operation allows QEMU to communicate the kernel the maps > while the workload is still running in the source, so Linux can start > mapping them. > > As a small drawback, there is a time in the initialization where QEMU > cannot respond to QMP etc. By some testing, this time is about > 0.2seconds. This may be further reduced (or increased) depending on the > vdpa driver and the platform hardware, and it is dominated by the cost > of memory pinning. > > This matches the time that we move out of the called downtime window. > The downtime is measured as checking the trace timestamp from the moment > the source suspend the device to the moment the destination starts the > eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > secs to 2.0949. > > Future directions on top of this series may include to move more things ahead > of the migration time, like set DRIVER_OK or perform actual iterative > migration > of virtio-net devices. > > Comments are welcome. > > This series is a different approach of series [1]. As the title does not > reflect the changes anymore, please refer to the previous one to know the > series history. > > This series is based on [2], it must be applied after it. Not that this has been merged. Thanks > > [Jonah Palmer] > This series was rebased after [3] was pulled in, as [3] was a prerequisite > fix for this series. > > v4: > --- > * Add memory listener unregistration to vhost_vdpa_reset_device. > * Remove memory listener unregistration from vhost_vdpa_reset_status. > > v3: > --- > * Rebase > > v2: > --- > * Move the memory listener registration to vhost_vdpa_set_owner function. > * Move the iova_tree allocation to net_vhost_vdpa_init. > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. > > [1] > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html > [3] > https://lore.kernel.org/qemu-devel/[email protected]/ > > Jonah - note: I'll be on vacation from May 10-19. Will respond to > comments when I return. > > Eugenio Pérez (7): > vdpa: check for iova tree initialized at net_client_start > vdpa: reorder vhost_vdpa_set_backend_cap > vdpa: set backend capabilities at vhost_vdpa_init > vdpa: add listener_registered > vdpa: reorder listener assignment > vdpa: move iova_tree allocation to net_vhost_vdpa_init > vdpa: move memory listener register to vhost_vdpa_init > > hw/virtio/vhost-vdpa.c | 107 + > include/hw/virtio/vhost-vdpa.h | 22 ++- > net/vhost-vdpa.c | 34 +-- > 3 files changed, 93 insertions(+), 70 deletions(-) > > -- > 2.43.5 >
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On 5/14/2025 10:43 PM, Michael S. Tsirkin wrote: On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote: Hi Eugenio, On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote: On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. This may be further reduced (or increased) depending on the vdpa driver and the platform hardware, and it is dominated by the cost of memory pinning. This matches the time that we move out of the called downtime window. The downtime is measured as checking the trace timestamp from the moment the source suspend the device to the moment the destination starts the eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 secs to 2.0949. Hi Jonah, Could you update this benchmark? I don't think it changed a lot but just to be as updated as possible. Jonah is off this week and will be back until next Tuesday, but I recall he indeed did some downtime test with VM with 128GB memory before taking off, which shows obvious improvement from around 10 seconds to 5.8 seconds after applying this series. Since this is related to update on the cover letter, would it be okay for you and Jason to ack now and then proceed to Michael for upcoming merge? I think I cannot ack the series as I sent the first revision. Jason or Si-Wei, could you ack it? Sure, I just give my R-b, this series look good to me. Hopefully Jason can ack on his own. Thanks! -Siwei I just sent a pull, next one in a week or two, so - no rush. All right, should be good to wait. In any case you have to repost a v2 PULL, hope this series can be piggy-back'ed as we did extensive tests about it. ;-) -Siwei Thanks! Future directions on top of this series may include to move more things ahead of the migration time, like set DRIVER_OK or perform actual iterative migration of virtio-net devices. Comments are welcome. This series is a different approach of series [1]. As the title does not reflect the changes anymore, please refer to the previous one to know the series history. This series is based on [2], it must be applied after it. [Jonah Palmer] This series was rebased after [3] was pulled in, as [3] was a prerequisite fix for this series. v4: --- * Add memory listener unregistration to vhost_vdpa_reset_device. * Remove memory listener unregistration from vhost_vdpa_reset_status. v3: --- * Rebase v2: --- * Move the memory listener registration to vhost_vdpa_set_owner function. * Move the iova_tree allocation to net_vhost_vdpa_init. v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. [1] https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html [3] https://lore.kernel.org/qemu-devel/[email protected]/ Jonah - note: I'll be on vacation from May 10-19. Will respond to comments when I return. Eugenio Pérez (7): vdpa: check for iova tree initialized at net_client_start vdpa: reorder vhost_vdpa_set_backend_cap vdpa: set backend capabilities at vhost_vdpa_init vdpa: add listener_registered vdpa: reorder listener assignment vdpa: move iova_tree allocation to net_vhost_vdpa_init vdpa: move memory listener register to vhost_vdpa_init hw/virtio/vhost-vdpa.c | 107 + include/hw/virtio/vhost-vdpa.h | 22 ++- net/vhost-vdpa.c | 34 +-- 3 files changed, 93 insertions(+), 70 deletions(-) -- 2.43.5
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On Thu, May 15, 2025 at 2:17 AM Si-Wei Liu wrote: > > Hi Eugenio, > > On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote: > > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote: > >> Current memory operations like pinning may take a lot of time at the > >> destination. Currently they are done after the source of the migration is > >> stopped, and before the workload is resumed at the destination. This is a > >> period where neigher traffic can flow, nor the VM workload can continue > >> (downtime). > >> > >> We can do better as we know the memory layout of the guest RAM at the > >> destination from the moment that all devices are initializaed. So > >> moving that operation allows QEMU to communicate the kernel the maps > >> while the workload is still running in the source, so Linux can start > >> mapping them. > >> > >> As a small drawback, there is a time in the initialization where QEMU > >> cannot respond to QMP etc. By some testing, this time is about > >> 0.2seconds. This may be further reduced (or increased) depending on the > >> vdpa driver and the platform hardware, and it is dominated by the cost > >> of memory pinning. > >> > >> This matches the time that we move out of the called downtime window. > >> The downtime is measured as checking the trace timestamp from the moment > >> the source suspend the device to the moment the destination starts the > >> eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > >> secs to 2.0949. > >> > > Hi Jonah, > > > > Could you update this benchmark? I don't think it changed a lot but > > just to be as updated as possible. > Jonah is off this week and will be back until next Tuesday, but I recall > he indeed did some downtime test with VM with 128GB memory before taking > off, which shows obvious improvement from around 10 seconds to 5.8 > seconds after applying this series. Since this is related to update on > the cover letter, would it be okay for you and Jason to ack now and then > proceed to Michael for upcoming merge? > Oh yes that's what I meant, I should have been more explicit about that :). > > > > I think I cannot ack the series as I sent the first revision. Jason or > > Si-Wei, could you ack it? > Sure, I just give my R-b, this series look good to me. Hopefully Jason > can ack on his own. > > Thanks! > -Siwei > > > > > Thanks! > > > >> Future directions on top of this series may include to move more things > >> ahead > >> of the migration time, like set DRIVER_OK or perform actual iterative > >> migration > >> of virtio-net devices. > >> > >> Comments are welcome. > >> > >> This series is a different approach of series [1]. As the title does not > >> reflect the changes anymore, please refer to the previous one to know the > >> series history. > >> > >> This series is based on [2], it must be applied after it. > >> > >> [Jonah Palmer] > >> This series was rebased after [3] was pulled in, as [3] was a prerequisite > >> fix for this series. > >> > >> v4: > >> --- > >> * Add memory listener unregistration to vhost_vdpa_reset_device. > >> * Remove memory listener unregistration from vhost_vdpa_reset_status. > >> > >> v3: > >> --- > >> * Rebase > >> > >> v2: > >> --- > >> * Move the memory listener registration to vhost_vdpa_set_owner function. > >> * Move the iova_tree allocation to net_vhost_vdpa_init. > >> > >> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. > >> > >> [1] > >> https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ > >> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html > >> [3] > >> https://lore.kernel.org/qemu-devel/[email protected]/ > >> > >> Jonah - note: I'll be on vacation from May 10-19. Will respond to > >>comments when I return. > >> > >> Eugenio Pérez (7): > >>vdpa: check for iova tree initialized at net_client_start > >>vdpa: reorder vhost_vdpa_set_backend_cap > >>vdpa: set backend capabilities at vhost_vdpa_init > >>vdpa: add listener_registered > >>vdpa: reorder listener assignment > >>vdpa: move iova_tree allocation to net_vhost_vdpa_init > >>vdpa: move memory listener register to vhost_vdpa_init > >> > >> hw/virtio/vhost-vdpa.c | 107 + > >> include/hw/virtio/vhost-vdpa.h | 22 ++- > >> net/vhost-vdpa.c | 34 +-- > >> 3 files changed, 93 insertions(+), 70 deletions(-) > >> > >> -- > >> 2.43.5 > >> >
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote: > Hi Eugenio, > > On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote: > > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote: > > > Current memory operations like pinning may take a lot of time at the > > > destination. Currently they are done after the source of the migration is > > > stopped, and before the workload is resumed at the destination. This is a > > > period where neigher traffic can flow, nor the VM workload can continue > > > (downtime). > > > > > > We can do better as we know the memory layout of the guest RAM at the > > > destination from the moment that all devices are initializaed. So > > > moving that operation allows QEMU to communicate the kernel the maps > > > while the workload is still running in the source, so Linux can start > > > mapping them. > > > > > > As a small drawback, there is a time in the initialization where QEMU > > > cannot respond to QMP etc. By some testing, this time is about > > > 0.2seconds. This may be further reduced (or increased) depending on the > > > vdpa driver and the platform hardware, and it is dominated by the cost > > > of memory pinning. > > > > > > This matches the time that we move out of the called downtime window. > > > The downtime is measured as checking the trace timestamp from the moment > > > the source suspend the device to the moment the destination starts the > > > eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > > > secs to 2.0949. > > > > > Hi Jonah, > > > > Could you update this benchmark? I don't think it changed a lot but > > just to be as updated as possible. > Jonah is off this week and will be back until next Tuesday, but I recall he > indeed did some downtime test with VM with 128GB memory before taking off, > which shows obvious improvement from around 10 seconds to 5.8 seconds after > applying this series. Since this is related to update on the cover letter, > would it be okay for you and Jason to ack now and then proceed to Michael > for upcoming merge? > > > > > I think I cannot ack the series as I sent the first revision. Jason or > > Si-Wei, could you ack it? > Sure, I just give my R-b, this series look good to me. Hopefully Jason can > ack on his own. > > Thanks! > -Siwei I just sent a pull, next one in a week or two, so - no rush. > > > > Thanks! > > > > > Future directions on top of this series may include to move more things > > > ahead > > > of the migration time, like set DRIVER_OK or perform actual iterative > > > migration > > > of virtio-net devices. > > > > > > Comments are welcome. > > > > > > This series is a different approach of series [1]. As the title does not > > > reflect the changes anymore, please refer to the previous one to know the > > > series history. > > > > > > This series is based on [2], it must be applied after it. > > > > > > [Jonah Palmer] > > > This series was rebased after [3] was pulled in, as [3] was a prerequisite > > > fix for this series. > > > > > > v4: > > > --- > > > * Add memory listener unregistration to vhost_vdpa_reset_device. > > > * Remove memory listener unregistration from vhost_vdpa_reset_status. > > > > > > v3: > > > --- > > > * Rebase > > > > > > v2: > > > --- > > > * Move the memory listener registration to vhost_vdpa_set_owner function. > > > * Move the iova_tree allocation to net_vhost_vdpa_init. > > > > > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. > > > > > > [1] > > > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ > > > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html > > > [3] > > > https://lore.kernel.org/qemu-devel/[email protected]/ > > > > > > Jonah - note: I'll be on vacation from May 10-19. Will respond to > > >comments when I return. > > > > > > Eugenio Pérez (7): > > >vdpa: check for iova tree initialized at net_client_start > > >vdpa: reorder vhost_vdpa_set_backend_cap > > >vdpa: set backend capabilities at vhost_vdpa_init > > >vdpa: add listener_registered > > >vdpa: reorder listener assignment > > >vdpa: move iova_tree allocation to net_vhost_vdpa_init > > >vdpa: move memory listener register to vhost_vdpa_init > > > > > > hw/virtio/vhost-vdpa.c | 107 + > > > include/hw/virtio/vhost-vdpa.h | 22 ++- > > > net/vhost-vdpa.c | 34 +-- > > > 3 files changed, 93 insertions(+), 70 deletions(-) > > > > > > -- > > > 2.43.5 > > >
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Hi Eugenio, On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote: On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. This may be further reduced (or increased) depending on the vdpa driver and the platform hardware, and it is dominated by the cost of memory pinning. This matches the time that we move out of the called downtime window. The downtime is measured as checking the trace timestamp from the moment the source suspend the device to the moment the destination starts the eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 secs to 2.0949. Hi Jonah, Could you update this benchmark? I don't think it changed a lot but just to be as updated as possible. Jonah is off this week and will be back until next Tuesday, but I recall he indeed did some downtime test with VM with 128GB memory before taking off, which shows obvious improvement from around 10 seconds to 5.8 seconds after applying this series. Since this is related to update on the cover letter, would it be okay for you and Jason to ack now and then proceed to Michael for upcoming merge? I think I cannot ack the series as I sent the first revision. Jason or Si-Wei, could you ack it? Sure, I just give my R-b, this series look good to me. Hopefully Jason can ack on his own. Thanks! -Siwei Thanks! Future directions on top of this series may include to move more things ahead of the migration time, like set DRIVER_OK or perform actual iterative migration of virtio-net devices. Comments are welcome. This series is a different approach of series [1]. As the title does not reflect the changes anymore, please refer to the previous one to know the series history. This series is based on [2], it must be applied after it. [Jonah Palmer] This series was rebased after [3] was pulled in, as [3] was a prerequisite fix for this series. v4: --- * Add memory listener unregistration to vhost_vdpa_reset_device. * Remove memory listener unregistration from vhost_vdpa_reset_status. v3: --- * Rebase v2: --- * Move the memory listener registration to vhost_vdpa_set_owner function. * Move the iova_tree allocation to net_vhost_vdpa_init. v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. [1] https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html [3] https://lore.kernel.org/qemu-devel/[email protected]/ Jonah - note: I'll be on vacation from May 10-19. Will respond to comments when I return. Eugenio Pérez (7): vdpa: check for iova tree initialized at net_client_start vdpa: reorder vhost_vdpa_set_backend_cap vdpa: set backend capabilities at vhost_vdpa_init vdpa: add listener_registered vdpa: reorder listener assignment vdpa: move iova_tree allocation to net_vhost_vdpa_init vdpa: move memory listener register to vhost_vdpa_init hw/virtio/vhost-vdpa.c | 107 + include/hw/virtio/vhost-vdpa.h | 22 ++- net/vhost-vdpa.c | 34 +-- 3 files changed, 93 insertions(+), 70 deletions(-) -- 2.43.5
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
For the series: Reviewed-by: Si-Wei Liu On 5/7/2025 11:46 AM, Jonah Palmer wrote: Current memory operations like pinning may take a lot of time at the destination. Currently they are done after the source of the migration is stopped, and before the workload is resumed at the destination. This is a period where neigher traffic can flow, nor the VM workload can continue (downtime). We can do better as we know the memory layout of the guest RAM at the destination from the moment that all devices are initializaed. So moving that operation allows QEMU to communicate the kernel the maps while the workload is still running in the source, so Linux can start mapping them. As a small drawback, there is a time in the initialization where QEMU cannot respond to QMP etc. By some testing, this time is about 0.2seconds. This may be further reduced (or increased) depending on the vdpa driver and the platform hardware, and it is dominated by the cost of memory pinning. This matches the time that we move out of the called downtime window. The downtime is measured as checking the trace timestamp from the moment the source suspend the device to the moment the destination starts the eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 secs to 2.0949. Future directions on top of this series may include to move more things ahead of the migration time, like set DRIVER_OK or perform actual iterative migration of virtio-net devices. Comments are welcome. This series is a different approach of series [1]. As the title does not reflect the changes anymore, please refer to the previous one to know the series history. This series is based on [2], it must be applied after it. [Jonah Palmer] This series was rebased after [3] was pulled in, as [3] was a prerequisite fix for this series. v4: --- * Add memory listener unregistration to vhost_vdpa_reset_device. * Remove memory listener unregistration from vhost_vdpa_reset_status. v3: --- * Rebase v2: --- * Move the memory listener registration to vhost_vdpa_set_owner function. * Move the iova_tree allocation to net_vhost_vdpa_init. v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. [1] https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html [3] https://lore.kernel.org/qemu-devel/[email protected]/ Jonah - note: I'll be on vacation from May 10-19. Will respond to comments when I return. Eugenio Pérez (7): vdpa: check for iova tree initialized at net_client_start vdpa: reorder vhost_vdpa_set_backend_cap vdpa: set backend capabilities at vhost_vdpa_init vdpa: add listener_registered vdpa: reorder listener assignment vdpa: move iova_tree allocation to net_vhost_vdpa_init vdpa: move memory listener register to vhost_vdpa_init hw/virtio/vhost-vdpa.c | 107 + include/hw/virtio/vhost-vdpa.h | 22 ++- net/vhost-vdpa.c | 34 +-- 3 files changed, 93 insertions(+), 70 deletions(-)
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote: > > Current memory operations like pinning may take a lot of time at the > destination. Currently they are done after the source of the migration is > stopped, and before the workload is resumed at the destination. This is a > period where neigher traffic can flow, nor the VM workload can continue > (downtime). > > We can do better as we know the memory layout of the guest RAM at the > destination from the moment that all devices are initializaed. So > moving that operation allows QEMU to communicate the kernel the maps > while the workload is still running in the source, so Linux can start > mapping them. > > As a small drawback, there is a time in the initialization where QEMU > cannot respond to QMP etc. By some testing, this time is about > 0.2seconds. This may be further reduced (or increased) depending on the > vdpa driver and the platform hardware, and it is dominated by the cost > of memory pinning. > > This matches the time that we move out of the called downtime window. > The downtime is measured as checking the trace timestamp from the moment > the source suspend the device to the moment the destination starts the > eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > secs to 2.0949. > Hi Jonah, Could you update this benchmark? I don't think it changed a lot but just to be as updated as possible. I think I cannot ack the series as I sent the first revision. Jason or Si-Wei, could you ack it? Thanks! > Future directions on top of this series may include to move more things ahead > of the migration time, like set DRIVER_OK or perform actual iterative > migration > of virtio-net devices. > > Comments are welcome. > > This series is a different approach of series [1]. As the title does not > reflect the changes anymore, please refer to the previous one to know the > series history. > > This series is based on [2], it must be applied after it. > > [Jonah Palmer] > This series was rebased after [3] was pulled in, as [3] was a prerequisite > fix for this series. > > v4: > --- > * Add memory listener unregistration to vhost_vdpa_reset_device. > * Remove memory listener unregistration from vhost_vdpa_reset_status. > > v3: > --- > * Rebase > > v2: > --- > * Move the memory listener registration to vhost_vdpa_set_owner function. > * Move the iova_tree allocation to net_vhost_vdpa_init. > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. > > [1] > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html > [3] > https://lore.kernel.org/qemu-devel/[email protected]/ > > Jonah - note: I'll be on vacation from May 10-19. Will respond to > comments when I return. > > Eugenio Pérez (7): > vdpa: check for iova tree initialized at net_client_start > vdpa: reorder vhost_vdpa_set_backend_cap > vdpa: set backend capabilities at vhost_vdpa_init > vdpa: add listener_registered > vdpa: reorder listener assignment > vdpa: move iova_tree allocation to net_vhost_vdpa_init > vdpa: move memory listener register to vhost_vdpa_init > > hw/virtio/vhost-vdpa.c | 107 + > include/hw/virtio/vhost-vdpa.h | 22 ++- > net/vhost-vdpa.c | 34 +-- > 3 files changed, 93 insertions(+), 70 deletions(-) > > -- > 2.43.5 >
Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
Tested pass with vhost_vdpa device's regression tests. Tested-by: Lei Yang On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote: > > Current memory operations like pinning may take a lot of time at the > destination. Currently they are done after the source of the migration is > stopped, and before the workload is resumed at the destination. This is a > period where neigher traffic can flow, nor the VM workload can continue > (downtime). > > We can do better as we know the memory layout of the guest RAM at the > destination from the moment that all devices are initializaed. So > moving that operation allows QEMU to communicate the kernel the maps > while the workload is still running in the source, so Linux can start > mapping them. > > As a small drawback, there is a time in the initialization where QEMU > cannot respond to QMP etc. By some testing, this time is about > 0.2seconds. This may be further reduced (or increased) depending on the > vdpa driver and the platform hardware, and it is dominated by the cost > of memory pinning. > > This matches the time that we move out of the called downtime window. > The downtime is measured as checking the trace timestamp from the moment > the source suspend the device to the moment the destination starts the > eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526 > secs to 2.0949. > > Future directions on top of this series may include to move more things ahead > of the migration time, like set DRIVER_OK or perform actual iterative > migration > of virtio-net devices. > > Comments are welcome. > > This series is a different approach of series [1]. As the title does not > reflect the changes anymore, please refer to the previous one to know the > series history. > > This series is based on [2], it must be applied after it. > > [Jonah Palmer] > This series was rebased after [3] was pulled in, as [3] was a prerequisite > fix for this series. > > v4: > --- > * Add memory listener unregistration to vhost_vdpa_reset_device. > * Remove memory listener unregistration from vhost_vdpa_reset_status. > > v3: > --- > * Rebase > > v2: > --- > * Move the memory listener registration to vhost_vdpa_set_owner function. > * Move the iova_tree allocation to net_vhost_vdpa_init. > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html. > > [1] > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/ > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html > [3] > https://lore.kernel.org/qemu-devel/[email protected]/ > > Jonah - note: I'll be on vacation from May 10-19. Will respond to > comments when I return. > > Eugenio Pérez (7): > vdpa: check for iova tree initialized at net_client_start > vdpa: reorder vhost_vdpa_set_backend_cap > vdpa: set backend capabilities at vhost_vdpa_init > vdpa: add listener_registered > vdpa: reorder listener assignment > vdpa: move iova_tree allocation to net_vhost_vdpa_init > vdpa: move memory listener register to vhost_vdpa_init > > hw/virtio/vhost-vdpa.c | 107 + > include/hw/virtio/vhost-vdpa.h | 22 ++- > net/vhost-vdpa.c | 34 +-- > 3 files changed, 93 insertions(+), 70 deletions(-) > > -- > 2.43.5 >
