Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-07-09 Thread Markus Armbruster
Jonah Palmer  writes:

[...]

>> I think I finally know enough to give you constructive feedback.
>> 
>> Your commit messages should answer the questions I had.  Specifically:
>> 
>> * Why are we doing this?  To shorten guest-visible downtime.
>> 
>> * How are we doing this?  We additionally pin memory before entering the
>>   main loop.  This speeds up the pinning we still do in the main loop.
>> 
>> * Drawback: slower startup.  In particular, QMP becomes
>>   available later.
>> 
>> * Secondary benefit: main loop responsiveness improves, in particular
>>   QMP.
>> 
>> * What uses of QEMU are affected?  Only with vhost-vDPA.  Spell out all
>>the ways to get vhost-vDPA, please.
>> 
>> * There's a tradeoff.  Show your numbers.  Discuss whether this needs to
>>   be configurable.
>> 
>> If you can make a case for pinning memory this way always, do so.  If
>> you believe making it configurable would be a good idea, do so.  If
>> you're not sure, say so in the cover letter, and add a suitable TODO
>> comment.
>> 
>> Questions?
>
> No questions, understood.
>
> As I was writing the responses to your questions I was thinking to 
> myself that this stuff should've been in the cover letter / commit 
> messages in the first place.
>
> Definitely a learning moment for me. Thanks for your time on this Markus!

You're welcome!




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-07-09 Thread Jonah Palmer




On 7/8/25 4:17 AM, Markus Armbruster wrote:

Jonah Palmer  writes:


On 7/4/25 11:00 AM, Markus Armbruster wrote:

Jonah Palmer  writes:


[...]


So, total time increases: early pinning (before main loop) takes more
time than we save pinning (in the main loop).  Correct?


Correct. We only save ~0.07s from the pinning that happens in the main loop. 
But the extra 3s we now need to spend pinning before qemu_main_loop() 
overshadows it.


Got it.


We want this trade, because the time spent in the main loop is a
problem: guest-visible downtime.  Correct?
[...]


Correct. Though whether or not we want this trade I suppose is subjective. But 
the 50-60% reduction in guest-visible downtime is pretty nice if we can stomach 
the initial startup costs.


I'll get back to this at the end.

[...]


Let me circle back to my question: Under what circumstances is QMP
responsiveness affected?

The answer seems to be "only when we're using a vhost-vDPA device".
Correct?


Correct, since using one of these guys causes us to do this memory pinning. If 
we're not using one, it's business as usual for Qemu.


Got it.


We're using one exactly when QEMU is running with one of its
vhost-vdpa-device-pci* device models.  Correct?


Yea, or something like:

-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
-device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \


I'll get back to this at the end.

[...]


Let me recap:

* No change at all unless we're pinning memory early, and we're doing
that only when we're using a vhost-vDPA device.  Correct?

* If we are using a vhost-vDPA device:
- Total startup time (until we're done pinning) increases.


Correct.


- QMP becomes available later.


Correct.


- Main loop behavior improves: less guest-visible downtime, QMP more
  responsive (once it's available)


Correct. Though the improvement is modest at best if we put aside the 
guest-visible downtime improvement.


This is a tradeoff we want always.  There is no need to let users pick
"faster startup, worse main loop behavior."



"Always" might be subjective here. For example, if there's no desire to perform 
live migration, then the user kinda just gets stuck with the cons.

Whether or not we want to make this configurable though is another discussion.


Correct?

[...]


I think I finally know enough to give you constructive feedback.

Your commit messages should answer the questions I had.  Specifically:

* Why are we doing this?  To shorten guest-visible downtime.

* How are we doing this?  We additionally pin memory before entering the
   main loop.  This speeds up the pinning we still do in the main loop.

* Drawback: slower startup.  In particular, QMP becomes
   available later.

* Secondary benefit: main loop responsiveness improves, in particular
   QMP.

* What uses of QEMU are affected?  Only with vhost-vDPA.  Spell out all
   the ways to get vhost-vDPA, please.

* There's a tradeoff.  Show your numbers.  Discuss whether this needs to
   be configurable.

If you can make a case for pinning memory this way always, do so.  If
you believe making it configurable would be a good idea, do so.  If
you're not sure, say so in the cover letter, and add a suitable TODO
comment.

Questions?



No questions, understood.

As I was writing the responses to your questions I was thinking to 
myself that this stuff should've been in the cover letter / commit 
messages in the first place.


Definitely a learning moment for me. Thanks for your time on this Markus!

Jonah




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-07-08 Thread Markus Armbruster
Jonah Palmer  writes:

> On 7/4/25 11:00 AM, Markus Armbruster wrote:
>> Jonah Palmer  writes:

[...]

>> So, total time increases: early pinning (before main loop) takes more
>> time than we save pinning (in the main loop).  Correct?
>
> Correct. We only save ~0.07s from the pinning that happens in the main loop. 
> But the extra 3s we now need to spend pinning before qemu_main_loop() 
> overshadows it.

Got it.

>> We want this trade, because the time spent in the main loop is a
>> problem: guest-visible downtime.  Correct?
>> [...]
>
> Correct. Though whether or not we want this trade I suppose is subjective. 
> But the 50-60% reduction in guest-visible downtime is pretty nice if we can 
> stomach the initial startup costs.

I'll get back to this at the end.

[...]

>> Let me circle back to my question: Under what circumstances is QMP
>> responsiveness affected?
>> 
>> The answer seems to be "only when we're using a vhost-vDPA device".
>> Correct?
>
> Correct, since using one of these guys causes us to do this memory pinning. 
> If we're not using one, it's business as usual for Qemu.

Got it.

>> We're using one exactly when QEMU is running with one of its
>> vhost-vdpa-device-pci* device models.  Correct?
>
> Yea, or something like:
>
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \

I'll get back to this at the end.

[...]

>> Let me recap:
>> 
>> * No change at all unless we're pinning memory early, and we're doing
>>that only when we're using a vhost-vDPA device.  Correct?
>> 
>> * If we are using a vhost-vDPA device:
>>- Total startup time (until we're done pinning) increases.
>
> Correct.
>
>>- QMP becomes available later.
>
> Correct.
>
>>- Main loop behavior improves: less guest-visible downtime, QMP more
>>  responsive (once it's available)
>
> Correct. Though the improvement is modest at best if we put aside the 
> guest-visible downtime improvement.
>
>>This is a tradeoff we want always.  There is no need to let users pick
>>"faster startup, worse main loop behavior."
>> 
>
> "Always" might be subjective here. For example, if there's no desire to 
> perform live migration, then the user kinda just gets stuck with the cons.
>
> Whether or not we want to make this configurable though is another discussion.
>
>> Correct?
>> 
>> [...]

I think I finally know enough to give you constructive feedback.

Your commit messages should answer the questions I had.  Specifically:

* Why are we doing this?  To shorten guest-visible downtime.

* How are we doing this?  We additionally pin memory before entering the
  main loop.  This speeds up the pinning we still do in the main loop.

* Drawback: slower startup.  In particular, QMP becomes
  available later.

* Secondary benefit: main loop responsiveness improves, in particular
  QMP.

* What uses of QEMU are affected?  Only with vhost-vDPA.  Spell out all
  the ways to get vhost-vDPA, please.

* There's a tradeoff.  Show your numbers.  Discuss whether this needs to
  be configurable.

If you can make a case for pinning memory this way always, do so.  If
you believe making it configurable would be a good idea, do so.  If
you're not sure, say so in the cover letter, and add a suitable TODO
comment.

Questions?




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-07-07 Thread Jonah Palmer




On 7/4/25 11:00 AM, Markus Armbruster wrote:

Jonah Palmer  writes:


On 6/26/25 8:08 AM, Markus Armbruster wrote:


[...]


Apologies for the delay in getting back to you. I just wanted to be thorough 
and answer everything as accurately and clearly as possible.



Before these patches, pinning started in vhost_vdpa_dev_start(), where the 
memory listener was registered, and began calling 
vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This 
happens after entering qemu_main_loop().

After these patches, pinning started in vhost_dev_init() (specifically 
vhost_vdpa_set_owner()), where the memory listener registration was moved to. 
This happens *before* entering qemu_main_loop().

However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The 
pinning that happens before we enter qemu_main_loop() is the full guest RAM 
pinning, which is the main, heavy lifting work when it comes to pinning memory.

The rest of the pinning work happens after entering qemu_main_loop() 
(approximately around the same timing as when pinning started before these 
patches). But, since we already did the heavy lifting of the pinning work pre 
qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just 
re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're 
already mapped and locked in RAM).

This makes the pinning work we do after entering qemu_main_loop() much faster 
than compared to the same pinning we had to do before these patches.

However, we have to pay a cost for this. Because we do the heavy lifting work 
earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the 
guest hasn't yet touched its memory yet, all host pages are still anonymous and 
unallocated. This essentially means that doing the pinning earlier is more 
expensive time-wise given that we need to also allocate physical pages for each 
chunk of memory.

To (hopefully) show this more clearly, I ran some tests before and after these 
patches and averaged the results. I used a 50G guest with real vDPA hardware 
(Mellanox CX-6Dx):

0.) How many vhost_vdpa_listener_region_add() (pins) calls?

| Total | Before qemu_main_loop | After qemu_main_loop
_
Before patches |   6   | 0 | 6
---|-
After patches  |   11  | 5 | 6

- After the patches, this looks like we doubled the work we're doing (given the 
extra 5 calls), however, the 6 calls that happen after entering 
qemu_main_loop() are essentially replays of the first 5 we did.

  * In other words, after the patches, the 6 calls made after entering 
qemu_main_loop() are performed much faster than the same 6 calls before the 
patches.

  * From my measurements, these are the timings it took to perform those 6 
calls after entering qemu_main_loop():
> Before patches: 0.0770s
> After patches:  0.0065s

---

1.) Time from starting the guest to entering qemu_main_loop():
  * Before patches: 0.112s
  * After patches:  3.900s

- This is due to the 5 early pins we're doing now with these patches, whereas 
before we never did any pinning work at all.

- From measuring the time between the first and last 
vhost_vdpa_listener_region_add() calls during this period, this comes out to 
~3s for the early pinning.


So, total time increases: early pinning (before main loop) takes more
time than we save pinning (in the main loop).  Correct?



Correct. We only save ~0.07s from the pinning that happens in the main 
loop. But the extra 3s we now need to spend pinning before 
qemu_main_loop() overshadows it.



We want this trade, because the time spent in the main loop is a
problem: guest-visible downtime.  Correct?

[...]



Correct. Though whether or not we want this trade I suppose is 
subjective. But the 50-60% reduction in guest-visible downtime is pretty 
nice if we can stomach the initial startup costs.



Let's see whether I understand...  Please correct my mistakes.

Memory pinning takes several seconds for large guests.

Your patch makes pinning much slower.  You're theorizing this is because
pinning cold memory is slower than pinning warm memory.

I suppose the extra time is saved elsewhere, i.e. the entire startup
time remains roughly the same.  Have you verified this experimentally?


Based on my measurements that I did, we pay a ~3s increase in initialization 
time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning 
earlier for a vhost-vDPA device. This resulted in:

* Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).

* Shorter downtime phase during live migration (see below).

* Slight increase in time for the device to be operational (e.g. guest sets 
DRIVER_OK).
   > This measured the start time of the guest to guest setting DRIVER_OK for 
the device:

 Before patch

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-07-04 Thread Markus Armbruster
Jonah Palmer  writes:

> On 6/26/25 8:08 AM, Markus Armbruster wrote:

[...]

> Apologies for the delay in getting back to you. I just wanted to be thorough 
> and answer everything as accurately and clearly as possible.
>
> 
>
> Before these patches, pinning started in vhost_vdpa_dev_start(), where the 
> memory listener was registered, and began calling 
> vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This 
> happens after entering qemu_main_loop().
>
> After these patches, pinning started in vhost_dev_init() (specifically 
> vhost_vdpa_set_owner()), where the memory listener registration was moved to. 
> This happens *before* entering qemu_main_loop().
>
> However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The 
> pinning that happens before we enter qemu_main_loop() is the full guest RAM 
> pinning, which is the main, heavy lifting work when it comes to pinning 
> memory.
>
> The rest of the pinning work happens after entering qemu_main_loop() 
> (approximately around the same timing as when pinning started before these 
> patches). But, since we already did the heavy lifting of the pinning work pre 
> qemu_main_loop() (e.g. all pages were already allocated and pinned), we're 
> just re-pinning here (i.e. kernel just updates its IOTLB tables for pages 
> that're already mapped and locked in RAM).
>
> This makes the pinning work we do after entering qemu_main_loop() much faster 
> than compared to the same pinning we had to do before these patches.
>
> However, we have to pay a cost for this. Because we do the heavy lifting work 
> earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the 
> guest hasn't yet touched its memory yet, all host pages are still anonymous 
> and unallocated. This essentially means that doing the pinning earlier is 
> more expensive time-wise given that we need to also allocate physical pages 
> for each chunk of memory.
>
> To (hopefully) show this more clearly, I ran some tests before and after 
> these patches and averaged the results. I used a 50G guest with real vDPA 
> hardware (Mellanox CX-6Dx):
>
> 0.) How many vhost_vdpa_listener_region_add() (pins) calls?
>
>| Total | Before qemu_main_loop | After qemu_main_loop
> _
> Before patches |   6   | 0 | 6
> ---|-
> After patches  |   11  | 5   | 6
>
> - After the patches, this looks like we doubled the work we're doing (given 
> the extra 5 calls), however, the 6 calls that happen after entering 
> qemu_main_loop() are essentially replays of the first 5 we did.
>
>  * In other words, after the patches, the 6 calls made after entering 
> qemu_main_loop() are performed much faster than the same 6 calls before the 
> patches.
>
>  * From my measurements, these are the timings it took to perform those 6 
> calls after entering qemu_main_loop():
>> Before patches: 0.0770s
>> After patches:  0.0065s
>
> ---
>
> 1.) Time from starting the guest to entering qemu_main_loop():
>  * Before patches: 0.112s
>  * After patches:  3.900s
>
> - This is due to the 5 early pins we're doing now with these patches, whereas 
> before we never did any pinning work at all.
>
> - From measuring the time between the first and last 
> vhost_vdpa_listener_region_add() calls during this period, this comes out to 
> ~3s for the early pinning.

So, total time increases: early pinning (before main loop) takes more
time than we save pinning (in the main loop).  Correct?

We want this trade, because the time spent in the main loop is a
problem: guest-visible downtime.  Correct?

[...]

>> Let's see whether I understand...  Please correct my mistakes.
>> 
>> Memory pinning takes several seconds for large guests.
>> 
>> Your patch makes pinning much slower.  You're theorizing this is because
>> pinning cold memory is slower than pinning warm memory.
>> 
>> I suppose the extra time is saved elsewhere, i.e. the entire startup
>> time remains roughly the same.  Have you verified this experimentally?
>
> Based on my measurements that I did, we pay a ~3s increase in initialization 
> time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning 
> earlier for a vhost-vDPA device. This resulted in:
>
> * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).
>
> * Shorter downtime phase during live migration (see below).
>
> * Slight increase in time for the device to be operational (e.g. guest sets 
> DRIVER_OK).
>   > This measured the start time of the guest to guest setting DRIVER_OK for 
> the device:
>
> Before patches: 22.46s
> After patches:  23.40s
>
> The real timesaver here is the guest-visisble downtime during live migration 
> (when using a vhost-vDPA device). Since the heavy lifting of the memory 
> pinning is done during the initialization phas

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-07-02 Thread Jonah Palmer




On 6/26/25 8:08 AM, Markus Armbruster wrote:

Jonah Palmer  writes:


On 6/2/25 4:29 AM, Markus Armbruster wrote:

Butterfingers...  let's try this again.

Markus Armbruster writes:


Si-Wei Liu writes:


On 5/26/2025 2:16 AM, Markus Armbruster wrote:

Si-Wei Liu writes:


On 5/15/2025 11:40 PM, Markus Armbruster wrote:

Jason Wang writes:


On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.

Adding Markus to see if this is a real problem or not.

I guess the answer is "depends", and to get a more useful one, we need
more information.

When all you care is time from executing qemu-system-FOO to guest
finish booting, and the guest takes 10s to boot, then an extra 0.2s
won't matter much.

There's no such delay of an extra 0.2s or higher per se, it's just shifting 
around the page pinning hiccup, no matter it is 0.2s or something else, from 
the time of guest booting up to before guest is booted. This saves back guest 
boot time or start up delay, but in turn the same delay effectively will be 
charged to VM launch time. We follow the same model with VFIO, which would see 
the same hiccup during launch (at an early stage where no real mgmt software 
would care about).


When a management application runs qemu-system-FOO several times to
probe its capabilities via QMP, then even milliseconds can hurt.


Not something like that, this page pinning hiccup is one time only that occurs 
in the very early stage when launching QEMU, i.e. there's no consistent delay 
every time when QMP is called. The delay in QMP response at that very point 
depends on how much memory the VM has, but this is just specif to VM with VFIO 
or vDPA devices that have to pin memory for DMA. Having said, there's no extra 
delay at all if QEMU args has no vDPA device assignment, on the other hand, 
there's same delay or QMP hiccup when VFIO is around in QEMU args.


In what scenarios exactly is QMP delayed?

Having said, this is not a new problem to QEMU in particular, this QMP delay is 
not peculiar, it's existent on VFIO as well.


In what scenarios exactly is QMP delayed compared to before the patch?


The page pinning process now runs in a pretty early phase at
qemu_init() e.g. machine_run_board_init(),


It runs within

  qemu_init()
  qmp_x_exit_preconfig()
  qemu_init_board()
  machine_run_board_init()

Except when --preconfig is given, it instead runs within QMP command
x-exit-preconfig.

Correct?


before any QMP command can be serviced, the latter of which typically
would be able to get run from qemu_main_loop() until the AIO gets
chance to be started to get polled and dispatched to bh.


We create the QMP monitor within qemu_create_late_backends(), which runs
before qmp_x_exit_preconfig(), but commands get processed only in the
main loop, which we enter later.

Correct?


Technically it's not a real delay for specific QMP command, but rather
an extended span of initialization process may take place before the
very first QMP request, usually qmp_capabilities, will be
serviced. It's natural for mgmt software to expect initialization
delay for the first qmp_capabilities response if it has to immediately
issue one after launching qemu, especially when you have a large guest
with hundred GBs of memory and with passthrough device that has to pin
memory for DMA e.g. VFIO, the delayed effect from the QEMU
initialization process is very visible too.


The work clearly needs to be done.  Whether it needs to be blocking
other things is less clear.

Even if it doesn't need to be blocking, we may choose not to avoid
blocking for now.  That should be an informed decision, though.

All I'm trying to do here is understand the tradeoffs, so I can give
useful advice.


  On the other hand, before
the patch, if memory happens to be in the middle of being pinned, any
ongoing QMP can't be serviced by the QEMU main loop, either.


When exactly does this pinning happen before the patch?  In which
function?


Before the patches, the memory listener was registered in
vhost_vdpa_dev_start(), well after device initialization.

And by device initialization here I mean the
qemu_create_late_backends() function.

With these patches, the memory lis

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-06-26 Thread Markus Armbruster
Jonah Palmer  writes:

> On 6/2/25 4:29 AM, Markus Armbruster wrote:
>> Butterfingers...  let's try this again.
>>
>> Markus Armbruster writes:
>>
>>> Si-Wei Liu writes:
>>>
 On 5/26/2025 2:16 AM, Markus Armbruster wrote:
> Si-Wei Liu writes:
>
>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>> Jason Wang writes:
>>>
 On Thu, May 8, 2025 at 2:47 AM Jonah Palmer 
 wrote:
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the 
> migration is
> stopped, and before the workload is resumed at the destination.  This 
> is a
> period where neigher traffic can flow, nor the VM workload can 
> continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.
 Adding Markus to see if this is a real problem or not.
>>> I guess the answer is "depends", and to get a more useful one, we need
>>> more information.
>>>
>>> When all you care is time from executing qemu-system-FOO to guest
>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>> won't matter much.
>> There's no such delay of an extra 0.2s or higher per se, it's just 
>> shifting around the page pinning hiccup, no matter it is 0.2s or 
>> something else, from the time of guest booting up to before guest is 
>> booted. This saves back guest boot time or start up delay, but in turn 
>> the same delay effectively will be charged to VM launch time. We follow 
>> the same model with VFIO, which would see the same hiccup during launch 
>> (at an early stage where no real mgmt software would care about).
>>
>>> When a management application runs qemu-system-FOO several times to
>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>
>> Not something like that, this page pinning hiccup is one time only that 
>> occurs in the very early stage when launching QEMU, i.e. there's no 
>> consistent delay every time when QMP is called. The delay in QMP 
>> response at that very point depends on how much memory the VM has, but 
>> this is just specif to VM with VFIO or vDPA devices that have to pin 
>> memory for DMA. Having said, there's no extra delay at all if QEMU args 
>> has no vDPA device assignment, on the other hand, there's same delay or 
>> QMP hiccup when VFIO is around in QEMU args.
>>
>>> In what scenarios exactly is QMP delayed?
>> Having said, this is not a new problem to QEMU in particular, this QMP 
>> delay is not peculiar, it's existent on VFIO as well.
>
> In what scenarios exactly is QMP delayed compared to before the patch?

 The page pinning process now runs in a pretty early phase at
 qemu_init() e.g. machine_run_board_init(),
>>>
>>> It runs within
>>>
>>>  qemu_init()
>>>  qmp_x_exit_preconfig()
>>>  qemu_init_board()
>>>  machine_run_board_init()
>>>
>>> Except when --preconfig is given, it instead runs within QMP command
>>> x-exit-preconfig.
>>>
>>> Correct?
>>>
 before any QMP command can be serviced, the latter of which typically
 would be able to get run from qemu_main_loop() until the AIO gets
 chance to be started to get polled and dispatched to bh.
>>>
>>> We create the QMP monitor within qemu_create_late_backends(), which runs
>>> before qmp_x_exit_preconfig(), but commands get processed only in the
>>> main loop, which we enter later.
>>>
>>> Correct?
>>>
 Technically it's not a real delay for specific QMP command, but rather
 an extended span of initialization process may take place before the
 very first QMP request, usually qmp_capabilities, will be
 serviced. It's natural for mgmt software to expect initialization
 delay for the first qmp_capabilities response if it has to immediately
 issue one after launching qemu, especially when you have a large guest
 with hundred GBs of memory and with passthrough device that has to pin
 memory for DMA e.g. VFIO, the delayed effect from the QEMU
 initialization process is very visible too.
>>
>> The work clearly needs to be done.  Whether it needs to be blocking
>> other things is less clear.
>>
>> Even if it doesn't need to be blocking, we may choose not to avoid
>> blocking for now.  That should be an informed decision, though.
>>
>> All I'm trying to do here i

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-06-06 Thread Jonah Palmer


On 6/2/25 4:29 AM, Markus Armbruster wrote:

Butterfingers...  let's try this again.

Markus Armbruster writes:


Si-Wei Liu writes:


On 5/26/2025 2:16 AM, Markus Armbruster wrote:

Si-Wei Liu writes:


On 5/15/2025 11:40 PM, Markus Armbruster wrote:

Jason Wang writes:


On Thu, May 8, 2025 at 2:47 AM Jonah Palmer wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.

Adding Markus to see if this is a real problem or not.

I guess the answer is "depends", and to get a more useful one, we need
more information.

When all you care is time from executing qemu-system-FOO to guest
finish booting, and the guest takes 10s to boot, then an extra 0.2s
won't matter much.

There's no such delay of an extra 0.2s or higher per se, it's just shifting 
around the page pinning hiccup, no matter it is 0.2s or something else, from 
the time of guest booting up to before guest is booted. This saves back guest 
boot time or start up delay, but in turn the same delay effectively will be 
charged to VM launch time. We follow the same model with VFIO, which would see 
the same hiccup during launch (at an early stage where no real mgmt software 
would care about).


When a management application runs qemu-system-FOO several times to
probe its capabilities via QMP, then even milliseconds can hurt.


Not something like that, this page pinning hiccup is one time only that occurs 
in the very early stage when launching QEMU, i.e. there's no consistent delay 
every time when QMP is called. The delay in QMP response at that very point 
depends on how much memory the VM has, but this is just specif to VM with VFIO 
or vDPA devices that have to pin memory for DMA. Having said, there's no extra 
delay at all if QEMU args has no vDPA device assignment, on the other hand, 
there's same delay or QMP hiccup when VFIO is around in QEMU args.


In what scenarios exactly is QMP delayed?

Having said, this is not a new problem to QEMU in particular, this QMP delay is 
not peculiar, it's existent on VFIO as well.

In what scenarios exactly is QMP delayed compared to before the patch?

The page pinning process now runs in a pretty early phase at
qemu_init() e.g. machine_run_board_init(),

It runs within

 qemu_init()
 qmp_x_exit_preconfig()
 qemu_init_board()
 machine_run_board_init()

Except when --preconfig is given, it instead runs within QMP command
x-exit-preconfig.

Correct?


before any QMP command can be serviced, the latter of which typically
would be able to get run from qemu_main_loop() until the AIO gets
chance to be started to get polled and dispatched to bh.

We create the QMP monitor within qemu_create_late_backends(), which runs
before qmp_x_exit_preconfig(), but commands get processed only in the
main loop, which we enter later.

Correct?


Technically it's not a real delay for specific QMP command, but rather
an extended span of initialization process may take place before the
very first QMP request, usually qmp_capabilities, will be
serviced. It's natural for mgmt software to expect initialization
delay for the first qmp_capabilities response if it has to immediately
issue one after launching qemu, especially when you have a large guest
with hundred GBs of memory and with passthrough device that has to pin
memory for DMA e.g. VFIO, the delayed effect from the QEMU
initialization process is very visible too.

The work clearly needs to be done.  Whether it needs to be blocking
other things is less clear.

Even if it doesn't need to be blocking, we may choose not to avoid
blocking for now.  That should be an informed decision, though.

All I'm trying to do here is understand the tradeoffs, so I can give
useful advice.


 On the other hand, before
the patch, if memory happens to be in the middle of being pinned, any
ongoing QMP can't be serviced by the QEMU main loop, either.

When exactly does this pinning happen before the patch?  In which
function?


Before the patches, the memory listener was registered in
vhost_vdpa_dev_start(), well after device initialization.

And by device initialization here I mean the
qemu_create_late_backends() function.

With these patches, the memory listener is now being
registered in vhost_vdpa_set_owner(), called from
vhost_dev_init

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-06-02 Thread Markus Armbruster
Butterfingers...  let's try this again.

Markus Armbruster  writes:

> Si-Wei Liu  writes:
>
>> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>>> Si-Wei Liu  writes:
>>>
 On 5/15/2025 11:40 PM, Markus Armbruster wrote:
> Jason Wang  writes:
>
>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  
>> wrote:
>>> Current memory operations like pinning may take a lot of time at the
>>> destination.  Currently they are done after the source of the migration 
>>> is
>>> stopped, and before the workload is resumed at the destination.  This 
>>> is a
>>> period where neigher traffic can flow, nor the VM workload can continue
>>> (downtime).
>>>
>>> We can do better as we know the memory layout of the guest RAM at the
>>> destination from the moment that all devices are initializaed.  So
>>> moving that operation allows QEMU to communicate the kernel the maps
>>> while the workload is still running in the source, so Linux can start
>>> mapping them.
>>>
>>> As a small drawback, there is a time in the initialization where QEMU
>>> cannot respond to QMP etc.  By some testing, this time is about
>>> 0.2seconds.
>> Adding Markus to see if this is a real problem or not.
> I guess the answer is "depends", and to get a more useful one, we need
> more information.
>
> When all you care is time from executing qemu-system-FOO to guest
> finish booting, and the guest takes 10s to boot, then an extra 0.2s
> won't matter much.

 There's no such delay of an extra 0.2s or higher per se, it's just 
 shifting around the page pinning hiccup, no matter it is 0.2s or something 
 else, from the time of guest booting up to before guest is booted. This 
 saves back guest boot time or start up delay, but in turn the same delay 
 effectively will be charged to VM launch time. We follow the same model 
 with VFIO, which would see the same hiccup during launch (at an early 
 stage where no real mgmt software would care about).

> When a management application runs qemu-system-FOO several times to
> probe its capabilities via QMP, then even milliseconds can hurt.
>
 Not something like that, this page pinning hiccup is one time only that 
 occurs in the very early stage when launching QEMU, i.e. there's no 
 consistent delay every time when QMP is called. The delay in QMP response 
 at that very point depends on how much memory the VM has, but this is just 
 specif to VM with VFIO or vDPA devices that have to pin memory for DMA. 
 Having said, there's no extra delay at all if QEMU args has no vDPA device 
 assignment, on the other hand, there's same delay or QMP hiccup when VFIO 
 is around in QEMU args.

> In what scenarios exactly is QMP delayed?

 Having said, this is not a new problem to QEMU in particular, this QMP 
 delay is not peculiar, it's existent on VFIO as well.
>>>
>>> In what scenarios exactly is QMP delayed compared to before the patch?
>>
>> The page pinning process now runs in a pretty early phase at
>> qemu_init() e.g. machine_run_board_init(),
>
> It runs within
>
> qemu_init()
> qmp_x_exit_preconfig()
> qemu_init_board()
> machine_run_board_init()
>
> Except when --preconfig is given, it instead runs within QMP command
> x-exit-preconfig.
>
> Correct?
>
>> before any QMP command can be serviced, the latter of which typically
>> would be able to get run from qemu_main_loop() until the AIO gets
>> chance to be started to get polled and dispatched to bh.
>
> We create the QMP monitor within qemu_create_late_backends(), which runs
> before qmp_x_exit_preconfig(), but commands get processed only in the
> main loop, which we enter later.
>
> Correct?
>
>> Technically it's not a real delay for specific QMP command, but rather
>> an extended span of initialization process may take place before the
>> very first QMP request, usually qmp_capabilities, will be
>> serviced. It's natural for mgmt software to expect initialization
>> delay for the first qmp_capabilities response if it has to immediately
>> issue one after launching qemu, especially when you have a large guest
>> with hundred GBs of memory and with passthrough device that has to pin
>> memory for DMA e.g. VFIO, the delayed effect from the QEMU
>> initialization process is very visible too.

The work clearly needs to be done.  Whether it needs to be blocking
other things is less clear.

Even if it doesn't need to be blocking, we may choose not to avoid
blocking for now.  That should be an informed decision, though.

All I'm trying to do here is understand the tradeoffs, so I can give
useful advice.

>> On the other hand, before
>> the patch, if memory happens to be in the middle of being pinned, any
>> ongoing QMP can't be serviced by the QEMU main loop, either.

When exactly does t

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-06-02 Thread Markus Armbruster
Si-Wei Liu  writes:

> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>> Si-Wei Liu  writes:
>>
>>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
 Jason Wang  writes:

> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  
> wrote:
>> Current memory operations like pinning may take a lot of time at the
>> destination.  Currently they are done after the source of the migration 
>> is
>> stopped, and before the workload is resumed at the destination.  This is 
>> a
>> period where neigher traffic can flow, nor the VM workload can continue
>> (downtime).
>>
>> We can do better as we know the memory layout of the guest RAM at the
>> destination from the moment that all devices are initializaed.  So
>> moving that operation allows QEMU to communicate the kernel the maps
>> while the workload is still running in the source, so Linux can start
>> mapping them.
>>
>> As a small drawback, there is a time in the initialization where QEMU
>> cannot respond to QMP etc.  By some testing, this time is about
>> 0.2seconds.
> Adding Markus to see if this is a real problem or not.
 I guess the answer is "depends", and to get a more useful one, we need
 more information.

 When all you care is time from executing qemu-system-FOO to guest
 finish booting, and the guest takes 10s to boot, then an extra 0.2s
 won't matter much.
>>>
>>> There's no such delay of an extra 0.2s or higher per se, it's just shifting 
>>> around the page pinning hiccup, no matter it is 0.2s or something else, 
>>> from the time of guest booting up to before guest is booted. This saves 
>>> back guest boot time or start up delay, but in turn the same delay 
>>> effectively will be charged to VM launch time. We follow the same model 
>>> with VFIO, which would see the same hiccup during launch (at an early stage 
>>> where no real mgmt software would care about).
>>>
 When a management application runs qemu-system-FOO several times to
 probe its capabilities via QMP, then even milliseconds can hurt.

>>> Not something like that, this page pinning hiccup is one time only that 
>>> occurs in the very early stage when launching QEMU, i.e. there's no 
>>> consistent delay every time when QMP is called. The delay in QMP response 
>>> at that very point depends on how much memory the VM has, but this is just 
>>> specif to VM with VFIO or vDPA devices that have to pin memory for DMA. 
>>> Having said, there's no extra delay at all if QEMU args has no vDPA device 
>>> assignment, on the other hand, there's same delay or QMP hiccup when VFIO 
>>> is around in QEMU args.
>>>
 In what scenarios exactly is QMP delayed?
>>>
>>> Having said, this is not a new problem to QEMU in particular, this QMP 
>>> delay is not peculiar, it's existent on VFIO as well.
>>
>> In what scenarios exactly is QMP delayed compared to before the patch?
>
> The page pinning process now runs in a pretty early phase at
> qemu_init() e.g. machine_run_board_init(),

It runs within

qemu_init()
qmp_x_exit_preconfig()
qemu_init_board()
machine_run_board_init()

Except when --preconfig is given, it instead runs within QMP command
x-exit-preconfig.

Correct?

> before any QMP command can be serviced, the latter of which typically
> would be able to get run from qemu_main_loop() until the AIO gets
> chance to be started to get polled and dispatched to bh.

We create the QMP monitor within qemu_create_late_backends(), which runs
before qmp_x_exit_preconfig(), but commands get processed only in the
main loop, which we enter later.

Correct?

> Technically it's not a real delay for specific QMP command, but rather
> an extended span of initialization process may take place before the
> very first QMP request, usually qmp_capabilities, will be
> serviced. It's natural for mgmt software to expect initialization
> delay for the first qmp_capabilities response if it has to immediately
> issue one after launching qemu, especially when you have a large guest
> with hundred GBs of memory and with passthrough device that has to pin
> memory for DMA e.g. VFIO, the delayed effect from the QEMU
> initialization process is very visible too.



> On the other hand, before
> the patch, if memory happens to be in the middle of being pinned, any
> ongoing QMP can't be serviced by the QEMU main loop, either.
>
> I'd also like to highlight that without this patch, the pretty high
> delay due to page pinning is even visible to the guest in addition to
> just QMP delay, which largely affected guest boot time with vDPA
> device already. It is long standing, and every VM user with vDPA
> device would like to avoid such high delay for the first boot, which
> is not seen with similar device e.g. VFIO passthrough.
>
>>
>>> Thanks,
>>> -Siwei
>>>
 You told us an absolute delay you observed.  What's the relative delay

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-29 Thread Si-Wei Liu




On 5/26/2025 2:16 AM, Markus Armbruster wrote:

Si-Wei Liu  writes:


On 5/15/2025 11:40 PM, Markus Armbruster wrote:

Jason Wang  writes:


On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.

Adding Markus to see if this is a real problem or not.

I guess the answer is "depends", and to get a more useful one, we need
more information.

When all you care is time from executing qemu-system-FOO to guest
finish booting, and the guest takes 10s to boot, then an extra 0.2s
won't matter much.

There's no such delay of an extra 0.2s or higher per se, it's just shifting 
around the page pinning hiccup, no matter it is 0.2s or something else, from 
the time of guest booting up to before guest is booted. This saves back guest 
boot time or start up delay, but in turn the same delay effectively will be 
charged to VM launch time. We follow the same model with VFIO, which would see 
the same hiccup during launch (at an early stage where no real mgmt software 
would care about).


When a management application runs qemu-system-FOO several times to
probe its capabilities via QMP, then even milliseconds can hurt.


Not something like that, this page pinning hiccup is one time only that occurs 
in the very early stage when launching QEMU, i.e. there's no consistent delay 
every time when QMP is called. The delay in QMP response at that very point 
depends on how much memory the VM has, but this is just specif to VM with VFIO 
or vDPA devices that have to pin memory for DMA. Having said, there's no extra 
delay at all if QEMU args has no vDPA device assignment, on the other hand, 
there's same delay or QMP hiccup when VFIO is around in QEMU args.


In what scenarios exactly is QMP delayed?

Having said, this is not a new problem to QEMU in particular, this QMP delay is 
not peculiar, it's existent on VFIO as well.

In what scenarios exactly is QMP delayed compared to before the patch?
The page pinning process now runs in a pretty early phase at qemu_init() 
e.g. machine_run_board_init(), before any QMP command can be serviced, 
the latter of which typically would be able to get run from 
qemu_main_loop() until the AIO gets chance to be started to get polled 
and dispatched to bh. Technically it's not a real delay for specific QMP 
command, but rather an extended span of initialization process may take 
place before the very first QMP request, usually qmp_capabilities, will 
be serviced. It's natural for mgmt software to expect initialization 
delay for the first qmp_capabilities response if it has to immediately 
issue one after launching qemu, especially when you have a large guest 
with hundred GBs of memory and with passthrough device that has to pin 
memory for DMA e.g. VFIO, the delayed effect from the QEMU 
initialization process is very visible too. On the other hand, before 
the patch, if memory happens to be in the middle of being pinned, any 
ongoing QMP can't be serviced by the QEMU main loop, either.


I'd also like to highlight that without this patch, the pretty high 
delay due to page pinning is even visible to the guest in addition to 
just QMP delay, which largely affected guest boot time with vDPA device 
already. It is long standing, and every VM user with vDPA device would 
like to avoid such high delay for the first boot, which is not seen with 
similar device e.g. VFIO passthrough.





Thanks,
-Siwei


You told us an absolute delay you observed.  What's the relative delay,
i.e. what's the delay with and without these patches?

Can you answer this question?
I thought I already got that answered in earlier reply. The relative 
delay is subject to the size of memory. Usually mgmt software won't be 
able to notice, unless the guest has more than 100GB of THP memory to 
pin, for DMA or whatever reason.






We need QMP to become available earlier in the startup sequence for
other reasons.  Could we bypass the delay that way?  Please understand
that this would likely be quite difficult: we know from experience that
messing with the startup sequence is prone to introduce subtle
compatility breaks and even bugs.


(I remember VFIO has some optimization in the speed of the pinning,
could vDPA do the same?)

That's well outside my bailiwick :)


Please be understood that any p

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-26 Thread Markus Armbruster
Si-Wei Liu  writes:

> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>> Jason Wang  writes:
>>
>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  wrote:
 Current memory operations like pinning may take a lot of time at the
 destination.  Currently they are done after the source of the migration is
 stopped, and before the workload is resumed at the destination.  This is a
 period where neigher traffic can flow, nor the VM workload can continue
 (downtime).

 We can do better as we know the memory layout of the guest RAM at the
 destination from the moment that all devices are initializaed.  So
 moving that operation allows QEMU to communicate the kernel the maps
 while the workload is still running in the source, so Linux can start
 mapping them.

 As a small drawback, there is a time in the initialization where QEMU
 cannot respond to QMP etc.  By some testing, this time is about
 0.2seconds.
>>>
>>> Adding Markus to see if this is a real problem or not.
>>
>> I guess the answer is "depends", and to get a more useful one, we need
>> more information.
>>
>> When all you care is time from executing qemu-system-FOO to guest
>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>> won't matter much.
>
> There's no such delay of an extra 0.2s or higher per se, it's just shifting 
> around the page pinning hiccup, no matter it is 0.2s or something else, from 
> the time of guest booting up to before guest is booted. This saves back guest 
> boot time or start up delay, but in turn the same delay effectively will be 
> charged to VM launch time. We follow the same model with VFIO, which would 
> see the same hiccup during launch (at an early stage where no real mgmt 
> software would care about).
>
>> When a management application runs qemu-system-FOO several times to
>> probe its capabilities via QMP, then even milliseconds can hurt.
>>
> Not something like that, this page pinning hiccup is one time only that 
> occurs in the very early stage when launching QEMU, i.e. there's no 
> consistent delay every time when QMP is called. The delay in QMP response at 
> that very point depends on how much memory the VM has, but this is just 
> specif to VM with VFIO or vDPA devices that have to pin memory for DMA. 
> Having said, there's no extra delay at all if QEMU args has no vDPA device 
> assignment, on the other hand, there's same delay or QMP hiccup when VFIO is 
> around in QEMU args.
>
>> In what scenarios exactly is QMP delayed?
>
> Having said, this is not a new problem to QEMU in particular, this QMP delay 
> is not peculiar, it's existent on VFIO as well.

In what scenarios exactly is QMP delayed compared to before the patch?

> Thanks,
> -Siwei
>
>>
>> You told us an absolute delay you observed.  What's the relative delay,
>> i.e. what's the delay with and without these patches?

Can you answer this question?

>> We need QMP to become available earlier in the startup sequence for
>> other reasons.  Could we bypass the delay that way?  Please understand
>> that this would likely be quite difficult: we know from experience that
>> messing with the startup sequence is prone to introduce subtle
>> compatility breaks and even bugs.
>>
>>> (I remember VFIO has some optimization in the speed of the pinning,
>>> could vDPA do the same?)
>>
>> That's well outside my bailiwick :)
>>
>> [...]
>>




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-20 Thread Jonah Palmer


On 5/14/25 11:49 AM, Eugenio Perez Martin wrote:

On Wed, May 7, 2025 at 8:47 PM Jonah Palmer wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.  This may be further reduced (or increased) depending on the
vdpa driver and the platform hardware, and it is dominated by the cost
of memory pinning.

This matches the time that we move out of the called downtime window.
The downtime is measured as checking the trace timestamp from the moment
the source suspend the device to the moment the destination starts the
eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
secs to 2.0949.


Hi Jonah,

Could you update this benchmark? I don't think it changed a lot but
just to be as updated as possible.


Yes, will update this for 39G guest and for 128G guests :)



I think I cannot ack the series as I sent the first revision. Jason or
Si-Wei, could you ack it?

Thanks!


Future directions on top of this series may include to move more things ahead
of the migration time, like set DRIVER_OK or perform actual iterative migration
of virtio-net devices.

Comments are welcome.

This series is a different approach of series [1]. As the title does not
reflect the changes anymore, please refer to the previous one to know the
series history.

This series is based on [2], it must be applied after it.

[Jonah Palmer]
This series was rebased after [3] was pulled in, as [3] was a prerequisite
fix for this series.

v4:
---
* Add memory listener unregistration to vhost_vdpa_reset_device.
* Remove memory listener unregistration from vhost_vdpa_reset_status.

v3:
---
* Rebase

v2:
---
* Move the memory listener registration to vhost_vdpa_set_owner function.
* Move the iova_tree allocation to net_vhost_vdpa_init.

v1 
athttps://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86V7C4DccE$
 .

[1]https://urldefense.com/v3/__https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VTze8nNQ$ 
[2]https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VNYsAGaI$ 
[3]https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/[email protected]/__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VXyDekTU$ 


Jonah - note: I'll be on vacation from May 10-19. Will respond to
   comments when I return.

Eugenio Pérez (7):
   vdpa: check for iova tree initialized at net_client_start
   vdpa: reorder vhost_vdpa_set_backend_cap
   vdpa: set backend capabilities at vhost_vdpa_init
   vdpa: add listener_registered
   vdpa: reorder listener assignment
   vdpa: move iova_tree allocation to net_vhost_vdpa_init
   vdpa: move memory listener register to vhost_vdpa_init

  hw/virtio/vhost-vdpa.c | 107 +
  include/hw/virtio/vhost-vdpa.h |  22 ++-
  net/vhost-vdpa.c   |  34 +--
  3 files changed, 93 insertions(+), 70 deletions(-)

--
2.43.5


Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-16 Thread Si-Wei Liu




On 5/15/2025 11:40 PM, Markus Armbruster wrote:

Jason Wang  writes:


On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.

Adding Markus to see if this is a real problem or not.

I guess the answer is "depends", and to get a more useful one, we need
more information.

When all you care is time from executing qemu-system-FOO to guest
finish booting, and the guest takes 10s to boot, then an extra 0.2s
won't matter much.
There's no such delay of an extra 0.2s or higher per se, it's just 
shifting around the page pinning hiccup, no matter it is 0.2s or 
something else, from the time of guest booting up to before guest is 
booted. This saves back guest boot time or start up delay, but in turn 
the same delay effectively will be charged to VM launch time. We follow 
the same model with VFIO, which would see the same hiccup during launch 
(at an early stage where no real mgmt software would care about).




When a management application runs qemu-system-FOO several times to
probe its capabilities via QMP, then even milliseconds can hurt.
Not something like that, this page pinning hiccup is one time only that 
occurs in the very early stage when launching QEMU, i.e. there's no 
consistent delay every time when QMP is called. The delay in QMP 
response at that very point depends on how much memory the VM has, but 
this is just specif to VM with VFIO or vDPA devices that have to pin 
memory for DMA. Having said, there's no extra delay at all if QEMU args 
has no vDPA device assignment, on the other hand, there's same delay or 
QMP hiccup when VFIO is around in QEMU args.

In what scenarios exactly is QMP delayed?
Having said, this is not a new problem to QEMU in particular, this QMP 
delay is not peculiar, it's existent on VFIO as well.


Thanks,
-Siwei



You told us an absolute delay you observed.  What's the relative delay,
i.e. what's the delay with and without these patches?

We need QMP to become available earlier in the startup sequence for
other reasons.  Could we bypass the delay that way?  Please understand
that this would likely be quite difficult: we know from experience that
messing with the startup sequence is prone to introduce subtle
compatility breaks and even bugs.


(I remember VFIO has some optimization in the speed of the pinning,
could vDPA do the same?)

That's well outside my bailiwick :)

[...]






Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-16 Thread Michael S. Tsirkin
On Thu, May 15, 2025 at 10:41:45AM -0700, Si-Wei Liu wrote:
> 
> 
> On 5/14/2025 10:43 PM, Michael S. Tsirkin wrote:
> > On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote:
> > > Hi Eugenio,
> > > 
> > > On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > > > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer  
> > > > wrote:
> > > > > Current memory operations like pinning may take a lot of time at the
> > > > > destination.  Currently they are done after the source of the 
> > > > > migration is
> > > > > stopped, and before the workload is resumed at the destination.  This 
> > > > > is a
> > > > > period where neigher traffic can flow, nor the VM workload can 
> > > > > continue
> > > > > (downtime).
> > > > > 
> > > > > We can do better as we know the memory layout of the guest RAM at the
> > > > > destination from the moment that all devices are initializaed.  So
> > > > > moving that operation allows QEMU to communicate the kernel the maps
> > > > > while the workload is still running in the source, so Linux can start
> > > > > mapping them.
> > > > > 
> > > > > As a small drawback, there is a time in the initialization where QEMU
> > > > > cannot respond to QMP etc.  By some testing, this time is about
> > > > > 0.2seconds.  This may be further reduced (or increased) depending on 
> > > > > the
> > > > > vdpa driver and the platform hardware, and it is dominated by the cost
> > > > > of memory pinning.
> > > > > 
> > > > > This matches the time that we move out of the called downtime window.
> > > > > The downtime is measured as checking the trace timestamp from the 
> > > > > moment
> > > > > the source suspend the device to the moment the destination starts the
> > > > > eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> > > > > secs to 2.0949.
> > > > > 
> > > > Hi Jonah,
> > > > 
> > > > Could you update this benchmark? I don't think it changed a lot but
> > > > just to be as updated as possible.
> > > Jonah is off this week and will be back until next Tuesday, but I recall 
> > > he
> > > indeed did some downtime test with VM with 128GB memory before taking off,
> > > which shows obvious improvement from around 10 seconds to 5.8 seconds 
> > > after
> > > applying this series. Since this is related to update on the cover letter,
> > > would it be okay for you and Jason to ack now and then proceed to Michael
> > > for upcoming merge?
> > > 
> > > > I think I cannot ack the series as I sent the first revision. Jason or
> > > > Si-Wei, could you ack it?
> > > Sure, I just give my R-b, this series look good to me. Hopefully Jason can
> > > ack on his own.
> > > 
> > > Thanks!
> > > -Siwei
> > I just sent a pull, next one in a week or two, so - no rush.
> All right, should be good to wait. In any case you have to repost a v2 PULL,
> hope this series can be piggy-back'ed as we did extensive tests about it.
> ;-)
> 
> -Siwei

You mean "in case"?

> > 
> > 
> > > > Thanks!
> > > > 
> > > > > Future directions on top of this series may include to move more 
> > > > > things ahead
> > > > > of the migration time, like set DRIVER_OK or perform actual iterative 
> > > > > migration
> > > > > of virtio-net devices.
> > > > > 
> > > > > Comments are welcome.
> > > > > 
> > > > > This series is a different approach of series [1]. As the title does 
> > > > > not
> > > > > reflect the changes anymore, please refer to the previous one to know 
> > > > > the
> > > > > series history.
> > > > > 
> > > > > This series is based on [2], it must be applied after it.
> > > > > 
> > > > > [Jonah Palmer]
> > > > > This series was rebased after [3] was pulled in, as [3] was a 
> > > > > prerequisite
> > > > > fix for this series.
> > > > > 
> > > > > v4:
> > > > > ---
> > > > > * Add memory listener unregistration to vhost_vdpa_reset_device.
> > > > > * Remove memory listener unregistration from vhost_vdpa_reset_status.
> > > > > 
> > > > > v3:
> > > > > ---
> > > > > * Rebase
> > > > > 
> > > > > v2:
> > > > > ---
> > > > > * Move the memory listener registration to vhost_vdpa_set_owner 
> > > > > function.
> > > > > * Move the iova_tree allocation to net_vhost_vdpa_init.
> > > > > 
> > > > > v1 at 
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> > > > > 
> > > > > [1] 
> > > > > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> > > > > [2] 
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> > > > > [3] 
> > > > > https://lore.kernel.org/qemu-devel/[email protected]/
> > > > > 
> > > > > Jonah - note: I'll be on vacation from May 10-19. Will respond to
> > > > > comments when I return.
> > > > > 
> > > > > Eugenio Pérez (7):
> > > > > vdpa: check for iova tree initialized at net_client_start
> > > > > vdpa: reorder vhost_vdpa_set_backend_cap
> > > > > vdpa: set backend capabilities at vhost_vdpa_init
> > > > > vdpa: add l

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-15 Thread Markus Armbruster
Jason Wang  writes:

> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  wrote:
>>
>> Current memory operations like pinning may take a lot of time at the
>> destination.  Currently they are done after the source of the migration is
>> stopped, and before the workload is resumed at the destination.  This is a
>> period where neigher traffic can flow, nor the VM workload can continue
>> (downtime).
>>
>> We can do better as we know the memory layout of the guest RAM at the
>> destination from the moment that all devices are initializaed.  So
>> moving that operation allows QEMU to communicate the kernel the maps
>> while the workload is still running in the source, so Linux can start
>> mapping them.
>>
>> As a small drawback, there is a time in the initialization where QEMU
>> cannot respond to QMP etc.  By some testing, this time is about
>> 0.2seconds.
>
> Adding Markus to see if this is a real problem or not.

I guess the answer is "depends", and to get a more useful one, we need
more information.

When all you care is time from executing qemu-system-FOO to guest
finish booting, and the guest takes 10s to boot, then an extra 0.2s
won't matter much.

When a management application runs qemu-system-FOO several times to
probe its capabilities via QMP, then even milliseconds can hurt.

In what scenarios exactly is QMP delayed?

You told us an absolute delay you observed.  What's the relative delay,
i.e. what's the delay with and without these patches?

We need QMP to become available earlier in the startup sequence for
other reasons.  Could we bypass the delay that way?  Please understand
that this would likely be quite difficult: we know from experience that
messing with the startup sequence is prone to introduce subtle
compatility breaks and even bugs.

> (I remember VFIO has some optimization in the speed of the pinning,
> could vDPA do the same?)

That's well outside my bailiwick :)

[...]




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-15 Thread Jason Wang
On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.

Adding Markus to see if this is a real problem or not.

(I remember VFIO has some optimization in the speed of the pinning,
could vDPA do the same?)

Thanks

> This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>
> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative 
> migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.
>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] 
> https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] 
> https://lore.kernel.org/qemu-devel/[email protected]/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>   comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c | 107 +
>  include/hw/virtio/vhost-vdpa.h |  22 ++-
>  net/vhost-vdpa.c   |  34 +--
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-15 Thread Jason Wang
On Thu, May 15, 2025 at 8:17 AM Si-Wei Liu  wrote:
>
> Hi Eugenio,
>
> On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer  wrote:
> >> Current memory operations like pinning may take a lot of time at the
> >> destination.  Currently they are done after the source of the migration is
> >> stopped, and before the workload is resumed at the destination.  This is a
> >> period where neigher traffic can flow, nor the VM workload can continue
> >> (downtime).
> >>
> >> We can do better as we know the memory layout of the guest RAM at the
> >> destination from the moment that all devices are initializaed.  So
> >> moving that operation allows QEMU to communicate the kernel the maps
> >> while the workload is still running in the source, so Linux can start
> >> mapping them.
> >>
> >> As a small drawback, there is a time in the initialization where QEMU
> >> cannot respond to QMP etc.  By some testing, this time is about
> >> 0.2seconds.  This may be further reduced (or increased) depending on the
> >> vdpa driver and the platform hardware, and it is dominated by the cost
> >> of memory pinning.
> >>
> >> This matches the time that we move out of the called downtime window.
> >> The downtime is measured as checking the trace timestamp from the moment
> >> the source suspend the device to the moment the destination starts the
> >> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> >> secs to 2.0949.
> >>
> > Hi Jonah,
> >
> > Could you update this benchmark? I don't think it changed a lot but
> > just to be as updated as possible.
> Jonah is off this week and will be back until next Tuesday, but I recall
> he indeed did some downtime test with VM with 128GB memory before taking
> off, which shows obvious improvement from around 10 seconds to 5.8
> seconds after applying this series. Since this is related to update on
> the cover letter, would it be okay for you and Jason to ack now and then
> proceed to Michael for upcoming merge?

I will go through the series.

Thanks




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-15 Thread Jason Wang
On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>
> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative 
> migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.

Not that this has been merged.

Thanks

>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] 
> https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] 
> https://lore.kernel.org/qemu-devel/[email protected]/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>   comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c | 107 +
>  include/hw/virtio/vhost-vdpa.h |  22 ++-
>  net/vhost-vdpa.c   |  34 +--
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-15 Thread Si-Wei Liu




On 5/14/2025 10:43 PM, Michael S. Tsirkin wrote:

On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote:

Hi Eugenio,

On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:

On Wed, May 7, 2025 at 8:47 PM Jonah Palmer  wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.  This may be further reduced (or increased) depending on the
vdpa driver and the platform hardware, and it is dominated by the cost
of memory pinning.

This matches the time that we move out of the called downtime window.
The downtime is measured as checking the trace timestamp from the moment
the source suspend the device to the moment the destination starts the
eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
secs to 2.0949.


Hi Jonah,

Could you update this benchmark? I don't think it changed a lot but
just to be as updated as possible.

Jonah is off this week and will be back until next Tuesday, but I recall he
indeed did some downtime test with VM with 128GB memory before taking off,
which shows obvious improvement from around 10 seconds to 5.8 seconds after
applying this series. Since this is related to update on the cover letter,
would it be okay for you and Jason to ack now and then proceed to Michael
for upcoming merge?


I think I cannot ack the series as I sent the first revision. Jason or
Si-Wei, could you ack it?

Sure, I just give my R-b, this series look good to me. Hopefully Jason can
ack on his own.

Thanks!
-Siwei

I just sent a pull, next one in a week or two, so - no rush.
All right, should be good to wait. In any case you have to repost a v2 
PULL, hope this series can be piggy-back'ed as we did extensive tests 
about it. ;-)


-Siwei





Thanks!


Future directions on top of this series may include to move more things ahead
of the migration time, like set DRIVER_OK or perform actual iterative migration
of virtio-net devices.

Comments are welcome.

This series is a different approach of series [1]. As the title does not
reflect the changes anymore, please refer to the previous one to know the
series history.

This series is based on [2], it must be applied after it.

[Jonah Palmer]
This series was rebased after [3] was pulled in, as [3] was a prerequisite
fix for this series.

v4:
---
* Add memory listener unregistration to vhost_vdpa_reset_device.
* Remove memory listener unregistration from vhost_vdpa_reset_status.

v3:
---
* Rebase

v2:
---
* Move the memory listener registration to vhost_vdpa_set_owner function.
* Move the iova_tree allocation to net_vhost_vdpa_init.

v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.

[1] 
https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
[3] 
https://lore.kernel.org/qemu-devel/[email protected]/

Jonah - note: I'll be on vacation from May 10-19. Will respond to
comments when I return.

Eugenio Pérez (7):
vdpa: check for iova tree initialized at net_client_start
vdpa: reorder vhost_vdpa_set_backend_cap
vdpa: set backend capabilities at vhost_vdpa_init
vdpa: add listener_registered
vdpa: reorder listener assignment
vdpa: move iova_tree allocation to net_vhost_vdpa_init
vdpa: move memory listener register to vhost_vdpa_init

   hw/virtio/vhost-vdpa.c | 107 +
   include/hw/virtio/vhost-vdpa.h |  22 ++-
   net/vhost-vdpa.c   |  34 +--
   3 files changed, 93 insertions(+), 70 deletions(-)

--
2.43.5






Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-15 Thread Eugenio Perez Martin
On Thu, May 15, 2025 at 2:17 AM Si-Wei Liu  wrote:
>
> Hi Eugenio,
>
> On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer  wrote:
> >> Current memory operations like pinning may take a lot of time at the
> >> destination.  Currently they are done after the source of the migration is
> >> stopped, and before the workload is resumed at the destination.  This is a
> >> period where neigher traffic can flow, nor the VM workload can continue
> >> (downtime).
> >>
> >> We can do better as we know the memory layout of the guest RAM at the
> >> destination from the moment that all devices are initializaed.  So
> >> moving that operation allows QEMU to communicate the kernel the maps
> >> while the workload is still running in the source, so Linux can start
> >> mapping them.
> >>
> >> As a small drawback, there is a time in the initialization where QEMU
> >> cannot respond to QMP etc.  By some testing, this time is about
> >> 0.2seconds.  This may be further reduced (or increased) depending on the
> >> vdpa driver and the platform hardware, and it is dominated by the cost
> >> of memory pinning.
> >>
> >> This matches the time that we move out of the called downtime window.
> >> The downtime is measured as checking the trace timestamp from the moment
> >> the source suspend the device to the moment the destination starts the
> >> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> >> secs to 2.0949.
> >>
> > Hi Jonah,
> >
> > Could you update this benchmark? I don't think it changed a lot but
> > just to be as updated as possible.
> Jonah is off this week and will be back until next Tuesday, but I recall
> he indeed did some downtime test with VM with 128GB memory before taking
> off, which shows obvious improvement from around 10 seconds to 5.8
> seconds after applying this series. Since this is related to update on
> the cover letter, would it be okay for you and Jason to ack now and then
> proceed to Michael for upcoming merge?
>

Oh yes that's what I meant, I should have been more explicit about that :).


> >
> > I think I cannot ack the series as I sent the first revision. Jason or
> > Si-Wei, could you ack it?
> Sure, I just give my R-b, this series look good to me. Hopefully Jason
> can ack on his own.
>
> Thanks!
> -Siwei
>
> >
> > Thanks!
> >
> >> Future directions on top of this series may include to move more things 
> >> ahead
> >> of the migration time, like set DRIVER_OK or perform actual iterative 
> >> migration
> >> of virtio-net devices.
> >>
> >> Comments are welcome.
> >>
> >> This series is a different approach of series [1]. As the title does not
> >> reflect the changes anymore, please refer to the previous one to know the
> >> series history.
> >>
> >> This series is based on [2], it must be applied after it.
> >>
> >> [Jonah Palmer]
> >> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> >> fix for this series.
> >>
> >> v4:
> >> ---
> >> * Add memory listener unregistration to vhost_vdpa_reset_device.
> >> * Remove memory listener unregistration from vhost_vdpa_reset_status.
> >>
> >> v3:
> >> ---
> >> * Rebase
> >>
> >> v2:
> >> ---
> >> * Move the memory listener registration to vhost_vdpa_set_owner function.
> >> * Move the iova_tree allocation to net_vhost_vdpa_init.
> >>
> >> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> >>
> >> [1] 
> >> https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> >> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> >> [3] 
> >> https://lore.kernel.org/qemu-devel/[email protected]/
> >>
> >> Jonah - note: I'll be on vacation from May 10-19. Will respond to
> >>comments when I return.
> >>
> >> Eugenio Pérez (7):
> >>vdpa: check for iova tree initialized at net_client_start
> >>vdpa: reorder vhost_vdpa_set_backend_cap
> >>vdpa: set backend capabilities at vhost_vdpa_init
> >>vdpa: add listener_registered
> >>vdpa: reorder listener assignment
> >>vdpa: move iova_tree allocation to net_vhost_vdpa_init
> >>vdpa: move memory listener register to vhost_vdpa_init
> >>
> >>   hw/virtio/vhost-vdpa.c | 107 +
> >>   include/hw/virtio/vhost-vdpa.h |  22 ++-
> >>   net/vhost-vdpa.c   |  34 +--
> >>   3 files changed, 93 insertions(+), 70 deletions(-)
> >>
> >> --
> >> 2.43.5
> >>
>




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-14 Thread Michael S. Tsirkin
On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote:
> Hi Eugenio,
> 
> On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer  wrote:
> > > Current memory operations like pinning may take a lot of time at the
> > > destination.  Currently they are done after the source of the migration is
> > > stopped, and before the workload is resumed at the destination.  This is a
> > > period where neigher traffic can flow, nor the VM workload can continue
> > > (downtime).
> > > 
> > > We can do better as we know the memory layout of the guest RAM at the
> > > destination from the moment that all devices are initializaed.  So
> > > moving that operation allows QEMU to communicate the kernel the maps
> > > while the workload is still running in the source, so Linux can start
> > > mapping them.
> > > 
> > > As a small drawback, there is a time in the initialization where QEMU
> > > cannot respond to QMP etc.  By some testing, this time is about
> > > 0.2seconds.  This may be further reduced (or increased) depending on the
> > > vdpa driver and the platform hardware, and it is dominated by the cost
> > > of memory pinning.
> > > 
> > > This matches the time that we move out of the called downtime window.
> > > The downtime is measured as checking the trace timestamp from the moment
> > > the source suspend the device to the moment the destination starts the
> > > eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> > > secs to 2.0949.
> > > 
> > Hi Jonah,
> > 
> > Could you update this benchmark? I don't think it changed a lot but
> > just to be as updated as possible.
> Jonah is off this week and will be back until next Tuesday, but I recall he
> indeed did some downtime test with VM with 128GB memory before taking off,
> which shows obvious improvement from around 10 seconds to 5.8 seconds after
> applying this series. Since this is related to update on the cover letter,
> would it be okay for you and Jason to ack now and then proceed to Michael
> for upcoming merge?
> 
> > 
> > I think I cannot ack the series as I sent the first revision. Jason or
> > Si-Wei, could you ack it?
> Sure, I just give my R-b, this series look good to me. Hopefully Jason can
> ack on his own.
> 
> Thanks!
> -Siwei

I just sent a pull, next one in a week or two, so - no rush.


> > 
> > Thanks!
> > 
> > > Future directions on top of this series may include to move more things 
> > > ahead
> > > of the migration time, like set DRIVER_OK or perform actual iterative 
> > > migration
> > > of virtio-net devices.
> > > 
> > > Comments are welcome.
> > > 
> > > This series is a different approach of series [1]. As the title does not
> > > reflect the changes anymore, please refer to the previous one to know the
> > > series history.
> > > 
> > > This series is based on [2], it must be applied after it.
> > > 
> > > [Jonah Palmer]
> > > This series was rebased after [3] was pulled in, as [3] was a prerequisite
> > > fix for this series.
> > > 
> > > v4:
> > > ---
> > > * Add memory listener unregistration to vhost_vdpa_reset_device.
> > > * Remove memory listener unregistration from vhost_vdpa_reset_status.
> > > 
> > > v3:
> > > ---
> > > * Rebase
> > > 
> > > v2:
> > > ---
> > > * Move the memory listener registration to vhost_vdpa_set_owner function.
> > > * Move the iova_tree allocation to net_vhost_vdpa_init.
> > > 
> > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> > > 
> > > [1] 
> > > https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> > > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> > > [3] 
> > > https://lore.kernel.org/qemu-devel/[email protected]/
> > > 
> > > Jonah - note: I'll be on vacation from May 10-19. Will respond to
> > >comments when I return.
> > > 
> > > Eugenio Pérez (7):
> > >vdpa: check for iova tree initialized at net_client_start
> > >vdpa: reorder vhost_vdpa_set_backend_cap
> > >vdpa: set backend capabilities at vhost_vdpa_init
> > >vdpa: add listener_registered
> > >vdpa: reorder listener assignment
> > >vdpa: move iova_tree allocation to net_vhost_vdpa_init
> > >vdpa: move memory listener register to vhost_vdpa_init
> > > 
> > >   hw/virtio/vhost-vdpa.c | 107 +
> > >   include/hw/virtio/vhost-vdpa.h |  22 ++-
> > >   net/vhost-vdpa.c   |  34 +--
> > >   3 files changed, 93 insertions(+), 70 deletions(-)
> > > 
> > > --
> > > 2.43.5
> > > 




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-14 Thread Si-Wei Liu

Hi Eugenio,

On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:

On Wed, May 7, 2025 at 8:47 PM Jonah Palmer  wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.  This may be further reduced (or increased) depending on the
vdpa driver and the platform hardware, and it is dominated by the cost
of memory pinning.

This matches the time that we move out of the called downtime window.
The downtime is measured as checking the trace timestamp from the moment
the source suspend the device to the moment the destination starts the
eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
secs to 2.0949.


Hi Jonah,

Could you update this benchmark? I don't think it changed a lot but
just to be as updated as possible.
Jonah is off this week and will be back until next Tuesday, but I recall 
he indeed did some downtime test with VM with 128GB memory before taking 
off, which shows obvious improvement from around 10 seconds to 5.8 
seconds after applying this series. Since this is related to update on 
the cover letter, would it be okay for you and Jason to ack now and then 
proceed to Michael for upcoming merge?




I think I cannot ack the series as I sent the first revision. Jason or
Si-Wei, could you ack it?
Sure, I just give my R-b, this series look good to me. Hopefully Jason 
can ack on his own.


Thanks!
-Siwei



Thanks!


Future directions on top of this series may include to move more things ahead
of the migration time, like set DRIVER_OK or perform actual iterative migration
of virtio-net devices.

Comments are welcome.

This series is a different approach of series [1]. As the title does not
reflect the changes anymore, please refer to the previous one to know the
series history.

This series is based on [2], it must be applied after it.

[Jonah Palmer]
This series was rebased after [3] was pulled in, as [3] was a prerequisite
fix for this series.

v4:
---
* Add memory listener unregistration to vhost_vdpa_reset_device.
* Remove memory listener unregistration from vhost_vdpa_reset_status.

v3:
---
* Rebase

v2:
---
* Move the memory listener registration to vhost_vdpa_set_owner function.
* Move the iova_tree allocation to net_vhost_vdpa_init.

v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.

[1] 
https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
[3] 
https://lore.kernel.org/qemu-devel/[email protected]/

Jonah - note: I'll be on vacation from May 10-19. Will respond to
   comments when I return.

Eugenio Pérez (7):
   vdpa: check for iova tree initialized at net_client_start
   vdpa: reorder vhost_vdpa_set_backend_cap
   vdpa: set backend capabilities at vhost_vdpa_init
   vdpa: add listener_registered
   vdpa: reorder listener assignment
   vdpa: move iova_tree allocation to net_vhost_vdpa_init
   vdpa: move memory listener register to vhost_vdpa_init

  hw/virtio/vhost-vdpa.c | 107 +
  include/hw/virtio/vhost-vdpa.h |  22 ++-
  net/vhost-vdpa.c   |  34 +--
  3 files changed, 93 insertions(+), 70 deletions(-)

--
2.43.5






Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-14 Thread Si-Wei Liu

For the series:

Reviewed-by: Si-Wei Liu 

On 5/7/2025 11:46 AM, Jonah Palmer wrote:

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.  This may be further reduced (or increased) depending on the
vdpa driver and the platform hardware, and it is dominated by the cost
of memory pinning.

This matches the time that we move out of the called downtime window.
The downtime is measured as checking the trace timestamp from the moment
the source suspend the device to the moment the destination starts the
eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
secs to 2.0949.

Future directions on top of this series may include to move more things ahead
of the migration time, like set DRIVER_OK or perform actual iterative migration
of virtio-net devices.

Comments are welcome.

This series is a different approach of series [1]. As the title does not
reflect the changes anymore, please refer to the previous one to know the
series history.

This series is based on [2], it must be applied after it.

[Jonah Palmer]
This series was rebased after [3] was pulled in, as [3] was a prerequisite
fix for this series.

v4:
---
* Add memory listener unregistration to vhost_vdpa_reset_device.
* Remove memory listener unregistration from vhost_vdpa_reset_status.

v3:
---
* Rebase

v2:
---
* Move the memory listener registration to vhost_vdpa_set_owner function.
* Move the iova_tree allocation to net_vhost_vdpa_init.

v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.

[1] 
https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
[3] 
https://lore.kernel.org/qemu-devel/[email protected]/

Jonah - note: I'll be on vacation from May 10-19. Will respond to
   comments when I return.

Eugenio Pérez (7):
   vdpa: check for iova tree initialized at net_client_start
   vdpa: reorder vhost_vdpa_set_backend_cap
   vdpa: set backend capabilities at vhost_vdpa_init
   vdpa: add listener_registered
   vdpa: reorder listener assignment
   vdpa: move iova_tree allocation to net_vhost_vdpa_init
   vdpa: move memory listener register to vhost_vdpa_init

  hw/virtio/vhost-vdpa.c | 107 +
  include/hw/virtio/vhost-vdpa.h |  22 ++-
  net/vhost-vdpa.c   |  34 +--
  3 files changed, 93 insertions(+), 70 deletions(-)






Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-14 Thread Eugenio Perez Martin
On Wed, May 7, 2025 at 8:47 PM Jonah Palmer  wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>

Hi Jonah,

Could you update this benchmark? I don't think it changed a lot but
just to be as updated as possible.

I think I cannot ack the series as I sent the first revision. Jason or
Si-Wei, could you ack it?

Thanks!

> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative 
> migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.
>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] 
> https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] 
> https://lore.kernel.org/qemu-devel/[email protected]/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>   comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c | 107 +
>  include/hw/virtio/vhost-vdpa.h |  22 ++-
>  net/vhost-vdpa.c   |  34 +--
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>




Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

2025-05-13 Thread Lei Yang
Tested pass with vhost_vdpa device's regression tests.

Tested-by: Lei Yang 

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer  wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>
> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative 
> migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.
>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] 
> https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] 
> https://lore.kernel.org/qemu-devel/[email protected]/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>   comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c | 107 +
>  include/hw/virtio/vhost-vdpa.h |  22 ++-
>  net/vhost-vdpa.c   |  34 +--
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>