Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2024-01-03 Thread Peter Xu
On Wed, Jan 03, 2024 at 12:11:19PM +0100, Eugenio Perez Martin wrote:
> On Wed, Jan 3, 2024 at 7:16 AM Peter Xu  wrote:
> >
> > On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> > > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu  wrote:
> > > >
> > > > Jason, Eugenio,
> > > >
> > > > Apologies for a late reply; just back from the long holiday.
> > > >
> > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > > > correct me if I'm wrong.
> > > >
> > > > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > > > take a lot of time if it's similar to what VFIO does.
> > > >
> > > > >
> > > > > I didn't check VFIO, but I think it just maps at realize phase with
> > > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > > > previous testings, this delayed the VM initialization by a lot, as
> > > > > we're moving that 20s of blocking to every VM start.
> > > > >
> > > > > Investigating a way to do it only in the case of being the destination
> > > > > of a live migration, I think the right place is .load_setup migration
> > > > > handler. But I'm ok to move it for sure.
> > > >
> > > > If it's destined to map the 128G, it does sound sensible to me to do it
> > > > when VM starts, rather than anytime afterwards.
> > > >
> > >
> > > Just for completion, it is not 100% sure the driver will start the
> > > device. But it is likely for sure.
> >
> > My understanding is that vDPA is still a quite special device, assuming
> > only targeting advanced users, and should not appear in a default config
> > for anyone.  It means the user should hopefully remove the device if the
> > guest is not using it, instead of worrying on a slow boot.
> >
> > >
> > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > > > init just like what VFIO does?
> > > >
> > >
> > > The main problem was the delay of VM start. In the master branch, the
> > > pinning is done when the driver starts the device. While it takes the
> > > BQL, the rest of the vCPUs can move work forward while the host is
> > > pinning. So the impact of it is not so evident.
> > >
> > > To move it to initialization time made it very noticeable. To make
> > > things worse, QEMU did not respond to QMP commands and similar. That's
> > > why it was done only if the VM was the destination of a LM.
> >
> > Is that a major issue for us?
> 
> To me it is a regression but I'm ok with it for sure.
> 
> >  IIUC then VFIO shares the same condition.
> > If it's a real problem, do we want to have a solution that works for both
> > (or, is it possible)?
> >
> 
> I would not consider a regression for VFIO since I think it has
> behaved that way from the beginning. But yes, I'm all in to find a
> common solution.
> 
> > >
> > > However, we've added the memory map thread in this version, so this
> > > might not be a problem anymore. We could move the spawn of the thread
> > > to initialization time.
> > >
> > > But how to undo this pinning in the case the guest does not start the
> > > device? In this series, this is done at the destination with
> > > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> > > long as QEMU has the vDPA device?
> >
> > I think even if vDPA decides to use a thread, we should keep the same
> > behavior before/after the migration.  Having assymetric behavior over DMA
> > from the assigned HWs might have unpredictable implications.
> >
> > What I worry is we may over-optimize / over-engineer the case where the
> > user will specify the vDPA device but not use it, as I mentioned above.
> >
> 
> I agree with all of the above. If it is ok to keep memory mapped while
> the guest has not started I think we can move the spawn of the thread,
> or even just the map write itself, to the vdpa init.
> 
> > For the long term, maybe there's chance to optimize DMA pinning for both
> > vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
> > that issue only exists for large VMs, while they should normally be good
> > candidates for huge pages already.  Then, it means maybe one folio/page can
> > cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
> > also provides possibility of IOMMU large page mappings.  I didn't check at
> > which stage we are for VFIO on this, Alex may know better.
> 
> Sounds interesting, and I think it should be implemented. Thanks for
> the pointer!

I didn't have an exact pointer previously, but to provide a pointer, I
think it can be something like this:

  physr discussion - Jason Gunthorpe
  
https://www.youtube.com/watch?v=QftOTtks-pI=PLbzoR-pLrL6rlmdpJ3-oMgU_zxc1wAhjS=36

Since I have zero knowledge on vDPA side, I can only provide the exmaple
from VFIO and even if so that may not be fully accurate.  Basically 

Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2024-01-03 Thread Eugenio Perez Martin
On Wed, Jan 3, 2024 at 7:16 AM Peter Xu  wrote:
>
> On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu  wrote:
> > >
> > > Jason, Eugenio,
> > >
> > > Apologies for a late reply; just back from the long holiday.
> > >
> > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > > correct me if I'm wrong.
> > >
> > > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > > take a lot of time if it's similar to what VFIO does.
> > >
> > > >
> > > > I didn't check VFIO, but I think it just maps at realize phase with
> > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > > previous testings, this delayed the VM initialization by a lot, as
> > > > we're moving that 20s of blocking to every VM start.
> > > >
> > > > Investigating a way to do it only in the case of being the destination
> > > > of a live migration, I think the right place is .load_setup migration
> > > > handler. But I'm ok to move it for sure.
> > >
> > > If it's destined to map the 128G, it does sound sensible to me to do it
> > > when VM starts, rather than anytime afterwards.
> > >
> >
> > Just for completion, it is not 100% sure the driver will start the
> > device. But it is likely for sure.
>
> My understanding is that vDPA is still a quite special device, assuming
> only targeting advanced users, and should not appear in a default config
> for anyone.  It means the user should hopefully remove the device if the
> guest is not using it, instead of worrying on a slow boot.
>
> >
> > > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > > init just like what VFIO does?
> > >
> >
> > The main problem was the delay of VM start. In the master branch, the
> > pinning is done when the driver starts the device. While it takes the
> > BQL, the rest of the vCPUs can move work forward while the host is
> > pinning. So the impact of it is not so evident.
> >
> > To move it to initialization time made it very noticeable. To make
> > things worse, QEMU did not respond to QMP commands and similar. That's
> > why it was done only if the VM was the destination of a LM.
>
> Is that a major issue for us?

To me it is a regression but I'm ok with it for sure.

>  IIUC then VFIO shares the same condition.
> If it's a real problem, do we want to have a solution that works for both
> (or, is it possible)?
>

I would not consider a regression for VFIO since I think it has
behaved that way from the beginning. But yes, I'm all in to find a
common solution.

> >
> > However, we've added the memory map thread in this version, so this
> > might not be a problem anymore. We could move the spawn of the thread
> > to initialization time.
> >
> > But how to undo this pinning in the case the guest does not start the
> > device? In this series, this is done at the destination with
> > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> > long as QEMU has the vDPA device?
>
> I think even if vDPA decides to use a thread, we should keep the same
> behavior before/after the migration.  Having assymetric behavior over DMA
> from the assigned HWs might have unpredictable implications.
>
> What I worry is we may over-optimize / over-engineer the case where the
> user will specify the vDPA device but not use it, as I mentioned above.
>

I agree with all of the above. If it is ok to keep memory mapped while
the guest has not started I think we can move the spawn of the thread,
or even just the map write itself, to the vdpa init.

> For the long term, maybe there's chance to optimize DMA pinning for both
> vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
> that issue only exists for large VMs, while they should normally be good
> candidates for huge pages already.  Then, it means maybe one folio/page can
> cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
> also provides possibility of IOMMU large page mappings.  I didn't check at
> which stage we are for VFIO on this, Alex may know better.

Sounds interesting, and I think it should be implemented. Thanks for
the pointer!

> I'm copying Alex
> anyway since the problem seems to be a common one already, so maybe he has
> some thoughts.
>

Appreciated :).

Thanks!




Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2024-01-02 Thread Peter Xu
On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> On Tue, Jan 2, 2024 at 6:33 AM Peter Xu  wrote:
> >
> > Jason, Eugenio,
> >
> > Apologies for a late reply; just back from the long holiday.
> >
> > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > correct me if I'm wrong.
> >
> > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > take a lot of time if it's similar to what VFIO does.
> >
> > >
> > > I didn't check VFIO, but I think it just maps at realize phase with
> > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > previous testings, this delayed the VM initialization by a lot, as
> > > we're moving that 20s of blocking to every VM start.
> > >
> > > Investigating a way to do it only in the case of being the destination
> > > of a live migration, I think the right place is .load_setup migration
> > > handler. But I'm ok to move it for sure.
> >
> > If it's destined to map the 128G, it does sound sensible to me to do it
> > when VM starts, rather than anytime afterwards.
> >
> 
> Just for completion, it is not 100% sure the driver will start the
> device. But it is likely for sure.

My understanding is that vDPA is still a quite special device, assuming
only targeting advanced users, and should not appear in a default config
for anyone.  It means the user should hopefully remove the device if the
guest is not using it, instead of worrying on a slow boot.

> 
> > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > init just like what VFIO does?
> >
> 
> The main problem was the delay of VM start. In the master branch, the
> pinning is done when the driver starts the device. While it takes the
> BQL, the rest of the vCPUs can move work forward while the host is
> pinning. So the impact of it is not so evident.
> 
> To move it to initialization time made it very noticeable. To make
> things worse, QEMU did not respond to QMP commands and similar. That's
> why it was done only if the VM was the destination of a LM.

Is that a major issue for us?  IIUC then VFIO shares the same condition.
If it's a real problem, do we want to have a solution that works for both
(or, is it possible)?

> 
> However, we've added the memory map thread in this version, so this
> might not be a problem anymore. We could move the spawn of the thread
> to initialization time.
> 
> But how to undo this pinning in the case the guest does not start the
> device? In this series, this is done at the destination with
> vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> long as QEMU has the vDPA device?

I think even if vDPA decides to use a thread, we should keep the same
behavior before/after the migration.  Having assymetric behavior over DMA
from the assigned HWs might have unpredictable implications.

What I worry is we may over-optimize / over-engineer the case where the
user will specify the vDPA device but not use it, as I mentioned above.

For the long term, maybe there's chance to optimize DMA pinning for both
vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
that issue only exists for large VMs, while they should normally be good
candidates for huge pages already.  Then, it means maybe one folio/page can
cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
also provides possibility of IOMMU large page mappings.  I didn't check at
which stage we are for VFIO on this, Alex may know better. I'm copying Alex
anyway since the problem seems to be a common one already, so maybe he has
some thoughts.

Thanks,

-- 
Peter Xu




Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2024-01-02 Thread Eugenio Perez Martin
On Tue, Jan 2, 2024 at 6:33 AM Peter Xu  wrote:
>
> Jason, Eugenio,
>
> Apologies for a late reply; just back from the long holiday.
>
> On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > but most of the time was spent in the memory pinning. Si-Wei, please
> > correct me if I'm wrong.
>
> IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> take a lot of time if it's similar to what VFIO does.
>
> >
> > I didn't check VFIO, but I think it just maps at realize phase with
> > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > previous testings, this delayed the VM initialization by a lot, as
> > we're moving that 20s of blocking to every VM start.
> >
> > Investigating a way to do it only in the case of being the destination
> > of a live migration, I think the right place is .load_setup migration
> > handler. But I'm ok to move it for sure.
>
> If it's destined to map the 128G, it does sound sensible to me to do it
> when VM starts, rather than anytime afterwards.
>

Just for completion, it is not 100% sure the driver will start the
device. But it is likely for sure.

> Could anyone help to explain what's the problem if vDPA maps 128G at VM
> init just like what VFIO does?
>

The main problem was the delay of VM start. In the master branch, the
pinning is done when the driver starts the device. While it takes the
BQL, the rest of the vCPUs can move work forward while the host is
pinning. So the impact of it is not so evident.

To move it to initialization time made it very noticeable. To make
things worse, QEMU did not respond to QMP commands and similar. That's
why it was done only if the VM was the destination of a LM.

However, we've added the memory map thread in this version, so this
might not be a problem anymore. We could move the spawn of the thread
to initialization time.

But how to undo this pinning in the case the guest does not start the
device? In this series, this is done at the destination with
vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
long as QEMU has the vDPA device?

Thanks!




Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2024-01-01 Thread Peter Xu
Jason, Eugenio,

Apologies for a late reply; just back from the long holiday.

On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> Si-Wei did the actual profiling as he is the one with the 128G guests,
> but most of the time was spent in the memory pinning. Si-Wei, please
> correct me if I'm wrong.

IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
take a lot of time if it's similar to what VFIO does.

>
> I didn't check VFIO, but I think it just maps at realize phase with
> vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> previous testings, this delayed the VM initialization by a lot, as
> we're moving that 20s of blocking to every VM start.
>
> Investigating a way to do it only in the case of being the destination
> of a live migration, I think the right place is .load_setup migration
> handler. But I'm ok to move it for sure.

If it's destined to map the 128G, it does sound sensible to me to do it
when VM starts, rather than anytime afterwards.

Could anyone help to explain what's the problem if vDPA maps 128G at VM
init just like what VFIO does?

Thanks,

-- 
Peter Xu




Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2023-12-21 Thread Eugenio Perez Martin
On Thu, Dec 21, 2023 at 3:17 AM Jason Wang  wrote:
>
> On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin
>  wrote:
> >
> > On Wed, Dec 20, 2023 at 6:22 AM Jason Wang  wrote:
> > >
> > > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez  wrote:
> > > >
> > > > Callers can use this function to setup the incoming migration thread.
> > > >
> > > > This thread is able to map the guest memory while the migration is
> > > > ongoing, without blocking QMP or other important tasks. While this
> > > > allows the destination QEMU not to block, it expands the mapping time
> > > > during migration instead of making it pre-migration.
> > >
> > > If it's just QMP, can we simply use bh with a quota here?
> > >
> >
> > Because QEMU cannot guarantee the quota at write(fd,
> > VHOST_IOTLB_UPDATE, ...).
>
> So you mean the delay may be caused by a single syscall?
>

Mostly yes, the iotlb write() that maps of all the guest memory.

> > Also, synchronization with
> > vhost_vdpa_dev_start would complicate as it would need to be
> > re-scheduled too.
>
> Just a flush of the bh, or not?
>

Let me put it differently: to map the guest memory, vhost_vdpa_dma_map
is called because the guest starts the device by a PCI write to the
device status:
#0  vhost_vdpa_dma_map (s=0x570e0e60, asid=0, iova=0, size=786432,
vaddr=0x7fff4000, readonly=false)
at ../hw/virtio/vhost-vdpa.c:93
#1  0x55979451 in vhost_vdpa_listener_region_add
(listener=0x570e0e68, section=0x7fffee5bc0d0) at
../hw/virtio/vhost-vdpa.c:415
#2  0x55b3c543 in listener_add_address_space
(listener=0x570e0e68, as=0x56db72e0 )
at ../system/memory.c:3011
#3  0x55b3c996 in memory_listener_register
(listener=0x570e0e68, as=0x56db72e0 )
at ../system/memory.c:3081
#4  0x5597be03 in vhost_vdpa_dev_start (dev=0x570e1310,
started=true) at ../hw/virtio/vhost-vdpa.c:1460
#5  0x559734c2 in vhost_dev_start (hdev=0x570e1310,
vdev=0x584b2c80, vrings=false) at ../hw/virtio/vhost.c:2058
#6  0x55854ec8 in vhost_net_start_one (net=0x570e1310,
dev=0x584b2c80) at ../hw/net/vhost_net.c:274
#7  0x558554ca in vhost_net_start (dev=0x584b2c80,
ncs=0x584c8278, data_queue_pairs=1, cvq=1) at
../hw/net/vhost_net.c:415
#8  0x55ace7a5 in virtio_net_vhost_status (n=0x584b2c80,
status=15 '\017') at ../hw/net/virtio-net.c:310
#9  0x55acea50 in virtio_net_set_status (vdev=0x584b2c80,
status=15 '\017') at ../hw/net/virtio-net.c:391
#10 0x55b06fee in virtio_set_status (vdev=0x584b2c80,
val=15 '\017') at ../hw/virtio/virtio.c:2048
#11 0x5595d667 in virtio_pci_common_write
(opaque=0x584aa8b0, addr=20, val=15, size=1) at
../hw/virtio/virtio-pci.c:1580
#12 0x55b351c1 in memory_region_write_accessor
(mr=0x584ab3f0, addr=20, value=0x7fffee5bc4c8, size=1, shift=0,
mask=255,
attrs=...) at ../system/memory.c:497
#13 0x55b354c5 in access_with_adjusted_size (addr=20,
value=0x7fffee5bc4c8, size=1, access_size_min=1, access_size_max=4,
access_fn=0x55b350cf ,
mr=0x584ab3f0, attrs=...) at ../system/memory.c:573
#14 0x55b3856f in memory_region_dispatch_write
(mr=0x584ab3f0, addr=20, data=15, op=MO_8, attrs=...) at
../system/memory.c:1521
#15 0x55b45885 in flatview_write_continue (fv=0x7fffd8122b80,
addr=4227858452, attrs=..., ptr=0x77ff0028, len=1, addr1=20,
l=1, mr=0x584ab3f0) at ../system/physmem.c:2714
#16 0x55b459e8 in flatview_write (fv=0x7fffd8122b80,
addr=4227858452, attrs=..., buf=0x77ff0028, len=1)
at ../system/physmem.c:2756
#17 0x55b45d9a in address_space_write (as=0x56db72e0
, addr=4227858452, attrs=...,
buf=0x77ff0028,
len=1) at ../system/physmem.c:2863
#18 0x55b45e07 in address_space_rw (as=0x56db72e0
, addr=4227858452, attrs=...,
buf=0x77ff0028,
len=1, is_write=true) at ../system/physmem.c:2873
#19 0x55b5eb30 in kvm_cpu_exec (cpu=0x571258f0) at
../accel/kvm/kvm-all.c:2915
#20 0x55b61798 in kvm_vcpu_thread_fn (arg=0x571258f0) at
../accel/kvm/kvm-accel-ops.c:51
#21 0x55d384b7 in qemu_thread_start (args=0x5712c390) at
../util/qemu-thread-posix.c:541
#22 0x7580814a in start_thread () from /lib64/libpthread.so.0
#23 0x754fcf23 in clone () from /lib64/libc.so.6

Can we reschedule that map to a bh without returning the control to the vCPU?

> But another question. How to synchronize with the memory API in this
> case. Currently the updating (without vIOMMU) is done under the
> listener callback.
>
> Usually after the commit, Qemu may think the memory topology has been
> updated. If it is done asynchronously, would we have any problem?
>

The function vhost_vdpa_process_iotlb_msg in the kernel has its own
lock. So two QEMU threads can map memory independently and they get
serialized.

For the write() caller, it is like the call takes more time, but there
are no deadlocks or similar.


Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2023-12-20 Thread Jason Wang
On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin
 wrote:
>
> On Wed, Dec 20, 2023 at 6:22 AM Jason Wang  wrote:
> >
> > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez  wrote:
> > >
> > > Callers can use this function to setup the incoming migration thread.
> > >
> > > This thread is able to map the guest memory while the migration is
> > > ongoing, without blocking QMP or other important tasks. While this
> > > allows the destination QEMU not to block, it expands the mapping time
> > > during migration instead of making it pre-migration.
> >
> > If it's just QMP, can we simply use bh with a quota here?
> >
>
> Because QEMU cannot guarantee the quota at write(fd,
> VHOST_IOTLB_UPDATE, ...).

So you mean the delay may be caused by a single syscall?

> Also, synchronization with
> vhost_vdpa_dev_start would complicate as it would need to be
> re-scheduled too.

Just a flush of the bh, or not?

But another question. How to synchronize with the memory API in this
case. Currently the updating (without vIOMMU) is done under the
listener callback.

Usually after the commit, Qemu may think the memory topology has been
updated. If it is done asynchronously, would we have any problem?

>
> As a half-baked idea, we can split the mapping chunks in manageable
> sizes, but I don't like that idea a lot.
>
> > Btw, have you measured the hotspot that causes such slowness? Is it
> > pinning or vendor specific mapping that slows down the progress? Or if
> > VFIO has a similar issue?
> >
>
> Si-Wei did the actual profiling as he is the one with the 128G guests,
> but most of the time was spent in the memory pinning. Si-Wei, please
> correct me if I'm wrong.
>
> I didn't check VFIO, but I think it just maps at realize phase with
> vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> previous testings, this delayed the VM initialization by a lot, as
> we're moving that 20s of blocking to every VM start.
>
> Investigating a way to do it only in the case of being the destination
> of a live migration, I think the right place is .load_setup migration
> handler. But I'm ok to move it for sure.

Adding Peter for more ideas.

>
> > >
> > > This thread joins at vdpa backend device start, so it could happen that
> > > the guest memory is so large that we still have guest memory to map
> > > before this time.
> >
> > So we would still hit the QMP stall in this case?
> >
>
> This paragraph is kind of outdated, sorry. I can only cause this if I
> don't enable switchover_ack migration capability and if I artificially
> make memory pinning in the kernel artificially slow. But I didn't
> check QMP to be honest, so I can try to test it, yes.
>
> If QMP is not responsive, that means QMP is not responsive in QEMU
> master in that period actually. So we're only improving anyway.
>
> Thanks!
>

Thanks




Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2023-12-19 Thread Eugenio Perez Martin
On Wed, Dec 20, 2023 at 6:22 AM Jason Wang  wrote:
>
> On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez  wrote:
> >
> > Callers can use this function to setup the incoming migration thread.
> >
> > This thread is able to map the guest memory while the migration is
> > ongoing, without blocking QMP or other important tasks. While this
> > allows the destination QEMU not to block, it expands the mapping time
> > during migration instead of making it pre-migration.
>
> If it's just QMP, can we simply use bh with a quota here?
>

Because QEMU cannot guarantee the quota at write(fd,
VHOST_IOTLB_UPDATE, ...). Also, synchronization with
vhost_vdpa_dev_start would complicate as it would need to be
re-scheduled too.

As a half-baked idea, we can split the mapping chunks in manageable
sizes, but I don't like that idea a lot.

> Btw, have you measured the hotspot that causes such slowness? Is it
> pinning or vendor specific mapping that slows down the progress? Or if
> VFIO has a similar issue?
>

Si-Wei did the actual profiling as he is the one with the 128G guests,
but most of the time was spent in the memory pinning. Si-Wei, please
correct me if I'm wrong.

I didn't check VFIO, but I think it just maps at realize phase with
vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
previous testings, this delayed the VM initialization by a lot, as
we're moving that 20s of blocking to every VM start.

Investigating a way to do it only in the case of being the destination
of a live migration, I think the right place is .load_setup migration
handler. But I'm ok to move it for sure.

> >
> > This thread joins at vdpa backend device start, so it could happen that
> > the guest memory is so large that we still have guest memory to map
> > before this time.
>
> So we would still hit the QMP stall in this case?
>

This paragraph is kind of outdated, sorry. I can only cause this if I
don't enable switchover_ack migration capability and if I artificially
make memory pinning in the kernel artificially slow. But I didn't
check QMP to be honest, so I can try to test it, yes.

If QMP is not responsive, that means QMP is not responsive in QEMU
master in that period actually. So we're only improving anyway.

Thanks!

> > This can be improved in later iterations, when the
> > destination device can inform QEMU that it is not ready to complete the
> > migration.
> >
> > If the device is not started, the clean of the mapped memory is done at
> > .load_cleanup.  This is far from ideal, as the destination machine has
> > mapped all the guest ram for nothing, and now it needs to unmap it.
> > However, we don't have information about the state of the device so its
> > the best we can do.  Once iterative migration is supported, this will be
> > improved as we know the virtio state of the device.
> >
> > If the VM migrates before finishing all the maps, the source will stop
> > but the destination is still not ready to continue, and it will wait
> > until all guest RAM is mapped.  It is still an improvement over doing
> > all the map when the migration finish, but next patches use the
> > switchover_ack method to prevent source to stop until all the memory is
> > mapped at the destination.
> >
> > The memory unmapping if the device is not started is weird
> > too, as ideally nothing would be mapped.  This can be fixed when we
> > migrate the device state iteratively, and we know for sure if the device
> > is started or not.  At this moment we don't have such information so
> > there is no better alternative.
> >
> > Signed-off-by: Eugenio Pérez 
> >
> > ---
>
> Thanks
>




Re: [PATCH for 9.0 08/12] vdpa: add vhost_vdpa_load_setup

2023-12-19 Thread Jason Wang
On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez  wrote:
>
> Callers can use this function to setup the incoming migration thread.
>
> This thread is able to map the guest memory while the migration is
> ongoing, without blocking QMP or other important tasks. While this
> allows the destination QEMU not to block, it expands the mapping time
> during migration instead of making it pre-migration.

If it's just QMP, can we simply use bh with a quota here?

Btw, have you measured the hotspot that causes such slowness? Is it
pinning or vendor specific mapping that slows down the progress? Or if
VFIO has a similar issue?

>
> This thread joins at vdpa backend device start, so it could happen that
> the guest memory is so large that we still have guest memory to map
> before this time.

So we would still hit the QMP stall in this case?

> This can be improved in later iterations, when the
> destination device can inform QEMU that it is not ready to complete the
> migration.
>
> If the device is not started, the clean of the mapped memory is done at
> .load_cleanup.  This is far from ideal, as the destination machine has
> mapped all the guest ram for nothing, and now it needs to unmap it.
> However, we don't have information about the state of the device so its
> the best we can do.  Once iterative migration is supported, this will be
> improved as we know the virtio state of the device.
>
> If the VM migrates before finishing all the maps, the source will stop
> but the destination is still not ready to continue, and it will wait
> until all guest RAM is mapped.  It is still an improvement over doing
> all the map when the migration finish, but next patches use the
> switchover_ack method to prevent source to stop until all the memory is
> mapped at the destination.
>
> The memory unmapping if the device is not started is weird
> too, as ideally nothing would be mapped.  This can be fixed when we
> migrate the device state iteratively, and we know for sure if the device
> is started or not.  At this moment we don't have such information so
> there is no better alternative.
>
> Signed-off-by: Eugenio Pérez 
>
> ---

Thanks