On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <pet...@redhat.com> wrote: > > On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote: > > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <pet...@redhat.com> wrote: > > > > > > Jason, Eugenio, > > > > > > Apologies for a late reply; just back from the long holiday. > > > > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote: > > > > Si-Wei did the actual profiling as he is the one with the 128G guests, > > > > but most of the time was spent in the memory pinning. Si-Wei, please > > > > correct me if I'm wrong. > > > > > > IIUC we're talking about no-vIOMMU use case. The pinning should indeed > > > take a lot of time if it's similar to what VFIO does. > > > > > > > > > > > I didn't check VFIO, but I think it just maps at realize phase with > > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > > > > previous testings, this delayed the VM initialization by a lot, as > > > > we're moving that 20s of blocking to every VM start. > > > > > > > > Investigating a way to do it only in the case of being the destination > > > > of a live migration, I think the right place is .load_setup migration > > > > handler. But I'm ok to move it for sure. > > > > > > If it's destined to map the 128G, it does sound sensible to me to do it > > > when VM starts, rather than anytime afterwards. > > > > > > > Just for completion, it is not 100% sure the driver will start the > > device. But it is likely for sure. > > My understanding is that vDPA is still a quite special device, assuming > only targeting advanced users, and should not appear in a default config > for anyone. It means the user should hopefully remove the device if the > guest is not using it, instead of worrying on a slow boot. > > > > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM > > > init just like what VFIO does? > > > > > > > The main problem was the delay of VM start. In the master branch, the > > pinning is done when the driver starts the device. While it takes the > > BQL, the rest of the vCPUs can move work forward while the host is > > pinning. So the impact of it is not so evident. > > > > To move it to initialization time made it very noticeable. To make > > things worse, QEMU did not respond to QMP commands and similar. That's > > why it was done only if the VM was the destination of a LM. > > Is that a major issue for us?
To me it is a regression but I'm ok with it for sure. > IIUC then VFIO shares the same condition. > If it's a real problem, do we want to have a solution that works for both > (or, is it possible)? > I would not consider a regression for VFIO since I think it has behaved that way from the beginning. But yes, I'm all in to find a common solution. > > > > However, we've added the memory map thread in this version, so this > > might not be a problem anymore. We could move the spawn of the thread > > to initialization time. > > > > But how to undo this pinning in the case the guest does not start the > > device? In this series, this is done at the destination with > > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as > > long as QEMU has the vDPA device? > > I think even if vDPA decides to use a thread, we should keep the same > behavior before/after the migration. Having assymetric behavior over DMA > from the assigned HWs might have unpredictable implications. > > What I worry is we may over-optimize / over-engineer the case where the > user will specify the vDPA device but not use it, as I mentioned above. > I agree with all of the above. If it is ok to keep memory mapped while the guest has not started I think we can move the spawn of the thread, or even just the map write itself, to the vdpa init. > For the long term, maybe there's chance to optimize DMA pinning for both > vdpa/vfio use cases, then we can always pin them during VM starts? Assuming > that issue only exists for large VMs, while they should normally be good > candidates for huge pages already. Then, it means maybe one folio/page can > cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity > also provides possibility of IOMMU large page mappings. I didn't check at > which stage we are for VFIO on this, Alex may know better. Sounds interesting, and I think it should be implemented. Thanks for the pointer! > I'm copying Alex > anyway since the problem seems to be a common one already, so maybe he has > some thoughts. > Appreciated :). Thanks!