Re: [Qemu-block] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Alex Williamson
On Tue, 19 Apr 2016 12:13:29 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Apr 18, 2016 at 02:29:33PM -0400, David Woodhouse wrote:
> > On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin wrote:  
> > > I balk at adding more hacks to a broken system. My goals are
> > > merely to
> > > - make things work correctly with an IOMMU and new guests,
> > >   so people can use userspace drivers with virtio devices
> > > - prevent security risks when guest kernel mistakenly thinks
> > >   it's protected by an IOMMU, but in fact isn't
> > > - avoid breaking any working configurations  
> > 
> > AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
> > That's just an optimisation, for telling an OS "you don't really need
> > to bother with the IOMMU, even though you it works".
> > 
> > There are two main reasons why an operating system might want to use
> > the IOMMU via the DMA API for native drivers: 
> >  - To protect against driver bugs triggering rogue DMA.
> >  - To protect against hardware (or firmware) bugs.
> > 
> > With virtio, the first reason still exists. But the second is moot
> > because the device is part of the hypervisor and if the hypervisor is
> > untrustworthy then you're screwed anyway... but then again, in SoC
> > devices you could replace 'hypervisor' with 'chip' and the same is
> > true, isn't it? Is there *really* anything virtio-specific here?
> >
> > Sure, I want my *external* network device on a PCIe card with software-
> > loadable firmware to be behind an IOMMU because I don't trust it as far
> > as I can throw it. But for on-SoC devices surely the situation is
> > *just* the same as devices provided by a hypervisor?  
> 
> Depends on how SoC is designed I guess.  At the moment specifically QEMU
> runs everything in a single memory space so an IOMMU table lookup does
> not offer any extra protection. That's not a must, one could come
> up with modular hypervisor designs - it's just what we have ATM.
> 
> 
> > And some people want that external network device to use passthrough
> > anyway, for performance reasons.  
> 
> That's a policy decision though.
> 
> > On the whole, there are *plenty* of reasons why we might want to have a
> > passthrough mapping on a per-device basis,  
> 
> That's true. And driver security also might differ, for example maybe I
> trust a distro-supplied driver more than an out of tree one.  Or maybe I
> trust a distro-supplied userspace driver more than a closed-source one.
> And maybe I trust devices from same vendor as my chip more than a 3rd
> party one.  So one can generalize this even further, think about device
> and driver security/trust level as an integer and platform protection as an
> integer.
> 
> If platform IOMMU offers you extra protection over trusting the device
> (trust < protection) it improves you security to use platform to limit
> the device. If trust >= protection it just adds overhead without
> increasing the security.
> 
> > and I really struggle to
> > find justification for having this 'hint' in a virtio-specific way.  
> 
> It's a way. No system seems to expose this information in a more generic
> way at the moment, and it's portable. Would you like to push for some
> kind of standartization of such a hint? I would be interested
> to hear about that.
> 
> 
> > And it's complicating the discussion of the *actual* fix we're looking
> > at.  
> 
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".
> 
> And I thought it's a nice idea to use PASSTHROUGH && PLATFORM
> as a hint since it seemed to be unused.
> But maybe the best thing to do for now is to say
> - hosts should not set PASSTHROUGH && PLATFORM
> - guests should ignore PASSTHROUGH if PLATFORM is set
> 
> and then we can come back to this optimization idea later
> if it's appropriate.
> 
> So yes I think we need the two bits but no we don't need to
> mix the hint discussion in here.
> 
> > > Looking at guest code, it looks like virtio was always
> > > bypassing the IOMMU even if configured, but no other
> > > guest driver did.
> > > 
> > > This makes me think the problem where guest drivers
> > > ignore the IOMMU is virtio specific
> > > and so a virtio specific solution seems cleaner.
> > > 
> > > The problem for assigned devices is IMHO different: they bypass
> > > the guest IOMMU too but no guest driver knows about this,
> > > so guests do not work. Seems cleaner to fix QEMU to make
> > > existing guests work.  
> > 
> > I certainly agree that it's better to fix QEMU. Whether devices are
> > behind an IOMMU or not, the DMAR tables we expose to a guest should
> > tell the truth.
> > 
> > Part of the issue here is virtio-specific; part isn't.
> > 
> > Basically, we have a conjunction of two separate bugs which happened to
> > work (for virtio) — the IOMMU support in QEMU wasn't working for virti

Re: [Qemu-block] [Qemu-devel] [PATCH v4 2/9] util: Introduce vfio helpers

2018-01-10 Thread Alex Williamson
On Wed, 10 Jan 2018 17:18:39 +0800
Fam Zheng  wrote:

> This is a library to manage the host vfio interface, which could be used
> to implement userspace device driver code in QEMU such as NVMe or net
> controllers.
> 
> Signed-off-by: Fam Zheng 
> ---
>  include/qemu/vfio-helpers.h |  30 ++
>  util/Makefile.objs  |   1 +
>  util/trace-events   |  11 +
>  util/vfio-helpers.c | 723 
> 
>  4 files changed, 765 insertions(+)
>  create mode 100644 include/qemu/vfio-helpers.h
>  create mode 100644 util/vfio-helpers.c
> 
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> new file mode 100644
> index 00..6bdba3b66e
> --- /dev/null
> +++ b/include/qemu/vfio-helpers.h
...
> +/**
> + * Map a PCI bar area.
> + */
> +void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, Error **errp)
> +{
> +void *p;
> +assert_bar_index_valid(s, index);
> +p = mmap(NULL, MIN(8192, s->bar_region_info[index].size),
> + PROT_READ | PROT_WRITE, MAP_SHARED,
> + s->device, s->bar_region_info[index].offset);
> +if (p == MAP_FAILED) {
> +error_setg_errno(errp, errno, "Failed to map BAR region");
> +p = NULL;
> +}
> +return p;
> +}
> +
> +/**
> + * Unmap a PCI bar area.
> + */
> +void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar)
> +{
> +if (bar) {
> +munmap(bar, MIN(8192, s->bar_region_info[index].size));
> +}
> +}

What's up with this 8KB thing?  Is it perhaps a hack to avoid
un-mmap'able MSI-X sections of the BAR, which would make this general
purpose library very specific to devices which only operate in the lower
8KB of their MMIO space.  Maybe the interface should have an offset and
size so that the NVMe driver could implement that dependency.  We could
also be testing if the region supports mmap, but I suppose trying and
failing is just as good.  Thanks,

Alex



Re: [Qemu-block] [PATCH 2/2] virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

2018-09-13 Thread Alex Williamson
On Thu, 13 Sep 2018 12:04:34 +0200
Paolo Bonzini  wrote:

> On 13/09/2018 11:11, Paolo Bonzini wrote:
> > On 13/09/2018 08:03, Fam Zheng wrote:  
> >> On Wed, 09/12 14:42, Paolo Bonzini wrote:  
> >>> On 12/09/2018 13:50, Fam Zheng wrote:  
> > I think it's okay if it is invoked.  The sequence is first you stop the
> > vq, then you drain the BlockBackends, then you switch AioContext.  All
> > that matters is the outcome when virtio_scsi_dataplane_stop returns.  
>  Yes, but together with vIOMMU, it also effectively leads to a 
>  virtio_error(),
>  which is not clean. QEMU stderr when this call happens (with patch 1 but 
>  not
>  this patch):
> 
>  2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: 
>  detected translation failure (dev=02:00:00, iova=0x0)
>  2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not 
>  recorded due to compression of faults
>  2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized 
>  buffers are not allowed  
> >>>
> >>> But with iothread, virtio_scsi_dataplane_stop runs in another thread
> >>> than the iothread; in that case you still have a race where the iothread
> >>> can process the vq before aio_disable_external and print the error.
> >>>
> >>> IIUC the guest has cleared the IOMMU page tables _before_ clearing the
> >>> DRIVER_OK bit in the status field.  Could this be a guest bug?  
> >>
> >> I'm not sure if it is a bug or not. I think what happens is the device is 
> >> left
> >> enabled by Seabios, and then reset by kernel.  
> > 
> > That makes sense, though I'm not sure why QEMU needs to process a
> > request long after SeaBIOS has left control to Linux.  Maybe it's just
> > that the messages should not go on QEMU stderr, and rather trace-point
> > should be enough.  
> 
> Aha, it's not that QEMU needs to poll, it's just that polling mode is
> enabled, and it decides to do one last iteration.  In general the virtio
> spec allows the hardware to poll whenever it wants, hence:
> 
> 1) I'm not sure that translation failures should mark the device as
> broken---definitely not when doing polling, possibly not even in
> response to the guest "kicking" the virtqueue.  Alex, does the PCI spec
> say anything about this?

AFAIK the PCI spec doesn't define anything about the IOMMU or response
to translation failures.  Depending on whether it's a read or write,
the device might see an unsupported request or not even be aware of the
error.  It's really a platform RAS question whether to have any more
significant response, most don't, but at least one tends to consider
IOMMU faults to be a data integrity issue worth bring the system down.
We've struggled with handling ongoing DMA generating IOMMU faults
during kexec for a long time, so any sort of marking a device broken
for a fault should be thoroughly considered, especially when a device
could be assigned to a user who can trivially trigger a fault.
 
> 2) translation faliures should definitely not print messages to stderr.

Yep, easy DoS vector for a malicious guest, or malicious userspace
driver within the guest.  Thanks,

Alex



Re: [RFC PATCH v3 2/5] util/vfio-helpers: Report error on unsupported host architectures

2020-08-18 Thread Alex Williamson
On Tue, 18 Aug 2020 18:45:06 +0200
Philippe Mathieu-Daudé  wrote:

> The vfio-helpers implementation expects a TYPEv1 IOMMU, see
> qemu_vfio_init_pci:
> 
>   263 if (!ioctl(s->container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
>   264 error_setg_errno(errp, errno, "VFIO IOMMU check failed");
> 
> Thus POWER SPAPR IOMMU is obviously not supported.
> 
> The implementation only cares about host page size alignment
> (usually 4KB on X86), not the IOMMU one, which is be problematic
> on Aarch64, when 64MB page size is used. So Aarch64 is not
> supported neither.
> 
> Report an error when the host architecture is different than X86:
> 
>  $ qemu-system-aarch64 \
> -drive file=nvme://0001:01:00.0/1,if=none,id=drive0 \
> -device virtio-blk-pci,drive=drive0
>   qemu-system-aarch64: -drive file=nvme://0001:01:00.0/1,if=none,id=drive0: 
> QEMU VFIO utility is not supported on this architecture
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Cc: Eric Auger 
> Cc: Drew Jones 
> Cc: Laurent Vivier 
> Cc: David Gibson 
> ---
>  util/vfio-helpers.c | 26 +-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index e399e330e26..60017936e3e 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -420,14 +420,38 @@ static void qemu_vfio_open_common(QEMUVFIOState *s)
>  qemu_ram_foreach_block(qemu_vfio_init_ramblock, s);
>  }
>  
> +/**
> + * Return if the host architecture is supported.
> + *
> + * aarch64: IOMMU page alignment not respected
> + * ppc64:   SPAPR IOMMU window not configured
> + * x86-64:  Only architecture validated
> + * other:   Untested
> + */
> +static bool qemu_vfio_arch_supported(void)
> +{
> +bool supported = false;
> +
> +#if defined(HOST_X86_64)
> +supported = true;
> +#endif
> +
> +return supported;
> +}

Why does this need to be hard coded to specific architectures rather
than probing for type1 IOMMU support and looking at the iova_pgsizes
from VFIO_IOMMU_GET_INFO to see if there's a compatible size?  It
requires us to get a bit deeper into the device initialization, but we
should still be able to unwind out of the device realize.  Otherwise
we're throwing out aarch64 running of 4KB for no reason, right?  Thanks,

Alex


>  /**
>   * Open a PCI device, e.g. ":00:01.0".
>   */
>  QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp)
>  {
>  int r;
> -QEMUVFIOState *s = g_new0(QEMUVFIOState, 1);
> +QEMUVFIOState *s;
>  
> +if (!qemu_vfio_arch_supported()) {
> +error_setg(errp,
> +   "QEMU VFIO utility is not supported on this 
> architecture");
> +return NULL;
> +}
> +s = g_new0(QEMUVFIOState, 1);
>  r = qemu_vfio_init_pci(s, device, errp);
>  if (r) {
>  g_free(s);




Re: [RFC PATCH v3 4/5] util/vfio-helpers: Introduce qemu_vfio_pci_init_msix_irqs()

2020-08-18 Thread Alex Williamson
On Tue, 18 Aug 2020 18:45:08 +0200
Philippe Mathieu-Daudé  wrote:

> qemu_vfio_pci_init_irq() allows us to initialize any type of IRQ,
> but only one. Introduce qemu_vfio_pci_init_msix_irqs() which is
> specific to MSIX IRQ type, and allow us to use multiple IRQs
> (thus passing multiple eventfd notifiers).
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  2 ++
>  util/vfio-helpers.c | 53 +
>  2 files changed, 55 insertions(+)
> 
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e4..63108ebc8da 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -28,5 +28,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, 
> void *bar,
>   uint64_t offset, uint64_t size);
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> int irq_type, Error **errp);
> +int qemu_vfio_pci_init_msix_irqs(QEMUVFIOState *s, EventNotifier *e,
> + unsigned irq_count, Error **errp);
>  
>  #endif
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 696f2d51712..fb3a79a5bcb 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -215,6 +215,59 @@ int qemu_vfio_pci_init_irq(QEMUVFIOState *s, 
> EventNotifier *e,
>  return 0;
>  }
>  
> +/**
> + * Initialize device MSIX IRQs and register event notifiers.
> + * @irq_count: number of MSIX IRQs to initialize
> + * @e: Array of @irq_count notifiers (each corresponding to a MSIX IRQ)
> + */
> +int qemu_vfio_pci_init_msix_irqs(QEMUVFIOState *s, EventNotifier *e,
> + unsigned irq_count, Error **errp)
> +{
> +int r;
> +struct vfio_irq_set *irq_set;
> +size_t irq_set_size;
> +struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
> +
> +irq_info.index = VFIO_PCI_MSIX_IRQ_INDEX;

Nit, this could be initialized in the declaration with argsz.

> +if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) {
> +error_setg_errno(errp, errno, "Failed to get device interrupt info");
> +return -errno;
> +}
> +if (irq_info.count <= irq_count) {


Shouldn't this only test strictly less than?  The API seems to leave
the problem of determining how many vectors might be available as an
exercise for the caller.  Thanks,

Alex


> +error_setg(errp,
> +   "Not enough device interrupts available (only %" PRIu32 
> ")",
> +   irq_info.count);
> +return -EINVAL;
> +}
> +if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
> +error_setg(errp, "Device interrupt doesn't support eventfd");
> +return -EINVAL;
> +}
> +
> +irq_set_size = sizeof(*irq_set) + irq_count * sizeof(int32_t);
> +irq_set = g_malloc0(irq_set_size);
> +
> +/* Get to a known IRQ state */
> +*irq_set = (struct vfio_irq_set) {
> +.argsz = irq_set_size,
> +.flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
> +.index = irq_info.index,
> +.start = 0,
> +.count = irq_count,
> +};
> +
> +for (unsigned i = 0; i < irq_count; i++) {
> +((int32_t *)&irq_set->data)[i] = event_notifier_get_fd(&e[i]);
> +}
> +r = ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set);
> +g_free(irq_set);
> +if (r) {
> +error_setg_errno(errp, errno, "Failed to setup device interrupts");
> +return -errno;
> +}
> +return 0;
> +}
> +
>  static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf,
>   int size, int ofs)
>  {




Re: [RFC PATCH v4 4/5] util/vfio-helpers: Introduce qemu_vfio_pci_init_msix_irqs()

2020-08-19 Thread Alex Williamson
On Wed, 19 Aug 2020 18:03:17 +0200
Philippe Mathieu-Daudé  wrote:

> qemu_vfio_pci_init_irq() allows us to initialize any type of IRQ,
> but only one. Introduce qemu_vfio_pci_init_msix_irqs() which is
> specific to MSIX IRQ type, and allow us to use multiple IRQs
> (thus passing multiple eventfd notifiers).
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  2 ++
>  util/vfio-helpers.c | 57 +
>  2 files changed, 59 insertions(+)
> 
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e4..8e6bd83ea41 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -28,5 +28,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, 
> void *bar,
>   uint64_t offset, uint64_t size);
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> int irq_type, Error **errp);
> +int qemu_vfio_pci_init_msix_irqs(QEMUVFIOState *s, EventNotifier *e,
> + unsigned *irq_count, Error **errp);
>  
>  #endif
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 8f4a3d452ed..6f833972587 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -216,6 +216,63 @@ int qemu_vfio_pci_init_irq(QEMUVFIOState *s, 
> EventNotifier *e,
>  return 0;
>  }
>  
> +/**
> + * Initialize device MSIX IRQs and register event notifiers.
> + * @irq_count: pointer to number of MSIX IRQs to initialize
> + * @notifier: Array of @irq_count notifiers (each corresponding to a MSIX 
> IRQ)
> +
> + * If the number of IRQs requested exceeds the available on the device,
> + * store the number of available IRQs in @irq_count and return -EOVERFLOW.
> + */
> +int qemu_vfio_pci_init_msix_irqs(QEMUVFIOState *s, EventNotifier *notifier,
> + unsigned *irq_count, Error **errp)
> +{
> +int r;
> +size_t irq_set_size;
> +struct vfio_irq_set *irq_set;
> +struct vfio_irq_info irq_info = {
> +.argsz = sizeof(irq_info),
> +.index = VFIO_PCI_MSIX_IRQ_INDEX
> +};
> +
> +if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) {
> +error_setg_errno(errp, errno, "Failed to get device interrupt info");
> +return -errno;
> +}
> +if (irq_info.count < *irq_count) {
> +error_setg(errp, "Not enough device interrupts available");
> +*irq_count = irq_info.count;
> +return -EOVERFLOW;
> +}
> +if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
> +error_setg(errp, "Device interrupt doesn't support eventfd");
> +return -EINVAL;
> +}
> +
> +irq_set_size = sizeof(*irq_set) + *irq_count * sizeof(int32_t);
> +irq_set = g_malloc0(irq_set_size);
> +
> +/* Get to a known IRQ state */
> +*irq_set = (struct vfio_irq_set) {
> +.argsz = irq_set_size,
> +.flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
> +.index = irq_info.index,
> +.start = 0,
> +.count = *irq_count,
> +};
> +
> +for (unsigned i = 0; i < *irq_count; i++) {
> +((int32_t *)&irq_set->data)[i] = event_notifier_get_fd(¬ifier[i]);
> +}
> +r = ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set);
> +g_free(irq_set);
> +if (r) {
> +error_setg_errno(errp, errno, "Failed to setup device interrupts");
> +return -errno;

FWIW, the former irq_info.count check returns what the device is
capable of, the platform might only have limited vector space
available, therefore this ioctl can also return a value indicating the
number of vectors \actually\ available.  So if r > 0 you could return
it in *irq_count (which also makes me wonder if errno would be set in
that case).  Thanks,

Alex

> +}
> +return 0;
> +}
> +
>  static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf,
>   int size, int ofs)
>  {




Re: [PATCH v6 3/4] util/vfio-helpers: Introduce qemu_vfio_pci_init_msix_irqs()

2020-09-10 Thread Alex Williamson
On Thu, 10 Sep 2020 17:29:25 +0200
Philippe Mathieu-Daudé  wrote:

> Hi Stefan, Alex.
> 
> On 9/10/20 12:44 PM, Stefan Hajnoczi wrote:
> > On Wed, Sep 09, 2020 at 04:23:53PM +0200, Philippe Mathieu-Daudé wrote:  
> >> +/**
> >> + * Initialize device MSIX IRQs and register event notifiers.
> >> + * @irq_count: pointer to number of MSIX IRQs to initialize
> >> + * @notifier: Array of @irq_count notifiers (each corresponding to a MSIX 
> >> IRQ)
> >> +
> >> + * If the number of IRQs requested exceeds the available on the device,
> >> + * store the number of available IRQs in @irq_count and return -EOVERFLOW.
> >> + */
> >> +int qemu_vfio_pci_init_msix_irqs(QEMUVFIOState *s, EventNotifier 
> >> *notifier,
> >> + unsigned *irq_count, Error **errp)
> >> +{
> >> +int r;
> >> +size_t irq_set_size;
> >> +struct vfio_irq_set *irq_set;
> >> +struct vfio_irq_info irq_info = {
> >> +.argsz = sizeof(irq_info),
> >> +.index = VFIO_PCI_MSIX_IRQ_INDEX
> >> +};
> >> +
> >> +if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) {
> >> +error_setg_errno(errp, errno, "Failed to get device interrupt 
> >> info");
> >> +return -errno;
> >> +}
> >> +if (irq_info.count < *irq_count) {
> >> +error_setg(errp, "Not enough device interrupts available");
> >> +*irq_count = irq_info.count;
> >> +return -EOVERFLOW;
> >> +}
> >> +if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
> >> +error_setg(errp, "Device interrupt doesn't support eventfd");
> >> +return -EINVAL;
> >> +}
> >> +
> >> +irq_set_size = sizeof(*irq_set) + *irq_count * sizeof(int32_t);
> >> +irq_set = g_malloc0(irq_set_size);
> >> +
> >> +/* Get to a known IRQ state */
> >> +*irq_set = (struct vfio_irq_set) {
> >> +.argsz = irq_set_size,
> >> +.flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
> >> +.index = irq_info.index,
> >> +.start = 0,
> >> +.count = *irq_count,
> >> +};
> >> +
> >> +for (unsigned i = 0; i < *irq_count; i++) {
> >> +((int32_t *)&irq_set->data)[i] = 
> >> event_notifier_get_fd(¬ifier[i]);
> >> +}
> >> +r = ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set);
> >> +g_free(irq_set);
> >> +if (r <= 0) {
> >> +error_setg_errno(errp, errno, "Failed to setup device 
> >> interrupts");
> >> +return -errno;
> >> +} else if (r < *irq_count) {
> >> +error_setg(errp, "Not enough device interrupts available");
> >> +*irq_count = r;
> >> +return -EOVERFLOW;
> >> +}  
> > 
> > EOVERFLOW can occur in two cases: VFIO_DEVICE_GET_IRQ_INFO and
> > VFIO_DEVICE_SET_IRQS.  
> 
> Yes.
> 
> > 
> > If it happens in the second case the notifier[] array has been
> > registered successfully.  
> 
> No, I don't think so:
> 
> vfio_pci_set_msi_trigger() register the notifier only if
> vfio_msi_enable() succeeded (returned 0). If vfio_msi_enable()
> failed it returns the number of vectors available but do
> not register the notifiers.
> 
> Alex, do you confirm?

Yes, if we can't setup what the user requested we don't setup anything.
However, I think we return zero on success, which seems to fall into
your error condition.  Has this been tested?  Thanks,

Alex

> > The caller has no way of distinguishing the two cases. Therefore the
> > caller doesn't know if the eventfds will be used by the kernel after
> > EOVERFLOW.
> > 
> > If the second case can ever happen then this function should probably
> > call VFIO_DEVICE_SET_IRQS again with VFIO_IRQ_SET_DATA_NONE to
> > unregister the eventfds before returning EOVERFLOW.
> > 
> > STefan
> >   
> 




Re: [PATCH] MAINTAINERS: add Stefan Hajnoczi as block/nvme.c maintainer

2020-09-14 Thread Alex Williamson
On Mon, 14 Sep 2020 11:37:03 +0100
Stefan Hajnoczi  wrote:

> On Wed, Sep 09, 2020 at 02:51:50PM +0200, Philippe Mathieu-Daudé wrote:
> > On 9/7/20 1:16 PM, Stefan Hajnoczi wrote:  
> > > Development of the userspace NVMe block driver picked up again recently.
> > > After talking with Fam I am stepping up as block/nvme.c maintainer.
> > > Patches will be merged through my 'block' tree.
> > > 
> > > Cc: Kevin Wolf 
> > > Cc: Klaus Jensen 
> > > Cc: Fam Zheng 
> > > Signed-off-by: Stefan Hajnoczi 
> > > ---
> > >  MAINTAINERS | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index b233da2a73..a143941551 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -2895,10 +2895,12 @@ S: Supported
> > >  F: block/null.c
> > >  
> > >  NVMe Block Driver
> > > -M: Fam Zheng 
> > > +M: Stefan Hajnoczi 
> > > +R: Fam Zheng 
> > >  L: qemu-block@nongnu.org
> > >  S: Supported
> > >  F: block/nvme*
> > > +T: git https://github.com/stefanha/qemu.git block  
> > 
> > As this driver is the only consumer of the 'VFIO helpers',
> > maybe we can:
> > 
> > - maintains them in the same bucket
> > - add another entry (eventually with R: tag for Alex)
> > 
> > The 'VFIO helpers' files are:
> > - util/vfio-helpers.c
> > - include/qemu/vfio-helpers.h
> > 
> > What do you think?  
> 
> I'm happy to include vfio-helpers with the goal of eventually switching
> to vfio-common.
> 
> Alex: does this sound good?

Yep, ok by me.  Thanks,

Alex




Re: [PATCH] vfio: fix incorrect print type

2020-10-19 Thread Alex Williamson
On Mon, 19 Oct 2020 13:32:17 +
Zhengui li  wrote:

> fix incorrect print type.

Why is it incorrect, describe your change.  Patches must include a
Signed-off-by to adhere to the developer's certificate of origin.
Thanks,

Alex

> ---
>  hw/vfio/common.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 13471ae..acc3356 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -203,7 +203,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
>  buf.qword = cpu_to_le64(data);
>  break;
>  default:
> -hw_error("vfio: unsupported write size, %d bytes", size);
> +hw_error("vfio: unsupported write size, %u bytes", size);
>  break;
>  }
>  
> @@ -260,7 +260,7 @@ uint64_t vfio_region_read(void *opaque,
>  data = le64_to_cpu(buf.qword);
>  break;
>  default:
> -hw_error("vfio: unsupported read size, %d bytes", size);
> +hw_error("vfio: unsupported read size, %u bytes", size);
>  break;
>  }
>  




Re: [PATCH 12/12] hw/vfio/pci-quirks: Replace the word 'blacklist'

2021-02-02 Thread Alex Williamson
On Tue,  2 Feb 2021 21:58:24 +0100
Philippe Mathieu-Daudé  wrote:

> Follow the inclusive terminology from the "Conscious Language in your
> Open Source Projects" guidelines [*] and replace the word "blacklist"
> appropriately.
> 
> [*] https://github.com/conscious-lang/conscious-lang-docs/blob/main/faq.md
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/vfio/pci.h|  2 +-
>  hw/vfio/pci-quirks.c | 14 +++---
>  hw/vfio/pci.c|  4 ++--
>  hw/vfio/trace-events |  2 +-
>  4 files changed, 11 insertions(+), 11 deletions(-)

Thanks!

Reviewed-by: Alex Williamson 
Acked-by: Alex Williamson 

> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 1574ef983f8..64777516d16 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -197,7 +197,7 @@ void vfio_pci_write_config(PCIDevice *pdev,
>  uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
>  void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
>  
> -bool vfio_blacklist_opt_rom(VFIOPCIDevice *vdev);
> +bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev);
>  void vfio_vga_quirk_setup(VFIOPCIDevice *vdev);
>  void vfio_vga_quirk_exit(VFIOPCIDevice *vdev);
>  void vfio_vga_quirk_finalize(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index fc8d63c8504..81c3e30df77 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -43,19 +43,19 @@
>  static const struct {
>  uint32_t vendor;
>  uint32_t device;
> -} romblacklist[] = {
> +} rom_denylist[] = {
>  { 0x14e4, 0x168e }, /* Broadcom BCM 57810 */
>  };
>  
> -bool vfio_blacklist_opt_rom(VFIOPCIDevice *vdev)
> +bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev)
>  {
>  int i;
>  
> -for (i = 0 ; i < ARRAY_SIZE(romblacklist); i++) {
> -if (vfio_pci_is(vdev, romblacklist[i].vendor, 
> romblacklist[i].device)) {
> -trace_vfio_quirk_rom_blacklisted(vdev->vbasedev.name,
> - romblacklist[i].vendor,
> - romblacklist[i].device);
> +for (i = 0 ; i < ARRAY_SIZE(rom_denylist); i++) {
> +if (vfio_pci_is(vdev, rom_denylist[i].vendor, 
> rom_denylist[i].device)) {
> +trace_vfio_quirk_rom_in_denylist(vdev->vbasedev.name,
> + rom_denylist[i].vendor,
> + rom_denylist[i].device);
>  return true;
>  }
>  }
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index f74be782091..759a3b1abf4 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -900,7 +900,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
>  
>  if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
>  /* Since pci handles romfile, just print a message and return */
> -if (vfio_blacklist_opt_rom(vdev) && vdev->pdev.romfile) {
> +if (vfio_opt_rom_in_denylist(vdev) && vdev->pdev.romfile) {
>  warn_report("Device at %s is known to cause system instability"
>  " issues during option rom execution",
>  vdev->vbasedev.name);
> @@ -927,7 +927,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
>  return;
>  }
>  
> -if (vfio_blacklist_opt_rom(vdev)) {
> +if (vfio_opt_rom_in_denylist(vdev)) {
>  if (dev->opts && qemu_opt_get(dev->opts, "rombar")) {
>  warn_report("Device at %s is known to cause system instability"
>  " issues during option rom execution",
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index c0e75f24b76..079f53acf28 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -49,7 +49,7 @@ vfio_pci_emulated_sub_vendor_id(const char *name, uint16_t 
> val) "%s 0x%04x"
>  vfio_pci_emulated_sub_device_id(const char *name, uint16_t val) "%s 0x%04x"
>  
>  # pci-quirks.c
> -vfio_quirk_rom_blacklisted(const char *name, uint16_t vid, uint16_t did) "%s 
> %04x:%04x"
> +vfio_quirk_rom_in_denylist(const char *name, uint16_t vid, uint16_t did) "%s 
> %04x:%04x"
>  vfio_quirk_generic_window_address_write(const char *name, const char * 
> region_name, uint64_t data) "%s %s 0x%"PRIx64
>  vfio_quirk_generic_window_data_read(const char *name, const char * 
> region_name, uint64_t data) "%s %s 0x%"PRIx64
>  vfio_quirk_generic_window_data_write(const char *name, const char * 
> region_name, uint64_t data) "%s %s 0x%"PRIx64




Re: [PATCH v2 8/8] hw/vfio/pci-quirks: Replace the word 'blacklist'

2021-03-02 Thread Alex Williamson
On Fri,  5 Feb 2021 18:18:17 +0100
Philippe Mathieu-Daudé  wrote:

> Follow the inclusive terminology from the "Conscious Language in your
> Open Source Projects" guidelines [*] and replace the word "blacklist"
> appropriately.
> 
> [*] https://github.com/conscious-lang/conscious-lang-docs/blob/main/faq.md
> 
> Reviewed-by: Alex Williamson 
> Acked-by: Alex Williamson 
> Reviewed-by: Daniel P. Berrangé 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/vfio/pci.h|  2 +-
>  hw/vfio/pci-quirks.c | 14 +++---
>  hw/vfio/pci.c|  4 ++--
>  hw/vfio/trace-events |  2 +-
>  4 files changed, 11 insertions(+), 11 deletions(-)

I thought someone might grab the whole series, but since that hasn't
happened yet, I've queued this one.  Thanks,

Alex

> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 1574ef983f8..64777516d16 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -197,7 +197,7 @@ void vfio_pci_write_config(PCIDevice *pdev,
>  uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
>  void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
>  
> -bool vfio_blacklist_opt_rom(VFIOPCIDevice *vdev);
> +bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev);
>  void vfio_vga_quirk_setup(VFIOPCIDevice *vdev);
>  void vfio_vga_quirk_exit(VFIOPCIDevice *vdev);
>  void vfio_vga_quirk_finalize(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index fc8d63c8504..81c3e30df77 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -43,19 +43,19 @@
>  static const struct {
>  uint32_t vendor;
>  uint32_t device;
> -} romblacklist[] = {
> +} rom_denylist[] = {
>  { 0x14e4, 0x168e }, /* Broadcom BCM 57810 */
>  };
>  
> -bool vfio_blacklist_opt_rom(VFIOPCIDevice *vdev)
> +bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev)
>  {
>  int i;
>  
> -for (i = 0 ; i < ARRAY_SIZE(romblacklist); i++) {
> -if (vfio_pci_is(vdev, romblacklist[i].vendor, 
> romblacklist[i].device)) {
> -trace_vfio_quirk_rom_blacklisted(vdev->vbasedev.name,
> - romblacklist[i].vendor,
> - romblacklist[i].device);
> +for (i = 0 ; i < ARRAY_SIZE(rom_denylist); i++) {
> +if (vfio_pci_is(vdev, rom_denylist[i].vendor, 
> rom_denylist[i].device)) {
> +trace_vfio_quirk_rom_in_denylist(vdev->vbasedev.name,
> + rom_denylist[i].vendor,
> + rom_denylist[i].device);
>  return true;
>  }
>  }
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index f74be782091..759a3b1abf4 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -900,7 +900,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
>  
>  if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
>  /* Since pci handles romfile, just print a message and return */
> -if (vfio_blacklist_opt_rom(vdev) && vdev->pdev.romfile) {
> +if (vfio_opt_rom_in_denylist(vdev) && vdev->pdev.romfile) {
>  warn_report("Device at %s is known to cause system instability"
>  " issues during option rom execution",
>  vdev->vbasedev.name);
> @@ -927,7 +927,7 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
>  return;
>  }
>  
> -if (vfio_blacklist_opt_rom(vdev)) {
> +if (vfio_opt_rom_in_denylist(vdev)) {
>  if (dev->opts && qemu_opt_get(dev->opts, "rombar")) {
>  warn_report("Device at %s is known to cause system instability"
>  " issues during option rom execution",
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index c0e75f24b76..079f53acf28 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -49,7 +49,7 @@ vfio_pci_emulated_sub_vendor_id(const char *name, uint16_t 
> val) "%s 0x%04x"
>  vfio_pci_emulated_sub_device_id(const char *name, uint16_t val) "%s 0x%04x"
>  
>  # pci-quirks.c
> -vfio_quirk_rom_blacklisted(const char *name, uint16_t vid, uint16_t did) "%s 
> %04x:%04x"
> +vfio_quirk_rom_in_denylist(const char *name, uint16_t vid, uint16_t did) "%s 
> %04x:%04x"
>  vfio_quirk_generic_window_address_write(const char *name, const char * 
> region_name, uint64_t data) "%s %s 0x%"PRIx64
>  vfio_quirk_generic_window_data_read(const char *name, const char * 
> region_name, uint64_t data) "%s %s 0x%"PRIx64
>  vfio_quirk_generic_window_data_write(const char *name, const char * 
> region_name, uint64_t data) "%s %s 0x%"PRIx64




Re: [RFC PATCH 2/3] util/vfio-helpers: Add trace event to display device IRQs available

2020-08-11 Thread Alex Williamson
On Tue, 11 Aug 2020 19:28:44 +0200
Philippe Mathieu-Daudé  wrote:

> Add a trace event to display the amount of IRQs available
> on the device.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 1 +
>  util/trace-events   | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 6defefcc01..3ad7e6be52 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -192,6 +192,7 @@ int qemu_vfio_pci_init_irq(QEMUVFIOState *s, 
> EventNotifier *e,
>  error_setg(errp, "Device interrupt doesn't support eventfd");
>  return -EINVAL;
>  }
> +trace_qemu_vfio_pci_init_irq(irq_info.count);
>  
>  irq_set_size = sizeof(*irq_set) + sizeof(int32_t);
>  irq_set = g_malloc0(irq_set_size);
> diff --git a/util/trace-events b/util/trace-events
> index 0ce42822eb..351dbdbe3c 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -83,3 +83,4 @@ qemu_vfio_new_mapping(void *s, void *host, size_t size, int 
> index, uint64_t iova
>  qemu_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %p 
> host %p size %zu iova 0x%"PRIx64
>  qemu_vfio_dma_map(void *s, void *host, size_t size, bool temporary, uint64_t 
> *iova) "s %p host %p size %zu temporary %d iova %p"
>  qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p"
> +qemu_vfio_pci_init_irq(uint32_t count) "device interrupt count: %"PRIu32

Knowing the count independent of which index we're looking at doesn't
seem very useful.  Thanks,

Alex




Re: [RFC PATCH 3/3] util/vfio-helpers: Let qemu_vfio_pci_init_irq take IRQ index argument

2020-08-11 Thread Alex Williamson
On Tue, 11 Aug 2020 19:28:45 +0200
Philippe Mathieu-Daudé  wrote:

> Add a new 'index' argument to qemu_vfio_pci_init_irq() to be able
> to initialize other IRQs than IRQ #0. Adapt the single user of this
> API in nvme_init().

This is actually addressing the what the vfio uAPI refers to as the
subindex, the index is the interrupt type, ex. MSI-X.

> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  2 +-
>  block/nvme.c|  2 +-
>  util/vfio-helpers.c | 12 +---
>  3 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e..ff63e75096 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -27,6 +27,6 @@ void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
>  void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>   uint64_t offset, uint64_t size);
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> -   int irq_type, Error **errp);
> +   int irq_type, unsigned index, Error **errp);
>  
>  #endif
> diff --git a/block/nvme.c b/block/nvme.c
> index 374e268915..2b3986b66d 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -757,7 +757,7 @@ static int nvme_init(BlockDriverState *bs, const char 
> *device, int namespace,
>  }
>  
>  ret = qemu_vfio_pci_init_irq(s->vfio, &s->irq_notifier,
> - VFIO_PCI_MSIX_IRQ_INDEX, errp);
> + VFIO_PCI_MSIX_IRQ_INDEX, 0, errp);
>  if (ret) {
>  goto out;
>  }
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 3ad7e6be52..ba9a869364 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -173,10 +173,11 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int 
> index, void *bar,
>  }
>  
>  /**
> - * Initialize device IRQ with @irq_type and register an event notifier.
> + * Initialize device IRQ @index with @irq_type
> + * and register an event notifier.
>   */
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> -   int irq_type, Error **errp)
> +   int irq_type, unsigned index, Error **errp)
>  {


But MSI-X will expose the VFIO_IRQ_INFO_NORESIZE flag, which means that
we cannot incrementally enable additional subindexes, aka vectors,
without first disabling and re-enabling the entire index/irq_type.
Therefore this is exposing an API that isn't actually supported.
Thanks,

Alex

>  int r;
>  struct vfio_irq_set *irq_set;
> @@ -193,6 +194,11 @@ int qemu_vfio_pci_init_irq(QEMUVFIOState *s, 
> EventNotifier *e,
>  return -EINVAL;
>  }
>  trace_qemu_vfio_pci_init_irq(irq_info.count);
> +if (index >= irq_info.count) {
> +error_setg(errp, "Device has %"PRIu32" interrupts (requested index 
> %u)",
> +   irq_info.count, index);
> +return -EINVAL;
> +}
>  
>  irq_set_size = sizeof(*irq_set) + sizeof(int32_t);
>  irq_set = g_malloc0(irq_set_size);
> @@ -202,7 +208,7 @@ int qemu_vfio_pci_init_irq(QEMUVFIOState *s, 
> EventNotifier *e,
>  .argsz = irq_set_size,
>  .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
>  .index = irq_info.index,
> -.start = 0,
> +.start = index,
>  .count = 1,
>  };
>  




Re: [RFC PATCH v2 2/7] util/vfio-helpers: Move IRQ 'type' from pci_init_irq() to open_pci()

2020-08-13 Thread Alex Williamson
On Thu, 13 Aug 2020 19:29:52 +0200
Philippe Mathieu-Daudé  wrote:

> Once opened, we will used the same IRQ type for all our event
> notifiers, so pass the argument when we open the PCI device,
> store the IRQ type in the driver state, and directly use the
> value saved in the state each time we call qemu_vfio_pci_init_irq.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---

This feels quite a bit strange to me, a PCI device can operate in one
of several interrupt modes, or without interrupts at all.  Why would we
force a user of this interface to define the interrupt type they'll use
in advance and then not even verify if the device supports that type?
A driver might want to fall back to a different interrupt type if the
one they want is not supported.  If we want to abstract this from the
driver then we should at least have an interface separate from the
initial open function that tells us to preconfigure some specified
number of vectors.  We could then have a preference policy that would
attempt to use MSI-X, followed by MSI, followed by INTx (assuming
request is for a single vector), based on what the device supports.
Then a driver could fallback to fewer interrupts if the device does not
support, or the host system cannot provide, the desired number of
interrupts.  Thanks,

Alex


>  include/qemu/vfio-helpers.h |  5 +++--
>  block/nvme.c|  6 +++---
>  util/vfio-helpers.c | 13 +
>  3 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e..728f40922b 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -15,7 +15,8 @@
>  
>  typedef struct QEMUVFIOState QEMUVFIOState;
>  
> -QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp);
> +QEMUVFIOState *qemu_vfio_open_pci(const char *device, int irq_type,
> +  Error **errp);
>  void qemu_vfio_close(QEMUVFIOState *s);
>  int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
>bool temporary, uint64_t *iova_list);
> @@ -27,6 +28,6 @@ void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
>  void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>   uint64_t offset, uint64_t size);
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> -   int irq_type, Error **errp);
> +   Error **errp);
>  
>  #endif
> diff --git a/block/nvme.c b/block/nvme.c
> index a61e86a83e..21b0770c02 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -711,7 +711,8 @@ static int nvme_init(BlockDriverState *bs, const char 
> *device, int namespace,
>  return ret;
>  }
>  
> -s->vfio = qemu_vfio_open_pci(device, errp);
> +s->vfio = qemu_vfio_open_pci(device, VFIO_PCI_MSIX_IRQ_INDEX,
> + errp);
>  if (!s->vfio) {
>  ret = -EINVAL;
>  goto out;
> @@ -784,8 +785,7 @@ static int nvme_init(BlockDriverState *bs, const char 
> *device, int namespace,
>  }
>  }
>  
> -ret = qemu_vfio_pci_init_irq(s->vfio, &s->irq_notifier,
> - VFIO_PCI_MSIX_IRQ_INDEX, errp);
> +ret = qemu_vfio_pci_init_irq(s->vfio, &s->irq_notifier, errp);
>  if (ret) {
>  goto out;
>  }
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 9cb9b553a5..f1196e43dc 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -43,6 +43,8 @@ typedef struct {
>  struct QEMUVFIOState {
>  QemuMutex lock;
>  
> +int irq_type; /* vfio index */
> +
>  /* These fields are protected by BQL */
>  int container;
>  int group;
> @@ -176,14 +178,14 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int 
> index, void *bar,
>   * Initialize device IRQ with @irq_type and and register an event notifier.
>   */
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> -   int irq_type, Error **errp)
> +   Error **errp)
>  {
>  int r;
>  struct vfio_irq_set *irq_set;
>  size_t irq_set_size;
>  struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
>  
> -irq_info.index = irq_type;
> +irq_info.index = s->irq_type;
>  if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) {
>  error_setg_errno(errp, errno, "Failed to get device interrupt info");
>  return -errno;
> @@ -237,6 +239,7 @@ static int qemu_vfio_pci_write_config(QEMUVFIOState *s, 
> void *buf, int size, int
>  }
>  
>  static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
> +  int irq_type,
>Error **errp)
>  {
>  int ret;
> @@ -331,6 +334,7 @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const 
> char *device,
>  ret = -errno;
>  goto fail;
>  }
> +s->irq_type = irq_type;

Re: [RFC PATCH v2 4/7] util/vfio-helpers: Check the device allow up to 'irq_count' IRQs

2020-08-13 Thread Alex Williamson
On Thu, 13 Aug 2020 19:29:54 +0200
Philippe Mathieu-Daudé  wrote:

> As we want to use more than one single IRQ, add a check that
> the device accept our request to use multiple IRQs.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 6 ++
>  util/trace-events   | 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index bad60076f3..b81d4c70c2 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -335,6 +335,12 @@ static int qemu_vfio_init_pci(QEMUVFIOState *s, const 
> char *device,
>  ret = -errno;
>  goto fail;
>  }
> +trace_qemu_vfio_init_pci(device_info.num_irqs);
> +if (device_info.num_irqs < irq_count) {
> +error_setg(errp, "Invalid device IRQ count");
> +ret = -EINVAL;
> +goto fail;
> +}

This is confusing the number of IRQ indexes (ie. IRQ types -
INTx/MSI/MSIx plus virtual interrupts like error reporting and device
request) with the number of sub-indexes available for a given type
again.  You actually need to look at VFIO_DEVICE_GET_IRQ_INFO for the
specified irq_type to see if it supports irq_count sub-indexes.

Maybe think of interrupts as a 2-dimensional array, we have:

INDEX   \  SUBINDEX
 \ 0 1 2 3 4 ... N
==
INTx  [0]| 
MSI   [1]|
MSI-X [2]|
...   [M]|

VFIO_DEVICE_GET_INFO only tells us essentially the last INDEX that the
device supports.  In order to learn about the number of SUBINDEXes, or
vectors, if any, that each INDEX provides, we need to look at
VFIO_DEVICE_GET_IRQ_INFO.  When we're wanting to probe support for some
number of concurrent device interrupt vectors, we need to look at the
vfio_irq_info.count value for the desired index, ie. the extent of the
entries in the row associated with our column index type.  Thanks,

Alex

>  s->irq_type = irq_type;
>  s->irq_count = irq_count;
>  
> diff --git a/util/trace-events b/util/trace-events
> index 0ce42822eb..2e8be3 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -83,3 +83,4 @@ qemu_vfio_new_mapping(void *s, void *host, size_t size, int 
> index, uint64_t iova
>  qemu_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %p 
> host %p size %zu iova 0x%"PRIx64
>  qemu_vfio_dma_map(void *s, void *host, size_t size, bool temporary, uint64_t 
> *iova) "s %p host %p size %zu temporary %d iova %p"
>  qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p"
> +qemu_vfio_init_pci(uint32_t count) "device interrupt count: %"PRIu32




Re: [RFC PATCH v2 6/7] util/vfio-helpers: Allow to set EventNotifier to particular IRQ

2020-08-13 Thread Alex Williamson
On Thu, 13 Aug 2020 19:29:56 +0200
Philippe Mathieu-Daudé  wrote:

> Let qemu_vfio_pci_init_irq() take an 'index' argument, so we can
> set the EventNotifier to a specific IRQ.
> Add a safety check. Since our helper is limited to one single IRQ
> we are safe.
> 
> Our only user is the NVMe block driver, update it (also safe because
> it only uses the first IRQ).
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  include/qemu/vfio-helpers.h |  2 +-
>  block/nvme.c|  2 +-
>  util/vfio-helpers.c | 11 +--
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 728f40922b..5c2d8ee5b3 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -28,6 +28,6 @@ void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
>  void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>   uint64_t offset, uint64_t size);
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> -   Error **errp);
> +   int irq_index, Error **errp);
>  
>  #endif
> diff --git a/block/nvme.c b/block/nvme.c
> index 21b0770c02..a5ef571492 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -785,7 +785,7 @@ static int nvme_init(BlockDriverState *bs, const char 
> *device, int namespace,
>  }
>  }
>  
> -ret = qemu_vfio_pci_init_irq(s->vfio, &s->irq_notifier, errp);
> +ret = qemu_vfio_pci_init_irq(s->vfio, &s->irq_notifier, INDEX_ADMIN, 
> errp);
>  if (ret) {
>  goto out;
>  }
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 5781e4f066..7a934d1a1b 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -180,13 +180,20 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int 
> index, void *bar,
>   * Initialize device IRQ with @irq_type and and register an event notifier.
>   */
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> -   Error **errp)
> +   int irq_index, Error **errp)
>  {
>  int r;
>  struct vfio_irq_set *irq_set;
>  size_t irq_set_size;
>  struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
>  
> +if (irq_index >= s->irq_count) {
> +error_setg(errp,
> +   "Illegal interrupt %d (device initialized for %zu in 
> total)",
> +   irq_index, s->irq_count);
> +return -EINVAL;
> +}
> +
>  irq_info.index = s->irq_type;
>  if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) {
>  error_setg_errno(errp, errno, "Failed to get device interrupt info");
> @@ -196,7 +203,7 @@ int qemu_vfio_pci_init_irq(QEMUVFIOState *s, 
> EventNotifier *e,
>  error_setg(errp, "Device interrupt doesn't support eventfd");
>  return -EINVAL;
>  }
> -s->eventfd[0] = event_notifier_get_fd(e);
> +s->eventfd[irq_index] = event_notifier_get_fd(e);

This can't work.  For each fd in the array provided the kernel is going
to try to get that fd and configure it as an eventfd.  For each call
until we set all eventfd index {0..irq_count}, this SET_IRQS ioctl will
fail.  I would probably make that pre-configure function I referred to
earlier and create a single spurious interrupt eventfd and configure
all of the vectors to signal that one eventfd.  You could then have
this per vector callback swap the eventfd with the caller provided one
for the given vector.

NB, I don't know if you're going to run into trouble with this scheme
with the fact that devices can behave differently based on the number
of vectors they have enabled.  You're creating an interface for a
driver, so presumably that driver knows, for example, that as soon as
vector N is enabled, signaling for event foo moves from vector 0 to
vector N.  Thanks,

Alex

>  
>  irq_set_size = sizeof(*irq_set) + s->irq_count * sizeof(int32_t);
>  irq_set = g_malloc0(irq_set_size);




Re: [RFC PATCH v2 7/7] util/vfio-helpers: Allow opening device requesting for multiple IRQs

2020-08-13 Thread Alex Williamson
On Thu, 13 Aug 2020 19:29:57 +0200
Philippe Mathieu-Daudé  wrote:

> Now that our helper is ready for handling multiple IRQs, let
> qemu_vfio_open_pci() take an 'irq_count' argument.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---

As with patch 2/ tying IRQ setup with the opening of a device seems
wrong.  Get the device open, then create an interface to configure the
interrupt.  Thanks,

Alex


>  include/qemu/vfio-helpers.h | 2 +-
>  block/nvme.c| 5 -
>  util/vfio-helpers.c | 4 ++--
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 5c2d8ee5b3..4773b116df 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -16,7 +16,7 @@
>  typedef struct QEMUVFIOState QEMUVFIOState;
>  
>  QEMUVFIOState *qemu_vfio_open_pci(const char *device, int irq_type,
> -  Error **errp);
> +  unsigned irq_count, Error **errp);
>  void qemu_vfio_close(QEMUVFIOState *s);
>  int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
>bool temporary, uint64_t *iova_list);
> diff --git a/block/nvme.c b/block/nvme.c
> index a5ef571492..2d7aac3903 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -106,6 +106,9 @@ QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 
> 0x1000);
>  #define INDEX_ADMIN 0
>  #define INDEX_IO(n) (1 + n)
>  
> +/* This driver shares a single MSIX IRQ for the admin and I/O queues */
> +#define MSIX_IRQ_COUNT  1
> +
>  struct BDRVNVMeState {
>  AioContext *aio_context;
>  QEMUVFIOState *vfio;
> @@ -712,7 +715,7 @@ static int nvme_init(BlockDriverState *bs, const char 
> *device, int namespace,
>  }
>  
>  s->vfio = qemu_vfio_open_pci(device, VFIO_PCI_MSIX_IRQ_INDEX,
> - errp);
> + MSIX_IRQ_COUNT, errp);
>  if (!s->vfio) {
>  ret = -EINVAL;
>  goto out;
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 7a934d1a1b..36fafef0d3 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -450,12 +450,12 @@ static void qemu_vfio_open_common(QEMUVFIOState *s)
>   * Open a PCI device, e.g. ":00:01.0".
>   */
>  QEMUVFIOState *qemu_vfio_open_pci(const char *device, int irq_type,
> -  Error **errp)
> +  unsigned irq_count, Error **errp)
>  {
>  int r;
>  QEMUVFIOState *s = g_new0(QEMUVFIOState, 1);
>  
> -r = qemu_vfio_init_pci(s, device, irq_type, 1, errp);
> +r = qemu_vfio_init_pci(s, device, irq_type, irq_count, errp);
>  if (r) {
>  g_free(s);
>  return NULL;




Re: [PATCH] Report stringified errno in VFIO related errors

2020-02-14 Thread Alex Williamson
On Fri, 14 Feb 2020 10:19:50 +0100
Michal Privoznik  wrote:

> On 2/14/20 9:47 AM, Michal Privoznik wrote:
> > In a few places we report errno formatted as a negative integer.
> > This is not as user friendly as it can be. Use strerror() and/or
> > error_setg_errno() instead.
> > 
> > Signed-off-by: Michal Privoznik 
> > ---
> >   hw/vfio/common.c| 2 +-
> >   util/vfio-helpers.c | 6 +++---
> >   2 files changed, 4 insertions(+), 4 deletions(-)
> >   
> 
> BTW the reason I've noticed these is because I'm seeing some errors when 
> assigning my NVMe disk to qemu. This is the full command line:
> 
> 
> /home/zippy/work/qemu/qemu.git/x86_64-softmmu/qemu-system-x86_64 \
> -name guest=fedora,debug-threads=on \
> -S \
> -object 
> secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-fedora/master-key.aes
>  
> \
> -machine pc-i440fx-4.1,accel=kvm,usb=off,dump-guest-core=off \
> -cpu host \
> -m size=4194304k,slots=16,maxmem=1099511627776k \
> -overcommit mem-lock=off \
> -smp 4,sockets=1,dies=1,cores=2,threads=2 \
> -object iothread,id=iothread1 \
> -object iothread,id=iothread2 \
> -object iothread,id=iothread3 \
> -object iothread,id=iothread4 \
> -mem-prealloc \
> -mem-path /hugepages2M/libvirt/qemu/2-fedora \
> -numa node,nodeid=0,cpus=0,mem=4096 \
> -uuid 63840878-0deb-4095-97e6-fc444d9bc9fa \
> -no-user-config \
> -nodefaults \
> -chardev socket,id=charmonitor,fd=31,server,nowait \
> -mon chardev=charmonitor,id=monitor,mode=control \
> -rtc base=utc \
> -no-shutdown \
> -global PIIX4_PM.disable_s3=0 \
> -global PIIX4_PM.disable_s4=0 \
> -boot menu=on,strict=on \
> -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \
> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \
> -blockdev 
> '{"driver":"file","filename":"/var/lib/libvirt/images/fedora.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"}'
>  
> \
> -blockdev 
> '{"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-2-storage","backing":null}'
>  
> \
> -device 
> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-2-format,id=scsi0-0-0-0,bootindex=1
>  
> \
> -blockdev 
> '{"driver":"nvme","device":":02:00.0","namespace":1,"node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}'
>  
> \
> -blockdev 
> '{"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"}'
>  
> \
> -device 
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=libvirt-1-format,id=virtio-disk0
>  
> \
> -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=34 \
> -device 
> virtio-net-pci,host_mtu=9000,netdev=hostnet0,id=net0,mac=52:54:00:a4:6f:91,bus=pci.0,addr=0x3
>  
> \
> -chardev pty,id=charserial0 \
> -device isa-serial,chardev=charserial0,id=serial0 \
> -chardev socket,id=charchannel0,fd=35,server,nowait \
> -device 
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
>  
> \
> -spice port=5900,addr=0.0.0.0,disable-ticketing,seamless-migration=on \
> -device virtio-vga,id=video0,virgl=on,max_outputs=1,bus=pci.0,addr=0x2 \
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 \
> -sandbox 
> on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> -msg timestamp=on
> 
> And these are the errors I see:
> 
> 2020-02-14T09:06:18.183167Z qemu-system-x86_64: VFIO_MAP_DMA failed: 
> Invalid argument
> 2020-02-14T09:10:49.753767Z qemu-system-x86_64: VFIO_MAP_DMA failed: 
> Cannot allocate memory
> 2020-02-14T09:11:04.530344Z qemu-system-x86_64: VFIO_MAP_DMA failed: No 
> space left on device
> 2020-02-14T09:11:04.531087Z qemu-system-x86_64: VFIO_MAP_DMA failed: No 
> space left on device
> 2020-02-14T09:11:04.531230Z qemu-system-x86_64: VFIO_MAP_DMA failed: No 
> space left on device

I feel obligated to reply given the VFIO issues, but TBH I've never
used the nvme-vfio driver and don't consider myself the maintainer of
the vfio-helpers.  I'd guess the latter ENOSPC errors indicate we've
exhausted the number of separate DMA mappings that a vfio user is
allowed to create.  We had to put a cap on this to prevent malicious
users and it was set to 64K, which seemed like orders of magnitude more
than we use for device assignment, but perhaps not enough for this
driver.  There's a dma_entry_limit module option on the
vfio_iommu_type1 module that can be bumped up to see if it resolves
this issue.  The initial EINVAL and ENOMEM errors might however
indicate the driver has already gone into the weeds before we get to
the latter problem though.

> I'm doing nothing with the disk inside the guest, but:
> 
># dd if=/dev/vda of=/dev/null status=progress
> 
> (the disk appears as /dev/vda in the guest). Surprisingly, I do not see 
> these errors when I use the traditional PCI assignment (-device 
> vfio-pci).

Not really surprising, entirely different models of using the d

Re: [PATCH] pci: Abort if pci_add_capability fails

2022-08-30 Thread Alex Williamson
On Tue, 30 Aug 2022 13:37:35 +0200
Markus Armbruster  wrote:
>if (!offset) {
>offset = pci_find_space(pdev, size);
>/* out of PCI config space is programming error */
>assert(offset);
>} else {
>/* Verify that capabilities don't overlap.  Note: device assignment
> * depends on this check to verify that the device is not broken.
> * Should never trigger for emulated devices, but it's helpful
> * for debugging these. */
> 
> The comment makes me suspect that device assignment of a broken device
> could trigger the error.  It goes back to
> 
> commit c9abe111209abca1b910e35c6ca9888aced5f183
> Author: Jan Kiszka 
> Date:   Wed Aug 24 14:29:30 2011 +0200
> 
> pci: Error on PCI capability collisions
> 
> Nothing good can happen when we overlap capabilities. This may happen
> when plugging in assigned devices or when devices models contain bugs.
> Detect the overlap and report it.
> 
> Based on qemu-kvm commit by Alex Williamson.
> 
> Signed-off-by: Jan Kiszka 
> Acked-by: Don Dutile 
> Signed-off-by: Michael S. Tsirkin 
> 
> If this is still correct, then your patch is a regression: QEMU is no
> longer able to gracefully handle assignment of a broken device.  Does
> this matter?  Alex, maybe?

Ok, that was a long time ago.  I have some vague memories of hitting
something like this with a Broadcom NIC, but a google search for the
error string doesn't turn up anything recently.  So there's a fair
chance this wouldn't break anyone initially.

Even back when the above patch was proposed, there were some
suggestions to turn the error path into an abort, which I pushed back
on since clearly enumerating capabilities of a device can occur due to
a hot-plug and we don't necessarily have control of the device being
added.  This is only more true with the possibility of soft-devices out
of tree, through things like vfio-user.

Personally I think the right approach is to support an error path such
that we can abort when triggered by a cold-plug device, while simply
rejecting a broken hot-plug device, but that seems to be the minority
opinion among QEMU developers afaict.  Thanks,

Alex




Re: [PATCH v3 14/16] hw/vfio/pci: Omit errp for pci_add_capability

2022-10-26 Thread Alex Williamson
On Thu, 27 Oct 2022 05:15:25 +0900
Akihiko Odaki  wrote:

> The code generating errors in pci_add_capability has a comment which
> says:
> > Verify that capabilities don't overlap.  Note: device assignment
> > depends on this check to verify that the device is not broken.
> > Should never trigger for emulated devices, but it's helpful for
> > debugging these.  
> 
> Indeed vfio has some code that passes capability offsets and sizes from
> a physical device, but it explicitly pays attention so that the
> capabilities never overlap. Therefore, in pci_add_capability(), we can
> always assert that capabilities never overlap, and that is what happens
> when omitting errp.
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  hw/vfio/pci-quirks.c | 15 +++
>  hw/vfio/pci.c| 14 +-
>  2 files changed, 8 insertions(+), 21 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index f0147a050a..e94fd273ea 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -1530,7 +1530,7 @@ const PropertyInfo qdev_prop_nv_gpudirect_clique = {
>  static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
>  {
>  PCIDevice *pdev = &vdev->pdev;
> -int ret, pos = 0xC8;
> +int pos = 0xC8;
>  
>  if (vdev->nv_gpudirect_clique == 0xFF) {
>  return 0;
> @@ -1547,11 +1547,7 @@ static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice 
> *vdev, Error **errp)
>  return -EINVAL;
>  }
>  
> -ret = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, 8, errp);
> -if (ret < 0) {
> -error_prepend(errp, "Failed to add NVIDIA GPUDirect cap: ");
> -return ret;
> -}
> +pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, 8);
>  
>  memset(vdev->emulated_config_bits + pos, 0xFF, 8);
>  pos += PCI_CAP_FLAGS;
> @@ -1718,12 +1714,7 @@ static int vfio_add_vmd_shadow_cap(VFIOPCIDevice 
> *vdev, Error **errp)
>  return -EFAULT;
>  }
>  
> -ret = pci_add_capability(&vdev->pdev, PCI_CAP_ID_VNDR, pos,
> - VMD_SHADOW_CAP_LEN, errp);
> -if (ret < 0) {
> -error_prepend(errp, "Failed to add VMD MEMBAR Shadow cap: ");
> -return ret;
> -}
> +pci_add_capability(&vdev->pdev, PCI_CAP_ID_VNDR, pos, 
> VMD_SHADOW_CAP_LEN);


For these, I guess we're assuming that since they're added first via
vfio_add_virt_caps() that we cannot create an overlap?

>  
>  memset(vdev->emulated_config_bits + pos, 0xFF, VMD_SHADOW_CAP_LEN);
>  pos += PCI_CAP_FLAGS;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 939dcc3d4a..2b653d01e3 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1826,7 +1826,7 @@ static void vfio_add_emulated_long(VFIOPCIDevice *vdev, 
> int pos,
>  vfio_set_long_bits(vdev->emulated_config_bits + pos, mask, mask);
>  }
>  
> -static int vfio_setup_pcie_cap(VFIOPCIDevice *vdev, int pos, uint8_t size,
> +static int vfio_setup_pcie_cap(VFIOPCIDevice *vdev, uint8_t pos, uint8_t 
> size,
> Error **errp)
>  {
>  uint16_t flags;
> @@ -1943,11 +1943,7 @@ static int vfio_setup_pcie_cap(VFIOPCIDevice *vdev, 
> int pos, uint8_t size,
> 1, PCI_EXP_FLAGS_VERS);
>  }
>  
> -pos = pci_add_capability(&vdev->pdev, PCI_CAP_ID_EXP, pos, size,
> - errp);
> -if (pos < 0) {
> -return pos;
> -}
> +pos = pci_add_capability(&vdev->pdev, PCI_CAP_ID_EXP, pos, size);
>  
>  vdev->pdev.exp.exp_cap = pos;
>  
> @@ -2045,14 +2041,14 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, 
> uint8_t pos, Error **errp)
>  case PCI_CAP_ID_PM:
>  vfio_check_pm_reset(vdev, pos);
>  vdev->pm_cap = pos;
> -ret = pci_add_capability(pdev, cap_id, pos, size, errp);
> +pci_add_capability(pdev, cap_id, pos, size);
>  break;
>  case PCI_CAP_ID_AF:
>  vfio_check_af_flr(vdev, pos);
> -ret = pci_add_capability(pdev, cap_id, pos, size, errp);
> +pci_add_capability(pdev, cap_id, pos, size);
>  break;
>  default:
> -ret = pci_add_capability(pdev, cap_id, pos, size, errp);
> +pci_add_capability(pdev, cap_id, pos, size);
>  break;
>  }

And for these we've calculated a size to make sure that we don't bump
into the next capability, but how do you account for the MSI and MSI-X
cases above this chunk where vfio's overlap prevention can't be used?
Thanks,

Alex




Re: [PATCH v4 01/17] hw/vfio/pci: Ensure MSI and MSI-X do not overlap

2022-10-27 Thread Alex Williamson
On Thu, 27 Oct 2022 15:36:49 +0900
Akihiko Odaki  wrote:

> vfio_add_std_cap() is designed to ensure that capabilities do not
> overlap, but it failed to do so for MSI and MSI-X capabilities.
> 
> Ensure MSI and MSI-X capabilities do not overlap with others by omitting
> other overlapping capabilities.
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  hw/vfio/pci.c | 55 +--
>  hw/vfio/pci.h |  3 +++
>  2 files changed, 48 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 939dcc3d4a..8a4995cd68 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1278,23 +1278,42 @@ static void vfio_disable_interrupts(VFIOPCIDevice 
> *vdev)
>  }
>  }
>  
> -static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> +static void vfio_msi_early_setup(VFIOPCIDevice *vdev, Error **errp)
>  {
>  uint16_t ctrl;
> -bool msi_64bit, msi_maskbit;
> -int ret, entries;
> -Error *err = NULL;
> +uint8_t pos;
> +
> +pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSIX);


PCI_CAP_ID_MSIX???  Is this tested with MSI?


> +if (!pos) {
> +return;
> +}
>  
>  if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
>vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
>  error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
> -return -errno;
> +return;
>  }
> -ctrl = le16_to_cpu(ctrl);
> +vdev->msi_pos = pos;
> +vdev->msi_ctrl = le16_to_cpu(ctrl);
> +
> +vdev->msi_cap_size = 0xa;
> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT)) {
> +vdev->msi_cap_size += 0xa;
> +}
> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT)) {
> +vdev->msi_cap_size += 0x4;
> +}
> +}
>  
> -msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
> -msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
> -entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
> +static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> +{
> +bool msi_64bit, msi_maskbit;
> +int ret, entries;
> +Error *err = NULL;
> +
> +msi_64bit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT);
> +msi_maskbit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT);
> +entries = 1 << ((vdev->msi_ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
>  
>  trace_vfio_msi_setup(vdev->vbasedev.name, pos);
>  
> @@ -1306,7 +1325,6 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, 
> Error **errp)
>  error_propagate_prepend(errp, err, "msi_init failed: ");
>  return ret;
>  }
> -vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 
> 0);
>  
>  return 0;
>  }
> @@ -1524,6 +1542,7 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, 
> Error **errp)
>  pba = le32_to_cpu(pba);
>  
>  msix = g_malloc0(sizeof(*msix));
> +msix->pos = pos;
>  msix->table_bar = table & PCI_MSIX_FLAGS_BIRMASK;
>  msix->table_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
>  msix->pba_bar = pba & PCI_MSIX_FLAGS_BIRMASK;
> @@ -2025,6 +2044,16 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, 
> uint8_t pos, Error **errp)
>  }
>  }
>  
> +if (cap_id != PCI_CAP_ID_MSI &&
> +pos >= vdev->msi_pos && pos < vdev->msi_pos + vdev->msi_cap_size) {
> +return 0;
> +}
> +
> +if (cap_id != PCI_CAP_ID_MSIX && vdev->msix &&
> +pos >= vdev->msix->pos && pos < vdev->msix->pos + MSIX_CAP_LENGTH) {
> +return 0;
> +}
> +

These only test a specific kind of overlap, why not use
ranges_overlap()?

We also need to sanity test vdev->msi_pos, or are you just letting
msi_pos = 0, msi_cap_size = 0 fall through since we cannot overlap?

Shouldn't this also jump to reporting the error rather than silently
dropping a capability?  Thanks,

Alex

>  /* Scale down size, esp in case virt caps were added above */
>  size = MIN(size, vfio_std_cap_max_size(pdev, pos));
>  
> @@ -3037,6 +3066,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  
>  vfio_bars_prepare(vdev);
>  
> +vfio_msi_early_setup(vdev, &err);
> +if (err) {
> +error_propagate(errp, err);
> +goto error;
> +}
> +
>  vfio_msix_early_setup(vdev, &err);
>  if (err) {
>  error_propagate(errp, err);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 7c236a52f4..9ae0278058 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -107,6 +107,7 @@ enum {
>  
>  /* Cache of MSI-X setup */
>  typedef struct VFIOMSIXInfo {
> +uint8_t pos;
>  uint8_t table_bar;
>  uint8_t pba_bar;
>  uint16_t entries;
> @@ -128,6 +129,8 @@ struct VFIOPCIDevice {
>  unsigned int rom_size;
>  off_t rom_offset; /* Offset of ROM region within device fd */
>  void *rom;
> +uint8_t msi_pos;
> +uint16_t msi_ctrl;
>  int msi_cap_size;
>  VFIOMSIVector *msi_vectors;
>  VFIOMSIXInfo *msix;




Re: [PATCH v5 01/17] hw/vfio/pci: Ensure MSI and MSI-X do not overlap

2022-10-28 Thread Alex Williamson
On Fri, 28 Oct 2022 21:26:13 +0900
Akihiko Odaki  wrote:

> vfio_add_std_cap() is designed to ensure that capabilities do not
> overlap, but it failed to do so for MSI and MSI-X capabilities.
> 
> Ensure MSI and MSI-X capabilities do not overlap with others by omitting
> other overlapping capabilities.
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  hw/vfio/pci.c | 63 +++
>  hw/vfio/pci.h |  3 +++
>  2 files changed, 56 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 939dcc3d4a..36c8f3dc85 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1278,23 +1278,42 @@ static void vfio_disable_interrupts(VFIOPCIDevice 
> *vdev)
>  }
>  }
>  
> -static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> +static void vfio_msi_early_setup(VFIOPCIDevice *vdev, Error **errp)
>  {
>  uint16_t ctrl;
> -bool msi_64bit, msi_maskbit;
> -int ret, entries;
> -Error *err = NULL;
> +uint8_t pos;
> +
> +pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
> +if (!pos) {
> +return;
> +}
>  
>  if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
>vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
>  error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
> -return -errno;
> +return;
>  }
> -ctrl = le16_to_cpu(ctrl);
> +vdev->msi_pos = pos;
> +vdev->msi_ctrl = le16_to_cpu(ctrl);
>  
> -msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
> -msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
> -entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
> +vdev->msi_cap_size = 0xa;
> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT)) {
> +vdev->msi_cap_size += 0xa;
> +}
> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT)) {
> +vdev->msi_cap_size += 0x4;
> +}
> +}
> +
> +static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> +{
> +bool msi_64bit, msi_maskbit;
> +int ret, entries;
> +Error *err = NULL;
> +
> +msi_64bit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT);
> +msi_maskbit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT);
> +entries = 1 << ((vdev->msi_ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
>  
>  trace_vfio_msi_setup(vdev->vbasedev.name, pos);
>  
> @@ -1306,7 +1325,6 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, 
> Error **errp)
>  error_propagate_prepend(errp, err, "msi_init failed: ");
>  return ret;
>  }
> -vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 
> 0);
>  
>  return 0;
>  }
> @@ -1524,6 +1542,7 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, 
> Error **errp)
>  pba = le32_to_cpu(pba);
>  
>  msix = g_malloc0(sizeof(*msix));
> +msix->pos = pos;
>  msix->table_bar = table & PCI_MSIX_FLAGS_BIRMASK;
>  msix->table_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
>  msix->pba_bar = pba & PCI_MSIX_FLAGS_BIRMASK;
> @@ -2025,6 +2044,24 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, 
> uint8_t pos, Error **errp)
>  }
>  }
>  
> +if (cap_id != PCI_CAP_ID_MSI &&
> +range_covers_byte(vdev->msi_pos, vdev->msi_cap_size, pos)) {
> +warn_report(VFIO_MSG_PREFIX
> +"A capability overlaps with MSI, ignoring (%" PRIu8 " @ 
> %" PRIu8 " in [%" PRIu8 ", %" PRIu8 "))",
> +vdev->vbasedev.name, cap_id, pos,
> +vdev->msi_pos, vdev->msi_pos + vdev->msi_cap_size);
> +return 0;
> +}
> +
> +if (cap_id != PCI_CAP_ID_MSIX && vdev->msix &&
> +range_covers_byte(vdev->msix->pos, MSIX_CAP_LENGTH, pos)) {
> +warn_report(VFIO_MSG_PREFIX
> +"A capability overlaps with MSI-X, ignoring (%" PRIu8 " 
> @ %" PRIu8 " in [%" PRIu8 ", %" PRIu8 "))",
> +vdev->vbasedev.name, cap_id, pos,
> +vdev->msix->pos, vdev->msix->pos + MSIX_CAP_LENGTH);
> +return 0;
> +}

Capabilities are not a single byte, the fact that it doesn't start
within the MSI or MSI-X capability is not a sufficient test.  We're
also choosing to prioritize MSI and MSI-X capabilities by protecting
that range rather than the existing behavior where we'd drop those
capabilities if they overlap with another capability that has already
been placed.  There are merits to both approaches, but I don't see any
justification here to change the current behavior.

Isn't the most similar behavior to existing to pass the available size
to vfio_msi[x]_setup() and return an errno if the size would be
exceeded?  Something like below (untested, and requires exporting
msi_cap_sizeof()).  Thanks,

Alex

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 939dcc3d4a9e..485f9bc5102d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1278,11 +1278,13 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev)
 }
 }
 
-static int vfio_msi_setup(VFIOPCIDevice *v

Re: [PATCH v5 01/17] hw/vfio/pci: Ensure MSI and MSI-X do not overlap

2022-10-28 Thread Alex Williamson
On Sat, 29 Oct 2022 01:12:11 +0900
Akihiko Odaki  wrote:

> On 2022/10/28 23:16, Alex Williamson wrote:
> > On Fri, 28 Oct 2022 21:26:13 +0900
> > Akihiko Odaki  wrote:
> >   
> >> vfio_add_std_cap() is designed to ensure that capabilities do not
> >> overlap, but it failed to do so for MSI and MSI-X capabilities.
> >>
> >> Ensure MSI and MSI-X capabilities do not overlap with others by omitting
> >> other overlapping capabilities.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> ---
> >>   hw/vfio/pci.c | 63 +++
> >>   hw/vfio/pci.h |  3 +++
> >>   2 files changed, 56 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 939dcc3d4a..36c8f3dc85 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -1278,23 +1278,42 @@ static void vfio_disable_interrupts(VFIOPCIDevice 
> >> *vdev)
> >>   }
> >>   }
> >>   
> >> -static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >> +static void vfio_msi_early_setup(VFIOPCIDevice *vdev, Error **errp)
> >>   {
> >>   uint16_t ctrl;
> >> -bool msi_64bit, msi_maskbit;
> >> -int ret, entries;
> >> -Error *err = NULL;
> >> +uint8_t pos;
> >> +
> >> +pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
> >> +if (!pos) {
> >> +return;
> >> +}
> >>   
> >>   if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
> >> vdev->config_offset + pos + PCI_CAP_FLAGS) != 
> >> sizeof(ctrl)) {
> >>   error_setg_errno(errp, errno, "failed reading MSI 
> >> PCI_CAP_FLAGS");
> >> -return -errno;
> >> +return;
> >>   }
> >> -ctrl = le16_to_cpu(ctrl);
> >> +vdev->msi_pos = pos;
> >> +vdev->msi_ctrl = le16_to_cpu(ctrl);
> >>   
> >> -msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
> >> -msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
> >> -entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
> >> +vdev->msi_cap_size = 0xa;
> >> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT)) {
> >> +vdev->msi_cap_size += 0xa;
> >> +}
> >> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT)) {
> >> +vdev->msi_cap_size += 0x4;
> >> +}
> >> +}
> >> +
> >> +static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >> +{
> >> +bool msi_64bit, msi_maskbit;
> >> +int ret, entries;
> >> +Error *err = NULL;
> >> +
> >> +msi_64bit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT);
> >> +msi_maskbit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT);
> >> +entries = 1 << ((vdev->msi_ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
> >>   
> >>   trace_vfio_msi_setup(vdev->vbasedev.name, pos);
> >>   
> >> @@ -1306,7 +1325,6 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int 
> >> pos, Error **errp)
> >>   error_propagate_prepend(errp, err, "msi_init failed: ");
> >>   return ret;
> >>   }
> >> -vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 
> >> : 0);
> >>   
> >>   return 0;
> >>   }
> >> @@ -1524,6 +1542,7 @@ static void vfio_msix_early_setup(VFIOPCIDevice 
> >> *vdev, Error **errp)
> >>   pba = le32_to_cpu(pba);
> >>   
> >>   msix = g_malloc0(sizeof(*msix));
> >> +msix->pos = pos;
> >>   msix->table_bar = table & PCI_MSIX_FLAGS_BIRMASK;
> >>   msix->table_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
> >>   msix->pba_bar = pba & PCI_MSIX_FLAGS_BIRMASK;
> >> @@ -2025,6 +2044,24 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, 
> >> uint8_t pos, Error **errp)
> >>   }
> >>   }
> >>   
> >> +if (cap_id != PCI_CAP_ID_MSI &&
> >> +range_covers_byte(vdev->msi_pos, vdev->msi_cap_size, pos)) {
> >> +warn_report(VFIO_MSG_PREFIX
> >> +"A capability overlaps with MSI, ignoring (%" PRIu8 " 
> >> @ %" PRIu8 " in [%" PRIu8 ", %" PRIu8

Re: [PATCH v6 01/17] hw/vfio/pci: Ensure MSI and MSI-X do not overlap

2022-10-31 Thread Alex Williamson
On Mon, 31 Oct 2022 21:33:03 +0900
Akihiko Odaki  wrote:

> pci_add_capability() checks whether capabilities overlap, and notifies
> its caller so that it can properly handle the case. However, in the
> most cases, the capabilities actually never overlap, and the interface
> incurred extra error handling code, which is often incorrect or
> suboptimal. For such cases, pci_add_capability() can simply abort the
> execution if the capabilities actually overlap since it should be a
> programming error.
> 
> This change handles the other cases: hw/vfio/pci depends on the check to
> decide MSI and MSI-X capabilities overlap with another. As they are
> quite an exceptional and hw/vfio/pci knows much about PCI capabilities,
> adding code specific to the cases to hw/vfio/pci still results in less
> code than having error handling code everywhere in total.
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  hw/vfio/pci.c | 61 ++-
>  hw/vfio/pci.h |  3 +++
>  2 files changed, 54 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 939dcc3d4a..c7e3ef95a7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1278,23 +1278,42 @@ static void vfio_disable_interrupts(VFIOPCIDevice 
> *vdev)
>  }
>  }
>  
> -static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> +static void vfio_msi_early_setup(VFIOPCIDevice *vdev, Error **errp)
>  {
>  uint16_t ctrl;
> -bool msi_64bit, msi_maskbit;
> -int ret, entries;
> -Error *err = NULL;
> +uint8_t pos;
> +
> +pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
> +if (!pos) {
> +return;
> +}
>  
>  if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
>vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
>  error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
> -return -errno;
> +return;
>  }
> -ctrl = le16_to_cpu(ctrl);
> +vdev->msi_pos = pos;
> +vdev->msi_ctrl = le16_to_cpu(ctrl);
>  
> -msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
> -msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
> -entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
> +vdev->msi_cap_size = 0xa;
> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT)) {
> +vdev->msi_cap_size += 0xa;
> +}
> +if ((vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT)) {
> +vdev->msi_cap_size += 0x4;
> +}
> +}
> +
> +static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> +{
> +bool msi_64bit, msi_maskbit;
> +int ret, entries;
> +Error *err = NULL;
> +
> +msi_64bit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_64BIT);
> +msi_maskbit = !!(vdev->msi_ctrl & PCI_MSI_FLAGS_MASKBIT);
> +entries = 1 << ((vdev->msi_ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
>  
>  trace_vfio_msi_setup(vdev->vbasedev.name, pos);
>  
> @@ -1306,7 +1325,6 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, 
> Error **errp)
>  error_propagate_prepend(errp, err, "msi_init failed: ");
>  return ret;
>  }
> -vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 
> 0);
>  
>  return 0;
>  }
> @@ -1524,6 +1542,7 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, 
> Error **errp)
>  pba = le32_to_cpu(pba);
>  
>  msix = g_malloc0(sizeof(*msix));
> +msix->pos = pos;
>  msix->table_bar = table & PCI_MSIX_FLAGS_BIRMASK;
>  msix->table_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
>  msix->pba_bar = pba & PCI_MSIX_FLAGS_BIRMASK;
> @@ -2025,6 +2044,22 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, 
> uint8_t pos, Error **errp)
>  }
>  }
>  
> +if (cap_id != PCI_CAP_ID_MSI &&
> +range_covers_byte(vdev->msi_pos, vdev->msi_cap_size, pos)) {
> +error_setg(errp,
> +"A capability overlaps with MSI (%" PRIu8 " in [%" PRIu8 ", %" 
> PRIu8 "))",
> +pos, vdev->msi_pos, vdev->msi_pos + vdev->msi_cap_size);
> +return -EINVAL;
> +}
> +
> +if (cap_id != PCI_CAP_ID_MSIX && vdev->msix &&
> +range_covers_byte(vdev->msix->pos, MSIX_CAP_LENGTH, pos)) {
> +error_setg(errp,
> +"A capability overlaps with MSI-X (%" PRIu8 " in [%" PRIu8 ", %" 
> PRIu8 "))",
> +pos, vdev->msix->pos, vdev->msix->pos + MSIX_CAP_LENGTH);
> +return -EINVAL;
> +}

I provided an example of how the existing vfio_msi[x]_setup() can be
trivially extended to perform the necessary size checking in place of
pci_add_capability() without special cases and additional error paths
as done here.  Please adopt the approach I suggested.  Thanks,

Alex

> +
>  /* Scale down size, esp in case virt caps were added above */
>  size = MIN(size, vfio_std_cap_max_size(pdev, pos));
>  
> @@ -3037,6 +3072,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  
>  vfio_bars_prepare(vdev);
>  
> +vfio_msi_early_setup(vdev, &err);
> +if (err) {
> 

Re: [PATCH 1/2] Allow returning EventNotifier's wfd

2022-03-02 Thread Alex Williamson
On Wed,  2 Mar 2022 12:36:43 +0100
Sergio Lopez  wrote:

> event_notifier_get_fd(const EventNotifier *e) always returns
> EventNotifier's read file descriptor (rfd). This is not a problem when
> the EventNotifier is backed by a an eventfd, as a single file
> descriptor is used both for reading and triggering events (rfd ==
> wfd).
> 
> But, when EventNotifier is backed by a pipefd, we have two file
> descriptors, one that can only be used for reads (rfd), and the other
> only for writes (wfd).
> 
> There's, at least, one known situation in which we need to obtain wfd
> instead of rfd, which is when setting up the file that's going to be
> sent to the peer in vhost's SET_VRING_CALL.
> 
> Extend event_notifier_get_fd() to receive an argument which indicates
> whether the caller wants to obtain rfd (false) or wfd (true).

There are about 50 places where we add the false arg here and 1 where
we use true.  Seems it would save a lot of churn to hide this
internally, event_notifier_get_fd() returns an rfd, a new
event_notifier_get_wfd() returns the wfd.  Thanks,

Alex




Re: [PATCH 1/2] Allow returning EventNotifier's wfd

2022-03-02 Thread Alex Williamson
On Wed, 2 Mar 2022 16:23:42 +0100
Sergio Lopez  wrote:

> On Wed, Mar 02, 2022 at 08:12:34AM -0700, Alex Williamson wrote:
> > On Wed,  2 Mar 2022 12:36:43 +0100
> > Sergio Lopez  wrote:
> >   
> > > event_notifier_get_fd(const EventNotifier *e) always returns
> > > EventNotifier's read file descriptor (rfd). This is not a problem when
> > > the EventNotifier is backed by a an eventfd, as a single file
> > > descriptor is used both for reading and triggering events (rfd ==
> > > wfd).
> > > 
> > > But, when EventNotifier is backed by a pipefd, we have two file
> > > descriptors, one that can only be used for reads (rfd), and the other
> > > only for writes (wfd).
> > > 
> > > There's, at least, one known situation in which we need to obtain wfd
> > > instead of rfd, which is when setting up the file that's going to be
> > > sent to the peer in vhost's SET_VRING_CALL.
> > > 
> > > Extend event_notifier_get_fd() to receive an argument which indicates
> > > whether the caller wants to obtain rfd (false) or wfd (true).  
> > 
> > There are about 50 places where we add the false arg here and 1 where
> > we use true.  Seems it would save a lot of churn to hide this
> > internally, event_notifier_get_fd() returns an rfd, a new
> > event_notifier_get_wfd() returns the wfd.  Thanks,  
> 
> I agree. In fact, that's what I implemented in the first place. I
> changed to this version in which event_notifier_get_fd() is extended
> because it feels more "correct". But yes, the pragmatic option would
> be adding a new event_notifier_get_wfd().
> 
> I'll wait for more reviews, and unless someone voices against it, I'll
> respin the patches with that strategy (I already have it around here).

I'd argue that adding a bool as an arg to a function to change the
return value is sufficiently non-intuitive to program for that the
wrapper method is actually more correct.  event_notifier_get_fd()
essentially becomes a shorthand for event_notifier_get_rfd().  Thanks,

Alex




Re: [PATCH 02/12] hw/vfio/pci: Replace sprintf() by g_strdup_printf()

2024-04-12 Thread Alex Williamson
On Wed, 10 Apr 2024 18:06:03 +0200
Philippe Mathieu-Daudé  wrote:

> sprintf() is deprecated on Darwin since macOS 13.0 / XCode 14.1,
> resulting in painful developper experience. Use g_strdup_printf()
> instead.

Isn't this code only compiled for Linux hosts?  Maybe still a valid
change, but the rationale seems irrelevant.  Thanks,

Alex

> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/vfio/pci.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 64780d1b79..cc3cc89122 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2442,10 +2442,9 @@ void vfio_pci_post_reset(VFIOPCIDevice *vdev)
>  
>  bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
>  {
> -char tmp[13];
> -
> -sprintf(tmp, "%04x:%02x:%02x.%1x", addr->domain,
> -addr->bus, addr->slot, addr->function);
> +g_autofree char *tmp = g_strdup_printf("%04x:%02x:%02x.%1x",
> +   addr->domain, addr->bus,
> +   addr->slot, addr->function);
>  
>  return (strcmp(tmp, name) == 0);
>  }




Re: [PATCH v10 11/12] hw/pci: Convert rom_bar into OnOffAuto

2024-07-03 Thread Alex Williamson
On Wed, 3 Jul 2024 13:00:21 +0200 (CEST)
BALATON Zoltan  wrote:

> On Wed, 3 Jul 2024, Michael S. Tsirkin wrote:
> > On Wed, Jul 03, 2024 at 04:15:23AM +0200, BALATON Zoltan wrote:  
> >> On Tue, 2 Jul 2024, Michael S. Tsirkin wrote:  
> >>> On Thu, Jun 27, 2024 at 03:08:00PM +0900, Akihiko Odaki wrote:  
>  rom_bar is tristate but was defined as uint32_t so convert it into
>  OnOffAuto.
> 
>  Signed-off-by: Akihiko Odaki   
> >>>
> >>> Commit log should explain why this is an improvement,
> >>> not just what's done.
> >>>
> >>>  
>  diff --git a/docs/igd-assign.txt b/docs/igd-assign.txt
>  index e17bb50789ad..35c6c8e28493 100644
>  --- a/docs/igd-assign.txt
>  +++ b/docs/igd-assign.txt
>  @@ -35,7 +35,7 @@ IGD has two different modes for assignment using 
>  vfio-pci:
> ISA/LPC bridge device (vfio-pci-igd-lpc-bridge) on the root bus at
> PCI address 1f.0.
>   * The IGD device must have a VGA ROM, either provided via the 
>  romfile
>  -  option or loaded automatically through vfio (standard).  rombar=0
>  +  option or loaded automatically through vfio (standard).  
>  rombar=off
> will disable legacy mode support.
>   * Hotplug of the IGD device is not supported.
>   * The IGD device must be a SandyBridge or newer model device.  
> >>>
> >>> ...
> >>>  
>  diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>  index 39dae72497e0..0e920ed0691a 100644
>  --- a/hw/vfio/pci-quirks.c
>  +++ b/hw/vfio/pci-quirks.c
>  @@ -33,7 +33,7 @@
>    * execution as noticed with the BCM 57810 card for lack of a
>    * more better way to handle such issues.
>    * The  user can still override by specifying a romfile or
>  - * rombar=1.
>  + * rombar=on.
>    * Please see https://bugs.launchpad.net/qemu/+bug/1284874
>    * for an analysis of the 57810 card hang. When adding
>    * a new vendor id/device id combination below, please also add  
> >>>
> >>>
> >>> So we are apparently breaking a bunch of users who followed
> >>> documentation to the dot. Why is this a good idea?  
> >>
> >> On/off is clearer than 1/0. But isn't 1/0 a synonym for on/off so previous
> >> command lines would still work?
> >>
> >> Regards,
> >> BALATON Zoltan  
> >
> > I see nothing in code that would make it so:
> >
> >
> > const QEnumLookup OnOffAuto_lookup = {
> >.array = (const char *const[]) {
> >[ON_OFF_AUTO_AUTO] = "auto",
> >[ON_OFF_AUTO_ON] = "on",
> >[ON_OFF_AUTO_OFF] = "off",
> >},
> >.size = ON_OFF_AUTO__MAX
> > };
> >
> > I also tried with an existing property:
> >
> > $ ./qemu-system-x86_64 -device intel-hda,msi=0
> > qemu-system-x86_64: -device intel-hda,msi=0: Parameter 'msi' does not 
> > accept value '0'  
> 
> Then it was probably bit properties that also accept 0/1, on/off, 
> true/false. Maybe similar aliases could be added to on/off/auto?
> 
> In any case when I first saw rombar I thought it would set the BAR of the 
> ROM so wondered why it's 1 and not 5 or 6 or an offset. So on/off is 
> clearer in this case.

There's only one PCI spec defined offset for the ROM BAR.  Yes, the
option could be more clear but relocating the ROM to a different
regular BAR offset is invalid.  Thanks,

Alex




Re: [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support

2022-11-15 Thread Alex Williamson
On Thu, 3 Nov 2022 18:16:10 +0200
Avihai Horon  wrote:

> Currently, if IOMMU of a VFIO container doesn't support dirty page
> tracking, migration is blocked. This is because a DMA-able VFIO device
> can dirty RAM pages without updating QEMU about it, thus breaking the
> migration.
> 
> However, this doesn't mean that migration can't be done at all.
> In such case, allow migration and let QEMU VFIO code mark the entire
> bitmap dirty.
> 
> This guarantees that all pages that might have gotten dirty are reported
> back, and thus guarantees a valid migration even without VFIO IOMMU
> dirty tracking support.
> 
> The motivation for this patch is the future introduction of iommufd [1].
> iommufd will directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into its internal ops, allowing the usage of these IOCTLs
> over iommufd. However, VFIO IOMMU dirty tracking will not be supported
> by this VFIO compatibility API.
> 
> This patch will allow migration by hosts that use the VFIO compatibility
> API and prevent migration regressions caused by the lack of VFIO IOMMU
> dirty tracking support.
> 
> [1] https://lore.kernel.org/kvm/0-v2-f9436d0bde78+4bb-iommufd_...@nvidia.com/
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/common.c| 84 +
>  hw/vfio/migration.c |  3 +-
>  2 files changed, 70 insertions(+), 17 deletions(-)

This duplicates quite a bit of code, I think we can integrate this into
a common flow quite a bit more.  See below, only compile tested. Thanks,

Alex

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b5d8c0bf694..4117b40fd9b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -397,17 +397,33 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
  IOMMUTLBEntry *iotlb)
 {
 struct vfio_iommu_type1_dma_unmap *unmap;
-struct vfio_bitmap *bitmap;
+struct vfio_bitmap *vbitmap;
+unsigned long *bitmap;
+uint64_t bitmap_size;
 uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
 int ret;
 
-unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+unmap = g_malloc0(sizeof(*unmap) + sizeof(*vbitmap));
 
-unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+unmap->argsz = sizeof(*unmap);
 unmap->iova = iova;
 unmap->size = size;
-unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
-bitmap = (struct vfio_bitmap *)&unmap->data;
+
+bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+  BITS_PER_BYTE;
+bitmap = g_try_malloc0(bitmap_size);
+if (!bitmap) {
+ret = -ENOMEM;
+goto unmap_exit;
+}
+
+if (!container->dirty_pages_supported) {
+bitmap_set(bitmap, 0, pages);
+goto do_unmap;
+}
+
+unmap->argsz += sizeof(*vbitmap);
+unmap->flags = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
 
 /*
  * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
@@ -415,33 +431,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
  * to qemu_real_host_page_size.
  */
 
-bitmap->pgsize = qemu_real_host_page_size();
-bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-   BITS_PER_BYTE;
+vbitmap = (struct vfio_bitmap *)&unmap->data;
+vbitmap->data = (__u64 *)bitmap;
+vbitmap->pgsize = qemu_real_host_page_size();
+vbitmap->size = bitmap_size;
 
-if (bitmap->size > container->max_dirty_bitmap_size) {
-error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
- (uint64_t)bitmap->size);
+if (bitmap_size > container->max_dirty_bitmap_size) {
+error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, bitmap_size);
 ret = -E2BIG;
 goto unmap_exit;
 }
 
-bitmap->data = g_try_malloc0(bitmap->size);
-if (!bitmap->data) {
-ret = -ENOMEM;
-goto unmap_exit;
-}
-
+do_unmap:
 ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
 if (!ret) {
-cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
-iotlb->translated_addr, pages);
+cpu_physical_memory_set_dirty_lebitmap(bitmap, iotlb->translated_addr,
+   pages);
 } else {
 error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
 }
 
-g_free(bitmap->data);
 unmap_exit:
+g_free(bitmap);
 g_free(unmap);
 return ret;
 }
@@ -460,8 +471,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .size = size,
 };
 
-if (iotlb && container->dirty_pages_supported &&
-vfio_devices_all_running_and_saving(container)) {
+if (iotlb && vfio_devices_all_running_and_saving(container)) {
 return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
 }
 
@@ -1257,6 +1267,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer 
*container, bool start)
 .argsz = sizeof(dirty),
 }

Re: [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init()

2022-11-15 Thread Alex Williamson
On Thu, 3 Nov 2022 18:16:13 +0200
Avihai Horon  wrote:

> Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
> vfio_migration_init(). This logic is specific to v1 protocol and moving
> it will make it easier to add the v2 protocol implementation later.
> No functional changes intended.
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/migration.c  | 30 +++---
>  hw/vfio/trace-events |  2 +-
>  2 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 99ffb75782..0e3a950746 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -785,14 +785,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>  vbasedev->migration = NULL;
>  }
>  
> -static int vfio_migration_init(VFIODevice *vbasedev,
> -   struct vfio_region_info *info)
> +static int vfio_migration_init(VFIODevice *vbasedev)
>  {
>  int ret;
>  Object *obj;
>  VFIOMigration *migration;
>  char id[256] = "";
>  g_autofree char *path = NULL, *oid = NULL;
> +struct vfio_region_info *info = NULL;

Nit, I'm not spotting any cases where we need this initialization.  The
same is not true in the code the info handling was extracted from.
Thanks,

Alex

>  
>  if (!vbasedev->ops->vfio_get_object) {
>  return -EINVAL;
> @@ -803,6 +803,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  return -EINVAL;
>  }
>  
> +ret = vfio_get_dev_region_info(vbasedev,
> +   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> +   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> +   &info);
> +if (ret) {
> +return ret;
> +}
> +
>  vbasedev->migration = g_new0(VFIOMigration, 1);
>  vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
>  vbasedev->migration->vm_running = runstate_is_running();
> @@ -822,6 +830,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  goto err;
>  }
>  
> +g_free(info);
> +
>  migration = vbasedev->migration;
>  migration->vbasedev = vbasedev;
>  
> @@ -844,6 +854,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  return 0;
>  
>  err:
> +g_free(info);
>  vfio_migration_exit(vbasedev);
>  return ret;
>  }
> @@ -857,34 +868,23 @@ int64_t vfio_mig_bytes_transferred(void)
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
> -struct vfio_region_info *info = NULL;
>  int ret = -ENOTSUP;
>  
>  if (!vbasedev->enable_migration) {
>  goto add_blocker;
>  }
>  
> -ret = vfio_get_dev_region_info(vbasedev,
> -   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> -   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> -   &info);
> +ret = vfio_migration_init(vbasedev);
>  if (ret) {
>  goto add_blocker;
>  }
>  
> -ret = vfio_migration_init(vbasedev, info);
> -if (ret) {
> -goto add_blocker;
> -}
> -
> -trace_vfio_migration_probe(vbasedev->name, info->index);
> -g_free(info);
> +trace_vfio_migration_probe(vbasedev->name);
>  return 0;
>  
>  add_blocker:
>  error_setg(&vbasedev->migration_blocker,
> "VFIO device doesn't support migration");
> -g_free(info);
>  
>  ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
>  if (ret < 0) {
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index a21cbd2a56..27c059f96e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) 
> "%ux%u"
>  vfio_display_edid_write_error(void) ""
>  
>  # migration.c
> -vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_probe(const char *name) " (%s)"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, 
> uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) 
> state %s"




Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-16 Thread Alex Williamson
On Thu, 3 Nov 2022 18:16:15 +0200
Avihai Horon  wrote:

> Add implementation of VFIO migration protocol v2. The two protocols, v1
> and v2, will co-exist and in next patch v1 protocol will be removed.
> 
> There are several main differences between v1 and v2 protocols:
> - VFIO device state is now represented as a finite state machine instead
>   of a bitmap.
> 
> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
>   ioctl and normal read() and write() instead of the migration region.
> 
> - VFIO migration protocol v2 currently doesn't support the pre-copy
>   phase of migration.
> 
> Detailed information about VFIO migration protocol v2 and difference
> compared to v1 can be found here [1].
> 
> [1]
> https://lore.kernel.org/all/20220224142024.147653-10-yish...@nvidia.com/
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/common.c  |  19 +-
>  hw/vfio/migration.c   | 386 ++
>  hw/vfio/trace-events  |   4 +
>  include/hw/vfio/vfio-common.h |   5 +
>  4 files changed, 375 insertions(+), 39 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 617e6cd901..0bdbd1586b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -355,10 +355,18 @@ static bool 
> vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  return false;
>  }
>  
> -if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
> +if (!migration->v2 &&
> +(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
>  (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) 
> {
>  return false;
>  }
> +
> +if (migration->v2 &&
> +(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> +return false;
> +}
>  }
>  }
>  return true;
> @@ -385,7 +393,14 @@ static bool 
> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>  return false;
>  }
>  
> -if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +if (!migration->v2 &&
> +migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +continue;
> +}
> +
> +if (migration->v2 &&
> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>  continue;
>  } else {
>  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e784374453..62afc23a8c 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -44,8 +44,84 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
>  
> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)

Add comment explaining heuristic of this size.

> +
>  static int64_t bytes_transferred;
>  
> +static const char *mig_state_to_str(enum vfio_device_mig_state state)
> +{
> +switch (state) {
> +case VFIO_DEVICE_STATE_ERROR:
> +return "ERROR";
> +case VFIO_DEVICE_STATE_STOP:
> +return "STOP";
> +case VFIO_DEVICE_STATE_RUNNING:
> +return "RUNNING";
> +case VFIO_DEVICE_STATE_STOP_COPY:
> +return "STOP_COPY";
> +case VFIO_DEVICE_STATE_RESUMING:
> +return "RESUMING";
> +case VFIO_DEVICE_STATE_RUNNING_P2P:
> +return "RUNNING_P2P";
> +default:
> +return "UNKNOWN STATE";
> +}
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev,
> +enum vfio_device_mig_state new_state,
> +enum vfio_device_mig_state recover_state)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> +  sizeof(struct vfio_device_feature_mig_state),
> +  sizeof(uint64_t))] = {};
> +struct vfio_device_feature *feature = (void *)buf;
> +struct vfio_device_feature_mig_state *mig_state = (void *)feature->data;

We can cast to the actual types rather than void* here.

> +
> +feature->argsz = sizeof(buf);
> +feature->flags =
> +VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
> +mig_state->device_state = new_state;
> +if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +/* Try to set the device in some good state */
> +error_report(
> +"%s: Failed setting device state to %s, err: %s. Setting device 
> in recover state %s",
> + vbasedev->name, mig_state_t

Re: [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails

2022-11-16 Thread Alex Williamson
On Thu, 3 Nov 2022 18:16:17 +0200
Avihai Horon  wrote:

> If vfio_migration_set_state() fails to set the device in the requested
> state it tries to put it in a recover state. If setting the device in
> the recover state fails as well, hw_error is triggered and the VM is
> aborted.
> 
> To improve user experience and avoid VM data loss, reset the device with
> VFIO_RESET_DEVICE instead of aborting the VM.
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/migration.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f8c3228314..e8068b9147 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
>  
>  mig_state->device_state = recover_state;
>  if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> -hw_error("%s: Failed setting device in recover state, err: %s",
> - vbasedev->name, strerror(errno));
> +error_report(
> +"%s: Failed setting device in recover state, err: %s. 
> Resetting device",
> + vbasedev->name, strerror(errno));
> +
> +if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
> +hw_error("%s: Failed resetting device, err: %s", 
> vbasedev->name,
> + strerror(errno));
> +}
> +
> +migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +
> +return -1;
>  }
>  
>  migration->device_state = recover_state;

This addresses one of my comments on 12/ and should probably be rolled
in there.  Thanks,

Alex




Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-17 Thread Alex Williamson
On Thu, 17 Nov 2022 19:07:10 +0200
Avihai Horon  wrote:
> On 16/11/2022 20:29, Alex Williamson wrote:
> > On Thu, 3 Nov 2022 18:16:15 +0200
> > Avihai Horon  wrote:
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index e784374453..62afc23a8c 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -44,8 +44,84 @@
> >>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
> >>   #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
> >>
> >> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)  
> > Add comment explaining heuristic of this size.  
> 
> This is an arbitrary size we picked with mlx5 state size in mind.
> Increasing this size to higher values (128M, 1G) didn't improve 
> performance in our testing.
> 
> How about this comment:
> This is an initial value that doesn't consume much memory and provides 
> good performance.
> 
> Do you have other suggestion?

I'd lean more towards your description above, ex:

/*
 * This is an arbitrary size based on migration of mlx5 devices, where
 * the worst case total device migration size is on the order of 100s
 * of MB.  Testing with larger values, ex. 128MB and 1GB, did not show
 * a performance improvement.
 */

I think that provides sufficient information for someone who might come
later to have an understanding of the basis if they want to try to
optimize further.

> >> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> >>   return -EINVAL;
> >>   }
> >>
> >> -ret = vfio_get_dev_region_info(vbasedev,
> >> -   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> >> -   
> >> VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> >> -   &info);
> >> -if (ret) {
> >> -return ret;
> >> -}
> >> +ret = vfio_migration_query_flags(vbasedev, &mig_flags);
> >> +if (!ret) {
> >> +/* Migration v2 */
> >> +/* Basic migration functionality must be supported */
> >> +if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> >> +return -EOPNOTSUPP;
> >> +}
> >> +vbasedev->migration = g_new0(VFIOMigration, 1);
> >> +vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> >> +vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
> >> +vbasedev->migration->data_buffer =
> >> +g_malloc0(vbasedev->migration->data_buffer_size);  
> > So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
> > later addition of estimated device data size make any changes here?
> > I'd think we'd want to scale the buffer to the minimum of the reported
> > data size and some well documented heuristic for an upper bound.  
> 
> As I wrote above, increasing this size to higher values (128M, 1G) 
> didn't improve performance in our testing.
> We can always change it later on if some other heuristics are proven to 
> improve performance.

Note that I'm asking about a minimum buffer size, for example if
hisi_acc reports only 10s of KB for an estimated device size, why would
we still allocate VFIO_MIG_DATA_BUFFER_SIZE here?  Thanks,

Alex




Re: [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails

2022-11-17 Thread Alex Williamson
On Thu, 17 Nov 2022 19:11:47 +0200
Avihai Horon  wrote:

> On 16/11/2022 20:36, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 3 Nov 2022 18:16:17 +0200
> > Avihai Horon  wrote:
> >  
> >> If vfio_migration_set_state() fails to set the device in the requested
> >> state it tries to put it in a recover state. If setting the device in
> >> the recover state fails as well, hw_error is triggered and the VM is
> >> aborted.
> >>
> >> To improve user experience and avoid VM data loss, reset the device with
> >> VFIO_RESET_DEVICE instead of aborting the VM.
> >>
> >> Signed-off-by: Avihai Horon 
> >> ---
> >>   hw/vfio/migration.c | 14 --
> >>   1 file changed, 12 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index f8c3228314..e8068b9147 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice 
> >> *vbasedev,
> >>
> >>   mig_state->device_state = recover_state;
> >>   if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> >> -hw_error("%s: Failed setting device in recover state, err: 
> >> %s",
> >> - vbasedev->name, strerror(errno));
> >> +error_report(
> >> +"%s: Failed setting device in recover state, err: %s. 
> >> Resetting device",
> >> + vbasedev->name, strerror(errno));
> >> +
> >> +if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
> >> +hw_error("%s: Failed resetting device, err: %s", 
> >> vbasedev->name,
> >> + strerror(errno));
> >> +}
> >> +
> >> +migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> >> +
> >> +return -1;
> >>   }
> >>
> >>   migration->device_state = recover_state;  
> > This addresses one of my comments on 12/ and should probably be rolled
> > in there.  
> 
> Not sure to which comment you refer to. Could you elaborate?

Hmm, I guess I thought this was in the section immediately following
where I questioned going to recovery state.  I'm still not sure why
this is a separate patch from the initial implementation of the
function in 12/ though.  Thanks,
'
Alex




Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-28 Thread Alex Williamson
On Thu, 24 Nov 2022 14:41:00 +0200
Avihai Horon  wrote:

> On 20/11/2022 11:34, Avihai Horon wrote:
> >
> > On 17/11/2022 19:38, Alex Williamson wrote:  
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Thu, 17 Nov 2022 19:07:10 +0200
> >> Avihai Horon  wrote:  
> >>> On 16/11/2022 20:29, Alex Williamson wrote:  
> >>>> On Thu, 3 Nov 2022 18:16:15 +0200
> >>>> Avihai Horon  wrote:  
> >>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>>>> index e784374453..62afc23a8c 100644
> >>>>> --- a/hw/vfio/migration.c
> >>>>> +++ b/hw/vfio/migration.c
> >>>>> @@ -44,8 +44,84 @@
> >>>>>    #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL)
> >>>>>    #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xef14ULL)
> >>>>>
> >>>>> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)  
> >>>> Add comment explaining heuristic of this size.  
> >>> This is an arbitrary size we picked with mlx5 state size in mind.
> >>> Increasing this size to higher values (128M, 1G) didn't improve
> >>> performance in our testing.
> >>>
> >>> How about this comment:
> >>> This is an initial value that doesn't consume much memory and provides
> >>> good performance.
> >>>
> >>> Do you have other suggestion?  
> >> I'd lean more towards your description above, ex:
> >>
> >> /*
> >>   * This is an arbitrary size based on migration of mlx5 devices, where
> >>   * the worst case total device migration size is on the order of 100s
> >>   * of MB.  Testing with larger values, ex. 128MB and 1GB, did not show
> >>   * a performance improvement.
> >>   */
> >>
> >> I think that provides sufficient information for someone who might come
> >> later to have an understanding of the basis if they want to try to
> >> optimize further.  
> >
> > OK, sounds good, I will add a comment like this.
> >  
> >>>>> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice 
> >>>>> *vbasedev)
> >>>>>    return -EINVAL;
> >>>>>    }
> >>>>>
> >>>>> -    ret = vfio_get_dev_region_info(vbasedev,
> >>>>> - VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> >>>>> - VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> >>>>> -   &info);
> >>>>> -    if (ret) {
> >>>>> -    return ret;
> >>>>> -    }
> >>>>> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
> >>>>> +    if (!ret) {
> >>>>> +    /* Migration v2 */
> >>>>> +    /* Basic migration functionality must be supported */
> >>>>> +    if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> >>>>> +    return -EOPNOTSUPP;
> >>>>> +    }
> >>>>> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> >>>>> +    vbasedev->migration->device_state = 
> >>>>> VFIO_DEVICE_STATE_RUNNING;
> >>>>> +    vbasedev->migration->data_buffer_size = 
> >>>>> VFIO_MIG_DATA_BUFFER_SIZE;
> >>>>> +    vbasedev->migration->data_buffer =
> >>>>> + g_malloc0(vbasedev->migration->data_buffer_size);  
> >>>> So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
> >>>> later addition of estimated device data size make any changes here?
> >>>> I'd think we'd want to scale the buffer to the minimum of the reported
> >>>> data size and some well documented heuristic for an upper bound.  
> >>> As I wrote above, increasing this size to higher values (128M, 1G)
> >>> didn't improve performance in our testing.
> >>> We can always change it later on if some other heuristics are proven to
> >>> improve performance.  
> >> Note that I'm asking about a minimum buffer size, for example if
> >> hisi_acc reports only 10s of KB for an estimated device size, why would
> >> we still allocate VFIO_MIG_DATA_BUFFER_SIZE here?  Thanks,  
> >
> > This buffer is rather small and has little memory footprint.
> > Do you think it is worth the extra complexity of resizing the buffer?
> >  
> Alex, WDYT?
> Note that the reported estimated size is dynamic and might change from 
> query to the other, potentially leaving us with smaller buffer size.
> 
> Also, as part of v4 I moved this allocation to vfio_save_setup(), so it 
> will be allocated only during migration (when it's actually used) and 
> only by src side.

There's a claim here about added complexity that I'm not really seeing.
It looks like we simply make an ioctl call here and scale our buffer
based on the minimum of the returned device estimate or our upper
bound.

The previous comments that exceptionally large buffers don't
significantly affect migration performance seems like that also suggests
that even if the device estimate later changes, we'll likely be ok with
the initial device estimate anyway.  Periodically re-checking the
device estimate and re-allocating up to a high water mark could
potentially be future work.  Thanks,

Alex




Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-28 Thread Alex Williamson
On Mon, 28 Nov 2022 15:40:23 -0400
Jason Gunthorpe  wrote:

> On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:
> 
> > There's a claim here about added complexity that I'm not really seeing.
> > It looks like we simply make an ioctl call here and scale our buffer
> > based on the minimum of the returned device estimate or our upper
> > bound.  
> 
> I'm not keen on this, for something like mlx5 that has a small precopy
> size and large post-copy size it risks running with an under allocated
> buffer, which is harmful to performance.

I'm trying to weed out whether there are device assumptions in the
implementation, seems like maybe we found one.  MIG_DATA_SIZE specifies
that it's an estimated data size for stop-copy, so shouldn't that
provide the buffer size you're looking for?  Thanks,

Alex




Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2

2022-11-28 Thread Alex Williamson
On Mon, 28 Nov 2022 16:56:39 -0400
Jason Gunthorpe  wrote:

> On Mon, Nov 28, 2022 at 01:36:30PM -0700, Alex Williamson wrote:
> > On Mon, 28 Nov 2022 15:40:23 -0400
> > Jason Gunthorpe  wrote:
> >   
> > > On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:
> > >   
> > > > There's a claim here about added complexity that I'm not really seeing.
> > > > It looks like we simply make an ioctl call here and scale our buffer
> > > > based on the minimum of the returned device estimate or our upper
> > > > bound.
> > > 
> > > I'm not keen on this, for something like mlx5 that has a small precopy
> > > size and large post-copy size it risks running with an under allocated
> > > buffer, which is harmful to performance.  
> > 
> > I'm trying to weed out whether there are device assumptions in the
> > implementation, seems like maybe we found one.
> 
> I don't think there are assumptions. Any correct kernel driver should
> be able to do this transfer out of the FD byte-at-a-time.
> 
> This buffer size is just a random selection for now until we get
> multi-fd and can sit down, benchmark and optimize this properly.

We can certainly still do that, but I'm still failing to see how
buffer_size = min(MIG_DATA_SIZE, 1MB) is such an imposition on the
complexity or over-eager optimization.  Thanks,

Alex




Re: [PATCH v5 03/14] migration: Simplify migration_iteration_run()

2023-01-06 Thread Alex Williamson
On Thu, 29 Dec 2022 13:03:34 +0200
Avihai Horon  wrote:

> From: Juan Quintela 

IMHO, there should always be a commit log description.  Why is this a
simplification?  What does it allow us to do?  Nothing later obviously
depends on this, why is it part of this series?  Thanks,

Alex

> Signed-off-by: Juan Quintela 
> Signed-off-by: Avihai Horon 
> ---
>  migration/migration.c | 25 +
>  1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 9795d0ec5c..61b9ce0fe8 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3758,23 +3758,24 @@ static MigIterateState 
> migration_iteration_run(MigrationState *s)
>  trace_migrate_pending(pending_size, s->threshold_size,
>pend_pre, pend_compat, pend_post);
>  
> -if (pending_size && pending_size >= s->threshold_size) {
> -/* Still a significant amount to transfer */
> -if (!in_postcopy && pend_pre <= s->threshold_size &&
> -qatomic_read(&s->start_postcopy)) {
> -if (postcopy_start(s)) {
> -error_report("%s: postcopy failed to start", __func__);
> -}
> -return MIG_ITERATE_SKIP;
> -}
> -/* Just another iteration step */
> -qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
> -} else {
> +
> +if (!pending_size || pending_size < s->threshold_size) {
>  trace_migration_thread_low_pending(pending_size);
>  migration_completion(s);
>  return MIG_ITERATE_BREAK;
>  }
>  
> +/* Still a significant amount to transfer */
> +if (!in_postcopy && pend_pre <= s->threshold_size &&
> +qatomic_read(&s->start_postcopy)) {
> +if (postcopy_start(s)) {
> +error_report("%s: postcopy failed to start", __func__);
> +}
> +return MIG_ITERATE_SKIP;
> +}
> +
> +/* Just another iteration step */
> +qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
>  return MIG_ITERATE_RESUME;
>  }
>  




Re: [PATCH v5 04/14] vfio/migration: Fix NULL pointer dereference bug

2023-01-06 Thread Alex Williamson
On Tue, 3 Jan 2023 11:13:21 +
"Dr. David Alan Gilbert"  wrote:

> * Avihai Horon (avih...@nvidia.com) wrote:
> > As part of its error flow, vfio_vmstate_change() accesses
> > MigrationState->to_dst_file without any checks. This can cause a NULL
> > pointer dereference if the error flow is taken and
> > MigrationState->to_dst_file is not set.
> > 
> > For example, this can happen if VM is started or stopped not during
> > migration and vfio_vmstate_change() error flow is taken, as
> > MigrationState->to_dst_file is not set at that time.
> > 
> > Fix it by checking that MigrationState->to_dst_file is set before using
> > it.
> > 
> > Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of 
> > VM")
> > Signed-off-by: Avihai Horon 
> > Reviewed-by: Juan Quintela 
> > Reviewed-by: Vladimir Sementsov-Ogievskiy   
> 
> It might be worth posting this patch separately since it's a simple fix
> and should go in sooner.

There's no upstream implementation of a v1 migration driver, it's
deprecated in the kernel, and only enabled in QEMU via an experimental
option.  It seems sufficient as done here to make it a separate fix for
downstreams that may want to backport separately, but given the
remaining v1 status, I don't think I'd bother hurrying it otherwise.
Thanks,

Alex

> > ---
> >  hw/vfio/migration.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index e1413ac90c..09fe7c1de2 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -743,7 +743,9 @@ static void vfio_vmstate_change(void *opaque, bool 
> > running, RunState state)
> >   */
> >  error_report("%s: Failed to set device state 0x%x", vbasedev->name,
> >   (migration->device_state & mask) | value);
> > -qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> > +if (migrate_get_current()->to_dst_file) {
> > +qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> > +}
> >  }
> >  vbasedev->migration->vm_running = running;
> >  trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> > -- 
> > 2.26.3
> >   




Re: [PATCH v5 05/14] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support

2023-01-06 Thread Alex Williamson
On Thu, 29 Dec 2022 13:03:36 +0200
Avihai Horon  wrote:

> Currently, if IOMMU of a VFIO container doesn't support dirty page
> tracking, migration is blocked. This is because a DMA-able VFIO device
> can dirty RAM pages without updating QEMU about it, thus breaking the
> migration.
> 
> However, this doesn't mean that migration can't be done at all.
> In such case, allow migration and let QEMU VFIO code mark the entire
> bitmap dirty.
> 
> This guarantees that all pages that might have gotten dirty are reported
> back, and thus guarantees a valid migration even without VFIO IOMMU
> dirty tracking support.
> 
> The motivation for this patch is the introduction of iommufd [1].
> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into its internal ops, allowing the usage of these IOCTLs
> over iommufd. However, VFIO IOMMU dirty tracking is not supported by
> this VFIO compatibility API.
> 
> This patch will allow migration by hosts that use the VFIO compatibility
> API and prevent migration regressions caused by the lack of VFIO IOMMU
> dirty tracking support.
> 
> [1]
> https://lore.kernel.org/kvm/0-v6-a196d26f289e+11787-iommufd_...@nvidia.com/
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/common.c| 100 ++--
>  hw/vfio/migration.c |   3 +-
>  2 files changed, 61 insertions(+), 42 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 130e5d1dc7..67104e2fc2 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -397,51 +397,61 @@ static int vfio_dma_unmap_bitmap(VFIOContainer 
> *container,
>   IOMMUTLBEntry *iotlb)
>  {
>  struct vfio_iommu_type1_dma_unmap *unmap;
> -struct vfio_bitmap *bitmap;
> +struct vfio_bitmap *vbitmap;
> +unsigned long *bitmap;
> +uint64_t bitmap_size;
>  uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
>  int ret;
>  
> -unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> +unmap = g_malloc0(sizeof(*unmap) + sizeof(*vbitmap));
>  
> -unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> +unmap->argsz = sizeof(*unmap);
>  unmap->iova = iova;
>  unmap->size = size;
> -unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> -bitmap = (struct vfio_bitmap *)&unmap->data;
>  
> +bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> +  BITS_PER_BYTE;
> +bitmap = g_try_malloc0(bitmap_size);
> +if (!bitmap) {
> +ret = -ENOMEM;
> +goto unmap_exit;
> +}
> +
> +if (!container->dirty_pages_supported) {
> +bitmap_set(bitmap, 0, pages);
> +goto do_unmap;
> +}
> +
> +unmap->argsz += sizeof(*vbitmap);
> +unmap->flags = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> +
> +vbitmap = (struct vfio_bitmap *)&unmap->data;
> +vbitmap->data = (__u64 *)bitmap;
>  /*
>   * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
>   * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
>   * to qemu_real_host_page_size.
>   */
> +vbitmap->pgsize = qemu_real_host_page_size();
> +vbitmap->size = bitmap_size;
>  
> -bitmap->pgsize = qemu_real_host_page_size();
> -bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> -   BITS_PER_BYTE;
> -
> -if (bitmap->size > container->max_dirty_bitmap_size) {
> -error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
> - (uint64_t)bitmap->size);
> +if (bitmap_size > container->max_dirty_bitmap_size) {
> +error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, bitmap_size);
>  ret = -E2BIG;
>  goto unmap_exit;
>  }
>  
> -bitmap->data = g_try_malloc0(bitmap->size);
> -if (!bitmap->data) {
> -ret = -ENOMEM;
> -goto unmap_exit;
> -}
> -
> +do_unmap:
>  ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
>  if (!ret) {
> -cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
> -iotlb->translated_addr, pages);
> +cpu_physical_memory_set_dirty_lebitmap(bitmap, 
> iotlb->translated_addr,
> +   pages);
>  } else {
>  error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
>  }
>  
> -g_free(bitmap->data);
>  unmap_exit:
> +g_free(bitmap);
>  g_free(unmap);
>  return ret;
>  }
> @@ -460,8 +470,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
>  .size = size,
>  };
>  
> -if (iotlb && container->dirty_pages_supported &&
> -vfio_devices_all_running_and_saving(container)) {
> +if (iotlb && vfio_devices_all_running_and_saving(container)) {
>  return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>  }

Seems like it would be simpler to follow the non-dirty_pages_supported
path here and follow-up with a conditi

Re: [PATCH v5 00/14] vfio/migration: Implement VFIO migration protocol v2

2023-01-06 Thread Alex Williamson
On Thu, 29 Dec 2022 13:03:31 +0200
Avihai Horon  wrote:

> Hello,
> 
> Now that QEMU 8.0 development cycle has started and MIG_DATA_SIZE ioctl
> is in kernel v6.2-rc1, I am sending v5 of this series with linux headers
> update and with the preview patches in v4 merged into this series.
> 
> 
> 
> Following VFIO migration protocol v2 acceptance in kernel, this series
> implements VFIO migration according to the new v2 protocol and replaces
> the now deprecated v1 implementation.
> 
> The main differences between v1 and v2 migration protocols are:
> 1. VFIO device state is represented as a finite state machine instead of
>a bitmap.
> 
> 2. The migration interface with kernel is done using VFIO_DEVICE_FEATURE
>ioctl and normal read() and write() instead of the migration region
>used in v1.
> 
> 3. Pre-copy is made optional in v2 protocol. Support for pre-copy will
>be added later on.
> 
> Full description of the v2 protocol and the differences from v1 can be
> found here [1].
> 
> 
> 
> Patch list:
> 
> Patch 1 updates linux headers so we will have the MIG_DATA_SIZE ioctl.
> 
> Patches 2-3 are patches taken from Juan's RFC [2].
> As discussed in the KVM call, since we have a new ioctl to get device
> data size while it's RUNNING, we don't need the stop and resume VM
> functionality from the RFC.
> 
> Patches 4-9 are prep patches fixing bugs, adding QEMUFile function
> that will be used later and refactoring v1 protocol code to make it
> easier to add v2 protocol.
> 
> Patches 10-14 implement v2 protocol and remove v1 protocol.

Missing from the series is the all important question of what happens
to "x-enable-migration" now.  We have two in-kernel drivers supporting
v2 migration, so while hardware and firmware may still be difficult to
bring together, it does seem possible for the upstream community to
test and maintain this.

To declare this supported and not to impose any additional requirements
on management tools, I think migration needs to be enabled by default
for devices that support it.  Is there any utility to keeping around
some sort of device option to force it ON/OFF?  My interpretation of ON
seems rather redundant to the -only-migratable option, ie. fail the
device if migration is not supported, and I can't think of any
production use cases for OFF.  So maybe we simply drop the option as an
implicit AUTO feature and we can consider an experimental or supported
explicit feature later for the more esoteric use cases as they develop?
Thanks,

Alex




Re: [PATCH v8 04/13] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support

2023-01-20 Thread Alex Williamson
On Mon, 16 Jan 2023 16:11:26 +0200
Avihai Horon  wrote:

> Currently, if IOMMU of a VFIO container doesn't support dirty page
> tracking, migration is blocked. This is because a DMA-able VFIO device
> can dirty RAM pages without updating QEMU about it, thus breaking the
> migration.
> 
> However, this doesn't mean that migration can't be done at all.
> In such case, allow migration and let QEMU VFIO code mark all pages
> dirty.
> 
> This guarantees that all pages that might have gotten dirty are reported
> back, and thus guarantees a valid migration even without VFIO IOMMU
> dirty tracking support.
> 
> The motivation for this patch is the introduction of iommufd [1].
> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into its internal ops, allowing the usage of these IOCTLs
> over iommufd. However, VFIO IOMMU dirty tracking is not supported by
> this VFIO compatibility API.
> 
> This patch will allow migration by hosts that use the VFIO compatibility
> API and prevent migration regressions caused by the lack of VFIO IOMMU
> dirty tracking support.
> 
> [1]
> https://lore.kernel.org/kvm/0-v6-a196d26f289e+11787-iommufd_...@nvidia.com/
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/common.c| 20 ++--
>  hw/vfio/migration.c |  3 +--
>  2 files changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 130e5d1dc7..f6dd571549 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -488,6 +488,12 @@ static int vfio_dma_unmap(VFIOContainer *container,
>  return -errno;
>  }
>  
> +if (iotlb && vfio_devices_all_running_and_saving(container)) {
> +cpu_physical_memory_set_dirty_range(iotlb->translated_addr, size,
> +tcg_enabled() ? 
> DIRTY_CLIENTS_ALL :
> +DIRTY_CLIENTS_NOCODE);

I take it this is an attempt to decipher the mask arg based on its use
in cpu_physical_memory_set_dirty_lebitmap().  I'm attempting to do the
same.  It seems like it must logically be the case that
global_dirty_tracking is set to pass the running-and-saving test, but I
can't connect the pieces.  Is this your understanding as well and the
reason we don't also need to optionally exclude DIRTY_MEMORY_MIGRATION?
Thanks,

Alex

> +}
> +
>  return 0;
>  }
>  
> @@ -1201,6 +1207,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer 
> *container, bool start)
>  .argsz = sizeof(dirty),
>  };
>  
> +if (!container->dirty_pages_supported) {
> +return;
> +}
> +
>  if (start) {
>  dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
>  } else {
> @@ -1236,6 +1246,13 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
> *container, uint64_t iova,
>  uint64_t pages;
>  int ret;
>  
> +if (!container->dirty_pages_supported) {
> +cpu_physical_memory_set_dirty_range(ram_addr, size,
> +tcg_enabled() ? 
> DIRTY_CLIENTS_ALL :
> +DIRTY_CLIENTS_NOCODE);
> +return 0;
> +}
> +
>  dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
>  
>  dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> @@ -1409,8 +1426,7 @@ static void vfio_listener_log_sync(MemoryListener 
> *listener,
>  {
>  VFIOContainer *container = container_of(listener, VFIOContainer, 
> listener);
>  
> -if (vfio_listener_skipped_section(section) ||
> -!container->dirty_pages_supported) {
> +if (vfio_listener_skipped_section(section)) {
>  return;
>  }
>  
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 09fe7c1de2..552c2313b2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -860,11 +860,10 @@ int64_t vfio_mig_bytes_transferred(void)
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
> -VFIOContainer *container = vbasedev->group->container;
>  struct vfio_region_info *info = NULL;
>  int ret = -ENOTSUP;
>  
> -if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +if (!vbasedev->enable_migration) {
>  goto add_blocker;
>  }
>  




Re: [PATCH v8 05/13] migration/qemu-file: Add qemu_file_get_to_fd()

2023-01-20 Thread Alex Williamson
On Mon, 16 Jan 2023 16:11:27 +0200
Avihai Horon  wrote:

> Add new function qemu_file_get_to_fd() that allows reading data from
> QEMUFile and writing it straight into a given fd.
> 
> This will be used later in VFIO migration code.
> 
> Signed-off-by: Avihai Horon 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Cédric Le Goater 
> ---
>  migration/qemu-file.h |  1 +
>  migration/qemu-file.c | 34 ++
>  2 files changed, 35 insertions(+)

quintela, dgilbert, Ping for ack.  Thanks,

Alex

 
> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index fa13d04d78..9d0155a2a1 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -148,6 +148,7 @@ int qemu_file_shutdown(QEMUFile *f);
>  QEMUFile *qemu_file_get_return_path(QEMUFile *f);
>  void qemu_fflush(QEMUFile *f);
>  void qemu_file_set_blocking(QEMUFile *f, bool block);
> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
>  
>  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
>  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 2d5f74ffc2..102ab3b439 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -940,3 +940,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
>  {
>  return file->ioc;
>  }
> +
> +/*
> + * Read size bytes from QEMUFile f and write them to fd.
> + */
> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
> +{
> +while (size) {
> +size_t pending = f->buf_size - f->buf_index;
> +ssize_t rc;
> +
> +if (!pending) {
> +rc = qemu_fill_buffer(f);
> +if (rc < 0) {
> +return rc;
> +}
> +if (rc == 0) {
> +return -EIO;
> +}
> +continue;
> +}
> +
> +rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
> +if (rc < 0) {
> +return -errno;
> +}
> +if (rc == 0) {
> +return -EIO;
> +}
> +f->buf_index += rc;
> +size -= rc;
> +}
> +
> +return 0;
> +}




Re: [PATCH v8 09/13] vfio/migration: Implement VFIO migration protocol v2

2023-01-20 Thread Alex Williamson
On Mon, 16 Jan 2023 16:11:31 +0200
Avihai Horon  wrote:

> Implement the basic mandatory part of VFIO migration protocol v2.
> This includes all functionality that is necessary to support
> VFIO_MIGRATION_STOP_COPY part of the v2 protocol.
> 
> The two protocols, v1 and v2, will co-exist and in the following patches
> v1 protocol code will be removed.
> 
> There are several main differences between v1 and v2 protocols:
> - VFIO device state is now represented as a finite state machine instead
>   of a bitmap.
> 
> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
>   ioctl and normal read() and write() instead of the migration region.
> 
> - Pre-copy is made optional in v2 protocol. Support for pre-copy will be
>   added later on.
> 
> Detailed information about VFIO migration protocol v2 and its difference
> compared to v1 protocol can be found here [1].
> 
> [1]
> https://lore.kernel.org/all/20220224142024.147653-10-yish...@nvidia.com/
> 
> Signed-off-by: Avihai Horon 
> Reviewed-by: Cédric Le Goater 
> ---
>  include/hw/vfio/vfio-common.h |   5 +
>  hw/vfio/common.c  |  19 +-
>  hw/vfio/migration.c   | 455 +++---
>  hw/vfio/trace-events  |   7 +
>  4 files changed, 447 insertions(+), 39 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bbaf72ba00..6d7d850bfe 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
>  int vm_running;
>  Notifier migration_state;
>  uint64_t pending_bytes;
> +uint32_t device_state;
> +int data_fd;
> +void *data_buffer;
> +size_t data_buffer_size;
> +bool v2;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 550b2d7ded..dcaa77d2a8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -355,10 +355,18 @@ static bool 
> vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  return false;
>  }
>  
> -if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
> +if (!migration->v2 &&
> +(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
>  (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) 
> {
>  return false;
>  }
> +
> +if (migration->v2 &&
> +(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) 
> &&
> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> +return false;
> +}
>  }
>  }
>  return true;
> @@ -385,7 +393,14 @@ static bool 
> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>  return false;
>  }
>  
> -if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +if (!migration->v2 &&
> +migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +continue;
> +}
> +
> +if (migration->v2 &&
> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>  continue;
>  } else {
>  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 9df859f4d3..f19ada0f4f 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -10,6 +10,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/cutils.h"
> +#include "qemu/units.h"
>  #include 
>  #include 
>  
> @@ -44,8 +45,103 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
>  
> +/*
> + * This is an arbitrary size based on migration of mlx5 devices, where 
> typically
> + * total device migration size is on the order of 100s of MB. Testing with
> + * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
> + */
> +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
> +
>  static int64_t bytes_transferred;
>  
> +static const char *mig_state_to_str(enum vfio_device_mig_state state)
> +{
> +switch (state) {
> +case VFIO_DEVICE_STATE_ERROR:
> +return "ERROR";
> +case VFIO_DEVICE_STATE_STOP:
> +return "STOP";
> +case VFIO_DEVICE_STATE_RUNNING:
> +return "RUNNING";
> +case VFIO_DEVICE_STATE_STOP_COPY:
> +return "STOP_COPY";
> +case VFIO_DEVICE_STATE_RESUMING:
> +return "RESUMING";
> +case VFIO_DEVICE_STATE_RUNNING_P2P:
> +return "RUNNING_P2P";
> +default:
> +return "UNKNOWN STATE";
> +}
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev,
> +   

Re: [PATCH v8 09/13] vfio/migration: Implement VFIO migration protocol v2

2023-01-25 Thread Alex Williamson
On Sun, 22 Jan 2023 12:31:33 +0200
Avihai Horon  wrote:

> On 21/01/2023 1:07, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 16 Jan 2023 16:11:31 +0200
> > Avihai Horon  wrote:
> >  
> >> Implement the basic mandatory part of VFIO migration protocol v2.
> >> This includes all functionality that is necessary to support
> >> VFIO_MIGRATION_STOP_COPY part of the v2 protocol.
> >>
> >> The two protocols, v1 and v2, will co-exist and in the following patches
> >> v1 protocol code will be removed.
> >>
> >> There are several main differences between v1 and v2 protocols:
> >> - VFIO device state is now represented as a finite state machine instead
> >>of a bitmap.
> >>
> >> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
> >>ioctl and normal read() and write() instead of the migration region.
> >>
> >> - Pre-copy is made optional in v2 protocol. Support for pre-copy will be
> >>added later on.
> >>
> >> Detailed information about VFIO migration protocol v2 and its difference
> >> compared to v1 protocol can be found here [1].
> >>
> >> [1]
> >> https://lore.kernel.org/all/20220224142024.147653-10-yish...@nvidia.com/
> >>
> >> Signed-off-by: Avihai Horon 
> >> Reviewed-by: Cédric Le Goater 
> >> ---
> >>   include/hw/vfio/vfio-common.h |   5 +
> >>   hw/vfio/common.c  |  19 +-
> >>   hw/vfio/migration.c   | 455 +++---
> >>   hw/vfio/trace-events  |   7 +
> >>   4 files changed, 447 insertions(+), 39 deletions(-)
> >>
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index bbaf72ba00..6d7d850bfe 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
> >>   int vm_running;
> >>   Notifier migration_state;
> >>   uint64_t pending_bytes;
> >> +uint32_t device_state;
> >> +int data_fd;
> >> +void *data_buffer;
> >> +size_t data_buffer_size;
> >> +bool v2;
> >>   } VFIOMigration;
> >>
> >>   typedef struct VFIOAddressSpace {
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 550b2d7ded..dcaa77d2a8 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -355,10 +355,18 @@ static bool 
> >> vfio_devices_all_dirty_tracking(VFIOContainer *container)
> >>   return false;
> >>   }
> >>
> >> -if ((vbasedev->pre_copy_dirty_page_tracking == 
> >> ON_OFF_AUTO_OFF) &&
> >> +if (!migration->v2 &&
> >> +(vbasedev->pre_copy_dirty_page_tracking == 
> >> ON_OFF_AUTO_OFF) &&
> >>   (migration->device_state_v1 & 
> >> VFIO_DEVICE_STATE_V1_RUNNING)) {
> >>   return false;
> >>   }
> >> +
> >> +if (migration->v2 &&
> >> +(vbasedev->pre_copy_dirty_page_tracking == 
> >> ON_OFF_AUTO_OFF) &&
> >> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> >> + migration->device_state == 
> >> VFIO_DEVICE_STATE_RUNNING_P2P)) {
> >> +return false;
> >> +}
> >>   }
> >>   }
> >>   return true;
> >> @@ -385,7 +393,14 @@ static bool 
> >> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> >>   return false;
> >>   }
> >>
> >> -if (migration->device_state_v1 & 
> >> VFIO_DEVICE_STATE_V1_RUNNING) {
> >> +if (!migration->v2 &&
> >> +migration->device_state_v1 & 
> >> VFIO_DEVICE_STATE_V1_RUNNING) {
> >> +continue;
> >> +}
> >> +
> >> +if (migration->v2 &&
> >> +(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> >> + migration->device_state == 
> >> VFIO_DEVICE_STATE_RUNNING_P2P)) {
> >>   continue;
> >>   } else {
&g

Re: [PATCH v9 07/14] vfio/migration: Block multiple devices migration

2023-02-07 Thread Alex Williamson
On Mon, 6 Feb 2023 14:31:30 +0200
Avihai Horon  wrote:

> Currently VFIO migration doesn't implement some kind of intermediate
> quiescent state in which P2P DMAs are quiesced before stopping or
> running the device. This can cause problems in multi-device migration
> where the devices are doing P2P DMAs, since the devices are not stopped
> together at the same time.
> 
> Until such support is added, block migration of multiple devices.
> 
> Signed-off-by: Avihai Horon 
> ---
>  include/hw/vfio/vfio-common.h |  2 ++
>  hw/vfio/common.c  | 51 +++
>  hw/vfio/migration.c   |  6 +
>  3 files changed, 59 insertions(+)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index e573f5a9f1..56b1683824 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -218,6 +218,8 @@ typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) 
> VFIOGroupList;
>  extern VFIOGroupList vfio_group_list;
>  
>  bool vfio_mig_active(void);
> +int vfio_block_multiple_devices_migration(Error **errp);
> +void vfio_unblock_multiple_devices_migration(void);
>  int64_t vfio_mig_bytes_transferred(void);
>  
>  #ifdef CONFIG_LINUX
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 3a35f4afad..01db41b735 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -41,6 +41,7 @@
>  #include "qapi/error.h"
>  #include "migration/migration.h"
>  #include "migration/misc.h"
> +#include "migration/blocker.h"
>  #include "sysemu/tpm.h"
>  
>  VFIOGroupList vfio_group_list =
> @@ -337,6 +338,56 @@ bool vfio_mig_active(void)
>  return true;
>  }
>  
> +Error *multiple_devices_migration_blocker;
> +
> +static unsigned int vfio_migratable_device_num(void)
> +{
> +VFIOGroup *group;
> +VFIODevice *vbasedev;
> +unsigned int device_num = 0;
> +
> +QLIST_FOREACH(group, &vfio_group_list, next) {
> +QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +if (vbasedev->migration) {
> +device_num++;
> +}
> +}
> +}
> +
> +return device_num;
> +}
> +
> +int vfio_block_multiple_devices_migration(Error **errp)
> +{
> +int ret;
> +
> +if (vfio_migratable_device_num() != 2) {
> +return 0;
> +}
> +
> +error_setg(&multiple_devices_migration_blocker,
> +   "Migration is currently not supported with multiple "
> +   "VFIO devices");
> +ret = migrate_add_blocker(multiple_devices_migration_blocker, errp);
> +if (ret < 0) {
> +error_free(multiple_devices_migration_blocker);
> +multiple_devices_migration_blocker = NULL;
> +}
> +
> +return ret;
> +}
> +
> +void vfio_unblock_multiple_devices_migration(void)
> +{
> +if (vfio_migratable_device_num() != 2) {
> +return;
> +}
> +
> +migrate_del_blocker(multiple_devices_migration_blocker);
> +error_free(multiple_devices_migration_blocker);
> +multiple_devices_migration_blocker = NULL;
> +}

A couple awkward things here.  First I wish we could do something
cleaner or more intuitive than the != 2 test.  I get that we're trying
to do this on the addition of the 2nd device supporting migration, or
the removal of the next to last device independent of all other devices,
but I wonder if it wouldn't be better to remove the multiple-device
blocker after migration is torn down for the device so we can test
device >1 or ==1 in combination with whether
multiple_devices_migration_blocker is NULL.

Which comes to the second awkwardness, if we fail to add the blocker we
free and clear the blocker, but when we tear down the device due to that
failure we'll remove the blocker that doesn't exist, free NULL, and
clear it again.  Thanks to the glib slist the migration blocker is
using, I think that all works, but I'd rather not be dependent on that
implementation to avoid a segfault here.  Incorporating a test of
multiple_devices_migration_blocker as above would avoid this too.  Thanks,

Alex

> +
>  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  {
>  VFIOGroup *group;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 552c2313b2..8085a4ff44 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -880,6 +880,11 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error 
> **errp)
>  goto add_blocker;
>  }
>  
> +ret = vfio_block_multiple_devices_migration(errp);
> +if (ret) {
> +return ret;
> +}
> +
>  trace_vfio_migration_probe(vbasedev->name, info->index);
>  g_free(info);
>  return 0;
> @@ -902,6 +907,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>  if (vbasedev->migration) {
>  VFIOMigration *migration = vbasedev->migration;
>  
> +vfio_unblock_multiple_devices_migration();
>  remove_migration_state_change_notifier(&migration->migration_state);
>  qemu_del_vm_change_state_handler(migration->vm_state);
>  

Re: [PATCH v9 10/14] vfio/migration: Implement VFIO migration protocol v2

2023-02-07 Thread Alex Williamson
On Mon, 6 Feb 2023 14:31:33 +0200
Avihai Horon  wrote:
> @@ -523,6 +745,41 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>  return 0;
>  }
>  
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +enum vfio_device_mig_state recover_state;
> +int ret;
> +
> +/* We reach here with device state STOP only */
> +recover_state = VFIO_DEVICE_STATE_STOP;

Why do we need to put this in a local variable?

> +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +   recover_state);
> +if (ret) {
> +return ret;
> +}
> +
> +do {
> +ret = vfio_save_block(f, vbasedev->migration);
> +if (ret < 0) {
> +return ret;
> +}
> +} while (!ret);
> +
> +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +ret = qemu_file_get_error(f);
> +if (ret) {
> +return ret;
> +}
> +
> +recover_state = VFIO_DEVICE_STATE_ERROR;

IIRC, the ERROR state is not reachable as a user directed state.  I
suppose passing it as the recovery state guarantees a device reset when
it fails, but if that's the intention it should be documented with a
comment to explain so (and vfio_migration_set_state() should not bother
trying to use it as a recovery state).

> +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> +   recover_state);
> +trace_vfio_save_complete_precopy(vbasedev->name, ret);
> +
> +return ret;
> +}
> +
>  static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
>  {
>  VFIODevice *vbasedev = opaque;
...
> @@ -769,12 +1087,17 @@ static void vfio_migration_state_notifier(Notifier 
> *notifier, void *data)
>  case MIGRATION_STATUS_CANCELLED:
>  case MIGRATION_STATUS_FAILED:
>  bytes_transferred = 0;
> -ret = vfio_migration_v1_set_state(vbasedev,
> -  ~(VFIO_DEVICE_STATE_V1_SAVING |
> -VFIO_DEVICE_STATE_V1_RESUMING),
> -  VFIO_DEVICE_STATE_V1_RUNNING);
> -if (ret) {
> -error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +if (migration->v2) {
> +vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
> + VFIO_DEVICE_STATE_ERROR);

Same here.  Thanks,

Alex

> +} else {
> +ret = vfio_migration_v1_set_state(vbasedev,
> +  ~(VFIO_DEVICE_STATE_V1_SAVING |
> +
> VFIO_DEVICE_STATE_V1_RESUMING),
> +  VFIO_DEVICE_STATE_V1_RUNNING);
> +if (ret) {
> +error_report("%s: Failed to set state RUNNING", 
> vbasedev->name);
> +}
>  }
>  }
>  }




Re: [PATCH v9 14/14] docs/devel: Align VFIO migration docs to v2 protocol

2023-02-07 Thread Alex Williamson
On Mon, 6 Feb 2023 14:31:37 +0200
Avihai Horon  wrote:

> Now that VFIO migration protocol v2 has been implemented and v1 protocol
> has been removed, update the documentation according to v2 protocol.
> 
> Signed-off-by: Avihai Horon 
> Reviewed-by: Cédric Le Goater 
> ---
>  docs/devel/vfio-migration.rst | 68 ---
>  1 file changed, 30 insertions(+), 38 deletions(-)

A note about the single device limitation should be added here.  Thanks,

Alex




Re: [PATCH v9 07/14] vfio/migration: Block multiple devices migration

2023-02-08 Thread Alex Williamson
On Wed, 8 Feb 2023 15:08:15 +0200
Avihai Horon  wrote:

> On 08/02/2023 0:34, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 6 Feb 2023 14:31:30 +0200
> > Avihai Horon  wrote:
> >  
> >> Currently VFIO migration doesn't implement some kind of intermediate
> >> quiescent state in which P2P DMAs are quiesced before stopping or
> >> running the device. This can cause problems in multi-device migration
> >> where the devices are doing P2P DMAs, since the devices are not stopped
> >> together at the same time.
> >>
> >> Until such support is added, block migration of multiple devices.
> >>
> >> Signed-off-by: Avihai Horon 
> >> ---
> >>   include/hw/vfio/vfio-common.h |  2 ++
> >>   hw/vfio/common.c  | 51 +++
> >>   hw/vfio/migration.c   |  6 +
> >>   3 files changed, 59 insertions(+)
> >>
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index e573f5a9f1..56b1683824 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -218,6 +218,8 @@ typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) 
> >> VFIOGroupList;
> >>   extern VFIOGroupList vfio_group_list;
> >>
> >>   bool vfio_mig_active(void);
> >> +int vfio_block_multiple_devices_migration(Error **errp);
> >> +void vfio_unblock_multiple_devices_migration(void);
> >>   int64_t vfio_mig_bytes_transferred(void);
> >>
> >>   #ifdef CONFIG_LINUX
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 3a35f4afad..01db41b735 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -41,6 +41,7 @@
> >>   #include "qapi/error.h"
> >>   #include "migration/migration.h"
> >>   #include "migration/misc.h"
> >> +#include "migration/blocker.h"
> >>   #include "sysemu/tpm.h"
> >>
> >>   VFIOGroupList vfio_group_list =
> >> @@ -337,6 +338,56 @@ bool vfio_mig_active(void)
> >>   return true;
> >>   }
> >>
> >> +Error *multiple_devices_migration_blocker;
> >> +
> >> +static unsigned int vfio_migratable_device_num(void)
> >> +{
> >> +VFIOGroup *group;
> >> +VFIODevice *vbasedev;
> >> +unsigned int device_num = 0;
> >> +
> >> +QLIST_FOREACH(group, &vfio_group_list, next) {
> >> +QLIST_FOREACH(vbasedev, &group->device_list, next) {
> >> +if (vbasedev->migration) {
> >> +device_num++;
> >> +}
> >> +}
> >> +}
> >> +
> >> +return device_num;
> >> +}
> >> +
> >> +int vfio_block_multiple_devices_migration(Error **errp)
> >> +{
> >> +int ret;
> >> +
> >> +if (vfio_migratable_device_num() != 2) {
> >> +return 0;
> >> +}
> >> +
> >> +error_setg(&multiple_devices_migration_blocker,
> >> +   "Migration is currently not supported with multiple "
> >> +   "VFIO devices");
> >> +ret = migrate_add_blocker(multiple_devices_migration_blocker, errp);
> >> +if (ret < 0) {
> >> +error_free(multiple_devices_migration_blocker);
> >> +multiple_devices_migration_blocker = NULL;
> >> +}
> >> +
> >> +return ret;
> >> +}
> >> +
> >> +void vfio_unblock_multiple_devices_migration(void)
> >> +{
> >> +if (vfio_migratable_device_num() != 2) {
> >> +return;
> >> +}
> >> +
> >> +migrate_del_blocker(multiple_devices_migration_blocker);
> >> +error_free(multiple_devices_migration_blocker);
> >> +multiple_devices_migration_blocker = NULL;
> >> +}  
> > A couple awkward things here.  First I wish we could do something
> > cleaner or more intuitive than the != 2 test.  I get that we're trying
> > to do this on the addition of the 2nd device supporting migration, or
> > the removal of the next to last device independent of all other devices,
> > but I wonder if it wouldn't be better to remove the multiple-device
> > blocker after migration is torn down for the device so we can test
> > device >1 or ==1 in comb