RE: [PATCH v2 0/4] Support dynamic MSI-X allocation
Hi Alex, > On Sat, 9/23/2023 4:57AM, Alex Williamson wrote: > > On Mon, 18 Sep 2023 05:45:03 -0400 > Jing Liu wrote: > > > Changes since v1: > > - v1: > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg982842.html > > - Revise Qemu to QEMU. (Cédric) > > - Add g_free when failure of getting MSI-X irq info. (Cédric) > > - Apply Cédric's Reviewed-by. (Cédric) > > - Use g_autofree to automatically release. (Cédric) > > - Remove the failure message in vfio_enable_msix_no_vec(). (Cédric) > > > > Changes since RFC v1: > > - RFC v1: > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg978637.html > > - Revise the comments. (Alex) > > - Report error of getting irq info and remove the trace of failure > > case. (Alex, Cédric) > > - Only store dynamic allocation flag as a bool type and test > > accordingly. (Alex) > > - Move dynamic allocation detection to vfio_msix_early_setup(). (Alex) > > - Change the condition logic in vfio_msix_vector_do_use() that moving > > the defer_kvm_irq_routing test out and create a common place to update > > nr_vectors. (Alex) > > - Consolidate the way of MSI-X enabling during device initialization and > > interrupt restoring that uses fd = -1 trick. Create a function doing > > that. (Alex) > > > > Before kernel v6.5, dynamic allocation of MSI-X interrupts was not > > supported. QEMU therefore when allocating a new interrupt, should > > first release all previously allocated interrupts (including disable > > of MSI-X) and re-allocate all interrupts that includes the new one. > > > > The kernel series [1] adds the support of dynamic MSI-X allocation to > > vfio-pci and uses the existing flag VFIO_IRQ_INFO_NORESIZE to guide > > user space, that when dynamic MSI-X is supported the flag is cleared. > > > > This series makes the behavior for VFIO PCI devices when dynamic MSI-X > > allocation is supported. When guest unmasks an interrupt, QEMU can > > directly allocate an interrupt on host for this and has nothing to do > > with the previously allocated ones. Therefore, host only allocates > > interrupts for those unmasked (enabled) interrupts inside guest when > > dynamic MSI-X allocation is supported by device. > > > > When guests enable MSI-X with all of the vectors masked, QEMU need > > match the state to enable MSI-X with no vector enabled. During > > migration restore, QEMU also need enable MSI-X first in dynamic > > allocation mode, to avoid the guest unused vectors being allocated on > > host. To consolidate them, we use vector 0 with an invalid fd to get > > MSI-X enabled and create a common function for this. This is cleaner > > than setting userspace triggering and immediately release. > > > > Any feedback is appreciated. > > > > Jing > > > > [1] https://lwn.net/Articles/931679/ > > > > Jing Liu (4): > > vfio/pci: detect the support of dynamic MSI-X allocation > > vfio/pci: enable vector on dynamic MSI-X allocation > > vfio/pci: use an invalid fd to enable MSI-X > > vfio/pci: enable MSI-X in interrupt restoring on dynamic allocation > > > > hw/vfio/pci.c| 121 +-- > > hw/vfio/pci.h| 1 + > > hw/vfio/trace-events | 2 +- > > 3 files changed, 96 insertions(+), 28 deletions(-) > > > > Some minor comments on 2/ but otherwise: > > Reviewed-by: Alex Williamson Thank you very much for the feedback. Will apply on v3 with fix for 2/4. Jing
RE: [PATCH v2 2/4] vfio/pci: enable vector on dynamic MSI-X allocation
Hi Alex, > On Sat, 9/23/2023 4:55 AM, Alex Williamson wrote: > On Mon, 18 Sep 2023 05:45:05 -0400 > Jing Liu wrote: > > > The vector_use callback is used to enable vector that is unmasked in > > guest. The kernel used to only support static MSI-X allocation. When > > allocating a new interrupt using "static MSI-X allocation" kernels, > > QEMU first disables all previously allocated vectors and then > > re-allocates all including the new one. The nr_vectors of > > VFIOPCIDevice indicates that all vectors from 0 to nr_vectors are > > allocated (and may be enabled), which is used to to loop all the > > possibly used vectors > ^^ ^^ > > s/to to/to/ Will change. > > > When, e.g., disabling MSI-X interrupts. > > > > Extend the vector_use function to support dynamic MSI-X allocation > > when host supports the capability. QEMU therefore can individually > > allocate and enable a new interrupt without affecting others or > > causing interrupts lost during runtime. > > > > Utilize nr_vectors to calculate the upper bound of enabled vectors in > > dynamic MSI-X allocation mode since looping all msix_entries_nr is not > > efficient and unnecessary. > > > > Signed-off-by: Jing Liu > > Tested-by: Reinette Chatre > > --- > > Changes since v1: > > - Revise Qemu to QEMU. > > > > Changes since RFC v1: > > - Test vdev->msix->noresize to identify the allocation mode. (Alex) > > - Move defer_kvm_irq_routing test out and update nr_vectors in a > > common place before vfio_enable_vectors(). (Alex) > > - Revise the comments. (Alex) > > --- > > hw/vfio/pci.c | 44 +++- > > 1 file changed, 27 insertions(+), 17 deletions(-) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > 60654ca28ab8..84987e46fd7a 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -470,6 +470,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, > unsigned int nr, > > VFIOPCIDevice *vdev = VFIO_PCI(pdev); > > VFIOMSIVector *vector; > > int ret; > > +int old_nr_vecs = vdev->nr_vectors; > > Minor suggestion, it reads slightly better below if this were something > like: > > bool resizing = !!(vdev->nr_vectors < nr + 1); > > Then use the bool in place of the nr+1 tests below. Thanks, > Got it. This change makes it nice to read. Thanks for the advice. Will send v3 later. Thanks, Jing > Alex > > > > > trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr); > > > > @@ -512,33 +513,42 @@ static int vfio_msix_vector_do_use(PCIDevice > *pdev, unsigned int nr, > > } > > > > /* > > - * We don't want to have the host allocate all possible MSI vectors > > - * for a device if they're not in use, so we shutdown and incrementally > > - * increase them as needed. > > + * When dynamic allocation is not supported, we don't want to have the > > + * host allocate all possible MSI vectors for a device if they're not > > + * in use, so we shutdown and incrementally increase them as needed. > > + * nr_vectors represents the total number of vectors allocated. > > + * > > + * When dynamic allocation is supported, let the host only allocate > > + * and enable a vector when it is in use in guest. nr_vectors > > represents > > + * the upper bound of vectors being enabled (but not all of the ranges > > + * is allocated or enabled). > > */ > > if (vdev->nr_vectors < nr + 1) { > > vdev->nr_vectors = nr + 1; > > -if (!vdev->defer_kvm_irq_routing) { > > +} > > + > > +if (!vdev->defer_kvm_irq_routing) { > > +if (vdev->msix->noresize && (old_nr_vecs < nr + 1)) { > > vfio_disable_irqindex(&vdev->vbasedev, > > VFIO_PCI_MSIX_IRQ_INDEX); > > ret = vfio_enable_vectors(vdev, true); > > if (ret) { > > error_report("vfio: failed to enable vectors, %d", ret); > > } > > -} > > -} else { > > -Error *err = NULL; > > -int32_t fd; > > - > > -if (vector->virq >= 0) { > > -fd = event_notifier_get_fd(&vector->kvm_interrupt); > > } else { > > -fd = event_notifier_get_fd(&vector->interrupt); > > -} > > +Error *err = NULL; > > +int32_t fd; > > > > -if (vfio_set_irq_signaling(&vdev->vbasedev, > > - VFIO_PCI_MSIX_IRQ_INDEX, nr, > > - VFIO_IRQ_SET_ACTION_TRIGGER, fd, > > &err)) { > > -error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name); > > +if (vector->virq >= 0) { > > +fd = event_notifier_get_fd(&vector->kvm_interrupt); > > +} else { > > +fd = event_notifier_get_fd(&vector->interrupt); > > +} > > + > > +if (vfio_set_irq_signaling(&vdev->vbasedev, > > + VFIO_PCI_MSIX_IRQ_INDEX, nr, > > +
RE: [PATCH v2 4/4] vfio/pci: enable MSI-X in interrupt restoring on dynamic allocation
Hi Cédric, > On 9/19/2023 11:21 PM, Cédric Le Goater wrote: > > On 9/18/23 11:45, Jing Liu wrote: > > During migration restoring, vfio_enable_vectors() is called to restore > > enabling MSI-X interrupts for assigned devices. It sets the range from > > 0 to nr_vectors to kernel to enable MSI-X and the vectors unmasked in > > guest. During the MSI-X enabling, all the vectors within the range are > > allocated according to the VFIO_DEVICE_SET_IRQS ioctl. > > > > When dynamic MSI-X allocation is supported, we only want the guest > > unmasked vectors being allocated and enabled. Use vector 0 with an > > invalid fd to get MSI-X enabled, after that, all the vectors can be > > allocated in need. > > > > Signed-off-by: Jing Liu > > > Reviewed-by: Cédric Le Goater Thanks very much for your feedback. Jing > > Thanks, > > C. > > > > --- > > Changes since v1: > > - No change. > > > > Changes since RFC v1: > > - Revise the comments. (Alex) > > - Call the new helper function in previous patch to enable MSI-X. > > (Alex) > > --- > > hw/vfio/pci.c | 17 + > > 1 file changed, 17 insertions(+) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > 0117f230e934..f5f891dc0792 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -402,6 +402,23 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, > bool msix) > > int ret = 0, i, argsz; > > int32_t *fds; > > > > +/* > > + * If dynamic MSI-X allocation is supported, the vectors to be > > allocated > > + * and enabled can be scattered. Before kernel enabling MSI-X, setting > > + * nr_vectors causes all these vectors to be allocated on host. > > + * > > + * To keep allocation as needed, use vector 0 with an invalid fd to get > > + * MSI-X enabled first, then set vectors with a potentially sparse set > > of > > + * eventfds to enable interrupts only when enabled in guest. > > + */ > > +if (msix && !vdev->msix->noresize) { > > +ret = vfio_enable_msix_no_vec(vdev); > > + > > +if (ret) { > > +return ret; > > +} > > +} > > + > > argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds)); > > > > irq_set = g_malloc0(argsz);
RE: [PATCH v1 0/4] Support dynamic MSI-X allocation
Hi Cédric, Thanks for this information. I'll send v2 later. Jing > On 9/15/2023 3:42 PM, Cédric Le Goater wrote: > > On 9/15/23 09:40, Liu, Jing2 wrote: > > Friendly ping to have your valuable inputs and comments. > > Thanks very much. > > I think that was done. We are waiting for the v2. > > Thanks, > > C. > > > > > > BRs, > > Jing > > > >> On 8/22/2023 3:29 PM, Jing Liu wrote: > >> Changes since RFC v1: > >> - RFC v1: https://www.mail-archive.com/qemu- > >> de...@nongnu.org/msg978637.html > >> - Revise the comments. (Alex) > >> - Report error of getting irq info and remove the trace of failure > >>case. (Alex, Cédric) > >> - Only store dynamic allocation flag as a bool type and test > >>accordingly. (Alex) > >> - Move dynamic allocation detection to vfio_msix_early_setup(). > >> (Alex) > >> - Change the condition logic in vfio_msix_vector_do_use() that moving > >>the defer_kvm_irq_routing test out and create a common place to update > >>nr_vectors. (Alex) > >> - Consolidate the way of MSI-X enabling during device initialization and > >>interrupt restoring that uses fd = -1 trick. Create a function doing > >>that. (Alex) > >> > >> Before kernel v6.5, dynamic allocation of MSI-X interrupts was not > >> supported. > >> Qemu therefore when allocating a new interrupt, should first release > >> all previously allocated interrupts (including disable of MSI-X) and > >> re-allocate all interrupts that includes the new one. > >> > >> The kernel series [1] adds the support of dynamic MSI-X allocation to > >> vfio-pci and uses the existing flag VFIO_IRQ_INFO_NORESIZE to guide > >> user space, that when dynamic MSI-X is supported the flag is cleared. > >> > >> This series makes the behavior for VFIO PCI devices when dynamic > >> MSI-X allocation is supported. When guest unmasks an interrupt, Qemu > >> can directly allocate an interrupt on host for this and has nothing > >> to do with the previously allocated ones. Therefore, host only > >> allocates interrupts for those unmasked > >> (enabled) interrupts inside guest when dynamic MSI-X allocation is > >> supported by device. > >> > >> When guests enable MSI-X with all of the vectors masked, Qemu need > >> match the state to enable MSI-X with no vector enabled. During > >> migration restore, Qemu also need enable MSI-X first in dynamic > >> allocation mode, to avoid the guest unused vectors being allocated on > >> host. To consolidate them, we use vector 0 with an invalid fd to get MSI-X > enabled and create a common function for this. > >> This is cleaner than setting userspace triggering and immediately release. > >> > >> Any feedback is appreciated. > >> > >> Jing > >> > >> [1] https://lwn.net/Articles/931679/ > >> > >> Jing Liu (4): > >>vfio/pci: detect the support of dynamic MSI-X allocation > >>vfio/pci: enable vector on dynamic MSI-X allocation > >>vfio/pci: use an invalid fd to enable MSI-X > >>vfio/pci: enable MSI-X in interrupt restoring on dynamic > >> allocation > >> > >> hw/vfio/pci.c| 126 +-- > >> hw/vfio/pci.h| 1 + > >> hw/vfio/trace-events | 2 +- > >> 3 files changed, 101 insertions(+), 28 deletions(-) > >> > >> -- > >> 2.27.0 > >
RE: [PATCH v1 0/4] Support dynamic MSI-X allocation
Friendly ping to have your valuable inputs and comments. Thanks very much. BRs, Jing > On 8/22/2023 3:29 PM, Jing Liu wrote: > Changes since RFC v1: > - RFC v1: https://www.mail-archive.com/qemu- > de...@nongnu.org/msg978637.html > - Revise the comments. (Alex) > - Report error of getting irq info and remove the trace of failure > case. (Alex, Cédric) > - Only store dynamic allocation flag as a bool type and test > accordingly. (Alex) > - Move dynamic allocation detection to vfio_msix_early_setup(). (Alex) > - Change the condition logic in vfio_msix_vector_do_use() that moving > the defer_kvm_irq_routing test out and create a common place to update > nr_vectors. (Alex) > - Consolidate the way of MSI-X enabling during device initialization and > interrupt restoring that uses fd = -1 trick. Create a function doing > that. (Alex) > > Before kernel v6.5, dynamic allocation of MSI-X interrupts was not supported. > Qemu therefore when allocating a new interrupt, should first release all > previously allocated interrupts (including disable of MSI-X) and re-allocate > all > interrupts that includes the new one. > > The kernel series [1] adds the support of dynamic MSI-X allocation to vfio-pci > and uses the existing flag VFIO_IRQ_INFO_NORESIZE to guide user space, that > when dynamic MSI-X is supported the flag is cleared. > > This series makes the behavior for VFIO PCI devices when dynamic MSI-X > allocation is supported. When guest unmasks an interrupt, Qemu can directly > allocate an interrupt on host for this and has nothing to do with the > previously > allocated ones. Therefore, host only allocates interrupts for those unmasked > (enabled) interrupts inside guest when dynamic MSI-X allocation is supported > by > device. > > When guests enable MSI-X with all of the vectors masked, Qemu need match the > state to enable MSI-X with no vector enabled. During migration restore, Qemu > also need enable MSI-X first in dynamic allocation mode, to avoid the guest > unused vectors being allocated on host. To consolidate them, we use vector 0 > with an invalid fd to get MSI-X enabled and create a common function for this. > This is cleaner than setting userspace triggering and immediately release. > > Any feedback is appreciated. > > Jing > > [1] https://lwn.net/Articles/931679/ > > Jing Liu (4): > vfio/pci: detect the support of dynamic MSI-X allocation > vfio/pci: enable vector on dynamic MSI-X allocation > vfio/pci: use an invalid fd to enable MSI-X > vfio/pci: enable MSI-X in interrupt restoring on dynamic allocation > > hw/vfio/pci.c| 126 +-- > hw/vfio/pci.h| 1 + > hw/vfio/trace-events | 2 +- > 3 files changed, 101 insertions(+), 28 deletions(-) > > -- > 2.27.0
RE: [PATCH v1 3/4] vfio/pci: use an invalid fd to enable MSI-X
Hi Igor, On Wed, August 30, 2023 6:49 PM, Igor Mammedov wrote: > > On Wed, 30 Aug 2023 10:03:33 +0000 > "Liu, Jing2" wrote: > ... > > > > +/* > > > > + * Get MSI-X enabled, but no vector enabled, by setting vector 0 > > > > +with an invalid > > > > + * fd to kernel. > > > > + */ > > > > +static int vfio_enable_msix_no_vec(VFIOPCIDevice *vdev)> +{ > > > > +struct vfio_irq_set *irq_set; > > > > > > This could be a 'g_autofree' variable. > > > > Thanks for pointing this to me. I just realized QEMU doc recommends > > g_autofree/g_autoptr to do automatic memory deallocation. > > > > I will replace to g_autofree style in next version. > > > > I have a question for a specific example (not related to this patch), > > and I failed to find an example in current QEMU code to understand it. > > If one g_autofree structure includes a pointer that we also allocate > > space for the inside pointer, could g_autofree release the inside space? > > it might be too cumbersome for 1-off use see following for how 'auto' works > https://docs.gtk.org/glib/auto-cleanup.html Thank you for the guide. It's now clear to me that for such type, there need define specific free function by macros and g_auto can help call it automatically. Jing > > > struct dummy1 { > > int data; > > struct *p; > > } > > struct p { > > char member[]; > > } > > void func() { > > g_autofree struct *dummy1 = NULL; > > > > dummy1 = g_malloc(); > > dummy1->p = g_malloc(); > > ... > > return; // is this correct or need g_free(p)? > > } > > > > > > > > > +int ret = 0, argsz; > > > > +int32_t *fd; > > > > + > > > > +argsz = sizeof(*irq_set) + sizeof(*fd); > > > > + > > > > +irq_set = g_malloc0(argsz); > > > > +irq_set->argsz = argsz; > > > > +irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | > > > > + VFIO_IRQ_SET_ACTION_TRIGGER; > > > > +irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX; > > > > +irq_set->start = 0; > > > > +irq_set->count = 1; > > > > +fd = (int32_t *)&irq_set->data; > > > > +*fd = -1; > > > > + > > > > +ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); > > > > +if (ret) { > > > > +error_report("vfio: failed to enable MSI-X with vector 0 > > > > trick, %d", > > > > + ret); > > > > > > The above message seems redundant. I would simply return 'ret' and > > > let the caller report the error. Same as vfio_enable_vectors(). > > > > Understood. Will change. > > > > Thanks, > > Jing > > > > > Thanks, > > > > > > C.
RE: [PATCH v1 3/4] vfio/pci: use an invalid fd to enable MSI-X
Hi Cédric, On 8/29/2023 10:04 PM, Cédric Le Goater wrote: > On 8/22/23 09:29, Jing Liu wrote: > > Guests typically enable MSI-X with all of the vectors masked in the > > MSI-X vector table. To match the guest state of device, Qemu enables > > MSI-X by > > QEMU is preferred to Qemu. Got it. > > > enabling vector 0 with userspace triggering and immediately release. > > However the release function actually does not release it due to > > already using userspace mode. > > > > It is no need to enable triggering on host and rely on the mask bit to > > avoid spurious interrupts. Use an invalid fd (i.e. fd = -1) is enough > > to get MSI-X enabled. > > > > After dynamic MSI-X allocation is supported, the interrupt restoring > > also need use such way to enable MSI-X, therefore, create a function > > for that. > > > > Suggested-by: Alex Williamson > > Signed-off-by: Jing Liu > > --- > > Changes since RFC v1: > > - A new patch. Use an invalid fd to get MSI-X enabled instead of using > >userspace triggering. (Alex) > > --- > > hw/vfio/pci.c | 50 ++ > > 1 file changed, 42 insertions(+), 8 deletions(-) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > 31f36d68bb19..e24c21241a0c 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -369,6 +369,39 @@ static void vfio_msi_interrupt(void *opaque) > > notify(&vdev->pdev, nr); > > } > > > > +/* > > + * Get MSI-X enabled, but no vector enabled, by setting vector 0 with > > +an invalid > > + * fd to kernel. > > + */ > > +static int vfio_enable_msix_no_vec(VFIOPCIDevice *vdev)> +{ > > +struct vfio_irq_set *irq_set; > > This could be a 'g_autofree' variable. Thanks for pointing this to me. I just realized QEMU doc recommends g_autofree/g_autoptr to do automatic memory deallocation. I will replace to g_autofree style in next version. I have a question for a specific example (not related to this patch), and I failed to find an example in current QEMU code to understand it. If one g_autofree structure includes a pointer that we also allocate space for the inside pointer, could g_autofree release the inside space? struct dummy1 { int data; struct *p; } struct p { char member[]; } void func() { g_autofree struct *dummy1 = NULL; dummy1 = g_malloc(); dummy1->p = g_malloc(); ... return; // is this correct or need g_free(p)? } > > > +int ret = 0, argsz; > > +int32_t *fd; > > + > > +argsz = sizeof(*irq_set) + sizeof(*fd); > > + > > +irq_set = g_malloc0(argsz); > > +irq_set->argsz = argsz; > > +irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | > > + VFIO_IRQ_SET_ACTION_TRIGGER; > > +irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX; > > +irq_set->start = 0; > > +irq_set->count = 1; > > +fd = (int32_t *)&irq_set->data; > > +*fd = -1; > > + > > +ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); > > +if (ret) { > > +error_report("vfio: failed to enable MSI-X with vector 0 trick, > > %d", > > + ret); > > The above message seems redundant. I would simply return 'ret' and let the > caller report the error. Same as vfio_enable_vectors(). Understood. Will change. Thanks, Jing > Thanks, > > C. > > > > +} > > + > > +g_free(irq_set); > > + > > +return ret; > > +} > > + > > static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix) > > { > > struct vfio_irq_set *irq_set; > > @@ -618,6 +651,8 @@ static void > > vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev) > > > > static void vfio_msix_enable(VFIOPCIDevice *vdev) > > { > > +int ret; > > + > > vfio_disable_interrupts(vdev); > > > > vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->msix->entries); > > @@ -640,8 +675,6 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev) > > vfio_commit_kvm_msi_virq_batch(vdev); > > > > if (vdev->nr_vectors) { > > -int ret; > > - > > ret = vfio_enable_vectors(vdev, true); > > if (ret) { > > error_report("vfio: failed to enable vectors, %d", ret); > > @@ -655,13 +688,14 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev) > >* MSI-X capability, but leaves the vector table masked. We > > therefore > >* can't rely on a vector_use callback (from request_irq() in the > > guest) > >* to switch the physical device into MSI-X mode because that may > > come > a > > - * long time after pci_enable_msix(). This code enables vector 0 > > with > > - * triggering to userspace, then immediately release the vector, > > leaving > > - * the physical device with no vectors enabled, but MSI-X enabled, > > just > > - * like the guest view. > > + * long time after pci_enable_msix(). This code sets vector 0 > > with an > > + * invalid fd to make the physical device MSI-X enabled, but with > > no > > +
RE: [PATCH v1 1/4] vfio/pci: detect the support of dynamic MSI-X allocation
Hi Cédric, Thank you for your reviewing. On 8/29/2023 9:33 PM, Cédric Le Goater wrote: > Hello Jing, > > On 8/22/23 09:29, Jing Liu wrote: > > Kernel provides the guidance of dynamic MSI-X allocation support of > > passthrough device, by clearing the VFIO_IRQ_INFO_NORESIZE flag to > > guide user space. > > > > Fetch the flags from host to determine if dynamic MSI-X allocation is > > supported. > > > > Originally-by: Reinette Chatre > > Signed-off-by: Jing Liu > > --- > > Changes since RFC v1: > > - Filter the dynamic MSI-X allocation flag and store as a bool type. > >(Alex) > > - Move the detection to vfio_msix_early_setup(). (Alex) > > - Report error of getting irq info and remove the trace of failure > >case. (Alex, Cédric) > > --- > > hw/vfio/pci.c| 15 +-- > > hw/vfio/pci.h| 1 + > > hw/vfio/trace-events | 2 +- > > 3 files changed, 15 insertions(+), 3 deletions(-) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > a205c6b1130f..8a3b34f3c196 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1493,7 +1493,9 @@ static void vfio_msix_early_setup(VFIOPCIDevice > *vdev, Error **errp) > > uint8_t pos; > > uint16_t ctrl; > > uint32_t table, pba; > > -int fd = vdev->vbasedev.fd; > > +int ret, fd = vdev->vbasedev.fd; > > +struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info), > > + .index = > > + VFIO_PCI_MSIX_IRQ_INDEX }; > > VFIOMSIXInfo *msix; > > > > pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSIX); @@ > > -1530,6 +1532,14 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, > Error **errp) > > msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK; > > msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; > > > > +ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); > > +if (ret < 0) { > > +error_setg_errno(errp, -ret, "failed to get MSI-X irq info"); > > Missing : > g_free(msix); > Oh, yes. > > With this fixed, > > Reviewed-by: Cédric Le Goater > Thank you. I'll apply it in next version with the fix. Jing > Thanks, > > C. > > > > > +return; > > +} > > + > > +msix->noresize = !!(irq_info.flags & VFIO_IRQ_INFO_NORESIZE); > > + > > /* > >* Test the size of the pba_offset variable and catch if it extends > > outside > >* of the specified BAR. If it is the case, we need to apply a > > hardware @@ -1562,7 +1572,8 @@ static void > vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > > } > > > > trace_vfio_msix_early_setup(vdev->vbasedev.name, pos, msix->table_bar, > > -msix->table_offset, msix->entries); > > +msix->table_offset, msix->entries, > > +msix->noresize); > > vdev->msix = msix; > > > > vfio_pci_fixup_msix_region(vdev); diff --git a/hw/vfio/pci.h > > b/hw/vfio/pci.h index a2771b9ff3cc..0717574d79e9 100644 > > --- a/hw/vfio/pci.h > > +++ b/hw/vfio/pci.h > > @@ -113,6 +113,7 @@ typedef struct VFIOMSIXInfo { > > uint32_t table_offset; > > uint32_t pba_offset; > > unsigned long *pending; > > +bool noresize; > > } VFIOMSIXInfo; > > > > #define TYPE_VFIO_PCI "vfio-pci" > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index > > ee7509e68e4f..6de5d9ba8e46 100644 > > --- a/hw/vfio/trace-events > > +++ b/hw/vfio/trace-events > > @@ -27,7 +27,7 @@ vfio_vga_read(uint64_t addr, int size, uint64_t data) " > (0x%"PRIx64", %d) = 0x%" > > vfio_pci_read_config(const char *name, int addr, int len, int val) " (%s, > @0x%x, len=0x%x) 0x%x" > > vfio_pci_write_config(const char *name, int addr, int val, int len) " (%s, > @0x%x, 0x%x, len=0x%x)" > > vfio_msi_setup(const char *name, int pos) "%s PCI MSI CAP @0x%x" > > -vfio_msix_early_setup(const char *name, int pos, int table_bar, int > > offset, int > entries) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d" > > +vfio_msix_early_setup(const char *name, int pos, int table_bar, int > > offset, int > entries, bool noresize) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, > entries %d, noresize %d" > > vfio_check_pcie_flr(const char *name) "%s Supports FLR via PCIe cap" > > vfio_check_pm_reset(const char *name) "%s Supports PM reset" > > vfio_check_af_flr(const char *name) "%s Supports FLR via AF cap"
RE: [PATCH RFC v1 3/3] vfio/pci: dynamic MSI-X allocation in interrupt restoring
Hi Alex, > On July 28, 2023 1:25 AM, Alex Williamson wrote: > > On Thu, 27 Jul 2023 03:24:10 -0400 > Jing Liu wrote: > > > During migration restoring, vfio_enable_vectors() is called to restore > > enabling MSI-X interrupts for assigned devices. It sets the range from > > 0 to nr_vectors to kernel to enable MSI-X and the vectors unmasked in > > guest. During the MSI-X enabling, all the vectors within the range are > > allocated according to the ioctl(). > > > > When dynamic MSI-X allocation is supported, we only want the guest > > unmasked vectors being allocated and enabled. Therefore, Qemu can > > first set vector 0 to enable MSI-X and after that, all the vectors can > > be allocated in need. > > > > Signed-off-by: Jing Liu > > --- > > hw/vfio/pci.c | 32 > > 1 file changed, 32 insertions(+) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > 8c485636445c..43ffacd5b36a 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -375,6 +375,38 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, > bool msix) > > int ret = 0, i, argsz; > > int32_t *fds; > > > > +/* > > + * If dynamic MSI-X allocation is supported, the vectors to be > > allocated > > + * and enabled can be scattered. Before kernel enabling MSI-X, setting > > + * nr_vectors causes all these vectors being allocated on host. > > s/being/to be/ Will change. > > > + * > > + * To keep allocation as needed, first setup vector 0 with an invalid > > + * fd to make MSI-X enabled, then enable vectors by setting all so that > > + * kernel allocates and enables interrupts only when enabled in guest. > > + */ > > +if (msix && !(vdev->msix->irq_info_flags & > > + VFIO_IRQ_INFO_NORESIZE)) { > > !vdev->msix->noresize again seems cleaner. Sure, will change. > > > +argsz = sizeof(*irq_set) + sizeof(*fds); > > + > > +irq_set = g_malloc0(argsz); > > +irq_set->argsz = argsz; > > +irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | > > + VFIO_IRQ_SET_ACTION_TRIGGER; > > +irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : > > + VFIO_PCI_MSI_IRQ_INDEX; > > Why are we testing msix again within a branch that requires msix? Ah, yes. Will remove the test. > > > +irq_set->start = 0; > > +irq_set->count = 1; > > +fds = (int32_t *)&irq_set->data; > > +fds[0] = -1; > > + > > +ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, > > + irq_set); > > + > > +g_free(irq_set); > > + > > +if (ret) { > > +return ret; > > +} > > +} > > So your goal here is simply to get the kernel to call vfio_msi_enable() with > nvec > = 1 to get MSI-X enabled on the device, which then allows the kernel to use > the > dynamic expansion when we call SET_IRQS again with a potentially sparse set of > eventfds to vector mappings. Yes, that's what I can think out to get MSI-X enabled first. The only question is that, when getting kernel to call vfio_msi_enable() with nvec=1, kernel will allocate one interrupt along with enabling MSI-X, which cannot avoid. Therefore, if we set vector 0 for example, irq for vec 0 will be allocated in kernel. And later if vector 0 is unmasked in guest, then enable it as normal; but if vector 0 is always masked in guest, then we leave an allocated irq there (unenabled though) until MSI-X disable. I'm not sure if this is okay, but cannot think out other cleaner way. And I also wonder if it is possible, or vector 0 is always being enabled? This seems very similar to the nr_vectors == 0 > branch of vfio_msix_enable() where it uses a do_use and release call to > accomplish getting MSI-X enabled. They are similar. Use a do_use to setup userspace triggering also makes kernel one allocated irq there. And my understanding is that, the following release function actually won't release if it is a userspace trigger. static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr) { /* * There are still old guests that mask and unmask vectors on every * interrupt. If we're using QEMU bypass with a KVM irqfd, leave all of * the KVM setup in place, simply switch VFIO to use the non-bypass * eventfd. We'll then fire the interrupt through QEMU and the MSI-X * core will mask the interrupt and set pending bits, allowing it to * be re-asserted on unmask. Nothing to do if already using QEMU mode. */ ... } We should consolidate, probably by pulling > this out into a function since it seems cleaner to use the fd = -1 trick than > to > setup userspace triggering and immediately release. Thanks, Oh, yes, agree that uses fd=-1 trick is cleaner and we don't need depend on the maskable bit in qemu. According to your suggestion, I will create a function e.g., vfio_enable_msix_no_vec(vdev), which only sets vector 0 with fd=-1 to kernel, and returns the r
RE: [PATCH RFC v1 1/3] vfio/pci: detect the support of dynamic MSI-X allocation
Hi C. > On July 31, 2023 3:26 PM, Cédric Le Goater wrote: > > On 7/31/23 05:57, Liu, Jing2 wrote: > > Hi C. > > > >> On July 28, 2023 4:44 PM, Cédric Le Goater wrote: > >> > >> [ ... ] > >> > >>> Sorry I didn't quite understand "info.flags be tested against > >> VFIO_IRQ_INFO_NORESIZE". > >>> I saw kernel < 6.4 simply added NORESIZE to info.flags and latest > >>> kernel adds > >> if has_dyn_msix. > >>> Would you please kindly describe more on your point? > >> > >> I was trying to find the conditions to detect safely that the kernel > >> didn't have dynamic MSI-X support. Testing VFIO_IRQ_INFO_NORESIZE > seems enough. > >> > > Oh, I see. > > > >>>> In that case, QEMU should report an error and the trace event is > >>>> not > >> needed. > >>> > >>> I replied an email with new error handling draft code based on my > >>> understanding, which reports the error and need no trace. Could you > >>> please > >> help review if that is what we want? > >> > >> yes. It looked good. Please send a v1 ! > > > > Thanks for reviewing that. I guess you mean v2 for next version 😊 > > Well, if you remove the RFC status, I think you should, this would still be a > v1. > Oh, got it. Thanks for telling this. Jing > Thanks, > > C.
RE: [PATCH RFC v1 2/3] vfio/pci: enable vector on dynamic MSI-X allocation
Hi Alex, > On July 28, 2023 1:25 AM, Alex Williamson wrote: > > On Thu, 27 Jul 2023 03:24:09 -0400 > Jing Liu wrote: > > > The vector_use callback is used to enable vector that is unmasked in > > guest. The kernel used to only support static MSI-X allocation. When > > allocating a new interrupt using "static MSI-X allocation" kernels, > > Qemu first disables all previously allocated vectors and then > > re-allocates all including the new one. The nr_vectors of > > VFIOPCIDevice indicates that all vectors from 0 to nr_vectors are > > allocated (and may be enabled), which is used to to loop all the > > possibly used vectors When, e.g., disabling MSI-X interrupts. > > > > Extend the vector_use function to support dynamic MSI-X allocation > > when host supports the capability. Qemu therefore can individually > > allocate and enable a new interrupt without affecting others or > > causing interrupts lost during runtime. > > > > Utilize nr_vectors to calculate the upper bound of enabled vectors in > > dynamic MSI-X allocation mode since looping all msix_entries_nr is not > > efficient and unnecessary. > > > > Signed-off-by: Jing Liu > > Tested-by: Reinette Chatre > > --- > > hw/vfio/pci.c | 40 +++- > > 1 file changed, 27 insertions(+), 13 deletions(-) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > 0c4ac0873d40..8c485636445c 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -512,12 +512,20 @@ static int vfio_msix_vector_do_use(PCIDevice > *pdev, unsigned int nr, > > } > > > > /* > > - * We don't want to have the host allocate all possible MSI vectors > > - * for a device if they're not in use, so we shutdown and incrementally > > - * increase them as needed. > > + * When dynamic allocation is not supported, we don't want to have the > > + * host allocate all possible MSI vectors for a device if they're not > > + * in use, so we shutdown and incrementally increase them as needed. > > + * And nr_vectors stands for the number of vectors being allocated. > > "nr_vectors represents the total number of vectors allocated." Will change. > > > + * > > + * When dynamic allocation is supported, let the host only allocate > > + * and enable a vector when it is in use in guest. nr_vectors stands > > + * for the upper bound of vectors being enabled (but not all of the > > + * ranges is allocated or enabled). > > s/stands for/represents/ Will change. > > > */ > > -if (vdev->nr_vectors < nr + 1) { > > +if ((vdev->msix->irq_info_flags & VFIO_IRQ_INFO_NORESIZE) && > > Testing vdev->msix->noresize would be cleaner. > > > +(vdev->nr_vectors < nr + 1)) { > > vdev->nr_vectors = nr + 1; > > + > > if (!vdev->defer_kvm_irq_routing) { > > vfio_disable_irqindex(&vdev->vbasedev, > > VFIO_PCI_MSIX_IRQ_INDEX); > > ret = vfio_enable_vectors(vdev, true); @@ -529,16 +537,22 > > @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, > > Error *err = NULL; > > int32_t fd; > > > > -if (vector->virq >= 0) { > > -fd = event_notifier_get_fd(&vector->kvm_interrupt); > > -} else { > > -fd = event_notifier_get_fd(&vector->interrupt); > > -} > > +if (!vdev->defer_kvm_irq_routing) { > > +if (vector->virq >= 0) { > > +fd = event_notifier_get_fd(&vector->kvm_interrupt); > > +} else { > > +fd = event_notifier_get_fd(&vector->interrupt); > > +} > > > > -if (vfio_set_irq_signaling(&vdev->vbasedev, > > - VFIO_PCI_MSIX_IRQ_INDEX, nr, > > - VFIO_IRQ_SET_ACTION_TRIGGER, fd, > > &err)) { > > -error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name); > > +if (vfio_set_irq_signaling(&vdev->vbasedev, > > + VFIO_PCI_MSIX_IRQ_INDEX, nr, > > + VFIO_IRQ_SET_ACTION_TRIGGER, fd, > > &err)) { > > +error_reportf_err(err, VFIO_MSG_PREFIX, > > vdev->vbasedev.name); > > +} > > +} > > +/* Increase for dynamic allocation case. */ > > +if (vdev->nr_vectors < nr + 1) { > > +vdev->nr_vectors = nr + 1; > > } > > We now have two branches where the bulk of the code is skipped when > defer_kvm_irq_routing is enabled and doing effectively the same update to > nr_vectors otherwise. This suggests we should move the > defer_kvm_irq_routing test out and create a common place to update > nr_vectors. Thanks, I make a new logic as follows that moves the defer_kvm_irq_routing test out. Since the vfio_enable_vectors() function need an updated nr_vectors value so need first update and test the different conditions using old value, e.g. old_nr_vec. int old_nr_vec = vdev->nr_vectors; .
RE: [PATCH RFC v1 1/3] vfio/pci: detect the support of dynamic MSI-X allocation
Hi C. > On July 28, 2023 4:44 PM, Cédric Le Goater wrote: > > [ ... ] > > > Sorry I didn't quite understand "info.flags be tested against > VFIO_IRQ_INFO_NORESIZE". > > I saw kernel < 6.4 simply added NORESIZE to info.flags and latest kernel > > adds > if has_dyn_msix. > > Would you please kindly describe more on your point? > > I was trying to find the conditions to detect safely that the kernel didn't > have > dynamic MSI-X support. Testing VFIO_IRQ_INFO_NORESIZE seems enough. > Oh, I see. > >> In that case, QEMU should report an error and the trace event is not > needed. > > > > I replied an email with new error handling draft code based on my > > understanding, which reports the error and need no trace. Could you please > help review if that is what we want? > > yes. It looked good. Please send a v1 ! Thanks for reviewing that. I guess you mean v2 for next version 😊 Jing > > Thanks, > > Cédric.
RE: [PATCH RFC v1 1/3] vfio/pci: detect the support of dynamic MSI-X allocation
Hi Alex, > On July 28, 2023 11:41 PM, Alex Williamson wrote: > > On Fri, 28 Jul 2023 10:27:17 +0200 > Cédric Le Goater wrote: > > > On 7/28/23 10:09, Liu, Jing2 wrote: > > > Hi Alex, > > > > > > Thanks very much for reviewing the patches. > > > > > >> On July 28, 2023 1:25 AM, Alex Williamson > wrote: > > >> > > >> On Thu, 27 Jul 2023 03:24:08 -0400 > > >> Jing Liu wrote: > > >> > > >>> From: Reinette Chatre > > >>> > > >>> Kernel provides the guidance of dynamic MSI-X allocation support > > >>> of passthrough device, by clearing the VFIO_IRQ_INFO_NORESIZE flag > > >>> to guide user space. > > >>> > > >>> Fetch and store the flags from host for later use to determine if > > >>> specific flags are set. > > >>> > > >>> Signed-off-by: Reinette Chatre > > >>> Signed-off-by: Jing Liu > > >>> --- > > >>> hw/vfio/pci.c| 12 > > >>> hw/vfio/pci.h| 1 + > > >>> hw/vfio/trace-events | 2 ++ > > >>> 3 files changed, 15 insertions(+) > > >>> > > >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > >>> a205c6b1130f..0c4ac0873d40 100644 > > >>> --- a/hw/vfio/pci.c > > >>> +++ b/hw/vfio/pci.c > > >>> @@ -1572,6 +1572,7 @@ static void > > >>> vfio_msix_early_setup(VFIOPCIDevice > > >>> *vdev, Error **errp) > > >>> > > >>> static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error > > >>> **errp) { > > >>> +struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) > > >>> + }; > > >>> int ret; > > >>> Error *err = NULL; > > >>> > > >>> @@ -1624,6 +1625,17 @@ static int vfio_msix_setup(VFIOPCIDevice > > >>> *vdev, int > > >> pos, Error **errp) > > >>> memory_region_set_enabled(&vdev->pdev.msix_table_mmio, > false); > > >>> } > > >>> > > >>> +irq_info.index = VFIO_PCI_MSIX_IRQ_INDEX; > > >>> +ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, > &irq_info); > > >>> +if (ret) { > > >>> +/* This can fail for an old kernel or legacy PCI dev */ > > >>> + > > >>> + trace_vfio_msix_setup_get_irq_info_failure(strerror(errno)); > > >> > > >> We only call vfio_msix_setup() if the device has an MSI-X > > >> capability, so the "legacy PCI" portion of this comment seems > > >> unjustified. > > >> Otherwise the GET_IRQ_INFO ioctl has always existed, so I'd also > > >> question the "old kernel" part of this comment. > > > > > > Oh, yes, I just realize that only VFIO_PCI_ERR_IRQ_INDEX and > > > VFIO_PCI_REQ_IRQ_INDEX were added later in > > > include/uapi/linux/vfio.h. Thus, this ioctl() with MSIX index would not > > > fail by > the old-kernel or legacy-PCI reason. > > > Thanks for pointing out this to me. > > > > > > We don't currently sanity test the device > > >> exposed MSI-X info versus that reported by GET_IRQ_INFO, but it > > >> seems valid to do so. > > > > > > Do we want to keep the check of possible failure from kernel (e.g., > > > -EFAULT) and report the error code back to caller? Maybe like this, > > > > > > static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > > > { > > > > > > msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; > > > > > > ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); > > > if (ret < 0) { > > > error_setg_errno(errp, -ret, "failed to get MSI-X IRQ INFO"); > > > return; > > Yes. > > > > } else { > > > vdev->msix->noresize = !!(irq_info.flags & > > > VFIO_IRQ_INFO_NORESIZE); > > > } > > No 'else' required here since the error branch returns. Oh, yes. Will remove the "else" and just set the ”noresize“ value in next version. > > > > ... > > > trace_vfio_msix_early_setup(vdev->vbasedev.name, pos, msix- > >table_bar, > > >
RE: [PATCH RFC v1 1/3] vfio/pci: detect the support of dynamic MSI-X allocation
Hi Alex, Thanks very much for reviewing the patches. > On July 28, 2023 1:25 AM, Alex Williamson wrote: > > On Thu, 27 Jul 2023 03:24:08 -0400 > Jing Liu wrote: > > > From: Reinette Chatre > > > > Kernel provides the guidance of dynamic MSI-X allocation support of > > passthrough device, by clearing the VFIO_IRQ_INFO_NORESIZE flag to > > guide user space. > > > > Fetch and store the flags from host for later use to determine if > > specific flags are set. > > > > Signed-off-by: Reinette Chatre > > Signed-off-by: Jing Liu > > --- > > hw/vfio/pci.c| 12 > > hw/vfio/pci.h| 1 + > > hw/vfio/trace-events | 2 ++ > > 3 files changed, 15 insertions(+) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > a205c6b1130f..0c4ac0873d40 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1572,6 +1572,7 @@ static void vfio_msix_early_setup(VFIOPCIDevice > > *vdev, Error **errp) > > > > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error > > **errp) { > > +struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) }; > > int ret; > > Error *err = NULL; > > > > @@ -1624,6 +1625,17 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int > pos, Error **errp) > > memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); > > } > > > > +irq_info.index = VFIO_PCI_MSIX_IRQ_INDEX; > > +ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); > > +if (ret) { > > +/* This can fail for an old kernel or legacy PCI dev */ > > +trace_vfio_msix_setup_get_irq_info_failure(strerror(errno)); > > We only call vfio_msix_setup() if the device has an MSI-X capability, so the > "legacy PCI" portion of this comment seems unjustified. > Otherwise the GET_IRQ_INFO ioctl has always existed, so I'd also question the > "old kernel" part of this comment. Oh, yes, I just realize that only VFIO_PCI_ERR_IRQ_INDEX and VFIO_PCI_REQ_IRQ_INDEX were added later in include/uapi/linux/vfio.h. Thus, this ioctl() with MSIX index would not fail by the old-kernel or legacy-PCI reason. Thanks for pointing out this to me. We don't currently sanity test the device > exposed MSI-X info versus that reported by GET_IRQ_INFO, but it seems valid to > do so. Do we want to keep the check of possible failure from kernel (e.g., -EFAULT) and report the error code back to caller? Maybe like this, static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) { msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); if (ret < 0) { error_setg_errno(errp, -ret, "failed to get MSI-X IRQ INFO"); return; } else { vdev->msix->noresize = !!(irq_info.flags & VFIO_IRQ_INFO_NORESIZE); } ... trace_vfio_msix_early_setup(vdev->vbasedev.name, pos, msix->table_bar, msix->table_offset, msix->entries, vdev->msix->noresize); } > I'd expect this to happen in vfio_msix_early_setup() though, especially > since that's where the remainder of VFIOMSIXInfo is setup. > > > +} else { > > +vdev->msix->irq_info_flags = irq_info.flags; > > +} > > +trace_vfio_msix_setup_irq_info_flags(vdev->vbasedev.name, > > + vdev->msix->irq_info_flags); > > + > > return 0; > > } > > > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index > > a2771b9ff3cc..ad34ec56d0ae 100644 > > --- a/hw/vfio/pci.h > > +++ b/hw/vfio/pci.h > > @@ -113,6 +113,7 @@ typedef struct VFIOMSIXInfo { > > uint32_t table_offset; > > uint32_t pba_offset; > > unsigned long *pending; > > +uint32_t irq_info_flags; > > Why not simply pull out a "noresize" bool? Thanks, > Will change to a bool type. Thanks, Jing > Alex > > > } VFIOMSIXInfo; > > > > #define TYPE_VFIO_PCI "vfio-pci" > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index > > ee7509e68e4f..7d4a398f044d 100644 > > --- a/hw/vfio/trace-events > > +++ b/hw/vfio/trace-events > > @@ -28,6 +28,8 @@ vfio_pci_read_config(const char *name, int addr, int > > len, int val) " (%s, @0x%x, vfio_pci_write_config(const char *name, int > > addr, int > val, int len) " (%s, @0x%x, 0x%x, len=0x%x)" > > vfio_msi_setup(const char *name, int pos) "%s PCI MSI CAP @0x%x" > > vfio_msix_early_setup(const char *name, int pos, int table_bar, int > > offset, int > entries) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d" > > +vfio_msix_setup_get_irq_info_failure(const char *errstr) > "VFIO_DEVICE_GET_IRQ_INFO failure: %s" > > +vfio_msix_setup_irq_info_flags(const char *name, uint32_t flags) " (%s) > > MSI-X > irq info flags 0x%x" > > vfio_check_pcie_flr(const char *name) "%s Supports FLR via PCIe cap" > > vfio_check_pm_reset(const char *name) "%s Supports PM reset" > > vfio_check_af_flr(const char *name) "%s Supports FLR via AF cap"
RE: [PATCH RFC v1 1/3] vfio/pci: detect the support of dynamic MSI-X allocation
Hi C., Thanks very much for reviewing the patches. > On July 28, 2023 12:58 AM, Cédric Le Goater wrote: > > Hello Jing, > > On 7/27/23 09:24, Jing Liu wrote: > > From: Reinette Chatre > > > > Kernel provides the guidance of dynamic MSI-X allocation support of > > passthrough device, by clearing the VFIO_IRQ_INFO_NORESIZE flag to > > guide user space. > > > > Fetch and store the flags from host for later use to determine if > > specific flags are set. > > > > Signed-off-by: Reinette Chatre > > Signed-off-by: Jing Liu > > --- > > hw/vfio/pci.c| 12 > > hw/vfio/pci.h| 1 + > > hw/vfio/trace-events | 2 ++ > > 3 files changed, 15 insertions(+) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index > > a205c6b1130f..0c4ac0873d40 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1572,6 +1572,7 @@ static void vfio_msix_early_setup(VFIOPCIDevice > > *vdev, Error **errp) > > > > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > > { > > +struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) }; > > int ret; > > Error *err = NULL; > > > > @@ -1624,6 +1625,17 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int > pos, Error **errp) > > memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); > > } > > > > +irq_info.index = VFIO_PCI_MSIX_IRQ_INDEX; > > +ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); > > +if (ret) { > > +/* This can fail for an old kernel or legacy PCI dev */ > > +trace_vfio_msix_setup_get_irq_info_failure(strerror(errno)); > > Is it possible to detect the error reported by a kernel (< 6.4) when dynamic > MSI-X > are not supported. I just realized that ioctl(VFIO_DEVICE_GET_IRQ_INFO) with VFIO_PCI_MSIX_IRQ_INDEX should always exists and would not cause failure for older kernel reason, after reading Alex's comment on this patch. (If my understanding is correct) So the possible failure might be other reason except old kernel or legacy PCI dev. Looking at vfio_pci_ioctl_get_irq_info() in the kernel, could > info.flags be tested against VFIO_IRQ_INFO_NORESIZE ? > Sorry I didn't quite understand "info.flags be tested against VFIO_IRQ_INFO_NORESIZE". I saw kernel < 6.4 simply added NORESIZE to info.flags and latest kernel adds if has_dyn_msix. Would you please kindly describe more on your point? > In that case, QEMU should report an error and the trace event is not needed. I replied an email with new error handling draft code based on my understanding, which reports the error and need no trace. Could you please help review if that is what we want? Thanks, Jing > > Thanks, > > C. > > > +} else { > > +vdev->msix->irq_info_flags = irq_info.flags; > > +} > > +trace_vfio_msix_setup_irq_info_flags(vdev->vbasedev.name, > > + vdev->msix->irq_info_flags); > > + > > return 0; > > } > > > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index > > a2771b9ff3cc..ad34ec56d0ae 100644 > > --- a/hw/vfio/pci.h > > +++ b/hw/vfio/pci.h > > @@ -113,6 +113,7 @@ typedef struct VFIOMSIXInfo { > > uint32_t table_offset; > > uint32_t pba_offset; > > unsigned long *pending; > > +uint32_t irq_info_flags; > > } VFIOMSIXInfo; > > > > #define TYPE_VFIO_PCI "vfio-pci" > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index > > ee7509e68e4f..7d4a398f044d 100644 > > --- a/hw/vfio/trace-events > > +++ b/hw/vfio/trace-events > > @@ -28,6 +28,8 @@ vfio_pci_read_config(const char *name, int addr, int len, > int val) " (%s, @0x%x, > > vfio_pci_write_config(const char *name, int addr, int val, int len) " (%s, > @0x%x, 0x%x, len=0x%x)" > > vfio_msi_setup(const char *name, int pos) "%s PCI MSI CAP @0x%x" > > vfio_msix_early_setup(const char *name, int pos, int table_bar, int > > offset, int > entries) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d" > > +vfio_msix_setup_get_irq_info_failure(const char *errstr) > "VFIO_DEVICE_GET_IRQ_INFO failure: %s" > > +vfio_msix_setup_irq_info_flags(const char *name, uint32_t flags) " (%s) > > MSI-X > irq info flags 0x%x" > > vfio_check_pcie_flr(const char *name) "%s Supports FLR via PCIe cap" > > vfio_check_pm_reset(const char *name) "%s Supports PM reset" > > vfio_check_af_flr(const char *name) "%s Supports FLR via AF cap"
Re: [virtio-dev] Re: [PATCH v2 4/5] virtio-mmio: add MSI interrupt feature support
On 2/11/2020 3:40 PM, Jason Wang wrote: On 2020/2/11 下午2:02, Liu, Jing2 wrote: On 2/11/2020 12:02 PM, Jason Wang wrote: On 2020/2/11 上午11:35, Liu, Jing2 wrote: On 2/11/2020 11:17 AM, Jason Wang wrote: On 2020/2/10 下午5:05, Zha Bin wrote: From: Liu Jiang Userspace VMMs (e.g. Qemu microvm, Firecracker) take advantage of using virtio over mmio devices as a lightweight machine model for modern cloud. The standard virtio over MMIO transport layer only supports one legacy interrupt, which is much heavier than virtio over PCI transport layer using MSI. Legacy interrupt has long work path and causes specific VMExits in following cases, which would considerably slow down the performance: 1) read interrupt status register 2) update interrupt status register 3) write IOAPIC EOI register We proposed to add MSI support for virtio over MMIO via new feature bit VIRTIO_F_MMIO_MSI[1] which increases the interrupt performance. With the VIRTIO_F_MMIO_MSI feature bit supported, the virtio-mmio MSI uses msi_sharing[1] to indicate the event and vector mapping. Bit 1 is 0: device uses non-sharing and fixed vector per event mapping. Bit 1 is 1: device uses sharing mode and dynamic mapping. I believe dynamic mapping should cover the case of fixed vector? Actually this bit *aims* for msi sharing or msi non-sharing. It means, when msi sharing bit is 1, device doesn't want vector per queue (it wants msi vector sharing as name) and doesn't want a high interrupt rate. So driver turns to !per_vq_vectors and has to do dynamical mapping. So they are opposite not superset. Thanks! Jing I think you need add more comments on the command. E.g if I want to map vector 0 to queue 1, how do I need to do? write(1, queue_sel); write(0, vector_sel); That's true. Besides, two commands are used for msi sharing mode, VIRTIO_MMIO_MSI_CMD_MAP_CONFIG and VIRTIO_MMIO_MSI_CMD_MAP_QUEUE. "To set up the event and vector mapping for MSI sharing mode, driver SHOULD write a valid MsiVecSel followed by VIRTIO_MMIO_MSI_CMD_MAP_CONFIG/VIRTIO_MMIO_MSI_CMD_MAP_QUEUE command to map the configuration change/selected queue events respectively. " (See spec patch 5/5) So if driver detects the msi sharing mode, when it does setup vq, writes the queue_sel (this already exists in setup vq), vector sel and then MAP_QUEUE command to do the queue event mapping. So actually the per vq msix could be done through this. Right, per vq msix can also be mapped by the 2 commands if we want. The current design benefits for those devices requesting per vq msi that driver doesn't have to map during setup each queue, since we define the relationship by default. I don't get why you need to introduce MSI_SHARING_MASK which is the charge of driver instead of device. MSI_SHARING_MASK is just for identifying the msi_sharing bit in readl(MsiState) (0x0c4). The device tells whether it wants msi_sharing. MsiState register: R le32 { msi_enabled : 1; msi_sharing: 1; reserved : 30; }; The interrupt rate should have no direct relationship with whether it has been shared or not. Btw, you introduce mask/unmask without pending, how to deal with the lost interrupt during the masking then? For msi non-sharing mode, no special action is needed because we make the rule of per_vq_vector and fixed relationship. Correct me if this is not that clear for spec/code comments. The ABI is not as straightforward as PCI did. Why not just reuse the PCI layout? E.g having queue_sel queue_msix_vector msix_config for configuring map between msi vector and queues/config Thanks for the advice. :) Actually when looking into pci, the queue_msix_vector/msix_config is the msi vector index, which is the same as the mmio register MsiVecSel (0x0d0). So we don't introduce two extra registers for mapping even in sharing mode. What do you think? Then vector_sel address data pending mask unmask for configuring msi table? PCI-like msix table is not introduced to device and instead simply use commands to tell the mask/configure/enable. Thanks! Jing Thanks Thanks! Jing ? Thanks Thanks - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH v2 2/5] virtio-mmio: refactor common functionality
On 2/11/2020 7:19 PM, Michael S. Tsirkin wrote: On Mon, Feb 10, 2020 at 05:05:18PM +0800, Zha Bin wrote: From: Liu Jiang Common functionality is refactored into virtio_mmio_common.h in order to MSI support in later patch set. Signed-off-by: Liu Jiang Co-developed-by: Zha Bin Signed-off-by: Zha Bin Co-developed-by: Jing Liu Signed-off-by: Jing Liu Co-developed-by: Chao Peng Signed-off-by: Chao Peng What does this proliferation of header files achieve? common with what? We're considering that the virtio mmio structure is useful for virtio mmio msi file so refactor out. e.g. to get the base of virtio_mmio_device from struct msi_desc *desc. Jing --- drivers/virtio/virtio_mmio.c| 21 + drivers/virtio/virtio_mmio_common.h | 31 +++ 2 files changed, 32 insertions(+), 20 deletions(-) create mode 100644 drivers/virtio/virtio_mmio_common.h diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index 1733ab97..41e1c93 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -61,13 +61,12 @@ #include #include #include -#include #include #include -#include #include #include #include +#include "virtio_mmio_common.h" @@ -77,24 +76,6 @@ -#define to_virtio_mmio_device(_plat_dev) \ - container_of(_plat_dev, struct virtio_mmio_device, vdev) - -struct virtio_mmio_device { - struct virtio_device vdev; - struct platform_device *pdev; - - void __iomem *base; - unsigned long version; - - /* a list of queues so we can dispatch IRQs */ - spinlock_t lock; - struct list_head virtqueues; - - unsigned short notify_base; - unsigned short notify_multiplier; -}; - struct virtio_mmio_vq_info { /* the actual virtqueue */ struct virtqueue *vq; diff --git a/drivers/virtio/virtio_mmio_common.h b/drivers/virtio/virtio_mmio_common.h new file mode 100644 index 000..90cb304 --- /dev/null +++ b/drivers/virtio/virtio_mmio_common.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _DRIVERS_VIRTIO_VIRTIO_MMIO_COMMON_H +#define _DRIVERS_VIRTIO_VIRTIO_MMIO_COMMON_H +/* + * Virtio MMIO driver - common functionality for all device versions + * + * This module allows virtio devices to be used over a memory-mapped device. + */ + +#include +#include + +#define to_virtio_mmio_device(_plat_dev) \ + container_of(_plat_dev, struct virtio_mmio_device, vdev) + +struct virtio_mmio_device { + struct virtio_device vdev; + struct platform_device *pdev; + + void __iomem *base; + unsigned long version; + + /* a list of queues so we can dispatch IRQs */ + spinlock_t lock; + struct list_head virtqueues; + + unsigned short notify_base; + unsigned short notify_multiplier; +}; + +#endif -- 1.8.3.1 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH v2 4/5] virtio-mmio: add MSI interrupt feature support
On 2/11/2020 12:02 PM, Jason Wang wrote: On 2020/2/11 上午11:35, Liu, Jing2 wrote: On 2/11/2020 11:17 AM, Jason Wang wrote: On 2020/2/10 下午5:05, Zha Bin wrote: From: Liu Jiang Userspace VMMs (e.g. Qemu microvm, Firecracker) take advantage of using virtio over mmio devices as a lightweight machine model for modern cloud. The standard virtio over MMIO transport layer only supports one legacy interrupt, which is much heavier than virtio over PCI transport layer using MSI. Legacy interrupt has long work path and causes specific VMExits in following cases, which would considerably slow down the performance: 1) read interrupt status register 2) update interrupt status register 3) write IOAPIC EOI register We proposed to add MSI support for virtio over MMIO via new feature bit VIRTIO_F_MMIO_MSI[1] which increases the interrupt performance. With the VIRTIO_F_MMIO_MSI feature bit supported, the virtio-mmio MSI uses msi_sharing[1] to indicate the event and vector mapping. Bit 1 is 0: device uses non-sharing and fixed vector per event mapping. Bit 1 is 1: device uses sharing mode and dynamic mapping. I believe dynamic mapping should cover the case of fixed vector? Actually this bit *aims* for msi sharing or msi non-sharing. It means, when msi sharing bit is 1, device doesn't want vector per queue (it wants msi vector sharing as name) and doesn't want a high interrupt rate. So driver turns to !per_vq_vectors and has to do dynamical mapping. So they are opposite not superset. Thanks! Jing I think you need add more comments on the command. E.g if I want to map vector 0 to queue 1, how do I need to do? write(1, queue_sel); write(0, vector_sel); That's true. Besides, two commands are used for msi sharing mode, VIRTIO_MMIO_MSI_CMD_MAP_CONFIG and VIRTIO_MMIO_MSI_CMD_MAP_QUEUE. "To set up the event and vector mapping for MSI sharing mode, driver SHOULD write a valid MsiVecSel followed by VIRTIO_MMIO_MSI_CMD_MAP_CONFIG/VIRTIO_MMIO_MSI_CMD_MAP_QUEUE command to map the configuration change/selected queue events respectively. " (See spec patch 5/5) So if driver detects the msi sharing mode, when it does setup vq, writes the queue_sel (this already exists in setup vq), vector sel and then MAP_QUEUE command to do the queue event mapping. For msi non-sharing mode, no special action is needed because we make the rule of per_vq_vector and fixed relationship. Correct me if this is not that clear for spec/code comments. Thanks! Jing ? Thanks Thanks - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH v2 4/5] virtio-mmio: add MSI interrupt feature support
On 2/11/2020 11:17 AM, Jason Wang wrote: On 2020/2/10 下午5:05, Zha Bin wrote: From: Liu Jiang Userspace VMMs (e.g. Qemu microvm, Firecracker) take advantage of using virtio over mmio devices as a lightweight machine model for modern cloud. The standard virtio over MMIO transport layer only supports one legacy interrupt, which is much heavier than virtio over PCI transport layer using MSI. Legacy interrupt has long work path and causes specific VMExits in following cases, which would considerably slow down the performance: 1) read interrupt status register 2) update interrupt status register 3) write IOAPIC EOI register We proposed to add MSI support for virtio over MMIO via new feature bit VIRTIO_F_MMIO_MSI[1] which increases the interrupt performance. With the VIRTIO_F_MMIO_MSI feature bit supported, the virtio-mmio MSI uses msi_sharing[1] to indicate the event and vector mapping. Bit 1 is 0: device uses non-sharing and fixed vector per event mapping. Bit 1 is 1: device uses sharing mode and dynamic mapping. I believe dynamic mapping should cover the case of fixed vector? Actually this bit *aims* for msi sharing or msi non-sharing. It means, when msi sharing bit is 1, device doesn't want vector per queue (it wants msi vector sharing as name) and doesn't want a high interrupt rate. So driver turns to !per_vq_vectors and has to do dynamical mapping. So they are opposite not superset. Thanks! Jing Thanks - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] [PATCH v2 5/5] virtio-mmio: MSI vector and event mapping
On 1/29/2020 6:14 PM, Michael S. Tsirkin wrote: On Tue, Jan 21, 2020 at 09:54:33PM +0800, Jing Liu wrote: Bit 1 msi_sharing reported in the MsiState register indicates the mapping mode device uses. Bit 1 is 0 - device uses MSI non-sharing mode. This indicates vector per event and fixed static vectors and events relationship. This fits for devices with a high interrupt rate and best performance; Bit 1 is 1 - device uses MSI sharing mode. This indicates vectors and events dynamic mapping and fits for devices not requiring a high interrupt rate. Thanks for reviewing! Sorry for the late reply. For msi sharing/non-sharing mode, we are trying to define a rule that, simply using 1 bit to indicate what device wants and what driver should do. Let me try to explain details. If it's sharing mode (bit=1), it means device doesn't want vector per queue or a high interrupt rate. Driver should meet such request. So define that, device should support at least 2 msi vectors and driver should use VIRTIO_MMIO_MSI_CMD_MAP_CONFIG/VIRTIO_MMIO_MSI_CMD_MAP_QUEUE commands to map the configuration change/selected queue events. In implementation, driver will detect the bit=1, turn to !per_vq_vector and map 1 vector for config and 1 for all queues to meet the request. (Maybe we need restrict this in spec too.) It seems that sharing mode is a superset of non-sharing mode. I think sharing mode is not a superset, because driver should not map 1:1 which will go against the request from device that it doesn't want 1:1 (non-sharing mode). In other words, device chooses the way it wants, a high interrupt rate with 1:1 or not. Isn't that right? E.g. with sharing mode drivers can still avoid sharing if they like. Drivers should not avoid msi sharing if that bit is 1. Maybe it should just be a hint to drivers whether to share interrupts, The sharing mode bit is the hint to indicate whether to share interrupts. :) instead of a completely different layout? Except the bit, no other different register layout is there for such feature. Thanks! Jing diff --git a/msi-state.c b/msi-state.c index b1fa0c1..d470be4 100644 --- a/msi-state.c +++ b/msi-state.c @@ -1,4 +1,5 @@ le32 { msi_enabled : 1; -reserved : 31; +msi_sharing: 1; +reserved : 30; }; -- 2.7.4 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [Qemu-devel] [Issues] PCI hotplug does not work well on pc platform?
Hi Marcel and Igor, Thanks very much for your help! On 2/14/2019 10:42 PM, Marcel Apfelbaum wrote: [...] I have two questions. 1. PCI hotplug on pci.0 must manually rescan in guest. The ACPI hotplug handler sends the GPE event to guest but it seems guest doesn't receive it? I tried to open ACPI debug level/layer to 0x, in order to see if there is any message after device_add in monitor, but no message comes out until I manually rescan. Also tried printk in acpi_ev_gpe_xrupt_handler() and acpi_ev_sci_xrupt_handler(). No output in dmesg. (I'm sure that CONFIG_HOTPLUG_PCI_PCIE=y, CONFIG_HOTPLUG_PCI_CPCI=y, CONFIG_HOTPLUG_PCI=y, CONFIG_HOTPLUG_PCI_ACPI=y) What about |CONFIG_HOTPLUG_PCI_SHPC=y ? Thanks and I finally found it is because I set hardware reduced acpi before and now it is correct. | Whether this is a kind of design or a known issue? Does guest receive the request, where can I find the does it work with known to work kernels (RHEL7)? Also sharing used QEMU version and command line could help. Is there any key config of kernel in guest, besides those I listed above? Maybe Marcel knows something about it (CCed) May I ask why do you need SHPC hotplug and not the ACPI based hotplug? Oh, it is some customer's requests. And the method to set PIIX4_PM property works. Thanks! Jing
Re: [Qemu-devel] [Issues] PCI hotplug does not work well on pc platform?
Hi Igor, Thanks for your reply! On 2/5/2019 11:47 PM, Igor Mammedov wrote: On Wed, 30 Jan 2019 21:02:10 +0800 "Liu, Jing2" wrote: Hi everyone, I have two questions. 1. PCI hotplug on pci.0 must manually rescan in guest. The ACPI hotplug handler sends the GPE event to guest but it seems guest doesn't receive it? I tried to open ACPI debug level/layer to 0x, in order to see if there is any message after device_add in monitor, but no message comes out until I manually rescan. Also tried printk in acpi_ev_gpe_xrupt_handler() and acpi_ev_sci_xrupt_handler(). No output in dmesg. (I'm sure that CONFIG_HOTPLUG_PCI_PCIE=y, CONFIG_HOTPLUG_PCI_CPCI=y, CONFIG_HOTPLUG_PCI=y, CONFIG_HOTPLUG_PCI_ACPI=y) Whether this is a kind of design or a known issue? Does guest receive the request, where can I find the does it work with known to work kernels (RHEL7)? Also sharing used QEMU version and command line could help. Is there any key config of kernel in guest, besides those I listed above? I used guest kernel v4.18 and qemu upstream v3.1.0 Command line: sudo /home/xxx/qemu/x86_64-softmmu/qemu-system-x86_64 \ -machine pc,accel=kvm,kernel_irqchip -cpu host -m 1G,slots=2,maxmem=10G \ -nographic -no-user-config -nodefaults -vga none \ -drive file=/home/xxx/image/clear-24690-kvm.img,if=virtio,format=raw \ -smp sockets=1,cpus=4,cores=2,maxcpus=8 \ -device virtio-serial-pci,id=virtio-serial0,disable-modern,addr=0x5 \ -monitor tcp:0.0.0.0:5000,nowait,server \ -chardev stdio,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 \ -kernel /home/xxx/linux-stable/arch/x86/boot/bzImage \ -append 'root=/dev/vda3 rw rootfstype=ext4 data=ordered rcupdate.rcu_expedited=1 pci=lastbus=0 pci=realloc=on tsc=reliable no_timer_check reboot=t noapictimer console=hvc0 iommu=off panic=1 initcall_debug acpi.debug_layer=0x4 acpi.debug_level=4 ' \ -device pci-bridge,bus=pci.0,id=br1,chassis_nr=1,shpc=on,addr=6 \ 2. I want to try hotplugging on pci-bridge on pc platform, using shpc. I set shpc=on, but when I do device_add, qemu still calls acpi_pcihp_device_plug_cb? Why it does not call pci_bridge_dev_hotplug_cb? (CONFIG_HOTPLUG_PCI_SHPC=y) try to disable ACPI hotplug for bridges -global PIIX4_PM.acpi-pci-hotplug-with-bridge-support=off I'll try it! Thanks! Jing
[Qemu-devel] [Issues] PCI hotplug does not work well on pc platform?
Hi everyone, I have two questions. 1. PCI hotplug on pci.0 must manually rescan in guest. The ACPI hotplug handler sends the GPE event to guest but it seems guest doesn't receive it? I tried to open ACPI debug level/layer to 0x, in order to see if there is any message after device_add in monitor, but no message comes out until I manually rescan. Also tried printk in acpi_ev_gpe_xrupt_handler() and acpi_ev_sci_xrupt_handler(). No output in dmesg. (I'm sure that CONFIG_HOTPLUG_PCI_PCIE=y, CONFIG_HOTPLUG_PCI_CPCI=y, CONFIG_HOTPLUG_PCI=y, CONFIG_HOTPLUG_PCI_ACPI=y) Whether this is a kind of design or a known issue? Does guest receive the request, where can I find the 2. I want to try hotplugging on pci-bridge on pc platform, using shpc. I set shpc=on, but when I do device_add, qemu still calls acpi_pcihp_device_plug_cb? Why it does not call pci_bridge_dev_hotplug_cb? (CONFIG_HOTPLUG_PCI_SHPC=y) Thanks very much! Jing
[Qemu-devel] [Question] Hotplug rescan/remove by manual
Hi guys, When I tested hotplug on pci.0 on pc platform, it seems that we have to echo 1 > /sys/bus/pci/rescan manually in guest? Is this a non-complete feature because of something like gpe interrupt issue? For hot-unplug, I uses echo 1 > /sys/bus/pci//remove but this seems only remove the device in guest but we still can see it in qemu monitor by "info pci". I guess this is not a right way to unplug. I'm not sure if I missed something? Looking forwards to your response. Thanks! Jing
Re: [Qemu-devel] [PATCH v3 2/2] hw/pci: add PCI resource reserve capability to legacy PCI bridge
Hi Marcle and Michael, Re-ping if this is not in the PR list :) Thanks, Jing On 9/6/2018 10:16 AM, Liu, Jing2 wrote: Hi Marcle, On 9/6/2018 12:36 AM, Marcel Apfelbaum wrote: On 09/05/2018 05:08 AM, Liu, Jing2 wrote: Hi Marcel and Michael, Got no response so I would like to ask if I need do something more for this serial? :) Hi Jing, Maybe Michael is PTO, let's wait a few more days. Thank you, I got it :) BRs, Jing Michael, I can send a pull request for this series if you are busy. Thanks, Marcel Thanks, Jing On 8/30/2018 10:58 AM, Liu, Jing2 wrote: Ping Michael :) Thanks, Jing On 8/25/2018 12:51 AM, Marcel Apfelbaum wrote: On 08/21/2018 06:18 AM, Jing Liu wrote: Add hint to firmware (e.g. SeaBIOS) to reserve addtional BUS/IO/MEM/PREF resource for legacy pci-pci bridge. Add the resource reserve capability deleting in pci_bridge_dev_exitfn. Signed-off-by: Jing Liu --- hw/pci-bridge/pci_bridge_dev.c | 24 1 file changed, 24 insertions(+) [...] Reviewed-by: Marcel Apfelbaum Thanks for the quick reviewing and feedback. So could I ask what I should do now, update a new version with your rb or just waiting for pushing, or else? You just need to wait until Michael adds the series to his next pull request. Nothing more to do. OK, got it! Thanks! Please ping Michael if your code is not merged in a week or so. BTW, do you have some suggestion on the seabios counterpart patches? https://patchew.org/Seabios/1534386737-8131-1-git-send-email-jing2@linux.intel.com/ I plan to have a look this weekend. Thanks, Marcel
Re: [Qemu-devel] [PATCH v3 2/2] hw/pci: add PCI resource reserve capability to legacy PCI bridge
Hi Marcle, On 9/6/2018 12:36 AM, Marcel Apfelbaum wrote: On 09/05/2018 05:08 AM, Liu, Jing2 wrote: Hi Marcel and Michael, Got no response so I would like to ask if I need do something more for this serial? :) Hi Jing, Maybe Michael is PTO, let's wait a few more days. Thank you, I got it :) BRs, Jing Michael, I can send a pull request for this series if you are busy. Thanks, Marcel Thanks, Jing On 8/30/2018 10:58 AM, Liu, Jing2 wrote: Ping Michael :) Thanks, Jing On 8/25/2018 12:51 AM, Marcel Apfelbaum wrote: On 08/21/2018 06:18 AM, Jing Liu wrote: Add hint to firmware (e.g. SeaBIOS) to reserve addtional BUS/IO/MEM/PREF resource for legacy pci-pci bridge. Add the resource reserve capability deleting in pci_bridge_dev_exitfn. Signed-off-by: Jing Liu --- hw/pci-bridge/pci_bridge_dev.c | 24 1 file changed, 24 insertions(+) [...] Reviewed-by: Marcel Apfelbaum Thanks for the quick reviewing and feedback. So could I ask what I should do now, update a new version with your rb or just waiting for pushing, or else? You just need to wait until Michael adds the series to his next pull request. Nothing more to do. OK, got it! Thanks! Please ping Michael if your code is not merged in a week or so. BTW, do you have some suggestion on the seabios counterpart patches? https://patchew.org/Seabios/1534386737-8131-1-git-send-email-jing2@linux.intel.com/ I plan to have a look this weekend. Thanks, Marcel
Re: [Qemu-devel] [PATCH v3 2/2] hw/pci: add PCI resource reserve capability to legacy PCI bridge
Hi Marcel and Michael, Got no response so I would like to ask if I need do something more for this serial? :) Thanks, Jing On 8/30/2018 10:58 AM, Liu, Jing2 wrote: Ping Michael :) Thanks, Jing On 8/25/2018 12:51 AM, Marcel Apfelbaum wrote: On 08/21/2018 06:18 AM, Jing Liu wrote: Add hint to firmware (e.g. SeaBIOS) to reserve addtional BUS/IO/MEM/PREF resource for legacy pci-pci bridge. Add the resource reserve capability deleting in pci_bridge_dev_exitfn. Signed-off-by: Jing Liu --- hw/pci-bridge/pci_bridge_dev.c | 24 1 file changed, 24 insertions(+) [...] Reviewed-by: Marcel Apfelbaum Thanks for the quick reviewing and feedback. So could I ask what I should do now, update a new version with your rb or just waiting for pushing, or else? You just need to wait until Michael adds the series to his next pull request. Nothing more to do. OK, got it! Thanks! Please ping Michael if your code is not merged in a week or so. BTW, do you have some suggestion on the seabios counterpart patches? https://patchew.org/Seabios/1534386737-8131-1-git-send-email-jing2@linux.intel.com/ I plan to have a look this weekend. Thanks, Marcel
Re: [Qemu-devel] [PATCH v3 2/2] hw/pci: add PCI resource reserve capability to legacy PCI bridge
Ping Michael :) Thanks, Jing On 8/25/2018 12:51 AM, Marcel Apfelbaum wrote: On 08/21/2018 06:18 AM, Jing Liu wrote: Add hint to firmware (e.g. SeaBIOS) to reserve addtional BUS/IO/MEM/PREF resource for legacy pci-pci bridge. Add the resource reserve capability deleting in pci_bridge_dev_exitfn. Signed-off-by: Jing Liu --- hw/pci-bridge/pci_bridge_dev.c | 24 1 file changed, 24 insertions(+) [...] Reviewed-by: Marcel Apfelbaum Thanks for the quick reviewing and feedback. So could I ask what I should do now, update a new version with your rb or just waiting for pushing, or else? You just need to wait until Michael adds the series to his next pull request. Nothing more to do. OK, got it! Thanks! Please ping Michael if your code is not merged in a week or so. BTW, do you have some suggestion on the seabios counterpart patches? https://patchew.org/Seabios/1534386737-8131-1-git-send-email-jing2@linux.intel.com/ I plan to have a look this weekend. Thanks, Marcel
Re: [Qemu-devel] [PATCH v3 2/2] hw/pci: add PCI resource reserve capability to legacy PCI bridge
Hi Marcel, On 8/25/2018 12:51 AM, Marcel Apfelbaum wrote: On 08/21/2018 06:18 AM, Jing Liu wrote: Add hint to firmware (e.g. SeaBIOS) to reserve addtional BUS/IO/MEM/PREF resource for legacy pci-pci bridge. Add the resource reserve capability deleting in pci_bridge_dev_exitfn. Signed-off-by: Jing Liu --- hw/pci-bridge/pci_bridge_dev.c | 24 1 file changed, 24 insertions(+) [...] Reviewed-by: Marcel Apfelbaum Thanks for the quick reviewing and feedback. So could I ask what I should do now, update a new version with your rb or just waiting for pushing, or else? You just need to wait until Michael adds the series to his next pull request. Nothing more to do. OK, got it! Thanks! Please ping Michael if your code is not merged in a week or so. Got it, I will check by "git pull" to see if master branch has that. BTW, do you have some suggestion on the seabios counterpart patches? https://patchew.org/Seabios/1534386737-8131-1-git-send-email-jing2@linux.intel.com/ I plan to have a look this weekend. Glad to see your comments! Thanks very much, Jing Thanks, Marcel Jing Thanks, Marcel Thanks, Jing Thanks, Marcel
Re: [Qemu-devel] [PATCH v3 2/2] hw/pci: add PCI resource reserve capability to legacy PCI bridge
Hi Marcel, On 8/22/2018 2:58 PM, Marcel Apfelbaum wrote: Hi Jing, On 08/22/2018 04:53 AM, Liu, Jing2 wrote: Hi Marcel, On 8/21/2018 5:59 PM, Marcel Apfelbaum wrote: On 08/21/2018 06:18 AM, Jing Liu wrote: Add hint to firmware (e.g. SeaBIOS) to reserve addtional BUS/IO/MEM/PREF resource for legacy pci-pci bridge. Add the resource reserve capability deleting in pci_bridge_dev_exitfn. Signed-off-by: Jing Liu --- hw/pci-bridge/pci_bridge_dev.c | 24 1 file changed, 24 insertions(+) [...] Reviewed-by: Marcel Apfelbaum Thanks for the quick reviewing and feedback. So could I ask what I should do now, update a new version with your rb or just waiting for pushing, or else? You just need to wait until Michael adds the series to his next pull request. Nothing more to do. OK, got it! Thanks! BTW, do you have some suggestion on the seabios counterpart patches? https://patchew.org/Seabios/1534386737-8131-1-git-send-email-jing2@linux.intel.com/ Jing Thanks, Marcel Thanks, Jing Thanks, Marcel
Re: [Qemu-devel] [PATCH v3 2/2] hw/pci: add PCI resource reserve capability to legacy PCI bridge
Hi Marcel, On 8/21/2018 5:59 PM, Marcel Apfelbaum wrote: On 08/21/2018 06:18 AM, Jing Liu wrote: Add hint to firmware (e.g. SeaBIOS) to reserve addtional BUS/IO/MEM/PREF resource for legacy pci-pci bridge. Add the resource reserve capability deleting in pci_bridge_dev_exitfn. Signed-off-by: Jing Liu --- hw/pci-bridge/pci_bridge_dev.c | 24 1 file changed, 24 insertions(+) [...] Reviewed-by: Marcel Apfelbaum Thanks for the quick reviewing and feedback. So could I ask what I should do now, update a new version with your rb or just waiting for pushing, or else? Thanks, Jing Thanks, Marcel
Re: [Qemu-devel] [PATCH v2 2/3] hw/pci: add teardown function for PCI resource reserve capability
Hi Marcel, On 8/20/2018 9:38 PM, Marcel Apfelbaum wrote: Hi Jing, On 08/20/2018 05:58 AM, Liu, Jing2 wrote: Hi Marcel, On 8/18/2018 12:10 AM, Marcel Apfelbaum wrote: Hi Jing, On 08/16/2018 12:28 PM, Jing Liu wrote: Clean up the PCI config space of resource reserve capability. Signed-off-by: Jing Liu --- hw/pci/pci_bridge.c | 9 + include/hw/pci/pci_bridge.h | 1 + 2 files changed, 10 insertions(+) diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c index 15b055e..dbcee90 100644 --- a/hw/pci/pci_bridge.c +++ b/hw/pci/pci_bridge.c @@ -465,6 +465,15 @@ int pci_bridge_qemu_reserve_cap_init(PCIDevice *dev, int cap_offset, return 0; } +void pci_bridge_qemu_reserve_cap_uninit(PCIDevice *dev) +{ + uint8_t pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); + + pci_del_capability(dev, PCI_CAP_ID_VNDR, sizeof(PCIBridgeQemuCap)); I think that you only need to call pci_del_capability, + memset(dev->config + pos + PCI_CAP_FLAGS, 0, + sizeof(PCIBridgeQemuCap) - PCI_CAP_FLAGS); +} ... no need for the above line. The reason is pci_del_capability will "unlink" the capability, and even if the data remains in the configuration space array, it will not be used. I think I got it: pci_del_capability "unlink" by set the tag pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST; so that pdev->config will not be used, right? If is the latest capability in the list, yes. Otherwise it will simply link 'prev' with 'next' using config array offsets. I got it! Thanks very much for the details! Jing Thanks, Marcel Do you agree? If yes, just call pci_del_capability and you don't need this patch. Yup, I agree with you. And let me remove this patch in next version. Thanks, Jing Thanks, Marcel [...]
Re: [Qemu-devel] [PATCH v2 1/3] hw/pci: factor PCI reserve resources to a separate structure
Hi Marcel, On 8/17/2018 11:49 PM, Marcel Apfelbaum wrote: Hi Jing, [...] +/* + * additional resources to reserve on firmware init + */ +typedef struct PCIResReserve { + uint32_t bus_reserve; + uint64_t io_reserve; + uint64_t mem_reserve; The patch looks good to me, I noticed you renamed 'mem_no_pref_reserve' to 'mem reserve'. I remember we had a lot of discussions about the naming, so they would be clear and consistent with the firmware counterpart. OK, will change 'mem_no_pref_reserve' to 'mem_no_pref' and also for others. Please add a least a comment in the PCIResReserve. Will add a comment to the structure definition, and where it's called. Also, since you encapsulated the fields into a new struct, you could remove the "_reserve" suffix so we remain with clear "bus", "io", "mem" ... Got it. Thanks, Jing Thanks, Marcel + uint64_t pref32_reserve; + uint64_t pref64_reserve; +} PCIResReserve; + int pci_bridge_qemu_reserve_cap_init(PCIDevice *dev, int cap_offset, - uint32_t bus_reserve, uint64_t io_reserve, - uint64_t mem_non_pref_reserve, - uint64_t mem_pref_32_reserve, - uint64_t mem_pref_64_reserve, - Error **errp); + PCIResReserve res_reserve, Error **errp); #endif /* QEMU_PCI_BRIDGE_H */
Re: [Qemu-devel] [PATCH v2 2/3] hw/pci: add teardown function for PCI resource reserve capability
Hi Marcel, On 8/18/2018 12:10 AM, Marcel Apfelbaum wrote: Hi Jing, On 08/16/2018 12:28 PM, Jing Liu wrote: Clean up the PCI config space of resource reserve capability. Signed-off-by: Jing Liu --- hw/pci/pci_bridge.c | 9 + include/hw/pci/pci_bridge.h | 1 + 2 files changed, 10 insertions(+) diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c index 15b055e..dbcee90 100644 --- a/hw/pci/pci_bridge.c +++ b/hw/pci/pci_bridge.c @@ -465,6 +465,15 @@ int pci_bridge_qemu_reserve_cap_init(PCIDevice *dev, int cap_offset, return 0; } +void pci_bridge_qemu_reserve_cap_uninit(PCIDevice *dev) +{ + uint8_t pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); + + pci_del_capability(dev, PCI_CAP_ID_VNDR, sizeof(PCIBridgeQemuCap)); I think that you only need to call pci_del_capability, + memset(dev->config + pos + PCI_CAP_FLAGS, 0, + sizeof(PCIBridgeQemuCap) - PCI_CAP_FLAGS); +} ... no need for the above line. The reason is pci_del_capability will "unlink" the capability, and even if the data remains in the configuration space array, it will not be used. I think I got it: pci_del_capability "unlink" by set the tag pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST; so that pdev->config will not be used, right? Do you agree? If yes, just call pci_del_capability and you don't need this patch. Yup, I agree with you. And let me remove this patch in next version. Thanks, Jing Thanks, Marcel [...]
Re: [Qemu-devel] [PATCH v2 0/3] hw/pci: PCI resource reserve capability
Hi Marcel, On 8/18/2018 12:18 AM, Marcel Apfelbaum wrote: Hi Jing, On 08/16/2018 12:28 PM, Jing Liu wrote: This patch serial is about PCI resource reserve capability. First patch refactors the resource reserve fields in GenPCIERoorPort structure out to another new structure, called "PCIResReserve". Modify the parameter list of pci_bridge_qemu_reserve_cap_init(). Then we add the teardown function called pci_bridge_qemu_reserve_cap_uninit(). Last we enable the resource reserve capability for legacy PCI bridge so that firmware can reserve additional resources for the bridge. The series looks good to me, please see some minor comments in the patches. Thanks very much for your reviewing. I will improve the codes and send new version then. Can you please point me to the SeaBIOS / OVMF counterpart? Sure. SeaBIOS patch serial is here: https://patchew.org/Seabios/1534386737-8131-1-git-send-email-jing2@linux.intel.com/ Thanks, Jing Thanks, Marcel Change Log: v2 -> v1 * add refactoring patch * add teardown function * some other fixes Jing Liu (3): hw/pci: factor PCI reserve resources to a separate structure hw/pci: add teardown function for PCI resource reserve capability hw/pci: add PCI resource reserve capability to legacy PCI bridge hw/pci-bridge/gen_pcie_root_port.c | 32 +- hw/pci-bridge/pci_bridge_dev.c | 25 hw/pci/pci_bridge.c | 47 +- include/hw/pci/pci_bridge.h | 18 +++ 4 files changed, 80 insertions(+), 42 deletions(-)
Re: [Qemu-devel] [PATCH v2 0/3] hw/pci: PCI resource reserve capability
Hi Laszlo, Thanks very much for your reminder. Looking forward to comments from all! Thanks, Jing On 8/17/2018 12:17 AM, Laszlo Ersek wrote: Hi, On 08/16/18 11:28, Jing Liu wrote: This patch serial is about PCI resource reserve capability. First patch refactors the resource reserve fields in GenPCIERoorPort structure out to another new structure, called "PCIResReserve". Modify the parameter list of pci_bridge_qemu_reserve_cap_init(). Then we add the teardown function called pci_bridge_qemu_reserve_cap_uninit(). Last we enable the resource reserve capability for legacy PCI bridge so that firmware can reserve additional resources for the bridge. Change Log: v2 -> v1 * add refactoring patch * add teardown function * some other fixes Jing Liu (3): hw/pci: factor PCI reserve resources to a separate structure hw/pci: add teardown function for PCI resource reserve capability hw/pci: add PCI resource reserve capability to legacy PCI bridge hw/pci-bridge/gen_pcie_root_port.c | 32 +- hw/pci-bridge/pci_bridge_dev.c | 25 hw/pci/pci_bridge.c| 47 +- include/hw/pci/pci_bridge.h| 18 +++ 4 files changed, 80 insertions(+), 42 deletions(-) just some meta comments for now: - I've added Marcel; please keep him CC'd on this set (and the SeaBIOS counterpart, [SeaBIOS] [PATCH v2 0/3] pci: resource reserve capability found) - my task queue has blown up and I'm unsure when I'll get to reviewing this set. The same applies to the SeaBIOS counterpart. This is just to say that you should feel free to go ahead and work with the (sub)maintainers; I'll try to get back to this as time allows, but don't wait for me. If further versions are necessary, I'd appreciate being CC'd on those, just so I know what to look at when I find the time again. Thanks! Laszlo
Re: [Qemu-devel] [PATCH] hw/pci: add pci capability to pci-pci bridge
On 8/14/2018 9:27 PM, Laszlo Ersek wrote: [...] +cap_error: + msi_uninit(dev); (4) This error handler doesn't look entirely correct; we can reach it without having initialized MSI. (MSI setup is conditional; and even if we attempt it, it is permitted to fail with "msi=auto".) Therefore, msi_uninit() should be wrapped into "if (msi_present())", same as you see in pci_bridge_dev_exitfn(). sure, can add a pre-check. But I don't understand why we need that, msi_uninit() will check msi_present()? Thanks for pointing that out. So, I grepped the source code for msi_uninit(). It seems that only one location gates it with msi_present() -- in pci_bridge_dev_exitfn() --, while 13 other locations just call it unconditionally. I don't know what the consistent approach is here. I'll have to defer to the PCI maintainers; their taste will decide. Technically your code looks correct, then. Yes, I also grep that. I guess if we directly use msi_uninit, this is because msi_init is non-conditional too. Anyway, I added check in my version two. BTW, other comments are included in v2. Thanks, Jing Thanks Laszlo
Re: [Qemu-devel] [PATCH] hw/pci: add pci capability to pci-pci bridge
On 8/12/2018 3:11 PM, Marcel Apfelbaum wrote: [...] I don't know what to suggest for the new capability's teardown, in pci_bridge_dev_exitfn() -- should we just ignore it (as suggested by this patch)? No, we should remove it properly. I think it is not considered a "big" issue since adding/removing PCI capabilities is not an operation that deals with resources, we only edit an array (the config space) that will not be used anyway if the device init sequence failed. That does not mean the code should not be clean. I just search some PCI_CAP_ID_* to see what they do for exitfn/uninit. It's weird that some more capabilities aren't being deleted, like pci_ich9_uninit(). Thanks, Jing Thanks, Marcel Thanks Laszlo
Re: [Qemu-devel] [PATCH] hw/pci: add pci capability to pci-pci bridge
Hi Marcel, On 8/12/2018 3:11 PM, Marcel Apfelbaum wrote: Hi Laszlo, [...] hw/pci-bridge/pci_bridge_dev.c | 20 1 file changed, 20 insertions(+) +cap_error: + msi_uninit(dev); (4) This error handler doesn't look entirely correct; we can reach it without having initialized MSI. (MSI setup is conditional; and even if we attempt it, it is permitted to fail with "msi=auto".) Therefore, msi_uninit() should be wrapped into "if (msi_present())", same as you see in pci_bridge_dev_exitfn(). You are right. msi_present should be checked. I looked at the codes calling this function. It need be added to be strong. But could I ask why we need check twice? msi_unint() help check again. msi_error: slotid_cap_cleanup(dev); slotid_error: I tried to understand the error handling a bit better. I'm confused. [...] Second, msi_uninit() and shpc_cleanup() are internally inconsistent between each other. The former removes the respective capability from config space -- with pci_del_capability() --, Right but the latter only has a comment, "/* TODO: cleanup config space changes? */". But also disables the QEMU_PCI_CAP_SLOTID (but no code checks it anyway) I agree it should call pci_del_capability to delete the SHPC capability, maybe is some "debt" from early development stages. The same comment is present in the slotid_cap_cleanup() function. Given this inconsistency, Here we also miss a call to pci_del_capability. I don't know what to suggest for the new capability's teardown, in pci_bridge_dev_exitfn() Aha, yes, the teardown needs to be added here. Will add that in next version. -- should we just ignore it (as suggested by this patch)? No, we should remove it properly. I think it is not considered a "big" issue since adding/removing PCI capabilities is not an operation that deals with resources, we only edit an array (the config space) that will not be used anyway if the device init sequence failed. That does not mean the code should not be clean. Do we need set up another serial patches (separated from this one) to add pci_del_capability() for slotid_cap_cleanup() and shpc_cleanup()? Thanks, Jing [...]
Re: [Qemu-devel] [PATCH] hw/pci: add pci capability to pci-pci bridge
Hi Laszlo, Sorry for late reply. I missed these mails because of wrong filter. And thanks very much for comments. My reply as belows. On 8/7/2018 8:19 PM, Laszlo Ersek wrote: On 08/07/18 09:04, Jing Liu wrote: [...] @@ -46,6 +46,12 @@ struct PCIBridgeDev { uint32_t flags; OnOffAuto msi; + +/* additional resources to reserve on firmware init */ +uint64_t io_reserve; +uint64_t mem_reserve; +uint64_t pref32_reserve; +uint64_t pref64_reserve; }; typedef struct PCIBridgeDev PCIBridgeDev; (1) Please evaluate the following idea: This is really good and I'm only not sure if DEFINE_PROP_ like, DEFINE_PROP_UINT32("bus-reserve", GenPCIERootPort, res_reserve.bus_reserve, -1), is a right/good way? I mean the third parameter "_f". [...] - gen_rp_props should be updated accordingly. The property names should stay the same, but they should reference fields of the "res_reserve" field. (2) Then, in a separate patch, you can add another "res_reserve" field to "PCIBridgeDev". (As a consequence, "bus_reserve" will also be added, which I think is the right thing to do.) I consider whether we need limit the bus_reserve of pci-pci bridge. For old codes, we could add "unlimited" numbers of devices (but less than than max) onto pci-pci bridge. (3) There is a class that is derived from TYPE_PCI_BRIDGE_DEV: TYPE_PCI_BRIDGE_SEAT_DEV. Is it correct to add the new properties / fields to the latter as well? Umm, I looked into the codes that doesn't have hotplug property? So do we need add resource reserve for it? @@ -95,6 +101,13 @@ static void pci_bridge_dev_realize(PCIDevice *dev, Error **errp) error_free(local_err); } +err = pci_bridge_qemu_reserve_cap_init(dev, 0, 0, + bridge_dev->io_reserve, bridge_dev->mem_reserve, + bridge_dev->pref32_reserve, bridge_dev->pref64_reserve, errp); +if (err) { +goto cap_error; +} + if (shpc_present(dev)) { /* TODO: spec recommends using 64 bit prefetcheable BAR. * Check whether that works well. */ @@ -103,6 +116,8 @@ static void pci_bridge_dev_realize(PCIDevice *dev, Error **errp) } return; +cap_error: +msi_uninit(dev); (4) This error handler doesn't look entirely correct; we can reach it without having initialized MSI. (MSI setup is conditional; and even if we attempt it, it is permitted to fail with "msi=auto".) Therefore, msi_uninit() should be wrapped into "if (msi_present())", same as you see in pci_bridge_dev_exitfn(). sure, can add a pre-check. But I don't understand why we need that, msi_uninit() will check msi_present()? msi_error: slotid_cap_cleanup(dev); slotid_error: @@ -162,6 +177,11 @@ static Property pci_bridge_dev_properties[] = { ON_OFF_AUTO_AUTO), DEFINE_PROP_BIT(PCI_BRIDGE_DEV_PROP_SHPC, PCIBridgeDev, flags, PCI_BRIDGE_DEV_F_SHPC_REQ, true), +DEFINE_PROP_SIZE("io-reserve", PCIBridgeDev, io_reserve, -1), +DEFINE_PROP_SIZE("mem-reserve", PCIBridgeDev, mem_reserve, -1), +DEFINE_PROP_SIZE("pref32-reserve", PCIBridgeDev, pref32_reserve, -1), +DEFINE_PROP_SIZE("pref64-reserve", PCIBridgeDev, pref64_reserve, -1), + DEFINE_PROP_END_OF_LIST(), }; (5) Similarly to (2), this property list should cover "bus-reserve" too. If we are exposing the same capability in PCI config space, then I think we should let the user control all fields of it as well. (6) Please clarify the capability ID in the subject line of the patch. OK, will add the details. Thanks very much. Jing For example: hw/pci: support resource reservation capability on "pci-bridge" Thanks Laszlo
Re: [Qemu-devel] [Consult] nfs-vsocks support
On 4/4/2018 5:45 PM, Stefan Hajnoczi wrote: On Wed, Mar 28, 2018 at 07:25:21PM +0800, Liu, Jing2 wrote: On 3/26/2018 6:12 PM, Liu, Jing2 wrote: Hi Stefan, Thank you very much for the response! It truly gave me much help. On 3/24/2018 12:02 AM, Stefan Hajnoczi wrote: [...] Could you give me some help and thanks in advance! Here is a script that launches nfsd and runs a guest: https://github.com/stefanha/linux/commit/38cbc15661a6dd44b69c4f318091f2047d707035#diff-05f3fe8941076453942a8c059b409009 Using the script, I met some problems making the initramfs.gz. It confused me a lot... I don't have busybox under /usr/bin or /usr/sbin. Where to get it? It may be easier to use a full guest disk image instead of the disk image generation in my go.sh script. The initramfs is very specific to the development machine I was using at the time. The main thing to look at the in the go.sh script is how the NFS services are launched on the host. Busybox is a package that is available on most Linux distributions. I realized the rhel-7.x doesn't has busybox anymore. And yes, I used some other image to continue. Before you spend more time on this, NFS over AF_VSOCK is only suitable for developers right now. It is not ready for production and I am not supporting it for end users. So if you want to hack on NFS, please go ahead, but otherwise it's probably not what you want. Yes, so far we didn't plan for production. And thank you very much for your kind response. Jing Stefan (host)$./go.sh nfs_tcp File /usr/sbin/busybox could not be opened for reading line 55 File /lib64/libtirpc.so.3 could not be opened for reading line 306 File nc-vsock could not be opened for reading line 321 File ../netperf-2.7.0/src/netserver could not be opened for reading line 325 It results in guest kernel_panic when gdb launches the guest: [0.425738] Failed to execute /init (error -2) [0.426074] Kernel panic - not syncing: No working init found. Try passing . [0.427077] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0+ #3 [0.427507] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-4 [0.428354] Call Trace: [0.428538] dump_stack+0x63/0x89 [0.428774] ? rest_init+0x60/0xc0 [0.429017] panic+0xeb/0x245 [0.429225] ? putname+0x53/0x60 [0.429455] ? rest_init+0xc0/0xc0 [0.429722] kernel_init+0xf1/0x104 [0.430017] ret_from_fork+0x25/0x30 Looking forward to the kind reply. BTW, for the following questions that I sent two days ago, I have some ideas in mind and it could be ignored now. Jing I have updated both host and guest kernel with vsock-nfsd repo and installed nfs-utils in both host and guest successfully. There're two different manuals in https://www.spinics.net/lists/linux-nfs/msg64563.html(Quickstart 3~7) and nfs-utils (README step 3 DAEMON STARTUP ORDER). Which one should be the best usage? For the first quickstart manual step 7-Mount the export from the guest, two questions as follows. 1. why hypervisor's CID is 2? I didn't notice when it is specified? 2. Though I did step 3~7 successfully, I can't see the synchronistical changing on host folder /export/ and guest /mnt? So how to test the communcation between host and guest on the folder? Thanks again! Jing [...] Stefan
Re: [Qemu-devel] [Consult] nfs-vsocks support
On 3/26/2018 6:12 PM, Liu, Jing2 wrote: Hi Stefan, Thank you very much for the response! It truly gave me much help. On 3/24/2018 12:02 AM, Stefan Hajnoczi wrote: [...] Could you give me some help and thanks in advance! Here is a script that launches nfsd and runs a guest: https://github.com/stefanha/linux/commit/38cbc15661a6dd44b69c4f318091f2047d707035#diff-05f3fe8941076453942a8c059b409009 Using the script, I met some problems making the initramfs.gz. It confused me a lot... I don't have busybox under /usr/bin or /usr/sbin. Where to get it? (host)$./go.sh nfs_tcp File /usr/sbin/busybox could not be opened for reading line 55 File /lib64/libtirpc.so.3 could not be opened for reading line 306 File nc-vsock could not be opened for reading line 321 File ../netperf-2.7.0/src/netserver could not be opened for reading line 325 It results in guest kernel_panic when gdb launches the guest: [0.425738] Failed to execute /init (error -2) [0.426074] Kernel panic - not syncing: No working init found. Try passing . [0.427077] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0+ #3 [0.427507] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-4 [0.428354] Call Trace: [0.428538] dump_stack+0x63/0x89 [0.428774] ? rest_init+0x60/0xc0 [0.429017] panic+0xeb/0x245 [0.429225] ? putname+0x53/0x60 [0.429455] ? rest_init+0xc0/0xc0 [0.429722] kernel_init+0xf1/0x104 [0.430017] ret_from_fork+0x25/0x30 Looking forward to the kind reply. BTW, for the following questions that I sent two days ago, I have some ideas in mind and it could be ignored now. Jing I have updated both host and guest kernel with vsock-nfsd repo and installed nfs-utils in both host and guest successfully. There're two different manuals in https://www.spinics.net/lists/linux-nfs/msg64563.html(Quickstart 3~7) and nfs-utils (README step 3 DAEMON STARTUP ORDER). Which one should be the best usage? For the first quickstart manual step 7-Mount the export from the guest, two questions as follows. 1. why hypervisor's CID is 2? I didn't notice when it is specified? 2. Though I did step 3~7 successfully, I can't see the synchronistical changing on host folder /export/ and guest /mnt? So how to test the communcation between host and guest on the folder? Thanks again! Jing [...] Stefan
Re: [Qemu-devel] [Consult] nfs-vsocks support
Hi Stefan, Thank you very much for the response! It truly gave me much help. On 3/24/2018 12:02 AM, Stefan Hajnoczi wrote: On Fri, Mar 23, 2018 at 05:54:53PM +0800, Liu, Jing2 wrote: I am currently trying to use nfs-vsocks on x86 for vitural machine filesystem by some manuals on https://www.spinics.net/lists/linux-nfs/msg64563.html and https://lwn.net/Articles/647516/ [...] 3. For the 5th step (Start nfsd), I'd like to know the rpc.nfsd --vsock is enabled by which codes? Is it the CONFIG_VHOST_VSOCK kernel module in host? Because I installed the kernel with vsock-nfs repo with the specified config options but failed to do rpc.nfsd --vsock. Could you give me some help and thanks in advance! Here is a script that launches nfsd and runs a guest: https://github.com/stefanha/linux/commit/38cbc15661a6dd44b69c4f318091f2047d707035#diff-05f3fe8941076453942a8c059b409009 I have updated both host and guest kernel with vsock-nfsd repo and installed nfs-utils in both host and guest successfully. There're two different manuals in https://www.spinics.net/lists/linux-nfs/msg64563.html(Quickstart 3~7) and nfs-utils (README step 3 DAEMON STARTUP ORDER). Which one should be the best usage? For the first quickstart manual step 7-Mount the export from the guest, two questions as follows. 1. why hypervisor's CID is 2? I didn't notice when it is specified? 2. Though I did step 3~7 successfully, I can't see the synchronistical changing on host folder /export/ and guest /mnt? So how to test the communcation between host and guest on the folder? Thanks again! Jing Regarding rpc.nfsd, on my Fedora 27 system rfc.nfsd is launched by /usr/lib/systemd/system/nfs.service. When developing NFS over AF_VSOCK I disable the NFS service and launch nfsd manually from the script above. Stefan
[Qemu-devel] [Consult] nfs-vsocks support
Hello, I am currently trying to use nfs-vsocks on x86 for vitural machine filesystem by some manuals on https://www.spinics.net/lists/linux-nfs/msg64563.html and https://lwn.net/Articles/647516/ It tells the quickstart steps with the following codes but I got some problems as listed. * Linux kernel: https://github.com/stefanha/linux.git vsock-nfs * QEMU virtio-vsock device: https://github.com/stefanha/qemu.git vsock * nfs-utils vsock: https://github.com/stefanha/nfs-utils.git vsock 1. I would like to ask what is the current status of these codes? If there're some updates? :) 2. nfs-utils codes could not compile for me... I refered to the README but failed for both ways. It seems no configure script and autoheader/automake... So I don't know how to install this. 3. For the 5th step (Start nfsd), I'd like to know the rpc.nfsd --vsock is enabled by which codes? Is it the CONFIG_VHOST_VSOCK kernel module in host? Because I installed the kernel with vsock-nfs repo with the specified config options but failed to do rpc.nfsd --vsock. Could you give me some help and thanks in advance! BRs, Jing Liu