Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-29 Thread Alex Williamson
On Thu, 29 Aug 2019 13:40:59 +0800
Ben Luo  wrote:

> 在 2019/8/29 上午1:23, Alex Williamson 写道:
> > On Wed, 28 Aug 2019 18:08:02 +0800
> > Ben Luo  wrote:
> >  
> >> 在 2019/8/28 上午4:33, Alex Williamson 写道:  
> >>> On Thu, 22 Aug 2019 23:34:43 +0800
> >>> Ben Luo  wrote:
> >>> 
>  When userspace (e.g. qemu) triggers a switch between KVM
>  irqfd and userspace eventfd, only dev_id of irq action
>  (i.e. the "trigger" in this patch's context) will be
>  changed, but a free-then-request-irq action is taken in
>  current code. And, irq affinity setting in VM will also
>  trigger a free-then-request-irq action, which actually
>  changes nothing, but only fires a producer re-registration
>  to update irte in case that posted-interrupt is in use.
> 
>  This patch makes use of irq_update_devid() and optimize
>  both cases above, which reduces the risk of losing interrupt
>  and also cuts some overhead.
> 
>  Signed-off-by: Ben Luo 
>  ---
> drivers/vfio/pci/vfio_pci_intrs.c | 112 
>  +-
> 1 file changed, 74 insertions(+), 38 deletions(-)
> 
>  diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
>  b/drivers/vfio/pci/vfio_pci_intrs.c
>  index 3fa3f72..60d3023 100644
>  --- a/drivers/vfio/pci/vfio_pci_intrs.c
>  +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>  @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device 
>  *vdev, int nvec, bool msix)
> static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> int vector, int fd, bool msix)
> {
>  +struct eventfd_ctx *trigger = NULL;
>   struct pci_dev *pdev = vdev->pdev;
>  -struct eventfd_ctx *trigger;
>   int irq, ret;
> 
>   if (vector < 0 || vector >= vdev->num_ctx)
>   return -EINVAL;
> 
>  +if (fd >= 0) {
>  +trigger = eventfd_ctx_fdget(fd);
>  +if (IS_ERR(trigger))
>  +return PTR_ERR(trigger);
>  +}  
> >>> I think this is a user visible change.  Previously the vector is
> >>> disabled first, then if an error occurs re-enabling, we return an errno
> >>> with the vector disabled.  Here we instead fail the ioctl and leave the
> >>> state as if it had never happened.  For instance with QEMU, if they
> >>> were trying to change from KVM to userspace signaling and entered this
> >>> condition, previously the interrupt would signal to neither eventfd, now
> >>> it would continue signaling to KVM. If QEMU's intent was to emulate
> >>> vector masking, this could induce unhandled interrupts in the guest.
> >>> Maybe we need a tear-down on fault here to maintain that behavior, or
> >>> do you see some justification for the change?  
> >> Thanks for your comments, this reminds me to think more about the
> >> effects to users.
> >>
> >> After I reviewed the related code in Qemu and VFIO, I think maybe there
> >> is a problem in current behavior
> >> for the signal path changing case. Qemu has neither recovery nor retry
> >> code in case that ioctl with
> >> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been
> >> disabled on fault of setting
> >> up new path, the corresponding vector may be disabled forever. Following
> >> is an example from qemu's
> >> vfio_msix_vector_do_use():
> >>
> >>       ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >>       g_free(irq_set);
> >>       if (ret) {
> >>       error_report("vfio: failed to modify vector, %d", ret);
> >>       }
> >>
> >> I think the singal path before changing should be still working at this
> >> moment and the caller should keep it
> >> working if the changing fails, so that at least we still have the old
> >> path instead of no path.
> >>
> >> For masking vector case, the 'fd' should be -1, and the interrupt will
> >> be freed as before this patch.  
> > QEMU doesn't really have an opportunity to signal an error to the
> > guest, we're emulating the hardware masking of MSI and MSI-X.  The
> > guest is simply trying to write a mask bit in the vector, there's no
> > provision in the PCI spec that setting this bit can fail.  The current
> > behavior is that the vector is disabled on error.  We can argue whether
> > that's the optimal behavior, but it's the existing behavior and
> > changing it would require and evaluation of all existing users.  
> 
> I totally agree with you that masking of MSI and MSI-X should follow 
> current behavior, and my code does follow this behavior when 'fd' == -1, 
> the interrupt will surely be disabled.
> 
> There is another case that 'fd' is not -1, means userspace just want to 
> change the singal path, this time we do have a chance of encountering 
> error on eventfd_ctx_fdget(fd). So, I think it is better to keep t

Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-28 Thread Ben Luo



在 2019/8/29 上午1:23, Alex Williamson 写道:

On Wed, 28 Aug 2019 18:08:02 +0800
Ben Luo  wrote:


在 2019/8/28 上午4:33, Alex Williamson 写道:

On Thu, 22 Aug 2019 23:34:43 +0800
Ben Luo  wrote:
  

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of irq_update_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo 
---
   drivers/vfio/pci/vfio_pci_intrs.c | 112 
+-
   1 file changed, 74 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..60d3023 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, 
int nvec, bool msix)
   static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
  int vector, int fd, bool msix)
   {
+   struct eventfd_ctx *trigger = NULL;
struct pci_dev *pdev = vdev->pdev;
-   struct eventfd_ctx *trigger;
int irq, ret;
   
   	if (vector < 0 || vector >= vdev->num_ctx)

return -EINVAL;
   
+	if (fd >= 0) {

+   trigger = eventfd_ctx_fdget(fd);
+   if (IS_ERR(trigger))
+   return PTR_ERR(trigger);
+   }

I think this is a user visible change.  Previously the vector is
disabled first, then if an error occurs re-enabling, we return an errno
with the vector disabled.  Here we instead fail the ioctl and leave the
state as if it had never happened.  For instance with QEMU, if they
were trying to change from KVM to userspace signaling and entered this
condition, previously the interrupt would signal to neither eventfd, now
it would continue signaling to KVM. If QEMU's intent was to emulate
vector masking, this could induce unhandled interrupts in the guest.
Maybe we need a tear-down on fault here to maintain that behavior, or
do you see some justification for the change?

Thanks for your comments, this reminds me to think more about the
effects to users.

After I reviewed the related code in Qemu and VFIO, I think maybe there
is a problem in current behavior
for the signal path changing case. Qemu has neither recovery nor retry
code in case that ioctl with
'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been
disabled on fault of setting
up new path, the corresponding vector may be disabled forever. Following
is an example from qemu's
vfio_msix_vector_do_use():

      ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
      g_free(irq_set);
      if (ret) {
      error_report("vfio: failed to modify vector, %d", ret);
      }

I think the singal path before changing should be still working at this
moment and the caller should keep it
working if the changing fails, so that at least we still have the old
path instead of no path.

For masking vector case, the 'fd' should be -1, and the interrupt will
be freed as before this patch.

QEMU doesn't really have an opportunity to signal an error to the
guest, we're emulating the hardware masking of MSI and MSI-X.  The
guest is simply trying to write a mask bit in the vector, there's no
provision in the PCI spec that setting this bit can fail.  The current
behavior is that the vector is disabled on error.  We can argue whether
that's the optimal behavior, but it's the existing behavior and
changing it would require and evaluation of all existing users.


I totally agree with you that masking of MSI and MSI-X should follow 
current behavior, and my code does follow this behavior when 'fd' == -1, 
the interrupt will surely be disabled.


There is another case that 'fd' is not -1, means userspace just want to 
change the singal path, this time we do have a chance of encountering 
error on eventfd_ctx_fdget(fd). So, I think it is better to keep the old 
path working in this case.


But, as you said this may break the expectation of existing users, I 
make it tear-down on fault in next version.





+
irq = pci_irq_vector(pdev, vector);
   
+	/*

+* For KVM-VFIO case, interrupt from passthrough device will be directly
+* delivered to VM after producer and consumer connected successfully.
+* If producer and consumer are disconnected, this interrupt process
+* will fall back to remap mode, where interrupt handler uses 'trigger'
+* to find the right way to deliver the interrupt to VM. So, it is safe
+* to do irq_u

Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-28 Thread Alex Williamson
On Wed, 28 Aug 2019 18:08:02 +0800
Ben Luo  wrote:

> 在 2019/8/28 上午4:33, Alex Williamson 写道:
> > On Thu, 22 Aug 2019 23:34:43 +0800
> > Ben Luo  wrote:
> >  
> >> When userspace (e.g. qemu) triggers a switch between KVM
> >> irqfd and userspace eventfd, only dev_id of irq action
> >> (i.e. the "trigger" in this patch's context) will be
> >> changed, but a free-then-request-irq action is taken in
> >> current code. And, irq affinity setting in VM will also
> >> trigger a free-then-request-irq action, which actually
> >> changes nothing, but only fires a producer re-registration
> >> to update irte in case that posted-interrupt is in use.
> >>
> >> This patch makes use of irq_update_devid() and optimize
> >> both cases above, which reduces the risk of losing interrupt
> >> and also cuts some overhead.
> >>
> >> Signed-off-by: Ben Luo 
> >> ---
> >>   drivers/vfio/pci/vfio_pci_intrs.c | 112 
> >> +-
> >>   1 file changed, 74 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> >> b/drivers/vfio/pci/vfio_pci_intrs.c
> >> index 3fa3f72..60d3023 100644
> >> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> >> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device 
> >> *vdev, int nvec, bool msix)
> >>   static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> >>  int vector, int fd, bool msix)
> >>   {
> >> +  struct eventfd_ctx *trigger = NULL;
> >>struct pci_dev *pdev = vdev->pdev;
> >> -  struct eventfd_ctx *trigger;
> >>int irq, ret;
> >>   
> >>if (vector < 0 || vector >= vdev->num_ctx)
> >>return -EINVAL;
> >>   
> >> +  if (fd >= 0) {
> >> +  trigger = eventfd_ctx_fdget(fd);
> >> +  if (IS_ERR(trigger))
> >> +  return PTR_ERR(trigger);
> >> +  }  
> > I think this is a user visible change.  Previously the vector is
> > disabled first, then if an error occurs re-enabling, we return an errno
> > with the vector disabled.  Here we instead fail the ioctl and leave the
> > state as if it had never happened.  For instance with QEMU, if they
> > were trying to change from KVM to userspace signaling and entered this
> > condition, previously the interrupt would signal to neither eventfd, now
> > it would continue signaling to KVM. If QEMU's intent was to emulate
> > vector masking, this could induce unhandled interrupts in the guest.
> > Maybe we need a tear-down on fault here to maintain that behavior, or
> > do you see some justification for the change?  
> 
> Thanks for your comments, this reminds me to think more about the 
> effects to users.
> 
> After I reviewed the related code in Qemu and VFIO, I think maybe there 
> is a problem in current behavior
> for the signal path changing case. Qemu has neither recovery nor retry 
> code in case that ioctl with
> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been 
> disabled on fault of setting
> up new path, the corresponding vector may be disabled forever. Following 
> is an example from qemu's
> vfio_msix_vector_do_use():
> 
>      ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>      g_free(irq_set);
>      if (ret) {
>      error_report("vfio: failed to modify vector, %d", ret);
>      }
> 
> I think the singal path before changing should be still working at this 
> moment and the caller should keep it
> working if the changing fails, so that at least we still have the old 
> path instead of no path.
> 
> For masking vector case, the 'fd' should be -1, and the interrupt will 
> be freed as before this patch.

QEMU doesn't really have an opportunity to signal an error to the
guest, we're emulating the hardware masking of MSI and MSI-X.  The
guest is simply trying to write a mask bit in the vector, there's no
provision in the PCI spec that setting this bit can fail.  The current
behavior is that the vector is disabled on error.  We can argue whether
that's the optimal behavior, but it's the existing behavior and
changing it would require and evaluation of all existing users.

> >> +
> >>irq = pci_irq_vector(pdev, vector);
> >>   
> >> +  /*
> >> +   * For KVM-VFIO case, interrupt from passthrough device will be directly
> >> +   * delivered to VM after producer and consumer connected successfully.
> >> +   * If producer and consumer are disconnected, this interrupt process
> >> +   * will fall back to remap mode, where interrupt handler uses 'trigger'
> >> +   * to find the right way to deliver the interrupt to VM. So, it is safe
> >> +   * to do irq_update_devid() before irq_bypass_unregister_producer() 
> >> which
> >> +   * switches interrupt process to remap mode. To producer and consumer,
> >> +   * 'trigger' is only a token used for pairing them togather.
> >> +   */
> >>if (vdev->ctx[vector].trigger) {
> >> -  free_irq(irq, vdev->ctx[vector].trigger)

Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-28 Thread Ben Luo



在 2019/8/28 上午4:33, Alex Williamson 写道:

On Thu, 22 Aug 2019 23:34:43 +0800
Ben Luo  wrote:


When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of irq_update_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo 
---
  drivers/vfio/pci/vfio_pci_intrs.c | 112 +-
  1 file changed, 74 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..60d3023 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, 
int nvec, bool msix)
  static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
  int vector, int fd, bool msix)
  {
+   struct eventfd_ctx *trigger = NULL;
struct pci_dev *pdev = vdev->pdev;
-   struct eventfd_ctx *trigger;
int irq, ret;
  
  	if (vector < 0 || vector >= vdev->num_ctx)

return -EINVAL;
  
+	if (fd >= 0) {

+   trigger = eventfd_ctx_fdget(fd);
+   if (IS_ERR(trigger))
+   return PTR_ERR(trigger);
+   }

I think this is a user visible change.  Previously the vector is
disabled first, then if an error occurs re-enabling, we return an errno
with the vector disabled.  Here we instead fail the ioctl and leave the
state as if it had never happened.  For instance with QEMU, if they
were trying to change from KVM to userspace signaling and entered this
condition, previously the interrupt would signal to neither eventfd, now
it would continue signaling to KVM. If QEMU's intent was to emulate
vector masking, this could induce unhandled interrupts in the guest.
Maybe we need a tear-down on fault here to maintain that behavior, or
do you see some justification for the change?


Thanks for your comments, this reminds me to think more about the 
effects to users.


After I reviewed the related code in Qemu and VFIO, I think maybe there 
is a problem in current behavior
for the signal path changing case. Qemu has neither recovery nor retry 
code in case that ioctl with
'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been 
disabled on fault of setting
up new path, the corresponding vector may be disabled forever. Following 
is an example from qemu's

vfio_msix_vector_do_use():

    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
    g_free(irq_set);
    if (ret) {
    error_report("vfio: failed to modify vector, %d", ret);
    }

I think the singal path before changing should be still working at this 
moment and the caller should keep it
working if the changing fails, so that at least we still have the old 
path instead of no path.


For masking vector case, the 'fd' should be -1, and the interrupt will 
be freed as before this patch.





+
irq = pci_irq_vector(pdev, vector);
  
+	/*

+* For KVM-VFIO case, interrupt from passthrough device will be directly
+* delivered to VM after producer and consumer connected successfully.
+* If producer and consumer are disconnected, this interrupt process
+* will fall back to remap mode, where interrupt handler uses 'trigger'
+* to find the right way to deliver the interrupt to VM. So, it is safe
+* to do irq_update_devid() before irq_bypass_unregister_producer() 
which
+* switches interrupt process to remap mode. To producer and consumer,
+* 'trigger' is only a token used for pairing them togather.
+*/
if (vdev->ctx[vector].trigger) {
-   free_irq(irq, vdev->ctx[vector].trigger);
-   irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
-   kfree(vdev->ctx[vector].name);
-   eventfd_ctx_put(vdev->ctx[vector].trigger);
-   vdev->ctx[vector].trigger = NULL;
+   if (vdev->ctx[vector].trigger == trigger) {
+   /* switch back to remap mode */
+   
irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

I think we leak the fd context we acquired above in this case.

Thanks for pointing it out, I will fix this in next version.


Why do we do anything in this case, couldn't we just 'put' the extra ctx
and return 0 here?


Sorry for confusing and I do this for a reason,  I will add some more 
comments like this:


Unregistration here is for re-resigtraion later, which

Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-27 Thread Alex Williamson
On Thu, 22 Aug 2019 23:34:43 +0800
Ben Luo  wrote:

> When userspace (e.g. qemu) triggers a switch between KVM
> irqfd and userspace eventfd, only dev_id of irq action
> (i.e. the "trigger" in this patch's context) will be
> changed, but a free-then-request-irq action is taken in
> current code. And, irq affinity setting in VM will also
> trigger a free-then-request-irq action, which actually
> changes nothing, but only fires a producer re-registration
> to update irte in case that posted-interrupt is in use.
> 
> This patch makes use of irq_update_devid() and optimize
> both cases above, which reduces the risk of losing interrupt
> and also cuts some overhead.
> 
> Signed-off-by: Ben Luo 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 112 
> +-
>  1 file changed, 74 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3fa3f72..60d3023 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device 
> *vdev, int nvec, bool msix)
>  static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> int vector, int fd, bool msix)
>  {
> + struct eventfd_ctx *trigger = NULL;
>   struct pci_dev *pdev = vdev->pdev;
> - struct eventfd_ctx *trigger;
>   int irq, ret;
>  
>   if (vector < 0 || vector >= vdev->num_ctx)
>   return -EINVAL;
>  
> + if (fd >= 0) {
> + trigger = eventfd_ctx_fdget(fd);
> + if (IS_ERR(trigger))
> + return PTR_ERR(trigger);
> + }

I think this is a user visible change.  Previously the vector is
disabled first, then if an error occurs re-enabling, we return an errno
with the vector disabled.  Here we instead fail the ioctl and leave the
state as if it had never happened.  For instance with QEMU, if they
were trying to change from KVM to userspace signaling and entered this
condition, previously the interrupt would signal to neither eventfd, now
it would continue signaling to KVM.  If QEMU's intent was to emulate
vector masking, this could induce unhandled interrupts in the guest.
Maybe we need a tear-down on fault here to maintain that behavior, or
do you see some justification for the change?

> +
>   irq = pci_irq_vector(pdev, vector);
>  
> + /*
> +  * For KVM-VFIO case, interrupt from passthrough device will be directly
> +  * delivered to VM after producer and consumer connected successfully.
> +  * If producer and consumer are disconnected, this interrupt process
> +  * will fall back to remap mode, where interrupt handler uses 'trigger'
> +  * to find the right way to deliver the interrupt to VM. So, it is safe
> +  * to do irq_update_devid() before irq_bypass_unregister_producer() 
> which
> +  * switches interrupt process to remap mode. To producer and consumer,
> +  * 'trigger' is only a token used for pairing them togather.
> +  */
>   if (vdev->ctx[vector].trigger) {
> - free_irq(irq, vdev->ctx[vector].trigger);
> - irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> - kfree(vdev->ctx[vector].name);
> - eventfd_ctx_put(vdev->ctx[vector].trigger);
> - vdev->ctx[vector].trigger = NULL;
> + if (vdev->ctx[vector].trigger == trigger) {
> + /* switch back to remap mode */
> + 
> irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

I think we leak the fd context we acquired above in this case.

Why do we do anything in this case, couldn't we just 'put' the extra ctx
and return 0 here?

> + } else if (trigger) {
> + ret = irq_update_devid(irq,
> +vdev->ctx[vector].trigger, 
> trigger);
> + if (unlikely(ret)) {
> + dev_info(&pdev->dev,
> +  "update devid of %d (token %p) failed: 
> %d\n",
> +  irq, vdev->ctx[vector].trigger, ret);
> + eventfd_ctx_put(trigger);
> + return ret;
> + }
> + 
> irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

Can you explain this ordering, I would have expected that we'd
unregister the bypass before we updated the devid.  Thanks,

Alex

> + eventfd_ctx_put(vdev->ctx[vector].trigger);
> + vdev->ctx[vector].producer.token = trigger;
> + vdev->ctx[vector].trigger = trigger;
> + } else {
> + free_irq(irq, vdev->ctx[vector].trigger);
> + 
> irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> + kfree(vdev->ctx[vector].n

[PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops

2019-08-22 Thread Ben Luo
When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of irq_update_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo 
---
 drivers/vfio/pci/vfio_pci_intrs.c | 112 +-
 1 file changed, 74 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..60d3023 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, 
int nvec, bool msix)
 static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
  int vector, int fd, bool msix)
 {
+   struct eventfd_ctx *trigger = NULL;
struct pci_dev *pdev = vdev->pdev;
-   struct eventfd_ctx *trigger;
int irq, ret;
 
if (vector < 0 || vector >= vdev->num_ctx)
return -EINVAL;
 
+   if (fd >= 0) {
+   trigger = eventfd_ctx_fdget(fd);
+   if (IS_ERR(trigger))
+   return PTR_ERR(trigger);
+   }
+
irq = pci_irq_vector(pdev, vector);
 
+   /*
+* For KVM-VFIO case, interrupt from passthrough device will be directly
+* delivered to VM after producer and consumer connected successfully.
+* If producer and consumer are disconnected, this interrupt process
+* will fall back to remap mode, where interrupt handler uses 'trigger'
+* to find the right way to deliver the interrupt to VM. So, it is safe
+* to do irq_update_devid() before irq_bypass_unregister_producer() 
which
+* switches interrupt process to remap mode. To producer and consumer,
+* 'trigger' is only a token used for pairing them togather.
+*/
if (vdev->ctx[vector].trigger) {
-   free_irq(irq, vdev->ctx[vector].trigger);
-   irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
-   kfree(vdev->ctx[vector].name);
-   eventfd_ctx_put(vdev->ctx[vector].trigger);
-   vdev->ctx[vector].trigger = NULL;
+   if (vdev->ctx[vector].trigger == trigger) {
+   /* switch back to remap mode */
+   
irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+   } else if (trigger) {
+   ret = irq_update_devid(irq,
+  vdev->ctx[vector].trigger, 
trigger);
+   if (unlikely(ret)) {
+   dev_info(&pdev->dev,
+"update devid of %d (token %p) failed: 
%d\n",
+irq, vdev->ctx[vector].trigger, ret);
+   eventfd_ctx_put(trigger);
+   return ret;
+   }
+   
irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+   eventfd_ctx_put(vdev->ctx[vector].trigger);
+   vdev->ctx[vector].producer.token = trigger;
+   vdev->ctx[vector].trigger = trigger;
+   } else {
+   free_irq(irq, vdev->ctx[vector].trigger);
+   
irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+   kfree(vdev->ctx[vector].name);
+   eventfd_ctx_put(vdev->ctx[vector].trigger);
+   vdev->ctx[vector].trigger = NULL;
+   }
}
 
if (fd < 0)
return 0;
 
-   vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
-  msix ? "x" : "", vector,
-  pci_name(pdev));
-   if (!vdev->ctx[vector].name)
-   return -ENOMEM;
+   if (!vdev->ctx[vector].trigger) {
+   vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
+  "vfio-msi%s[%d](%s)",
+  msix ? "x" : "", vector,
+  pci_name(pdev));
+   if (!vdev->ctx[vector].name) {
+   eventfd_ctx_put(trigger);
+   return -ENOMEM;
+   }
 
-   trigger = eventfd_ctx_fdget(fd);
-   if (IS_ERR(trigger)) {
-   kfree(vdev->ctx[vector].name);
-