Re: [Qemu-devel] Questions about VFIO enabling MSIX vector

Alex Williamson Thu, 17 Jan 2019 07:21:29 -0800

On Thu, 17 Jan 2019 20:55:05 +0800
Heyi Guo <guoh...@huawei.com> wrote:


> On 2019/1/16 0:18, Alex Williamson wrote:
> > On Tue, 15 Jan 2019 11:21:51 +0800
> > Heyi Guo <guoh...@huawei.com> wrote:
> >  
> >> Hi Alex,
> >>
> >> Really appreciate your comments. I have some more questions below.
> >>
> >>
> >> On 2019/1/15 0:07, Alex Williamson wrote:  
> >>> On Sat, 12 Jan 2019 10:30:40 +0800
> >>> Heyi Guo <guoh...@huawei.com> wrote:
> >>>     
> >>>> Hi folks,
> >>>>
> >>>> I have some questions about vfio_msix_vector_do_use() in
> >>>> hw/vfio/pci.c, could you help to explain?
> >>>>
> >>>> We can see that when guest tries to enable one specific MSIX vector
> >>>> by unmasking MSIX Vector Control, the access will be trapped and then
> >>>> into function vfio_msix_vector_do_use(). And we may go to the below
> >>>> branch in line 525:
> >>>>
> >>>>
> >>>> 520 /*
> >>>> 521  * We don't want to have the host allocate all possible MSI vectors
> >>>> 522  * for a device if they're not in use, so we shutdown and 
> >>>> incrementally
> >>>> 523  * increase them as needed.
> >>>> 524  */
> >>>> 525  if (vdev->nr_vectors < nr + 1) {
> >>>> 526      vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSIX_IRQ_INDEX);
> >>>> 527      vdev->nr_vectors = nr + 1;
> >>>> 528      ret = vfio_enable_vectors(vdev, true);
> >>>> 529      if (ret) {
> >>>> 530          error_report("vfio: failed to enable vectors, %d", ret);
> >>>> 531      }
> >>>>
> >>>> Here all MSIX vectors will be disabled first and then enabled, with
> >>>> one more MSIX. The comment is there but I still don't quite
> >>>> understand. It makes sense for not allocating all possible MSI
> >>>> vectors, but why shall we shutdown the whole MSI when being requested
> >>>> to enable one specific vector? Can't we just enable the user
> >>>> specified vector indexed by "nr"?  
> >>> What internal kernel API would vfio-pci make use of to achieve that?
> >>> We can't know the intentions of the guest and we have a limited set of
> >>> tools to manage the MSI-X state in the host kernel.  It's generally the
> >>> case that MSI-X vectors are only manipulated during driver setup and
> >>> shutdown, so while the current behavior isn't necessarily efficient, it
> >>> works within the constraints and generally doesn't have a runtime impact
> >>> on performance.  We also recognize that this behavior may change in the
> >>> future depending on more dynamic internal host kernel interfaces, thus
> >>> we export the VFIO_IRQ_INFO_NORESIZE flag to indicate to userspace
> >>> whether this procedure is required.  
> >> I tried to enable the specified MSIX vector only, and finally
> >> understood the original QEMU code after always getting EINVAL from
> >> ioctl->VFIO_DEVICE_SET_IRQS. Then I tried to allocate all MSIX
> >> vectors in vfio_msix_enable(), not only MSIX vector 0. The change
> >> works, but it definitely doesn't follow the comment in the code "We
> >> don't want to have the host allocate all possible MSI vectors for a
> >> device if they're not in use". May I ask what the side effect is if
> >> we really allocate all possible vectors at the beginning?  
> > Host IRQ vector exhaustion.  Very few devices seem to make use of the
> > their full compliment of available vectors and some devices support a
> > very, very large number of vectors.  The current approach only consumes
> > the host resources that the guest is actually making use of.
> >  
> >>>> What's more, on ARM64 systems with GIC ITS, the kernel will issue
> >>>> an ITS discard command when disabling a MSI vector, which will drop
> >>>> currently pending MSI interrupt. If device driver in guest system
> >>>> enables some MSIs first and interrupts may come at any time, and
> >>>> then it tries to enable other MSIs, is it possible for the above
> >>>> code to cause interrupts missing?  
> >>> Interrupt reconfiguration is generally during driver setup or
> >>> teardown when the device is considered idle or lost interrupts are
> >>> not a concern.  If you have a device which manipulates interrupt
> >>> configuration runtime, you can expect that lost interrupts won't be
> >>> the only problem,  
> >> On physical machine, if the driver enables vector 0 first, and then
> >> vector 1 later, will the vector 0 interrupt be lost for some
> >> possibility? In my opinion the manipulation of vector 1 should not
> >> affect the interrupt of vector 0, isn't it? But if this is possible
> >> in virtual world, what do we consider it as? A bug to be fixed in
> >> future, or a known limitation of virtualization that won't be fixed
> >> and all driver developers should pay attention to it?  
> > Are you experiencing an actual problem with lost interrupts or is this
> > a theoretical question?  There is a difference in programming of MSI-X
> > for an assigned device because we cannot know the intentions of the
> > guest with respect to how many vectors they intend to use.  We can only
> > know as individual vectors are unmasked that a vector is now used.
> > Allocating the maximum compliment of interrupts supported on the device
> > is wasteful of host resources.  Given this, combined with the fact that
> > drivers generally configure vectors only at device setup time, the
> > existing code generally works well.  Ideally drivers should not be
> > required to modify their behavior to support device assignment, but
> > certain actions, like high frequency masking/unmasking of interrupts
> > through the vector table impose unavoidable virtualization overhead
> > and should be avoided.  I believe older Linux guests (RHEL5 time frame)
> > had such behavior.  
>
> Yes we are experiencing occasional interrupts missing for Huawei
> Hi1822 NIC VF used on Guest. Once I suspected the workflow in qemu
> could cause this issue, but after more investigation these two days,
> we found it should only be the trigger of the issue. The root cause
> may be on kernel or ARM GICR in the chip, for kernel always maps vPE
> to the first CPU (normally CPU 0), and then moves the map to the
> scheduled CPU. But there is a time window between these two actions,
> and interrupts may get lost when triggered in this short time window.
> However we have not come to a final conclusion yet.
> 
> >
> > It's possible that current IRQ domains alleviate some of the concern
> > for host IRQ exhaustion and that vfio-pci could be made to allocate
> > the maximum supported vectors per device, removing the use of the
> > NORESIZE flag for the index.  Userspace would need to continue to
> > support both modes of operation and the kernel would need to
> > continue to be compatible with that.  Thanks,  
>
> How about we set a threshold (maybe 16, 32) that only allocating
> vectors of number = min (maximum supported vectors, threshold), and
> go thru the original work flow if the specified vector is larger than
> the initially allocated?

I would not support such a solution, it's not only still wasteful of
interrupt vectors but it introduces an inflection point in the behavior
which means that "high-end" configurations exercise code paths that
"low-end" configurations do not.  Besides, it seems you haven't yet
even proven that this is an issue.  If we wanted to remove the NORESIZE
flag for this vector index, I think the host would need to allocate all
vectors supported by the MSI-X capability in advance, which means we
need to verify that vector exhaustion is no longer a concern on the
host.  Thanks,

Alex

Re: [Qemu-devel] Questions about VFIO enabling MSIX vector

Reply via email to