On 2019/1/16 0:18, Alex Williamson wrote:
On Tue, 15 Jan 2019 11:21:51 +0800
Heyi Guo <guoh...@huawei.com> wrote:
Hi Alex,
Really appreciate your comments. I have some more questions below.
On 2019/1/15 0:07, Alex Williamson wrote:
On Sat, 12 Jan 2019 10:30:40 +0800
Heyi Guo <guoh...@huawei.com> wrote:
Hi folks,
I have some questions about vfio_msix_vector_do_use() in
hw/vfio/pci.c, could you help to explain?
We can see that when guest tries to enable one specific MSIX vector
by unmasking MSIX Vector Control, the access will be trapped and then
into function vfio_msix_vector_do_use(). And we may go to the below
branch in line 525:
520 /*
521 * We don't want to have the host allocate all possible MSI vectors
522 * for a device if they're not in use, so we shutdown and incrementally
523 * increase them as needed.
524 */
525 if (vdev->nr_vectors < nr + 1) {
526 vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSIX_IRQ_INDEX);
527 vdev->nr_vectors = nr + 1;
528 ret = vfio_enable_vectors(vdev, true);
529 if (ret) {
530 error_report("vfio: failed to enable vectors, %d", ret);
531 }
Here all MSIX vectors will be disabled first and then enabled, with
one more MSIX. The comment is there but I still don't quite
understand. It makes sense for not allocating all possible MSI
vectors, but why shall we shutdown the whole MSI when being requested
to enable one specific vector? Can't we just enable the user
specified vector indexed by "nr"?
What internal kernel API would vfio-pci make use of to achieve that?
We can't know the intentions of the guest and we have a limited set of
tools to manage the MSI-X state in the host kernel. It's generally the
case that MSI-X vectors are only manipulated during driver setup and
shutdown, so while the current behavior isn't necessarily efficient, it
works within the constraints and generally doesn't have a runtime impact
on performance. We also recognize that this behavior may change in the
future depending on more dynamic internal host kernel interfaces, thus
we export the VFIO_IRQ_INFO_NORESIZE flag to indicate to userspace
whether this procedure is required.
I tried to enable the specified MSIX vector only, and finally
understood the original QEMU code after always getting EINVAL from
ioctl->VFIO_DEVICE_SET_IRQS. Then I tried to allocate all MSIX
vectors in vfio_msix_enable(), not only MSIX vector 0. The change
works, but it definitely doesn't follow the comment in the code "We
don't want to have the host allocate all possible MSI vectors for a
device if they're not in use". May I ask what the side effect is if
we really allocate all possible vectors at the beginning?
Host IRQ vector exhaustion. Very few devices seem to make use of the
their full compliment of available vectors and some devices support a
very, very large number of vectors. The current approach only consumes
the host resources that the guest is actually making use of.
What's more, on ARM64 systems with GIC ITS, the kernel will issue
an ITS discard command when disabling a MSI vector, which will drop
currently pending MSI interrupt. If device driver in guest system
enables some MSIs first and interrupts may come at any time, and
then it tries to enable other MSIs, is it possible for the above
code to cause interrupts missing?
Interrupt reconfiguration is generally during driver setup or
teardown when the device is considered idle or lost interrupts are
not a concern. If you have a device which manipulates interrupt
configuration runtime, you can expect that lost interrupts won't be
the only problem,
On physical machine, if the driver enables vector 0 first, and then
vector 1 later, will the vector 0 interrupt be lost for some
possibility? In my opinion the manipulation of vector 1 should not
affect the interrupt of vector 0, isn't it? But if this is possible
in virtual world, what do we consider it as? A bug to be fixed in
future, or a known limitation of virtualization that won't be fixed
and all driver developers should pay attention to it?
Are you experiencing an actual problem with lost interrupts or is this
a theoretical question? There is a difference in programming of MSI-X
for an assigned device because we cannot know the intentions of the
guest with respect to how many vectors they intend to use. We can only
know as individual vectors are unmasked that a vector is now used.
Allocating the maximum compliment of interrupts supported on the device
is wasteful of host resources. Given this, combined with the fact that
drivers generally configure vectors only at device setup time, the
existing code generally works well. Ideally drivers should not be
required to modify their behavior to support device assignment, but
certain actions, like high frequency masking/unmasking of interrupts
through the vector table impose unavoidable virtualization overhead
and should be avoided. I believe older Linux guests (RHEL5 time frame)
had such behavior.
Yes we are experiencing occasional interrupts missing for Huawei Hi1822 NIC VF
used on Guest. Once I suspected the workflow in qemu could cause this issue,
but after more investigation these two days, we found it should only be the
trigger of the issue. The root cause may be on kernel or ARM GICR in the chip,
for kernel always maps vPE to the first CPU (normally CPU 0), and then moves
the map to the scheduled CPU. But there is a time window between these two
actions, and interrupts may get lost when triggered in this short time window.
However we have not come to a final conclusion yet.
It's possible that current IRQ domains alleviate some of the concern
for host IRQ exhaustion and that vfio-pci could be made to allocate the
maximum supported vectors per device, removing the use of the NORESIZE
flag for the index. Userspace would need to continue to support both
modes of operation and the kernel would need to continue to be
compatible with that. Thanks,
How about we set a threshold (maybe 16, 32) that only allocating vectors of
number = min (maximum supported vectors, threshold), and go thru the original
work flow if the specified vector is larger than the initially allocated?
Thanks,
Heyi
Alex
.