pci: Fixing s390 vfio-pci ISM support

Matthew Rosato Wed, 20 Jan 2021 12:30:54 -0800

On 1/20/21 2:18 PM, Pierre Morel wrote:

On 1/20/21 4:59 PM, Matthew Rosato wrote:
On 1/20/21 9:45 AM, Pierre Morel wrote:
On 1/20/21 3:03 PM, Matthew Rosato wrote:
On 1/20/21 4:12 AM, Pierre Morel wrote:
On 1/19/21 9:44 PM, Matthew Rosato wrote:
Today, ISM devices are completely disallowed for vfio-pcipassthrough asQEMU rejects the device due to an (inappropriate) MSI-X check.Removing
this fence, however, reveals additional deficiencies in the s390x PCI
interception layer that prevent ISM devices from working correctly.
Namely, ISM block write operations have particular requirements inregards
to the alignment, size and order of writes performed that cannot be
guaranteed when breaking up write operations through the typical
vfio_pci_bar_rw paths. Furthermore, ISM requires that legacy/non-MIO
s390 PCI instructions are used, which is also not guaranteed whenthe I/O
is passed through the typical userspace channels.

This patchset provides a set of fixes related to enabling ISM device
passthrough and includes patches to enable use of a new vfioregion thatwill allow s390x PCI pass-through devices to perform s390 PCIinstructionsin such a way that the same instruction issued on the guest isre-issued
on the host.

Associated kernel patchset:
https://lkml.org/lkml/2021/1/19/874

Changes from RFC -> v1:
- Refresh the header sync (built using Eric's 'update-linux-headers:
Include const.h' + manually removed pvrdma_ring.h again)
- Remove s390x/pci: fix pcistb length (already merged)
- Remove s390x/pci: Fix memory_region_access_valid call (alreadymerged)
- Fix bug: s390_pci_vfio_pcistb should use the pre-allocated PCISTB
buffer pcistb_buf rather than allocating/freeing its own.
- New patch: track the PFT (PCI Function Type) separately fromguest CLP
response data -- we tell the guest '0' for now due to limitations in
measurement block support, but we can still use the real valueprovided via
the vfio CLP capabilities to make decisions.
- Use the PFT (pci function type) to determine when to use the region
for PCISTB/PCILG (only for ISM), rather than using the relaxedalignment
bit.
- As a result, the pcistb_default is now updated to also handle the
possibility of relaxed alignment via 2 new functions,pcistb_validate_write
and pcistb_write, which serve as wrappers to the memory_region calls.
- New patch, which partially restores the MSI-X fence for passthrough
devices... Could potentially be squashed with 's390x/pci: MSI-Xisn'tstrictly required for passthrough' but left separately for now asI felt itneeded a clear commit description of why we should still fencethis case.
Hi,
The choice of using the new VFIO region is made on the ISM PCIfunction type (PFT), which makes the patch ISM specific, why don'twe use here the MIO bit common to any zPCI function and present inkernel to make the choice?
As discussed during the RFC (and see my reply also to the kernelset), the use of this region only works for devices that do not relyon MSI-X interrupts. If we did as you suggest, other device typeslike mlx would not receive MSI-X interrupts in the guest (And I didindeed try variations where I used the special VFIO region for allPCISTG/PCILG/PCISTB for various device types)
So the idea for now was to solve the specific problem at hand(getting ISM devices working).
Sorry, if I missed or forgot some discussions, but I understood thatwe are using this region to handle PCISTB instructions when thedevice do not support MIO.
Don't we?
Sure thing - It's probably good to refresh the issue/rationale anywayas we've had the holidays in between.
You are correct, a primary reason we need to resort to a separate VFIOregion for PCISTB (and PCILG) instructions for ISM devices is thatthey do not support the MIO instruction set, yet the host kernel willtranslate everything coming through the PCI I/O layer to MIOinstructions whenever that facility is available to the host (and notpurposely disabled). This issue is unique to vfio-pci/passthrough -in the host, the ISM driver directly invokes functions in s390 pcicode to ensure that MIO instructions are not used.
QEMU intercepts and differentiates PCISTG and PCISTB.
The new hardware support both MIO and legacy PCISTB/PCISTG.

QEMU does not support MIO

My first interrogation is why should we translate legacy to MIO?

This is existing behavior of the s390 kernel PCI layer, not somethingI'm introducing here.

So, as you say, QEMU does not support MIO, nor does it attempt totranslate anything to MIO. The thing is, to the host kernel, PCI I/Ocoming from QEMU through the standard vfio-pci codepath just looks likeany other userspace PCI I/O coming in to the kernel. So thememory_region operations against the vfio virtual address space bars inQEMU then turn into iowrite/ioread operations in vfio-pci in the kernel,which then subsequently end up in the s390 PCI kernel layer, and it'sthere that everything is turned into an MIO operation if MIO isavailable. That's not limited to vfio-pci, that's all PCI I/O in thehost (except for the ISM driver, which bypasses this behavior byinvoking s390 PCI kernel interfaces directly).

In an early (internal) version of the kernel component to this I floatedthe idea of trying to determine whether or not MIO instructions could beused for a given I/O operation, but the ideas I floated had variousflaws -- I'd invite Niklas to chime in with why this got squashed.

But anyway... If we were to solve that somehow, this would still leaveus with write operations capped at 8B and odd write pattern requirementsfor ISM passthrough. Using a VFIO region to pass the operation directlyto the host kernel via a pinned page to overcome those limitations wasyour idea actually :)

But OK, say we do need this for some obscure reason.
But this is not the only reason. There are additional reasons forusing this VFIO region:1) ISM devices also don't support PCISTG instructions to certainaddress spaces and PCISTB must be used regardless of operationlength. However the standard s390 PCI I/O path always uses PCISTG foranything <=8B. Trying to determine whether a given I/O is intended foran ISM device at that point in kernel code so as to use PCISTB insteadof PCISTG is the
OK, this is clear.
same problem as attempting to decide whether to use MIO vs non-MIOinstructions at that point.
humm, this is not exactly the same problem for me, but OK to choose tohandle it the same way.

The problem isn't the same BUT the information needed to solve theproblem is the same (for a given memory operation, knowing whatdevice/type it is intended for, which can therefore be used to determinealso whether it supports MIO or not).

2) It allows for much larger PCISTB operations (4K) than allowed viathe memory regions (loop of 8B operations).
OK
3) The above also has the added benefit of eliminating certain writepattern requirements that are unique to ISM that would be introducedif we split up the I/O into 8B chunks (if we can't write the wholePCISTB in one go, ISM requires data written in a certain order forsome address spaces, or with certain bits on/off on the PCISTBinstruction to signify the state of the larger operation)
Yes, I suppose that the driver in the guest does it right and we need todo the same.
I do not understand the relation between MSI-X and MIO.
Can you please explain?
There is not a relation between MSI-X and MIO really. Rather, this isa case of the solution that is being offered here ONLY works fordevices that use MSI -- and ISM is a device that only supports MSI.If you try to use this new VFIO region to pass I/O for an MSI-Xenabled device, the notifiers set up via vfio_msix_setup won't betriggered because we are writing to the new VFIO region, not thevirtual bar regions that may have had notifiers setup as part ofvfio_msix_setup. This results in missing interrupts on MSI-X-enabledvfio-pci devices.
These notifiers aren't a factor when the device is using MSI.
I find this strange but we do not need to discuss it.
So we have:
devices supporting MIO and MSIX
devices not supporting MIO nor MSIX
devices not supporting the use of PCISTG to emulate PCISTB
The first two are two different things indicated by two differententries in the clp query PCI function response.
The last one, we do not have an indicator as if the relaxed alignmentand length is set, PCISTB can not be emulated with PCISTG
What I mean with this is that considering the proposed implementationand considering:
MIO MSIX RELAX

0 0 1  -> must use the new region (ISM)
1 1 0  -> must use the standard VFIO region (MLX)

we can discuss other 6 possibilities

0 0 0 -> must use the new region
0 1 0 -> NOOP
0 1 1 -> NOOP
1 0 0 -> can use any region
1 0 1 -> can use any region
1 1 1 -> NOOP
In my opinion the test for using one region or another should be done onthese indicator instead of using the PFT. > This may offer us more compatibility with other hardware we may not be
aware of as today.

This gets a little shaky, and goes both ways -- Using your list, adevice that supports MIO, does not have MSI-X capability and doesn'tsupport relaxed alignment (1 0 0 from above) can use any region -- butthat may not always be true. What if "other hardware we may not beaware of as today" includes future hardware that ONLY supports the MIOinstruction set? Then that device really can't use this region either.

But forgetting that possibility... I think we can really simplify theabove matrix down to a statement of "if device doesn't support MSI-X butDOES support non-MIO instructions, it can use the region." I believethe latter half of that statement is implicit in the architecture today,so it's really then "if device doesn't support MSI-X, it can use theregion". There's just the caveat of, if the device is ISM, it changesfrom 'can use the region' to 'must use the region'.

So, I mean I can change the code to be more permissive in that way(allow any device that doesn't have MSI-X capability to at least attemptto use the region). But the reality is that ISM specifically needs theregion for successful pass through, so I don't see a reason to create adifferent bit for that vs just checking for the PFT in QEMU and usingthat value to decide whether or not region availability is a requirementfor allowing the device to pass through.

Re: [PATCH 0/8] s390x/pci: Fixing s390 vfio-pci ISM support

Reply via email to