On Wed, 2015-04-15 at 18:30 +0800, Chen Fan wrote: > On 04/08/2015 11:36 PM, Alex Williamson wrote: > > On Wed, 2015-04-08 at 16:59 +0800, Chen Fan wrote: > >> On 04/01/2015 11:46 PM, Alex Williamson wrote: > >>> On Wed, 2015-04-01 at 12:12 +0800, Chen Fan wrote: > >>>> On 03/25/2015 10:41 AM, Alex Williamson wrote: > >>>>> On Wed, 2015-03-25 at 09:53 +0800, Chen Fan wrote: > >>>>>> On 03/16/2015 10:09 PM, Alex Williamson wrote: > >>>>>>> On Mon, 2015-03-16 at 15:35 +0800, Chen Fan wrote: > >>>>>>>> On 03/16/2015 11:52 AM, Alex Williamson wrote: > >>>>>>>>> On Mon, 2015-03-16 at 11:05 +0800, Chen Fan wrote: > >>>>>>>>>> On 03/14/2015 06:34 AM, Alex Williamson wrote: > >>>>>>>>>>> On Thu, 2015-03-12 at 18:23 +0800, Chen Fan wrote: > >>>>>>>>>>>> when the vfio device encounters an uncorrectable error in host, > >>>>>>>>>>>> the vfio_pci driver will signal the eventfd registered by this > >>>>>>>>>>>> vfio device, the results in the qemu eventfd handler getting > >>>>>>>>>>>> invoked. > >>>>>>>>>>>> > >>>>>>>>>>>> this patch is to pass the error to guest and have the guest > >>>>>>>>>>>> driver > >>>>>>>>>>>> recover from the error. > >>>>>>>>>>> What is going to be the typical recovery mechanism for the guest? > >>>>>>>>>>> I'm > >>>>>>>>>>> concerned that the topology of the device in the guest doesn't > >>>>>>>>>>> necessarily match the topology of the device in the host, so if > >>>>>>>>>>> the > >>>>>>>>>>> guest were to attempt a bus reset to recover a device, for > >>>>>>>>>>> instance, > >>>>>>>>>>> what happens? > >>>>>>>>>> the recovery mechanism is that when guest got an aer error from a > >>>>>>>>>> device, > >>>>>>>>>> guest will clean the corresponding status bit in device register. > >>>>>>>>>> and for > >>>>>>>>>> need reset device, the guest aer driver would reset all devices > >>>>>>>>>> under bus. > >>>>>>>>> Sorry, I'm still confused, how does the guest aer driver reset all > >>>>>>>>> devices under a bus? Are we talking about function-level, device > >>>>>>>>> specific reset mechanisms or secondary bus resets? If the guest is > >>>>>>>>> performing secondary bus resets, what guarantee do they have that it > >>>>>>>>> will translate to a physical secondary bus reset? vfio may only do > >>>>>>>>> an > >>>>>>>>> FLR when the bus is reset or it may not be able to do anything > >>>>>>>>> depending > >>>>>>>>> on the available function-level resets and physical and virtual > >>>>>>>>> topology > >>>>>>>>> of the device. Thanks, > >>>>>>>> in general, functions depends on the corresponding device driver > >>>>>>>> behaviors > >>>>>>>> to do the recovery. e.g: implemented the error_detect, slot_reset > >>>>>>>> callbacks. > >>>>>>>> and for link reset, it usually do secondary bus reset. > >>>>>>>> > >>>>>>>> and do we must require to the physical secondary bus reset for vfio > >>>>>>>> device > >>>>>>>> as bus reset? > >>>>>>> That depends on how the guest driver attempts recovery, doesn't it? > >>>>>>> There are only a very limited number of cases where a secondary bus > >>>>>>> reset initiated by the guest will translate to a secondary bus reset > >>>>>>> of > >>>>>>> the physical device (iirc, single function device without FLR). In > >>>>>>> most > >>>>>>> cases, it will at best be translated to an FLR. VFIO really only does > >>>>>>> bus resets on VM reset because that's the only time we know that it's > >>>>>>> ok > >>>>>>> to reset multiple devices. If the guest driver is depending on a > >>>>>>> secondary bus reset to put the device into a recoverable state and > >>>>>>> we're > >>>>>>> not able to provide that, then we're actually reducing containment of > >>>>>>> the error by exposing AER to the guest and allowing it to attempt > >>>>>>> recovery. So in practice, I'm afraid we're risking the integrity of > >>>>>>> the > >>>>>>> VM by exposing AER to the guest and making it think that it can > >>>>>>> perform > >>>>>>> recovery operations that are not effective. Thanks, > >>>>>> I also have seen that if device without FLR, it seems can do hot reset > >>>>>> by ioctl VFIO_DEVICE_PCI_HOT_RESET to reset the physical slot or bus > >>>>>> in vfio_pci_reset. does it satisfy the recovery issues that you said? > >>>>> The hot reset interface can only be used when a) the user (QEMU) owns > >>>>> all of the devices on the bus and b) we know we're resetting all of the > >>>>> devices. That mostly limits its use to VM reset. I think that on a > >>>>> secondary bus reset, we don't know the scope of the reset at the QEMU > >>>>> vfio driver, so we only make use of reset methods with a function-level > >>>>> scope. That would only result in a secondary bus reset if that's the > >>>>> reset mechanism used by the host kernel's PCI code (pci_reset_function), > >>>>> which is limited to single function devices on a secondary bus, with no > >>>>> other reset mechanisms. The host reset is also only available in some > >>>>> configurations, for instance if we have a dual-port NIC where each > >>>>> function is a separate IOMMU group, then we clearly cannot do a hot > >>>>> reset unless both functions are assigned to the same VM _and_ appear to > >>>>> the guest on the same virtual bus. So even if we could know the scope > >>>>> of the reset in the QEMU vfio driver, we can only make use of it under > >>>>> very strict guest configurations. Thanks, > >>>> Hi Alex, > >>>> > >>>> have you some idea or scenario to fix/escape this issue? > >>> Hi Chen, > >>> > >>> I expect there are two major components to this. The first is that > >>> QEMU/vfio-pci needs to enforce that a bus reset is possible for the host > >>> and guest topology when guest AER handling is specified for a device. > >>> That means that everything affected by the bus reset needs to be exposed > >>> to the guest in a compatible way. For instance, if a bus reset affects > >>> devices from multiple groups, the guest needs to not only own all of > >>> those groups, but they also need to be exposed to the guest such that > >>> the virtual bus layout reflects the extent of the reset for the physical > >>> bus. This also implies that guest AER handling cannot be the default > >>> since it will impose significant configuration restrictions on device > >>> assignment. > >>> > >>> This seems like a difficult configuration enforcement to make, but maybe > >>> there are simplifying assumptions that can help. For instance the > >>> devices need to be exposed as PCIe therefore we won't have multiple > >>> slots in use on a bus and I think we can therefore mostly ignore hotplug > >>> since we can only hotplug at a slot granularity. That may also imply > >>> that we should simply enforce a 1:1 mapping of physical functions to > >>> virtual functions. At least one function from each group affected by a > >>> reset must be exposed to the guest. > >>> > >>> The second issue is that individual QEMU PCI devices have no callback > >>> for a bus reset. QEMU/vfio-pci currently has the DeviceClass.reset > >>> callback, which we assume to be a function-level reset. We also > >>> register with qemu_register_reset() for a VM reset, which is the only > >>> point currently that we know we can do a reset affecting multiple > >>> devices. Infrastructure will need to be added to QEMU/PCI to expose the > >>> link down/RST signal to devices on a bus to trigger a multi-device reset > >>> in vfio-pci. > >>> > >>> Hopefully I'm not missing something, but I think both of those changes > >>> are going to be required before we can have anything remotely > >>> supportable for guest-based AER error handle. This pretty complicated > >>> for the user and also for libvirt to figure out. At a minimum libvirt > >>> would need to support a new guest-based AER handling flag for devices. > >>> We probably need to determine whether this is unique to vfio-pci or a > >>> generic PCIDevice option. Thanks, > >> Hi Alex, > >> Solving the two issues seem like a big workload. do we have a simple > >> way to support qemu AER ? > > Hi Chen, > > > > The simpler way is the existing, containment-only solution where QEMU > > stops the guest on an uncorrected error. Do you have any other > > suggestions? I don't see how we can rely on guest involvement in > > recovery unless the guest has the same abilities to reset the device as > > it would on bare metal. Thanks, > Hi Alex, > > for the first issue, I think the functions affected by a bus reset need > to assign to > guest are too restricted.
Why? If the guest thinks that it's doing a bus reset to recover the device, I don't think we can ignore that or do a lesser function-level reset. If the guest thought it could recover using a function-level reset, it probably would have used that instead. > I suppose if we enable support the aer feature, only need to do > is check the pass through device's host bus whether have other endpoint, > if no other pci device, we can support the host bus reset in qemu vfio-pci. I don't think that restricting the problem to single-function endpoints changes the requirements at all. vfio-pci in QEMU would still need to restrict that AER forwarding to the guest can only be enabled in supported configurations and the QEMU PCI-core code would need to differentiate a PCI bus reset from a regular single device scope reset. In fact, restricting the configuration to single function endpoints appears to be the same amount of work, only reducing the usefulness. Thanks, Alex