On Mon, Feb 26, 2024 at 3:42 PM Jason Wang <jasow...@redhat.com> wrote: > > On Fri, Feb 16, 2024 at 4:25 PM David Stevens <steve...@chromium.org> wrote: > > > > Add a virtio power management PCI capability to allow drivers to suspend > > virtio PCI devices. This allows drivers to suspend devices at the virtio > > level before suspending them at the PCI transport layer. This allows > > drivers to do a two phase suspend, which prevents notifications from > > being ignored or lost if interrupts are reconfigured at the PCI > > transport layer immediately before or after the device is put into the > > PCI PM D3 low power state. > > --- > > transport-pci.tex | 51 +++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 51 insertions(+) > > > > I would like to know how the two phase suspend can prevent the loss of > notification or interrupt. Or for example, if it can be workaround at > the driver level.
I'll preface this by saying I am not an expert in how Linux handles interrupts or in PCI, so I might be missing some important information. Actually, after thinking about this some more, I'm not sure that the two phase suspend is necessary per-se. Rather, I think relying only on PCI power management is simply insufficient. According to the kernel's system suspend and device interrupts documentation [1], "... after the 'late' phase of device suspend there is no legitimate reason why any interrupts from suspended devices should trigger and if any devices have not been suspended properly yet, it is better to block interrupts from them anyway." PCI power management doesn't occur until the noirq phase of suspend, so that's too late to be useful here. Other than that, as of right now, there's nothing the driver can do to tell devices they need to stop sending interrupts. Adding suspend support to the virtio spec would address this deficiency. More concretely, I ran into two problems when trying to get shallow suspend working with only PCI power management. If there are workarounds for this at the driver level, then that would be great. But I can't think of any, at least. First, there is a window between when Linux suspends interrupts with suspend_device_irqs and when PCI power management occurs. If the device triggers an interrupt in that window, then it won't be lost since IRQS_PENDING gets added to the descriptor state. However, if the originating event should have woken the guest up from sleep, then that doesn't happen, since there is no way for the device to know it actually needed to generate a PCI PME. I might be missing something, but this seems like something that can't be worked around at the driver level. Second, the core interrupt code can migrate interrupts when CPUs are offline'ed entering shallow suspend. Since PCI devices are already in D3 at this point, __pci_write_msi_msg does not update the MSI routing. When bringing the device back up, there is a window between when the device is put into D0 and when the updated routing tables are written by __pci_restore_msix_state. If the device sends an interrupt in this window, then it will be sent to the completely wrong handler in the guest (or dropped if there is no handler installed). I guess this could be worked around in the core PCI code by masking MSIX interrupts before suspending the device. However, the fact that Linux doesn't already do that today suggests that we're wrong, rather than everybody else happens to be wrong in just the right way. [1] https://www.kernel.org/doc/Documentation/power/suspend-and-interrupts.rst -David --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org