Hi, Alex
I was really hoping to hear your opinion, or at least some further discussion of pros and cons rather than simply parroting back my idea.
I understand.
My current thinking is that a resume notifier to userspace is poorly defined, it's not clear what the user can and cannot do between an error notification and the resume notification.
Yes, do nothing between that time is better.
One approach to solve that might be that the kernel internally handles the resume notifications. Maybe that means blocking the ioctl (interruptible timeout) until the internal resume occurs, or maybe that means returning -EAGAIN.
I don't think it is a good idea. The kernel give the error and resume notifications, it's enough. It's up to user to how to use them.
Probably implementations of each need to be worked through to determine which is better. We don't want to add complexity to the kernel simply to make things easier for userspace, but we also don't want a poorly specified interface that is difficult for userspace to use correctly. Thanks,
In qemu, the aer recovery process: 1. Detect support for resume notification If host vfio driver does not support for resume notification, directly fail to boot up VM as with aer enabled. 2. Immediately notify the VM on error detected. 3. Disable the device. Unmap the config space and bar region. 4. Delay the guest directed bus reset. 5. Wait for resume notification. If we don't get the resume notification from the host after some timeout, we would abort the guest directed bus reset altogether and unplug of the device to prevent it from further interacting with the VM. 6. After get the resume notification reset bus and enable the device. I think we only make sure the disabled device will not interact with the VM. Sincerely Zhou jie
Alex .
-- ------------------------------------------------ 周潔 Dept 1 No. 6 Wenzhu Road, Nanjing, 210012, China TEL:+86+25-86630566-8557 FUJITSU INTERNAL:7998-8557 E-Mail:zhoujie2...@cn.fujitsu.com ------------------------------------------------