On Tue, Sep 17, 2019 at 1:36 PM Oliver O'Halloran <ooh...@gmail.com> wrote: > > On Tue, Sep 17, 2019 at 1:16 PM Sam Bobroff <sbobr...@linux.ibm.com> wrote: > > > > On Tue, Sep 03, 2019 at 08:16:03PM +1000, Oliver O'Halloran wrote: > > > Detecting an frozen EEH PE usually occurs when an MMIO load returns a > > > 0xFFs > > > response. When performing EEH testing using the EEH error injection > > > feature > > > available on some platforms there is no simple way to kick-off the > > > kernel's > > > recovery process since any accesses from userspace (usually /dev/mem) will > > > bypass the MMIO helpers in the kernel which check if a 0xFF response is > > > due > > > to an EEH freeze or not. > > > > > > If a device contains a 0xFF byte in it's config space it's possible to > > > trigger the recovery process via config space read from userspace, but > > > this > > > is not a reliable method. If a driver is bound to the device an in use it > > > will frequently trigger the MMIO check, but this is also inconsistent. > > > > > > To solve these problems this patch adds a debugfs file called > > > "eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs > > > eeh_dev_check_failure() on it. This is the same check that's done when the > > > kernel gets a 0xFF result from an config or MMIO read with the added > > > benifit that it can be reliably triggered from userspace. > > > > > > Signed-off-by: Oliver O'Halloran <ooh...@gmail.com> > > > > Looks good, and I tested it with the next patch and it seems to work. > > > > But I think you should make it clear that this does not work with > > the hardware "EEH error injection" facility accessible via debugfs in > > err_injct (that doesn't seem clear to me from the commit message). > > It's not intended to be a separate mechanisms in the long term. I'm > planning on converting this interface to make use the platform defined > error injection mechanism once I can find how to use the PAPR ones > reliably. The idea is to use this as a generic "cause an EEH to happen > on this device" interface for userspace which we can use in test > scripts and the like.
Urgh, I'm tired and thought this was the eeh_debugfs_break patch. This (the _check) debugfs interface does work with the HW error injection facilities. After the HW injects an error the PE is frozen, but the kernel doesn't notice until something runs eeh_dev_check_failure(). This interface gives userspace a reliable way to do that rather than relying on drivers doing MMIO, or somewhere in config space containing a 0xFF. Oliver