On Mon, May 1, 2017 at 8:34 AM, Kani, Toshimitsu <toshi.k...@hpe.com> wrote: > On Sun, 2017-04-30 at 05:39 -0700, Dan Williams wrote: >> Toshi noticed that the new support for a region-level badblocks >> missed the case where errors are cleared due to BTT I/O. >> >> An initial attempt to fix this ran into a "sleeping while atomic" >> warning due to taking the nvdimm_bus_lock() in the BTT I/O path to >> satisfy the locking requirements of __nvdimm_bus_badblocks_clear(). >> However, that lock is not needed since we are not acting any data >> that is subject to change due to a change of state of the bus / >> region. The badblocks instance has its own internal lock to handle >> mutations of the error list. >> >> So, to make it clear that we are just acting on region devices and >> don't need the lock rename __nvdimm_bus_badblocks_clear() to >> nvdimm_clear_badblocks_regions(). Eliminate the lock and consolidate >> all routines in drivers/nvdimm/bus.c. Also, make some cleanups to >> remove unnecessary casts, make the calling convention of >> nvdimm_clear_badblocks_regions() clearer by replacing struct resource >> with the minimal struct clear_badblocks_context, and use the >> DEVICE_ATTR macro. > > Hi Dan, > > I was testing the change with CONFIG_DEBUG_ATOMIC_SLEEP set this time, > and hit the following BUG with BTT. This is a separate issue (not > introduced by this patch), but it shows that we have an issue with the > DSM call path as well.
Ah, great find, thanks! We don't see this in the unit tests because the nfit_test infrastructure takes no sleeping actions in its simulated DSM path. Outside of converting btt to use sleeping locks I'm not sure I see a path forward. I wonder how bad the performance impact of that would be? Perhaps with opportunistic spinning it won't be so bad, but I don't see another choice.