Re: [PATCH v3 2/8] cxl/acpi: Add root device lockdep validation

2022-04-23 Thread Dan Williams
On Fri, Apr 22, 2022 at 5:08 PM Dan Williams  wrote:
>
> On Fri, Apr 22, 2022 at 4:58 PM Ira Weiny  wrote:
> >
> > On Thu, Apr 21, 2022 at 08:33:18AM -0700, Dan Williams wrote:
> > > The CXL "root" device, ACPI0017, is an attach point for coordinating
> > > platform level CXL resources and is the parent device for a CXL port
> > > topology tree. As such it has distinct locking rules relative to other
> > > CXL subsystem objects, but because it is an ACPI device the lock class
> > > is established well before it is given to the cxl_acpi driver.
> >
> > This final sentence gave me pause because it implied that the device lock 
> > class
> > was set to something other than no validate.  But I don't see that anywhere 
> > in
> > the acpi code.  So given that it looks to me like ACPI is just using the
> > default no validate class...
>
> Oh, good observation. *If* ACPI had set a custom lock class then
> cxl_acpi would need to be careful to restore that ACPI-specific class
> and not reset it to "no validate" on exit, or skip setting its own
> custom class. However, I think for generic buses like ACPI that feed
> devices into other subsystems it likely has little reason to set its
> own class. For safety, since device_lock_set_class() is general
> purpose, I'll have it emit a debug message and fail if the class is
> not "no validate" on entry.
>

So this turned out way uglier than I expected:

 drivers/cxl/acpi.c |4 +++-
 include/linux/device.h |   33 +
 2 files changed, 28 insertions(+), 9 deletions(-)

...so I'm going to drop it and just add a comment about the
expectations. As Peter said there's already a multitude of ways to
cause false positive / negative results with lockdep so this is just
one more area where one needs to be careful and understand the lock
context they might be overriding.



Re: [PATCH v3 8/8] nvdimm: Fix firmware activation deadlock scenarios

2022-04-23 Thread Dan Williams
On Fri, Apr 22, 2022 at 9:28 PM Ira Weiny  wrote:
>
> On Thu, Apr 21, 2022 at 08:33:51AM -0700, Dan Williams wrote:
> > Lockdep reports the following deadlock scenarios for CXL root device
> > power-management, device_prepare(), operations, and device_shutdown()
> > operations for 'nd_region' devices:
> >
> > ---
> >  Chain exists of:
> >&nvdimm_region_key --> &nvdimm_bus->reconfig_mutex --> 
> > system_transition_mutex
> >
> >   Possible unsafe locking scenario:
> >
> > CPU0CPU1
> > 
> >lock(system_transition_mutex);
> > lock(&nvdimm_bus->reconfig_mutex);
> > lock(system_transition_mutex);
> >lock(&nvdimm_region_key);
> >
> > --
> >
> >  Chain exists of:
> >&cxl_nvdimm_bridge_key --> acpi_scan_lock --> &cxl_root_key
> >
> >   Possible unsafe locking scenario:
> >
> > CPU0CPU1
> > 
> >lock(&cxl_root_key);
> > lock(acpi_scan_lock);
> > lock(&cxl_root_key);
> >lock(&cxl_nvdimm_bridge_key);
> >
> > ---
> >
> > These stem from holding nvdimm_bus_lock() over hibernate_quiet_exec()
> > which walks the entire system device topology taking device_lock() along
> > the way. The nvdimm_bus_lock() is protecting against unregistration,
> > multiple simultaneous ops callers, and preventing activate_show() from
> > racing activate_store(). For the first 2, the lock is redundant.
> > Unregistration already flushes all ops users, and sysfs already prevents
> > multiple threads to be active in an ops handler at the same time. For
> > the last userspace should already be waiting for its last
> > activate_store() to complete, and does not need activate_show() to flush
> > the write side, so this lock usage can be deleted in these attributes.
> >
>
> I'm sorry if this is obvious but why can't the locking be removed from
> capability_show() and nvdimm_bus_firmware_visible() as well?

It can, that's a good catch, thanks.



Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-23 Thread Dan Williams
On Fri, Apr 22, 2022 at 5:02 PM Dave Chinner  wrote:
>
> On Fri, Apr 22, 2022 at 02:27:32PM -0700, Dan Williams wrote:
> > On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner  wrote:
> > >
> > > On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > > > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > > > > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > > > > a single release. However, being unable to cleanly merge code we
> > > > > need integrated into our local subsystem tree for integration
> > > > > testing because a patch dependency with another subsystem won't gain
> > > > > a stable commit ID until the next merge window is  distinctly
> > > > > suboptimal.
> > > >
> > > > Yes.  Which is why we've taken a lot of mm patchs through other trees,
> > > > sometimes specilly crafted for that.  So I guess in this case we'll
> > > > just need to take non-trivial dependencies into the XFS tree, and just
> > > > deal with small merge conflicts for the trivial ones.
> > >
> > > OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
> > > listed looks trivial to resolve.
> > >
> > > The second dependency, OTOH, is on a new function added in the patch
> > > pointed to. That said, at first glance it looks to be independent of
> > > the first two patches in that series so I might just be able to pull
> > > that one patch in and have that leave us with a working
> > > fsdax+reflink tree.
> > >
> > > Regardless, I'll wait to see how much work the updated XFS/DAX
> > > reflink enablement patchset still requires when Ruan posts it before
> > > deciding what to do here.  If it isn't going to be a merge
> > > candidate, what to do with this patchset is moot because there's
> > > little to test without reflink enabled...
> >
> > I do have a use case for this work absent the reflink work.  Recall we
> > had a conversation about how to communicate "dax-device has been
> > ripped away from the fs" events and we ended up on the idea of reusing
> > ->notify_failure(), but with the device's entire logical address range
> > as the notification span. That will let me unwind and delete the
> > PTE_DEVMAP infrastructure for taking extra device references to hold
> > off device-removal. Instead ->notify_failure() arranges for all active
> > DAX mappings to be invalidated and allow the removal to proceed
> > especially since physical removal does not care about software pins.
>
> Sure. My point is that if the reflink enablement isn't ready to go,
> then from an XFS POV none of this matters in this cycle and we can
> just leave the dependencies to commit via Andrew's tree. Hence by
> the time we get to the reflink enablement all the prior dependencies
> will have been merged and have stable commit IDs, and we can just
> stage this series and the reflink enablement as we normally would in
> the next cycle.
>
> However, if we don't get the XFS reflink dax enablement sorted out
> in the next week or two, then we don't need this patchset in this
> cycle. Hence if you still need this patchset for other code you need
> to merge in this cycle, then you're the poor schmuck that has to run
> the mm-tree conflict guantlet to get a stable commit ID for the
> dependent patches in this cycle, not me

Yup. Let's give it another week or so to see if the reflink rebase
materializes and go from there.



[PATCH] tools/testing/nvdimm: remove unneeded flush_workqueue

2022-04-23 Thread cgel . zte
From: ran jianping 

All work currently pending will be done first by calling destroy_workqueue,
so there is no need to flush it explicitly.

Reported-by: Zeal Robot 
Signed-off-by: ran jianping 
---
 tools/testing/nvdimm/test/nfit.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 1da76ccde448..e7e1a640e482 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -3375,7 +3375,6 @@ static __exit void nfit_test_exit(void)
 {
int i;
 
-   flush_workqueue(nfit_wq);
destroy_workqueue(nfit_wq);
for (i = 0; i < NUM_NFITS; i++)
platform_device_unregister(&instances[i]->pdev);
-- 
2.25.1