Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
On 21/03/2023 10:38, Dan Williams wrote: > lizhij...@fujitsu.com wrote: > [..] >>> Now I do think it would be a good idea to fail device_add() if the bus >>> is not registered, >> >> BTW, below line 369: device_add() didn't fail in practical. >> > > I think that's ok because the device gets added, but never probed due to > this part of that commit I referenced: Very thanks for your explanation. > > @@ -503,20 +517,21 @@ int bus_add_device(struct device *dev) >*/ > void bus_probe_device(struct device *dev) > { > - struct bus_type *bus = dev->bus; > + struct subsys_private *sp = bus_to_subsys(dev->bus); > struct subsys_interface *sif; > > - if (!bus) > + if (!sp) > return; > > > ...so it does what you want which is disable the libnvdimm subsystem > from binding to any devices without crashing. Yes, it's fine enough to me.
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
On 21/03/2023 01:30, Dan Williams wrote: > lizhij...@fujitsu.com wrote: > [..] >>>> Dan, >>>> >>>> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra >>>> kernel booting parameter >>>> 'initcall_blacklist=libnvdimm_init'. Then kernel panic! >>> >>> That's expected though, >> >> Do you mean we just keep it as it is. > > Yes. > >> >> >>> you can't block libnvdimm_init and then expect >>> modules that link to libnvdimm to work. >> Ah, we would rather see it *unable to work* than panic, isn't it. > > That part is true, but consider the implications of adding error > handling to all code that can no longer depend on initcall ordering, not > just libnvdimm. This would be a large paradigm shift. > > Now I do think it would be a good idea to fail device_add() if the bus > is not registered, but I *think* that happens now as a result of: > > 5221b82d46f2 driver core: bus: bus_add/probe/remove_device() cleanups > > ...can you double check if you have that commit in your tests? Great, panic is gone after i upgraded to the upstream! > Now I do think it would be a good idea to fail device_add() if the bus > is not registered, BTW, below line 369: device_add() didn't fail in practical. 354 mutex_init(_bus->reconfig_mutex); 355 badrange_init(_bus->badrange); 356 nvdimm_bus->nd_desc = nd_desc; 357 nvdimm_bus->dev.parent = parent; 358 nvdimm_bus->dev.type = _bus_dev_type; 359 nvdimm_bus->dev.groups = nd_desc->attr_groups; 360 nvdimm_bus->dev.bus = _bus_type; 361 nvdimm_bus->dev.of_node = nd_desc->of_node; 362 device_initialize(_bus->dev); 363 lockdep_set_class(_bus->dev.mutex, _bus_key); 364 device_set_pm_not_required(_bus->dev); 365 rc = dev_set_name(_bus->dev, "ndbus%d", nvdimm_bus->id); 366 if (rc) 367 goto err; 368 369 rc = device_add(_bus->dev); 370 dev_err(_bus->dev, "registration failed: %d\n", rc); Thanks Zhijian
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
On 17/03/2023 13:59, Dan Williams wrote: > lizhij...@fujitsu.com wrote: >> >> >> On 16/03/2023 23:54, Dan Williams wrote: >>> Li Zhijian wrote: >>>> nvdimm_bus_register() could be called from other modules, such as nfit, >>>> but it can only be called after the nvdimm_bus_type is registered. >>>> >>>>BUG: kernel NULL pointer dereference, address: 0098 >>>>#PF: supervisor read access in kernel mode >>>>#PF: error_code(0x) - not-present page >>>>PGD 0 P4D 0 >>>>Oops: [#1] PREEMPT SMP PTI >>>>CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97 >>>>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>>> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 >>>>RIP: 0010:bus_add_device+0x58/0x150 >>>>Call Trace: >>>> >>>> device_add+0x3ac/0x980 >>>> nvdimm_bus_register+0x16d/0x1d0 >>>> acpi_nfit_init+0xb72/0x1f90 [nfit] >>>> acpi_nfit_add+0x1d5/0x200 [nfit] >>>> acpi_device_probe+0x45/0x160 >>> >>> Can you explain a bit more how to hit this crash? This has not been a >>> problem historically and the explanation above makes it sound like this >>> is a theoretical issue. >>> >> >> Dan, >> >> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra >> kernel booting parameter >> 'initcall_blacklist=libnvdimm_init'. Then kernel panic! > > That's expected though, Do you mean we just keep it as it is. > you can't block libnvdimm_init and then expect > modules that link to libnvdimm to work. Ah, we would rather see it *unable to work* than panic, isn't it. Thanks Zhijian > You would also need to block all > modules / initcalls that depend on libnvdimm_init having runI'll > respond to the other thread with some ideas.
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
On 17/03/2023 10:26, Dan Williams wrote: > lizhij...@fujitsu.com wrote: >> >> >> On 16/03/2023 23:54, Dan Williams wrote: >>> Li Zhijian wrote: >>>> nvdimm_bus_register() could be called from other modules, such as nfit, >>>> but it can only be called after the nvdimm_bus_type is registered. >>>> >>>>BUG: kernel NULL pointer dereference, address: 0098 >>>>#PF: supervisor read access in kernel mode >>>>#PF: error_code(0x) - not-present page >>>>PGD 0 P4D 0 >>>>Oops: [#1] PREEMPT SMP PTI >>>>CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97 >>>>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>>> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 >>>>RIP: 0010:bus_add_device+0x58/0x150 >>>>Call Trace: >>>> >>>> device_add+0x3ac/0x980 >>>> nvdimm_bus_register+0x16d/0x1d0 >>>> acpi_nfit_init+0xb72/0x1f90 [nfit] >>>> acpi_nfit_add+0x1d5/0x200 [nfit] >>>> acpi_device_probe+0x45/0x160 >>> >>> Can you explain a bit more how to hit this crash? This has not been a >>> problem historically and the explanation above makes it sound like this >>> is a theoretical issue. >>> >> >> Dan, >> >> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra >> kernel booting parameter >> 'initcall_blacklist=libnvdimm_init'. Then kernel panic! >> Theoretically, it will also happen if nvdimm_bus_register() failed. >> >> >> For kdump purpose[1], we need to disable libnvdimm driver to ensure metadata >> in pmem will not be updated again in kdump kernel >> [1]https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com/T/ > > Ah, great write up! Let me give that some thought. That would be great. > Apologies for missing it earlier. Never mind :) > > This would have been a good use for: > > Link: > https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com > > ...in the above changelog. sounds good, i will update it in next version. Thanks Zhijian
Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus
On 16/03/2023 23:54, Dan Williams wrote: > Li Zhijian wrote: >> nvdimm_bus_register() could be called from other modules, such as nfit, >> but it can only be called after the nvdimm_bus_type is registered. >> >> BUG: kernel NULL pointer dereference, address: 0098 >> #PF: supervisor read access in kernel mode >> #PF: error_code(0x) - not-present page >> PGD 0 P4D 0 >> Oops: [#1] PREEMPT SMP PTI >> CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 >> RIP: 0010:bus_add_device+0x58/0x150 >> Call Trace: >> >>device_add+0x3ac/0x980 >>nvdimm_bus_register+0x16d/0x1d0 >>acpi_nfit_init+0xb72/0x1f90 [nfit] >>acpi_nfit_add+0x1d5/0x200 [nfit] >>acpi_device_probe+0x45/0x160 > > Can you explain a bit more how to hit this crash? This has not been a > problem historically and the explanation above makes it sound like this > is a theoretical issue. > Dan, Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra kernel booting parameter 'initcall_blacklist=libnvdimm_init'. Then kernel panic! Theoretically, it will also happen if nvdimm_bus_register() failed. For kdump purpose[1], we need to disable libnvdimm driver to ensure metadata in pmem will not be updated again in kdump kernel [1]https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com/T/ Thanks Zhijian > libnvdimm_init() *should* be done before the nfit driver can attempt > nvdimm_bus_register(). So, something else is broken if > nvdimm_bus_register() can be called before libnvdimm_init(), or after > libnvdimm_exit().
Re: [RFC PATCH v2 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
On 11/11/2022 17:22, Daisuke Matsuda wrote: > ib_umem_odp_map_dma_single_page(), which has been used only by the mlx5 > driver, holds umem_mutex on success and releases on failure. This > behavior is not convenient for other drivers to use it, so change it to > always retain mutex on return. > > Signed-off-by: Daisuke Matsuda > --- > drivers/infiniband/core/umem_odp.c | 8 +++- > drivers/infiniband/hw/mlx5/odp.c | 4 +++- > 2 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/core/umem_odp.c > b/drivers/infiniband/core/umem_odp.c > index e9fa22d31c23..49da6735f7c8 100644 > --- a/drivers/infiniband/core/umem_odp.c > +++ b/drivers/infiniband/core/umem_odp.c > @@ -328,8 +328,8 @@ static int ib_umem_odp_map_dma_single_page( >* >* Maps the range passed in the argument to DMA addresses. >* The DMA addresses of the mapped pages is updated in umem_odp->dma_list. > - * Upon success the ODP MR will be locked to let caller complete its device > - * page table update. > + * The umem mutex is locked in this function. Callers are responsible for > + * releasing the lock. >* >* Returns the number of pages mapped in success, negative error code >* for failure. > @@ -453,11 +453,9 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp > *umem_odp, u64 user_virt, > break; > } > } > - /* upon success lock should stay on hold for the callee */ > + > if (!ret) > ret = dma_index - start_idx; > - else > - mutex_unlock(_odp->umem_mutex); > > out_put_mm: > mmput_async(owning_mm); > diff --git a/drivers/infiniband/hw/mlx5/odp.c > b/drivers/infiniband/hw/mlx5/odp.c > index bc97958818bb..a0de27651586 100644 > --- a/drivers/infiniband/hw/mlx5/odp.c > +++ b/drivers/infiniband/hw/mlx5/odp.c > @@ -572,8 +572,10 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, > struct ib_umem_odp *odp, > access_mask |= ODP_WRITE_ALLOWED_BIT; > > np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, > fault); > - if (np < 0) > + if (np < 0) { > + mutex_unlock(>umem_mutex); > return np; > + } refer to the comments of ib_umem_odp_map_dma_and_lock: 334 * Returns the number of pages mapped in success, negative error code 335 * for failure. I don't think it's correct to release the lock in all failure case, for example when it reaches below error path. 346 int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt, 347 u64 bcnt, u64 access_mask, bool fault) 348 __acquires(_odp->umem_mutex) 349 { 350 struct task_struct *owning_process = NULL; 351 struct mm_struct *owning_mm = umem_odp->umem.owning_mm; 352 int pfn_index, dma_index, ret = 0, start_idx; 353 unsigned int page_shift, hmm_order, pfn_start_idx; 354 unsigned long num_pfns, current_seq; 355 struct hmm_range range = {}; 356 unsigned long timeout; 357 358 if (access_mask == 0) 359 return -EINVAL; < no lock is hold yet 360 361 if (user_virt < ib_umem_start(umem_odp) || 362 user_virt + bcnt > ib_umem_end(umem_odp)) 363 return -EFAULT; < no lock is hold yet Further more, you changed public API's the behavior, do it matter for other out-of-tree drivers which is using it, i'm not familiar with this, maybe kernel has no restriction on it ? > > /* >* No need to check whether the MTTs really belong to this MR, since
Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d
Hi all I have verified that below changes can solve this problem diff --git a/mm/hmm.c b/mm/hmm.c index 24f9ff95f3ae..2c9a3e3eefce 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -294,7 +294,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, * Since each architecture defines a struct page for the zero page, just * fall through and treat it like a normal page. */ - if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { + if (!pte_devmap(pte) && pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) { pte_unmap(ptep); return -EFAULT; i looked over the change-log of hmm_vma_handle_pte(), and found that before 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling") hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true. when we reached "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {" the pte have already presented and its pte's flag already fulfilled the request flags. My question is that Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits, pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, right ? if so, what's the expected code path for fsdax case ? commit 4055062749229101365e5f9e87cb1c5a93e292f8 Author: Jason Gunthorpe Date: Thu Mar 5 14:27:20 2020 -0400 mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses). EFAULT should only be returned after testing with hmm_pte_need_fault(). Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Reviewed-by: Ralph Campbell Reviewed-by: Christoph Hellwig Signed-off-by: Jason Gunthorpe diff --git a/mm/hmm.c b/mm/hmm.c index 3a03fcf..9c82ea9 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -339,16 +339,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, pte_unmap(ptep); return -EBUSY; } - } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { - if (!is_zero_pfn(pte_pfn(pte))) { + } + + /* + * Since each architecture defines a struct page for the zero page, just + * fall through and treat it like a normal page. + */ + if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, , + _fault); + if (fault || write_fault) { pte_unmap(ptep); - *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT; } - /* - * Since each architecture defines a struct page for the zero - * page, just fall through and treat it like a normal page. - */ + *pfn = range->values[HMM_PFN_SPECIAL]; + return 0; } *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags; Thanks Zhijian On 06/08/2021 11:20, Li, Zhijian wrote: > Hi Jason > > thank you for your advice. > > CCing nvdimm > > both ext4 and xfs are impacted. > > Thanks > > > at 2021/8/6 9:45, Jason Gunthorpe wrote: >> On Wed, Aug 04, 2021 at 04:06:53PM +0800, Li, Zhijian/李 智坚 wrote: >>> convert to text and send again >>> >>> 2021/8/4 15:55, Li, Zhijian wrote: Hey all: Recently, i reported a issue to rpmahttps://github.com/pmem/rpma/issues/1142 where we found that the native rpma + fsdax example failed in recent kernel. Below is the bisect log [lizhijian@yl linux]$ git bisect log git bisect start # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9 git bisect good bbf5c979011a099af5dc76498918ed7df445635b # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10 git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442 # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add failure injection to usercopy functions git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/... git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8 # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 'wireless-drivers-next-2020-10-02' of