Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-20 Thread lizhij...@fujitsu.com


On 21/03/2023 10:38, Dan Williams wrote:
> lizhij...@fujitsu.com wrote:
> [..]
>>> Now I do think it would be a good idea to fail device_add() if the bus
>>> is not registered,
>>
>> BTW, below line 369: device_add() didn't fail in practical.
>>
> 
> I think that's ok because the device gets added, but never probed due to
> this part of that commit I referenced:

Very thanks for your explanation.

> 
> @@ -503,20 +517,21 @@ int bus_add_device(struct device *dev)
>*/
>   void bus_probe_device(struct device *dev)
>   {
> -   struct bus_type *bus = dev->bus;
> +   struct subsys_private *sp = bus_to_subsys(dev->bus);
>  struct subsys_interface *sif;
>   
> -   if (!bus)
> +   if (!sp)
>  return;
>   
> 
> ...so it does what you want which is disable the libnvdimm subsystem
> from binding to any devices without crashing.

Yes, it's fine enough to me.


Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-20 Thread lizhij...@fujitsu.com


On 21/03/2023 01:30, Dan Williams wrote:
> lizhij...@fujitsu.com wrote:
> [..]
>>>> Dan,
>>>>
>>>> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra 
>>>> kernel booting parameter
>>>> 'initcall_blacklist=libnvdimm_init'. Then kernel panic!
>>>
>>> That's expected though,
>>
>> Do you mean we just keep it as it is.
> 
> Yes.
> 
>>
>>
>>> you can't block libnvdimm_init and then expect
>>> modules that link to libnvdimm to work.
>> Ah, we would rather see it *unable to work* than panic, isn't it.
> 
> That part is true, but consider the implications of adding error
> handling to all code that can no longer depend on initcall ordering, not
> just libnvdimm. This would be a large paradigm shift.
> 
> Now I do think it would be a good idea to fail device_add() if the bus
> is not registered, but I *think* that happens now as a result of:
> 
> 5221b82d46f2 driver core: bus: bus_add/probe/remove_device() cleanups
> 
> ...can you double check if you have that commit in your tests?

Great, panic is gone after i upgraded to the upstream!


> Now I do think it would be a good idea to fail device_add() if the bus
> is not registered,

BTW, below line 369: device_add() didn't fail in practical.

  354 mutex_init(_bus->reconfig_mutex);
  355 badrange_init(_bus->badrange);
  356 nvdimm_bus->nd_desc = nd_desc;
  357 nvdimm_bus->dev.parent = parent;
  358 nvdimm_bus->dev.type = _bus_dev_type;
  359 nvdimm_bus->dev.groups = nd_desc->attr_groups;
  360 nvdimm_bus->dev.bus = _bus_type;
  361 nvdimm_bus->dev.of_node = nd_desc->of_node;
  362 device_initialize(_bus->dev);
  363 lockdep_set_class(_bus->dev.mutex, _bus_key);
  364 device_set_pm_not_required(_bus->dev);
  365 rc = dev_set_name(_bus->dev, "ndbus%d", nvdimm_bus->id);
  366 if (rc)
  367 goto err;
  368
  369 rc = device_add(_bus->dev);
  370 dev_err(_bus->dev, "registration failed: %d\n", rc);

Thanks
Zhijian

Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-19 Thread lizhij...@fujitsu.com


On 17/03/2023 13:59, Dan Williams wrote:
> lizhij...@fujitsu.com wrote:
>>
>>
>> On 16/03/2023 23:54, Dan Williams wrote:
>>> Li Zhijian wrote:
>>>> nvdimm_bus_register() could be called from other modules, such as nfit,
>>>> but it can only be called after the nvdimm_bus_type is registered.
>>>>
>>>>BUG: kernel NULL pointer dereference, address: 0098
>>>>#PF: supervisor read access in kernel mode
>>>>#PF: error_code(0x) - not-present page
>>>>PGD 0 P4D 0
>>>>Oops:  [#1] PREEMPT SMP PTI
>>>>CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97
>>>>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>>> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>>>>RIP: 0010:bus_add_device+0x58/0x150
>>>>Call Trace:
>>>> 
>>>> device_add+0x3ac/0x980
>>>> nvdimm_bus_register+0x16d/0x1d0
>>>> acpi_nfit_init+0xb72/0x1f90 [nfit]
>>>> acpi_nfit_add+0x1d5/0x200 [nfit]
>>>> acpi_device_probe+0x45/0x160
>>>
>>> Can you explain a bit more how to hit this crash? This has not been a
>>> problem historically and the explanation above makes it sound like this
>>> is a theoretical issue.
>>>
>>
>> Dan,
>>
>> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra 
>> kernel booting parameter
>> 'initcall_blacklist=libnvdimm_init'. Then kernel panic!
> 
> That's expected though,

Do you mean we just keep it as it is.


> you can't block libnvdimm_init and then expect
> modules that link to libnvdimm to work.
Ah, we would rather see it *unable to work* than panic, isn't it.



Thanks
Zhijian


> You would also need to block all
> modules / initcalls that depend on libnvdimm_init having runI'll
> respond to the other thread with some ideas.

Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-16 Thread lizhij...@fujitsu.com


On 17/03/2023 10:26, Dan Williams wrote:
> lizhij...@fujitsu.com wrote:
>>
>>
>> On 16/03/2023 23:54, Dan Williams wrote:
>>> Li Zhijian wrote:
>>>> nvdimm_bus_register() could be called from other modules, such as nfit,
>>>> but it can only be called after the nvdimm_bus_type is registered.
>>>>
>>>>BUG: kernel NULL pointer dereference, address: 0098
>>>>#PF: supervisor read access in kernel mode
>>>>#PF: error_code(0x) - not-present page
>>>>PGD 0 P4D 0
>>>>Oops:  [#1] PREEMPT SMP PTI
>>>>CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97
>>>>Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>>> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>>>>RIP: 0010:bus_add_device+0x58/0x150
>>>>Call Trace:
>>>> 
>>>> device_add+0x3ac/0x980
>>>> nvdimm_bus_register+0x16d/0x1d0
>>>> acpi_nfit_init+0xb72/0x1f90 [nfit]
>>>> acpi_nfit_add+0x1d5/0x200 [nfit]
>>>> acpi_device_probe+0x45/0x160
>>>
>>> Can you explain a bit more how to hit this crash? This has not been a
>>> problem historically and the explanation above makes it sound like this
>>> is a theoretical issue.
>>>
>>
>> Dan,
>>
>> Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra 
>> kernel booting parameter
>> 'initcall_blacklist=libnvdimm_init'. Then kernel panic!
>> Theoretically, it will also happen if nvdimm_bus_register() failed.
>>
>>
>> For kdump purpose[1], we need to disable libnvdimm driver to ensure metadata 
>> in pmem will not be updated again in kdump kernel
>> [1]https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com/T/
> 
> Ah, great write up! Let me give that some thought. 

That would be great.


> Apologies for missing it earlier.

Never mind :)



> 
> This would have been a good use for:
> 
> Link: 
> https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com
> 
> ...in the above changelog.


sounds good, i will update it in next version.


Thanks
Zhijian


Re: [PATCH] nvdimm: nvdimm_bus_register: Avoid adding device to the unregistered bus

2023-03-16 Thread lizhij...@fujitsu.com


On 16/03/2023 23:54, Dan Williams wrote:
> Li Zhijian wrote:
>> nvdimm_bus_register() could be called from other modules, such as nfit,
>> but it can only be called after the nvdimm_bus_type is registered.
>>
>>   BUG: kernel NULL pointer dereference, address: 0098
>>   #PF: supervisor read access in kernel mode
>>   #PF: error_code(0x) - not-present page
>>   PGD 0 P4D 0
>>   Oops:  [#1] PREEMPT SMP PTI
>>   CPU: 0 PID: 117 Comm: systemd-udevd Not tainted 6.2.0-rc6-pmem+ #97
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>>   RIP: 0010:bus_add_device+0x58/0x150
>>   Call Trace:
>>
>>device_add+0x3ac/0x980
>>nvdimm_bus_register+0x16d/0x1d0
>>acpi_nfit_init+0xb72/0x1f90 [nfit]
>>acpi_nfit_add+0x1d5/0x200 [nfit]
>>acpi_device_probe+0x45/0x160
> 
> Can you explain a bit more how to hit this crash? This has not been a
> problem historically and the explanation above makes it sound like this
> is a theoretical issue.
> 

Dan,

Configure the kconfig with ACPI_NFIT [=m] && LIBNVDIMM [=y], and add extra 
kernel booting parameter
'initcall_blacklist=libnvdimm_init'. Then kernel panic!
Theoretically, it will also happen if nvdimm_bus_register() failed.


For kdump purpose[1], we need to disable libnvdimm driver to ensure metadata in 
pmem will not be updated again in kdump kernel
[1]https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4...@fujitsu.com/T/

Thanks
Zhijian

> libnvdimm_init() *should* be done before the nfit driver can attempt
> nvdimm_bus_register(). So, something else is broken if
> nvdimm_bus_register() can be called before libnvdimm_init(), or after
> libnvdimm_exit().

Re: [RFC PATCH v2 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex

2022-11-17 Thread lizhij...@fujitsu.com


On 11/11/2022 17:22, Daisuke Matsuda wrote:
> ib_umem_odp_map_dma_single_page(), which has been used only by the mlx5
> driver, holds umem_mutex on success and releases on failure. This
> behavior is not convenient for other drivers to use it, so change it to
> always retain mutex on return.
> 
> Signed-off-by: Daisuke Matsuda 
> ---
>   drivers/infiniband/core/umem_odp.c | 8 +++-
>   drivers/infiniband/hw/mlx5/odp.c   | 4 +++-
>   2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem_odp.c 
> b/drivers/infiniband/core/umem_odp.c
> index e9fa22d31c23..49da6735f7c8 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -328,8 +328,8 @@ static int ib_umem_odp_map_dma_single_page(
>*
>* Maps the range passed in the argument to DMA addresses.
>* The DMA addresses of the mapped pages is updated in umem_odp->dma_list.
> - * Upon success the ODP MR will be locked to let caller complete its device
> - * page table update.
> + * The umem mutex is locked in this function. Callers are responsible for
> + * releasing the lock.
>*


>* Returns the number of pages mapped in success, negative error code
>* for failure.
> @@ -453,11 +453,9 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp 
> *umem_odp, u64 user_virt,
>   break;
>   }
>   }
> - /* upon success lock should stay on hold for the callee */
> +
>   if (!ret)
>   ret = dma_index - start_idx;
> - else
> - mutex_unlock(_odp->umem_mutex);
>   
>   out_put_mm:
>   mmput_async(owning_mm);
> diff --git a/drivers/infiniband/hw/mlx5/odp.c 
> b/drivers/infiniband/hw/mlx5/odp.c
> index bc97958818bb..a0de27651586 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -572,8 +572,10 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, 
> struct ib_umem_odp *odp,
>   access_mask |= ODP_WRITE_ALLOWED_BIT;
>   
>   np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, 
> fault);
> - if (np < 0)
> + if (np < 0) {
> + mutex_unlock(>umem_mutex);
>   return np;
> + }

refer to the comments of ib_umem_odp_map_dma_and_lock:
334  * Returns the number of pages mapped in success, negative error 
code
335  * for failure.

I don't think it's correct to release the lock in all failure case, for 
example when it reaches below error path.

346 int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 
user_virt,
347  u64 bcnt, u64 access_mask, bool 
fault)
348 __acquires(_odp->umem_mutex) 

349 { 

350 struct task_struct *owning_process  = NULL; 

351 struct mm_struct *owning_mm = umem_odp->umem.owning_mm; 

352 int pfn_index, dma_index, ret = 0, start_idx; 

353 unsigned int page_shift, hmm_order, pfn_start_idx; 

354 unsigned long num_pfns, current_seq; 

355 struct hmm_range range = {}; 

356 unsigned long timeout; 

357 

358 if (access_mask == 0) 

359 return -EINVAL;   <   no lock is hold yet 

360 

361 if (user_virt < ib_umem_start(umem_odp) || 

362 user_virt + bcnt > ib_umem_end(umem_odp)) 

363 return -EFAULT;   <   no lock is hold yet


Further more, you changed public API's the behavior, do it matter for 
other out-of-tree drivers which is using it, i'm not familiar with this, 
maybe kernel has no restriction on it ?


>   
>   /*
>* No need to check whether the MTTs really belong to this MR, since

Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread lizhij...@fujitsu.com
Hi all

I have verified that below changes can solve this problem

diff --git a/mm/hmm.c b/mm/hmm.c
index 24f9ff95f3ae..2c9a3e3eefce 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -294,7 +294,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
  * Since each architecture defines a struct page for the zero page, 
just
  * fall through and treat it like a normal page.
  */
-   if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
+   if (!pte_devmap(pte) && pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) 
{
     if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
     pte_unmap(ptep);
     return -EFAULT;


i looked over the change-log of hmm_vma_handle_pte(), and found that before
4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL 
handling")

hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is true.

when we reached
"if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
the pte have already presented and its pte's flag already fulfilled the request 
flags.


My question is that
Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, 
right ?

if so, what's the expected code path for fsdax case ?


commit 4055062749229101365e5f9e87cb1c5a93e292f8
Author: Jason Gunthorpe 
Date:   Thu Mar 5 14:27:20 2020 -0400

     mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling

     Currently if a special PTE is encountered hmm_range_fault() immediately
     returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing
     uses).

     EFAULT should only be returned after testing with hmm_pte_need_fault().

     Also pte_devmap() and pte_special() are exclusive, and there is no need to
     check IS_ENABLED, pte_special() is stubbed out to return false on
     unsupported architectures.

     Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX 
backed filesystem")
     Reviewed-by: Ralph Campbell 
     Reviewed-by: Christoph Hellwig 
     Signed-off-by: Jason Gunthorpe 

diff --git a/mm/hmm.c b/mm/hmm.c
index 3a03fcf..9c82ea9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -339,16 +339,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
     pte_unmap(ptep);
     return -EBUSY;
     }
-   } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) 
{
-   if (!is_zero_pfn(pte_pfn(pte))) {
+   }
+
+   /*
+    * Since each architecture defines a struct page for the zero page, just
+    * fall through and treat it like a normal page.
+    */
+   if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
+   hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, ,
+  _fault);
+   if (fault || write_fault) {
     pte_unmap(ptep);
-   *pfn = range->values[HMM_PFN_SPECIAL];
     return -EFAULT;
     }
-   /*
-    * Since each architecture defines a struct page for the zero
-    * page, just fall through and treat it like a normal page.
-    */
+   *pfn = range->values[HMM_PFN_SPECIAL];
+   return 0;
     }

     *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;


Thanks
Zhijian



On 06/08/2021 11:20, Li, Zhijian wrote:
> Hi Jason
>
> thank you for your advice.
>
> CCing nvdimm
>
> both ext4 and xfs are impacted.
>
> Thanks
>
>
> at 2021/8/6 9:45, Jason Gunthorpe wrote:
>> On Wed, Aug 04, 2021 at 04:06:53PM +0800, Li, Zhijian/李 智坚 wrote:
>>> convert to text and send again
>>>
>>> 2021/8/4 15:55, Li, Zhijian wrote:
 Hey all:

 Recently, i reported a issue to 
 rpmahttps://github.com/pmem/rpma/issues/1142
 where we found that the native rpma + fsdax example failed in recent 
 kernel.

 Below is the bisect log

 [lizhijian@yl linux]$ git bisect log
 git bisect start
 # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
 git bisect good bbf5c979011a099af5dc76498918ed7df445635b
 # bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
 git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
 # good: [4d0e9df5e43dba52d38b251e3b909df8fa1110be] lib, uaccess: add 
 failure injection to usercopy functions
 git bisect good 4d0e9df5e43dba52d38b251e3b909df8fa1110be
 # bad: [6694875ef8045cdb1e6712ee9b68fe08763507d8] ext4: indicate that 
 fast_commit is available via /sys/fs/ext4/feature/...
 git bisect bad 6694875ef8045cdb1e6712ee9b68fe08763507d8
 # good: [14c914fcb515c424177bb6848cc2858ebfe717a8] Merge tag 
 'wireless-drivers-next-2020-10-02' of