Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Shiyang Ruan




在 2022/9/15 2:15, Darrick J. Wong 写道:

On Wed, Sep 14, 2022 at 11:09:23AM -0700, Darrick J. Wong wrote:

On Wed, Sep 07, 2022 at 05:46:00PM +0800, Shiyang Ruan wrote:

ping

在 2022/9/2 18:35, Shiyang Ruan 写道:

Changes since v7:
1. Add P1 to fix calculation mistake
2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
3. P3: Add invalidate all mappings after sync.
4. P3: Set offset to be start of device when it is to be removed.
5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].

Changes since v6:
1. Rebase on 6.0-rc2 and Darrick's patch[1].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: 
https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/


Just out of curiosity, is it your (or djbw's) intent to send all these
as bugfixes for 6.0 via akpm like all the other dax fixen?


Aha, this is 6.1 stuff, please ignore this question.


Actually I hope these patches can be merged ASAP. (But it seems a bit 
late for 6.0 now.)


And do you know which/whose branch has picked up your patch[1]?  I 
cannot find it.



--
Thanks,
Ruan.



--D


--D



Shiyang Ruan (3):
xfs: fix the calculation of length and end
fs: move drop_pagecache_sb() for others to use
mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

   drivers/dax/super.c |  3 ++-
   fs/drop_caches.c| 33 -
   fs/super.c  | 34 ++
   fs/xfs/xfs_notify_failure.c | 31 +++
   include/linux/fs.h  |  1 +
   include/linux/mm.h  |  1 +
   6 files changed, 65 insertions(+), 38 deletions(-)





Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Shiyang Ruan




在 2022/9/15 2:15, Darrick J. Wong 写道:

On Fri, Sep 02, 2022 at 10:36:01AM +, Shiyang Ruan wrote:

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
  -> unbind_store()
   -> ... (skip)
-> devres_release_all()   # was pmem driver ->remove() in v1
 -> kill_dax()
  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
   -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 
---
  drivers/dax/super.c |  3 ++-
  fs/xfs/xfs_notify_failure.c | 23 +++
  include/linux/mm.h  |  1 +
  3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
  
  	if (dax_dev->holder_data != NULL)

-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
  
  	clear_bit(DAXDEV_ALIVE, _dev->flags);

synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5e04ba7fa403 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
  
  #include 

  #include 
+#include 
  
  struct xfs_failure_info {

xfs_agblock_t   startblock;
@@ -77,6 +78,9 @@ xfs_dax_failure_fn(
  
  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||

(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* The device is about to be removed.  Not a really failure. */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
notify->want_shutdown = true;
return 0;
}
@@ -182,12 +186,23 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
  
  	if (!(mp->m_super->s_flags & SB_BORN)) {

xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
  
+	if (mf_flags & MF_MEM_PRE_REMOVE) {

+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   drop_pagecache_sb(mp->m_super, NULL);
+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_debug(mp,
 "notify_failure() not supported on realtime device!");
@@ -196,6 +211,8 @@ xfs_dax_notify_failure(
  
  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&

mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -209,6 +226,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
  
+	/* Notify failure on the whole device */

+   if (offset == 0 && len == U64_MAX) {
+   offset = ddev_start;
+   len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+   }


I wonder, won't the trimming code below take care of this?


The len is U64_MAX, so 'offset + len - 1' will overflow.  That can't be 
handled correctly by the trimming code below.



--
Thanks,
Ruan.



The rest of the patch looks ok to me.

--D


+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..9122a1c57dd2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3183,6 +3183,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+   MF_MEM_PRE_REMOVE = 1 << 7,
  };
  int 

Re: [PATCH 2/3] fs: move drop_pagecache_sb() for others to use

2022-09-14 Thread Darrick J. Wong
On Fri, Sep 02, 2022 at 10:36:00AM +, Shiyang Ruan wrote:
> xfs_notify_failure requires a method to invalidate all mappings.
> drop_pagecache_sb() can do this but it is a static function and only
> build with CONFIG_SYSCTL.  Now, move it to super.c and make it available
> for others.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/drop_caches.c   | 33 -
>  fs/super.c | 34 ++
>  include/linux/fs.h |  1 +
>  3 files changed, 35 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index e619c31b6bd9..5c8406076f9b 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -3,7 +3,6 @@
>   * Implement the manual drop-all-pagecache function
>   */
>  
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -15,38 +14,6 @@
>  /* A global variable is a bit ugly, but it keeps the code simple */
>  int sysctl_drop_caches;
>  
> -static void drop_pagecache_sb(struct super_block *sb, void *unused)
> -{
> - struct inode *inode, *toput_inode = NULL;
> -
> - spin_lock(>s_inode_list_lock);
> - list_for_each_entry(inode, >s_inodes, i_sb_list) {
> - spin_lock(>i_lock);
> - /*
> -  * We must skip inodes in unusual state. We may also skip
> -  * inodes without pages but we deliberately won't in case
> -  * we need to reschedule to avoid softlockups.
> -  */
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping_empty(inode->i_mapping) && !need_resched())) {
> - spin_unlock(>i_lock);
> - continue;
> - }
> - __iget(inode);
> - spin_unlock(>i_lock);
> - spin_unlock(>s_inode_list_lock);
> -
> - invalidate_mapping_pages(inode->i_mapping, 0, -1);
> - iput(toput_inode);
> - toput_inode = inode;
> -
> - cond_resched();
> - spin_lock(>s_inode_list_lock);
> - }
> - spin_unlock(>s_inode_list_lock);
> - iput(toput_inode);
> -}
> -
>  int drop_caches_sysctl_handler(struct ctl_table *table, int write,
>   void *buffer, size_t *length, loff_t *ppos)
>  {
> diff --git a/fs/super.c b/fs/super.c
> index 734ed584a946..bdf53dbe834c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "internal.h"
>  
> @@ -677,6 +678,39 @@ void drop_super_exclusive(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(drop_super_exclusive);
>  
> +void drop_pagecache_sb(struct super_block *sb, void *unused)
> +{
> + struct inode *inode, *toput_inode = NULL;
> +
> + spin_lock(>s_inode_list_lock);
> + list_for_each_entry(inode, >s_inodes, i_sb_list) {
> + spin_lock(>i_lock);
> + /*
> +  * We must skip inodes in unusual state. We may also skip
> +  * inodes without pages but we deliberately won't in case
> +  * we need to reschedule to avoid softlockups.
> +  */
> + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> + (mapping_empty(inode->i_mapping) && !need_resched())) {
> + spin_unlock(>i_lock);
> + continue;
> + }
> + __iget(inode);
> + spin_unlock(>i_lock);
> + spin_unlock(>s_inode_list_lock);
> +
> + invalidate_mapping_pages(inode->i_mapping, 0, -1);
> + iput(toput_inode);
> + toput_inode = inode;
> +
> + cond_resched();
> + spin_lock(>s_inode_list_lock);
> + }
> + spin_unlock(>s_inode_list_lock);
> + iput(toput_inode);
> +}
> +EXPORT_SYMBOL(drop_pagecache_sb);

You might want to rename this "super_drop_pagecache" to fit with the
other functions that all have "super" in the name somewhere.

--D

> +
>  static void __iterate_supers(void (*f)(struct super_block *))
>  {
>   struct super_block *sb, *p = NULL;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9eced4cc286e..5ded28c0d2c9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3292,6 +3292,7 @@ extern struct super_block *get_super(struct 
> block_device *);
>  extern struct super_block *get_active_super(struct block_device *bdev);
>  extern void drop_super(struct super_block *sb);
>  extern void drop_super_exclusive(struct super_block *sb);
> +void drop_pagecache_sb(struct super_block *sb, void *unused);
>  extern void iterate_supers(void (*)(struct super_block *, void *), void *);
>  extern void iterate_supers_type(struct file_system_type *,
>   void (*)(struct super_block *, void *), void *);
> -- 
> 2.37.2
> 



Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Darrick J. Wong
On Wed, Sep 14, 2022 at 11:09:23AM -0700, Darrick J. Wong wrote:
> On Wed, Sep 07, 2022 at 05:46:00PM +0800, Shiyang Ruan wrote:
> > ping
> > 
> > 在 2022/9/2 18:35, Shiyang Ruan 写道:
> > > Changes since v7:
> > >1. Add P1 to fix calculation mistake
> > >2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
> > >3. P3: Add invalidate all mappings after sync.
> > >4. P3: Set offset to be start of device when it is to be 
> > > removed.
> > >5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].
> > > 
> > > Changes since v6:
> > >1. Rebase on 6.0-rc2 and Darrick's patch[1].
> > > 
> > > [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> > > [2]: 
> > > https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/
> 
> Just out of curiosity, is it your (or djbw's) intent to send all these
> as bugfixes for 6.0 via akpm like all the other dax fixen?

Aha, this is 6.1 stuff, please ignore this question.

--D

> --D
> 
> > > 
> > > Shiyang Ruan (3):
> > >xfs: fix the calculation of length and end
> > >fs: move drop_pagecache_sb() for others to use
> > >mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
> > > 
> > >   drivers/dax/super.c |  3 ++-
> > >   fs/drop_caches.c| 33 -
> > >   fs/super.c  | 34 ++
> > >   fs/xfs/xfs_notify_failure.c | 31 +++
> > >   include/linux/fs.h  |  1 +
> > >   include/linux/mm.h  |  1 +
> > >   6 files changed, 65 insertions(+), 38 deletions(-)
> > > 



Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Darrick J. Wong
On Fri, Sep 02, 2022 at 10:36:01AM +, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 23 +++
>  include/linux/mm.h  |  1 +
>  3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 3830f908e215..5e04ba7fa403 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  struct xfs_failure_info {
>   xfs_agblock_t   startblock;
> @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
>  
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* The device is about to be removed.  Not a really failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   notify->want_shutdown = true;
>   return 0;
>   }
> @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>  
>   if (!(mp->m_super->s_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>  
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + drop_pagecache_sb(mp->m_super, NULL);
> + up_write(>m_super->s_umount);
> + if (error)
> + return error;
> + }
> +
>   if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
>   xfs_debug(mp,
>"notify_failure() not supported on realtime device!");
> @@ -196,6 +211,8 @@ xfs_dax_notify_failure(
>  
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> @@ -209,6 +226,12 @@ xfs_dax_notify_failure(
>   ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>   ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
> + /* Notify failure on the whole device */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }

I wonder, won't the trimming code below take care of this?

The rest of the patch looks ok to me.

--D

> +
>   /* Ignore the range out of filesystem area */
>   if (offset + len - 1 < ddev_start)
>   return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd..9122a1c57dd2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3183,6 +3183,7 @@ enum mf_flags {
>   MF_UNPOISON = 1 << 4,
>   MF_SW_SIMULATED = 1 << 5,
>   MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t 

Re: [PATCH 1/3] xfs: fix the calculation of length and end

2022-09-14 Thread Darrick J. Wong
On Fri, Sep 02, 2022 at 10:35:59AM +, Shiyang Ruan wrote:
> The end should be start + length - 1.  Also fix the calculation of the
> length when seeking for intersection of notify range and device.
> 
> Signed-off-by: Shiyang Ruan 

Looks correct to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_notify_failure.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index c4078d0ec108..3830f908e215 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure(
>   int error = 0;
>   xfs_fsblock_t   fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>   xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
> - xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
> + xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen 
> - 1);
>   xfs_agnumber_t  end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>  
>   error = xfs_trans_alloc_empty(mp, );
> @@ -210,7 +210,7 @@ xfs_dax_notify_failure(
>   ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
>   /* Ignore the range out of filesystem area */
> - if (offset + len < ddev_start)
> + if (offset + len - 1 < ddev_start)
>   return -ENXIO;
>   if (offset > ddev_end)
>   return -ENXIO;
> @@ -222,8 +222,8 @@ xfs_dax_notify_failure(
>   len -= ddev_start - offset;
>   offset = 0;
>   }
> - if (offset + len > ddev_end)
> - len -= ddev_end - offset;
> + if (offset + len - 1 > ddev_end)
> + len -= offset + len - 1 - ddev_end;
>  
>   return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>   mf_flags);
> -- 
> 2.37.2
> 



Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Darrick J. Wong
On Wed, Sep 07, 2022 at 05:46:00PM +0800, Shiyang Ruan wrote:
> ping
> 
> 在 2022/9/2 18:35, Shiyang Ruan 写道:
> > Changes since v7:
> >1. Add P1 to fix calculation mistake
> >2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
> >3. P3: Add invalidate all mappings after sync.
> >4. P3: Set offset to be start of device when it is to be 
> > removed.
> >5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].
> > 
> > Changes since v6:
> >1. Rebase on 6.0-rc2 and Darrick's patch[1].
> > 
> > [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> > [2]: 
> > https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/

Just out of curiosity, is it your (or djbw's) intent to send all these
as bugfixes for 6.0 via akpm like all the other dax fixen?

--D

> > 
> > Shiyang Ruan (3):
> >xfs: fix the calculation of length and end
> >fs: move drop_pagecache_sb() for others to use
> >mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
> > 
> >   drivers/dax/super.c |  3 ++-
> >   fs/drop_caches.c| 33 -
> >   fs/super.c  | 34 ++
> >   fs/xfs/xfs_notify_failure.c | 31 +++
> >   include/linux/fs.h  |  1 +
> >   include/linux/mm.h  |  1 +
> >   6 files changed, 65 insertions(+), 38 deletions(-)
> > 



Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-14 Thread Darrick J. Wong
On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > > On 2022/9/9 21:01, Brian Foster wrote:
> > > > Yes.. I don't recall all the internals of the tools and test, but IIRC
> > > > it relied on discard to perform zeroing between checkpoints or some such
> > > > and avoid spurious failures. The purpose of running on dm-thin was
> > > > merely to provide reliable discard zeroing behavior on the target device
> > > > and thus to allow the test to run reliably.
> > > Hi Brian,
> > > 
> > > As far as I know, generic/470 was original designed to verify
> > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > > reason, we need to ensure that all underlying devices under
> > > dm-log-writes device support DAX. However dm-thin device never supports
> > > DAX so
> > > running generic/470 with dm-thin device always returns "not run".
> > > 
> > > Please see the difference between old and new logic:
> > > 
> > >old logic  new logic
> > > ---
> > > log-writes device(DAX) log-writes device(DAX)
> > >  |   |
> > > PMEM0(DAX) + PMEM1(DAX)   Thin device(non-DAX) + PMEM1(DAX)
> > >|
> > >  PMEM0(DAX)
> > > ---
> > > 
> > > We think dm-thin device is not a good solution for generic/470, is there
> > > any other solution to support both discard zero and DAX?
> > 
> > Hi Brian,
> > 
> > I have sent a patch[1] to revert your fix because I think it's not good for
> > generic/470 to use thin volume as my revert patch[1] describes:
> > [1] 
> > https://lore.kernel.org/fstests/20220914090625.32207-1-yangx...@fujitsu.com/T/#u
> > 
> 
> I think the history here is that generic/482 was changed over first in
> commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
> then sometime later we realized generic/455,457,470 had the same general
> flaw and were switched over. The dm/dax compatibility thing was probably
> just an oversight, but I am a little curious about that because it should

It's not an oversight -- it used to work (albeit with EXPERIMENTAL
tags), and now we've broken it on fsdax as the pmem/blockdev divorce
progresses.

> have been obvious that the change caused the test to no longer run. Did
> something change after that to trigger that change in behavior?
> 
> > With the revert, generic/470 can always run successfully on my environment
> > so I wonder how to reproduce the out-of-order replay issue on XFS v5
> > filesystem?
> > 
> 
> I don't quite recall the characteristics of the failures beyond that we
> were seeing spurious test failures with generic/482 that were due to
> essentially putting the fs/log back in time in a way that wasn't quite
> accurate due to the clearing by the logwrites tool not taking place. If
> you wanted to reproduce in order to revisit that, perhaps start with
> generic/482 and let it run in a loop for a while and see if it
> eventually triggers a failure/corruption..?
> 
> > PS: I want to reproduce the issue and try to find a better solution to fix
> > it.
> > 
> 
> It's been a while since I looked at any of this tooling to semi-grok how
> it works.

I /think/ this was the crux of the problem, back in 2019?
https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/

> Perhaps it could learn to rely on something more explicit like
> zero range (instead of discard?) or fall back to manual zeroing?

AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
ought to be adapted to call BLKZEROOUT and (b) in the worst case it
writes zeroes to the entire device, which is/can be slow.

For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
and for cost optimization purposes we're only "paying" for the cheapest
tier.  Weirdly that maps to an upper limit of 6500 write iops and
48MB/s(!) but that would take about 20 minutes to zero the entire
device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
support discard or write-zeroes.

> If the
> eventual solution is simple and low enough overhead, it might make some
> sense to replace the dmthin hack across the set of tests mentioned
> above.

That said, for a *pmem* test you'd expect it to be faster than that...

--D

> Brian
> 
> > Best Regards,
> > Xiao Yang
> > 
> > > 
> > > BTW, only log-writes, stripe and linear support DAX for now.
> > 
> 



Re: [PATCH v2] libnvdimm/region: Allow setting align attribute on regions without mappings

2022-09-14 Thread Tyler Hicks
On 2022-08-30 00:45:05, Tyler Hicks wrote:
> The alignment constraint for namespace creation in a region was
> increased, from 2M to 16M, for non-PowerPC architectures in v5.7 with
> commit 2522afb86a8c ("libnvdimm/region: Introduce an 'align'
> attribute"). The thought behind the change was that region alignment
> should be uniform across all architectures and, since PowerPC had the
> largest alignment constraint of 16M, all architectures should conform to
> that alignment.
> 
> The change regressed namespace creation in pre-defined regions that
> relied on 2M alignment but a workaround was provided in the form of a
> sysfs attribute, named 'align', that could be adjusted to a non-default
> alignment value.
> 
> However, the sysfs attribute's store function returned an error (-ENXIO)
> when userspace attempted to change the alignment of a region that had no
> mappings. This affected 2M aligned regions of volatile memory that were
> defined in a device tree using "pmem-region" and created by the
> of_pmem_region_driver, since those regions do not contain mappings
> (ndr_mappings is 0).
> 
> Allow userspace to set the align attribute on pre-existing regions that
> do not have mappings so that namespaces can still be within those
> regions, despite not being aligned to 16M.
> 
> Link: 
> https://lore.kernel.org/lkml/CA+CK2bDJ3hrWoE91L2wpAk+Yu0_=GtYw=4glddd7mxs321b...@mail.gmail.com
> Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute")
> Signed-off-by: Tyler Hicks 
> ---

Friendly ping. I'm hoping this fix can be considered for v6.1. Thanks!

Tyler

> 
> While testing with a recent kernel release (6.0-rc3), I rediscovered
> this bug and eventually realized that I never followed through with
> fixing it upstream. After a year later, here's the v2 that Aneesh
> requested. Sorry about that!
> 
> v2:
> - Included Aneesh's feedback to ensure the val is a power of 2 and
>   greater than PAGE_SIZE even for regions without mappings
> - Reused the max_t() trick from default_align() to avoid special
>   casing, with an if-else, when regions have mappings and when they
>   don't
>   + Didn't include Pavel's Reviewed-by since this is a slightly
> different approach than what he reviewed in v1
> - Added a Link commit tag to Pavel's initial problem description
> v1: 
> https://lore.kernel.org/lkml/20210326152645.85225-1-tyhi...@linux.microsoft.com/
> 
>  drivers/nvdimm/region_devs.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 473a71bbd9c9..550ea0bd6c53 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -509,16 +509,13 @@ static ssize_t align_store(struct device *dev,
>  {
>   struct nd_region *nd_region = to_nd_region(dev);
>   unsigned long val, dpa;
> - u32 remainder;
> + u32 mappings, remainder;
>   int rc;
>  
>   rc = kstrtoul(buf, 0, );
>   if (rc)
>   return rc;
>  
> - if (!nd_region->ndr_mappings)
> - return -ENXIO;
> -
>   /*
>* Ensure space-align is evenly divisible by the region
>* interleave-width because the kernel typically has no facility
> @@ -526,7 +523,8 @@ static ssize_t align_store(struct device *dev,
>* contribute to the tail capacity in system-physical-address
>* space for the namespace.
>*/
> - dpa = div_u64_rem(val, nd_region->ndr_mappings, );
> + mappings = max_t(u32, 1, nd_region->ndr_mappings);
> + dpa = div_u64_rem(val, mappings, );
>   if (!is_power_of_2(dpa) || dpa < PAGE_SIZE
>   || val > region_size(nd_region) || remainder)
>   return -EINVAL;
> -- 
> 2.25.1
> 



Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-14 Thread Brian Foster
On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > On 2022/9/9 21:01, Brian Foster wrote:
> > > Yes.. I don't recall all the internals of the tools and test, but IIRC
> > > it relied on discard to perform zeroing between checkpoints or some such
> > > and avoid spurious failures. The purpose of running on dm-thin was
> > > merely to provide reliable discard zeroing behavior on the target device
> > > and thus to allow the test to run reliably.
> > Hi Brian,
> > 
> > As far as I know, generic/470 was original designed to verify
> > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > reason, we need to ensure that all underlying devices under
> > dm-log-writes device support DAX. However dm-thin device never supports
> > DAX so
> > running generic/470 with dm-thin device always returns "not run".
> > 
> > Please see the difference between old and new logic:
> > 
> >old logic  new logic
> > ---
> > log-writes device(DAX) log-writes device(DAX)
> >  |   |
> > PMEM0(DAX) + PMEM1(DAX)   Thin device(non-DAX) + PMEM1(DAX)
> >|
> >  PMEM0(DAX)
> > ---
> > 
> > We think dm-thin device is not a good solution for generic/470, is there
> > any other solution to support both discard zero and DAX?
> 
> Hi Brian,
> 
> I have sent a patch[1] to revert your fix because I think it's not good for
> generic/470 to use thin volume as my revert patch[1] describes:
> [1] 
> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx...@fujitsu.com/T/#u
> 

I think the history here is that generic/482 was changed over first in
commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
then sometime later we realized generic/455,457,470 had the same general
flaw and were switched over. The dm/dax compatibility thing was probably
just an oversight, but I am a little curious about that because it should
have been obvious that the change caused the test to no longer run. Did
something change after that to trigger that change in behavior?

> With the revert, generic/470 can always run successfully on my environment
> so I wonder how to reproduce the out-of-order replay issue on XFS v5
> filesystem?
> 

I don't quite recall the characteristics of the failures beyond that we
were seeing spurious test failures with generic/482 that were due to
essentially putting the fs/log back in time in a way that wasn't quite
accurate due to the clearing by the logwrites tool not taking place. If
you wanted to reproduce in order to revisit that, perhaps start with
generic/482 and let it run in a loop for a while and see if it
eventually triggers a failure/corruption..?

> PS: I want to reproduce the issue and try to find a better solution to fix
> it.
> 

It's been a while since I looked at any of this tooling to semi-grok how
it works. Perhaps it could learn to rely on something more explicit like
zero range (instead of discard?) or fall back to manual zeroing? If the
eventual solution is simple and low enough overhead, it might make some
sense to replace the dmthin hack across the set of tests mentioned
above.

Brian

> Best Regards,
> Xiao Yang
> 
> > 
> > BTW, only log-writes, stripe and linear support DAX for now.
> 




Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-14 Thread Yang , Xiao/杨 晓

On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:

On 2022/9/9 21:01, Brian Foster wrote:

Yes.. I don't recall all the internals of the tools and test, but IIRC
it relied on discard to perform zeroing between checkpoints or some such
and avoid spurious failures. The purpose of running on dm-thin was
merely to provide reliable discard zeroing behavior on the target device
and thus to allow the test to run reliably.

Hi Brian,

As far as I know, generic/470 was original designed to verify
mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
reason, we need to ensure that all underlying devices under
dm-log-writes device support DAX. However dm-thin device never supports
DAX so
running generic/470 with dm-thin device always returns "not run".

Please see the difference between old and new logic:

   old logic  new logic
---
log-writes device(DAX) log-writes device(DAX)
 |   |
PMEM0(DAX) + PMEM1(DAX)   Thin device(non-DAX) + PMEM1(DAX)
   |
 PMEM0(DAX)
---

We think dm-thin device is not a good solution for generic/470, is there
any other solution to support both discard zero and DAX?


Hi Brian,

I have sent a patch[1] to revert your fix because I think it's not good 
for generic/470 to use thin volume as my revert patch[1] describes:
[1] 
https://lore.kernel.org/fstests/20220914090625.32207-1-yangx...@fujitsu.com/T/#u


With the revert, generic/470 can always run successfully on my 
environment so I wonder how to reproduce the out-of-order replay issue 
on XFS v5 filesystem?


PS: I want to reproduce the issue and try to find a better solution to 
fix it.


Best Regards,
Xiao Yang



BTW, only log-writes, stripe and linear support DAX for now.




Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-14 Thread Yang , Xiao/杨 晓

On 2022/9/9 21:01, Brian Foster wrote:

Yes.. I don't recall all the internals of the tools and test, but IIRC
it relied on discard to perform zeroing between checkpoints or some such
and avoid spurious failures. The purpose of running on dm-thin was
merely to provide reliable discard zeroing behavior on the target device
and thus to allow the test to run reliably.


Hi Brian,

As far as I know, generic/470 was original designed to verify 
mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the 
reason, we need to ensure that all underlying devices under 
dm-log-writes device support DAX. However dm-thin device never supports 
DAX so

running generic/470 with dm-thin device always returns "not run".

Please see the difference between old and new logic:

 old logic  new logic
---
log-writes device(DAX) log-writes device(DAX)
   |   |
PMEM0(DAX) + PMEM1(DAX)   Thin device(non-DAX) + PMEM1(DAX)
 |
   PMEM0(DAX)
---

We think dm-thin device is not a good solution for generic/470, is there 
any other solution to support both discard zero and DAX?


BTW, only log-writes, stripe and linear support DAX for now.

Best Regards,
Xiao Yang



[PATCH] nvdimm: make __nvdimm_security_overwrite_query static

2022-09-14 Thread Jiapeng Chong
This symbol is not used outside of security.c, so marks it static.

drivers/nvdimm/security.c:411:6: warning: no previous prototype for function 
'__nvdimm_security_overwrite_query'.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2148
Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
 drivers/nvdimm/security.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index b5aa55c61461..8aefb60c42ff 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -408,7 +408,7 @@ static int security_overwrite(struct nvdimm *nvdimm, 
unsigned int keyid)
return rc;
 }
 
-void __nvdimm_security_overwrite_query(struct nvdimm *nvdimm)
+static void __nvdimm_security_overwrite_query(struct nvdimm *nvdimm)
 {
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(>dev);
int rc;
-- 
2.20.1.7.g153144c