Re: [PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-05-17 Thread Darrick J. Wong
On Thu, May 16, 2024 at 01:34:54PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> [
>This is a treewide change. I will likely re-create this patch again in
>the second week of the merge window of v6.10 and submit it then. Hoping
>to keep the conflicts that it will cause to a minimum.
> ]
> 
> With the rework of how the __string() handles dynamic strings where it
> saves off the source string in field in the helper structure[1], the
> assignment of that value to the trace event field is stored in the helper
> value and does not need to be passed in again.
> 
> This means that with:
> 
>   __string(field, mystring)
> 
> Which use to be assigned with __assign_str(field, mystring), no longer
> needs the second parameter and it is unused. With this, __assign_str()
> will now only get a single parameter.
> 
> There's over 700 users of __assign_str() and because coccinelle does not
> handle the TRACE_EVENT() macro I ended up using the following sed script:
> 
>   git grep -l __assign_str | while read a ; do
>   sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
>   mv /tmp/test-file $a;
>   done
> 
> I then searched for __assign_str() that did not end with ';' as those
> were multi line assignments that the sed script above would fail to catch.
> 
> Note, the same updates will need to be done for:
> 
>   __assign_str_len()
>   __assign_rel_str()
>   __assign_rel_str_len()
> 
> I tested this with both an allmodconfig and an allyesconfig (build only for 
> both).
> 
> [1] 
> https://lore.kernel.org/linux-trace-kernel/2024011442.634192...@goodmis.org/
> 
> Cc: Masami Hiramatsu 
> Cc: Mathieu Desnoyers 
> Cc: Linus Torvalds 
> Cc: Julia Lawall 
> Signed-off-by: Steven Rostedt (Google) 

/me finds this pretty magical, but such is the way of macros.
Thanks for being much smarter about them than me. :)

Acked-by: Darrick J. Wong# xfs

--D



Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-22 Thread Darrick J. Wong
On Mon, Jan 22, 2024 at 02:02:28PM -0800, Linus Torvalds wrote:
> On Mon, 22 Jan 2024 at 13:59, Darrick J. Wong  wrote:
> >
> >  though I don't think
> > leaking raw kernel pointers is an awesome idea.
> 
> Yeah, I wasn't all that comfortable even with trying to hash it
> (because I think the number of source bits is small enough that even
> with a crypto hash, it's trivially brute-forceable).
> 
> See
> 
>https://lore.kernel.org/all/20240122152748.46897...@gandalf.local.home/
> 
> for the current patch under discussion (and it contains a link _to_
> said discussion).

Ah, cool, thank you!

--D

>Linus



Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-22 Thread Darrick J. Wong
On Tue, Jan 16, 2024 at 05:55:32PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> The dentries and inodes are created in the readdir for the sole purpose of
> getting a consistent inode number. Linus stated that is unnecessary, and
> that all inodes can have the same inode number. For a virtual file system
> they are pretty meaningless.
> 
> Instead use a single unique inode number for all files and one for all
> directories.
> 
> Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/
> Link: 
> https://lore.kernel.org/linux-trace-kernel/20240116211353.412180...@goodmis.org
> 
> Cc: Masami Hiramatsu 
> Cc: Mark Rutland 
> Cc: Mathieu Desnoyers 
> Cc: Christian Brauner 
> Cc: Al  Viro 
> Cc: Ajay Kaher 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/tracefs/event_inode.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index fdff53d5a1f8..5edf0b96758b 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -32,6 +32,10 @@
>   */
>  static DEFINE_MUTEX(eventfs_mutex);
>  
> +/* Choose something "unique" ;-) */
> +#define EVENTFS_FILE_INODE_INO   0x12c4e37
> +#define EVENTFS_DIR_INODE_INO0x134b2f5
> +
>  /*
>   * The eventfs_inode (ei) itself is protected by SRCU. It is released from
>   * its parent's list and will have is_freed set (under eventfs_mutex).
> @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, 
> umode_t mode,
>   inode->i_fop = fop;
>   inode->i_private = data;
>  
> + /* All files will have the same inode number */
> + inode->i_ino = EVENTFS_FILE_INODE_INO;
> +
>   ti = get_tracefs(inode);
>   ti->flags |= TRACEFS_EVENT_INODE;
>   d_instantiate(dentry, inode);
> @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode 
> *ei, struct dentry *parent
>   inode->i_op = _root_dir_inode_operations;
>   inode->i_fop = _file_operations;
>  
> + /* All directories will have the same inode number */
> + inode->i_ino = EVENTFS_DIR_INODE_INO;

Regrettably, this leads to find failing on 6.8-rc1 (see xfs/55[89] in
fstests):

# find /sys/kernel/debug/tracing/ >/dev/null
find: File system loop detected; 
‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of the same 
file system loop as ‘/sys/kernel/debug/tracing/events/initcall’.
find: File system loop detected; 
‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of the same 
file system loop as ‘/sys/kernel/debug/tracing/events/initcall’.
find: File system loop detected; 
‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of the same 
file system loop as ‘/sys/kernel/debug/tracing/events/initcall’.

There were no such reports on 6.7.0; AFAICT find(1) is tripping over
parent and child subdirectory having the same dev/i_ino.  Changing this
line to the following:

/* All directories will NOT have the same inode number */
inode->i_ino = (unsigned long)inode;

makes the messages about filesystem loops go away, though I don't think
leaking raw kernel pointers is an awesome idea.

--D

> +
>   ti = get_tracefs(inode);
>   ti->flags |= TRACEFS_EVENT_INODE;
>  
> -- 
> 2.43.0
> 
> 
> 



Re: [RFC PATCH] xfs: check shared state of when CoW, update reflink flag when io ends

2023-03-21 Thread Darrick J. Wong
On Mon, Mar 20, 2023 at 06:02:05PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2023/3/18 4:35, Darrick J. Wong 写道:
> > On Fri, Mar 17, 2023 at 03:59:48AM +, Shiyang Ruan wrote:
> > > As is mentioned[1] before, the generic/388 will randomly fail with dmesg
> > > warning.  This case uses fsstress with a lot of random operations.  It is 
> > > hard
> > > to  reproduce.  Finally I found a 100% reproduce condition, which is 
> > > setting
> > > the seed to 1677104360.  So I changed the generic/388 code: removed the 
> > > loop
> > > and used the code below instad:
> > > ```
> > > ($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 
> > > -p 1 >> $seqres.full) > /dev/null 2>&1
> > > ($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 
> > > -p 1 >> $seqres.full) > /dev/null 2>&1
> > > _check_dmesg_for dax_insert_entry
> > > ```
> > > 
> > > According to the operations log, and kernel debug log I added, I found 
> > > that
> > > the reflink flag of one inode won't be unset even if there's no more 
> > > shared
> > > extents any more.
> > >Then write to this file again.  Because of the reflink flag, xfs 
> > > thinks it
> > >  needs cow, and extent(called it extA) will be CoWed to a new
> > >  extent(called it extB) incorrectly.  And extA is not used any more,
> > >  but didn't be unmapped (didn't do dax_disassociate_entry()).
> > 
> > IOWs, dax_iomap_copy_around (or something very near it) should be
> > calling dax_disassociate_entry on the source range after copying extA's
> > contents to extB to drop its page->shared count?
> 
> If extA is a shared extent, its pages will be disassociated correctly by
> invalidate_inode_pages2_range() in dax_iomap_iter().
> 
> But the problem is that extA is not shared but now be CoWed,

Aha!  Ok, I hadn't realized that extA is not shared...

> invalidate_inode_pages2_range() is also called but it can't disassociate the
> old page (because the page is marked dirty, can't be invalidated)

...so what marked the old page dirty?   Was it the case that the
unshared extA got marked dirty, then later someone created a cow
reservation (extB, I guess) that covered the already dirty extA?

Should we be transferring the dirty state from A to B here before the
invalidate_inode_pages2_range ?

> Is the behavior to do CoW on a non-shared extent allowed?

In general, yes, XFS allows COW on non-shared extents.  The (cow) extent
size hint provides for cowing the unshared blocks adjacent to a shared
block to try to combat fragmentation.

> > 
> > >The next time we mapwrite to another file, xfs will allocate extA for 
> > > it,
> > >  page fault handler do dax_associate_entry().  BUT bucause the extA 
> > > didn't
> > >  be unmapped, it still stores old file's info in 
> > > page->mapping,->index.
> > >  Then, It reports dmesg warning when it try to sotre the new file's 
> > > info.
> > > 
> > > So, I think:
> > >1. reflink flag should be updated after CoW operations.
> > >2. xfs_reflink_allocate_cow() should add "if extent is shared" to 
> > > determine
> > >   xfs do CoW or not.
> > > 
> > > I made the fix patch, it can resolve the fail of generic/388.  But it 
> > > causes
> > > other cases fail: generic/127, generic/263, generic/616, xfs/315 xfs/421. 
> > > I'm
> > > not sure if the fix is right, or I have missed something somewhere.  
> > > Please
> > > give me some advice.
> > > 
> > > Thank you very much!!
> > > 
> > > [1]: 
> > > https://lore.kernel.org/linux-xfs/1669908538-55-1-git-send-email-ruansy.f...@fujitsu.com/
> > > 
> > > Signed-off-by: Shiyang Ruan 
> > > ---
> > >   fs/xfs/xfs_reflink.c | 44 
> > >   fs/xfs/xfs_reflink.h |  2 ++
> > >   2 files changed, 46 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index f5dc46ce9803..a6b07f5c1db2 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -154,6 +154,40 @@ xfs_reflink_find_shared(
> > >   return error;
> > >   }
> > > +int xfs_reflink_extent_is_shared(
> > > + struct xfs_inode*ip,
> > > + struct xfs_bmbt_irec*irec,
> > > + bool*shar

Re: [RFC PATCH] xfs: check shared state of when CoW, update reflink flag when io ends

2023-03-17 Thread Darrick J. Wong
On Fri, Mar 17, 2023 at 03:59:48AM +, Shiyang Ruan wrote:
> As is mentioned[1] before, the generic/388 will randomly fail with dmesg
> warning.  This case uses fsstress with a lot of random operations.  It is hard
> to  reproduce.  Finally I found a 100% reproduce condition, which is setting
> the seed to 1677104360.  So I changed the generic/388 code: removed the loop
> and used the code below instad:
> ```
> ($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 
> >> $seqres.full) > /dev/null 2>&1
> ($FSSTRESS_PROG $FSSTRESS_AVOID -d $SCRATCH_MNT -v -s 1677104360 -n 221 -p 1 
> >> $seqres.full) > /dev/null 2>&1
> _check_dmesg_for dax_insert_entry
> ```
> 
> According to the operations log, and kernel debug log I added, I found that
> the reflink flag of one inode won't be unset even if there's no more shared
> extents any more.
>   Then write to this file again.  Because of the reflink flag, xfs thinks it
> needs cow, and extent(called it extA) will be CoWed to a new
> extent(called it extB) incorrectly.  And extA is not used any more,
> but didn't be unmapped (didn't do dax_disassociate_entry()).

IOWs, dax_iomap_copy_around (or something very near it) should be
calling dax_disassociate_entry on the source range after copying extA's
contents to extB to drop its page->shared count?

>   The next time we mapwrite to another file, xfs will allocate extA for it,
> page fault handler do dax_associate_entry().  BUT bucause the extA didn't
> be unmapped, it still stores old file's info in page->mapping,->index.
> Then, It reports dmesg warning when it try to sotre the new file's info.
> 
> So, I think:
>   1. reflink flag should be updated after CoW operations.
>   2. xfs_reflink_allocate_cow() should add "if extent is shared" to determine
>  xfs do CoW or not.
> 
> I made the fix patch, it can resolve the fail of generic/388.  But it causes
> other cases fail: generic/127, generic/263, generic/616, xfs/315 xfs/421. I'm
> not sure if the fix is right, or I have missed something somewhere.  Please
> give me some advice.
> 
> Thank you very much!!
> 
> [1]: 
> https://lore.kernel.org/linux-xfs/1669908538-55-1-git-send-email-ruansy.f...@fujitsu.com/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_reflink.c | 44 
>  fs/xfs/xfs_reflink.h |  2 ++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index f5dc46ce9803..a6b07f5c1db2 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -154,6 +154,40 @@ xfs_reflink_find_shared(
>   return error;
>  }
>  
> +int xfs_reflink_extent_is_shared(
> + struct xfs_inode*ip,
> + struct xfs_bmbt_irec*irec,
> + bool*shared)
> +{
> + struct xfs_mount*mp = ip->i_mount;
> + struct xfs_perag*pag;
> + xfs_agblock_t   agbno;
> + xfs_extlen_taglen;
> + xfs_agblock_t   fbno;
> + xfs_extlen_tflen;
> + int error = 0;
> +
> + *shared = false;
> +
> + /* Holes, unwritten, and delalloc extents cannot be shared */
> + if (!xfs_bmap_is_written_extent(irec))
> + return 0;
> +
> + pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> + agbno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
> + aglen = irec->br_blockcount;
> + error = xfs_reflink_find_shared(pag, NULL, agbno, aglen, , ,
> + true);
> + xfs_perag_put(pag);
> + if (error)
> + return error;
> +
> + if (fbno != NULLAGBLOCK)
> + *shared = true;
> +
> + return 0;
> +}
> +
>  /*
>   * Trim the mapping to the next block where there's a change in the
>   * shared/unshared status.  More specifically, this means that we
> @@ -533,6 +567,12 @@ xfs_reflink_allocate_cow(
>   xfs_ifork_init_cow(ip);
>   }
>  
> + error = xfs_reflink_extent_is_shared(ip, imap, shared);
> + if (error)
> + return error;
> + if (!*shared)
> + return 0;
> +
>   error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, );
>   if (error || !*shared)
>   return error;
> @@ -834,6 +874,10 @@ xfs_reflink_end_cow_extent(
>   /* Remove the mapping from the CoW fork. */
>   xfs_bmap_del_extent_cow(ip, , , );
>  
> + error = xfs_reflink_clear_inode_flag(ip, );

This will disable COW on /all/ blocks in the entire file, including the
shared ones.  At a bare minimum you'd have to scan the entire data fork
to ensure there are no shared extents.  That's probably why doing this
causes so many new regressions.

--D

> + if (error)
> + goto out_cancel;
> +
>   error = xfs_trans_commit(tp);
>   xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   if (error)
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 65c5dfe17ecf..d5835814bce6 100644
> --- 

Re: [PATCH v2.2 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-07 Thread Darrick J. Wong
On Wed, Dec 07, 2022 at 02:49:19AM +, Shiyang Ruan wrote:
> fsdax page is used not only when CoW, but also mapread. To make the it
> easily understood, use 'share' to indicate that the dax page is shared
> by more than one extent.  And add helper functions to use it.
> 
> Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Allison Henderson 

Looks fine to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c   | 38 ++
>  include/linux/mm_types.h   |  5 -
>  include/linux/page-flags.h |  2 +-
>  3 files changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1c6867810cbd..84fadea08705 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
>   for (pfn = dax_to_pfn(entry); \
>   pfn < dax_end_pfn(entry); pfn++)
>  
> -static inline bool dax_mapping_is_cow(struct address_space *mapping)
> +static inline bool dax_page_is_shared(struct page *page)
>  {
> - return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
> + return page->mapping == PAGE_MAPPING_DAX_SHARED;
>  }
>  
>  /*
> - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
> + * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
> + * refcount.
>   */
> -static inline void dax_mapping_set_cow(struct page *page)
> +static inline void dax_page_share_get(struct page *page)
>  {
> - if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
> + if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
>   /*
>* Reset the index if the page was already mapped
>* regularly before.
>*/
>   if (page->mapping)
> - page->index = 1;
> - page->mapping = (void *)PAGE_MAPPING_DAX_COW;
> + page->share = 1;
> + page->mapping = PAGE_MAPPING_DAX_SHARED;
>   }
> - page->index++;
> + page->share++;
> +}
> +
> +static inline unsigned long dax_page_share_put(struct page *page)
> +{
> + return --page->share;
>  }
>  
>  /*
> - * When it is called in dax_insert_entry(), the cow flag will indicate that
> + * When it is called in dax_insert_entry(), the shared flag will indicate 
> that
>   * whether this entry is shared by multiple files.  If so, set the 
> page->mapping
> - * FS_DAX_MAPPING_COW, and use page->index as refcount.
> + * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
>   */
>  static void dax_associate_entry(void *entry, struct address_space *mapping,
> - struct vm_area_struct *vma, unsigned long address, bool cow)
> + struct vm_area_struct *vma, unsigned long address, bool shared)
>  {
>   unsigned long size = dax_entry_size(entry), pfn, index;
>   int i = 0;
> @@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - if (cow) {
> - dax_mapping_set_cow(page);
> + if (shared) {
> + dax_page_share_get(page);
>   } else {
>   WARN_ON_ONCE(page->mapping);
>   page->mapping = mapping;
> @@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   struct page *page = pfn_to_page(pfn);
>  
>   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> - if (dax_mapping_is_cow(page->mapping)) {
> - /* keep the CoW flag if this page is still shared */
> - if (page->index-- > 0)
> + if (dax_page_is_shared(page)) {
> + /* keep the shared flag if this page is still shared */
> + if (dax_page_share_put(page) > 0)
>   continue;
>   } else
>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..f46cac3657ad 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -103,7 +103,10 @@ struct page {
>   };
>   /* See page-flags.h for PAGE_MAPPING_FLAGS */
>   struct address_space *mapping;
> - pgoff_t index;  /* 

Re: [PATCH v2.1 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-04 Thread Darrick J. Wong
On Mon, Dec 05, 2022 at 01:56:24PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/12/3 10:07, Dan Williams 写道:
> > Shiyang Ruan wrote:
> > > fsdax page is used not only when CoW, but also mapread. To make the it
> > > easily understood, use 'share' to indicate that the dax page is shared
> > > by more than one extent.  And add helper functions to use it.
> > > 
> > > Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.
> > > 
> > > Signed-off-by: Shiyang Ruan 
> > > ---
> > >   fs/dax.c   | 38 ++
> > >   include/linux/mm_types.h   |  5 -
> > >   include/linux/page-flags.h |  2 +-
> > >   3 files changed, 27 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 1c6867810cbd..edbacb273ab5 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
> > >   for (pfn = dax_to_pfn(entry); \
> > >   pfn < dax_end_pfn(entry); pfn++)
> > > -static inline bool dax_mapping_is_cow(struct address_space *mapping)
> > > +static inline bool dax_page_is_shared(struct page *page)
> > >   {
> > > - return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
> > > + return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED;
> > >   }
> > >   /*
> > > - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the 
> > > refcount.
> > > + * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
> > > + * refcount.
> > >*/
> > > -static inline void dax_mapping_set_cow(struct page *page)
> > > +static inline void dax_page_bump_sharing(struct page *page)
> > 
> > Similar to page_ref naming I would call this page_share_get() and the
> > corresponding function page_share_put().
> > 
> > >   {
> > > - if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
> > > + if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) {
> > >   /*
> > >* Reset the index if the page was already mapped
> > >* regularly before.
> > >*/
> > >   if (page->mapping)
> > > - page->index = 1;
> > > - page->mapping = (void *)PAGE_MAPPING_DAX_COW;
> > > + page->share = 1;
> > > + page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
> > 
> > Small nit, You could save a cast here by defining
> > PAGE_MAPPING_DAX_SHARED as "((void *) 1)".
> 
> Ok.

It's sort of a pity you can't pass around a pointer to a privately
defined const struct in dax.c.  But yeah, you might as well include the
cast in the macro definition.

> > 
> > >   }
> > > - page->index++;
> > > + page->share++;
> > > +}
> > > +
> > > +static inline unsigned long dax_page_drop_sharing(struct page *page)
> > > +{
> > > + return --page->share;
> > >   }
> > >   /*
> > > - * When it is called in dax_insert_entry(), the cow flag will indicate 
> > > that
> > > + * When it is called in dax_insert_entry(), the shared flag will 
> > > indicate that
> > >* whether this entry is shared by multiple files.  If so, set the 
> > > page->mapping
> > > - * FS_DAX_MAPPING_COW, and use page->index as refcount.
> > > + * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
> > >*/
> > >   static void dax_associate_entry(void *entry, struct address_space 
> > > *mapping,
> > > - struct vm_area_struct *vma, unsigned long address, bool cow)
> > > + struct vm_area_struct *vma, unsigned long address, bool shared)
> > >   {
> > >   unsigned long size = dax_entry_size(entry), pfn, index;
> > >   int i = 0;
> > > @@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
> > > address_space *mapping,
> > >   for_each_mapped_pfn(entry, pfn) {
> > >   struct page *page = pfn_to_page(pfn);
> > > - if (cow) {
> > > - dax_mapping_set_cow(page);
> > > + if (shared) {
> > > + dax_page_bump_sharing(page);
> > >   } else {
> > >   WARN_ON_ONCE(page->mapping);
> > >   page->mapping = mapping;
> > > @@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, 
> > > struct address_space *mapping,
> > >   struct page *page = pfn_to_page(pfn);
> > >   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> > > - if (dax_mapping_is_cow(page->mapping)) {
> > > - /* keep the CoW flag if this page is still shared */
> > > - if (page->index-- > 0)
> > > + if (dax_page_is_shared(page)) {
> > > + /* keep the shared flag if this page is still shared */
> > > + if (dax_page_drop_sharing(page) > 0)
> > >   continue;
> > 
> > I think part of what makes this hard to read is trying to preserve the
> > same code paths for shared pages and typical pages.
> > 
> > 

Re: [PATCH v2 8/8] xfs: remove restrictions for fsdax and reflink

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 03:32:53PM +, Shiyang Ruan wrote:
> Since the basic function for fsdax and reflink has been implemented,
> remove the restrictions of them for widly test.
> 
> Signed-off-by: Shiyang Ruan 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_ioctl.c | 4 
>  fs/xfs/xfs_iops.c  | 4 
>  2 files changed, 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 1f783e979629..13f1b2add390 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1138,10 +1138,6 @@ xfs_ioctl_setattr_xflags(
>   if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
>   ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
>  
> - /* Don't allow us to set DAX mode for a reflinked file for now. */
> - if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
> - return -EINVAL;
> -
>   /* diflags2 only valid for v3 inodes. */
>   i_flags2 = xfs_flags2diflags2(ip, fa->fsx_xflags);
>   if (i_flags2 && !xfs_has_v3inodes(mp))
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 2e10e1c66ad6..bf0495f7a5e1 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1185,10 +1185,6 @@ xfs_inode_supports_dax(
>   if (!S_ISREG(VFS_I(ip)->i_mode))
>   return false;
>  
> - /* Only supported on non-reflinked files. */
> - if (xfs_is_reflink_inode(ip))
> - return false;
> -
>   /* Block size must match page size */
>   if (mp->m_sb.sb_blocksize != PAGE_SIZE)
>   return false;
> -- 
> 2.38.1
> 



Re: [PATCH v2 6/8] xfs: use dax ops for zero and truncate in fsdax mode

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 03:32:10PM +, Shiyang Ruan wrote:
> Zero and truncate on a dax file may execute CoW.  So use dax ops which
> contains end work for CoW.
> 
> Signed-off-by: Shiyang Ruan 

LGTM
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_iomap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 881de99766ca..d9401d0300ad 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1370,7 +1370,7 @@ xfs_zero_range(
>  
>   if (IS_DAX(inode))
>   return dax_zero_range(inode, pos, len, did_zero,
> -   _direct_write_iomap_ops);
> +   _dax_write_iomap_ops);
>   return iomap_zero_range(inode, pos, len, did_zero,
>   _buffered_write_iomap_ops);
>  }
> @@ -1385,7 +1385,7 @@ xfs_truncate_page(
>  
>   if (IS_DAX(inode))
>   return dax_truncate_page(inode, pos, did_zero,
> - _direct_write_iomap_ops);
> + _dax_write_iomap_ops);
>   return iomap_truncate_page(inode, pos, did_zero,
>  _buffered_write_iomap_ops);
>  }
> -- 
> 2.38.1
> 



Re: [PATCH v2 5/8] fsdax: dedupe: iter two files at the same time

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 03:31:41PM +, Shiyang Ruan wrote:
> The iomap_iter() on a range of one file may loop more than once.  In
> this case, the inner dst_iter can update its iomap but the outer
> src_iter can't.  This may cause the wrong remapping in filesystem.  Let
> them called at the same time.
> 
> Signed-off-by: Shiyang Ruan 

Thank you for adding that explanation, it makes the problem much more
obvious. :)

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index f1eb59bee0b5..354be56750c2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1964,15 +1964,15 @@ int dax_dedupe_file_range_compare(struct inode *src, 
> loff_t srcoff,
>   .len= len,
>   .flags  = IOMAP_DAX,
>   };
> - int ret;
> + int ret, compared = 0;
>  
> - while ((ret = iomap_iter(_iter, ops)) > 0) {
> - while ((ret = iomap_iter(_iter, ops)) > 0) {
> - dst_iter.processed = dax_range_compare_iter(_iter,
> - _iter, len, same);
> - }
> - if (ret <= 0)
> - src_iter.processed = ret;
> + while ((ret = iomap_iter(_iter, ops)) > 0 &&
> +(ret = iomap_iter(_iter, ops)) > 0) {
> + compared = dax_range_compare_iter(_iter, _iter, len,
> +   same);
> + if (compared < 0)
> + return ret;
> + src_iter.processed = dst_iter.processed = compared;
>   }
>   return ret;
>  }
> -- 
> 2.38.1
> 



Re: [PATCH v2 4/8] fsdax,xfs: set the shared flag when file extent is shared

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 03:28:54PM +, Shiyang Ruan wrote:
> If a dax page is shared, mapread at different offsets can also trigger
> page fault on same dax page.  So, change the flag from "cow" to
> "shared".  And get the shared flag from filesystem when read.
> 
> Signed-off-by: Shiyang Ruan 

Makes sense.
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c   | 19 +++
>  fs/xfs/xfs_iomap.c |  2 +-
>  2 files changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 6b6e07ad8d80..f1eb59bee0b5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -846,12 +846,6 @@ static bool dax_fault_is_synchronous(const struct 
> iomap_iter *iter,
>   (iter->iomap.flags & IOMAP_F_DIRTY);
>  }
>  
> -static bool dax_fault_is_cow(const struct iomap_iter *iter)
> -{
> - return (iter->flags & IOMAP_WRITE) &&
> - (iter->iomap.flags & IOMAP_F_SHARED);
> -}
> -
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -865,13 +859,14 @@ static void *dax_insert_entry(struct xa_state *xas, 
> struct vm_fault *vmf,
>  {
>   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   void *new_entry = dax_make_entry(pfn, flags);
> - bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
> - bool cow = dax_fault_is_cow(iter);
> + bool write = iter->flags & IOMAP_WRITE;
> + bool dirty = write && !dax_fault_is_synchronous(iter, vmf->vma);
> + bool shared = iter->iomap.flags & IOMAP_F_SHARED;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> - if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
> + if (shared || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>   unsigned long index = xas->xa_index;
>   /* we are replacing a zero page with block mapping */
>   if (dax_is_pmd_entry(entry))
> @@ -883,12 +878,12 @@ static void *dax_insert_entry(struct xa_state *xas, 
> struct vm_fault *vmf,
>  
>   xas_reset(xas);
>   xas_lock_irq(xas);
> - if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>   void *old;
>  
>   dax_disassociate_entry(entry, mapping, false);
>   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
> - cow);
> + shared);
>   /*
>* Only swap our new entry into the page cache if the current
>* entry is a zero page or an empty entry.  If a normal PTE or
> @@ -908,7 +903,7 @@ static void *dax_insert_entry(struct xa_state *xas, 
> struct vm_fault *vmf,
>   if (dirty)
>   xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> - if (cow)
> + if (write && shared)
>   xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
>  
>   xas_unlock_irq(xas);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 07da03976ec1..881de99766ca 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1215,7 +1215,7 @@ xfs_read_iomap_begin(
>   return error;
>   error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, ,
>  , 0);
> - if (!error && (flags & IOMAP_REPORT))
> + if (!error && ((flags & IOMAP_REPORT) || IS_DAX(inode)))
>   error = xfs_reflink_trim_around_shared(ip, , );
>   xfs_iunlock(ip, lockmode);
>  
> -- 
> 2.38.1
> 



Re: [PATCH v2 3/8] fsdax: zero the edges if source is HOLE or UNWRITTEN

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 03:28:53PM +, Shiyang Ruan wrote:
> If srcmap contains invalid data, such as HOLE and UNWRITTEN, the dest
> page should be zeroed.  Otherwise, since it's a pmem, old data may
> remains on the dest page, the result of CoW will be incorrect.
> 
> The function name is also not easy to understand, rename it to
> "dax_iomap_copy_around()", which means it copys data around the range.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 78 ++--
>  1 file changed, 48 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 482dda85ccaf..6b6e07ad8d80 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1092,7 +1092,7 @@ static int dax_iomap_direct_access(const struct iomap 
> *iomap, loff_t pos,
>  }
>  
>  /**
> - * dax_iomap_cow_copy - Copy the data from source to destination before write
> + * dax_iomap_copy_around - Copy the data from source to destination before 
> write

 * dax_iomap_copy_around - Prepare for an unaligned write to a
 * shared/cow page by copying the data before and after the range to be
 * written.

Other than that, this make sense,
Reviewed-by: Darrick J. Wong 

--D

>   * @pos: address to do copy from.
>   * @length:  size of copy operation.
>   * @align_size:  aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
> @@ -1101,35 +1101,50 @@ static int dax_iomap_direct_access(const struct iomap 
> *iomap, loff_t pos,
>   *
>   * This can be called from two places. Either during DAX write fault (page
>   * aligned), to copy the length size data to daddr. Or, while doing normal 
> DAX
> - * write operation, dax_iomap_actor() might call this to do the copy of 
> either
> + * write operation, dax_iomap_iter() might call this to do the copy of either
>   * start or end unaligned address. In the latter case the rest of the copy of
> - * aligned ranges is taken care by dax_iomap_actor() itself.
> + * aligned ranges is taken care by dax_iomap_iter() itself.
> + * If the srcmap contains invalid data, such as HOLE and UNWRITTEN, zero the
> + * area to make sure no old data remains.
>   */
> -static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
> +static int dax_iomap_copy_around(loff_t pos, uint64_t length, size_t 
> align_size,
>   const struct iomap *srcmap, void *daddr)
>  {
>   loff_t head_off = pos & (align_size - 1);
>   size_t size = ALIGN(head_off + length, align_size);
>   loff_t end = pos + length;
>   loff_t pg_end = round_up(end, align_size);
> + /* copy_all is usually in page fault case */
>   bool copy_all = head_off == 0 && end == pg_end;
> + /* zero the edges if srcmap is a HOLE or IOMAP_UNWRITTEN */
> + bool zero_edge = srcmap->flags & IOMAP_F_SHARED ||
> +  srcmap->type == IOMAP_UNWRITTEN;
>   void *saddr = 0;
>   int ret = 0;
>  
> - ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
> - if (ret)
> - return ret;
> + if (!zero_edge) {
> + ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
> + if (ret)
> + return ret;
> + }
>  
>   if (copy_all) {
> - ret = copy_mc_to_kernel(daddr, saddr, length);
> - return ret ? -EIO : 0;
> + if (zero_edge)
> + memset(daddr, 0, size);
> + else
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + goto out;
>   }
>  
>   /* Copy the head part of the range */
>   if (head_off) {
> - ret = copy_mc_to_kernel(daddr, saddr, head_off);
> - if (ret)
> - return -EIO;
> + if (zero_edge)
> + memset(daddr, 0, head_off);
> + else {
> + ret = copy_mc_to_kernel(daddr, saddr, head_off);
> + if (ret)
> + return -EIO;
> + }
>   }
>  
>   /* Copy the tail part of the range */
> @@ -1137,12 +1152,19 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t 
> length, size_t align_size,
>   loff_t tail_off = head_off + length;
>   loff_t tail_len = pg_end - end;
>  
> - ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
> - tail_len);
> - if (ret)
> - return -EIO;
> + if (zero_edge)
> + memset(daddr + tail_off, 0, tail_len);
> + else {
> + ret = copy_mc_to_kernel(daddr + tail_off

Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 11:39:12PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/12/1 5:08, Darrick J. Wong 写道:
> > On Tue, Nov 29, 2022 at 11:05:30PM -0800, Dan Williams wrote:
> > > Darrick J. Wong wrote:
> > > > On Tue, Nov 29, 2022 at 07:59:14PM -0800, Dan Williams wrote:
> > > > > [ add Andrew ]
> > > > > 
> > > > > Shiyang Ruan wrote:
> > > > > > Many testcases failed in dax+reflink mode with warning message in 
> > > > > > dmesg.
> > > > > > This also effects dax+noreflink mode if we run the test after a
> > > > > > dax+reflink test.  So, the most urgent thing is solving the warning
> > > > > > messages.
> > > > > > 
> > > > > > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > > > > > previously considered (srcmap is HOLE or UNWRITTEN).
> > > > > > Patch 2 adds the implementation of unshare for fsdax.
> > > > > > 
> > > > > > With these fixes, most warning messages in dax_associate_entry() are
> > > > > > gone.  But honestly, generic/388 will randomly failed with the 
> > > > > > warning.
> > > > > > The case shutdown the xfs when fsstress is running, and do it for 
> > > > > > many
> > > > > > times.  I think the reason is that dax pages in use are not able to 
> > > > > > be
> > > > > > invalidated in time when fs is shutdown.  The next time dax page to 
> > > > > > be
> > > > > > associated, it still remains the mapping value set last time.  I'll 
> > > > > > keep
> > > > > > on solving it.
> > > > > > 
> > > > > > The warning message in dax_writeback_one() can also be fixed 
> > > > > > because of
> > > > > > the dax unshare.
> > > > > 
> > > > > Thank you for digging in on this, I had been pinned down on CXL tasks
> > > > > and worried that we would need to mark FS_DAX broken for a cycle, so
> > > > > this is timely.
> > > > > 
> > > > > My only concern is that these patches look to have significant 
> > > > > collisions with
> > > > > the fsdax page reference counting reworks pending in linux-next. 
> > > > > Although,
> > > > > those are still sitting in mm-unstable:
> > > > > 
> > > > > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org
> > > > > 
> > > > > My preference would be to move ahead with both in which case I can 
> > > > > help
> > > > > rebase these fixes on top. In that scenario everything would go 
> > > > > through
> > > > > Andrew.
> > > > > 
> > > > > However, if we are getting too late in the cycle for that path I think
> > > > > these dax-fixes take precedence, and one more cycle to let the page
> > > > > reference count reworks sit is ok.
> > > > 
> > > > Well now that raises some interesting questions -- dax and reflink are
> > > > totally broken on 6.1.  I was thinking about cramming them into 6.2 as a
> > > > data corruption fix on the grounds that is not an acceptable state of
> > > > affairs.
> > > 
> > > I agree it's not an acceptable state of affairs, but for 6.1 the answer
> > > may be to just revert to dax+reflink being forbidden again. The fact
> > > that no end user has noticed is probably a good sign that we can disable
> > > that without any one screaming. That may be the easy answer for 6.2 as
> > > well given how late this all is.
> > > 
> > > > OTOH we're past -rc7, which is **really late** to be changing core code.
> > > > Then again, there aren't so many fsdax users and nobody's complained
> > > > about 6.0/6.1 being busted, so perhaps the risk of regression isn't so
> > > > bad?  Then again, that could be a sign that this could wait, if you and
> > > > Andrew are really eager to merge the reworks.
> > > 
> > > The page reference counting has also been languishing for a long time. A
> > > 6.2 merge would be nice, it relieves maintenance burden, but they do not
> > > start to have real end user implications until CXL memory hotplug
> > > platforms arrive and the warts in the reference counting start to show
> >

Re: [PATCH v2 2/8] fsdax: invalidate pages when CoW

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 03:28:52PM +, Shiyang Ruan wrote:
> CoW changes the share state of a dax page, but the share count of the
> page isn't updated.  The next time access this page, it should have been
> a newly accessed, but old association exists.  So, we need to clear the
> share state when CoW happens, in both dax_iomap_rw() and
> dax_zero_iter().
> 
> Signed-off-by: Shiyang Ruan 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 85b81963ea31..482dda85ccaf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1264,6 +1264,15 @@ static s64 dax_zero_iter(struct iomap_iter *iter, bool 
> *did_zero)
>   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
>   return length;
>  
> + /*
> +  * invalidate the pages whose sharing state is to be changed
> +  * because of CoW.
> +  */
> + if (iomap->flags & IOMAP_F_SHARED)
> + invalidate_inode_pages2_range(iter->inode->i_mapping,
> +   pos >> PAGE_SHIFT,
> +   (pos + length - 1) >> PAGE_SHIFT);
> +
>   do {
>   unsigned offset = offset_in_page(pos);
>   unsigned size = min_t(u64, PAGE_SIZE - offset, length);
> @@ -1324,12 +1333,13 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
>   struct iov_iter *iter)
>  {
>   const struct iomap *iomap = >iomap;
> - const struct iomap *srcmap = >srcmap;
> + const struct iomap *srcmap = iomap_iter_srcmap(iomi);
>   loff_t length = iomap_length(iomi);
>   loff_t pos = iomi->pos;
>   struct dax_device *dax_dev = iomap->dax_dev;
>   loff_t end = pos + length, done = 0;
>   bool write = iov_iter_rw(iter) == WRITE;
> + bool cow = write && iomap->flags & IOMAP_F_SHARED;
>   ssize_t ret = 0;
>   size_t xfer;
>   int id;
> @@ -1356,7 +1366,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
>* into page tables. We have to tear down these mappings so that data
>* written by write(2) is visible in mmap.
>*/
> - if (iomap->flags & IOMAP_F_NEW) {
> + if (iomap->flags & IOMAP_F_NEW || cow) {
>   invalidate_inode_pages2_range(iomi->inode->i_mapping,
> pos >> PAGE_SHIFT,
> (end - 1) >> PAGE_SHIFT);
> @@ -1390,8 +1400,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
>   break;
>   }
>  
> - if (write &&
> - srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {
> + if (cow) {
>   ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
>kaddr);
>   if (ret)
> -- 
> 2.38.1
> 



Re: [PATCH v2 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-01 Thread Darrick J. Wong
On Thu, Dec 01, 2022 at 03:28:51PM +, Shiyang Ruan wrote:
> fsdax page is used not only when CoW, but also mapread. To make the it
> easily understood, use 'share' to indicate that the dax page is shared
> by more than one extent.  And add helper functions to use it.
> 
> Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c   | 38 ++
>  include/linux/mm_types.h   |  5 -
>  include/linux/page-flags.h |  2 +-
>  3 files changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1c6867810cbd..85b81963ea31 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -334,35 +334,41 @@ static unsigned long dax_end_pfn(void *entry)
>   for (pfn = dax_to_pfn(entry); \
>   pfn < dax_end_pfn(entry); pfn++)
>  
> -static inline bool dax_mapping_is_cow(struct address_space *mapping)
> +static inline bool dax_mapping_is_shared(struct page *page)

dax_page_is_shared?

>  {
> - return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
> + return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED;
>  }
>  
>  /*
> - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
> + * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
> + * refcount.
>   */
> -static inline void dax_mapping_set_cow(struct page *page)
> +static inline void dax_mapping_set_shared(struct page *page)

It's odd that a function of a struct page still has 'mapping' in the
name.

dax_page_increase_shared?

or perhaps simply

dax_page_bump_sharing and dax_page_drop_sharing?

Otherwise this mechanical change looks pretty straightforward.

--D

>  {
> - if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
> + if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) {
>   /*
>* Reset the index if the page was already mapped
>* regularly before.
>*/
>   if (page->mapping)
> - page->index = 1;
> - page->mapping = (void *)PAGE_MAPPING_DAX_COW;
> + page->share = 1;
> + page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
>   }
> - page->index++;
> + page->share++;
> +}
> +
> +static inline unsigned long dax_mapping_decrease_shared(struct page *page)
> +{
> + return --page->share;
>  }
>  
>  /*
> - * When it is called in dax_insert_entry(), the cow flag will indicate that
> + * When it is called in dax_insert_entry(), the shared flag will indicate 
> that
>   * whether this entry is shared by multiple files.  If so, set the 
> page->mapping
> - * FS_DAX_MAPPING_COW, and use page->index as refcount.
> + * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
>   */
>  static void dax_associate_entry(void *entry, struct address_space *mapping,
> - struct vm_area_struct *vma, unsigned long address, bool cow)
> + struct vm_area_struct *vma, unsigned long address, bool shared)
>  {
>   unsigned long size = dax_entry_size(entry), pfn, index;
>   int i = 0;
> @@ -374,8 +380,8 @@ static void dax_associate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - if (cow) {
> - dax_mapping_set_cow(page);
> + if (shared) {
> + dax_mapping_set_shared(page);
>   } else {
>   WARN_ON_ONCE(page->mapping);
>   page->mapping = mapping;
> @@ -396,9 +402,9 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   struct page *page = pfn_to_page(pfn);
>  
>   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> - if (dax_mapping_is_cow(page->mapping)) {
> - /* keep the CoW flag if this page is still shared */
> - if (page->index-- > 0)
> + if (dax_mapping_is_shared(page)) {
> + /* keep the shared flag if this page is still shared */
> + if (dax_mapping_decrease_shared(page) > 0)
>   continue;
>   } else
>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..f46cac3657ad 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -103,7 +103,10 @@ struct page {
>   };
>   /* See page-flags.h for PAGE_MAPPING_FLAGS */
>   struct address_space *mapping;
> - pgoff_t index;  /* Our offset within mapping. */
> + union {
> + pgoff_t index;  /* Our offset within 
> mapping. */
> + 

Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-30 Thread Darrick J. Wong
On Wed, Nov 30, 2022 at 01:48:59PM -0800, Dan Williams wrote:
> Andrew Morton wrote:
> > On Tue, 29 Nov 2022 19:59:14 -0800 Dan Williams  
> > wrote:
> > 
> > > [ add Andrew ]
> > > 
> > > Shiyang Ruan wrote:
> > > > Many testcases failed in dax+reflink mode with warning message in dmesg.
> > > > This also effects dax+noreflink mode if we run the test after a
> > > > dax+reflink test.  So, the most urgent thing is solving the warning
> > > > messages.
> > > > 
> > > > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > > > previously considered (srcmap is HOLE or UNWRITTEN).
> > > > Patch 2 adds the implementation of unshare for fsdax.
> > > > 
> > > > With these fixes, most warning messages in dax_associate_entry() are
> > > > gone.  But honestly, generic/388 will randomly failed with the warning.
> > > > The case shutdown the xfs when fsstress is running, and do it for many
> > > > times.  I think the reason is that dax pages in use are not able to be
> > > > invalidated in time when fs is shutdown.  The next time dax page to be
> > > > associated, it still remains the mapping value set last time.  I'll keep
> > > > on solving it.
> > > > 
> > > > The warning message in dax_writeback_one() can also be fixed because of
> > > > the dax unshare.
> > > 
> > > Thank you for digging in on this, I had been pinned down on CXL tasks
> > > and worried that we would need to mark FS_DAX broken for a cycle, so
> > > this is timely.
> > > 
> > > My only concern is that these patches look to have significant collisions 
> > > with
> > > the fsdax page reference counting reworks pending in linux-next. Although,
> > > those are still sitting in mm-unstable:
> > > 
> > > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org
> > 
> > As far as I know, Dan's "Fix the DAX-gup mistake" series is somewhat
> > stuck.  Jan pointed out:
> > 
> > https://lore.kernel.org/all/20221109113849.p7pwob533ijgrytu@quack3/T/#u
> > 
> > or have Jason's issues since been addressed?
> 
> No, they have not. I do think the current series is a step forward, but
> given the urgency remains low for the time being (CXL hotplug use case
> further out, no known collisions with ongoing folio work, and no
> MEMORY_DEVICE_PRIVATE users looking to build any conversions on top for
> 6.2) I am ok to circle back for 6.3 for that follow on work to be
> integrated.
> 
> > > My preference would be to move ahead with both in which case I can help
> > > rebase these fixes on top. In that scenario everything would go through
> > > Andrew.
> > > 
> > > However, if we are getting too late in the cycle for that path I think
> > > these dax-fixes take precedence, and one more cycle to let the page
> > > reference count reworks sit is ok.
> > 
> > That sounds a decent approach.  So we go with this series ("fsdax,xfs:
> > fix warning messages") and aim at 6.3-rc1 with "Fix the DAX-gup
> > mistake"?
> > 
> 
> Yeah, that's the path of least hassle.

Sounds good.  I still want to see patch 1 of this series broken up into
smaller pieces though.  Once the series goes through review, do you want
me to push the fixes to Linus, seeing as xfs is the only user of this
functionality?

--D



Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-30 Thread Darrick J. Wong
On Tue, Nov 29, 2022 at 11:05:30PM -0800, Dan Williams wrote:
> Darrick J. Wong wrote:
> > On Tue, Nov 29, 2022 at 07:59:14PM -0800, Dan Williams wrote:
> > > [ add Andrew ]
> > > 
> > > Shiyang Ruan wrote:
> > > > Many testcases failed in dax+reflink mode with warning message in dmesg.
> > > > This also effects dax+noreflink mode if we run the test after a
> > > > dax+reflink test.  So, the most urgent thing is solving the warning
> > > > messages.
> > > > 
> > > > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > > > previously considered (srcmap is HOLE or UNWRITTEN).
> > > > Patch 2 adds the implementation of unshare for fsdax.
> > > > 
> > > > With these fixes, most warning messages in dax_associate_entry() are
> > > > gone.  But honestly, generic/388 will randomly failed with the warning.
> > > > The case shutdown the xfs when fsstress is running, and do it for many
> > > > times.  I think the reason is that dax pages in use are not able to be
> > > > invalidated in time when fs is shutdown.  The next time dax page to be
> > > > associated, it still remains the mapping value set last time.  I'll keep
> > > > on solving it.
> > > > 
> > > > The warning message in dax_writeback_one() can also be fixed because of
> > > > the dax unshare.
> > > 
> > > Thank you for digging in on this, I had been pinned down on CXL tasks
> > > and worried that we would need to mark FS_DAX broken for a cycle, so
> > > this is timely.
> > > 
> > > My only concern is that these patches look to have significant collisions 
> > > with
> > > the fsdax page reference counting reworks pending in linux-next. Although,
> > > those are still sitting in mm-unstable:
> > > 
> > > http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org
> > > 
> > > My preference would be to move ahead with both in which case I can help
> > > rebase these fixes on top. In that scenario everything would go through
> > > Andrew.
> > > 
> > > However, if we are getting too late in the cycle for that path I think
> > > these dax-fixes take precedence, and one more cycle to let the page
> > > reference count reworks sit is ok.
> > 
> > Well now that raises some interesting questions -- dax and reflink are
> > totally broken on 6.1.  I was thinking about cramming them into 6.2 as a
> > data corruption fix on the grounds that is not an acceptable state of
> > affairs.
> 
> I agree it's not an acceptable state of affairs, but for 6.1 the answer
> may be to just revert to dax+reflink being forbidden again. The fact
> that no end user has noticed is probably a good sign that we can disable
> that without any one screaming. That may be the easy answer for 6.2 as
> well given how late this all is.
> 
> > OTOH we're past -rc7, which is **really late** to be changing core code.
> > Then again, there aren't so many fsdax users and nobody's complained
> > about 6.0/6.1 being busted, so perhaps the risk of regression isn't so
> > bad?  Then again, that could be a sign that this could wait, if you and
> > Andrew are really eager to merge the reworks.
> 
> The page reference counting has also been languishing for a long time. A
> 6.2 merge would be nice, it relieves maintenance burden, but they do not
> start to have real end user implications until CXL memory hotplug
> platforms arrive and the warts in the reference counting start to show
> real problems in production.

Hm.  How bad *would* it be to rebase that patchset atop this one?

After overnight testing on -rc7 it looks like Ruan's patchset fixes all
the problems AFAICT.  Most of the remaining regressions are to mask off
fragmentation testing because fsdax cow (like the directio write paths)
doesn't make much use of extent size hints.

> > Just looking at the stuff that's still broken with dax+reflink -- I
> > noticed that xfs/550-552 (aka the dax poison tests) are still regressing
> > on reflink filesystems.
> 
> That's worrying because the whole point of reworking dax, xfs, and
> mm/memory-failure all at once was to handle the collision of poison and
> reflink'd dax files.

I just tried out -rc7 and all three pass, so disregard this please.

> > So, uh, what would this patchset need to change if the "fsdax page
> > reference counting reworks" were applied?  Would it be changing the page
> > refcount instead of stashing that in page->index?
> 
> Nah, it's things like swit

Re: [PATCH 1/2] fsdax,xfs: fix warning messages at dax_[dis]associate_entry()

2022-11-30 Thread Darrick J. Wong
On Wed, Nov 30, 2022 at 04:58:32PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/11/30 12:08, Darrick J. Wong 写道:
> > On Thu, Nov 24, 2022 at 02:54:53PM +, Shiyang Ruan wrote:
> > > This patch fixes the warning message reported in dax_associate_entry()
> > > and dax_disassociate_entry().
> > 
> > Hmm, that's quite a bit to put in a single patch, but I'll try to get
> > through this...
> 
> Oh sorry...

Well you have to start somewhere. :)

I often start with a megapatch for testing and later break it into
smaller pieces once I've validated that the megapatch creates a solid
improvement.

> > 
> > > 1. reset page->mapping and ->index when refcount counting down to 0.
> > > 2. set IOMAP_F_SHARED flag when iomap read to allow one dax page to be
> > > associated more than once for not only write but also read.
> > 
> > That makes sense, I think.
> > 
> > > 3. should zero the edge (when not aligned) if srcmap is HOLE or
> > 
> > When is IOMAP_F_SHARED set on the /source/ mapping?
> 
> In fs/xfs/xfs_iomap.c: xfs_direct_write_iomap_begin(): goto out_found_cow
> tag, srcmap is *not set* when the source extent is HOLE, then only iomap is
> set with IOMAP_F_SHARED flag.
> 
> Now we come to iomap iter, when we get the srcmap by calling
> iomap_iter_srcmap(iter), the iomap will be returned (because srcmap isn't
> set).  So, in this case, srcmap == iomap, we can think the source extent is
> a HOLE if srcmap->flag & IOMAP_F_SHARED != 0

Aha, got it.  IOWs, this handles things like alwayscow and cowing over a
hole, where we don't have a source mapping.  Thanks for refreshing my
memory.

> > > UNWRITTEN.
> > > 4. iterator of two files in dedupe should be executed side by side, not
> > > nested.
> > 
> > Why?  Also, this seems like a separate change?
> 
> Explain below.
> 
> > 
> > > 5. use xfs_dax_write_iomap_ops for xfs zero and truncate.
> > 
> > Makes sense.
> > 
> > > Signed-off-by: Shiyang Ruan 
> > > ---
> > >   fs/dax.c   | 114 ++---
> > >   fs/xfs/xfs_iomap.c |   6 +--
> > >   2 files changed, 69 insertions(+), 51 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 1c6867810cbd..5ea7c0926b7f 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -398,7 +398,7 @@ static void dax_disassociate_entry(void *entry, 
> > > struct address_space *mapping,
> > >   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> > >   if (dax_mapping_is_cow(page->mapping)) {
> > >   /* keep the CoW flag if this page is still 
> > > shared */
> > > - if (page->index-- > 0)
> > > + if (page->index-- > 1)
> > 
> > Hmm.  So if the fsdax "page" sharing factor drops from 2 to 1, we'll now
> > null out the mapping and index?  Before, we only did that when it
> > dropped from 1 to 0.
> > 
> > Does this leave the page with no mapping?  And I guess a subsequent
> > access will now take a fault to map it back in?
> 
> I confused it with --page->index, the result of "page->index--" is
> page->index itself.

Yeah, postfix operators in comparisons are not great for readability the
later one gets into the night.

> So, assume:
> this time, refcount is 2, >1, minus 1 to 1, then continue;
> next time, refcount is 1, not >1, minus 1 to 0, then clear the
> page->mapping.

> 
> > 
> > >   continue;
> > >   } else
> > >   WARN_ON_ONCE(page->mapping && page->mapping != 
> > > mapping);
> > > @@ -840,12 +840,6 @@ static bool dax_fault_is_synchronous(const struct 
> > > iomap_iter *iter,
> > >   (iter->iomap.flags & IOMAP_F_DIRTY);
> > >   }
> > > -static bool dax_fault_is_cow(const struct iomap_iter *iter)
> > > -{
> > > - return (iter->flags & IOMAP_WRITE) &&
> > > - (iter->iomap.flags & IOMAP_F_SHARED);
> > > -}
> > > -
> > >   /*
> > >* By this point grab_mapping_entry() has ensured that we have a locked 
> > > entry
> > >* of the appropriate size so we don't have to worry about downgrading 
> > > PMDs to
> > > @@ -859,13 +853,14 @@ static void *dax_insert_entry(struct xa_state *xas, 
> > > struct vm_fault *vmf,
> &

Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-29 Thread Darrick J. Wong
On Tue, Nov 29, 2022 at 07:59:14PM -0800, Dan Williams wrote:
> [ add Andrew ]
> 
> Shiyang Ruan wrote:
> > Many testcases failed in dax+reflink mode with warning message in dmesg.
> > This also effects dax+noreflink mode if we run the test after a
> > dax+reflink test.  So, the most urgent thing is solving the warning
> > messages.
> > 
> > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > previously considered (srcmap is HOLE or UNWRITTEN).
> > Patch 2 adds the implementation of unshare for fsdax.
> > 
> > With these fixes, most warning messages in dax_associate_entry() are
> > gone.  But honestly, generic/388 will randomly failed with the warning.
> > The case shutdown the xfs when fsstress is running, and do it for many
> > times.  I think the reason is that dax pages in use are not able to be
> > invalidated in time when fs is shutdown.  The next time dax page to be
> > associated, it still remains the mapping value set last time.  I'll keep
> > on solving it.
> > 
> > The warning message in dax_writeback_one() can also be fixed because of
> > the dax unshare.
> 
> Thank you for digging in on this, I had been pinned down on CXL tasks
> and worried that we would need to mark FS_DAX broken for a cycle, so
> this is timely.
> 
> My only concern is that these patches look to have significant collisions with
> the fsdax page reference counting reworks pending in linux-next. Although,
> those are still sitting in mm-unstable:
> 
> http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org
> 
> My preference would be to move ahead with both in which case I can help
> rebase these fixes on top. In that scenario everything would go through
> Andrew.
> 
> However, if we are getting too late in the cycle for that path I think
> these dax-fixes take precedence, and one more cycle to let the page
> reference count reworks sit is ok.

Well now that raises some interesting questions -- dax and reflink are
totally broken on 6.1.  I was thinking about cramming them into 6.2 as a
data corruption fix on the grounds that is not an acceptable state of
affairs.

OTOH we're past -rc7, which is **really late** to be changing core code.
Then again, there aren't so many fsdax users and nobody's complained
about 6.0/6.1 being busted, so perhaps the risk of regression isn't so
bad?  Then again, that could be a sign that this could wait, if you and
Andrew are really eager to merge the reworks.

Just looking at the stuff that's still broken with dax+reflink -- I
noticed that xfs/550-552 (aka the dax poison tests) are still regressing
on reflink filesystems.

So, uh, what would this patchset need to change if the "fsdax page
reference counting reworks" were applied?  Would it be changing the page
refcount instead of stashing that in page->index?

--D

> > Shiyang Ruan (2):
> >   fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
> >   fsdax,xfs: port unshare to fsdax
> > 
> >  fs/dax.c | 166 ++-
> >  fs/xfs/xfs_iomap.c   |   6 +-
> >  fs/xfs/xfs_reflink.c |   8 ++-
> >  include/linux/dax.h  |   2 +
> >  4 files changed, 129 insertions(+), 53 deletions(-)
> > 
> > -- 
> > 2.38.1



Re: [PATCH 1/2] fsdax,xfs: fix warning messages at dax_[dis]associate_entry()

2022-11-29 Thread Darrick J. Wong
On Thu, Nov 24, 2022 at 02:54:53PM +, Shiyang Ruan wrote:
> This patch fixes the warning message reported in dax_associate_entry()
> and dax_disassociate_entry().

Hmm, that's quite a bit to put in a single patch, but I'll try to get
through this...

> 1. reset page->mapping and ->index when refcount counting down to 0.
> 2. set IOMAP_F_SHARED flag when iomap read to allow one dax page to be
> associated more than once for not only write but also read.

That makes sense, I think.

> 3. should zero the edge (when not aligned) if srcmap is HOLE or

When is IOMAP_F_SHARED set on the /source/ mapping?

> UNWRITTEN.
> 4. iterator of two files in dedupe should be executed side by side, not
> nested.

Why?  Also, this seems like a separate change?

> 5. use xfs_dax_write_iomap_ops for xfs zero and truncate. 

Makes sense.

> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c   | 114 ++---
>  fs/xfs/xfs_iomap.c |   6 +--
>  2 files changed, 69 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1c6867810cbd..5ea7c0926b7f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -398,7 +398,7 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>   if (dax_mapping_is_cow(page->mapping)) {
>   /* keep the CoW flag if this page is still shared */
> - if (page->index-- > 0)
> + if (page->index-- > 1)

Hmm.  So if the fsdax "page" sharing factor drops from 2 to 1, we'll now
null out the mapping and index?  Before, we only did that when it
dropped from 1 to 0.

Does this leave the page with no mapping?  And I guess a subsequent
access will now take a fault to map it back in?

>   continue;
>   } else
>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> @@ -840,12 +840,6 @@ static bool dax_fault_is_synchronous(const struct 
> iomap_iter *iter,
>   (iter->iomap.flags & IOMAP_F_DIRTY);
>  }
>  
> -static bool dax_fault_is_cow(const struct iomap_iter *iter)
> -{
> - return (iter->flags & IOMAP_WRITE) &&
> - (iter->iomap.flags & IOMAP_F_SHARED);
> -}
> -
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -859,13 +853,14 @@ static void *dax_insert_entry(struct xa_state *xas, 
> struct vm_fault *vmf,
>  {
>   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   void *new_entry = dax_make_entry(pfn, flags);
> - bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
> - bool cow = dax_fault_is_cow(iter);
> + bool write = iter->flags & IOMAP_WRITE;
> + bool dirty = write && !dax_fault_is_synchronous(iter, vmf->vma);
> + bool shared = iter->iomap.flags & IOMAP_F_SHARED;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> - if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
> + if (shared || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {

Ah, ok, so now we're yanking the mapping if the extent is shared,
presumably so that...

>   unsigned long index = xas->xa_index;
>   /* we are replacing a zero page with block mapping */
>   if (dax_is_pmd_entry(entry))
> @@ -877,12 +872,12 @@ static void *dax_insert_entry(struct xa_state *xas, 
> struct vm_fault *vmf,
>  
>   xas_reset(xas);
>   xas_lock_irq(xas);
> - if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>   void *old;
>  
>   dax_disassociate_entry(entry, mapping, false);
>   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
> - cow);
> + shared);

...down here we can rebuild the association, but this time we'll set the
page->mapping to PAGE_MAPPING_DAX_COW?  I see a lot of similar changes,
so I'm guessing this is how you fixed the failures that were a result of
read file A -> reflink A to B -> read file B sequences?

>   /*
>* Only swap our new entry into the page cache if the current
>* entry is a zero page or an empty entry.  If a normal PTE or
> @@ -902,7 +897,7 @@ static void *dax_insert_entry(struct xa_state *xas, 
> struct vm_fault *vmf,
>   if (dirty)
>   xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> - if (cow)
> + if (write && shared)
>   xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
>  
>   xas_unlock_irq(xas);
> @@ -1107,23 +1102,35 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t 
> length, size_t align_size,

I think this function isn't well named.  It's copying into the 

Re: [PATCH 2/2] fsdax,xfs: port unshare to fsdax

2022-11-29 Thread Darrick J. Wong
On Thu, Nov 24, 2022 at 02:54:54PM +, Shiyang Ruan wrote:
> Implement unshare in fsdax mode: copy data from srcmap to iomap.
> 
> Signed-off-by: Shiyang Ruan 

Heh, I had a version nearly like this in my tree.  Makes reviewing
easier:
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 52 
>  fs/xfs/xfs_reflink.c |  8 +--
>  include/linux/dax.h  |  2 ++
>  3 files changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5ea7c0926b7f..3d0bf68ab6b0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1235,6 +1235,58 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>  
> +static s64 dax_unshare_iter(struct iomap_iter *iter)
> +{
> + struct iomap *iomap = >iomap;
> + const struct iomap *srcmap = iomap_iter_srcmap(iter);
> + loff_t pos = iter->pos;
> + loff_t length = iomap_length(iter);
> + int id = 0;
> + s64 ret = 0;
> + void *daddr = NULL, *saddr = NULL;
> +
> + /* don't bother with blocks that are not shared to start with */
> + if (!(iomap->flags & IOMAP_F_SHARED))
> + return length;
> + /* don't bother with holes or unwritten extents */
> + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> + return length;
> +
> + id = dax_read_lock();
> + ret = dax_iomap_direct_access(iomap, pos, length, , NULL);
> + if (ret < 0)
> + goto out_unlock;
> +
> + ret = dax_iomap_direct_access(srcmap, pos, length, , NULL);
> + if (ret < 0)
> + goto out_unlock;
> +
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + if (ret)
> + ret = -EIO;
> +
> +out_unlock:
> + dax_read_unlock(id);
> + return ret;
> +}
> +
> +int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> + const struct iomap_ops *ops)
> +{
> + struct iomap_iter iter = {
> + .inode  = inode,
> + .pos= pos,
> + .len= len,
> + .flags  = IOMAP_WRITE | IOMAP_UNSHARE | IOMAP_DAX,
> + };
> + int ret;
> +
> + while ((ret = iomap_iter(, ops)) > 0)
> + iter.processed = dax_unshare_iter();
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_file_unshare);
> +
>  static int dax_memzero(struct iomap_iter *iter, loff_t pos, size_t size)
>  {
>   const struct iomap *iomap = >iomap;
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 93bdd25680bc..fe46bce8cae6 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1693,8 +1693,12 @@ xfs_reflink_unshare(
>  
>   inode_dio_wait(inode);
>  
> - error = iomap_file_unshare(inode, offset, len,
> - _buffered_write_iomap_ops);
> + if (IS_DAX(inode))
> + error = dax_file_unshare(inode, offset, len,
> + _dax_write_iomap_ops);
> + else
> + error = iomap_file_unshare(inode, offset, len,
> + _buffered_write_iomap_ops);
>   if (error)
>   goto out;
>  
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index ba985333e26b..2b5ecb591059 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -205,6 +205,8 @@ static inline void dax_unlock_mapping_entry(struct 
> address_space *mapping,
>  }
>  #endif
>  
> +int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> + const struct iomap_ops *ops);
>  int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool 
> *did_zero,
>   const struct iomap_ops *ops);
>  int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -- 
> 2.38.1
> 



Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-28 Thread Darrick J. Wong
On Mon, Nov 28, 2022 at 10:16:23AM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/11/28 2:38, Darrick J. Wong 写道:
> > On Thu, Nov 24, 2022 at 02:54:52PM +, Shiyang Ruan wrote:
> > > Many testcases failed in dax+reflink mode with warning message in dmesg.
> > > This also effects dax+noreflink mode if we run the test after a
> > > dax+reflink test.  So, the most urgent thing is solving the warning
> > > messages.
> > > 
> > > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > > previously considered (srcmap is HOLE or UNWRITTEN).
> > > Patch 2 adds the implementation of unshare for fsdax.
> > > 
> > > With these fixes, most warning messages in dax_associate_entry() are
> > > gone.  But honestly, generic/388 will randomly failed with the warning.
> > > The case shutdown the xfs when fsstress is running, and do it for many
> > > times.  I think the reason is that dax pages in use are not able to be
> > > invalidated in time when fs is shutdown.  The next time dax page to be
> > > associated, it still remains the mapping value set last time.  I'll keep
> > > on solving it.
> > > 
> > > The warning message in dax_writeback_one() can also be fixed because of
> > > the dax unshare.
> > 
> > This cuts down the amount of test failures quite a bit, but I think
> > you're still missing a piece or two -- namely the part that refuses to
> > enable S_DAX mode on a reflinked file when the inode is being loaded
> > from disk.  However, thank you for fixing dax.c, because that was the
> > part I couldn't figure out at all. :)
> 
> I didn't include it[1] in this patchset...
> 
> [1] 
> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.f...@fujitsu.com/

Oh, ok.  I'll pull that one in.  All the remaining test failures seem to
be related to inode flag states or tests that trip over the lack of
delalloc on dax+reflink files.

--D

> 
> --
> Thanks,
> Ruan.
> 
> > 
> > --D
> > 
> > > 
> > > Shiyang Ruan (2):
> > >fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
> > >fsdax,xfs: port unshare to fsdax
> > > 
> > >   fs/dax.c | 166 ++-
> > >   fs/xfs/xfs_iomap.c   |   6 +-
> > >   fs/xfs/xfs_reflink.c |   8 ++-
> > >   include/linux/dax.h  |   2 +
> > >   4 files changed, 129 insertions(+), 53 deletions(-)
> > > 
> > > -- 
> > > 2.38.1
> > > 



Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-27 Thread Darrick J. Wong
On Thu, Nov 24, 2022 at 02:54:52PM +, Shiyang Ruan wrote:
> Many testcases failed in dax+reflink mode with warning message in dmesg.
> This also effects dax+noreflink mode if we run the test after a
> dax+reflink test.  So, the most urgent thing is solving the warning
> messages.
> 
> Patch 1 fixes some mistakes and adds handling of CoW cases not
> previously considered (srcmap is HOLE or UNWRITTEN).
> Patch 2 adds the implementation of unshare for fsdax.
> 
> With these fixes, most warning messages in dax_associate_entry() are
> gone.  But honestly, generic/388 will randomly failed with the warning.
> The case shutdown the xfs when fsstress is running, and do it for many
> times.  I think the reason is that dax pages in use are not able to be
> invalidated in time when fs is shutdown.  The next time dax page to be
> associated, it still remains the mapping value set last time.  I'll keep
> on solving it.
> 
> The warning message in dax_writeback_one() can also be fixed because of
> the dax unshare.

This cuts down the amount of test failures quite a bit, but I think
you're still missing a piece or two -- namely the part that refuses to
enable S_DAX mode on a reflinked file when the inode is being loaded
from disk.  However, thank you for fixing dax.c, because that was the
part I couldn't figure out at all. :)

--D

> 
> Shiyang Ruan (2):
>   fsdax,xfs: fix warning messages at dax_[dis]associate_entry()
>   fsdax,xfs: port unshare to fsdax
> 
>  fs/dax.c | 166 ++-
>  fs/xfs/xfs_iomap.c   |   6 +-
>  fs/xfs/xfs_reflink.c |   8 ++-
>  include/linux/dax.h  |   2 +
>  4 files changed, 129 insertions(+), 53 deletions(-)
> 
> -- 
> 2.38.1
> 



Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-10-19 Thread Darrick J. Wong
On Sun, Oct 16, 2022 at 10:05:17PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/10/14 23:50, Darrick J. Wong 写道:
> > On Fri, Oct 14, 2022 at 10:24:29AM +0800, Shiyang Ruan wrote:
> > > 
> > > 
> > > 在 2022/10/14 2:30, Darrick J. Wong 写道:
> > > > On Thu, Sep 29, 2022 at 12:05:14PM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Sep 28, 2022 at 10:46:17PM +0800, Shiyang Ruan wrote:
> > > > > > 
> > > ...
> > > > > > > 
> > > > > > > > FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 
> > > > > > > > 6.0-rc5,
> > > > > > > > and I haven't even turned on reflink yet:
> > > > > > > > 
> > > > > > > > run fstests xfs/517 at 2022-09-26 19:53:34
> > > > > > > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. 
> > > > > > > > Use at your own risk!
> > > > > > > > XFS (pmem1): Mounting V5 Filesystem
> > > > > > > > XFS (pmem1): Ending clean mount
> > > > > > > > XFS (pmem1): Quotacheck needed: Please wait.
> > > > > > > > XFS (pmem1): Quotacheck: Done.
> > > > > > > > XFS (pmem1): Unmounting Filesystem
> > > > > > > > XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at 
> > > > > > > > your own risk!
> > > > > > > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. 
> > > > > > > > Use at your own risk!
> > > > > > > > XFS (pmem1): Mounting V5 Filesystem
> > > > > > > > XFS (pmem1): Ending clean mount
> > > > > > > > XFS (pmem1): Quotacheck needed: Please wait.
> > > > > > > > XFS (pmem1): Quotacheck: Done.
> > > > > > > > [ cut here ]
> > > > > > > > WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 
> > > > > > > > dax_insert_entry+0x22d/0x320
> > > > 
> > > > Ping?
> > > > 
> > > > This time around I replaced the WARN_ON with this:
> > > > 
> > > > if (page->mapping)
> > > > printk(KERN_ERR "%s:%d ino 0x%lx index 0x%lx page 
> > > > 0x%llx mapping 0x%llx <- 0x%llx\n", __func__, __LINE__, 
> > > > mapping->host->i_ino, index + i, (unsigned long long)page, (unsigned 
> > > > long long)page->mapping, (unsigned long long)mapping);
> > > > 
> > > > and promptly started seeing scary things like this:
> > > > 
> > > > [   37.576598] dax_associate_entry:381 ino 0x1807870 index 0x370 page 
> > > > 0xea00133f1480 mapping 0x1 <- 0x888042fbb528
> > > > [   37.577570] dax_associate_entry:381 ino 0x1807870 index 0x371 page 
> > > > 0xea00133f1500 mapping 0x1 <- 0x888042fbb528
> > > > [   37.698657] dax_associate_entry:381 ino 0x180044a index 0x5f8 page 
> > > > 0xea0013244900 mapping 0x888042eaf128 <- 0x888042dda128
> > > > [   37.699349] dax_associate_entry:381 ino 0x800808 index 0x136 page 
> > > > 0xea0013245640 mapping 0x888042eaf128 <- 0x888042d3ce28
> > > > [   37.699680] dax_associate_entry:381 ino 0x180044a index 0x5f9 page 
> > > > 0xea0013245680 mapping 0x888042eaf128 <- 0x888042dda128
> > > > [   37.700684] dax_associate_entry:381 ino 0x800808 index 0x137 page 
> > > > 0xea00132456c0 mapping 0x888042eaf128 <- 0x888042d3ce28
> > > > [   37.701611] dax_associate_entry:381 ino 0x180044a index 0x5fa page 
> > > > 0xea0013245700 mapping 0x888042eaf128 <- 0x888042dda128
> > > > [   37.764126] dax_associate_entry:381 ino 0x103c52c index 0x28a page 
> > > > 0xea001345afc0 mapping 0x1 <- 0x888019c14928
> > > > [   37.765078] dax_associate_entry:381 ino 0x103c52c index 0x28b page 
> > > > 0xea001345b000 mapping 0x1 <- 0x888019c14928
> > > > [   39.193523] dax_associate_entry:381 ino 0x184657f index 0x124 page 
> > > > 0xea000e2a4440 mapping 0x8880120d7628 <- 0x888019ca3528
> > > > [   39.194692] dax_associate_entry:381 ino 0x184657f index 0x125 page 
> > > > 0xea000e2a4480 mapping 0x8880120d7628 <- 0x888019ca3528
> > > > [   39.195716] dax_associate_entry:381 ino 0x184657f

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-10-14 Thread Darrick J. Wong
On Fri, Oct 14, 2022 at 10:24:29AM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/10/14 2:30, Darrick J. Wong 写道:
> > On Thu, Sep 29, 2022 at 12:05:14PM -0700, Darrick J. Wong wrote:
> > > On Wed, Sep 28, 2022 at 10:46:17PM +0800, Shiyang Ruan wrote:
> > > > 
> ...
> > > > > 
> > > > > > FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 
> > > > > > 6.0-rc5,
> > > > > > and I haven't even turned on reflink yet:
> > > > > > 
> > > > > > run fstests xfs/517 at 2022-09-26 19:53:34
> > > > > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use 
> > > > > > at your own risk!
> > > > > > XFS (pmem1): Mounting V5 Filesystem
> > > > > > XFS (pmem1): Ending clean mount
> > > > > > XFS (pmem1): Quotacheck needed: Please wait.
> > > > > > XFS (pmem1): Quotacheck: Done.
> > > > > > XFS (pmem1): Unmounting Filesystem
> > > > > > XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at your 
> > > > > > own risk!
> > > > > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use 
> > > > > > at your own risk!
> > > > > > XFS (pmem1): Mounting V5 Filesystem
> > > > > > XFS (pmem1): Ending clean mount
> > > > > > XFS (pmem1): Quotacheck needed: Please wait.
> > > > > > XFS (pmem1): Quotacheck: Done.
> > > > > > [ cut here ]
> > > > > > WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 
> > > > > > dax_insert_entry+0x22d/0x320
> > 
> > Ping?
> > 
> > This time around I replaced the WARN_ON with this:
> > 
> > if (page->mapping)
> > printk(KERN_ERR "%s:%d ino 0x%lx index 0x%lx page 0x%llx 
> > mapping 0x%llx <- 0x%llx\n", __func__, __LINE__, mapping->host->i_ino, 
> > index + i, (unsigned long long)page, (unsigned long long)page->mapping, 
> > (unsigned long long)mapping);
> > 
> > and promptly started seeing scary things like this:
> > 
> > [   37.576598] dax_associate_entry:381 ino 0x1807870 index 0x370 page 
> > 0xea00133f1480 mapping 0x1 <- 0x888042fbb528
> > [   37.577570] dax_associate_entry:381 ino 0x1807870 index 0x371 page 
> > 0xea00133f1500 mapping 0x1 <- 0x888042fbb528
> > [   37.698657] dax_associate_entry:381 ino 0x180044a index 0x5f8 page 
> > 0xea0013244900 mapping 0x888042eaf128 <- 0x888042dda128
> > [   37.699349] dax_associate_entry:381 ino 0x800808 index 0x136 page 
> > 0xea0013245640 mapping 0x888042eaf128 <- 0x888042d3ce28
> > [   37.699680] dax_associate_entry:381 ino 0x180044a index 0x5f9 page 
> > 0xea0013245680 mapping 0x888042eaf128 <- 0x888042dda128
> > [   37.700684] dax_associate_entry:381 ino 0x800808 index 0x137 page 
> > 0xea00132456c0 mapping 0x888042eaf128 <- 0x888042d3ce28
> > [   37.701611] dax_associate_entry:381 ino 0x180044a index 0x5fa page 
> > 0xea0013245700 mapping 0x888042eaf128 <- 0x888042dda128
> > [   37.764126] dax_associate_entry:381 ino 0x103c52c index 0x28a page 
> > 0xea001345afc0 mapping 0x1 <- 0x888019c14928
> > [   37.765078] dax_associate_entry:381 ino 0x103c52c index 0x28b page 
> > 0xea001345b000 mapping 0x1 <- 0x888019c14928
> > [   39.193523] dax_associate_entry:381 ino 0x184657f index 0x124 page 
> > 0xea000e2a4440 mapping 0x8880120d7628 <- 0x888019ca3528
> > [   39.194692] dax_associate_entry:381 ino 0x184657f index 0x125 page 
> > 0xea000e2a4480 mapping 0x8880120d7628 <- 0x888019ca3528
> > [   39.195716] dax_associate_entry:381 ino 0x184657f index 0x126 page 
> > 0xea000e2a44c0 mapping 0x8880120d7628 <- 0x888019ca3528
> > [   39.196736] dax_associate_entry:381 ino 0x184657f index 0x127 page 
> > 0xea000e2a4500 mapping 0x8880120d7628 <- 0x888019ca3528
> > [   39.197906] dax_associate_entry:381 ino 0x184657f index 0x128 page 
> > 0xea000e2a5040 mapping 0x8880120d7628 <- 0x888019ca3528
> > [   39.198924] dax_associate_entry:381 ino 0x184657f index 0x129 page 
> > 0xea000e2a5080 mapping 0x8880120d7628 <- 0x888019ca3528
> > [   39.247053] dax_associate_entry:381 ino 0x5dd1e index 0x2d page 
> > 0xea0015a0e640 mapping 0x1 <- 0x88804af88828
> > [   39.248006] dax_associate_entry:381 ino 0x5dd1e index 0x2e page 
>

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-10-13 Thread Darrick J. Wong
On Thu, Sep 29, 2022 at 12:05:14PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 28, 2022 at 10:46:17PM +0800, Shiyang Ruan wrote:
> > 
> > 
> > 在 2022/9/28 7:51, Dave Chinner 写道:
> > > On Tue, Sep 27, 2022 at 09:02:48AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Sep 27, 2022 at 02:53:14PM +0800, Shiyang Ruan wrote:
> > ...
> > > > > 
> > > > > I have tested these two mode for many times:
> > > > > 
> > > > > xfs_dax mode did failed so many cases.  (If you tested with this 
> > > > > "drop"
> > > > > patch, some warning around "dax_dedupe_file_range_compare()" won't 
> > > > > occur any
> > > > > more.)  I think warning around "dax_disassociate_entry()" is a 
> > > > > problem with
> > > > > concurrency.  Still looking into it.
> > > > > 
> > > > > But xfs_dax_noreflink didn't have so many failure, just 3 in my 
> > > > > environment:
> > > > > Failures: generic/471 generic/519 xfs/148.  I am thinking that did you
> > > > > forget to reformat the TEST_DEV to be non-reflink before run the 
> > > > > test?  If
> > > > > so it will make sense.
> > > 
> > > No, I did not forget to turn off reflink for the test device:
> > > 
> > > # ./run_check.sh --mkfs-opts "-m reflink=0,rmapbt=1" --run-opts "-s 
> > > xfs_dax_noreflink -g auto"
> > > umount: /mnt/test: not mounted.
> > > umount: /mnt/scratch: not mounted.
> > > wrote 8589934592/8589934592 bytes at offset 0
> > > 8.000 GiB, 8192 ops; 0:00:03.99 (2.001 GiB/sec and 2049.0850 ops/sec)
> > > wrote 8589934592/8589934592 bytes at offset 0
> > > 8.000 GiB, 8192 ops; 0:00:04.13 (1.936 GiB/sec and 1982.5453 ops/sec)
> > > meta-data=/dev/pmem0 isize=512agcount=4, agsize=524288 
> > > blks
> > >   =   sectsz=4096  attr=2, projid32bit=1
> > >   =   crc=1finobt=1, sparse=1, 
> > > rmapbt=1
> > >   =   reflink=0bigtime=1 inobtcount=1 
> > > nrext64=0
> > > data =   bsize=4096   blocks=2097152, imaxpct=25
> > >   =   sunit=0  swidth=0 blks
> > > naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
> > > log  =internal log   bsize=4096   blocks=16384, version=2
> > >   =   sectsz=4096  sunit=1 blks, lazy-count=1
> > > realtime =none   extsz=4096   blocks=0, rtextents=0
> > > .
> > > Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_dax_noreflink -g auto
> > > SECTION   -- xfs_dax_noreflink
> > > FSTYP -- xfs (debug)
> > > PLATFORM  -- Linux/x86_64 test3 6.0.0-rc6-dgc+ #1543 SMP 
> > > PREEMPT_DYNAMIC Mon Sep 19 07:46:37 AEST 2022
> > > MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
> > > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 
> > > /dev/pmem1 /mnt/scratch
> > > 
> > > So, yeah, reflink was turned off on both test and scratch devices,
> > > and dax=always on both the test and scratch devices was used to
> > > ensure that DAX was always in use.
> > > 
> > > 
> > > > FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 6.0-rc5,
> > > > and I haven't even turned on reflink yet:
> > > > 
> > > > run fstests xfs/517 at 2022-09-26 19:53:34
> > > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at 
> > > > your own risk!
> > > > XFS (pmem1): Mounting V5 Filesystem
> > > > XFS (pmem1): Ending clean mount
> > > > XFS (pmem1): Quotacheck needed: Please wait.
> > > > XFS (pmem1): Quotacheck: Done.
> > > > XFS (pmem1): Unmounting Filesystem
> > > > XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at your own 
> > > > risk!
> > > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at 
> > > > your own risk!
> > > > XFS (pmem1): Mounting V5 Filesystem
> > > > XFS (pmem1): Ending clean mount
> > > > XFS (pmem1): Quotacheck needed: Please wait.
> > > > XFS (pmem1): Quotacheck: Done.
> > > > [ cut here ]
> > > > WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 dax_insert_entry+0x22d/0x3

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-29 Thread Darrick J. Wong
On Wed, Sep 28, 2022 at 10:46:17PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/9/28 7:51, Dave Chinner 写道:
> > On Tue, Sep 27, 2022 at 09:02:48AM -0700, Darrick J. Wong wrote:
> > > On Tue, Sep 27, 2022 at 02:53:14PM +0800, Shiyang Ruan wrote:
> ...
> > > > 
> > > > I have tested these two mode for many times:
> > > > 
> > > > xfs_dax mode did failed so many cases.  (If you tested with this "drop"
> > > > patch, some warning around "dax_dedupe_file_range_compare()" won't 
> > > > occur any
> > > > more.)  I think warning around "dax_disassociate_entry()" is a problem 
> > > > with
> > > > concurrency.  Still looking into it.
> > > > 
> > > > But xfs_dax_noreflink didn't have so many failure, just 3 in my 
> > > > environment:
> > > > Failures: generic/471 generic/519 xfs/148.  I am thinking that did you
> > > > forget to reformat the TEST_DEV to be non-reflink before run the test?  
> > > > If
> > > > so it will make sense.
> > 
> > No, I did not forget to turn off reflink for the test device:
> > 
> > # ./run_check.sh --mkfs-opts "-m reflink=0,rmapbt=1" --run-opts "-s 
> > xfs_dax_noreflink -g auto"
> > umount: /mnt/test: not mounted.
> > umount: /mnt/scratch: not mounted.
> > wrote 8589934592/8589934592 bytes at offset 0
> > 8.000 GiB, 8192 ops; 0:00:03.99 (2.001 GiB/sec and 2049.0850 ops/sec)
> > wrote 8589934592/8589934592 bytes at offset 0
> > 8.000 GiB, 8192 ops; 0:00:04.13 (1.936 GiB/sec and 1982.5453 ops/sec)
> > meta-data=/dev/pmem0 isize=512agcount=4, agsize=524288 blks
> >   =   sectsz=4096  attr=2, projid32bit=1
> >   =   crc=1finobt=1, sparse=1, rmapbt=1
> >   =   reflink=0bigtime=1 inobtcount=1 
> > nrext64=0
> > data =   bsize=4096   blocks=2097152, imaxpct=25
> >   =   sunit=0  swidth=0 blks
> > naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
> > log  =internal log   bsize=4096   blocks=16384, version=2
> >   =   sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none   extsz=4096   blocks=0, rtextents=0
> > .
> > Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_dax_noreflink -g auto
> > SECTION   -- xfs_dax_noreflink
> > FSTYP -- xfs (debug)
> > PLATFORM  -- Linux/x86_64 test3 6.0.0-rc6-dgc+ #1543 SMP 
> > PREEMPT_DYNAMIC Mon Sep 19 07:46:37 AEST 2022
> > MKFS_OPTIONS  -- -f -m reflink=0,rmapbt=1 /dev/pmem1
> > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 
> > /dev/pmem1 /mnt/scratch
> > 
> > So, yeah, reflink was turned off on both test and scratch devices,
> > and dax=always on both the test and scratch devices was used to
> > ensure that DAX was always in use.
> > 
> > 
> > > FWIW I saw dmesg failures in xfs/517 and xfs/013 starting with 6.0-rc5,
> > > and I haven't even turned on reflink yet:
> > > 
> > > run fstests xfs/517 at 2022-09-26 19:53:34
> > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your 
> > > own risk!
> > > XFS (pmem1): Mounting V5 Filesystem
> > > XFS (pmem1): Ending clean mount
> > > XFS (pmem1): Quotacheck needed: Please wait.
> > > XFS (pmem1): Quotacheck: Done.
> > > XFS (pmem1): Unmounting Filesystem
> > > XFS (pmem0): EXPERIMENTAL online scrub feature in use. Use at your own 
> > > risk!
> > > XFS (pmem1): EXPERIMENTAL Large extent counts feature in use. Use at your 
> > > own risk!
> > > XFS (pmem1): Mounting V5 Filesystem
> > > XFS (pmem1): Ending clean mount
> > > XFS (pmem1): Quotacheck needed: Please wait.
> > > XFS (pmem1): Quotacheck: Done.
> > > [ cut here ]
> > > WARNING: CPU: 1 PID: 415317 at fs/dax.c:380 dax_insert_entry+0x22d/0x320
> > > Modules linked in: xfs nft_chain_nat xt_REDIRECT nf_nat nf_conntrack 
> > > nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT 
> > > nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat 
> > > ip_set_hash_mac ip_set nf_tables libcrc32c bfq nfnetlink pvpanic_mmio 
> > > pvpanic nd_pmem dax_pmem nd_btt sch_fq_codel fuse configfs ip_tables 
> > > x_tables overlay nfsv4 af_packet [last unloaded: scsi_d
> > &g

Re: [RFC PATCH] xfs: drop experimental warning for fsdax

2022-09-27 Thread Darrick J. Wong
On Tue, Sep 27, 2022 at 02:53:14PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/9/20 5:15, Dave Chinner 写道:
> > On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:
> > > On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:
> > > > Since reflink can work together now, the last obstacle has been
> > > > resolved.  It's time to remove restrictions and drop this warning.
> > > > 
> > > > Signed-off-by: Shiyang Ruan 
> > > 
> > > I haven't looked at reflink+DAX for some time, and I haven't tested
> > > it for even longer. So I'm currently running a v6.0-rc6 kernel with
> > > "-o dax=always" fstests run with reflink enabled and it's not
> > > looking very promising.
> > > 
> > > All of the fsx tests are failing with data corruption, several
> > > reflink/clone tests are failing with -EINVAL (e.g. g/16[45]) and
> > > *lots* of tests are leaving stack traces from WARN() conditions in
> > > DAx operations such as dax_insert_entry(), dax_disassociate_entry(),
> > > dax_writeback_mapping_range(), iomap_iter() (called from
> > > dax_dedupe_file_range_compare()), and so on.
> > > 
> > > At thsi point - the tests are still running - I'd guess that there's
> > > going to be at least 50 test failures by the time it completes -
> > > in comparison using "-o dax=never" results in just a single test
> > > failure and a lot more tests actually being run.
> > 
> > The end results with dax+reflink were:
> > 
> > SECTION   -- xfs_dax
> > =
> > 
> > Failures: generic/051 generic/068 generic/074 generic/075
> > generic/083 generic/091 generic/112 generic/127 generic/164
> > generic/165 generic/175 generic/231 generic/232 generic/247
> > generic/269 generic/270 generic/327 generic/340 generic/388
> > generic/390 generic/413 generic/447 generic/461 generic/471
> > generic/476 generic/517 generic/519 generic/560 generic/561
> > generic/605 generic/617 generic/619 generic/630 generic/649
> > generic/650 generic/656 generic/670 generic/672 xfs/011 xfs/013
> > xfs/017 xfs/068 xfs/073 xfs/104 xfs/127 xfs/137 xfs/141 xfs/158
> > xfs/168 xfs/179 xfs/243 xfs/297 xfs/305 xfs/328 xfs/440 xfs/442
> > xfs/517 xfs/535 xfs/538 xfs/551 xfs/552
> > Failed 61 of 1071 tests
> > 
> > Ok, so I did a new no-reflink run as a baseline, because it is a
> > while since I've tested DAX at all:
> > 
> > SECTION   -- xfs_dax_noreflink
> > =
> > Failures: generic/051 generic/068 generic/074 generic/075
> > generic/083 generic/112 generic/231 generic/232 generic/269
> > generic/270 generic/340 generic/388 generic/461 generic/471
> > generic/476 generic/519 generic/560 generic/561 generic/617
> > generic/650 generic/656 xfs/011 xfs/013 xfs/017 xfs/073 xfs/297
> > xfs/305 xfs/517 xfs/538
> > Failed 29 of 1071 tests
> > 
> > Yeah, there's still lots of warnings from dax_insert_entry() and
> > friends like:
> > 
> > [43262.025815] WARNING: CPU: 9 PID: 1309428 at fs/dax.c:380 
> > dax_insert_entry+0x2ab/0x320
> > [43262.028355] Modules linked in:
> > [43262.029386] CPU: 9 PID: 1309428 Comm: fsstress Tainted: G W  
> > 6.0.0-rc6-dgc+ #1543
> > [43262.032168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.15.0-1 04/01/2014
> > [43262.034840] RIP: 0010:dax_insert_entry+0x2ab/0x320
> > [43262.036358] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 
> > 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff 
> > <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 84 b1 5a 00 eb a4 48 81 e6
> > [43262.042255] RSP: 0018:c9000a0cbb78 EFLAGS: 00010002
> > [43262.043946] RAX: ea0018cd1fc0 RBX: 0001 RCX: 
> > 0001
> > [43262.046233] RDX: ea00 RSI: 0221 RDI: 
> > ea0018cd2000
> > [43262.048518] RBP: 0011 R08:  R09: 
> > 
> > [43262.050762] R10: 888241a6d318 R11: 0001 R12: 
> > c9000a0cbc58
> > [43262.053020] R13: 888241a6d318 R14: c9000a0cbe20 R15: 
> > 
> > [43262.055309] FS:  7f8ce25e2b80() GS:8885fec8() 
> > knlGS:
> > [43262.057859] CS:  0010 DS:  ES:  CR0: 80050033
> > [43262.059713] CR2: 7f8ce25e1000 CR3: 000152141001 CR4: 
> > 00060ee0
> > [43262.061993] Call Trace:
> > [43262.062836]  
> > [43262.063557]  dax_fault_iter+0x243/0x600
> > [43262.064802]  dax_iomap_pte_fault+0x199/0x360
> > [43262.066197]  __xfs_filemap_fault+0x1e3/0x2c0
> > [43262.067602]  __do_fault+0x31/0x1d0
> > [43262.068719]  __handle_mm_fault+0xd6d/0x1650
> > [43262.070083]  ? do_mmap+0x348/0x540
> > [43262.071200]  handle_mm_fault+0x7a/0x1d0
> > [43262.072449]  ? __kvm_handle_async_pf+0x12/0xb0
> > [43262.073908]  exc_page_fault+0x1d9/0x810
> > [43262.075123]  asm_exc_page_fault+0x22/0x30
> > [43262.076413] RIP: 0033:0x7f8ce268bc23
> > 
> > So it looks to me like DAX is well and truly broken in 6.0-rc6. And,
> > yes, I'm running the fixes in mm-hotifxes-stable branch that 

Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-20 Thread Darrick J. Wong
On Tue, Sep 20, 2022 at 12:45:19PM +1000, Dave Chinner wrote:
> On Fri, Sep 02, 2022 at 10:36:01AM +, Shiyang Ruan wrote:
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > (or mapped device) on it to unmap all files in use and notify processes
> > who are using those files.
> > 
> > Call trace:
> > trigger unbind
> >  -> unbind_store()
> >   -> ... (skip)
> >-> devres_release_all()   # was pmem driver ->remove() in v1
> > -> kill_dax()
> >  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> >   -> xfs_dax_notify_failure()
> > 
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event.  So do not shutdown filesystem directly if something not
> > supported, or if failure range includes metadata area.  Make sure all
> > files and processes are handled correctly.
> > 
> > [1]: 
> > https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> > 
> > Signed-off-by: Shiyang Ruan 
> > ---
> >  drivers/dax/super.c |  3 ++-
> >  fs/xfs/xfs_notify_failure.c | 23 +++
> >  include/linux/mm.h  |  1 +
> >  3 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 9b5e2a5eb0ae..cf9a64563fbe 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > return;
> >  
> > if (dax_dev->holder_data != NULL)
> > -   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > +   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > +   MF_MEM_PRE_REMOVE);
> >  
> > clear_bit(DAXDEV_ALIVE, _dev->flags);
> > synchronize_srcu(_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 3830f908e215..5e04ba7fa403 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -22,6 +22,7 @@
> >  
> >  #include 
> >  #include 
> > +#include 
> >  
> >  struct xfs_failure_info {
> > xfs_agblock_t   startblock;
> > @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
> >  
> > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > +   /* The device is about to be removed.  Not a really failure. */
> > +   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > +   return 0;
> > notify->want_shutdown = true;
> > return 0;
> > }
> > @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
> > struct xfs_mount*mp = dax_holder(dax_dev);
> > u64 ddev_start;
> > u64 ddev_end;
> > +   int error;
> >  
> > if (!(mp->m_super->s_flags & SB_BORN)) {
> > xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> > return -EIO;
> > }
> >  
> > +   if (mf_flags & MF_MEM_PRE_REMOVE) {
> > +   xfs_info(mp, "device is about to be removed!");
> > +   down_write(>m_super->s_umount);
> > +   error = sync_filesystem(mp->m_super);
> > +   drop_pagecache_sb(mp->m_super, NULL);
> > +   up_write(>m_super->s_umount);
> > +   if (error)
> > +   return error;
> 
> If the device is about to go away unexpectedly, shouldn't this shut
> down the filesystem after syncing it here?  If the filesystem has
> been shut down, then everything will fail before removal finally
> triggers, and the act of unmounting the filesystem post device
> removal will clean up the page cache and all the other caches.

IIRC they want to kill all the processes with MAP_SYNC mappings sooner
than whenever the admin gets around to unmounting the filesystem, which
is why PRE_REMOVE will then go walk the rmapbt to find processes to
shoot down.  I'm not sure, though, if drop_pagecache_sb only touches
DRAM page cache or if it'll shoot down fsdax mappings too?

> IOWs, I don't understand why the page cache is considered special
> here (as opposed to, say, the inode or dentry caches), nor why we
> aren't shutting down the filesystem directly after syncing it to
> disk to ensure that we don't end up with applications losing data as
> a result of racing with the removal

But yeah, we might as well shut down the fs at the end of PRE_REMOVE
handling, if the rmap walk hasn't already done that.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com



Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-16 Thread Darrick J. Wong
On Thu, Sep 15, 2022 at 10:56:09AM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/9/15 2:15, Darrick J. Wong 写道:
> > On Wed, Sep 14, 2022 at 11:09:23AM -0700, Darrick J. Wong wrote:
> > > On Wed, Sep 07, 2022 at 05:46:00PM +0800, Shiyang Ruan wrote:
> > > > ping
> > > > 
> > > > 在 2022/9/2 18:35, Shiyang Ruan 写道:
> > > > > Changes since v7:
> > > > > 1. Add P1 to fix calculation mistake
> > > > > 2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
> > > > > 3. P3: Add invalidate all mappings after sync.
> > > > > 4. P3: Set offset to be start of device when it is to 
> > > > > be removed.
> > > > > 5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].
> > > > > 
> > > > > Changes since v6:
> > > > > 1. Rebase on 6.0-rc2 and Darrick's patch[1].
> > > > > 
> > > > > [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> > > > > [2]: 
> > > > > https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/
> > > 
> > > Just out of curiosity, is it your (or djbw's) intent to send all these
> > > as bugfixes for 6.0 via akpm like all the other dax fixen?
> > 
> > Aha, this is 6.1 stuff, please ignore this question.
> 
> Actually I hope these patches can be merged ASAP. (But it seems a bit late
> for 6.0 now.)
> 
> And do you know which/whose branch has picked up your patch[1]?  I cannot
> find it.

It's not upstream, though the maintainer (Dave currently) reviewed it.
I don't know if he hasn't had time to put together a fixes branch or if
he's simply punting all the queued up stuff to 6.1.

(Dave?)

--D

> 
> --
> Thanks,
> Ruan.
> 
> > 
> > --D
> > 
> > > --D
> > > 
> > > > > 
> > > > > Shiyang Ruan (3):
> > > > > xfs: fix the calculation of length and end
> > > > > fs: move drop_pagecache_sb() for others to use
> > > > > mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
> > > > > 
> > > > >drivers/dax/super.c |  3 ++-
> > > > >fs/drop_caches.c| 33 -
> > > > >fs/super.c  | 34 ++
> > > > >fs/xfs/xfs_notify_failure.c | 31 +++
> > > > >include/linux/fs.h  |  1 +
> > > > >include/linux/mm.h  |  1 +
> > > > >6 files changed, 65 insertions(+), 38 deletions(-)
> > > > > 



Re: [PATCH 2/3] fs: move drop_pagecache_sb() for others to use

2022-09-14 Thread Darrick J. Wong
On Fri, Sep 02, 2022 at 10:36:00AM +, Shiyang Ruan wrote:
> xfs_notify_failure requires a method to invalidate all mappings.
> drop_pagecache_sb() can do this but it is a static function and only
> build with CONFIG_SYSCTL.  Now, move it to super.c and make it available
> for others.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/drop_caches.c   | 33 -
>  fs/super.c | 34 ++
>  include/linux/fs.h |  1 +
>  3 files changed, 35 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index e619c31b6bd9..5c8406076f9b 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -3,7 +3,6 @@
>   * Implement the manual drop-all-pagecache function
>   */
>  
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -15,38 +14,6 @@
>  /* A global variable is a bit ugly, but it keeps the code simple */
>  int sysctl_drop_caches;
>  
> -static void drop_pagecache_sb(struct super_block *sb, void *unused)
> -{
> - struct inode *inode, *toput_inode = NULL;
> -
> - spin_lock(>s_inode_list_lock);
> - list_for_each_entry(inode, >s_inodes, i_sb_list) {
> - spin_lock(>i_lock);
> - /*
> -  * We must skip inodes in unusual state. We may also skip
> -  * inodes without pages but we deliberately won't in case
> -  * we need to reschedule to avoid softlockups.
> -  */
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping_empty(inode->i_mapping) && !need_resched())) {
> - spin_unlock(>i_lock);
> - continue;
> - }
> - __iget(inode);
> - spin_unlock(>i_lock);
> - spin_unlock(>s_inode_list_lock);
> -
> - invalidate_mapping_pages(inode->i_mapping, 0, -1);
> - iput(toput_inode);
> - toput_inode = inode;
> -
> - cond_resched();
> - spin_lock(>s_inode_list_lock);
> - }
> - spin_unlock(>s_inode_list_lock);
> - iput(toput_inode);
> -}
> -
>  int drop_caches_sysctl_handler(struct ctl_table *table, int write,
>   void *buffer, size_t *length, loff_t *ppos)
>  {
> diff --git a/fs/super.c b/fs/super.c
> index 734ed584a946..bdf53dbe834c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "internal.h"
>  
> @@ -677,6 +678,39 @@ void drop_super_exclusive(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(drop_super_exclusive);
>  
> +void drop_pagecache_sb(struct super_block *sb, void *unused)
> +{
> + struct inode *inode, *toput_inode = NULL;
> +
> + spin_lock(>s_inode_list_lock);
> + list_for_each_entry(inode, >s_inodes, i_sb_list) {
> + spin_lock(>i_lock);
> + /*
> +  * We must skip inodes in unusual state. We may also skip
> +  * inodes without pages but we deliberately won't in case
> +  * we need to reschedule to avoid softlockups.
> +  */
> + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> + (mapping_empty(inode->i_mapping) && !need_resched())) {
> + spin_unlock(>i_lock);
> + continue;
> + }
> + __iget(inode);
> + spin_unlock(>i_lock);
> + spin_unlock(>s_inode_list_lock);
> +
> + invalidate_mapping_pages(inode->i_mapping, 0, -1);
> + iput(toput_inode);
> + toput_inode = inode;
> +
> + cond_resched();
> + spin_lock(>s_inode_list_lock);
> + }
> + spin_unlock(>s_inode_list_lock);
> + iput(toput_inode);
> +}
> +EXPORT_SYMBOL(drop_pagecache_sb);

You might want to rename this "super_drop_pagecache" to fit with the
other functions that all have "super" in the name somewhere.

--D

> +
>  static void __iterate_supers(void (*f)(struct super_block *))
>  {
>   struct super_block *sb, *p = NULL;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9eced4cc286e..5ded28c0d2c9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3292,6 +3292,7 @@ extern struct super_block *get_super(struct 
> block_device *);
>  extern struct super_block *get_active_super(struct block_device *bdev);
>  extern void drop_super(struct super_block *sb);
>  extern void drop_super_exclusive(struct super_block *sb);
> +void drop_pagecache_sb(struct super_block *sb, void *unused);
>  extern void iterate_supers(void (*)(struct super_block *, void *), void *);
>  extern void iterate_supers_type(struct file_system_type *,
>   void (*)(struct super_block *, void *), void *);
> -- 
> 2.37.2
> 



Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Darrick J. Wong
On Wed, Sep 14, 2022 at 11:09:23AM -0700, Darrick J. Wong wrote:
> On Wed, Sep 07, 2022 at 05:46:00PM +0800, Shiyang Ruan wrote:
> > ping
> > 
> > 在 2022/9/2 18:35, Shiyang Ruan 写道:
> > > Changes since v7:
> > >1. Add P1 to fix calculation mistake
> > >2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
> > >3. P3: Add invalidate all mappings after sync.
> > >4. P3: Set offset to be start of device when it is to be 
> > > removed.
> > >5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].
> > > 
> > > Changes since v6:
> > >1. Rebase on 6.0-rc2 and Darrick's patch[1].
> > > 
> > > [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> > > [2]: 
> > > https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/
> 
> Just out of curiosity, is it your (or djbw's) intent to send all these
> as bugfixes for 6.0 via akpm like all the other dax fixen?

Aha, this is 6.1 stuff, please ignore this question.

--D

> --D
> 
> > > 
> > > Shiyang Ruan (3):
> > >xfs: fix the calculation of length and end
> > >fs: move drop_pagecache_sb() for others to use
> > >mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
> > > 
> > >   drivers/dax/super.c |  3 ++-
> > >   fs/drop_caches.c| 33 -
> > >   fs/super.c  | 34 ++
> > >   fs/xfs/xfs_notify_failure.c | 31 +++
> > >   include/linux/fs.h  |  1 +
> > >   include/linux/mm.h  |  1 +
> > >   6 files changed, 65 insertions(+), 38 deletions(-)
> > > 



Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Darrick J. Wong
On Fri, Sep 02, 2022 at 10:36:01AM +, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 23 +++
>  include/linux/mm.h  |  1 +
>  3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 3830f908e215..5e04ba7fa403 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  struct xfs_failure_info {
>   xfs_agblock_t   startblock;
> @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
>  
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* The device is about to be removed.  Not a really failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   notify->want_shutdown = true;
>   return 0;
>   }
> @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>  
>   if (!(mp->m_super->s_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>  
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + drop_pagecache_sb(mp->m_super, NULL);
> + up_write(>m_super->s_umount);
> + if (error)
> + return error;
> + }
> +
>   if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
>   xfs_debug(mp,
>"notify_failure() not supported on realtime device!");
> @@ -196,6 +211,8 @@ xfs_dax_notify_failure(
>  
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> @@ -209,6 +226,12 @@ xfs_dax_notify_failure(
>   ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>   ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
> + /* Notify failure on the whole device */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }

I wonder, won't the trimming code below take care of this?

The rest of the patch looks ok to me.

--D

> +
>   /* Ignore the range out of filesystem area */
>   if (offset + len - 1 < ddev_start)
>   return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd..9122a1c57dd2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3183,6 +3183,7 @@ enum mf_flags {
>   MF_UNPOISON = 1 << 4,
>   MF_SW_SIMULATED = 1 << 5,
>   MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t 

Re: [PATCH 1/3] xfs: fix the calculation of length and end

2022-09-14 Thread Darrick J. Wong
On Fri, Sep 02, 2022 at 10:35:59AM +, Shiyang Ruan wrote:
> The end should be start + length - 1.  Also fix the calculation of the
> length when seeking for intersection of notify range and device.
> 
> Signed-off-by: Shiyang Ruan 

Looks correct to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_notify_failure.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index c4078d0ec108..3830f908e215 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure(
>   int error = 0;
>   xfs_fsblock_t   fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>   xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
> - xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
> + xfs_fsblock_t   end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen 
> - 1);
>   xfs_agnumber_t  end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>  
>   error = xfs_trans_alloc_empty(mp, );
> @@ -210,7 +210,7 @@ xfs_dax_notify_failure(
>   ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
>   /* Ignore the range out of filesystem area */
> - if (offset + len < ddev_start)
> + if (offset + len - 1 < ddev_start)
>   return -ENXIO;
>   if (offset > ddev_end)
>   return -ENXIO;
> @@ -222,8 +222,8 @@ xfs_dax_notify_failure(
>   len -= ddev_start - offset;
>   offset = 0;
>   }
> - if (offset + len > ddev_end)
> - len -= ddev_end - offset;
> + if (offset + len - 1 > ddev_end)
> + len -= offset + len - 1 - ddev_end;
>  
>   return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>   mf_flags);
> -- 
> 2.37.2
> 



Re: [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-09-14 Thread Darrick J. Wong
On Wed, Sep 07, 2022 at 05:46:00PM +0800, Shiyang Ruan wrote:
> ping
> 
> 在 2022/9/2 18:35, Shiyang Ruan 写道:
> > Changes since v7:
> >1. Add P1 to fix calculation mistake
> >2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use
> >3. P3: Add invalidate all mappings after sync.
> >4. P3: Set offset to be start of device when it is to be 
> > removed.
> >5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2].
> > 
> > Changes since v6:
> >1. Rebase on 6.0-rc2 and Darrick's patch[1].
> > 
> > [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> > [2]: 
> > https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.st...@dwillia2-xfh.jf.intel.com/

Just out of curiosity, is it your (or djbw's) intent to send all these
as bugfixes for 6.0 via akpm like all the other dax fixen?

--D

> > 
> > Shiyang Ruan (3):
> >xfs: fix the calculation of length and end
> >fs: move drop_pagecache_sb() for others to use
> >mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
> > 
> >   drivers/dax/super.c |  3 ++-
> >   fs/drop_caches.c| 33 -
> >   fs/super.c  | 34 ++
> >   fs/xfs/xfs_notify_failure.c | 31 +++
> >   include/linux/fs.h  |  1 +
> >   include/linux/mm.h  |  1 +
> >   6 files changed, 65 insertions(+), 38 deletions(-)
> > 



Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-14 Thread Darrick J. Wong
On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
> On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > > On 2022/9/9 21:01, Brian Foster wrote:
> > > > Yes.. I don't recall all the internals of the tools and test, but IIRC
> > > > it relied on discard to perform zeroing between checkpoints or some such
> > > > and avoid spurious failures. The purpose of running on dm-thin was
> > > > merely to provide reliable discard zeroing behavior on the target device
> > > > and thus to allow the test to run reliably.
> > > Hi Brian,
> > > 
> > > As far as I know, generic/470 was original designed to verify
> > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > > reason, we need to ensure that all underlying devices under
> > > dm-log-writes device support DAX. However dm-thin device never supports
> > > DAX so
> > > running generic/470 with dm-thin device always returns "not run".
> > > 
> > > Please see the difference between old and new logic:
> > > 
> > >old logic  new logic
> > > ---
> > > log-writes device(DAX) log-writes device(DAX)
> > >  |   |
> > > PMEM0(DAX) + PMEM1(DAX)   Thin device(non-DAX) + PMEM1(DAX)
> > >|
> > >  PMEM0(DAX)
> > > ---
> > > 
> > > We think dm-thin device is not a good solution for generic/470, is there
> > > any other solution to support both discard zero and DAX?
> > 
> > Hi Brian,
> > 
> > I have sent a patch[1] to revert your fix because I think it's not good for
> > generic/470 to use thin volume as my revert patch[1] describes:
> > [1] 
> > https://lore.kernel.org/fstests/20220914090625.32207-1-yangx...@fujitsu.com/T/#u
> > 
> 
> I think the history here is that generic/482 was changed over first in
> commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
> then sometime later we realized generic/455,457,470 had the same general
> flaw and were switched over. The dm/dax compatibility thing was probably
> just an oversight, but I am a little curious about that because it should

It's not an oversight -- it used to work (albeit with EXPERIMENTAL
tags), and now we've broken it on fsdax as the pmem/blockdev divorce
progresses.

> have been obvious that the change caused the test to no longer run. Did
> something change after that to trigger that change in behavior?
> 
> > With the revert, generic/470 can always run successfully on my environment
> > so I wonder how to reproduce the out-of-order replay issue on XFS v5
> > filesystem?
> > 
> 
> I don't quite recall the characteristics of the failures beyond that we
> were seeing spurious test failures with generic/482 that were due to
> essentially putting the fs/log back in time in a way that wasn't quite
> accurate due to the clearing by the logwrites tool not taking place. If
> you wanted to reproduce in order to revisit that, perhaps start with
> generic/482 and let it run in a loop for a while and see if it
> eventually triggers a failure/corruption..?
> 
> > PS: I want to reproduce the issue and try to find a better solution to fix
> > it.
> > 
> 
> It's been a while since I looked at any of this tooling to semi-grok how
> it works.

I /think/ this was the crux of the problem, back in 2019?
https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/

> Perhaps it could learn to rely on something more explicit like
> zero range (instead of discard?) or fall back to manual zeroing?

AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
ought to be adapted to call BLKZEROOUT and (b) in the worst case it
writes zeroes to the entire device, which is/can be slow.

For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
and for cost optimization purposes we're only "paying" for the cheapest
tier.  Weirdly that maps to an upper limit of 6500 write iops and
48MB/s(!) but that would take about 20 minutes to zero the entire
device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
support discard or write-zeroes.

> If the
> eventual solution is simple and low enough overhead, it might make some
> sense to replace the dmthin hack across the set of tests mentioned
> above.

That said, for a *pmem* test you'd expect it to be faster than that...

--D

> Brian
> 
> > Best Regards,
> > Xiao Yang
> > 
> > > 
> > > BTW, only log-writes, stripe and linear support DAX for now.
> > 
> 



Re: [PATCH v7] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-29 Thread Darrick J. Wong
On Mon, Aug 29, 2022 at 06:02:11PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/8/27 5:35, Dan Williams 写道:
> > Shiyang Ruan wrote:
> > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > (or mapped device) on it to unmap all files in use and notify processes
> > > who are using those files.
> > > 
> > > Call trace:
> > > trigger unbind
> > >-> unbind_store()
> > > -> ... (skip)
> > >  -> devres_release_all()
> > >   -> kill_dax()
> > >-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, 
> > > MF_MEM_PRE_REMOVE)
> > > -> xfs_dax_notify_failure()
> > > 
> > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > event.  So do not shutdown filesystem directly if something not
> > > supported, or if failure range includes metadata area.  Make sure all
> > > files and processes are handled correctly.
> > > 
> > > ==
> > > Changes since v6:
> > > 1. Rebase on 6.0-rc2 and Darrick's patch[2].
> > > 
> > > Changes since v5:
> > > 1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
> > > 2. hold s_umount before sync_filesystem()
> > > 3. do sync_filesystem() after SB_BORN check
> > > 4. Rebased on next-20220714
> > > 
> > > [1]:
> > > https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> > > [2]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> > > 
> > > Signed-off-by: Shiyang Ruan 
> > > Reviewed-by: Darrick J. Wong 
> > > ---
> > >drivers/dax/super.c |  3 ++-
> > >fs/xfs/xfs_notify_failure.c | 15 +++
> > >include/linux/mm.h  |  1 +
> > >3 files changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index 9b5e2a5eb0ae..cf9a64563fbe 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > >   return;
> > >   if (dax_dev->holder_data != NULL)
> > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > + MF_MEM_PRE_REMOVE);
> > >   clear_bit(DAXDEV_ALIVE, _dev->flags);
> > >   synchronize_srcu(_srcu);
> > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > index 65d5eb20878e..a9769f17e998 100644
> > > --- a/fs/xfs/xfs_notify_failure.c
> > > +++ b/fs/xfs/xfs_notify_failure.c
> > > @@ -77,6 +77,9 @@ xfs_dax_failure_fn(
> > >   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > >   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | 
> > > XFS_RMAP_BMBT_BLOCK))) {
> > > + /* Do not shutdown so early when device is to be removed */
> > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > + return 0;
> > >   notify->want_shutdown = true;
> > >   return 0;
> > >   }
> > > @@ -182,12 +185,22 @@ xfs_dax_notify_failure(
> > >   struct xfs_mount*mp = dax_holder(dax_dev);
> > >   u64 ddev_start;
> > >   u64 ddev_end;
> > > + int error;
> > >   if (!(mp->m_sb.sb_flags & SB_BORN)) {
> > 
> > How are you testing the SB_BORN interactions? I have a fix for this
> > pending here:
> > 
> > https://lore.kernel.org/nvdimm/166153428094.2758201.7936572520826540019.st...@dwillia2-xfh.jf.intel.com/
> 
> That was my mistake.  Yes, it should be mp->m_super->s_flags.
> 
> (I remember my testcase did pass in my dev version, but now that seems
> impossible.  I think something was wrong when I did the test.)
> 
> > 
> > >   xfs_warn(mp, "filesystem is not ready for 
> > > notify_failure()!");
> > >   return -EIO;
> > >   }
> > >+  if (mf_flags & MF_MEM_PRE_REMOVE) {
> > 
> > It appears this patch is corr

Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-18 Thread Darrick J. Wong
On Thu, Aug 18, 2022 at 07:19:28PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/8/3 12:33, Darrick J. Wong 写道:
> > On Wed, Aug 03, 2022 at 02:43:20AM +, ruansy.f...@fujitsu.com wrote:
> > > 
> > > 在 2022/7/19 6:56, Dan Williams 写道:
> > > > Darrick J. Wong wrote:
> > > > > On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote:
> > > > > > ruansy.f...@fujitsu.com wrote:
> > > > > > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > > > > > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > > > > > > ->notify_failure() mechanism, the pmem driver is able to ask 
> > > > > > > filesystem
> > > > > > > (or mapped device) on it to unmap all files in use and notify 
> > > > > > > processes
> > > > > > > who are using those files.
> > > > > > > 
> > > > > > > Call trace:
> > > > > > > trigger unbind
> > > > > > >-> unbind_store()
> > > > > > > -> ... (skip)
> > > > > > >  -> devres_release_all()   # was pmem driver ->remove() in v1
> > > > > > >   -> kill_dax()
> > > > > > >-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, 
> > > > > > > MF_MEM_PRE_REMOVE)
> > > > > > > -> xfs_dax_notify_failure()
> > > > > > > 
> > > > > > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a 
> > > > > > > remove
> > > > > > > event.  So do not shutdown filesystem directly if something not
> > > > > > > supported, or if failure range includes metadata area.  Make sure 
> > > > > > > all
> > > > > > > files and processes are handled correctly.
> > > > > > > 
> > > > > > > ==
> > > > > > > Changes since v5:
> > > > > > > 1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
> > > > > > > 2. hold s_umount before sync_filesystem()
> > > > > > > 3. move sync_filesystem() after SB_BORN check
> > > > > > > 4. Rebased on next-20220714
> > > > > > > 
> > > > > > > Changes since v4:
> > > > > > > 1. sync_filesystem() at the beginning when MF_MEM_REMOVE
> > > > > > > 2. Rebased on next-20220706
> > > > > > > 
> > > > > > > [1]: 
> > > > > > > https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> > > > > > > 
> > > > > > > Signed-off-by: Shiyang Ruan 
> > > > > > > ---
> > > > > > >drivers/dax/super.c |  3 ++-
> > > > > > >fs/xfs/xfs_notify_failure.c | 15 +++
> > > > > > >include/linux/mm.h  |  1 +
> > > > > > >3 files changed, 18 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > > > index 9b5e2a5eb0ae..cf9a64563fbe 100644
> > > > > > > --- a/drivers/dax/super.c
> > > > > > > +++ b/drivers/dax/super.c
> > > > > > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > > > > >   return;
> > > > > > >   if (dax_dev->holder_data != NULL)
> > > > > > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > > > > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > > > > > + MF_MEM_PRE_REMOVE);
> > > > > > >   clear_bit(DAXDEV_ALIVE, _dev->flags);
> > > > > > >   synchronize_srcu(_srcu);
> > > > > > > diff --git a/fs/xfs/xfs_notify_failure.c 
> > > > > > > b/fs/xfs/xfs_notify_failure.c
> > > > > > > index 69d9c83ea4b2..6da6747435eb 100644
> > > > > > > --- a/fs/xfs/xfs_notify_failure.c
> > > > > > > +++ b/fs/xfs/xfs_notify_failure.c
> > > > > > > @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
> > > > > > >   if (XFS_R

[PATCH] xfs: on memory failure, only shut down fs after scanning all mappings

2022-08-18 Thread Darrick J. Wong
From: Darrick J. Wong 

xfs_dax_failure_fn is used to scan the filesystem during a memory
failure event to look for memory mappings to revoke.  Unfortunately, if
it encounters an rmap record for filesystem metadata, it will shut down
the filesystem and the scan immediately.  This means that we don't
complete the mapping revocation scan and instead leave live mappings to
failed memory.  Fix the function to defer the shutdown until after we've
finished culling mappings.

While we're at it, add the usual "xfs_" prefix to struct failure_info,
and actually initialize mf_flags.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_notify_failure.c |   26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 69d9c83ea4b2..65d5eb20878e 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -23,17 +23,18 @@
 #include 
 #include 
 
-struct failure_info {
+struct xfs_failure_info {
xfs_agblock_t   startblock;
xfs_extlen_tblockcount;
int mf_flags;
+   boolwant_shutdown;
 };
 
 static pgoff_t
 xfs_failure_pgoff(
struct xfs_mount*mp,
const struct xfs_rmap_irec  *rec,
-   const struct failure_info   *notify)
+   const struct xfs_failure_info   *notify)
 {
loff_t  pos = XFS_FSB_TO_B(mp, rec->rm_offset);
 
@@ -47,7 +48,7 @@ static unsigned long
 xfs_failure_pgcnt(
struct xfs_mount*mp,
const struct xfs_rmap_irec  *rec,
-   const struct failure_info   *notify)
+   const struct xfs_failure_info   *notify)
 {
xfs_agblock_t   end_rec;
xfs_agblock_t   end_notify;
@@ -71,13 +72,13 @@ xfs_dax_failure_fn(
 {
struct xfs_mount*mp = cur->bc_mp;
struct xfs_inode*ip;
-   struct failure_info *notify = data;
+   struct xfs_failure_info *notify = data;
int error = 0;
 
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
-   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
-   return -EFSCORRUPTED;
+   notify->want_shutdown = true;
+   return 0;
}
 
/* Get files that incore, filter out others that are not in use. */
@@ -86,8 +87,10 @@ xfs_dax_failure_fn(
/* Continue the rmap query if the inode isn't incore */
if (error == -ENODATA)
return 0;
-   if (error)
-   return error;
+   if (error) {
+   notify->want_shutdown = true;
+   return 0;
+   }
 
error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
  xfs_failure_pgoff(mp, rec, notify),
@@ -104,6 +107,7 @@ xfs_dax_notify_ddev_failure(
xfs_daddr_t bblen,
int mf_flags)
 {
+   struct xfs_failure_info notify = { .mf_flags = mf_flags };
struct xfs_trans*tp = NULL;
struct xfs_btree_cur*cur = NULL;
struct xfs_buf  *agf_bp = NULL;
@@ -120,7 +124,6 @@ xfs_dax_notify_ddev_failure(
for (; agno <= end_agno; agno++) {
struct xfs_rmap_irecri_low = { };
struct xfs_rmap_irecri_high;
-   struct failure_info notify;
struct xfs_agf  *agf;
xfs_agblock_t   agend;
struct xfs_perag*pag;
@@ -161,6 +164,11 @@ xfs_dax_notify_ddev_failure(
}
 
xfs_trans_cancel(tp);
+   if (error || notify.want_shutdown) {
+   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
+   if (!error)
+   error = -EFSCORRUPTED;
+   }
return error;
 }
 



Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-08-03 Thread Darrick J. Wong
On Wed, Aug 03, 2022 at 06:47:24AM +, ruansy.f...@fujitsu.com wrote:
> 
> 
> 在 2022/7/29 12:54, Darrick J. Wong 写道:
> > On Fri, Jul 29, 2022 at 03:55:24AM +, ruansy.f...@fujitsu.com wrote:
> >>
> >>
> >> 在 2022/7/22 0:16, Darrick J. Wong 写道:
> >>> On Thu, Jul 21, 2022 at 02:06:10PM +, ruansy.f...@fujitsu.com wrote:
> >>>> 在 2022/7/1 8:31, Darrick J. Wong 写道:
> >>>>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> >>>>>> Failure notification is not supported on partitions.  So, when we mount
> >>>>>> a reflink enabled xfs on a partition with dax option, let it fail with
> >>>>>> -EINVAL code.
> >>>>>>
> >>>>>> Signed-off-by: Shiyang Ruan 
> >>>>>
> >>>>> Looks good to me, though I think this patch applies to ... wherever all
> >>>>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> >>>>>
> >>>>> Ideally this would go in through there to keep the pieces together, but
> >>>>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
> >>>>> is unwilling.
> >>>>
> >>>> BTW, since these patches (dax + THIS + pmem-unbind) are
> >>>> waiting to be merged, is it time to think about "removing the
> >>>> experimental tag" again?  :)
> >>>
> >>> It's probably time to take up that question again.
> >>>
> >>> Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
> >>> didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
> >>> and at least one of those dm layers no longer allows fsdax pass-through,
> >>> so XFS silently turned mount -o dax into -o dax=never. :(
> >>
> >> Hi Darrick,
> >>
> >> I tried generic/470 but it didn't run:
> >> [not run] Cannot use thin-pool devices on DAX capable block devices.
> >>
> >> Did you modify the _require_dm_target() in common/rc?  I added thin-pool
> >> to not to check dax capability:
> >>
> >>   case $target in
> >>   stripe|linear|log-writes|thin-pool)  # add thin-pool here
> >>   ;;
> >>
> >> then the case finally ran and it silently turned off dax as you said.
> >>
> >> Are the steps for reproduction correct? If so, I will continue to
> >> investigate this problem.
> > 
> > Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
> > mention that.  I suspect that the removal of dm support for pmem is
> > going to force us to completely redesign this test.  I can't really
> > think of how, though, since there's no good way that I know of to gain a
> > point-in-time snapshot of a pmem device.
> 
> Hi Darrick,
> 
>  > removal of dm support for pmem
> I think here we are saying about xfstest who removed the support, not 
> kernel?
> 
> I found some xfstests commits:
> fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the restriction.
> fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool
> 
> So, this case was never able to run since the second commit?  (I didn't 
> notice the not run case.  I thought it was expected to be not run.)
> 
> And according to the first commit, the restriction was added because 
> some of dm devices don't support dax.  So my understanding is: we should 
> redesign the case to make the it work, and firstly, we should add dax 
> support for dm devices in kernel.

dm devices used to have fsdax support; I think Christoph is actively
removing (or already has removed) all that support.

> In addition, is there any other testcase has the same problem?  so that 
> we can deal with them together.

The last I checked, there aren't any that require MAP_SYNC or pmem aside
from g/470 and the three poison notification tests that you sent a few
days ago.

--D

> 
> --
> Thanks,
> Ruan
> 
> 
> > 
> > --D
> > 
> >>
> >> --
> >> Thanks,
> >> Ruan.
> >>
> >>
> >>
> >>>
> >>> I'm not sure how to fix that...
> >>>
> >>> --D
> >>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Ruan.
> >>>>
> >>>>>
> >>>>> Reviewed-by: Darrick J. Wong 
> >>>>>
> >>>>> --D
> >>>>>
> >>>>>> ---
> >>>>>> fs/xfs/xfs_super.c | 6 --
> >>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> >>>>>> index 8495ef076ffc..a3c221841fa6 100644
> >>>>>> --- a/fs/xfs/xfs_super.c
> >>>>>> +++ b/fs/xfs/xfs_super.c
> >>>>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
> >>>>>>goto disable_dax;
> >>>>>>}
> >>>>>> 
> >>>>>> -  if (xfs_has_reflink(mp)) {
> >>>>>> -  xfs_alert(mp, "DAX and reflink cannot be used 
> >>>>>> together!");
> >>>>>> +  if (xfs_has_reflink(mp) &&
> >>>>>> +  bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> >>>>>> +  xfs_alert(mp,
> >>>>>> +  "DAX and reflink cannot work with 
> >>>>>> multi-partitions!");
> >>>>>>return -EINVAL;
> >>>>>>}
> >>>>>> 
> >>>>>> -- 
> >>>>>> 2.36.1
> >>>>>>
> >>>>>>
> >>>>>>



Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-02 Thread Darrick J. Wong
On Wed, Aug 03, 2022 at 02:43:20AM +, ruansy.f...@fujitsu.com wrote:
> 
> 在 2022/7/19 6:56, Dan Williams 写道:
> > Darrick J. Wong wrote:
> >> On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote:
> >>> ruansy.f...@fujitsu.com wrote:
> >>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> >>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
> >>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> >>>> (or mapped device) on it to unmap all files in use and notify processes
> >>>> who are using those files.
> >>>>
> >>>> Call trace:
> >>>> trigger unbind
> >>>>   -> unbind_store()
> >>>>-> ... (skip)
> >>>> -> devres_release_all()   # was pmem driver ->remove() in v1
> >>>>  -> kill_dax()
> >>>>   -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, 
> >>>> MF_MEM_PRE_REMOVE)
> >>>>-> xfs_dax_notify_failure()
> >>>>
> >>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> >>>> event.  So do not shutdown filesystem directly if something not
> >>>> supported, or if failure range includes metadata area.  Make sure all
> >>>> files and processes are handled correctly.
> >>>>
> >>>> ==
> >>>> Changes since v5:
> >>>>1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
> >>>>2. hold s_umount before sync_filesystem()
> >>>>3. move sync_filesystem() after SB_BORN check
> >>>>4. Rebased on next-20220714
> >>>>
> >>>> Changes since v4:
> >>>>1. sync_filesystem() at the beginning when MF_MEM_REMOVE
> >>>>2. Rebased on next-20220706
> >>>>
> >>>> [1]: 
> >>>> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> >>>>
> >>>> Signed-off-by: Shiyang Ruan 
> >>>> ---
> >>>>   drivers/dax/super.c |  3 ++-
> >>>>   fs/xfs/xfs_notify_failure.c | 15 +++
> >>>>   include/linux/mm.h  |  1 +
> >>>>   3 files changed, 18 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> >>>> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> >>>> --- a/drivers/dax/super.c
> >>>> +++ b/drivers/dax/super.c
> >>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> >>>>  return;
> >>>>   
> >>>>  if (dax_dev->holder_data != NULL)
> >>>> -dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> >>>> +dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> >>>> +MF_MEM_PRE_REMOVE);
> >>>>   
> >>>>  clear_bit(DAXDEV_ALIVE, _dev->flags);
> >>>>  synchronize_srcu(_srcu);
> >>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> >>>> index 69d9c83ea4b2..6da6747435eb 100644
> >>>> --- a/fs/xfs/xfs_notify_failure.c
> >>>> +++ b/fs/xfs/xfs_notify_failure.c
> >>>> @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
> >>>>   
> >>>>  if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> >>>>  (rec->rm_flags & (XFS_RMAP_ATTR_FORK | 
> >>>> XFS_RMAP_BMBT_BLOCK))) {
> >>>> +/* Do not shutdown so early when device is to be 
> >>>> removed */
> >>>> +if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> >>>> +return 0;
> >>>>  xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> >>>>  return -EFSCORRUPTED;
> >>>>  }
> >>>> @@ -174,12 +177,22 @@ xfs_dax_notify_failure(
> >>>>  struct xfs_mount*mp = dax_holder(dax_dev);
> >>>>  u64 ddev_start;
> >>>>  u64 ddev_end;
> >>>> +int error;
> >>>>   
> >>>>  if (!(mp->m_sb.sb_flags & SB_BORN)) {
> >>

Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-07-28 Thread Darrick J. Wong
On Fri, Jul 29, 2022 at 03:55:24AM +, ruansy.f...@fujitsu.com wrote:
> 
> 
> 在 2022/7/22 0:16, Darrick J. Wong 写道:
> > On Thu, Jul 21, 2022 at 02:06:10PM +, ruansy.f...@fujitsu.com wrote:
> >> 在 2022/7/1 8:31, Darrick J. Wong 写道:
> >>> On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> >>>> Failure notification is not supported on partitions.  So, when we mount
> >>>> a reflink enabled xfs on a partition with dax option, let it fail with
> >>>> -EINVAL code.
> >>>>
> >>>> Signed-off-by: Shiyang Ruan 
> >>>
> >>> Looks good to me, though I think this patch applies to ... wherever all
> >>> those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> >>>
> >>> Ideally this would go in through there to keep the pieces together, but
> >>> I don't mind tossing this in at the end of the 5.20 merge window if akpm
> >>> is unwilling.
> >>
> >> BTW, since these patches (dax + THIS + pmem-unbind) are
> >> waiting to be merged, is it time to think about "removing the
> >> experimental tag" again?  :)
> > 
> > It's probably time to take up that question again.
> > 
> > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
> > didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
> > and at least one of those dm layers no longer allows fsdax pass-through,
> > so XFS silently turned mount -o dax into -o dax=never. :(
> 
> Hi Darrick,
> 
> I tried generic/470 but it didn't run:
>[not run] Cannot use thin-pool devices on DAX capable block devices.
> 
> Did you modify the _require_dm_target() in common/rc?  I added thin-pool 
> to not to check dax capability:
> 
>  case $target in
>  stripe|linear|log-writes|thin-pool)  # add thin-pool here
>  ;;
> 
> then the case finally ran and it silently turned off dax as you said.
> 
> Are the steps for reproduction correct? If so, I will continue to 
> investigate this problem.

Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
mention that.  I suspect that the removal of dm support for pmem is
going to force us to completely redesign this test.  I can't really
think of how, though, since there's no good way that I know of to gain a
point-in-time snapshot of a pmem device.

--D

> 
> --
> Thanks,
> Ruan.
> 
> 
> 
> > 
> > I'm not sure how to fix that...
> > 
> > --D
> > 
> >>
> >> --
> >> Thanks,
> >> Ruan.
> >>
> >>>
> >>> Reviewed-by: Darrick J. Wong 
> >>>
> >>> --D
> >>>
> >>>> ---
> >>>>fs/xfs/xfs_super.c | 6 --
> >>>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> >>>> index 8495ef076ffc..a3c221841fa6 100644
> >>>> --- a/fs/xfs/xfs_super.c
> >>>> +++ b/fs/xfs/xfs_super.c
> >>>> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
> >>>>  goto disable_dax;
> >>>>  }
> >>>>
> >>>> -if (xfs_has_reflink(mp)) {
> >>>> -xfs_alert(mp, "DAX and reflink cannot be used 
> >>>> together!");
> >>>> +if (xfs_has_reflink(mp) &&
> >>>> +bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> >>>> +xfs_alert(mp,
> >>>> +"DAX and reflink cannot work with 
> >>>> multi-partitions!");
> >>>>  return -EINVAL;
> >>>>  }
> >>>>
> >>>> -- 
> >>>> 2.36.1
> >>>>
> >>>>
> >>>>



Re: [PATCH] fsdax: Fix infinite loop in dax_iomap_rw()

2022-07-25 Thread Darrick J. Wong
On Mon, Jul 25, 2022 at 11:20:50AM +0800, Li Jinlin wrote:
> I got an infinite loop and a WARNING report when executing a tail command
> in virtiofs.
> 
>   WARNING: CPU: 10 PID: 964 at fs/iomap/iter.c:34 iomap_iter+0x3a2/0x3d0
>   Modules linked in:
>   CPU: 10 PID: 964 Comm: tail Not tainted 5.19.0-rc7
>   Call Trace:
>   
>   dax_iomap_rw+0xea/0x620
>   ? __this_cpu_preempt_check+0x13/0x20
>   fuse_dax_read_iter+0x47/0x80
>   fuse_file_read_iter+0xae/0xd0
>   new_sync_read+0xfe/0x180
>   ? 0x8100
>   vfs_read+0x14d/0x1a0
>   ksys_read+0x6d/0xf0
>   __x64_sys_read+0x1a/0x20
>   do_syscall_64+0x3b/0x90
>   entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> The tail command will call read() with a count of 0. In this case,
> iomap_iter() will report this WARNING, and always return 1 which casuing
> the infinite loop in dax_iomap_rw().
> 
> Fixing by checking count whether is 0 in dax_iomap_rw().
> 
> Fixes: ca289e0b95af ("fsdax: switch dax_iomap_rw to use iomap_iter")
> Signed-off-by: Li Jinlin 

Huh, I didn't know FUSE supports DAX and iomap now...

> ---
>  fs/dax.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4155a6107fa1..7ab248ed21aa 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1241,6 +1241,9 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>   loff_t done = 0;
>   int ret;
>  
> + if (!iomi.len)
> + return 0;

Hmm, most of the callers of dax_iomap_rw skip the whole call if
iov_iter_count(to)==0, so I wonder if fuse_dax_read_iter should do the
same?

That said, iomap_dio_rw bails early if you pass it iomi.len, so I don't
have any real objections to this.

Reviewed-by: Darrick J. Wong 

--D


> +
>   if (iov_iter_rw(iter) == WRITE) {
>   lockdep_assert_held_write(>i_rwsem);
>   iomi.flags |= IOMAP_WRITE;
> -- 
> 2.30.2
> 



Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-07-21 Thread Darrick J. Wong
On Thu, Jul 21, 2022 at 02:06:10PM +, ruansy.f...@fujitsu.com wrote:
> 在 2022/7/1 8:31, Darrick J. Wong 写道:
> > On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> >> Failure notification is not supported on partitions.  So, when we mount
> >> a reflink enabled xfs on a partition with dax option, let it fail with
> >> -EINVAL code.
> >>
> >> Signed-off-by: Shiyang Ruan 
> > 
> > Looks good to me, though I think this patch applies to ... wherever all
> > those rmap+reflink+dax patches went.  I think that's akpm's tree, right?
> > 
> > Ideally this would go in through there to keep the pieces together, but
> > I don't mind tossing this in at the end of the 5.20 merge window if akpm
> > is unwilling.
> 
> BTW, since these patches (dax + THIS + pmem-unbind) are 
> waiting to be merged, is it time to think about "removing the 
> experimental tag" again?  :)

It's probably time to take up that question again.

Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
didn't succeed because it sets up dmlogwrites atop dmthinp atop pmem,
and at least one of those dm layers no longer allows fsdax pass-through,
so XFS silently turned mount -o dax into -o dax=never. :(

I'm not sure how to fix that...

--D

> 
> --
> Thanks,
> Ruan.
> 
> > 
> > Reviewed-by: Darrick J. Wong 
> > 
> > --D
> > 
> >> ---
> >>   fs/xfs/xfs_super.c | 6 --
> >>   1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> >> index 8495ef076ffc..a3c221841fa6 100644
> >> --- a/fs/xfs/xfs_super.c
> >> +++ b/fs/xfs/xfs_super.c
> >> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
> >>goto disable_dax;
> >>}
> >>   
> >> -  if (xfs_has_reflink(mp)) {
> >> -  xfs_alert(mp, "DAX and reflink cannot be used together!");
> >> +  if (xfs_has_reflink(mp) &&
> >> +  bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> >> +  xfs_alert(mp,
> >> +  "DAX and reflink cannot work with multi-partitions!");
> >>return -EINVAL;
> >>}
> >>   
> >> -- 
> >> 2.36.1
> >>
> >>
> >>



Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-18 Thread Darrick J. Wong
On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote:
> ruansy.f...@fujitsu.com wrote:
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > (or mapped device) on it to unmap all files in use and notify processes
> > who are using those files.
> > 
> > Call trace:
> > trigger unbind
> >  -> unbind_store()
> >   -> ... (skip)
> >-> devres_release_all()   # was pmem driver ->remove() in v1
> > -> kill_dax()
> >  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> >   -> xfs_dax_notify_failure()
> > 
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event.  So do not shutdown filesystem directly if something not
> > supported, or if failure range includes metadata area.  Make sure all
> > files and processes are handled correctly.
> > 
> > ==
> > Changes since v5:
> >   1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
> >   2. hold s_umount before sync_filesystem()
> >   3. move sync_filesystem() after SB_BORN check
> >   4. Rebased on next-20220714
> > 
> > Changes since v4:
> >   1. sync_filesystem() at the beginning when MF_MEM_REMOVE
> >   2. Rebased on next-20220706
> > 
> > [1]: 
> > https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> > 
> > Signed-off-by: Shiyang Ruan 
> > ---
> >  drivers/dax/super.c |  3 ++-
> >  fs/xfs/xfs_notify_failure.c | 15 +++
> >  include/linux/mm.h  |  1 +
> >  3 files changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 9b5e2a5eb0ae..cf9a64563fbe 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > return;
> >  
> > if (dax_dev->holder_data != NULL)
> > -   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > +   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > +   MF_MEM_PRE_REMOVE);
> >  
> > clear_bit(DAXDEV_ALIVE, _dev->flags);
> > synchronize_srcu(_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 69d9c83ea4b2..6da6747435eb 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
> >  
> > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > +   /* Do not shutdown so early when device is to be removed */
> > +   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > +   return 0;
> > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > return -EFSCORRUPTED;
> > }
> > @@ -174,12 +177,22 @@ xfs_dax_notify_failure(
> > struct xfs_mount*mp = dax_holder(dax_dev);
> > u64 ddev_start;
> > u64 ddev_end;
> > +   int error;
> >  
> > if (!(mp->m_sb.sb_flags & SB_BORN)) {
> > xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> > return -EIO;
> > }
> >  
> > +   if (mf_flags & MF_MEM_PRE_REMOVE) {
> > +   xfs_info(mp, "device is about to be removed!");
> > +   down_write(>m_super->s_umount);
> > +   error = sync_filesystem(mp->m_super);
> > +   up_write(>m_super->s_umount);
> 
> Are all mappings invalidated after this point?

No; all this step does is pushes dirty filesystem [meta]data to pmem
before we lose DAXDEV_ALIVE...

> The goal of the removal notification is to invalidate all DAX mappings
> that are no pointing to pfns that do not exist anymore, so just syncing
> does not seem like enough, and the shutdown is skipped above. What am I
> missing?

...however, the shutdown above only applies to filesystem metadata.  In
effect, we avoid the fs shutdown in MF_MEM_PRE_REMOVE mode, which
enables the mf_dax_kill_procs calls to proceed against mapped file data.
I have a nagging suspicion that in non-PREREMOVE mode, we can end up
shutting down the filesytem on an xattr block and the 'return
-EFSCORRUPTED' actually prevents us from reaching all the remaining file
data mappings.

IOWs, I think that clause above really ought to have returned zero so
that we keep the filesystem up while we're tearing down mappings, and
only call xfs_force_shutdown() after we've had a chance to let
xfs_dax_notify_ddev_failure() tear down all the mappings.

I missed that subtlety in the initial ~30 rounds of review, but I figure
at this point let's just land it in 5.20 and clean up that quirk for
-rc1.

> Notice that kill_dev_dax() does unmap_mapping_range() after invalidating
> the dax device and that ensures that all existing mappings are gone and
> cannot be re-established. As far as I can see a 

Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-14 Thread Darrick J. Wong
On Thu, Jul 14, 2022 at 10:34:29AM +, ruansy.f...@fujitsu.com wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> ==
> Changes since v5:
>   1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
>   2. hold s_umount before sync_filesystem()
>   3. move sync_filesystem() after SB_BORN check
>   4. Rebased on next-20220714
> 
> Changes since v4:
>   1. sync_filesystem() at the beginning when MF_MEM_REMOVE
>   2. Rebased on next-20220706
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan 

Looks reasonable to me now,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  drivers/dax/super.c |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 15 +++
>  include/linux/mm.h  |  1 +
>  3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 69d9c83ea4b2..6da6747435eb 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
>  
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Do not shutdown so early when device is to be removed */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
>   }
> @@ -174,12 +177,22 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>  
>   if (!(mp->m_sb.sb_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>  
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + up_write(>m_super->s_umount);
> + if (error)
> + return error;
> + }
> +
>   if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
>   xfs_warn(mp,
>"notify_failure() not supported on realtime device!");
> @@ -188,6 +201,8 @@ xfs_dax_notify_failure(
>  
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4287bec50c28..2ddfb76c8a83 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3188,6 +3188,7 @@ enum mf_flags {
>   MF_SOFT_OFFLINE = 1 << 3,
>   MF_UNPOISON = 1 << 4,
>   MF_SW_SIMULATED = 1 << 5,
> + MF_MEM_PRE_REMOVE = 1 << 6,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> -- 
> 2.37.0



Re: [RFC PATCH v5] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-08 Thread Darrick J. Wong
On Fri, Jul 08, 2022 at 05:42:22AM +, ruansy.f...@fujitsu.com wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_REMOVE to let filesystem know this is a remove event.
> So do not shutdown filesystem directly if something not supported, or if
> failure range includes metadata area.  Make sure all files and processes
> are handled correctly.
> 
> ==
> Changes since v4:
>   1. sync_filesystem() at the beginning when MF_MEM_REMOVE
>   2. Rebased on next-20220706
> 
> Changes since v3:
>   1. Flush dirty files and logs when pmem is about to be removed.
>   2. Rebased on next-20220701
> 
> Changes since v2:
>   1. Rebased on next-20220615
> 
> Changes since v1:
>   1. Drop the needless change of moving {kill,put}_dax()
>   2. Rebased on '[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink'[2]
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> [2]: 
> https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.f...@fujitsu.com/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c |  2 +-
>  fs/xfs/xfs_notify_failure.c | 16 
>  include/linux/mm.h  |  1 +
>  3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..d4bc83159d46 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,7 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE);
>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index aa8dc27c599c..728b0c1d0ddf 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -18,6 +18,7 @@
>  #include "xfs_rmap_btree.h"
>  #include "xfs_rtalloc.h"
>  #include "xfs_trans.h"
> +#include "xfs_log.h"
>  
>  #include 
>  #include 
> @@ -75,6 +76,10 @@ xfs_dax_failure_fn(
>  
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Do not shutdown so early when device is to be removed */
> + if (notify->mf_flags & MF_MEM_REMOVE) {
> + return 0;
> + }

Nit: no curly braces needed here.

>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
>   }
> @@ -168,6 +173,14 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
> +
> + if (mf_flags & MF_MEM_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + error = sync_filesystem(mp->m_super);

sync_filesystem requires callers to hold s_umount.  Does the dax media
failure code take that lock for us, or is this missing a lock?

Also, I'm not sure it's a good idea to sync_filesystem() before checking
if SB_BORN has been set.

> + if (error)
> + return error;
> + }
>  
>   if (!(mp->m_sb.sb_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> @@ -182,6 +195,9 @@ xfs_dax_notify_failure(
>  
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_REMOVE) {
> + return 0;
> + }

Same nit about not needing curly braces.

>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 794ad19b57f8..3eab2d7ba884 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3240,6 +3240,7 @@ enum mf_flags {
>   MF_UNPOISON = 1 << 4,
>   MF_SW_SIMULATED = 1 << 5,
>   MF_NO_RETRY = 1 << 6,
> + MF_MEM_REMOVE = 1 << 7,

This is more of a pre-removal notification, right?  I think the flag
value ought to be named that way too (MF_MEM_PRE_REMOVE).

--D

>  };
>  int mf_dax_kill_procs(struct address_space 

Re: [RFC PATCH v4] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-05 Thread Darrick J. Wong
On Sun, Jul 03, 2022 at 09:08:38PM +0800, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_REMOVE to let filesystem know this is a remove event.
> So do not shutdown filesystem directly if something not supported, or if
> failure range includes metadata area.  Make sure all files and processes
> are handled correctly.
> 
> ==
> Changes since v3:
>   1. Flush dirty files and logs when pmem is about to be removed.
>   2. Rebased on next-20220701
> 
> Changes since v2:
>   1. Rebased on next-20220615
> 
> Changes since v1:
>   1. Drop the needless change of moving {kill,put}_dax()
>   2. Rebased on '[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink'[2]
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> [2]: 
> https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.f...@fujitsu.com/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c |  2 +-
>  fs/xfs/xfs_notify_failure.c | 23 ++-
>  include/linux/mm.h  |  1 +
>  3 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..d4bc83159d46 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,7 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
> 
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE);
> 
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index aa8dc27c599c..269e21b3341c 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -18,6 +18,7 @@
>  #include "xfs_rmap_btree.h"
>  #include "xfs_rtalloc.h"
>  #include "xfs_trans.h"
> +#include "xfs_log.h"
> 
>  #include 
>  #include 
> @@ -75,6 +76,10 @@ xfs_dax_failure_fn(
> 
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Do not shutdown so early when device is to be removed */
> + if (notify->mf_flags & MF_MEM_REMOVE) {
> + return 0;
> + }
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
>   }
> @@ -168,6 +173,7 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
> 
>   if (!(mp->m_sb.sb_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> @@ -182,6 +188,13 @@ xfs_dax_notify_failure(
> 
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_REMOVE) {
> + /* Flush the log since device is about to be removed. */

If MF_MEM_REMOVE means "storage is about to go away" then perhaps the
only thing we need to do in xfs_dax_notify_failure is log a message
about the pending failure and then call sync_filesystem()?  This I think
could come before we even start looking at which device -- if any of the
filesystem blockdevs are about to be removed, the best we can do is
flush all the dirty data to disk.

--D

> + error = xfs_log_force(mp, XFS_LOG_SYNC);
> + if (error)
> + return error;
> + return -EOPNOTSUPP;
> + }
>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> @@ -211,8 +224,16 @@ xfs_dax_notify_failure(
>   if (offset + len > ddev_end)
>   len -= ddev_end - offset;
> 
> - return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> + error = xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>   mf_flags);
> + if (error)
> + return error;
> +
> + if (mf_flags & MF_MEM_REMOVE) {
> + xfs_flush_inodes(mp);
> + error = xfs_log_force(mp, XFS_LOG_SYNC);
> + }
> + return error;
>  }
> 
>  const 

Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-06-30 Thread Darrick J. Wong
On Thu, Jun 09, 2022 at 10:34:35PM +0800, Shiyang Ruan wrote:
> Failure notification is not supported on partitions.  So, when we mount
> a reflink enabled xfs on a partition with dax option, let it fail with
> -EINVAL code.
> 
> Signed-off-by: Shiyang Ruan 

Looks good to me, though I think this patch applies to ... wherever all
those rmap+reflink+dax patches went.  I think that's akpm's tree, right?

Ideally this would go in through there to keep the pieces together, but
I don't mind tossing this in at the end of the 5.20 merge window if akpm
is unwilling.

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_super.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 8495ef076ffc..a3c221841fa6 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -348,8 +348,10 @@ xfs_setup_dax_always(
>   goto disable_dax;
>   }
>  
> - if (xfs_has_reflink(mp)) {
> - xfs_alert(mp, "DAX and reflink cannot be used together!");
> + if (xfs_has_reflink(mp) &&
> + bdev_is_partition(mp->m_ddev_targp->bt_bdev)) {
> + xfs_alert(mp,
> + "DAX and reflink cannot work with multi-partitions!");
>   return -EINVAL;
>   }
>  
> -- 
> 2.36.1
> 
> 
> 



Re: [RFC PATCH v3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-06-22 Thread Darrick J. Wong
On Wed, Jun 15, 2022 at 08:54:00PM +0800, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_REMOVE to let filesystem know this is a remove event.
> So do not shutdown filesystem directly if something not supported, or if
> failure range includes metadata area.  Make sure all files and processes
> are handled correctly.
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan 
> 
> ==
> Changes since v2:
>   1. Rebased on next-20220615
> 
> Changes since v1:
>   1. Drop the needless change of moving {kill,put}_dax()
>   2. Rebased on '[PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink'[2]
> 
> ---
>  drivers/dax/super.c | 2 +-
>  fs/xfs/xfs_notify_failure.c | 6 +-
>  include/linux/mm.h  | 1 +
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..d4bc83159d46 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,7 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_REMOVE);

At the point we're initiating a MEM_REMOVE call, is the pmem already
gone, or is it about to be gone?

>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index aa8dc27c599c..91d3f05d4241 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -73,7 +73,9 @@ xfs_dax_failure_fn(
>   struct failure_info *notify = data;
>   int error = 0;
>  
> - if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> + /* Do not shutdown so early when device is to be removed */
> + if (!(notify->mf_flags & MF_MEM_REMOVE) ||
> + XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> @@ -182,6 +184,8 @@ xfs_dax_notify_failure(
>  
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_REMOVE)
> + return -EOPNOTSUPP;

The reason I ask is that if the pmem is *about to be* but not yet
removed from the system, shouldn't we at least try to flush all dirty
files and the log to reduce data loss and minimize recovery time?

If it's already gone, then you might as well shut down immediately,
unless there's a chance the pmem will come back(?)

--D

>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 623c2ee8330a..bbeb31883362 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3249,6 +3249,7 @@ enum mf_flags {
>   MF_SOFT_OFFLINE = 1 << 3,
>   MF_UNPOISON = 1 << 4,
>   MF_NO_RETRY = 1 << 5,
> + MF_MEM_REMOVE = 1 << 6,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> -- 
> 2.36.1
> 
> 
> 



Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-02 Thread Darrick J. Wong
On Thu, Jun 02, 2022 at 05:42:13PM +0800, Shiyang Ruan wrote:
> Hi,
> 
> Is there any other work I should do with these two patchsets?  I think they
> are good for now.  So... since the 5.19-rc1 is coming, could the
> notify_failure() part be merged as your plan?

Hmm.  I don't see any of the patches 1-5,7-13 in current upstream, so
I'm guessing this means Andrew isn't taking it for 5.19?

--D

> 
> 
> --
> Thanks,
> Ruan.
> 
> 
> 在 2022/5/12 20:27, Shiyang Ruan 写道:
> > 
> > 
> > 在 2022/5/11 23:46, Dan Williams 写道:
> > > On Wed, May 11, 2022 at 8:21 AM Darrick J. Wong 
> > > wrote:
> > > > 
> > > > Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
> > > > > On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong"
> > > > >  wrote:
> > > > > 
> > > > > > On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > > > > > > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams
> > > > > > >  wrote:
> > > > > > > 
> > > > > > > > > It'll need to be a stable branch somewhere, but I don't think 
> > > > > > > > > it
> > > > > > > > > really matters where al long as it's merged into the xfs 
> > > > > > > > > for-next
> > > > > > > > > tree so it gets filesystem test coverage...
> > > > > > > > 
> > > > > > > > So how about let the notify_failure() bits go
> > > > > > > > through -mm this cycle,
> > > > > > > > if Andrew will have it, and then the reflnk work
> > > > > > > > has a clean v5.19-rc1
> > > > > > > > baseline to build from?
> > > > > > > 
> > > > > > > What are we referring to here?  I think a minimal thing would be 
> > > > > > > the
> > > > > > > memremap.h and memory-failure.c changes from
> > > > > > > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com
> > > > > > > ?
> > > > > > > 
> > > > > > > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > > > > > > would probably be straining things to slip it into 5.19.
> > > > > > > 
> > > > > > > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > > > > > > right thing, but it's a networking errno.  I suppose
> > > > > > > livable with if it
> > > > > > > never escapes the kernel, but if it can get back to userspace 
> > > > > > > then a
> > > > > > > user would be justified in wondering how the heck a filesystem
> > > > > > > operation generated a networking errno?
> > > > > > 
> > > > > >  most filesystems return EOPNOTSUPP rather
> > > > > > enthusiastically when
> > > > > > they don't know how to do something...
> > > > > 
> > > > > Can it propagate back to userspace?
> > > > 
> > > > AFAICT, the new code falls back to the current (mf_generic_kill_procs)
> > > > failure code if the filesystem doesn't provide a ->memory_failure
> > > > function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
> > > > return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
> > > > convert that to 0 before returning it to userspace.
> > > > 
> > > > I suppose the weirder question is going to be what happens when madvise
> > > > starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
> > > > loses half its brains and even the fs can't deal with it.
> > > 
> > > Even then that notification is not in a system call context so it
> > > would still result in a SIGBUS notification not a EOPNOTSUPP return
> > > code. The only potential gap I see are what are the possible error
> > > codes that MADV_SOFT_OFFLINE might see? The man page is silent on soft
> > > offline failure codes. Shiyang, that's something to check / update if
> > > necessary.
> > 
> > According to the code around MADV_SOFT_OFFLINE, it will return -EIO when
> > the backend is NVDIMM.
> > 
> > Here is the logic:
> >   madvise_inject_error() {
> >   ...
> >   if (MADV_SOFT_OFFLINE) {
> >   ret = soft_offline_page() {
> >   ...
> >   /* Only online pages can be soft-offlined (esp., not
> > ZONE_DEVICE). */
> >   page = pfn_to_online_page(pfn);
> >   if (!page) {
> >   put_ref_page(ref_page);
> >   return -EIO;
> >   }
> >   ...
> >   }
> >   } else {
> >   ret = memory_failure()
> >   }
> >   return ret
> >   }
> > 
> > 
> > -- 
> > Thanks,
> > Ruan.
> > 
> > 
> 
> 



Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-11 Thread Darrick J. Wong
Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
> On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong"  
> wrote:
> 
> > On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams 
> > >  wrote:
> > > 
> > > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > > really matters where al long as it's merged into the xfs for-next
> > > > > tree so it gets filesystem test coverage...
> > > > 
> > > > So how about let the notify_failure() bits go through -mm this cycle,
> > > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > > baseline to build from?
> > > 
> > > What are we referring to here?  I think a minimal thing would be the
> > > memremap.h and memory-failure.c changes from
> > > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com
> > >  ?
> > > 
> > > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > > would probably be straining things to slip it into 5.19.
> > > 
> > > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > > right thing, but it's a networking errno.  I suppose livable with if it
> > > never escapes the kernel, but if it can get back to userspace then a
> > > user would be justified in wondering how the heck a filesystem
> > > operation generated a networking errno?
> > 
> >  most filesystems return EOPNOTSUPP rather enthusiastically when
> > they don't know how to do something...
> 
> Can it propagate back to userspace?

AFAICT, the new code falls back to the current (mf_generic_kill_procs)
failure code if the filesystem doesn't provide a ->memory_failure
function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
convert that to 0 before returning it to userspace.

I suppose the weirder question is going to be what happens when madvise
starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
loses half its brains and even the fs can't deal with it.

--D



Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-10 Thread Darrick J. Wong
On Tue, May 10, 2022 at 09:20:57PM -0700, Dan Williams wrote:
> On Tue, May 10, 2022 at 7:29 PM Andrew Morton  
> wrote:
> >
> > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams  
> > wrote:
> >
> > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > really matters where al long as it's merged into the xfs for-next
> > > > tree so it gets filesystem test coverage...
> > >
> > > So how about let the notify_failure() bits go through -mm this cycle,
> > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > baseline to build from?
> >
> > What are we referring to here?  I think a minimal thing would be the
> > memremap.h and memory-failure.c changes from
> > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com ?
> 
> Latest is here:
> https://lore.kernel.org/all/20220508143620.1775214-1-ruansy.f...@fujitsu.com/
> 
> > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > would probably be straining things to slip it into 5.19.
> 
> Hmm, if it's straining things and XFS will also target v5.20 I think
> the best course for all involved is just wait. Let some of the current
> conflicts in -mm land in v5.19 and then I can merge the DAX baseline
> and publish a stable branch for XFS and BTRFS to build upon for v5.20.

Sounds good to /me...

--D

> > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > right thing, but it's a networking errno.  I suppose livable with if it
> > never escapes the kernel, but if it can get back to userspace then a
> > user would be justified in wondering how the heck a filesystem
> > operation generated a networking errno?



Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-10 Thread Darrick J. Wong
On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> On Tue, 10 May 2022 18:55:50 -0700 Dan Williams  
> wrote:
> 
> > > It'll need to be a stable branch somewhere, but I don't think it
> > > really matters where al long as it's merged into the xfs for-next
> > > tree so it gets filesystem test coverage...
> > 
> > So how about let the notify_failure() bits go through -mm this cycle,
> > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > baseline to build from?
> 
> What are we referring to here?  I think a minimal thing would be the
> memremap.h and memory-failure.c changes from
> https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com ?
> 
> Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> would probably be straining things to slip it into 5.19.
> 
> The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> right thing, but it's a networking errno.  I suppose livable with if it
> never escapes the kernel, but if it can get back to userspace then a
> user would be justified in wondering how the heck a filesystem
> operation generated a networking errno?

 most filesystems return EOPNOTSUPP rather enthusiastically when
they don't know how to do something...

--D



Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-10 Thread Darrick J. Wong
On Sun, May 08, 2022 at 10:36:06PM +0800, Shiyang Ruan wrote:
> This is a combination of two patchsets:
>  1.fsdax-rmap: 
> https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/
>  2.fsdax-reflink: 
> https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/
> 
>  Changes since v13 of fsdax-rmap:
>   1. Fixed mistakes during rebasing code to latest next-
>   2. Rebased to next-20220504
> 
>  Changes since v10 of fsdax-reflink:
>   1. Rebased to next-20220504 and fsdax-rmap
>   2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
>   iter model'
>   3. Fixed many conflicts during rebasing
>   4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
>   shorter than smap->length or dmap->length.
>   PS: There are many changes during rebasing.  I think it's better to
>   review again.
> 
> ==
> Shiyang Ruan (14):
>   fsdax-rmap:
> dax: Introduce holder for dax_device
> mm: factor helpers for memory_failure_dev_pagemap
> pagemap,pmem: Introduce ->memory_failure()
> fsdax: Introduce dax_lock_mapping_entry()
> mm: Introduce mf_dax_kill_procs() for fsdax case

Hmm.  This patchset touches at least the dax, pagecache, and xfs
subsystems.  Assuming it's too late for 5.19, how should we stage this
for 5.20?

I could just add the entire series to iomap-5.20-merge and base the
xfs-5.20-merge off of that?  But I'm not sure what else might be landing
in the other subsystems, so I'm open to input.

--D

> xfs: Implement ->notify_failure() for XFS
> fsdax: set a CoW flag when associate reflink mappings
>   fsdax-reflink:
> fsdax: Output address in dax_iomap_pfn() and rename it
> fsdax: Introduce dax_iomap_cow_copy()
> fsdax: Replace mmap entry in case of CoW
> fsdax: Add dax_iomap_cow_copy() for dax zero
> fsdax: Dedup file range to use a compare function
> xfs: support CoW in fsdax mode
> xfs: Add dax dedupe support
> 
>  drivers/dax/super.c |  67 +-
>  drivers/md/dm.c |   2 +-
>  drivers/nvdimm/pmem.c   |  17 ++
>  fs/dax.c| 398 ++--
>  fs/erofs/super.c|  13 +-
>  fs/ext2/super.c |   7 +-
>  fs/ext4/super.c |   9 +-
>  fs/remap_range.c|  31 ++-
>  fs/xfs/Makefile |   5 +
>  fs/xfs/xfs_buf.c|  10 +-
>  fs/xfs/xfs_file.c   |   9 +-
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_inode.c  |  69 ++-
>  fs/xfs/xfs_inode.h  |   1 +
>  fs/xfs/xfs_iomap.c  |  46 -
>  fs/xfs/xfs_iomap.h  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 220 
>  fs/xfs/xfs_reflink.c|  12 +-
>  fs/xfs/xfs_super.h  |   1 +
>  include/linux/dax.h |  56 -
>  include/linux/fs.h  |  12 +-
>  include/linux/memremap.h|  12 ++
>  include/linux/mm.h  |   2 +
>  include/linux/page-flags.h  |   6 +
>  mm/memory-failure.c | 257 ---
>  26 files changed, 1087 insertions(+), 182 deletions(-)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
> 
> -- 
> 2.35.1
> 
> 
> 



Re: [PATCH v13 5/7] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-04-20 Thread Darrick J. Wong
On Tue, Apr 19, 2022 at 12:50:43PM +0800, Shiyang Ruan wrote:
> This new function is a variant of mf_generic_kill_procs that accepts a
> file, offset pair instead of a struct to support multiple files sharing
> a DAX mapping.  It is intended to be called by the file systems as part
> of the memory_failure handler after the file system performed a reverse
> mapping from the storage address to the file and file offset.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Dan Williams 
> Reviewed-by: Christoph Hellwig 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  include/linux/mm.h  |  2 +
>  mm/memory-failure.c | 96 -
>  2 files changed, 88 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad4b6c15c814..52208d743546 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3233,6 +3233,8 @@ enum mf_flags {
>   MF_SOFT_OFFLINE = 1 << 3,
>   MF_UNPOISON = 1 << 4,
>  };
> +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> +   unsigned long count, int mf_flags);
>  extern int memory_failure(unsigned long pfn, int flags);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern void memory_failure_queue_kick(int cpu);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a40e79e634a4..dc47c5f83d85 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -295,10 +295,9 @@ void shake_page(struct page *p)
>  }
>  EXPORT_SYMBOL_GPL(shake_page);
>  
> -static unsigned long dev_pagemap_mapping_shift(struct page *page,
> - struct vm_area_struct *vma)
> +static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
> + unsigned long address)
>  {
> - unsigned long address = vma_address(page, vma);
>   unsigned long ret = 0;
>   pgd_t *pgd;
>   p4d_t *p4d;
> @@ -338,10 +337,14 @@ static unsigned long dev_pagemap_mapping_shift(struct 
> page *page,
>  /*
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> + *
> + * Notice: @fsdax_pgoff is used only when @p is a fsdax page.
> + *   In other cases, such as anonymous and file-backend page, the address to 
> be
> + *   killed can be caculated by @p itself.
>   */
>  static void add_to_kill(struct task_struct *tsk, struct page *p,
> -struct vm_area_struct *vma,
> -struct list_head *to_kill)
> + pgoff_t fsdax_pgoff, struct vm_area_struct *vma,
> + struct list_head *to_kill)
>  {
>   struct to_kill *tk;
>  
> @@ -352,9 +355,15 @@ static void add_to_kill(struct task_struct *tsk, struct 
> page *p,
>   }
>  
>   tk->addr = page_address_in_vma(p, vma);
> - if (is_zone_device_page(p))
> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> - else
> + if (is_zone_device_page(p)) {
> + /*
> +  * Since page->mapping is not used for fsdax, we need
> +  * calculate the address based on the vma.
> +  */
> + if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
> + tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
> + tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> + } else
>   tk->size_shift = page_shift(compound_head(p));
>  
>   /*
> @@ -503,7 +512,7 @@ static void collect_procs_anon(struct page *page, struct 
> list_head *to_kill,
>   if (!page_mapped_in_vma(page, vma))
>   continue;
>   if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, 0, vma, to_kill);
>   }
>   }
>   read_unlock(_lock);
> @@ -539,13 +548,41 @@ static void collect_procs_file(struct page *page, 
> struct list_head *to_kill,
>* to be informed of all such data corruptions.
>*/
>   if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, 0, vma, to_kill);
>   }
>   }
>   read_unlock(_lock);
>   i_mmap_unlock_read(mapping);
>  }
>  
> +#if IS_ENABLED(CONFIG_FS_DAX)
> +/*
> + * Collect processes when the error hit a fsdax page.
> + */
> +static void collect_procs_fsdax(struct page *page,
> + struct address_space *mapping, pgoff_t pgoff

Re: [PATCH v13 4/7] fsdax: Introduce dax_lock_mapping_entry()

2022-04-20 Thread Darrick J. Wong
On Tue, Apr 19, 2022 at 12:50:42PM +0800, Shiyang Ruan wrote:
> The current dax_lock_page() locks dax entry by obtaining mapping and
> index in page.  To support 1-to-N RMAP in NVDIMM, we need a new function
> to lock a specific dax entry corresponding to this file's mapping,index.
> And output the page corresponding to the specific dax entry for caller
> use.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/dax.c| 63 +
>  include/linux/dax.h | 15 +++
>  2 files changed, 78 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1ac12e877f4f..57efd3f73655 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t 
> cookie)
>   dax_unlock_entry(, (void *)cookie);
>  }
>  
> +/*
> + * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
> + * @mapping: the file's mapping whose entry we want to lock
> + * @index: the offset within this file
> + * @page: output the dax page corresponding to this dax entry
> + *
> + * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
> + * could not be locked.
> + */
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t 
> index,
> + struct page **page)
> +{
> + XA_STATE(xas, NULL, 0);
> + void *entry;
> +
> + rcu_read_lock();
> + for (;;) {
> + entry = NULL;
> + if (!dax_mapping(mapping))
> + break;
> +
> + xas.xa = >i_pages;
> + xas_lock_irq();
> + xas_set(, index);
> + entry = xas_load();
> + if (dax_is_locked(entry)) {
> + rcu_read_unlock();
> + wait_entry_unlocked(, entry);
> + rcu_read_lock();
> + continue;
> + }
> + if (!entry ||
> + dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + /*
> +  * Because we are looking for entry from file's mapping
> +  * and index, so the entry may not be inserted for now,
> +  * or even a zero/empty entry.  We don't think this is
> +  * an error case.  So, return a special value and do
> +  * not output @page.
> +  */
> + entry = (void *)~0UL;

In this case we exit to the caller with the magic return value, having
not set *page.  Either the comment for this function should note that
the caller must set *page to a known value (NULL?) before the call, or
we should set *page = NULL here.

AFAICT the callers in this series initialize page to NULL before passing
in , so I think the comment update would be fine.

With the **page requirement documented,
Reviewed-by: Darrick J. Wong 

--D


> + } else {
> + *page = pfn_to_page(dax_to_pfn(entry));
> + dax_lock_entry(, entry);
> + }
> + xas_unlock_irq();
> + break;
> + }
> + rcu_read_unlock();
> + return (dax_entry_t)entry;
> +}
> +
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
> + dax_entry_t cookie)
> +{
> + XA_STATE(xas, >i_pages, index);
> +
> + if (cookie == ~0UL)
> + return;
> +
> + dax_unlock_entry(, (void *)cookie);
> +}
> +
>  /*
>   * Find page cache entry at given index. If it is a DAX entry, return it
>   * with the entry locked. If the page cache doesn't contain an entry at
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9c426a207ba8..c152f315d1c9 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -143,6 +143,10 @@ struct page *dax_layout_busy_page(struct address_space 
> *mapping);
>  struct page *dax_layout_busy_page_range(struct address_space *mapping, 
> loff_t start, loff_t end);
>  dax_entry_t dax_lock_page(struct page *page);
>  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
> + unsigned long index, struct page **page);
> +void dax_unlock_mapping_entry(struct address_space *mapping,
> + unsigned long index, dax_entry_t cookie);
>  #else
>  static inline struct page *dax_layout_busy_page(struct address_space 
> *mapping)
>  {
> @@ -170,6 +174,17 @@ static inline dax_entry_t dax_lock_page(struct page 
> *page)
>  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
>  {
>  }
&g

Re: [PATCH v13 3/7] pagemap,pmem: Introduce ->memory_failure()

2022-04-20 Thread Darrick J. Wong
On Tue, Apr 19, 2022 at 12:50:41PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the filesystem in which
> the corrupted page located in.
> 
> With dax_holder notify support, we are able to notify the memory failure
> from pmem driver to upper layers.  If there is something not support in
> the notify routine, memory_failure will fall back to the generic hanlder.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Dan Williams 

Looks good to me now that we've ironed out the earlier unit questions,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  drivers/nvdimm/pmem.c| 17 +
>  include/linux/memremap.h | 12 
>  mm/memory-failure.c  | 14 ++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 58d95242a836..bd502957cfdf 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -366,6 +366,21 @@ static void pmem_release_disk(void *__pmem)
>   blk_cleanup_disk(pmem->disk);
>  }
>  
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + unsigned long pfn, unsigned long nr_pages, int mf_flags)
> +{
> + struct pmem_device *pmem =
> + container_of(pgmap, struct pmem_device, pgmap);
> + u64 offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
> + u64 len = nr_pages << PAGE_SHIFT;
> +
> + return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
> +}
> +
> +static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> + .memory_failure = pmem_pagemap_memory_failure,
> +};
> +
>  static int pmem_attach_disk(struct device *dev,
>   struct nd_namespace_common *ndns)
>  {
> @@ -427,6 +442,7 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->pfn_flags = PFN_DEV;
>   if (is_nd_pfn(dev)) {
>   pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> + pmem->pgmap.ops = _pagemap_ops;
>   addr = devm_memremap_pages(dev, >pgmap);
>   pfn_sb = nd_pfn->pfn_sb;
>   pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
> @@ -440,6 +456,7 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->pgmap.range.end = res->end;
>   pmem->pgmap.nr_range = 1;
>   pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> + pmem->pgmap.ops = _pagemap_ops;
>   addr = devm_memremap_pages(dev, >pgmap);
>   pmem->pfn_flags |= PFN_MAP;
>   bb_range = pmem->pgmap.range;
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index ad6062d736cd..bcfb6bf4ce5a 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -79,6 +79,18 @@ struct dev_pagemap_ops {
>* the page back to a CPU accessible page.
>*/
>   vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> +
> + /*
> +  * Handle the memory failure happens on a range of pfns.  Notify the
> +  * processes who are using these pfns, and try to recover the data on
> +  * them if necessary.  The mf_flags is finally passed to the recover
> +  * function through the whole notify routine.
> +  *
> +  * When this is not implemented, or it returns -EOPNOTSUPP, the caller
> +  * will fall back to a common handler called mf_generic_kill_procs().
> +  */
> + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> +   unsigned long nr_pages, int mf_flags);
>  };
>  
>  #define PGMAP_ALTMAP_VALID   (1 << 0)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7c8c047bfdc8..a40e79e634a4 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1741,6 +1741,20 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>   if (!pgmap_pfn_valid(pgmap, pfn))
>   goto out;
>  
> + /*
> +  * Call driver's implementation to handle the memory failure, otherwise
> +  * fall back to generic handler.
> +  */
> + if (pgmap->ops->memory_failure) {
> + rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
> + /*
> +  * Fall back to generic handler too if operation is not
> +  * supported inside the driver/device/filesystem.
> +  */
> + if (rc != -EOPNOTSUPP)
> + goto out;
> + }
> +
>   rc = mf_generic_kill_procs(pfn, flags, pgmap);
>  out:
>   /* drop pgmap ref acquired in caller */
> -- 
> 2.35.1
> 
> 
> 



Re: [PATCH v13 1/7] dax: Introduce holder for dax_device

2022-04-20 Thread Darrick J. Wong
On Tue, Apr 19, 2022 at 12:50:39PM +0800, Shiyang Ruan wrote:
> To easily track filesystem from a pmem device, we introduce a holder for
> dax_device structure, and also its operation.  This holder is used to
> remember who is using this dax_device:
>  - When it is the backend of a filesystem, the holder will be the
>instance of this filesystem.
>  - When this pmem device is one of the targets in a mapped device, the
>holder will be this mapped device.  In this case, the mapped device
>has its own dax_device and it will follow the first rule.  So that we
>can finally track to the filesystem we needed.
> 
> The holder and holder_ops will be set when filesystem is being mounted,
> or an target device is being activated.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Dan Williams 

LGTM
Reviewed-by: Darrick J. Wong 

--D

> ---
>  drivers/dax/super.c | 67 -
>  drivers/md/dm.c |  2 +-
>  fs/erofs/super.c| 10 ---
>  fs/ext2/super.c |  7 +++--
>  fs/ext4/super.c |  9 +++---
>  fs/xfs/xfs_buf.c|  5 ++--
>  include/linux/dax.h | 33 --
>  7 files changed, 110 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0211e6f7b47a..5ddb159c4653 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -22,6 +22,8 @@
>   * @private: dax driver private data
>   * @flags: state and boolean properties
>   * @ops: operations for this device
> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
> + * @holder_ops: operations for the inner holder
>   */
>  struct dax_device {
>   struct inode inode;
> @@ -29,6 +31,8 @@ struct dax_device {
>   void *private;
>   unsigned long flags;
>   const struct dax_operations *ops;
> + void *holder_data;
> + const struct dax_holder_operations *holder_ops;
>  };
>  
>  static dev_t dax_devt;
> @@ -71,8 +75,11 @@ EXPORT_SYMBOL_GPL(dax_remove_host);
>   * fs_dax_get_by_bdev() - temporary lookup mechanism for filesystem-dax
>   * @bdev: block device to find a dax_device for
>   * @start_off: returns the byte offset into the dax_device that @bdev starts
> + * @holder: filesystem or mapped device inside the dax_device
> + * @ops: operations for the inner holder
>   */
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
> *start_off)
> +struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
> *start_off,
> + void *holder, const struct dax_holder_operations *ops)
>  {
>   struct dax_device *dax_dev;
>   u64 part_size;
> @@ -92,11 +99,26 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device 
> *bdev, u64 *start_off)
>   dax_dev = xa_load(_hosts, (unsigned long)bdev->bd_disk);
>   if (!dax_dev || !dax_alive(dax_dev) || !igrab(_dev->inode))
>   dax_dev = NULL;
> + else if (holder) {
> + if (!cmpxchg(_dev->holder_data, NULL, holder))
> + dax_dev->holder_ops = ops;
> + else
> + dax_dev = NULL;
> + }
>   dax_read_unlock(id);
>  
>   return dax_dev;
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> +
> +void fs_put_dax(struct dax_device *dax_dev, void *holder)
> +{
> + if (dax_dev && holder &&
> + cmpxchg(_dev->holder_data, holder, NULL) == holder)
> + dax_dev->holder_ops = NULL;
> + put_dax(dax_dev);
> +}
> +EXPORT_SYMBOL_GPL(fs_put_dax);
>  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>  
>  enum dax_device_flags {
> @@ -194,6 +216,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
> pgoff_t pgoff,
>  }
>  EXPORT_SYMBOL_GPL(dax_zero_page_range);
>  
> +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> +   u64 len, int mf_flags)
> +{
> + int rc, id;
> +
> + id = dax_read_lock();
> + if (!dax_alive(dax_dev)) {
> + rc = -ENXIO;
> + goto out;
> + }
> +
> + if (!dax_dev->holder_ops) {
> + rc = -EOPNOTSUPP;
> + goto out;
> + }
> +
> + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> +out:
> + dax_read_unlock(id);
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> +
>  #ifdef CONFIG_ARCH_HAS_PMEM_API
>  void arch_wb_cache_pmem(void *addr, size_t size);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> @@ -267,8 +312,15 @@ void kill_dax(struct dax_device *dax_dev)
>   

Re: [PATCH v13 7/7] fsdax: set a CoW flag when associate reflink mappings

2022-04-20 Thread Darrick J. Wong
vmf->address);
> + dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
> + false);
>   /*
>* Only swap our new entry into the page cache if the current
>* entry is a zero page or an empty entry.  If a normal PTE or
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index d725a2d17806..5b601e375773 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -650,6 +650,12 @@ __PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
>  #define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
>  #define PAGE_MAPPING_FLAGS   (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
>  
> +/*
> + * Different with flags above, this flag is used only for fsdax mode.  It
> + * indicates that this page->mapping is now under reflink case.
> + */
> +#define PAGE_MAPPING_DAX_COW 0x1

The logic looks sound enough, I guess.

Though I do wonder -- if this were defined like this:

#define PAGE_MAPPING_DAX_COW((struct address_space *)0x1)

Could you then avoid all uintptr_t/unsigned long casts above?

It's probably not worth holding up the whole patchset though, so
Reviewed-by: Darrick J. Wong 

--D

> +
>  static __always_inline int PageMappingFlags(struct page *page)
>  {
>   return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != 0;
> -- 
> 2.35.1
> 
> 
> 



Re: [PATCH v13.1 6/7] xfs: Implement ->notify_failure() for XFS

2022-04-20 Thread Darrick J. Wong
On Wed, Apr 20, 2022 at 03:33:42PM +0800, Shiyang Ruan wrote:
> Introduce xfs_notify_failure.c to handle failure related works, such as
> implement ->notify_failure(), register/unregister dax holder in xfs, and
> so on.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 

Looks good now, thank you for your persistence!
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/Makefile |   5 +
>  fs/xfs/xfs_buf.c|  11 +-
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 220 
>  fs/xfs/xfs_super.h  |   1 +
>  6 files changed, 238 insertions(+), 3 deletions(-)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..09f5560e29f2 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -128,6 +128,11 @@ xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o
>  xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o
>  xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o
>  
> +# notify failure
> +ifeq ($(CONFIG_MEMORY_FAILURE),y)
> +xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o
> +endif
> +
>  # online scrub/repair
>  ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
>  
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f9ca08398d32..084455f7e2ff 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -5,6 +5,7 @@
>   */
>  #include "xfs.h"
>  #include 
> +#include 
>  
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
> @@ -1911,7 +1912,7 @@ xfs_free_buftarg(
>   list_lru_destroy(>bt_lru);
>  
>   blkdev_issue_flush(btp->bt_bdev);
> - fs_put_dax(btp->bt_daxdev, NULL);
> + fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  
>   kmem_free(btp);
>  }
> @@ -1958,14 +1959,18 @@ xfs_alloc_buftarg(
>   struct block_device *bdev)
>  {
>   xfs_buftarg_t   *btp;
> + const struct dax_holder_operations *ops = NULL;
>  
> +#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
> + ops = _dax_holder_operations;
> +#endif
>   btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
>  
>   btp->bt_mount = mp;
>   btp->bt_dev =  bdev->bd_dev;
>   btp->bt_bdev = bdev;
> - btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, NULL,
> - NULL);
> + btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off,
> + mp, ops);
>  
>   /*
>* Buffer IO error rate limiting. Limit it to no more than 10 messages
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 68f74549fa22..56530900bb86 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -536,6 +536,9 @@ xfs_do_force_shutdown(
>   } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>   why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
>   } else {
>   tag = XFS_PTAG_SHUTDOWN_IOERROR;
>   why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index f6dc19de8322..9237cc159542 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -435,6 +435,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_ONDISK  0x0010  /* corrupt metadata on device */
>  
>  #define XFS_SHUTDOWN_STRINGS \
>   { SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> new file mode 100644
> index ..aa8dc27c599c
> --- /dev/null
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -0,0 +1,220 @@
> +// SPDX-License-Identif

Re: [PATCH v13 6/7] xfs: Implement ->notify_failure() for XFS

2022-04-19 Thread Darrick J. Wong
On Tue, Apr 19, 2022 at 12:50:44PM +0800, Shiyang Ruan wrote:
> Introduce xfs_notify_failure.c to handle failure related works, such as
> implement ->notify_failure(), register/unregister dax holder in xfs, and
> so on.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/xfs/Makefile |   5 +
>  fs/xfs/xfs_buf.c|  11 +-
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 220 
>  fs/xfs/xfs_super.h  |   1 +
>  6 files changed, 238 insertions(+), 3 deletions(-)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..09f5560e29f2 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -128,6 +128,11 @@ xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o
>  xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o
>  xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o
>  
> +# notify failure
> +ifeq ($(CONFIG_MEMORY_FAILURE),y)
> +xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o
> +endif
> +
>  # online scrub/repair
>  ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
>  
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f9ca08398d32..084455f7e2ff 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -5,6 +5,7 @@
>   */
>  #include "xfs.h"
>  #include 
> +#include 
>  
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
> @@ -1911,7 +1912,7 @@ xfs_free_buftarg(
>   list_lru_destroy(>bt_lru);
>  
>   blkdev_issue_flush(btp->bt_bdev);
> - fs_put_dax(btp->bt_daxdev, NULL);
> + fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  
>   kmem_free(btp);
>  }
> @@ -1958,14 +1959,18 @@ xfs_alloc_buftarg(
>   struct block_device *bdev)
>  {
>   xfs_buftarg_t   *btp;
> + const struct dax_holder_operations *ops = NULL;
>  
> +#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
> + ops = _dax_holder_operations;
> +#endif
>   btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
>  
>   btp->bt_mount = mp;
>   btp->bt_dev =  bdev->bd_dev;
>   btp->bt_bdev = bdev;
> - btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off, NULL,
> - NULL);
> + btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off,
> + mp, ops);
>  
>   /*
>* Buffer IO error rate limiting. Limit it to no more than 10 messages
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 68f74549fa22..56530900bb86 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -536,6 +536,9 @@ xfs_do_force_shutdown(
>   } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>   why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
>   } else {
>   tag = XFS_PTAG_SHUTDOWN_IOERROR;
>   why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index f6dc19de8322..9237cc159542 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -435,6 +435,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_ONDISK  0x0010  /* corrupt metadata on device */
>  
>  #define XFS_SHUTDOWN_STRINGS \
>   { SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> new file mode 100644
> index ..0702a402688a
> --- /dev/null
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -0,0 +1,220 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
> + */
> +
> +#include "xfs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_alloc.h"
> +#include "xfs_bit.h"
> +#include "xfs_btree.h"
> +#include "xfs_inode.h"
> +#include "xfs_icache.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_trans.h"
> +
> +#include 
> +#include 
> +
> +struct failure_info {
> + xfs_agblock_t

Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-04-06 Thread Darrick J. Wong
On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote:
> On Tue, Apr 5, 2022 at 5:55 PM Jane Chu  wrote:
> >
> > On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> > >>> As the code I pasted before, pmem driver will subtract its 
> > >>> ->data_offset,
> > >>> which is byte-based. And the filesystem who implements 
> > >>> ->notify_failure()
> > >>> will calculate the offset in unit of byte again.
> > >>>
> > >>> So, leave its function signature byte-based, to avoid repeated 
> > >>> conversions.
> > >>
> > >> I'm actually fine either way, so I'll wait for Dan to comment.
> > >
> > > FWIW I'd convinced myself that the reason for using byte units is to
> > > make it possible to reduce the pmem failure blast radius to subpage
> > > units... but then I've also been distracted for months. :/
> > >
> >
> > Yes, thanks Darrick!  I recall that.
> > Maybe just add a comment about why byte unit is used?
> 
> I think we start with page failure notification and then figure out
> how to get finer grained through the dax interface in follow-on
> changes. Otherwise, for finer grained error handling support,
> memory_failure() would also need to be converted to stop upcasting
> cache-line granularity to page granularity failures. The native MCE
> notification communicates a 'struct mce' that can be in terms of
> sub-page bytes, but the memory management implications are all page
> based. I assume the FS implications are all FS-block-size based?

I wouldn't necessarily make that assumption -- for regular files, the
user program is in a better position to figure out how to reset the file
contents.

For fs metadata, it really depends.  In principle, if (say) we could get
byte granularity poison info, we could look up the space usage within
the block to decide if the poisoned part was actually free space, in
which case we can correct the problem by (re)zeroing the affected bytes
to clear the poison.

Obviously, if the blast radius hits the internal space info or something
that was storing useful data, then you'd have to rebuild the whole block
(or the whole data structure), but that's not necessarily a given.

--D




Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

2022-03-30 Thread Darrick J. Wong
On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> > As the code I pasted before, pmem driver will subtract its ->data_offset,
> > which is byte-based. And the filesystem who implements ->notify_failure()
> > will calculate the offset in unit of byte again.
> > 
> > So, leave its function signature byte-based, to avoid repeated conversions.
> 
> I'm actually fine either way, so I'll wait for Dan to comment.

FWIW I'd convinced myself that the reason for using byte units is to
make it possible to reduce the pmem failure blast radius to subpage
units... but then I've also been distracted for months. :/

--D



Re: [PATCH v10.1 8/9] xfs: Implement ->notify_failure() for XFS

2022-02-14 Thread Darrick J. Wong
On Sun, Feb 13, 2022 at 09:02:24PM +0800, Shiyang Ruan wrote:
> v10.1 update:
>  - Handle the error code returns by dax_register_holder()
>  - In v10.1, dax_register_holder() will hold a write lock so XFS
>  doesn't need to hold a lock
>  - Fix the mistake in failure notification over two AGs
>  - Fix the year in copyright message
> 
> Introduce xfs_notify_failure.c to handle failure related works, such as
> implement ->notify_failure(), register/unregister dax holder in xfs, and
> so on.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/Makefile |   1 +
>  fs/xfs/xfs_buf.c|  12 ++
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 225 
>  fs/xfs/xfs_notify_failure.h |  10 ++
>  6 files changed, 252 insertions(+)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
>  create mode 100644 fs/xfs/xfs_notify_failure.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..389970b3e13b 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -84,6 +84,7 @@ xfs-y   += xfs_aops.o \
>  xfs_message.o \
>  xfs_mount.o \
>  xfs_mru_cache.o \
> +xfs_notify_failure.o \
>  xfs_pwork.o \
>  xfs_reflink.o \
>  xfs_stats.o \
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index b45e0d50a405..941e8825cee6 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -19,6 +19,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_ag.h"
> +#include "xfs_notify_failure.h"
>  
>  static struct kmem_cache *xfs_buf_cache;
>  
> @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
>   list_lru_destroy(>bt_lru);
>  
>   blkdev_issue_flush(btp->bt_bdev);
> + if (btp->bt_daxdev)
> + dax_unregister_holder(btp->bt_daxdev);
>   fs_put_dax(btp->bt_daxdev);
>  
>   kmem_free(btp);
> @@ -1939,6 +1942,7 @@ xfs_alloc_buftarg(
>   struct block_device *bdev)
>  {
>   xfs_buftarg_t   *btp;
> + int error;
>  
>   btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
>  
> @@ -1946,6 +1950,14 @@ xfs_alloc_buftarg(
>   btp->bt_dev =  bdev->bd_dev;
>   btp->bt_bdev = bdev;
>   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off);
> + if (btp->bt_daxdev) {
> + error = dax_register_holder(btp->bt_daxdev, mp,
> + _dax_holder_operations);
> + if (error) {
> + xfs_err(mp, "DAX device already in use?!");
> + goto error_free;
> + }
> + }
>  
>   /*
>* Buffer IO error rate limiting. Limit it to no more than 10 messages
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..d4d36c5bef11 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -542,6 +542,9 @@ xfs_do_force_shutdown(
>   } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>   why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
>   } else {
>   tag = XFS_PTAG_SHUTDOWN_IOERROR;
>   why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 00720a02e761..47ff4ac53c4c 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -435,6 +435,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_ONDISK  0x0010  /* corrupt metadata on device */
>  
>  #define XFS_SHUTDOWN_STRINGS \
>   { SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> new file mode 100644
> index ..aa67662210a1
> --- /dev/null
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -0,0 +1,225 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
> + */
> +
> +#include "xfs.h"
> +#include "xfs_shared.h"
> 

Re: [PATCH v10 8/9] xfs: Implement ->notify_failure() for XFS

2022-02-01 Thread Darrick J. Wong
On Thu, Jan 27, 2022 at 08:40:57PM +0800, Shiyang Ruan wrote:
> Introduce xfs_notify_failure.c to handle failure related works, such as
> implement ->notify_failure(), register/unregister dax holder in xfs, and
> so on.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/Makefile |   1 +
>  fs/xfs/xfs_buf.c|  12 ++
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 222 
>  fs/xfs/xfs_notify_failure.h |  10 ++
>  6 files changed, 249 insertions(+)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
>  create mode 100644 fs/xfs/xfs_notify_failure.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..389970b3e13b 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -84,6 +84,7 @@ xfs-y   += xfs_aops.o \
>  xfs_message.o \
>  xfs_mount.o \
>  xfs_mru_cache.o \
> +xfs_notify_failure.o \
>  xfs_pwork.o \
>  xfs_reflink.o \
>  xfs_stats.o \
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index b45e0d50a405..017010b3d601 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -19,6 +19,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_ag.h"
> +#include "xfs_notify_failure.h"
>  
>  static struct kmem_cache *xfs_buf_cache;
>  
> @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
>   list_lru_destroy(>bt_lru);
>  
>   blkdev_issue_flush(btp->bt_bdev);
> + if (btp->bt_daxdev)
> + dax_unregister_holder(btp->bt_daxdev);
>   fs_put_dax(btp->bt_daxdev);
>  
>   kmem_free(btp);
> @@ -1946,6 +1949,15 @@ xfs_alloc_buftarg(
>   btp->bt_dev =  bdev->bd_dev;
>   btp->bt_bdev = bdev;
>   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off);
> + if (btp->bt_daxdev) {
> + if (dax_get_holder(btp->bt_daxdev)) {
> + xfs_err(mp, "DAX device already in use?!");
> + goto error_free;
> + }
> +
> + dax_register_holder(btp->bt_daxdev, mp,
> + _dax_holder_operations);

Um... is XFS required to take a lock here?  How do we prevent parallel
mounts of filesystems on two partitions from breaking each other?

> + }
>  
>   /*
>* Buffer IO error rate limiting. Limit it to no more than 10 messages
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..d4d36c5bef11 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -542,6 +542,9 @@ xfs_do_force_shutdown(
>   } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>   why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
>   } else {
>   tag = XFS_PTAG_SHUTDOWN_IOERROR;
>   why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 00720a02e761..47ff4ac53c4c 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -435,6 +435,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_ONDISK  0x0010  /* corrupt metadata on device */
>  
>  #define XFS_SHUTDOWN_STRINGS \
>   { SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> new file mode 100644
> index ..6abaa043f4bc
> --- /dev/null
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -0,0 +1,222 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2021 Fujitsu.  All Rights Reserved.

2022?

> + */
> +
> +#include "xfs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_alloc.h"
> +#include "xfs_bit.h"
> +#include "xfs_btree.h"
> +#include "xfs_inode.h"
> +#include "xfs_icache.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> 

Re: [PATCH v9 02/10] dax: Introduce holder for dax_device

2022-01-20 Thread Darrick J. Wong
On Fri, Jan 21, 2022 at 09:26:52AM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/1/20 16:46, Christoph Hellwig 写道:
> > On Wed, Jan 05, 2022 at 04:12:04PM -0800, Dan Williams wrote:
> > > We ended up with explicit callbacks after hch balked at a notifier
> > > call-chain, but I think we're back to that now. The partition mistake
> > > might be unfixable, but at least bdev_dax_pgoff() is dead. Notifier
> > > call chains have their own locking so, Ruan, this still does not need
> > > to touch dax_read_lock().
> > 
> > I think we have a few options here:
> > 
> >   (1) don't allow error notifications on partitions.  And error return from
> >   the holder registration with proper error handling in the file
> >   system would give us that

Hm, so that means XFS can only support dax+pmem when there aren't
partitions in use?  Ew.

> >   (2) extent the holder mechanism to cover a rangeo

I don't think I was around for the part where "hch balked at a notifier
call chain" -- what were the objections there, specifically?  I would
hope that pmem problems would be infrequent enough that the locking
contention (or rcu expiration) wouldn't be an issue...?

> >   (3) bite the bullet and create a new stacked dax_device for each
> >   partition
> > 
> > I think (1) is the best option for now.  If people really do need
> > partitions we'll have to go for (3)
> 
> Yes, I agree.  I'm doing it the first way right now.
> 
> I think that since we can use namespace to divide a big NVDIMM into multiple
> pmems, partition on a pmem seems not so meaningful.

I'll try to find out what will happen if pmem suddenly stops supporting
partitions...

--D

> 
> --
> Thanks,
> Ruan.
> 
> 



Re: [PATCH v9 02/10] dax: Introduce holder for dax_device

2022-01-05 Thread Darrick J. Wong
On Wed, Jan 05, 2022 at 03:01:22PM -0800, Dan Williams wrote:
> On Wed, Jan 5, 2022 at 2:47 PM Darrick J. Wong  wrote:
> >
> > On Wed, Jan 05, 2022 at 11:20:12AM -0800, Dan Williams wrote:
> > > On Wed, Jan 5, 2022 at 10:56 AM Darrick J. Wong  wrote:
> > > >
> > > > On Wed, Jan 05, 2022 at 10:23:08AM -0800, Dan Williams wrote:
> > > > > On Wed, Jan 5, 2022 at 10:12 AM Darrick J. Wong  
> > > > > wrote:
> > > > > >
> > > > > > On Sun, Dec 26, 2021 at 10:34:31PM +0800, Shiyang Ruan wrote:
> > > > > > > To easily track filesystem from a pmem device, we introduce a 
> > > > > > > holder for
> > > > > > > dax_device structure, and also its operation.  This holder is 
> > > > > > > used to
> > > > > > > remember who is using this dax_device:
> > > > > > >  - When it is the backend of a filesystem, the holder will be the
> > > > > > >instance of this filesystem.
> > > > > > >  - When this pmem device is one of the targets in a mapped 
> > > > > > > device, the
> > > > > > >holder will be this mapped device.  In this case, the mapped 
> > > > > > > device
> > > > > > >has its own dax_device and it will follow the first rule.  So 
> > > > > > > that we
> > > > > > >can finally track to the filesystem we needed.
> > > > > > >
> > > > > > > The holder and holder_ops will be set when filesystem is being 
> > > > > > > mounted,
> > > > > > > or an target device is being activated.
> > > > > > >
> > > > > > > Signed-off-by: Shiyang Ruan 
> > > > > > > ---
> > > > > > >  drivers/dax/super.c | 62 
> > > > > > > +
> > > > > > >  include/linux/dax.h | 29 +
> > > > > > >  2 files changed, 91 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > > > index c46f56e33d40..94c51f2ee133 100644
> > > > > > > --- a/drivers/dax/super.c
> > > > > > > +++ b/drivers/dax/super.c
> > > > > > > @@ -20,15 +20,20 @@
> > > > > > >   * @inode: core vfs
> > > > > > >   * @cdev: optional character interface for "device dax"
> > > > > > >   * @private: dax driver private data
> > > > > > > + * @holder_data: holder of a dax_device: could be filesystem or 
> > > > > > > mapped device
> > > > > > >   * @flags: state and boolean properties
> > > > > > > + * @ops: operations for dax_device
> > > > > > > + * @holder_ops: operations for the inner holder
> > > > > > >   */
> > > > > > >  struct dax_device {
> > > > > > >   struct inode inode;
> > > > > > >   struct cdev cdev;
> > > > > > >   void *private;
> > > > > > >   struct percpu_rw_semaphore rwsem;
> > > > > > > + void *holder_data;
> > > > > > >   unsigned long flags;
> > > > > > >   const struct dax_operations *ops;
> > > > > > > + const struct dax_holder_operations *holder_ops;
> > > > > > >  };
> > > > > > >
> > > > > > >  static dev_t dax_devt;
> > > > > > > @@ -192,6 +197,29 @@ int dax_zero_page_range(struct dax_device 
> > > > > > > *dax_dev, pgoff_t pgoff,
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(dax_zero_page_range);
> > > > > > >
> > > > > > > +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 
> > > > > > > off,
> > > > > > > +   u64 len, int mf_flags)
> > > > > > > +{
> > > > > > > + int rc;
> > > > > > > +
> > > > > > > + dax_read_lock(dax_dev);
> > > > > > > + if (!dax_alive(dax_dev)) {
> > > > > > > + rc = -ENXIO;
> > > > > > > +  

Re: [PATCH v9 02/10] dax: Introduce holder for dax_device

2022-01-05 Thread Darrick J. Wong
On Wed, Jan 05, 2022 at 11:20:12AM -0800, Dan Williams wrote:
> On Wed, Jan 5, 2022 at 10:56 AM Darrick J. Wong  wrote:
> >
> > On Wed, Jan 05, 2022 at 10:23:08AM -0800, Dan Williams wrote:
> > > On Wed, Jan 5, 2022 at 10:12 AM Darrick J. Wong  wrote:
> > > >
> > > > On Sun, Dec 26, 2021 at 10:34:31PM +0800, Shiyang Ruan wrote:
> > > > > To easily track filesystem from a pmem device, we introduce a holder 
> > > > > for
> > > > > dax_device structure, and also its operation.  This holder is used to
> > > > > remember who is using this dax_device:
> > > > >  - When it is the backend of a filesystem, the holder will be the
> > > > >instance of this filesystem.
> > > > >  - When this pmem device is one of the targets in a mapped device, the
> > > > >holder will be this mapped device.  In this case, the mapped device
> > > > >has its own dax_device and it will follow the first rule.  So that 
> > > > > we
> > > > >can finally track to the filesystem we needed.
> > > > >
> > > > > The holder and holder_ops will be set when filesystem is being 
> > > > > mounted,
> > > > > or an target device is being activated.
> > > > >
> > > > > Signed-off-by: Shiyang Ruan 
> > > > > ---
> > > > >  drivers/dax/super.c | 62 
> > > > > +
> > > > >  include/linux/dax.h | 29 +
> > > > >  2 files changed, 91 insertions(+)
> > > > >
> > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > index c46f56e33d40..94c51f2ee133 100644
> > > > > --- a/drivers/dax/super.c
> > > > > +++ b/drivers/dax/super.c
> > > > > @@ -20,15 +20,20 @@
> > > > >   * @inode: core vfs
> > > > >   * @cdev: optional character interface for "device dax"
> > > > >   * @private: dax driver private data
> > > > > + * @holder_data: holder of a dax_device: could be filesystem or 
> > > > > mapped device
> > > > >   * @flags: state and boolean properties
> > > > > + * @ops: operations for dax_device
> > > > > + * @holder_ops: operations for the inner holder
> > > > >   */
> > > > >  struct dax_device {
> > > > >   struct inode inode;
> > > > >   struct cdev cdev;
> > > > >   void *private;
> > > > >   struct percpu_rw_semaphore rwsem;
> > > > > + void *holder_data;
> > > > >   unsigned long flags;
> > > > >   const struct dax_operations *ops;
> > > > > + const struct dax_holder_operations *holder_ops;
> > > > >  };
> > > > >
> > > > >  static dev_t dax_devt;
> > > > > @@ -192,6 +197,29 @@ int dax_zero_page_range(struct dax_device 
> > > > > *dax_dev, pgoff_t pgoff,
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(dax_zero_page_range);
> > > > >
> > > > > +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> > > > > +   u64 len, int mf_flags)
> > > > > +{
> > > > > + int rc;
> > > > > +
> > > > > + dax_read_lock(dax_dev);
> > > > > + if (!dax_alive(dax_dev)) {
> > > > > + rc = -ENXIO;
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > + if (!dax_dev->holder_ops) {
> > > > > + rc = -EOPNOTSUPP;
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, 
> > > > > mf_flags);
> > > > > +out:
> > > > > + dax_read_unlock(dax_dev);
> > > > > + return rc;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> > > > > +
> > > > >  #ifdef CONFIG_ARCH_HAS_PMEM_API
> > > > >  void arch_wb_cache_pmem(void *addr, size_t size);
> > > > >  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> > > > > @@ -254,6 +282,10 @@ void kill_dax(struct dax_device *dax_dev)
>

Re: [PATCH v9 04/10] pagemap,pmem: Introduce ->memory_failure()

2022-01-05 Thread Darrick J. Wong
On Sun, Dec 26, 2021 at 10:34:33PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the filesystem in which
> the corrupted page located in.
> 
> With dax_holder notify support, we are able to notify the memory failure
> from pmem driver to upper layers.  If there is something not support in
> the notify routine, memory_failure will fall back to the generic hanlder.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> ---
>  drivers/nvdimm/pmem.c| 16 
>  include/linux/memremap.h |  9 +
>  mm/memory-failure.c  | 14 ++
>  3 files changed, 39 insertions(+)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4190c8c46ca8..2114554358eb 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -386,6 +386,20 @@ static void pmem_release_disk(void *__pmem)
>   blk_cleanup_disk(pmem->disk);
>  }
>  
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + unsigned long pfn, u64 len, int mf_flags)
> +{
> + struct pmem_device *pmem =
> + container_of(pgmap, struct pmem_device, pgmap);
> + loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;

Use u64 here ^^^ because this isn't a file offset, this is a physical
offset.  Also, loff_t is signed, which you probably don't want.

> +
> + return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
> +}
> +
> +static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> + .memory_failure = pmem_pagemap_memory_failure,
> +};
> +
>  static int pmem_attach_disk(struct device *dev,
>   struct nd_namespace_common *ndns)
>  {
> @@ -448,6 +462,7 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->pfn_flags = PFN_DEV;
>   if (is_nd_pfn(dev)) {
>   pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> + pmem->pgmap.ops = _pagemap_ops;
>   addr = devm_memremap_pages(dev, >pgmap);
>   pfn_sb = nd_pfn->pfn_sb;
>   pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
> @@ -461,6 +476,7 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->pgmap.range.end = res->end;
>   pmem->pgmap.nr_range = 1;
>   pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> + pmem->pgmap.ops = _pagemap_ops;
>   addr = devm_memremap_pages(dev, >pgmap);
>   pmem->pfn_flags |= PFN_MAP;
>   bb_range = pmem->pgmap.range;
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c0e9d35889e8..820c2f33b163 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -87,6 +87,15 @@ struct dev_pagemap_ops {
>* the page back to a CPU accessible page.
>*/
>   vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> +
> + /*
> +  * Handle the memory failure happens on a range of pfns.  Notify the
> +  * processes who are using these pfns, and try to recover the data on
> +  * them if necessary.  The mf_flags is finally passed to the recover
> +  * function through the whole notify routine.


Might want to state here that the generic implementation will be used if
->memory_failure is NULL or calling the function returns -EOPNOTSUPP.

--D

> +  */
> + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> +   u64 len, int mf_flags);
>  };
>  
>  #define PGMAP_ALTMAP_VALID   (1 << 0)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1ee7d626fed7..3cc612b29f89 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1625,6 +1625,20 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>   if (!pgmap_pfn_valid(pgmap, pfn))
>   goto out;
>  
> + /*
> +  * Call driver's implementation to handle the memory failure, otherwise
> +  * fall back to generic handler.
> +  */
> + if (pgmap->ops->memory_failure) {
> + rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
> + /*
> +  * Fall back to generic handler too if operation is not
> +  * supported inside the driver/device/filesystem.
> +  */
> + if (rc != -EOPNOTSUPP)
> + goto out;
> + }
> +
>   rc = mf_generic_kill_procs(pfn, flags, pgmap);
>  out:
>   /* drop pgmap ref acquired in caller */
> -- 
> 2.34.1
> 
> 
> 



Re: [PATCH v9 02/10] dax: Introduce holder for dax_device

2022-01-05 Thread Darrick J. Wong
On Wed, Jan 05, 2022 at 10:23:08AM -0800, Dan Williams wrote:
> On Wed, Jan 5, 2022 at 10:12 AM Darrick J. Wong  wrote:
> >
> > On Sun, Dec 26, 2021 at 10:34:31PM +0800, Shiyang Ruan wrote:
> > > To easily track filesystem from a pmem device, we introduce a holder for
> > > dax_device structure, and also its operation.  This holder is used to
> > > remember who is using this dax_device:
> > >  - When it is the backend of a filesystem, the holder will be the
> > >instance of this filesystem.
> > >  - When this pmem device is one of the targets in a mapped device, the
> > >holder will be this mapped device.  In this case, the mapped device
> > >has its own dax_device and it will follow the first rule.  So that we
> > >can finally track to the filesystem we needed.
> > >
> > > The holder and holder_ops will be set when filesystem is being mounted,
> > > or an target device is being activated.
> > >
> > > Signed-off-by: Shiyang Ruan 
> > > ---
> > >  drivers/dax/super.c | 62 +
> > >  include/linux/dax.h | 29 +
> > >  2 files changed, 91 insertions(+)
> > >
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index c46f56e33d40..94c51f2ee133 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -20,15 +20,20 @@
> > >   * @inode: core vfs
> > >   * @cdev: optional character interface for "device dax"
> > >   * @private: dax driver private data
> > > + * @holder_data: holder of a dax_device: could be filesystem or mapped 
> > > device
> > >   * @flags: state and boolean properties
> > > + * @ops: operations for dax_device
> > > + * @holder_ops: operations for the inner holder
> > >   */
> > >  struct dax_device {
> > >   struct inode inode;
> > >   struct cdev cdev;
> > >   void *private;
> > >   struct percpu_rw_semaphore rwsem;
> > > + void *holder_data;
> > >   unsigned long flags;
> > >   const struct dax_operations *ops;
> > > + const struct dax_holder_operations *holder_ops;
> > >  };
> > >
> > >  static dev_t dax_devt;
> > > @@ -192,6 +197,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
> > > pgoff_t pgoff,
> > >  }
> > >  EXPORT_SYMBOL_GPL(dax_zero_page_range);
> > >
> > > +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> > > +   u64 len, int mf_flags)
> > > +{
> > > + int rc;
> > > +
> > > + dax_read_lock(dax_dev);
> > > + if (!dax_alive(dax_dev)) {
> > > + rc = -ENXIO;
> > > + goto out;
> > > + }
> > > +
> > > + if (!dax_dev->holder_ops) {
> > > + rc = -EOPNOTSUPP;
> > > + goto out;
> > > + }
> > > +
> > > + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, 
> > > mf_flags);
> > > +out:
> > > + dax_read_unlock(dax_dev);
> > > + return rc;
> > > +}
> > > +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> > > +
> > >  #ifdef CONFIG_ARCH_HAS_PMEM_API
> > >  void arch_wb_cache_pmem(void *addr, size_t size);
> > >  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> > > @@ -254,6 +282,10 @@ void kill_dax(struct dax_device *dax_dev)
> > >   return;
> > >   dax_write_lock(dax_dev);
> > >   clear_bit(DAXDEV_ALIVE, _dev->flags);
> > > +
> > > + /* clear holder data */
> > > + dax_dev->holder_ops = NULL;
> > > + dax_dev->holder_data = NULL;
> > >   dax_write_unlock(dax_dev);
> > >  }
> > >  EXPORT_SYMBOL_GPL(kill_dax);
> > > @@ -401,6 +433,36 @@ void put_dax(struct dax_device *dax_dev)
> > >  }
> > >  EXPORT_SYMBOL_GPL(put_dax);
> > >
> > > +void dax_register_holder(struct dax_device *dax_dev, void *holder,
> > > + const struct dax_holder_operations *ops)
> > > +{
> > > + if (!dax_alive(dax_dev))
> > > + return;
> > > +
> > > + dax_dev->holder_data = holder;
> > > + dax_dev->holder_ops = ops;
> >
> > Shouldn't this return an error code if the dax device is dead or if
> > someone already registered a holder?  I'm pretty sure XFS should not
> > bind to a dax device if someone else already registered for it...
> 
> Agree, yes.
> 
> >
> > ...unless you want to use a notifier chain for failure events so that
> > there can be multiple consumers of dax failure events?
> 
> No, I would hope not. It should be 1:1 holders to dax-devices. Similar
> ownership semantics like bd_prepare_to_claim().

Does each partition on a pmem device still have its own dax_device?

--D



Re: [PATCH v9 09/10] xfs: Implement ->notify_failure() for XFS

2022-01-05 Thread Darrick J. Wong
On Sun, Dec 26, 2021 at 10:34:38PM +0800, Shiyang Ruan wrote:
> Introduce xfs_notify_failure.c to handle failure related works, such as
> implement ->notify_failure(), register/unregister dax holder in xfs, and
> so on.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/Makefile |   1 +
>  fs/xfs/xfs_buf.c|  15 +++
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 189 
>  fs/xfs/xfs_notify_failure.h |  10 ++
>  6 files changed, 219 insertions(+)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
>  create mode 100644 fs/xfs/xfs_notify_failure.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..389970b3e13b 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -84,6 +84,7 @@ xfs-y   += xfs_aops.o \
>  xfs_message.o \
>  xfs_mount.o \
>  xfs_mru_cache.o \
> +xfs_notify_failure.o \
>  xfs_pwork.o \
>  xfs_reflink.o \
>  xfs_stats.o \
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index bbb0fbd34e64..d0df7604fa9e 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -19,6 +19,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_ag.h"
> +#include "xfs_notify_failure.h"
>  
>  static struct kmem_cache *xfs_buf_cache;
>  
> @@ -1892,6 +1893,8 @@ xfs_free_buftarg(
>   list_lru_destroy(>bt_lru);
>  
>   blkdev_issue_flush(btp->bt_bdev);
> + if (btp->bt_daxdev)
> + dax_unregister_holder(btp->bt_daxdev);
>   fs_put_dax(btp->bt_daxdev);
>  
>   kmem_free(btp);
> @@ -1946,6 +1949,18 @@ xfs_alloc_buftarg(
>   btp->bt_dev =  bdev->bd_dev;
>   btp->bt_bdev = bdev;
>   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off);
> + if (btp->bt_daxdev) {
> + dax_write_lock(btp->bt_daxdev);
> + if (dax_get_holder(btp->bt_daxdev)) {
> + dax_write_unlock(btp->bt_daxdev);
> + xfs_err(mp, "DAX device already in use?!");
> + goto error_free;
> + }
> +
> + dax_register_holder(btp->bt_daxdev, mp,
> + _dax_holder_operations);
> + dax_write_unlock(btp->bt_daxdev);
> + }
>  
>   /*
>* Buffer IO error rate limiting. Limit it to no more than 10 messages
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..d4d36c5bef11 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -542,6 +542,9 @@ xfs_do_force_shutdown(
>   } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>   why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
>   } else {
>   tag = XFS_PTAG_SHUTDOWN_IOERROR;
>   why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 00720a02e761..47ff4ac53c4c 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -435,6 +435,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_ONDISK  0x0010  /* corrupt metadata on device */
>  
>  #define XFS_SHUTDOWN_STRINGS \
>   { SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> new file mode 100644
> index ..a87bd08365f4
> --- /dev/null
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -0,0 +1,189 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2021 Fujitsu.  All Rights Reserved.
> + */
> +
> +#include "xfs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_alloc.h"
> +#include "xfs_bit.h"
> +#include "xfs_btree.h"
> +#include "xfs_inode.h"
> +#include "xfs_icache.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include 

Re: [PATCH v9 02/10] dax: Introduce holder for dax_device

2022-01-05 Thread Darrick J. Wong
On Sun, Dec 26, 2021 at 10:34:31PM +0800, Shiyang Ruan wrote:
> To easily track filesystem from a pmem device, we introduce a holder for
> dax_device structure, and also its operation.  This holder is used to
> remember who is using this dax_device:
>  - When it is the backend of a filesystem, the holder will be the
>instance of this filesystem.
>  - When this pmem device is one of the targets in a mapped device, the
>holder will be this mapped device.  In this case, the mapped device
>has its own dax_device and it will follow the first rule.  So that we
>can finally track to the filesystem we needed.
> 
> The holder and holder_ops will be set when filesystem is being mounted,
> or an target device is being activated.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c | 62 +
>  include/linux/dax.h | 29 +
>  2 files changed, 91 insertions(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c46f56e33d40..94c51f2ee133 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -20,15 +20,20 @@
>   * @inode: core vfs
>   * @cdev: optional character interface for "device dax"
>   * @private: dax driver private data
> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
>   * @flags: state and boolean properties
> + * @ops: operations for dax_device
> + * @holder_ops: operations for the inner holder
>   */
>  struct dax_device {
>   struct inode inode;
>   struct cdev cdev;
>   void *private;
>   struct percpu_rw_semaphore rwsem;
> + void *holder_data;
>   unsigned long flags;
>   const struct dax_operations *ops;
> + const struct dax_holder_operations *holder_ops;
>  };
>  
>  static dev_t dax_devt;
> @@ -192,6 +197,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
> pgoff_t pgoff,
>  }
>  EXPORT_SYMBOL_GPL(dax_zero_page_range);
>  
> +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> +   u64 len, int mf_flags)
> +{
> + int rc;
> +
> + dax_read_lock(dax_dev);
> + if (!dax_alive(dax_dev)) {
> + rc = -ENXIO;
> + goto out;
> + }
> +
> + if (!dax_dev->holder_ops) {
> + rc = -EOPNOTSUPP;
> + goto out;
> + }
> +
> + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> +out:
> + dax_read_unlock(dax_dev);
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> +
>  #ifdef CONFIG_ARCH_HAS_PMEM_API
>  void arch_wb_cache_pmem(void *addr, size_t size);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> @@ -254,6 +282,10 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>   dax_write_lock(dax_dev);
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
> +
> + /* clear holder data */
> + dax_dev->holder_ops = NULL;
> + dax_dev->holder_data = NULL;
>   dax_write_unlock(dax_dev);
>  }
>  EXPORT_SYMBOL_GPL(kill_dax);
> @@ -401,6 +433,36 @@ void put_dax(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(put_dax);
>  
> +void dax_register_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops)
> +{
> + if (!dax_alive(dax_dev))
> + return;
> +
> + dax_dev->holder_data = holder;
> + dax_dev->holder_ops = ops;

Shouldn't this return an error code if the dax device is dead or if
someone already registered a holder?  I'm pretty sure XFS should not
bind to a dax device if someone else already registered for it...

...unless you want to use a notifier chain for failure events so that
there can be multiple consumers of dax failure events?

--D

> +}
> +EXPORT_SYMBOL_GPL(dax_register_holder);
> +
> +void dax_unregister_holder(struct dax_device *dax_dev)
> +{
> + if (!dax_alive(dax_dev))
> + return;
> +
> + dax_dev->holder_data = NULL;
> + dax_dev->holder_ops = NULL;
> +}
> +EXPORT_SYMBOL_GPL(dax_unregister_holder);
> +
> +void *dax_get_holder(struct dax_device *dax_dev)
> +{
> + if (!dax_alive(dax_dev))
> + return NULL;
> +
> + return dax_dev->holder_data;
> +}
> +EXPORT_SYMBOL_GPL(dax_get_holder);
> +
>  /**
>   * inode_dax: convert a public inode into its dax_dev
>   * @inode: An inode with i_cdev pointing to a dax_dev
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a146bfb80804..e16a9e0ee857 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -44,6 +44,22 @@ struct dax_operations {
>  #if IS_ENABLED(CONFIG_DAX)
>  struct dax_device *alloc_dax(void *private, const struct dax_operations *ops,
>   unsigned long flags);
> +struct dax_holder_operations {
> + /*
> +  * notify_failure - notify memory failure into inner holder device
> +  * @dax_dev: the dax device which contains the holder
> +  * @offset: offset on this dax device where memory failure occurs
> +  * 

Re: [PATCH v9 05/10] fsdax: fix function description

2022-01-05 Thread Darrick J. Wong
On Sun, Dec 26, 2021 at 10:34:34PM +0800, Shiyang Ruan wrote:
> The function name has been changed, so the description should be updated
> too.
> 
> Signed-off-by: Shiyang Ruan 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1f46810d4b68..2ee2d5a525ee 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -390,7 +390,7 @@ static struct page *dax_busy_page(void *entry)
>  }
>  
>  /*
> - * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
> + * dax_lock_page - Lock the DAX entry corresponding to a page
>   * @page: The page whose entry we want to lock
>   *
>   * Context: Process context.
> -- 
> 2.34.1
> 
> 
> 



Re: [PATCH v9 01/10] dax: Use percpu rwsem for dax_{read,write}_lock()

2022-01-05 Thread Darrick J. Wong
On Tue, Jan 04, 2022 at 02:44:08PM -0800, Dan Williams wrote:
> On Sun, Dec 26, 2021 at 6:35 AM Shiyang Ruan  wrote:
> >
> > In order to introduce dax holder registration, we need a write lock for
> > dax.
> 
> As far as I can see, no, a write lock is not needed while the holder
> is being registered.
> 
> The synchronization that is needed is to make sure that the device
> stays live over the registration event, and that any in-flight holder
> operations are flushed before the device transitions from live to
> dead, and that in turn relates to the live state of the pgmap.
> 
> The dax device cannot switch from live to dead without first flushing
> all readers, so holding dax_read_lock() over the register holder event
> should be sufficient.

...and perhaps add a comment describing that this is what the
synchronization primitive is really protecting against?  The first time
I read through this patchset, I assumed the rwsem was protecting
_hosts and was confused when I saw the one use of dax_write_lock.

--D

> If you are worried about 2 or more potential
> holders colliding at registration time, I would expect that's already
> prevented by block device exclusive holder synchronization, but you
> could also use cmpxchg and a single pointer to a 'struct dax_holder {
> void *holder_data, struct dax_holder_operations *holder_ops }'. If you
> are worried about memory_failure triggering while the filesystem is
> shutting down it can do a synchronize_srcu(_srcu) if it really
> needs to ensure that the notify path is idle after removing the holder
> registration.
> 
> ...are there any cases remaining not covered by the above suggestions?



Re: [PATCH v8 8/9] xfs: Implement ->notify_failure() for XFS

2021-12-14 Thread Darrick J. Wong
On Thu, Dec 02, 2021 at 04:48:55PM +0800, Shiyang Ruan wrote:
> Introduce xfs_notify_failure.c to handle failure related works, such as
> implement ->notify_failure(), register/unregister dax holder in xfs, and
> so on.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/Makefile |   1 +
>  fs/xfs/xfs_buf.c|   4 +
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_notify_failure.c | 224 
>  fs/xfs/xfs_notify_failure.h |  15 +++
>  6 files changed, 248 insertions(+)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
>  create mode 100644 fs/xfs/xfs_notify_failure.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..389970b3e13b 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -84,6 +84,7 @@ xfs-y   += xfs_aops.o \
>  xfs_message.o \
>  xfs_mount.o \
>  xfs_mru_cache.o \
> +xfs_notify_failure.o \
>  xfs_pwork.o \
>  xfs_reflink.o \
>  xfs_stats.o \
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index bbb0fbd34e64..40a8916cbbcb 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -19,6 +19,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_ag.h"
> +#include "xfs_notify_failure.h"
>  
>  static struct kmem_cache *xfs_buf_cache;
>  
> @@ -1892,6 +1893,7 @@ xfs_free_buftarg(
>   list_lru_destroy(>bt_lru);
>  
>   blkdev_issue_flush(btp->bt_bdev);
> + xfs_notify_failure_unregister(btp->bt_daxdev);
>   fs_put_dax(btp->bt_daxdev);
>  
>   kmem_free(btp);
> @@ -1947,6 +1949,8 @@ xfs_alloc_buftarg(
>   btp->bt_bdev = bdev;
>   btp->bt_daxdev = fs_dax_get_by_bdev(bdev, >bt_dax_part_off);
>  
> + xfs_notify_failure_register(mp, btp->bt_daxdev);

There's no _unregister call if the buftarg allocation fails.

> +
>   /*
>* Buffer IO error rate limiting. Limit it to no more than 10 messages
>* per 30 seconds so as to not spam logs too much on repeated errors.
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..4c2d3d4ca5a5 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -542,6 +542,9 @@ xfs_do_force_shutdown(
>   } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>   why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_META) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
>   } else {
>   tag = XFS_PTAG_SHUTDOWN_IOERROR;
>   why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 00720a02e761..7812de2c00a7 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -435,6 +435,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_META0x0010  /* corrupt metadata on device */

SHUTDOWN_CORRUPT_ONDISK?

>  
>  #define XFS_SHUTDOWN_STRINGS \
>   { SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> new file mode 100644
> index ..0c868f89ca3e
> --- /dev/null
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -0,0 +1,224 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2021 Fujitsu.  All Rights Reserved.
> + */
> +
> +#include "xfs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_alloc.h"
> +#include "xfs_bit.h"
> +#include "xfs_btree.h"
> +#include "xfs_inode.h"
> +#include "xfs_icache.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_trans.h"
> +
> +#include 
> +#include 
> +
> +struct failure_info {
> + xfs_agblock_t   startblock;
> + xfs_filblks_t   blockcount;
> + int mf_flags;
> +};
> +
> +static pgoff_t
> +xfs_failure_pgoff(
> + struct xfs_mount*mp,
> + const struct 

Re: [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:57PM +0800, Shiyang Ruan wrote:
> This function is called at the end of RMAP routine, i.e. filesystem
> recovery function, to collect and kill processes using a shared page of
> DAX file.  The difference between mf_generic_kill_procs() is,
> it accepts file's mapping,offset instead of struct page.  Because
> different file's mappings and offsets may share the same page in fsdax
> mode.  So, it is called when filesystem RMAP results are found.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c| 10 --
>  include/linux/dax.h |  9 +
>  include/linux/mm.h  |  2 ++
>  mm/memory-failure.c | 83 -
>  4 files changed, 86 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 509b65e60478..2536c105ec7f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -852,16 +852,6 @@ static void *dax_insert_entry(struct xa_state *xas,
>   return entry;
>  }
>  
> -static inline
> -unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> -{
> - unsigned long address;
> -
> - address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> - return address;
> -}
> -
>  /* Walk all mappings of a given index of a file and writeprotect them */
>  static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
>   unsigned long pfn)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 65411bee4312..3d90becbd160 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -258,6 +258,15 @@ static inline bool dax_mapping(struct address_space 
> *mapping)
>  {
>   return mapping->host && IS_DAX(mapping->host);
>  }
> +static inline unsigned long pgoff_address(pgoff_t pgoff,
> + struct vm_area_struct *vma)
> +{
> + unsigned long address;
> +
> + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> + return address;
> +}
>  
>  #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>  void hmem_register_device(int target_nid, struct resource *r);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..d06af0051e53 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3114,6 +3114,8 @@ enum mf_flags {
>   MF_MUST_KILL = 1 << 2,
>   MF_SOFT_OFFLINE = 1 << 3,
>  };
> +extern int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> +  size_t size, int flags);
>  extern int memory_failure(unsigned long pfn, int flags);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern void memory_failure_queue_kick(int cpu);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 85eab206b68f..a9d0d487d205 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -302,10 +302,9 @@ void shake_page(struct page *p)
>  }
>  EXPORT_SYMBOL_GPL(shake_page);
>  
> -static unsigned long dev_pagemap_mapping_shift(struct page *page,
> +static unsigned long dev_pagemap_mapping_shift(unsigned long address,
>   struct vm_area_struct *vma)
>  {
> - unsigned long address = vma_address(page, vma);
>   pgd_t *pgd;
>   p4d_t *p4d;
>   pud_t *pud;
> @@ -345,7 +344,7 @@ static unsigned long dev_pagemap_mapping_shift(struct 
> page *page,
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   */
> -static void add_to_kill(struct task_struct *tsk, struct page *p,
> +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t 
> pgoff,

Hm, so I guess you're passing the page and the pgoff now because
page->index is meaningless for shared dax pages?  Ok.

>  struct vm_area_struct *vma,
>  struct list_head *to_kill)
>  {
> @@ -358,9 +357,15 @@ static void add_to_kill(struct task_struct *tsk, struct 
> page *p,
>   }
>  
>   tk->addr = page_address_in_vma(p, vma);
> - if (is_zone_device_page(p))
> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> - else
> + if (is_zone_device_page(p)) {
> + /*
> +  * Since page->mapping is no more used for fsdax, we should
> +  * calculate the address in a fsdax way.
> +  */
> + if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
> + tk->addr = pgoff_address(pgoff, vma);
> + tk->size_shift = dev_pagemap_mapping_shift(tk->addr, vma);
> + } else
>   tk->size_shift = page_shift(compound_head(p));
>  
>   /*
> @@ -508,7 +513,7 @@ static void collect_procs_anon(struct page *page, struct 
> list_head *to_kill,
>   if (!page_mapped_in_vma(page, vma))
>   continue;
>   if (vma->vm_mm == t->mm)
> - 

Re: [PATCH v7 8/8] fsdax: add exception for reflinked files

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:59PM +0800, Shiyang Ruan wrote:
> For reflinked files, one dax page may be associated more than once with
> different fime mapping and index.  It will report warning.  Now, since
> we have introduced dax-RMAP for this case and also have to keep its
> functionality for other filesystems who are not support rmap, I add this
> exception here.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 2536c105ec7f..1a57211b1bc9 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -352,9 +352,10 @@ static void dax_associate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - WARN_ON_ONCE(page->mapping);
> - page->mapping = mapping;
> - page->index = index + i++;
> + if (!page->mapping) {
> + page->mapping = mapping;
> + page->index = index + i++;

It feels a little dangerous to have page->mapping for shared storage
point to an actual address_space when there are really multiple
potential address_spaces out there.  If the mm or dax folks are ok with
doing this this way then I'll live with it, but it seems like you'd want
to leave /some/ kind of marker once you know that the page has multiple
owners and therefore regular mm rmap via page->mapping won't work.

--D

> + }
>   }
>  }
>  
> @@ -370,9 +371,10 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   struct page *page = pfn_to_page(pfn);
>  
>   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> - WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> - page->mapping = NULL;
> - page->index = 0;
> + if (page->mapping == mapping) {
> + page->mapping = NULL;
> + page->index = 0;
> + }
>   }
>  }
>  
> -- 
> 2.33.0
> 
> 
> 



Re: [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:58PM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem.  Such as memory failure in fsdax mode.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c |  19 +
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_super.c  | 188 
>  include/linux/dax.h |  18 +
>  5 files changed, 229 insertions(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 7d4a11dcba90..22091e7fb0ef 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -135,6 +135,25 @@ struct dax_device *fs_dax_get_by_bdev(struct 
> block_device *bdev)
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>  
> +void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops)
> +{
> + dax_set_holder(dax_dev, holder, ops);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_register_holder);
> +
> +void fs_dax_unregister_holder(struct dax_device *dax_dev)
> +{
> + dax_set_holder(dax_dev, NULL, NULL);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_unregister_holder);
> +
> +void *fs_dax_get_holder(struct dax_device *dax_dev)
> +{
> + return dax_get_holder(dax_dev);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get_holder);
> +
>  bool generic_fsdax_supported(struct dax_device *dax_dev,
>   struct block_device *bdev, int blocksize, sector_t start,
>   sector_t sectors)
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..4c2d3d4ca5a5 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -542,6 +542,9 @@ xfs_do_force_shutdown(
>   } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>   tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>   why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_META) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
>   } else {
>   tag = XFS_PTAG_SHUTDOWN_IOERROR;
>   why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index e091f3b3fa15..d0f6da23e3df 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -434,6 +434,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> lags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_META0x0010  /* corrupt metadata on device */
>  
>  #define XFS_SHUTDOWN_STRINGS \
>   { SHUTDOWN_META_IO_ERROR,   "metadata_io" }, \
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index c4e0cd1c1c8c..46fdf44b5ec2 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -37,11 +37,19 @@
>  #include "xfs_reflink.h"
>  #include "xfs_pwork.h"
>  #include "xfs_ag.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_bit.h"
>  
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
> +static const struct dax_holder_operations xfs_dax_holder_operations;
>  static const struct super_operations xfs_super_operations;
>  
>  static struct kset *xfs_kset;/* top-level xfs sysfs dir */
> @@ -377,6 +385,8 @@ xfs_close_devices(
>  
>   xfs_free_buftarg(mp->m_logdev_targp);
>   xfs_blkdev_put(logdev);
> + if (dax_logdev)
> + fs_dax_unregister_holder(dax_logdev);
>   fs_put_dax(dax_logdev);
>   }
>   if (mp->m_rtdev_targp) {
> @@ -385,9 +395,13 @@ xfs_close_devices(
>  
>   xfs_free_buftarg(mp->m_rtdev_targp);
>   xfs_blkdev_put(rtdev);
> + if (dax_rtdev)
> + fs_dax_unregister_holder(dax_rtdev);
>   fs_put_dax(dax_rtdev);
>   }
>   xfs_free_buftarg(mp->m_ddev_targp);
> + if (dax_ddev)
> + fs_dax_unregister_holder(dax_ddev);
>   fs_put_dax(dax_ddev);
>  }
>  
> @@ -411,6 +425,9 @@ xfs_open_devices(
>   struct block_device *logdev = NULL, *rtdev = NULL;
>   int error;
>  
> + if (dax_ddev)
> + fs_dax_register_holder(dax_ddev, mp,
> + _dax_holder_operations);
>   /*
>* Open real time and log devices - order is important.
>*/
> @@ -419,6 

Re: [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry()

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:56PM +0800, Shiyang Ruan wrote:
> The current dax_lock_page() locks dax entry by obtaining mapping and
> index in page.  To support 1-to-N RMAP in NVDIMM, we need a new function
> to lock a specific dax entry corresponding to this file's mapping,index.
> And BTW, output the page corresponding to the specific dax entry for
> caller use.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c| 65 -
>  include/linux/dax.h | 15 +++
>  2 files changed, 79 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 798c43f09eee..509b65e60478 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -390,7 +390,7 @@ static struct page *dax_busy_page(void *entry)
>  }
>  
>  /*
> - * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
> + * dax_lock_page - Lock the DAX entry corresponding to a page
>   * @page: The page whose entry we want to lock
>   *
>   * Context: Process context.
> @@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t 
> cookie)
>   dax_unlock_entry(, (void *)cookie);
>  }
>  
> +/*
> + * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
> + * @mapping: the file's mapping whose entry we want to lock
> + * @index: the offset within this file
> + * @page: output the dax page corresponding to this dax entry
> + *
> + * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
> + * could not be locked.
> + */
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t 
> index,
> + struct page **page)
> +{
> + XA_STATE(xas, NULL, 0);
> + void *entry;
> +
> + rcu_read_lock();
> + for (;;) {
> + entry = NULL;
> + if (!dax_mapping(mapping))
> + break;
> +
> + xas.xa = >i_pages;
> + xas_lock_irq();
> + xas_set(, index);
> + entry = xas_load();
> + if (dax_is_locked(entry)) {
> + rcu_read_unlock();
> + wait_entry_unlocked(, entry);
> + rcu_read_lock();
> + continue;
> + }
> + if (!entry ||
> + dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + /*
> +  * Because we are looking for entry from file's mapping
> +  * and index, so the entry may not be inserted for now,
> +  * or even a zero/empty entry.  We don't think this is
> +  * an error case.  So, return a special value and do
> +  * not output @page.
> +  */
> + entry = (void *)~0UL;

I kinda wonder if these open-coded magic values ~0UL (no entry) and 0
(cannot lock) should be #defines that force-cast the magic value to
dax_entry_t...

...but then I'm not really an expert in the design behind fs/dax.c --
this part looks reasonable enough to me, but I think Dan or Matthew
ought to look this over.

--D

> + } else {
> + *page = pfn_to_page(dax_to_pfn(entry));
> + dax_lock_entry(, entry);
> + }
> + xas_unlock_irq();
> + break;
> + }
> + rcu_read_unlock();
> + return (dax_entry_t)entry;
> +}
> +
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
> + dax_entry_t cookie)
> +{
> + XA_STATE(xas, >i_pages, index);
> +
> + if (cookie == ~0UL)
> + return;
> +
> + dax_unlock_entry(, (void *)cookie);
> +}
> +
>  /*
>   * Find page cache entry at given index. If it is a DAX entry, return it
>   * with the entry locked. If the page cache doesn't contain an entry at
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index d273d59723cd..65411bee4312 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -156,6 +156,10 @@ struct page *dax_layout_busy_page(struct address_space 
> *mapping);
>  struct page *dax_layout_busy_page_range(struct address_space *mapping, 
> loff_t start, loff_t end);
>  dax_entry_t dax_lock_page(struct page *page);
>  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
> + unsigned long index, struct page **page);
> +void dax_unlock_mapping_entry(struct address_space *mapping,
> + unsigned long index, dax_entry_t cookie);
>  #else
>  #define generic_fsdax_supported  NULL
>  
> @@ -201,6 +205,17 @@ static inline dax_entry_t dax_lock_page(struct page 
> *page)
>  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
>  {
>  }
> +
> +static inline dax_entry_t dax_lock_mapping_entry(struct address_space 
> *mapping,
> + unsigned long index, struct page **page)
> +{
> + return 0;
> +}
> +
> +static inline void 

Re: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:55PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the filesystem in which
> the corrupted page located in.
> 
> With dax_holder notify support, we are able to notify the memory failure
> from pmem driver to upper layers.  If there is something not support in
> the notify routine, memory_failure will fall back to the generic hanlder.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/nvdimm/pmem.c| 11 +++
>  include/linux/memremap.h |  9 +
>  mm/memory-failure.c  | 14 ++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 72de88ff0d30..0dfafad8fcc5 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -362,9 +362,20 @@ static void pmem_release_disk(void *__pmem)
>   del_gendisk(pmem->disk);
>  }
>  
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + unsigned long pfn, size_t size, int flags)
> +{
> + struct pmem_device *pmem =
> + container_of(pgmap, struct pmem_device, pgmap);
> + loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
> +
> + return dax_holder_notify_failure(pmem->dax_dev, offset, size, flags);
> +}
> +
>  static const struct dev_pagemap_ops fsdax_pagemap_ops = {
>   .kill   = pmem_pagemap_kill,
>   .cleanup= pmem_pagemap_cleanup,
> + .memory_failure = pmem_pagemap_memory_failure,
>  };
>  
>  static int pmem_attach_disk(struct device *dev,
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c0e9d35889e8..36d47bacd46d 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -87,6 +87,15 @@ struct dev_pagemap_ops {
>* the page back to a CPU accessible page.
>*/
>   vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> +
> + /*
> +  * Handle the memory failure happens on a range of pfns.  Notify the
> +  * processes who are using these pfns, and try to recover the data on
> +  * them if necessary.  The flag is finally passed to the recover
> +  * function through the whole notify routine.
> +  */
> + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> +   size_t size, int flags);
>  };
>  
>  #define PGMAP_ALTMAP_VALID   (1 << 0)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 8ff9b52823c0..85eab206b68f 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1605,6 +1605,20 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>   if (!pgmap_pfn_valid(pgmap, pfn))
>   goto out;
>  
> + /*
> +  * Call driver's implementation to handle the memory failure, otherwise
> +  * fall back to generic handler.
> +  */
> + if (pgmap->ops->memory_failure) {
> + rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
> + /*
> +  * Fall back to generic handler too if operation is not
> +  * supported inside the driver/device/filesystem.
> +  */
> + if (rc != EOPNOTSUPP)

-EOPNOTSUPP?  (negative errno)

--D

> + goto out;
> + }
> +
>   rc = mf_generic_kill_procs(pfn, flags, pgmap);
>  out:
>   /* drop pgmap ref acquired in caller */
> -- 
> 2.33.0
> 
> 
> 



Re: [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:54PM +0800, Shiyang Ruan wrote:
> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
> feature for fsdax.  So it is needed to factor some helper functions to
> simplify these code.
> 
> Signed-off-by: Shiyang Ruan 

This looks like a reasonable hoist...
Reviewed-by: Darrick J. Wong 

--D

> ---
>  mm/memory-failure.c | 140 
>  1 file changed, 76 insertions(+), 64 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 54879c339024..8ff9b52823c0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1430,6 +1430,79 @@ static int try_to_split_thp_page(struct page *page, 
> const char *msg)
>   return 0;
>  }
>  
> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> + struct address_space *mapping, pgoff_t index, int flags)
> +{
> + struct to_kill *tk;
> + unsigned long size = 0;
> +
> + list_for_each_entry(tk, to_kill, nd)
> + if (tk->size_shift)
> + size = max(size, 1UL << tk->size_shift);
> + if (size) {
> + /*
> +  * Unmap the largest mapping to avoid breaking up device-dax
> +  * mappings which are constant size. The actual size of the
> +  * mapping being torn down is communicated in siginfo, see
> +  * kill_proc()
> +  */
> + loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
> +
> + unmap_mapping_range(mapping, start, size, 0);
> + }
> +
> + kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
> +}
> +
> +static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> + struct dev_pagemap *pgmap)
> +{
> + struct page *page = pfn_to_page(pfn);
> + LIST_HEAD(to_kill);
> + dax_entry_t cookie;
> +
> + /*
> +  * Prevent the inode from being freed while we are interrogating
> +  * the address_space, typically this would be handled by
> +  * lock_page(), but dax pages do not use the page lock. This
> +  * also prevents changes to the mapping of this pfn until
> +  * poison signaling is complete.
> +  */
> + cookie = dax_lock_page(page);
> + if (!cookie)
> + return -EBUSY;
> +
> + if (hwpoison_filter(page))
> + return 0;
> +
> + if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> + /*
> +  * TODO: Handle HMM pages which may need coordination
> +  * with device-side memory.
> +  */
> + return -EBUSY;
> + }
> +
> + /*
> +  * Use this flag as an indication that the dax page has been
> +  * remapped UC to prevent speculative consumption of poison.
> +  */
> + SetPageHWPoison(page);
> +
> + /*
> +  * Unlike System-RAM there is no possibility to swap in a
> +  * different physical page at a given virtual address, so all
> +  * userspace consumption of ZONE_DEVICE memory necessitates
> +  * SIGBUS (i.e. MF_MUST_KILL)
> +  */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs(page, _kill, true);
> +
> + unmap_and_kill(_kill, pfn, page->mapping, page->index, flags);
> + dax_unlock_page(page, cookie);
> + return 0;
> +}
> +
>  static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  {
>   struct page *p = pfn_to_page(pfn);
> @@ -1519,12 +1592,8 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>   struct dev_pagemap *pgmap)
>  {
>   struct page *page = pfn_to_page(pfn);
> - unsigned long size = 0;
> - struct to_kill *tk;
>   LIST_HEAD(tokill);
> - int rc = -EBUSY;
> - loff_t start;
> - dax_entry_t cookie;
> + int rc = -ENXIO;
>  
>   if (flags & MF_COUNT_INCREASED)
>   /*
> @@ -1533,67 +1602,10 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>   put_page(page);
>  
>   /* device metadata space is not recoverable */
> - if (!pgmap_pfn_valid(pgmap, pfn)) {
> - rc = -ENXIO;
> - goto out;
> - }
> -
> - /*
> -  * Prevent the inode from being freed while we are interrogating
> -  * the address_space, typically this would be handled by
> -  * lock_page(), but dax pages do not use the page lock. This
> -  * also prevents changes to the mapping of this pfn until
> -  * poison signaling is complete.
> -  */
> - cookie = dax_lock_page(page);
> -   

Re: [PATCH v7 2/8] dax: Introduce holder for dax_device

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:53PM +0800, Shiyang Ruan wrote:
> To easily track filesystem from a pmem device, we introduce a holder for
> dax_device structure, and also its operation.  This holder is used to
> remember who is using this dax_device:
>  - When it is the backend of a filesystem, the holder will be the
>superblock of this filesystem.
>  - When this pmem device is one of the targets in a mapped device, the
>holder will be this mapped device.  In this case, the mapped device
>has its own dax_device and it will follow the first rule.  So that we
>can finally track to the filesystem we needed.
> 
> The holder and holder_ops will be set when filesystem is being mounted,
> or an target device is being activated.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c | 59 +
>  include/linux/dax.h | 29 ++
>  2 files changed, 88 insertions(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 48ce86501d93..7d4a11dcba90 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -23,7 +23,10 @@
>   * @cdev: optional character interface for "device dax"
>   * @host: optional name for lookups where the device path is not available
>   * @private: dax driver private data
> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
>   * @flags: state and boolean properties
> + * @ops: operations for dax_device
> + * @holder_ops: operations for the inner holder
>   */
>  struct dax_device {
>   struct hlist_node list;
> @@ -31,8 +34,10 @@ struct dax_device {
>   struct cdev cdev;
>   const char *host;
>   void *private;
> + void *holder_data;
>   unsigned long flags;
>   const struct dax_operations *ops;
> + const struct dax_holder_operations *holder_ops;
>  };
>  
>  static dev_t dax_devt;
> @@ -374,6 +379,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
> pgoff_t pgoff,
>  }
>  EXPORT_SYMBOL_GPL(dax_zero_page_range);
>  
> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
> +   size_t size, int flags)
> +{
> + int rc;
> +
> + dax_read_lock();
> + if (!dax_alive(dax_dev)) {
> + rc = -ENXIO;
> + goto out;
> + }
> +
> + if (!dax_dev->holder_data) {
> + rc = -EOPNOTSUPP;
> + goto out;
> + }
> +
> + rc = dax_dev->holder_ops->notify_failure(dax_dev, offset, size, flags);

Shouldn't this check if dax_dev->holder_ops != NULL before dereferencing
it for the function call?  Imagine an implementation that wants to
attach a ->notify_failure function to a dax_device, maintains its own
lookup table, and decides that it doesn't need to set holder_data.

(Or, imagine someone who writes a garbage into holder_data and *boom*)

How does the locking work here?  If there's a media failure, we'll take
dax_rwsem and call ->notify_failure.  If the ->notify_failure function
wants to access the pmem to handle the error by calling back into the
dax code, will that cause nested locking on dax_rwsem?

Jumping ahead a bit, I think the rmap btree accesses that the xfs
implementation performs can cause xfs_buf(fer) cache IO, which would
trigger that if the buffers aren't already in memory, if I'm reading
this correctly?

> +out:
> + dax_read_unlock();
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> +
>  #ifdef CONFIG_ARCH_HAS_PMEM_API
>  void arch_wb_cache_pmem(void *addr, size_t size);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> @@ -618,6 +646,37 @@ void put_dax(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(put_dax);
>  
> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops)
> +{
> + dax_write_lock();
> + if (!dax_alive(dax_dev)) {
> + dax_write_unlock();
> + return;
> + }
> +
> + dax_dev->holder_data = holder;
> + dax_dev->holder_ops = ops;
> + dax_write_unlock();

I guess this means that the holder has to detach itself before anyone
calls kill_dax, or else a dead dax device ends up with a dangling
reference to the holder?

> +}
> +EXPORT_SYMBOL_GPL(dax_set_holder);
> +
> +void *dax_get_holder(struct dax_device *dax_dev)
> +{
> + void *holder;
> +
> + dax_read_lock();
> + if (!dax_alive(dax_dev)) {
> + dax_read_unlock();
> + return NULL;
> + }
> +
> + holder = dax_dev->holder_data;
> + dax_read_unlock();
> + return holder;
> +}
> +EXPORT_SYMBOL_GPL(dax_get_holder);
> +
>  /**
>   * inode_dax: convert a public inode into its dax_dev
>   * @inode: An inode with i_cdev pointing to a dax_dev
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 097b3304f9b9..d273d59723cd 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -38,9 +38,24 @@ struct dax_operations {
>   int 

Re: [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock()

2021-10-14 Thread Darrick J. Wong
On Fri, Sep 24, 2021 at 09:09:52PM +0800, Shiyang Ruan wrote:
> In order to introduce dax holder registration, we need a write lock for
> dax.  Because of the rarity of notification failures and the infrequency
> of registration events, it would be better to be a global lock rather
> than per-device.  So, change the current lock to rwsem and introduce a
> write lock for registration.

Urgh, I totally thought dax_read_lock was a global lock on something
relating to the global dax_device state until I noticed this comment
above kill_dax():

/*
 * Note, rcu is not protecting the liveness of dax_dev, rcu is ensuring
 * that any fault handlers or operations that might have seen
 * dax_alive(), have completed.  Any operations that start after
 * synchronize_srcu() has run will abort upon seeing !dax_alive().
 */

So dax_srcu ensures stability in the dax_device's ALIVE state while any
code that relies on that aliveness runs.  As a side effect, it'll block
kill_dax (and I guess run_dax) while those functions run.  It doesn't
protect any global state at all... but this isn't made obvious in the
API by (for example) passing the dax_device into dax_read_lock.

IOWs, It's not protecting against the dax_device getting freed or
anything resembling global state.  So that's probably why you note above
that this /could/ be a per-device synchronization primitive, right?

If that's the case, then why shouldn't this be a per-device item?  As
written here, any code that takes dax_write_lock() will block every dax
device in the system while it does some work on a single dax device.
Being an rwsem, it  will also have to wait for every other dax device
access to complete before it can begin.  That seems excessive,
particularly if in the future we start hooking up lots of pmem to a
single host.

I have more to say around kill_dax() below.

> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/device.c   | 11 +-
>  drivers/dax/super.c| 43 ++
>  drivers/md/dm-writecache.c |  7 +++
>  fs/dax.c   | 26 +++
>  include/linux/dax.h|  9 
>  5 files changed, 49 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index dd8222a42808..cc7b835509f9 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -198,7 +198,6 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>   struct file *filp = vmf->vma->vm_file;
>   unsigned long fault_size;
>   vm_fault_t rc = VM_FAULT_SIGBUS;
> - int id;
>   pfn_t pfn;
>   struct dev_dax *dev_dax = filp->private_data;
>  
> @@ -206,7 +205,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>   (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
>   vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
>  
> - id = dax_read_lock();
> + dax_read_lock();
>   switch (pe_size) {
>   case PE_SIZE_PTE:
>   fault_size = PAGE_SIZE;
> @@ -246,7 +245,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>   page->index = pgoff + i;
>   }
>   }
> - dax_read_unlock(id);
> + dax_read_unlock();
>  
>   return rc;
>  }
> @@ -284,7 +283,7 @@ static const struct vm_operations_struct dax_vm_ops = {
>  static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
>  {
>   struct dev_dax *dev_dax = filp->private_data;
> - int rc, id;
> + int rc;
>  
>   dev_dbg(_dax->dev, "trace\n");
>  
> @@ -292,9 +291,9 @@ static int dax_mmap(struct file *filp, struct 
> vm_area_struct *vma)
>* We lock to check dax_dev liveness and will re-check at
>* fault time.
>*/
> - id = dax_read_lock();
> + dax_read_lock();
>   rc = check_vma(dev_dax, vma, __func__);
> - dax_read_unlock(id);
> + dax_read_unlock();
>   if (rc)
>   return rc;
>  
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index fc89e91beea7..48ce86501d93 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -36,7 +36,7 @@ struct dax_device {
>  };
>  
>  static dev_t dax_devt;
> -DEFINE_STATIC_SRCU(dax_srcu);
> +static DECLARE_RWSEM(dax_rwsem);
>  static struct vfsmount *dax_mnt;
>  static DEFINE_IDA(dax_minor_ida);
>  static struct kmem_cache *dax_cache __read_mostly;
> @@ -46,18 +46,28 @@ static struct super_block *dax_superblock __read_mostly;
>  static struct hlist_head dax_host_list[DAX_HASH_SIZE];
>  static DEFINE_SPINLOCK(dax_host_lock);
>  
> -int dax_read_lock(void)
> +void dax_read_lock(void)
>  {
> - return srcu_read_lock(_srcu);
> + down_read(_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(dax_read_lock);
>  
> -void dax_read_unlock(int id)
> +void dax_read_unlock(void)
>  {
> - srcu_read_unlock(_srcu, id);
> + up_read(_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(dax_read_unlock);
>  
> +void dax_write_lock(void)
> +{
> + down_write(_rwsem);
> +}

Re: [PATCH v10 7/8] xfs: support CoW in fsdax mode

2021-10-14 Thread Darrick J. Wong
On Tue, Sep 28, 2021 at 02:23:10PM +0800, Shiyang Ruan wrote:
> In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
> After that, new allocated extents needs to be remapped to the file.
> So, add a CoW identification in ->iomap_begin(), and implement
> ->iomap_end() to do the remapping work.
> 
> Signed-off-by: Shiyang Ruan 

I think this patch looks good, so:
Reviewed-by: Darrick J. Wong 

A big thank you to Shiyang for persisting in getting this series
finished! :)

Judging from the conversation Christoph and I had the last time this
patchset was submitted, I gather the last big remaining issue is the use
of page->mapping for hw poison.  So I'll go take a look at "fsdax:
introduce FS query interface to support reflink" now.

--D

> ---
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c  |  7 ++-
>  fs/xfs/xfs_iomap.c | 30 +++-
>  fs/xfs/xfs_iomap.h | 44 ++
>  fs/xfs/xfs_iops.c  |  7 +++
>  fs/xfs/xfs_reflink.c   |  3 +--
>  6 files changed, 80 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 73a36b7be3bd..0681250e0a5d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1009,8 +1009,7 @@ xfs_free_file_space(
>   return 0;
>   if (offset + len > XFS_ISIZE(ip))
>   len = XFS_ISIZE(ip) - offset;
> - error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - _buffered_write_iomap_ops);
> + error = xfs_iomap_zero_range(ip, offset, len, NULL);
>   if (error)
>   return error;
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7aa943edfc02..afde4fbefb6f 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -704,7 +704,7 @@ xfs_file_dax_write(
>   pos = iocb->ki_pos;
>  
>   trace_xfs_file_dax_write(iocb, from);
> - ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
> + ret = dax_iomap_rw(iocb, from, _dax_write_iomap_ops);
>   if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
>   i_size_write(inode, iocb->ki_pos);
>   error = xfs_setfilesize(ip, pos, ret);
> @@ -1327,10 +1327,7 @@ __xfs_filemap_fault(
>   pfn_t pfn;
>  
>   xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> - ret = dax_iomap_fault(vmf, pe_size, , NULL,
> - (write_fault && !vmf->cow_page) ?
> -  _direct_write_iomap_ops :
> -  _read_iomap_ops);
> + ret = xfs_dax_iomap_fault(vmf, pe_size, write_fault, );
>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
>   xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 093758440ad5..51cb5b713521 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
>  
>   /* may drop and re-acquire the ilock */
>   error = xfs_reflink_allocate_cow(ip, , , ,
> - , flags & IOMAP_DIRECT);
> + ,
> + (flags & IOMAP_DIRECT) || IS_DAX(inode));
>   if (error)
>   goto out_unlock;
>   if (shared)
> @@ -854,6 +855,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
>   .iomap_begin= xfs_direct_write_iomap_begin,
>  };
>  
> +static int
> +xfs_dax_write_iomap_end(
> + struct inode*inode,
> + loff_t  pos,
> + loff_t  length,
> + ssize_t written,
> + unsignedflags,
> + struct iomap*iomap)
> +{
> + struct xfs_inode*ip = XFS_I(inode);
> +
> + if (!xfs_is_cow_inode(ip))
> + return 0;
> +
> + if (!written) {
> + xfs_reflink_cancel_cow_range(ip, pos, length, true);
> + return 0;
> + }
> +
> + return xfs_reflink_end_cow(ip, pos, written);
> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> + .iomap_begin= xfs_direct_write_iomap_begin,
> + .iomap_end  = xfs_dax_write_iomap_end,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(
>   struct inode*inode,
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 7d3703556d0e..81726bfbf890 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -7

Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-17 Thread Darrick J. Wong
On Fri, Sep 17, 2021 at 01:21:25PM -0700, Dan Williams wrote:
> On Fri, Sep 17, 2021 at 8:27 AM Darrick J. Wong  wrote:
> >
> > On Fri, Sep 17, 2021 at 01:53:33PM +0100, Christoph Hellwig wrote:
> > > On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > > > > That was my gut feeling.  If everyone feels 100% comfortable with
> > > > > zeroingas the mechanism to clear poisoning I'll cave in.  The most
> > > > > important bit is that we do that through a dedicated DAX path instead
> > > > > of abusing the block layer even more.
> > > >
> > > > ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> > > > Where reset == "zero + clear-poison"?
> > >
> > > I'd say that naming is more confusing than overloading zero.
> >
> > How about dax_zeroinit_range() ?
> 
> Works for me.
> 
> >
> > To go with its fallocate flag (yeah I've been too busy sorting out -rc1
> > regressions to repost this) FALLOC_FL_ZEROINIT_RANGE that will reset the
> > hardware (whatever that means) and set the contents to the known value
> > zero.
> >
> > Userspace usage model:
> >
> > void handle_media_error(int fd, loff_t pos, size_t len)
> > {
> > /* yell about this for posterior's sake */
> >
> > ret = fallocate(fd, FALLOC_FL_ZEROINIT_RANGE, pos, len);
> >
> > /* yay our disk drive / pmem / stone table engraver is online */
> 
> The fallocate mode can still be error-aware though, right? When the FS
> has knowledge of the error locations the fallocate mode could be
> fallocate(fd, FALLOC_FL_OVERWRITE_ERRORS, pos, len) with the semantics
> of attempting to zero out any known poison extents in the given file
> range? At the risk of going overboard on new fallocate modes there
> could also (or instead of) be FALLOC_FL_PUNCH_ERRORS to skip trying to
> clear them and just ask the FS to throw error extents away.

It /could/ be, but for now I've stuck to what you see is what you get --
if you tell it to 'zero initialize' 1MB of pmem, it'll write zeroes and
clear the poison on all 1MB, regardless of the old contents.

IOWs, you can use it from a poison handler on just the range that it
told you about, or you could use it to bulk-clear a lot of space all at
once.

A dorky thing here is that the dax_zero_page_range function returns EIO
if you tell it to do more than one page...


> 
> > }
> >
> > > > > I'm really worried about both patartitions on DAX and DM passing 
> > > > > through
> > > > > DAX because they deeply bind DAX to the block layer, which is just a 
> > > > > bad
> > > > > idea.  I think we also need to sort that whole story out before 
> > > > > removing
> > > > > the EXPERIMENTAL tags.
> > > >
> > > > I do think it was a mistake to allow for DAX on partitions of a pmemX
> > > > block-device.
> > > >
> > > > DAX-reflink support may be the opportunity to start deprecating that
> > > > support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> > > > without partitions (later add dax-device direct mounting),
> > >
> > > I think we need to fully or almost fully sort this out.
> > >
> > > Here is my bold suggestions:
> > >
> > >  1) drop no drop the EXPERMINTAL on the current block layer overload
> > > at all
> >
> > I don't understand this.
> >
> > >  2) add direct mounting of the nvdimm namespaces ASAP.  Because all
> > > the filesystem currently also need the /dev/pmem0 device add a way
> > > to open the block device by the dax_device instead of our current
> > > way of doing the reverse
> > >  3) deprecate DAX support through block layer mounts with a say 2 year
> > > deprecation period
> > >  4) add DAX remapping devices as needed
> >
> > What devices are needed?  linear for lvm, and maybe error so we can
> > actually test all this stuff?
> 
> The proposal would be zero lvm support. The nvdimm namespace
> definition would need to grow support for concatenation + striping.

Ah, ok.

> Soft error injection could be achieved by writing to the badblocks
> interface.



I'll send out an RFC of what I have currently.

--D



Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode

2021-09-17 Thread Darrick J. Wong
On Thu, Sep 16, 2021 at 08:32:51AM +0200, Christoph Hellwig wrote:
> On Wed, Sep 15, 2021 at 05:22:27PM -0700, Darrick J. Wong wrote:
> > >   xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > >   ret = dax_iomap_fault(vmf, pe_size, , NULL,
> > >   (write_fault && !vmf->cow_page) ?
> > > -  _direct_write_iomap_ops :
> > > -  _read_iomap_ops);
> > > + _dax_write_iomap_ops :
> > > + _read_iomap_ops);
> > 
> > Hmm... I wonder if this should get hoisted to a "xfs_dax_iomap_fault"
> > wrapper like you did for xfs_iomap_zero_range?
> 
> This has just a single users, so the classic argument won't apply.  That
> being said __xfs_filemap_fault is a complete mess to due the calling
> conventions of the various VFS methods multiplexed into it.  So yes,
> splitting out a xfs_dax_iomap_fault to wrap the above plus the
> dax_finish_sync_fault call might not actually be a bad idea nevertheless.

Agree.

> > > + struct xfs_inode*ip = XFS_I(inode);
> > > + /*
> > > +  * Usually we use @written to indicate whether the operation was
> > > +  * successful.  But it is always positive or zero.  The CoW needs the
> > > +  * actual error code from actor().  So, get it from
> > > +  * iomap_iter->processed.
> > 
> > Hm.  All six arguments are derived from the struct iomap_iter, so maybe
> > it makes more sense to pass that in?  I'll poke around with this more
> > tomorrow.
> 
> I'd argue against just changing the calling conventions for ->iomap_end
> now.  The original iter patches from willy allowed passing a single
> next callback combinging iomap_begin and iomap_end in a way that with
> a little magic we can avoid the indirect calls entirely.  I think we'll
> need to experiment with that that a bit and see if is worth the effort
> first.  I plan to do that but I might not get to it immediate.  If some
> else wants to take over I'm fine with that.

Ah, I forgot that.  Yay Etch-a-Sketch brain.  -ENODATA ;)

> > >  static int
> > >  xfs_buffered_write_iomap_begin(
> > 
> > Also, we have an related request to drop the EXPERIMENTAL tag for
> > non-DAX reflink.  Whichever patch enables dax+reflink for xfs needs to
> > make it clear that reflink + any possibility of DAX emits an
> > EXPERIMENTAL warning.
> 
> More importantly before we can merge this series we also need the VM
> level support for reflink-aware reverse mapping.  So while this series
> here is no in a good enough shape I don't see how we could merge it
> without that other series as we'd have to disallow mmap for reflink+dax
> files otherwise.

I've forgotten why we need mm level reverse mapping again?  The pmem
poison stuff can use ->media_failure (or whatever it was called,
memory_failure?) to find all the owners and notify them.  Was there
some other accounting reason that fell out of my brain?

I'm more afraid of 'sharing pages between files needs mm support'
sparking another multi-year folioesque fight with the mm people.

--D



Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-17 Thread Darrick J. Wong
On Fri, Sep 17, 2021 at 01:53:33PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > > That was my gut feeling.  If everyone feels 100% comfortable with
> > > zeroingas the mechanism to clear poisoning I'll cave in.  The most
> > > important bit is that we do that through a dedicated DAX path instead
> > > of abusing the block layer even more.
> > 
> > ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> > Where reset == "zero + clear-poison"?
> 
> I'd say that naming is more confusing than overloading zero.

How about dax_zeroinit_range() ?

To go with its fallocate flag (yeah I've been too busy sorting out -rc1
regressions to repost this) FALLOC_FL_ZEROINIT_RANGE that will reset the
hardware (whatever that means) and set the contents to the known value
zero.

Userspace usage model:

void handle_media_error(int fd, loff_t pos, size_t len)
{
/* yell about this for posterior's sake */

ret = fallocate(fd, FALLOC_FL_ZEROINIT_RANGE, pos, len);

/* yay our disk drive / pmem / stone table engraver is online */
}

> > > I'm really worried about both patartitions on DAX and DM passing through
> > > DAX because they deeply bind DAX to the block layer, which is just a bad
> > > idea.  I think we also need to sort that whole story out before removing
> > > the EXPERIMENTAL tags.
> > 
> > I do think it was a mistake to allow for DAX on partitions of a pmemX
> > block-device.
> > 
> > DAX-reflink support may be the opportunity to start deprecating that
> > support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> > without partitions (later add dax-device direct mounting),
> 
> I think we need to fully or almost fully sort this out.
> 
> Here is my bold suggestions:
> 
>  1) drop no drop the EXPERMINTAL on the current block layer overload
> at all

I don't understand this.

>  2) add direct mounting of the nvdimm namespaces ASAP.  Because all
> the filesystem currently also need the /dev/pmem0 device add a way
> to open the block device by the dax_device instead of our current
> way of doing the reverse
>  3) deprecate DAX support through block layer mounts with a say 2 year
> deprecation period
>  4) add DAX remapping devices as needed

What devices are needed?  linear for lvm, and maybe error so we can
actually test all this stuff?

> I'll volunteer to write the initial code for 2).  And I think we should
> not allow DAX+reflink on the block device shim at all.

/me has other questions about daxreflink, but I'll ask them on shiyang's
thread.

--D



Re: [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

2021-09-16 Thread Darrick J. Wong
On Thu, Sep 16, 2021 at 04:49:19PM +0800, Shiyang Ruan wrote:
> 
> 
> On 2021/9/16 14:16, Christoph Hellwig wrote:
> > On Wed, Sep 15, 2021 at 06:44:58PM +0800, Shiyang Ruan wrote:
> > > + rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
> > > + if (rc < 0)
> > > + goto out;
> > > + memset(kaddr + offset, 0, size);
> > > + if (srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
> > 
> > Should we also check that ->dax_dev for iomap and srcmap are different
> > first to deal with case of file system with multiple devices?
> 
> I have not thought of this case.  Isn't it possible to CoW between different
> devices?

There's nothing in the iomap API that prevents a filesystem from doing
that, though there are no filesystems today that do such a thing.

That said, if btrfs ever joins the fold (and adds DAX support) then they
could totally COW to a different device.

--D

> 
> 
> --
> Thanks,
> Ruan
> 
> > 
> > Otherwise looks good:
> > 
> > Reviewed-by: Christoph Hellwig 
> > 
> 
> 



Re: [PATCH v9 8/8] xfs: Add dax dedupe support

2021-09-15 Thread Darrick J. Wong
On Thu, Sep 16, 2021 at 12:01:18PM +0800, Shiyang Ruan wrote:
> 
> 
> On 2021/9/16 8:30, Darrick J. Wong wrote:
> > On Wed, Sep 15, 2021 at 06:45:01PM +0800, Shiyang Ruan wrote:
> > > Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
> > > who are going to be deduped.  After that, call compare range function
> > > only when files are both DAX or not.
> > > 
> > > Signed-off-by: Shiyang Ruan 
> > > Reviewed-by: Darrick J. Wong 
> > > Reviewed-by: Christoph Hellwig 
> > > ---
> > >   fs/xfs/xfs_file.c|  2 +-
> > >   fs/xfs/xfs_inode.c   | 80 +---
> > >   fs/xfs/xfs_inode.h   |  1 +
> > >   fs/xfs/xfs_reflink.c |  4 +--
> > >   4 files changed, 80 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 2ef1930374d2..c3061723613c 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -846,7 +846,7 @@ xfs_wait_dax_page(
> > >   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > >   }
> > > -static int
> > > +int
> > >   xfs_break_dax_layouts(
> > >   struct inode*inode,
> > >   bool*retry)
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index a4f6f034fb81..bdc084cdbf46 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -3790,6 +3790,61 @@ xfs_iolock_two_inodes_and_break_layout(
> > >   return 0;
> > >   }
> > > +static int
> > > +xfs_mmaplock_two_inodes_and_break_dax_layout(
> > > + struct xfs_inode*ip1,
> > > + struct xfs_inode*ip2)
> > > +{
> > > + int error, attempts = 0;
> > > + boolretry;
> > > + struct page *page;
> > > + struct xfs_log_item *lp;
> > > +
> > > + if (ip1->i_ino > ip2->i_ino)
> > > + swap(ip1, ip2);
> > > +
> > > +again:
> > > + retry = false;
> > > + /* Lock the first inode */
> > > + xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> > > + error = xfs_break_dax_layouts(VFS_I(ip1), );
> > > + if (error || retry) {
> > > + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > > + if (error == 0 && retry)
> > > + goto again;
> > > + return error;
> > > + }
> > > +
> > > + if (ip1 == ip2)
> > > + return 0;
> > > +
> > > + /* Nested lock the second inode */
> > > + lp = >i_itemp->ili_item;
> > > + if (lp && test_bit(XFS_LI_IN_AIL, >li_flags)) {
> > > + if (!xfs_ilock_nowait(ip2,
> > > + xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
> > > + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > > + if ((++attempts % 5) == 0)
> > > + delay(1); /* Don't just spin the CPU */
> > > + goto again;
> > > + }
> > 
> > I suspect we don't need this part for grabbing the MMAPLOCK^W pagecache
> > invalidatelock.  The AIL only grabs the ILOCK, never the IOLOCK or the
> > MMAPLOCK.
> 
> Maybe I have misunderstood this part.
> 
> What I want is to lock the two inode nestedly.  This code is copied from
> xfs_lock_two_inodes(), which checks this AIL during locking two inode with
> each of the three kinds of locks.

 It's totally reasonable to copy-paste the function you want and
change it as needed...

> But I also found the recent merged function: filemap_invalidate_lock_two()
> just locks two inode directly without checking AIL.  So, I am not if the AIL
> check is needed in this case.

...especially when even the maintainer is only 99% sure that the AIL
checking chunk here can be removed.  Anyone else have an opinion?

--D

> > 
> > > + } else
> > > + xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
> > > + /*
> > > +  * We cannot use xfs_break_dax_layouts() directly here because it may
> > > +  * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
> > > +  * for this nested lock case.
> > > +  */
> > > + page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> > > + if (page && page_ref_count(page) != 1) {
> > 
> > Do you think the patch "ext4/xfs: add page refcount helper" would be a
> > good cleanup to head th

Re: [PATCH v9 8/8] xfs: Add dax dedupe support

2021-09-15 Thread Darrick J. Wong
On Wed, Sep 15, 2021 at 06:45:01PM +0800, Shiyang Ruan wrote:
> Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
> who are going to be deduped.  After that, call compare range function
> only when files are both DAX or not.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Darrick J. Wong 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/xfs/xfs_file.c|  2 +-
>  fs/xfs/xfs_inode.c   | 80 +---
>  fs/xfs/xfs_inode.h   |  1 +
>  fs/xfs/xfs_reflink.c |  4 +--
>  4 files changed, 80 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2ef1930374d2..c3061723613c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -846,7 +846,7 @@ xfs_wait_dax_page(
>   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  }
>  
> -static int
> +int
>  xfs_break_dax_layouts(
>   struct inode*inode,
>   bool*retry)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index a4f6f034fb81..bdc084cdbf46 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3790,6 +3790,61 @@ xfs_iolock_two_inodes_and_break_layout(
>   return 0;
>  }
>  
> +static int
> +xfs_mmaplock_two_inodes_and_break_dax_layout(
> + struct xfs_inode*ip1,
> + struct xfs_inode*ip2)
> +{
> + int error, attempts = 0;
> + boolretry;
> + struct page *page;
> + struct xfs_log_item *lp;
> +
> + if (ip1->i_ino > ip2->i_ino)
> + swap(ip1, ip2);
> +
> +again:
> + retry = false;
> + /* Lock the first inode */
> + xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> + error = xfs_break_dax_layouts(VFS_I(ip1), );
> + if (error || retry) {
> + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> + if (error == 0 && retry)
> + goto again;
> + return error;
> + }
> +
> + if (ip1 == ip2)
> + return 0;
> +
> + /* Nested lock the second inode */
> + lp = >i_itemp->ili_item;
> + if (lp && test_bit(XFS_LI_IN_AIL, >li_flags)) {
> + if (!xfs_ilock_nowait(ip2,
> + xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
> + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> + if ((++attempts % 5) == 0)
> + delay(1); /* Don't just spin the CPU */
> + goto again;
> + }

I suspect we don't need this part for grabbing the MMAPLOCK^W pagecache
invalidatelock.  The AIL only grabs the ILOCK, never the IOLOCK or the
MMAPLOCK.

> + } else
> + xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
> + /*
> +  * We cannot use xfs_break_dax_layouts() directly here because it may
> +  * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
> +  * for this nested lock case.
> +  */
> + page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> + if (page && page_ref_count(page) != 1) {

Do you think the patch "ext4/xfs: add page refcount helper" would be a
good cleanup to head this series?

https://lore.kernel.org/linux-xfs/20210913161604.31981-1-alex.sie...@amd.com/T/#m59cf7cd5c0d521ad487fa3a15d31c3865db88bdf

The rest of the logic looks ok.

--D

> + xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> + goto again;
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
>   * mmap activity.
> @@ -3804,8 +3859,19 @@ xfs_ilock2_io_mmap(
>   ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
>   if (ret)
>   return ret;
> - filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
> - VFS_I(ip2)->i_mapping);
> +
> + if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
> + ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
> + if (ret) {
> + inode_unlock(VFS_I(ip2));
> + if (ip1 != ip2)
> + inode_unlock(VFS_I(ip1));
> + return ret;
> + }
> + } else
> + filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
> + VFS_I(ip2)->i_mapping);
> +
>   return 0;
>  }
>  
> @@ -3815,8 +3881,14 @@ xfs_iunlock2_io_mmap(
>   struct xfs_inode*ip1,
>   struct xf

Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode

2021-09-15 Thread Darrick J. Wong
On Wed, Sep 15, 2021 at 06:45:00PM +0800, Shiyang Ruan wrote:
> In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
> After that, new allocated extents needs to be remapped to the file.
> So, add a CoW identification in ->iomap_begin(), and implement
> ->iomap_end() to do the remapping work.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c  |  6 +++---
>  fs/xfs/xfs_iomap.c | 38 +-
>  fs/xfs/xfs_iomap.h | 30 ++
>  fs/xfs/xfs_iops.c  |  7 +++
>  fs/xfs/xfs_reflink.c   |  3 +--
>  6 files changed, 75 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 73a36b7be3bd..0681250e0a5d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1009,8 +1009,7 @@ xfs_free_file_space(
>   return 0;
>   if (offset + len > XFS_ISIZE(ip))
>   len = XFS_ISIZE(ip) - offset;
> - error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - _buffered_write_iomap_ops);
> + error = xfs_iomap_zero_range(ip, offset, len, NULL);
>   if (error)
>   return error;
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7aa943edfc02..2ef1930374d2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -704,7 +704,7 @@ xfs_file_dax_write(
>   pos = iocb->ki_pos;
>  
>   trace_xfs_file_dax_write(iocb, from);
> - ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
> + ret = dax_iomap_rw(iocb, from, _dax_write_iomap_ops);
>   if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
>   i_size_write(inode, iocb->ki_pos);
>   error = xfs_setfilesize(ip, pos, ret);
> @@ -1329,8 +1329,8 @@ __xfs_filemap_fault(
>   xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   ret = dax_iomap_fault(vmf, pe_size, , NULL,
>   (write_fault && !vmf->cow_page) ?
> -  _direct_write_iomap_ops :
> -  _read_iomap_ops);
> + _dax_write_iomap_ops :
> + _read_iomap_ops);

Hmm... I wonder if this should get hoisted to a "xfs_dax_iomap_fault"
wrapper like you did for xfs_iomap_zero_range?

>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
>   xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 093758440ad5..6fa3b377cb81 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
>  
>   /* may drop and re-acquire the ilock */
>   error = xfs_reflink_allocate_cow(ip, , , ,
> - , flags & IOMAP_DIRECT);
> + ,
> + (flags & IOMAP_DIRECT) || IS_DAX(inode));
>   if (error)
>   goto out_unlock;
>   if (shared)
> @@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
>   .iomap_begin= xfs_direct_write_iomap_begin,
>  };
>  
> +static int
> +xfs_dax_write_iomap_end(
> + struct inode*inode,
> + loff_t  pos,
> + loff_t  length,
> + ssize_t written,
> + unsignedflags,
> + struct iomap*iomap)

Whitespace nit: ^ space before a tab.

> +{
> + struct xfs_inode*ip = XFS_I(inode);
> + /*
> +  * Usually we use @written to indicate whether the operation was
> +  * successful.  But it is always positive or zero.  The CoW needs the
> +  * actual error code from actor().  So, get it from
> +  * iomap_iter->processed.

Hm.  All six arguments are derived from the struct iomap_iter, so maybe
it makes more sense to pass that in?  I'll poke around with this more
tomorrow.

> +  */
> + const struct iomap_iter *iter =
> + container_of(iomap, typeof(*iter), iomap);
> +
> + if (!xfs_is_cow_inode(ip))
> + return 0;
> +
> + if (iter->processed <= 0) {
> + xfs_reflink_cancel_cow_range(ip, pos, length, true);
> + return 0;
> + }
> +
> + return xfs_reflink_end_cow(ip, pos, iter->processed);
> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> + .iomap_begin= xfs_direct_write_iomap_begin,

Space before tab^

> + .iomap_end  = xfs_dax_write_iomap_end,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(

Also, we have an related request to drop the EXPERIMENTAL tag for
non-DAX reflink.  Whichever patch enables dax+reflink for xfs needs to
make it clear that reflink + any possibility of DAX emits an
EXPERIMENTAL warning.

--D

>   struct inode  

Re: [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model

2021-09-15 Thread Darrick J. Wong
On Wed, Sep 15, 2021 at 06:44:57PM +0800, Shiyang Ruan wrote:
> Let dax_iomap_zero() support iter model.
> 
> Signed-off-by: Shiyang Ruan 

Oops, I guess we forgot this one when we did the iter conversion last
cycle. :(

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c   | 3 ++-
>  fs/iomap/buffered-io.c | 3 +--
>  include/linux/dax.h| 3 ++-
>  3 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 41c93929f20b..4f346e25e488 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1209,8 +1209,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>  
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length)
>  {
> + const struct iomap *iomap = >iomap;
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>   pgoff_t pgoff;
>   long rc, id;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9cc5798423d1..84a861d3b3e0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -889,7 +889,6 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, 
> loff_t pos, u64 length)
>  
>  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
> - struct iomap *iomap = >iomap;
>   const struct iomap *srcmap = iomap_iter_srcmap(iter);
>   loff_t pos = iter->pos;
>   loff_t length = iomap_length(iter);
> @@ -903,7 +902,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, 
> bool *did_zero)
>   s64 bytes;
>  
>   if (IS_DAX(iter->inode))
> - bytes = dax_iomap_zero(pos, length, iomap);
> + bytes = dax_iomap_zero(iter, pos, length);
>   else
>   bytes = __iomap_zero_iter(iter, pos, length);
>   if (bytes < 0)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 2619d94c308d..642de7ef1a10 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -13,6 +13,7 @@ typedef unsigned long dax_entry_t;
>  
>  struct iomap_ops;
>  struct iomap;
> +struct iomap_iter;
>  struct dax_device;
>  struct dax_operations {
>   /*
> @@ -210,7 +211,7 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> pgoff_t index);
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
> +s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length);
>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>   return mapping->host && IS_DAX(mapping->host);
> -- 
> 2.33.0
> 
> 
> 



Re: [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it

2021-09-15 Thread Darrick J. Wong
On Wed, Sep 15, 2021 at 06:44:54PM +0800, Shiyang Ruan wrote:
> Add address output in dax_iomap_pfn() in order to perform a memcpy() in
> CoW case.  Since this function both output address and pfn, rename it to
> dax_iomap_direct_access().
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 
> Reviewed-by: Dan Williams 

Could've sworn I reviewed this a few revisions ago...
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4e3e5a283a91..8b482a58acae 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1010,8 +1010,8 @@ static sector_t dax_iomap_sector(const struct iomap 
> *iomap, loff_t pos)
>   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
>  }
>  
> -static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
> -  pfn_t *pfnp)
> +static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
> + size_t size, void **kaddr, pfn_t *pfnp)
>  {
>   const sector_t sector = dax_iomap_sector(iomap, pos);
>   pgoff_t pgoff;
> @@ -1023,11 +1023,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, 
> loff_t pos, size_t size,
>   return rc;
>   id = dax_read_lock();
>   length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> -NULL, pfnp);
> +kaddr, pfnp);
>   if (length < 0) {
>   rc = length;
>   goto out;
>   }
> + if (!pfnp)
> + goto out_check_addr;
>   rc = -EINVAL;
>   if (PFN_PHYS(length) < size)
>   goto out;
> @@ -1037,6 +1039,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, 
> loff_t pos, size_t size,
>   if (length > 1 && !pfn_t_devmap(*pfnp))
>   goto out;
>   rc = 0;
> +
> +out_check_addr:
> + if (!kaddr)
> + goto out;
> + if (!*kaddr)
> + rc = -EFAULT;
>  out:
>   dax_read_unlock(id);
>   return rc;
> @@ -1401,7 +1409,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>   return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
>   }
>  
> - err = dax_iomap_pfn(>iomap, pos, size, );
> + err = dax_iomap_direct_access(>iomap, pos, size, NULL, );
>   if (err)
>   return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
>  
> -- 
> 2.33.0
> 
> 
> 



Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-15 Thread Darrick J. Wong
On Wed, Sep 15, 2021 at 01:27:47PM -0700, Dan Williams wrote:
> On Wed, Sep 15, 2021 at 9:15 AM Darrick J. Wong  wrote:
> >
> > On Wed, Sep 15, 2021 at 12:22:05AM -0700, Jane Chu wrote:
> > > Hi, Dan,
> > >
> > > On 9/14/2021 9:44 PM, Dan Williams wrote:
> > > > On Tue, Sep 14, 2021 at 4:32 PM Jane Chu  wrote:
> > > > >
> > > > > If pwrite(2) encounters poison in a pmem range, it fails with EIO.
> > > > > This is unecessary if hardware is capable of clearing the poison.
> > > > >
> > > > > Though not all dax backend hardware has the capability of clearing
> > > > > poison on the fly, but dax backed by Intel DCPMEM has such capability,
> > > > > and it's desirable to, first, speed up repairing by means of it;
> > > > > second, maintain backend continuity instead of fragmenting it in
> > > > > search for clean blocks.
> > > > >
> > > > > Jane Chu (3):
> > > > >dax: introduce dax_operation dax_clear_poison
> > > >
> > > > The problem with new dax operations is that they need to be plumbed
> > > > not only through fsdax and pmem, but also through device-mapper.
> > > >
> > > > In this case I think we're already covered by dax_zero_page_range().
> > > > That will ultimately trigger pmem_clear_poison() and it is routed
> > > > through device-mapper properly.
> > > >
> > > > Can you clarify why the existing dax_zero_page_range() is not 
> > > > sufficient?
> > >
> > > fallocate ZERO_RANGE is in itself a functionality that applied to dax
> > > should lead to zero out the media range.  So one may argue it is part
> > > of a block operations, and not something explicitly aimed at clearing
> > > poison.
> >
> > Yeah, Christoph suggested that we make the clearing operation explicit
> > in a related thread a few weeks ago:
> > https://lore.kernel.org/linux-fsdevel/yrtnlperhfmz2...@infradead.org/
> 
> That seemed to be tied to a proposal to plumb it all the way out to an
> explicit fallocate() mode, not make it a silent side effect of
> pwrite(). That said pwrite() does clear errors in hard drives in
> not-DAX mode, but I like the change in direction to make it explicit
> going forward.
> 
> > I like Jane's patchset far better than the one that I sent, because it
> > doesn't require a block device wrapper for the pmem, and it enables us
> > to tell application writers that they can handle media errors by
> > pwrite()ing the bad region, just like they do for nvme and spinners.
> 
> pwrite(), hmm, so you're not onboard with the explicit clearing API
> proposal, or...?

I don't really care either way.  I was going to send a reworked version
of that earlier patchset which would add an explicit fallocate mode and
make it work on regular block storage too, but then Jane sent this. :)

Hmm, maybe I should rework my patchset to call dax_zero_page_range
directly...?

> > > I'm also thinking about the MOVEDIR64B instruction and how it
> > > might be used to clear poison on the fly with a single 'store'.
> > > Of course, that means we need to figure out how to narrow down the
> > > error blast radius first.
> 
> It turns out the MOVDIR64B error clearing idea runs into problem with
> the device poison tracking. Without the explicit notification that
> software wanted the error cleared the device may ghost report errors
> that are not there anymore. I think we should continue explicit error
> clearing and notification of the device that the error has been
> cleared (by asking the device to clear it).

If the poison clearing is entirely OOB (i.e. you have to call ACPI
methods) and can't be made part of the memory controller, then I guess
you can't use movdir64b at all, right?

> > That was one of the advantages of Shiyang Ruan's NAKed patchset to
> > enable byte-granularity media errors
> 
> ...the method of triggering reverse mapping had review feedback, I
> apologize if that came across of a NAK of the whole proposal. As I
> clarified to Eric this morning, I think the solution is iterating
> towards upstream inclusion.
> 
> > to pass upwards through the stack
> > back to the filesystem, which could then tell applications exactly what
> > they lost.
> >
> > I want to get back to that, though if Dan won't withdraw the NAK then I
> > don't know how to move forward...
> 
> No NAK in place. Let's go!

Ok, thanks.  I'll start looking through Shiyang's patches tomorrow.

> 
> >
> > > With respect to plumbing through devi

Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-15 Thread Darrick J. Wong
On Wed, Sep 15, 2021 at 12:22:05AM -0700, Jane Chu wrote:
> Hi, Dan,
> 
> On 9/14/2021 9:44 PM, Dan Williams wrote:
> > On Tue, Sep 14, 2021 at 4:32 PM Jane Chu  wrote:
> > > 
> > > If pwrite(2) encounters poison in a pmem range, it fails with EIO.
> > > This is unecessary if hardware is capable of clearing the poison.
> > > 
> > > Though not all dax backend hardware has the capability of clearing
> > > poison on the fly, but dax backed by Intel DCPMEM has such capability,
> > > and it's desirable to, first, speed up repairing by means of it;
> > > second, maintain backend continuity instead of fragmenting it in
> > > search for clean blocks.
> > > 
> > > Jane Chu (3):
> > >dax: introduce dax_operation dax_clear_poison
> > 
> > The problem with new dax operations is that they need to be plumbed
> > not only through fsdax and pmem, but also through device-mapper.
> > 
> > In this case I think we're already covered by dax_zero_page_range().
> > That will ultimately trigger pmem_clear_poison() and it is routed
> > through device-mapper properly.
> > 
> > Can you clarify why the existing dax_zero_page_range() is not sufficient?
> 
> fallocate ZERO_RANGE is in itself a functionality that applied to dax
> should lead to zero out the media range.  So one may argue it is part
> of a block operations, and not something explicitly aimed at clearing
> poison.

Yeah, Christoph suggested that we make the clearing operation explicit
in a related thread a few weeks ago:
https://lore.kernel.org/linux-fsdevel/yrtnlperhfmz2...@infradead.org/

I like Jane's patchset far better than the one that I sent, because it
doesn't require a block device wrapper for the pmem, and it enables us
to tell application writers that they can handle media errors by
pwrite()ing the bad region, just like they do for nvme and spinners.

> I'm also thinking about the MOVEDIR64B instruction and how it
> might be used to clear poison on the fly with a single 'store'.
> Of course, that means we need to figure out how to narrow down the
> error blast radius first.

That was one of the advantages of Shiyang Ruan's NAKed patchset to
enable byte-granularity media errors to pass upwards through the stack
back to the filesystem, which could then tell applications exactly what
they lost.

I want to get back to that, though if Dan won't withdraw the NAK then I
don't know how to move forward...

> With respect to plumbing through device-mapper, I thought about that,
> and wasn't sure. I mean the clear-poison work will eventually fall on
> the pmem driver, and thru the DM layers, how does that play out thru
> DM?

Each of the dm drivers has to add their own ->clear_poison operation
that remaps the incoming (sector, len) parameters as appropriate for
that device and then calls the lower device's ->clear_poison with the
translated parameters.

This (AFAICT) has already been done for dax_zero_page_range, so I sense
that Dan is trying to save you a bunch of code plumbing work by nudging
you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
and then you only need patches 2-3.

> BTW, our customer doesn't care about creating dax volume thru DM, so.

They might not care, but anything going upstream should work in the
general case.

--D

> thanks!
> -jane
> 
> 
> > 
> > >dax: introduce dax_clear_poison to dax pwrite operation
> > >libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
> > > 
> > >   drivers/dax/super.c   | 13 +
> > >   drivers/nvdimm/pmem.c | 17 +
> > >   fs/dax.c  |  9 +
> > >   include/linux/dax.h   |  6 ++
> > >   4 files changed, 45 insertions(+)
> > > 
> > > --
> > > 2.18.4
> > > 



Re: [PATCH v8 6/7] xfs: support CoW in fsdax mode

2021-09-02 Thread Darrick J. Wong
On Thu, Sep 02, 2021 at 09:43:08AM +0200, Christoph Hellwig wrote:
> On Sun, Aug 29, 2021 at 08:25:16PM +0800, Shiyang Ruan wrote:
> > In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
> > After that, new allocated extents needs to be remapped to the file.  Add
> > an implementation of ->iomap_end() for dax write ops to do the remapping
> > work.
> 
> Please split the new dax infrastructure from the XFS changes.
> 
> >  static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > -  int *iomap_errp, const struct iomap_ops *ops)
> > +   int *iomap_errp, const struct iomap_ops *ops)
> >  {
> > struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > XA_STATE(xas, >i_pages, vmf->pgoff);
> > @@ -1631,7 +1664,7 @@ static bool dax_fault_check_fallback(struct vm_fault 
> > *vmf, struct xa_state *xas,
> >  }
> >  
> >  static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > -  const struct iomap_ops *ops)
> > +   const struct iomap_ops *ops)
> 
> These looks like unrelated whitespace changes.
> 
> > -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > +loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  {
> > const struct iomap *iomap = >iomap;
> > const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > @@ -918,6 +918,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, 
> > bool *did_zero)
> >  
> > return written;
> >  }
> > +EXPORT_SYMBOL_GPL(iomap_zero_iter);
> 
> I don't see why this would have to be exported.
> 
> > +   unsignedflags,
> > +   struct iomap*iomap)
> > +{
> > +   int error = 0;
> > +   struct xfs_inode*ip = XFS_I(inode);
> > +   boolcow = xfs_is_cow_inode(ip);
> 
> The cow variable is only used once, so I think we can drop it.
> 
> > +   const struct iomap_iter *iter =
> > +   container_of(iomap, typeof(*iter), iomap);
> 
> Please comment this as it is a little unusual.
> 
> > +
> > +   if (cow) {
> > +   if (iter->processed <= 0)
> > +   xfs_reflink_cancel_cow_range(ip, pos, length, true);
> > +   else
> > +   error = xfs_reflink_end_cow(ip, pos, iter->processed);
> > +   }
> > +   return error ?: iter->processed;
> 
> The ->iomap_end convention is to return 0 or a negative error code.
> Also i'd much prefer to just spell this out in a normal sequential way:
> 
>   if (!xfs_is_cow_inode(ip))
>   return 0;
> 
>   if (iter->processed <= 0) {
>   xfs_reflink_cancel_cow_range(ip, pos, length, true);
>   return 0;
>   }
> 
>   return xfs_reflink_end_cow(ip, pos, iter->processed);

Seeing as written either contains iter->processed if it's positive, or
zero if nothing got written or there were errors, I wonder why this
isn't just:

if (!xfs_is_cow_inode(ip));
return 0;

if (!written) {
xfs_reflink_cancel_cow_range(ip, pos, length, true);
return 0;
}

return xfs_reflink_end_cow(ip, pos, written);

? (He says while cleaning up trying to leave for vacation, pardon me
if this comment is totally boneheaded...)

--D

> > +static inline int
> > +xfs_iomap_zero_range(
> > +   struct xfs_inode*ip,
> > +   loff_t  pos,
> > +   loff_t  len,
> > +   bool*did_zero)
> > +{
> > +   struct inode*inode = VFS_I(ip);
> > +
> > +   return IS_DAX(inode)
> > +   ? dax_iomap_zero_range(inode, pos, len, did_zero,
> > +  _dax_write_iomap_ops)
> > +   : iomap_zero_range(inode, pos, len, did_zero,
> > +  _buffered_write_iomap_ops);
> > +}
> 
>   if (IS_DAX(inode))
>   return dax_iomap_zero_range(inode, pos, len, did_zero,
>   _dax_write_iomap_ops);
>   return iomap_zero_range(inode, pos, len, did_zero,
>   _buffered_write_iomap_ops);
> 
> > +static inline int
> > +xfs_iomap_truncate_page(
> > +   struct xfs_inode*ip,
> > +   loff_t  pos,
> > +   bool*did_zero)
> > +{
> > +   struct inode*inode = VFS_I(ip);
> > +
> > +   return IS_DAX(inode)
> > +   ? dax_iomap_truncate_page(inode, pos, did_zero,
> > +  _dax_write_iomap_ops)
> > +   : iomap_truncate_page(inode, pos, did_zero,
> > +  _buffered_write_iomap_ops);
> > +}
> 
> Same here.



Re: [PATCH][next] xfs: Fix fall-through warnings for Clang

2021-04-20 Thread Darrick J. Wong
On Tue, Apr 20, 2021 at 06:06:52PM -0500, Gustavo A. R. Silva wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix
> the following warnings by replacing /* fall through */ comments,
> and its variants, with the new pseudo-keyword macro fallthrough:
> 
> fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between 
> switch labels [-Wimplicit-fallthrough]
> fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between 
> switch labels [-Wimplicit-fallthrough]
> fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between 
> switch labels [-Wimplicit-fallthrough]
> fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between 
> switch labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between 
> switch labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between 
> switch labels [-Wimplicit-fallthrough]
> fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch 
> labels [-Wimplicit-fallthrough]
> 
> Notice that Clang doesn't recognize /* fall through */ comments as
> implicit fall-through markings, so in order to globally enable
> -Wimplicit-fallthrough for Clang, these comments need to be
> replaced with fallthrough; in the whole codebase.
> 
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva 

I've already NAKd this twice, so I guess I'll NAK it a third time.

--D

> ---
>  fs/xfs/libxfs/xfs_ag_resv.c  | 4 ++--
>  fs/xfs/libxfs/xfs_alloc.c| 2 +-
>  fs/xfs/libxfs/xfs_da_btree.c | 2 +-
>  fs/xfs/scrub/bmap.c  | 2 +-
>  fs/xfs/scrub/btree.c | 2 +-
>  fs/xfs/scrub/common.c| 6 +++---
>  fs/xfs/scrub/dabtree.c   | 2 +-
>  fs/xfs/scrub/repair.c| 2 +-
>  fs/xfs/xfs_bmap_util.c   | 2 +-
>  fs/xfs/xfs_export.c  | 4 ++--
>  fs/xfs/xfs_file.c| 2 +-
>  fs/xfs/xfs_inode.c   | 2 +-
>  fs/xfs/xfs_ioctl.c   | 4 ++--
>  fs/xfs/xfs_iomap.c   | 2 +-
>  fs/xfs/xfs_trans_buf.c   | 2 +-
>  15 files changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 6c5f8d10589c..8c3c99a9bf83 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -342,7 +342,7 @@ xfs_ag_resv_alloc_extent(
>   break;
>   default:
>   ASSERT(0);
> - /* fall through */
> + fallthrough;
>   case XFS_AG_RESV_NONE:
>   field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS :
>  XFS_TRANS_SB_FDBLOCKS;
> @@ -384,7 +384,7 @@ xfs_ag_resv_free_extent(
>   break;
>   default:
>   ASSERT(0);
> - /* fall through */
> + fallthrough;
>   case XFS_AG_RESV_NONE:
>   xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
>   return;
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index aaa19101bb2a..9eabdeeec492 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -3163,7 +3163,7 @@ xfs_alloc_vextent(
>   }
>   args->agbno = XFS_FSB_TO_AGBNO(mp, args->fsbno);
>   args->type = XFS_ALLOCTYPE_NEAR_BNO;
> - /* FALLTHROUGH */
> + 

Re: linux-next: manual merge of the vfs tree with the xfs tree

2021-04-20 Thread Darrick J. Wong
On Mon, Apr 19, 2021 at 10:49:48AM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the vfs tree got a conflict in:
> 
>   fs/xfs/xfs_ioctl.c
> 
> between commit:
> 
>   b2197a36c0ef ("xfs: remove XFS_IFEXTENTS")
> 
> from the xfs tree and commit:
> 
>   9fefd5db08ce ("xfs: convert to fileattr")
> 
> from the vfs tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

This looks like a good resolution to the merge conflict, thank you!

--D

> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc fs/xfs/xfs_ioctl.c
> index bf490bfae6cb,bbda105a2ce5..
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@@ -1056,77 -1057,17 +1057,19 @@@ xfs_ioc_ag_geometry
>   static void
>   xfs_fill_fsxattr(
>   struct xfs_inode*ip,
> - boolattr,
> - struct fsxattr  *fa)
> + int whichfork,
> + struct fileattr *fa)
>   {
>  +struct xfs_mount*mp = ip->i_mount;
> - struct xfs_ifork*ifp = attr ? ip->i_afp : >i_df;
> + struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, whichfork);
>   
> - simple_fill_fsxattr(fa, xfs_ip2xflags(ip));
> + fileattr_fill_xflags(fa, xfs_ip2xflags(ip));
>  -fa->fsx_extsize = ip->i_d.di_extsize << ip->i_mount->m_sb.sb_blocklog;
>  -fa->fsx_cowextsize = ip->i_d.di_cowextsize <<
>  -ip->i_mount->m_sb.sb_blocklog;
>  -fa->fsx_projid = ip->i_d.di_projid;
>  -if (ifp && (ifp->if_flags & XFS_IFEXTENTS))
>  +
>  +fa->fsx_extsize = XFS_FSB_TO_B(mp, ip->i_extsize);
>  +if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>  +fa->fsx_cowextsize = XFS_FSB_TO_B(mp, ip->i_cowextsize);
>  +fa->fsx_projid = ip->i_projid;
>  +if (ifp && !xfs_need_iread_extents(ifp))
>   fa->fsx_nextents = xfs_iext_count(ifp);
>   else
>   fa->fsx_nextents = xfs_ifork_nextents(ifp);
> @@@ -1212,10 -1167,10 +1169,10 @@@ static in
>   xfs_ioctl_setattr_xflags(
>   struct xfs_trans*tp,
>   struct xfs_inode*ip,
> - struct fsxattr  *fa)
> + struct fileattr *fa)
>   {
>   struct xfs_mount*mp = ip->i_mount;
>  -uint64_tdi_flags2;
>  +uint64_ti_flags2;
>   
>   /* Can't change realtime flag if any extents are allocated. */
>   if ((ip->i_df.if_nextents || ip->i_delayed_blks) &&
> @@@ -1348,8 -1289,11 +1291,11 @@@ xfs_ioctl_setattr_check_extsize
>   xfs_extlen_tsize;
>   xfs_fsblock_t   extsize_fsb;
>   
> + if (!fa->fsx_valid)
> + return 0;
> + 
>   if (S_ISREG(VFS_I(ip)->i_mode) && ip->i_df.if_nextents &&
>  -((ip->i_d.di_extsize << mp->m_sb.sb_blocklog) != fa->fsx_extsize))
>  +((ip->i_extsize << mp->m_sb.sb_blocklog) != fa->fsx_extsize))
>   return -EINVAL;
>   
>   if (fa->fsx_extsize == 0)
> @@@ -1520,18 -1476,18 +1478,19 @@@ xfs_fileattr_set
>* extent size hint should be set on the inode. If no extent size flags
>* are set on the inode then unconditionally clear the extent size hint.
>*/
>  -if (ip->i_d.di_flags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT))
>  -ip->i_d.di_extsize = fa->fsx_extsize >> mp->m_sb.sb_blocklog;
>  -else
>  -ip->i_d.di_extsize = 0;
>  -if (xfs_sb_version_has_v3inode(>m_sb) &&
>  -(ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
>  -ip->i_d.di_cowextsize = fa->fsx_cowextsize >>
>  -mp->m_sb.sb_blocklog;
>  +if (ip->i_diflags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT))
>  +ip->i_extsize = XFS_B_TO_FSB(mp, fa->fsx_extsize);
>   else
>  -ip->i_d.di_cowextsize = 0;
>  +ip->i_extsize = 0;
>  +
>  +if (xfs_sb_version_has_v3inode(>m_sb)) {
>  +if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>  +ip->i_cowextsize = XFS_B_TO_FSB(mp, fa->fsx_cowextsize);
>  +else
>  +ip->i_cowextsize = 0;
>  +}
>   
> + skip_xattr:
>   error = xfs_trans_commit(tp);
>   
>   /*




Re: linux-next: manual merge of the vfs tree with the xfs tree

2021-04-13 Thread Darrick J. Wong
On Mon, Apr 12, 2021 at 12:22:11PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the vfs tree got a conflict in:
> 
>   fs/xfs/xfs_ioctl.c
> 
> between commits:
> 
>   ceaf603c7024 ("xfs: move the di_projid field to struct xfs_inode")
>   031474c28a3a ("xfs: move the di_extsize field to struct xfs_inode")
>   b33ce57d3e61 ("xfs: move the di_cowextsize field to struct xfs_inode")
>   4800887b4574 ("xfs: cleanup xfs_fill_fsxattr")
>   ee7b83fd365e ("xfs: use a union for i_cowextsize and i_flushiter")
>   db07349da2f5 ("xfs: move the di_flags field to struct xfs_inode")
>   3e09ab8fdc4d ("xfs: move the di_flags2 field to struct xfs_inode")
> 
> from the xfs tree and commit:
> 
>   280cad4ac884 ("xfs: convert to fileattr")
> 
> from the vfs tree.
> 
> I fixed it up (I think - see below) and can carry the fix as
> necessary. This is now fixed as far as linux-next is concerned, but any
> non trivial conflicts should be mentioned to your upstream maintainer
> when your tree is submitted for merging.  You may also want to consider
> cooperating with the maintainer of the conflicting tree to minimise any
> particularly complex conflicts.

This looks correct to me; thanks for pointing out the merge conflict! :)

--D

> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc fs/xfs/xfs_ioctl.c
> index 708b77341a70,bbda105a2ce5..
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@@ -1056,76 -1057,16 +1057,18 @@@ xfs_ioc_ag_geometry
>   static void
>   xfs_fill_fsxattr(
>   struct xfs_inode*ip,
> - boolattr,
> - struct fsxattr  *fa)
> + int whichfork,
> + struct fileattr *fa)
>   {
>  +struct xfs_mount*mp = ip->i_mount;
> - struct xfs_ifork*ifp = attr ? ip->i_afp : >i_df;
> + struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, whichfork);
>   
> - simple_fill_fsxattr(fa, xfs_ip2xflags(ip));
> + fileattr_fill_xflags(fa, xfs_ip2xflags(ip));
>  -fa->fsx_extsize = ip->i_d.di_extsize << ip->i_mount->m_sb.sb_blocklog;
>  -fa->fsx_cowextsize = ip->i_d.di_cowextsize <<
>  -ip->i_mount->m_sb.sb_blocklog;
>  -fa->fsx_projid = ip->i_d.di_projid;
>  +
>  +fa->fsx_extsize = XFS_FSB_TO_B(mp, ip->i_extsize);
>  +if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>  +fa->fsx_cowextsize = XFS_FSB_TO_B(mp, ip->i_cowextsize);
>  +fa->fsx_projid = ip->i_projid;
>   if (ifp && (ifp->if_flags & XFS_IFEXTENTS))
>   fa->fsx_nextents = xfs_iext_count(ifp);
>   else
> @@@ -1212,10 -1167,10 +1169,10 @@@ static in
>   xfs_ioctl_setattr_xflags(
>   struct xfs_trans*tp,
>   struct xfs_inode*ip,
> - struct fsxattr  *fa)
> + struct fileattr *fa)
>   {
>   struct xfs_mount*mp = ip->i_mount;
>  -uint64_tdi_flags2;
>  +uint64_ti_flags2;
>   
>   /* Can't change realtime flag if any extents are allocated. */
>   if ((ip->i_df.if_nextents || ip->i_delayed_blks) &&
> @@@ -1348,8 -1289,11 +1291,11 @@@ xfs_ioctl_setattr_check_extsize
>   xfs_extlen_tsize;
>   xfs_fsblock_t   extsize_fsb;
>   
> + if (!fa->fsx_valid)
> + return 0;
> + 
>   if (S_ISREG(VFS_I(ip)->i_mode) && ip->i_df.if_nextents &&
>  -((ip->i_d.di_extsize << mp->m_sb.sb_blocklog) != fa->fsx_extsize))
>  +((ip->i_extsize << mp->m_sb.sb_blocklog) != fa->fsx_extsize))
>   return -EINVAL;
>   
>   if (fa->fsx_extsize == 0)
> @@@ -1520,18 -1476,18 +1478,19 @@@ xfs_fileattr_set
>* extent size hint should be set on the inode. If no extent size flags
>* are set on the inode then unconditionally clear the extent size hint.
>*/
>  -if (ip->i_d.di_flags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT))
>  -ip->i_d.di_extsize = fa->fsx_extsize >> mp->m_sb.sb_blocklog;
>  -else
>  -ip->i_d.di_extsize = 0;
>  -if (xfs_sb_version_has_v3inode(>m_sb) &&
>  -(ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
>  -ip->i_d.di_cowextsize = fa->fsx_cowextsize >>
>  -mp->m_sb.sb_blocklog;
>  +if (ip->i_diflags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT))
>  +ip->i_extsize = XFS_B_TO_FSB(mp, fa->fsx_extsize);
>   else
>  -ip->i_d.di_cowextsize = 0;
>  +ip->i_extsize = 0;
>  +
>  +if (xfs_sb_version_has_v3inode(>m_sb)) {
>  +if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>  +ip->i_cowextsize = XFS_B_TO_FSB(mp, fa->fsx_cowextsize);
>  +else
>  +ip->i_cowextsize = 0;
>  +}
>   
> + skip_xattr:
>   error = xfs_trans_commit(tp);
>   
>   /*




  1   2   3   4   5   6   7   8   9   10   >