Re: [PATCH 07/13] btrfs: basic direct read operation

2019-08-12 Thread RITESH HARJANI



On 8/3/19 3:30 AM, Goldwyn Rodrigues wrote:

From: Goldwyn Rodrigues 

Add btrfs_dio_iomap_ops for iomap.begin() function. In order to
accomodate dio reads, add a new function btrfs_file_read_iter()
which would call btrfs_dio_iomap_read() for DIO reads and
fallback to generic_file_read_iter otherwise.

Signed-off-by: Goldwyn Rodrigues 
---
  fs/btrfs/ctree.h |  2 ++
  fs/btrfs/file.c  | 10 +-
  fs/btrfs/iomap.c | 20 
  3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7a4ff524dc77..9eca2d576dd1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3247,7 +3247,9 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t 
start, loff_t end);
  loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
  struct file *file_out, loff_t pos_out,
  loff_t len, unsigned int remap_flags);
+/* iomap.c */
  size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from);
+ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to);
  
  /* tree-defrag.c */

  int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f7087e28ac08..997eb152a35a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2839,9 +2839,17 @@ static int btrfs_file_open(struct inode *inode, struct 
file *filp)
return generic_file_open(inode, filp);
  }
  
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)

+{
+   if (iocb->ki_flags & IOCB_DIRECT)
+   return btrfs_dio_iomap_read(iocb, to);


No provision to fallback to bufferedIO read? Not sure from btrfs 
perspective,
but earlier generic_file_read_iter may fall through to bufferedIO read 
say in case where directIO could not be completed (returned 0 or less 
than the requested read bytes).
Is it not required anymore in case of btrfs when we move to iomap 
infrastructure, to still fall back to bufferedIO read?

Correct me if I am missing anything here.


+
+   return generic_file_read_iter(iocb, to);
+}
+
  const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
-   .read_iter  = generic_file_read_iter,
+   .read_iter  = btrfs_file_read_iter,
.splice_read= generic_file_splice_read,
.write_iter = btrfs_file_write_iter,
.mmap   = btrfs_file_mmap,
diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
index 879038e2f1a0..36df606fc028 100644
--- a/fs/btrfs/iomap.c
+++ b/fs/btrfs/iomap.c
@@ -420,3 +420,23 @@ size_t btrfs_buffered_iomap_write(struct kiocb *iocb, 
struct iov_iter *from)
return written;
  }
  
+static int btrfs_dio_iomap_begin(struct inode *inode, loff_t pos,

+   loff_t length, unsigned flags, struct iomap *iomap,
+   struct iomap *srcmap)
+{
+   return get_iomap(inode, pos, length, iomap);
+}
+
+static const struct iomap_ops btrfs_dio_iomap_ops = {
+   .iomap_begin= btrfs_dio_iomap_begin,
+};
+
+ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct inode *inode = file_inode(iocb->ki_filp);
+   ssize_t ret;
+   inode_lock_shared(inode);
+   ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, NULL);
+   inode_unlock_shared(inode);
+   return ret;
+}




Re: Odd locking pattern introduced as part of "nowait aio support"

2019-09-11 Thread Ritesh Harjani

Hi,

On 9/11/19 3:09 PM, Andres Freund wrote:

Hi,

On 2019-09-11 14:04:20 +1000, Dave Chinner wrote:

On Tue, Sep 10, 2019 at 03:33:27PM -0700, Andres Freund wrote:

Hi,

Especially with buffered io it's fairly easy to hit contention on the
inode lock, during writes. With something like io_uring, it's even
easier, because it currently (but see [1]) farms out buffered writes to
workers, which then can easily contend on the inode lock, even if only
one process submits writes.  But I've seen it in plenty other cases too.

Looking at the code I noticed that several parts of the "nowait aio
support" (cf 728fbc0e10b7f3) series introduced code like:

static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
...
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
inode_lock(inode);
}


The ext4 code is just buggy here - we don't support RWF_NOWAIT on
buffered write >

But both buffered and non-buffered writes go through
ext4_file_write_iter(). And there's a preceding check, at least these
days, preventing IOCB_NOWAIT to apply to buffered writes:

if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
return -EOPNOTSUPP;



-EOPNOTSUPP is now taken care in ext4 iomap patch series as well.




I do really wish buffered NOWAIT was supported... The overhead of having
to do async buffered writes through the workqueue in io_uring, even if
an already existing page is targeted, is quite noticable. Even if it
failed with EAGAIN as soon as the buffered write's target isn't in the
page cache, it'd be worthwhile.



isn't trylocking and then locking in a blocking fashion an inefficient
pattern? I.e. I think this should be

if (iocb->ki_flags & IOCB_NOWAIT) {
if (!inode_trylock(inode))
return -EAGAIN;
}
 else
inode_lock(inode);


Yes, you are right.

History: commit 91f9943e1c7b ("fs: support RWF_NOWAIT
for buffered reads") which introduced the first locking pattern
you describe in XFS.


Seems, as part of the nowait work, the pattern was introduced in ext4,
xfs and btrfs. And fixed in xfs.



That was followed soon after by:

commit 942491c9e6d631c012f3c4ea8eb0b02edeab
Author: Christoph Hellwig 
Date:   Mon Oct 23 18:31:50 2017 -0700

 xfs: fix AIM7 regression
 
 Apparently our current rwsem code doesn't like doing the trylock, then

 lock for real scheme.  So change our read/write methods to just do the
 trylock for the RWF_NOWAIT case.  This fixes a ~25% regression in
 AIM7.
 
 Fixes: 91f9943e ("fs: support RWF_NOWAIT for buffered reads")

 Reported-by: kernel test robot 
 Signed-off-by: Christoph Hellwig 
 Reviewed-by: Darrick J. Wong 
 Signed-off-by: Darrick J. Wong 

Which changed all the trylock/eagain/lock patterns to the second
form you quote. None of the other filesystems had AIM7 regressions
reported against them, so nobody changed them


:(



Obviously this isn't going to improve scalability to a very significant
degree. But not unnecessarily doing two atomic ops on a contended lock
can't hurt scalability either. Also, the current code just seems
confusing.

Am I missing something?


Just that the sort of performance regression testing that uncovers
this sort of thing isn't widely done, and most filesystems are
concurrency limited in some way before they hit inode lock
scalability issues. Hence filesystem concurrency foccussed
benchmarks that could uncover it (like aim7) won't because the inode
locks don't end up stressed enough to make a difference to
benchmark performance.


Ok.  Goldwyn, do you want to write a patch, or do you want me to write
one up?


I am anyways looking into ext4 performance issue of mixed parallel DIO 
workload. This will require some new APIs for inode locking similar to

that of XFS.
In that I can take care of this symantics reported here by you (which is 
taken care by XFS in above patch) for ext4.



Thanks
-ritesh




Greetings,

Andres Freund





Re: [PATCH 2/3] ext4: fix inode rwsem regression

2019-09-12 Thread Ritesh Harjani

cc'd Matthew as well.

On 9/11/19 10:15 PM, Goldwyn Rodrigues wrote:

From: Goldwyn Rodrigues 

This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression")
Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme.  So change our read/write methods to just do the
trylock for the RWF_NOWAIT case.

Fixes: 728fbc0e10b7 ("ext4: nowait aio support")
Signed-off-by: Goldwyn Rodrigues 


This patch will conflict with recent iomap patch series.
So if this is getting queued up before, so iomap patch series will
need to rebase and factor these changes in the new APIs.

Otherwise looks good to me!

Reviewed-by: Ritesh Harjani 



---
  fs/ext4/file.c | 17 +++--
  1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 70b0438dbc94..d5b2d0cc325d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -40,11 +40,13 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
  
-	if (!inode_trylock_shared(inode)) {

-   if (iocb->ki_flags & IOCB_NOWAIT)
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock_shared(inode))
return -EAGAIN;
+   } else {
inode_lock_shared(inode);
}
+
/*
 * Recheck under inode lock - at this point we are sure it cannot
 * change anymore
@@ -190,11 +192,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
  
-	if (!inode_trylock(inode)) {

-   if (iocb->ki_flags & IOCB_NOWAIT)
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
return -EAGAIN;
+   } else {
inode_lock(inode);
}
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -233,9 +237,10 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
return -EOPNOTSUPP;
  
-	if (!inode_trylock(inode)) {

-   if (iocb->ki_flags & IOCB_NOWAIT)
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
return -EAGAIN;
+   } else {
inode_lock(inode);
}
  





Re: [PATCH v3 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

From: Shiyang Ruan 

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.



Thanks for the patchset. I have tried reviewing the series from logical
correctness and to some extent functional correctness.
Since I am not well versed with the core functionality of COW operation,
so I may have requested for some clarifications where I could not get
the code 100% on what it is doing.




(Rebased on v5.11)
==


Thanks. Yes, I see some conflicts when tried to apply on latest kernel.

-ritesh


Re: [PATCH v3 01/10] fsdax: Factor helpers to simplify dax fault code

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

The dax page fault code is too long and a bit difficult to read. And it
is hard to understand when we trying to add new features. Some of the
PTE/PMD codes have similar logic. So, factor them as helper functions to
simplify the code.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
  fs/dax.c | 152 ++-
  1 file changed, 84 insertions(+), 68 deletions(-)



Refactoring & the changes looks good to me.
Feel free to add.

Reviewed-by: Ritesh Harjani 


Re: [PATCH v3 03/10] fsdax: Output address in dax_iomap_pfn() and rename it

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
  fs/dax.c | 16 
  1 file changed, 12 insertions(+), 4 deletions(-)



Like the naming convention. It is consistent with
dax_direct_access() function.

Changes looks good to me. Feel free to add.

Reviewed-by: Ritesh Harjani 


Re: [PATCH v3 02/10] fsdax: Factor helper: dax_fault_actor()

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

The core logic in the two dax page fault functions is similar. So, move
the logic into a common helper function. Also, to facilitate the
addition of new features, such as CoW, switch-case is no longer used to
handle different iomap types.

Signed-off-by: Shiyang Ruan 
---
  fs/dax.c | 291 +++
  1 file changed, 145 insertions(+), 146 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7031e4302b13..33ddad0f3091 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1053,6 +1053,66 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
return ret;
  }
  
+#ifdef CONFIG_FS_DAX_PMD

+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   struct iomap *iomap, void **entry)
+{
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   unsigned long pmd_addr = vmf->address & PMD_MASK;
+   struct vm_area_struct *vma = vmf->vma;
+   struct inode *inode = mapping->host;
+   pgtable_t pgtable = NULL;
+   struct page *zero_page;
+   spinlock_t *ptl;
+   pmd_t pmd_entry;
+   pfn_t pfn;
+
+   zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
+
+   if (unlikely(!zero_page))
+   goto fallback;
+
+   pfn = page_to_pfn_t(zero_page);
+   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
+   DAX_PMD | DAX_ZERO_PAGE, false);
+
+   if (arch_needs_pgtable_deposit()) {
+   pgtable = pte_alloc_one(vma->vm_mm);
+   if (!pgtable)
+   return VM_FAULT_OOM;
+   }
+
+   ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
+   if (!pmd_none(*(vmf->pmd))) {
+   spin_unlock(ptl);
+   goto fallback;
+   }
+
+   if (pgtable) {
+   pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+   mm_inc_nr_ptes(vma->vm_mm);
+   }
+   pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
+   pmd_entry = pmd_mkhuge(pmd_entry);
+   set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
+   spin_unlock(ptl);
+   trace_dax_pmd_load_hole(inode, vmf, zero_page, *entry);
+   return VM_FAULT_NOPAGE;
+
+fallback:
+   if (pgtable)
+   pte_free(vma->vm_mm, pgtable);
+   trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
+   return VM_FAULT_FALLBACK;
+}
+#else
+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   struct iomap *iomap, void **entry)
+{
+   return VM_FAULT_FALLBACK;
+}
+#endif /* CONFIG_FS_DAX_PMD */
+
  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
  {
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
@@ -1289,6 +1349,61 @@ static int dax_fault_cow_page(struct vm_fault *vmf, 
struct iomap *iomap,
return 0;
  }
  
+/**

+ * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault.
+ * @vmf:   vm fault instance
+ * @pfnp:  pfn to be returned
+ * @xas:   the dax mapping tree of a file
+ * @entry: an unlocked dax entry to be inserted
+ * @pmd:   distinguish whether it is a pmd fault
+ * @flags: iomap flags
+ * @iomap: from iomap_begin()
+ * @srcmap:from iomap_begin(), not equal to iomap if it is a CoW
+ */
+static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
+   struct xa_state *xas, void *entry, bool pmd, unsigned int flags,
+   struct iomap *iomap, struct iomap *srcmap)
+{
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
+   loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;


shouldn't we use xa_index here for pos ?
(loff_t)xas->xa_index << PAGE_SHIFT;


+   bool write = vmf->flags & FAULT_FLAG_WRITE;
+   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
+   int err = 0;
+   pfn_t pfn;
+
+   /* if we are reading UNWRITTEN and HOLE, return a hole. */
+   if (!write &&
+   (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
+   if (!pmd)
+   return dax_load_hole(xas, mapping, &entry, vmf);
+   else
+   return dax_pmd_load_hole(xas, vmf, iomap, &entry);
+   }
+
+   if (iomap->type != IOMAP_MAPPED) {
+   WARN_ON_ONCE(1);
+   return VM_FAULT_SIGBUS;
+   }


So now in case if mapping is not mapped, we always cause
VM_FAULT_SIGBUG. But earlier we were only doing WARN_ON_ONCE(1).
Can you pls help answer why the change in behavior?





+
+   err = dax_iomap_pfn(iomap, pos, size, &pfn);
+   if (err)
+   return dax_fault_return(err);


Same case here as well. This could return SIGBUS while earlier I am not 
sure why were we only returning FALLBACK?




+
+   entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
+

Re: [PATCH v3 04/10] fsdax: Introduce dax_iomap_cow_copy()

2021-03-23 Thread Ritesh Harjani




On 3/19/21 7:22 AM, Shiyang Ruan wrote:

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

The destance extent which iomap indicated is new allocated extent.
So, it is needed to copy the data from srcmap to new allocated extent.
In theory, it is better to copy the head and tail ranges which is
outside of the non-aligned area instead of copying the whole aligned
range. But in dax page fault, it will always be an aligned range.  So,
we have to copy the whole range in this case.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
  fs/dax.c | 71 
  1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a70e6aa285bb..181aad97136a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1037,6 +1037,51 @@ static int dax_iomap_direct_access(struct iomap *iomap, 
loff_t pos, size_t size,
return rc;
  }
  
+/*

+ * Copy the head and tail part of the pages not included in the write but
+ * required for CoW, because pos/pos+length are not page aligned.  But in dax
+ * page fault case, the range is page aligned, we need to copy the whole range
+ * of data.  Use copy_edge to distinguish these cases.
+ */



Is this version better? Feel free to update/change.

dax_iomap_cow_copy(): This can be called from two places.
Either during DAX write fault, to copy the length size data to daddr.
Or, while doing normal DAX write operation, dax_iomap_actor() might call 
this to do the copy of either start or end unaligned address. In this 
case the rest of the copy of aligned ranges is taken care by 
dax_iomap_actor() itself.

Also, note DAX fault will always result in aligned pos and pos + length.

* @pos: address to do copy from.
* @length:  size of copy operation.
* @align_size:  aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
* @srcmap:  iomap srcmap
* @daddr:   destination address to copy to.


+static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size,
+   struct iomap *srcmap, void *daddr, bool copy_edge)


do we need bool copy_edge here?
We can detect non-align case directly if head_off != pos or pd_end != 
end no?




+{
+   loff_t head_off = pos & (align_size - 1);
+   size_t size = ALIGN(head_off + length, align_size);
+   loff_t end = pos + length;
+   loff_t pg_end = round_up(end, align_size);
+   void *saddr = 0;
+   int ret = 0;
+
+   ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+   if (ret)
+   return ret;
+
+   if (!copy_edge)
+   return copy_mc_to_kernel(daddr, saddr, length);
+
+   /* Copy the head part of the range.  Note: we pass offset as length. */
+   if (head_off) {
+   if (saddr)
+   ret = copy_mc_to_kernel(daddr, saddr, head_off);
+   else
+   memset(daddr, 0, head_off);
+   }
+   /* Copy the tail part of the range */
+   if (end < pg_end) {
+   loff_t tail_off = head_off + length;
+   loff_t tail_len = pg_end - end;
+
+   if (saddr)
+   ret = copy_mc_to_kernel(daddr + tail_off,
+   saddr + tail_off, tail_len);
+   else
+   memset(daddr + tail_off, 0, tail_len);
+   }


Can you pls help me understand in which case where saddr is 0 and we
will fall back to memset API ?
I was thinking shouldn't such restrictions be coded inside 
copy_mc_to_kernel() function in general?



-ritesh


Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-03-28 Thread Ritesh Harjani
On 21/03/25 09:16PM, Qu Wenruo wrote:
>
>
> On 2021/3/25 下午8:20, Neal Gompa wrote:
> > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:
> > >
> > > This patchset can be fetched from the following github repo, along with
> > > the full subpage RW support:
> > > https://github.com/adam900710/linux/tree/subpage
> > >
> > > This patchset is for metadata read write support.
> > >
> > > [FULL RW TEST]
> > > Since the data write path is not included in this patchset, we can't
> > > really test the patchset itself, but anyone can grab the patch from
> > > github repo and do fstests/generic tests.
> > >
> > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > for now.
> > >
> > > There are some known issues:
> > >
> > > - Defrag behavior change
> > >Since current defrag is doing per-page defrag, to support subpage
> > >defrag, we need some change in the loop.
> > >E.g. if a page has both hole and regular extents in it, then defrag
> > >will rewrite the full 64K page.
> > >
> > >Thus for now, defrag related failure is expected.
> > >But this should only cause behavior difference, no crash nor hang is
> > >expected.
> > >
> > > - No compression support yet
> > >There are at least 2 known bugs if forcing compression for subpage
> > >* Some hard coded PAGE_SIZE screwing up space rsv
> > >* Subpage ASSERT() triggered
> > >  This is because some compression code is unlocking locked_page by
> > >  calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > >So for now compression is also disabled.
> > >
> > > - Inode nbytes mismatch
> > >Still debugging.
> > >The fastest way to trigger is fsx using the following parameters:
> > >
> > >  fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
> > >
> > >Which would cause inode nbytes differs from expected value and
> > >triggers btrfs check error.
> > >
> > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > The metadata part in fact has more new code than data part, as it has
> > > some different behaviors compared to the regular sector size handling:
> > >
> > > - No more page locking
> > >Now metadata read/write relies on extent io tree locking, other than
> > >page locking.
> > >This is to allow behaviors like read lock one eb while also try to
> > >read lock another eb in the same page.
> > >We can't rely on page lock as now we have multiple extent buffers in
> > >the same page.
> > >
> > > - Page status update
> > >Now we use subpage wrappers to handle page status update.
> > >
> > > - How to submit dirty extent buffers
> > >Instead of just grabbing extent buffer from page::private, we need to
> > >iterate all dirty extent buffers in the page and submit them.
> > >
> > > [CHANGELOG]
> > > v2:
> > > - Rebased to latest misc-next
> > >No conflicts at all.
> > >
> > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > >This will allow mkfs.btrfs to detect unmountable fs better.
> > >
> > > - Use newer naming schema for each patch
> > >No more "extent_io:" or "inode:" schema anymore.
> > >
> > > - Move two pure cleanups to the series
> > >Patch 2~3, originally in RW part.
> > >
> > > - Fix one uninitialized variable
> > >Patch 6.
> > >
> > > v3:
> > > - Rename the sysfs to supported_sectorsizes
> > >
> > > - Rebased to latest misc-next branch
> > >This removes 2 cleanup patches.
> > >
> > > - Add new overview comment for subpage metadata
> > >
> > > Qu Wenruo (13):
> > >btrfs: add sysfs interface for supported sectorsize
> > >btrfs: use min() to replace open-code in btrfs_invalidatepage()
> > >btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> > >btrfs: refactor how we iterate ordered extent in
> > >  btrfs_invalidatepage()
> > >btrfs: introduce helpers for subpage dirty status
> > >btrfs: introduce helpers for subpage writeback status
> > >btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> > >  metadata
> > >btrfs: support subpage metadata csum calculation at write time
> > >btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > >btrfs: make the page uptodate assert to be subpage compatible
> > >btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> > >btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> > >  compatible
> > >btrfs: add subpage overview comments
> > >
> > >   fs/btrfs/disk-io.c   | 143 ++-
> > >   fs/btrfs/extent_io.c | 127 --
> > >   fs/btrfs/inode.c | 128 ++
> > >   fs/btrfs/subpage.c   | 127 ++
> > >   fs/btrfs/subpage.h   |  17 +
> > >   fs/btrfs/sysfs.c |  15 +
> > >   6 files changed, 441 insertions(+), 116 deletions(-)
> > >
> > > --
> > > 2.30.1
> > >
>

Re: [PATCH v3 05/10] fsdax: Replace mmap entry in case of CoW

2021-03-31 Thread Ritesh Harjani
x_fault_actor(struct vm_fault *vmf, 
> pfn_t *pfnp,
>   loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
>   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
> + unsigned int insert_flags = 0;
>   int err = 0;
>   pfn_t pfn;
>   void *kaddr;
> @@ -1453,8 +1462,14 @@ static vm_fault_t dax_fault_actor(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   if (err)
>   return dax_fault_return(err);
>
> - entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
> -  write && !sync);
> + if (write) {
> + if (!sync)
> +     insert_flags |= DAX_IF_DIRTY;
> + if (iomap->flags & IOMAP_F_SHARED)
> + insert_flags |= DAX_IF_COW;
> + }
> +
> + entry = dax_insert_entry(xas, vmf, entry, pfn, 0, insert_flags);
>
>   if (write && srcmap->addr != iomap->addr) {
>   err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr, false);
>

Rest looks good to me. Please feel free to add
Reviewed-by: Ritesh Harjani 

sorry about changing my email in between of this code review.
I am planning to use above gmail id as primary account for all upstream work
from now.

> --
> 2.30.1
>
>
>


Re: [PATCH v3 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

2021-03-31 Thread Ritesh Harjani
On 21/03/19 09:52AM, Shiyang Ruan wrote:
> Punch hole on a reflinked file needs dax_copy_edge() too.  Otherwise,
> data in not aligned area will be not correct.  So, add the srcmap to
> dax_iomap_zero() and replace memset() as dax_copy_edge().
>
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c   | 9 +++--
>  fs/iomap/buffered-io.c | 2 +-
>  include/linux/dax.h| 3 ++-
>  3 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index cfe513eb111e..348297b38f76 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1174,7 +1174,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
> + struct iomap *srcmap)

Do we know why does dax_iomap_zero() operates on PAGE_SIZE range?
IIUC, dax_zero_page_range() can take nr_pages as a parameter. But we still
always use one page at a time. Why is that?

>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>   pgoff_t pgoff;
> @@ -1204,7 +1205,11 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct 
> iomap *iomap)
>   }
>
>   if (!page_aligned) {
> - memset(kaddr + offset, 0, size);
> + if (iomap->addr != srcmap->addr)
> + dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap,
> +kaddr, true);
> + else
> + memset(kaddr + offset, 0, size);
>   dax_flush(iomap->dax_dev, kaddr + offset, size);
>   }
>   dax_read_unlock(id);
>

Maybe the above could be simplified to this?

if (page_aligned) {
rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
} else {
rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
if (iomap->addr != srcmap->addr)
dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap,
   kaddr, true);
else
memset(kaddr + offset, 0, size);
dax_flush(iomap->dax_dev, kaddr + offset, size);
}

dax_read_unlock(id);
return rc < 0 ? rc : size;

Other than that looks good.
Feel free to add.
Reviewed-by: Ritesh Harjani 




Re: [PATCH v3 07/10] iomap: Introduce iomap_apply2() for operations on two files

2021-04-01 Thread Ritesh Harjani
On 21/03/19 09:52AM, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.
>
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/iomap/apply.c  | 56 +++
>  include/linux/iomap.h |  7 +-
>  2 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 26ab6563181f..fbc38ce3d5b6 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -97,3 +97,59 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>
>   return written ? written : ret;
>  }
> +
> +loff_t
> +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t 
> pos2,
> + loff_t length, unsigned int flags, const struct iomap_ops *ops,
> + void *data, iomap_actor2_t actor)
> +{
> + struct iomap smap = { .type = IOMAP_HOLE };
> + struct iomap dmap = { .type = IOMAP_HOLE };
> + loff_t written = 0, ret, ret2 = 0;
> + loff_t len1 = length, len2, min_len;
> +
> + ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL);
> + if (ret)
> + goto out_src;

if above call fails we need not call ->iomap_end() on smap.

> + if (WARN_ON(smap.offset > pos1)) {
> + written = -EIO;
> + goto out_src;
> + }
> + if (WARN_ON(smap.length == 0)) {
> + written = -EIO;
> + goto out_src;
> + }
> + len2 = min_t(loff_t, len1, smap.length);
> +
> + ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL);
> + if (ret)
> + goto out_dest;

ditto

> + if (WARN_ON(dmap.offset > pos2)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + if (WARN_ON(dmap.length == 0)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + min_len = min_t(loff_t, len2, dmap.length);
> +
> + written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap);
> +
> +out_dest:
> + if (ops->iomap_end)
> + ret2 = ops->iomap_end(ino2, pos2, len2,
> +   written > 0 ? written : 0, flags, &dmap);
> +out_src:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino1, pos1, len1,
> +  written > 0 ? written : 0, flags, &smap);
> +

I guess, this maynot be a problem, but I still think we should be
consistent w.r.t len argument we are passing in ->iomap_end() for both type of
iomap_apply* family of functions.
IIUC, we used to call ->iomap_end() with the length argument filled by the
filesystem from ->iomap_begin() call.

whereas above breaks that behavior. Although I don't think this is FATAL, but
still it is better to be consistent with the APIs.
Thoughts?


> + if (ret)
> + return written ? written : ret;
> +
> + if (ret2)
> + return written ? written : ret2;
> +
> + return written;
> +}

if (written)
return written;

return ret ? ret : ret2;

Is above a simpler version?

-ritesh


Re: [PATCH v3 08/10] fsdax: Dedup file range to use a compare function

2021-04-01 Thread Ritesh Harjani
On 21/03/19 09:52AM, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a dax
> comparison funciton which is similar with
> vfs_dedupe_file_range_compare().
> And introduce dax_remap_file_range_prep() for filesystem use.
>
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 56 
>  fs/remap_range.c | 45 ---
>  fs/xfs/xfs_reflink.c |  9 +--
>  include/linux/dax.h  |  4 
>  include/linux/fs.h   | 15 
>  5 files changed, 115 insertions(+), 14 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 348297b38f76..76f81f1d76ec 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1833,3 +1833,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>   return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap)
> +{
> + void *saddr, *daddr;
> + bool *same = data;
> + int ret;
> +
> + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
> + *same = true;
> + return len;
> + }
> +
> + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> + *same = false;
> + return 0;
> + }
> +
> + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
> +   &saddr, NULL);

shouldn't it take len as the second argument?

> + if (ret < 0)
> + return -EIO;
> +
> + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
> +   &daddr, NULL);

ditto.
> + if (ret < 0)
> + return -EIO;
> +
> + *same = !memcmp(saddr, daddr, len);
> + return len;
> +}
> +
> +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
> + const struct iomap_ops *ops)
> +{
> + int id, ret = 0;
> +
> + id = dax_read_lock();
> + while (len) {
> + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
> +is_same, dax_range_compare_actor);
> + if (ret < 0 || !*is_same)
> + goto out;
> +
> + len -= ret;
> + srcoff += ret;
> + destoff += ret;
> + }
> + ret = 0;
> +out:
> + dax_read_unlock(id);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index 77dba3a49e65..9079390edaf3 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>
>  #include 
> @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, 
> struct page *page2)
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
>   */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> -  struct inode *dest, loff_t destoff,
> -  loff_t len, bool *is_same)
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +   struct inode *dest, loff_t destoff,
> +   loff_t len, bool *is_same)
>  {
>   loff_t src_poff;
>   loff_t dest_poff;
> @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>  out_error:
>   return error;
>  }
> +EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
> @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>   * If there's an error, then the usual negative error code is returned.
>   * Otherwise returns 0 with *len set to the request length.
>   */
> -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> -   struct file *file_out, loff_t pos_out,
> -   loff_t *len, unsigned int remap_flags)
> +static int
> +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + loff_t *len, unsigned int remap_flags,
> + const struct iomap_ops *ops)
>  {
>   struct inode *inode_in = file_inode(file_in);
>   struct inode *inode_out = file_inode(file_out);
> @@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, 
> loff_t pos_in,
>   if (remap_flags & REMAP_FILE_DEDUP) {

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-02 Thread Ritesh Harjani
On 21/03/29 10:01AM, Qu Wenruo wrote:
>
>
> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:
> > > > >
> > > > > This patchset can be fetched from the following github repo, along 
> > > > > with
> > > > > the full subpage RW support:
> > > > > https://github.com/adam900710/linux/tree/subpage
> > > > >
> > > > > This patchset is for metadata read write support.
> > > > >
> > > > > [FULL RW TEST]
> > > > > Since the data write path is not included in this patchset, we can't
> > > > > really test the patchset itself, but anyone can grab the patch from
> > > > > github repo and do fstests/generic tests.
> > > > >
> > > > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > > > for now.
> > > > >
> > > > > There are some known issues:
> > > > >
> > > > > - Defrag behavior change
> > > > > Since current defrag is doing per-page defrag, to support subpage
> > > > > defrag, we need some change in the loop.
> > > > > E.g. if a page has both hole and regular extents in it, then 
> > > > > defrag
> > > > > will rewrite the full 64K page.
> > > > >
> > > > > Thus for now, defrag related failure is expected.
> > > > > But this should only cause behavior difference, no crash nor hang 
> > > > > is
> > > > > expected.
> > > > >
> > > > > - No compression support yet
> > > > > There are at least 2 known bugs if forcing compression for subpage
> > > > > * Some hard coded PAGE_SIZE screwing up space rsv
> > > > > * Subpage ASSERT() triggered
> > > > >   This is because some compression code is unlocking locked_page 
> > > > > by
> > > > >   calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > > > > So for now compression is also disabled.
> > > > >
> > > > > - Inode nbytes mismatch
> > > > > Still debugging.
> > > > > The fastest way to trigger is fsx using the following parameters:
> > > > >
> > > > >   fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > 
> > > > > /tmp/fsx
> > > > >
> > > > > Which would cause inode nbytes differs from expected value and
> > > > > triggers btrfs check error.
> > > > >
> > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > The metadata part in fact has more new code than data part, as it has
> > > > > some different behaviors compared to the regular sector size handling:
> > > > >
> > > > > - No more page locking
> > > > > Now metadata read/write relies on extent io tree locking, other 
> > > > > than
> > > > > page locking.
> > > > > This is to allow behaviors like read lock one eb while also try to
> > > > > read lock another eb in the same page.
> > > > > We can't rely on page lock as now we have multiple extent buffers 
> > > > > in
> > > > > the same page.
> > > > >
> > > > > - Page status update
> > > > > Now we use subpage wrappers to handle page status update.
> > > > >
> > > > > - How to submit dirty extent buffers
> > > > > Instead of just grabbing extent buffer from page::private, we 
> > > > > need to
> > > > > iterate all dirty extent buffers in the page and submit them.
> > > > >
> > > > > [CHANGELOG]
> > > > > v2:
> > > > > - Rebased to latest misc-next
> > > > > No conflicts at all.
> > > > >
> > > > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > > > > This will allow mkfs.btrfs to detect unmountable fs better.
> > > > >
> > > > > - Use newer naming schema for each patch
> > > > > No more "extent_io:" or "inode:" schema anymore.
> > > >

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-02 Thread Ritesh Harjani
On 21/04/02 04:36PM, Qu Wenruo wrote:
>
>
> On 2021/4/2 下午4:33, Ritesh Harjani wrote:
> > On 21/03/29 10:01AM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > > > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > > > >
> > > > >
> > > > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:
> > > > > > >
> > > > > > > This patchset can be fetched from the following github repo, 
> > > > > > > along with
> > > > > > > the full subpage RW support:
> > > > > > > https://github.com/adam900710/linux/tree/subpage
> > > > > > >
> > > > > > > This patchset is for metadata read write support.
> > > > > > >
> > > > > > > [FULL RW TEST]
> > > > > > > Since the data write path is not included in this patchset, we 
> > > > > > > can't
> > > > > > > really test the patchset itself, but anyone can grab the patch 
> > > > > > > from
> > > > > > > github repo and do fstests/generic tests.
> > > > > > >
> > > > > > > But at least the full RW patchset can pass -g generic/quick -x 
> > > > > > > defrag
> > > > > > > for now.
> > > > > > >
> > > > > > > There are some known issues:
> > > > > > >
> > > > > > > - Defrag behavior change
> > > > > > >  Since current defrag is doing per-page defrag, to support 
> > > > > > > subpage
> > > > > > >  defrag, we need some change in the loop.
> > > > > > >  E.g. if a page has both hole and regular extents in it, then 
> > > > > > > defrag
> > > > > > >  will rewrite the full 64K page.
> > > > > > >
> > > > > > >  Thus for now, defrag related failure is expected.
> > > > > > >  But this should only cause behavior difference, no crash nor 
> > > > > > > hang is
> > > > > > >  expected.
> > > > > > >
> > > > > > > - No compression support yet
> > > > > > >  There are at least 2 known bugs if forcing compression for 
> > > > > > > subpage
> > > > > > >  * Some hard coded PAGE_SIZE screwing up space rsv
> > > > > > >  * Subpage ASSERT() triggered
> > > > > > >This is because some compression code is unlocking 
> > > > > > > locked_page by
> > > > > > >calling extent_clear_unlock_delalloc() with locked_page == 
> > > > > > > NULL.
> > > > > > >  So for now compression is also disabled.
> > > > > > >
> > > > > > > - Inode nbytes mismatch
> > > > > > >  Still debugging.
> > > > > > >  The fastest way to trigger is fsx using the following 
> > > > > > > parameters:
> > > > > > >
> > > > > > >fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > 
> > > > > > > /tmp/fsx
> > > > > > >
> > > > > > >  Which would cause inode nbytes differs from expected value 
> > > > > > > and
> > > > > > >  triggers btrfs check error.
> > > > > > >
> > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > > > The metadata part in fact has more new code than data part, as it 
> > > > > > > has
> > > > > > > some different behaviors compared to the regular sector size 
> > > > > > > handling:
> > > > > > >
> > > > > > > - No more page locking
> > > > > > >  Now metadata read/write relies on extent io tree locking, 
> > > > > > > other than
> > > > > > >  page locking.
> > > > > > >  This is to allow behaviors like read lock one eb while also 
> > > > > > > try to
> > > > > > >  read lock another eb in the same page.
> > > > > >