Re: [PATCH 07/13] btrfs: basic direct read operation
On 8/3/19 3:30 AM, Goldwyn Rodrigues wrote: From: Goldwyn Rodrigues Add btrfs_dio_iomap_ops for iomap.begin() function. In order to accomodate dio reads, add a new function btrfs_file_read_iter() which would call btrfs_dio_iomap_read() for DIO reads and fallback to generic_file_read_iter otherwise. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/file.c | 10 +- fs/btrfs/iomap.c | 20 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7a4ff524dc77..9eca2d576dd1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3247,7 +3247,9 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); +/* iomap.c */ size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from); +ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index f7087e28ac08..997eb152a35a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2839,9 +2839,17 @@ static int btrfs_file_open(struct inode *inode, struct file *filp) return generic_file_open(inode, filp); } +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + if (iocb->ki_flags & IOCB_DIRECT) + return btrfs_dio_iomap_read(iocb, to); No provision to fallback to bufferedIO read? Not sure from btrfs perspective, but earlier generic_file_read_iter may fall through to bufferedIO read say in case where directIO could not be completed (returned 0 or less than the requested read bytes). Is it not required anymore in case of btrfs when we move to iomap infrastructure, to still fall back to bufferedIO read? Correct me if I am missing anything here. + + return generic_file_read_iter(iocb, to); +} + const struct file_operations btrfs_file_operations = { .llseek = btrfs_file_llseek, - .read_iter = generic_file_read_iter, + .read_iter = btrfs_file_read_iter, .splice_read= generic_file_splice_read, .write_iter = btrfs_file_write_iter, .mmap = btrfs_file_mmap, diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c index 879038e2f1a0..36df606fc028 100644 --- a/fs/btrfs/iomap.c +++ b/fs/btrfs/iomap.c @@ -420,3 +420,23 @@ size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from) return written; } +static int btrfs_dio_iomap_begin(struct inode *inode, loff_t pos, + loff_t length, unsigned flags, struct iomap *iomap, + struct iomap *srcmap) +{ + return get_iomap(inode, pos, length, iomap); +} + +static const struct iomap_ops btrfs_dio_iomap_ops = { + .iomap_begin= btrfs_dio_iomap_begin, +}; + +ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to) +{ + struct inode *inode = file_inode(iocb->ki_filp); + ssize_t ret; + inode_lock_shared(inode); + ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, NULL); + inode_unlock_shared(inode); + return ret; +}
Re: Odd locking pattern introduced as part of "nowait aio support"
Hi, On 9/11/19 3:09 PM, Andres Freund wrote: Hi, On 2019-09-11 14:04:20 +1000, Dave Chinner wrote: On Tue, Sep 10, 2019 at 03:33:27PM -0700, Andres Freund wrote: Hi, Especially with buffered io it's fairly easy to hit contention on the inode lock, during writes. With something like io_uring, it's even easier, because it currently (but see [1]) farms out buffered writes to workers, which then can easily contend on the inode lock, even if only one process submits writes. But I've seen it in plenty other cases too. Looking at the code I noticed that several parts of the "nowait aio support" (cf 728fbc0e10b7f3) series introduced code like: static ssize_t ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { ... if (!inode_trylock(inode)) { if (iocb->ki_flags & IOCB_NOWAIT) return -EAGAIN; inode_lock(inode); } The ext4 code is just buggy here - we don't support RWF_NOWAIT on buffered write > But both buffered and non-buffered writes go through ext4_file_write_iter(). And there's a preceding check, at least these days, preventing IOCB_NOWAIT to apply to buffered writes: if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT)) return -EOPNOTSUPP; -EOPNOTSUPP is now taken care in ext4 iomap patch series as well. I do really wish buffered NOWAIT was supported... The overhead of having to do async buffered writes through the workqueue in io_uring, even if an already existing page is targeted, is quite noticable. Even if it failed with EAGAIN as soon as the buffered write's target isn't in the page cache, it'd be worthwhile. isn't trylocking and then locking in a blocking fashion an inefficient pattern? I.e. I think this should be if (iocb->ki_flags & IOCB_NOWAIT) { if (!inode_trylock(inode)) return -EAGAIN; } else inode_lock(inode); Yes, you are right. History: commit 91f9943e1c7b ("fs: support RWF_NOWAIT for buffered reads") which introduced the first locking pattern you describe in XFS. Seems, as part of the nowait work, the pattern was introduced in ext4, xfs and btrfs. And fixed in xfs. That was followed soon after by: commit 942491c9e6d631c012f3c4ea8eb0b02edeab Author: Christoph Hellwig Date: Mon Oct 23 18:31:50 2017 -0700 xfs: fix AIM7 regression Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our read/write methods to just do the trylock for the RWF_NOWAIT case. This fixes a ~25% regression in AIM7. Fixes: 91f9943e ("fs: support RWF_NOWAIT for buffered reads") Reported-by: kernel test robot Signed-off-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Which changed all the trylock/eagain/lock patterns to the second form you quote. None of the other filesystems had AIM7 regressions reported against them, so nobody changed them :( Obviously this isn't going to improve scalability to a very significant degree. But not unnecessarily doing two atomic ops on a contended lock can't hurt scalability either. Also, the current code just seems confusing. Am I missing something? Just that the sort of performance regression testing that uncovers this sort of thing isn't widely done, and most filesystems are concurrency limited in some way before they hit inode lock scalability issues. Hence filesystem concurrency foccussed benchmarks that could uncover it (like aim7) won't because the inode locks don't end up stressed enough to make a difference to benchmark performance. Ok. Goldwyn, do you want to write a patch, or do you want me to write one up? I am anyways looking into ext4 performance issue of mixed parallel DIO workload. This will require some new APIs for inode locking similar to that of XFS. In that I can take care of this symantics reported here by you (which is taken care by XFS in above patch) for ext4. Thanks -ritesh Greetings, Andres Freund
Re: [PATCH 2/3] ext4: fix inode rwsem regression
cc'd Matthew as well. On 9/11/19 10:15 PM, Goldwyn Rodrigues wrote: From: Goldwyn Rodrigues This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression") Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our read/write methods to just do the trylock for the RWF_NOWAIT case. Fixes: 728fbc0e10b7 ("ext4: nowait aio support") Signed-off-by: Goldwyn Rodrigues This patch will conflict with recent iomap patch series. So if this is getting queued up before, so iomap patch series will need to rebase and factor these changes in the new APIs. Otherwise looks good to me! Reviewed-by: Ritesh Harjani --- fs/ext4/file.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 70b0438dbc94..d5b2d0cc325d 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -40,11 +40,13 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to) struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; - if (!inode_trylock_shared(inode)) { - if (iocb->ki_flags & IOCB_NOWAIT) + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!inode_trylock_shared(inode)) return -EAGAIN; + } else { inode_lock_shared(inode); } + /* * Recheck under inode lock - at this point we are sure it cannot * change anymore @@ -190,11 +192,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; - if (!inode_trylock(inode)) { - if (iocb->ki_flags & IOCB_NOWAIT) + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!inode_trylock(inode)) return -EAGAIN; + } else { inode_lock(inode); } + ret = ext4_write_checks(iocb, from); if (ret <= 0) goto out; @@ -233,9 +237,10 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT)) return -EOPNOTSUPP; - if (!inode_trylock(inode)) { - if (iocb->ki_flags & IOCB_NOWAIT) + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!inode_trylock(inode)) return -EAGAIN; + } else { inode_lock(inode); }
Re: [PATCH v3 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax
On 3/19/21 7:22 AM, Shiyang Ruan wrote: From: Shiyang Ruan This patchset is attempt to add CoW support for fsdax, and take XFS, which has both reflink and fsdax feature, as an example. Thanks for the patchset. I have tried reviewing the series from logical correctness and to some extent functional correctness. Since I am not well versed with the core functionality of COW operation, so I may have requested for some clarifications where I could not get the code 100% on what it is doing. (Rebased on v5.11) == Thanks. Yes, I see some conflicts when tried to apply on latest kernel. -ritesh
Re: [PATCH v3 01/10] fsdax: Factor helpers to simplify dax fault code
On 3/19/21 7:22 AM, Shiyang Ruan wrote: The dax page fault code is too long and a bit difficult to read. And it is hard to understand when we trying to add new features. Some of the PTE/PMD codes have similar logic. So, factor them as helper functions to simplify the code. Signed-off-by: Shiyang Ruan Reviewed-by: Christoph Hellwig --- fs/dax.c | 152 ++- 1 file changed, 84 insertions(+), 68 deletions(-) Refactoring & the changes looks good to me. Feel free to add. Reviewed-by: Ritesh Harjani
Re: [PATCH v3 03/10] fsdax: Output address in dax_iomap_pfn() and rename it
On 3/19/21 7:22 AM, Shiyang Ruan wrote: Add address output in dax_iomap_pfn() in order to perform a memcpy() in CoW case. Since this function both output address and pfn, rename it to dax_iomap_direct_access(). Signed-off-by: Shiyang Ruan Reviewed-by: Christoph Hellwig --- fs/dax.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) Like the naming convention. It is consistent with dax_direct_access() function. Changes looks good to me. Feel free to add. Reviewed-by: Ritesh Harjani
Re: [PATCH v3 02/10] fsdax: Factor helper: dax_fault_actor()
On 3/19/21 7:22 AM, Shiyang Ruan wrote: The core logic in the two dax page fault functions is similar. So, move the logic into a common helper function. Also, to facilitate the addition of new features, such as CoW, switch-case is no longer used to handle different iomap types. Signed-off-by: Shiyang Ruan --- fs/dax.c | 291 +++ 1 file changed, 145 insertions(+), 146 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 7031e4302b13..33ddad0f3091 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1053,6 +1053,66 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, return ret; } +#ifdef CONFIG_FS_DAX_PMD +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, + struct iomap *iomap, void **entry) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + unsigned long pmd_addr = vmf->address & PMD_MASK; + struct vm_area_struct *vma = vmf->vma; + struct inode *inode = mapping->host; + pgtable_t pgtable = NULL; + struct page *zero_page; + spinlock_t *ptl; + pmd_t pmd_entry; + pfn_t pfn; + + zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm); + + if (unlikely(!zero_page)) + goto fallback; + + pfn = page_to_pfn_t(zero_page); + *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, + DAX_PMD | DAX_ZERO_PAGE, false); + + if (arch_needs_pgtable_deposit()) { + pgtable = pte_alloc_one(vma->vm_mm); + if (!pgtable) + return VM_FAULT_OOM; + } + + ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); + if (!pmd_none(*(vmf->pmd))) { + spin_unlock(ptl); + goto fallback; + } + + if (pgtable) { + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); + mm_inc_nr_ptes(vma->vm_mm); + } + pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot); + pmd_entry = pmd_mkhuge(pmd_entry); + set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry); + spin_unlock(ptl); + trace_dax_pmd_load_hole(inode, vmf, zero_page, *entry); + return VM_FAULT_NOPAGE; + +fallback: + if (pgtable) + pte_free(vma->vm_mm, pgtable); + trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry); + return VM_FAULT_FALLBACK; +} +#else +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, + struct iomap *iomap, void **entry) +{ + return VM_FAULT_FALLBACK; +} +#endif /* CONFIG_FS_DAX_PMD */ + s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap) { sector_t sector = iomap_sector(iomap, pos & PAGE_MASK); @@ -1289,6 +1349,61 @@ static int dax_fault_cow_page(struct vm_fault *vmf, struct iomap *iomap, return 0; } +/** + * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault. + * @vmf: vm fault instance + * @pfnp: pfn to be returned + * @xas: the dax mapping tree of a file + * @entry: an unlocked dax entry to be inserted + * @pmd: distinguish whether it is a pmd fault + * @flags: iomap flags + * @iomap: from iomap_begin() + * @srcmap:from iomap_begin(), not equal to iomap if it is a CoW + */ +static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp, + struct xa_state *xas, void *entry, bool pmd, unsigned int flags, + struct iomap *iomap, struct iomap *srcmap) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + size_t size = pmd ? PMD_SIZE : PAGE_SIZE; + loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT; shouldn't we use xa_index here for pos ? (loff_t)xas->xa_index << PAGE_SHIFT; + bool write = vmf->flags & FAULT_FLAG_WRITE; + bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap); + int err = 0; + pfn_t pfn; + + /* if we are reading UNWRITTEN and HOLE, return a hole. */ + if (!write && + (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) { + if (!pmd) + return dax_load_hole(xas, mapping, &entry, vmf); + else + return dax_pmd_load_hole(xas, vmf, iomap, &entry); + } + + if (iomap->type != IOMAP_MAPPED) { + WARN_ON_ONCE(1); + return VM_FAULT_SIGBUS; + } So now in case if mapping is not mapped, we always cause VM_FAULT_SIGBUG. But earlier we were only doing WARN_ON_ONCE(1). Can you pls help answer why the change in behavior? + + err = dax_iomap_pfn(iomap, pos, size, &pfn); + if (err) + return dax_fault_return(err); Same case here as well. This could return SIGBUS while earlier I am not sure why were we only returning FALLBACK? + + entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0, +
Re: [PATCH v3 04/10] fsdax: Introduce dax_iomap_cow_copy()
On 3/19/21 7:22 AM, Shiyang Ruan wrote: In the case where the iomap is a write operation and iomap is not equal to srcmap after iomap_begin, we consider it is a CoW operation. The destance extent which iomap indicated is new allocated extent. So, it is needed to copy the data from srcmap to new allocated extent. In theory, it is better to copy the head and tail ranges which is outside of the non-aligned area instead of copying the whole aligned range. But in dax page fault, it will always be an aligned range. So, we have to copy the whole range in this case. Signed-off-by: Shiyang Ruan Reviewed-by: Christoph Hellwig --- fs/dax.c | 71 1 file changed, 66 insertions(+), 5 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index a70e6aa285bb..181aad97136a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1037,6 +1037,51 @@ static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size, return rc; } +/* + * Copy the head and tail part of the pages not included in the write but + * required for CoW, because pos/pos+length are not page aligned. But in dax + * page fault case, the range is page aligned, we need to copy the whole range + * of data. Use copy_edge to distinguish these cases. + */ Is this version better? Feel free to update/change. dax_iomap_cow_copy(): This can be called from two places. Either during DAX write fault, to copy the length size data to daddr. Or, while doing normal DAX write operation, dax_iomap_actor() might call this to do the copy of either start or end unaligned address. In this case the rest of the copy of aligned ranges is taken care by dax_iomap_actor() itself. Also, note DAX fault will always result in aligned pos and pos + length. * @pos: address to do copy from. * @length: size of copy operation. * @align_size: aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE) * @srcmap: iomap srcmap * @daddr: destination address to copy to. +static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size, + struct iomap *srcmap, void *daddr, bool copy_edge) do we need bool copy_edge here? We can detect non-align case directly if head_off != pos or pd_end != end no? +{ + loff_t head_off = pos & (align_size - 1); + size_t size = ALIGN(head_off + length, align_size); + loff_t end = pos + length; + loff_t pg_end = round_up(end, align_size); + void *saddr = 0; + int ret = 0; + + ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL); + if (ret) + return ret; + + if (!copy_edge) + return copy_mc_to_kernel(daddr, saddr, length); + + /* Copy the head part of the range. Note: we pass offset as length. */ + if (head_off) { + if (saddr) + ret = copy_mc_to_kernel(daddr, saddr, head_off); + else + memset(daddr, 0, head_off); + } + /* Copy the tail part of the range */ + if (end < pg_end) { + loff_t tail_off = head_off + length; + loff_t tail_len = pg_end - end; + + if (saddr) + ret = copy_mc_to_kernel(daddr + tail_off, + saddr + tail_off, tail_len); + else + memset(daddr + tail_off, 0, tail_len); + } Can you pls help me understand in which case where saddr is 0 and we will fall back to memset API ? I was thinking shouldn't such restrictions be coded inside copy_mc_to_kernel() function in general? -ritesh
Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
On 21/03/25 09:16PM, Qu Wenruo wrote: > > > On 2021/3/25 下午8:20, Neal Gompa wrote: > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo wrote: > > > > > > This patchset can be fetched from the following github repo, along with > > > the full subpage RW support: > > > https://github.com/adam900710/linux/tree/subpage > > > > > > This patchset is for metadata read write support. > > > > > > [FULL RW TEST] > > > Since the data write path is not included in this patchset, we can't > > > really test the patchset itself, but anyone can grab the patch from > > > github repo and do fstests/generic tests. > > > > > > But at least the full RW patchset can pass -g generic/quick -x defrag > > > for now. > > > > > > There are some known issues: > > > > > > - Defrag behavior change > > >Since current defrag is doing per-page defrag, to support subpage > > >defrag, we need some change in the loop. > > >E.g. if a page has both hole and regular extents in it, then defrag > > >will rewrite the full 64K page. > > > > > >Thus for now, defrag related failure is expected. > > >But this should only cause behavior difference, no crash nor hang is > > >expected. > > > > > > - No compression support yet > > >There are at least 2 known bugs if forcing compression for subpage > > >* Some hard coded PAGE_SIZE screwing up space rsv > > >* Subpage ASSERT() triggered > > > This is because some compression code is unlocking locked_page by > > > calling extent_clear_unlock_delalloc() with locked_page == NULL. > > >So for now compression is also disabled. > > > > > > - Inode nbytes mismatch > > >Still debugging. > > >The fastest way to trigger is fsx using the following parameters: > > > > > > fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx > > > > > >Which would cause inode nbytes differs from expected value and > > >triggers btrfs check error. > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE] > > > The metadata part in fact has more new code than data part, as it has > > > some different behaviors compared to the regular sector size handling: > > > > > > - No more page locking > > >Now metadata read/write relies on extent io tree locking, other than > > >page locking. > > >This is to allow behaviors like read lock one eb while also try to > > >read lock another eb in the same page. > > >We can't rely on page lock as now we have multiple extent buffers in > > >the same page. > > > > > > - Page status update > > >Now we use subpage wrappers to handle page status update. > > > > > > - How to submit dirty extent buffers > > >Instead of just grabbing extent buffer from page::private, we need to > > >iterate all dirty extent buffers in the page and submit them. > > > > > > [CHANGELOG] > > > v2: > > > - Rebased to latest misc-next > > >No conflicts at all. > > > > > > - Add new sysfs interface to grab supported RO/RW sectorsize > > >This will allow mkfs.btrfs to detect unmountable fs better. > > > > > > - Use newer naming schema for each patch > > >No more "extent_io:" or "inode:" schema anymore. > > > > > > - Move two pure cleanups to the series > > >Patch 2~3, originally in RW part. > > > > > > - Fix one uninitialized variable > > >Patch 6. > > > > > > v3: > > > - Rename the sysfs to supported_sectorsizes > > > > > > - Rebased to latest misc-next branch > > >This removes 2 cleanup patches. > > > > > > - Add new overview comment for subpage metadata > > > > > > Qu Wenruo (13): > > >btrfs: add sysfs interface for supported sectorsize > > >btrfs: use min() to replace open-code in btrfs_invalidatepage() > > >btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage() > > >btrfs: refactor how we iterate ordered extent in > > > btrfs_invalidatepage() > > >btrfs: introduce helpers for subpage dirty status > > >btrfs: introduce helpers for subpage writeback status > > >btrfs: allow btree_set_page_dirty() to do more sanity check on subpage > > > metadata > > >btrfs: support subpage metadata csum calculation at write time > > >btrfs: make alloc_extent_buffer() check subpage dirty bitmap > > >btrfs: make the page uptodate assert to be subpage compatible > > >btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible > > >btrfs: make set_btree_ioerr() accept extent buffer and to be subpage > > > compatible > > >btrfs: add subpage overview comments > > > > > > fs/btrfs/disk-io.c | 143 ++- > > > fs/btrfs/extent_io.c | 127 -- > > > fs/btrfs/inode.c | 128 ++ > > > fs/btrfs/subpage.c | 127 ++ > > > fs/btrfs/subpage.h | 17 + > > > fs/btrfs/sysfs.c | 15 + > > > 6 files changed, 441 insertions(+), 116 deletions(-) > > > > > > -- > > > 2.30.1 > > > >
Re: [PATCH v3 05/10] fsdax: Replace mmap entry in case of CoW
x_fault_actor(struct vm_fault *vmf, > pfn_t *pfnp, > loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT; > bool write = vmf->flags & FAULT_FLAG_WRITE; > bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap); > + unsigned int insert_flags = 0; > int err = 0; > pfn_t pfn; > void *kaddr; > @@ -1453,8 +1462,14 @@ static vm_fault_t dax_fault_actor(struct vm_fault > *vmf, pfn_t *pfnp, > if (err) > return dax_fault_return(err); > > - entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0, > - write && !sync); > + if (write) { > + if (!sync) > + insert_flags |= DAX_IF_DIRTY; > + if (iomap->flags & IOMAP_F_SHARED) > + insert_flags |= DAX_IF_COW; > + } > + > + entry = dax_insert_entry(xas, vmf, entry, pfn, 0, insert_flags); > > if (write && srcmap->addr != iomap->addr) { > err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr, false); > Rest looks good to me. Please feel free to add Reviewed-by: Ritesh Harjani sorry about changing my email in between of this code review. I am planning to use above gmail id as primary account for all upstream work from now. > -- > 2.30.1 > > >
Re: [PATCH v3 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
On 21/03/19 09:52AM, Shiyang Ruan wrote: > Punch hole on a reflinked file needs dax_copy_edge() too. Otherwise, > data in not aligned area will be not correct. So, add the srcmap to > dax_iomap_zero() and replace memset() as dax_copy_edge(). > > Signed-off-by: Shiyang Ruan > --- > fs/dax.c | 9 +++-- > fs/iomap/buffered-io.c | 2 +- > include/linux/dax.h| 3 ++- > 3 files changed, 10 insertions(+), 4 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index cfe513eb111e..348297b38f76 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1174,7 +1174,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state > *xas, struct vm_fault *vmf, > } > #endif /* CONFIG_FS_DAX_PMD */ > > -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap) > +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap, > + struct iomap *srcmap) Do we know why does dax_iomap_zero() operates on PAGE_SIZE range? IIUC, dax_zero_page_range() can take nr_pages as a parameter. But we still always use one page at a time. Why is that? > { > sector_t sector = iomap_sector(iomap, pos & PAGE_MASK); > pgoff_t pgoff; > @@ -1204,7 +1205,11 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct > iomap *iomap) > } > > if (!page_aligned) { > - memset(kaddr + offset, 0, size); > + if (iomap->addr != srcmap->addr) > + dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap, > +kaddr, true); > + else > + memset(kaddr + offset, 0, size); > dax_flush(iomap->dax_dev, kaddr + offset, size); > } > dax_read_unlock(id); > Maybe the above could be simplified to this? if (page_aligned) { rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1); } else { rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL); if (iomap->addr != srcmap->addr) dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap, kaddr, true); else memset(kaddr + offset, 0, size); dax_flush(iomap->dax_dev, kaddr + offset, size); } dax_read_unlock(id); return rc < 0 ? rc : size; Other than that looks good. Feel free to add. Reviewed-by: Ritesh Harjani
Re: [PATCH v3 07/10] iomap: Introduce iomap_apply2() for operations on two files
On 21/03/19 09:52AM, Shiyang Ruan wrote: > Some operations, such as comparing a range of data in two files under > fsdax mode, requires nested iomap_open()/iomap_end() on two file. Thus, > we introduce iomap_apply2() to accept arguments from two files and > iomap_actor2_t for actions on two files. > > Signed-off-by: Shiyang Ruan > --- > fs/iomap/apply.c | 56 +++ > include/linux/iomap.h | 7 +- > 2 files changed, 62 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c > index 26ab6563181f..fbc38ce3d5b6 100644 > --- a/fs/iomap/apply.c > +++ b/fs/iomap/apply.c > @@ -97,3 +97,59 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t > length, unsigned flags, > > return written ? written : ret; > } > + > +loff_t > +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t > pos2, > + loff_t length, unsigned int flags, const struct iomap_ops *ops, > + void *data, iomap_actor2_t actor) > +{ > + struct iomap smap = { .type = IOMAP_HOLE }; > + struct iomap dmap = { .type = IOMAP_HOLE }; > + loff_t written = 0, ret, ret2 = 0; > + loff_t len1 = length, len2, min_len; > + > + ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL); > + if (ret) > + goto out_src; if above call fails we need not call ->iomap_end() on smap. > + if (WARN_ON(smap.offset > pos1)) { > + written = -EIO; > + goto out_src; > + } > + if (WARN_ON(smap.length == 0)) { > + written = -EIO; > + goto out_src; > + } > + len2 = min_t(loff_t, len1, smap.length); > + > + ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL); > + if (ret) > + goto out_dest; ditto > + if (WARN_ON(dmap.offset > pos2)) { > + written = -EIO; > + goto out_dest; > + } > + if (WARN_ON(dmap.length == 0)) { > + written = -EIO; > + goto out_dest; > + } > + min_len = min_t(loff_t, len2, dmap.length); > + > + written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap); > + > +out_dest: > + if (ops->iomap_end) > + ret2 = ops->iomap_end(ino2, pos2, len2, > + written > 0 ? written : 0, flags, &dmap); > +out_src: > + if (ops->iomap_end) > + ret = ops->iomap_end(ino1, pos1, len1, > + written > 0 ? written : 0, flags, &smap); > + I guess, this maynot be a problem, but I still think we should be consistent w.r.t len argument we are passing in ->iomap_end() for both type of iomap_apply* family of functions. IIUC, we used to call ->iomap_end() with the length argument filled by the filesystem from ->iomap_begin() call. whereas above breaks that behavior. Although I don't think this is FATAL, but still it is better to be consistent with the APIs. Thoughts? > + if (ret) > + return written ? written : ret; > + > + if (ret2) > + return written ? written : ret2; > + > + return written; > +} if (written) return written; return ret ? ret : ret2; Is above a simpler version? -ritesh
Re: [PATCH v3 08/10] fsdax: Dedup file range to use a compare function
On 21/03/19 09:52AM, Shiyang Ruan wrote: > With dax we cannot deal with readpage() etc. So, we create a dax > comparison funciton which is similar with > vfs_dedupe_file_range_compare(). > And introduce dax_remap_file_range_prep() for filesystem use. > > Signed-off-by: Goldwyn Rodrigues > Signed-off-by: Shiyang Ruan > --- > fs/dax.c | 56 > fs/remap_range.c | 45 --- > fs/xfs/xfs_reflink.c | 9 +-- > include/linux/dax.h | 4 > include/linux/fs.h | 15 > 5 files changed, 115 insertions(+), 14 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 348297b38f76..76f81f1d76ec 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1833,3 +1833,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > return dax_insert_pfn_mkwrite(vmf, pfn, order); > } > EXPORT_SYMBOL_GPL(dax_finish_sync_fault); > + > +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1, > + struct inode *ino2, loff_t pos2, loff_t len, void *data, > + struct iomap *smap, struct iomap *dmap) > +{ > + void *saddr, *daddr; > + bool *same = data; > + int ret; > + > + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) { > + *same = true; > + return len; > + } > + > + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) { > + *same = false; > + return 0; > + } > + > + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE), > + &saddr, NULL); shouldn't it take len as the second argument? > + if (ret < 0) > + return -EIO; > + > + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE), > + &daddr, NULL); ditto. > + if (ret < 0) > + return -EIO; > + > + *same = !memcmp(saddr, daddr, len); > + return len; > +} > + > +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > + struct inode *dest, loff_t destoff, loff_t len, bool *is_same, > + const struct iomap_ops *ops) > +{ > + int id, ret = 0; > + > + id = dax_read_lock(); > + while (len) { > + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops, > +is_same, dax_range_compare_actor); > + if (ret < 0 || !*is_same) > + goto out; > + > + len -= ret; > + srcoff += ret; > + destoff += ret; > + } > + ret = 0; > +out: > + dax_read_unlock(id); > + return ret; > +} > +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare); > diff --git a/fs/remap_range.c b/fs/remap_range.c > index 77dba3a49e65..9079390edaf3 100644 > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > #include "internal.h" > > #include > @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, > struct page *page2) > * Compare extents of two files to see if they are the same. > * Caller must have locked both inodes to prevent write races. > */ > -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > - struct inode *dest, loff_t destoff, > - loff_t len, bool *is_same) > +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > + struct inode *dest, loff_t destoff, > + loff_t len, bool *is_same) > { > loff_t src_poff; > loff_t dest_poff; > @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode > *src, loff_t srcoff, > out_error: > return error; > } > +EXPORT_SYMBOL(vfs_dedupe_file_range_compare); > > /* > * Check that the two inodes are eligible for cloning, the ranges make > @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode > *src, loff_t srcoff, > * If there's an error, then the usual negative error code is returned. > * Otherwise returns 0 with *len set to the request length. > */ > -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, > - struct file *file_out, loff_t pos_out, > - loff_t *len, unsigned int remap_flags) > +static int > +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, > + struct file *file_out, loff_t pos_out, > + loff_t *len, unsigned int remap_flags, > + const struct iomap_ops *ops) > { > struct inode *inode_in = file_inode(file_in); > struct inode *inode_out = file_inode(file_out); > @@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, > loff_t pos_in, > if (remap_flags & REMAP_FILE_DEDUP) {
Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
On 21/03/29 10:01AM, Qu Wenruo wrote: > > > On 2021/3/29 上午4:02, Ritesh Harjani wrote: > > On 21/03/25 09:16PM, Qu Wenruo wrote: > > > > > > > > > On 2021/3/25 下午8:20, Neal Gompa wrote: > > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo wrote: > > > > > > > > > > This patchset can be fetched from the following github repo, along > > > > > with > > > > > the full subpage RW support: > > > > > https://github.com/adam900710/linux/tree/subpage > > > > > > > > > > This patchset is for metadata read write support. > > > > > > > > > > [FULL RW TEST] > > > > > Since the data write path is not included in this patchset, we can't > > > > > really test the patchset itself, but anyone can grab the patch from > > > > > github repo and do fstests/generic tests. > > > > > > > > > > But at least the full RW patchset can pass -g generic/quick -x defrag > > > > > for now. > > > > > > > > > > There are some known issues: > > > > > > > > > > - Defrag behavior change > > > > > Since current defrag is doing per-page defrag, to support subpage > > > > > defrag, we need some change in the loop. > > > > > E.g. if a page has both hole and regular extents in it, then > > > > > defrag > > > > > will rewrite the full 64K page. > > > > > > > > > > Thus for now, defrag related failure is expected. > > > > > But this should only cause behavior difference, no crash nor hang > > > > > is > > > > > expected. > > > > > > > > > > - No compression support yet > > > > > There are at least 2 known bugs if forcing compression for subpage > > > > > * Some hard coded PAGE_SIZE screwing up space rsv > > > > > * Subpage ASSERT() triggered > > > > > This is because some compression code is unlocking locked_page > > > > > by > > > > > calling extent_clear_unlock_delalloc() with locked_page == NULL. > > > > > So for now compression is also disabled. > > > > > > > > > > - Inode nbytes mismatch > > > > > Still debugging. > > > > > The fastest way to trigger is fsx using the following parameters: > > > > > > > > > > fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > > > > > > /tmp/fsx > > > > > > > > > > Which would cause inode nbytes differs from expected value and > > > > > triggers btrfs check error. > > > > > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE] > > > > > The metadata part in fact has more new code than data part, as it has > > > > > some different behaviors compared to the regular sector size handling: > > > > > > > > > > - No more page locking > > > > > Now metadata read/write relies on extent io tree locking, other > > > > > than > > > > > page locking. > > > > > This is to allow behaviors like read lock one eb while also try to > > > > > read lock another eb in the same page. > > > > > We can't rely on page lock as now we have multiple extent buffers > > > > > in > > > > > the same page. > > > > > > > > > > - Page status update > > > > > Now we use subpage wrappers to handle page status update. > > > > > > > > > > - How to submit dirty extent buffers > > > > > Instead of just grabbing extent buffer from page::private, we > > > > > need to > > > > > iterate all dirty extent buffers in the page and submit them. > > > > > > > > > > [CHANGELOG] > > > > > v2: > > > > > - Rebased to latest misc-next > > > > > No conflicts at all. > > > > > > > > > > - Add new sysfs interface to grab supported RO/RW sectorsize > > > > > This will allow mkfs.btrfs to detect unmountable fs better. > > > > > > > > > > - Use newer naming schema for each patch > > > > > No more "extent_io:" or "inode:" schema anymore. > > > >
Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
On 21/04/02 04:36PM, Qu Wenruo wrote: > > > On 2021/4/2 下午4:33, Ritesh Harjani wrote: > > On 21/03/29 10:01AM, Qu Wenruo wrote: > > > > > > > > > On 2021/3/29 上午4:02, Ritesh Harjani wrote: > > > > On 21/03/25 09:16PM, Qu Wenruo wrote: > > > > > > > > > > > > > > > On 2021/3/25 下午8:20, Neal Gompa wrote: > > > > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo wrote: > > > > > > > > > > > > > > This patchset can be fetched from the following github repo, > > > > > > > along with > > > > > > > the full subpage RW support: > > > > > > > https://github.com/adam900710/linux/tree/subpage > > > > > > > > > > > > > > This patchset is for metadata read write support. > > > > > > > > > > > > > > [FULL RW TEST] > > > > > > > Since the data write path is not included in this patchset, we > > > > > > > can't > > > > > > > really test the patchset itself, but anyone can grab the patch > > > > > > > from > > > > > > > github repo and do fstests/generic tests. > > > > > > > > > > > > > > But at least the full RW patchset can pass -g generic/quick -x > > > > > > > defrag > > > > > > > for now. > > > > > > > > > > > > > > There are some known issues: > > > > > > > > > > > > > > - Defrag behavior change > > > > > > > Since current defrag is doing per-page defrag, to support > > > > > > > subpage > > > > > > > defrag, we need some change in the loop. > > > > > > > E.g. if a page has both hole and regular extents in it, then > > > > > > > defrag > > > > > > > will rewrite the full 64K page. > > > > > > > > > > > > > > Thus for now, defrag related failure is expected. > > > > > > > But this should only cause behavior difference, no crash nor > > > > > > > hang is > > > > > > > expected. > > > > > > > > > > > > > > - No compression support yet > > > > > > > There are at least 2 known bugs if forcing compression for > > > > > > > subpage > > > > > > > * Some hard coded PAGE_SIZE screwing up space rsv > > > > > > > * Subpage ASSERT() triggered > > > > > > >This is because some compression code is unlocking > > > > > > > locked_page by > > > > > > >calling extent_clear_unlock_delalloc() with locked_page == > > > > > > > NULL. > > > > > > > So for now compression is also disabled. > > > > > > > > > > > > > > - Inode nbytes mismatch > > > > > > > Still debugging. > > > > > > > The fastest way to trigger is fsx using the following > > > > > > > parameters: > > > > > > > > > > > > > >fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > > > > > > > > /tmp/fsx > > > > > > > > > > > > > > Which would cause inode nbytes differs from expected value > > > > > > > and > > > > > > > triggers btrfs check error. > > > > > > > > > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE] > > > > > > > The metadata part in fact has more new code than data part, as it > > > > > > > has > > > > > > > some different behaviors compared to the regular sector size > > > > > > > handling: > > > > > > > > > > > > > > - No more page locking > > > > > > > Now metadata read/write relies on extent io tree locking, > > > > > > > other than > > > > > > > page locking. > > > > > > > This is to allow behaviors like read lock one eb while also > > > > > > > try to > > > > > > > read lock another eb in the same page. > > > > > >