[f2fs-dev] 答复: [PATCH v3] f2fs: compress: add nocompress extensions support

2021-05-13 Thread changfengnan
Hi chao & jaegeuk:

Any new comments on this patch ?

Thanks

-邮件原件-
发件人: Fengnan Chang 
发送时间: 2021年5月7日 11:06
收件人: jaeg...@kernel.org; c...@kernel.org;
linux-f2fs-devel@lists.sourceforge.net
抄送: Fengnan Chang 
主题: [PATCH v3] f2fs: compress: add nocompress extensions support

When we create a directory with enable compression, all file write into
directory will try to compress.But sometimes we may know, new file cannot
meet compression ratio requirements.
We need a nocompress extension to skip those files to avoid unnecessary
compress page test.

Signed-off-by: Fengnan Chang 
---
 Documentation/filesystems/f2fs.rst |  8 +++
 fs/f2fs/f2fs.h |  2 +
 fs/f2fs/namei.c| 18 +--
 fs/f2fs/super.c| 79 +-
 4 files changed, 103 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/f2fs.rst
b/Documentation/filesystems/f2fs.rst
index 63c0c49b726d..f9248a36cd53 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -281,6 +281,14 @@ compress_extension=%s   Support adding specified
extension, so that f2fs can enab
 For other files, we can still enable compression
via ioctl.
 Note that, there is one reserved special
extension '*', it
 can be set to enable compression for all files.
+nocompress_extension=%s   Support adding specified extension, so
that f2fs can disable
+compression on those corresponding files, just
contrary to compression extension.
+If you know exactly which files cannot be
compressed, you can use this.
+The same extension name can't appear in both
compress and nocompress
+extension at the same time.
+If the compress extension specifies all files,
the types specified by the
+nocompress extension will be treated as special
cases and will not be compressed.
+Don't allow use '*' to specifie all file in
nocompress extension.
 compress_chksum Support verifying chksum of raw data in
compressed cluster.
 compress_mode=%sControl file compression mode. This supports "fs"
and "user"
 modes. In "fs" mode (default), f2fs does
automatic compression diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index
87d734f5589d..3d5d28a2568f 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -150,8 +150,10 @@ struct f2fs_mount_info {
unsigned char compress_level;   /* compress level */
bool compress_chksum;   /* compressed data chksum
*/
unsigned char compress_ext_cnt; /* extension count */
+   unsigned char nocompress_ext_cnt;   /* nocompress
extension count */
int compress_mode;  /* compression mode */
unsigned char extensions[COMPRESS_EXT_NUM][F2FS_EXTENSION_LEN]; /*
extensions */
+   unsigned char noextensions[COMPRESS_EXT_NUM][F2FS_EXTENSION_LEN];
/*
+extensions */
 };

 #define F2FS_FEATURE_ENCRYPT   0x0001
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index
405d85dbf9f1..84ca322a22ee 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -279,14 +279,16 @@ static void set_compress_inode(struct f2fs_sb_info
*sbi, struct inode *inode,
const unsigned char *name)
 {
__u8 (*extlist)[F2FS_EXTENSION_LEN] =
sbi->raw_super->extension_list;
+   unsigned char (*noext)[F2FS_EXTENSION_LEN] =
+F2FS_OPTION(sbi).noextensions;
unsigned char (*ext)[F2FS_EXTENSION_LEN];
-   unsigned int ext_cnt = F2FS_OPTION(sbi).compress_ext_cnt;
+   unsigned char ext_cnt = F2FS_OPTION(sbi).compress_ext_cnt;
+   unsigned char noext_cnt = F2FS_OPTION(sbi).nocompress_ext_cnt;
int i, cold_count, hot_count;

if (!f2fs_sb_has_compression(sbi) ||
-   is_inode_flag_set(inode, FI_COMPRESSED_FILE) ||
F2FS_I(inode)->i_flags & F2FS_NOCOMP_FL ||
-   !f2fs_may_compress(inode))
+   !f2fs_may_compress(inode) ||
+   (!ext_cnt && !noext_cnt))
return;

down_read(>sb_lock);
@@ -303,6 +305,16 @@ static void set_compress_inode(struct f2fs_sb_info
*sbi, struct inode *inode,

up_read(>sb_lock);

+   for (i = 0; i < noext_cnt; i++) {
+   if (is_extension_exist(name, noext[i])) {
+   f2fs_disable_compressed_file(inode);
+   return;
+   }
+   }
+
+   if (is_inode_flag_set(inode, FI_COMPRESSED_FILE))
+   return;
+
ext = F2FS_OPTION(sbi).extensions;

for (i = 0; i < ext_cnt; i++) {
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index
5020152aa8fc..865191339625 100644
--- a/fs/f2fs/super.c
+++ 

Re: [f2fs-dev] 答复: [PATCH v4] f2fs: compress: avoid unnecessary check in f2fs_prepare_compress_overwrite

2021-05-13 Thread Chao Yu

On 2021/5/14 5:17, Eric Biggers wrote:

On Wed, May 12, 2021 at 09:52:19AM +0800, Chao Yu wrote:

On 2021/5/12 5:50, Jaegeuk Kim wrote:

On 05/11, changfeng...@vivo.com wrote:

Hi Jaegeuk:

If there're existing clusters beyond i_size, may cause data corruption, but
will this happen in normal? maybe some error can cause this, if i_size is
error the data beyond size still can't handle properly.  Is there normal
case can casue existing clusters beyond i_size?


We don't have a rule to sync between i_size and i_blocks.


I can't image a case that compressed cluster may cross filesize, it looks it's
a bug if that happened, but I'm not sure I have considered all cases. So, I
prefer to add a check condition as below, then testing w/ xfstest/por_fsstress
for a while.

Subject: [PATCH] f2fs: compress: compressed cluster should not cross i_size

---
  fs/f2fs/data.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 06d1e58d3882..9acca358d578 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -3325,6 +3325,8 @@ static int f2fs_write_begin(struct file *file, struct 
address_space *mapping,
err = ret;
goto fail;
} else if (ret) {
+   f2fs_bug_on(sbi, index >=
+   DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
return 0;
}
}


If a file has both fs-verity and compression enabled, it can have compressed
clusters past i_size.


Correct, any other case we missed for a writable file? let us know.

Thanks,



- Eric
.




___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Dave Chinner
On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > +->fallocate implementation must be really careful to maintain page 
> > > > cache
> > > > +consistency when punching holes or performing other operations that 
> > > > invalidate
> > > > +page cache contents. Usually the filesystem needs to call
> > > > +truncate_inode_pages_range() to invalidate relevant range of the page 
> > > > cache.
> > > > +However the filesystem usually also needs to update its internal (and 
> > > > on disk)
> > > > +view of file offset -> disk block mapping. Until this update is 
> > > > finished, the
> > > > +filesystem needs to block page faults and reads from reloading 
> > > > now-stale page
> > > > +cache contents from the disk. VFS provides mapping->invalidate_lock 
> > > > for this
> > > > +and acquires it in shared mode in paths loading pages from disk
> > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > +responsible for taking this lock in its fallocate implementation and 
> > > > generally
> > > > +whenever the page cache contents needs to be invalidated because a 
> > > > block is
> > > > +moving from under a page.
> > > > +
> > > > +->copy_file_range and ->remap_file_range implementations need to 
> > > > serialize
> > > > +against modifications of file data while the operation is running. For 
> > > > blocking
> > > > +changes through write(2) and similar operations inode->i_rwsem can be 
> > > > used. For
> > > > +blocking changes through memory mapping, the filesystem can use
> > > > +mapping->invalidate_lock provided it also acquires it in its 
> > > > ->page_mkwrite
> > > > +implementation.
> > > 
> > > Question: What is the locking order when acquiring the invalidate_lock
> > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > being hoisted here (increasing order of i_ino)?
> > > 
> > > The reason I ask is that remap_file_range has to do that, but I don't
> > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > calls in xfs_ilock2_io_mmap in this series.
> > 
> > Good question. Technically, I don't think there's real need to establish a
> > single ordering because locks among different filesystems are never going
> > to be acquired together (effectively each lock type is local per sb and we
> > are free to define an ordering for each lock type differently). But to
> > maintain some sanity I guess having the same locking order for doublelock
> > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > by-ino ordering? So that we don't have to consider two different orders in
> > xfs_lock_two_inodes()...
> 
> I imagine Dave will chime in on this, but I suspect the reason is
> hysterical raisins^Wreasons.

It's the locking rules that XFS has used pretty much forever.
Locking by inode number always guarantees the same locking order of
two inodes in the same filesystem, regardless of the specific
in-memory instances of the two inodes.

e.g. if we lock based on the inode structure address, in one
instancex, we could get A -> B, then B gets recycled and
reallocated, then we get B -> A as the locking order for the same
two inodes.

That, IMNSHO, is utterly crazy because with non-deterministic inode
lock ordered like this you can't make consistent locking rules for
locking the physical inode cluster buffers underlying the inodes in
the situation where they also need to be locked.

We've been down this path before more than a decade ago when the
powers that be decreed that inode locking order is to be "by
structure address" rather than inode number, because "inode number
is not unique across multiple superblocks".

I'm not sure that there is anywhere that locks multiple inodes
across different superblocks, but here we are again

> It might simply be time to convert all
> three XFS inode locks to use the same ordering rules.

Careful, there lie dragons along that path because of things like
how the inode cluster buffer operations work - they all assume
ascending inode number traversal within and across inode cluster
buffers and hence we do have locking order constraints based on
inode number...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] 答复: [PATCH v4] f2fs: compress: avoid unnecessary check in f2fs_prepare_compress_overwrite

2021-05-13 Thread Eric Biggers
On Wed, May 12, 2021 at 09:52:19AM +0800, Chao Yu wrote:
> On 2021/5/12 5:50, Jaegeuk Kim wrote:
> > On 05/11, changfeng...@vivo.com wrote:
> > > Hi Jaegeuk:
> > > 
> > > If there're existing clusters beyond i_size, may cause data corruption, 
> > > but
> > > will this happen in normal? maybe some error can cause this, if i_size is
> > > error the data beyond size still can't handle properly.  Is there normal
> > > case can casue existing clusters beyond i_size?
> > 
> > We don't have a rule to sync between i_size and i_blocks.
> 
> I can't image a case that compressed cluster may cross filesize, it looks it's
> a bug if that happened, but I'm not sure I have considered all cases. So, I
> prefer to add a check condition as below, then testing w/ xfstest/por_fsstress
> for a while.
> 
> Subject: [PATCH] f2fs: compress: compressed cluster should not cross i_size
> 
> ---
>  fs/f2fs/data.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 06d1e58d3882..9acca358d578 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -3325,6 +3325,8 @@ static int f2fs_write_begin(struct file *file, struct 
> address_space *mapping,
>   err = ret;
>   goto fail;
>   } else if (ret) {
> + f2fs_bug_on(sbi, index >=
> + DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
>   return 0;
>   }
>   }

If a file has both fs-verity and compression enabled, it can have compressed
clusters past i_size.

- Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Matthew Wilcox
On Thu, May 13, 2021 at 09:01:14PM +0200, Jan Kara wrote:
> On Wed 12-05-21 15:40:21, Matthew Wilcox wrote:
> > Remind me (or, rather, add to the documentation) why we have to hold the
> > invalidate_lock during the call to readpage / readahead, and we don't just
> > hold it around the call to add_to_page_cache / add_to_page_cache_locked
> > / add_to_page_cache_lru ?  I appreciate that ->readpages is still going
> > to suck, but we're down to just three implementations of ->readpages now
> > (9p, cifs & nfs).
> 
> There's a comment in filemap_create_page() trying to explain this. We need
> to protect against cases like: Filesystem with 1k blocksize, file F has
> page at index 0 with uptodate buffer at 0-1k, rest not uptodate. All blocks
> underlying page are allocated. Now let read at offset 1k race with hole
> punch at offset 1k, length 1k.
> 
> read()hole punch
> ...
>   filemap_read()
> filemap_get_pages()
>   - page found in the page cache but !Uptodate
>   filemap_update_page()
> locks everything
> truncate_inode_pages_range()
>   lock_page(page)
>   do_invalidatepage()
>   unlock_page(page)
> locks page
>   filemap_read_page()

Ah, this is the partial_start case, which means that page->mapping
is still valid.  But that means that do_invalidatepage() was called
with (offset 1024, length 1024), immediately after we called
zero_user_segment().  So isn't this a bug in the fs do_invalidatepage()?
The range from 1k-2k _is_ uptodate.  It's been zeroed in memory,
and if we were to run after the "free block" below, we'd get that
memory zeroed again.

> ->readpage()
>   block underlying offset 1k
> still allocated -> map buffer
> free block under offset 1k
> submit IO -> corrupted data
> 
> If you think I should expand it to explain more details, please tell.
> Or maybe I can put more detailed discussion like above into the changelog?

> > Why not:
> > 
> > __init_rwsem(>invalidate_lock, "mapping.invalidate_lock",
> > >s_type->invalidate_lock_key);
> 
> I replicated what we do for i_rwsem but you're right, this is better.
> Updated.

Hmm, there's a few places we should use __init_rwsem() ... something
for my "when bored" pile of work.


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Jan Kara
On Wed 12-05-21 15:40:21, Matthew Wilcox wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > Currently, serializing operations such as page fault, read, or readahead
> > against hole punching is rather difficult. The basic race scheme is
> > like:
> > 
> > fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
> >   truncate_inode_pages_range()
> >>cache here>
> >   
> > 
> > Now the problem is in this way read / page fault / readahead can
> > instantiate pages in page cache with potentially stale data (if blocks
> > get quickly reused). Avoiding this race is not simple - page locks do
> > not work because we want to make sure there are *no* pages in given
> > range. inode->i_rwsem does not work because page fault happens under
> > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> > the performance for mixed read-write workloads suffer.
> > 
> > So create a new rw_semaphore in the address_space - invalidate_lock -
> > that protects adding of pages to page cache for page faults / reads /
> > readahead.
> 
> Remind me (or, rather, add to the documentation) why we have to hold the
> invalidate_lock during the call to readpage / readahead, and we don't just
> hold it around the call to add_to_page_cache / add_to_page_cache_locked
> / add_to_page_cache_lru ?  I appreciate that ->readpages is still going
> to suck, but we're down to just three implementations of ->readpages now
> (9p, cifs & nfs).

There's a comment in filemap_create_page() trying to explain this. We need
to protect against cases like: Filesystem with 1k blocksize, file F has
page at index 0 with uptodate buffer at 0-1k, rest not uptodate. All blocks
underlying page are allocated. Now let read at offset 1k race with hole
punch at offset 1k, length 1k.

read()  hole punch
...
  filemap_read()
filemap_get_pages()
  - page found in the page cache but !Uptodate
  filemap_update_page()
  locks everything
  truncate_inode_pages_range()
lock_page(page)
do_invalidatepage()
unlock_page(page)
locks page
  filemap_read_page()
->readpage()
  block underlying offset 1k
  still allocated -> map buffer
  free block under offset 1k
  submit IO -> corrupted data

If you think I should expand it to explain more details, please tell.
Or maybe I can put more detailed discussion like above into the changelog?

> Also, could I trouble you to run the comments through 'fmt' (or
> equivalent)?  It's easier to read if you're not kissing right up on 80
> columns.

Sure, will do.

> > +++ b/fs/inode.c
> > @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct 
> > inode *inode)
> > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> > mapping->private_data = NULL;
> > mapping->writeback_index = 0;
> > +   init_rwsem(>invalidate_lock);
> > +   lockdep_set_class(>invalidate_lock,
> > + >s_type->invalidate_lock_key);
> 
> Why not:
> 
>   __init_rwsem(>invalidate_lock, "mapping.invalidate_lock",
>   >s_type->invalidate_lock_key);

I replicated what we do for i_rwsem but you're right, this is better.
Updated.

Honza
-- 
Jan Kara 
SUSE Labs, CR


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Darrick J. Wong
On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > +->fallocate implementation must be really careful to maintain page cache
> > > +consistency when punching holes or performing other operations that 
> > > invalidate
> > > +page cache contents. Usually the filesystem needs to call
> > > +truncate_inode_pages_range() to invalidate relevant range of the page 
> > > cache.
> > > +However the filesystem usually also needs to update its internal (and on 
> > > disk)
> > > +view of file offset -> disk block mapping. Until this update is 
> > > finished, the
> > > +filesystem needs to block page faults and reads from reloading now-stale 
> > > page
> > > +cache contents from the disk. VFS provides mapping->invalidate_lock for 
> > > this
> > > +and acquires it in shared mode in paths loading pages from disk
> > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > +responsible for taking this lock in its fallocate implementation and 
> > > generally
> > > +whenever the page cache contents needs to be invalidated because a block 
> > > is
> > > +moving from under a page.
> > > +
> > > +->copy_file_range and ->remap_file_range implementations need to 
> > > serialize
> > > +against modifications of file data while the operation is running. For 
> > > blocking
> > > +changes through write(2) and similar operations inode->i_rwsem can be 
> > > used. For
> > > +blocking changes through memory mapping, the filesystem can use
> > > +mapping->invalidate_lock provided it also acquires it in its 
> > > ->page_mkwrite
> > > +implementation.
> > 
> > Question: What is the locking order when acquiring the invalidate_lock
> > of two different files?  Is it the same as i_rwsem (increasing order of
> > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > being hoisted here (increasing order of i_ino)?
> > 
> > The reason I ask is that remap_file_range has to do that, but I don't
> > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > calls in xfs_ilock2_io_mmap in this series.
> 
> Good question. Technically, I don't think there's real need to establish a
> single ordering because locks among different filesystems are never going
> to be acquired together (effectively each lock type is local per sb and we
> are free to define an ordering for each lock type differently). But to
> maintain some sanity I guess having the same locking order for doublelock
> of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> by-ino ordering? So that we don't have to consider two different orders in
> xfs_lock_two_inodes()...

I imagine Dave will chime in on this, but I suspect the reason is
hysterical raisins^Wreasons.  It might simply be time to convert all
three XFS inode locks to use the same ordering rules.

--D

> 
>   Honza
> 
> > > +
> > >  dquot_operations
> > >  
> > >  
> > > @@ -634,9 +658,9 @@ access:   yes
> > >  to be faulted in. The filesystem must find and return the page associated
> > >  with the passed in "pgoff" in the vm_fault structure. If it is possible 
> > > that
> > >  the page may be truncated and/or invalidated, then the filesystem must 
> > > lock
> > > -the page, then ensure it is not already truncated (the page lock will 
> > > block
> > > -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
> > > -locked. The VM will unlock the page.
> > > +invalidate_lock, then ensure the page is not already truncated 
> > > (invalidate_lock
> > > +will block subsequent truncate), and then return with VM_FAULT_LOCKED, 
> > > and the
> > > +page locked. The VM will unlock the page.
> > >  
> > >  ->map_pages() is called when VM asks to map easy accessible pages.
> > >  Filesystem should find and map pages associated with offsets from 
> > > "start_pgoff"
> > > @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with 
> > > the page is passed in
> > >  "pte" field in vm_fault structure. Pointers to entries for other offsets
> > >  should be calculated relative to "pte".
> > >  
> > > -->page_mkwrite() is called when a previously read-only pte is
> > > -about to become writeable. The filesystem again must ensure that there 
> > > are
> > > -no truncate/invalidate races, and then return with the page locked. If
> > > -the page has been truncated, the filesystem should not look up a new page
> > > -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
> > > -will cause the VM to retry the fault.
> > > +->page_mkwrite() is called when a previously read-only pte is about to 
> > > become
> > > +writeable. The filesystem again must ensure that there are no
> > > +truncate/invalidate races or races with operations such as 
> > > ->remap_file_range
> > > +or ->copy_file_range, and 

Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Jan Kara
On Wed 12-05-21 15:20:44, Matthew Wilcox wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> 
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index 57a618c4a0d6..93bde2741e0e 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
> >   * @mapping: mapping to truncate
> >   * @lstart: offset from which to truncate
> >   *
> > - * Called under (and serialised by) inode->i_rwsem.
> > + * Called under (and serialised by) inode->i_rwsem and 
> > inode->i_mapping_rwsem.
> 
> mapping->invalidate_lock, surely?

Right, thanks for noticing. 

> And could we ask lockdep to assert this for us instead of just a comment?

That's the plan but currently it would trip for filesystems unaware of
invalidate_lock. Once all filesystems are converted I plan to transform the
comments into actual asserts. In this series I aim at fixing the data
corruption issues, I plan the cleanups for later...

Honza

-- 
Jan Kara 
SUSE Labs, CR


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Jan Kara
On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > +->fallocate implementation must be really careful to maintain page cache
> > +consistency when punching holes or performing other operations that 
> > invalidate
> > +page cache contents. Usually the filesystem needs to call
> > +truncate_inode_pages_range() to invalidate relevant range of the page 
> > cache.
> > +However the filesystem usually also needs to update its internal (and on 
> > disk)
> > +view of file offset -> disk block mapping. Until this update is finished, 
> > the
> > +filesystem needs to block page faults and reads from reloading now-stale 
> > page
> > +cache contents from the disk. VFS provides mapping->invalidate_lock for 
> > this
> > +and acquires it in shared mode in paths loading pages from disk
> > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > +responsible for taking this lock in its fallocate implementation and 
> > generally
> > +whenever the page cache contents needs to be invalidated because a block is
> > +moving from under a page.
> > +
> > +->copy_file_range and ->remap_file_range implementations need to serialize
> > +against modifications of file data while the operation is running. For 
> > blocking
> > +changes through write(2) and similar operations inode->i_rwsem can be 
> > used. For
> > +blocking changes through memory mapping, the filesystem can use
> > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > +implementation.
> 
> Question: What is the locking order when acquiring the invalidate_lock
> of two different files?  Is it the same as i_rwsem (increasing order of
> the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> being hoisted here (increasing order of i_ino)?
> 
> The reason I ask is that remap_file_range has to do that, but I don't
> see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> calls in xfs_ilock2_io_mmap in this series.

Good question. Technically, I don't think there's real need to establish a
single ordering because locks among different filesystems are never going
to be acquired together (effectively each lock type is local per sb and we
are free to define an ordering for each lock type differently). But to
maintain some sanity I guess having the same locking order for doublelock
of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
by-ino ordering? So that we don't have to consider two different orders in
xfs_lock_two_inodes()...

Honza

> > +
> >  dquot_operations
> >  
> >  
> > @@ -634,9 +658,9 @@ access: yes
> >  to be faulted in. The filesystem must find and return the page associated
> >  with the passed in "pgoff" in the vm_fault structure. If it is possible 
> > that
> >  the page may be truncated and/or invalidated, then the filesystem must lock
> > -the page, then ensure it is not already truncated (the page lock will block
> > -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
> > -locked. The VM will unlock the page.
> > +invalidate_lock, then ensure the page is not already truncated 
> > (invalidate_lock
> > +will block subsequent truncate), and then return with VM_FAULT_LOCKED, and 
> > the
> > +page locked. The VM will unlock the page.
> >  
> >  ->map_pages() is called when VM asks to map easy accessible pages.
> >  Filesystem should find and map pages associated with offsets from 
> > "start_pgoff"
> > @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with 
> > the page is passed in
> >  "pte" field in vm_fault structure. Pointers to entries for other offsets
> >  should be calculated relative to "pte".
> >  
> > -->page_mkwrite() is called when a previously read-only pte is
> > -about to become writeable. The filesystem again must ensure that there are
> > -no truncate/invalidate races, and then return with the page locked. If
> > -the page has been truncated, the filesystem should not look up a new page
> > -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
> > -will cause the VM to retry the fault.
> > +->page_mkwrite() is called when a previously read-only pte is about to 
> > become
> > +writeable. The filesystem again must ensure that there are no
> > +truncate/invalidate races or races with operations such as 
> > ->remap_file_range
> > +or ->copy_file_range, and then return with the page locked. Usually
> > +mapping->invalidate_lock is suitable for proper serialization. If the page 
> > has
> > +been truncated, the filesystem should not look up a new page like the 
> > ->fault()
> > +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
> > +retry the fault.
> >  
> >  ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> >  VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> > diff