Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
On Wed, Aug 12, 2020 at 05:10:12PM -0400, Vivek Goyal wrote: > On Wed, Aug 12, 2020 at 11:23:45AM +1000, Dave Chinner wrote: > > On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote: > > > On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote: > > > > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote: > > > > > We need some kind of locking mechanism here. Normal file systems like > > > > > ext4 and xfs seems to take their own semaphore to protect agains > > > > > truncate while fault is going on. > > > > > > > > > > We have additional requirement to protect against fuse dax memory > > > > > range > > > > > reclaim. When a range has been selected for reclaim, we need to make > > > > > sure > > > > > no other read/write/fault can try to access that memory range while > > > > > reclaim is in progress. Once reclaim is complete, lock will be > > > > > released > > > > > and read/write/fault will trigger allocation of fresh dax range. > > > > > > > > > > Taking inode_lock() is not an option in fault path as lockdep > > > > > complains > > > > > about circular dependencies. So define a new fuse_inode->i_mmap_sem. > > > > > > > > That's precisely why filesystems like ext4 and XFS define their own > > > > rwsem. > > > > > > > > Note that this isn't a DAX requirement - the page fault > > > > serialisation is actually a requirement of hole punching... > > > > > > Hi Dave, > > > > > > I noticed that fuse code currently does not seem to have a rwsem which > > > can provide mutual exclusion between truncation/hole_punch path > > > and page fault path. I am wondering does that mean there are issues > > > with existing code or something else makes it unnecessary to provide > > > this mutual exlusion. > > > > I don't know enough about the fuse implementation to say. What I'm > > saying is that nothing in the core mm/ or VFS serilises page cache > > access to the data against direct filesystem manipulations of the > > underlying filesystem structures. > > Hi Dave, > > Got it. I was checking nfs and they also seem to be calling filemap_fault > and not taking any locks to block faults. fallocate() (nfs42_fallocate) > seems to block read/write/aio/dio but does not seem to do anything > about blocking faults. I am wondering if remote filesystem are > little different in this aspect. Especially fuse does not maintain > any filesystem block/extent data. It is file server which is doing > all that. I suspect they have all the same problems, and worse, the behaviour will largely be dependent on the server side behaviour that is out of the user's control. Essentially, nobody except us XFS folks seem to regard hole punching corrupting data or exposing stale data as being a problem that needs to be avoided or fixed. The only reason ext4 has the i_mmap_sem is because ext4 wanted to support DAX, and us XFS developers said "DAX absolutely requires that the filesystem can lock out physical access to the storage" and so they had no choice in the matter. Other than that, nobody really seems to understand or care about all these nasty little mmap() corner cases that we've seen corrupt user data or expose stale data to users over many years. > > i.e. nothing in the VFS or page fault IO path prevents this race > > condition: > > > > P0 P1 > > fallocate > > page cache invalidation > > page fault > > read data > > punch out data extents > > > > > backing store allocated> > > > > > > That's where the ext4 and XFS internal rwsem come into play: > > > > fallocate > > down_write(mmaplock) > > page cache invalidation > > page fault > > down_read(mmaplock) > > > > punch out data > > up_write(mmaplock) > > > > > > > > > > And there's not stale data exposure to userspace. > > Got it. I noticed that both fuse/nfs seem to have reversed the > order of operation. They call server to punch out data first > and then truncate page cache. And that should mean that even > if mmap reader will not see stale data after fallocate(punch_hole) > has finished. Yes, but that doesn't prevent page fault races from occuring, it just changes the nature of them.. Such as. > > There is nothing stopping, say, memory reclaim from reclaiming pages > > over the range while the hole is being punched, then having the > > application refault them while the backing store is being freed. > > While the page fault read IO is in progress, there's nothing > > stopping the filesystem from freeing those blocks, nor reallocating > > them and writing something else to them (e.g. metadata). So they > > could read someone elses data. > > > > Even worse: the page fault is a write fault, it lands in a
Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
On Wed, Aug 12, 2020 at 11:23:45AM +1000, Dave Chinner wrote: > On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote: > > On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote: > > > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote: > > > > We need some kind of locking mechanism here. Normal file systems like > > > > ext4 and xfs seems to take their own semaphore to protect agains > > > > truncate while fault is going on. > > > > > > > > We have additional requirement to protect against fuse dax memory range > > > > reclaim. When a range has been selected for reclaim, we need to make > > > > sure > > > > no other read/write/fault can try to access that memory range while > > > > reclaim is in progress. Once reclaim is complete, lock will be released > > > > and read/write/fault will trigger allocation of fresh dax range. > > > > > > > > Taking inode_lock() is not an option in fault path as lockdep complains > > > > about circular dependencies. So define a new fuse_inode->i_mmap_sem. > > > > > > That's precisely why filesystems like ext4 and XFS define their own > > > rwsem. > > > > > > Note that this isn't a DAX requirement - the page fault > > > serialisation is actually a requirement of hole punching... > > > > Hi Dave, > > > > I noticed that fuse code currently does not seem to have a rwsem which > > can provide mutual exclusion between truncation/hole_punch path > > and page fault path. I am wondering does that mean there are issues > > with existing code or something else makes it unnecessary to provide > > this mutual exlusion. > > I don't know enough about the fuse implementation to say. What I'm > saying is that nothing in the core mm/ or VFS serilises page cache > access to the data against direct filesystem manipulations of the > underlying filesystem structures. Hi Dave, Got it. I was checking nfs and they also seem to be calling filemap_fault and not taking any locks to block faults. fallocate() (nfs42_fallocate) seems to block read/write/aio/dio but does not seem to do anything about blocking faults. I am wondering if remote filesystem are little different in this aspect. Especially fuse does not maintain any filesystem block/extent data. It is file server which is doing all that. > > i.e. nothing in the VFS or page fault IO path prevents this race > condition: > > P0P1 > fallocate > page cache invalidation > page fault > read data > punch out data extents > > backing store allocated> > > > That's where the ext4 and XFS internal rwsem come into play: > > fallocate > down_write(mmaplock) > page cache invalidation > page fault > down_read(mmaplock) > > punch out data > up_write(mmaplock) > > > > > And there's not stale data exposure to userspace. Got it. I noticed that both fuse/nfs seem to have reversed the order of operation. They call server to punch out data first and then truncate page cache. And that should mean that even if mmap reader will not see stale data after fallocate(punch_hole) has finished. > > It's the same reason that we use the i_rwsem to prevent concurrent > IO while a truncate or hole punch is in progress. The IO could map > the extent, then block in the IO path, while the filesytsem > re-allocates and writes new data or metadata to those blocks. That's > another potential non-owner data exposure problem. > > And if you don't drain AIO+DIO before truncate/hole punch, the > i_rwsem does not protect you against concurrent IO as that gets > dropped after the AIO is submitted and returns EIOCBQUEUED to the > AIO layer. Hence there's IO in flight that isn't tracked by the > i_rwsem or the MMAPLOCK, and if you punch out the blocks and > reallocate them while the IO is in flight > > > > > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file > > > > *file, int mode, loff_t offset, > > > > file_update_time(file); > > > > } > > > > > > > > - if (mode & FALLOC_FL_PUNCH_HOLE) > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) { > > > > + down_write(>i_mmap_sem); > > > > truncate_pagecache_range(inode, offset, offset + length > > > > - 1); > > > > - > > > > + up_write(>i_mmap_sem); > > > > + } > > > > fuse_invalidate_attr(inode); > > > > > > > > > I'm not sure this is sufficient. You have to lock page faults out > > > for the entire time the hole punch is being performed, not just while > > > the mapping is being invalidated. > > > > > > That is, once you've taken the inode lock and written back the dirty > > > data over the range being punched, you can then take a
Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote: > On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote: > > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote: > > > We need some kind of locking mechanism here. Normal file systems like > > > ext4 and xfs seems to take their own semaphore to protect agains > > > truncate while fault is going on. > > > > > > We have additional requirement to protect against fuse dax memory range > > > reclaim. When a range has been selected for reclaim, we need to make sure > > > no other read/write/fault can try to access that memory range while > > > reclaim is in progress. Once reclaim is complete, lock will be released > > > and read/write/fault will trigger allocation of fresh dax range. > > > > > > Taking inode_lock() is not an option in fault path as lockdep complains > > > about circular dependencies. So define a new fuse_inode->i_mmap_sem. > > > > That's precisely why filesystems like ext4 and XFS define their own > > rwsem. > > > > Note that this isn't a DAX requirement - the page fault > > serialisation is actually a requirement of hole punching... > > Hi Dave, > > I noticed that fuse code currently does not seem to have a rwsem which > can provide mutual exclusion between truncation/hole_punch path > and page fault path. I am wondering does that mean there are issues > with existing code or something else makes it unnecessary to provide > this mutual exlusion. I don't know enough about the fuse implementation to say. What I'm saying is that nothing in the core mm/ or VFS serilises page cache access to the data against direct filesystem manipulations of the underlying filesystem structures. i.e. nothing in the VFS or page fault IO path prevents this race condition: P0 P1 fallocate page cache invalidation page fault read data punch out data extents That's where the ext4 and XFS internal rwsem come into play: fallocate down_write(mmaplock) page cache invalidation page fault down_read(mmaplock) punch out data up_write(mmaplock) And there's not stale data exposure to userspace. It's the same reason that we use the i_rwsem to prevent concurrent IO while a truncate or hole punch is in progress. The IO could map the extent, then block in the IO path, while the filesytsem re-allocates and writes new data or metadata to those blocks. That's another potential non-owner data exposure problem. And if you don't drain AIO+DIO before truncate/hole punch, the i_rwsem does not protect you against concurrent IO as that gets dropped after the AIO is submitted and returns EIOCBQUEUED to the AIO layer. Hence there's IO in flight that isn't tracked by the i_rwsem or the MMAPLOCK, and if you punch out the blocks and reallocate them while the IO is in flight > > > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, > > > int mode, loff_t offset, > > > file_update_time(file); > > > } > > > > > > - if (mode & FALLOC_FL_PUNCH_HOLE) > > > + if (mode & FALLOC_FL_PUNCH_HOLE) { > > > + down_write(>i_mmap_sem); > > > truncate_pagecache_range(inode, offset, offset + length - 1); > > > - > > > + up_write(>i_mmap_sem); > > > + } > > > fuse_invalidate_attr(inode); > > > > > > I'm not sure this is sufficient. You have to lock page faults out > > for the entire time the hole punch is being performed, not just while > > the mapping is being invalidated. > > > > That is, once you've taken the inode lock and written back the dirty > > data over the range being punched, you can then take a page fault > > and dirty the page again. Then after you punch the hole out, > > you have a dirty page with non-zero data in it, and that can get > > written out before the page cache is truncated. > > Just for my better udnerstanding of the issue, I am wondering what > problem will it lead to. > If one process is doing punch_hole and other is writing in the > range being punched, end result could be anything. Either we will > read zeroes from punched_hole pages or we will read the data > written by process writing to mmaped page, depending on in what > order it got executed. > > If that's the case, then holding fi->i_mmap_sem for the whole > duration might not matter. What am I missing? That it is safe to invalidate the page cache after the hole has been punched. There is nothing stopping, say, memory reclaim from reclaiming pages over the range while the hole is being punched, then having the application refault them while the backing store is being freed. While the page fault read IO is in progress, there's nothing stopping the filesystem
Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote: > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote: > > We need some kind of locking mechanism here. Normal file systems like > > ext4 and xfs seems to take their own semaphore to protect agains > > truncate while fault is going on. > > > > We have additional requirement to protect against fuse dax memory range > > reclaim. When a range has been selected for reclaim, we need to make sure > > no other read/write/fault can try to access that memory range while > > reclaim is in progress. Once reclaim is complete, lock will be released > > and read/write/fault will trigger allocation of fresh dax range. > > > > Taking inode_lock() is not an option in fault path as lockdep complains > > about circular dependencies. So define a new fuse_inode->i_mmap_sem. > > That's precisely why filesystems like ext4 and XFS define their own > rwsem. > > Note that this isn't a DAX requirement - the page fault > serialisation is actually a requirement of hole punching... Hi Dave, I noticed that fuse code currently does not seem to have a rwsem which can provide mutual exclusion between truncation/hole_punch path and page fault path. I am wondering does that mean there are issues with existing code or something else makes it unnecessary to provide this mutual exlusion. > > > Signed-off-by: Vivek Goyal > > --- > > fs/fuse/dir.c| 2 ++ > > fs/fuse/file.c | 15 --- > > fs/fuse/fuse_i.h | 7 +++ > > fs/fuse/inode.c | 1 + > > 4 files changed, 22 insertions(+), 3 deletions(-) > > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > index 26f028bc760b..f40766c0693b 100644 > > --- a/fs/fuse/dir.c > > +++ b/fs/fuse/dir.c > > @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct > > iattr *attr, > > */ > > if ((is_truncate || !is_wb) && > > S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { > > + down_write(>i_mmap_sem); > > truncate_pagecache(inode, outarg.attr.size); > > invalidate_inode_pages2(inode->i_mapping); > > + up_write(>i_mmap_sem); > > } > > > > clear_bit(FUSE_I_SIZE_UNSTABLE, >state); > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > > index be7d90eb5b41..00ad27216cc3 100644 > > --- a/fs/fuse/file.c > > +++ b/fs/fuse/file.c > > @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault > > *vmf, > > > > if (write) > > sb_start_pagefault(sb); > > - > > + /* > > +* We need to serialize against not only truncate but also against > > +* fuse dax memory range reclaim. While a range is being reclaimed, > > +* we do not want any read/write/mmap to make progress and try > > +* to populate page cache or access memory we are trying to free. > > +*/ > > + down_read(_fuse_inode(inode)->i_mmap_sem); > > ret = dax_iomap_fault(vmf, pe_size, , NULL, _iomap_ops); > > > > if (ret & VM_FAULT_NEEDDSYNC) > > ret = dax_finish_sync_fault(vmf, pe_size, pfn); > > + up_read(_fuse_inode(inode)->i_mmap_sem); > > > > if (write) > > sb_end_pagefault(sb); > > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, > > int mode, loff_t offset, > > file_update_time(file); > > } > > > > - if (mode & FALLOC_FL_PUNCH_HOLE) > > + if (mode & FALLOC_FL_PUNCH_HOLE) { > > + down_write(>i_mmap_sem); > > truncate_pagecache_range(inode, offset, offset + length - 1); > > - > > + up_write(>i_mmap_sem); > > + } > > fuse_invalidate_attr(inode); > > > I'm not sure this is sufficient. You have to lock page faults out > for the entire time the hole punch is being performed, not just while > the mapping is being invalidated. > > That is, once you've taken the inode lock and written back the dirty > data over the range being punched, you can then take a page fault > and dirty the page again. Then after you punch the hole out, > you have a dirty page with non-zero data in it, and that can get > written out before the page cache is truncated. Just for my better udnerstanding of the issue, I am wondering what problem will it lead to. If one process is doing punch_hole and other is writing in the range being punched, end result could be anything. Either we will read zeroes from punched_hole pages or we will read the data written by process writing to mmaped page, depending on in what order it got executed. If that's the case, then holding fi->i_mmap_sem for the whole duration might not matter. What am I missing? Thanks Vivek > > IOWs, to do a hole punch safely, you have to both lock the inode > and lock out page faults *before* you write back dirty data. Then > you can invalidate the page cache so you know there is no cached > data over the range about to be punched. Once the punch is done, > then you can drop all locks > > The same goes for any other operation that manipulates
Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote: > We need some kind of locking mechanism here. Normal file systems like > ext4 and xfs seems to take their own semaphore to protect agains > truncate while fault is going on. > > We have additional requirement to protect against fuse dax memory range > reclaim. When a range has been selected for reclaim, we need to make sure > no other read/write/fault can try to access that memory range while > reclaim is in progress. Once reclaim is complete, lock will be released > and read/write/fault will trigger allocation of fresh dax range. > > Taking inode_lock() is not an option in fault path as lockdep complains > about circular dependencies. So define a new fuse_inode->i_mmap_sem. That's precisely why filesystems like ext4 and XFS define their own rwsem. Note that this isn't a DAX requirement - the page fault serialisation is actually a requirement of hole punching... > Signed-off-by: Vivek Goyal > --- > fs/fuse/dir.c| 2 ++ > fs/fuse/file.c | 15 --- > fs/fuse/fuse_i.h | 7 +++ > fs/fuse/inode.c | 1 + > 4 files changed, 22 insertions(+), 3 deletions(-) > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 26f028bc760b..f40766c0693b 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct > iattr *attr, >*/ > if ((is_truncate || !is_wb) && > S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { > + down_write(>i_mmap_sem); > truncate_pagecache(inode, outarg.attr.size); > invalidate_inode_pages2(inode->i_mapping); > + up_write(>i_mmap_sem); > } > > clear_bit(FUSE_I_SIZE_UNSTABLE, >state); > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index be7d90eb5b41..00ad27216cc3 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault > *vmf, > > if (write) > sb_start_pagefault(sb); > - > + /* > + * We need to serialize against not only truncate but also against > + * fuse dax memory range reclaim. While a range is being reclaimed, > + * we do not want any read/write/mmap to make progress and try > + * to populate page cache or access memory we are trying to free. > + */ > + down_read(_fuse_inode(inode)->i_mmap_sem); > ret = dax_iomap_fault(vmf, pe_size, , NULL, _iomap_ops); > > if (ret & VM_FAULT_NEEDDSYNC) > ret = dax_finish_sync_fault(vmf, pe_size, pfn); > + up_read(_fuse_inode(inode)->i_mmap_sem); > > if (write) > sb_end_pagefault(sb); > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int > mode, loff_t offset, > file_update_time(file); > } > > - if (mode & FALLOC_FL_PUNCH_HOLE) > + if (mode & FALLOC_FL_PUNCH_HOLE) { > + down_write(>i_mmap_sem); > truncate_pagecache_range(inode, offset, offset + length - 1); > - > + up_write(>i_mmap_sem); > + } > fuse_invalidate_attr(inode); I'm not sure this is sufficient. You have to lock page faults out for the entire time the hole punch is being performed, not just while the mapping is being invalidated. That is, once you've taken the inode lock and written back the dirty data over the range being punched, you can then take a page fault and dirty the page again. Then after you punch the hole out, you have a dirty page with non-zero data in it, and that can get written out before the page cache is truncated. IOWs, to do a hole punch safely, you have to both lock the inode and lock out page faults *before* you write back dirty data. Then you can invalidate the page cache so you know there is no cached data over the range about to be punched. Once the punch is done, then you can drop all locks The same goes for any other operation that manipulates extents directly (other fallocate ops, truncate, etc). /me also wonders if there can be racing AIO+DIO in progress over the range that is being punched and whether fuse needs to call inode_dio_wait() before punching holes, running truncates, etc... Cheers, Dave. -- Dave Chinner da...@fromorbit.com