Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 23:24 +0100, Nick Piggin wrote: > mmap()s can be different from read in that the syscall may have little > relation to when the data gets used. But I guess it's still a best > effort thing. Fair enough. Agreed that mmap() is special and very problematic on NFS. However I can't see how we can improve on the existing models short of some significant protocol modifications, and so far, nobody has presented the IETF with a good case for why they need this level of cache consistency. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote: > Right, but I guess what Nick asked is, if pages could be stale to start > with, how is that avoided in the future. > > The way I understand it, this re-validate is just a best effort at > getting a coherent image. The normal convention for NFS is to use a close-to-open cache consistency model. In that model, applications must agree never to open the file for reading or writing if an application on a different NFS client already holds it open for writing. However there is no standard locking model for _enforcing_ such an agreement, so some setups do violate it. One obvious model that we try to support is that where the applications are using POSIX locking in order to ensure exclusive access to the data when requires. Another model is to rely rather on synchronous writes and heavy attribute revalidation to detect when a competing application has written to the file (the 'noac' mount option). While such a model is obviously deficient in that it can never guarantee cache coherency, we do attempt to ensure that it works on a per-operation basis (IOW: we check cache coherency before each call to read(), to mmap(), etc) since it is by far the easiest model to apply if you have applications that cannot be rewritten and that satisfy the requirement that they rarely conflict. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
On Wed, Nov 14, 2007 at 05:18:50PM -0500, Trond Myklebust wrote: > > On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote: > > Right, but I guess what Nick asked is, if pages could be stale to start > > with, how is that avoided in the future. > > > > The way I understand it, this re-validate is just a best effort at > > getting a coherent image. > > The normal convention for NFS is to use a close-to-open cache > consistency model. In that model, applications must agree never to open > the file for reading or writing if an application on a different NFS > client already holds it open for writing. > > However there is no standard locking model for _enforcing_ such an > agreement, so some setups do violate it. One obvious model that we try > to support is that where the applications are using POSIX locking in > order to ensure exclusive access to the data when requires. > > Another model is to rely rather on synchronous writes and heavy > attribute revalidation to detect when a competing application has > written to the file (the 'noac' mount option). While such a model is > obviously deficient in that it can never guarantee cache coherency, we > do attempt to ensure that it works on a per-operation basis (IOW: we > check cache coherency before each call to read(), to mmap(), etc) since > it is by far the easiest model to apply if you have applications that > cannot be rewritten and that satisfy the requirement that they rarely > conflict. mmap()s can be different from read in that the syscall may have little relation to when the data gets used. But I guess it's still a best effort thing. Fair enough. Thanks, Nick - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 16:41 -0500, Trond Myklebust wrote: > On Wed, 2007-11-14 at 22:31 +0100, Peter Zijlstra wrote: > > On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote: > > > On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: > > > > Normal locking order is: > > > > > > > > i_mutex > > > > mmap_sem > > > > > > > > However NFS's ->mmap hook, which is called under mmap_sem, can take > > > > i_mutex. > > > > Avoid this potential deadlock by doing the work that requires i_mutex > > > > from > > > > the new ->mmap_prepare(). > > > > > > > > [ Is this sufficient, or does it introduce a race? ] > > > > > > Seems like an OK patchset in my opinion. I don't know much about NFS > > > unfortunately, but I wonder what prevents the condition fixed by > > > nfs_revalidate_mapping from happening again while the mmap is active...? > > > > As the changelog might have suggested, I'm not overly sure of the nfs > > requirements myself. I think it just does a best effort at getting the > > pages coherent with other clients, and then hopes for the best. > > > > I'll let Trond enlighten us further before I make an utter fool of > > myself :-) > > The NFS client needs to check the validity of already cached data before > it allows those pages to be mmapped. If it finds out that the cache is > stale, then we need to call invalidate_inode_pages2() to clear out the > cache and refresh it from the server. The inode->i_mutex is necessary in > order to prevent races between the new writes and the cache invalidation > code. Right, but I guess what Nick asked is, if pages could be stale to start with, how is that avoided in the future. The way I understand it, this re-validate is just a best effort at getting a coherent image. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 22:31 +0100, Peter Zijlstra wrote: > On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote: > > On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: > > > Normal locking order is: > > > > > > i_mutex > > > mmap_sem > > > > > > However NFS's ->mmap hook, which is called under mmap_sem, can take > > > i_mutex. > > > Avoid this potential deadlock by doing the work that requires i_mutex from > > > the new ->mmap_prepare(). > > > > > > [ Is this sufficient, or does it introduce a race? ] > > > > Seems like an OK patchset in my opinion. I don't know much about NFS > > unfortunately, but I wonder what prevents the condition fixed by > > nfs_revalidate_mapping from happening again while the mmap is active...? > > As the changelog might have suggested, I'm not overly sure of the nfs > requirements myself. I think it just does a best effort at getting the > pages coherent with other clients, and then hopes for the best. > > I'll let Trond enlighten us further before I make an utter fool of > myself :-) The NFS client needs to check the validity of already cached data before it allows those pages to be mmapped. If it finds out that the cache is stale, then we need to call invalidate_inode_pages2() to clear out the cache and refresh it from the server. The inode->i_mutex is necessary in order to prevent races between the new writes and the cache invalidation code. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote: > On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: > > Normal locking order is: > > > > i_mutex > > mmap_sem > > > > However NFS's ->mmap hook, which is called under mmap_sem, can take i_mutex. > > Avoid this potential deadlock by doing the work that requires i_mutex from > > the new ->mmap_prepare(). > > > > [ Is this sufficient, or does it introduce a race? ] > > Seems like an OK patchset in my opinion. I don't know much about NFS > unfortunately, but I wonder what prevents the condition fixed by > nfs_revalidate_mapping from happening again while the mmap is active...? As the changelog might have suggested, I'm not overly sure of the nfs requirements myself. I think it just does a best effort at getting the pages coherent with other clients, and then hopes for the best. I'll let Trond enlighten us further before I make an utter fool of myself :-) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: > Normal locking order is: > > i_mutex > mmap_sem > > However NFS's ->mmap hook, which is called under mmap_sem, can take i_mutex. > Avoid this potential deadlock by doing the work that requires i_mutex from > the new ->mmap_prepare(). > > [ Is this sufficient, or does it introduce a race? ] Seems like an OK patchset in my opinion. I don't know much about NFS unfortunately, but I wonder what prevents the condition fixed by nfs_revalidate_mapping from happening again while the mmap is active...? > > Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> > --- > fs/nfs/file.c | 25 + > 1 file changed, 17 insertions(+), 8 deletions(-) > > Index: linux-2.6/fs/nfs/file.c > === > --- linux-2.6.orig/fs/nfs/file.c > +++ linux-2.6/fs/nfs/file.c > @@ -41,6 +41,9 @@ > static int nfs_file_open(struct inode *, struct file *); > static int nfs_file_release(struct inode *, struct file *); > static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); > +static int > +nfs_file_mmap_prepare(struct file * file, unsigned long len, > + unsigned long prot, unsigned long flags, unsigned long pgoff); > static int nfs_file_mmap(struct file *, struct vm_area_struct *); > static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos, > struct pipe_inode_info *pipe, > @@ -64,6 +67,7 @@ const struct file_operations nfs_file_op > .write = do_sync_write, > .aio_read = nfs_file_read, > .aio_write = nfs_file_write, > + .mmap_prepare = nfs_file_mmap_prepare, > .mmap = nfs_file_mmap, > .open = nfs_file_open, > .flush = nfs_file_flush, > @@ -270,7 +274,8 @@ nfs_file_splice_read(struct file *filp, > } > > static int > -nfs_file_mmap(struct file * file, struct vm_area_struct * vma) > +nfs_file_mmap_prepare(struct file * file, unsigned long len, > + unsigned long prot, unsigned long flags, unsigned long pgoff) > { > struct dentry *dentry = file->f_path.dentry; > struct inode *inode = dentry->d_inode; > @@ -279,13 +284,17 @@ nfs_file_mmap(struct file * file, struct > dfprintk(VFS, "nfs: mmap(%s/%s)\n", > dentry->d_parent->d_name.name, dentry->d_name.name); > > - status = nfs_revalidate_mapping(inode, file->f_mapping); > - if (!status) { > - vma->vm_ops = &nfs_file_vm_ops; > - vma->vm_flags |= VM_CAN_NONLINEAR; > - file_accessed(file); > - } > - return status; > + return nfs_revalidate_mapping(inode, file->f_mapping); > +} > + > +static int > +nfs_file_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + vma->vm_ops = &nfs_file_vm_ops; > + vma->vm_flags |= VM_CAN_NONLINEAR; > + file_accessed(file); > + > + return 0; > } > > /* > > -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock
Normal locking order is: i_mutex mmap_sem However NFS's ->mmap hook, which is called under mmap_sem, can take i_mutex. Avoid this potential deadlock by doing the work that requires i_mutex from the new ->mmap_prepare(). [ Is this sufficient, or does it introduce a race? ] Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- fs/nfs/file.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -41,6 +41,9 @@ static int nfs_file_open(struct inode *, struct file *); static int nfs_file_release(struct inode *, struct file *); static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); +static int +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos, struct pipe_inode_info *pipe, @@ -64,6 +67,7 @@ const struct file_operations nfs_file_op .write = do_sync_write, .aio_read = nfs_file_read, .aio_write = nfs_file_write, + .mmap_prepare = nfs_file_mmap_prepare, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -270,7 +274,8 @@ nfs_file_splice_read(struct file *filp, } static int -nfs_file_mmap(struct file * file, struct vm_area_struct * vma) +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff) { struct dentry *dentry = file->f_path.dentry; struct inode *inode = dentry->d_inode; @@ -279,13 +284,17 @@ nfs_file_mmap(struct file * file, struct dfprintk(VFS, "nfs: mmap(%s/%s)\n", dentry->d_parent->d_name.name, dentry->d_name.name); - status = nfs_revalidate_mapping(inode, file->f_mapping); - if (!status) { - vma->vm_ops = &nfs_file_vm_ops; - vma->vm_flags |= VM_CAN_NONLINEAR; - file_accessed(file); - } - return status; + return nfs_revalidate_mapping(inode, file->f_mapping); +} + +static int +nfs_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma->vm_ops = &nfs_file_vm_ops; + vma->vm_flags |= VM_CAN_NONLINEAR; + file_accessed(file); + + return 0; } /* -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html