Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies
Nicolas Boichat writes: > On Wed, Feb 24, 2021 at 6:44 PM Nicolas Boichat wrote: >> >> On Wed, Feb 24, 2021 at 6:22 PM Luis Henriques wrote: >> > >> > On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote: >> > > On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques >> > > wrote: >> > > > >> > > > A regression has been reported by Nicolas Boichat, found while using >> > > > the >> > > > copy_file_range syscall to copy a tracefs file. Before commit >> > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the >> > > > kernel would return -EXDEV to userspace when trying to copy a file >> > > > across >> > > > different filesystems. After this commit, the syscall doesn't fail >> > > > anymore >> > > > and instead returns zero (zero bytes copied), as this file's content is >> > > > generated on-the-fly and thus reports a size of zero. >> > > > >> > > > This patch restores some cross-filesystem copy restrictions that >> > > > existed >> > > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy >> > > > across >> > > > devices"). Filesystems are still allowed to fall-back to the VFS >> > > > generic_copy_file_range() implementation, but that has now to be done >> > > > explicitly. >> > > > >> > > > nfsd is also modified to fall-back into generic_copy_file_range() in >> > > > case >> > > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. >> > > > >> > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across >> > > > devices") >> > > > Link: >> > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ >> > > > Link: >> > > > https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ >> > > > Link: >> > > > https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ >> > > > Reported-by: Nicolas Boichat >> > > > Signed-off-by: Luis Henriques >> > > >> > > I tested v8 and I believe it works for NFS. >> > >> > Thanks a lot for the testing. And to everyone else for reviews, >> > feedback,... and patience. >> >> Thanks so much to you!!! >> >> Works here, you can add my >> Tested-by: Nicolas Boichat > > What happened to this patch? It does not seem to have been picked up > yet? Any reason why? Hmm... good question. I'm not actually sure who would be picking it. Al, maybe...? Cheers, -- Luis > >> > >> > I'll now go look into the manpage and see what needs to be changed. >> > >> > Cheers, >> > -- >> > Luís
Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2
Vivek Goyal writes: > On Mon, Mar 29, 2021 at 03:54:03PM +0100, Luis Henriques wrote: >> On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote: >> > Fuse client needs to send additional information to file server when >> > it calls SETXATTR(system.posix_acl_access). Right now there is no extra >> > space in fuse_setxattr_in. So introduce a v2 of the structure which has >> > more space in it and can be used to send extra flags. >> > >> > "struct fuse_setxattr_in_v2" is only used if file server opts-in for it >> > using >> > flag FUSE_SETXATTR_V2 during feature negotiations. >> > >> > Signed-off-by: Vivek Goyal >> > --- >> > fs/fuse/acl.c | 2 +- >> > fs/fuse/fuse_i.h | 5 - >> > fs/fuse/inode.c | 4 +++- >> > fs/fuse/xattr.c | 21 +++-- >> > include/uapi/linux/fuse.h | 10 ++ >> > 5 files changed, 33 insertions(+), 9 deletions(-) >> > >> > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c >> > index e9c0f916349d..d31260a139d4 100644 >> > --- a/fs/fuse/acl.c >> > +++ b/fs/fuse/acl.c >> > @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, >> > struct inode *inode, >> >return ret; >> >} >> > >> > - ret = fuse_setxattr(inode, name, value, size, 0); >> > + ret = fuse_setxattr(inode, name, value, size, 0, 0); >> >kfree(value); >> >} else { >> >ret = fuse_removexattr(inode, name); >> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h >> > index 63d97a15ffde..d00bf0b9a38c 100644 >> > --- a/fs/fuse/fuse_i.h >> > +++ b/fs/fuse/fuse_i.h >> > @@ -668,6 +668,9 @@ struct fuse_conn { >> >/** Is setxattr not implemented by fs? */ >> >unsigned no_setxattr:1; >> > >> > + /** Does file server support setxattr_v2 */ >> > + unsigned setxattr_v2:1; >> > + >> >/** Is getxattr not implemented by fs? */ >> >unsigned no_getxattr:1; >> > >> > @@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool >> > locked); >> > bool fuse_lock_inode(struct inode *inode); >> > >> > int fuse_setxattr(struct inode *inode, const char *name, const void >> > *value, >> > -size_t size, int flags); >> > +size_t size, int flags, unsigned extra_flags); >> > ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value, >> > size_t size); >> > ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size); >> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c >> > index b0e18b470e91..1c726df13f80 100644 >> > --- a/fs/fuse/inode.c >> > +++ b/fs/fuse/inode.c >> > @@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount >> > *fm, struct fuse_args *args, >> >fc->handle_killpriv_v2 = 1; >> >fm->sb->s_flags |= SB_NOSEC; >> >} >> > + if (arg->flags & FUSE_SETXATTR_V2) >> > + fc->setxattr_v2 = 1; >> >} else { >> >ra_pages = fc->max_read / PAGE_SIZE; >> >fc->no_lock = 1; >> > @@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm) >> >FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL | >> >FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS | >> >FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA | >> > - FUSE_HANDLE_KILLPRIV_V2; >> > + FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2; >> > #ifdef CONFIG_FUSE_DAX >> >if (fm->fc->dax) >> >ia->in.flags |= FUSE_MAP_ALIGNMENT; >> > diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c >> > index 1a7d7ace54e1..f2aae72653dc 100644 >> > --- a/fs/fuse/xattr.c >> > +++ b/fs/fuse/xattr.c >> > @@ -12,24 +12,33 @@ >> > #include >> > >> > int fuse_setxattr(struct inode *inode, const char *name, const void >> > *value, >> > -size_t size, int flags) >> > +size_t size, int flags, unsigned extra_flags) >> > { >> >struct fuse_mount *fm = get_fuse_mount(inode); >> >FUSE_A
Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2
On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote: > Fuse client needs to send additional information to file server when > it calls SETXATTR(system.posix_acl_access). Right now there is no extra > space in fuse_setxattr_in. So introduce a v2 of the structure which has > more space in it and can be used to send extra flags. > > "struct fuse_setxattr_in_v2" is only used if file server opts-in for it using > flag FUSE_SETXATTR_V2 during feature negotiations. > > Signed-off-by: Vivek Goyal > --- > fs/fuse/acl.c | 2 +- > fs/fuse/fuse_i.h | 5 - > fs/fuse/inode.c | 4 +++- > fs/fuse/xattr.c | 21 +++-- > include/uapi/linux/fuse.h | 10 ++ > 5 files changed, 33 insertions(+), 9 deletions(-) > > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > index e9c0f916349d..d31260a139d4 100644 > --- a/fs/fuse/acl.c > +++ b/fs/fuse/acl.c > @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct > inode *inode, > return ret; > } > > - ret = fuse_setxattr(inode, name, value, size, 0); > + ret = fuse_setxattr(inode, name, value, size, 0, 0); > kfree(value); > } else { > ret = fuse_removexattr(inode, name); > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 63d97a15ffde..d00bf0b9a38c 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -668,6 +668,9 @@ struct fuse_conn { > /** Is setxattr not implemented by fs? */ > unsigned no_setxattr:1; > > + /** Does file server support setxattr_v2 */ > + unsigned setxattr_v2:1; > + > /** Is getxattr not implemented by fs? */ > unsigned no_getxattr:1; > > @@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool > locked); > bool fuse_lock_inode(struct inode *inode); > > int fuse_setxattr(struct inode *inode, const char *name, const void *value, > - size_t size, int flags); > + size_t size, int flags, unsigned extra_flags); > ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value, > size_t size); > ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size); > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index b0e18b470e91..1c726df13f80 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount *fm, > struct fuse_args *args, > fc->handle_killpriv_v2 = 1; > fm->sb->s_flags |= SB_NOSEC; > } > + if (arg->flags & FUSE_SETXATTR_V2) > + fc->setxattr_v2 = 1; > } else { > ra_pages = fc->max_read / PAGE_SIZE; > fc->no_lock = 1; > @@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm) > FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL | > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS | > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA | > - FUSE_HANDLE_KILLPRIV_V2; > + FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2; > #ifdef CONFIG_FUSE_DAX > if (fm->fc->dax) > ia->in.flags |= FUSE_MAP_ALIGNMENT; > diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c > index 1a7d7ace54e1..f2aae72653dc 100644 > --- a/fs/fuse/xattr.c > +++ b/fs/fuse/xattr.c > @@ -12,24 +12,33 @@ > #include > > int fuse_setxattr(struct inode *inode, const char *name, const void *value, > - size_t size, int flags) > + size_t size, int flags, unsigned extra_flags) > { > struct fuse_mount *fm = get_fuse_mount(inode); > FUSE_ARGS(args); > struct fuse_setxattr_in inarg; > + struct fuse_setxattr_in_v2 inarg_v2; > + bool setxattr_v2 = fm->fc->setxattr_v2; > int err; > > if (fm->fc->no_setxattr) > return -EOPNOTSUPP; > > memset(, 0, sizeof(inarg)); > - inarg.size = size; > - inarg.flags = flags; > + memset(_v2, 0, sizeof(inarg_v2)); > + if (setxattr_v2) { > + inarg_v2.size = size; > + inarg_v2.flags = flags; > + inarg_v2.setxattr_flags = extra_flags; > + } else { > + inarg.size = size; > + inarg.flags = flags; > + } > args.opcode = FUSE_SETXATTR; > args.nodeid = get_node_id(inode); > args.in_numargs = 3; > - args.in_args[0].size = sizeof(inarg); > - args.in_args[0].value = > + args.in_args[0].size = setxattr_v2 ? sizeof(inarg_v2) : sizeof(inarg); > + args.in_args[0].value = setxattr_v2 ? _v2 : (void *) And yet another minor: It's a bit awkward to have to cast '' to 'void *' just because you're using the ternary operator. Why not use an 'if' statement instead for initializing .size and .value? Cheers, -- Luís >
Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2
On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote: > Fuse client needs to send additional information to file server when > it calls SETXATTR(system.posix_acl_access). Right now there is no extra > space in fuse_setxattr_in. So introduce a v2 of the structure which has > more space in it and can be used to send extra flags. > > "struct fuse_setxattr_in_v2" is only used if file server opts-in for it using > flag FUSE_SETXATTR_V2 during feature negotiations. > > Signed-off-by: Vivek Goyal > --- > fs/fuse/acl.c | 2 +- > fs/fuse/fuse_i.h | 5 - > fs/fuse/inode.c | 4 +++- > fs/fuse/xattr.c | 21 +++-- > include/uapi/linux/fuse.h | 10 ++ > 5 files changed, 33 insertions(+), 9 deletions(-) > > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c > index e9c0f916349d..d31260a139d4 100644 > --- a/fs/fuse/acl.c > +++ b/fs/fuse/acl.c > @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct > inode *inode, > return ret; > } > > - ret = fuse_setxattr(inode, name, value, size, 0); > + ret = fuse_setxattr(inode, name, value, size, 0, 0); > kfree(value); > } else { > ret = fuse_removexattr(inode, name); > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 63d97a15ffde..d00bf0b9a38c 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -668,6 +668,9 @@ struct fuse_conn { > /** Is setxattr not implemented by fs? */ > unsigned no_setxattr:1; > > + /** Does file server support setxattr_v2 */ > + unsigned setxattr_v2:1; > + Minor (pedantic!) comment: most of the fields here start with 'no_*', so maybe it's worth setting the logic to use 'no_setxattr_v2' instead? Cheers, -- Luís > /** Is getxattr not implemented by fs? */ > unsigned no_getxattr:1; > > @@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool > locked); > bool fuse_lock_inode(struct inode *inode); > > int fuse_setxattr(struct inode *inode, const char *name, const void *value, > - size_t size, int flags); > + size_t size, int flags, unsigned extra_flags); > ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value, > size_t size); > ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size); > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index b0e18b470e91..1c726df13f80 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount *fm, > struct fuse_args *args, > fc->handle_killpriv_v2 = 1; > fm->sb->s_flags |= SB_NOSEC; > } > + if (arg->flags & FUSE_SETXATTR_V2) > + fc->setxattr_v2 = 1; > } else { > ra_pages = fc->max_read / PAGE_SIZE; > fc->no_lock = 1; > @@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm) > FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL | > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS | > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA | > - FUSE_HANDLE_KILLPRIV_V2; > + FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2; > #ifdef CONFIG_FUSE_DAX > if (fm->fc->dax) > ia->in.flags |= FUSE_MAP_ALIGNMENT; > diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c > index 1a7d7ace54e1..f2aae72653dc 100644 > --- a/fs/fuse/xattr.c > +++ b/fs/fuse/xattr.c > @@ -12,24 +12,33 @@ > #include > > int fuse_setxattr(struct inode *inode, const char *name, const void *value, > - size_t size, int flags) > + size_t size, int flags, unsigned extra_flags) > { > struct fuse_mount *fm = get_fuse_mount(inode); > FUSE_ARGS(args); > struct fuse_setxattr_in inarg; > + struct fuse_setxattr_in_v2 inarg_v2; > + bool setxattr_v2 = fm->fc->setxattr_v2; > int err; > > if (fm->fc->no_setxattr) > return -EOPNOTSUPP; > > memset(, 0, sizeof(inarg)); > - inarg.size = size; > - inarg.flags = flags; > + memset(_v2, 0, sizeof(inarg_v2)); > + if (setxattr_v2) { > + inarg_v2.size = size; > + inarg_v2.flags = flags; > + inarg_v2.setxattr_flags = extra_flags; > + } else { > + inarg.size = size; > + inarg.flags = flags; > + } > args.opcode = FUSE_SETXATTR; > args.nodeid = get_node_id(inode); > args.in_numargs = 3; > - args.in_args[0].size = sizeof(inarg); > - args.in_args[0].value = > + args.in_args[0].size = setxattr_v2 ? sizeof(inarg_v2) : sizeof(inarg); > + args.in_args[0].value = setxattr_v2 ? _v2 : (void *) > args.in_args[1].size = strlen(name) + 1; >
Re: fuse: kernel BUG at mm/truncate.c:763!
On Fri, Mar 19, 2021 at 09:02:33AM +, Luis Henriques wrote: > On Thu, Mar 18, 2021 at 11:55:43AM +, Matthew Wilcox wrote: > > On Thu, Mar 18, 2021 at 11:29:28AM +0000, Luis Henriques wrote: > > > On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote: > > > > On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote: > > > > > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 > > > > > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00 > > > > > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 > > > > > > compound_pincount:0 > > > > > > > > > > This is a compound page alright. Have no idea how it got into fuse's > > > > > pagecache. > > > > > > > > > > > > Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled? > > > > > > Yes, it looks like Tumbleweed kernels have that config option enabled by > > > default. And it this feature was introduced in 5.4 (the bug doesn't seem > > > to be reproducible in 5.3). > > > > Can you try adding this patch? > > > > https://git.infradead.org/users/willy/pagecache.git/commitdiff/369a4fcd78369b7a026bdef465af9669bde98ef4 > > Good news, looks like this patch fixes the issue[1]. Thanks a lot > everyone. Is this already queued somewhere for 5.12? Also, it would be > nice to have it Cc'ed for stable kernels >= 5.4. Ping. Are you planning to push this for 5.12, or is that queued for the 5.13 merged window? Or "none of the above"? :) Cheers, -- Luís
Re: fuse: kernel BUG at mm/truncate.c:763!
On Thu, Mar 18, 2021 at 11:55:43AM +, Matthew Wilcox wrote: > On Thu, Mar 18, 2021 at 11:29:28AM +0000, Luis Henriques wrote: > > On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote: > > > On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote: > > > > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 > > > > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00 > > > > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 > > > > > compound_pincount:0 > > > > > > > > This is a compound page alright. Have no idea how it got into fuse's > > > > pagecache. > > > > > > > > > Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled? > > > > Yes, it looks like Tumbleweed kernels have that config option enabled by > > default. And it this feature was introduced in 5.4 (the bug doesn't seem > > to be reproducible in 5.3). > > Can you try adding this patch? > > https://git.infradead.org/users/willy/pagecache.git/commitdiff/369a4fcd78369b7a026bdef465af9669bde98ef4 Good news, looks like this patch fixes the issue[1]. Thanks a lot everyone. Is this already queued somewhere for 5.12? Also, it would be nice to have it Cc'ed for stable kernels >= 5.4. [1] https://bugzilla.suse.com/show_bug.cgi?id=1182929#c24 Cheers, -- Luís
Re: fuse: kernel BUG at mm/truncate.c:763!
On Thu, Mar 18, 2021 at 11:55:43AM +, Matthew Wilcox wrote: > On Thu, Mar 18, 2021 at 11:29:28AM +0000, Luis Henriques wrote: > > On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote: > > > On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote: > > > > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 > > > > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00 > > > > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 > > > > > compound_pincount:0 > > > > > > > > This is a compound page alright. Have no idea how it got into fuse's > > > > pagecache. > > > > > > > > > Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled? > > > > Yes, it looks like Tumbleweed kernels have that config option enabled by > > default. And it this feature was introduced in 5.4 (the bug doesn't seem > > to be reproducible in 5.3). > > Can you try adding this patch? > > https://git.infradead.org/users/willy/pagecache.git/commitdiff/369a4fcd78369b7a026bdef465af9669bde98ef4 Yep, sure. Unfortunately, the testing round-trip can be a bit high. I'll push a new kernel build and ask the reporter to give it a try. [ I'll add this patch on top of the s/BUG_ON/VM_BUG_ON_PAGE change. ] Cheers, -- Luís
Re: fuse: kernel BUG at mm/truncate.c:763!
On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote: > On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote: > > [CC linux-mm] > > > > On Thu, Mar 18, 2021 at 10:25 AM Luis Henriques wrote: > > > > > > (I thought Vlastimil was already on CC...) > > > > > > On Mon, Mar 15, 2021 at 11:06:59AM +, Matthew Wilcox wrote: > > > > On Mon, Mar 15, 2021 at 09:47:45AM +, Luis Henriques wrote: > > > > > On Fri, Mar 12, 2021 at 01:11:23PM +0000, Matthew Wilcox wrote: > > > > > > On Fri, Mar 12, 2021 at 12:21:59PM +, Luis Henriques wrote: > > > > > > > > > I've seen a bug report (5.10.16 kernel splat below) that > > > > > > > > > seems to be > > > > > > > > > reproducible in kernels as early as 5.4. > > > > > > > > > > > > If this is reproducible, can you turn this BUG_ON into a > > > > > > VM_BUG_ON_PAGE() > > > > > > so we know what kind of problem we're dealing with? Assuming the > > > > > > SUSE > > > > > > tumbleweed kernels enable CONFIG_DEBUG_VM, which I'm sure they do. > > > > > > > > > > Just to make sure I got this right, you want to test something like > > > > > this: > > > > > > > > > > } > > > > > } > > > > > - BUG_ON(page_mapped(page)); > > > > > + VM_BUG_ON_PAGE(page_mapped(page), page); > > > > > ret2 = do_launder_page(mapping, page); > > > > > if (ret2 == 0) { > > > > > if (!invalidate_complete_page2(mapping, > > > > > page)) > > > > > > > > Yes, exactly. > > > > > > Ok, finally I got some feedback from the bug reporter. Please see bellow > > > the kernel log with the VM_BUG_ON_PAGE() in place. Also note that this is > > > on a 5.12-rc3, vanilla. > > > > > > Cheers, > > > -- > > > Luís > > > > > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 > > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00 > > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 > > > compound_pincount:0 > > > > This is a compound page alright. Have no idea how it got into fuse's > > pagecache. > > > Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled? Yes, it looks like Tumbleweed kernels have that config option enabled by default. And it this feature was introduced in 5.4 (the bug doesn't seem to be reproducible in 5.3). Cheers, -- Luís > > > [16247.536361] memcg:8e730012b000 > > > [16247.536364] aops:fuse_file_aops [fuse] ino:8b8 dentry name:"cc1plus" > > > [16247.536379] flags: > > > 0xa800010037(locked|referenced|uptodate|lru|active|head) > > > [16247.536385] raw: 00a800010037 d6519ed9c448 d651abea5b08 > > > 8eb2f9a02ef8 > > > [16247.536388] raw: 1400 02a1 > > > 8e730012b000 > > > [16247.536389] page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) > > > [16247.536399] [ cut here ] > > > [16247.536400] kernel BUG at mm/truncate.c:678! > > > [16247.536406] invalid opcode: [#1] SMP PTI > > > [16247.536416] CPU: 42 PID: 2063761 Comm: g++ Not tainted > > > 5.12.0-rc3-1.g008d601-default #1 openSUSE Tumbleweed (unreleased) > > > [16247.536423] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a > > > 10/16/2019 > > > [16247.536427] RIP: 0010:invalidate_inode_pages2_range+0x3b4/0x550 > > > [16247.536436] Code: 00 00 00 4c 89 e6 e8 eb 0f 03 00 4c 89 ff e8 63 40 > > > 01 00 84 c0 0f 84 23 fe ff ff 48 c7 c6 d0 1d f4 b1 4c 89 ff e8 ec 82 02 > > > 00 <0f> 0b 48 8b 45 78 48 8b 80 80 00 00 00 48 85 c0 0f 84 fb fe ff ff > > > [16247.536444] RSP: :a18cb0af7a40 EFLAGS: 00010246 > > > [16247.536450] RAX: 0036 RBX: 000d RCX: > > > 8ef13fc9a748 > > > [16247.536455] RDX: RSI: 0027 RDI: > > > 8ef13fc9a740 > > > [16247.536460] RBP: 8eb2f9a02ef8 R08: 8ef23ffb48a8 R09: > > > 0004fffb > > > [16247.536464] R10: R11: 3fff R12: > > > 1400 > > > [16247
Re: fuse: kernel BUG at mm/truncate.c:763!
(I thought Vlastimil was already on CC...) On Mon, Mar 15, 2021 at 11:06:59AM +, Matthew Wilcox wrote: > On Mon, Mar 15, 2021 at 09:47:45AM +0000, Luis Henriques wrote: > > On Fri, Mar 12, 2021 at 01:11:23PM +, Matthew Wilcox wrote: > > > On Fri, Mar 12, 2021 at 12:21:59PM +0000, Luis Henriques wrote: > > > > > > I've seen a bug report (5.10.16 kernel splat below) that seems to be > > > > > > reproducible in kernels as early as 5.4. > > > > > > If this is reproducible, can you turn this BUG_ON into a VM_BUG_ON_PAGE() > > > so we know what kind of problem we're dealing with? Assuming the SUSE > > > tumbleweed kernels enable CONFIG_DEBUG_VM, which I'm sure they do. > > > > Just to make sure I got this right, you want to test something like this: > > > > } > > } > > - BUG_ON(page_mapped(page)); > > + VM_BUG_ON_PAGE(page_mapped(page), page); > > ret2 = do_launder_page(mapping, page); > > if (ret2 == 0) { > > if (!invalidate_complete_page2(mapping, page)) > > Yes, exactly. Ok, finally I got some feedback from the bug reporter. Please see bellow the kernel log with the VM_BUG_ON_PAGE() in place. Also note that this is on a 5.12-rc3, vanilla. Cheers, -- Luís [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 mapping:f982a7f8 index:0x1400 pfn:0x4c65e00 [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 compound_pincount:0 [16247.536361] memcg:8e730012b000 [16247.536364] aops:fuse_file_aops [fuse] ino:8b8 dentry name:"cc1plus" [16247.536379] flags: 0xa800010037(locked|referenced|uptodate|lru|active|head) [16247.536385] raw: 00a800010037 d6519ed9c448 d651abea5b08 8eb2f9a02ef8 [16247.536388] raw: 1400 02a1 8e730012b000 [16247.536389] page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) [16247.536399] [ cut here ] [16247.536400] kernel BUG at mm/truncate.c:678! [16247.536406] invalid opcode: [#1] SMP PTI [16247.536416] CPU: 42 PID: 2063761 Comm: g++ Not tainted 5.12.0-rc3-1.g008d601-default #1 openSUSE Tumbleweed (unreleased) [16247.536423] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a 10/16/2019 [16247.536427] RIP: 0010:invalidate_inode_pages2_range+0x3b4/0x550 [16247.536436] Code: 00 00 00 4c 89 e6 e8 eb 0f 03 00 4c 89 ff e8 63 40 01 00 84 c0 0f 84 23 fe ff ff 48 c7 c6 d0 1d f4 b1 4c 89 ff e8 ec 82 02 00 <0f> 0b 48 8b 45 78 48 8b 80 80 00 00 00 48 85 c0 0f 84 fb fe ff ff [16247.536444] RSP: :a18cb0af7a40 EFLAGS: 00010246 [16247.536450] RAX: 0036 RBX: 000d RCX: 8ef13fc9a748 [16247.536455] RDX: RSI: 0027 RDI: 8ef13fc9a740 [16247.536460] RBP: 8eb2f9a02ef8 R08: 8ef23ffb48a8 R09: 0004fffb [16247.536464] R10: R11: 3fff R12: 1400 [16247.536468] R13: 8eb2f9a02f00 R14: R15: d651b1978000 [16247.536473] FS: 7f97c1717740() GS:8ef13fc8() knlGS: [16247.536478] CS: 0010 DS: ES: CR0: 80050033 [16247.536483] CR2: 7fd48a25a7c0 CR3: 0040aa3ac006 CR4: 007706e0 [16247.536487] DR0: DR1: DR2: [16247.536491] DR3: DR6: fffe0ff0 DR7: 0400 [16247.536495] PKRU: 5554 [16247.536498] Call Trace: [16247.536506] fuse_finish_open+0x82/0x150 [fuse] [16247.536520] fuse_open_common+0x1a8/0x1b0 [fuse] [16247.536530] ? fuse_open_common+0x1b0/0x1b0 [fuse] [16247.536540] do_dentry_open+0x14e/0x380 [16247.536547] path_openat+0xaf6/0x10a0 [16247.536555] do_filp_open+0x88/0x130 [16247.536560] ? security_prepare_creds+0x6d/0x90 [16247.536566] ? __kmalloc+0x157/0x2e0 [16247.536575] do_open_execat+0x6d/0x1a0 [16247.536581] bprm_execve+0x128/0x660 [16247.536587] do_execveat_common+0x192/0x1c0 [16247.536593] __x64_sys_execve+0x39/0x50 [16247.536599] do_syscall_64+0x33/0x80 [16247.536606] entry_SYSCALL_64_after_hwframe+0x44/0xae [16247.536614] RIP: 0033:0x7f97c0efec37 [16247.536621] Code: Unable to access opcode bytes at RIP 0x7f97c0efec0d. [16247.536625] RSP: 002b:7ffdc2fdea68 EFLAGS: 0202 ORIG_RAX: 003b [16247.536631] RAX: ffda RBX: 7f97c17176a0 RCX: 7f97c0efec37 [16247.536635] RDX: 00ea42c0 RSI: 00ea5848 RDI: 00ea5d00 [16247.536639] RBP: 0001 R08: R09: [16247.536643] R10: 7ffdc2fdde60 R11: 0202 R12: [16247.536647] R13: 0001 R14: 00ea5d00 R15: [16247.536653] Modules l
Re: [PATCH mm] kfence: make compatible with kmemleak
On Wed, Mar 17, 2021 at 09:47:40AM +0100, Marco Elver wrote: > Because memblock allocations are registered with kmemleak, the KFENCE > pool was seen by kmemleak as one large object. Later allocations through > kfence_alloc() that were registered with kmemleak via > slab_post_alloc_hook() would then overlap and trigger a warning. > Therefore, once the pool is initialized, we can remove (free) it from > kmemleak again, since it should be treated as allocator-internal and be > seen as "free memory". > > The second problem is that kmemleak is passed the rounded size, and not > the originally requested size, which is also the size of KFENCE objects. > To avoid kmemleak scanning past the end of an object and trigger a > KFENCE out-of-bounds error, fix the size if it is a KFENCE object. > > For simplicity, to avoid a call to kfence_ksize() in > slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) > guard), just call kfence_ksize() in mm/kmemleak.c:create_object(). > > Reported-by: Luis Henriques > Cc: Catalin Marinas > Signed-off-by: Marco Elver Tested-by: Luis Henriques > --- > mm/kfence/core.c | 9 + > mm/kmemleak.c| 3 ++- > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > index f7106f28443d..768dbd58170d 100644 > --- a/mm/kfence/core.c > +++ b/mm/kfence/core.c > @@ -12,6 +12,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -481,6 +482,14 @@ static bool __init kfence_init_pool(void) > addr += 2 * PAGE_SIZE; > } > > + /* > + * The pool is live and will never be deallocated from this point on. > + * Remove the pool object from the kmemleak object tree, as it would > + * otherwise overlap with allocations returned by kfence_alloc(), which > + * are registered with kmemleak through the slab post-alloc hook. > + */ > + kmemleak_free(__kfence_pool); > + > return true; > > err: > diff --git a/mm/kmemleak.c b/mm/kmemleak.c > index c0014d3b91c1..fe6e3ae8e8c6 100644 > --- a/mm/kmemleak.c > +++ b/mm/kmemleak.c > @@ -97,6 +97,7 @@ > #include > > #include > +#include > #include > #include > > @@ -589,7 +590,7 @@ static struct kmemleak_object *create_object(unsigned > long ptr, size_t size, > atomic_set(>use_count, 1); > object->flags = OBJECT_ALLOCATED; > object->pointer = ptr; > - object->size = size; > + object->size = kfence_ksize((void *)ptr) ?: size; > object->excess_ref = 0; > object->min_count = min_count; > object->count = 0; /* white color initially */ > -- > 2.31.0.rc2.261.g7f71774620-goog >
[PATCH v2] virtiofs: fix memory leak in virtio_fs_probe()
When accidentally passing twice the same tag to qemu, kmemleak ended up reporting a memory leak in virtiofs. Also, looking at the log I saw the following error (that's when I realised the duplicated tag): virtiofs: probe of virtio5 failed with error -17 Here's the kmemleak log for reference: unreferenced object 0x888103d47800 (size 1024): comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .N.. ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff backtrace: [<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs] [<f8aca419>] virtio_dev_probe+0x15f/0x210 [<4d6baf3c>] really_probe+0xea/0x430 [<a6ceeac8>] device_driver_attach+0xa8/0xb0 [<196f47a7>] __driver_attach+0x98/0x140 [<0b20601d>] bus_for_each_dev+0x7b/0xc0 [<399c7b7f>] bus_add_driver+0x11b/0x1f0 [<32b09ba7>] driver_register+0x8f/0xe0 [<cdd55998>] 0xa002c013 [<0ea196a2>] do_one_initcall+0x64/0x2e0 [<08f727ce>] do_init_module+0x5c/0x260 [<3cdedab6>] __do_sys_finit_module+0xb5/0x120 [<ad2f48c6>] do_syscall_64+0x33/0x40 [<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: sta...@vger.kernel.org Signed-off-by: Luis Henriques --- Changes since v1: - Use kfree() to free fs->vqs instead of calling virtio_fs_put() fs/fuse/virtio_fs.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 8868ac31a3c0..989ef4f88636 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -896,6 +896,7 @@ static int virtio_fs_probe(struct virtio_device *vdev) out_vqs: vdev->config->reset(vdev); virtio_fs_cleanup_vqs(vdev, fs); + kfree(fs->vqs); out: vdev->priv = NULL;
Re: Issue with kfence and kmemleak
On Tue, Mar 16, 2021 at 07:47:00PM +0100, Marco Elver wrote: > On Tue, Mar 16, 2021 at 06:19PM +, Catalin Marinas wrote: > > On Tue, Mar 16, 2021 at 06:30:00PM +0100, Marco Elver wrote: > > > On Tue, Mar 16, 2021 at 04:42PM +0000, Luis Henriques wrote: > > > > This is probably a known issue, but just in case: looks like it's not > > > > possible to use kmemleak when kfence is enabled: > > > > > > > > [0.272136] kmemleak: Cannot insert 0x888236e02f00 into the > > > > object search tree (overlaps existing) > > > > [0.272136] CPU: 0 PID: 8 Comm: kthreadd Not tainted 5.12.0-rc3+ #92 > > > > [0.272136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > > BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 > > > > [0.272136] Call Trace: > > > > [0.272136] dump_stack+0x6d/0x89 > > > > [0.272136] create_object.isra.0.cold+0x40/0x62 > > > > [0.272136] ? process_one_work+0x5a0/0x5a0 > > > > [0.272136] ? process_one_work+0x5a0/0x5a0 > > > > [0.272136] kmem_cache_alloc_trace+0x110/0x2f0 > > > > [0.272136] ? process_one_work+0x5a0/0x5a0 > > > > [0.272136] kthread+0x3f/0x150 > > > > [0.272136] ? lockdep_hardirqs_on_prepare+0xd4/0x170 > > > > [0.272136] ? __kthread_bind_mask+0x60/0x60 > > > > [0.272136] ret_from_fork+0x22/0x30 > > > > [0.272136] kmemleak: Kernel memory leak detector disabled > > > > [0.272136] kmemleak: Object 0x888236e0 (size 2097152): > > > > [0.272136] kmemleak: comm "swapper", pid 0, jiffies 4294892296 > > > > [0.272136] kmemleak: min_count = 0 > > > > [0.272136] kmemleak: count = 0 > > > > [0.272136] kmemleak: flags = 0x1 > > > > [0.272136] kmemleak: checksum = 0 > > > > [0.272136] kmemleak: backtrace: > > > > [0.272136] memblock_alloc_internal+0x6d/0xb0 > > > > [0.272136] memblock_alloc_try_nid+0x6c/0x8a > > > > [0.272136] kfence_alloc_pool+0x26/0x3f > > > > [0.272136] start_kernel+0x242/0x548 > > > > [0.272136] secondary_startup_64_no_verify+0xb0/0xbb > > > > > > > > I've tried the hack below but it didn't really helped. Obviously I > > > > don't > > > > really understand what's going on ;-) But I think the reason for this > > > > patch not working as (I) expected is because kfence is initialised > > > > *before* kmemleak. > > > > > > > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > > > > index 3b8ec938470a..b4ffd7695268 100644 > > > > --- a/mm/kfence/core.c > > > > +++ b/mm/kfence/core.c > > > > @@ -631,6 +631,9 @@ void __init kfence_alloc_pool(void) > > > > > > > > if (!__kfence_pool) > > > > pr_err("failed to allocate pool\n"); > > > > + kmemleak_no_scan(__kfence_pool); > > > > } > > > > > > Can you try the below patch? > > > > > > Thanks, > > > -- Marco > > > > > > -- >8 -- > > > > > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > > > index f7106f28443d..5891019721f6 100644 > > > --- a/mm/kfence/core.c > > > +++ b/mm/kfence/core.c > > > @@ -12,6 +12,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -481,6 +482,13 @@ static bool __init kfence_init_pool(void) > > > addr += 2 * PAGE_SIZE; > > > } > > > > > > + /* > > > + * The pool is live and will never be deallocated from this point on; > > > + * tell kmemleak this is now free memory, so that later allocations can > > > + * correctly be tracked. > > > + */ > > > + kmemleak_free_part_phys(__pa(__kfence_pool), KFENCE_POOL_SIZE); > > > > I presume this pool does not refer any objects that are only tracked > > through pool pointers. > > No, at this point this memory should not have been touched by anything. > > > kmemleak_free() (or *_free_part) should work, no need for the _phys > > variant (which converts it back with __va). > > Will fix. > > > Since we normally use kmemleak_ignore() (or no_scan) for objects we > > don't care about, I'd exp
Re: [PATCH] virtiofs: fix memory leak in virtio_fs_probe()
Vivek Goyal writes: > On Tue, Mar 16, 2021 at 05:02:34PM +0000, Luis Henriques wrote: >> When accidentally passing twice the same tag to qemu, kmemleak ended up >> reporting a memory leak in virtiofs. Also, looking at the log I saw the >> following error (that's when I realised the duplicated tag): >> >> virtiofs: probe of virtio5 failed with error -17 >> >> Here's the kmemleak log for reference: >> >> unreferenced object 0x888103d47800 (size 1024): >> comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s) >> hex dump (first 32 bytes): >> 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .N.. >> ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff >> backtrace: >> [<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs] >> [<f8aca419>] virtio_dev_probe+0x15f/0x210 >> [<4d6baf3c>] really_probe+0xea/0x430 >> [<a6ceeac8>] device_driver_attach+0xa8/0xb0 >> [<196f47a7>] __driver_attach+0x98/0x140 >> [<0b20601d>] bus_for_each_dev+0x7b/0xc0 >> [<399c7b7f>] bus_add_driver+0x11b/0x1f0 >> [<32b09ba7>] driver_register+0x8f/0xe0 >> [<cdd55998>] 0xa002c013 >> [<0ea196a2>] do_one_initcall+0x64/0x2e0 >> [<08f727ce>] do_init_module+0x5c/0x260 >> [<00003cdedab6>] __do_sys_finit_module+0xb5/0x120 >> [<ad2f48c6>] do_syscall_64+0x33/0x40 >> [<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae >> >> Cc: sta...@vger.kernel.org >> Signed-off-by: Luis Henriques > > Hi Luis, > > Thanks for the report and the fix. So looks like leak is happening > because we are not doing kfree(fs->vqs) in error path. Yep! >> --- >> fs/fuse/virtio_fs.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c >> index 8868ac31a3c0..4e6ef9f24e84 100644 >> --- a/fs/fuse/virtio_fs.c >> +++ b/fs/fuse/virtio_fs.c >> @@ -899,7 +899,7 @@ static int virtio_fs_probe(struct virtio_device *vdev) >> >> out: >> vdev->priv = NULL; >> -kfree(fs); >> +virtio_fs_put(fs); > > [ CC virtio-fs list ] Oops, forgot to include it. Maybe it should be added to the MAINTAINERS file (although IIRC it's not an open list). > fs object is not fully formed. So calling virtio_fs_put() is little odd. > I will expect it to be called if somebody takes a reference using _get() > or in the final virtio_fs_remove() when creation reference should go > away. > > How about open coding it and free fs->vqs explicitly. Something like > as follows. Ok, I'll send v2 later (I'm currently away from my devel workstation). To be honest, my initial version was doing exactly what you're suggesting. I decided to change to virtio_fs_put() because the refcount was already initialised early in the function. Bad decision. Cheers, -- Luis > > @@ -896,7 +896,7 @@ static int virtio_fs_probe(struct virtio > out_vqs: > vdev->config->reset(vdev); > virtio_fs_cleanup_vqs(vdev, fs); > - > + kfree(fs->vqs); > out: > vdev->priv = NULL; > kfree(fs); > > Thanks > Vivek >
Re: Issue with kfence and kmemleak
On Tue, Mar 16, 2021 at 06:30:00PM +0100, Marco Elver wrote: > On Tue, Mar 16, 2021 at 04:42PM +0000, Luis Henriques wrote: > > Hi! > > > > This is probably a known issue, but just in case: looks like it's not > > possible to use kmemleak when kfence is enabled: > > Thanks for spotting this. > > > [0.272136] kmemleak: Cannot insert 0x888236e02f00 into the object > > search tree (overlaps existing) > > [0.272136] CPU: 0 PID: 8 Comm: kthreadd Not tainted 5.12.0-rc3+ #92 > > [0.272136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 > > [0.272136] Call Trace: > > [0.272136] dump_stack+0x6d/0x89 > > [0.272136] create_object.isra.0.cold+0x40/0x62 > > [0.272136] ? process_one_work+0x5a0/0x5a0 > > [0.272136] ? process_one_work+0x5a0/0x5a0 > > [0.272136] kmem_cache_alloc_trace+0x110/0x2f0 > > [0.272136] ? process_one_work+0x5a0/0x5a0 > > [0.272136] kthread+0x3f/0x150 > > [0.272136] ? lockdep_hardirqs_on_prepare+0xd4/0x170 > > [0.272136] ? __kthread_bind_mask+0x60/0x60 > > [0.272136] ret_from_fork+0x22/0x30 > > [0.272136] kmemleak: Kernel memory leak detector disabled > > [0.272136] kmemleak: Object 0x888236e0 (size 2097152): > > [0.272136] kmemleak: comm "swapper", pid 0, jiffies 4294892296 > > [0.272136] kmemleak: min_count = 0 > > [0.272136] kmemleak: count = 0 > > [0.272136] kmemleak: flags = 0x1 > > [0.272136] kmemleak: checksum = 0 > > [0.272136] kmemleak: backtrace: > > [0.272136] memblock_alloc_internal+0x6d/0xb0 > > [0.272136] memblock_alloc_try_nid+0x6c/0x8a > > [0.272136] kfence_alloc_pool+0x26/0x3f > > [0.272136] start_kernel+0x242/0x548 > > [0.272136] secondary_startup_64_no_verify+0xb0/0xbb > > > > I've tried the hack below but it didn't really helped. Obviously I don't > > really understand what's going on ;-) But I think the reason for this > > patch not working as (I) expected is because kfence is initialised > > *before* kmemleak. > > > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > > index 3b8ec938470a..b4ffd7695268 100644 > > --- a/mm/kfence/core.c > > +++ b/mm/kfence/core.c > > @@ -631,6 +631,9 @@ void __init kfence_alloc_pool(void) > > > > if (!__kfence_pool) > > pr_err("failed to allocate pool\n"); > > + kmemleak_no_scan(__kfence_pool); > > } > > Can you try the below patch? Yep, that seems to fix the issue. Feel free to add my Tested-by. Thanks! Cheers, -- Luís > > Thanks, > -- Marco > > -- >8 -- > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > index f7106f28443d..5891019721f6 100644 > --- a/mm/kfence/core.c > +++ b/mm/kfence/core.c > @@ -12,6 +12,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -481,6 +482,13 @@ static bool __init kfence_init_pool(void) > addr += 2 * PAGE_SIZE; > } > > + /* > + * The pool is live and will never be deallocated from this point on; > + * tell kmemleak this is now free memory, so that later allocations can > + * correctly be tracked. > + */ > + kmemleak_free_part_phys(__pa(__kfence_pool), KFENCE_POOL_SIZE); > + > return true; > > err:
[PATCH] virtiofs: fix memory leak in virtio_fs_probe()
When accidentally passing twice the same tag to qemu, kmemleak ended up reporting a memory leak in virtiofs. Also, looking at the log I saw the following error (that's when I realised the duplicated tag): virtiofs: probe of virtio5 failed with error -17 Here's the kmemleak log for reference: unreferenced object 0x888103d47800 (size 1024): comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .N.. ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff backtrace: [<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs] [<f8aca419>] virtio_dev_probe+0x15f/0x210 [<4d6baf3c>] really_probe+0xea/0x430 [<a6ceeac8>] device_driver_attach+0xa8/0xb0 [<196f47a7>] __driver_attach+0x98/0x140 [<0b20601d>] bus_for_each_dev+0x7b/0xc0 [<399c7b7f>] bus_add_driver+0x11b/0x1f0 [<32b09ba7>] driver_register+0x8f/0xe0 [<cdd55998>] 0xa002c013 [<0ea196a2>] do_one_initcall+0x64/0x2e0 [<08f727ce>] do_init_module+0x5c/0x260 [<3cdedab6>] __do_sys_finit_module+0xb5/0x120 [<ad2f48c6>] do_syscall_64+0x33/0x40 [<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: sta...@vger.kernel.org Signed-off-by: Luis Henriques --- fs/fuse/virtio_fs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 8868ac31a3c0..4e6ef9f24e84 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -899,7 +899,7 @@ static int virtio_fs_probe(struct virtio_device *vdev) out: vdev->priv = NULL; - kfree(fs); + virtio_fs_put(fs); return ret; }
Issue with kfence and kmemleak
Hi! This is probably a known issue, but just in case: looks like it's not possible to use kmemleak when kfence is enabled: [0.272136] kmemleak: Cannot insert 0x888236e02f00 into the object search tree (overlaps existing) [0.272136] CPU: 0 PID: 8 Comm: kthreadd Not tainted 5.12.0-rc3+ #92 [0.272136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 [0.272136] Call Trace: [0.272136] dump_stack+0x6d/0x89 [0.272136] create_object.isra.0.cold+0x40/0x62 [0.272136] ? process_one_work+0x5a0/0x5a0 [0.272136] ? process_one_work+0x5a0/0x5a0 [0.272136] kmem_cache_alloc_trace+0x110/0x2f0 [0.272136] ? process_one_work+0x5a0/0x5a0 [0.272136] kthread+0x3f/0x150 [0.272136] ? lockdep_hardirqs_on_prepare+0xd4/0x170 [0.272136] ? __kthread_bind_mask+0x60/0x60 [0.272136] ret_from_fork+0x22/0x30 [0.272136] kmemleak: Kernel memory leak detector disabled [0.272136] kmemleak: Object 0x888236e0 (size 2097152): [0.272136] kmemleak: comm "swapper", pid 0, jiffies 4294892296 [0.272136] kmemleak: min_count = 0 [0.272136] kmemleak: count = 0 [0.272136] kmemleak: flags = 0x1 [0.272136] kmemleak: checksum = 0 [0.272136] kmemleak: backtrace: [0.272136] memblock_alloc_internal+0x6d/0xb0 [0.272136] memblock_alloc_try_nid+0x6c/0x8a [0.272136] kfence_alloc_pool+0x26/0x3f [0.272136] start_kernel+0x242/0x548 [0.272136] secondary_startup_64_no_verify+0xb0/0xbb I've tried the hack below but it didn't really helped. Obviously I don't really understand what's going on ;-) But I think the reason for this patch not working as (I) expected is because kfence is initialised *before* kmemleak. diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 3b8ec938470a..b4ffd7695268 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -631,6 +631,9 @@ void __init kfence_alloc_pool(void) if (!__kfence_pool) pr_err("failed to allocate pool\n"); + kmemleak_no_scan(__kfence_pool); } Cheers, -- Luís
Re: fuse: kernel BUG at mm/truncate.c:763!
On Fri, Mar 12, 2021 at 01:11:23PM +, Matthew Wilcox wrote: > On Fri, Mar 12, 2021 at 12:21:59PM +0000, Luis Henriques wrote: > > > > I've seen a bug report (5.10.16 kernel splat below) that seems to be > > > > reproducible in kernels as early as 5.4. > > If this is reproducible, can you turn this BUG_ON into a VM_BUG_ON_PAGE() > so we know what kind of problem we're dealing with? Assuming the SUSE > tumbleweed kernels enable CONFIG_DEBUG_VM, which I'm sure they do. Just to make sure I got this right, you want to test something like this: } } - BUG_ON(page_mapped(page)); + VM_BUG_ON_PAGE(page_mapped(page), page); ret2 = do_launder_page(mapping, page); if (ret2 == 0) { if (!invalidate_complete_page2(mapping, page)) Cheers, -- Luís > > > > Page fault locks the page before installing a new pte, at least > > > AFAICS, so the BUG looks impossible. The referenced commits only > > > touch very high level control of writeback, so they may well increase > > > the chance of a bug triggering, but very unlikely to be the actual > > > cause of the bug. I'm guessing this to be an MM issue. > > > > Ok, thank you for having a look at it. > > > > Interestingly, there's a single commit to mm/truncate.c in 5.4: > > ef18a1ca847b ("mm/thp: allow dropping THP from page cache"). I'm Cc'ing > > Andrew and Kirill, maybe they have some ideas. > > That's probably not it; unless FUSE has developed the ability to insert > compound pages into the page cache without me noticing. > > (if it had, that would absolutely explain it -- i have a fix in my thp > tree for this case, but it doesn't affect any existing filesystem > because only shmem uses compound pages and it doesn't call > invalidate_inode_pages2_range)
Re: fuse: kernel BUG at mm/truncate.c:763!
On Fri, Mar 12, 2021 at 10:48:40AM +0100, Miklos Szeredi wrote: > On Fri, Mar 12, 2021 at 9:51 AM Luis Henriques wrote: > > > > Hi Miklos, > > > > I've seen a bug report (5.10.16 kernel splat below) that seems to be > > reproducible in kernels as early as 5.4. > > > > The commit that caught my attention when looking at what was merged in 5.4 > > was e4648309b85a ("fuse: truncate pending writes on O_TRUNC") but I didn't > > went too deeper on that -- I was wondering if you have seen something > > similar before. > > Don't remember seeing this. > > Excerpt from invalidate_inode_pages2_range(): > > lock_page(page); > [...] > if (page_mapped(page)) { > [...] > unmap_mapping_pages(mapping, index, > 1, false); > } > } > BUG_ON(page_mapped(page)); > > Page fault locks the page before installing a new pte, at least > AFAICS, so the BUG looks impossible. The referenced commits only > touch very high level control of writeback, so they may well increase > the chance of a bug triggering, but very unlikely to be the actual > cause of the bug. I'm guessing this to be an MM issue. Ok, thank you for having a look at it. Interestingly, there's a single commit to mm/truncate.c in 5.4: ef18a1ca847b ("mm/thp: allow dropping THP from page cache"). I'm Cc'ing Andrew and Kirill, maybe they have some ideas. > Is this reproducible on vanilla, or just openSUSE kernels? Well, this is on a Tumbleweed kernel, which is pretty much the stable kernel with a few patches that AFAIK touch mostly drivers. But I'll see if I can get the reporter trying to reproduce on a vanilla kernel. Cheers, -- Luís > > Thanks, > Miklos > > > > > > > > > There's another splat in the bug report[1] for a 5.4.14 kernel (which may > > be for a different bug, but the traces don't look as reliable as the one > > bellow). > > > > [1] https://bugzilla.opensuse.org/show_bug.cgi?id=1182929 > > > > [97604.721590] kernel BUG at mm/truncate.c:763! > > [97604.721601] invalid opcode: [#1] SMP PTI > > [97604.721613] CPU: 18 PID: 1584438 Comm: g++ Tainted: P O > > 5.10.16-1-default #1 openSUSE Tumbleweed > > [97604.721618] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a > > 10/16/2019 > > [97604.721631] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0 > > [97604.721637] Code: 0f 48 f0 e9 19 ff ff ff 31 c9 4c 89 e7 ba 01 00 00 00 > > 48 89 ee e8 1a c5 02 00 4c 89 ff e8 02 1b 01 00 84 c0 0f 84 ca fe ff ff <0f> > > 0b 49 8b 57 18 49 39 d4 0f 85 e2 fe ff ff 49 f7 07 00 60 00 00 > > [97604.721645] RSP: 0018:a613aa54ba40 EFLAGS: 00010202 > > [97604.721651] RAX: 0001 RBX: 000a RCX: > > 0200 > > [97604.721656] RDX: 0090 RSI: 00a800010037 RDI: > > d880718e > > [97604.721660] RBP: 1400 R08: 1400 R09: > > 1a73 > > [97604.721664] R10: R11: 04a684da R12: > > 8a28d4549d78 > > [97604.721669] R13: R14: R15: > > d880718e > > [97604.721674] FS: 7f9cdd7fb740() GS:8a5c7f98() > > knlGS: > > [97604.721679] CS: 0010 DS: ES: CR0: 80050033 > > [97604.721683] CR2: 7f89d3d78d80 CR3: 004d8a14e005 CR4: > > 007706e0 > > [97604.721688] DR0: DR1: DR2: > > > > [97604.721692] DR3: DR6: fffe0ff0 DR7: > > 0400 > > 97604.721696] PKRU: 5554 > > [97604.721699] Call Trace: > > [97604.721719] ? request_wait_answer+0x11a/0x210 [fuse] > > [97604.721729] ? fuse_dentry_delete+0xb/0x20 [fuse] > > [97604.721740] fuse_finish_open+0x85/0x150 [fuse] > > [97604.721750] fuse_open_common+0x1a8/0x1b0 [fuse] > > [97604.721759] ? fuse_open_common+0x1b0/0x1b0 [fuse] > > [97604.721766] do_dentry_open+0x14e/0x380 > > [97604.721775] path_openat+0x600/0x10d0 > > [97604.721782] ? handle_mm_fault+0x103c/0x1a00 > > [97604.721791] ? follow_page_pte+0x314/0x5f0 > > [97604.721795] do_filp_open+0x88/0x130 > > [97604.721803] ? security_prepare_creds+0x6d/0x90 > > [97604.721808] ? __kmalloc+0x11d/0x2a0 > > [97604.721814] do_open_execat+0x6d/0x1a0 > > [97604.721819] bprm_execve+0x190/0x6b0 > > [97604.721825] do_execveat_common+0x192/0x1c0 > > [97604.721830] __x64_sys_execve+0x39/0x50 > > [97604.721836] do_s
fuse: kernel BUG at mm/truncate.c:763!
Hi Miklos, I've seen a bug report (5.10.16 kernel splat below) that seems to be reproducible in kernels as early as 5.4. The commit that caught my attention when looking at what was merged in 5.4 was e4648309b85a ("fuse: truncate pending writes on O_TRUNC") but I didn't went too deeper on that -- I was wondering if you have seen something similar before. There's another splat in the bug report[1] for a 5.4.14 kernel (which may be for a different bug, but the traces don't look as reliable as the one bellow). [1] https://bugzilla.opensuse.org/show_bug.cgi?id=1182929 [97604.721590] kernel BUG at mm/truncate.c:763! [97604.721601] invalid opcode: [#1] SMP PTI [97604.721613] CPU: 18 PID: 1584438 Comm: g++ Tainted: P O 5.10.16-1-default #1 openSUSE Tumbleweed [97604.721618] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a 10/16/2019 [97604.721631] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0 [97604.721637] Code: 0f 48 f0 e9 19 ff ff ff 31 c9 4c 89 e7 ba 01 00 00 00 48 89 ee e8 1a c5 02 00 4c 89 ff e8 02 1b 01 00 84 c0 0f 84 ca fe ff ff <0f> 0b 49 8b 57 18 49 39 d4 0f 85 e2 fe ff ff 49 f7 07 00 60 00 00 [97604.721645] RSP: 0018:a613aa54ba40 EFLAGS: 00010202 [97604.721651] RAX: 0001 RBX: 000a RCX: 0200 [97604.721656] RDX: 0090 RSI: 00a800010037 RDI: d880718e [97604.721660] RBP: 1400 R08: 1400 R09: 1a73 [97604.721664] R10: R11: 04a684da R12: 8a28d4549d78 [97604.721669] R13: R14: R15: d880718e [97604.721674] FS: 7f9cdd7fb740() GS:8a5c7f98() knlGS: [97604.721679] CS: 0010 DS: ES: CR0: 80050033 [97604.721683] CR2: 7f89d3d78d80 CR3: 004d8a14e005 CR4: 007706e0 [97604.721688] DR0: DR1: DR2: [97604.721692] DR3: DR6: fffe0ff0 DR7: 0400 97604.721696] PKRU: 5554 [97604.721699] Call Trace: [97604.721719] ? request_wait_answer+0x11a/0x210 [fuse] [97604.721729] ? fuse_dentry_delete+0xb/0x20 [fuse] [97604.721740] fuse_finish_open+0x85/0x150 [fuse] [97604.721750] fuse_open_common+0x1a8/0x1b0 [fuse] [97604.721759] ? fuse_open_common+0x1b0/0x1b0 [fuse] [97604.721766] do_dentry_open+0x14e/0x380 [97604.721775] path_openat+0x600/0x10d0 [97604.721782] ? handle_mm_fault+0x103c/0x1a00 [97604.721791] ? follow_page_pte+0x314/0x5f0 [97604.721795] do_filp_open+0x88/0x130 [97604.721803] ? security_prepare_creds+0x6d/0x90 [97604.721808] ? __kmalloc+0x11d/0x2a0 [97604.721814] do_open_execat+0x6d/0x1a0 [97604.721819] bprm_execve+0x190/0x6b0 [97604.721825] do_execveat_common+0x192/0x1c0 [97604.721830] __x64_sys_execve+0x39/0x50 [97604.721836] do_syscall_64+0x33/0x80 [97604.721843] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [97604.721848] RIP: 0033:0x7f9cdcfe2c37 [97604.721853] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 08 12 30 00 f7 d8 64 89 02 [97604.721862] RSP: 002b:7ffe444f5758 EFLAGS: 0202 ORIG_RAX: 003b [97604.721867] RAX: ffda RBX: 7f9cdd7fb6a0 RCX: 7f9cdcfe2c37 [97604.721872] RDX: 020f5300 RSI: 020f3bf8 RDI: 020f36a0 [97604.721876] RBP: 0001 R08: R09: [97604.721880] R10: 7ffe444f4b60 R11: 0202 R12: [97604.721884] R13: 0001 R14: 020f36a0 R15: [97604.721890] Modules linked in: overlay rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache libafs(PO) iscsi_ibft iscsi_boot_sysfs rfkill vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dmi_sysfs intel_rapl_msr intel_rapl_common isst_if_common joydev ipmi_ssif i40iw ib_uverbs iTCO_wdt intel_pmc_bxt ib_core hid_generic iTCO_vendor_support skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel acpi_ipmi usbhid kvm i40e ipmi_si ioatdma mei_me i2c_i801 irqbypass ipmi_devintf mei i2c_smbus lpc_ich dca efi_pstore pcspkr ipmi_msghandler tiny_power_button acpi_pad button nls_iso8859_1 nls_cp437 vfat fat nfsd nfs_acl lockd auth_rpcgss grace sunrpc fuse configfs nfs_ssc ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper xhci_pci ttm xhci_pci_renesas xhci_hcd crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel drm glue_helper crypto_simd cryptd usbcore wmi sg br_netfilter bridge stp llc [97604.721991] dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs [97604.722031] ---[ end trace edcabaccd35272e2 ]--- [97604.727773] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0 Cheers, -- Luís
Re: [RFC PATCH] fuse: Clear SGID bit when setting mode in setacl
On Mon, Mar 01, 2021 at 11:33:24AM -0500, Vivek Goyal wrote: > On Fri, Feb 26, 2021 at 06:33:57PM +0000, Luis Henriques wrote: > > Setting file permissions with POSIX ACLs (setxattr) isn't clearing the > > setgid bit. This seems to be CVE-2016-7097, detected by running fstest > > generic/375 in virtiofs. Unfortunately, when the fix for this CVE landed > > in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when > > setting file permissions"), FUSE didn't had ACLs support yet. > > Hi Luis, > > Interesting. I did not know that "chmod" can lead to clearing of SGID > as well. Recently we implemented FUSE_HANDLE_KILLPRIV_V2 flag which > means that file server is responsible for clearing of SUID/SGID/caps > as per following rules. > > - caps are always cleared on chown/write/truncate > - suid is always cleared on chown, while for truncate/write it is cleared > only if caller does not have CAP_FSETID. > - sgid is always cleared on chown, while for truncate/write it is cleared > only if caller does not have CAP_FSETID as well as file has group > execute > permission. > > And we don't have anything about "chmod" in this list. Well, I will test > this and come back to this little later. > > I see following comment in fuse_set_acl(). > > /* > * Fuse userspace is responsible for updating access > * permissions in the inode, if needed. fuse_setxattr > * invalidates the inode attributes, which will force > * them to be refreshed the next time they are used, > * and it also updates i_ctime. > */ > > So looks like that original code has been written with intent that > file server is responsible for updating inode permissions. I am > assuming this will include clearing of S_ISGID if needed. > > But question is, does file server has enough information to be able > to handle proper clearing of S_ISGID info. IIUC, file server will need > two pieces of information atleast. > > - gid of the caller. > - Whether caller has CAP_FSETID or not. > > I think we have first piece of information but not the second one. May > be we need to send this in fuse_setxattr_in->flags. And file server > can drop CAP_FSETID while doing setxattr(). > > What about "gid" info. We don't change to caller's uid/gid while doing > setxattr(). So host might not clear S_ISGID or clear it when it should > not. I am wondering that can we switch to caller's uid/gid in setxattr(), > atleast while setting acls. Thank for looking into this. To be honest, initially I thought that the fix should be done in the server too, but when I looked into the code I couldn't find an easy way to get that done (without modifying the data being passed from the kernel in setxattr). So, what I've done was to look at what other filesystems were doing in the ACL code, and that's where I found out about this CVE. The CVE fix for the other filesystems looked easy enough to be included in FUSE too. Cheers, -- Luís
[RFC PATCH] fuse: Clear SGID bit when setting mode in setacl
Setting file permissions with POSIX ACLs (setxattr) isn't clearing the setgid bit. This seems to be CVE-2016-7097, detected by running fstest generic/375 in virtiofs. Unfortunately, when the fix for this CVE landed in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when setting file permissions"), FUSE didn't had ACLs support yet. Signed-off-by: Luis Henriques --- fs/fuse/acl.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index f529075a2ce8..1b273277c1c9 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -54,7 +54,9 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type) { struct fuse_conn *fc = get_fuse_conn(inode); const char *name; + umode_t mode = inode->i_mode; int ret; + bool update_mode = false; if (fuse_is_bad(inode)) return -EIO; @@ -62,11 +64,18 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type) if (!fc->posix_acl || fc->no_setxattr) return -EOPNOTSUPP; - if (type == ACL_TYPE_ACCESS) + if (type == ACL_TYPE_ACCESS) { name = XATTR_NAME_POSIX_ACL_ACCESS; - else if (type == ACL_TYPE_DEFAULT) + if (acl) { + ret = posix_acl_update_mode(inode, , ); + if (ret) + return ret; + if (inode->i_mode != mode) + update_mode = true; + } + } else if (type == ACL_TYPE_DEFAULT) { name = XATTR_NAME_POSIX_ACL_DEFAULT; - else + } else return -EINVAL; if (acl) { @@ -98,6 +107,20 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type) } else { ret = fuse_removexattr(inode, name); } + if (!ret && update_mode) { + struct dentry *entry; + struct iattr attr; + + entry = d_find_alias(inode); + if (entry) { + memset(, 0, sizeof(attr)); + attr.ia_valid = ATTR_MODE | ATTR_CTIME; + attr.ia_mode = mode; + attr.ia_ctime = current_time(inode); + ret = fuse_do_setattr(entry, , NULL); + dput(entry); + } + } forget_all_cached_acls(inode); fuse_invalidate_attr(inode);
Re: [PATCH] copy_file_range.2: Kernel v5.12 updates
On Wed, Feb 24, 2021 at 06:10:45PM +0200, Amir Goldstein wrote: > On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques wrote: > > > > Update man-page with recent changes to this syscall. > > > > Signed-off-by: Luis Henriques > > --- > > Hi! > > > > Here's a suggestion for fixing the manpage for copy_file_range(). Note that > > I've assumed the fix will hit 5.12. > > > > man2/copy_file_range.2 | 10 +- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 > > index 611a39b8026b..b0fd85e2631e 100644 > > --- a/man2/copy_file_range.2 > > +++ b/man2/copy_file_range.2 > > @@ -169,6 +169,9 @@ Out of memory. > > .B ENOSPC > > There is not enough space on the target filesystem to complete the copy. > > .TP > > +.B EOPNOTSUPP > > +The filesystem does not support this operation. > > +.TP > > .B EOVERFLOW > > The requested source or destination range is too large to represent in the > > specified data types. > > @@ -187,7 +190,7 @@ refers to an active swap file. > > .B EXDEV > > The files referred to by > > .IR fd_in " and " fd_out > > -are not on the same mounted filesystem (pre Linux 5.3). > > +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12). > > I think you need to drop the (Linux range) altogether. > What's missing here is the NFS cross server copy use case. > Maybe: > > ...are not on the same mounted filesystem and the source and target > filesystems > do not support cross-filesystem copy. > > You may refer the reader to VERSIONS section where it will say which > filesystems support cross-fs copy as of kernel version XXX (i.e. cifs and > nfs). > > > .SH VERSIONS > > The > > .BR copy_file_range () > > @@ -202,6 +205,11 @@ Applications should target the behaviour and > > requirements of 5.3 kernels. > > .PP > > First support for cross-filesystem copies was introduced in Linux 5.3. > > Older kernels will return -EXDEV when cross-filesystem copies are > > attempted. > > +.PP > > +After Linux 5.12, support for copies between different filesystems was > > dropped. > > +However, individual filesystems may still provide > > +.BR copy_file_range () > > +implementations that allow copies across different devices. > > Again, this is not likely to stay uptodate for very long. > The stable kernels are expected to apply your patch (because it fixes > a regression) > so this should be phrased differently. > If it were me, I would provide all the details of the situation to > Michael and ask him > to write the best description for this section. Thanks Amir. Yeah, it's tricky. Support was added and then dropped. Since stable kernels will be picking this patch, maybe the best thing to do is to no mention the generic cross-filesystem support at all...? Or simply say that 5.3 temporarily supported it but that support was later dropped. Michael (or Alejandro), would you be OK handling this yourself as Amir suggested? Cheers, -- Luís
[PATCH] copy_file_range.2: Kernel v5.12 updates
Update man-page with recent changes to this syscall. Signed-off-by: Luis Henriques --- Hi! Here's a suggestion for fixing the manpage for copy_file_range(). Note that I've assumed the fix will hit 5.12. man2/copy_file_range.2 | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 index 611a39b8026b..b0fd85e2631e 100644 --- a/man2/copy_file_range.2 +++ b/man2/copy_file_range.2 @@ -169,6 +169,9 @@ Out of memory. .B ENOSPC There is not enough space on the target filesystem to complete the copy. .TP +.B EOPNOTSUPP +The filesystem does not support this operation. +.TP .B EOVERFLOW The requested source or destination range is too large to represent in the specified data types. @@ -187,7 +190,7 @@ refers to an active swap file. .B EXDEV The files referred to by .IR fd_in " and " fd_out -are not on the same mounted filesystem (pre Linux 5.3). +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12). .SH VERSIONS The .BR copy_file_range () @@ -202,6 +205,11 @@ Applications should target the behaviour and requirements of 5.3 kernels. .PP First support for cross-filesystem copies was introduced in Linux 5.3. Older kernels will return -EXDEV when cross-filesystem copies are attempted. +.PP +After Linux 5.12, support for copies between different filesystems was dropped. +However, individual filesystems may still provide +.BR copy_file_range () +implementations that allow copies across different devices. .SH CONFORMING TO The .BR copy_file_range ()
Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies
On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote: > On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques wrote: > > > > A regression has been reported by Nicolas Boichat, found while using the > > copy_file_range syscall to copy a tracefs file. Before commit > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the > > kernel would return -EXDEV to userspace when trying to copy a file across > > different filesystems. After this commit, the syscall doesn't fail anymore > > and instead returns zero (zero bytes copied), as this file's content is > > generated on-the-fly and thus reports a size of zero. > > > > This patch restores some cross-filesystem copy restrictions that existed > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across > > devices"). Filesystems are still allowed to fall-back to the VFS > > generic_copy_file_range() implementation, but that has now to be done > > explicitly. > > > > nfsd is also modified to fall-back into generic_copy_file_range() in case > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") > > Link: > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ > > Link: > > https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ > > Link: > > https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ > > Reported-by: Nicolas Boichat > > Signed-off-by: Luis Henriques > > I tested v8 and I believe it works for NFS. Thanks a lot for the testing. And to everyone else for reviews, feedback,... and patience. I'll now go look into the manpage and see what needs to be changed. Cheers, -- Luís
Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies
On Tue, Feb 23, 2021 at 08:57:38AM -0800, dai@oracle.com wrote: > > On 2/23/21 8:47 AM, Amir Goldstein wrote: > > On Tue, Feb 23, 2021 at 6:02 PM wrote: > > > > > > On 2/23/21 7:29 AM, dai@oracle.com wrote: > > > > On 2/23/21 2:32 AM, Luis Henriques wrote: > > > > > On Mon, Feb 22, 2021 at 08:25:27AM -0800, dai....@oracle.com wrote: > > > > > > On 2/22/21 2:24 AM, Luis Henriques wrote: > > > > > > > A regression has been reported by Nicolas Boichat, found while > > > > > > > using the > > > > > > > copy_file_range syscall to copy a tracefs file. Before commit > > > > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across > > > > > > > devices") the > > > > > > > kernel would return -EXDEV to userspace when trying to copy a file > > > > > > > across > > > > > > > different filesystems. After this commit, the syscall doesn't > > > > > > > fail > > > > > > > anymore > > > > > > > and instead returns zero (zero bytes copied), as this file's > > > > > > > content is > > > > > > > generated on-the-fly and thus reports a size of zero. > > > > > > > > > > > > > > This patch restores some cross-filesystem copy restrictions that > > > > > > > existed > > > > > > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy > > > > > > > across > > > > > > > devices"). Filesystems are still allowed to fall-back to the VFS > > > > > > > generic_copy_file_range() implementation, but that has now to be > > > > > > > done > > > > > > > explicitly. > > > > > > > > > > > > > > nfsd is also modified to fall-back into generic_copy_file_range() > > > > > > > in case > > > > > > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. > > > > > > > > > > > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across > > > > > > > devices") > > > > > > > Link: > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$ > > > > > > > Link: > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$ > > > > > > > Link: > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$ > > > > > > > Reported-by: Nicolas Boichat > > > > > > > Signed-off-by: Luis Henriques > > > > > > > --- > > > > > > > Changes since v7 > > > > > > > - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so > > > > > > > that the > > > > > > > error returned is always related to the 'copy' operation > > > > > > > Changes since v6 > > > > > > > - restored i_sb checks for the clone operation > > > > > > > Changes since v5 > > > > > > > - check if ->copy_file_range is NULL before calling it > > > > > > > Changes since v4 > > > > > > > - nfsd falls-back to generic_copy_file_range() only *if* it gets > > > > > > > -EOPNOTSUPP > > > > > > > or -EXDEV. > > > > > > > Changes since v3 > > > > > > > - dropped the COPY_FILE_SPLICE flag > > > > > > > - kept the f_op's checks early in generic_copy_file_checks, > > > > > > > implementing > > > > > > > Amir's suggestions > > > > > > > - modified nfsd to use generic_copy_file_range() > > > > > > > Changes since v2 > > > > > > > - do all the required checks earlier, in > > > > > > > generic_copy_file_checks(), > > > > > > >
Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies
On Mon, Feb 22, 2021 at 08:25:27AM -0800, dai@oracle.com wrote: > > On 2/22/21 2:24 AM, Luis Henriques wrote: > > A regression has been reported by Nicolas Boichat, found while using the > > copy_file_range syscall to copy a tracefs file. Before commit > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the > > kernel would return -EXDEV to userspace when trying to copy a file across > > different filesystems. After this commit, the syscall doesn't fail anymore > > and instead returns zero (zero bytes copied), as this file's content is > > generated on-the-fly and thus reports a size of zero. > > > > This patch restores some cross-filesystem copy restrictions that existed > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across > > devices"). Filesystems are still allowed to fall-back to the VFS > > generic_copy_file_range() implementation, but that has now to be done > > explicitly. > > > > nfsd is also modified to fall-back into generic_copy_file_range() in case > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") > > Link: > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$ > > Link: > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$ > > Link: > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$ > > Reported-by: Nicolas Boichat > > Signed-off-by: Luis Henriques > > --- > > Changes since v7 > > - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the > >error returned is always related to the 'copy' operation > > Changes since v6 > > - restored i_sb checks for the clone operation > > Changes since v5 > > - check if ->copy_file_range is NULL before calling it > > Changes since v4 > > - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP > >or -EXDEV. > > Changes since v3 > > - dropped the COPY_FILE_SPLICE flag > > - kept the f_op's checks early in generic_copy_file_checks, implementing > >Amir's suggestions > > - modified nfsd to use generic_copy_file_range() > > Changes since v2 > > - do all the required checks earlier, in generic_copy_file_checks(), > >adding new checks for ->remap_file_range > > - new COPY_FILE_SPLICE flag > > - don't remove filesystem's fallback to generic_copy_file_range() > > - updated commit changelog (and subject) > > Changes since v1 (after Amir review) > > - restored do_copy_file_range() helper > > - return -EOPNOTSUPP if fs doesn't implement CFR > > - updated commit description > > > > fs/nfsd/vfs.c | 8 +++- > > fs/read_write.c | 49 - > > 2 files changed, 31 insertions(+), 26 deletions(-) > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > > index 04937e51de56..23dab0fa9087 100644 > > --- a/fs/nfsd/vfs.c > > +++ b/fs/nfsd/vfs.c > > @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, > > u64 src_pos, > > ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file > > *dst, > > u64 dst_pos, u64 count) > > { > > + ssize_t ret; > > /* > > * Limit copy to 4MB to prevent indefinitely blocking an nfsd > > @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 > > src_pos, struct file *dst, > > * limit like this and pipeline multiple COPY requests. > > */ > > count = min_t(u64, count, 1 << 22); > > - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); > > + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); > > + > > + if (ret == -EOPNOTSUPP || ret == -EXDEV) > > + ret = generic_copy_file_range(src, src_pos, dst, dst_pos, > > + count, 0); > > + return ret; > > } > > __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, > > diff --git a/fs/read_write.c b/fs/read_write.c > > index 75f76
[PATCH v8] vfs: fix copy_file_range regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file. Before commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero. This patch restores some cross-filesystem copy restrictions that existed prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"). Filesystems are still allowed to fall-back to the VFS generic_copy_file_range() implementation, but that has now to be done explicitly. nfsd is also modified to fall-back into generic_copy_file_range() in case vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ Reported-by: Nicolas Boichat Signed-off-by: Luis Henriques --- Changes since v7 - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the error returned is always related to the 'copy' operation Changes since v6 - restored i_sb checks for the clone operation Changes since v5 - check if ->copy_file_range is NULL before calling it Changes since v4 - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP or -EXDEV. Changes since v3 - dropped the COPY_FILE_SPLICE flag - kept the f_op's checks early in generic_copy_file_checks, implementing Amir's suggestions - modified nfsd to use generic_copy_file_range() Changes since v2 - do all the required checks earlier, in generic_copy_file_checks(), adding new checks for ->remap_file_range - new COPY_FILE_SPLICE flag - don't remove filesystem's fallback to generic_copy_file_range() - updated commit changelog (and subject) Changes since v1 (after Amir review) - restored do_copy_file_range() helper - return -EOPNOTSUPP if fs doesn't implement CFR - updated commit description fs/nfsd/vfs.c | 8 +++- fs/read_write.c | 49 - 2 files changed, 31 insertions(+), 26 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 04937e51de56..23dab0fa9087 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos, ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, u64 dst_pos, u64 count) { + ssize_t ret; /* * Limit copy to 4MB to prevent indefinitely blocking an nfsd @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, * limit like this and pipeline multiple COPY requests. */ count = min_t(u64, count, 1 << 22); - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + + if (ret == -EOPNOTSUPP || ret == -EXDEV) + ret = generic_copy_file_range(src, src_pos, dst, dst_pos, + count, 0); + return ret; } __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..5a26297fd410 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, } EXPORT_SYMBOL(generic_copy_file_range); -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - size_t len, unsigned int flags) -{ - /* -* Although we now allow filesystems to handle cross sb copy, passing -* a file of the wrong filesystem type to filesystem driver can result -* in an attempt to dereference the wrong type of ->private_data, so -* avoid doing that until we really have a good reason. NFS defines -* several different file_system_type structures, but they all end up -* using the same ->copy_file_range() function pointer. -*/ - if (file_out->f_op->copy_file_range && - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) - return file_out->f_op->copy_file_range(file_in, pos_in, - file_out, pos_out, -
[PATCH v7] vfs: fix copy_file_range regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file. Before commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero. This patch restores some cross-filesystem copy restrictions that existed prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"). Filesystems are still allowed to fall-back to the VFS generic_copy_file_range() implementation, but that has now to be done explicitly. nfsd is also modified to fall-back into generic_copy_file_range() in case vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ Reported-by: Nicolas Boichat Signed-off-by: Luis Henriques --- Changes since v6 - restored i_sb checks for the clone operation Changes since v5 - check if ->copy_file_range is NULL before calling it Changes since v4 - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP or -EXDEV. Changes since v3 - dropped the COPY_FILE_SPLICE flag - kept the f_op's checks early in generic_copy_file_checks, implementing Amir's suggestions - modified nfsd to use generic_copy_file_range() Changes since v2 - do all the required checks earlier, in generic_copy_file_checks(), adding new checks for ->remap_file_range - new COPY_FILE_SPLICE flag - don't remove filesystem's fallback to generic_copy_file_range() - updated commit changelog (and subject) Changes since v1 (after Amir review) - restored do_copy_file_range() helper - return -EOPNOTSUPP if fs doesn't implement CFR - updated commit description fs/nfsd/vfs.c | 8 +++- fs/read_write.c | 50 - 2 files changed, 32 insertions(+), 26 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 04937e51de56..23dab0fa9087 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos, ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, u64 dst_pos, u64 count) { + ssize_t ret; /* * Limit copy to 4MB to prevent indefinitely blocking an nfsd @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, * limit like this and pipeline multiple COPY requests. */ count = min_t(u64, count, 1 << 22); - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + + if (ret == -EOPNOTSUPP || ret == -EXDEV) + ret = generic_copy_file_range(src, src_pos, dst, dst_pos, + count, 0); + return ret; } __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..463345c0ee30 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, } EXPORT_SYMBOL(generic_copy_file_range); -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - size_t len, unsigned int flags) -{ - /* -* Although we now allow filesystems to handle cross sb copy, passing -* a file of the wrong filesystem type to filesystem driver can result -* in an attempt to dereference the wrong type of ->private_data, so -* avoid doing that until we really have a good reason. NFS defines -* several different file_system_type structures, but they all end up -* using the same ->copy_file_range() function pointer. -*/ - if (file_out->f_op->copy_file_range && - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) - return file_out->f_op->copy_file_range(file_in, pos_in, - file_out, pos_out, - len, flags); - - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); -} - /
[PATCH v6] vfs: fix copy_file_range regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file. Before commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero. This patch restores some cross-filesystem copy restrictions that existed prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"). Filesystems are still allowed to fall-back to the VFS generic_copy_file_range() implementation, but that has now to be done explicitly. nfsd is also modified to fall-back into generic_copy_file_range() in case vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ Reported-by: Nicolas Boichat Signed-off-by: Luis Henriques --- And v6 is upon us. Behold! Changes since v5 - check if ->copy_file_range is NULL before calling it Changes since v4 - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP or -EXDEV. Changes since v3 - dropped the COPY_FILE_SPLICE flag - kept the f_op's checks early in generic_copy_file_checks, implementing Amir's suggestions - modified nfsd to use generic_copy_file_range() Changes since v2 - do all the required checks earlier, in generic_copy_file_checks(), adding new checks for ->remap_file_range - new COPY_FILE_SPLICE flag - don't remove filesystem's fallback to generic_copy_file_range() - updated commit changelog (and subject) Changes since v1 (after Amir review) - restored do_copy_file_range() helper - return -EOPNOTSUPP if fs doesn't implement CFR - updated commit description fs/nfsd/vfs.c | 8 +++- fs/read_write.c | 53 - 2 files changed, 33 insertions(+), 28 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 04937e51de56..23dab0fa9087 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos, ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, u64 dst_pos, u64 count) { + ssize_t ret; /* * Limit copy to 4MB to prevent indefinitely blocking an nfsd @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, * limit like this and pipeline multiple COPY requests. */ count = min_t(u64, count, 1 << 22); - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + + if (ret == -EOPNOTSUPP || ret == -EXDEV) + ret = generic_copy_file_range(src, src_pos, dst, dst_pos, + count, 0); + return ret; } __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..0348aaa9e237 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, } EXPORT_SYMBOL(generic_copy_file_range); -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - size_t len, unsigned int flags) -{ - /* -* Although we now allow filesystems to handle cross sb copy, passing -* a file of the wrong filesystem type to filesystem driver can result -* in an attempt to dereference the wrong type of ->private_data, so -* avoid doing that until we really have a good reason. NFS defines -* several different file_system_type structures, but they all end up -* using the same ->copy_file_range() function pointer. -*/ - if (file_out->f_op->copy_file_range && - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) - return file_out->f_op->copy_file_range(file_in, pos_in, - file_out, pos_out, - len, flags); - - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); -} - /* * Performs
Re: [PATCH v5] vfs: fix copy_file_range regression in cross-fs copies
Amir Goldstein writes: > On Thu, Feb 18, 2021 at 5:16 PM Luis Henriques wrote: >> >> A regression has been reported by Nicolas Boichat, found while using the >> copy_file_range syscall to copy a tracefs file. Before commit >> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the >> kernel would return -EXDEV to userspace when trying to copy a file across >> different filesystems. After this commit, the syscall doesn't fail anymore >> and instead returns zero (zero bytes copied), as this file's content is >> generated on-the-fly and thus reports a size of zero. >> >> This patch restores some cross-filesystem copy restrictions that existed >> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across >> devices"). Filesystems are still allowed to fall-back to the VFS >> generic_copy_file_range() implementation, but that has now to be done >> explicitly. >> >> nfsd is also modified to fall-back into generic_copy_file_range() in case >> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. >> >> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") >> Link: >> https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ >> Link: >> https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ >> Link: >> https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ >> Reported-by: Nicolas Boichat >> Signed-off-by: Luis Henriques >> --- >> And v5! Sorry. Sure, it makes sense to go through the all the vfs_cfr() >> checks first. > > You missed my other comment on v4... > > not checking NULL copy_file_range case. Ah, yeah I did missed it. I'll follow up with yet another revision. Cheers, -- Luis
[PATCH v5] vfs: fix copy_file_range regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file. Before commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero. This patch restores some cross-filesystem copy restrictions that existed prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"). Filesystems are still allowed to fall-back to the VFS generic_copy_file_range() implementation, but that has now to be done explicitly. nfsd is also modified to fall-back into generic_copy_file_range() in case vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ Reported-by: Nicolas Boichat Signed-off-by: Luis Henriques --- And v5! Sorry. Sure, it makes sense to go through the all the vfs_cfr() checks first. Again, here's my request for testing. Changes since v4 - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP or -EXDEV. Changes since v3 - dropped the COPY_FILE_SPLICE flag - kept the f_op's checks early in generic_copy_file_checks, implementing Amir's suggestions - modified nfsd to use generic_copy_file_range() Changes since v2 - do all the required checks earlier, in generic_copy_file_checks(), adding new checks for ->remap_file_range - new COPY_FILE_SPLICE flag - don't remove filesystem's fallback to generic_copy_file_range() - updated commit changelog (and subject) Changes since v1 (after Amir review) - restored do_copy_file_range() helper - return -EOPNOTSUPP if fs doesn't implement CFR - updated commit description fs/nfsd/vfs.c | 8 +++- fs/read_write.c | 50 +++-- 2 files changed, 30 insertions(+), 28 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 04937e51de56..23dab0fa9087 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos, ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, u64 dst_pos, u64 count) { + ssize_t ret; /* * Limit copy to 4MB to prevent indefinitely blocking an nfsd @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, * limit like this and pipeline multiple COPY requests. */ count = min_t(u64, count, 1 << 22); - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + + if (ret == -EOPNOTSUPP || ret == -EXDEV) + ret = generic_copy_file_range(src, src_pos, dst, dst_pos, + count, 0); + return ret; } __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..214d44f7cbfa 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, } EXPORT_SYMBOL(generic_copy_file_range); -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - size_t len, unsigned int flags) -{ - /* -* Although we now allow filesystems to handle cross sb copy, passing -* a file of the wrong filesystem type to filesystem driver can result -* in an attempt to dereference the wrong type of ->private_data, so -* avoid doing that until we really have a good reason. NFS defines -* several different file_system_type structures, but they all end up -* using the same ->copy_file_range() function pointer. -*/ - if (file_out->f_op->copy_file_range && - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) - return file_out->f_op->copy_file_range(file_in, pos_in, - file_out, pos_out, - len, flags); - - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); -} - /* * Perform
[PATCH v4] vfs: fix copy_file_range regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file. Before commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero. This patch restores some cross-filesystem copy restrictions that existed prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"). Filesystems are still allowed to fall-back to the VFS generic_copy_file_range() implementation, but that has now to be done explicitly. nfsd is also modified to use generic_copy_file_range() instead of vfs_copy_file_range() so that it can still fall-back to splice without going through all the checks. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ Reported-by: Nicolas Boichat Signed-off-by: Luis Henriques --- And here's v4. I'd like to request help for testing. I know Nicolas is doing that (thanks! and thanks for the reviews). But it would be great to get at least the nfs code tested. Olga, can you help here? Changes since v3 - dropped the COPY_FILE_SPLICE flag - kept the f_op's checks early in generic_copy_file_checks, implementing Amir's suggestions - modified nfsd to use generic_copy_file_range() Changes since v2 - do all the required checks earlier, in generic_copy_file_checks(), adding new checks for ->remap_file_range - new COPY_FILE_SPLICE flag - don't remove filesystem's fallback to generic_copy_file_range() - updated commit changelog (and subject) Changes since v1 (after Amir review) - restored do_copy_file_range() helper - return -EOPNOTSUPP if fs doesn't implement CFR - updated commit description fs/nfsd/vfs.c | 2 +- fs/read_write.c | 50 +++-- 2 files changed, 24 insertions(+), 28 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 04937e51de56..49dd28ee2602 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -578,7 +578,7 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, * limit like this and pipeline multiple COPY requests. */ count = min_t(u64, count, 1 << 22); - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + return generic_copy_file_range(src, src_pos, dst, dst_pos, count, 0); } __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..214d44f7cbfa 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, } EXPORT_SYMBOL(generic_copy_file_range); -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - size_t len, unsigned int flags) -{ - /* -* Although we now allow filesystems to handle cross sb copy, passing -* a file of the wrong filesystem type to filesystem driver can result -* in an attempt to dereference the wrong type of ->private_data, so -* avoid doing that until we really have a good reason. NFS defines -* several different file_system_type structures, but they all end up -* using the same ->copy_file_range() function pointer. -*/ - if (file_out->f_op->copy_file_range && - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) - return file_out->f_op->copy_file_range(file_in, pos_in, - file_out, pos_out, - len, flags); - - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); -} - /* * Performs necessary checks before doing a file copy * @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, loff_t size_in; int ret; + /* +* Although we now allow filesystems to handle cross sb copy, passing +* a file of the wrong filesystem type to filesystem driver can result +* in an attempt to dereference the wrong type of ->private_data, so +* avoid doing that until we really have a good reason. NFS defin
Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices
Luis Henriques writes: > Amir Goldstein writes: > >> On Thu, Feb 18, 2021 at 9:42 AM Christoph Hellwig wrote: >>> >>> Looks good: >>> >>> Reviewed-by: Christoph Hellwig >>> >>> This whole idea of cross-device copie has always been a horrible idea, >>> and I've been arguing against it since the patches were posted. >> >> Ok. I'm good with this v2 as well, but need to add the fallback to >> do_splice_direct() >> in nfsd_copy_file_range(), because this patch breaks it. >> >> And the commit message of v3 is better in describing the reported issue. > > Except that, as I said in a previous email, v2 doesn't really fix the > issue: all the checks need to be done earlier in generic_copy_file_checks(). > > I'll work on getting v4, based on v2 and but moving the checks and > implementing your review suggestions to v3 (plus this nfs change). There's something else: The filesystems (nfs, ceph, cifs, fuse) rely on the fallback to generic_copy_file_range() if something's wrong. And this "something's wrong" is fs specific. For example: in ceph it is possible to offload the file copy to the OSDs even if the files are in different filesystems as long as these filesystems are on the *same* ceph cluster. If the copy being done is across two different clusters, then the copy reverts to splice. This means that the boilerplate code being removed in v2 of this patch needs to be restored and replace by: ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off, len, flags); if (ret == -EOPNOTSUPP || ret == -EXDEV) ret = do_splice_direct(src_file, _off, dst_file, _off, len > MAX_RW_COUNT ? MAX_RW_COUNT : len, flags); return ret; A quick look at the other filesystems code indicate similar patterns. Since at this point we've gone through all the syscall checks already, calling do_splice_direct() shouldn't be a huge change. But I may be missing something. Again. Which is quite likely :-) Cheers, -- Luis
Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices
Amir Goldstein writes: > On Thu, Feb 18, 2021 at 9:42 AM Christoph Hellwig wrote: >> >> Looks good: >> >> Reviewed-by: Christoph Hellwig >> >> This whole idea of cross-device copie has always been a horrible idea, >> and I've been arguing against it since the patches were posted. > > Ok. I'm good with this v2 as well, but need to add the fallback to > do_splice_direct() > in nfsd_copy_file_range(), because this patch breaks it. > > And the commit message of v3 is better in describing the reported issue. Except that, as I said in a previous email, v2 doesn't really fix the issue: all the checks need to be done earlier in generic_copy_file_checks(). I'll work on getting v4, based on v2 and but moving the checks and implementing your review suggestions to v3 (plus this nfs change). Cheers, -- Luis
[PATCH v3] vfs: fix copy_file_range regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file. Before commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero. This patch restores some cross-filesystems copy restrictions that existed prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"). It also introduces a flag (COPY_FILE_SPLICE) that can be used by filesystems calling directly into the vfs copy_file_range to override these restrictions. Right now, only NFS needs to set this flag. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/ Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ Reported-by: Nicolas Boichat Signed-off-by: Luis Henriques --- Ok, I've tried to address all the issues and comments. Hopefully this v3 is a bit closer to the final fix. Changes since v2 - do all the required checks earlier, in generic_copy_file_checks(), adding new checks for ->remap_file_range - new COPY_FILE_SPLICE flag - don't remove filesystem's fallback to generic_copy_file_range() - updated commit changelog (and subject) Changes since v1 (after Amir review) - restored do_copy_file_range() helper - return -EOPNOTSUPP if fs doesn't implement CFR - updated commit description fs/nfsd/vfs.c | 3 ++- fs/read_write.c| 44 +--- include/linux/fs.h | 7 +++ 3 files changed, 50 insertions(+), 4 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 04937e51de56..14e55822c223 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -578,7 +578,8 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, * limit like this and pipeline multiple COPY requests. */ count = min_t(u64, count, 1 << 22); - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, + COPY_FILE_SPLICE); } __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..40a16003fb05 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1410,6 +1410,33 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, flags); } +/* + * This helper function checks whether copy_file_range can actually be used, + * depending on the source and destination filesystems being the same. + * + * In-kernel callers may set COPY_FILE_SPLICE to override these checks. + */ +static int fops_copy_file_checks(struct file *file_in, struct file *file_out, +unsigned int flags) +{ + if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE)) + return -EINVAL; + + if (flags & COPY_FILE_SPLICE) + return 0; + /* +* We got here from userspace, so forbid copies if copy_file_range isn't +* implemented or if we're doing a cross-fs copy. +*/ + if (!file_out->f_op->copy_file_range) + return -EOPNOTSUPP; + else if (file_out->f_op->copy_file_range != +file_in->f_op->copy_file_range) + return -EXDEV; + + return 0; +} + /* * Performs necessary checks before doing a file copy * @@ -1427,6 +1454,14 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, loff_t size_in; int ret; + /* Only check f_ops if we're not trying to clone */ + if (!file_in->f_op->remap_file_range || + (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb)) { + ret = fops_copy_file_checks(file_in, file_out, flags); + if (ret) + return ret; + } + ret = generic_file_rw_checks(file_in, file_out); if (ret) return ret; @@ -1474,9 +1509,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, { ssize_t ret; - if (flags != 0) - return -EINVAL; - ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, , flags); if (unlikely(ret)) @@ -1511,6 +1543,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, ret = clo
Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices
Amir Goldstein writes: > On Tue, Feb 16, 2021 at 6:41 PM Luis Henriques wrote: >> >> Amir Goldstein writes: >> >> >> Ugh. And I guess overlayfs may have a similar problem. >> > >> > Not exactly. >> > Generally speaking, overlayfs should call vfs_copy_file_range() >> > with the flags it got from layer above, so if called from nfsd it >> > will allow cross fs copy and when called from syscall it won't. >> > >> > There are some corner cases where overlayfs could benefit from >> > COPY_FILE_SPLICE (e.g. copy from lower file to upper file), but >> > let's leave those for now. Just leave overlayfs code as is. >> >> Got it, thanks for clarifying. >> >> >> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that >> >> > is internal to kernel users. >> >> > >> >> > FWIW, you may want to look at the loop in ovl_copy_up_data() >> >> > for improvements to nfsd_copy_file_range(). >> >> > >> >> > We can move the check out to copy_file_range syscall: >> >> > >> >> > if (flags != 0) >> >> > return -EINVAL; >> >> > >> >> > Leave the fallback from all filesystems and check for the >> >> > COPY_FILE_SPLICE flag inside generic_copy_file_range(). >> >> >> >> Ok, the diff bellow is just to make sure I understood your suggestion. >> >> >> >> The patch will also need to: >> >> >> >> - change nfs and overlayfs calls to vfs_copy_file_range() so that they >> >>use the new flag. >> >> >> >> - check flags in generic_copy_file_checks() to make sure only valid flags >> >>are used (COPY_FILE_SPLICE at the moment). >> >> >> >> Also, where should this flag be defined? include/uapi/linux/fs.h? >> > >> > Grep for REMAP_FILE_ >> > Same header file, same Documentation rst file. >> > >> >> >> >> Cheers, >> >> -- >> >> Luis >> >> >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> >> index 75f764b43418..341d315d2a96 100644 >> >> --- a/fs/read_write.c >> >> +++ b/fs/read_write.c >> >> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file >> >> *file_in, loff_t pos_in, >> >> struct file *file_out, loff_t pos_out, >> >> size_t len, unsigned int flags) >> >> { >> >> + if (!(flags & COPY_FILE_SPLICE)) { >> >> + if (!file_out->f_op->copy_file_range) >> >> + return -EOPNOTSUPP; >> >> + else if (file_out->f_op->copy_file_range != >> >> +file_in->f_op->copy_file_range) >> >> + return -EXDEV; >> >> + } >> > >> > That looks strange, because you are duplicating the logic in >> > do_copy_file_range(). Maybe better: >> > >> > if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE)) >> > return -EINVAL; >> > if (flags & COPY_FILE_SPLICE) >> >return do_splice_direct(file_in, _in, file_out, _out, >> > len > MAX_RW_COUNT ? MAX_RW_COUNT : len, >> > 0); >> >> My initial reasoning for duplicating the logic in do_copy_file_range() was >> to allow the generic_copy_file_range() callers to be left unmodified and >> allow the filesystems to default to this implementation. >> >> With this change, I guess that the calls to generic_copy_file_range() from >> the different filesystems can be dropped, as in my initial patch, as they >> will always get -EINVAL. The other option would be to set the >> COPY_FILE_SPLICE flag in those calls, but that would get us back to the >> problem we're trying to solve. > > I don't understand the problem. > > What exactly is wrong with the code I suggested? > Why should any filesystem be changed? > > Maybe I am missing something. Ok, I have to do a full brain reboot and start all over. Before that, I picked the code you suggested and tested it. I've mounted a cephfs filesystem and used xfs_io to execute a 'copy_range' command using /sys/kernel/debug/sched_features as source. The result was a 0-sized file in cephfs. And the reason is thevfs_copy_file_range() early exit in: if (len == 0) return 0; 'len' is set in generic_copy_file_checks(). This means that we're not solving the original problem anymore (probably since v1 of this patch, haven't checked). Also, re-reading Trond's emails, I read: "... also disallowing the copy from, say, an XFS formatted partition to an ext4 partition". Isn't that *exactly* what we're trying to do here? I.e. _prevent_ these copies from happening so that tracefs files can't be CFR'ed? /me stops now and waits to see if the morning brings some sun :-) Cheers, -- Luis
Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices
Amir Goldstein writes: >> Ugh. And I guess overlayfs may have a similar problem. > > Not exactly. > Generally speaking, overlayfs should call vfs_copy_file_range() > with the flags it got from layer above, so if called from nfsd it > will allow cross fs copy and when called from syscall it won't. > > There are some corner cases where overlayfs could benefit from > COPY_FILE_SPLICE (e.g. copy from lower file to upper file), but > let's leave those for now. Just leave overlayfs code as is. Got it, thanks for clarifying. >> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that >> > is internal to kernel users. >> > >> > FWIW, you may want to look at the loop in ovl_copy_up_data() >> > for improvements to nfsd_copy_file_range(). >> > >> > We can move the check out to copy_file_range syscall: >> > >> > if (flags != 0) >> > return -EINVAL; >> > >> > Leave the fallback from all filesystems and check for the >> > COPY_FILE_SPLICE flag inside generic_copy_file_range(). >> >> Ok, the diff bellow is just to make sure I understood your suggestion. >> >> The patch will also need to: >> >> - change nfs and overlayfs calls to vfs_copy_file_range() so that they >>use the new flag. >> >> - check flags in generic_copy_file_checks() to make sure only valid flags >>are used (COPY_FILE_SPLICE at the moment). >> >> Also, where should this flag be defined? include/uapi/linux/fs.h? > > Grep for REMAP_FILE_ > Same header file, same Documentation rst file. > >> >> Cheers, >> -- >> Luis >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 75f764b43418..341d315d2a96 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, >> loff_t pos_in, >> struct file *file_out, loff_t pos_out, >> size_t len, unsigned int flags) >> { >> + if (!(flags & COPY_FILE_SPLICE)) { >> + if (!file_out->f_op->copy_file_range) >> + return -EOPNOTSUPP; >> + else if (file_out->f_op->copy_file_range != >> +file_in->f_op->copy_file_range) >> + return -EXDEV; >> + } > > That looks strange, because you are duplicating the logic in > do_copy_file_range(). Maybe better: > > if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE)) > return -EINVAL; > if (flags & COPY_FILE_SPLICE) >return do_splice_direct(file_in, _in, file_out, _out, > len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0); My initial reasoning for duplicating the logic in do_copy_file_range() was to allow the generic_copy_file_range() callers to be left unmodified and allow the filesystems to default to this implementation. With this change, I guess that the calls to generic_copy_file_range() from the different filesystems can be dropped, as in my initial patch, as they will always get -EINVAL. The other option would be to set the COPY_FILE_SPLICE flag in those calls, but that would get us back to the problem we're trying to solve. > if (!file_out->f_op->copy_file_range) > return -EOPNOTSUPP; > return -EXDEV; > >> } >> @@ -1474,9 +1481,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, >> loff_t pos_in, >> { >> ssize_t ret; >> >> - if (flags != 0) >> - return -EINVAL; >> - > > This needs to move to the beginning of SYSCALL_DEFINE6(copy_file_range,... Yep, I didn't included that change in my diff as I wasn't sure if you'd like to have the flag visible in userspace. Anyway, thanks for your patience! Cheers, -- Luis
Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices
"gre...@linuxfoundation.org" writes: > On Tue, Feb 16, 2021 at 11:17:34AM +, Luis Henriques wrote: >> Amir Goldstein writes: >> >> > On Mon, Feb 15, 2021 at 8:57 PM Trond Myklebust >> > wrote: >> >> >> >> On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote: >> >> > On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust < >> >> > tron...@hammerspace.com> wrote: >> >> > > >> >> > > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote: >> >> > > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques < >> >> > > > lhenriq...@suse.de> >> >> > > > wrote: >> >> > > > > >> >> > > > > Nicolas Boichat reported an issue when trying to use the >> >> > > > > copy_file_range >> >> > > > > syscall on a tracefs file. It failed silently because the file >> >> > > > > content is >> >> > > > > generated on-the-fly (reporting a size of zero) and >> >> > > > > copy_file_range >> >> > > > > needs >> >> > > > > to know in advance how much data is present. >> >> > > > > >> >> > > > > This commit restores the cross-fs restrictions that existed >> >> > > > > prior >> >> > > > > to >> >> > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across >> >> > > > > devices") >> >> > > > > and >> >> > > > > removes generic_copy_file_range() calls from ceph, cifs, fuse, >> >> > > > > and >> >> > > > > nfs. >> >> > > > > >> >> > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across >> >> > > > > devices") >> >> > > > > Link: >> >> > > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ >> >> > > > > Cc: Nicolas Boichat >> >> > > > > Signed-off-by: Luis Henriques >> >> > > > >> >> > > > Code looks ok. >> >> > > > You may add: >> >> > > > >> >> > > > Reviewed-by: Amir Goldstein >> >> > > > >> >> > > > I agree with Trond that the first paragraph of the commit message >> >> > > > could >> >> > > > be improved. >> >> > > > The purpose of this change is to fix the change of behavior that >> >> > > > caused the regression. >> >> > > > >> >> > > > Before v5.3, behavior was -EXDEV and userspace could fallback to >> >> > > > read. >> >> > > > After v5.3, behavior is zero size copy. >> >> > > > >> >> > > > It does not matter so much what makes sense for CFR to do in this >> >> > > > case (generic cross-fs copy). What matters is that nobody asked >> >> > > > for >> >> > > > this change and that it caused problems. >> >> > > > >> >> > > >> >> > > No. I'm saying that this patch should be NACKed unless there is a >> >> > > real >> >> > > explanation for why we give crap about this tracefs corner case and >> >> > > why >> >> > > it can't be fixed. >> >> > > >> >> > > There are plenty of reasons why copy offload across filesystems >> >> > > makes >> >> > > sense, and particularly when you're doing NAS. Clone just doesn't >> >> > > cut >> >> > > it when it comes to disaster recovery (whereas backup to a >> >> > > different >> >> > > storage unit does). If the client has to do the copy, then you're >> >> > > effectively doubling the load on the server, and you're adding >> >> > > potentially unnecessary network traffic (or at the very least you >> >> > > are >> >> > > doubling that traffic). >> >> > > >> >> > >> >> > I don't understand the use case you are describing. >> >> > >> >> > Which filesystem types are you tal
Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices
Amir Goldstein writes: > On Mon, Feb 15, 2021 at 8:57 PM Trond Myklebust > wrote: >> >> On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote: >> > On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust < >> > tron...@hammerspace.com> wrote: >> > > >> > > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote: >> > > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques < >> > > > lhenriq...@suse.de> >> > > > wrote: >> > > > > >> > > > > Nicolas Boichat reported an issue when trying to use the >> > > > > copy_file_range >> > > > > syscall on a tracefs file. It failed silently because the file >> > > > > content is >> > > > > generated on-the-fly (reporting a size of zero) and >> > > > > copy_file_range >> > > > > needs >> > > > > to know in advance how much data is present. >> > > > > >> > > > > This commit restores the cross-fs restrictions that existed >> > > > > prior >> > > > > to >> > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across >> > > > > devices") >> > > > > and >> > > > > removes generic_copy_file_range() calls from ceph, cifs, fuse, >> > > > > and >> > > > > nfs. >> > > > > >> > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across >> > > > > devices") >> > > > > Link: >> > > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ >> > > > > Cc: Nicolas Boichat >> > > > > Signed-off-by: Luis Henriques >> > > > >> > > > Code looks ok. >> > > > You may add: >> > > > >> > > > Reviewed-by: Amir Goldstein >> > > > >> > > > I agree with Trond that the first paragraph of the commit message >> > > > could >> > > > be improved. >> > > > The purpose of this change is to fix the change of behavior that >> > > > caused the regression. >> > > > >> > > > Before v5.3, behavior was -EXDEV and userspace could fallback to >> > > > read. >> > > > After v5.3, behavior is zero size copy. >> > > > >> > > > It does not matter so much what makes sense for CFR to do in this >> > > > case (generic cross-fs copy). What matters is that nobody asked >> > > > for >> > > > this change and that it caused problems. >> > > > >> > > >> > > No. I'm saying that this patch should be NACKed unless there is a >> > > real >> > > explanation for why we give crap about this tracefs corner case and >> > > why >> > > it can't be fixed. >> > > >> > > There are plenty of reasons why copy offload across filesystems >> > > makes >> > > sense, and particularly when you're doing NAS. Clone just doesn't >> > > cut >> > > it when it comes to disaster recovery (whereas backup to a >> > > different >> > > storage unit does). If the client has to do the copy, then you're >> > > effectively doubling the load on the server, and you're adding >> > > potentially unnecessary network traffic (or at the very least you >> > > are >> > > doubling that traffic). >> > > >> > >> > I don't understand the use case you are describing. >> > >> > Which filesystem types are you talking about for source and target >> > of copy_file_range()? >> > >> > To be clear, the original change was done to support NFS/CIFS server- >> > side >> > copy and those should not be affected by this change. >> > >> >> That is incorrect: >> >> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file >> *dst, >> u64 dst_pos, u64 count) >> { >> >> /* >> * Limit copy to 4MB to prevent indefinitely blocking an nfsd >> * thread and client rpc slot. The choice of 4MB is somewhat >> * arbitrary. We might instead base this on r/wsize, or make it >> * tunable, or use a time instead of a byte limit, or implement >> * asynchronous copy. In theory a client could also recognize a >> * limit like this an
[PATCH v2] vfs: prevent copy_file_range to copy across devices
Nicolas Boichat reported an issue when trying to use the copy_file_range syscall on a tracefs file. It failed silently because the file content is generated on-the-fly (reporting a size of zero) and copy_file_range needs to know in advance how much data is present. This commit restores the cross-fs restrictions that existed prior to 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") and removes generic_copy_file_range() calls from ceph, cifs, fuse, and nfs. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/ Cc: Nicolas Boichat Signed-off-by: Luis Henriques --- Changes since v1 (after Amir review) - restored do_copy_file_range() helper - return -EOPNOTSUPP if fs doesn't implement CFR - updated commit description fs/ceph/file.c | 21 +++- fs/cifs/cifsfs.c | 3 --- fs/fuse/file.c | 21 +++- fs/nfs/nfs4file.c | 20 +++ fs/read_write.c| 49 ++ include/linux/fs.h | 3 --- 6 files changed, 19 insertions(+), 98 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 209535d5b8d3..639bd7bfaea9 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off return bytes; } -static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, - struct file *dst_file, loff_t dst_off, - size_t len, unsigned int flags) +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, + size_t len, unsigned int flags) { struct inode *src_inode = file_inode(src_file); struct inode *dst_inode = file_inode(dst_file); @@ -2456,21 +2456,6 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, return ret; } -static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, - struct file *dst_file, loff_t dst_off, - size_t len, unsigned int flags) -{ - ssize_t ret; - - ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off, -len, flags); - - if (ret == -EOPNOTSUPP || ret == -EXDEV) - ret = generic_copy_file_range(src_file, src_off, dst_file, - dst_off, len, flags); - return ret; -} - const struct file_operations ceph_file_fops = { .open = ceph_open, .release = ceph_release, diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index ab883e84e116..7aa3d20f21c0 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -1229,9 +1229,6 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off, len, flags); free_xid(xid); - if (rc == -EOPNOTSUPP || rc == -EXDEV) - rc = generic_copy_file_range(src_file, off, dst_file, -destoff, len, flags); return rc; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 8cccecb55fb8..0dd703278e49 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -3330,9 +3330,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, return err; } -static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - size_t len, unsigned int flags) +static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t len, unsigned int flags) { struct fuse_file *ff_in = file_in->private_data; struct fuse_file *ff_out = file_out->private_data; @@ -3439,21 +3439,6 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, return err; } -static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off, - struct file *dst_file, loff_t dst_off, - size_t len, unsigned int flags) -{ - ssize_t ret; - - ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off, -len, flags); - - if (ret == -EOPNOTSUPP || ret == -EXDEV) - ret = generic_copy_file_range(src_file, src_off, dst_file, - dst_off, len, flags); - return ret; -} - static const struct file_operations fuse_file_operations = { .llseek = fuse_file_
Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated
Amir Goldstein writes: > On Mon, Feb 15, 2021 at 2:21 PM Luis Henriques wrote: >> >> Luis Henriques writes: >> >> > Amir Goldstein writes: >> > >> >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques wrote: >> > ... >> >>> Sure, I just wanted to point out that *maybe* there are other options >> >>> than >> >>> simply reverting that commit :-) >> >>> >> >>> Something like the patch below (completely untested!) should revert to >> >>> the >> >>> old behaviour in filesystems that don't implement the CFR syscall. >> >>> >> >>> Cheers, >> >>> -- >> >>> Luis >> >>> >> >>> diff --git a/fs/read_write.c b/fs/read_write.c >> >>> index 75f764b43418..bf5dccc43cc9 100644 >> >>> --- a/fs/read_write.c >> >>> +++ b/fs/read_write.c >> >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file >> >>> *file_in, loff_t pos_in, >> >>>file_out, pos_out, >> >>>len, flags); >> >>> >> >>> - return generic_copy_file_range(file_in, pos_in, file_out, >> >>> pos_out, len, >> >>> - flags); >> >>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> >>> + return -EXDEV; >> >>> + else >> >>> + generic_copy_file_range(file_in, pos_in, file_out, >> >>> pos_out, len, >> >>> + flags); >> >>> } >> >>> >> >> >> >> Which kernel is this patch based on? >> > >> > It was v5.11-rc7. >> > >> >> At this point, I am with Dave and Darrick on not falling back to >> >> generic_copy_file_range() at all. >> >> >> >> We do not have proof of any workload that benefits from it and the >> >> above patch does not protect from a wierd use case of trying to copy a >> >> file >> >> from sysfs to sysfs. >> >> >> > >> > Ok, cool. I can post a new patch doing just that. I guess that function >> > do_copy_file_range() can be dropped in that case. >> > >> >> I am indecisive about what should be done with generic_copy_file_range() >> >> called as fallback from within filesystems. >> >> >> >> I think the wise choice is to not do the fallback in any case, but this >> >> is up >> >> to the specific filesystem maintainers to decide. >> > >> > I see what you mean. You're suggesting to have userspace handle all the >> > -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also >> > removes all the calls to generic_copy_file_range() function? And that >> > function can also be deleted too, of course. >> >> Here's a first stab at this patch. Hopefully I didn't forgot anything >> here. Let me know if you prefer the more conservative approach, i.e. not >> touching any of the filesystems and let them use generic_copy_file_range. >> > > I'm fine with this one (modulu my comment below). > CC'ing fuse/cifs/nfs maintainers. > Though I don't think the FS maintainers should mind removing the fallback. > It was added by "us" (64bf5ff58dff "vfs: no fallback for ->copy_file_range()") Thanks for your review, Amir. I'll be posting v2 shortly. Cheers, -- Luis >> Once everyone agrees on the final solution, I can follow-up with the >> manpages update. >> >> Cheers, >> -- >> Luis >> >> From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001 >> From: Luis Henriques >> Date: Fri, 12 Feb 2021 18:03:23 + >> Subject: [PATCH] vfs: prevent copy_file_range to copy across devices >> >> Nicolas Boichat reported an issue when trying to use the copy_file_range >> syscall on a tracefs file. It failed silently because the file content is >> generated on-the-fly (reporting a size of zero) and copy_file_range needs >> to know in advance how much data is present. >> >> This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to >> copy across devices"). Now the copy is done only if the filesystems for >> source >> and destination files are the sa
Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated
Luis Henriques writes: > Amir Goldstein writes: > >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques wrote: > ... >>> Sure, I just wanted to point out that *maybe* there are other options than >>> simply reverting that commit :-) >>> >>> Something like the patch below (completely untested!) should revert to the >>> old behaviour in filesystems that don't implement the CFR syscall. >>> >>> Cheers, >>> -- >>> Luis >>> >>> diff --git a/fs/read_write.c b/fs/read_write.c >>> index 75f764b43418..bf5dccc43cc9 100644 >>> --- a/fs/read_write.c >>> +++ b/fs/read_write.c >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file >>> *file_in, loff_t pos_in, >>>file_out, pos_out, >>>len, flags); >>> >>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, >>> len, >>> - flags); >>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >>> + return -EXDEV; >>> + else >>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, >>> len, >>> + flags); >>> } >>> >> >> Which kernel is this patch based on? > > It was v5.11-rc7. > >> At this point, I am with Dave and Darrick on not falling back to >> generic_copy_file_range() at all. >> >> We do not have proof of any workload that benefits from it and the >> above patch does not protect from a wierd use case of trying to copy a file >> from sysfs to sysfs. >> > > Ok, cool. I can post a new patch doing just that. I guess that function > do_copy_file_range() can be dropped in that case. > >> I am indecisive about what should be done with generic_copy_file_range() >> called as fallback from within filesystems. >> >> I think the wise choice is to not do the fallback in any case, but this is up >> to the specific filesystem maintainers to decide. > > I see what you mean. You're suggesting to have userspace handle all the > -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also > removes all the calls to generic_copy_file_range() function? And that > function can also be deleted too, of course. Here's a first stab at this patch. Hopefully I didn't forgot anything here. Let me know if you prefer the more conservative approach, i.e. not touching any of the filesystems and let them use generic_copy_file_range. Once everyone agrees on the final solution, I can follow-up with the manpages update. Cheers, -- Luis >From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001 From: Luis Henriques Date: Fri, 12 Feb 2021 18:03:23 + Subject: [PATCH] vfs: prevent copy_file_range to copy across devices Nicolas Boichat reported an issue when trying to use the copy_file_range syscall on a tracefs file. It failed silently because the file content is generated on-the-fly (reporting a size of zero) and copy_file_range needs to know in advance how much data is present. This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"). Now the copy is done only if the filesystems for source and destination files are the same and they implement this syscall. Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Cc: Nicolas Boichat Signed-off-by: Luis Henriques --- fs/ceph/file.c | 21 +++ fs/cifs/cifsfs.c | 3 --- fs/fuse/file.c | 21 +++ fs/nfs/nfs4file.c | 20 +++--- fs/read_write.c| 65 -- include/linux/fs.h | 3 --- 6 files changed, 20 insertions(+), 113 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 209535d5b8d3..639bd7bfaea9 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off return bytes; } -static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, - struct file *dst_file, loff_t dst_off, - size_t len, unsigned int flags) +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, + size_t len, unsigned int flags) { struct inode *src_inode = file_inode(src_file); struct inode *dst_inode = file_inode(dst_file); @@ -2456,2
Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated
Amir Goldstein writes: > On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques wrote: ... >> Sure, I just wanted to point out that *maybe* there are other options than >> simply reverting that commit :-) >> >> Something like the patch below (completely untested!) should revert to the >> old behaviour in filesystems that don't implement the CFR syscall. >> >> Cheers, >> -- >> Luis >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 75f764b43418..bf5dccc43cc9 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file >> *file_in, loff_t pos_in, >>file_out, pos_out, >>len, flags); >> >> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, >> len, >> - flags); >> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> + return -EXDEV; >> + else >> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, >> len, >> + flags); >> } >> > > Which kernel is this patch based on? It was v5.11-rc7. > At this point, I am with Dave and Darrick on not falling back to > generic_copy_file_range() at all. > > We do not have proof of any workload that benefits from it and the > above patch does not protect from a wierd use case of trying to copy a file > from sysfs to sysfs. > Ok, cool. I can post a new patch doing just that. I guess that function do_copy_file_range() can be dropped in that case. > I am indecisive about what should be done with generic_copy_file_range() > called as fallback from within filesystems. > > I think the wise choice is to not do the fallback in any case, but this is up > to the specific filesystem maintainers to decide. I see what you mean. You're suggesting to have userspace handle all the -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also removes all the calls to generic_copy_file_range() function? And that function can also be deleted too, of course. Cheers, -- Luis
Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated
Greg KH writes: > On Fri, Feb 12, 2021 at 12:41:48PM +0000, Luis Henriques wrote: >> Greg KH writes: ... >> >> >> Our option now are: >> >> >> - Restore the cross-fs restriction into generic_copy_file_range() >> >> > >> >> > Yes. >> >> > >> >> >> >> Restoring this restriction will actually change the current cephfs CFR >> >> behaviour. Since that commit we have allowed doing remote copies between >> >> different filesystems within the same ceph cluster. See commit >> >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in >> >> the same cluster"). >> >> >> >> Although I'm not aware of any current users for this scenario, the >> >> performance impact can actually be huge as it's the difference between >> >> asking the OSDs for copying a file and doing a full read+write on the >> >> client side. >> > >> > Regression in performance is ok if it fixes a regression for things that >> > used to work just fine in the past :) >> > >> > First rule, make it work. >> >> Sure, I just wanted to point out that *maybe* there are other options than >> simply reverting that commit :-) >> >> Something like the patch below (completely untested!) should revert to the >> old behaviour in filesystems that don't implement the CFR syscall. >> >> Cheers, >> -- >> Luis >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 75f764b43418..bf5dccc43cc9 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file >> *file_in, loff_t pos_in, >> file_out, pos_out, >> len, flags); >> >> -return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> - flags); >> +if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> +return -EXDEV; >> +else >> +generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> +flags); >> } >> >> /* > > That would make much more sense to me. Great. I can send a proper patch with changelog, if this is the really what we want. But I would rather hear from others first. I guess that at least the NFS devs have something to say here. Cheers, -- Luis
Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated
Greg KH writes: > On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote: >> Greg KH writes: >> >> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: >> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH >> >> wrote: >> >> > >> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: >> >> > > Filesystems such as procfs and sysfs generate their content at >> >> > > runtime. This implies the file sizes do not usually match the >> >> > > amount of data that can be read from the file, and that seeking >> >> > > may not work as intended. >> >> > > >> >> > > This will be useful to disallow copy_file_range with input files >> >> > > from such filesystems. >> >> > > >> >> > > Signed-off-by: Nicolas Boichat >> >> > > --- >> >> > > I first thought of adding a new field to struct file_operations, >> >> > > but that doesn't quite scale as every single file creation >> >> > > operation would need to be modified. >> >> > >> >> > Even so, you missed a load of filesystems in the kernel with this patch >> >> > series, what makes the ones you did mark here different from the >> >> > "internal" filesystems that you did not? >> >> > >> >> > This feels wrong, why is userspace suddenly breaking? What changed in >> >> > the kernel that caused this? Procfs has been around for a _very_ long >> >> > time :) >> >> >> >> That would be because of (v5.3): >> >> >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices >> >> >> >> The intention of this change (series) was to allow server side copy >> >> for nfs and cifs via copy_file_range(). >> >> This is mostly work by Dave Chinner that I picked up following requests >> >> from the NFS folks. >> >> >> >> But the above change also includes this generic change: >> >> >> >> - /* this could be relaxed once a method supports cross-fs copies */ >> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> >> - return -EXDEV; >> >> - >> >> >> >> The change of behavior was documented in the commit message. >> >> It was also documented in: >> >> >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates >> >> >> >> I think our rationale for the generic change was: >> >> "Why not? What could go wrong? (TM)" >> >> I am not sure if any workload really gained something from this >> >> kernel cross-fs CFR. >> > >> > Why not put that check back? >> > >> >> In retrospect, I think it would have been safer to allow cross-fs CFR >> >> only to the filesystems that implement ->{copy,remap}_file_range()... >> > >> > Why not make this change? That seems easier and should fix this for >> > everyone, right? >> > >> >> Our option now are: >> >> - Restore the cross-fs restriction into generic_copy_file_range() >> > >> > Yes. >> > >> >> Restoring this restriction will actually change the current cephfs CFR >> behaviour. Since that commit we have allowed doing remote copies between >> different filesystems within the same ceph cluster. See commit >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in >> the same cluster"). >> >> Although I'm not aware of any current users for this scenario, the >> performance impact can actually be huge as it's the difference between >> asking the OSDs for copying a file and doing a full read+write on the >> client side. > > Regression in performance is ok if it fixes a regression for things that > used to work just fine in the past :) > > First rule, make it work. Sure, I just wanted to point out that *maybe* there are other options than simply reverting that commit :-) Something like the patch below (completely untested!) should revert to the old behaviour in filesystems that don't implement the CFR syscall. Cheers, -- Luis diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..bf5dccc43cc9 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, file_out, pos_out, len, flags); - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) + return -EXDEV; + else + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, + flags); } /*
Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated
Greg KH writes: > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH wrote: >> > >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: >> > > Filesystems such as procfs and sysfs generate their content at >> > > runtime. This implies the file sizes do not usually match the >> > > amount of data that can be read from the file, and that seeking >> > > may not work as intended. >> > > >> > > This will be useful to disallow copy_file_range with input files >> > > from such filesystems. >> > > >> > > Signed-off-by: Nicolas Boichat >> > > --- >> > > I first thought of adding a new field to struct file_operations, >> > > but that doesn't quite scale as every single file creation >> > > operation would need to be modified. >> > >> > Even so, you missed a load of filesystems in the kernel with this patch >> > series, what makes the ones you did mark here different from the >> > "internal" filesystems that you did not? >> > >> > This feels wrong, why is userspace suddenly breaking? What changed in >> > the kernel that caused this? Procfs has been around for a _very_ long >> > time :) >> >> That would be because of (v5.3): >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices >> >> The intention of this change (series) was to allow server side copy >> for nfs and cifs via copy_file_range(). >> This is mostly work by Dave Chinner that I picked up following requests >> from the NFS folks. >> >> But the above change also includes this generic change: >> >> - /* this could be relaxed once a method supports cross-fs copies */ >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> - return -EXDEV; >> - >> >> The change of behavior was documented in the commit message. >> It was also documented in: >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates >> >> I think our rationale for the generic change was: >> "Why not? What could go wrong? (TM)" >> I am not sure if any workload really gained something from this >> kernel cross-fs CFR. > > Why not put that check back? > >> In retrospect, I think it would have been safer to allow cross-fs CFR >> only to the filesystems that implement ->{copy,remap}_file_range()... > > Why not make this change? That seems easier and should fix this for > everyone, right? > >> Our option now are: >> - Restore the cross-fs restriction into generic_copy_file_range() > > Yes. > Restoring this restriction will actually change the current cephfs CFR behaviour. Since that commit we have allowed doing remote copies between different filesystems within the same ceph cluster. See commit 6fd4e6348352 ("ceph: allow object copies across different filesystems in the same cluster"). Although I'm not aware of any current users for this scenario, the performance impact can actually be huge as it's the difference between asking the OSDs for copying a file and doing a full read+write on the client side. Cheers, -- Luis >> - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does > > No. That way lies constant auditing and someone being "vigilant" for > the next 30+ years. Which will not happen. > > thanks, > > greg k-h
Re: [PATCH v2] ceph: add ceph.caps vxattr
Jeff Layton writes: > On Mon, 2020-11-23 at 17:38 +0000, Luis Henriques wrote: >> Add a new vxattr that allows userspace to list the caps for a specific >> directory or file. >> >> Signed-off-by: Luis Henriques >> --- >> Hi! >> >> Here's a version that also shows the caps in hexadecimal format, as >> suggested by Jeff. IMO the parenthesis and the '0x' prefix help the >> readability, but they may make it a bit harder for scripts to parsing the >> output. I'm OK dropping those. >> >> Cheers, > > Looks good, merged into "testing". Awesome, thanks! > I did make a slight change to the format -- instead of putting the hex > value in parenthesis, I separated the two fields with a /, which I think > should make things easier for scripts to parse. > > You should be able to do something like this to get at the hex value for > testing: > > $ getfattr -n ceph.caps foo | cut -d / -f2 > > Let me know if you see issues with that and we can revisit the format. Sure, I'm OK with that. Or even simply dropping any separator, having only a space/tab between the string and the hex value. Another option I saw was to have two vxattrs: ceph.caps.string and ceph.caps.int. But that's probably overkill. Cheers, -- Luis
[PATCH v2] ceph: add ceph.caps vxattr
Add a new vxattr that allows userspace to list the caps for a specific directory or file. Signed-off-by: Luis Henriques --- Hi! Here's a version that also shows the caps in hexadecimal format, as suggested by Jeff. IMO the parenthesis and the '0x' prefix help the readability, but they may make it a bit harder for scripts to parsing the output. I'm OK dropping those. Cheers, -- Luis fs/ceph/xattr.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 197cb1234341..aec9bd5c8e77 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -303,6 +303,19 @@ static ssize_t ceph_vxattrcb_snap_btime(struct ceph_inode_info *ci, char *val, ci->i_snap_btime.tv_nsec); } +static ssize_t ceph_vxattrcb_caps(struct ceph_inode_info *ci, char *val, + size_t size) +{ + int issued; + + spin_lock(>i_ceph_lock); + issued = __ceph_caps_issued(ci, NULL); + spin_unlock(>i_ceph_lock); + + return ceph_fmt_xattr(val, size, "%s (0x%x)", + ceph_cap_string(issued), issued); +} + #define CEPH_XATTR_NAME(_type, _name) XATTR_CEPH_PREFIX #_type "." #_name #define CEPH_XATTR_NAME2(_type, _name, _name2) \ XATTR_CEPH_PREFIX #_type "." #_name "." #_name2 @@ -378,6 +391,13 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = { .exists_cb = ceph_vxattrcb_snap_btime_exists, .flags = VXATTR_FLAG_READONLY, }, + { + .name = "ceph.caps", + .name_size = sizeof("ceph.caps"), + .getxattr_cb = ceph_vxattrcb_caps, + .exists_cb = NULL, + .flags = VXATTR_FLAG_HIDDEN, + }, { .name = NULL, 0 } /* Required table terminator */ }; @@ -403,6 +423,13 @@ static struct ceph_vxattr ceph_file_vxattrs[] = { .exists_cb = ceph_vxattrcb_snap_btime_exists, .flags = VXATTR_FLAG_READONLY, }, + { + .name = "ceph.caps", + .name_size = sizeof("ceph.caps"), + .getxattr_cb = ceph_vxattrcb_caps, + .exists_cb = NULL, + .flags = VXATTR_FLAG_HIDDEN, + }, { .name = NULL, 0 } /* Required table terminator */ };
[RFC PATCH] ceph: add ceph.caps vxattr
Add a new vxattr that allows userspace to list the caps for a specific directory or file. Signed-off-by: Luis Henriques --- fs/ceph/xattr.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 197cb1234341..996512e05513 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -303,6 +303,18 @@ static ssize_t ceph_vxattrcb_snap_btime(struct ceph_inode_info *ci, char *val, ci->i_snap_btime.tv_nsec); } +static ssize_t ceph_vxattrcb_caps(struct ceph_inode_info *ci, char *val, + size_t size) +{ + int issued; + + spin_lock(>i_ceph_lock); + issued = __ceph_caps_issued(ci, NULL); + spin_unlock(>i_ceph_lock); + + return ceph_fmt_xattr(val, size, "%s", ceph_cap_string(issued)); +} + #define CEPH_XATTR_NAME(_type, _name) XATTR_CEPH_PREFIX #_type "." #_name #define CEPH_XATTR_NAME2(_type, _name, _name2) \ XATTR_CEPH_PREFIX #_type "." #_name "." #_name2 @@ -378,6 +390,13 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = { .exists_cb = ceph_vxattrcb_snap_btime_exists, .flags = VXATTR_FLAG_READONLY, }, + { + .name = "ceph.caps", + .name_size = sizeof("ceph.caps"), + .getxattr_cb = ceph_vxattrcb_caps, + .exists_cb = NULL, + .flags = VXATTR_FLAG_HIDDEN, + }, { .name = NULL, 0 } /* Required table terminator */ }; @@ -403,6 +422,13 @@ static struct ceph_vxattr ceph_file_vxattrs[] = { .exists_cb = ceph_vxattrcb_snap_btime_exists, .flags = VXATTR_FLAG_READONLY, }, + { + .name = "ceph.caps", + .name_size = sizeof("ceph.caps"), + .getxattr_cb = ceph_vxattrcb_caps, + .exists_cb = NULL, + .flags = VXATTR_FLAG_HIDDEN, + }, { .name = NULL, 0 } /* Required table terminator */ };
[PATCH] Revert "ceph: allow rename operation under different quota realms"
This reverts commit dffdcd71458e699e839f0bf47c3d42d64210b939. When doing a rename across quota realms, there's a corner case that isn't handled correctly. Here's a testcase: mkdir files limit truncate files/file -s 10G setfattr limit -n ceph.quota.max_bytes -v 100 mv files limit/ The above will succeed because ftruncate(2) won't immediately notify the MDSs with the new file size, and thus the quota realms stats won't be updated. Since the possible fixes for this issue would have a huge performance impact, the solution for now is to simply revert to returning -EXDEV when doing a cross quota realms rename. URL: https://tracker.ceph.com/issues/48203 Signed-off-by: Luis Henriques --- fs/ceph/dir.c | 9 fs/ceph/quota.c | 58 + fs/ceph/super.h | 3 +-- 3 files changed, 6 insertions(+), 64 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index a4d48370b2b3..858ee7362ff5 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1202,12 +1202,11 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry, op = CEPH_MDS_OP_RENAMESNAP; else return -EROFS; - } else if (old_dir != new_dir) { - err = ceph_quota_check_rename(mdsc, d_inode(old_dentry), - new_dir); - if (err) - return err; } + /* don't allow cross-quota renames */ + if ((old_dir != new_dir) && + (!ceph_quota_is_same_realm(old_dir, new_dir))) + return -EXDEV; dout("rename dir %p dentry %p to dir %p dentry %p\n", old_dir, old_dentry, new_dir, new_dentry); diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c index 9b785f11e95a..4e32c9600ecc 100644 --- a/fs/ceph/quota.c +++ b/fs/ceph/quota.c @@ -264,7 +264,7 @@ static struct ceph_snap_realm *get_quota_realm(struct ceph_mds_client *mdsc, return NULL; } -static bool ceph_quota_is_same_realm(struct inode *old, struct inode *new) +bool ceph_quota_is_same_realm(struct inode *old, struct inode *new) { struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(old->i_sb); struct ceph_snap_realm *old_realm, *new_realm; @@ -516,59 +516,3 @@ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, struct kstatfs *buf) return is_updated; } -/* - * ceph_quota_check_rename - check if a rename can be executed - * @mdsc: MDS client instance - * @old: inode to be copied - * @new: destination inode (directory) - * - * This function verifies if a rename (e.g. moving a file or directory) can be - * executed. It forces an rstat update in the @new target directory (and in the - * source @old as well, if it's a directory). The actual check is done both for - * max_files and max_bytes. - * - * This function returns 0 if it's OK to do the rename, or, if quotas are - * exceeded, -EXDEV (if @old is a directory) or -EDQUOT. - */ -int ceph_quota_check_rename(struct ceph_mds_client *mdsc, - struct inode *old, struct inode *new) -{ - struct ceph_inode_info *ci_old = ceph_inode(old); - int ret = 0; - - if (ceph_quota_is_same_realm(old, new)) - return 0; - - /* -* Get the latest rstat for target directory (and for source, if a -* directory) -*/ - ret = ceph_do_getattr(new, CEPH_STAT_RSTAT, false); - if (ret) - return ret; - - if (S_ISDIR(old->i_mode)) { - ret = ceph_do_getattr(old, CEPH_STAT_RSTAT, false); - if (ret) - return ret; - ret = check_quota_exceeded(new, QUOTA_CHECK_MAX_BYTES_OP, - ci_old->i_rbytes); - if (!ret) - ret = check_quota_exceeded(new, - QUOTA_CHECK_MAX_FILES_OP, - ci_old->i_rfiles + - ci_old->i_rsubdirs); - if (ret) - ret = -EXDEV; - } else { - ret = check_quota_exceeded(new, QUOTA_CHECK_MAX_BYTES_OP, - i_size_read(old)); - if (!ret) - ret = check_quota_exceeded(new, - QUOTA_CHECK_MAX_FILES_OP, 1); - if (ret) - ret = -EDQUOT; - } - - return ret; -} diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 482473e4cce1..8dbb0babddea 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -1222,14 +1222,13 @@ extern void ceph_handle_quota(struct ceph_mds_client *mdsc, struct ceph_mds_session *session, struct ceph_msg *msg); extern bool ceph_q
Re: [RFC PATCH] ceph: fix cross quota realms renames with new truncated files
Jeff Layton writes: > On Thu, 2020-11-12 at 10:40 +0000, Luis Henriques wrote: >> Jeff Layton writes: >> >> > On Wed, 2020-11-11 at 18:28 +, Luis Henriques wrote: >> > > Jeff Layton writes: >> > > >> > > > On Wed, 2020-11-11 at 15:39 +, Luis Henriques wrote: >> > > > > When doing a rename across quota realms, there's a corner case that >> > > > > isn't >> > > > > handled correctly. Here's a testcase: >> > > > > >> > > > > mkdir files limit >> > > > > truncate files/file -s 10G >> > > > > setfattr limit -n ceph.quota.max_bytes -v 100 >> > > > > mv files limit/ >> > > > > >> > > > > The above will succeed because ftruncate(2) won't result in an >> > > > > immediate >> > > > > notification of the MDSs with the new file size, and thus the quota >> > > > > realms >> > > > > stats won't be updated. >> > > > > >> > > > > This patch forces a sync with the MDS every time there's an >> > > > > ATTR_SIZE that >> > > > > sets a new i_size, even if we have Fx caps. >> > > > > >> > > > > Cc: sta...@vger.kernel.org >> > > > > Fixes: dffdcd71458e ("ceph: allow rename operation under different >> > > > > quota realms") >> > > > > URL: https://tracker.ceph.com/issues/36593 >> > > > > Signed-off-by: Luis Henriques >> > > > > --- >> > > > > fs/ceph/inode.c | 11 ++- >> > > > > 1 file changed, 2 insertions(+), 9 deletions(-) >> > > > > >> > > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c >> > > > > index 526faf4778ce..30e3f240ac96 100644 >> > > > > --- a/fs/ceph/inode.c >> > > > > +++ b/fs/ceph/inode.c >> > > > > @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, >> > > > > struct iattr *attr) >> > > > > if (ia_valid & ATTR_SIZE) { >> > > > > dout("setattr %p size %lld -> %lld\n", inode, >> > > > > inode->i_size, attr->ia_size); >> > > > > -if ((issued & CEPH_CAP_FILE_EXCL) && >> > > > > -attr->ia_size > inode->i_size) { >> > > > > -i_size_write(inode, attr->ia_size); >> > > > > -inode->i_blocks = >> > > > > calc_inode_blocks(attr->ia_size); >> > > > > -ci->i_reported_size = attr->ia_size; >> > > > > -dirtied |= CEPH_CAP_FILE_EXCL; >> > > > > -ia_valid |= ATTR_MTIME; >> > > > > -} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 || >> > > > > - attr->ia_size != inode->i_size) { >> > > > > +if ((issued & >> > > > > (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) || >> > > > > +(attr->ia_size != inode->i_size)) { >> > > > > req->r_args.setattr.size = >> > > > > cpu_to_le64(attr->ia_size); >> > > > > req->r_args.setattr.old_size = >> > > > > cpu_to_le64(inode->i_size); >> > > > >> > > > Hmm...this makes truncates more expensive when we have caps. I'd rather >> > > > not do that if we can help it. >> > > >> > > Yeah, as I mentioned in the tracker, there's indeed a performance impact >> > > with this fix. That's what made me add the RFC in the subject ;-) >> > > >> > > > What about instead having the client mimic a fsync when there is a >> > > > rename across quota realms? If we can't tell that reliably then we >> > > > could >> > > > also just do an effective fsync ahead of any cross-directory rename? >> > > >> > > Ok, thanks for the suggestion. That may actually work, although it will >> > > make the rename more expensive of course. I'll test that tomorrow and >> > > eventually follow-up with a patch. >> > > >> > >> > Patrick po
Re: [PATCH] ceph: fix race in concurrent __ceph_remove_cap invocations
Jeff Layton writes: > On Thu, 2020-11-12 at 20:43 +0800, Yan, Zheng wrote: >> On Thu, Nov 12, 2020 at 6:48 PM Luis Henriques wrote: >> > >> > A NULL pointer dereference may occur in __ceph_remove_cap with some of the >> > callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and >> > remove_session_caps_cb. These aren't protected against the concurrent >> > execution of __ceph_remove_cap. >> > >> >> they are protected by session mutex, never get executed concurrently >> > > Maybe not concurrently with one another, but the s_mutex is _not_ held > when __ceph_remove_caps is called from ceph_evict_inode. We can't rely > on it to protect this. Hmm, yeah. I guess the changelog could mention that. Thanks, Jeff. Cheers, -- Luis >> > Since the callers of this function hold the i_ceph_lock, the fix is simply >> > a matter of returning immediately if caps->ci is NULL. >> > >> > Based on a patch from Jeff Layton. >> > >> > Cc: sta...@vger.kernel.org >> > URL: https://tracker.ceph.com/issues/43272 >> > Link: https://www.spinics.net/lists/ceph-devel/msg47064.html >> > Signed-off-by: Luis Henriques >> > --- >> > fs/ceph/caps.c | 11 +-- >> > 1 file changed, 9 insertions(+), 2 deletions(-) >> > >> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >> > index ded4229c314a..443f164760d5 100644 >> > --- a/fs/ceph/caps.c >> > +++ b/fs/ceph/caps.c >> > @@ -1140,12 +1140,19 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool >> > queue_release) >> > { >> > struct ceph_mds_session *session = cap->session; >> > struct ceph_inode_info *ci = cap->ci; >> > - struct ceph_mds_client *mdsc = >> > - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; >> > + struct ceph_mds_client *mdsc; >> > int removed = 0; >> > >> > + /* 'ci' being NULL means he remove have already occurred */ >> > + if (!ci) { >> > + dout("%s: cap inode is NULL\n", __func__); >> > + return; >> > + } >> > + >> > dout("__ceph_remove_cap %p from %p\n", cap, >vfs_inode); >> > >> > + mdsc = ceph_inode_to_client(>vfs_inode)->mdsc; >> > + >> > /* remove from inode's cap rbtree, and clear auth cap */ >> > rb_erase(>ci_node, >i_caps); >> > if (ci->i_auth_cap == cap) { > > -- > Jeff Layton >
[PATCH] ceph: downgrade warning from mdsmap decode to debug
While the MDS cluster is unstable and changing state the client may get mdsmap updates that will trigger warnings: [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay) This patch downgrades these warnings to debug, as they may flood the logs if the cluster is unstable for a while. Signed-off-by: Luis Henriques --- Hi! This patch follows from my other patch "ceph: fix race in concurrent __ceph_remove_cap invocations", where I see a *lot* of warnings before the NULL pointer. Maybe this could be a pr_warn_once instead, but not sure that would be useful. Note that before commit 4d7ace02ba5c ("ceph: fix mdsmap cluster available check based on laggy number") this was simply ignored without any pr_warn or dout. Cheers, -- Luis fs/ceph/mdsmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ceph/mdsmap.c b/fs/ceph/mdsmap.c index e4aba6c6d3b5..1096d1d3a84c 100644 --- a/fs/ceph/mdsmap.c +++ b/fs/ceph/mdsmap.c @@ -243,8 +243,8 @@ struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void *end) } if (state <= 0) { - pr_warn("mdsmap_decode got incorrect state(%s)\n", - ceph_mds_state_name(state)); + dout("mdsmap_decode got incorrect state(%s)\n", +ceph_mds_state_name(state)); continue; }
[PATCH] ceph: fix race in concurrent __ceph_remove_cap invocations
A NULL pointer dereference may occur in __ceph_remove_cap with some of the callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and remove_session_caps_cb. These aren't protected against the concurrent execution of __ceph_remove_cap. Since the callers of this function hold the i_ceph_lock, the fix is simply a matter of returning immediately if caps->ci is NULL. Based on a patch from Jeff Layton. Cc: sta...@vger.kernel.org URL: https://tracker.ceph.com/issues/43272 Link: https://www.spinics.net/lists/ceph-devel/msg47064.html Signed-off-by: Luis Henriques --- fs/ceph/caps.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index ded4229c314a..443f164760d5 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1140,12 +1140,19 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) { struct ceph_mds_session *session = cap->session; struct ceph_inode_info *ci = cap->ci; - struct ceph_mds_client *mdsc = - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; + struct ceph_mds_client *mdsc; int removed = 0; + /* 'ci' being NULL means he remove have already occurred */ + if (!ci) { + dout("%s: cap inode is NULL\n", __func__); + return; + } + dout("__ceph_remove_cap %p from %p\n", cap, >vfs_inode); + mdsc = ceph_inode_to_client(>vfs_inode)->mdsc; + /* remove from inode's cap rbtree, and clear auth cap */ rb_erase(>ci_node, >i_caps); if (ci->i_auth_cap == cap) {
Re: [RFC PATCH] ceph: fix cross quota realms renames with new truncated files
Jeff Layton writes: > On Wed, 2020-11-11 at 18:28 +0000, Luis Henriques wrote: >> Jeff Layton writes: >> >> > On Wed, 2020-11-11 at 15:39 +, Luis Henriques wrote: >> > > When doing a rename across quota realms, there's a corner case that isn't >> > > handled correctly. Here's a testcase: >> > > >> > > mkdir files limit >> > > truncate files/file -s 10G >> > > setfattr limit -n ceph.quota.max_bytes -v 100 >> > > mv files limit/ >> > > >> > > The above will succeed because ftruncate(2) won't result in an immediate >> > > notification of the MDSs with the new file size, and thus the quota >> > > realms >> > > stats won't be updated. >> > > >> > > This patch forces a sync with the MDS every time there's an ATTR_SIZE >> > > that >> > > sets a new i_size, even if we have Fx caps. >> > > >> > > Cc: sta...@vger.kernel.org >> > > Fixes: dffdcd71458e ("ceph: allow rename operation under different quota >> > > realms") >> > > URL: https://tracker.ceph.com/issues/36593 >> > > Signed-off-by: Luis Henriques >> > > --- >> > > fs/ceph/inode.c | 11 ++- >> > > 1 file changed, 2 insertions(+), 9 deletions(-) >> > > >> > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c >> > > index 526faf4778ce..30e3f240ac96 100644 >> > > --- a/fs/ceph/inode.c >> > > +++ b/fs/ceph/inode.c >> > > @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct >> > > iattr *attr) >> > > if (ia_valid & ATTR_SIZE) { >> > > dout("setattr %p size %lld -> %lld\n", inode, >> > > inode->i_size, attr->ia_size); >> > > -if ((issued & CEPH_CAP_FILE_EXCL) && >> > > -attr->ia_size > inode->i_size) { >> > > -i_size_write(inode, attr->ia_size); >> > > -inode->i_blocks = >> > > calc_inode_blocks(attr->ia_size); >> > > -ci->i_reported_size = attr->ia_size; >> > > -dirtied |= CEPH_CAP_FILE_EXCL; >> > > -ia_valid |= ATTR_MTIME; >> > > -} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 || >> > > - attr->ia_size != inode->i_size) { >> > > +if ((issued & >> > > (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) || >> > > +(attr->ia_size != inode->i_size)) { >> > > req->r_args.setattr.size = >> > > cpu_to_le64(attr->ia_size); >> > > req->r_args.setattr.old_size = >> > > cpu_to_le64(inode->i_size); >> > >> > Hmm...this makes truncates more expensive when we have caps. I'd rather >> > not do that if we can help it. >> >> Yeah, as I mentioned in the tracker, there's indeed a performance impact >> with this fix. That's what made me add the RFC in the subject ;-) >> >> > What about instead having the client mimic a fsync when there is a >> > rename across quota realms? If we can't tell that reliably then we could >> > also just do an effective fsync ahead of any cross-directory rename? >> >> Ok, thanks for the suggestion. That may actually work, although it will >> make the rename more expensive of course. I'll test that tomorrow and >> eventually follow-up with a patch. >> > > Patrick pointed out to me on IRC that since you're moving the parent > directory of the truncated file, flushing the caps on the directory > won't really help. You'd need to walk the entire subtree and try to > flush every dirty inode, or basically do a syncfs() prior to renaming > the directory across quotarealms. > > I think we probably will need to revert the change to allow cross- > quotarealm renames of directories and make those return EXDEV again. > Anything else sounds like it's probably going to be too expensive. Hmm... that sounds a bit drastic and it would make the kernel client behave differently from the fuse client -- from what I could understand the fuse client does the sync ATTR_SIZE and thus doesn't have this issue. Obviously, I agree with you that the performance penalty is too high for such a common operation. But maybe renames across quotarealms aren't that common and paying the penalty of doing a full ceph_flush_dirty_caps() is acceptable for such cases? Cheers, -- Luis
Re: [RFC PATCH] ceph: fix cross quota realms renames with new truncated files
Jeff Layton writes: > On Wed, 2020-11-11 at 15:39 +0000, Luis Henriques wrote: >> When doing a rename across quota realms, there's a corner case that isn't >> handled correctly. Here's a testcase: >> >> mkdir files limit >> truncate files/file -s 10G >> setfattr limit -n ceph.quota.max_bytes -v 100 >> mv files limit/ >> >> The above will succeed because ftruncate(2) won't result in an immediate >> notification of the MDSs with the new file size, and thus the quota realms >> stats won't be updated. >> >> This patch forces a sync with the MDS every time there's an ATTR_SIZE that >> sets a new i_size, even if we have Fx caps. >> >> Cc: sta...@vger.kernel.org >> Fixes: dffdcd71458e ("ceph: allow rename operation under different quota >> realms") >> URL: https://tracker.ceph.com/issues/36593 >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/inode.c | 11 ++- >> 1 file changed, 2 insertions(+), 9 deletions(-) >> >> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c >> index 526faf4778ce..30e3f240ac96 100644 >> --- a/fs/ceph/inode.c >> +++ b/fs/ceph/inode.c >> @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct iattr >> *attr) >> if (ia_valid & ATTR_SIZE) { >> dout("setattr %p size %lld -> %lld\n", inode, >> inode->i_size, attr->ia_size); >> -if ((issued & CEPH_CAP_FILE_EXCL) && >> -attr->ia_size > inode->i_size) { >> -i_size_write(inode, attr->ia_size); >> -inode->i_blocks = calc_inode_blocks(attr->ia_size); >> -ci->i_reported_size = attr->ia_size; >> -dirtied |= CEPH_CAP_FILE_EXCL; >> -ia_valid |= ATTR_MTIME; >> -} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 || >> - attr->ia_size != inode->i_size) { >> +if ((issued & (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) || >> +(attr->ia_size != inode->i_size)) { >> req->r_args.setattr.size = cpu_to_le64(attr->ia_size); >> req->r_args.setattr.old_size = >> cpu_to_le64(inode->i_size); > > Hmm...this makes truncates more expensive when we have caps. I'd rather > not do that if we can help it. Yeah, as I mentioned in the tracker, there's indeed a performance impact with this fix. That's what made me add the RFC in the subject ;-) > What about instead having the client mimic a fsync when there is a > rename across quota realms? If we can't tell that reliably then we could > also just do an effective fsync ahead of any cross-directory rename? Ok, thanks for the suggestion. That may actually work, although it will make the rename more expensive of course. I'll test that tomorrow and eventually follow-up with a patch. Cheers, -- Luis
[RFC PATCH] ceph: fix cross quota realms renames with new truncated files
When doing a rename across quota realms, there's a corner case that isn't handled correctly. Here's a testcase: mkdir files limit truncate files/file -s 10G setfattr limit -n ceph.quota.max_bytes -v 100 mv files limit/ The above will succeed because ftruncate(2) won't result in an immediate notification of the MDSs with the new file size, and thus the quota realms stats won't be updated. This patch forces a sync with the MDS every time there's an ATTR_SIZE that sets a new i_size, even if we have Fx caps. Cc: sta...@vger.kernel.org Fixes: dffdcd71458e ("ceph: allow rename operation under different quota realms") URL: https://tracker.ceph.com/issues/36593 Signed-off-by: Luis Henriques --- fs/ceph/inode.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 526faf4778ce..30e3f240ac96 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr) if (ia_valid & ATTR_SIZE) { dout("setattr %p size %lld -> %lld\n", inode, inode->i_size, attr->ia_size); - if ((issued & CEPH_CAP_FILE_EXCL) && - attr->ia_size > inode->i_size) { - i_size_write(inode, attr->ia_size); - inode->i_blocks = calc_inode_blocks(attr->ia_size); - ci->i_reported_size = attr->ia_size; - dirtied |= CEPH_CAP_FILE_EXCL; - ia_valid |= ATTR_MTIME; - } else if ((issued & CEPH_CAP_FILE_SHARED) == 0 || - attr->ia_size != inode->i_size) { + if ((issued & (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) || + (attr->ia_size != inode->i_size)) { req->r_args.setattr.size = cpu_to_le64(attr->ia_size); req->r_args.setattr.old_size = cpu_to_le64(inode->i_size);
Re: [PATCH] ceph: remove unnecessary return in switch statement
David Laight writes: > From: Luis Henriques >> Sent: 14 August 2020 10:38 >> >> Since there's a return immediately after the 'break', there's no need for >> this extra 'return' in the S_IFDIR case. >> >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/file.c | 2 -- >> 1 file changed, 2 deletions(-) >> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> index d51c3f2fdca0..04ab99c0223a 100644 >> --- a/fs/ceph/file.c >> +++ b/fs/ceph/file.c >> @@ -256,8 +256,6 @@ static int ceph_init_file(struct inode *inode, struct >> file *file, int fmode) >> case S_IFDIR: >> ret = ceph_init_file_info(inode, file, fmode, >> S_ISDIR(inode->i_mode)); >> -if (ret) >> -return ret; >> break; >> >> case S_IFLNK: > > I'd move the other way and just do: > return ceph_init_file_info(...); Sure, that would work too, although my preference would be to have a single function exit point. But I'll leave that decision to Jeff :-) Cheers, -- Luis > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 > 1PT, UK > Registration No: 1397386 (Wales) >
[PATCH] ceph: remove unnecessary return in switch statement
Since there's a return immediately after the 'break', there's no need for this extra 'return' in the S_IFDIR case. Signed-off-by: Luis Henriques --- fs/ceph/file.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index d51c3f2fdca0..04ab99c0223a 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -256,8 +256,6 @@ static int ceph_init_file(struct inode *inode, struct file *file, int fmode) case S_IFDIR: ret = ceph_init_file_info(inode, file, fmode, S_ISDIR(inode->i_mode)); - if (ret) - return ret; break; case S_IFLNK:
drm: list_add corruption followed by BUG (stack guard page was hit)
Hi! I've just got the following WARNING followed by a BUG on rc7. Maybe it's already a known issue, but here it is anyway. Cheers, -- Luis [ 38.001304] [ cut here ] [ 38.001312] list_add corruption. prev->next should be next (8fe713397b88), but was . (prev=8fe715fb9140). [ 38.001337] WARNING: CPU: 3 PID: 501 at lib/list_debug.c:26 __list_add_valid+0x4d/0x70 [ 38.001340] Modules linked in: cdc_ether(E) usbnet(E) r8152(E) mii(E) hid_generic(E) usbhid(E) snd_hda_codec_hdmi(E) iwlmvm(E) dell_rbtn(E) mac80211(E) libarc4(E) snd_hda_codec_realtek(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_hda_codec_generic(E) coretemp(E) mei_wdt(E) dell_laptop(E) kvm_intel(E) ledtrig_audio(E) intel_rapl_msr(E) dell_smm_hwmon(E) snd_hda_intel(E) kvm(E) uvcvideo(E) snd_intel_dspcfg(E) videobuf2_vmalloc(E) irqbypass(E) iwlwifi(E) videobuf2_memops(E) rapl(E) snd_hda_codec(E) videobuf2_v4l2(E) intel_cstate(E) dell_wmi(E) joydev(E) snd_hwdep(E) pcspkr(E) serio_raw(E) intel_uncore(E) dell_smbios(E) videobuf2_common(E) dcdbas(E) snd_hda_core(E) iTCO_wdt(E) snd_pcm(E) videodev(E) wmi_bmof(E) snd_timer(E) dell_wmi_descriptor(E) intel_wmi_thunderbolt(E) iTCO_vendor_support(E) mei_me(E) snd(E) cfg80211(E) soundcore(E) tpm_crb(E) mc(E) rfkill(E) mei(E) intel_pch_thermal(E) sg(E) processor_thermal_device(E) intel_rapl_common(E) intel_soc_dts_iosf(E) battery(E) tpm_tis(E) [ 38.001397] int3403_thermal(E) tpm_tis_core(E) tpm(E) dell_smo8800(E) intel_hid(E) evdev(E) rng_core(E) int3400_thermal(E) int3402_thermal(E) acpi_thermal_rel(E) int340x_thermal_zone(E) sparse_keymap(E) acpi_pad(E) ac(E) nft_counter(E) nft_ct(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) i2c_dev(E) nf_tables(E) parport_pc(E) ppdev(E) nfnetlink(E) lp(E) parport(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) dm_crypt(E) cbc(E) encrypted_keys(E) dm_mod(E) sd_mod(E) t10_pi(E) rtsx_pci_sdmmc(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) mmc_core(E) aesni_intel(E) crypto_simd(E) cryptd(E) glue_helper(E) ahci(E) nouveau(E) i915(E) mxm_wmi(E) i2c_i801(E) i2c_smbus(E) libahci(E) ttm(E) i2c_algo_bit(E) rtsx_pci(E) xhci_pci(E) drm_kms_helper(E) intel_lpss_pci(E) libata(E) syscopyarea(E) intel_lpss(E) xhci_hcd(E) idma64(E) sysfillrect(E) virt_dma(E) [ 38.001461] sysimgblt(E) scsi_mod(E) fb_sys_fops(E) mfd_core(E) usbcore(E) usb_common(E) drm(E) fan(E) thermal(E) i2c_hid(E) hid(E) wmi(E) video(E) button(E) [ 38.001482] CPU: 3 PID: 501 Comm: kworker/3:4 Tainted: GE 5.8.0-rc7 #43 [ 38.001485] Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.14.2 05/25/2020 [ 38.001513] Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper] [ 38.001521] RIP: 0010:__list_add_valid+0x4d/0x70 [ 38.001527] Code: c3 4c 89 c1 48 c7 c7 98 34 ed af e8 7f 16 c9 ff 0f 0b 31 c0 c3 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 e8 34 ed af e8 65 16 c9 ff <0f> 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 38 35 ed af e8 [ 38.001530] RSP: 0018:a47680417ca0 EFLAGS: 00010286 [ 38.001535] RAX: RBX: 8fe713397b88 RCX: 0027 [ 38.001538] RDX: 0027 RSI: 0096 RDI: 8fe71e197b68 [ 38.001541] RBP: 8fe7133977e8 R08: 8fe71e197b60 R09: 0084 [ 38.001544] R10: a47680417b48 R11: a47680417b4d R12: 8fe71869f800 [ 38.001547] R13: 8fe715fb9140 R14: 8fe71869f940 R15: 8fe713397b68 [ 38.001551] FS: () GS:8fe71e18() knlGS: [ 38.001555] CS: 0010 DS: ES: CR0: 80050033 [ 38.001558] CR2: 7f9e3f7159d0 CR3: 00040e60a004 CR4: 003606e0 [ 38.001561] DR0: DR1: DR2: [ 38.001563] DR3: DR6: fffe0ff0 DR7: 0400 [ 38.001566] Call Trace: [ 38.001593] drm_dp_queue_down_tx+0x5c/0x110 [drm_kms_helper] [ 38.001604] ? i2c_register_adapter+0x1d0/0x390 [ 38.001627] drm_dp_send_enum_path_resources+0x54/0x120 [drm_kms_helper] [ 38.001650] drm_dp_send_link_address+0x682/0x990 [drm_kms_helper] [ 38.001657] ? prepare_to_wait_event+0x7e/0x150 [ 38.001661] ? finish_wait+0x3f/0x80 [ 38.001684] drm_dp_check_and_send_link_address+0xad/0xd0 [drm_kms_helper] [ 38.001707] drm_dp_mst_link_probe_work+0x94/0x180 [drm_kms_helper] [ 38.001714] process_one_work+0x1ae/0x370 [ 38.001720] worker_thread+0x50/0x3a0 [ 38.001725] ? process_one_work+0x370/0x370 [ 38.001729] kthread+0x11b/0x140 [ 38.001734] ? kthread_associate_blkcg+0x90/0x90 [ 38.001741] ret_from_fork+0x22/0x30 [ 38.001747] ---[ end trace ca03f107384f1adc ]--- [ 38.001759] BUG: stack guard page was hit at 62c9d455 (stack is e3f6f298..86ee600f) [ 38.001766] kernel stack
Re: Warning triggered in drm_dp_delayed_destroy_work workqueue
On Fri, Jun 26, 2020 at 05:06:00PM +0300, Ville Syrjälä wrote: > On Fri, Jun 26, 2020 at 03:40:20PM +0200, Daniel Vetter wrote: > > Adding Lyude, she's been revamping all the lifetime refcouting in the > > dp code last few kernel releases. At a glance I don't even have an > > idea what's going wrong here ... > > Already fixed by Imre I believe. > > 7d11507605a7 ("drm/dp_mst: Fix the DDC I2C device unregistration of an MST > port") > Ah! It does seems to be the same issue indeed! Thanks a lot for pointing me at this commit. Hopefully this fix can be included in 5.8. Not that I'm seeing this WARNING frequently, but frequent enough to annoy me :-) Cheers, -- Luis > > -Daniel > > > > On Thu, Jun 25, 2020 at 12:22 PM Luis Henriques wrote: > > > > > > Hi! > > > > > > I've been seeing this warning occasionally, not sure if it has been > > > reported yet. It's not a regression as I remember seeing it in, at least, > > > 5.7. > > > > > > Anyway, here it is: > > > > > > [ cut here ] > > > sysfs group 'power' not found for kobject 'i2c-7' > > > WARNING: CPU: 1 PID: 17996 at fs/sysfs/group.c:279 > > > sysfs_remove_group+0x74/0x80 > > > Modules linked in: ccm(E) dell_rbtn(E) iwlmvm(E) mei_wdt(E) mac80211(E) > > > libarc4(E) uvcvideo(E) dell_laptop(E) videobuf2_vmalloc(E) intel_rapl_> > > > soundcore(E) intel_soc_dts_iosf(E) rng_core(E) battery(E) acpi_pad(E) > > > sparse_keymap(E) acpi_thermal_rel(E) intel_pch_thermal(E) int3402_therm> > > > sysfillrect(E) intel_lpss(E) sysimgblt(E) fb_sys_fops(E) idma64(E) > > > scsi_mod(E) virt_dma(E) mfd_core(E) drm(E) fan(E) thermal(E) i2c_hid(E) > > > hi> > > > CPU: 1 PID: 17996 Comm: kworker/1:1 Tainted: GE > > > 5.8.0-rc2+ #36 > > > Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.14.2 05/25/2020 > > > Workqueue: events drm_dp_delayed_destroy_work [drm_kms_helper] > > > RIP: 0010:sysfs_remove_group+0x74/0x80 > > > Code: ff 5b 48 89 ef 5d 41 5c e9 79 bc ff ff 48 89 ef e8 01 b8 ff ff eb > > > cc 49 8b 14 24 48 8b 33 48 c7 c7 90 ac 8b 93 e8 de b1 d4 ff <0f> 0b 5b> > > > RSP: :b12d40c13c38 EFLAGS: 00010282 > > > RAX: RBX: 936e6a60 RCX: 0027 > > > RDX: 0027 RSI: 0086 RDI: 8e37de097b68 > > > RBP: R08: 8e37de097b60 R09: 93fb4624 > > > R10: 0904 R11: 0001002c R12: 8e37d3081c18 > > > R13: 8e375f1450a8 R14: R15: 8e375f145410 > > > FS: () GS:8e37de08() > > > knlGS: > > > CS: 0010 DS: ES: CR0: 80050033 > > > CR2: CR3: 0004ab20a001 CR4: 003606e0 > > > DR0: DR1: DR2: > > > DR3: DR6: fffe0ff0 DR7: 0400 > > > Call Trace: > > > device_del+0x97/0x3f0 > > > cdev_device_del+0x15/0x30 > > > put_i2c_dev+0x7b/0x90 [i2c_dev] > > > i2cdev_detach_adapter+0x33/0x60 [i2c_dev] > > > notifier_call_chain+0x47/0x70 > > > blocking_notifier_call_chain+0x3d/0x60 > > > device_del+0x8f/0x3f0 > > > device_unregister+0x16/0x60 > > > i2c_del_adapter+0x247/0x300 > > > drm_dp_port_set_pdt+0x90/0x2c0 [drm_kms_helper] > > > drm_dp_delayed_destroy_work+0x2be/0x340 [drm_kms_helper] > > > process_one_work+0x1ae/0x370 > > > worker_thread+0x50/0x3a0 > > > ? process_one_work+0x370/0x370 > > > kthread+0x11b/0x140 > > > ? kthread_associate_blkcg+0x90/0x90 > > > ret_from_fork+0x22/0x30 > > > ---[ end trace 16486ad3c2627482 ]--- > > > [ cut here ] > > > > > > Cheers, > > > -- > > > Luis > > > > > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch > > ___ > > dri-devel mailing list > > dri-de...@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/dri-devel > > -- > Ville Syrjälä > Intel
Warning triggered in drm_dp_delayed_destroy_work workqueue
Hi! I've been seeing this warning occasionally, not sure if it has been reported yet. It's not a regression as I remember seeing it in, at least, 5.7. Anyway, here it is: [ cut here ] sysfs group 'power' not found for kobject 'i2c-7' WARNING: CPU: 1 PID: 17996 at fs/sysfs/group.c:279 sysfs_remove_group+0x74/0x80 Modules linked in: ccm(E) dell_rbtn(E) iwlmvm(E) mei_wdt(E) mac80211(E) libarc4(E) uvcvideo(E) dell_laptop(E) videobuf2_vmalloc(E) intel_rapl_> soundcore(E) intel_soc_dts_iosf(E) rng_core(E) battery(E) acpi_pad(E) sparse_keymap(E) acpi_thermal_rel(E) intel_pch_thermal(E) int3402_therm> sysfillrect(E) intel_lpss(E) sysimgblt(E) fb_sys_fops(E) idma64(E) scsi_mod(E) virt_dma(E) mfd_core(E) drm(E) fan(E) thermal(E) i2c_hid(E) hi> CPU: 1 PID: 17996 Comm: kworker/1:1 Tainted: GE 5.8.0-rc2+ #36 Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.14.2 05/25/2020 Workqueue: events drm_dp_delayed_destroy_work [drm_kms_helper] RIP: 0010:sysfs_remove_group+0x74/0x80 Code: ff 5b 48 89 ef 5d 41 5c e9 79 bc ff ff 48 89 ef e8 01 b8 ff ff eb cc 49 8b 14 24 48 8b 33 48 c7 c7 90 ac 8b 93 e8 de b1 d4 ff <0f> 0b 5b> RSP: :b12d40c13c38 EFLAGS: 00010282 RAX: RBX: 936e6a60 RCX: 0027 RDX: 0027 RSI: 0086 RDI: 8e37de097b68 RBP: R08: 8e37de097b60 R09: 93fb4624 R10: 0904 R11: 0001002c R12: 8e37d3081c18 R13: 8e375f1450a8 R14: R15: 8e375f145410 FS: () GS:8e37de08() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0004ab20a001 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: device_del+0x97/0x3f0 cdev_device_del+0x15/0x30 put_i2c_dev+0x7b/0x90 [i2c_dev] i2cdev_detach_adapter+0x33/0x60 [i2c_dev] notifier_call_chain+0x47/0x70 blocking_notifier_call_chain+0x3d/0x60 device_del+0x8f/0x3f0 device_unregister+0x16/0x60 i2c_del_adapter+0x247/0x300 drm_dp_port_set_pdt+0x90/0x2c0 [drm_kms_helper] drm_dp_delayed_destroy_work+0x2be/0x340 [drm_kms_helper] process_one_work+0x1ae/0x370 worker_thread+0x50/0x3a0 ? process_one_work+0x370/0x370 kthread+0x11b/0x140 ? kthread_associate_blkcg+0x90/0x90 ret_from_fork+0x22/0x30 ---[ end trace 16486ad3c2627482 ]--- [ cut here ] Cheers, -- Luis
[PATCH v2] ceph: don't return -ESTALE if there's still an open file
Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting a file handle to dentry"), this fixes another corner case with name_to_handle_at/open_by_handle_at. The issue has been detected by xfstest generic/467, when doing: - name_to_handle_at("/cephfs/myfile") - open("/cephfs/myfile") - unlink("/cephfs/myfile") - sync; sync; - drop caches - open_by_handle_at() The call to open_by_handle_at should not fail because the file hasn't been deleted yet (only unlinked) and we do have a valid handle to it. -ESTALE shall be returned only if i_nlink is 0 *and* i_count is 1. This patch also makes sure we have LINK caps before checking i_nlink. Signed-off-by: Luis Henriques --- Hi! (and sorry for the delay in my reply!) So, from the discussion thread and some IRC chat with Jeff, I'm sending v2. What changed? Everything! :-) - Use i_count instead of __ceph_is_file_opened to check for open files - Add call to ceph_do_getattr to make sure we have LINK caps before accessing i_nlink Cheers, -- Luis fs/ceph/export.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/ceph/export.c b/fs/ceph/export.c index 79dc06881e78..e088843a7734 100644 --- a/fs/ceph/export.c +++ b/fs/ceph/export.c @@ -172,9 +172,16 @@ struct inode *ceph_lookup_inode(struct super_block *sb, u64 ino) static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino) { struct inode *inode = __lookup_inode(sb, ino); + int err; + if (IS_ERR(inode)) return ERR_CAST(inode); - if (inode->i_nlink == 0) { + /* We need LINK caps to reliably check i_nlink */ + err = ceph_do_getattr(inode, CEPH_CAP_LINK_SHARED, false); + if (err) + return ERR_PTR(err); + /* -ESTALE if inode as been unlinked and no file is open */ + if ((inode->i_nlink == 0) && (atomic_read(>i_count) == 1)) { iput(inode); return ERR_PTR(-ESTALE); }
Re: [PATCH] ceph: don't return -ESTALE if there's still an open file
On Fri, May 15, 2020 at 09:42:24AM +0300, Amir Goldstein wrote: > +CC: fstests > > On Thu, May 14, 2020 at 4:15 PM Jeff Layton wrote: > > > > On Thu, 2020-05-14 at 13:48 +0100, Luis Henriques wrote: > > > On Thu, May 14, 2020 at 08:10:09AM -0400, Jeff Layton wrote: > > > > On Thu, 2020-05-14 at 12:14 +0100, Luis Henriques wrote: > > > > > Similarly to commit 03f219041fdb ("ceph: check i_nlink while > > > > > converting > > > > > a file handle to dentry"), this fixes another corner case with > > > > > name_to_handle_at/open_by_handle_at. The issue has been detected by > > > > > xfstest generic/467, when doing: > > > > > > > > > > - name_to_handle_at("/cephfs/myfile") > > > > > - open("/cephfs/myfile") > > > > > - unlink("/cephfs/myfile") > > > > > - open_by_handle_at() > > > > > > > > > > The call to open_by_handle_at should not fail because the file still > > > > > exists and we do have a valid handle to it. > > > > > > > > > > Signed-off-by: Luis Henriques > > > > > --- > > > > > fs/ceph/export.c | 13 +++-- > > > > > 1 file changed, 11 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/fs/ceph/export.c b/fs/ceph/export.c > > > > > index 79dc06881e78..8556df9d94d0 100644 > > > > > --- a/fs/ceph/export.c > > > > > +++ b/fs/ceph/export.c > > > > > @@ -171,12 +171,21 @@ struct inode *ceph_lookup_inode(struct > > > > > super_block *sb, u64 ino) > > > > > > > > > > static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino) > > > > > { > > > > > + struct ceph_inode_info *ci; > > > > > struct inode *inode = __lookup_inode(sb, ino); > > > > > + > > > > > if (IS_ERR(inode)) > > > > > return ERR_CAST(inode); > > > > > if (inode->i_nlink == 0) { > > > > > - iput(inode); > > > > > - return ERR_PTR(-ESTALE); > > > > > + bool is_open; > > > > > + ci = ceph_inode(inode); > > > > > + spin_lock(>i_ceph_lock); > > > > > + is_open = __ceph_is_file_opened(ci); > > > > > + spin_unlock(>i_ceph_lock); > > > > > + if (!is_open) { > > > > > + iput(inode); > > > > > + return ERR_PTR(-ESTALE); > > > > > + } > > > > > } > > > > > return d_obtain_alias(inode); > > > > > } > > > > > > > > Thanks Luis. Out of curiousity, is there any reason we shouldn't ignore > > > > the i_nlink value here? Does anything obviously break if we do? > > > > > > Yes, the scenario described in commit 03f219041fdb is still valid, which > > > is basically the same but without the extra open(2): > > > > > > - name_to_handle_at("/cephfs/myfile") > > > - unlink("/cephfs/myfile") > > > - open_by_handle_at() > > > > > > > Ok, I guess we end up doing some delayed cleanup, and that allows the > > inode to be found in that situation. > > > > > The open_by_handle_at man page isn't really clear about these 2 scenarios, > > > but generic/426 will fail if -ESTALE isn't returned. Want me to add a > > > comment to the code, describing these 2 scenarios? > > > > > > > (cc'ing Amir since he added this test) > > > > I don't think there is any hard requirement that open_by_handle_at > > should fail in that situation. It generally does for most filesystems > > due to the way they handle cl794798fa xfsqa: test open_by_handle() on > > unlinked and freed inode clusters > eaning up unlinked inodes, but I don't > > think it's technically illegal to allow the inode to still be found. If > > the caller cares about whether it has been unlinked it can always test > > i_nlink itself. > > > > Amir, is this required for some reason that I'm not aware of? > > Hi Jeff, > > The origin of this test is in fstests commit: > 794798fa xfsqa: test open_by_handle() on unlinked and freed inode clusters > > It was introduced to catch an xfs bug, so this behavior is the expectation > of xfs filesystem, but note that it is not a general expecta
Re: [PATCH] ceph: don't return -ESTALE if there's still an open file
On Thu, May 14, 2020 at 08:10:09AM -0400, Jeff Layton wrote: > On Thu, 2020-05-14 at 12:14 +0100, Luis Henriques wrote: > > Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting > > a file handle to dentry"), this fixes another corner case with > > name_to_handle_at/open_by_handle_at. The issue has been detected by > > xfstest generic/467, when doing: > > > > - name_to_handle_at("/cephfs/myfile") > > - open("/cephfs/myfile") > > - unlink("/cephfs/myfile") > > - open_by_handle_at() > > > > The call to open_by_handle_at should not fail because the file still > > exists and we do have a valid handle to it. > > > > Signed-off-by: Luis Henriques > > --- > > fs/ceph/export.c | 13 +++-- > > 1 file changed, 11 insertions(+), 2 deletions(-) > > > > diff --git a/fs/ceph/export.c b/fs/ceph/export.c > > index 79dc06881e78..8556df9d94d0 100644 > > --- a/fs/ceph/export.c > > +++ b/fs/ceph/export.c > > @@ -171,12 +171,21 @@ struct inode *ceph_lookup_inode(struct super_block > > *sb, u64 ino) > > > > static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino) > > { > > + struct ceph_inode_info *ci; > > struct inode *inode = __lookup_inode(sb, ino); > > + > > if (IS_ERR(inode)) > > return ERR_CAST(inode); > > if (inode->i_nlink == 0) { > > - iput(inode); > > - return ERR_PTR(-ESTALE); > > + bool is_open; > > + ci = ceph_inode(inode); > > + spin_lock(>i_ceph_lock); > > + is_open = __ceph_is_file_opened(ci); > > + spin_unlock(>i_ceph_lock); > > + if (!is_open) { > > + iput(inode); > > + return ERR_PTR(-ESTALE); > > + } > > } > > return d_obtain_alias(inode); > > } > > Thanks Luis. Out of curiousity, is there any reason we shouldn't ignore > the i_nlink value here? Does anything obviously break if we do? Yes, the scenario described in commit 03f219041fdb is still valid, which is basically the same but without the extra open(2): - name_to_handle_at("/cephfs/myfile") - unlink("/cephfs/myfile") - open_by_handle_at() The open_by_handle_at man page isn't really clear about these 2 scenarios, but generic/426 will fail if -ESTALE isn't returned. Want me to add a comment to the code, describing these 2 scenarios? Cheers, -- Luis
[PATCH] ceph: don't return -ESTALE if there's still an open file
Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting a file handle to dentry"), this fixes another corner case with name_to_handle_at/open_by_handle_at. The issue has been detected by xfstest generic/467, when doing: - name_to_handle_at("/cephfs/myfile") - open("/cephfs/myfile") - unlink("/cephfs/myfile") - open_by_handle_at() The call to open_by_handle_at should not fail because the file still exists and we do have a valid handle to it. Signed-off-by: Luis Henriques --- fs/ceph/export.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/ceph/export.c b/fs/ceph/export.c index 79dc06881e78..8556df9d94d0 100644 --- a/fs/ceph/export.c +++ b/fs/ceph/export.c @@ -171,12 +171,21 @@ struct inode *ceph_lookup_inode(struct super_block *sb, u64 ino) static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino) { + struct ceph_inode_info *ci; struct inode *inode = __lookup_inode(sb, ino); + if (IS_ERR(inode)) return ERR_CAST(inode); if (inode->i_nlink == 0) { - iput(inode); - return ERR_PTR(-ESTALE); + bool is_open; + ci = ceph_inode(inode); + spin_lock(>i_ceph_lock); + is_open = __ceph_is_file_opened(ci); + spin_unlock(>i_ceph_lock); + if (!is_open) { + iput(inode); + return ERR_PTR(-ESTALE); + } } return d_obtain_alias(inode); }
[PATCH] ceph: demote quotarealm lookup warning to a debug message
A misconfigured cephx can easily result in having the kernel client flooding the logs with: ceph: Can't lookup inode 1 (err: -13) Change his message to debug level. Link: https://tracker.ceph.com/issues/44546 Signed-off-by: Luis Henriques --- Hi! This patch should fix some harmless warnings when using cephx to restrict users access to certain filesystem paths. I've added a comment to the tracker where removing this warning could result (unlikely, IMHO!) in an admin to miss not-so-harmless bogus configurations. Cheers, -- Luís fs/ceph/quota.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c index de56dee60540..19507e2fdb57 100644 --- a/fs/ceph/quota.c +++ b/fs/ceph/quota.c @@ -159,8 +159,8 @@ static struct inode *lookup_quotarealm_inode(struct ceph_mds_client *mdsc, } if (IS_ERR(in)) { - pr_warn("Can't lookup inode %llx (err: %ld)\n", - realm->ino, PTR_ERR(in)); + dout("Can't lookup inode %llx (err: %ld)\n", +realm->ino, PTR_ERR(in)); qri->timeout = jiffies + msecs_to_jiffies(60 * 1000); /* XXX */ } else { qri->timeout = 0;
Re: [PATCH] ceph: Fix use-after-free in __ceph_remove_cap
Luis Henriques writes: > On Tue, Oct 22, 2019 at 08:48:56PM +0800, Yan, Zheng wrote: >> On Mon, Oct 21, 2019 at 10:55 PM Luis Henriques wrote: >> >> > >> > Jeff Layton writes: >> > >> > > On Thu, 2019-10-17 at 15:46 +0100, Luis Henriques wrote: >> > >> KASAN reports a use-after-free when running xfstest generic/531, with >> > the >> > >> following trace: >> > >> >> > >> [ 293.903362] kasan_report+0xe/0x20 >> > >> [ 293.903365] rb_erase+0x1f/0x790 >> > >> [ 293.903370] __ceph_remove_cap+0x201/0x370 >> > >> [ 293.903375] __ceph_remove_caps+0x4b/0x70 >> > >> [ 293.903380] ceph_evict_inode+0x4e/0x360 >> > >> [ 293.903386] evict+0x169/0x290 >> > >> [ 293.903390] __dentry_kill+0x16f/0x250 >> > >> [ 293.903394] dput+0x1c6/0x440 >> > >> [ 293.903398] __fput+0x184/0x330 >> > >> [ 293.903404] task_work_run+0xb9/0xe0 >> > >> [ 293.903410] exit_to_usermode_loop+0xd3/0xe0 >> > >> [ 293.903413] do_syscall_64+0x1a0/0x1c0 >> > >> [ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> > >> >> > >> This happens because __ceph_remove_cap() may queue a cap release >> > >> (__ceph_queue_cap_release) which can be scheduled before that cap is >> > >> removed from the inode list with >> > >> >> > >> rb_erase(>ci_node, >i_caps); >> > >> >> > >> And, when this finally happens, the use-after-free will occur. >> > >> >> > >> This can be fixed by protecting the rb_erase with the s_cap_lock >> > spinlock, >> > >> which is used by ceph_send_cap_releases(), before the cap is freed. >> > >> >> > >> Signed-off-by: Luis Henriques >> > >> --- >> > >> fs/ceph/caps.c | 4 ++-- >> > >> 1 file changed, 2 insertions(+), 2 deletions(-) >> > >> >> > >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >> > >> index d3b9c9d5c1bd..21ee38cabe98 100644 >> > >> --- a/fs/ceph/caps.c >> > >> +++ b/fs/ceph/caps.c >> > >> @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap, >> > bool queue_release) >> > >> } >> > >> cap->cap_ino = ci->i_vino.ino; >> > >> >> > >> -spin_unlock(>s_cap_lock); >> > >> - >> > >> /* remove from inode list */ >> > >> rb_erase(>ci_node, >i_caps); >> > >> if (ci->i_auth_cap == cap) >> > >> ci->i_auth_cap = NULL; >> > >> >> > >> +spin_unlock(>s_cap_lock); >> > >> + >> > >> if (removed) >> > >> ceph_put_cap(mdsc, cap); >> > >> >> > > >> > > Is there any reason we need to wait until this point to remove it from >> > > the rbtree? ISTM that we ought to just do that at the beginning of the >> > > function, before we take the s_cap_lock. >> > >> > That sounds good to me, at least at a first glace. I spent some time >> > looking for any possible issues in the code, and even run a few tests. >> > >> > However, looking at git log I found commit f818a73674c5 ("ceph: fix cap >> > removal races"), which moved that rb_erase from the beginning of the >> > function to it's current position. So, unless the race mentioned in >> > this commit has disappeared in the meantime (which is possible, this >> > commit is from 2010!), this rbtree operation shouldn't be changed. >> > >> > And I now wonder if my patch isn't introducing a race too... >> > __ceph_remove_cap() is supposed to always be called with the session >> > mutex held, except for the ceph_evict_inode() path. Which is where I'm >> > seeing the UAF. So, maybe what's missing here is the s_mutex. Hmm... >> > >> > >> we can't lock s_mutex here, because i_ceph_lock is locked > > Well, my idea wasn't to get s_mutex here but earlier in the stack. > Maybe in ceph_evict_inode, protecting the call to __ceph_remove_caps. > But I didn't really looked into that yet, so I'm not really sure if Ok, I looked into that now and obviously that's not possible. So, I guess my original patch is still the best option. Cheers, -- Luis > that's feasible (or even if that would fix this UAF). I suspect that's > not possible anyway, due to the comment above __ceph_remove_cap: > > caller will not hold session s_mutex if called from destroy_inode.
Re: [PATCH] ceph: Fix use-after-free in __ceph_remove_cap
On Tue, Oct 22, 2019 at 08:48:56PM +0800, Yan, Zheng wrote: > On Mon, Oct 21, 2019 at 10:55 PM Luis Henriques wrote: > > > > > Jeff Layton writes: > > > > > On Thu, 2019-10-17 at 15:46 +0100, Luis Henriques wrote: > > >> KASAN reports a use-after-free when running xfstest generic/531, with > > the > > >> following trace: > > >> > > >> [ 293.903362] kasan_report+0xe/0x20 > > >> [ 293.903365] rb_erase+0x1f/0x790 > > >> [ 293.903370] __ceph_remove_cap+0x201/0x370 > > >> [ 293.903375] __ceph_remove_caps+0x4b/0x70 > > >> [ 293.903380] ceph_evict_inode+0x4e/0x360 > > >> [ 293.903386] evict+0x169/0x290 > > >> [ 293.903390] __dentry_kill+0x16f/0x250 > > >> [ 293.903394] dput+0x1c6/0x440 > > >> [ 293.903398] __fput+0x184/0x330 > > >> [ 293.903404] task_work_run+0xb9/0xe0 > > >> [ 293.903410] exit_to_usermode_loop+0xd3/0xe0 > > >> [ 293.903413] do_syscall_64+0x1a0/0x1c0 > > >> [ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > >> > > >> This happens because __ceph_remove_cap() may queue a cap release > > >> (__ceph_queue_cap_release) which can be scheduled before that cap is > > >> removed from the inode list with > > >> > > >> rb_erase(>ci_node, >i_caps); > > >> > > >> And, when this finally happens, the use-after-free will occur. > > >> > > >> This can be fixed by protecting the rb_erase with the s_cap_lock > > spinlock, > > >> which is used by ceph_send_cap_releases(), before the cap is freed. > > >> > > >> Signed-off-by: Luis Henriques > > >> --- > > >> fs/ceph/caps.c | 4 ++-- > > >> 1 file changed, 2 insertions(+), 2 deletions(-) > > >> > > >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > >> index d3b9c9d5c1bd..21ee38cabe98 100644 > > >> --- a/fs/ceph/caps.c > > >> +++ b/fs/ceph/caps.c > > >> @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap, > > bool queue_release) > > >> } > > >> cap->cap_ino = ci->i_vino.ino; > > >> > > >> -spin_unlock(>s_cap_lock); > > >> - > > >> /* remove from inode list */ > > >> rb_erase(>ci_node, >i_caps); > > >> if (ci->i_auth_cap == cap) > > >> ci->i_auth_cap = NULL; > > >> > > >> +spin_unlock(>s_cap_lock); > > >> + > > >> if (removed) > > >> ceph_put_cap(mdsc, cap); > > >> > > > > > > Is there any reason we need to wait until this point to remove it from > > > the rbtree? ISTM that we ought to just do that at the beginning of the > > > function, before we take the s_cap_lock. > > > > That sounds good to me, at least at a first glace. I spent some time > > looking for any possible issues in the code, and even run a few tests. > > > > However, looking at git log I found commit f818a73674c5 ("ceph: fix cap > > removal races"), which moved that rb_erase from the beginning of the > > function to it's current position. So, unless the race mentioned in > > this commit has disappeared in the meantime (which is possible, this > > commit is from 2010!), this rbtree operation shouldn't be changed. > > > > And I now wonder if my patch isn't introducing a race too... > > __ceph_remove_cap() is supposed to always be called with the session > > mutex held, except for the ceph_evict_inode() path. Which is where I'm > > seeing the UAF. So, maybe what's missing here is the s_mutex. Hmm... > > > > > we can't lock s_mutex here, because i_ceph_lock is locked Well, my idea wasn't to get s_mutex here but earlier in the stack. Maybe in ceph_evict_inode, protecting the call to __ceph_remove_caps. But I didn't really looked into that yet, so I'm not really sure if that's feasible (or even if that would fix this UAF). I suspect that's not possible anyway, due to the comment above __ceph_remove_cap: caller will not hold session s_mutex if called from destroy_inode. Cheers, -- Luís
Re: [PATCH] ceph: Fix use-after-free in __ceph_remove_cap
Jeff Layton writes: > On Thu, 2019-10-17 at 15:46 +0100, Luis Henriques wrote: >> KASAN reports a use-after-free when running xfstest generic/531, with the >> following trace: >> >> [ 293.903362] kasan_report+0xe/0x20 >> [ 293.903365] rb_erase+0x1f/0x790 >> [ 293.903370] __ceph_remove_cap+0x201/0x370 >> [ 293.903375] __ceph_remove_caps+0x4b/0x70 >> [ 293.903380] ceph_evict_inode+0x4e/0x360 >> [ 293.903386] evict+0x169/0x290 >> [ 293.903390] __dentry_kill+0x16f/0x250 >> [ 293.903394] dput+0x1c6/0x440 >> [ 293.903398] __fput+0x184/0x330 >> [ 293.903404] task_work_run+0xb9/0xe0 >> [ 293.903410] exit_to_usermode_loop+0xd3/0xe0 >> [ 293.903413] do_syscall_64+0x1a0/0x1c0 >> [ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> This happens because __ceph_remove_cap() may queue a cap release >> (__ceph_queue_cap_release) which can be scheduled before that cap is >> removed from the inode list with >> >> rb_erase(>ci_node, >i_caps); >> >> And, when this finally happens, the use-after-free will occur. >> >> This can be fixed by protecting the rb_erase with the s_cap_lock spinlock, >> which is used by ceph_send_cap_releases(), before the cap is freed. >> >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/caps.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >> index d3b9c9d5c1bd..21ee38cabe98 100644 >> --- a/fs/ceph/caps.c >> +++ b/fs/ceph/caps.c >> @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool >> queue_release) >> } >> cap->cap_ino = ci->i_vino.ino; >> >> -spin_unlock(>s_cap_lock); >> - >> /* remove from inode list */ >> rb_erase(>ci_node, >i_caps); >> if (ci->i_auth_cap == cap) >> ci->i_auth_cap = NULL; >> >> +spin_unlock(>s_cap_lock); >> + >> if (removed) >> ceph_put_cap(mdsc, cap); >> > > Is there any reason we need to wait until this point to remove it from > the rbtree? ISTM that we ought to just do that at the beginning of the > function, before we take the s_cap_lock. That sounds good to me, at least at a first glace. I spent some time looking for any possible issues in the code, and even run a few tests. However, looking at git log I found commit f818a73674c5 ("ceph: fix cap removal races"), which moved that rb_erase from the beginning of the function to it's current position. So, unless the race mentioned in this commit has disappeared in the meantime (which is possible, this commit is from 2010!), this rbtree operation shouldn't be changed. And I now wonder if my patch isn't introducing a race too... __ceph_remove_cap() is supposed to always be called with the session mutex held, except for the ceph_evict_inode() path. Which is where I'm seeing the UAF. So, maybe what's missing here is the s_mutex. Hmm... Cheers, -- Luis
[PATCH] ceph: Fix use-after-free in __ceph_remove_cap
KASAN reports a use-after-free when running xfstest generic/531, with the following trace: [ 293.903362] kasan_report+0xe/0x20 [ 293.903365] rb_erase+0x1f/0x790 [ 293.903370] __ceph_remove_cap+0x201/0x370 [ 293.903375] __ceph_remove_caps+0x4b/0x70 [ 293.903380] ceph_evict_inode+0x4e/0x360 [ 293.903386] evict+0x169/0x290 [ 293.903390] __dentry_kill+0x16f/0x250 [ 293.903394] dput+0x1c6/0x440 [ 293.903398] __fput+0x184/0x330 [ 293.903404] task_work_run+0xb9/0xe0 [ 293.903410] exit_to_usermode_loop+0xd3/0xe0 [ 293.903413] do_syscall_64+0x1a0/0x1c0 [ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens because __ceph_remove_cap() may queue a cap release (__ceph_queue_cap_release) which can be scheduled before that cap is removed from the inode list with rb_erase(>ci_node, >i_caps); And, when this finally happens, the use-after-free will occur. This can be fixed by protecting the rb_erase with the s_cap_lock spinlock, which is used by ceph_send_cap_releases(), before the cap is freed. Signed-off-by: Luis Henriques --- fs/ceph/caps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index d3b9c9d5c1bd..21ee38cabe98 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) } cap->cap_ino = ci->i_vino.ino; - spin_unlock(>s_cap_lock); - /* remove from inode list */ rb_erase(>ci_node, >i_caps); if (ci->i_auth_cap == cap) ci->i_auth_cap = NULL; + spin_unlock(>s_cap_lock); + if (removed) ceph_put_cap(mdsc, cap);
'unable to handle page fault' in pstore
Hi, Maybe this is a known issue with pstore, I didn't investigate, but it's pretty easy to reproduce: I've efi-pstore loaded, with a bunch of files in /sys/fs/pstore. If I unload my backend driver (efi-pstore) and try to remove a file from /sys/fs/pstore I'll see the following spat: BUG: unable to handle page fault for address: c0bcf090 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 64a60c067 P4D 64a60c067 PUD 64a60e067 PMD 892200067 PTE 0 Oops: [#1] SMP PTI CPU: 0 PID: 3154 Comm: mv Tainted: GE 5.4.0-rc2 #19 Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.10.0 02/25/2019 RIP: 0010:pstore_unlink+0x1e/0x70 Code: 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 46 30 48 8b 80 38 02 00 00 48 8b 58 10 48 8b 3b <48> 83 bf 90 00 00 00 00 74 39 48 83 c7 0c 81 42 00 RSP: 0018:b2eb40587e70 EFLAGS: 00010246 RAX: 8920cea3e8a0 RBX: 8920cd55b400 RCX: RDX: 0001 RSI: 8920cf76f300 RDI: c0bcf000 RBP: 8920cf76f300 R08: 8920cf76f300 R09: 007a2e636e652e32 R10: 0007 R11: 7fff R12: 8920d4c91d80 R13: 8920d4c91d80 R14: 8920d80c3480 R15: 8920d80c3520 FS: 7f3afe720640() GS:8920dda0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: c0bcf090 CR3: 00088e260001 CR4: 003606f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: vfs_unlink+0x10f/0x1f0 do_unlinkat+0x1af/0x2f0 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f3afe8ef2f7 Code: 73 01 c3 48 8b 0d 89 db 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 07 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 64 89 01 48 RSP: 002b:7ffd60fea168 EFLAGS: 0202 ORIG_RAX: 0107 RAX: ffda RBX: 5629c4e3ef70 RCX: 7f3afe8ef2f7 RDX: RSI: 5629c4e3dd40 RDI: ff9c RBP: 5629c4e3dcb0 R08: 0003 R09: R10: f24b R11: 0202 R12: R13: 7ffd60fea260 R14: 5629c4e3ef70 R15: 0002 My understanding is that pstore_unlink() is exploding when running: if (!record->psi->erase) because that address (psi) was on the efi-pstore module. I'm not sure what's the best way to fix this, probably a if (psinfo && record->psi->erase) would be enough, but ugly (and still racy?). Cheers, -- Luis
Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
Gregory Farnum writes: > On Mon, Sep 9, 2019 at 4:15 AM Luis Henriques wrote: >> >> "Jeff Layton" writes: >> >> > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote: >> >> OSDs are able to perform object copies across different pools. Thus, >> >> there's no need to prevent copy_file_range from doing remote copies if the >> >> source and destination superblocks are different. Only return -EXDEV if >> >> they have different fsid (the cluster ID). >> >> >> >> Signed-off-by: Luis Henriques >> >> --- >> >> fs/ceph/file.c | 18 ++ >> >> 1 file changed, 14 insertions(+), 4 deletions(-) >> >> >> >> Hi, >> >> >> >> Here's the patch changelog since initial submittion: >> >> >> >> - Dropped have_fsid checks on client structs >> >> - Use %pU to print the fsid instead of raw hex strings (%*ph) >> >> - Fixed 'To:' field in email so that this time the patch hits vger >> >> >> >> Cheers, >> >> -- >> >> Luis >> >> >> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> >> index 685a03cc4b77..4a624a1dd0bb 100644 >> >> --- a/fs/ceph/file.c >> >> +++ b/fs/ceph/file.c >> >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file >> >> *src_file, loff_t src_off, >> >> struct ceph_inode_info *src_ci = ceph_inode(src_inode); >> >> struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); >> >> struct ceph_cap_flush *prealloc_cf; >> >> +struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); >> >> struct ceph_object_locator src_oloc, dst_oloc; >> >> struct ceph_object_id src_oid, dst_oid; >> >> loff_t endoff = 0, size; >> >> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file >> >> *src_file, loff_t src_off, >> >> >> >> if (src_inode == dst_inode) >> >> return -EINVAL; >> >> -if (src_inode->i_sb != dst_inode->i_sb) >> >> -return -EXDEV; >> >> +if (src_inode->i_sb != dst_inode->i_sb) { >> >> +struct ceph_fs_client *dst_fsc = >> >> ceph_inode_to_client(dst_inode); >> >> + >> >> +if (ceph_fsid_compare(_fsc->client->fsid, >> >> + _fsc->client->fsid)) { >> >> +dout("Copying object across different clusters:"); >> >> +dout(" src fsid: %pU dst fsid: %pU\n", >> >> + _fsc->client->fsid, _fsc->client->fsid); >> >> +return -EXDEV; >> >> +} >> >> +} >> > >> > Just to be clear: what happens here if I mount two entirely separate >> > clusters, and their OSDs don't have any access to one another? Will this >> > fail at some later point with an error that we can catch so that we can >> > fall back? >> >> This is exactly what this check prevents: if we have two CephFS from two >> unrelated clusters mounted and we try to copy a file across them, the >> operation will fail with -EXDEV[1] because the FSIDs for these two >> ceph_fs_client will be different. OTOH, if these two filesystems are >> within the same cluster (and thus with the same FSID), then the OSDs are >> able to do 'copy-from' operations between them. >> >> I've tested all these scenarios and they seem to be handled correctly. >> Now, I'm assuming that *all* OSDs within the same ceph cluster can >> communicate between themselves; if this assumption is false, then this >> patch is broken. But again, I'm not aware of any mechanism that >> prevents 2 OSDs from communicating between them. > > Your assumption is correct: all OSDs in a Ceph cluster can communicate > with each other. I'm not aware of any plans to change this. > > I spent a bit of time trying to figure out how this could break > security models and things and didn't come up with anything, so I > think functionally it's fine even though I find it a bit scary. > > Also, yes, cluster FSIDs are UUIDs so they shouldn't collide. Awesome, thanks for clarifying these points! Cheers, -- Luis > -Greg > >> >> [1] Actually, the files will still be copied because we'll fallback into >> the defau
[PATCH v3] ceph: allow object copies across different filesystems in the same cluster
OSDs are able to perform object copies across different pools. Thus, there's no need to prevent copy_file_range from doing remote copies if the source and destination superblocks are different. Only return -EXDEV if they have different fsid (the cluster ID). Signed-off-by: Luis Henriques --- fs/ceph/file.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) Hi, Here's the changelog: * since v2 - single dout() in error path * since v1: - Dropped have_fsid checks on client structs - Use %pU to print the fsid instead of raw hex strings (%*ph) - Fixed 'To:' field in email so that this time the patch hits vger Cheers, -- Luis diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 685a03cc4b77..846cf5aea85e 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, struct ceph_inode_info *src_ci = ceph_inode(src_inode); struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); struct ceph_cap_flush *prealloc_cf; + struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); struct ceph_object_locator src_oloc, dst_oloc; struct ceph_object_id src_oid, dst_oid; loff_t endoff = 0, size; @@ -1915,8 +1916,16 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, if (src_inode == dst_inode) return -EINVAL; - if (src_inode->i_sb != dst_inode->i_sb) - return -EXDEV; + if (src_inode->i_sb != dst_inode->i_sb) { + struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode); + + if (ceph_fsid_compare(_fsc->client->fsid, + _fsc->client->fsid)) { + dout("Copying files across clusters: src: %pU dst: %pU\n", +_fsc->client->fsid, _fsc->client->fsid); + return -EXDEV; + } + } if (ceph_snap(dst_inode) != CEPH_NOSNAP) return -EROFS; @@ -1928,7 +1937,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, * efficient). */ - if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM)) + if (ceph_test_mount_opt(src_fsc, NOCOPYFROM)) return -EOPNOTSUPP; if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) || @@ -2044,7 +2053,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, dst_ci->i_vino.ino, dst_objnum); /* Do an object remote copy */ err = ceph_osdc_copy_from( - _inode_to_client(src_inode)->client->osdc, + _fsc->client->osdc, src_ci->i_vino.snap, 0, _oid, _oloc, CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
"Jeff Layton" writes: > On Mon, 2019-09-09 at 06:35 -0400, Jeff Layton wrote: >> On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote: >> > OSDs are able to perform object copies across different pools. Thus, >> > there's no need to prevent copy_file_range from doing remote copies if the >> > source and destination superblocks are different. Only return -EXDEV if >> > they have different fsid (the cluster ID). >> > >> > Signed-off-by: Luis Henriques >> > --- >> > fs/ceph/file.c | 18 ++ >> > 1 file changed, 14 insertions(+), 4 deletions(-) >> > >> > Hi, >> > >> > Here's the patch changelog since initial submittion: >> > >> > - Dropped have_fsid checks on client structs >> > - Use %pU to print the fsid instead of raw hex strings (%*ph) >> > - Fixed 'To:' field in email so that this time the patch hits vger >> > >> > Cheers, >> > -- >> > Luis >> > >> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> > index 685a03cc4b77..4a624a1dd0bb 100644 >> > --- a/fs/ceph/file.c >> > +++ b/fs/ceph/file.c >> > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file >> > *src_file, loff_t src_off, >> >struct ceph_inode_info *src_ci = ceph_inode(src_inode); >> >struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); >> >struct ceph_cap_flush *prealloc_cf; >> > + struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); >> >struct ceph_object_locator src_oloc, dst_oloc; >> >struct ceph_object_id src_oid, dst_oid; >> >loff_t endoff = 0, size; >> > @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file >> > *src_file, loff_t src_off, >> > >> >if (src_inode == dst_inode) >> >return -EINVAL; >> > - if (src_inode->i_sb != dst_inode->i_sb) >> > - return -EXDEV; >> > + if (src_inode->i_sb != dst_inode->i_sb) { >> > + struct ceph_fs_client *dst_fsc = >> > ceph_inode_to_client(dst_inode); >> > + >> > + if (ceph_fsid_compare(_fsc->client->fsid, >> > +_fsc->client->fsid)) { >> > + dout("Copying object across different clusters:"); >> > + dout(" src fsid: %pU dst fsid: %pU\n", >> > + _fsc->client->fsid, _fsc->client->fsid); >> > + return -EXDEV; >> > + } >> > + } >> >> Just to be clear: what happens here if I mount two entirely separate >> clusters, and their OSDs don't have any access to one another? Will this >> fail at some later point with an error that we can catch so that we can >> fall back? >> > > Duh, sorry I asked before I had a cup of coffee this morning. The whole > point is to skip that case. > > That said...I wonder if it's possible to have an fsid collision across > two separate clusters and this fail to catch that case? Aren't these > things just allocated via a simple counter increment? My understanding is that this is some sort of UUID. Looking at doc/install/manual-deployment.rst it says that the fsid is a unique ID that should be generated using uuidgen (I believe that's what vstart.sh clusters use). That said, it's obviously possible to reuse an fsid in two clusters. And mounting both filesystems with the same fsid on the same client may already cause some troubles without even trying to copy_file_range files across them (for ex., fscache code seems to assume unique fsids). But I have never tested such sort of things (probably no one did) and I really don't know what are the consequences. In this specific case, I would expect the 'copy-from' operation to fail with some error from the OSDs. > Probably not worth worrying about overmuch, but might be good to > understand what would happen in that case if only to field mailing list > reports. If there are concerns regarding this, I'm OK simply dropping the patch for now and continue forbidding object copies when superblocks are different. I just thought this was a low-hanging fruit, and didn't realized that it's not very easy to ensure that 2 cephfs instances actually belong to the same cluster. Maybe there are other checks that could be done...? Cheers, -- Luis > Other than that, this looks fine, modulo Ilya's comment about the two > dout messages. > >> >> >if (ceph_snap(dst_inode) != CEPH_NOSNAP) >> >return -EROFS; >> >
Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
"Jeff Layton" writes: > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote: >> OSDs are able to perform object copies across different pools. Thus, >> there's no need to prevent copy_file_range from doing remote copies if the >> source and destination superblocks are different. Only return -EXDEV if >> they have different fsid (the cluster ID). >> >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/file.c | 18 ++ >> 1 file changed, 14 insertions(+), 4 deletions(-) >> >> Hi, >> >> Here's the patch changelog since initial submittion: >> >> - Dropped have_fsid checks on client structs >> - Use %pU to print the fsid instead of raw hex strings (%*ph) >> - Fixed 'To:' field in email so that this time the patch hits vger >> >> Cheers, >> -- >> Luis >> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> index 685a03cc4b77..4a624a1dd0bb 100644 >> --- a/fs/ceph/file.c >> +++ b/fs/ceph/file.c >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> struct ceph_inode_info *src_ci = ceph_inode(src_inode); >> struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); >> struct ceph_cap_flush *prealloc_cf; >> +struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); >> struct ceph_object_locator src_oloc, dst_oloc; >> struct ceph_object_id src_oid, dst_oid; >> loff_t endoff = 0, size; >> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> >> if (src_inode == dst_inode) >> return -EINVAL; >> -if (src_inode->i_sb != dst_inode->i_sb) >> -return -EXDEV; >> +if (src_inode->i_sb != dst_inode->i_sb) { >> +struct ceph_fs_client *dst_fsc = >> ceph_inode_to_client(dst_inode); >> + >> +if (ceph_fsid_compare(_fsc->client->fsid, >> + _fsc->client->fsid)) { >> +dout("Copying object across different clusters:"); >> +dout(" src fsid: %pU dst fsid: %pU\n", >> + _fsc->client->fsid, _fsc->client->fsid); >> +return -EXDEV; >> +} >> +} > > Just to be clear: what happens here if I mount two entirely separate > clusters, and their OSDs don't have any access to one another? Will this > fail at some later point with an error that we can catch so that we can > fall back? This is exactly what this check prevents: if we have two CephFS from two unrelated clusters mounted and we try to copy a file across them, the operation will fail with -EXDEV[1] because the FSIDs for these two ceph_fs_client will be different. OTOH, if these two filesystems are within the same cluster (and thus with the same FSID), then the OSDs are able to do 'copy-from' operations between them. I've tested all these scenarios and they seem to be handled correctly. Now, I'm assuming that *all* OSDs within the same ceph cluster can communicate between themselves; if this assumption is false, then this patch is broken. But again, I'm not aware of any mechanism that prevents 2 OSDs from communicating between them. [1] Actually, the files will still be copied because we'll fallback into the default VFS generic_copy_file_range behaviour, which is to do reads+writes operations. Cheers, -- Luis > > >> if (ceph_snap(dst_inode) != CEPH_NOSNAP) >> return -EROFS; >> >> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> * efficient). >> */ >> >> -if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM)) >> +if (ceph_test_mount_opt(src_fsc, NOCOPYFROM)) >> return -EOPNOTSUPP; >> >> if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) || >> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> dst_ci->i_vino.ino, dst_objnum); >> /* Do an object remote copy */ >> err = ceph_osdc_copy_from( >> -_inode_to_client(src_inode)->client->osdc, >> +_fsc->client->osdc, >> src_ci->i_vino.snap, 0, >> _oid, _oloc, >> CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
[PATCH v2] ceph: allow object copies across different filesystems in the same cluster
OSDs are able to perform object copies across different pools. Thus, there's no need to prevent copy_file_range from doing remote copies if the source and destination superblocks are different. Only return -EXDEV if they have different fsid (the cluster ID). Signed-off-by: Luis Henriques --- fs/ceph/file.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) Hi, Here's the patch changelog since initial submittion: - Dropped have_fsid checks on client structs - Use %pU to print the fsid instead of raw hex strings (%*ph) - Fixed 'To:' field in email so that this time the patch hits vger Cheers, -- Luis diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 685a03cc4b77..4a624a1dd0bb 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, struct ceph_inode_info *src_ci = ceph_inode(src_inode); struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); struct ceph_cap_flush *prealloc_cf; + struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); struct ceph_object_locator src_oloc, dst_oloc; struct ceph_object_id src_oid, dst_oid; loff_t endoff = 0, size; @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, if (src_inode == dst_inode) return -EINVAL; - if (src_inode->i_sb != dst_inode->i_sb) - return -EXDEV; + if (src_inode->i_sb != dst_inode->i_sb) { + struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode); + + if (ceph_fsid_compare(_fsc->client->fsid, + _fsc->client->fsid)) { + dout("Copying object across different clusters:"); + dout(" src fsid: %pU dst fsid: %pU\n", +_fsc->client->fsid, _fsc->client->fsid); + return -EXDEV; + } + } if (ceph_snap(dst_inode) != CEPH_NOSNAP) return -EROFS; @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, * efficient). */ - if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM)) + if (ceph_test_mount_opt(src_fsc, NOCOPYFROM)) return -EOPNOTSUPP; if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) || @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, dst_ci->i_vino.ino, dst_objnum); /* Do an object remote copy */ err = ceph_osdc_copy_from( - _inode_to_client(src_inode)->client->osdc, + _fsc->client->osdc, src_ci->i_vino.snap, 0, _oid, _oloc, CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster
"Jeff Layton" writes: > On Fri, 2019-09-06 at 17:26 +0100, Luis Henriques wrote: >> "Jeff Layton" writes: >> >> > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote: >> > > OSDs are able to perform object copies across different pools. Thus, >> > > there's no need to prevent copy_file_range from doing remote copies if >> > > the >> > > source and destination superblocks are different. Only return -EXDEV if >> > > they have different fsid (the cluster ID). >> > > >> > > Signed-off-by: Luis Henriques >> > > --- >> > > fs/ceph/file.c | 23 +++ >> > > 1 file changed, 19 insertions(+), 4 deletions(-) >> > > >> > > Hi! >> > > >> > > I've finally managed to run some tests using multiple filesystems, both >> > > within a single cluster and also using two different clusters. The >> > > behaviour of copy_file_range (with this patch, of course) was what I >> > > expected: >> > > >> > > - Object copies work fine across different filesystems within the same >> > > cluster (even with pools in different PGs); >> > > - -EXDEV is returned if the fsid is different >> > > >> > > (OT: I wonder why the cluster ID is named 'fsid'; historical reasons? >> > > Because this is actually what's in ceph.conf fsid in "[global]" >> > > section. Anyway...) >> > > >> > > So, what's missing right now is (I always mention this when I have the >> > > opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-) >> > > And add the corresponding support for the new flag to the kernel >> > > client, of course. >> > > >> > > Cheers, >> > > -- >> > > Luis >> > > >> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> > > index 685a03cc4b77..88d116893c2b 100644 >> > > --- a/fs/ceph/file.c >> > > +++ b/fs/ceph/file.c >> > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file >> > > *src_file, loff_t src_off, >> > > struct ceph_inode_info *src_ci = ceph_inode(src_inode); >> > > struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); >> > > struct ceph_cap_flush *prealloc_cf; >> > > +struct ceph_fs_client *src_fsc = >> > > ceph_inode_to_client(src_inode); >> > > struct ceph_object_locator src_oloc, dst_oloc; >> > > struct ceph_object_id src_oid, dst_oid; >> > > loff_t endoff = 0, size; >> > > @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file >> > > *src_file, loff_t src_off, >> > > >> > > if (src_inode == dst_inode) >> > > return -EINVAL; >> > > -if (src_inode->i_sb != dst_inode->i_sb) >> > > -return -EXDEV; >> > > +if (src_inode->i_sb != dst_inode->i_sb) { >> > > +struct ceph_fs_client *dst_fsc = >> > > ceph_inode_to_client(dst_inode); >> > > + >> > > +if (!src_fsc->client->have_fsid || >> > > !dst_fsc->client->have_fsid) { >> > > +dout("No fsid in a fs client\n"); >> > > +return -EXDEV; >> > > +} >> > >> > In what situation is there no fsid? Old cluster version? >> > >> > If there is no fsid, can we take that to indicate that there is only a >> > single filesystem possible in the cluster and that we should attempt the >> > copy anyway? >> >> TBH I'm not sure if 'have_fsid' can ever be 'false' in this call. It is >> set to 'true' when handling the monmap, and it's never changed back to >> 'false'. Since I don't think copy_file_range will be invoked *before* >> we get the monmap, it should be safe to drop this check. Maybe it could >> be replaced it by a WARN_ON()? >> > > Yeah. I think the have_fsid flag just allows us to avoid the pr_err msg > in ceph_check_fsid when the client is initially created. Maybe there is > some better way to achieve that? I guess the struct ceph_fsid embedded in the client(s) could be changed into a pointer initialized to NULL (and later dynamically allocated). Then, the have_fsid check could be replaced by a NULL check. Not sure if it would bring any real benefit, though. Want me to give that a try? Or maybe I misunderstood you question. > In any case, I'd just drop that condition here. Ok, I'll send v2 in a second, without this check. [ BTW, looks like my initial post didn't made it into vger.kernel.org. It was probably dropped because I screwed-up the 'To:' field in my email (no idea how I did that, TBH). ] Cheers, -- Luis
Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster
"Jeff Layton" writes: > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote: >> OSDs are able to perform object copies across different pools. Thus, >> there's no need to prevent copy_file_range from doing remote copies if the >> source and destination superblocks are different. Only return -EXDEV if >> they have different fsid (the cluster ID). >> >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/file.c | 23 +++ >> 1 file changed, 19 insertions(+), 4 deletions(-) >> >> Hi! >> >> I've finally managed to run some tests using multiple filesystems, both >> within a single cluster and also using two different clusters. The >> behaviour of copy_file_range (with this patch, of course) was what I >> expected: >> >> - Object copies work fine across different filesystems within the same >> cluster (even with pools in different PGs); >> - -EXDEV is returned if the fsid is different >> >> (OT: I wonder why the cluster ID is named 'fsid'; historical reasons? >> Because this is actually what's in ceph.conf fsid in "[global]" >> section. Anyway...) >> >> So, what's missing right now is (I always mention this when I have the >> opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-) >> And add the corresponding support for the new flag to the kernel >> client, of course. >> >> Cheers, >> -- >> Luis >> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> index 685a03cc4b77..88d116893c2b 100644 >> --- a/fs/ceph/file.c >> +++ b/fs/ceph/file.c >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> struct ceph_inode_info *src_ci = ceph_inode(src_inode); >> struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); >> struct ceph_cap_flush *prealloc_cf; >> +struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode); >> struct ceph_object_locator src_oloc, dst_oloc; >> struct ceph_object_id src_oid, dst_oid; >> loff_t endoff = 0, size; >> @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> >> if (src_inode == dst_inode) >> return -EINVAL; >> -if (src_inode->i_sb != dst_inode->i_sb) >> -return -EXDEV; >> +if (src_inode->i_sb != dst_inode->i_sb) { >> +struct ceph_fs_client *dst_fsc = >> ceph_inode_to_client(dst_inode); >> + >> +if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) >> { >> +dout("No fsid in a fs client\n"); >> +return -EXDEV; >> +} > > In what situation is there no fsid? Old cluster version? > > If there is no fsid, can we take that to indicate that there is only a > single filesystem possible in the cluster and that we should attempt the > copy anyway? TBH I'm not sure if 'have_fsid' can ever be 'false' in this call. It is set to 'true' when handling the monmap, and it's never changed back to 'false'. Since I don't think copy_file_range will be invoked *before* we get the monmap, it should be safe to drop this check. Maybe it could be replaced it by a WARN_ON()? Cheers, -- Luis > >> +if (ceph_fsid_compare(_fsc->client->fsid, >> + _fsc->client->fsid)) { >> +dout("Copying object across different clusters:"); >> +dout(" src fsid: %*ph\n dst fsid: %*ph\n", >> + 16, _fsc->client->fsid, >> + 16, _fsc->client->fsid); >> +return -EXDEV; >> +} >> +} >> if (ceph_snap(dst_inode) != CEPH_NOSNAP) >> return -EROFS; >> >> @@ -1928,7 +1943,7 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> * efficient). >> */ >> >> -if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM)) >> +if (ceph_test_mount_opt(src_fsc, NOCOPYFROM)) >> return -EOPNOTSUPP; >> >> if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) || >> @@ -2044,7 +2059,7 @@ static ssize_t __ceph_copy_file_range(struct file >> *src_file, loff_t src_off, >> dst_ci->i_vino.ino, dst_objnum); >> /* Do an object remote copy */ >> err = ceph_osdc_copy_from( >> -_inode_to_client(src_inode)->client->osdc, >> +_fsc->client->osdc, >> src_ci->i_vino.snap, 0, >> _oid, _oloc, >> CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
Re: [RFC PATCH] ceph: fix directories inode i_blkbits initialization
Luis Henriques writes: > "Jeff Layton" writes: > >> On Tue, 2019-07-23 at 16:50 +0100, Luis Henriques wrote: >>> When filling an inode with info from the MDS, i_blkbits is being >>> initialized using fl_stripe_unit, which contains the stripe unit in >>> bytes. Unfortunately, this doesn't make sense for directories as they >>> have fl_stripe_unit set to '0'. This means that i_blkbits will be set >>> to 0xff, causing an UBSAN undefined behaviour in i_blocksize(): >>> >>> UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12 >>> shift exponent 255 is too large for 32-bit type 'int' >>> >>> Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit >>> is zero. >>> >>> Signed-off-by: Luis Henriques >>> --- >>> fs/ceph/inode.c | 7 ++- >>> 1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> Hi Jeff, >>> >>> To be honest, I'm not sure CEPH_BLOCK_SHIFT is the right value to use >>> here, but for sure the one currently being used isn't correct if the >>> inode is a directory. Using stripe units seems to be a bug that has >>> been there since the beginning, but it definitely became bigger problem >>> after commit 69448867abcb ("fs: shave 8 bytes off of struct inode"). >>> >>> This fix could also be moved into the 'switch' statement later in that >>> function, in the S_IFDIR case, similar to commit 5ba72e607cdb ("ceph: >>> set special inode's blocksize to page size"). Let me know which version >>> you would prefer. >>> >> >> What happens with (e.g.) named pipes or symlinks? Do those inodes also >> get this bogus value? Assuming that they do, I'd probably prefer this >> patch since it'd fix things for all inode types, not just directories. > > I tested symlinks and they seem to be handled correctly (i.e. the stripe > units seems to be the same as the target file). Regarding pipes, I > didn't test them, but from the code it should be set to PAGE_SHIFT (see > the above mentioned commit 5ba72e607cdb). Ok, after looking closer at the other inode types and running a few tests with extra debug code, it all seems to be sane -- only directories (root dir is an exception) will cause problems with i_blkbits being set to a bogus value. So, I'm sticking with my original RFC patch approach, which should be easy to apply to stable kernels. Cheers, -- Luis > > Anyway, I can change the code to do *all* the i_blkbits initialization > inside the switch statement. Something like: > > switch (inode->i_mode & S_IFMT) { > case S_IFIFO: > case S_IFBLK: > case S_IFCHR: > case S_IFSOCK: > inode->i_blkbits = PAGE_SHIFT; > ... > case S_IFREG: > inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; > ... > case S_IFLNK: > inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; > ... > case S_IFDIR: > inode->i_blkbits = CEPH_BLOCK_SHIFT; > ... > default: > pr_err(); > ... > } > > This would add some code duplication (S_IFREG and S_IFLNK cases), but > maybe it's a bit more clear. The other option would be obviously to > leave the initialization outside the switch and only change the > i_blkbits value in the S_IF{IFO,BLK,CHR,SOCK,DIR} cases. > > Cheers,
Re: [RFC PATCH] ceph: fix directories inode i_blkbits initialization
"Jeff Layton" writes: > On Tue, 2019-07-23 at 16:50 +0100, Luis Henriques wrote: >> When filling an inode with info from the MDS, i_blkbits is being >> initialized using fl_stripe_unit, which contains the stripe unit in >> bytes. Unfortunately, this doesn't make sense for directories as they >> have fl_stripe_unit set to '0'. This means that i_blkbits will be set >> to 0xff, causing an UBSAN undefined behaviour in i_blocksize(): >> >> UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12 >> shift exponent 255 is too large for 32-bit type 'int' >> >> Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit >> is zero. >> >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/inode.c | 7 ++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> Hi Jeff, >> >> To be honest, I'm not sure CEPH_BLOCK_SHIFT is the right value to use >> here, but for sure the one currently being used isn't correct if the >> inode is a directory. Using stripe units seems to be a bug that has >> been there since the beginning, but it definitely became bigger problem >> after commit 69448867abcb ("fs: shave 8 bytes off of struct inode"). >> >> This fix could also be moved into the 'switch' statement later in that >> function, in the S_IFDIR case, similar to commit 5ba72e607cdb ("ceph: >> set special inode's blocksize to page size"). Let me know which version >> you would prefer. >> > > What happens with (e.g.) named pipes or symlinks? Do those inodes also > get this bogus value? Assuming that they do, I'd probably prefer this > patch since it'd fix things for all inode types, not just directories. I tested symlinks and they seem to be handled correctly (i.e. the stripe units seems to be the same as the target file). Regarding pipes, I didn't test them, but from the code it should be set to PAGE_SHIFT (see the above mentioned commit 5ba72e607cdb). Anyway, I can change the code to do *all* the i_blkbits initialization inside the switch statement. Something like: switch (inode->i_mode & S_IFMT) { case S_IFIFO: case S_IFBLK: case S_IFCHR: case S_IFSOCK: inode->i_blkbits = PAGE_SHIFT; ... case S_IFREG: inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; ... case S_IFLNK: inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; ... case S_IFDIR: inode->i_blkbits = CEPH_BLOCK_SHIFT; ... default: pr_err(); ... } This would add some code duplication (S_IFREG and S_IFLNK cases), but maybe it's a bit more clear. The other option would be obviously to leave the initialization outside the switch and only change the i_blkbits value in the S_IF{IFO,BLK,CHR,SOCK,DIR} cases. Cheers, -- Luis > >> Cheers, >> -- >> Luis >> >> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c >> index 791f84a13bb8..0e6d6db848b7 100644 >> --- a/fs/ceph/inode.c >> +++ b/fs/ceph/inode.c >> @@ -800,7 +800,12 @@ static int fill_inode(struct inode *inode, struct page >> *locked_page, >> >> /* update inode */ >> inode->i_rdev = le32_to_cpu(info->rdev); >> -inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; >> +/* directories have fl_stripe_unit set to zero */ >> +if (le32_to_cpu(info->layout.fl_stripe_unit)) >> +inode->i_blkbits = >> +fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; >> +else >> +inode->i_blkbits = CEPH_BLOCK_SHIFT; >> >> __ceph_update_quota(ci, iinfo->max_bytes, iinfo->max_files); >>
[RFC PATCH] ceph: fix directories inode i_blkbits initialization
When filling an inode with info from the MDS, i_blkbits is being initialized using fl_stripe_unit, which contains the stripe unit in bytes. Unfortunately, this doesn't make sense for directories as they have fl_stripe_unit set to '0'. This means that i_blkbits will be set to 0xff, causing an UBSAN undefined behaviour in i_blocksize(): UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12 shift exponent 255 is too large for 32-bit type 'int' Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit is zero. Signed-off-by: Luis Henriques --- fs/ceph/inode.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) Hi Jeff, To be honest, I'm not sure CEPH_BLOCK_SHIFT is the right value to use here, but for sure the one currently being used isn't correct if the inode is a directory. Using stripe units seems to be a bug that has been there since the beginning, but it definitely became bigger problem after commit 69448867abcb ("fs: shave 8 bytes off of struct inode"). This fix could also be moved into the 'switch' statement later in that function, in the S_IFDIR case, similar to commit 5ba72e607cdb ("ceph: set special inode's blocksize to page size"). Let me know which version you would prefer. Cheers, -- Luis diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 791f84a13bb8..0e6d6db848b7 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -800,7 +800,12 @@ static int fill_inode(struct inode *inode, struct page *locked_page, /* update inode */ inode->i_rdev = le32_to_cpu(info->rdev); - inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; + /* directories have fl_stripe_unit set to zero */ + if (le32_to_cpu(info->layout.fl_stripe_unit)) + inode->i_blkbits = + fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1; + else + inode->i_blkbits = CEPH_BLOCK_SHIFT; __ceph_update_quota(ci, iinfo->max_bytes, iinfo->max_files);
Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
Waiman Long writes: > On 7/20/19 4:41 AM, Luis Henriques wrote: >> "Linus Torvalds" writes: >> >>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long wrote: >>>> This patch shouldn't change the behavior of the rwsem code. The code >>>> only access data within the rw_semaphore structures. I don't know why it >>>> will cause a KASAN error. I will have to reproduce it and figure out >>>> exactly which statement is doing the invalid access. >>> The stack traces should show line numbers if you run them through >>> scripts/decode_stacktrace.sh. >>> >>> You need to have debug info enabled for that, though. >>> >>> Luis? >>> >>> Linus >> Yep, sure. And I should have done this in the initial report. It's a >> different trace, I had to recompile the kernel. >> >> (I'm also adding Jeff to the CC list.) >> >> Cheers, > > Thanks for the information. I think I know where the problem is. Would > you mind applying the attached patch to see if it can fix the KASAN error. Yep, that seems to work -- I can't reproduce the error anymore (and sorry for the delay). Thanks! And feel free to add my Tested-by. Cheers, -- Luis
Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
Luis Henriques writes: > Luis Henriques writes: > >> "Linus Torvalds" writes: >> >>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long wrote: >>>> >>>> This patch shouldn't change the behavior of the rwsem code. The code >>>> only access data within the rw_semaphore structures. I don't know why it >>>> will cause a KASAN error. I will have to reproduce it and figure out >>>> exactly which statement is doing the invalid access. >>> >>> The stack traces should show line numbers if you run them through >>> scripts/decode_stacktrace.sh. >>> >>> You need to have debug info enabled for that, though. >>> >>> Luis? >>> >>> Linus >> >> Yep, sure. And I should have done this in the initial report. It's a >> different trace, I had to recompile the kernel. >> >> (I'm also adding Jeff to the CC list.) >> > > Ah, and I also managed to reproduce this on btrfs so I guess this rules > out a bug in the filesystem code. Just another detail (before I go completely offline until tomorrow evening): in the btrfs case I'm seeing the bug on the rwsem_down_read_slowpath path, not on rwsem_down_write_slowpath. But it seems to be on the same place (i.e. rwsem_can_spin_on_owner). Cheers, -- Luis
Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
Luis Henriques writes: > "Linus Torvalds" writes: > >> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long wrote: >>> >>> This patch shouldn't change the behavior of the rwsem code. The code >>> only access data within the rw_semaphore structures. I don't know why it >>> will cause a KASAN error. I will have to reproduce it and figure out >>> exactly which statement is doing the invalid access. >> >> The stack traces should show line numbers if you run them through >> scripts/decode_stacktrace.sh. >> >> You need to have debug info enabled for that, though. >> >> Luis? >> >> Linus > > Yep, sure. And I should have done this in the initial report. It's a > different trace, I had to recompile the kernel. > > (I'm also adding Jeff to the CC list.) > Ah, and I also managed to reproduce this on btrfs so I guess this rules out a bug in the filesystem code. Cheers, -- Luis
Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
"Linus Torvalds" writes: > On Fri, Jul 19, 2019 at 12:32 PM Waiman Long wrote: >> >> This patch shouldn't change the behavior of the rwsem code. The code >> only access data within the rw_semaphore structures. I don't know why it >> will cause a KASAN error. I will have to reproduce it and figure out >> exactly which statement is doing the invalid access. > > The stack traces should show line numbers if you run them through > scripts/decode_stacktrace.sh. > > You need to have debug info enabled for that, though. > > Luis? > > Linus Yep, sure. And I should have done this in the initial report. It's a different trace, I had to recompile the kernel. (I'm also adding Jeff to the CC list.) Cheers, -- Luis [ 39.801179] == [ 39.801973] BUG: KASAN: use-after-free in rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) [ 39.802733] Read of size 4 at addr 8881f1f65138 by task xfs_io/2145 [ 39.803598] CPU: 0 PID: 2145 Comm: xfs_io Not tainted 5.2.0+ #460 [ 39.803600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 [ 39.803602] Call Trace: [ 39.803609] dump_stack (/home/miguel/kernel/linux/lib/dump_stack.c:115) [ 39.803615] print_address_description (/home/miguel/kernel/linux/mm/kasan/report.c:352) [ 39.803618] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) [ 39.803621] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) [ 39.803624] __kasan_report.cold (/home/miguel/kernel/linux/mm/kasan/report.c:483) [ 39.803629] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) [ 39.803633] kasan_report (/home/miguel/kernel/linux/./arch/x86/include/asm/smap.h:69 /home/miguel/kernel/linux/mm/kasan/common.c:613) [ 39.803636] rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) [ 39.803641] ? __ceph_caps_issued_mask (/home/miguel/kernel/linux/fs/ceph/caps.c:914) [ 39.803644] ? find_held_lock (/home/miguel/kernel/linux/kernel/locking/lockdep.c:4004) [ 39.803649] ? __ceph_do_getattr (/home/miguel/kernel/linux/fs/ceph/inode.c:2246) [ 39.803653] ? down_read_non_owner (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1116) [ 39.803658] ? do_raw_spin_unlock (/home/miguel/kernel/linux/./include/linux/compiler.h:218 /home/miguel/kernel/linux/./include/asm-generic/qspinlock.h:94 /home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:139) [ 39.803663] ? _raw_spin_unlock (/home/miguel/kernel/linux/kernel/locking/spinlock.c:184) [ 39.803667] ? __lock_acquire.isra.0 (/home/miguel/kernel/linux/kernel/locking/lockdep.c:3884) [ 39.803674] ? path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 /home/miguel/kernel/linux/fs/namei.c:3533) [ 39.803680] ? down_write (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486) [ 39.803683] down_write (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486) [ 39.803687] ? down_read_killable (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1482) [ 39.803690] ? __sb_start_write (/home/miguel/kernel/linux/./include/linux/compiler.h:194 /home/miguel/kernel/linux/./include/linux/rcu_sync.h:38 /home/miguel/kernel/linux/./include/linux/percpu-rwsem.h:52 /home/miguel/kernel/linux/fs/super.c:1608) [ 39.803694] ? __mnt_want_write (/home/miguel/kernel/linux/fs/namespace.c:253 /home/miguel/kernel/linux/fs/namespace.c:297 /home/miguel/kernel/linux/fs/namespace.c:337) [ 39.803699] path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 /home/miguel/kernel/linux/fs/namei.c:3533) [ 39.803706] ? path_mountpoint (/home/miguel/kernel/linux/fs/namei.c:3518) [ 39.803711] ? __is_insn_slot_addr (/home/miguel/kernel/linux/kernel/kprobes.c:291) [ 39.803716] ? kernel_text_address (/home/miguel/kernel/linux/kernel/extable.c:113) [ 39.803719] ? __kernel_text_address (/home/miguel/kernel/linux/kernel/extable.c:95) [ 39.803724] ? unwind_get_return_address (/home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:311 /home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:306) [ 39.803727] ? swiotlb_map.cold (/home/miguel/kernel/linux/kernel/stacktrace.c:83) [ 39.803730] ? arch_stack_walk (/home/miguel/kernel/linux/arch/x86/kernel/stacktrace.c:26) [ 39.803735] do_filp_open (/home/miguel/kernel/linux/fs/namei.c:3563) [ 39.803739] ? may_open_dev (/home/miguel/kernel/linux/fs/namei.c:3557) [ 39.803746] ? __alloc_fd (/home/miguel/kernel/linux/fs/file.c:536) [ 39.803749] ? lock_downgrade
Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
Waiman Long writes: > On 7/19/19 2:45 PM, Luis Henriques wrote: >> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote: >>> The rwsem->owner contains not just the task structure pointer, it also >>> holds some flags for storing the current state of the rwsem. Some of >>> the flags may have to be atomically updated. To reflect the new reality, >>> the owner is now changed to an atomic_long_t type. >>> >>> New helper functions are added to properly separate out the task >>> structure pointer and the embedded flags. >> I started seeing KASAN use-after-free with current master, and a bisect >> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make >> rwsem->owner an atomic_long_t") was the problem. Does it ring any >> bells? I can easily reproduce it with xfstests (generic/464). >> >> Cheers, >> -- >> Luís > > This patch shouldn't change the behavior of the rwsem code. The code > only access data within the rw_semaphore structures. I don't know why it > will cause a KASAN error. I will have to reproduce it and figure out > exactly which statement is doing the invalid access. Yeah, screwing the bisection is something I've done in the past so I may have got the wrong commit. Another detail is that I was running xfstests against CephFS, I didn't tried with any other filesystem. I can try to reproduce with btrfs or xfs next week. Cheers, -- Luis
Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t
On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote: > The rwsem->owner contains not just the task structure pointer, it also > holds some flags for storing the current state of the rwsem. Some of > the flags may have to be atomically updated. To reflect the new reality, > the owner is now changed to an atomic_long_t type. > > New helper functions are added to properly separate out the task > structure pointer and the embedded flags. I started seeing KASAN use-after-free with current master, and a bisect showed me that this commit 94a9717b3c40 ("locking/rwsem: Make rwsem->owner an atomic_long_t") was the problem. Does it ring any bells? I can easily reproduce it with xfstests (generic/464). Cheers, -- Luís [ 6380.820179] run fstests generic/464 at 2019-07-19 12:04:05 [ 6381.504693] libceph: mon0 (1)192.168.155.1:40786 session established [ 6381.506790] libceph: client4572 fsid 86b39301-7192-4052-8427-a241af35a591 [ 6381.618830] libceph: mon0 (1)192.168.155.1:40786 session established [ 6381.619993] libceph: client4573 fsid 86b39301-7192-4052-8427-a241af35a591 [ 6384.464561] == [ 6384.466165] BUG: KASAN: use-after-free in rwsem_down_write_slowpath+0x67d/0x8a0 [ 6384.468288] Read of size 4 at addr 8881d5dc9478 by task xfs_io/17238 [ 6384.469545] CPU: 1 PID: 17238 Comm: xfs_io Not tainted 5.2.0+ #444 [ 6384.469550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 [ 6384.469554] Call Trace: [ 6384.469563] dump_stack+0x5b/0x90 [ 6384.469569] print_address_description+0x6f/0x332 [ 6384.469573] ? rwsem_down_write_slowpath+0x67d/0x8a0 [ 6384.469575] ? rwsem_down_write_slowpath+0x67d/0x8a0 [ 6384.469579] __kasan_report.cold+0x1a/0x3e [ 6384.469583] ? rwsem_down_write_slowpath+0x67d/0x8a0 [ 6384.469588] kasan_report+0xe/0x12 [ 6384.469591] rwsem_down_write_slowpath+0x67d/0x8a0 [ 6384.469596] ? __ceph_caps_issued_mask+0xe7/0x280 [ 6384.469599] ? find_held_lock+0xc9/0xf0 [ 6384.469604] ? __ceph_do_getattr+0x19f/0x290 [ 6384.469608] ? down_read_non_owner+0x1c0/0x1c0 [ 6384.469612] ? do_raw_spin_unlock+0xa3/0x130 [ 6384.469617] ? _raw_spin_unlock+0x24/0x30 [ 6384.469622] ? __lock_acquire.isra.0+0x486/0x770 [ 6384.469629] ? path_openat+0x7ef/0xfe0 [ 6384.469635] ? down_write+0x11e/0x130 [ 6384.469638] down_write+0x11e/0x130 [ 6384.469642] ? down_read_killable+0x1e0/0x1e0 [ 6384.469646] ? __sb_start_write+0x11c/0x170 [ 6384.469650] ? __mnt_want_write+0xb4/0xd0 [ 6384.469655] path_openat+0x7ef/0xfe0 [ 6384.469661] ? path_mountpoint+0x4d0/0x4d0 [ 6384.469667] ? __is_insn_slot_addr+0x93/0xb0 [ 6384.469671] ? kernel_text_address+0x113/0x120 [ 6384.469674] ? __kernel_text_address+0xe/0x30 [ 6384.469679] ? unwind_get_return_address+0x2f/0x50 [ 6384.469683] ? swiotlb_map.cold+0x25/0x25 [ 6384.469687] ? arch_stack_walk+0x8f/0xe0 [ 6384.469692] do_filp_open+0x12b/0x1c0 [ 6384.469695] ? may_open_dev+0x50/0x50 [ 6384.469702] ? __alloc_fd+0x115/0x280 [ 6384.469705] ? lock_downgrade+0x350/0x350 [ 6384.469709] ? do_raw_spin_lock+0x113/0x1d0 [ 6384.469713] ? rwlock_bug.part.0+0x60/0x60 [ 6384.469718] ? do_raw_spin_unlock+0xa3/0x130 [ 6384.469722] ? _raw_spin_unlock+0x24/0x30 [ 6384.469725] ? __alloc_fd+0x115/0x280 [ 6384.469731] do_sys_open+0x1f0/0x2d0 [ 6384.469735] ? filp_open+0x50/0x50 [ 6384.469738] ? switch_fpu_return+0x13e/0x230 [ 6384.469742] ? __do_page_fault+0x4b5/0x670 [ 6384.469748] do_syscall_64+0x63/0x1c0 [ 6384.469753] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 6384.469756] RIP: 0033:0x7fe961434528 [ 6384.469760] Code: 00 00 41 00 3d 00 00 41 00 74 47 48 8d 05 20 4d 0d 00 8b 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 94 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 [ 6384.469762] RSP: 002b:7ffd9bbabb20 EFLAGS: 0246 ORIG_RAX: 0101 [ 6384.469765] RAX: ffda RBX: 0242 RCX: 7fe961434528 [ 6384.469767] RDX: 0242 RSI: 7ffd9bbae2a5 RDI: ff9c [ 6384.469769] RBP: 7ffd9bbae2a5 R08: 0001 R09: [ 6384.469771] R10: 0180 R11: 0246 R12: 0242 [ 6384.469773] R13: 7ffd9bbabe00 R14: 0180 R15: 0060 [ 6384.470018] Allocated by task 16593: [ 6384.470562] __kasan_kmalloc.part.0+0x3c/0xa0 [ 6384.470565] kmem_cache_alloc+0xdc/0x240 [ 6384.470569] copy_process+0x1dce/0x27b0 [ 6384.470572] _do_fork+0xec/0x540 [ 6384.470576] __se_sys_clone+0xb2/0x100 [ 6384.470581] do_syscall_64+0x63/0x1c0 [ 6384.470586] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 6384.470823] Freed by task 9: [ 6384.471235] __kasan_slab_free+0x147/0x200 [ 6384.471240] kmem_cache_free+0x111/0x330 [ 6384.471246] rcu_core+0x2f9/0x830 [ 6384.471251] __do_softirq+0x154/0x486 [ 6384.471493] The buggy address belongs to the object at 8881d5dc9440 which
[PATCH 0/4] Sleeping functions in invalid context bug fixes
Hi, I'm sending three "sleeping function called from invalid context" bug fixes that I had on my TODO for a while. All of them are ceph_buffer_put related, and all the fixes follow the same pattern: delay the operation until the ci->i_ceph_lock is released. The first patch simply allows ceph_buffer_put to receive a NULL buffer so that the NULL check doesn't need to be performed in all the other patches. IOW, it's not really required, just convenient. (Note: maybe these patches should all be tagged for stable.) Luis Henriques (4): libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr() ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob() ceph: fix buffer free while holding i_ceph_lock in fill_inode() fs/ceph/caps.c | 5 - fs/ceph/inode.c | 7 --- fs/ceph/snap.c | 4 +++- fs/ceph/super.h | 2 +- fs/ceph/xattr.c | 19 ++- include/linux/ceph/buffer.h | 3 ++- 6 files changed, 28 insertions(+), 12 deletions(-)
[PATCH 1/4] libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
Signed-off-by: Luis Henriques --- include/linux/ceph/buffer.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/ceph/buffer.h b/include/linux/ceph/buffer.h index 5e58bb29b1a3..11cdc7c60480 100644 --- a/include/linux/ceph/buffer.h +++ b/include/linux/ceph/buffer.h @@ -30,7 +30,8 @@ static inline struct ceph_buffer *ceph_buffer_get(struct ceph_buffer *b) static inline void ceph_buffer_put(struct ceph_buffer *b) { - kref_put(>kref, ceph_buffer_release); + if (b) + kref_put(>kref, ceph_buffer_release); } extern int ceph_decode_buffer(struct ceph_buffer **b, void **p, void *end);
[PATCH 2/4] ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()
Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the i_xattrs.prealloc_blob buffer while holding the i_ceph_lock. This can be fixed by postponing the call until later, when the lock is released. The following backtrace was triggered by fstests generic/117. BUG: sleeping function called from invalid context at mm/vmalloc.c:2283 in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress 3 locks held by fsstress/650: #0: 870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50 #1: ba0c4c74 (>i_mutex_dir_key#6){}, at: vfs_setxattr+0x55/0xa0 #2: 8dfbb3f2 (&(>i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810 CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x67/0x90 ___might_sleep.cold+0x9f/0xb1 vfree+0x4b/0x60 ceph_buffer_release+0x1b/0x60 __ceph_setxattr+0x2b4/0x810 __vfs_setxattr+0x66/0x80 __vfs_setxattr_noperm+0x59/0xf0 vfs_setxattr+0x81/0xa0 setxattr+0x115/0x230 ? filename_lookup+0xc9/0x140 ? rcu_read_lock_sched_held+0x74/0x80 ? rcu_sync_lockdep_assert+0x2e/0x60 ? __sb_start_write+0x142/0x1a0 ? mnt_want_write+0x20/0x50 path_setxattr+0xba/0xd0 __x64_sys_lsetxattr+0x24/0x30 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7ff23514359a Signed-off-by: Luis Henriques --- fs/ceph/xattr.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 37b458a9af3a..c083557b3657 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1036,6 +1036,7 @@ int __ceph_setxattr(struct inode *inode, const char *name, struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc; struct ceph_cap_flush *prealloc_cf = NULL; + struct ceph_buffer *old_blob = NULL; int issued; int err; int dirty = 0; @@ -1109,13 +1110,15 @@ int __ceph_setxattr(struct inode *inode, const char *name, struct ceph_buffer *blob; spin_unlock(>i_ceph_lock); - dout(" preaallocating new blob size=%d\n", required_blob_size); + ceph_buffer_put(old_blob); /* Shouldn't be required */ + dout(" pre-allocating new blob size=%d\n", required_blob_size); blob = ceph_buffer_new(required_blob_size, GFP_NOFS); if (!blob) goto do_sync_unlocked; spin_lock(>i_ceph_lock); + /* prealloc_blob can't be released while holding i_ceph_lock */ if (ci->i_xattrs.prealloc_blob) - ceph_buffer_put(ci->i_xattrs.prealloc_blob); + old_blob = ci->i_xattrs.prealloc_blob; ci->i_xattrs.prealloc_blob = blob; goto retry; } @@ -1131,6 +1134,7 @@ int __ceph_setxattr(struct inode *inode, const char *name, } spin_unlock(>i_ceph_lock); + ceph_buffer_put(old_blob); if (lock_snap_rwsem) up_read(>snap_rwsem); if (dirty)
[PATCH 3/4] ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob()
Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by having this function returning the old blob buffer and have the callers of this function freeing it when the lock is released. The following backtrace was triggered by fstests generic/117. BUG: sleeping function called from invalid context at mm/vmalloc.c:2283 in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress 4 locks held by fsstress/649: #0: a7478e7e (>s_umount_key#19){}, at: iterate_supers+0x77/0xf0 #1: f8de1423 (&(>i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60 #2: 562f2b27 (>s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60 #3: f83ce16a (>snap_rwsem){}, at: ceph_check_caps+0x3ed/0xc60 CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x67/0x90 ___might_sleep.cold+0x9f/0xb1 vfree+0x4b/0x60 ceph_buffer_release+0x1b/0x60 __ceph_build_xattrs_blob+0x12b/0x170 __send_cap+0x302/0x540 ? __lock_acquire+0x23c/0x1e40 ? __mark_caps_flushing+0x15c/0x280 ? _raw_spin_unlock+0x24/0x30 ceph_check_caps+0x5f0/0xc60 ceph_flush_dirty_caps+0x7c/0x150 ? __ia32_sys_fdatasync+0x20/0x20 ceph_sync_fs+0x5a/0x130 iterate_supers+0x8f/0xf0 ksys_sync+0x4f/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fc6409ab617 Signed-off-by: Luis Henriques --- fs/ceph/caps.c | 5 - fs/ceph/snap.c | 4 +++- fs/ceph/super.h | 2 +- fs/ceph/xattr.c | 11 --- 4 files changed, 16 insertions(+), 6 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index d98dcd976c80..ce0f5658720a 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1301,6 +1301,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, { struct ceph_inode_info *ci = cap->ci; struct inode *inode = >vfs_inode; + struct ceph_buffer *old_blob = NULL; struct cap_msg_args arg; int held, revoking; int wake = 0; @@ -1365,7 +1366,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, ci->i_requested_max_size = arg.max_size; if (flushing & CEPH_CAP_XATTR_EXCL) { - __ceph_build_xattrs_blob(ci); + old_blob = __ceph_build_xattrs_blob(ci); arg.xattr_version = ci->i_xattrs.version; arg.xattr_buf = ci->i_xattrs.blob; } else { @@ -1409,6 +1410,8 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, spin_unlock(>i_ceph_lock); + ceph_buffer_put(old_blob); + ret = send_cap_msg(); if (ret < 0) { dout("error sending cap msg, must requeue %p\n", inode); diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 4c6494eb02b5..ccfcc66aaf44 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -465,6 +465,7 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci) struct inode *inode = >vfs_inode; struct ceph_cap_snap *capsnap; struct ceph_snap_context *old_snapc, *new_snapc; + struct ceph_buffer *old_blob = NULL; int used, dirty; capsnap = kzalloc(sizeof(*capsnap), GFP_NOFS); @@ -541,7 +542,7 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci) capsnap->gid = inode->i_gid; if (dirty & CEPH_CAP_XATTR_EXCL) { - __ceph_build_xattrs_blob(ci); + old_blob = __ceph_build_xattrs_blob(ci); capsnap->xattr_blob = ceph_buffer_get(ci->i_xattrs.blob); capsnap->xattr_version = ci->i_xattrs.version; @@ -584,6 +585,7 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci) } spin_unlock(>i_ceph_lock); + ceph_buffer_put(old_blob); kfree(capsnap); ceph_put_snap_context(old_snapc); } diff --git a/fs/ceph/super.h b/fs/ceph/super.h index d2352fd95dbc..6b9f1ee7de85 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -926,7 +926,7 @@ extern int ceph_getattr(const struct path *path, struct kstat *stat, int __ceph_setxattr(struct inode *, const char *, const void *, size_t, int); ssize_t __ceph_getxattr(struct inode *, const char *, void *, size_t); extern ssize_t ceph_listxattr(struct dentry *, char *, size_t); -extern void __ceph_build_xattrs_blob(struct ceph_inode_info *ci); +extern struct ceph_buffer *__ceph_build_xattrs_blob(struct ceph_inode_info *ci); extern void __ceph_destroy_xattrs(struct ceph_inode_info *ci); extern const struct xattr_handler *ceph_xattr_handlers[]; diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index c083557b3657..939eab7aa219 100644 --- a/fs/ceph/xattr.c +++
[PATCH 4/4] ceph: fix buffer free while holding i_ceph_lock in fill_inode()
Calling ceph_buffer_put() in fill_inode() may result in freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by postponing the call until later, when the lock is released. The following backtrace was triggered by fstests generic/070. BUG: sleeping function called from invalid context at mm/vmalloc.c:2283 in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4 6 locks held by kworker/0:4/3852: #0: 4270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0 #1: eb420803 ((work_completion)(&(>work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0 #2: be1c53a4 (>s_mutex){+.+.}, at: dispatch+0x288/0x1476 #3: 559cb958 (>snap_rwsem){}, at: dispatch+0x2eb/0x1476 #4: 0d5ebbae (>r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476 #5: a83d0514 (&(>i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70 CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Workqueue: ceph-msgr ceph_con_workfn Call Trace: dump_stack+0x67/0x90 ___might_sleep.cold+0x9f/0xb1 vfree+0x4b/0x60 ceph_buffer_release+0x1b/0x60 fill_inode.isra.0+0xa9b/0xf70 ceph_fill_trace+0x13b/0xc70 ? dispatch+0x2eb/0x1476 dispatch+0x320/0x1476 ? __mutex_unlock_slowpath+0x4d/0x2a0 ceph_con_workfn+0xc97/0x2ec0 ? process_one_work+0x1b8/0x5f0 process_one_work+0x244/0x5f0 worker_thread+0x4d/0x3e0 kthread+0x105/0x140 ? process_one_work+0x5f0/0x5f0 ? kthread_park+0x90/0x90 ret_from_fork+0x3a/0x50 Signed-off-by: Luis Henriques --- fs/ceph/inode.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 791f84a13bb8..18500edefc56 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -736,6 +736,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page, int issued, new_issued, info_caps; struct timespec64 mtime, atime, ctime; struct ceph_buffer *xattr_blob = NULL; + struct ceph_buffer *old_blob = NULL; struct ceph_string *pool_ns = NULL; struct ceph_cap *new_cap = NULL; int err = 0; @@ -881,7 +882,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page, if ((ci->i_xattrs.version == 0 || !(issued & CEPH_CAP_XATTR_EXCL)) && le64_to_cpu(info->xattr_version) > ci->i_xattrs.version) { if (ci->i_xattrs.blob) - ceph_buffer_put(ci->i_xattrs.blob); + old_blob = ci->i_xattrs.blob; ci->i_xattrs.blob = xattr_blob; if (xattr_blob) memcpy(ci->i_xattrs.blob->vec.iov_base, @@ -1022,8 +1023,8 @@ static int fill_inode(struct inode *inode, struct page *locked_page, out: if (new_cap) ceph_put_cap(mdsc, new_cap); - if (xattr_blob) - ceph_buffer_put(xattr_blob); + ceph_buffer_put(old_blob); + ceph_buffer_put(xattr_blob); ceph_put_string(pool_ns); return err; }
[PATCH] ceph: use generic_delete_inode() for ->drop_inode
ceph_drop_inode() implementation is not any different from the generic function, thus there's no point in keeping it around. Signed-off-by: Luis Henriques --- fs/ceph/inode.c | 10 -- fs/ceph/super.c | 2 +- fs/ceph/super.h | 1 - 3 files changed, 1 insertion(+), 12 deletions(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 761451f36e2d..211140e6ef9c 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -578,16 +578,6 @@ void ceph_destroy_inode(struct inode *inode) ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns)); } -int ceph_drop_inode(struct inode *inode) -{ - /* -* Positve dentry and corresponding inode are always accompanied -* in MDS reply. So no need to keep inode in the cache after -* dropping all its aliases. -*/ - return 1; -} - static inline blkcnt_t calc_inode_blocks(u64 size) { return (size + (1<<9) - 1) >> 9; diff --git a/fs/ceph/super.c b/fs/ceph/super.c index d57fa60dcd43..b4a4772756cb 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -843,7 +843,7 @@ static const struct super_operations ceph_super_ops = { .destroy_inode = ceph_destroy_inode, .free_inode = ceph_free_inode, .write_inode= ceph_write_inode, - .drop_inode = ceph_drop_inode, + .drop_inode = generic_delete_inode, .sync_fs= ceph_sync_fs, .put_super = ceph_put_super, .remount_fs = ceph_remount, diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 5f27e1f7f2d6..622e6c96c960 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -878,7 +878,6 @@ extern const struct inode_operations ceph_file_iops; extern struct inode *ceph_alloc_inode(struct super_block *sb); extern void ceph_destroy_inode(struct inode *inode); extern void ceph_free_inode(struct inode *inode); -extern int ceph_drop_inode(struct inode *inode); extern struct inode *ceph_get_inode(struct super_block *sb, struct ceph_vino vino);
Re: [PATCH] ceph: fix end offset in truncate_inode_pages_range call
"Jeff Layton" writes: > On Mon, 2019-07-01 at 18:16 +0100, Luis Henriques wrote: >> Commit e450f4d1a5d6 ("ceph: pass inclusive lend parameter to >> filemap_write_and_wait_range()") fixed the end offset parameter used to >> call filemap_write_and_wait_range and invalidate_inode_pages2_range. >> Unfortunately it missed truncate_inode_pages_range, introducing a >> regression that is easily detected by xfstest generic/130. >> >> The problem is that when doing direct IO it is possible that an extra page >> is truncated from the page cache when the end offset is page aligned. >> This can cause data loss if that page hasn't been sync'ed to the OSDs. >> >> While there, change code to use PAGE_ALIGN macro instead. >> >> Fixes: e450f4d1a5d6 ("ceph: pass inclusive lend parameter to >> filemap_write_and_wait_range()") >> Signed-off-by: Luis Henriques >> --- >> fs/ceph/file.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> index 183c37c0a8fc..7a57db8e2fa9 100644 >> --- a/fs/ceph/file.c >> +++ b/fs/ceph/file.c >> @@ -1007,7 +1007,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct >> iov_iter *iter, >> * may block. >> */ >> truncate_inode_pages_range(inode->i_mapping, pos, >> -(pos+len) | (PAGE_SIZE - 1)); >> + PAGE_ALIGN(pos + len) - 1); >> >> req->r_mtime = mtime; >> } > > Luis, should this be sent to stable? It seems like a data corruption > problem... Yes, I believe so. But I believe all the active stable kernels that include commit e450f4d1a5d6 (or a backport of it) will pick it anyway due to the 'Fixes:' tag. AFAIK only 5.1 and 5.2 are affected. Cheers, -- Luis
[PATCH] ceph: fix end offset in truncate_inode_pages_range call
Commit e450f4d1a5d6 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") fixed the end offset parameter used to call filemap_write_and_wait_range and invalidate_inode_pages2_range. Unfortunately it missed truncate_inode_pages_range, introducing a regression that is easily detected by xfstest generic/130. The problem is that when doing direct IO it is possible that an extra page is truncated from the page cache when the end offset is page aligned. This can cause data loss if that page hasn't been sync'ed to the OSDs. While there, change code to use PAGE_ALIGN macro instead. Fixes: e450f4d1a5d6 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") Signed-off-by: Luis Henriques --- fs/ceph/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 183c37c0a8fc..7a57db8e2fa9 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1007,7 +1007,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, * may block. */ truncate_inode_pages_range(inode->i_mapping, pos, - (pos+len) | (PAGE_SIZE - 1)); + PAGE_ALIGN(pos + len) - 1); req->r_mtime = mtime; }
Re: [RFC PATCH] ceph: initialize superblock s_time_gran to 1
Jeff Layton writes: > On Thu, 2019-06-27 at 15:44 +, Sage Weil wrote: >> On Thu, 27 Jun 2019, Jeff Layton wrote: >> > On Thu, 2019-06-27 at 14:51 +0100, Luis Henriques wrote: >> > > Having granularity set to 1us results in having inode timestamps with a >> > > accurancy different from the fuse client (i.e. atime, ctime and mtime >> > > will >> > > always end with '000'). This patch normalizes this behaviour and sets >> > > the >> > > granularity to 1. >> > > >> > > Signed-off-by: Luis Henriques >> > > --- >> > > fs/ceph/super.c | 2 +- >> > > 1 file changed, 1 insertion(+), 1 deletion(-) >> > > >> > > Hi! >> > > >> > > As far as I could see there are no other side-effects of changing >> > > s_time_gran but I'm really not sure why it was initially set to 1000 in >> > > the first place so I may be missing something. >> > > >> > > diff --git a/fs/ceph/super.c b/fs/ceph/super.c >> > > index d57fa60dcd43..35dd75bc9cd0 100644 >> > > --- a/fs/ceph/super.c >> > > +++ b/fs/ceph/super.c >> > > @@ -980,7 +980,7 @@ static int ceph_set_super(struct super_block *s, >> > > void *data) >> > > s->s_d_op = _dentry_ops; >> > > s->s_export_op = _export_ops; >> > > >> > > - s->s_time_gran = 1000; /* 1000 ns == 1 us */ >> > > + s->s_time_gran = 1; >> > > >> > > ret = set_anon_super(s, NULL); /* what is that second arg for? */ >> > > if (ret != 0) >> > >> > >> > Looks like it was set that way since the client code was originally >> > merged. Was this an earlier limitation of ceph that is no longer >> > applicable? >> > >> > In any case, I see no need at all to keep this at 1000, so: >> >> As long as the encoded on-write time value is at ns resolution, I >> agree! No recollection of why I did this :( >> >> Reviewed-by: Sage Weil > > Good enough for me. I went ahead and merged this into the testing > branch. Assuming nothing breaks, this should make v5.3. Awesome, thanks. AFAICS it shouldn't break anything, specially because the fuse client seems to be using ns resolution too. But yeah unexpected side-effects show up in unexpected ways :-) Cheers, -- Luis
[RFC PATCH] ceph: initialize superblock s_time_gran to 1
Having granularity set to 1us results in having inode timestamps with a accurancy different from the fuse client (i.e. atime, ctime and mtime will always end with '000'). This patch normalizes this behaviour and sets the granularity to 1. Signed-off-by: Luis Henriques --- fs/ceph/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Hi! As far as I could see there are no other side-effects of changing s_time_gran but I'm really not sure why it was initially set to 1000 in the first place so I may be missing something. diff --git a/fs/ceph/super.c b/fs/ceph/super.c index d57fa60dcd43..35dd75bc9cd0 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -980,7 +980,7 @@ static int ceph_set_super(struct super_block *s, void *data) s->s_d_op = _dentry_ops; s->s_export_op = _export_ops; - s->s_time_gran = 1000; /* 1000 ns == 1 us */ + s->s_time_gran = 1; ret = set_anon_super(s, NULL); /* what is that second arg for? */ if (ret != 0)
Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc
Luis Henriques writes: > "Yan, Zheng" writes: > >> On Fri, Mar 22, 2019 at 6:04 PM Luis Henriques wrote: >>> >>> Luis Henriques writes: >>> >>> > "Yan, Zheng" writes: >>> > >>> >> On Tue, Mar 19, 2019 at 12:22 AM Luis Henriques >>> >> wrote: >>> >>> >>> >>> "Yan, Zheng" writes: >>> >>> >>> >>> > On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques >>> >>> > wrote: >>> >>> >> >>> >>> >> "Yan, Zheng" writes: >>> >>> >> >>> >>> >> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques >>> >>> >> > wrote: >>> >>> >> >> >>> >>> >> >> I'm occasionally seeing a kmemleak warning in xfstest generic/013: >>> >>> >> >> >>> >>> >> >> unreferenced object 0x8881fccca940 (size 32): >>> >>> >> >> comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s) >>> >>> >> >> hex dump (first 32 bytes): >>> >>> >> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 >>> >>> >> >> >>> >>> >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >>> >>> >> >> >>> >>> >> >> backtrace: >>> >>> >> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0 >>> >>> >> >> [<21a00533>] rebuild_snap_realms+0x27/0x90 >>> >>> >> >> [<ac538600>] rebuild_snap_realms+0x42/0x90 >>> >>> >> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610 >>> >>> >> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3 >>> >>> >> >> [<fc287b83>] dispatch+0x362/0x176c >>> >>> >> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0 >>> >>> >> >> [<4168e3a9>] process_one_work+0x1d4/0x400 >>> >>> >> >> [<2188e9e7>] worker_thread+0x2d/0x3c0 >>> >>> >> >> [<b593e4b3>] kthread+0x112/0x130 >>> >>> >> >> [<a8587dca>] ret_from_fork+0x35/0x40 >>> >>> >> >> [<ba1c9c1d>] 0x >>> >>> >> >> >>> >>> >> >> It looks like it is possible that we miss a flush_ack from the >>> >>> >> >> MDS when, >>> >>> >> >> for example, umounting the filesystem. In that case, we can >>> >>> >> >> simply drop >>> >>> >> >> the reference to the ceph_snap_context obtained in >>> >>> >> >> ceph_queue_cap_snap(). >>> >>> >> >> >>> >>> >> >> Link: https://tracker.ceph.com/issues/38224 >>> >>> >> >> Cc: sta...@vger.kernel.org >>> >>> >> >> Signed-off-by: Luis Henriques >>> >>> >> >> --- >>> >>> >> >> fs/ceph/caps.c | 7 +++ >>> >>> >> >> 1 file changed, 7 insertions(+) >>> >>> >> >> >>> >>> >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >>> >>> >> >> index 36a8dc699448..208f4dc6f574 100644 >>> >>> >> >> --- a/fs/ceph/caps.c >>> >>> >> >> +++ b/fs/ceph/caps.c >>> >>> >> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode) >>> >>> >> >> static void drop_inode_snap_realm(struct ceph_inode_info *ci) >>> >>> >> >> { >>> >>> >> >> struct ceph_snap_realm *realm = ci->i_snap_realm; >>> >>> >> >> + >>> >>> >> >> spin_lock(>inodes_with_caps_lock); >>> >>> >> >> list_del_init(>i_snap_realm_item); >>> >>> >> >> ci->i_snap_realm_counter++; >>> >
Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc
"Yan, Zheng" writes: > On Fri, Mar 22, 2019 at 6:04 PM Luis Henriques wrote: >> >> Luis Henriques writes: >> >> > "Yan, Zheng" writes: >> > >> >> On Tue, Mar 19, 2019 at 12:22 AM Luis Henriques >> >> wrote: >> >>> >> >>> "Yan, Zheng" writes: >> >>> >> >>> > On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques >> >>> > wrote: >> >>> >> >> >>> >> "Yan, Zheng" writes: >> >>> >> >> >>> >> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques >> >>> >> > wrote: >> >>> >> >> >> >>> >> >> I'm occasionally seeing a kmemleak warning in xfstest generic/013: >> >>> >> >> >> >>> >> >> unreferenced object 0x8881fccca940 (size 32): >> >>> >> >> comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s) >> >>> >> >> hex dump (first 32 bytes): >> >>> >> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 >> >>> >> >> >> >>> >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >>> >> >> >> >>> >> >> backtrace: >> >>> >> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0 >> >>> >> >> [<21a00533>] rebuild_snap_realms+0x27/0x90 >> >>> >> >> [<ac538600>] rebuild_snap_realms+0x42/0x90 >> >>> >> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610 >> >>> >> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3 >> >>> >> >> [<fc287b83>] dispatch+0x362/0x176c >> >>> >> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0 >> >>> >> >> [<4168e3a9>] process_one_work+0x1d4/0x400 >> >>> >> >> [<2188e9e7>] worker_thread+0x2d/0x3c0 >> >>> >> >> [<b593e4b3>] kthread+0x112/0x130 >> >>> >> >> [<a8587dca>] ret_from_fork+0x35/0x40 >> >>> >> >> [<ba1c9c1d>] 0x >> >>> >> >> >> >>> >> >> It looks like it is possible that we miss a flush_ack from the MDS >> >>> >> >> when, >> >>> >> >> for example, umounting the filesystem. In that case, we can >> >>> >> >> simply drop >> >>> >> >> the reference to the ceph_snap_context obtained in >> >>> >> >> ceph_queue_cap_snap(). >> >>> >> >> >> >>> >> >> Link: https://tracker.ceph.com/issues/38224 >> >>> >> >> Cc: sta...@vger.kernel.org >> >>> >> >> Signed-off-by: Luis Henriques >> >>> >> >> --- >> >>> >> >> fs/ceph/caps.c | 7 +++ >> >>> >> >> 1 file changed, 7 insertions(+) >> >>> >> >> >> >>> >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >> >>> >> >> index 36a8dc699448..208f4dc6f574 100644 >> >>> >> >> --- a/fs/ceph/caps.c >> >>> >> >> +++ b/fs/ceph/caps.c >> >>> >> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode) >> >>> >> >> static void drop_inode_snap_realm(struct ceph_inode_info *ci) >> >>> >> >> { >> >>> >> >> struct ceph_snap_realm *realm = ci->i_snap_realm; >> >>> >> >> + >> >>> >> >> spin_lock(>inodes_with_caps_lock); >> >>> >> >> list_del_init(>i_snap_realm_item); >> >>> >> >> ci->i_snap_realm_counter++; >> >>> >> >> @@ -1063,6 +1064,12 @@ static void drop_inode_snap_realm(struct >> >>> >> >> ceph_inode_info *ci) >> >>> >> >> spin_unlock(>inodes_with_caps_lock); >> >>> >> >> >> >>> &g