Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

2021-04-09 Thread Luis Henriques
Nicolas Boichat  writes:

> On Wed, Feb 24, 2021 at 6:44 PM Nicolas Boichat  wrote:
>>
>> On Wed, Feb 24, 2021 at 6:22 PM Luis Henriques  wrote:
>> >
>> > On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote:
>> > > On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques  
>> > > wrote:
>> > > >
>> > > > A regression has been reported by Nicolas Boichat, found while using 
>> > > > the
>> > > > copy_file_range syscall to copy a tracefs file.  Before commit
>> > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
>> > > > kernel would return -EXDEV to userspace when trying to copy a file 
>> > > > across
>> > > > different filesystems.  After this commit, the syscall doesn't fail 
>> > > > anymore
>> > > > and instead returns zero (zero bytes copied), as this file's content is
>> > > > generated on-the-fly and thus reports a size of zero.
>> > > >
>> > > > This patch restores some cross-filesystem copy restrictions that 
>> > > > existed
>> > > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy 
>> > > > across
>> > > > devices").  Filesystems are still allowed to fall-back to the VFS
>> > > > generic_copy_file_range() implementation, but that has now to be done
>> > > > explicitly.
>> > > >
>> > > > nfsd is also modified to fall-back into generic_copy_file_range() in 
>> > > > case
>> > > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>> > > >
>> > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across 
>> > > > devices")
>> > > > Link: 
>> > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
>> > > > Link: 
>> > > > https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
>> > > > Link: 
>> > > > https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
>> > > > Reported-by: Nicolas Boichat 
>> > > > Signed-off-by: Luis Henriques 
>> > >
>> > > I tested v8 and I believe it works for NFS.
>> >
>> > Thanks a lot for the testing.  And to everyone else for reviews,
>> > feedback,... and patience.
>>
>> Thanks so much to you!!!
>>
>> Works here, you can add my
>> Tested-by: Nicolas Boichat 
>
> What happened to this patch? It does not seem to have been picked up
> yet? Any reason why?

Hmm... good question.  I'm not actually sure who would be picking it.  Al,
maybe...?

Cheers,
-- 
Luis

>
>> >
>> > I'll now go look into the manpage and see what needs to be changed.
>> >
>> > Cheers,
>> > --
>> > Luís



Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2

2021-03-29 Thread Luis Henriques
Vivek Goyal  writes:

> On Mon, Mar 29, 2021 at 03:54:03PM +0100, Luis Henriques wrote:
>> On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote:
>> > Fuse client needs to send additional information to file server when
>> > it calls SETXATTR(system.posix_acl_access). Right now there is no extra
>> > space in fuse_setxattr_in. So introduce a v2 of the structure which has
>> > more space in it and can be used to send extra flags.
>> > 
>> > "struct fuse_setxattr_in_v2" is only used if file server opts-in for it 
>> > using
>> > flag FUSE_SETXATTR_V2 during feature negotiations.
>> > 
>> > Signed-off-by: Vivek Goyal 
>> > ---
>> >  fs/fuse/acl.c |  2 +-
>> >  fs/fuse/fuse_i.h  |  5 -
>> >  fs/fuse/inode.c   |  4 +++-
>> >  fs/fuse/xattr.c   | 21 +++--
>> >  include/uapi/linux/fuse.h | 10 ++
>> >  5 files changed, 33 insertions(+), 9 deletions(-)
>> > 
>> > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
>> > index e9c0f916349d..d31260a139d4 100644
>> > --- a/fs/fuse/acl.c
>> > +++ b/fs/fuse/acl.c
>> > @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, 
>> > struct inode *inode,
>> >return ret;
>> >}
>> >  
>> > -  ret = fuse_setxattr(inode, name, value, size, 0);
>> > +  ret = fuse_setxattr(inode, name, value, size, 0, 0);
>> >kfree(value);
>> >} else {
>> >ret = fuse_removexattr(inode, name);
>> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> > index 63d97a15ffde..d00bf0b9a38c 100644
>> > --- a/fs/fuse/fuse_i.h
>> > +++ b/fs/fuse/fuse_i.h
>> > @@ -668,6 +668,9 @@ struct fuse_conn {
>> >/** Is setxattr not implemented by fs? */
>> >unsigned no_setxattr:1;
>> >  
>> > +  /** Does file server support setxattr_v2 */
>> > +  unsigned setxattr_v2:1;
>> > +
>> >/** Is getxattr not implemented by fs? */
>> >unsigned no_getxattr:1;
>> >  
>> > @@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool 
>> > locked);
>> >  bool fuse_lock_inode(struct inode *inode);
>> >  
>> >  int fuse_setxattr(struct inode *inode, const char *name, const void 
>> > *value,
>> > -size_t size, int flags);
>> > +size_t size, int flags, unsigned extra_flags);
>> >  ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
>> >  size_t size);
>> >  ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
>> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> > index b0e18b470e91..1c726df13f80 100644
>> > --- a/fs/fuse/inode.c
>> > +++ b/fs/fuse/inode.c
>> > @@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount 
>> > *fm, struct fuse_args *args,
>> >fc->handle_killpriv_v2 = 1;
>> >fm->sb->s_flags |= SB_NOSEC;
>> >}
>> > +  if (arg->flags & FUSE_SETXATTR_V2)
>> > +  fc->setxattr_v2 = 1;
>> >} else {
>> >ra_pages = fc->max_read / PAGE_SIZE;
>> >fc->no_lock = 1;
>> > @@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm)
>> >FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
>> >FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
>> >FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
>> > -  FUSE_HANDLE_KILLPRIV_V2;
>> > +  FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2;
>> >  #ifdef CONFIG_FUSE_DAX
>> >if (fm->fc->dax)
>> >ia->in.flags |= FUSE_MAP_ALIGNMENT;
>> > diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
>> > index 1a7d7ace54e1..f2aae72653dc 100644
>> > --- a/fs/fuse/xattr.c
>> > +++ b/fs/fuse/xattr.c
>> > @@ -12,24 +12,33 @@
>> >  #include 
>> >  
>> >  int fuse_setxattr(struct inode *inode, const char *name, const void 
>> > *value,
>> > -size_t size, int flags)
>> > +size_t size, int flags, unsigned extra_flags)
>> >  {
>> >struct fuse_mount *fm = get_fuse_mount(inode);
>> >FUSE_A

Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2

2021-03-29 Thread Luis Henriques
On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote:
> Fuse client needs to send additional information to file server when
> it calls SETXATTR(system.posix_acl_access). Right now there is no extra
> space in fuse_setxattr_in. So introduce a v2 of the structure which has
> more space in it and can be used to send extra flags.
> 
> "struct fuse_setxattr_in_v2" is only used if file server opts-in for it using
> flag FUSE_SETXATTR_V2 during feature negotiations.
> 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/fuse/acl.c |  2 +-
>  fs/fuse/fuse_i.h  |  5 -
>  fs/fuse/inode.c   |  4 +++-
>  fs/fuse/xattr.c   | 21 +++--
>  include/uapi/linux/fuse.h | 10 ++
>  5 files changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
> index e9c0f916349d..d31260a139d4 100644
> --- a/fs/fuse/acl.c
> +++ b/fs/fuse/acl.c
> @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
> inode *inode,
>   return ret;
>   }
>  
> - ret = fuse_setxattr(inode, name, value, size, 0);
> + ret = fuse_setxattr(inode, name, value, size, 0, 0);
>   kfree(value);
>   } else {
>   ret = fuse_removexattr(inode, name);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 63d97a15ffde..d00bf0b9a38c 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -668,6 +668,9 @@ struct fuse_conn {
>   /** Is setxattr not implemented by fs? */
>   unsigned no_setxattr:1;
>  
> + /** Does file server support setxattr_v2 */
> + unsigned setxattr_v2:1;
> +
>   /** Is getxattr not implemented by fs? */
>   unsigned no_getxattr:1;
>  
> @@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool 
> locked);
>  bool fuse_lock_inode(struct inode *inode);
>  
>  int fuse_setxattr(struct inode *inode, const char *name, const void *value,
> -   size_t size, int flags);
> +   size_t size, int flags, unsigned extra_flags);
>  ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
> size_t size);
>  ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b0e18b470e91..1c726df13f80 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount *fm, 
> struct fuse_args *args,
>   fc->handle_killpriv_v2 = 1;
>   fm->sb->s_flags |= SB_NOSEC;
>   }
> + if (arg->flags & FUSE_SETXATTR_V2)
> + fc->setxattr_v2 = 1;
>   } else {
>   ra_pages = fc->max_read / PAGE_SIZE;
>   fc->no_lock = 1;
> @@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm)
>   FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
>   FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
>   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> - FUSE_HANDLE_KILLPRIV_V2;
> + FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2;
>  #ifdef CONFIG_FUSE_DAX
>   if (fm->fc->dax)
>   ia->in.flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
> index 1a7d7ace54e1..f2aae72653dc 100644
> --- a/fs/fuse/xattr.c
> +++ b/fs/fuse/xattr.c
> @@ -12,24 +12,33 @@
>  #include 
>  
>  int fuse_setxattr(struct inode *inode, const char *name, const void *value,
> -   size_t size, int flags)
> +   size_t size, int flags, unsigned extra_flags)
>  {
>   struct fuse_mount *fm = get_fuse_mount(inode);
>   FUSE_ARGS(args);
>   struct fuse_setxattr_in inarg;
> + struct fuse_setxattr_in_v2 inarg_v2;
> + bool setxattr_v2 = fm->fc->setxattr_v2;
>   int err;
>  
>   if (fm->fc->no_setxattr)
>   return -EOPNOTSUPP;
>  
>   memset(, 0, sizeof(inarg));
> - inarg.size = size;
> - inarg.flags = flags;
> + memset(_v2, 0, sizeof(inarg_v2));
> + if (setxattr_v2) {
> + inarg_v2.size = size;
> + inarg_v2.flags = flags;
> + inarg_v2.setxattr_flags = extra_flags;
> + } else {
> + inarg.size = size;
> + inarg.flags = flags;
> + }
>   args.opcode = FUSE_SETXATTR;
>   args.nodeid = get_node_id(inode);
>   args.in_numargs = 3;
> - args.in_args[0].size = sizeof(inarg);
> - args.in_args[0].value = 
> + args.in_args[0].size = setxattr_v2 ? sizeof(inarg_v2) : sizeof(inarg);
> + args.in_args[0].value = setxattr_v2 ? _v2 : (void *)

And yet another minor:

It's a bit awkward to have to cast '' to 'void *' just because
you're using the ternary operator.  Why not use an 'if' statement instead
for initializing .size and .value?

Cheers,
--
Luís

>   

Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2

2021-03-29 Thread Luis Henriques
On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote:
> Fuse client needs to send additional information to file server when
> it calls SETXATTR(system.posix_acl_access). Right now there is no extra
> space in fuse_setxattr_in. So introduce a v2 of the structure which has
> more space in it and can be used to send extra flags.
> 
> "struct fuse_setxattr_in_v2" is only used if file server opts-in for it using
> flag FUSE_SETXATTR_V2 during feature negotiations.
> 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/fuse/acl.c |  2 +-
>  fs/fuse/fuse_i.h  |  5 -
>  fs/fuse/inode.c   |  4 +++-
>  fs/fuse/xattr.c   | 21 +++--
>  include/uapi/linux/fuse.h | 10 ++
>  5 files changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
> index e9c0f916349d..d31260a139d4 100644
> --- a/fs/fuse/acl.c
> +++ b/fs/fuse/acl.c
> @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
> inode *inode,
>   return ret;
>   }
>  
> - ret = fuse_setxattr(inode, name, value, size, 0);
> + ret = fuse_setxattr(inode, name, value, size, 0, 0);
>   kfree(value);
>   } else {
>   ret = fuse_removexattr(inode, name);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 63d97a15ffde..d00bf0b9a38c 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -668,6 +668,9 @@ struct fuse_conn {
>   /** Is setxattr not implemented by fs? */
>   unsigned no_setxattr:1;
>  
> + /** Does file server support setxattr_v2 */
> + unsigned setxattr_v2:1;
> +

Minor (pedantic!) comment: most of the fields here start with 'no_*', so
maybe it's worth setting the logic to use 'no_setxattr_v2' instead?

Cheers,
--
Luís


>   /** Is getxattr not implemented by fs? */
>   unsigned no_getxattr:1;
>  
> @@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool 
> locked);
>  bool fuse_lock_inode(struct inode *inode);
>  
>  int fuse_setxattr(struct inode *inode, const char *name, const void *value,
> -   size_t size, int flags);
> +   size_t size, int flags, unsigned extra_flags);
>  ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
> size_t size);
>  ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b0e18b470e91..1c726df13f80 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount *fm, 
> struct fuse_args *args,
>   fc->handle_killpriv_v2 = 1;
>   fm->sb->s_flags |= SB_NOSEC;
>   }
> + if (arg->flags & FUSE_SETXATTR_V2)
> + fc->setxattr_v2 = 1;
>   } else {
>   ra_pages = fc->max_read / PAGE_SIZE;
>   fc->no_lock = 1;
> @@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm)
>   FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
>   FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
>   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> - FUSE_HANDLE_KILLPRIV_V2;
> + FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2;
>  #ifdef CONFIG_FUSE_DAX
>   if (fm->fc->dax)
>   ia->in.flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
> index 1a7d7ace54e1..f2aae72653dc 100644
> --- a/fs/fuse/xattr.c
> +++ b/fs/fuse/xattr.c
> @@ -12,24 +12,33 @@
>  #include 
>  
>  int fuse_setxattr(struct inode *inode, const char *name, const void *value,
> -   size_t size, int flags)
> +   size_t size, int flags, unsigned extra_flags)
>  {
>   struct fuse_mount *fm = get_fuse_mount(inode);
>   FUSE_ARGS(args);
>   struct fuse_setxattr_in inarg;
> + struct fuse_setxattr_in_v2 inarg_v2;
> + bool setxattr_v2 = fm->fc->setxattr_v2;
>   int err;
>  
>   if (fm->fc->no_setxattr)
>   return -EOPNOTSUPP;
>  
>   memset(, 0, sizeof(inarg));
> - inarg.size = size;
> - inarg.flags = flags;
> + memset(_v2, 0, sizeof(inarg_v2));
> + if (setxattr_v2) {
> + inarg_v2.size = size;
> + inarg_v2.flags = flags;
> + inarg_v2.setxattr_flags = extra_flags;
> + } else {
> + inarg.size = size;
> + inarg.flags = flags;
> + }
>   args.opcode = FUSE_SETXATTR;
>   args.nodeid = get_node_id(inode);
>   args.in_numargs = 3;
> - args.in_args[0].size = sizeof(inarg);
> - args.in_args[0].value = 
> + args.in_args[0].size = setxattr_v2 ? sizeof(inarg_v2) : sizeof(inarg);
> + args.in_args[0].value = setxattr_v2 ? _v2 : (void *)
>   args.in_args[1].size = strlen(name) + 1;
>   

Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-29 Thread Luis Henriques
On Fri, Mar 19, 2021 at 09:02:33AM +, Luis Henriques wrote:
> On Thu, Mar 18, 2021 at 11:55:43AM +, Matthew Wilcox wrote:
> > On Thu, Mar 18, 2021 at 11:29:28AM +0000, Luis Henriques wrote:
> > > On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote:
> > > > > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 
> > > > > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00
> > > > > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 
> > > > > > compound_pincount:0
> > > > > 
> > > > > This is a compound page alright.   Have no idea how it got into fuse's
> > > > > pagecache.
> > > > 
> > > > 
> > > > Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled?
> > > 
> > > Yes, it looks like Tumbleweed kernels have that config option enabled by
> > > default.  And it this feature was introduced in 5.4 (the bug doesn't seem
> > > to be reproducible in 5.3).
> > 
> > Can you try adding this patch?
> > 
> > https://git.infradead.org/users/willy/pagecache.git/commitdiff/369a4fcd78369b7a026bdef465af9669bde98ef4
> 
> Good news, looks like this patch fixes the issue[1].  Thanks a lot
> everyone.  Is this already queued somewhere for 5.12?  Also, it would be
> nice to have it Cc'ed for stable kernels >= 5.4.

Ping.  Are you planning to push this for 5.12, or is that queued for the
5.13 merged window?  Or "none of the above"? :)

Cheers,
--
Luís


Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-19 Thread Luis Henriques
On Thu, Mar 18, 2021 at 11:55:43AM +, Matthew Wilcox wrote:
> On Thu, Mar 18, 2021 at 11:29:28AM +0000, Luis Henriques wrote:
> > On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote:
> > > > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 
> > > > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00
> > > > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 
> > > > > compound_pincount:0
> > > > 
> > > > This is a compound page alright.   Have no idea how it got into fuse's
> > > > pagecache.
> > > 
> > > 
> > > Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled?
> > 
> > Yes, it looks like Tumbleweed kernels have that config option enabled by
> > default.  And it this feature was introduced in 5.4 (the bug doesn't seem
> > to be reproducible in 5.3).
> 
> Can you try adding this patch?
> 
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/369a4fcd78369b7a026bdef465af9669bde98ef4

Good news, looks like this patch fixes the issue[1].  Thanks a lot
everyone.  Is this already queued somewhere for 5.12?  Also, it would be
nice to have it Cc'ed for stable kernels >= 5.4.

[1] https://bugzilla.suse.com/show_bug.cgi?id=1182929#c24

Cheers,
--
Luís


Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-18 Thread Luis Henriques
On Thu, Mar 18, 2021 at 11:55:43AM +, Matthew Wilcox wrote:
> On Thu, Mar 18, 2021 at 11:29:28AM +0000, Luis Henriques wrote:
> > On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote:
> > > > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 
> > > > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00
> > > > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 
> > > > > compound_pincount:0
> > > > 
> > > > This is a compound page alright.   Have no idea how it got into fuse's
> > > > pagecache.
> > > 
> > > 
> > > Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled?
> > 
> > Yes, it looks like Tumbleweed kernels have that config option enabled by
> > default.  And it this feature was introduced in 5.4 (the bug doesn't seem
> > to be reproducible in 5.3).
> 
> Can you try adding this patch?
> 
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/369a4fcd78369b7a026bdef465af9669bde98ef4

Yep, sure.  Unfortunately, the testing round-trip can be a bit high.  I'll
push a new kernel build and ask the reporter to give it a try.

[ I'll add this patch on top of the s/BUG_ON/VM_BUG_ON_PAGE change. ]

Cheers,
--
Luís


Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-18 Thread Luis Henriques
On Thu, Mar 18, 2021 at 02:03:02PM +0300, Kirill A. Shutemov wrote:
> On Thu, Mar 18, 2021 at 11:59:59AM +0100, Miklos Szeredi wrote:
> > [CC linux-mm]
> > 
> > On Thu, Mar 18, 2021 at 10:25 AM Luis Henriques  wrote:
> > >
> > > (I thought Vlastimil was already on CC...)
> > >
> > > On Mon, Mar 15, 2021 at 11:06:59AM +, Matthew Wilcox wrote:
> > > > On Mon, Mar 15, 2021 at 09:47:45AM +, Luis Henriques wrote:
> > > > > On Fri, Mar 12, 2021 at 01:11:23PM +0000, Matthew Wilcox wrote:
> > > > > > On Fri, Mar 12, 2021 at 12:21:59PM +, Luis Henriques wrote:
> > > > > > > > > I've seen a bug report (5.10.16 kernel splat below) that 
> > > > > > > > > seems to be
> > > > > > > > > reproducible in kernels as early as 5.4.
> > > > > >
> > > > > > If this is reproducible, can you turn this BUG_ON into a 
> > > > > > VM_BUG_ON_PAGE()
> > > > > > so we know what kind of problem we're dealing with?  Assuming the 
> > > > > > SUSE
> > > > > > tumbleweed kernels enable CONFIG_DEBUG_VM, which I'm sure they do.
> > > > >
> > > > > Just to make sure I got this right, you want to test something like 
> > > > > this:
> > > > >
> > > > > }
> > > > > }
> > > > > -   BUG_ON(page_mapped(page));
> > > > > +   VM_BUG_ON_PAGE(page_mapped(page), page);
> > > > > ret2 = do_launder_page(mapping, page);
> > > > > if (ret2 == 0) {
> > > > > if (!invalidate_complete_page2(mapping, 
> > > > > page))
> > > >
> > > > Yes, exactly.
> > >
> > > Ok, finally I got some feedback from the bug reporter.  Please see bellow
> > > the kernel log with the VM_BUG_ON_PAGE() in place.  Also note that this is
> > > on a 5.12-rc3, vanilla.
> > >
> > > Cheers,
> > > --
> > > Luís
> > >
> > > [16247.536348] page:dfe36ab1 refcount:673 mapcount:0 
> > > mapping:f982a7f8 index:0x1400 pfn:0x4c65e00
> > > [16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 
> > > compound_pincount:0
> > 
> > This is a compound page alright.   Have no idea how it got into fuse's
> > pagecache.
> 
> 
> Luis, do you have CONFIG_READ_ONLY_THP_FOR_FS enabled?

Yes, it looks like Tumbleweed kernels have that config option enabled by
default.  And it this feature was introduced in 5.4 (the bug doesn't seem
to be reproducible in 5.3).

Cheers,
--
Luís


> > > [16247.536361] memcg:8e730012b000
> > > [16247.536364] aops:fuse_file_aops [fuse] ino:8b8 dentry name:"cc1plus"
> > > [16247.536379] flags: 
> > > 0xa800010037(locked|referenced|uptodate|lru|active|head)
> > > [16247.536385] raw: 00a800010037 d6519ed9c448 d651abea5b08 
> > > 8eb2f9a02ef8
> > > [16247.536388] raw: 1400  02a1 
> > > 8e730012b000
> > > [16247.536389] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> > > [16247.536399] [ cut here ]
> > > [16247.536400] kernel BUG at mm/truncate.c:678!
> > > [16247.536406] invalid opcode:  [#1] SMP PTI
> > > [16247.536416] CPU: 42 PID: 2063761 Comm: g++ Not tainted 
> > > 5.12.0-rc3-1.g008d601-default #1 openSUSE Tumbleweed (unreleased)
> > > [16247.536423] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a 
> > > 10/16/2019
> > > [16247.536427] RIP: 0010:invalidate_inode_pages2_range+0x3b4/0x550
> > > [16247.536436] Code: 00 00 00 4c 89 e6 e8 eb 0f 03 00 4c 89 ff e8 63 40 
> > > 01 00 84 c0 0f 84 23 fe ff ff 48 c7 c6 d0 1d f4 b1 4c 89 ff e8 ec 82 02 
> > > 00 <0f> 0b 48 8b 45 78 48 8b 80 80 00 00 00 48 85 c0 0f 84 fb fe ff ff
> > > [16247.536444] RSP: :a18cb0af7a40 EFLAGS: 00010246
> > > [16247.536450] RAX: 0036 RBX: 000d RCX: 
> > > 8ef13fc9a748
> > > [16247.536455] RDX:  RSI: 0027 RDI: 
> > > 8ef13fc9a740
> > > [16247.536460] RBP: 8eb2f9a02ef8 R08: 8ef23ffb48a8 R09: 
> > > 0004fffb
> > > [16247.536464] R10:  R11: 3fff R12: 
> > > 1400
> > > [16247

Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-18 Thread Luis Henriques
(I thought Vlastimil was already on CC...)

On Mon, Mar 15, 2021 at 11:06:59AM +, Matthew Wilcox wrote:
> On Mon, Mar 15, 2021 at 09:47:45AM +0000, Luis Henriques wrote:
> > On Fri, Mar 12, 2021 at 01:11:23PM +, Matthew Wilcox wrote:
> > > On Fri, Mar 12, 2021 at 12:21:59PM +0000, Luis Henriques wrote:
> > > > > > I've seen a bug report (5.10.16 kernel splat below) that seems to be
> > > > > > reproducible in kernels as early as 5.4.
> > > 
> > > If this is reproducible, can you turn this BUG_ON into a VM_BUG_ON_PAGE()
> > > so we know what kind of problem we're dealing with?  Assuming the SUSE
> > > tumbleweed kernels enable CONFIG_DEBUG_VM, which I'm sure they do.
> > 
> > Just to make sure I got this right, you want to test something like this:
> > 
> > }
> > }
> > -   BUG_ON(page_mapped(page));
> > +   VM_BUG_ON_PAGE(page_mapped(page), page);
> > ret2 = do_launder_page(mapping, page);
> > if (ret2 == 0) {
> > if (!invalidate_complete_page2(mapping, page))
> 
> Yes, exactly.

Ok, finally I got some feedback from the bug reporter.  Please see bellow
the kernel log with the VM_BUG_ON_PAGE() in place.  Also note that this is
on a 5.12-rc3, vanilla.

Cheers,
--
Luís

[16247.536348] page:dfe36ab1 refcount:673 mapcount:0 
mapping:f982a7f8 index:0x1400 pfn:0x4c65e00
[16247.536359] head:dfe36ab1 order:9 compound_mapcount:0 
compound_pincount:0
[16247.536361] memcg:8e730012b000
[16247.536364] aops:fuse_file_aops [fuse] ino:8b8 dentry name:"cc1plus"
[16247.536379] flags: 
0xa800010037(locked|referenced|uptodate|lru|active|head)
[16247.536385] raw: 00a800010037 d6519ed9c448 d651abea5b08 
8eb2f9a02ef8
[16247.536388] raw: 1400  02a1 
8e730012b000
[16247.536389] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
[16247.536399] [ cut here ]
[16247.536400] kernel BUG at mm/truncate.c:678!
[16247.536406] invalid opcode:  [#1] SMP PTI
[16247.536416] CPU: 42 PID: 2063761 Comm: g++ Not tainted 
5.12.0-rc3-1.g008d601-default #1 openSUSE Tumbleweed (unreleased)
[16247.536423] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a 
10/16/2019
[16247.536427] RIP: 0010:invalidate_inode_pages2_range+0x3b4/0x550
[16247.536436] Code: 00 00 00 4c 89 e6 e8 eb 0f 03 00 4c 89 ff e8 63 40 01 00 
84 c0 0f 84 23 fe ff ff 48 c7 c6 d0 1d f4 b1 4c 89 ff e8 ec 82 02 00 <0f> 0b 48 
8b 45 78 48 8b 80 80 00 00 00 48 85 c0 0f 84 fb fe ff ff
[16247.536444] RSP: :a18cb0af7a40 EFLAGS: 00010246
[16247.536450] RAX: 0036 RBX: 000d RCX: 8ef13fc9a748
[16247.536455] RDX:  RSI: 0027 RDI: 8ef13fc9a740
[16247.536460] RBP: 8eb2f9a02ef8 R08: 8ef23ffb48a8 R09: 0004fffb
[16247.536464] R10:  R11: 3fff R12: 1400
[16247.536468] R13: 8eb2f9a02f00 R14:  R15: d651b1978000
[16247.536473] FS:  7f97c1717740() GS:8ef13fc8() 
knlGS:
[16247.536478] CS:  0010 DS:  ES:  CR0: 80050033
[16247.536483] CR2: 7fd48a25a7c0 CR3: 0040aa3ac006 CR4: 007706e0
[16247.536487] DR0:  DR1:  DR2: 
[16247.536491] DR3:  DR6: fffe0ff0 DR7: 0400
[16247.536495] PKRU: 5554
[16247.536498] Call Trace:
[16247.536506]  fuse_finish_open+0x82/0x150 [fuse]
[16247.536520]  fuse_open_common+0x1a8/0x1b0 [fuse]
[16247.536530]  ? fuse_open_common+0x1b0/0x1b0 [fuse]
[16247.536540]  do_dentry_open+0x14e/0x380
[16247.536547]  path_openat+0xaf6/0x10a0
[16247.536555]  do_filp_open+0x88/0x130
[16247.536560]  ? security_prepare_creds+0x6d/0x90
[16247.536566]  ? __kmalloc+0x157/0x2e0
[16247.536575]  do_open_execat+0x6d/0x1a0
[16247.536581]  bprm_execve+0x128/0x660
[16247.536587]  do_execveat_common+0x192/0x1c0
[16247.536593]  __x64_sys_execve+0x39/0x50
[16247.536599]  do_syscall_64+0x33/0x80
[16247.536606]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[16247.536614] RIP: 0033:0x7f97c0efec37
[16247.536621] Code: Unable to access opcode bytes at RIP 0x7f97c0efec0d.
[16247.536625] RSP: 002b:7ffdc2fdea68 EFLAGS: 0202 ORIG_RAX: 
003b
[16247.536631] RAX: ffda RBX: 7f97c17176a0 RCX: 7f97c0efec37
[16247.536635] RDX: 00ea42c0 RSI: 00ea5848 RDI: 00ea5d00
[16247.536639] RBP: 0001 R08:  R09: 
[16247.536643] R10: 7ffdc2fdde60 R11: 0202 R12: 
[16247.536647] R13: 0001 R14: 00ea5d00 R15: 
[16247.536653] Modules l

Re: [PATCH mm] kfence: make compatible with kmemleak

2021-03-17 Thread Luis Henriques
On Wed, Mar 17, 2021 at 09:47:40AM +0100, Marco Elver wrote:
> Because memblock allocations are registered with kmemleak, the KFENCE
> pool was seen by kmemleak as one large object. Later allocations through
> kfence_alloc() that were registered with kmemleak via
> slab_post_alloc_hook() would then overlap and trigger a warning.
> Therefore, once the pool is initialized, we can remove (free) it from
> kmemleak again, since it should be treated as allocator-internal and be
> seen as "free memory".
> 
> The second problem is that kmemleak is passed the rounded size, and not
> the originally requested size, which is also the size of KFENCE objects.
> To avoid kmemleak scanning past the end of an object and trigger a
> KFENCE out-of-bounds error, fix the size if it is a KFENCE object.
> 
> For simplicity, to avoid a call to kfence_ksize() in
> slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
> guard), just call kfence_ksize() in mm/kmemleak.c:create_object().
> 
> Reported-by: Luis Henriques 
> Cc: Catalin Marinas 
> Signed-off-by: Marco Elver 

Tested-by: Luis Henriques 

> ---
>  mm/kfence/core.c | 9 +
>  mm/kmemleak.c| 3 ++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index f7106f28443d..768dbd58170d 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -481,6 +482,14 @@ static bool __init kfence_init_pool(void)
>   addr += 2 * PAGE_SIZE;
>   }
>  
> + /*
> +  * The pool is live and will never be deallocated from this point on.
> +  * Remove the pool object from the kmemleak object tree, as it would
> +  * otherwise overlap with allocations returned by kfence_alloc(), which
> +  * are registered with kmemleak through the slab post-alloc hook.
> +  */
> + kmemleak_free(__kfence_pool);
> +
>   return true;
>  
>  err:
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index c0014d3b91c1..fe6e3ae8e8c6 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -97,6 +97,7 @@
>  #include 
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -589,7 +590,7 @@ static struct kmemleak_object *create_object(unsigned 
> long ptr, size_t size,
>   atomic_set(>use_count, 1);
>   object->flags = OBJECT_ALLOCATED;
>   object->pointer = ptr;
> - object->size = size;
> + object->size = kfence_ksize((void *)ptr) ?: size;
>   object->excess_ref = 0;
>   object->min_count = min_count;
>   object->count = 0;  /* white color initially */
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
> 


[PATCH v2] virtiofs: fix memory leak in virtio_fs_probe()

2021-03-17 Thread Luis Henriques
When accidentally passing twice the same tag to qemu, kmemleak ended up
reporting a memory leak in virtiofs.  Also, looking at the log I saw the
following error (that's when I realised the duplicated tag):

  virtiofs: probe of virtio5 failed with error -17

Here's the kmemleak log for reference:

unreferenced object 0x888103d47800 (size 1024):
  comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s)
  hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff  
  backtrace:
[<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs]
[<f8aca419>] virtio_dev_probe+0x15f/0x210
[<4d6baf3c>] really_probe+0xea/0x430
[<a6ceeac8>] device_driver_attach+0xa8/0xb0
[<196f47a7>] __driver_attach+0x98/0x140
[<0b20601d>] bus_for_each_dev+0x7b/0xc0
[<399c7b7f>] bus_add_driver+0x11b/0x1f0
[<32b09ba7>] driver_register+0x8f/0xe0
[<cdd55998>] 0xa002c013
[<0ea196a2>] do_one_initcall+0x64/0x2e0
[<08f727ce>] do_init_module+0x5c/0x260
[<3cdedab6>] __do_sys_finit_module+0xb5/0x120
[<ad2f48c6>] do_syscall_64+0x33/0x40
[<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: sta...@vger.kernel.org
Signed-off-by: Luis Henriques 
---
Changes since v1:
- Use kfree() to free fs->vqs instead of calling virtio_fs_put()

 fs/fuse/virtio_fs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8868ac31a3c0..989ef4f88636 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -896,6 +896,7 @@ static int virtio_fs_probe(struct virtio_device *vdev)
 out_vqs:
vdev->config->reset(vdev);
virtio_fs_cleanup_vqs(vdev, fs);
+   kfree(fs->vqs);
 
 out:
vdev->priv = NULL;


Re: Issue with kfence and kmemleak

2021-03-17 Thread Luis Henriques
On Tue, Mar 16, 2021 at 07:47:00PM +0100, Marco Elver wrote:
> On Tue, Mar 16, 2021 at 06:19PM +, Catalin Marinas wrote:
> > On Tue, Mar 16, 2021 at 06:30:00PM +0100, Marco Elver wrote:
> > > On Tue, Mar 16, 2021 at 04:42PM +0000, Luis Henriques wrote:
> > > > This is probably a known issue, but just in case: looks like it's not
> > > > possible to use kmemleak when kfence is enabled:
> > > > 
> > > > [0.272136] kmemleak: Cannot insert 0x888236e02f00 into the 
> > > > object search tree (overlaps existing)
> > > > [0.272136] CPU: 0 PID: 8 Comm: kthreadd Not tainted 5.12.0-rc3+ #92
> > > > [0.272136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > > BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> > > > [0.272136] Call Trace:
> > > > [0.272136]  dump_stack+0x6d/0x89
> > > > [0.272136]  create_object.isra.0.cold+0x40/0x62
> > > > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > > > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > > > [0.272136]  kmem_cache_alloc_trace+0x110/0x2f0
> > > > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > > > [0.272136]  kthread+0x3f/0x150
> > > > [0.272136]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
> > > > [0.272136]  ? __kthread_bind_mask+0x60/0x60
> > > > [0.272136]  ret_from_fork+0x22/0x30
> > > > [0.272136] kmemleak: Kernel memory leak detector disabled
> > > > [0.272136] kmemleak: Object 0x888236e0 (size 2097152):
> > > > [0.272136] kmemleak:   comm "swapper", pid 0, jiffies 4294892296
> > > > [0.272136] kmemleak:   min_count = 0
> > > > [0.272136] kmemleak:   count = 0
> > > > [0.272136] kmemleak:   flags = 0x1
> > > > [0.272136] kmemleak:   checksum = 0
> > > > [0.272136] kmemleak:   backtrace:
> > > > [0.272136]  memblock_alloc_internal+0x6d/0xb0
> > > > [0.272136]  memblock_alloc_try_nid+0x6c/0x8a
> > > > [0.272136]  kfence_alloc_pool+0x26/0x3f
> > > > [0.272136]  start_kernel+0x242/0x548
> > > > [0.272136]  secondary_startup_64_no_verify+0xb0/0xbb
> > > > 
> > > > I've tried the hack below but it didn't really helped.  Obviously I 
> > > > don't
> > > > really understand what's going on ;-)  But I think the reason for this
> > > > patch not working as (I) expected is because kfence is initialised
> > > > *before* kmemleak.
> > > > 
> > > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> > > > index 3b8ec938470a..b4ffd7695268 100644
> > > > --- a/mm/kfence/core.c
> > > > +++ b/mm/kfence/core.c
> > > > @@ -631,6 +631,9 @@ void __init kfence_alloc_pool(void)
> > > >  
> > > > if (!__kfence_pool)
> > > > pr_err("failed to allocate pool\n");
> > > > +   kmemleak_no_scan(__kfence_pool);
> > > >  }
> > > 
> > > Can you try the below patch?
> > > 
> > > Thanks,
> > > -- Marco
> > > 
> > > -- >8 --
> > > 
> > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> > > index f7106f28443d..5891019721f6 100644
> > > --- a/mm/kfence/core.c
> > > +++ b/mm/kfence/core.c
> > > @@ -12,6 +12,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -481,6 +482,13 @@ static bool __init kfence_init_pool(void)
> > >   addr += 2 * PAGE_SIZE;
> > >   }
> > >  
> > > + /*
> > > +  * The pool is live and will never be deallocated from this point on;
> > > +  * tell kmemleak this is now free memory, so that later allocations can
> > > +  * correctly be tracked.
> > > +  */
> > > + kmemleak_free_part_phys(__pa(__kfence_pool), KFENCE_POOL_SIZE);
> > 
> > I presume this pool does not refer any objects that are only tracked
> > through pool pointers.
> 
> No, at this point this memory should not have been touched by anything.
> 
> > kmemleak_free() (or *_free_part) should work, no need for the _phys
> > variant (which converts it back with __va).
> 
> Will fix.
> 
> > Since we normally use kmemleak_ignore() (or no_scan) for objects we
> > don't care about, I'd exp

Re: [PATCH] virtiofs: fix memory leak in virtio_fs_probe()

2021-03-16 Thread Luis Henriques
Vivek Goyal  writes:

> On Tue, Mar 16, 2021 at 05:02:34PM +0000, Luis Henriques wrote:
>> When accidentally passing twice the same tag to qemu, kmemleak ended up
>> reporting a memory leak in virtiofs.  Also, looking at the log I saw the
>> following error (that's when I realised the duplicated tag):
>> 
>>   virtiofs: probe of virtio5 failed with error -17
>> 
>> Here's the kmemleak log for reference:
>> 
>> unreferenced object 0x888103d47800 (size 1024):
>>   comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s)
>>   hex dump (first 32 bytes):
>> 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
>> ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff  
>>   backtrace:
>> [<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs]
>> [<f8aca419>] virtio_dev_probe+0x15f/0x210
>> [<4d6baf3c>] really_probe+0xea/0x430
>> [<a6ceeac8>] device_driver_attach+0xa8/0xb0
>> [<196f47a7>] __driver_attach+0x98/0x140
>> [<0b20601d>] bus_for_each_dev+0x7b/0xc0
>> [<399c7b7f>] bus_add_driver+0x11b/0x1f0
>> [<32b09ba7>] driver_register+0x8f/0xe0
>> [<cdd55998>] 0xa002c013
>> [<0ea196a2>] do_one_initcall+0x64/0x2e0
>> [<08f727ce>] do_init_module+0x5c/0x260
>>     [<00003cdedab6>] __do_sys_finit_module+0xb5/0x120
>> [<ad2f48c6>] do_syscall_64+0x33/0x40
>> [<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> 
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Luis Henriques 
>
> Hi Luis,
>
> Thanks for the report and the fix. So looks like leak is happening
> because we are not doing kfree(fs->vqs) in error path.

Yep!

>> ---
>>  fs/fuse/virtio_fs.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index 8868ac31a3c0..4e6ef9f24e84 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -899,7 +899,7 @@ static int virtio_fs_probe(struct virtio_device *vdev)
>>  
>>  out:
>>  vdev->priv = NULL;
>> -kfree(fs);
>> +virtio_fs_put(fs);
>
> [ CC virtio-fs list ]

Oops, forgot to include it.  Maybe it should be added to the MAINTAINERS
file (although IIRC it's not an open list).

> fs object is not fully formed. So calling virtio_fs_put() is little odd.
> I will expect it to be called if somebody takes a reference using _get()
> or in the final virtio_fs_remove() when creation reference should go
> away.
>
> How about open coding it and free fs->vqs explicitly. Something like
> as follows.

Ok, I'll send v2 later (I'm currently away from my devel workstation).  To
be honest, my initial version was doing exactly what you're suggesting.  I
decided to change to virtio_fs_put() because the refcount was already
initialised early in the function.  Bad decision.

Cheers,
-- 
Luis

>
> @@ -896,7 +896,7 @@ static int virtio_fs_probe(struct virtio
>  out_vqs:
> vdev->config->reset(vdev);
> virtio_fs_cleanup_vqs(vdev, fs);
> -
> +   kfree(fs->vqs);
>  out:
> vdev->priv = NULL;
> kfree(fs);
>
> Thanks
> Vivek
>


Re: Issue with kfence and kmemleak

2021-03-16 Thread Luis Henriques
On Tue, Mar 16, 2021 at 06:30:00PM +0100, Marco Elver wrote:
> On Tue, Mar 16, 2021 at 04:42PM +0000, Luis Henriques wrote:
> > Hi!
> > 
> > This is probably a known issue, but just in case: looks like it's not
> > possible to use kmemleak when kfence is enabled:
> 
> Thanks for spotting this.
> 
> > [0.272136] kmemleak: Cannot insert 0x888236e02f00 into the object 
> > search tree (overlaps existing)
> > [0.272136] CPU: 0 PID: 8 Comm: kthreadd Not tainted 5.12.0-rc3+ #92
> > [0.272136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> > [0.272136] Call Trace:
> > [0.272136]  dump_stack+0x6d/0x89
> > [0.272136]  create_object.isra.0.cold+0x40/0x62
> > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > [0.272136]  kmem_cache_alloc_trace+0x110/0x2f0
> > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > [0.272136]  kthread+0x3f/0x150
> > [0.272136]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
> > [0.272136]  ? __kthread_bind_mask+0x60/0x60
> > [0.272136]  ret_from_fork+0x22/0x30
> > [0.272136] kmemleak: Kernel memory leak detector disabled
> > [0.272136] kmemleak: Object 0x888236e0 (size 2097152):
> > [0.272136] kmemleak:   comm "swapper", pid 0, jiffies 4294892296
> > [0.272136] kmemleak:   min_count = 0
> > [0.272136] kmemleak:   count = 0
> > [0.272136] kmemleak:   flags = 0x1
> > [0.272136] kmemleak:   checksum = 0
> > [0.272136] kmemleak:   backtrace:
> > [0.272136]  memblock_alloc_internal+0x6d/0xb0
> > [0.272136]  memblock_alloc_try_nid+0x6c/0x8a
> > [0.272136]  kfence_alloc_pool+0x26/0x3f
> > [0.272136]  start_kernel+0x242/0x548
> > [0.272136]  secondary_startup_64_no_verify+0xb0/0xbb
> > 
> > I've tried the hack below but it didn't really helped.  Obviously I don't
> > really understand what's going on ;-)  But I think the reason for this
> > patch not working as (I) expected is because kfence is initialised
> > *before* kmemleak.
> > 
> > diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> > index 3b8ec938470a..b4ffd7695268 100644
> > --- a/mm/kfence/core.c
> > +++ b/mm/kfence/core.c
> > @@ -631,6 +631,9 @@ void __init kfence_alloc_pool(void)
> >  
> > if (!__kfence_pool)
> > pr_err("failed to allocate pool\n");
> > +   kmemleak_no_scan(__kfence_pool);
> >  }
> 
> Can you try the below patch?

Yep, that seems to fix the issue.  Feel free to add my Tested-by.  Thanks!

Cheers,
--
Luís

> 
> Thanks,
> -- Marco
> 
> -- >8 --
> 
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index f7106f28443d..5891019721f6 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -481,6 +482,13 @@ static bool __init kfence_init_pool(void)
>   addr += 2 * PAGE_SIZE;
>   }
>  
> + /*
> +  * The pool is live and will never be deallocated from this point on;
> +  * tell kmemleak this is now free memory, so that later allocations can
> +  * correctly be tracked.
> +  */
> + kmemleak_free_part_phys(__pa(__kfence_pool), KFENCE_POOL_SIZE);
> +
>   return true;
>  
>  err:


[PATCH] virtiofs: fix memory leak in virtio_fs_probe()

2021-03-16 Thread Luis Henriques
When accidentally passing twice the same tag to qemu, kmemleak ended up
reporting a memory leak in virtiofs.  Also, looking at the log I saw the
following error (that's when I realised the duplicated tag):

  virtiofs: probe of virtio5 failed with error -17

Here's the kmemleak log for reference:

unreferenced object 0x888103d47800 (size 1024):
  comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s)
  hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff  
  backtrace:
[<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs]
[<f8aca419>] virtio_dev_probe+0x15f/0x210
[<4d6baf3c>] really_probe+0xea/0x430
[<a6ceeac8>] device_driver_attach+0xa8/0xb0
[<196f47a7>] __driver_attach+0x98/0x140
[<0b20601d>] bus_for_each_dev+0x7b/0xc0
[<399c7b7f>] bus_add_driver+0x11b/0x1f0
[<32b09ba7>] driver_register+0x8f/0xe0
[<cdd55998>] 0xa002c013
[<0ea196a2>] do_one_initcall+0x64/0x2e0
[<08f727ce>] do_init_module+0x5c/0x260
[<3cdedab6>] __do_sys_finit_module+0xb5/0x120
[<ad2f48c6>] do_syscall_64+0x33/0x40
[<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: sta...@vger.kernel.org
Signed-off-by: Luis Henriques 
---
 fs/fuse/virtio_fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8868ac31a3c0..4e6ef9f24e84 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -899,7 +899,7 @@ static int virtio_fs_probe(struct virtio_device *vdev)
 
 out:
vdev->priv = NULL;
-   kfree(fs);
+   virtio_fs_put(fs);
return ret;
 }
 


Issue with kfence and kmemleak

2021-03-16 Thread Luis Henriques
Hi!

This is probably a known issue, but just in case: looks like it's not
possible to use kmemleak when kfence is enabled:

[0.272136] kmemleak: Cannot insert 0x888236e02f00 into the object 
search tree (overlaps existing)
[0.272136] CPU: 0 PID: 8 Comm: kthreadd Not tainted 5.12.0-rc3+ #92
[0.272136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[0.272136] Call Trace:
[0.272136]  dump_stack+0x6d/0x89
[0.272136]  create_object.isra.0.cold+0x40/0x62
[0.272136]  ? process_one_work+0x5a0/0x5a0
[0.272136]  ? process_one_work+0x5a0/0x5a0
[0.272136]  kmem_cache_alloc_trace+0x110/0x2f0
[0.272136]  ? process_one_work+0x5a0/0x5a0
[0.272136]  kthread+0x3f/0x150
[0.272136]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
[0.272136]  ? __kthread_bind_mask+0x60/0x60
[0.272136]  ret_from_fork+0x22/0x30
[0.272136] kmemleak: Kernel memory leak detector disabled
[0.272136] kmemleak: Object 0x888236e0 (size 2097152):
[0.272136] kmemleak:   comm "swapper", pid 0, jiffies 4294892296
[0.272136] kmemleak:   min_count = 0
[0.272136] kmemleak:   count = 0
[0.272136] kmemleak:   flags = 0x1
[0.272136] kmemleak:   checksum = 0
[0.272136] kmemleak:   backtrace:
[0.272136]  memblock_alloc_internal+0x6d/0xb0
[0.272136]  memblock_alloc_try_nid+0x6c/0x8a
[0.272136]  kfence_alloc_pool+0x26/0x3f
[0.272136]  start_kernel+0x242/0x548
[0.272136]  secondary_startup_64_no_verify+0xb0/0xbb

I've tried the hack below but it didn't really helped.  Obviously I don't
really understand what's going on ;-)  But I think the reason for this
patch not working as (I) expected is because kfence is initialised
*before* kmemleak.

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 3b8ec938470a..b4ffd7695268 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -631,6 +631,9 @@ void __init kfence_alloc_pool(void)
 
if (!__kfence_pool)
pr_err("failed to allocate pool\n");
+   kmemleak_no_scan(__kfence_pool);
 }


Cheers,
--
Luís


Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-15 Thread Luis Henriques
On Fri, Mar 12, 2021 at 01:11:23PM +, Matthew Wilcox wrote:
> On Fri, Mar 12, 2021 at 12:21:59PM +0000, Luis Henriques wrote:
> > > > I've seen a bug report (5.10.16 kernel splat below) that seems to be
> > > > reproducible in kernels as early as 5.4.
> 
> If this is reproducible, can you turn this BUG_ON into a VM_BUG_ON_PAGE()
> so we know what kind of problem we're dealing with?  Assuming the SUSE
> tumbleweed kernels enable CONFIG_DEBUG_VM, which I'm sure they do.

Just to make sure I got this right, you want to test something like this:

}
}
-   BUG_ON(page_mapped(page));
+   VM_BUG_ON_PAGE(page_mapped(page), page);
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))

Cheers,
--
Luís

> 
> > > Page fault locks the page before installing a new pte, at least
> > > AFAICS, so the BUG looks impossible.  The referenced commits only
> > > touch very high level control of writeback, so they may well increase
> > > the chance of a bug triggering, but very unlikely to be the actual
> > > cause of the bug.   I'm guessing this to be an MM issue.
> > 
> > Ok, thank you for having a look at it.
> > 
> > Interestingly, there's a single commit to mm/truncate.c in 5.4:
> > ef18a1ca847b ("mm/thp: allow dropping THP from page cache").  I'm Cc'ing
> > Andrew and Kirill, maybe they have some ideas.
> 
> That's probably not it; unless FUSE has developed the ability to insert
> compound pages into the page cache without me noticing.
> 
> (if it had, that would absolutely explain it -- i have a fix in my thp
> tree for this case, but it doesn't affect any existing filesystem
> because only shmem uses compound pages and it doesn't call
> invalidate_inode_pages2_range)


Re: fuse: kernel BUG at mm/truncate.c:763!

2021-03-12 Thread Luis Henriques
On Fri, Mar 12, 2021 at 10:48:40AM +0100, Miklos Szeredi wrote:
> On Fri, Mar 12, 2021 at 9:51 AM Luis Henriques  wrote:
> >
> > Hi Miklos,
> >
> > I've seen a bug report (5.10.16 kernel splat below) that seems to be
> > reproducible in kernels as early as 5.4.
> >
> > The commit that caught my attention when looking at what was merged in 5.4
> > was e4648309b85a ("fuse: truncate pending writes on O_TRUNC") but I didn't
> > went too deeper on that -- I was wondering if you have seen something
> > similar before.
> 
> Don't remember seeing this.
> 
> Excerpt from invalidate_inode_pages2_range():
> 
> lock_page(page);
> [...]
> if (page_mapped(page)) {
>  [...]
> unmap_mapping_pages(mapping, index,
> 1, false);
> }
> }
> BUG_ON(page_mapped(page));
> 
> Page fault locks the page before installing a new pte, at least
> AFAICS, so the BUG looks impossible.  The referenced commits only
> touch very high level control of writeback, so they may well increase
> the chance of a bug triggering, but very unlikely to be the actual
> cause of the bug.   I'm guessing this to be an MM issue.

Ok, thank you for having a look at it.

Interestingly, there's a single commit to mm/truncate.c in 5.4:
ef18a1ca847b ("mm/thp: allow dropping THP from page cache").  I'm Cc'ing
Andrew and Kirill, maybe they have some ideas.

> Is this reproducible on vanilla, or just openSUSE kernels?

Well, this is on a Tumbleweed kernel, which is pretty much the stable
kernel with a few patches that AFAIK touch mostly drivers.  But I'll see
if I can get the reporter trying to reproduce on a vanilla kernel.

Cheers,
--
Luís

> 
> Thanks,
> Miklos
> 
> 
> 
> >
> >
> > There's another splat in the bug report[1] for a 5.4.14 kernel (which may
> > be for a different bug, but the traces don't look as reliable as the one
> > bellow).
> >
> > [1] https://bugzilla.opensuse.org/show_bug.cgi?id=1182929
> >
> > [97604.721590] kernel BUG at mm/truncate.c:763!
> > [97604.721601] invalid opcode:  [#1] SMP PTI
> > [97604.721613] CPU: 18 PID: 1584438 Comm: g++ Tainted: P   O
> >  5.10.16-1-default #1 openSUSE Tumbleweed
> > [97604.721618] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a
> > 10/16/2019
> > [97604.721631] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0
> > [97604.721637] Code: 0f 48 f0 e9 19 ff ff ff 31 c9 4c 89 e7 ba 01 00 00 00
> > 48 89 ee e8 1a c5 02 00 4c 89 ff e8 02 1b 01 00 84 c0 0f 84 ca fe ff ff <0f>
> > 0b 49 8b 57 18 49 39 d4 0f 85 e2 fe ff ff 49 f7 07 00 60 00 00
> > [97604.721645] RSP: 0018:a613aa54ba40 EFLAGS: 00010202
> > [97604.721651] RAX: 0001 RBX: 000a RCX:
> > 0200
> > [97604.721656] RDX: 0090 RSI: 00a800010037 RDI:
> > d880718e
> > [97604.721660] RBP: 1400 R08: 1400 R09:
> > 1a73
> > [97604.721664] R10:  R11: 04a684da R12:
> > 8a28d4549d78
> > [97604.721669] R13:  R14:  R15:
> > d880718e
> > [97604.721674] FS:  7f9cdd7fb740() GS:8a5c7f98()
> > knlGS:
> > [97604.721679] CS:  0010 DS:  ES:  CR0: 80050033
> > [97604.721683] CR2: 7f89d3d78d80 CR3: 004d8a14e005 CR4:
> > 007706e0
> > [97604.721688] DR0:  DR1:  DR2:
> > 
> > [97604.721692] DR3:  DR6: fffe0ff0 DR7:
> > 0400
> > 97604.721696] PKRU: 5554
> > [97604.721699] Call Trace:
> > [97604.721719]  ? request_wait_answer+0x11a/0x210 [fuse]
> > [97604.721729]  ? fuse_dentry_delete+0xb/0x20 [fuse]
> > [97604.721740]  fuse_finish_open+0x85/0x150 [fuse]
> > [97604.721750]  fuse_open_common+0x1a8/0x1b0 [fuse]
> > [97604.721759]  ? fuse_open_common+0x1b0/0x1b0 [fuse]
> > [97604.721766]  do_dentry_open+0x14e/0x380
> > [97604.721775]  path_openat+0x600/0x10d0
> > [97604.721782]  ? handle_mm_fault+0x103c/0x1a00
> > [97604.721791]  ? follow_page_pte+0x314/0x5f0
> > [97604.721795]  do_filp_open+0x88/0x130
> > [97604.721803]  ? security_prepare_creds+0x6d/0x90
> > [97604.721808]  ? __kmalloc+0x11d/0x2a0
> > [97604.721814]  do_open_execat+0x6d/0x1a0
> > [97604.721819]  bprm_execve+0x190/0x6b0
> > [97604.721825]  do_execveat_common+0x192/0x1c0
> > [97604.721830]  __x64_sys_execve+0x39/0x50
> > [97604.721836]  do_s

fuse: kernel BUG at mm/truncate.c:763!

2021-03-12 Thread Luis Henriques
Hi Miklos,

I've seen a bug report (5.10.16 kernel splat below) that seems to be
reproducible in kernels as early as 5.4.

The commit that caught my attention when looking at what was merged in 5.4
was e4648309b85a ("fuse: truncate pending writes on O_TRUNC") but I didn't
went too deeper on that -- I was wondering if you have seen something
similar before.

There's another splat in the bug report[1] for a 5.4.14 kernel (which may
be for a different bug, but the traces don't look as reliable as the one
bellow).

[1] https://bugzilla.opensuse.org/show_bug.cgi?id=1182929

[97604.721590] kernel BUG at mm/truncate.c:763!
[97604.721601] invalid opcode:  [#1] SMP PTI
[97604.721613] CPU: 18 PID: 1584438 Comm: g++ Tainted: P   O 
 5.10.16-1-default #1 openSUSE Tumbleweed
[97604.721618] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a
10/16/2019
[97604.721631] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0
[97604.721637] Code: 0f 48 f0 e9 19 ff ff ff 31 c9 4c 89 e7 ba 01 00 00 00
48 89 ee e8 1a c5 02 00 4c 89 ff e8 02 1b 01 00 84 c0 0f 84 ca fe ff ff <0f>
0b 49 8b 57 18 49 39 d4 0f 85 e2 fe ff ff 49 f7 07 00 60 00 00
[97604.721645] RSP: 0018:a613aa54ba40 EFLAGS: 00010202
[97604.721651] RAX: 0001 RBX: 000a RCX:
0200
[97604.721656] RDX: 0090 RSI: 00a800010037 RDI:
d880718e
[97604.721660] RBP: 1400 R08: 1400 R09:
1a73
[97604.721664] R10:  R11: 04a684da R12:
8a28d4549d78
[97604.721669] R13:  R14:  R15:
d880718e
[97604.721674] FS:  7f9cdd7fb740() GS:8a5c7f98()
knlGS:
[97604.721679] CS:  0010 DS:  ES:  CR0: 80050033
[97604.721683] CR2: 7f89d3d78d80 CR3: 004d8a14e005 CR4:
007706e0
[97604.721688] DR0:  DR1:  DR2:

[97604.721692] DR3:  DR6: fffe0ff0 DR7:
0400
97604.721696] PKRU: 5554
[97604.721699] Call Trace:
[97604.721719]  ? request_wait_answer+0x11a/0x210 [fuse]
[97604.721729]  ? fuse_dentry_delete+0xb/0x20 [fuse]
[97604.721740]  fuse_finish_open+0x85/0x150 [fuse]
[97604.721750]  fuse_open_common+0x1a8/0x1b0 [fuse]
[97604.721759]  ? fuse_open_common+0x1b0/0x1b0 [fuse]
[97604.721766]  do_dentry_open+0x14e/0x380
[97604.721775]  path_openat+0x600/0x10d0
[97604.721782]  ? handle_mm_fault+0x103c/0x1a00
[97604.721791]  ? follow_page_pte+0x314/0x5f0
[97604.721795]  do_filp_open+0x88/0x130
[97604.721803]  ? security_prepare_creds+0x6d/0x90
[97604.721808]  ? __kmalloc+0x11d/0x2a0
[97604.721814]  do_open_execat+0x6d/0x1a0
[97604.721819]  bprm_execve+0x190/0x6b0
[97604.721825]  do_execveat_common+0x192/0x1c0
[97604.721830]  __x64_sys_execve+0x39/0x50
[97604.721836]  do_syscall_64+0x33/0x80
[97604.721843]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[97604.721848] RIP: 0033:0x7f9cdcfe2c37
[97604.721853] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de
64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48>
3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 08 12 30 00 f7 d8 64 89 02
[97604.721862] RSP: 002b:7ffe444f5758 EFLAGS: 0202 ORIG_RAX:
003b
[97604.721867] RAX: ffda RBX: 7f9cdd7fb6a0 RCX:
7f9cdcfe2c37
[97604.721872] RDX: 020f5300 RSI: 020f3bf8 RDI:
020f36a0
[97604.721876] RBP: 0001 R08:  R09:

[97604.721880] R10: 7ffe444f4b60 R11: 0202 R12:

[97604.721884] R13: 0001 R14: 020f36a0 R15:

[97604.721890] Modules linked in: overlay rpcsec_gss_krb5 nfsv4 dns_resolver
nfsv3 nfs fscache libafs(PO) iscsi_ibft iscsi_boot_sysfs rfkill
vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dmi_sysfs intel_rapl_msr
intel_rapl_common isst_if_common joydev ipmi_ssif i40iw ib_uverbs iTCO_wdt
intel_pmc_bxt ib_core hid_generic iTCO_vendor_support skx_edac nfit
libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel acpi_ipmi
usbhid kvm i40e ipmi_si ioatdma mei_me i2c_i801 irqbypass ipmi_devintf mei
i2c_smbus lpc_ich dca efi_pstore pcspkr ipmi_msghandler tiny_power_button
acpi_pad button nls_iso8859_1 nls_cp437 vfat fat nfsd nfs_acl lockd
auth_rpcgss grace sunrpc fuse configfs nfs_ssc ast i2c_algo_bit
drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
cec rc_core drm_ttm_helper xhci_pci ttm xhci_pci_renesas xhci_hcd
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
drm glue_helper crypto_simd cryptd usbcore wmi sg br_netfilter bridge stp
llc
[97604.721991]  dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
msr efivarfs
[97604.722031] ---[ end trace edcabaccd35272e2 ]---
[97604.727773] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0

Cheers,
--
Luís



Re: [RFC PATCH] fuse: Clear SGID bit when setting mode in setacl

2021-03-01 Thread Luis Henriques
On Mon, Mar 01, 2021 at 11:33:24AM -0500, Vivek Goyal wrote:
> On Fri, Feb 26, 2021 at 06:33:57PM +0000, Luis Henriques wrote:
> > Setting file permissions with POSIX ACLs (setxattr) isn't clearing the
> > setgid bit.  This seems to be CVE-2016-7097, detected by running fstest
> > generic/375 in virtiofs.  Unfortunately, when the fix for this CVE landed
> > in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when
> > setting file permissions"), FUSE didn't had ACLs support yet.
> 
> Hi Luis,
> 
> Interesting. I did not know that "chmod" can lead to clearing of SGID
> as well. Recently we implemented FUSE_HANDLE_KILLPRIV_V2 flag which
> means that file server is responsible for clearing of SUID/SGID/caps
> as per following rules.
> 
> - caps are always cleared on chown/write/truncate
> - suid is always cleared on chown, while for truncate/write it is cleared
>   only if caller does not have CAP_FSETID.
> - sgid is always cleared on chown, while for truncate/write it is cleared
>   only if caller does not have CAP_FSETID as well as file has group 
> execute
>   permission.
> 
> And we don't have anything about "chmod" in this list. Well, I will test
> this and come back to this little later.
> 
> I see following comment in fuse_set_acl().
> 
> /*
>  * Fuse userspace is responsible for updating access
>  * permissions in the inode, if needed. fuse_setxattr
>  * invalidates the inode attributes, which will force
>  * them to be refreshed the next time they are used,
>  * and it also updates i_ctime.
>  */
> 
> So looks like that original code has been written with intent that
> file server is responsible for updating inode permissions. I am
> assuming this will include clearing of S_ISGID if needed.
> 
> But question is, does file server has enough information to be able
> to handle proper clearing of S_ISGID info. IIUC, file server will need
> two pieces of information atleast.
> 
> - gid of the caller.
> - Whether caller has CAP_FSETID or not.
> 
> I think we have first piece of information but not the second one. May
> be we need to send this in fuse_setxattr_in->flags. And file server
> can drop CAP_FSETID while doing setxattr().
> 
> What about "gid" info. We don't change to caller's uid/gid while doing
> setxattr(). So host might not clear S_ISGID or clear it when it should
> not. I am wondering that can we switch to caller's uid/gid in setxattr(),
> atleast while setting acls.

Thank for looking into this.  To be honest, initially I thought that the
fix should be done in the server too, but when I looked into the code I
couldn't find an easy way to get that done (without modifying the data
being passed from the kernel in setxattr).

So, what I've done was to look at what other filesystems were doing in the
ACL code, and that's where I found out about this CVE.  The CVE fix for
the other filesystems looked easy enough to be included in FUSE too.

Cheers,
--
Luís


[RFC PATCH] fuse: Clear SGID bit when setting mode in setacl

2021-02-26 Thread Luis Henriques
Setting file permissions with POSIX ACLs (setxattr) isn't clearing the
setgid bit.  This seems to be CVE-2016-7097, detected by running fstest
generic/375 in virtiofs.  Unfortunately, when the fix for this CVE landed
in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when
setting file permissions"), FUSE didn't had ACLs support yet.

Signed-off-by: Luis Henriques 
---
 fs/fuse/acl.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index f529075a2ce8..1b273277c1c9 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -54,7 +54,9 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, 
int type)
 {
struct fuse_conn *fc = get_fuse_conn(inode);
const char *name;
+   umode_t mode = inode->i_mode;
int ret;
+   bool update_mode = false;
 
if (fuse_is_bad(inode))
return -EIO;
@@ -62,11 +64,18 @@ int fuse_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
if (!fc->posix_acl || fc->no_setxattr)
return -EOPNOTSUPP;
 
-   if (type == ACL_TYPE_ACCESS)
+   if (type == ACL_TYPE_ACCESS) {
name = XATTR_NAME_POSIX_ACL_ACCESS;
-   else if (type == ACL_TYPE_DEFAULT)
+   if (acl) {
+   ret = posix_acl_update_mode(inode, , );
+   if (ret)
+   return ret;
+   if (inode->i_mode != mode)
+   update_mode = true;
+   }
+   } else if (type == ACL_TYPE_DEFAULT) {
name = XATTR_NAME_POSIX_ACL_DEFAULT;
-   else
+   } else
return -EINVAL;
 
if (acl) {
@@ -98,6 +107,20 @@ int fuse_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
} else {
ret = fuse_removexattr(inode, name);
}
+   if (!ret && update_mode) {
+   struct dentry *entry;
+   struct iattr attr;
+
+   entry = d_find_alias(inode);
+   if (entry) {
+   memset(, 0, sizeof(attr));
+   attr.ia_valid = ATTR_MODE | ATTR_CTIME;
+   attr.ia_mode = mode;
+   attr.ia_ctime = current_time(inode);
+   ret = fuse_do_setattr(entry, , NULL);
+   dput(entry);
+   }
+   }
forget_all_cached_acls(inode);
fuse_invalidate_attr(inode);
 


Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

2021-02-25 Thread Luis Henriques
On Wed, Feb 24, 2021 at 06:10:45PM +0200, Amir Goldstein wrote:
> On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques  wrote:
> >
> > Update man-page with recent changes to this syscall.
> >
> > Signed-off-by: Luis Henriques 
> > ---
> > Hi!
> >
> > Here's a suggestion for fixing the manpage for copy_file_range().  Note that
> > I've assumed the fix will hit 5.12.
> >
> >  man2/copy_file_range.2 | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> > index 611a39b8026b..b0fd85e2631e 100644
> > --- a/man2/copy_file_range.2
> > +++ b/man2/copy_file_range.2
> > @@ -169,6 +169,9 @@ Out of memory.
> >  .B ENOSPC
> >  There is not enough space on the target filesystem to complete the copy.
> >  .TP
> > +.B EOPNOTSUPP
> > +The filesystem does not support this operation.
> > +.TP
> >  .B EOVERFLOW
> >  The requested source or destination range is too large to represent in the
> >  specified data types.
> > @@ -187,7 +190,7 @@ refers to an active swap file.
> >  .B EXDEV
> >  The files referred to by
> >  .IR fd_in " and " fd_out
> > -are not on the same mounted filesystem (pre Linux 5.3).
> > +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
> 
> I think you need to drop the (Linux range) altogether.
> What's missing here is the NFS cross server copy use case.
> Maybe:
> 
> ...are not on the same mounted filesystem and the source and target 
> filesystems
> do not support cross-filesystem copy.
> 
> You may refer the reader to VERSIONS section where it will say which
> filesystems support cross-fs copy as of kernel version XXX (i.e. cifs and 
> nfs).
> 
> >  .SH VERSIONS
> >  The
> >  .BR copy_file_range ()
> > @@ -202,6 +205,11 @@ Applications should target the behaviour and 
> > requirements of 5.3 kernels.
> >  .PP
> >  First support for cross-filesystem copies was introduced in Linux 5.3.
> >  Older kernels will return -EXDEV when cross-filesystem copies are 
> > attempted.
> > +.PP
> > +After Linux 5.12, support for copies between different filesystems was 
> > dropped.
> > +However, individual filesystems may still provide
> > +.BR copy_file_range ()
> > +implementations that allow copies across different devices.
> 
> Again, this is not likely to stay uptodate for very long.
> The stable kernels are expected to apply your patch (because it fixes
> a regression)
> so this should be phrased differently.
> If it were me, I would provide all the details of the situation to
> Michael and ask him
> to write the best description for this section.

Thanks Amir.

Yeah, it's tricky.  Support was added and then dropped.   Since stable
kernels will be picking this patch,  maybe the best thing to do is to no
mention the generic cross-filesystem support at all...?  Or simply say
that 5.3 temporarily supported it but that support was later dropped.

Michael (or Alejandro), would you be OK handling this yourself as Amir
suggested?

Cheers,
--
Luís


[PATCH] copy_file_range.2: Kernel v5.12 updates

2021-02-24 Thread Luis Henriques
Update man-page with recent changes to this syscall.

Signed-off-by: Luis Henriques 
---
Hi!

Here's a suggestion for fixing the manpage for copy_file_range().  Note that
I've assumed the fix will hit 5.12.

 man2/copy_file_range.2 | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 611a39b8026b..b0fd85e2631e 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -169,6 +169,9 @@ Out of memory.
 .B ENOSPC
 There is not enough space on the target filesystem to complete the copy.
 .TP
+.B EOPNOTSUPP
+The filesystem does not support this operation.
+.TP
 .B EOVERFLOW
 The requested source or destination range is too large to represent in the
 specified data types.
@@ -187,7 +190,7 @@ refers to an active swap file.
 .B EXDEV
 The files referred to by
 .IR fd_in " and " fd_out
-are not on the same mounted filesystem (pre Linux 5.3).
+are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
 .SH VERSIONS
 The
 .BR copy_file_range ()
@@ -202,6 +205,11 @@ Applications should target the behaviour and requirements 
of 5.3 kernels.
 .PP
 First support for cross-filesystem copies was introduced in Linux 5.3.
 Older kernels will return -EXDEV when cross-filesystem copies are attempted.
+.PP
+After Linux 5.12, support for copies between different filesystems was dropped.
+However, individual filesystems may still provide
+.BR copy_file_range ()
+implementations that allow copies across different devices.
 .SH CONFORMING TO
 The
 .BR copy_file_range ()


Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

2021-02-24 Thread Luis Henriques
On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote:
> On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques  wrote:
> >
> > A regression has been reported by Nicolas Boichat, found while using the
> > copy_file_range syscall to copy a tracefs file.  Before commit
> > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > kernel would return -EXDEV to userspace when trying to copy a file across
> > different filesystems.  After this commit, the syscall doesn't fail anymore
> > and instead returns zero (zero bytes copied), as this file's content is
> > generated on-the-fly and thus reports a size of zero.
> >
> > This patch restores some cross-filesystem copy restrictions that existed
> > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > devices").  Filesystems are still allowed to fall-back to the VFS
> > generic_copy_file_range() implementation, but that has now to be done
> > explicitly.
> >
> > nfsd is also modified to fall-back into generic_copy_file_range() in case
> > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> >
> > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > Link: 
> > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
> > Link: 
> > https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
> > Link: 
> > https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
> > Reported-by: Nicolas Boichat 
> > Signed-off-by: Luis Henriques 
> 
> I tested v8 and I believe it works for NFS.

Thanks a lot for the testing.  And to everyone else for reviews,
feedback,... and patience.

I'll now go look into the manpage and see what needs to be changed.

Cheers,
--
Luís


Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

2021-02-23 Thread Luis Henriques
On Tue, Feb 23, 2021 at 08:57:38AM -0800, dai@oracle.com wrote:
> 
> On 2/23/21 8:47 AM, Amir Goldstein wrote:
> > On Tue, Feb 23, 2021 at 6:02 PM  wrote:
> > > 
> > > On 2/23/21 7:29 AM, dai@oracle.com wrote:
> > > > On 2/23/21 2:32 AM, Luis Henriques wrote:
> > > > > On Mon, Feb 22, 2021 at 08:25:27AM -0800, dai....@oracle.com wrote:
> > > > > > On 2/22/21 2:24 AM, Luis Henriques wrote:
> > > > > > > A regression has been reported by Nicolas Boichat, found while
> > > > > > > using the
> > > > > > > copy_file_range syscall to copy a tracefs file.  Before commit
> > > > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across 
> > > > > > > devices") the
> > > > > > > kernel would return -EXDEV to userspace when trying to copy a file
> > > > > > > across
> > > > > > > different filesystems.  After this commit, the syscall doesn't 
> > > > > > > fail
> > > > > > > anymore
> > > > > > > and instead returns zero (zero bytes copied), as this file's
> > > > > > > content is
> > > > > > > generated on-the-fly and thus reports a size of zero.
> > > > > > > 
> > > > > > > This patch restores some cross-filesystem copy restrictions that
> > > > > > > existed
> > > > > > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy
> > > > > > > across
> > > > > > > devices").  Filesystems are still allowed to fall-back to the VFS
> > > > > > > generic_copy_file_range() implementation, but that has now to be 
> > > > > > > done
> > > > > > > explicitly.
> > > > > > > 
> > > > > > > nfsd is also modified to fall-back into generic_copy_file_range()
> > > > > > > in case
> > > > > > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> > > > > > > 
> > > > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > > > > > devices")
> > > > > > > Link:
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
> > > > > > > Link:
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
> > > > > > > Link:
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
> > > > > > > Reported-by: Nicolas Boichat 
> > > > > > > Signed-off-by: Luis Henriques 
> > > > > > > ---
> > > > > > > Changes since v7
> > > > > > > - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so
> > > > > > > that the
> > > > > > >  error returned is always related to the 'copy' operation
> > > > > > > Changes since v6
> > > > > > > - restored i_sb checks for the clone operation
> > > > > > > Changes since v5
> > > > > > > - check if ->copy_file_range is NULL before calling it
> > > > > > > Changes since v4
> > > > > > > - nfsd falls-back to generic_copy_file_range() only *if* it gets
> > > > > > > -EOPNOTSUPP
> > > > > > >  or -EXDEV.
> > > > > > > Changes since v3
> > > > > > > - dropped the COPY_FILE_SPLICE flag
> > > > > > > - kept the f_op's checks early in generic_copy_file_checks,
> > > > > > > implementing
> > > > > > >  Amir's suggestions
> > > > > > > - modified nfsd to use generic_copy_file_range()
> > > > > > > Changes since v2
> > > > > > > - do all the required checks earlier, in 
> > > > > > > generic_copy_file_checks(),
> > > > > > >

Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

2021-02-23 Thread Luis Henriques
On Mon, Feb 22, 2021 at 08:25:27AM -0800, dai@oracle.com wrote:
> 
> On 2/22/21 2:24 AM, Luis Henriques wrote:
> > A regression has been reported by Nicolas Boichat, found while using the
> > copy_file_range syscall to copy a tracefs file.  Before commit
> > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > kernel would return -EXDEV to userspace when trying to copy a file across
> > different filesystems.  After this commit, the syscall doesn't fail anymore
> > and instead returns zero (zero bytes copied), as this file's content is
> > generated on-the-fly and thus reports a size of zero.
> > 
> > This patch restores some cross-filesystem copy restrictions that existed
> > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > devices").  Filesystems are still allowed to fall-back to the VFS
> > generic_copy_file_range() implementation, but that has now to be done
> > explicitly.
> > 
> > nfsd is also modified to fall-back into generic_copy_file_range() in case
> > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> > 
> > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > Link: 
> > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
> > Link: 
> > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
> > Link: 
> > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
> > Reported-by: Nicolas Boichat 
> > Signed-off-by: Luis Henriques 
> > ---
> > Changes since v7
> > - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the
> >error returned is always related to the 'copy' operation
> > Changes since v6
> > - restored i_sb checks for the clone operation
> > Changes since v5
> > - check if ->copy_file_range is NULL before calling it
> > Changes since v4
> > - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
> >or -EXDEV.
> > Changes since v3
> > - dropped the COPY_FILE_SPLICE flag
> > - kept the f_op's checks early in generic_copy_file_checks, implementing
> >Amir's suggestions
> > - modified nfsd to use generic_copy_file_range()
> > Changes since v2
> > - do all the required checks earlier, in generic_copy_file_checks(),
> >adding new checks for ->remap_file_range
> > - new COPY_FILE_SPLICE flag
> > - don't remove filesystem's fallback to generic_copy_file_range()
> > - updated commit changelog (and subject)
> > Changes since v1 (after Amir review)
> > - restored do_copy_file_range() helper
> > - return -EOPNOTSUPP if fs doesn't implement CFR
> > - updated commit description
> > 
> >   fs/nfsd/vfs.c   |  8 +++-
> >   fs/read_write.c | 49 -
> >   2 files changed, 31 insertions(+), 26 deletions(-)
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 04937e51de56..23dab0fa9087 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, 
> > u64 src_pos,
> >   ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file 
> > *dst,
> >  u64 dst_pos, u64 count)
> >   {
> > +   ssize_t ret;
> > /*
> >  * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> > @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 
> > src_pos, struct file *dst,
> >  * limit like this and pipeline multiple COPY requests.
> >  */
> > count = min_t(u64, count, 1 << 22);
> > -   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > +   ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > +
> > +   if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > +   ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> > + count, 0);
> > +   return ret;
> >   }
> >   __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 75f76

[PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

2021-02-22 Thread Luis Henriques
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.  Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems.  After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices").  Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to fall-back into generic_copy_file_range() in case
vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: 
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
Link: 
https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Reported-by: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
Changes since v7
- set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the
  error returned is always related to the 'copy' operation
Changes since v6
- restored i_sb checks for the clone operation
Changes since v5
- check if ->copy_file_range is NULL before calling it
Changes since v4
- nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
  or -EXDEV.
Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
  Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
  adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

 fs/nfsd/vfs.c   |  8 +++-
 fs/read_write.c | 49 -
 2 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..23dab0fa9087 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 
src_pos,
 ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
 u64 dst_pos, u64 count)
 {
+   ssize_t ret;
 
/*
 * Limit copy to 4MB to prevent indefinitely blocking an nfsd
@@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 
src_pos, struct file *dst,
 * limit like this and pipeline multiple COPY requests.
 */
count = min_t(u64, count, 1 << 22);
-   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+   ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+
+   if (ret == -EOPNOTSUPP || ret == -EXDEV)
+   ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
+ count, 0);
+   return ret;
 }
 
 __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..5a26297fd410 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, 
loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_copy_file_range);
 
-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
-   /*
-* Although we now allow filesystems to handle cross sb copy, passing
-* a file of the wrong filesystem type to filesystem driver can result
-* in an attempt to dereference the wrong type of ->private_data, so
-* avoid doing that until we really have a good reason.  NFS defines
-* several different file_system_type structures, but they all end up
-* using the same ->copy_file_range() function pointer.
-*/
-   if (file_out->f_op->copy_file_range &&
-   file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
-   return file_out->f_op->copy_file_range(file_in, pos_in,
-  file_out, pos_out,
- 

[PATCH v7] vfs: fix copy_file_range regression in cross-fs copies

2021-02-21 Thread Luis Henriques
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.  Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems.  After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices").  Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to fall-back into generic_copy_file_range() in case
vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: 
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
Link: 
https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Reported-by: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
Changes since v6
- restored i_sb checks for the clone operation
Changes since v5
- check if ->copy_file_range is NULL before calling it
Changes since v4
- nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
  or -EXDEV.
Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
  Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
  adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

 fs/nfsd/vfs.c   |  8 +++-
 fs/read_write.c | 50 -
 2 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..23dab0fa9087 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 
src_pos,
 ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
 u64 dst_pos, u64 count)
 {
+   ssize_t ret;
 
/*
 * Limit copy to 4MB to prevent indefinitely blocking an nfsd
@@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 
src_pos, struct file *dst,
 * limit like this and pipeline multiple COPY requests.
 */
count = min_t(u64, count, 1 << 22);
-   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+   ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+
+   if (ret == -EOPNOTSUPP || ret == -EXDEV)
+   ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
+ count, 0);
+   return ret;
 }
 
 __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..463345c0ee30 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, 
loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_copy_file_range);
 
-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
-   /*
-* Although we now allow filesystems to handle cross sb copy, passing
-* a file of the wrong filesystem type to filesystem driver can result
-* in an attempt to dereference the wrong type of ->private_data, so
-* avoid doing that until we really have a good reason.  NFS defines
-* several different file_system_type structures, but they all end up
-* using the same ->copy_file_range() function pointer.
-*/
-   if (file_out->f_op->copy_file_range &&
-   file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
-   return file_out->f_op->copy_file_range(file_in, pos_in,
-  file_out, pos_out,
-  len, flags);
-
-   return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
-  flags);
-}
-
 /

[PATCH v6] vfs: fix copy_file_range regression in cross-fs copies

2021-02-18 Thread Luis Henriques
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.  Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems.  After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices").  Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to fall-back into generic_copy_file_range() in case
vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: 
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
Link: 
https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Reported-by: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
And v6 is upon us.  Behold!

Changes since v5
- check if ->copy_file_range is NULL before calling it
Changes since v4
- nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
  or -EXDEV.
Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
  Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
  adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

 fs/nfsd/vfs.c   |  8 +++-
 fs/read_write.c | 53 -
 2 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..23dab0fa9087 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 
src_pos,
 ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
 u64 dst_pos, u64 count)
 {
+   ssize_t ret;
 
/*
 * Limit copy to 4MB to prevent indefinitely blocking an nfsd
@@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 
src_pos, struct file *dst,
 * limit like this and pipeline multiple COPY requests.
 */
count = min_t(u64, count, 1 << 22);
-   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+   ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+
+   if (ret == -EOPNOTSUPP || ret == -EXDEV)
+   ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
+ count, 0);
+   return ret;
 }
 
 __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..0348aaa9e237 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, 
loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_copy_file_range);
 
-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
-   /*
-* Although we now allow filesystems to handle cross sb copy, passing
-* a file of the wrong filesystem type to filesystem driver can result
-* in an attempt to dereference the wrong type of ->private_data, so
-* avoid doing that until we really have a good reason.  NFS defines
-* several different file_system_type structures, but they all end up
-* using the same ->copy_file_range() function pointer.
-*/
-   if (file_out->f_op->copy_file_range &&
-   file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
-   return file_out->f_op->copy_file_range(file_in, pos_in,
-  file_out, pos_out,
-  len, flags);
-
-   return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
-  flags);
-}
-
 /*
  * Performs

Re: [PATCH v5] vfs: fix copy_file_range regression in cross-fs copies

2021-02-18 Thread Luis Henriques
Amir Goldstein  writes:

> On Thu, Feb 18, 2021 at 5:16 PM Luis Henriques  wrote:
>>
>> A regression has been reported by Nicolas Boichat, found while using the
>> copy_file_range syscall to copy a tracefs file.  Before commit
>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
>> kernel would return -EXDEV to userspace when trying to copy a file across
>> different filesystems.  After this commit, the syscall doesn't fail anymore
>> and instead returns zero (zero bytes copied), as this file's content is
>> generated on-the-fly and thus reports a size of zero.
>>
>> This patch restores some cross-filesystem copy restrictions that existed
>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> devices").  Filesystems are still allowed to fall-back to the VFS
>> generic_copy_file_range() implementation, but that has now to be done
>> explicitly.
>>
>> nfsd is also modified to fall-back into generic_copy_file_range() in case
>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>>
>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
>> Link: 
>> https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
>> Link: 
>> https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
>> Link: 
>> https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
>> Reported-by: Nicolas Boichat 
>> Signed-off-by: Luis Henriques 
>> ---
>> And v5!  Sorry.  Sure, it makes sense to go through the all the vfs_cfr()
>> checks first.
>
> You missed my other comment on v4...
>
> not checking NULL copy_file_range case.

Ah, yeah I did missed it.  I'll follow up with yet another revision.

Cheers,
-- 
Luis


[PATCH v5] vfs: fix copy_file_range regression in cross-fs copies

2021-02-18 Thread Luis Henriques
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.  Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems.  After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices").  Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to fall-back into generic_copy_file_range() in case
vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: 
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
Link: 
https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Reported-by: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
And v5!  Sorry.  Sure, it makes sense to go through the all the vfs_cfr()
checks first.

Again, here's my request for testing.

Changes since v4
- nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
  or -EXDEV.
Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
  Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
  adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description
 fs/nfsd/vfs.c   |  8 +++-
 fs/read_write.c | 50 +++--
 2 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..23dab0fa9087 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 
src_pos,
 ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
 u64 dst_pos, u64 count)
 {
+   ssize_t ret;
 
/*
 * Limit copy to 4MB to prevent indefinitely blocking an nfsd
@@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 
src_pos, struct file *dst,
 * limit like this and pipeline multiple COPY requests.
 */
count = min_t(u64, count, 1 << 22);
-   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+   ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+
+   if (ret == -EOPNOTSUPP || ret == -EXDEV)
+   ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
+ count, 0);
+   return ret;
 }
 
 __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..214d44f7cbfa 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, 
loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_copy_file_range);
 
-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
-   /*
-* Although we now allow filesystems to handle cross sb copy, passing
-* a file of the wrong filesystem type to filesystem driver can result
-* in an attempt to dereference the wrong type of ->private_data, so
-* avoid doing that until we really have a good reason.  NFS defines
-* several different file_system_type structures, but they all end up
-* using the same ->copy_file_range() function pointer.
-*/
-   if (file_out->f_op->copy_file_range &&
-   file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
-   return file_out->f_op->copy_file_range(file_in, pos_in,
-  file_out, pos_out,
-  len, flags);
-
-   return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
-  flags);
-}
-
 /*
  * Perform

[PATCH v4] vfs: fix copy_file_range regression in cross-fs copies

2021-02-18 Thread Luis Henriques
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.  Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems.  After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices").  Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to use generic_copy_file_range() instead of
vfs_copy_file_range() so that it can still fall-back to splice without going
through all the checks.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: 
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
Link: 
https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Reported-by: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
And here's v4.  I'd like to request help for testing.  I know Nicolas is
doing that (thanks!  and thanks for the reviews).  But it would be great to
get at least the nfs code tested.  Olga, can you help here?

Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
  Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
  adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

 fs/nfsd/vfs.c   |  2 +-
 fs/read_write.c | 50 +++--
 2 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..49dd28ee2602 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -578,7 +578,7 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, 
struct file *dst,
 * limit like this and pipeline multiple COPY requests.
 */
count = min_t(u64, count, 1 << 22);
-   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+   return generic_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
 }
 
 __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..214d44f7cbfa 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, 
loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_copy_file_range);
 
-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
-   /*
-* Although we now allow filesystems to handle cross sb copy, passing
-* a file of the wrong filesystem type to filesystem driver can result
-* in an attempt to dereference the wrong type of ->private_data, so
-* avoid doing that until we really have a good reason.  NFS defines
-* several different file_system_type structures, but they all end up
-* using the same ->copy_file_range() function pointer.
-*/
-   if (file_out->f_op->copy_file_range &&
-   file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
-   return file_out->f_op->copy_file_range(file_in, pos_in,
-  file_out, pos_out,
-  len, flags);
-
-   return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
-  flags);
-}
-
 /*
  * Performs necessary checks before doing a file copy
  *
@@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file 
*file_in, loff_t pos_in,
loff_t size_in;
int ret;
 
+   /*
+* Although we now allow filesystems to handle cross sb copy, passing
+* a file of the wrong filesystem type to filesystem driver can result
+* in an attempt to dereference the wrong type of ->private_data, so
+* avoid doing that until we really have a good reason.  NFS defin

Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-18 Thread Luis Henriques
Luis Henriques  writes:

> Amir Goldstein  writes:
>
>> On Thu, Feb 18, 2021 at 9:42 AM Christoph Hellwig  wrote:
>>>
>>> Looks good:
>>>
>>> Reviewed-by: Christoph Hellwig 
>>>
>>> This whole idea of cross-device copie has always been a horrible idea,
>>> and I've been arguing against it since the patches were posted.
>>
>> Ok. I'm good with this v2 as well, but need to add the fallback to
>> do_splice_direct()
>> in nfsd_copy_file_range(), because this patch breaks it.
>>
>> And the commit message of v3 is better in describing the reported issue.
>
> Except that, as I said in a previous email, v2 doesn't really fix the
> issue: all the checks need to be done earlier in generic_copy_file_checks().
>
> I'll work on getting v4, based on v2 and but moving the checks and
> implementing your review suggestions to v3 (plus this nfs change).

There's something else:

The filesystems (nfs, ceph, cifs, fuse) rely on the fallback to
generic_copy_file_range() if something's wrong.  And this "something's
wrong" is fs specific.  For example: in ceph it is possible to offload the
file copy to the OSDs even if the files are in different filesystems as
long as these filesystems are on the *same* ceph cluster.  If the copy
being done is across two different clusters, then the copy reverts to
splice.  This means that the boilerplate code being removed in v2 of this
patch needs to be restored and replace by:

ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
 len, flags);

if (ret == -EOPNOTSUPP || ret == -EXDEV)
ret = do_splice_direct(src_file, _off, dst_file, _off,
   len > MAX_RW_COUNT ? MAX_RW_COUNT : len,
   flags);
return ret;

A quick look at the other filesystems code indicate similar patterns.
Since at this point we've gone through all the syscall checks already,
calling do_splice_direct() shouldn't be a huge change.  But I may be
missing something.  Again.  Which is quite likely :-)

Cheers,
-- 
Luis


Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-18 Thread Luis Henriques
Amir Goldstein  writes:

> On Thu, Feb 18, 2021 at 9:42 AM Christoph Hellwig  wrote:
>>
>> Looks good:
>>
>> Reviewed-by: Christoph Hellwig 
>>
>> This whole idea of cross-device copie has always been a horrible idea,
>> and I've been arguing against it since the patches were posted.
>
> Ok. I'm good with this v2 as well, but need to add the fallback to
> do_splice_direct()
> in nfsd_copy_file_range(), because this patch breaks it.
>
> And the commit message of v3 is better in describing the reported issue.

Except that, as I said in a previous email, v2 doesn't really fix the
issue: all the checks need to be done earlier in generic_copy_file_checks().

I'll work on getting v4, based on v2 and but moving the checks and
implementing your review suggestions to v3 (plus this nfs change).

Cheers,
-- 
Luis


[PATCH v3] vfs: fix copy_file_range regression in cross-fs copies

2021-02-17 Thread Luis Henriques
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.  Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems.  After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystems copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices").  It also introduces a flag (COPY_FILE_SPLICE) that can be used
by filesystems calling directly into the vfs copy_file_range to override
these restrictions.  Right now, only NFS needs to set this flag.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: 
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
Link: 
https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=zmpsg47cyhfjwnw7zs...@mail.gmail.com/
Link: 
https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Reported-by: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
Ok, I've tried to address all the issues and comments.  Hopefully this v3
is a bit closer to the final fix.

Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
  adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

 fs/nfsd/vfs.c  |  3 ++-
 fs/read_write.c| 44 +---
 include/linux/fs.h |  7 +++
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..14e55822c223 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -578,7 +578,8 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, 
struct file *dst,
 * limit like this and pipeline multiple COPY requests.
 */
count = min_t(u64, count, 1 << 22);
-   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+   return vfs_copy_file_range(src, src_pos, dst, dst_pos, count,
+  COPY_FILE_SPLICE);
 }
 
 __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..40a16003fb05 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1410,6 +1410,33 @@ static ssize_t do_copy_file_range(struct file *file_in, 
loff_t pos_in,
   flags);
 }
 
+/*
+ * This helper function checks whether copy_file_range can actually be used,
+ * depending on the source and destination filesystems being the same.
+ *
+ * In-kernel callers may set COPY_FILE_SPLICE to override these checks.
+ */
+static int fops_copy_file_checks(struct file *file_in, struct file *file_out,
+unsigned int flags)
+{
+   if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE))
+   return -EINVAL;
+
+   if (flags & COPY_FILE_SPLICE)
+   return 0;
+   /*
+* We got here from userspace, so forbid copies if copy_file_range isn't
+* implemented or if we're doing a cross-fs copy.
+*/
+   if (!file_out->f_op->copy_file_range)
+   return -EOPNOTSUPP;
+   else if (file_out->f_op->copy_file_range !=
+file_in->f_op->copy_file_range)
+   return -EXDEV;
+
+   return 0;
+}
+
 /*
  * Performs necessary checks before doing a file copy
  *
@@ -1427,6 +1454,14 @@ static int generic_copy_file_checks(struct file 
*file_in, loff_t pos_in,
loff_t size_in;
int ret;
 
+   /* Only check f_ops if we're not trying to clone */
+   if (!file_in->f_op->remap_file_range ||
+   (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb)) {
+   ret = fops_copy_file_checks(file_in, file_out, flags);
+   if (ret)
+   return ret;
+   }
+
ret = generic_file_rw_checks(file_in, file_out);
if (ret)
return ret;
@@ -1474,9 +1509,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
 {
ssize_t ret;
 
-   if (flags != 0)
-   return -EINVAL;
-
ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, ,
   flags);
if (unlikely(ret))
@@ -1511,6 +1543,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
ret = clo

Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-16 Thread Luis Henriques
Amir Goldstein  writes:

> On Tue, Feb 16, 2021 at 6:41 PM Luis Henriques  wrote:
>>
>> Amir Goldstein  writes:
>>
>> >> Ugh.  And I guess overlayfs may have a similar problem.
>> >
>> > Not exactly.
>> > Generally speaking, overlayfs should call vfs_copy_file_range()
>> > with the flags it got from layer above, so if called from nfsd it
>> > will allow cross fs copy and when called from syscall it won't.
>> >
>> > There are some corner cases where overlayfs could benefit from
>> > COPY_FILE_SPLICE (e.g. copy from lower file to upper file), but
>> > let's leave those for now. Just leave overlayfs code as is.
>>
>> Got it, thanks for clarifying.
>>
>> >> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
>> >> > is internal to kernel users.
>> >> >
>> >> > FWIW, you may want to look at the loop in ovl_copy_up_data()
>> >> > for improvements to nfsd_copy_file_range().
>> >> >
>> >> > We can move the check out to copy_file_range syscall:
>> >> >
>> >> > if (flags != 0)
>> >> > return -EINVAL;
>> >> >
>> >> > Leave the fallback from all filesystems and check for the
>> >> > COPY_FILE_SPLICE flag inside generic_copy_file_range().
>> >>
>> >> Ok, the diff bellow is just to make sure I understood your suggestion.
>> >>
>> >> The patch will also need to:
>> >>
>> >>  - change nfs and overlayfs calls to vfs_copy_file_range() so that they
>> >>use the new flag.
>> >>
>> >>  - check flags in generic_copy_file_checks() to make sure only valid flags
>> >>are used (COPY_FILE_SPLICE at the moment).
>> >>
>> >> Also, where should this flag be defined?  include/uapi/linux/fs.h?
>> >
>> > Grep for REMAP_FILE_
>> > Same header file, same Documentation rst file.
>> >
>> >>
>> >> Cheers,
>> >> --
>> >> Luis
>> >>
>> >> diff --git a/fs/read_write.c b/fs/read_write.c
>> >> index 75f764b43418..341d315d2a96 100644
>> >> --- a/fs/read_write.c
>> >> +++ b/fs/read_write.c
>> >> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file 
>> >> *file_in, loff_t pos_in,
>> >> struct file *file_out, loff_t pos_out,
>> >> size_t len, unsigned int flags)
>> >>  {
>> >> +   if (!(flags & COPY_FILE_SPLICE)) {
>> >> +   if (!file_out->f_op->copy_file_range)
>> >> +   return -EOPNOTSUPP;
>> >> +   else if (file_out->f_op->copy_file_range !=
>> >> +file_in->f_op->copy_file_range)
>> >> +   return -EXDEV;
>> >> +   }
>> >
>> > That looks strange, because you are duplicating the logic in
>> > do_copy_file_range(). Maybe better:
>> >
>> > if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE))
>> > return -EINVAL;
>> > if (flags & COPY_FILE_SPLICE)
>> >return do_splice_direct(file_in, _in, file_out, _out,
>> >  len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 
>> > 0);
>>
>> My initial reasoning for duplicating the logic in do_copy_file_range() was
>> to allow the generic_copy_file_range() callers to be left unmodified and
>> allow the filesystems to default to this implementation.
>>
>> With this change, I guess that the calls to generic_copy_file_range() from
>> the different filesystems can be dropped, as in my initial patch, as they
>> will always get -EINVAL.  The other option would be to set the
>> COPY_FILE_SPLICE flag in those calls, but that would get us back to the
>> problem we're trying to solve.
>
> I don't understand the problem.
>
> What exactly is wrong with the code I suggested?
> Why should any filesystem be changed?
>
> Maybe I am missing something.

Ok, I have to do a full brain reboot and start all over.

Before that, I picked the code you suggested and tested it.  I've mounted
a cephfs filesystem and used xfs_io to execute a 'copy_range' command
using /sys/kernel/debug/sched_features as source.  The result was a
0-sized file in cephfs.  And the reason is thevfs_copy_file_range()
early exit in:

if (len == 0)
return 0;

'len' is set in generic_copy_file_checks().

This means that we're not solving the original problem anymore (probably
since v1 of this patch, haven't checked).

Also, re-reading Trond's emails, I read: "... also disallowing the copy
from, say, an XFS formatted partition to an ext4 partition".  Isn't that
*exactly* what we're trying to do here?  I.e. _prevent_ these copies from
happening so that tracefs files can't be CFR'ed?

/me stops now and waits to see if the morning brings some sun :-)

Cheers,
-- 
Luis


Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-16 Thread Luis Henriques
Amir Goldstein  writes:

>> Ugh.  And I guess overlayfs may have a similar problem.
>
> Not exactly.
> Generally speaking, overlayfs should call vfs_copy_file_range()
> with the flags it got from layer above, so if called from nfsd it
> will allow cross fs copy and when called from syscall it won't.
>
> There are some corner cases where overlayfs could benefit from
> COPY_FILE_SPLICE (e.g. copy from lower file to upper file), but
> let's leave those for now. Just leave overlayfs code as is.

Got it, thanks for clarifying.

>> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
>> > is internal to kernel users.
>> >
>> > FWIW, you may want to look at the loop in ovl_copy_up_data()
>> > for improvements to nfsd_copy_file_range().
>> >
>> > We can move the check out to copy_file_range syscall:
>> >
>> > if (flags != 0)
>> > return -EINVAL;
>> >
>> > Leave the fallback from all filesystems and check for the
>> > COPY_FILE_SPLICE flag inside generic_copy_file_range().
>>
>> Ok, the diff bellow is just to make sure I understood your suggestion.
>>
>> The patch will also need to:
>>
>>  - change nfs and overlayfs calls to vfs_copy_file_range() so that they
>>use the new flag.
>>
>>  - check flags in generic_copy_file_checks() to make sure only valid flags
>>are used (COPY_FILE_SPLICE at the moment).
>>
>> Also, where should this flag be defined?  include/uapi/linux/fs.h?
>
> Grep for REMAP_FILE_
> Same header file, same Documentation rst file.
>
>>
>> Cheers,
>> --
>> Luis
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..341d315d2a96 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, 
>> loff_t pos_in,
>> struct file *file_out, loff_t pos_out,
>> size_t len, unsigned int flags)
>>  {
>> +   if (!(flags & COPY_FILE_SPLICE)) {
>> +   if (!file_out->f_op->copy_file_range)
>> +   return -EOPNOTSUPP;
>> +   else if (file_out->f_op->copy_file_range !=
>> +file_in->f_op->copy_file_range)
>> +   return -EXDEV;
>> +   }
>
> That looks strange, because you are duplicating the logic in
> do_copy_file_range(). Maybe better:
>
> if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE))
> return -EINVAL;
> if (flags & COPY_FILE_SPLICE)
>return do_splice_direct(file_in, _in, file_out, _out,
>  len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);

My initial reasoning for duplicating the logic in do_copy_file_range() was
to allow the generic_copy_file_range() callers to be left unmodified and
allow the filesystems to default to this implementation.

With this change, I guess that the calls to generic_copy_file_range() from
the different filesystems can be dropped, as in my initial patch, as they
will always get -EINVAL.  The other option would be to set the
COPY_FILE_SPLICE flag in those calls, but that would get us back to the
problem we're trying to solve.

> if (!file_out->f_op->copy_file_range)
> return -EOPNOTSUPP;
> return -EXDEV;
>
>>  }
>> @@ -1474,9 +1481,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
>> loff_t pos_in,
>>  {
>> ssize_t ret;
>>
>> -   if (flags != 0)
>> -   return -EINVAL;
>> -
>
> This needs to move to the beginning of SYSCALL_DEFINE6(copy_file_range,...

Yep, I didn't included that change in my diff as I wasn't sure if you'd
like to have the flag visible in userspace.

Anyway, thanks for your patience!

Cheers,
-- 
Luis


Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-16 Thread Luis Henriques
"gre...@linuxfoundation.org"  writes:

> On Tue, Feb 16, 2021 at 11:17:34AM +, Luis Henriques wrote:
>> Amir Goldstein  writes:
>> 
>> > On Mon, Feb 15, 2021 at 8:57 PM Trond Myklebust  
>> > wrote:
>> >>
>> >> On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote:
>> >> > On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust <
>> >> > tron...@hammerspace.com> wrote:
>> >> > >
>> >> > > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote:
>> >> > > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques <
>> >> > > > lhenriq...@suse.de>
>> >> > > > wrote:
>> >> > > > >
>> >> > > > > Nicolas Boichat reported an issue when trying to use the
>> >> > > > > copy_file_range
>> >> > > > > syscall on a tracefs file.  It failed silently because the file
>> >> > > > > content is
>> >> > > > > generated on-the-fly (reporting a size of zero) and
>> >> > > > > copy_file_range
>> >> > > > > needs
>> >> > > > > to know in advance how much data is present.
>> >> > > > >
>> >> > > > > This commit restores the cross-fs restrictions that existed
>> >> > > > > prior
>> >> > > > > to
>> >> > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> >> > > > > devices")
>> >> > > > > and
>> >> > > > > removes generic_copy_file_range() calls from ceph, cifs, fuse,
>> >> > > > > and
>> >> > > > > nfs.
>> >> > > > >
>> >> > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> >> > > > > devices")
>> >> > > > > Link:
>> >> > > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
>> >> > > > > Cc: Nicolas Boichat 
>> >> > > > > Signed-off-by: Luis Henriques 
>> >> > > >
>> >> > > > Code looks ok.
>> >> > > > You may add:
>> >> > > >
>> >> > > > Reviewed-by: Amir Goldstein 
>> >> > > >
>> >> > > > I agree with Trond that the first paragraph of the commit message
>> >> > > > could
>> >> > > > be improved.
>> >> > > > The purpose of this change is to fix the change of behavior that
>> >> > > > caused the regression.
>> >> > > >
>> >> > > > Before v5.3, behavior was -EXDEV and userspace could fallback to
>> >> > > > read.
>> >> > > > After v5.3, behavior is zero size copy.
>> >> > > >
>> >> > > > It does not matter so much what makes sense for CFR to do in this
>> >> > > > case (generic cross-fs copy).  What matters is that nobody asked
>> >> > > > for
>> >> > > > this change and that it caused problems.
>> >> > > >
>> >> > >
>> >> > > No. I'm saying that this patch should be NACKed unless there is a
>> >> > > real
>> >> > > explanation for why we give crap about this tracefs corner case and
>> >> > > why
>> >> > > it can't be fixed.
>> >> > >
>> >> > > There are plenty of reasons why copy offload across filesystems
>> >> > > makes
>> >> > > sense, and particularly when you're doing NAS. Clone just doesn't
>> >> > > cut
>> >> > > it when it comes to disaster recovery (whereas backup to a
>> >> > > different
>> >> > > storage unit does). If the client has to do the copy, then you're
>> >> > > effectively doubling the load on the server, and you're adding
>> >> > > potentially unnecessary network traffic (or at the very least you
>> >> > > are
>> >> > > doubling that traffic).
>> >> > >
>> >> >
>> >> > I don't understand the use case you are describing.
>> >> >
>> >> > Which filesystem types are you tal

Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-16 Thread Luis Henriques
Amir Goldstein  writes:

> On Mon, Feb 15, 2021 at 8:57 PM Trond Myklebust  
> wrote:
>>
>> On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote:
>> > On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust <
>> > tron...@hammerspace.com> wrote:
>> > >
>> > > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote:
>> > > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques <
>> > > > lhenriq...@suse.de>
>> > > > wrote:
>> > > > >
>> > > > > Nicolas Boichat reported an issue when trying to use the
>> > > > > copy_file_range
>> > > > > syscall on a tracefs file.  It failed silently because the file
>> > > > > content is
>> > > > > generated on-the-fly (reporting a size of zero) and
>> > > > > copy_file_range
>> > > > > needs
>> > > > > to know in advance how much data is present.
>> > > > >
>> > > > > This commit restores the cross-fs restrictions that existed
>> > > > > prior
>> > > > > to
>> > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> > > > > devices")
>> > > > > and
>> > > > > removes generic_copy_file_range() calls from ceph, cifs, fuse,
>> > > > > and
>> > > > > nfs.
>> > > > >
>> > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> > > > > devices")
>> > > > > Link:
>> > > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
>> > > > > Cc: Nicolas Boichat 
>> > > > > Signed-off-by: Luis Henriques 
>> > > >
>> > > > Code looks ok.
>> > > > You may add:
>> > > >
>> > > > Reviewed-by: Amir Goldstein 
>> > > >
>> > > > I agree with Trond that the first paragraph of the commit message
>> > > > could
>> > > > be improved.
>> > > > The purpose of this change is to fix the change of behavior that
>> > > > caused the regression.
>> > > >
>> > > > Before v5.3, behavior was -EXDEV and userspace could fallback to
>> > > > read.
>> > > > After v5.3, behavior is zero size copy.
>> > > >
>> > > > It does not matter so much what makes sense for CFR to do in this
>> > > > case (generic cross-fs copy).  What matters is that nobody asked
>> > > > for
>> > > > this change and that it caused problems.
>> > > >
>> > >
>> > > No. I'm saying that this patch should be NACKed unless there is a
>> > > real
>> > > explanation for why we give crap about this tracefs corner case and
>> > > why
>> > > it can't be fixed.
>> > >
>> > > There are plenty of reasons why copy offload across filesystems
>> > > makes
>> > > sense, and particularly when you're doing NAS. Clone just doesn't
>> > > cut
>> > > it when it comes to disaster recovery (whereas backup to a
>> > > different
>> > > storage unit does). If the client has to do the copy, then you're
>> > > effectively doubling the load on the server, and you're adding
>> > > potentially unnecessary network traffic (or at the very least you
>> > > are
>> > > doubling that traffic).
>> > >
>> >
>> > I don't understand the use case you are describing.
>> >
>> > Which filesystem types are you talking about for source and target
>> > of copy_file_range()?
>> >
>> > To be clear, the original change was done to support NFS/CIFS server-
>> > side
>> > copy and those should not be affected by this change.
>> >
>>
>> That is incorrect:
>>
>> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file
>> *dst,
>>  u64 dst_pos, u64 count)
>> {
>>
>>  /*
>>  * Limit copy to 4MB to prevent indefinitely blocking an nfsd
>>  * thread and client rpc slot. The choice of 4MB is somewhat
>>  * arbitrary. We might instead base this on r/wsize, or make it
>>  * tunable, or use a time instead of a byte limit, or implement
>>  * asynchronous copy. In theory a client could also recognize a
>>  * limit like this an

[PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-15 Thread Luis Henriques
Nicolas Boichat reported an issue when trying to use the copy_file_range
syscall on a tracefs file.  It failed silently because the file content is
generated on-the-fly (reporting a size of zero) and copy_file_range needs
to know in advance how much data is present.

This commit restores the cross-fs restrictions that existed prior to
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") and
removes generic_copy_file_range() calls from ceph, cifs, fuse, and nfs.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: 
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
Cc: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

 fs/ceph/file.c | 21 +++-
 fs/cifs/cifsfs.c   |  3 ---
 fs/fuse/file.c | 21 +++-
 fs/nfs/nfs4file.c  | 20 +++
 fs/read_write.c| 49 ++
 include/linux/fs.h |  3 ---
 6 files changed, 19 insertions(+), 98 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 209535d5b8d3..639bd7bfaea9 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct 
ceph_inode_info *src_ci, u64 *src_off
return bytes;
 }
 
-static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
- struct file *dst_file, loff_t dst_off,
- size_t len, unsigned int flags)
+static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
+   struct file *dst_file, loff_t dst_off,
+   size_t len, unsigned int flags)
 {
struct inode *src_inode = file_inode(src_file);
struct inode *dst_inode = file_inode(dst_file);
@@ -2456,21 +2456,6 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
return ret;
 }
 
-static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
-   struct file *dst_file, loff_t dst_off,
-   size_t len, unsigned int flags)
-{
-   ssize_t ret;
-
-   ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
-len, flags);
-
-   if (ret == -EOPNOTSUPP || ret == -EXDEV)
-   ret = generic_copy_file_range(src_file, src_off, dst_file,
- dst_off, len, flags);
-   return ret;
-}
-
 const struct file_operations ceph_file_fops = {
.open = ceph_open,
.release = ceph_release,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index ab883e84e116..7aa3d20f21c0 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1229,9 +1229,6 @@ static ssize_t cifs_copy_file_range(struct file 
*src_file, loff_t off,
len, flags);
free_xid(xid);
 
-   if (rc == -EOPNOTSUPP || rc == -EXDEV)
-   rc = generic_copy_file_range(src_file, off, dst_file,
-destoff, len, flags);
return rc;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb55fb8..0dd703278e49 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3330,9 +3330,9 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
return err;
 }
 
-static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
+static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, unsigned int flags)
 {
struct fuse_file *ff_in = file_in->private_data;
struct fuse_file *ff_out = file_out->private_data;
@@ -3439,21 +3439,6 @@ static ssize_t __fuse_copy_file_range(struct file 
*file_in, loff_t pos_in,
return err;
 }
 
-static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
-   struct file *dst_file, loff_t dst_off,
-   size_t len, unsigned int flags)
-{
-   ssize_t ret;
-
-   ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
-len, flags);
-
-   if (ret == -EOPNOTSUPP || ret == -EXDEV)
-   ret = generic_copy_file_range(src_file, src_off, dst_file,
- dst_off, len, flags);
-   return ret;
-}
-
 static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_

Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-15 Thread Luis Henriques
Amir Goldstein  writes:

> On Mon, Feb 15, 2021 at 2:21 PM Luis Henriques  wrote:
>>
>> Luis Henriques  writes:
>>
>> > Amir Goldstein  writes:
>> >
>> >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques  wrote:
>> > ...
>> >>> Sure, I just wanted to point out that *maybe* there are other options 
>> >>> than
>> >>> simply reverting that commit :-)
>> >>>
>> >>> Something like the patch below (completely untested!) should revert to 
>> >>> the
>> >>> old behaviour in filesystems that don't implement the CFR syscall.
>> >>>
>> >>> Cheers,
>> >>> --
>> >>> Luis
>> >>>
>> >>> diff --git a/fs/read_write.c b/fs/read_write.c
>> >>> index 75f764b43418..bf5dccc43cc9 100644
>> >>> --- a/fs/read_write.c
>> >>> +++ b/fs/read_write.c
>> >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file 
>> >>> *file_in, loff_t pos_in,
>> >>>file_out, pos_out,
>> >>>len, flags);
>> >>>
>> >>> -   return generic_copy_file_range(file_in, pos_in, file_out, 
>> >>> pos_out, len,
>> >>> -  flags);
>> >>> +   if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> >>> +   return -EXDEV;
>> >>> +   else
>> >>> +   generic_copy_file_range(file_in, pos_in, file_out, 
>> >>> pos_out, len,
>> >>> +   flags);
>> >>>  }
>> >>>
>> >>
>> >> Which kernel is this patch based on?
>> >
>> > It was v5.11-rc7.
>> >
>> >> At this point, I am with Dave and Darrick on not falling back to
>> >> generic_copy_file_range() at all.
>> >>
>> >> We do not have proof of any workload that benefits from it and the
>> >> above patch does not protect from a wierd use case of trying to copy a 
>> >> file
>> >> from sysfs to sysfs.
>> >>
>> >
>> > Ok, cool.  I can post a new patch doing just that.  I guess that function
>> > do_copy_file_range() can be dropped in that case.
>> >
>> >> I am indecisive about what should be done with generic_copy_file_range()
>> >> called as fallback from within filesystems.
>> >>
>> >> I think the wise choice is to not do the fallback in any case, but this 
>> >> is up
>> >> to the specific filesystem maintainers to decide.
>> >
>> > I see what you mean.  You're suggesting to have userspace handle all the
>> > -EOPNOTSUPP and -EXDEV errors.  Would you rather have a patch that also
>> > removes all the calls to generic_copy_file_range() function?  And that
>> > function can also be deleted too, of course.
>>
>> Here's a first stab at this patch.  Hopefully I didn't forgot anything
>> here.  Let me know if you prefer the more conservative approach, i.e. not
>> touching any of the filesystems and let them use generic_copy_file_range.
>>
>
> I'm fine with this one (modulu my comment below).
> CC'ing fuse/cifs/nfs maintainers.
> Though I don't think the FS maintainers should mind removing the fallback.
> It was added by "us" (64bf5ff58dff "vfs: no fallback for ->copy_file_range()")

Thanks for your review, Amir.  I'll be posting v2 shortly.

Cheers,
-- 
Luis


>> Once everyone agrees on the final solution, I can follow-up with the
>> manpages update.
>>
>> Cheers,
>> --
>> Luis
>>
>> From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001
>> From: Luis Henriques 
>> Date: Fri, 12 Feb 2021 18:03:23 +
>> Subject: [PATCH] vfs: prevent copy_file_range to copy across devices
>>
>> Nicolas Boichat reported an issue when trying to use the copy_file_range
>> syscall on a tracefs file.  It failed silently because the file content is
>> generated on-the-fly (reporting a size of zero) and copy_file_range needs
>> to know in advance how much data is present.
>>
>> This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to
>> copy across devices").  Now the copy is done only if the filesystems for 
>> source
>> and destination files are the sa

Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-15 Thread Luis Henriques
Luis Henriques  writes:

> Amir Goldstein  writes:
>
>> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques  wrote:
> ...
>>> Sure, I just wanted to point out that *maybe* there are other options than
>>> simply reverting that commit :-)
>>>
>>> Something like the patch below (completely untested!) should revert to the
>>> old behaviour in filesystems that don't implement the CFR syscall.
>>>
>>> Cheers,
>>> --
>>> Luis
>>>
>>> diff --git a/fs/read_write.c b/fs/read_write.c
>>> index 75f764b43418..bf5dccc43cc9 100644
>>> --- a/fs/read_write.c
>>> +++ b/fs/read_write.c
>>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file 
>>> *file_in, loff_t pos_in,
>>>file_out, pos_out,
>>>len, flags);
>>>
>>> -   return generic_copy_file_range(file_in, pos_in, file_out, pos_out, 
>>> len,
>>> -  flags);
>>> +   if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>>> +   return -EXDEV;
>>> +   else
>>> +   generic_copy_file_range(file_in, pos_in, file_out, pos_out, 
>>> len,
>>> +   flags);
>>>  }
>>>
>>
>> Which kernel is this patch based on?
>
> It was v5.11-rc7.
>
>> At this point, I am with Dave and Darrick on not falling back to
>> generic_copy_file_range() at all.
>>
>> We do not have proof of any workload that benefits from it and the
>> above patch does not protect from a wierd use case of trying to copy a file
>> from sysfs to sysfs.
>>
>
> Ok, cool.  I can post a new patch doing just that.  I guess that function
> do_copy_file_range() can be dropped in that case.
>
>> I am indecisive about what should be done with generic_copy_file_range()
>> called as fallback from within filesystems.
>>
>> I think the wise choice is to not do the fallback in any case, but this is up
>> to the specific filesystem maintainers to decide.
>
> I see what you mean.  You're suggesting to have userspace handle all the
> -EOPNOTSUPP and -EXDEV errors.  Would you rather have a patch that also
> removes all the calls to generic_copy_file_range() function?  And that
> function can also be deleted too, of course.

Here's a first stab at this patch.  Hopefully I didn't forgot anything
here.  Let me know if you prefer the more conservative approach, i.e. not
touching any of the filesystems and let them use generic_copy_file_range.

Once everyone agrees on the final solution, I can follow-up with the
manpages update.

Cheers,
-- 
Luis

>From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001
From: Luis Henriques 
Date: Fri, 12 Feb 2021 18:03:23 +
Subject: [PATCH] vfs: prevent copy_file_range to copy across devices

Nicolas Boichat reported an issue when trying to use the copy_file_range
syscall on a tracefs file.  It failed silently because the file content is
generated on-the-fly (reporting a size of zero) and copy_file_range needs
to know in advance how much data is present.

This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to
copy across devices").  Now the copy is done only if the filesystems for source
and destination files are the same and they implement this syscall.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Cc: Nicolas Boichat 
Signed-off-by: Luis Henriques 
---
 fs/ceph/file.c | 21 +++
 fs/cifs/cifsfs.c   |  3 ---
 fs/fuse/file.c | 21 +++
 fs/nfs/nfs4file.c  | 20 +++---
 fs/read_write.c| 65 --
 include/linux/fs.h |  3 ---
 6 files changed, 20 insertions(+), 113 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 209535d5b8d3..639bd7bfaea9 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct 
ceph_inode_info *src_ci, u64 *src_off
return bytes;
 }
 
-static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
- struct file *dst_file, loff_t dst_off,
- size_t len, unsigned int flags)
+static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
+   struct file *dst_file, loff_t dst_off,
+   size_t len, unsigned int flags)
 {
struct inode *src_inode = file_inode(src_file);
struct inode *dst_inode = file_inode(dst_file);
@@ -2456,2

Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-15 Thread Luis Henriques
Amir Goldstein  writes:

> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques  wrote:
...
>> Sure, I just wanted to point out that *maybe* there are other options than
>> simply reverting that commit :-)
>>
>> Something like the patch below (completely untested!) should revert to the
>> old behaviour in filesystems that don't implement the CFR syscall.
>>
>> Cheers,
>> --
>> Luis
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..bf5dccc43cc9 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file 
>> *file_in, loff_t pos_in,
>>file_out, pos_out,
>>len, flags);
>>
>> -   return generic_copy_file_range(file_in, pos_in, file_out, pos_out, 
>> len,
>> -  flags);
>> +   if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> +   return -EXDEV;
>> +   else
>> +   generic_copy_file_range(file_in, pos_in, file_out, pos_out, 
>> len,
>> +   flags);
>>  }
>>
>
> Which kernel is this patch based on?

It was v5.11-rc7.

> At this point, I am with Dave and Darrick on not falling back to
> generic_copy_file_range() at all.
>
> We do not have proof of any workload that benefits from it and the
> above patch does not protect from a wierd use case of trying to copy a file
> from sysfs to sysfs.
>

Ok, cool.  I can post a new patch doing just that.  I guess that function
do_copy_file_range() can be dropped in that case.

> I am indecisive about what should be done with generic_copy_file_range()
> called as fallback from within filesystems.
>
> I think the wise choice is to not do the fallback in any case, but this is up
> to the specific filesystem maintainers to decide.

I see what you mean.  You're suggesting to have userspace handle all the
-EOPNOTSUPP and -EXDEV errors.  Would you rather have a patch that also
removes all the calls to generic_copy_file_range() function?  And that
function can also be deleted too, of course.

Cheers,
-- 
Luis


Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-12 Thread Luis Henriques
Greg KH  writes:

> On Fri, Feb 12, 2021 at 12:41:48PM +0000, Luis Henriques wrote:
>> Greg KH  writes:
...
>> >> >> Our option now are:
>> >> >> - Restore the cross-fs restriction into generic_copy_file_range()
>> >> >
>> >> > Yes.
>> >> >
>> >> 
>> >> Restoring this restriction will actually change the current cephfs CFR
>> >> behaviour.  Since that commit we have allowed doing remote copies between
>> >> different filesystems within the same ceph cluster.  See commit
>> >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in
>> >> the same cluster").
>> >> 
>> >> Although I'm not aware of any current users for this scenario, the
>> >> performance impact can actually be huge as it's the difference between
>> >> asking the OSDs for copying a file and doing a full read+write on the
>> >> client side.
>> >
>> > Regression in performance is ok if it fixes a regression for things that
>> > used to work just fine in the past :)
>> >
>> > First rule, make it work.
>> 
>> Sure, I just wanted to point out that *maybe* there are other options than
>> simply reverting that commit :-)
>> 
>> Something like the patch below (completely untested!) should revert to the
>> old behaviour in filesystems that don't implement the CFR syscall.
>> 
>> Cheers,
>> -- 
>> Luis
>> 
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..bf5dccc43cc9 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file 
>> *file_in, loff_t pos_in,
>> file_out, pos_out,
>> len, flags);
>>  
>> -return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> -   flags);
>> +if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> +return -EXDEV;
>> +else
>> +generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> +flags);
>>  }
>>  
>>  /*
>
> That would make much more sense to me.

Great.  I can send a proper patch with changelog, if this is the really
what we want.  But I would rather hear from others first.  I guess that at
least the NFS devs have something to say here.

Cheers,
-- 
Luis


Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-12 Thread Luis Henriques
Greg KH  writes:

> On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote:
>> Greg KH  writes:
>> 
>> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
>> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH  
>> >> wrote:
>> >> >
>> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
>> >> > > Filesystems such as procfs and sysfs generate their content at
>> >> > > runtime. This implies the file sizes do not usually match the
>> >> > > amount of data that can be read from the file, and that seeking
>> >> > > may not work as intended.
>> >> > >
>> >> > > This will be useful to disallow copy_file_range with input files
>> >> > > from such filesystems.
>> >> > >
>> >> > > Signed-off-by: Nicolas Boichat 
>> >> > > ---
>> >> > > I first thought of adding a new field to struct file_operations,
>> >> > > but that doesn't quite scale as every single file creation
>> >> > > operation would need to be modified.
>> >> >
>> >> > Even so, you missed a load of filesystems in the kernel with this patch
>> >> > series, what makes the ones you did mark here different from the
>> >> > "internal" filesystems that you did not?
>> >> >
>> >> > This feels wrong, why is userspace suddenly breaking?  What changed in
>> >> > the kernel that caused this?  Procfs has been around for a _very_ long
>> >> > time :)
>> >> 
>> >> That would be because of (v5.3):
>> >> 
>> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
>> >> 
>> >> The intention of this change (series) was to allow server side copy
>> >> for nfs and cifs via copy_file_range().
>> >> This is mostly work by Dave Chinner that I picked up following requests
>> >> from the NFS folks.
>> >> 
>> >> But the above change also includes this generic change:
>> >> 
>> >> -   /* this could be relaxed once a method supports cross-fs copies */
>> >> -   if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> >> -   return -EXDEV;
>> >> -
>> >> 
>> >> The change of behavior was documented in the commit message.
>> >> It was also documented in:
>> >> 
>> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
>> >> 
>> >> I think our rationale for the generic change was:
>> >> "Why not? What could go wrong? (TM)"
>> >> I am not sure if any workload really gained something from this
>> >> kernel cross-fs CFR.
>> >
>> > Why not put that check back?
>> >
>> >> In retrospect, I think it would have been safer to allow cross-fs CFR
>> >> only to the filesystems that implement ->{copy,remap}_file_range()...
>> >
>> > Why not make this change?  That seems easier and should fix this for
>> > everyone, right?
>> >
>> >> Our option now are:
>> >> - Restore the cross-fs restriction into generic_copy_file_range()
>> >
>> > Yes.
>> >
>> 
>> Restoring this restriction will actually change the current cephfs CFR
>> behaviour.  Since that commit we have allowed doing remote copies between
>> different filesystems within the same ceph cluster.  See commit
>> 6fd4e6348352 ("ceph: allow object copies across different filesystems in
>> the same cluster").
>> 
>> Although I'm not aware of any current users for this scenario, the
>> performance impact can actually be huge as it's the difference between
>> asking the OSDs for copying a file and doing a full read+write on the
>> client side.
>
> Regression in performance is ok if it fixes a regression for things that
> used to work just fine in the past :)
>
> First rule, make it work.

Sure, I just wanted to point out that *maybe* there are other options than
simply reverting that commit :-)

Something like the patch below (completely untested!) should revert to the
old behaviour in filesystems that don't implement the CFR syscall.

Cheers,
-- 
Luis

diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..bf5dccc43cc9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, 
loff_t pos_in,
   file_out, pos_out,
   len, flags);
 
-   return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
-  flags);
+   if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+   return -EXDEV;
+   else
+   generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
+   flags);
 }
 
 /*


Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

2021-02-12 Thread Luis Henriques
Greg KH  writes:

> On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
>> On Fri, Feb 12, 2021 at 9:49 AM Greg KH  wrote:
>> >
>> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
>> > > Filesystems such as procfs and sysfs generate their content at
>> > > runtime. This implies the file sizes do not usually match the
>> > > amount of data that can be read from the file, and that seeking
>> > > may not work as intended.
>> > >
>> > > This will be useful to disallow copy_file_range with input files
>> > > from such filesystems.
>> > >
>> > > Signed-off-by: Nicolas Boichat 
>> > > ---
>> > > I first thought of adding a new field to struct file_operations,
>> > > but that doesn't quite scale as every single file creation
>> > > operation would need to be modified.
>> >
>> > Even so, you missed a load of filesystems in the kernel with this patch
>> > series, what makes the ones you did mark here different from the
>> > "internal" filesystems that you did not?
>> >
>> > This feels wrong, why is userspace suddenly breaking?  What changed in
>> > the kernel that caused this?  Procfs has been around for a _very_ long
>> > time :)
>> 
>> That would be because of (v5.3):
>> 
>> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
>> 
>> The intention of this change (series) was to allow server side copy
>> for nfs and cifs via copy_file_range().
>> This is mostly work by Dave Chinner that I picked up following requests
>> from the NFS folks.
>> 
>> But the above change also includes this generic change:
>> 
>> -   /* this could be relaxed once a method supports cross-fs copies */
>> -   if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> -   return -EXDEV;
>> -
>> 
>> The change of behavior was documented in the commit message.
>> It was also documented in:
>> 
>> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
>> 
>> I think our rationale for the generic change was:
>> "Why not? What could go wrong? (TM)"
>> I am not sure if any workload really gained something from this
>> kernel cross-fs CFR.
>
> Why not put that check back?
>
>> In retrospect, I think it would have been safer to allow cross-fs CFR
>> only to the filesystems that implement ->{copy,remap}_file_range()...
>
> Why not make this change?  That seems easier and should fix this for
> everyone, right?
>
>> Our option now are:
>> - Restore the cross-fs restriction into generic_copy_file_range()
>
> Yes.
>

Restoring this restriction will actually change the current cephfs CFR
behaviour.  Since that commit we have allowed doing remote copies between
different filesystems within the same ceph cluster.  See commit
6fd4e6348352 ("ceph: allow object copies across different filesystems in
the same cluster").

Although I'm not aware of any current users for this scenario, the
performance impact can actually be huge as it's the difference between
asking the OSDs for copying a file and doing a full read+write on the
client side.

Cheers,
-- 
Luis


>> - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does
>
> No.  That way lies constant auditing and someone being "vigilant" for
> the next 30+ years.  Which will not happen.
>
> thanks,
>
> greg k-h


Re: [PATCH v2] ceph: add ceph.caps vxattr

2020-11-24 Thread Luis Henriques
Jeff Layton  writes:

> On Mon, 2020-11-23 at 17:38 +0000, Luis Henriques wrote:
>> Add a new vxattr that allows userspace to list the caps for a specific
>> directory or file.
>> 
>> Signed-off-by: Luis Henriques 
>> ---
>> Hi!
>> 
>> Here's a version that also shows the caps in hexadecimal format, as
>> suggested by Jeff.  IMO the parenthesis and the '0x' prefix help the
>> readability, but they may make it a bit harder for scripts to parsing the
>> output.  I'm OK dropping those.
>> 
>> Cheers,
>
> Looks good, merged into "testing".

Awesome, thanks!

> I did make a slight change to the format -- instead of putting the hex
> value in parenthesis, I separated the two fields with a /, which I think
> should make things easier for scripts to parse.
>
> You should be able to do something like this to get at the hex value for
> testing:
>
> $ getfattr -n ceph.caps foo | cut -d / -f2
>
> Let me know if you see issues with that and we can revisit the format.

Sure, I'm OK with that.  Or even simply dropping any separator, having
only a space/tab between the string and the hex value.

Another option I saw was to have two vxattrs: ceph.caps.string and
ceph.caps.int.  But that's probably overkill.

Cheers,
-- 
Luis


[PATCH v2] ceph: add ceph.caps vxattr

2020-11-23 Thread Luis Henriques
Add a new vxattr that allows userspace to list the caps for a specific
directory or file.

Signed-off-by: Luis Henriques 
---
Hi!

Here's a version that also shows the caps in hexadecimal format, as
suggested by Jeff.  IMO the parenthesis and the '0x' prefix help the
readability, but they may make it a bit harder for scripts to parsing the
output.  I'm OK dropping those.

Cheers,
-- 
Luis

fs/ceph/xattr.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 197cb1234341..aec9bd5c8e77 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -303,6 +303,19 @@ static ssize_t ceph_vxattrcb_snap_btime(struct 
ceph_inode_info *ci, char *val,
ci->i_snap_btime.tv_nsec);
 }
 
+static ssize_t ceph_vxattrcb_caps(struct ceph_inode_info *ci, char *val,
+   size_t size)
+{
+   int issued;
+
+   spin_lock(>i_ceph_lock);
+   issued = __ceph_caps_issued(ci, NULL);
+   spin_unlock(>i_ceph_lock);
+
+   return ceph_fmt_xattr(val, size, "%s (0x%x)",
+ ceph_cap_string(issued), issued);
+}
+
 #define CEPH_XATTR_NAME(_type, _name)  XATTR_CEPH_PREFIX #_type "." #_name
 #define CEPH_XATTR_NAME2(_type, _name, _name2) \
XATTR_CEPH_PREFIX #_type "." #_name "." #_name2
@@ -378,6 +391,13 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = {
.exists_cb = ceph_vxattrcb_snap_btime_exists,
.flags = VXATTR_FLAG_READONLY,
},
+   {
+   .name = "ceph.caps",
+   .name_size = sizeof("ceph.caps"),
+   .getxattr_cb = ceph_vxattrcb_caps,
+   .exists_cb = NULL,
+   .flags = VXATTR_FLAG_HIDDEN,
+   },
{ .name = NULL, 0 } /* Required table terminator */
 };
 
@@ -403,6 +423,13 @@ static struct ceph_vxattr ceph_file_vxattrs[] = {
.exists_cb = ceph_vxattrcb_snap_btime_exists,
.flags = VXATTR_FLAG_READONLY,
},
+   {
+   .name = "ceph.caps",
+   .name_size = sizeof("ceph.caps"),
+   .getxattr_cb = ceph_vxattrcb_caps,
+   .exists_cb = NULL,
+   .flags = VXATTR_FLAG_HIDDEN,
+   },
{ .name = NULL, 0 } /* Required table terminator */
 };
 


[RFC PATCH] ceph: add ceph.caps vxattr

2020-11-23 Thread Luis Henriques
Add a new vxattr that allows userspace to list the caps for a specific
directory or file.

Signed-off-by: Luis Henriques 
---
 fs/ceph/xattr.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 197cb1234341..996512e05513 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -303,6 +303,18 @@ static ssize_t ceph_vxattrcb_snap_btime(struct 
ceph_inode_info *ci, char *val,
ci->i_snap_btime.tv_nsec);
 }
 
+static ssize_t ceph_vxattrcb_caps(struct ceph_inode_info *ci, char *val,
+   size_t size)
+{
+   int issued;
+
+   spin_lock(>i_ceph_lock);
+   issued = __ceph_caps_issued(ci, NULL);
+   spin_unlock(>i_ceph_lock);
+
+   return ceph_fmt_xattr(val, size, "%s", ceph_cap_string(issued));
+}
+
 #define CEPH_XATTR_NAME(_type, _name)  XATTR_CEPH_PREFIX #_type "." #_name
 #define CEPH_XATTR_NAME2(_type, _name, _name2) \
XATTR_CEPH_PREFIX #_type "." #_name "." #_name2
@@ -378,6 +390,13 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = {
.exists_cb = ceph_vxattrcb_snap_btime_exists,
.flags = VXATTR_FLAG_READONLY,
},
+   {
+   .name = "ceph.caps",
+   .name_size = sizeof("ceph.caps"),
+   .getxattr_cb = ceph_vxattrcb_caps,
+   .exists_cb = NULL,
+   .flags = VXATTR_FLAG_HIDDEN,
+   },
{ .name = NULL, 0 } /* Required table terminator */
 };
 
@@ -403,6 +422,13 @@ static struct ceph_vxattr ceph_file_vxattrs[] = {
.exists_cb = ceph_vxattrcb_snap_btime_exists,
.flags = VXATTR_FLAG_READONLY,
},
+   {
+   .name = "ceph.caps",
+   .name_size = sizeof("ceph.caps"),
+   .getxattr_cb = ceph_vxattrcb_caps,
+   .exists_cb = NULL,
+   .flags = VXATTR_FLAG_HIDDEN,
+   },
{ .name = NULL, 0 } /* Required table terminator */
 };
 


[PATCH] Revert "ceph: allow rename operation under different quota realms"

2020-11-12 Thread Luis Henriques
This reverts commit dffdcd71458e699e839f0bf47c3d42d64210b939.

When doing a rename across quota realms, there's a corner case that isn't
handled correctly.  Here's a testcase:

  mkdir files limit
  truncate files/file -s 10G
  setfattr limit -n ceph.quota.max_bytes -v 100
  mv files limit/

The above will succeed because ftruncate(2) won't immediately notify the
MDSs with the new file size, and thus the quota realms stats won't be
updated.

Since the possible fixes for this issue would have a huge performance impact,
the solution for now is to simply revert to returning -EXDEV when doing a cross
quota realms rename.

URL: https://tracker.ceph.com/issues/48203
Signed-off-by: Luis Henriques 
---
 fs/ceph/dir.c   |  9 
 fs/ceph/quota.c | 58 +
 fs/ceph/super.h |  3 +--
 3 files changed, 6 insertions(+), 64 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index a4d48370b2b3..858ee7362ff5 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1202,12 +1202,11 @@ static int ceph_rename(struct inode *old_dir, struct 
dentry *old_dentry,
op = CEPH_MDS_OP_RENAMESNAP;
else
return -EROFS;
-   } else if (old_dir != new_dir) {
-   err = ceph_quota_check_rename(mdsc, d_inode(old_dentry),
- new_dir);
-   if (err)
-   return err;
}
+   /* don't allow cross-quota renames */
+   if ((old_dir != new_dir) &&
+   (!ceph_quota_is_same_realm(old_dir, new_dir)))
+   return -EXDEV;
 
dout("rename dir %p dentry %p to dir %p dentry %p\n",
 old_dir, old_dentry, new_dir, new_dentry);
diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
index 9b785f11e95a..4e32c9600ecc 100644
--- a/fs/ceph/quota.c
+++ b/fs/ceph/quota.c
@@ -264,7 +264,7 @@ static struct ceph_snap_realm *get_quota_realm(struct 
ceph_mds_client *mdsc,
return NULL;
 }
 
-static bool ceph_quota_is_same_realm(struct inode *old, struct inode *new)
+bool ceph_quota_is_same_realm(struct inode *old, struct inode *new)
 {
struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(old->i_sb);
struct ceph_snap_realm *old_realm, *new_realm;
@@ -516,59 +516,3 @@ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, 
struct kstatfs *buf)
return is_updated;
 }
 
-/*
- * ceph_quota_check_rename - check if a rename can be executed
- * @mdsc:  MDS client instance
- * @old:   inode to be copied
- * @new:   destination inode (directory)
- *
- * This function verifies if a rename (e.g. moving a file or directory) can be
- * executed.  It forces an rstat update in the @new target directory (and in 
the
- * source @old as well, if it's a directory).  The actual check is done both 
for
- * max_files and max_bytes.
- *
- * This function returns 0 if it's OK to do the rename, or, if quotas are
- * exceeded, -EXDEV (if @old is a directory) or -EDQUOT.
- */
-int ceph_quota_check_rename(struct ceph_mds_client *mdsc,
-   struct inode *old, struct inode *new)
-{
-   struct ceph_inode_info *ci_old = ceph_inode(old);
-   int ret = 0;
-
-   if (ceph_quota_is_same_realm(old, new))
-   return 0;
-
-   /*
-* Get the latest rstat for target directory (and for source, if a
-* directory)
-*/
-   ret = ceph_do_getattr(new, CEPH_STAT_RSTAT, false);
-   if (ret)
-   return ret;
-
-   if (S_ISDIR(old->i_mode)) {
-   ret = ceph_do_getattr(old, CEPH_STAT_RSTAT, false);
-   if (ret)
-   return ret;
-   ret = check_quota_exceeded(new, QUOTA_CHECK_MAX_BYTES_OP,
-  ci_old->i_rbytes);
-   if (!ret)
-   ret = check_quota_exceeded(new,
-  QUOTA_CHECK_MAX_FILES_OP,
-  ci_old->i_rfiles +
-  ci_old->i_rsubdirs);
-   if (ret)
-   ret = -EXDEV;
-   } else {
-   ret = check_quota_exceeded(new, QUOTA_CHECK_MAX_BYTES_OP,
-  i_size_read(old));
-   if (!ret)
-   ret = check_quota_exceeded(new,
-  QUOTA_CHECK_MAX_FILES_OP, 1);
-   if (ret)
-   ret = -EDQUOT;
-   }
-
-   return ret;
-}
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 482473e4cce1..8dbb0babddea 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1222,14 +1222,13 @@ extern void ceph_handle_quota(struct ceph_mds_client 
*mdsc,
  struct ceph_mds_session *session,
  struct ceph_msg *msg);
 extern bool ceph_q

Re: [RFC PATCH] ceph: fix cross quota realms renames with new truncated files

2020-11-12 Thread Luis Henriques
Jeff Layton  writes:

> On Thu, 2020-11-12 at 10:40 +0000, Luis Henriques wrote:
>> Jeff Layton  writes:
>> 
>> > On Wed, 2020-11-11 at 18:28 +, Luis Henriques wrote:
>> > > Jeff Layton  writes:
>> > > 
>> > > > On Wed, 2020-11-11 at 15:39 +, Luis Henriques wrote:
>> > > > > When doing a rename across quota realms, there's a corner case that 
>> > > > > isn't
>> > > > > handled correctly.  Here's a testcase:
>> > > > > 
>> > > > >   mkdir files limit
>> > > > >   truncate files/file -s 10G
>> > > > >   setfattr limit -n ceph.quota.max_bytes -v 100
>> > > > >   mv files limit/
>> > > > > 
>> > > > > The above will succeed because ftruncate(2) won't result in an 
>> > > > > immediate
>> > > > > notification of the MDSs with the new file size, and thus the quota 
>> > > > > realms
>> > > > > stats won't be updated.
>> > > > > 
>> > > > > This patch forces a sync with the MDS every time there's an 
>> > > > > ATTR_SIZE that
>> > > > > sets a new i_size, even if we have Fx caps.
>> > > > > 
>> > > > > Cc: sta...@vger.kernel.org
>> > > > > Fixes: dffdcd71458e ("ceph: allow rename operation under different 
>> > > > > quota realms")
>> > > > > URL: https://tracker.ceph.com/issues/36593
>> > > > > Signed-off-by: Luis Henriques 
>> > > > > ---
>> > > > >  fs/ceph/inode.c | 11 ++-
>> > > > >  1 file changed, 2 insertions(+), 9 deletions(-)
>> > > > > 
>> > > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> > > > > index 526faf4778ce..30e3f240ac96 100644
>> > > > > --- a/fs/ceph/inode.c
>> > > > > +++ b/fs/ceph/inode.c
>> > > > > @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, 
>> > > > > struct iattr *attr)
>> > > > >  if (ia_valid & ATTR_SIZE) {
>> > > > >  dout("setattr %p size %lld -> %lld\n", inode,
>> > > > >   inode->i_size, attr->ia_size);
>> > > > > -if ((issued & CEPH_CAP_FILE_EXCL) &&
>> > > > > -attr->ia_size > inode->i_size) {
>> > > > > -i_size_write(inode, attr->ia_size);
>> > > > > -inode->i_blocks = 
>> > > > > calc_inode_blocks(attr->ia_size);
>> > > > > -ci->i_reported_size = attr->ia_size;
>> > > > > -dirtied |= CEPH_CAP_FILE_EXCL;
>> > > > > -ia_valid |= ATTR_MTIME;
>> > > > > -} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 ||
>> > > > > -   attr->ia_size != inode->i_size) {
>> > > > > +if ((issued & 
>> > > > > (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) ||
>> > > > > +(attr->ia_size != inode->i_size)) {
>> > > > >  req->r_args.setattr.size = 
>> > > > > cpu_to_le64(attr->ia_size);
>> > > > >  req->r_args.setattr.old_size =
>> > > > >  cpu_to_le64(inode->i_size);
>> > > > 
>> > > > Hmm...this makes truncates more expensive when we have caps. I'd rather
>> > > > not do that if we can help it.
>> > > 
>> > > Yeah, as I mentioned in the tracker, there's indeed a performance impact
>> > > with this fix.  That's what made me add the RFC in the subject ;-)
>> > > 
>> > > > What about instead having the client mimic a fsync when there is a
>> > > > rename across quota realms? If we can't tell that reliably then we 
>> > > > could
>> > > > also just do an effective fsync ahead of any cross-directory rename?
>> > > 
>> > > Ok, thanks for the suggestion.  That may actually work, although it will
>> > > make the rename more expensive of course.  I'll test that tomorrow and
>> > > eventually follow-up with a patch.
>> > > 
>> > 
>> > Patrick po

Re: [PATCH] ceph: fix race in concurrent __ceph_remove_cap invocations

2020-11-12 Thread Luis Henriques
Jeff Layton  writes:

> On Thu, 2020-11-12 at 20:43 +0800, Yan, Zheng wrote:
>> On Thu, Nov 12, 2020 at 6:48 PM Luis Henriques  wrote:
>> > 
>> > A NULL pointer dereference may occur in __ceph_remove_cap with some of the
>> > callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and
>> > remove_session_caps_cb.  These aren't protected against the concurrent
>> > execution of __ceph_remove_cap.
>> > 
>> 
>> they are protected by session mutex, never get executed concurrently
>> 
>
> Maybe not concurrently with one another, but the s_mutex is _not_ held
> when __ceph_remove_caps is called from ceph_evict_inode. We can't rely
> on it to protect this.

Hmm, yeah.  I guess the changelog could mention that.  Thanks, Jeff.

Cheers,
-- 
Luis

>> > Since the callers of this function hold the i_ceph_lock, the fix is simply
>> > a matter of returning immediately if caps->ci is NULL.
>> > 
>> > Based on a patch from Jeff Layton.
>> > 
>> > Cc: sta...@vger.kernel.org
>> > URL: https://tracker.ceph.com/issues/43272
>> > Link: https://www.spinics.net/lists/ceph-devel/msg47064.html
>> > Signed-off-by: Luis Henriques 
>> > ---
>> >  fs/ceph/caps.c | 11 +--
>> >  1 file changed, 9 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> > index ded4229c314a..443f164760d5 100644
>> > --- a/fs/ceph/caps.c
>> > +++ b/fs/ceph/caps.c
>> > @@ -1140,12 +1140,19 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool 
>> > queue_release)
>> >  {
>> > struct ceph_mds_session *session = cap->session;
>> > struct ceph_inode_info *ci = cap->ci;
>> > -   struct ceph_mds_client *mdsc =
>> > -   ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc;
>> > +   struct ceph_mds_client *mdsc;
>> > int removed = 0;
>> > 
>> > +   /* 'ci' being NULL means he remove have already occurred */
>> > +   if (!ci) {
>> > +   dout("%s: cap inode is NULL\n", __func__);
>> > +   return;
>> > +   }
>> > +
>> > dout("__ceph_remove_cap %p from %p\n", cap, >vfs_inode);
>> > 
>> > +   mdsc = ceph_inode_to_client(>vfs_inode)->mdsc;
>> > +
>> > /* remove from inode's cap rbtree, and clear auth cap */
>> > rb_erase(>ci_node, >i_caps);
>> > if (ci->i_auth_cap == cap) {
>
> -- 
> Jeff Layton 
>


[PATCH] ceph: downgrade warning from mdsmap decode to debug

2020-11-12 Thread Luis Henriques
While the MDS cluster is unstable and changing state the client may get
mdsmap updates that will trigger warnings:

  [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay)

This patch downgrades these warnings to debug, as they may flood the logs
if the cluster is unstable for a while.

Signed-off-by: Luis Henriques 
---
Hi!

This patch follows from my other patch "ceph: fix race in concurrent
__ceph_remove_cap invocations", where I see a *lot* of warnings before the
NULL pointer.  Maybe this could be a pr_warn_once instead, but not sure
that would be useful.

Note that before commit 4d7ace02ba5c ("ceph: fix mdsmap cluster available
check based on laggy number") this was simply ignored without any pr_warn
or dout.

Cheers,
--
Luis

 fs/ceph/mdsmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mdsmap.c b/fs/ceph/mdsmap.c
index e4aba6c6d3b5..1096d1d3a84c 100644
--- a/fs/ceph/mdsmap.c
+++ b/fs/ceph/mdsmap.c
@@ -243,8 +243,8 @@ struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void *end)
}
 
if (state <= 0) {
-   pr_warn("mdsmap_decode got incorrect state(%s)\n",
-   ceph_mds_state_name(state));
+   dout("mdsmap_decode got incorrect state(%s)\n",
+ceph_mds_state_name(state));
continue;
}
 


[PATCH] ceph: fix race in concurrent __ceph_remove_cap invocations

2020-11-12 Thread Luis Henriques
A NULL pointer dereference may occur in __ceph_remove_cap with some of the
callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and
remove_session_caps_cb.  These aren't protected against the concurrent
execution of __ceph_remove_cap.

Since the callers of this function hold the i_ceph_lock, the fix is simply
a matter of returning immediately if caps->ci is NULL.

Based on a patch from Jeff Layton.

Cc: sta...@vger.kernel.org
URL: https://tracker.ceph.com/issues/43272
Link: https://www.spinics.net/lists/ceph-devel/msg47064.html
Signed-off-by: Luis Henriques 
---
 fs/ceph/caps.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index ded4229c314a..443f164760d5 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1140,12 +1140,19 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool 
queue_release)
 {
struct ceph_mds_session *session = cap->session;
struct ceph_inode_info *ci = cap->ci;
-   struct ceph_mds_client *mdsc =
-   ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc;
+   struct ceph_mds_client *mdsc;
int removed = 0;
 
+   /* 'ci' being NULL means he remove have already occurred */
+   if (!ci) {
+   dout("%s: cap inode is NULL\n", __func__);
+   return;
+   }
+
dout("__ceph_remove_cap %p from %p\n", cap, >vfs_inode);
 
+   mdsc = ceph_inode_to_client(>vfs_inode)->mdsc;
+
/* remove from inode's cap rbtree, and clear auth cap */
rb_erase(>ci_node, >i_caps);
if (ci->i_auth_cap == cap) {


Re: [RFC PATCH] ceph: fix cross quota realms renames with new truncated files

2020-11-12 Thread Luis Henriques
Jeff Layton  writes:

> On Wed, 2020-11-11 at 18:28 +0000, Luis Henriques wrote:
>> Jeff Layton  writes:
>> 
>> > On Wed, 2020-11-11 at 15:39 +, Luis Henriques wrote:
>> > > When doing a rename across quota realms, there's a corner case that isn't
>> > > handled correctly.  Here's a testcase:
>> > > 
>> > >   mkdir files limit
>> > >   truncate files/file -s 10G
>> > >   setfattr limit -n ceph.quota.max_bytes -v 100
>> > >   mv files limit/
>> > > 
>> > > The above will succeed because ftruncate(2) won't result in an immediate
>> > > notification of the MDSs with the new file size, and thus the quota 
>> > > realms
>> > > stats won't be updated.
>> > > 
>> > > This patch forces a sync with the MDS every time there's an ATTR_SIZE 
>> > > that
>> > > sets a new i_size, even if we have Fx caps.
>> > > 
>> > > Cc: sta...@vger.kernel.org
>> > > Fixes: dffdcd71458e ("ceph: allow rename operation under different quota 
>> > > realms")
>> > > URL: https://tracker.ceph.com/issues/36593
>> > > Signed-off-by: Luis Henriques 
>> > > ---
>> > >  fs/ceph/inode.c | 11 ++-
>> > >  1 file changed, 2 insertions(+), 9 deletions(-)
>> > > 
>> > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> > > index 526faf4778ce..30e3f240ac96 100644
>> > > --- a/fs/ceph/inode.c
>> > > +++ b/fs/ceph/inode.c
>> > > @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct 
>> > > iattr *attr)
>> > >  if (ia_valid & ATTR_SIZE) {
>> > >  dout("setattr %p size %lld -> %lld\n", inode,
>> > >   inode->i_size, attr->ia_size);
>> > > -if ((issued & CEPH_CAP_FILE_EXCL) &&
>> > > -attr->ia_size > inode->i_size) {
>> > > -i_size_write(inode, attr->ia_size);
>> > > -inode->i_blocks = 
>> > > calc_inode_blocks(attr->ia_size);
>> > > -ci->i_reported_size = attr->ia_size;
>> > > -dirtied |= CEPH_CAP_FILE_EXCL;
>> > > -ia_valid |= ATTR_MTIME;
>> > > -} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 ||
>> > > -   attr->ia_size != inode->i_size) {
>> > > +if ((issued & 
>> > > (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) ||
>> > > +(attr->ia_size != inode->i_size)) {
>> > >  req->r_args.setattr.size = 
>> > > cpu_to_le64(attr->ia_size);
>> > >  req->r_args.setattr.old_size =
>> > >  cpu_to_le64(inode->i_size);
>> > 
>> > Hmm...this makes truncates more expensive when we have caps. I'd rather
>> > not do that if we can help it.
>> 
>> Yeah, as I mentioned in the tracker, there's indeed a performance impact
>> with this fix.  That's what made me add the RFC in the subject ;-)
>> 
>> > What about instead having the client mimic a fsync when there is a
>> > rename across quota realms? If we can't tell that reliably then we could
>> > also just do an effective fsync ahead of any cross-directory rename?
>> 
>> Ok, thanks for the suggestion.  That may actually work, although it will
>> make the rename more expensive of course.  I'll test that tomorrow and
>> eventually follow-up with a patch.
>> 
>
> Patrick pointed out to me on IRC that since you're moving the parent
> directory of the truncated file, flushing the caps on the directory
> won't really help. You'd need to walk the entire subtree and try to
> flush every dirty inode, or basically do a syncfs() prior to renaming
> the directory across quotarealms.
>
> I think we probably will need to revert the change to allow cross-
> quotarealm renames of directories and make those return EXDEV again.
> Anything else sounds like it's probably going to be too expensive.

Hmm... that sounds a bit drastic and it would make the kernel client
behave differently from the fuse client -- from what I could understand
the fuse client does the sync ATTR_SIZE and thus doesn't have this issue.

Obviously, I agree with you that the performance penalty is too high for
such a common operation.  But maybe renames across quotarealms aren't that
common and paying the penalty of doing a full ceph_flush_dirty_caps() is
acceptable for such cases?

Cheers,
-- 
Luis


Re: [RFC PATCH] ceph: fix cross quota realms renames with new truncated files

2020-11-11 Thread Luis Henriques
Jeff Layton  writes:

> On Wed, 2020-11-11 at 15:39 +0000, Luis Henriques wrote:
>> When doing a rename across quota realms, there's a corner case that isn't
>> handled correctly.  Here's a testcase:
>> 
>>   mkdir files limit
>>   truncate files/file -s 10G
>>   setfattr limit -n ceph.quota.max_bytes -v 100
>>   mv files limit/
>> 
>> The above will succeed because ftruncate(2) won't result in an immediate
>> notification of the MDSs with the new file size, and thus the quota realms
>> stats won't be updated.
>> 
>> This patch forces a sync with the MDS every time there's an ATTR_SIZE that
>> sets a new i_size, even if we have Fx caps.
>> 
>> Cc: sta...@vger.kernel.org
>> Fixes: dffdcd71458e ("ceph: allow rename operation under different quota 
>> realms")
>> URL: https://tracker.ceph.com/issues/36593
>> Signed-off-by: Luis Henriques 
>> ---
>>  fs/ceph/inode.c | 11 ++-
>>  1 file changed, 2 insertions(+), 9 deletions(-)
>> 
>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> index 526faf4778ce..30e3f240ac96 100644
>> --- a/fs/ceph/inode.c
>> +++ b/fs/ceph/inode.c
>> @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct iattr 
>> *attr)
>>  if (ia_valid & ATTR_SIZE) {
>>  dout("setattr %p size %lld -> %lld\n", inode,
>>   inode->i_size, attr->ia_size);
>> -if ((issued & CEPH_CAP_FILE_EXCL) &&
>> -attr->ia_size > inode->i_size) {
>> -i_size_write(inode, attr->ia_size);
>> -inode->i_blocks = calc_inode_blocks(attr->ia_size);
>> -ci->i_reported_size = attr->ia_size;
>> -dirtied |= CEPH_CAP_FILE_EXCL;
>> -ia_valid |= ATTR_MTIME;
>> -} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 ||
>> -   attr->ia_size != inode->i_size) {
>> +if ((issued & (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) ||
>> +(attr->ia_size != inode->i_size)) {
>>  req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>>  req->r_args.setattr.old_size =
>>  cpu_to_le64(inode->i_size);
>
> Hmm...this makes truncates more expensive when we have caps. I'd rather
> not do that if we can help it.

Yeah, as I mentioned in the tracker, there's indeed a performance impact
with this fix.  That's what made me add the RFC in the subject ;-)

> What about instead having the client mimic a fsync when there is a
> rename across quota realms? If we can't tell that reliably then we could
> also just do an effective fsync ahead of any cross-directory rename?

Ok, thanks for the suggestion.  That may actually work, although it will
make the rename more expensive of course.  I'll test that tomorrow and
eventually follow-up with a patch.

Cheers,
-- 
Luis


[RFC PATCH] ceph: fix cross quota realms renames with new truncated files

2020-11-11 Thread Luis Henriques
When doing a rename across quota realms, there's a corner case that isn't
handled correctly.  Here's a testcase:

  mkdir files limit
  truncate files/file -s 10G
  setfattr limit -n ceph.quota.max_bytes -v 100
  mv files limit/

The above will succeed because ftruncate(2) won't result in an immediate
notification of the MDSs with the new file size, and thus the quota realms
stats won't be updated.

This patch forces a sync with the MDS every time there's an ATTR_SIZE that
sets a new i_size, even if we have Fx caps.

Cc: sta...@vger.kernel.org
Fixes: dffdcd71458e ("ceph: allow rename operation under different quota 
realms")
URL: https://tracker.ceph.com/issues/36593
Signed-off-by: Luis Henriques 
---
 fs/ceph/inode.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 526faf4778ce..30e3f240ac96 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct iattr 
*attr)
if (ia_valid & ATTR_SIZE) {
dout("setattr %p size %lld -> %lld\n", inode,
 inode->i_size, attr->ia_size);
-   if ((issued & CEPH_CAP_FILE_EXCL) &&
-   attr->ia_size > inode->i_size) {
-   i_size_write(inode, attr->ia_size);
-   inode->i_blocks = calc_inode_blocks(attr->ia_size);
-   ci->i_reported_size = attr->ia_size;
-   dirtied |= CEPH_CAP_FILE_EXCL;
-   ia_valid |= ATTR_MTIME;
-   } else if ((issued & CEPH_CAP_FILE_SHARED) == 0 ||
-  attr->ia_size != inode->i_size) {
+   if ((issued & (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) ||
+   (attr->ia_size != inode->i_size)) {
req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
req->r_args.setattr.old_size =
cpu_to_le64(inode->i_size);


Re: [PATCH] ceph: remove unnecessary return in switch statement

2020-08-14 Thread Luis Henriques
David Laight  writes:

> From: Luis Henriques
>> Sent: 14 August 2020 10:38
>> 
>> Since there's a return immediately after the 'break', there's no need for
>> this extra 'return' in the S_IFDIR case.
>> 
>> Signed-off-by: Luis Henriques 
>> ---
>>  fs/ceph/file.c | 2 --
>>  1 file changed, 2 deletions(-)
>> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index d51c3f2fdca0..04ab99c0223a 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -256,8 +256,6 @@ static int ceph_init_file(struct inode *inode, struct 
>> file *file, int fmode)
>>  case S_IFDIR:
>>  ret = ceph_init_file_info(inode, file, fmode,
>>  S_ISDIR(inode->i_mode));
>> -if (ret)
>> -return ret;
>>  break;
>> 
>>  case S_IFLNK:
>
> I'd move the other way and just do:
>   return ceph_init_file_info(...);

Sure, that would work too, although my preference would be to have a
single function exit point.  But I'll leave that decision to Jeff :-)

Cheers,
-- 
Luis

>
>   David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
>


[PATCH] ceph: remove unnecessary return in switch statement

2020-08-14 Thread Luis Henriques
Since there's a return immediately after the 'break', there's no need for
this extra 'return' in the S_IFDIR case.

Signed-off-by: Luis Henriques 
---
 fs/ceph/file.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d51c3f2fdca0..04ab99c0223a 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -256,8 +256,6 @@ static int ceph_init_file(struct inode *inode, struct file 
*file, int fmode)
case S_IFDIR:
ret = ceph_init_file_info(inode, file, fmode,
S_ISDIR(inode->i_mode));
-   if (ret)
-   return ret;
break;
 
case S_IFLNK:



drm: list_add corruption followed by BUG (stack guard page was hit)

2020-07-31 Thread Luis Henriques
Hi!

I've just got the following WARNING followed by a BUG on rc7.  Maybe it's
already a known issue, but here it is anyway.

Cheers,
-- 
Luis

[   38.001304] [ cut here ]
[   38.001312] list_add corruption. prev->next should be next 
(8fe713397b88), but was . (prev=8fe715fb9140).
[   38.001337] WARNING: CPU: 3 PID: 501 at lib/list_debug.c:26 
__list_add_valid+0x4d/0x70
[   38.001340] Modules linked in: cdc_ether(E) usbnet(E) r8152(E) mii(E) 
hid_generic(E) usbhid(E) snd_hda_codec_hdmi(E) iwlmvm(E) dell_rbtn(E) 
mac80211(E) libarc4(E) snd_hda_codec_realtek(E) x86_pkg_temp_thermal(E) 
intel_powerclamp(E) snd_hda_codec_generic(E) coretemp(E) mei_wdt(E) 
dell_laptop(E) kvm_intel(E) ledtrig_audio(E) intel_rapl_msr(E) 
dell_smm_hwmon(E) snd_hda_intel(E) kvm(E) uvcvideo(E) snd_intel_dspcfg(E) 
videobuf2_vmalloc(E) irqbypass(E) iwlwifi(E) videobuf2_memops(E) rapl(E) 
snd_hda_codec(E) videobuf2_v4l2(E) intel_cstate(E) dell_wmi(E) joydev(E) 
snd_hwdep(E) pcspkr(E) serio_raw(E) intel_uncore(E) dell_smbios(E) 
videobuf2_common(E) dcdbas(E) snd_hda_core(E) iTCO_wdt(E) snd_pcm(E) 
videodev(E) wmi_bmof(E) snd_timer(E) dell_wmi_descriptor(E) 
intel_wmi_thunderbolt(E) iTCO_vendor_support(E) mei_me(E) snd(E) cfg80211(E) 
soundcore(E) tpm_crb(E) mc(E) rfkill(E) mei(E) intel_pch_thermal(E) sg(E) 
processor_thermal_device(E) intel_rapl_common(E) intel_soc_dts_iosf(E) 
battery(E) tpm_tis(E)
[   38.001397]  int3403_thermal(E) tpm_tis_core(E) tpm(E) dell_smo8800(E) 
intel_hid(E) evdev(E) rng_core(E) int3400_thermal(E) int3402_thermal(E) 
acpi_thermal_rel(E) int340x_thermal_zone(E) sparse_keymap(E) acpi_pad(E) ac(E) 
nft_counter(E) nft_ct(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) 
i2c_dev(E) nf_tables(E) parport_pc(E) ppdev(E) nfnetlink(E) lp(E) parport(E) 
ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) 
btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) 
dm_crypt(E) cbc(E) encrypted_keys(E) dm_mod(E) sd_mod(E) t10_pi(E) 
rtsx_pci_sdmmc(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) 
ghash_clmulni_intel(E) mmc_core(E) aesni_intel(E) crypto_simd(E) cryptd(E) 
glue_helper(E) ahci(E) nouveau(E) i915(E) mxm_wmi(E) i2c_i801(E) i2c_smbus(E) 
libahci(E) ttm(E) i2c_algo_bit(E) rtsx_pci(E) xhci_pci(E) drm_kms_helper(E) 
intel_lpss_pci(E) libata(E) syscopyarea(E) intel_lpss(E) xhci_hcd(E) idma64(E) 
sysfillrect(E) virt_dma(E)
[   38.001461]  sysimgblt(E) scsi_mod(E) fb_sys_fops(E) mfd_core(E) usbcore(E) 
usb_common(E) drm(E) fan(E) thermal(E) i2c_hid(E) hid(E) wmi(E) video(E) 
button(E)
[   38.001482] CPU: 3 PID: 501 Comm: kworker/3:4 Tainted: GE 
5.8.0-rc7 #43
[   38.001485] Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.14.2 
05/25/2020
[   38.001513] Workqueue: events_long drm_dp_mst_link_probe_work 
[drm_kms_helper]
[   38.001521] RIP: 0010:__list_add_valid+0x4d/0x70
[   38.001527] Code: c3 4c 89 c1 48 c7 c7 98 34 ed af e8 7f 16 c9 ff 0f 0b 31 
c0 c3 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 e8 34 ed af e8 65 16 c9 ff <0f> 0b 31 
c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 38 35 ed af e8
[   38.001530] RSP: 0018:a47680417ca0 EFLAGS: 00010286
[   38.001535] RAX:  RBX: 8fe713397b88 RCX: 0027
[   38.001538] RDX: 0027 RSI: 0096 RDI: 8fe71e197b68
[   38.001541] RBP: 8fe7133977e8 R08: 8fe71e197b60 R09: 0084
[   38.001544] R10: a47680417b48 R11: a47680417b4d R12: 8fe71869f800
[   38.001547] R13: 8fe715fb9140 R14: 8fe71869f940 R15: 8fe713397b68
[   38.001551] FS:  () GS:8fe71e18() 
knlGS:
[   38.001555] CS:  0010 DS:  ES:  CR0: 80050033
[   38.001558] CR2: 7f9e3f7159d0 CR3: 00040e60a004 CR4: 003606e0
[   38.001561] DR0:  DR1:  DR2: 
[   38.001563] DR3:  DR6: fffe0ff0 DR7: 0400
[   38.001566] Call Trace:
[   38.001593]  drm_dp_queue_down_tx+0x5c/0x110 [drm_kms_helper]
[   38.001604]  ? i2c_register_adapter+0x1d0/0x390
[   38.001627]  drm_dp_send_enum_path_resources+0x54/0x120 [drm_kms_helper]
[   38.001650]  drm_dp_send_link_address+0x682/0x990 [drm_kms_helper]
[   38.001657]  ? prepare_to_wait_event+0x7e/0x150
[   38.001661]  ? finish_wait+0x3f/0x80
[   38.001684]  drm_dp_check_and_send_link_address+0xad/0xd0 [drm_kms_helper]
[   38.001707]  drm_dp_mst_link_probe_work+0x94/0x180 [drm_kms_helper]
[   38.001714]  process_one_work+0x1ae/0x370
[   38.001720]  worker_thread+0x50/0x3a0
[   38.001725]  ? process_one_work+0x370/0x370
[   38.001729]  kthread+0x11b/0x140
[   38.001734]  ? kthread_associate_blkcg+0x90/0x90
[   38.001741]  ret_from_fork+0x22/0x30
[   38.001747] ---[ end trace ca03f107384f1adc ]---
[   38.001759] BUG: stack guard page was hit at 62c9d455 (stack is 
e3f6f298..86ee600f)
[   38.001766] kernel stack 

Re: Warning triggered in drm_dp_delayed_destroy_work workqueue

2020-06-26 Thread Luis Henriques
On Fri, Jun 26, 2020 at 05:06:00PM +0300, Ville Syrjälä wrote:
> On Fri, Jun 26, 2020 at 03:40:20PM +0200, Daniel Vetter wrote:
> > Adding Lyude, she's been revamping all the lifetime refcouting in the
> > dp code last few kernel releases. At a glance I don't even have an
> > idea what's going wrong here ...
> 
> Already fixed by Imre I believe.
> 
> 7d11507605a7 ("drm/dp_mst: Fix the DDC I2C device unregistration of an MST 
> port")
> 

Ah!  It does seems to be the same issue indeed!  Thanks a lot for pointing
me at this commit.  Hopefully this fix can be included in 5.8.  Not that
I'm seeing this WARNING frequently, but frequent enough to annoy me :-)

Cheers,
--
Luis

> > -Daniel
> > 
> > On Thu, Jun 25, 2020 at 12:22 PM Luis Henriques  wrote:
> > >
> > > Hi!
> > >
> > > I've been seeing this warning occasionally, not sure if it has been
> > > reported yet.  It's not a regression as I remember seeing it in, at least,
> > > 5.7.
> > >
> > > Anyway, here it is:
> > >
> > > [ cut here ]
> > > sysfs group 'power' not found for kobject 'i2c-7'
> > > WARNING: CPU: 1 PID: 17996 at fs/sysfs/group.c:279 
> > > sysfs_remove_group+0x74/0x80
> > > Modules linked in: ccm(E) dell_rbtn(E) iwlmvm(E) mei_wdt(E) mac80211(E) 
> > > libarc4(E) uvcvideo(E) dell_laptop(E) videobuf2_vmalloc(E) intel_rapl_>
> > >  soundcore(E) intel_soc_dts_iosf(E) rng_core(E) battery(E) acpi_pad(E) 
> > > sparse_keymap(E) acpi_thermal_rel(E) intel_pch_thermal(E) int3402_therm>
> > >  sysfillrect(E) intel_lpss(E) sysimgblt(E) fb_sys_fops(E) idma64(E) 
> > > scsi_mod(E) virt_dma(E) mfd_core(E) drm(E) fan(E) thermal(E) i2c_hid(E) 
> > > hi>
> > > CPU: 1 PID: 17996 Comm: kworker/1:1 Tainted: GE 
> > > 5.8.0-rc2+ #36
> > > Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.14.2 05/25/2020
> > > Workqueue: events drm_dp_delayed_destroy_work [drm_kms_helper]
> > > RIP: 0010:sysfs_remove_group+0x74/0x80
> > > Code: ff 5b 48 89 ef 5d 41 5c e9 79 bc ff ff 48 89 ef e8 01 b8 ff ff eb 
> > > cc 49 8b 14 24 48 8b 33 48 c7 c7 90 ac 8b 93 e8 de b1 d4 ff <0f> 0b 5b>
> > > RSP: :b12d40c13c38 EFLAGS: 00010282
> > > RAX:  RBX: 936e6a60 RCX: 0027
> > > RDX: 0027 RSI: 0086 RDI: 8e37de097b68
> > > RBP:  R08: 8e37de097b60 R09: 93fb4624
> > > R10: 0904 R11: 0001002c R12: 8e37d3081c18
> > > R13: 8e375f1450a8 R14:  R15: 8e375f145410
> > > FS:  () GS:8e37de08() 
> > > knlGS:
> > > CS:  0010 DS:  ES:  CR0: 80050033
> > > CR2:  CR3: 0004ab20a001 CR4: 003606e0
> > > DR0:  DR1:  DR2: 
> > > DR3:  DR6: fffe0ff0 DR7: 0400
> > > Call Trace:
> > >  device_del+0x97/0x3f0
> > >  cdev_device_del+0x15/0x30
> > >  put_i2c_dev+0x7b/0x90 [i2c_dev]
> > >  i2cdev_detach_adapter+0x33/0x60 [i2c_dev]
> > >  notifier_call_chain+0x47/0x70
> > >  blocking_notifier_call_chain+0x3d/0x60
> > >  device_del+0x8f/0x3f0
> > >  device_unregister+0x16/0x60
> > >  i2c_del_adapter+0x247/0x300
> > >  drm_dp_port_set_pdt+0x90/0x2c0 [drm_kms_helper]
> > >  drm_dp_delayed_destroy_work+0x2be/0x340 [drm_kms_helper]
> > >  process_one_work+0x1ae/0x370
> > >  worker_thread+0x50/0x3a0
> > >  ? process_one_work+0x370/0x370
> > >  kthread+0x11b/0x140
> > >  ? kthread_associate_blkcg+0x90/0x90
> > >  ret_from_fork+0x22/0x30
> > > ---[ end trace 16486ad3c2627482 ]---
> > > [ cut here ]
> > >
> > > Cheers,
> > > --
> > > Luis
> > 
> > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> > ___
> > dri-devel mailing list
> > dri-de...@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> -- 
> Ville Syrjälä
> Intel



Warning triggered in drm_dp_delayed_destroy_work workqueue

2020-06-25 Thread Luis Henriques
Hi!

I've been seeing this warning occasionally, not sure if it has been
reported yet.  It's not a regression as I remember seeing it in, at least,
5.7.

Anyway, here it is:

[ cut here ]
sysfs group 'power' not found for kobject 'i2c-7'
WARNING: CPU: 1 PID: 17996 at fs/sysfs/group.c:279 sysfs_remove_group+0x74/0x80
Modules linked in: ccm(E) dell_rbtn(E) iwlmvm(E) mei_wdt(E) mac80211(E) 
libarc4(E) uvcvideo(E) dell_laptop(E) videobuf2_vmalloc(E) intel_rapl_>
 soundcore(E) intel_soc_dts_iosf(E) rng_core(E) battery(E) acpi_pad(E) 
sparse_keymap(E) acpi_thermal_rel(E) intel_pch_thermal(E) int3402_therm>
 sysfillrect(E) intel_lpss(E) sysimgblt(E) fb_sys_fops(E) idma64(E) scsi_mod(E) 
virt_dma(E) mfd_core(E) drm(E) fan(E) thermal(E) i2c_hid(E) hi>
CPU: 1 PID: 17996 Comm: kworker/1:1 Tainted: GE 5.8.0-rc2+ #36
Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.14.2 05/25/2020
Workqueue: events drm_dp_delayed_destroy_work [drm_kms_helper]
RIP: 0010:sysfs_remove_group+0x74/0x80
Code: ff 5b 48 89 ef 5d 41 5c e9 79 bc ff ff 48 89 ef e8 01 b8 ff ff eb cc 49 
8b 14 24 48 8b 33 48 c7 c7 90 ac 8b 93 e8 de b1 d4 ff <0f> 0b 5b>
RSP: :b12d40c13c38 EFLAGS: 00010282
RAX:  RBX: 936e6a60 RCX: 0027
RDX: 0027 RSI: 0086 RDI: 8e37de097b68
RBP:  R08: 8e37de097b60 R09: 93fb4624
R10: 0904 R11: 0001002c R12: 8e37d3081c18
R13: 8e375f1450a8 R14:  R15: 8e375f145410
FS:  () GS:8e37de08() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 0004ab20a001 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 device_del+0x97/0x3f0
 cdev_device_del+0x15/0x30
 put_i2c_dev+0x7b/0x90 [i2c_dev]
 i2cdev_detach_adapter+0x33/0x60 [i2c_dev]
 notifier_call_chain+0x47/0x70
 blocking_notifier_call_chain+0x3d/0x60
 device_del+0x8f/0x3f0
 device_unregister+0x16/0x60
 i2c_del_adapter+0x247/0x300
 drm_dp_port_set_pdt+0x90/0x2c0 [drm_kms_helper]
 drm_dp_delayed_destroy_work+0x2be/0x340 [drm_kms_helper]
 process_one_work+0x1ae/0x370
 worker_thread+0x50/0x3a0
 ? process_one_work+0x370/0x370
 kthread+0x11b/0x140
 ? kthread_associate_blkcg+0x90/0x90
 ret_from_fork+0x22/0x30
---[ end trace 16486ad3c2627482 ]---
[ cut here ]

Cheers,
--
Luis


[PATCH v2] ceph: don't return -ESTALE if there's still an open file

2020-05-18 Thread Luis Henriques
Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting
a file handle to dentry"), this fixes another corner case with
name_to_handle_at/open_by_handle_at.  The issue has been detected by
xfstest generic/467, when doing:

 - name_to_handle_at("/cephfs/myfile")
 - open("/cephfs/myfile")
 - unlink("/cephfs/myfile")
 - sync; sync;
 - drop caches
 - open_by_handle_at()

The call to open_by_handle_at should not fail because the file hasn't been
deleted yet (only unlinked) and we do have a valid handle to it.  -ESTALE
shall be returned only if i_nlink is 0 *and* i_count is 1.

This patch also makes sure we have LINK caps before checking i_nlink.

Signed-off-by: Luis Henriques 
---
Hi!

(and sorry for the delay in my reply!)

So, from the discussion thread and some IRC chat with Jeff, I'm sending
v2.  What changed?  Everything! :-)

- Use i_count instead of __ceph_is_file_opened to check for open files
- Add call to ceph_do_getattr to make sure we have LINK caps before
  accessing i_nlink

Cheers,
--
Luis

 fs/ceph/export.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/export.c b/fs/ceph/export.c
index 79dc06881e78..e088843a7734 100644
--- a/fs/ceph/export.c
+++ b/fs/ceph/export.c
@@ -172,9 +172,16 @@ struct inode *ceph_lookup_inode(struct super_block *sb, 
u64 ino)
 static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino)
 {
struct inode *inode = __lookup_inode(sb, ino);
+   int err;
+
if (IS_ERR(inode))
return ERR_CAST(inode);
-   if (inode->i_nlink == 0) {
+   /* We need LINK caps to reliably check i_nlink */
+   err = ceph_do_getattr(inode, CEPH_CAP_LINK_SHARED, false);
+   if (err)
+   return ERR_PTR(err);
+   /* -ESTALE if inode as been unlinked and no file is open */
+   if ((inode->i_nlink == 0) && (atomic_read(>i_count) == 1)) {
iput(inode);
return ERR_PTR(-ESTALE);
}



Re: [PATCH] ceph: don't return -ESTALE if there's still an open file

2020-05-15 Thread Luis Henriques
On Fri, May 15, 2020 at 09:42:24AM +0300, Amir Goldstein wrote:
> +CC: fstests
> 
> On Thu, May 14, 2020 at 4:15 PM Jeff Layton  wrote:
> >
> > On Thu, 2020-05-14 at 13:48 +0100, Luis Henriques wrote:
> > > On Thu, May 14, 2020 at 08:10:09AM -0400, Jeff Layton wrote:
> > > > On Thu, 2020-05-14 at 12:14 +0100, Luis Henriques wrote:
> > > > > Similarly to commit 03f219041fdb ("ceph: check i_nlink while 
> > > > > converting
> > > > > a file handle to dentry"), this fixes another corner case with
> > > > > name_to_handle_at/open_by_handle_at.  The issue has been detected by
> > > > > xfstest generic/467, when doing:
> > > > >
> > > > >  - name_to_handle_at("/cephfs/myfile")
> > > > >  - open("/cephfs/myfile")
> > > > >  - unlink("/cephfs/myfile")
> > > > >  - open_by_handle_at()
> > > > >
> > > > > The call to open_by_handle_at should not fail because the file still
> > > > > exists and we do have a valid handle to it.
> > > > >
> > > > > Signed-off-by: Luis Henriques 
> > > > > ---
> > > > >  fs/ceph/export.c | 13 +++--
> > > > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/ceph/export.c b/fs/ceph/export.c
> > > > > index 79dc06881e78..8556df9d94d0 100644
> > > > > --- a/fs/ceph/export.c
> > > > > +++ b/fs/ceph/export.c
> > > > > @@ -171,12 +171,21 @@ struct inode *ceph_lookup_inode(struct 
> > > > > super_block *sb, u64 ino)
> > > > >
> > > > >  static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino)
> > > > >  {
> > > > > + struct ceph_inode_info *ci;
> > > > >   struct inode *inode = __lookup_inode(sb, ino);
> > > > > +
> > > > >   if (IS_ERR(inode))
> > > > >   return ERR_CAST(inode);
> > > > >   if (inode->i_nlink == 0) {
> > > > > - iput(inode);
> > > > > - return ERR_PTR(-ESTALE);
> > > > > + bool is_open;
> > > > > + ci = ceph_inode(inode);
> > > > > + spin_lock(>i_ceph_lock);
> > > > > + is_open = __ceph_is_file_opened(ci);
> > > > > + spin_unlock(>i_ceph_lock);
> > > > > + if (!is_open) {
> > > > > + iput(inode);
> > > > > + return ERR_PTR(-ESTALE);
> > > > > + }
> > > > >   }
> > > > >   return d_obtain_alias(inode);
> > > > >  }
> > > >
> > > > Thanks Luis. Out of curiousity, is there any reason we shouldn't ignore
> > > > the i_nlink value here? Does anything obviously break if we do?
> > >
> > > Yes, the scenario described in commit 03f219041fdb is still valid, which
> > > is basically the same but without the extra open(2):
> > >
> > >   - name_to_handle_at("/cephfs/myfile")
> > >   - unlink("/cephfs/myfile")
> > >   - open_by_handle_at()
> > >
> >
> > Ok, I guess we end up doing some delayed cleanup, and that allows the
> > inode to be found in that situation.
> >
> > > The open_by_handle_at man page isn't really clear about these 2 scenarios,
> > > but generic/426 will fail if -ESTALE isn't returned.  Want me to add a
> > > comment to the code, describing these 2 scenarios?
> > >
> >
> > (cc'ing Amir since he added this test)
> >
> > I don't think there is any hard requirement that open_by_handle_at
> > should fail in that situation. It generally does for most filesystems
> > due to the way they handle cl794798fa xfsqa: test open_by_handle() on 
> > unlinked and freed inode clusters
> eaning up unlinked inodes, but I don't
> > think it's technically illegal to allow the inode to still be found. If
> > the caller cares about whether it has been unlinked it can always test
> > i_nlink itself.
> >
> > Amir, is this required for some reason that I'm not aware of?
> 
> Hi Jeff,
> 
> The origin of this test is in fstests commit:
> 794798fa xfsqa: test open_by_handle() on unlinked and freed inode clusters
> 
> It was introduced to catch an xfs bug, so this behavior is the expectation
> of xfs filesystem, but note that it is not a general expecta

Re: [PATCH] ceph: don't return -ESTALE if there's still an open file

2020-05-14 Thread Luis Henriques
On Thu, May 14, 2020 at 08:10:09AM -0400, Jeff Layton wrote:
> On Thu, 2020-05-14 at 12:14 +0100, Luis Henriques wrote:
> > Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting
> > a file handle to dentry"), this fixes another corner case with
> > name_to_handle_at/open_by_handle_at.  The issue has been detected by
> > xfstest generic/467, when doing:
> > 
> >  - name_to_handle_at("/cephfs/myfile")
> >  - open("/cephfs/myfile")
> >  - unlink("/cephfs/myfile")
> >  - open_by_handle_at()
> > 
> > The call to open_by_handle_at should not fail because the file still
> > exists and we do have a valid handle to it.
> > 
> > Signed-off-by: Luis Henriques 
> > ---
> >  fs/ceph/export.c | 13 +++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ceph/export.c b/fs/ceph/export.c
> > index 79dc06881e78..8556df9d94d0 100644
> > --- a/fs/ceph/export.c
> > +++ b/fs/ceph/export.c
> > @@ -171,12 +171,21 @@ struct inode *ceph_lookup_inode(struct super_block 
> > *sb, u64 ino)
> >  
> >  static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino)
> >  {
> > +   struct ceph_inode_info *ci;
> > struct inode *inode = __lookup_inode(sb, ino);
> > +
> > if (IS_ERR(inode))
> > return ERR_CAST(inode);
> > if (inode->i_nlink == 0) {
> > -   iput(inode);
> > -   return ERR_PTR(-ESTALE);
> > +   bool is_open;
> > +   ci = ceph_inode(inode);
> > +   spin_lock(>i_ceph_lock);
> > +   is_open = __ceph_is_file_opened(ci);
> > +   spin_unlock(>i_ceph_lock);
> > +   if (!is_open) {
> > +   iput(inode);
> > +   return ERR_PTR(-ESTALE);
> > +   }
> > }
> > return d_obtain_alias(inode);
> >  }
> 
> Thanks Luis. Out of curiousity, is there any reason we shouldn't ignore
> the i_nlink value here? Does anything obviously break if we do?

Yes, the scenario described in commit 03f219041fdb is still valid, which
is basically the same but without the extra open(2):

  - name_to_handle_at("/cephfs/myfile")
  - unlink("/cephfs/myfile")
  - open_by_handle_at()

The open_by_handle_at man page isn't really clear about these 2 scenarios,
but generic/426 will fail if -ESTALE isn't returned.  Want me to add a
comment to the code, describing these 2 scenarios?

Cheers,
--
Luis


[PATCH] ceph: don't return -ESTALE if there's still an open file

2020-05-14 Thread Luis Henriques
Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting
a file handle to dentry"), this fixes another corner case with
name_to_handle_at/open_by_handle_at.  The issue has been detected by
xfstest generic/467, when doing:

 - name_to_handle_at("/cephfs/myfile")
 - open("/cephfs/myfile")
 - unlink("/cephfs/myfile")
 - open_by_handle_at()

The call to open_by_handle_at should not fail because the file still
exists and we do have a valid handle to it.

Signed-off-by: Luis Henriques 
---
 fs/ceph/export.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/export.c b/fs/ceph/export.c
index 79dc06881e78..8556df9d94d0 100644
--- a/fs/ceph/export.c
+++ b/fs/ceph/export.c
@@ -171,12 +171,21 @@ struct inode *ceph_lookup_inode(struct super_block *sb, 
u64 ino)
 
 static struct dentry *__fh_to_dentry(struct super_block *sb, u64 ino)
 {
+   struct ceph_inode_info *ci;
struct inode *inode = __lookup_inode(sb, ino);
+
if (IS_ERR(inode))
return ERR_CAST(inode);
if (inode->i_nlink == 0) {
-   iput(inode);
-   return ERR_PTR(-ESTALE);
+   bool is_open;
+   ci = ceph_inode(inode);
+   spin_lock(>i_ceph_lock);
+   is_open = __ceph_is_file_opened(ci);
+   spin_unlock(>i_ceph_lock);
+   if (!is_open) {
+   iput(inode);
+   return ERR_PTR(-ESTALE);
+   }
}
return d_obtain_alias(inode);
 }


[PATCH] ceph: demote quotarealm lookup warning to a debug message

2020-05-05 Thread Luis Henriques
A misconfigured cephx can easily result in having the kernel client
flooding the logs with:

  ceph: Can't lookup inode 1 (err: -13)

Change his message to debug level.

Link: https://tracker.ceph.com/issues/44546
Signed-off-by: Luis Henriques 
---
Hi!

This patch should fix some harmless warnings when using cephx to restrict
users access to certain filesystem paths.  I've added a comment to the
tracker where removing this warning could result (unlikely, IMHO!) in an
admin to miss not-so-harmless bogus configurations.

Cheers,
--
Luís

 fs/ceph/quota.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
index de56dee60540..19507e2fdb57 100644
--- a/fs/ceph/quota.c
+++ b/fs/ceph/quota.c
@@ -159,8 +159,8 @@ static struct inode *lookup_quotarealm_inode(struct 
ceph_mds_client *mdsc,
}
 
if (IS_ERR(in)) {
-   pr_warn("Can't lookup inode %llx (err: %ld)\n",
-   realm->ino, PTR_ERR(in));
+   dout("Can't lookup inode %llx (err: %ld)\n",
+realm->ino, PTR_ERR(in));
qri->timeout = jiffies + msecs_to_jiffies(60 * 1000); /* XXX */
} else {
qri->timeout = 0;



Re: [PATCH] ceph: Fix use-after-free in __ceph_remove_cap

2019-10-23 Thread Luis Henriques


Luis Henriques  writes:

> On Tue, Oct 22, 2019 at 08:48:56PM +0800, Yan, Zheng wrote:
>> On Mon, Oct 21, 2019 at 10:55 PM Luis Henriques  wrote:
>> 
>> >
>> > Jeff Layton  writes:
>> >
>> > > On Thu, 2019-10-17 at 15:46 +0100, Luis Henriques wrote:
>> > >> KASAN reports a use-after-free when running xfstest generic/531, with
>> > the
>> > >> following trace:
>> > >>
>> > >> [  293.903362]  kasan_report+0xe/0x20
>> > >> [  293.903365]  rb_erase+0x1f/0x790
>> > >> [  293.903370]  __ceph_remove_cap+0x201/0x370
>> > >> [  293.903375]  __ceph_remove_caps+0x4b/0x70
>> > >> [  293.903380]  ceph_evict_inode+0x4e/0x360
>> > >> [  293.903386]  evict+0x169/0x290
>> > >> [  293.903390]  __dentry_kill+0x16f/0x250
>> > >> [  293.903394]  dput+0x1c6/0x440
>> > >> [  293.903398]  __fput+0x184/0x330
>> > >> [  293.903404]  task_work_run+0xb9/0xe0
>> > >> [  293.903410]  exit_to_usermode_loop+0xd3/0xe0
>> > >> [  293.903413]  do_syscall_64+0x1a0/0x1c0
>> > >> [  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > >>
>> > >> This happens because __ceph_remove_cap() may queue a cap release
>> > >> (__ceph_queue_cap_release) which can be scheduled before that cap is
>> > >> removed from the inode list with
>> > >>
>> > >>  rb_erase(>ci_node, >i_caps);
>> > >>
>> > >> And, when this finally happens, the use-after-free will occur.
>> > >>
>> > >> This can be fixed by protecting the rb_erase with the s_cap_lock
>> > spinlock,
>> > >> which is used by ceph_send_cap_releases(), before the cap is freed.
>> > >>
>> > >> Signed-off-by: Luis Henriques 
>> > >> ---
>> > >>  fs/ceph/caps.c | 4 ++--
>> > >>  1 file changed, 2 insertions(+), 2 deletions(-)
>> > >>
>> > >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> > >> index d3b9c9d5c1bd..21ee38cabe98 100644
>> > >> --- a/fs/ceph/caps.c
>> > >> +++ b/fs/ceph/caps.c
>> > >> @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap,
>> > bool queue_release)
>> > >>  }
>> > >>  cap->cap_ino = ci->i_vino.ino;
>> > >>
>> > >> -spin_unlock(>s_cap_lock);
>> > >> -
>> > >>  /* remove from inode list */
>> > >>  rb_erase(>ci_node, >i_caps);
>> > >>  if (ci->i_auth_cap == cap)
>> > >>  ci->i_auth_cap = NULL;
>> > >>
>> > >> +spin_unlock(>s_cap_lock);
>> > >> +
>> > >>  if (removed)
>> > >>  ceph_put_cap(mdsc, cap);
>> > >>
>> > >
>> > > Is there any reason we need to wait until this point to remove it from
>> > > the rbtree? ISTM that we ought to just do that at the beginning of the
>> > > function, before we take the s_cap_lock.
>> >
>> > That sounds good to me, at least at a first glace.  I spent some time
>> > looking for any possible issues in the code, and even run a few tests.
>> >
>> > However, looking at git log I found commit f818a73674c5 ("ceph: fix cap
>> > removal races"), which moved that rb_erase from the beginning of the
>> > function to it's current position.  So, unless the race mentioned in
>> > this commit has disappeared in the meantime (which is possible, this
>> > commit is from 2010!), this rbtree operation shouldn't be changed.
>> >
>> > And I now wonder if my patch isn't introducing a race too...
>> > __ceph_remove_cap() is supposed to always be called with the session
>> > mutex held, except for the ceph_evict_inode() path.  Which is where I'm
>> > seeing the UAF.  So, maybe what's missing here is the s_mutex.  Hmm...
>> >
>> >
>> we can't lock s_mutex here, because i_ceph_lock is locked
>
> Well, my idea wasn't to get s_mutex here but earlier in the stack.
> Maybe in ceph_evict_inode, protecting the call to __ceph_remove_caps.
> But I didn't really looked into that yet, so I'm not really sure if

Ok, I looked into that now and obviously that's not possible.  So, I
guess my original patch is still the best option.

Cheers,
-- 
Luis

> that's feasible (or even if that would fix this UAF).  I suspect that's
> not possible anyway, due to the comment above __ceph_remove_cap:
>
>   caller will not hold session s_mutex if called from destroy_inode.



Re: [PATCH] ceph: Fix use-after-free in __ceph_remove_cap

2019-10-22 Thread Luis Henriques
On Tue, Oct 22, 2019 at 08:48:56PM +0800, Yan, Zheng wrote:
> On Mon, Oct 21, 2019 at 10:55 PM Luis Henriques  wrote:
> 
> >
> > Jeff Layton  writes:
> >
> > > On Thu, 2019-10-17 at 15:46 +0100, Luis Henriques wrote:
> > >> KASAN reports a use-after-free when running xfstest generic/531, with
> > the
> > >> following trace:
> > >>
> > >> [  293.903362]  kasan_report+0xe/0x20
> > >> [  293.903365]  rb_erase+0x1f/0x790
> > >> [  293.903370]  __ceph_remove_cap+0x201/0x370
> > >> [  293.903375]  __ceph_remove_caps+0x4b/0x70
> > >> [  293.903380]  ceph_evict_inode+0x4e/0x360
> > >> [  293.903386]  evict+0x169/0x290
> > >> [  293.903390]  __dentry_kill+0x16f/0x250
> > >> [  293.903394]  dput+0x1c6/0x440
> > >> [  293.903398]  __fput+0x184/0x330
> > >> [  293.903404]  task_work_run+0xb9/0xe0
> > >> [  293.903410]  exit_to_usermode_loop+0xd3/0xe0
> > >> [  293.903413]  do_syscall_64+0x1a0/0x1c0
> > >> [  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >>
> > >> This happens because __ceph_remove_cap() may queue a cap release
> > >> (__ceph_queue_cap_release) which can be scheduled before that cap is
> > >> removed from the inode list with
> > >>
> > >>  rb_erase(>ci_node, >i_caps);
> > >>
> > >> And, when this finally happens, the use-after-free will occur.
> > >>
> > >> This can be fixed by protecting the rb_erase with the s_cap_lock
> > spinlock,
> > >> which is used by ceph_send_cap_releases(), before the cap is freed.
> > >>
> > >> Signed-off-by: Luis Henriques 
> > >> ---
> > >>  fs/ceph/caps.c | 4 ++--
> > >>  1 file changed, 2 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > >> index d3b9c9d5c1bd..21ee38cabe98 100644
> > >> --- a/fs/ceph/caps.c
> > >> +++ b/fs/ceph/caps.c
> > >> @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap,
> > bool queue_release)
> > >>  }
> > >>  cap->cap_ino = ci->i_vino.ino;
> > >>
> > >> -spin_unlock(>s_cap_lock);
> > >> -
> > >>  /* remove from inode list */
> > >>  rb_erase(>ci_node, >i_caps);
> > >>  if (ci->i_auth_cap == cap)
> > >>  ci->i_auth_cap = NULL;
> > >>
> > >> +spin_unlock(>s_cap_lock);
> > >> +
> > >>  if (removed)
> > >>  ceph_put_cap(mdsc, cap);
> > >>
> > >
> > > Is there any reason we need to wait until this point to remove it from
> > > the rbtree? ISTM that we ought to just do that at the beginning of the
> > > function, before we take the s_cap_lock.
> >
> > That sounds good to me, at least at a first glace.  I spent some time
> > looking for any possible issues in the code, and even run a few tests.
> >
> > However, looking at git log I found commit f818a73674c5 ("ceph: fix cap
> > removal races"), which moved that rb_erase from the beginning of the
> > function to it's current position.  So, unless the race mentioned in
> > this commit has disappeared in the meantime (which is possible, this
> > commit is from 2010!), this rbtree operation shouldn't be changed.
> >
> > And I now wonder if my patch isn't introducing a race too...
> > __ceph_remove_cap() is supposed to always be called with the session
> > mutex held, except for the ceph_evict_inode() path.  Which is where I'm
> > seeing the UAF.  So, maybe what's missing here is the s_mutex.  Hmm...
> >
> >
> we can't lock s_mutex here, because i_ceph_lock is locked

Well, my idea wasn't to get s_mutex here but earlier in the stack.
Maybe in ceph_evict_inode, protecting the call to __ceph_remove_caps.
But I didn't really looked into that yet, so I'm not really sure if
that's feasible (or even if that would fix this UAF).  I suspect that's
not possible anyway, due to the comment above __ceph_remove_cap:

  caller will not hold session s_mutex if called from destroy_inode.

Cheers,
--
Luís


Re: [PATCH] ceph: Fix use-after-free in __ceph_remove_cap

2019-10-21 Thread Luis Henriques


Jeff Layton  writes:

> On Thu, 2019-10-17 at 15:46 +0100, Luis Henriques wrote:
>> KASAN reports a use-after-free when running xfstest generic/531, with the
>> following trace:
>>
>> [  293.903362]  kasan_report+0xe/0x20
>> [  293.903365]  rb_erase+0x1f/0x790
>> [  293.903370]  __ceph_remove_cap+0x201/0x370
>> [  293.903375]  __ceph_remove_caps+0x4b/0x70
>> [  293.903380]  ceph_evict_inode+0x4e/0x360
>> [  293.903386]  evict+0x169/0x290
>> [  293.903390]  __dentry_kill+0x16f/0x250
>> [  293.903394]  dput+0x1c6/0x440
>> [  293.903398]  __fput+0x184/0x330
>> [  293.903404]  task_work_run+0xb9/0xe0
>> [  293.903410]  exit_to_usermode_loop+0xd3/0xe0
>> [  293.903413]  do_syscall_64+0x1a0/0x1c0
>> [  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> This happens because __ceph_remove_cap() may queue a cap release
>> (__ceph_queue_cap_release) which can be scheduled before that cap is
>> removed from the inode list with
>>
>>  rb_erase(>ci_node, >i_caps);
>>
>> And, when this finally happens, the use-after-free will occur.
>>
>> This can be fixed by protecting the rb_erase with the s_cap_lock spinlock,
>> which is used by ceph_send_cap_releases(), before the cap is freed.
>>
>> Signed-off-by: Luis Henriques 
>> ---
>>  fs/ceph/caps.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index d3b9c9d5c1bd..21ee38cabe98 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool 
>> queue_release)
>>  }
>>  cap->cap_ino = ci->i_vino.ino;
>>
>> -spin_unlock(>s_cap_lock);
>> -
>>  /* remove from inode list */
>>  rb_erase(>ci_node, >i_caps);
>>  if (ci->i_auth_cap == cap)
>>  ci->i_auth_cap = NULL;
>>
>> +spin_unlock(>s_cap_lock);
>> +
>>  if (removed)
>>  ceph_put_cap(mdsc, cap);
>>
>
> Is there any reason we need to wait until this point to remove it from
> the rbtree? ISTM that we ought to just do that at the beginning of the
> function, before we take the s_cap_lock.

That sounds good to me, at least at a first glace.  I spent some time
looking for any possible issues in the code, and even run a few tests.

However, looking at git log I found commit f818a73674c5 ("ceph: fix cap
removal races"), which moved that rb_erase from the beginning of the
function to it's current position.  So, unless the race mentioned in
this commit has disappeared in the meantime (which is possible, this
commit is from 2010!), this rbtree operation shouldn't be changed.

And I now wonder if my patch isn't introducing a race too...
__ceph_remove_cap() is supposed to always be called with the session
mutex held, except for the ceph_evict_inode() path.  Which is where I'm
seeing the UAF.  So, maybe what's missing here is the s_mutex.  Hmm...

Cheers,
--
Luis


[PATCH] ceph: Fix use-after-free in __ceph_remove_cap

2019-10-17 Thread Luis Henriques
KASAN reports a use-after-free when running xfstest generic/531, with the
following trace:

[  293.903362]  kasan_report+0xe/0x20
[  293.903365]  rb_erase+0x1f/0x790
[  293.903370]  __ceph_remove_cap+0x201/0x370
[  293.903375]  __ceph_remove_caps+0x4b/0x70
[  293.903380]  ceph_evict_inode+0x4e/0x360
[  293.903386]  evict+0x169/0x290
[  293.903390]  __dentry_kill+0x16f/0x250
[  293.903394]  dput+0x1c6/0x440
[  293.903398]  __fput+0x184/0x330
[  293.903404]  task_work_run+0xb9/0xe0
[  293.903410]  exit_to_usermode_loop+0xd3/0xe0
[  293.903413]  do_syscall_64+0x1a0/0x1c0
[  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because __ceph_remove_cap() may queue a cap release
(__ceph_queue_cap_release) which can be scheduled before that cap is
removed from the inode list with

rb_erase(>ci_node, >i_caps);

And, when this finally happens, the use-after-free will occur.

This can be fixed by protecting the rb_erase with the s_cap_lock spinlock,
which is used by ceph_send_cap_releases(), before the cap is freed.

Signed-off-by: Luis Henriques 
---
 fs/ceph/caps.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d3b9c9d5c1bd..21ee38cabe98 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool 
queue_release)
}
cap->cap_ino = ci->i_vino.ino;
 
-   spin_unlock(>s_cap_lock);
-
/* remove from inode list */
rb_erase(>ci_node, >i_caps);
if (ci->i_auth_cap == cap)
ci->i_auth_cap = NULL;
 
+   spin_unlock(>s_cap_lock);
+
if (removed)
ceph_put_cap(mdsc, cap);
 


'unable to handle page fault' in pstore

2019-10-08 Thread Luis Henriques
Hi,

Maybe this is a known issue with pstore, I didn't investigate, but it's
pretty easy to reproduce:

I've efi-pstore loaded, with a bunch of files in /sys/fs/pstore.  If I
unload my backend driver (efi-pstore) and try to remove a file from
/sys/fs/pstore I'll see the following spat:

BUG: unable to handle page fault for address: c0bcf090
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 64a60c067 P4D 64a60c067 PUD 64a60e067 PMD 892200067 PTE 0
Oops:  [#1] SMP PTI
CPU: 0 PID: 3154 Comm: mv Tainted: GE 5.4.0-rc2 #19
Hardware name: Dell Inc. Precision 5510/0N8J4R, BIOS 1.10.0 02/25/2019
RIP: 0010:pstore_unlink+0x1e/0x70
Code: 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 54 49 89 fc 55 53 
48 8b 46 30 48 8b 80 38 02 00 00 48 8b 58 10 48 8b 3b <48> 83 bf 90 00 00 00 00 
74 39 48 83 c7 0c 81 42 00
RSP: 0018:b2eb40587e70 EFLAGS: 00010246
RAX: 8920cea3e8a0 RBX: 8920cd55b400 RCX: 
RDX: 0001 RSI: 8920cf76f300 RDI: c0bcf000
RBP: 8920cf76f300 R08: 8920cf76f300 R09: 007a2e636e652e32
R10: 0007 R11: 7fff R12: 8920d4c91d80
R13: 8920d4c91d80 R14: 8920d80c3480 R15: 8920d80c3520
FS:  7f3afe720640() GS:8920dda0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: c0bcf090 CR3: 00088e260001 CR4: 003606f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 vfs_unlink+0x10f/0x1f0
 do_unlinkat+0x1af/0x2f0
 do_syscall_64+0x4c/0x170
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3afe8ef2f7
Code: 73 01 c3 48 8b 0d 89 db 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 
84 00 00 00 00 00 0f 1f 44 00 00 b8 07 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 8b 0d 59 64 89 01 48
RSP: 002b:7ffd60fea168 EFLAGS: 0202 ORIG_RAX: 0107
RAX: ffda RBX: 5629c4e3ef70 RCX: 7f3afe8ef2f7
RDX:  RSI: 5629c4e3dd40 RDI: ff9c
RBP: 5629c4e3dcb0 R08: 0003 R09: 
R10: f24b R11: 0202 R12: 
R13: 7ffd60fea260 R14: 5629c4e3ef70 R15: 0002

My understanding is that pstore_unlink() is exploding when running:

if (!record->psi->erase)

because that address (psi) was on the efi-pstore module.

I'm not sure what's the best way to fix this, probably a

if (psinfo && record->psi->erase)

would be enough, but ugly (and still racy?).

Cheers,
-- 
Luis


Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster

2019-09-10 Thread Luis Henriques
Gregory Farnum  writes:

> On Mon, Sep 9, 2019 at 4:15 AM Luis Henriques  wrote:
>>
>> "Jeff Layton"  writes:
>>
>> > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
>> >> OSDs are able to perform object copies across different pools.  Thus,
>> >> there's no need to prevent copy_file_range from doing remote copies if the
>> >> source and destination superblocks are different.  Only return -EXDEV if
>> >> they have different fsid (the cluster ID).
>> >>
>> >> Signed-off-by: Luis Henriques 
>> >> ---
>> >>  fs/ceph/file.c | 18 ++
>> >>  1 file changed, 14 insertions(+), 4 deletions(-)
>> >>
>> >> Hi,
>> >>
>> >> Here's the patch changelog since initial submittion:
>> >>
>> >> - Dropped have_fsid checks on client structs
>> >> - Use %pU to print the fsid instead of raw hex strings (%*ph)
>> >> - Fixed 'To:' field in email so that this time the patch hits vger
>> >>
>> >> Cheers,
>> >> --
>> >> Luis
>> >>
>> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> >> index 685a03cc4b77..4a624a1dd0bb 100644
>> >> --- a/fs/ceph/file.c
>> >> +++ b/fs/ceph/file.c
>> >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> >> *src_file, loff_t src_off,
>> >>  struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> >>  struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> >>  struct ceph_cap_flush *prealloc_cf;
>> >> +struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> >>  struct ceph_object_locator src_oloc, dst_oloc;
>> >>  struct ceph_object_id src_oid, dst_oid;
>> >>  loff_t endoff = 0, size;
>> >> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file 
>> >> *src_file, loff_t src_off,
>> >>
>> >>  if (src_inode == dst_inode)
>> >>  return -EINVAL;
>> >> -if (src_inode->i_sb != dst_inode->i_sb)
>> >> -return -EXDEV;
>> >> +if (src_inode->i_sb != dst_inode->i_sb) {
>> >> +struct ceph_fs_client *dst_fsc = 
>> >> ceph_inode_to_client(dst_inode);
>> >> +
>> >> +if (ceph_fsid_compare(_fsc->client->fsid,
>> >> +  _fsc->client->fsid)) {
>> >> +dout("Copying object across different clusters:");
>> >> +dout("  src fsid: %pU dst fsid: %pU\n",
>> >> + _fsc->client->fsid, _fsc->client->fsid);
>> >> +return -EXDEV;
>> >> +}
>> >> +}
>> >
>> > Just to be clear: what happens here if I mount two entirely separate
>> > clusters, and their OSDs don't have any access to one another? Will this
>> > fail at some later point with an error that we can catch so that we can
>> > fall back?
>>
>> This is exactly what this check prevents: if we have two CephFS from two
>> unrelated clusters mounted and we try to copy a file across them, the
>> operation will fail with -EXDEV[1] because the FSIDs for these two
>> ceph_fs_client will be different.  OTOH, if these two filesystems are
>> within the same cluster (and thus with the same FSID), then the OSDs are
>> able to do 'copy-from' operations between them.
>>
>> I've tested all these scenarios and they seem to be handled correctly.
>> Now, I'm assuming that *all* OSDs within the same ceph cluster can
>> communicate between themselves; if this assumption is false, then this
>> patch is broken.  But again, I'm not aware of any mechanism that
>> prevents 2 OSDs from communicating between them.
>
> Your assumption is correct: all OSDs in a Ceph cluster can communicate
> with each other. I'm not aware of any plans to change this.
>
> I spent a bit of time trying to figure out how this could break
> security models and things and didn't come up with anything, so I
> think functionally it's fine even though I find it a bit scary.
>
> Also, yes, cluster FSIDs are UUIDs so they shouldn't collide.

Awesome, thanks for clarifying these points!

Cheers,
-- 
Luis


> -Greg
>
>>
>> [1] Actually, the files will still be copied because we'll fallback into
>> the defau

[PATCH v3] ceph: allow object copies across different filesystems in the same cluster

2019-09-09 Thread Luis Henriques
OSDs are able to perform object copies across different pools.  Thus,
there's no need to prevent copy_file_range from doing remote copies if the
source and destination superblocks are different.  Only return -EXDEV if
they have different fsid (the cluster ID).

Signed-off-by: Luis Henriques 
---
 fs/ceph/file.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

Hi,

Here's the changelog:

* since v2

- single dout() in error path

* since v1:

- Dropped have_fsid checks on client structs
- Use %pU to print the fsid instead of raw hex strings (%*ph)
- Fixed 'To:' field in email so that this time the patch hits vger

Cheers,
--
Luis

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 685a03cc4b77..846cf5aea85e 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
struct ceph_inode_info *src_ci = ceph_inode(src_inode);
struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
struct ceph_cap_flush *prealloc_cf;
+   struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
struct ceph_object_locator src_oloc, dst_oloc;
struct ceph_object_id src_oid, dst_oid;
loff_t endoff = 0, size;
@@ -1915,8 +1916,16 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
 
if (src_inode == dst_inode)
return -EINVAL;
-   if (src_inode->i_sb != dst_inode->i_sb)
-   return -EXDEV;
+   if (src_inode->i_sb != dst_inode->i_sb) {
+   struct ceph_fs_client *dst_fsc = 
ceph_inode_to_client(dst_inode);
+
+   if (ceph_fsid_compare(_fsc->client->fsid,
+ _fsc->client->fsid)) {
+   dout("Copying files across clusters: src: %pU dst: 
%pU\n",
+_fsc->client->fsid, _fsc->client->fsid);
+   return -EXDEV;
+   }
+   }
if (ceph_snap(dst_inode) != CEPH_NOSNAP)
return -EROFS;
 
@@ -1928,7 +1937,7 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
 * efficient).
 */
 
-   if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
+   if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
return -EOPNOTSUPP;
 
if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
@@ -2044,7 +2053,7 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
dst_ci->i_vino.ino, dst_objnum);
/* Do an object remote copy */
err = ceph_osdc_copy_from(
-   _inode_to_client(src_inode)->client->osdc,
+   _fsc->client->osdc,
src_ci->i_vino.snap, 0,
_oid, _oloc,
CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |


Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster

2019-09-09 Thread Luis Henriques
"Jeff Layton"  writes:

> On Mon, 2019-09-09 at 06:35 -0400, Jeff Layton wrote:
>> On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
>> > OSDs are able to perform object copies across different pools.  Thus,
>> > there's no need to prevent copy_file_range from doing remote copies if the
>> > source and destination superblocks are different.  Only return -EXDEV if
>> > they have different fsid (the cluster ID).
>> > 
>> > Signed-off-by: Luis Henriques 
>> > ---
>> >  fs/ceph/file.c | 18 ++
>> >  1 file changed, 14 insertions(+), 4 deletions(-)
>> > 
>> > Hi,
>> > 
>> > Here's the patch changelog since initial submittion:
>> > 
>> > - Dropped have_fsid checks on client structs
>> > - Use %pU to print the fsid instead of raw hex strings (%*ph)
>> > - Fixed 'To:' field in email so that this time the patch hits vger
>> > 
>> > Cheers,
>> > --
>> > Luis
>> > 
>> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > index 685a03cc4b77..4a624a1dd0bb 100644
>> > --- a/fs/ceph/file.c
>> > +++ b/fs/ceph/file.c
>> > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> > *src_file, loff_t src_off,
>> >struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> >struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> >struct ceph_cap_flush *prealloc_cf;
>> > +  struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> >struct ceph_object_locator src_oloc, dst_oloc;
>> >struct ceph_object_id src_oid, dst_oid;
>> >loff_t endoff = 0, size;
>> > @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file 
>> > *src_file, loff_t src_off,
>> >  
>> >if (src_inode == dst_inode)
>> >return -EINVAL;
>> > -  if (src_inode->i_sb != dst_inode->i_sb)
>> > -  return -EXDEV;
>> > +  if (src_inode->i_sb != dst_inode->i_sb) {
>> > +  struct ceph_fs_client *dst_fsc = 
>> > ceph_inode_to_client(dst_inode);
>> > +
>> > +  if (ceph_fsid_compare(_fsc->client->fsid,
>> > +_fsc->client->fsid)) {
>> > +  dout("Copying object across different clusters:");
>> > +  dout("  src fsid: %pU dst fsid: %pU\n",
>> > +   _fsc->client->fsid, _fsc->client->fsid);
>> > +  return -EXDEV;
>> > +  }
>> > +  }
>> 
>> Just to be clear: what happens here if I mount two entirely separate
>> clusters, and their OSDs don't have any access to one another? Will this
>> fail at some later point with an error that we can catch so that we can
>> fall back?
>> 
>
> Duh, sorry I asked before I had a cup of coffee this morning. The whole
> point is to skip that case.
>
> That said...I wonder if it's possible to have an fsid collision across
> two separate clusters and this fail to catch that case? Aren't these
> things just allocated via a simple counter increment?

My understanding is that this is some sort of UUID.  Looking at
doc/install/manual-deployment.rst it says that the fsid is a unique ID
that should be generated using uuidgen (I believe that's what vstart.sh
clusters use).

That said, it's obviously possible to reuse an fsid in two clusters.
And mounting both filesystems with the same fsid on the same client may
already cause some troubles without even trying to copy_file_range files
across them (for ex., fscache code seems to assume unique fsids).  But I
have never tested such sort of things (probably no one did) and I really
don't know what are the consequences.  In this specific case, I would
expect the 'copy-from' operation to fail with some error from the OSDs.

> Probably not worth worrying about overmuch, but might be good to
> understand what would happen in that case if only to field mailing list
> reports.

If there are concerns regarding this, I'm OK simply dropping the patch
for now and continue forbidding object copies when superblocks are
different.  I just thought this was a low-hanging fruit, and didn't
realized that it's not very easy to ensure that 2 cephfs instances
actually belong to the same cluster.  Maybe there are other checks that
could be done...?

Cheers,
-- 
Luis

> Other than that, this looks fine, modulo Ilya's comment about the two
> dout messages.
>
>> 
>> >if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>> >return -EROFS;
>> >  

Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster

2019-09-09 Thread Luis Henriques
"Jeff Layton"  writes:

> On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
>> OSDs are able to perform object copies across different pools.  Thus,
>> there's no need to prevent copy_file_range from doing remote copies if the
>> source and destination superblocks are different.  Only return -EXDEV if
>> they have different fsid (the cluster ID).
>> 
>> Signed-off-by: Luis Henriques 
>> ---
>>  fs/ceph/file.c | 18 ++
>>  1 file changed, 14 insertions(+), 4 deletions(-)
>> 
>> Hi,
>> 
>> Here's the patch changelog since initial submittion:
>> 
>> - Dropped have_fsid checks on client structs
>> - Use %pU to print the fsid instead of raw hex strings (%*ph)
>> - Fixed 'To:' field in email so that this time the patch hits vger
>> 
>> Cheers,
>> --
>> Luis
>> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 685a03cc4b77..4a624a1dd0bb 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>  struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>>  struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>>  struct ceph_cap_flush *prealloc_cf;
>> +struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>>  struct ceph_object_locator src_oloc, dst_oloc;
>>  struct ceph_object_id src_oid, dst_oid;
>>  loff_t endoff = 0, size;
>> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>  
>>  if (src_inode == dst_inode)
>>  return -EINVAL;
>> -if (src_inode->i_sb != dst_inode->i_sb)
>> -return -EXDEV;
>> +if (src_inode->i_sb != dst_inode->i_sb) {
>> +struct ceph_fs_client *dst_fsc = 
>> ceph_inode_to_client(dst_inode);
>> +
>> +if (ceph_fsid_compare(_fsc->client->fsid,
>> +  _fsc->client->fsid)) {
>> +dout("Copying object across different clusters:");
>> +dout("  src fsid: %pU dst fsid: %pU\n",
>> + _fsc->client->fsid, _fsc->client->fsid);
>> +return -EXDEV;
>> +}
>> +}
>
> Just to be clear: what happens here if I mount two entirely separate
> clusters, and their OSDs don't have any access to one another? Will this
> fail at some later point with an error that we can catch so that we can
> fall back?

This is exactly what this check prevents: if we have two CephFS from two
unrelated clusters mounted and we try to copy a file across them, the
operation will fail with -EXDEV[1] because the FSIDs for these two
ceph_fs_client will be different.  OTOH, if these two filesystems are
within the same cluster (and thus with the same FSID), then the OSDs are
able to do 'copy-from' operations between them.

I've tested all these scenarios and they seem to be handled correctly.
Now, I'm assuming that *all* OSDs within the same ceph cluster can
communicate between themselves; if this assumption is false, then this
patch is broken.  But again, I'm not aware of any mechanism that
prevents 2 OSDs from communicating between them.

[1] Actually, the files will still be copied because we'll fallback into
the default VFS generic_copy_file_range behaviour, which is to do
reads+writes operations.

Cheers,
-- 
Luis


>
>
>>  if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>>  return -EROFS;
>>  
>> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>   * efficient).
>>   */
>>  
>> -if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> +if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>>  return -EOPNOTSUPP;
>>  
>>  if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>  dst_ci->i_vino.ino, dst_objnum);
>>  /* Do an object remote copy */
>>  err = ceph_osdc_copy_from(
>> -_inode_to_client(src_inode)->client->osdc,
>> +_fsc->client->osdc,
>>  src_ci->i_vino.snap, 0,
>>  _oid, _oloc,
>>  CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |


[PATCH v2] ceph: allow object copies across different filesystems in the same cluster

2019-09-09 Thread Luis Henriques
OSDs are able to perform object copies across different pools.  Thus,
there's no need to prevent copy_file_range from doing remote copies if the
source and destination superblocks are different.  Only return -EXDEV if
they have different fsid (the cluster ID).

Signed-off-by: Luis Henriques 
---
 fs/ceph/file.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

Hi,

Here's the patch changelog since initial submittion:

- Dropped have_fsid checks on client structs
- Use %pU to print the fsid instead of raw hex strings (%*ph)
- Fixed 'To:' field in email so that this time the patch hits vger

Cheers,
--
Luis

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 685a03cc4b77..4a624a1dd0bb 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
struct ceph_inode_info *src_ci = ceph_inode(src_inode);
struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
struct ceph_cap_flush *prealloc_cf;
+   struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
struct ceph_object_locator src_oloc, dst_oloc;
struct ceph_object_id src_oid, dst_oid;
loff_t endoff = 0, size;
@@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
 
if (src_inode == dst_inode)
return -EINVAL;
-   if (src_inode->i_sb != dst_inode->i_sb)
-   return -EXDEV;
+   if (src_inode->i_sb != dst_inode->i_sb) {
+   struct ceph_fs_client *dst_fsc = 
ceph_inode_to_client(dst_inode);
+
+   if (ceph_fsid_compare(_fsc->client->fsid,
+ _fsc->client->fsid)) {
+   dout("Copying object across different clusters:");
+   dout("  src fsid: %pU dst fsid: %pU\n",
+_fsc->client->fsid, _fsc->client->fsid);
+   return -EXDEV;
+   }
+   }
if (ceph_snap(dst_inode) != CEPH_NOSNAP)
return -EROFS;
 
@@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
 * efficient).
 */
 
-   if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
+   if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
return -EOPNOTSUPP;
 
if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
@@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file 
*src_file, loff_t src_off,
dst_ci->i_vino.ino, dst_objnum);
/* Do an object remote copy */
err = ceph_osdc_copy_from(
-   _inode_to_client(src_inode)->client->osdc,
+   _fsc->client->osdc,
src_ci->i_vino.snap, 0,
_oid, _oloc,
CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |


Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster

2019-09-09 Thread Luis Henriques
"Jeff Layton"  writes:

> On Fri, 2019-09-06 at 17:26 +0100, Luis Henriques wrote:
>> "Jeff Layton"  writes:
>> 
>> > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
>> > > OSDs are able to perform object copies across different pools.  Thus,
>> > > there's no need to prevent copy_file_range from doing remote copies if 
>> > > the
>> > > source and destination superblocks are different.  Only return -EXDEV if
>> > > they have different fsid (the cluster ID).
>> > > 
>> > > Signed-off-by: Luis Henriques 
>> > > ---
>> > >  fs/ceph/file.c | 23 +++
>> > >  1 file changed, 19 insertions(+), 4 deletions(-)
>> > > 
>> > > Hi!
>> > > 
>> > > I've finally managed to run some tests using multiple filesystems, both
>> > > within a single cluster and also using two different clusters.  The
>> > > behaviour of copy_file_range (with this patch, of course) was what I
>> > > expected:
>> > > 
>> > >   - Object copies work fine across different filesystems within the same
>> > > cluster (even with pools in different PGs);
>> > >   - -EXDEV is returned if the fsid is different
>> > > 
>> > > (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
>> > >  Because this is actually what's in ceph.conf fsid in "[global]"
>> > >  section.  Anyway...)
>> > > 
>> > > So, what's missing right now is (I always mention this when I have the
>> > > opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
>> > > And add the corresponding support for the new flag to the kernel
>> > > client, of course.
>> > > 
>> > > Cheers,
>> > > --
>> > > Luis
>> > > 
>> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > > index 685a03cc4b77..88d116893c2b 100644
>> > > --- a/fs/ceph/file.c
>> > > +++ b/fs/ceph/file.c
>> > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> > > *src_file, loff_t src_off,
>> > >  struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> > >  struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> > >  struct ceph_cap_flush *prealloc_cf;
>> > > +struct ceph_fs_client *src_fsc = 
>> > > ceph_inode_to_client(src_inode);
>> > >  struct ceph_object_locator src_oloc, dst_oloc;
>> > >  struct ceph_object_id src_oid, dst_oid;
>> > >  loff_t endoff = 0, size;
>> > > @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file 
>> > > *src_file, loff_t src_off,
>> > >  
>> > >  if (src_inode == dst_inode)
>> > >  return -EINVAL;
>> > > -if (src_inode->i_sb != dst_inode->i_sb)
>> > > -return -EXDEV;
>> > > +if (src_inode->i_sb != dst_inode->i_sb) {
>> > > +struct ceph_fs_client *dst_fsc = 
>> > > ceph_inode_to_client(dst_inode);
>> > > +
>> > > +if (!src_fsc->client->have_fsid || 
>> > > !dst_fsc->client->have_fsid) {
>> > > +dout("No fsid in a fs client\n");
>> > > +return -EXDEV;
>> > > +}
>> > 
>> > In what situation is there no fsid? Old cluster version?
>> > 
>> > If there is no fsid, can we take that to indicate that there is only a
>> > single filesystem possible in the cluster and that we should attempt the
>> > copy anyway?
>> 
>> TBH I'm not sure if 'have_fsid' can ever be 'false' in this call.  It is
>> set to 'true' when handling the monmap, and it's never changed back to
>> 'false'.  Since I don't think copy_file_range will be invoked *before*
>> we get the monmap, it should be safe to drop this check.  Maybe it could
>> be replaced it by a WARN_ON()?
>> 
>
> Yeah. I think the have_fsid flag just allows us to avoid the pr_err msg
> in ceph_check_fsid when the client is initially created. Maybe there is
> some better way to achieve that?

I guess the struct ceph_fsid embedded in the client(s) could be changed
into a pointer initialized to NULL (and later dynamically allocated).
Then, the have_fsid check could be replaced by a NULL check.  Not sure
if it would bring any real benefit, though.  Want me to give that a try?
Or maybe I misunderstood you question.

> In any case, I'd just drop that condition here.

Ok, I'll send v2 in a second, without this check.

[ BTW, looks like my initial post didn't made it into vger.kernel.org.
  It was probably dropped because I screwed-up the 'To:' field in my
  email (no idea how I did that, TBH). ]

Cheers,
-- 
Luis


Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster

2019-09-06 Thread Luis Henriques
"Jeff Layton"  writes:

> On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
>> OSDs are able to perform object copies across different pools.  Thus,
>> there's no need to prevent copy_file_range from doing remote copies if the
>> source and destination superblocks are different.  Only return -EXDEV if
>> they have different fsid (the cluster ID).
>> 
>> Signed-off-by: Luis Henriques 
>> ---
>>  fs/ceph/file.c | 23 +++
>>  1 file changed, 19 insertions(+), 4 deletions(-)
>> 
>> Hi!
>> 
>> I've finally managed to run some tests using multiple filesystems, both
>> within a single cluster and also using two different clusters.  The
>> behaviour of copy_file_range (with this patch, of course) was what I
>> expected:
>> 
>>   - Object copies work fine across different filesystems within the same
>> cluster (even with pools in different PGs);
>>   - -EXDEV is returned if the fsid is different
>> 
>> (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
>>  Because this is actually what's in ceph.conf fsid in "[global]"
>>  section.  Anyway...)
>> 
>> So, what's missing right now is (I always mention this when I have the
>> opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
>> And add the corresponding support for the new flag to the kernel
>> client, of course.
>> 
>> Cheers,
>> --
>> Luis
>> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 685a03cc4b77..88d116893c2b 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>  struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>>  struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>>  struct ceph_cap_flush *prealloc_cf;
>> +struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>>  struct ceph_object_locator src_oloc, dst_oloc;
>>  struct ceph_object_id src_oid, dst_oid;
>>  loff_t endoff = 0, size;
>> @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>  
>>  if (src_inode == dst_inode)
>>  return -EINVAL;
>> -if (src_inode->i_sb != dst_inode->i_sb)
>> -return -EXDEV;
>> +if (src_inode->i_sb != dst_inode->i_sb) {
>> +struct ceph_fs_client *dst_fsc = 
>> ceph_inode_to_client(dst_inode);
>> +
>> +if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) 
>> {
>> +dout("No fsid in a fs client\n");
>> +return -EXDEV;
>> +}
>
> In what situation is there no fsid? Old cluster version?
>
> If there is no fsid, can we take that to indicate that there is only a
> single filesystem possible in the cluster and that we should attempt the
> copy anyway?

TBH I'm not sure if 'have_fsid' can ever be 'false' in this call.  It is
set to 'true' when handling the monmap, and it's never changed back to
'false'.  Since I don't think copy_file_range will be invoked *before*
we get the monmap, it should be safe to drop this check.  Maybe it could
be replaced it by a WARN_ON()?

Cheers,
-- 
Luis

>
>> +if (ceph_fsid_compare(_fsc->client->fsid,
>> +  _fsc->client->fsid)) {
>> +dout("Copying object across different clusters:");
>> +dout("  src fsid: %*ph\n  dst fsid: %*ph\n",
>> + 16, _fsc->client->fsid,
>> + 16, _fsc->client->fsid);
>> +return -EXDEV;
>> +}
>> +}
>>  if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>>  return -EROFS;
>>  
>> @@ -1928,7 +1943,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>   * efficient).
>>   */
>>  
>> -if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> +if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>>  return -EOPNOTSUPP;
>>  
>>  if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> @@ -2044,7 +2059,7 @@ static ssize_t __ceph_copy_file_range(struct file 
>> *src_file, loff_t src_off,
>>  dst_ci->i_vino.ino, dst_objnum);
>>  /* Do an object remote copy */
>>  err = ceph_osdc_copy_from(
>> -_inode_to_client(src_inode)->client->osdc,
>> +_fsc->client->osdc,
>>  src_ci->i_vino.snap, 0,
>>  _oid, _oloc,
>>  CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |


Re: [RFC PATCH] ceph: fix directories inode i_blkbits initialization

2019-07-24 Thread Luis Henriques
Luis Henriques  writes:

> "Jeff Layton"  writes:
>
>> On Tue, 2019-07-23 at 16:50 +0100, Luis Henriques wrote:
>>> When filling an inode with info from the MDS, i_blkbits is being
>>> initialized using fl_stripe_unit, which contains the stripe unit in
>>> bytes.  Unfortunately, this doesn't make sense for directories as they
>>> have fl_stripe_unit set to '0'.  This means that i_blkbits will be set
>>> to 0xff, causing an UBSAN undefined behaviour in i_blocksize():
>>> 
>>>   UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12
>>>   shift exponent 255 is too large for 32-bit type 'int'
>>> 
>>> Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit
>>> is zero.
>>> 
>>> Signed-off-by: Luis Henriques 
>>> ---
>>>  fs/ceph/inode.c | 7 ++-
>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>> 
>>> Hi Jeff,
>>> 
>>> To be honest, I'm not sure CEPH_BLOCK_SHIFT is the right value to use
>>> here, but for sure the one currently being used isn't correct if the
>>> inode is a directory.  Using stripe units seems to be a bug that has
>>> been there since the beginning, but it definitely became bigger problem
>>> after commit 69448867abcb ("fs: shave 8 bytes off of struct inode").
>>> 
>>> This fix could also be moved into the 'switch' statement later in that
>>> function, in the S_IFDIR case, similar to commit 5ba72e607cdb ("ceph:
>>> set special inode's blocksize to page size").  Let me know which version
>>> you would prefer.
>>> 
>>
>> What happens with (e.g.) named pipes or symlinks? Do those inodes also
>> get this bogus value? Assuming that they do, I'd probably prefer this
>> patch since it'd fix things for all inode types, not just directories.
>
> I tested symlinks and they seem to be handled correctly (i.e. the stripe
> units seems to be the same as the target file).  Regarding pipes, I
> didn't test them, but from the code it should be set to PAGE_SHIFT (see
> the above mentioned commit 5ba72e607cdb).

Ok, after looking closer at the other inode types and running a few
tests with extra debug code, it all seems to be sane -- only directories
(root dir is an exception) will cause problems with i_blkbits being set
to a bogus value.  So, I'm sticking with my original RFC patch approach,
which should be easy to apply to stable kernels.

Cheers,
-- 
Luis

>
> Anyway, I can change the code to do *all* the i_blkbits initialization
> inside the switch statement.  Something like:
>
> switch (inode->i_mode & S_IFMT) {
> case S_IFIFO:
> case S_IFBLK:
> case S_IFCHR:
> case S_IFSOCK:
>   inode->i_blkbits = PAGE_SHIFT;
> ...
> case S_IFREG:
>   inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
>   ...
> case S_IFLNK:
>   inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
>   ...
> case S_IFDIR:
>   inode->i_blkbits = CEPH_BLOCK_SHIFT;
>   ...
> default:
>   pr_err();
> ...
> }
>
> This would add some code duplication (S_IFREG and S_IFLNK cases), but
> maybe it's a bit more clear.  The other option would be obviously to
> leave the initialization outside the switch and only change the
> i_blkbits value in the S_IF{IFO,BLK,CHR,SOCK,DIR} cases.
>
> Cheers,


Re: [RFC PATCH] ceph: fix directories inode i_blkbits initialization

2019-07-23 Thread Luis Henriques
"Jeff Layton"  writes:

> On Tue, 2019-07-23 at 16:50 +0100, Luis Henriques wrote:
>> When filling an inode with info from the MDS, i_blkbits is being
>> initialized using fl_stripe_unit, which contains the stripe unit in
>> bytes.  Unfortunately, this doesn't make sense for directories as they
>> have fl_stripe_unit set to '0'.  This means that i_blkbits will be set
>> to 0xff, causing an UBSAN undefined behaviour in i_blocksize():
>> 
>>   UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12
>>   shift exponent 255 is too large for 32-bit type 'int'
>> 
>> Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit
>> is zero.
>> 
>> Signed-off-by: Luis Henriques 
>> ---
>>  fs/ceph/inode.c | 7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> Hi Jeff,
>> 
>> To be honest, I'm not sure CEPH_BLOCK_SHIFT is the right value to use
>> here, but for sure the one currently being used isn't correct if the
>> inode is a directory.  Using stripe units seems to be a bug that has
>> been there since the beginning, but it definitely became bigger problem
>> after commit 69448867abcb ("fs: shave 8 bytes off of struct inode").
>> 
>> This fix could also be moved into the 'switch' statement later in that
>> function, in the S_IFDIR case, similar to commit 5ba72e607cdb ("ceph:
>> set special inode's blocksize to page size").  Let me know which version
>> you would prefer.
>> 
>
> What happens with (e.g.) named pipes or symlinks? Do those inodes also
> get this bogus value? Assuming that they do, I'd probably prefer this
> patch since it'd fix things for all inode types, not just directories.

I tested symlinks and they seem to be handled correctly (i.e. the stripe
units seems to be the same as the target file).  Regarding pipes, I
didn't test them, but from the code it should be set to PAGE_SHIFT (see
the above mentioned commit 5ba72e607cdb).

Anyway, I can change the code to do *all* the i_blkbits initialization
inside the switch statement.  Something like:

switch (inode->i_mode & S_IFMT) {
case S_IFIFO:
case S_IFBLK:
case S_IFCHR:
case S_IFSOCK:
inode->i_blkbits = PAGE_SHIFT;
...
case S_IFREG:
inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
...
case S_IFLNK:
inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
...
case S_IFDIR:
inode->i_blkbits = CEPH_BLOCK_SHIFT;
...
default:
pr_err();
...
}

This would add some code duplication (S_IFREG and S_IFLNK cases), but
maybe it's a bit more clear.  The other option would be obviously to
leave the initialization outside the switch and only change the
i_blkbits value in the S_IF{IFO,BLK,CHR,SOCK,DIR} cases.

Cheers,
-- 
Luis


>
>> Cheers,
>> --
>> Luis
>> 
>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> index 791f84a13bb8..0e6d6db848b7 100644
>> --- a/fs/ceph/inode.c
>> +++ b/fs/ceph/inode.c
>> @@ -800,7 +800,12 @@ static int fill_inode(struct inode *inode, struct page 
>> *locked_page,
>>  
>>  /* update inode */
>>  inode->i_rdev = le32_to_cpu(info->rdev);
>> -inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
>> +/* directories have fl_stripe_unit set to zero */
>> +if (le32_to_cpu(info->layout.fl_stripe_unit))
>> +inode->i_blkbits =
>> +fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
>> +else
>> +inode->i_blkbits = CEPH_BLOCK_SHIFT;
>>  
>>  __ceph_update_quota(ci, iinfo->max_bytes, iinfo->max_files);
>>  


[RFC PATCH] ceph: fix directories inode i_blkbits initialization

2019-07-23 Thread Luis Henriques
When filling an inode with info from the MDS, i_blkbits is being
initialized using fl_stripe_unit, which contains the stripe unit in
bytes.  Unfortunately, this doesn't make sense for directories as they
have fl_stripe_unit set to '0'.  This means that i_blkbits will be set
to 0xff, causing an UBSAN undefined behaviour in i_blocksize():

  UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12
  shift exponent 255 is too large for 32-bit type 'int'

Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit
is zero.

Signed-off-by: Luis Henriques 
---
 fs/ceph/inode.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Hi Jeff,

To be honest, I'm not sure CEPH_BLOCK_SHIFT is the right value to use
here, but for sure the one currently being used isn't correct if the
inode is a directory.  Using stripe units seems to be a bug that has
been there since the beginning, but it definitely became bigger problem
after commit 69448867abcb ("fs: shave 8 bytes off of struct inode").

This fix could also be moved into the 'switch' statement later in that
function, in the S_IFDIR case, similar to commit 5ba72e607cdb ("ceph:
set special inode's blocksize to page size").  Let me know which version
you would prefer.

Cheers,
--
Luis

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 791f84a13bb8..0e6d6db848b7 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -800,7 +800,12 @@ static int fill_inode(struct inode *inode, struct page 
*locked_page,
 
/* update inode */
inode->i_rdev = le32_to_cpu(info->rdev);
-   inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
+   /* directories have fl_stripe_unit set to zero */
+   if (le32_to_cpu(info->layout.fl_stripe_unit))
+   inode->i_blkbits =
+   fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
+   else
+   inode->i_blkbits = CEPH_BLOCK_SHIFT;
 
__ceph_update_quota(ci, iinfo->max_bytes, iinfo->max_files);
 


Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t

2019-07-21 Thread Luis Henriques
Waiman Long  writes:

> On 7/20/19 4:41 AM, Luis Henriques wrote:
>> "Linus Torvalds"  writes:
>>
>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long  wrote:
>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>> only access data within the rw_semaphore structures. I don't know why it
>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>> exactly which statement is doing the invalid access.
>>> The stack traces should show line numbers if you run them through
>>> scripts/decode_stacktrace.sh.
>>>
>>> You need to have debug info enabled for that, though.
>>>
>>> Luis?
>>>
>>>  Linus
>> Yep, sure.  And I should have done this in the initial report.  It's a
>> different trace, I had to recompile the kernel.
>>
>> (I'm also adding Jeff to the CC list.)
>>
>> Cheers,
>
> Thanks for the information. I think I know where the problem is. Would
> you mind applying the attached patch to see if it can fix the KASAN error.

Yep, that seems to work -- I can't reproduce the error anymore (and
sorry for the delay).  Thanks!  And feel free to add my Tested-by.

Cheers,
-- 
Luis


Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t

2019-07-20 Thread Luis Henriques
Luis Henriques  writes:

> Luis Henriques  writes:
>
>> "Linus Torvalds"  writes:
>>
>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long  wrote:
>>>>
>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>> only access data within the rw_semaphore structures. I don't know why it
>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>> exactly which statement is doing the invalid access.
>>>
>>> The stack traces should show line numbers if you run them through
>>> scripts/decode_stacktrace.sh.
>>>
>>> You need to have debug info enabled for that, though.
>>>
>>> Luis?
>>>
>>>  Linus
>>
>> Yep, sure.  And I should have done this in the initial report.  It's a
>> different trace, I had to recompile the kernel.
>>
>> (I'm also adding Jeff to the CC list.)
>>
>
> Ah, and I also managed to reproduce this on btrfs so I guess this rules
> out a bug in the filesystem code.

Just another detail (before I go completely offline until tomorrow
evening): in the btrfs case I'm seeing the bug on the
rwsem_down_read_slowpath path, not on rwsem_down_write_slowpath.  But it
seems to be on the same place (i.e. rwsem_can_spin_on_owner).

Cheers,
-- 
Luis


Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t

2019-07-20 Thread Luis Henriques
Luis Henriques  writes:

> "Linus Torvalds"  writes:
>
>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long  wrote:
>>>
>>> This patch shouldn't change the behavior of the rwsem code. The code
>>> only access data within the rw_semaphore structures. I don't know why it
>>> will cause a KASAN error. I will have to reproduce it and figure out
>>> exactly which statement is doing the invalid access.
>>
>> The stack traces should show line numbers if you run them through
>> scripts/decode_stacktrace.sh.
>>
>> You need to have debug info enabled for that, though.
>>
>> Luis?
>>
>>  Linus
>
> Yep, sure.  And I should have done this in the initial report.  It's a
> different trace, I had to recompile the kernel.
>
> (I'm also adding Jeff to the CC list.)
>

Ah, and I also managed to reproduce this on btrfs so I guess this rules
out a bug in the filesystem code.

Cheers,
-- 
Luis


Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t

2019-07-20 Thread Luis Henriques
"Linus Torvalds"  writes:

> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long  wrote:
>>
>> This patch shouldn't change the behavior of the rwsem code. The code
>> only access data within the rw_semaphore structures. I don't know why it
>> will cause a KASAN error. I will have to reproduce it and figure out
>> exactly which statement is doing the invalid access.
>
> The stack traces should show line numbers if you run them through
> scripts/decode_stacktrace.sh.
>
> You need to have debug info enabled for that, though.
>
> Luis?
>
>  Linus

Yep, sure.  And I should have done this in the initial report.  It's a
different trace, I had to recompile the kernel.

(I'm also adding Jeff to the CC list.)

Cheers,
-- 
Luis

[   39.801179] 
==
[   39.801973] BUG: KASAN: use-after-free in rwsem_down_write_slowpath 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 
/home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.802733] Read of size 4 at addr 8881f1f65138 by task xfs_io/2145

[   39.803598] CPU: 0 PID: 2145 Comm: xfs_io Not tainted 5.2.0+ #460
[   39.803600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
[   39.803602] Call Trace:
[   39.803609] dump_stack (/home/miguel/kernel/linux/lib/dump_stack.c:115) 
[   39.803615] print_address_description 
(/home/miguel/kernel/linux/mm/kasan/report.c:352) 
[   39.803618] ? rwsem_down_write_slowpath 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 
/home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803621] ? rwsem_down_write_slowpath 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 
/home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803624] __kasan_report.cold 
(/home/miguel/kernel/linux/mm/kasan/report.c:483) 
[   39.803629] ? rwsem_down_write_slowpath 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 
/home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803633] kasan_report 
(/home/miguel/kernel/linux/./arch/x86/include/asm/smap.h:69 
/home/miguel/kernel/linux/mm/kasan/common.c:613) 
[   39.803636] rwsem_down_write_slowpath 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 
/home/miguel/kernel/linux/kernel/locking/rwsem.c:1125) 
[   39.803641] ? __ceph_caps_issued_mask 
(/home/miguel/kernel/linux/fs/ceph/caps.c:914) 
[   39.803644] ? find_held_lock 
(/home/miguel/kernel/linux/kernel/locking/lockdep.c:4004) 
[   39.803649] ? __ceph_do_getattr 
(/home/miguel/kernel/linux/fs/ceph/inode.c:2246) 
[   39.803653] ? down_read_non_owner 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:1116) 
[   39.803658] ? do_raw_spin_unlock 
(/home/miguel/kernel/linux/./include/linux/compiler.h:218 
/home/miguel/kernel/linux/./include/asm-generic/qspinlock.h:94 
/home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:139) 
[   39.803663] ? _raw_spin_unlock 
(/home/miguel/kernel/linux/kernel/locking/spinlock.c:184) 
[   39.803667] ? __lock_acquire.isra.0 
(/home/miguel/kernel/linux/kernel/locking/lockdep.c:3884) 
[   39.803674] ? path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 
/home/miguel/kernel/linux/fs/namei.c:3533) 
[   39.803680] ? down_write 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486) 
[   39.803683] down_write 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486) 
[   39.803687] ? down_read_killable 
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:1482) 
[   39.803690] ? __sb_start_write 
(/home/miguel/kernel/linux/./include/linux/compiler.h:194 
/home/miguel/kernel/linux/./include/linux/rcu_sync.h:38 
/home/miguel/kernel/linux/./include/linux/percpu-rwsem.h:52 
/home/miguel/kernel/linux/fs/super.c:1608) 
[   39.803694] ? __mnt_want_write (/home/miguel/kernel/linux/fs/namespace.c:253 
/home/miguel/kernel/linux/fs/namespace.c:297 
/home/miguel/kernel/linux/fs/namespace.c:337) 
[   39.803699] path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 
/home/miguel/kernel/linux/fs/namei.c:3533) 
[   39.803706] ? path_mountpoint (/home/miguel/kernel/linux/fs/namei.c:3518) 
[   39.803711] ? __is_insn_slot_addr 
(/home/miguel/kernel/linux/kernel/kprobes.c:291) 
[   39.803716] ? kernel_text_address 
(/home/miguel/kernel/linux/kernel/extable.c:113) 
[   39.803719] ? __kernel_text_address 
(/home/miguel/kernel/linux/kernel/extable.c:95) 
[   39.803724] ? unwind_get_return_address 
(/home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:311 
/home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:306) 
[   39.803727] ? swiotlb_map.cold 
(/home/miguel/kernel/linux/kernel/stacktrace.c:83) 
[   39.803730] ? arch_stack_walk 
(/home/miguel/kernel/linux/arch/x86/kernel/stacktrace.c:26) 
[   39.803735] do_filp_open (/home/miguel/kernel/linux/fs/namei.c:3563) 
[   39.803739] ? may_open_dev (/home/miguel/kernel/linux/fs/namei.c:3557) 
[   39.803746] ? __alloc_fd (/home/miguel/kernel/linux/fs/file.c:536) 
[   39.803749] ? lock_downgrade 

Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t

2019-07-19 Thread Luis Henriques
Waiman Long  writes:

> On 7/19/19 2:45 PM, Luis Henriques wrote:
>> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>>> The rwsem->owner contains not just the task structure pointer, it also
>>> holds some flags for storing the current state of the rwsem. Some of
>>> the flags may have to be atomically updated. To reflect the new reality,
>>> the owner is now changed to an atomic_long_t type.
>>>
>>> New helper functions are added to properly separate out the task
>>> structure pointer and the embedded flags.
>> I started seeing KASAN use-after-free with current master, and a bisect
>> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
>> rwsem->owner an atomic_long_t") was the problem.  Does it ring any
>> bells?  I can easily reproduce it with xfstests (generic/464).
>>
>> Cheers,
>> --
>> Luís
>
> This patch shouldn't change the behavior of the rwsem code. The code
> only access data within the rw_semaphore structures. I don't know why it
> will cause a KASAN error. I will have to reproduce it and figure out
> exactly which statement is doing the invalid access.

Yeah, screwing the bisection is something I've done in the past so I may
have got the wrong commit.  Another detail is that I was running
xfstests against CephFS, I didn't tried with any other filesystem.  I
can try to reproduce with btrfs or xfs next week.

Cheers,
-- 
Luis


Re: [PATCH v8 13/19] locking/rwsem: Make rwsem->owner an atomic_long_t

2019-07-19 Thread Luis Henriques
On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
> The rwsem->owner contains not just the task structure pointer, it also
> holds some flags for storing the current state of the rwsem. Some of
> the flags may have to be atomically updated. To reflect the new reality,
> the owner is now changed to an atomic_long_t type.
> 
> New helper functions are added to properly separate out the task
> structure pointer and the embedded flags.

I started seeing KASAN use-after-free with current master, and a bisect
showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
rwsem->owner an atomic_long_t") was the problem.  Does it ring any
bells?  I can easily reproduce it with xfstests (generic/464).

Cheers,
--
Luís

[ 6380.820179] run fstests generic/464 at 2019-07-19 12:04:05
[ 6381.504693] libceph: mon0 (1)192.168.155.1:40786 session established
[ 6381.506790] libceph: client4572 fsid 86b39301-7192-4052-8427-a241af35a591
[ 6381.618830] libceph: mon0 (1)192.168.155.1:40786 session established
[ 6381.619993] libceph: client4573 fsid 86b39301-7192-4052-8427-a241af35a591
[ 6384.464561] 
==
[ 6384.466165] BUG: KASAN: use-after-free in 
rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.468288] Read of size 4 at addr 8881d5dc9478 by task xfs_io/17238

[ 6384.469545] CPU: 1 PID: 17238 Comm: xfs_io Not tainted 5.2.0+ #444
[ 6384.469550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
[ 6384.469554] Call Trace:
[ 6384.469563]  dump_stack+0x5b/0x90
[ 6384.469569]  print_address_description+0x6f/0x332
[ 6384.469573]  ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469575]  ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469579]  __kasan_report.cold+0x1a/0x3e
[ 6384.469583]  ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469588]  kasan_report+0xe/0x12
[ 6384.469591]  rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469596]  ? __ceph_caps_issued_mask+0xe7/0x280
[ 6384.469599]  ? find_held_lock+0xc9/0xf0
[ 6384.469604]  ? __ceph_do_getattr+0x19f/0x290
[ 6384.469608]  ? down_read_non_owner+0x1c0/0x1c0
[ 6384.469612]  ? do_raw_spin_unlock+0xa3/0x130
[ 6384.469617]  ? _raw_spin_unlock+0x24/0x30
[ 6384.469622]  ? __lock_acquire.isra.0+0x486/0x770
[ 6384.469629]  ? path_openat+0x7ef/0xfe0
[ 6384.469635]  ? down_write+0x11e/0x130
[ 6384.469638]  down_write+0x11e/0x130
[ 6384.469642]  ? down_read_killable+0x1e0/0x1e0
[ 6384.469646]  ? __sb_start_write+0x11c/0x170
[ 6384.469650]  ? __mnt_want_write+0xb4/0xd0
[ 6384.469655]  path_openat+0x7ef/0xfe0
[ 6384.469661]  ? path_mountpoint+0x4d0/0x4d0
[ 6384.469667]  ? __is_insn_slot_addr+0x93/0xb0
[ 6384.469671]  ? kernel_text_address+0x113/0x120
[ 6384.469674]  ? __kernel_text_address+0xe/0x30
[ 6384.469679]  ? unwind_get_return_address+0x2f/0x50
[ 6384.469683]  ? swiotlb_map.cold+0x25/0x25
[ 6384.469687]  ? arch_stack_walk+0x8f/0xe0
[ 6384.469692]  do_filp_open+0x12b/0x1c0
[ 6384.469695]  ? may_open_dev+0x50/0x50
[ 6384.469702]  ? __alloc_fd+0x115/0x280
[ 6384.469705]  ? lock_downgrade+0x350/0x350
[ 6384.469709]  ? do_raw_spin_lock+0x113/0x1d0
[ 6384.469713]  ? rwlock_bug.part.0+0x60/0x60
[ 6384.469718]  ? do_raw_spin_unlock+0xa3/0x130
[ 6384.469722]  ? _raw_spin_unlock+0x24/0x30
[ 6384.469725]  ? __alloc_fd+0x115/0x280
[ 6384.469731]  do_sys_open+0x1f0/0x2d0
[ 6384.469735]  ? filp_open+0x50/0x50
[ 6384.469738]  ? switch_fpu_return+0x13e/0x230
[ 6384.469742]  ? __do_page_fault+0x4b5/0x670
[ 6384.469748]  do_syscall_64+0x63/0x1c0
[ 6384.469753]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6384.469756] RIP: 0033:0x7fe961434528
[ 6384.469760] Code: 00 00 41 00 3d 00 00 41 00 74 47 48 8d 05 20 4d 0d 00 8b 
00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 
f0 ff ff 0f 87 94 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
[ 6384.469762] RSP: 002b:7ffd9bbabb20 EFLAGS: 0246 ORIG_RAX: 
0101
[ 6384.469765] RAX: ffda RBX: 0242 RCX: 7fe961434528
[ 6384.469767] RDX: 0242 RSI: 7ffd9bbae2a5 RDI: ff9c
[ 6384.469769] RBP: 7ffd9bbae2a5 R08: 0001 R09: 
[ 6384.469771] R10: 0180 R11: 0246 R12: 0242
[ 6384.469773] R13: 7ffd9bbabe00 R14: 0180 R15: 0060

[ 6384.470018] Allocated by task 16593:
[ 6384.470562]  __kasan_kmalloc.part.0+0x3c/0xa0
[ 6384.470565]  kmem_cache_alloc+0xdc/0x240
[ 6384.470569]  copy_process+0x1dce/0x27b0
[ 6384.470572]  _do_fork+0xec/0x540
[ 6384.470576]  __se_sys_clone+0xb2/0x100
[ 6384.470581]  do_syscall_64+0x63/0x1c0
[ 6384.470586]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 6384.470823] Freed by task 9:
[ 6384.471235]  __kasan_slab_free+0x147/0x200
[ 6384.471240]  kmem_cache_free+0x111/0x330
[ 6384.471246]  rcu_core+0x2f9/0x830
[ 6384.471251]  __do_softirq+0x154/0x486

[ 6384.471493] The buggy address belongs to the object at 8881d5dc9440
which 

[PATCH 0/4] Sleeping functions in invalid context bug fixes

2019-07-19 Thread Luis Henriques
Hi,

I'm sending three "sleeping function called from invalid context" bug
fixes that I had on my TODO for a while.  All of them are ceph_buffer_put
related, and all the fixes follow the same pattern: delay the operation
until the ci->i_ceph_lock is released.

The first patch simply allows ceph_buffer_put to receive a NULL buffer so
that the NULL check doesn't need to be performed in all the other patches.
IOW, it's not really required, just convenient.

(Note: maybe these patches should all be tagged for stable.)

Luis Henriques (4):
  libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
  ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()
  ceph: fix buffer free while holding i_ceph_lock in
__ceph_build_xattrs_blob()
  ceph: fix buffer free while holding i_ceph_lock in fill_inode()

 fs/ceph/caps.c  |  5 -
 fs/ceph/inode.c |  7 ---
 fs/ceph/snap.c  |  4 +++-
 fs/ceph/super.h |  2 +-
 fs/ceph/xattr.c | 19 ++-
 include/linux/ceph/buffer.h |  3 ++-
 6 files changed, 28 insertions(+), 12 deletions(-)



[PATCH 1/4] libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer

2019-07-19 Thread Luis Henriques
Signed-off-by: Luis Henriques 
---
 include/linux/ceph/buffer.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/buffer.h b/include/linux/ceph/buffer.h
index 5e58bb29b1a3..11cdc7c60480 100644
--- a/include/linux/ceph/buffer.h
+++ b/include/linux/ceph/buffer.h
@@ -30,7 +30,8 @@ static inline struct ceph_buffer *ceph_buffer_get(struct 
ceph_buffer *b)
 
 static inline void ceph_buffer_put(struct ceph_buffer *b)
 {
-   kref_put(>kref, ceph_buffer_release);
+   if (b)
+   kref_put(>kref, ceph_buffer_release);
 }
 
 extern int ceph_decode_buffer(struct ceph_buffer **b, void **p, void *end);


[PATCH 2/4] ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()

2019-07-19 Thread Luis Henriques
Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
i_xattrs.prealloc_blob buffer while holding the i_ceph_lock.  This can be
fixed by postponing the call until later, when the lock is released.

The following backtrace was triggered by fstests generic/117.

  BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
  in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
  3 locks held by fsstress/650:
   #0: 870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
   #1: ba0c4c74 (>i_mutex_dir_key#6){}, at: 
vfs_setxattr+0x55/0xa0
   #2: 8dfbb3f2 (&(>i_ceph_lock)->rlock){+.+.}, at: 
__ceph_setxattr+0x297/0x810
  CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x67/0x90
   ___might_sleep.cold+0x9f/0xb1
   vfree+0x4b/0x60
   ceph_buffer_release+0x1b/0x60
   __ceph_setxattr+0x2b4/0x810
   __vfs_setxattr+0x66/0x80
   __vfs_setxattr_noperm+0x59/0xf0
   vfs_setxattr+0x81/0xa0
   setxattr+0x115/0x230
   ? filename_lookup+0xc9/0x140
   ? rcu_read_lock_sched_held+0x74/0x80
   ? rcu_sync_lockdep_assert+0x2e/0x60
   ? __sb_start_write+0x142/0x1a0
   ? mnt_want_write+0x20/0x50
   path_setxattr+0xba/0xd0
   __x64_sys_lsetxattr+0x24/0x30
   do_syscall_64+0x50/0x1c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7ff23514359a

Signed-off-by: Luis Henriques 
---
 fs/ceph/xattr.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 37b458a9af3a..c083557b3657 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1036,6 +1036,7 @@ int __ceph_setxattr(struct inode *inode, const char *name,
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
struct ceph_cap_flush *prealloc_cf = NULL;
+   struct ceph_buffer *old_blob = NULL;
int issued;
int err;
int dirty = 0;
@@ -1109,13 +1110,15 @@ int __ceph_setxattr(struct inode *inode, const char 
*name,
struct ceph_buffer *blob;
 
spin_unlock(>i_ceph_lock);
-   dout(" preaallocating new blob size=%d\n", required_blob_size);
+   ceph_buffer_put(old_blob); /* Shouldn't be required */
+   dout(" pre-allocating new blob size=%d\n", required_blob_size);
blob = ceph_buffer_new(required_blob_size, GFP_NOFS);
if (!blob)
goto do_sync_unlocked;
spin_lock(>i_ceph_lock);
+   /* prealloc_blob can't be released while holding i_ceph_lock */
if (ci->i_xattrs.prealloc_blob)
-   ceph_buffer_put(ci->i_xattrs.prealloc_blob);
+   old_blob = ci->i_xattrs.prealloc_blob;
ci->i_xattrs.prealloc_blob = blob;
goto retry;
}
@@ -1131,6 +1134,7 @@ int __ceph_setxattr(struct inode *inode, const char *name,
}
 
spin_unlock(>i_ceph_lock);
+   ceph_buffer_put(old_blob);
if (lock_snap_rwsem)
up_read(>snap_rwsem);
if (dirty)


[PATCH 3/4] ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob()

2019-07-19 Thread Luis Henriques
Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
freeing the i_xattrs.blob buffer while holding the i_ceph_lock.  This can
be fixed by having this function returning the old blob buffer and have
the callers of this function freeing it when the lock is released.

The following backtrace was triggered by fstests generic/117.

  BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
  in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
  4 locks held by fsstress/649:
   #0: a7478e7e (>s_umount_key#19){}, at: 
iterate_supers+0x77/0xf0
   #1: f8de1423 (&(>i_ceph_lock)->rlock){+.+.}, at: 
ceph_check_caps+0x7b/0xc60
   #2: 562f2b27 (>s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
   #3: f83ce16a (>snap_rwsem){}, at: 
ceph_check_caps+0x3ed/0xc60
  CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x67/0x90
   ___might_sleep.cold+0x9f/0xb1
   vfree+0x4b/0x60
   ceph_buffer_release+0x1b/0x60
   __ceph_build_xattrs_blob+0x12b/0x170
   __send_cap+0x302/0x540
   ? __lock_acquire+0x23c/0x1e40
   ? __mark_caps_flushing+0x15c/0x280
   ? _raw_spin_unlock+0x24/0x30
   ceph_check_caps+0x5f0/0xc60
   ceph_flush_dirty_caps+0x7c/0x150
   ? __ia32_sys_fdatasync+0x20/0x20
   ceph_sync_fs+0x5a/0x130
   iterate_supers+0x8f/0xf0
   ksys_sync+0x4f/0xb0
   __ia32_sys_sync+0xa/0x10
   do_syscall_64+0x50/0x1c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7fc6409ab617

Signed-off-by: Luis Henriques 
---
 fs/ceph/caps.c  |  5 -
 fs/ceph/snap.c  |  4 +++-
 fs/ceph/super.h |  2 +-
 fs/ceph/xattr.c | 11 ---
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d98dcd976c80..ce0f5658720a 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1301,6 +1301,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, 
struct ceph_cap *cap,
 {
struct ceph_inode_info *ci = cap->ci;
struct inode *inode = >vfs_inode;
+   struct ceph_buffer *old_blob = NULL;
struct cap_msg_args arg;
int held, revoking;
int wake = 0;
@@ -1365,7 +1366,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, 
struct ceph_cap *cap,
ci->i_requested_max_size = arg.max_size;
 
if (flushing & CEPH_CAP_XATTR_EXCL) {
-   __ceph_build_xattrs_blob(ci);
+   old_blob = __ceph_build_xattrs_blob(ci);
arg.xattr_version = ci->i_xattrs.version;
arg.xattr_buf = ci->i_xattrs.blob;
} else {
@@ -1409,6 +1410,8 @@ static int __send_cap(struct ceph_mds_client *mdsc, 
struct ceph_cap *cap,
 
spin_unlock(>i_ceph_lock);
 
+   ceph_buffer_put(old_blob);
+
ret = send_cap_msg();
if (ret < 0) {
dout("error sending cap msg, must requeue %p\n", inode);
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 4c6494eb02b5..ccfcc66aaf44 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -465,6 +465,7 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci)
struct inode *inode = >vfs_inode;
struct ceph_cap_snap *capsnap;
struct ceph_snap_context *old_snapc, *new_snapc;
+   struct ceph_buffer *old_blob = NULL;
int used, dirty;
 
capsnap = kzalloc(sizeof(*capsnap), GFP_NOFS);
@@ -541,7 +542,7 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci)
capsnap->gid = inode->i_gid;
 
if (dirty & CEPH_CAP_XATTR_EXCL) {
-   __ceph_build_xattrs_blob(ci);
+   old_blob = __ceph_build_xattrs_blob(ci);
capsnap->xattr_blob =
ceph_buffer_get(ci->i_xattrs.blob);
capsnap->xattr_version = ci->i_xattrs.version;
@@ -584,6 +585,7 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci)
}
spin_unlock(>i_ceph_lock);
 
+   ceph_buffer_put(old_blob);
kfree(capsnap);
ceph_put_snap_context(old_snapc);
 }
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index d2352fd95dbc..6b9f1ee7de85 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -926,7 +926,7 @@ extern int ceph_getattr(const struct path *path, struct 
kstat *stat,
 int __ceph_setxattr(struct inode *, const char *, const void *, size_t, int);
 ssize_t __ceph_getxattr(struct inode *, const char *, void *, size_t);
 extern ssize_t ceph_listxattr(struct dentry *, char *, size_t);
-extern void __ceph_build_xattrs_blob(struct ceph_inode_info *ci);
+extern struct ceph_buffer *__ceph_build_xattrs_blob(struct ceph_inode_info 
*ci);
 extern void __ceph_destroy_xattrs(struct ceph_inode_info *ci);
 extern const struct xattr_handler *ceph_xattr_handlers[];
 
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index c083557b3657..939eab7aa219 100644
--- a/fs/ceph/xattr.c
+++ 

[PATCH 4/4] ceph: fix buffer free while holding i_ceph_lock in fill_inode()

2019-07-19 Thread Luis Henriques
Calling ceph_buffer_put() in fill_inode() may result in freeing the
i_xattrs.blob buffer while holding the i_ceph_lock.  This can be fixed by
postponing the call until later, when the lock is released.

The following backtrace was triggered by fstests generic/070.

  BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
  in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4
  6 locks held by kworker/0:4/3852:
   #0: 4270f6bb ((wq_completion)ceph-msgr){+.+.}, at: 
process_one_work+0x1b8/0x5f0
   #1: eb420803 ((work_completion)(&(>work)->work)){+.+.}, at: 
process_one_work+0x1b8/0x5f0
   #2: be1c53a4 (>s_mutex){+.+.}, at: dispatch+0x288/0x1476
   #3: 559cb958 (>snap_rwsem){}, at: dispatch+0x2eb/0x1476
   #4: 0d5ebbae (>r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476
   #5: a83d0514 (&(>i_ceph_lock)->rlock){+.+.}, at: 
fill_inode.isra.0+0xf8/0xf70
  CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
  Workqueue: ceph-msgr ceph_con_workfn
  Call Trace:
   dump_stack+0x67/0x90
   ___might_sleep.cold+0x9f/0xb1
   vfree+0x4b/0x60
   ceph_buffer_release+0x1b/0x60
   fill_inode.isra.0+0xa9b/0xf70
   ceph_fill_trace+0x13b/0xc70
   ? dispatch+0x2eb/0x1476
   dispatch+0x320/0x1476
   ? __mutex_unlock_slowpath+0x4d/0x2a0
   ceph_con_workfn+0xc97/0x2ec0
   ? process_one_work+0x1b8/0x5f0
   process_one_work+0x244/0x5f0
   worker_thread+0x4d/0x3e0
   kthread+0x105/0x140
   ? process_one_work+0x5f0/0x5f0
   ? kthread_park+0x90/0x90
   ret_from_fork+0x3a/0x50

Signed-off-by: Luis Henriques 
---
 fs/ceph/inode.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 791f84a13bb8..18500edefc56 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -736,6 +736,7 @@ static int fill_inode(struct inode *inode, struct page 
*locked_page,
int issued, new_issued, info_caps;
struct timespec64 mtime, atime, ctime;
struct ceph_buffer *xattr_blob = NULL;
+   struct ceph_buffer *old_blob = NULL;
struct ceph_string *pool_ns = NULL;
struct ceph_cap *new_cap = NULL;
int err = 0;
@@ -881,7 +882,7 @@ static int fill_inode(struct inode *inode, struct page 
*locked_page,
if ((ci->i_xattrs.version == 0 || !(issued & CEPH_CAP_XATTR_EXCL))  &&
le64_to_cpu(info->xattr_version) > ci->i_xattrs.version) {
if (ci->i_xattrs.blob)
-   ceph_buffer_put(ci->i_xattrs.blob);
+   old_blob = ci->i_xattrs.blob;
ci->i_xattrs.blob = xattr_blob;
if (xattr_blob)
memcpy(ci->i_xattrs.blob->vec.iov_base,
@@ -1022,8 +1023,8 @@ static int fill_inode(struct inode *inode, struct page 
*locked_page,
 out:
if (new_cap)
ceph_put_cap(mdsc, new_cap);
-   if (xattr_blob)
-   ceph_buffer_put(xattr_blob);
+   ceph_buffer_put(old_blob);
+   ceph_buffer_put(xattr_blob);
ceph_put_string(pool_ns);
return err;
 }


[PATCH] ceph: use generic_delete_inode() for ->drop_inode

2019-07-05 Thread Luis Henriques
ceph_drop_inode() implementation is not any different from the generic
function, thus there's no point in keeping it around.

Signed-off-by: Luis Henriques 
---
 fs/ceph/inode.c | 10 --
 fs/ceph/super.c |  2 +-
 fs/ceph/super.h |  1 -
 3 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 761451f36e2d..211140e6ef9c 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -578,16 +578,6 @@ void ceph_destroy_inode(struct inode *inode)
ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
 }
 
-int ceph_drop_inode(struct inode *inode)
-{
-   /*
-* Positve dentry and corresponding inode are always accompanied
-* in MDS reply. So no need to keep inode in the cache after
-* dropping all its aliases.
-*/
-   return 1;
-}
-
 static inline blkcnt_t calc_inode_blocks(u64 size)
 {
return (size + (1<<9) - 1) >> 9;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index d57fa60dcd43..b4a4772756cb 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -843,7 +843,7 @@ static const struct super_operations ceph_super_ops = {
.destroy_inode  = ceph_destroy_inode,
.free_inode = ceph_free_inode,
.write_inode= ceph_write_inode,
-   .drop_inode = ceph_drop_inode,
+   .drop_inode = generic_delete_inode,
.sync_fs= ceph_sync_fs,
.put_super  = ceph_put_super,
.remount_fs = ceph_remount,
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 5f27e1f7f2d6..622e6c96c960 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -878,7 +878,6 @@ extern const struct inode_operations ceph_file_iops;
 extern struct inode *ceph_alloc_inode(struct super_block *sb);
 extern void ceph_destroy_inode(struct inode *inode);
 extern void ceph_free_inode(struct inode *inode);
-extern int ceph_drop_inode(struct inode *inode);
 
 extern struct inode *ceph_get_inode(struct super_block *sb,
struct ceph_vino vino);


Re: [PATCH] ceph: fix end offset in truncate_inode_pages_range call

2019-07-02 Thread Luis Henriques
"Jeff Layton"  writes:

> On Mon, 2019-07-01 at 18:16 +0100, Luis Henriques wrote:
>> Commit e450f4d1a5d6 ("ceph: pass inclusive lend parameter to
>> filemap_write_and_wait_range()") fixed the end offset parameter used to
>> call filemap_write_and_wait_range and invalidate_inode_pages2_range.
>> Unfortunately it missed truncate_inode_pages_range, introducing a
>> regression that is easily detected by xfstest generic/130.
>> 
>> The problem is that when doing direct IO it is possible that an extra page
>> is truncated from the page cache when the end offset is page aligned.
>> This can cause data loss if that page hasn't been sync'ed to the OSDs.
>> 
>> While there, change code to use PAGE_ALIGN macro instead.
>> 
>> Fixes: e450f4d1a5d6 ("ceph: pass inclusive lend parameter to 
>> filemap_write_and_wait_range()")
>> Signed-off-by: Luis Henriques 
>> ---
>>  fs/ceph/file.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 183c37c0a8fc..7a57db8e2fa9 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1007,7 +1007,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct 
>> iov_iter *iter,
>>   * may block.
>>   */
>>  truncate_inode_pages_range(inode->i_mapping, pos,
>> -(pos+len) | (PAGE_SIZE - 1));
>> +   PAGE_ALIGN(pos + len) - 1);
>>  
>>  req->r_mtime = mtime;
>>  }
>
> Luis, should this be sent to stable? It seems like a data corruption
> problem...

Yes, I believe so.  But I believe all the active stable kernels that
include commit e450f4d1a5d6 (or a backport of it) will pick it anyway
due to the 'Fixes:' tag.  AFAIK only 5.1 and 5.2 are affected.

Cheers,
-- 
Luis


[PATCH] ceph: fix end offset in truncate_inode_pages_range call

2019-07-01 Thread Luis Henriques
Commit e450f4d1a5d6 ("ceph: pass inclusive lend parameter to
filemap_write_and_wait_range()") fixed the end offset parameter used to
call filemap_write_and_wait_range and invalidate_inode_pages2_range.
Unfortunately it missed truncate_inode_pages_range, introducing a
regression that is easily detected by xfstest generic/130.

The problem is that when doing direct IO it is possible that an extra page
is truncated from the page cache when the end offset is page aligned.
This can cause data loss if that page hasn't been sync'ed to the OSDs.

While there, change code to use PAGE_ALIGN macro instead.

Fixes: e450f4d1a5d6 ("ceph: pass inclusive lend parameter to 
filemap_write_and_wait_range()")
Signed-off-by: Luis Henriques 
---
 fs/ceph/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 183c37c0a8fc..7a57db8e2fa9 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1007,7 +1007,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct 
iov_iter *iter,
 * may block.
 */
truncate_inode_pages_range(inode->i_mapping, pos,
-   (pos+len) | (PAGE_SIZE - 1));
+  PAGE_ALIGN(pos + len) - 1);
 
req->r_mtime = mtime;
}


Re: [RFC PATCH] ceph: initialize superblock s_time_gran to 1

2019-06-28 Thread Luis Henriques
Jeff Layton  writes:

> On Thu, 2019-06-27 at 15:44 +, Sage Weil wrote:
>> On Thu, 27 Jun 2019, Jeff Layton wrote:
>> > On Thu, 2019-06-27 at 14:51 +0100, Luis Henriques wrote:
>> > > Having granularity set to 1us results in having inode timestamps with a
>> > > accurancy different from the fuse client (i.e. atime, ctime and mtime 
>> > > will
>> > > always end with '000').  This patch normalizes this behaviour and sets 
>> > > the
>> > > granularity to 1.
>> > > 
>> > > Signed-off-by: Luis Henriques 
>> > > ---
>> > >  fs/ceph/super.c | 2 +-
>> > >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > > 
>> > > Hi!
>> > > 
>> > > As far as I could see there are no other side-effects of changing
>> > > s_time_gran but I'm really not sure why it was initially set to 1000 in
>> > > the first place so I may be missing something.
>> > > 
>> > > diff --git a/fs/ceph/super.c b/fs/ceph/super.c
>> > > index d57fa60dcd43..35dd75bc9cd0 100644
>> > > --- a/fs/ceph/super.c
>> > > +++ b/fs/ceph/super.c
>> > > @@ -980,7 +980,7 @@ static int ceph_set_super(struct super_block *s, 
>> > > void *data)
>> > > s->s_d_op = _dentry_ops;
>> > > s->s_export_op = _export_ops;
>> > >  
>> > > -   s->s_time_gran = 1000;  /* 1000 ns == 1 us */
>> > > +   s->s_time_gran = 1;
>> > >  
>> > > ret = set_anon_super(s, NULL);  /* what is that second arg for? */
>> > > if (ret != 0)
>> > 
>> > 
>> > Looks like it was set that way since the client code was originally
>> > merged. Was this an earlier limitation of ceph that is no longer
>> > applicable?
>> > 
>> > In any case, I see no need at all to keep this at 1000, so:
>> 
>> As long as the encoded on-write time value is at ns resolution, I 
>> agree!  No recollection of why I did this :(
>> 
>> Reviewed-by: Sage Weil 
>
> Good enough for me. I went ahead and merged this into the testing
> branch. Assuming nothing breaks, this should make v5.3.

Awesome, thanks.  AFAICS it shouldn't break anything, specially because
the fuse client seems to be using ns resolution too.  But yeah
unexpected side-effects show up in unexpected ways :-)

Cheers,
-- 
Luis


[RFC PATCH] ceph: initialize superblock s_time_gran to 1

2019-06-27 Thread Luis Henriques
Having granularity set to 1us results in having inode timestamps with a
accurancy different from the fuse client (i.e. atime, ctime and mtime will
always end with '000').  This patch normalizes this behaviour and sets the
granularity to 1.

Signed-off-by: Luis Henriques 
---
 fs/ceph/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Hi!

As far as I could see there are no other side-effects of changing
s_time_gran but I'm really not sure why it was initially set to 1000 in
the first place so I may be missing something.

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index d57fa60dcd43..35dd75bc9cd0 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -980,7 +980,7 @@ static int ceph_set_super(struct super_block *s, void *data)
s->s_d_op = _dentry_ops;
s->s_export_op = _export_ops;
 
-   s->s_time_gran = 1000;  /* 1000 ns == 1 us */
+   s->s_time_gran = 1;
 
ret = set_anon_super(s, NULL);  /* what is that second arg for? */
if (ret != 0)


Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc

2019-04-16 Thread Luis Henriques
Luis Henriques  writes:

> "Yan, Zheng"  writes:
>
>> On Fri, Mar 22, 2019 at 6:04 PM Luis Henriques  wrote:
>>>
>>> Luis Henriques  writes:
>>>
>>> > "Yan, Zheng"  writes:
>>> >
>>> >> On Tue, Mar 19, 2019 at 12:22 AM Luis Henriques  
>>> >> wrote:
>>> >>>
>>> >>> "Yan, Zheng"  writes:
>>> >>>
>>> >>> > On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques  
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> "Yan, Zheng"  writes:
>>> >>> >>
>>> >>> >> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques 
>>> >>> >> >  wrote:
>>> >>> >> >>
>>> >>> >> >> I'm occasionally seeing a kmemleak warning in xfstest generic/013:
>>> >>> >> >>
>>> >>> >> >> unreferenced object 0x8881fccca940 (size 32):
>>> >>> >> >>   comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s)
>>> >>> >> >>   hex dump (first 32 bytes):
>>> >>> >> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  
>>> >>> >> >> 
>>> >>> >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>>> >>> >> >> 
>>> >>> >> >>   backtrace:
>>> >>> >> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0
>>> >>> >> >> [<21a00533>] rebuild_snap_realms+0x27/0x90
>>> >>> >> >> [<ac538600>] rebuild_snap_realms+0x42/0x90
>>> >>> >> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610
>>> >>> >> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3
>>> >>> >> >> [<fc287b83>] dispatch+0x362/0x176c
>>> >>> >> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0
>>> >>> >> >> [<4168e3a9>] process_one_work+0x1d4/0x400
>>> >>> >> >> [<2188e9e7>] worker_thread+0x2d/0x3c0
>>> >>> >> >> [<b593e4b3>] kthread+0x112/0x130
>>> >>> >> >> [<a8587dca>] ret_from_fork+0x35/0x40
>>> >>> >> >> [<ba1c9c1d>] 0x
>>> >>> >> >>
>>> >>> >> >> It looks like it is possible that we miss a flush_ack from the 
>>> >>> >> >> MDS when,
>>> >>> >> >> for example, umounting the filesystem.  In that case, we can 
>>> >>> >> >> simply drop
>>> >>> >> >> the reference to the ceph_snap_context obtained in 
>>> >>> >> >> ceph_queue_cap_snap().
>>> >>> >> >>
>>> >>> >> >> Link: https://tracker.ceph.com/issues/38224
>>> >>> >> >> Cc: sta...@vger.kernel.org
>>> >>> >> >> Signed-off-by: Luis Henriques 
>>> >>> >> >> ---
>>> >>> >> >>  fs/ceph/caps.c | 7 +++
>>> >>> >> >>  1 file changed, 7 insertions(+)
>>> >>> >> >>
>>> >>> >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>> >>> >> >> index 36a8dc699448..208f4dc6f574 100644
>>> >>> >> >> --- a/fs/ceph/caps.c
>>> >>> >> >> +++ b/fs/ceph/caps.c
>>> >>> >> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode)
>>> >>> >> >>  static void drop_inode_snap_realm(struct ceph_inode_info *ci)
>>> >>> >> >>  {
>>> >>> >> >> struct ceph_snap_realm *realm = ci->i_snap_realm;
>>> >>> >> >> +
>>> >>> >> >> spin_lock(>inodes_with_caps_lock);
>>> >>> >> >> list_del_init(>i_snap_realm_item);
>>> >>> >> >> ci->i_snap_realm_counter++;
>>> >

Re: [PATCH] ceph: Fix a memory leak in ci->i_head_snapc

2019-04-03 Thread Luis Henriques
"Yan, Zheng"  writes:

> On Fri, Mar 22, 2019 at 6:04 PM Luis Henriques  wrote:
>>
>> Luis Henriques  writes:
>>
>> > "Yan, Zheng"  writes:
>> >
>> >> On Tue, Mar 19, 2019 at 12:22 AM Luis Henriques  
>> >> wrote:
>> >>>
>> >>> "Yan, Zheng"  writes:
>> >>>
>> >>> > On Mon, Mar 18, 2019 at 6:33 PM Luis Henriques  
>> >>> > wrote:
>> >>> >>
>> >>> >> "Yan, Zheng"  writes:
>> >>> >>
>> >>> >> > On Fri, Mar 15, 2019 at 7:13 PM Luis Henriques 
>> >>> >> >  wrote:
>> >>> >> >>
>> >>> >> >> I'm occasionally seeing a kmemleak warning in xfstest generic/013:
>> >>> >> >>
>> >>> >> >> unreferenced object 0x8881fccca940 (size 32):
>> >>> >> >>   comm "kworker/0:1", pid 12, jiffies 4295005883 (age 130.648s)
>> >>> >> >>   hex dump (first 32 bytes):
>> >>> >> >> 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  
>> >>> >> >> 
>> >>> >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> >>> >> >> 
>> >>> >> >>   backtrace:
>> >>> >> >> [<d741a1ea>] build_snap_context+0x5b/0x2a0
>> >>> >> >> [<21a00533>] rebuild_snap_realms+0x27/0x90
>> >>> >> >> [<ac538600>] rebuild_snap_realms+0x42/0x90
>> >>> >> >> [<0e955fac>] ceph_update_snap_trace+0x2ee/0x610
>> >>> >> >> [<a9550416>] ceph_handle_snap+0x317/0x5f3
>> >>> >> >> [<fc287b83>] dispatch+0x362/0x176c
>> >>> >> >> [<a312c741>] ceph_con_workfn+0x9ce/0x2cf0
>> >>> >> >> [<4168e3a9>] process_one_work+0x1d4/0x400
>> >>> >> >> [<2188e9e7>] worker_thread+0x2d/0x3c0
>> >>> >> >> [<b593e4b3>] kthread+0x112/0x130
>> >>> >> >> [<a8587dca>] ret_from_fork+0x35/0x40
>> >>> >> >> [<ba1c9c1d>] 0x
>> >>> >> >>
>> >>> >> >> It looks like it is possible that we miss a flush_ack from the MDS 
>> >>> >> >> when,
>> >>> >> >> for example, umounting the filesystem.  In that case, we can 
>> >>> >> >> simply drop
>> >>> >> >> the reference to the ceph_snap_context obtained in 
>> >>> >> >> ceph_queue_cap_snap().
>> >>> >> >>
>> >>> >> >> Link: https://tracker.ceph.com/issues/38224
>> >>> >> >> Cc: sta...@vger.kernel.org
>> >>> >> >> Signed-off-by: Luis Henriques 
>> >>> >> >> ---
>> >>> >> >>  fs/ceph/caps.c | 7 +++
>> >>> >> >>  1 file changed, 7 insertions(+)
>> >>> >> >>
>> >>> >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> >>> >> >> index 36a8dc699448..208f4dc6f574 100644
>> >>> >> >> --- a/fs/ceph/caps.c
>> >>> >> >> +++ b/fs/ceph/caps.c
>> >>> >> >> @@ -1054,6 +1054,7 @@ int ceph_is_any_caps(struct inode *inode)
>> >>> >> >>  static void drop_inode_snap_realm(struct ceph_inode_info *ci)
>> >>> >> >>  {
>> >>> >> >> struct ceph_snap_realm *realm = ci->i_snap_realm;
>> >>> >> >> +
>> >>> >> >> spin_lock(>inodes_with_caps_lock);
>> >>> >> >> list_del_init(>i_snap_realm_item);
>> >>> >> >> ci->i_snap_realm_counter++;
>> >>> >> >> @@ -1063,6 +1064,12 @@ static void drop_inode_snap_realm(struct 
>> >>> >> >> ceph_inode_info *ci)
>> >>> >> >> spin_unlock(>inodes_with_caps_lock);
>> >>> >> >> 
>> >>> &g

  1   2   3   4   5   6   7   8   9   10   >