Re: [PATCH] net: sunrpc: xprt.c: fix shift-out-of-bounds in xprt_calc_majortimeo

2021-04-19 Thread Trond Myklebust
On Mon, 2021-04-19 at 16:36 +, Kurt Manucredo wrote:
> Fix shift-out-of-bounds in xprt_calc_majortimeo().
> 
> UBSAN: shift-out-of-bounds in net/sunrpc/xprt.c:658:14
> shift exponent 536871232 is too large for 64-bit type 'long u
> 
> Reported-by: syzbot+ba2e91df8f7480941...@syzkaller.appspotmail.com
> Signed-off-by: Kurt Manucredo 
> ---
>  net/sunrpc/xprt.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 691ccf8049a4..07128ac3d51d 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -655,7 +655,10 @@ static unsigned long xprt_calc_majortimeo(struct
> rpc_rqst *req)
> unsigned long majortimeo = req->rq_timeout;
>  
> if (to->to_exponential)
> -   majortimeo <<= to->to_retries;
> +   if (to->to_retries >= sizeof(majortimeo) * 8)
> +   majortimeo = to->to_maxval;
> +   else
> +   majortimeo <<= to->to_retries;
> else
> majortimeo += to->to_increment * to->to_retries;
> if (majortimeo > to->to_maxval || majortimeo == 0)

I've already stated on the mailing list that I'm not accepting any
changes to xprt_calc_majortimeo() for this problem. There is a fix to
the NFS mount code that addresses this bounds issue, and that commit is
already being tested in linux-next.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] SUNRPC: Add a check for gss_release_msg

2021-04-08 Thread Trond Myklebust
On Thu, 2021-04-08 at 11:24 -0400, Olga Kornievskaia wrote:
> On Thu, Apr 8, 2021 at 11:01 AM Trond Myklebust <
> tron...@hammerspace.com> wrote:
> > 
> > On Tue, 2021-04-06 at 19:16 -0500, Aditya Pakki wrote:
> > > In gss_pipe_destroy_msg(), in case of error in msg,
> > > gss_release_msg
> > > deletes gss_msg. The patch adds a check to avoid a potential
> > > double
> > > free.
> > > 
> > > Signed-off-by: Aditya Pakki 
> > > ---
> > >  net/sunrpc/auth_gss/auth_gss.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/net/sunrpc/auth_gss/auth_gss.c
> > > b/net/sunrpc/auth_gss/auth_gss.c
> > > index 5f42aa5fc612..eb52eebb3923 100644
> > > --- a/net/sunrpc/auth_gss/auth_gss.c
> > > +++ b/net/sunrpc/auth_gss/auth_gss.c
> > > @@ -848,7 +848,8 @@ gss_pipe_destroy_msg(struct rpc_pipe_msg
> > > *msg)
> > >     warn_gssd();
> > >     gss_release_msg(gss_msg);
> > >     }
> > > -   gss_release_msg(gss_msg);
> > > +   if (gss_msg)
> > > +   gss_release_msg(gss_msg);
> > >  }
> > > 
> > >  static void gss_pipe_dentry_destroy(struct dentry *dir,
> > 
> > 
> > NACK. There's no double free there.
> 
> I disagree that there is no double free, the wording of the commit
> describes the problem in the error case is that we call
> gss_release_msg() and then we call it again but the 1st one released
> the gss_msg. However, I think the fix should probably be instead:
> diff --git a/net/sunrpc/auth_gss/auth_gss.c
> b/net/sunrpc/auth_gss/auth_gss.c
> index 5f42aa5fc612..e8aae617e981 100644
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -846,7 +846,7 @@ gss_pipe_destroy_msg(struct rpc_pipe_msg *msg)
>     gss_unhash_msg(gss_msg);
>     if (msg->errno == -ETIMEDOUT)
>     warn_gssd();
> -   gss_release_msg(gss_msg);
> +   return gss_release_msg(gss_msg);
>     }
>     gss_release_msg(gss_msg);
>  }
> 
Please look one line further up: there is a refcount_inc() that matches
that first gss_release_msg(). Removing that call results in a
reintroduction of the leak that Neil fixed in commit 1cded9d2974fe
("SUNRPC: fix refcounting problems with auth_gss messages.").

We might, however, instead opt to remove both the refcount_inc() and
matching gss_release_msg(). Those do seem to be redundant, given that
the caller also holds a refcount.

> > 
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.mykleb...@hammerspace.com
> > 
> > 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] SUNRPC: Add a check for gss_release_msg

2021-04-08 Thread Trond Myklebust
On Tue, 2021-04-06 at 19:16 -0500, Aditya Pakki wrote:
> In gss_pipe_destroy_msg(), in case of error in msg, gss_release_msg
> deletes gss_msg. The patch adds a check to avoid a potential double
> free.
> 
> Signed-off-by: Aditya Pakki 
> ---
>  net/sunrpc/auth_gss/auth_gss.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/auth_gss/auth_gss.c
> b/net/sunrpc/auth_gss/auth_gss.c
> index 5f42aa5fc612..eb52eebb3923 100644
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -848,7 +848,8 @@ gss_pipe_destroy_msg(struct rpc_pipe_msg *msg)
> warn_gssd();
> gss_release_msg(gss_msg);
> }
> -   gss_release_msg(gss_msg);
> +   if (gss_msg)
> +   gss_release_msg(gss_msg);
>  }
>  
>  static void gss_pipe_dentry_destroy(struct dentry *dir,


NACK. There's no double free there.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: fs_context: validate UDP retrans to prevent shift out-of-bounds

2021-03-16 Thread Trond Myklebust
Hi Randy,

On Tue, 2021-03-16 at 15:35 -0700, Randy Dunlap wrote:
> ping?
> 

I can take this patch for the next merge window.

> On 3/1/21 4:19 PM, Randy Dunlap wrote:
> > Fix shift out-of-bounds in xprt_calc_majortimeo(). This is caused
> > by a garbage timeout (retrans) mount option being passed to nfs
> > mount,
> > in this case from syzkaller.
> > 
> > If the protocol is XPRT_TRANSPORT_UDP, then 'retrans' is a shift
> > value for a 64-bit long integer, so 'retrans' cannot be >= 64.
> > If it is >= 64, fail the mount and return an error.
> > 
> > Fixes: 9954bf92c0cd ("NFS: Move mount parameterisation bits into
> > their own file")
> > Reported-by: syzbot+ba2e91df8f7480941...@syzkaller.appspotmail.com
> > Reported-by: syzbot+f3a0fa110fd630ab5...@syzkaller.appspotmail.com
> > Signed-off-by: Randy Dunlap 
> > Cc: Trond Myklebust 
> > Cc: Anna Schumaker 
> > Cc: linux-...@vger.kernel.org
> > Cc: David Howells 
> > Cc: Al Viro 
> > Cc: sta...@vger.kernel.org
> > ---
> >  fs/nfs/fs_context.c |   12 
> >  1 file changed, 12 insertions(+)
> > 
> > --- lnx-512-rc1.orig/fs/nfs/fs_context.c
> > +++ lnx-512-rc1/fs/nfs/fs_context.c
> > @@ -974,6 +974,15 @@ static int nfs23_parse_monolithic(struct
> >    sizeof(mntfh->data) - mntfh->size);
> >  
> > /*
> > +    * for proto == XPRT_TRANSPORT_UDP, which is what
> > uses
> > +    * to_exponential, implying shift: limit the shift
> > value
> > +    * to BITS_PER_LONG (majortimeo is unsigned long)
> > +    */
> > +   if (!(data->flags & NFS_MOUNT_TCP)) /* this will be
> > UDP */
> > +   if (data->retrans >= 64) /* shift value is
> > too large */
> > +   goto out_invalid_data;
> > +
> > +   /*
> >  * Translate to nfs_fs_context, which
> > nfs_fill_super
> >  * can deal with.
> >      */
> > @@ -1073,6 +1082,9 @@ out_no_address:
> >  
> >  out_invalid_fh:
> > return nfs_invalf(fc, "NFS: invalid root filehandle");
> > +
> > +out_invalid_data:
> > +   return nfs_invalf(fc, "NFS: invalid binary mount data");
> >  }
> >  
> >  #if IS_ENABLED(CONFIG_NFS_V4)
> > 
> 
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] net: sunrpc: fixed function and extra line coding styles

2021-03-16 Thread Trond Myklebust
n cache
>   */
>  struct rpc_cred *
>  rpcauth_lookup_credcache(struct rpc_auth *auth, struct auth_cred *
> acred,
> @@ -566,6 +557,7 @@ rpcauth_lookup_credcache(struct rpc_auth *auth,
> struct auth_cred * acred,
>     cred->cr_ops->cr_init != NULL &&
>     !(flags & RPCAUTH_LOOKUP_NEW)) {
> int res = cred->cr_ops->cr_init(auth, cred);
> +
> if (res < 0) {
> put_rpccred(cred);
> cred = ERR_PTR(res);

NACK. This is not consistent with the preferred style for NFS and
SUNRPC (or the preferred style in Documentation/process/coding-
style.rst).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-15 Thread Trond Myklebust
On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote:
> On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust <
> tron...@hammerspace.com> wrote:
> > 
> > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote:
> > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques <
> > > lhenriq...@suse.de>
> > > wrote:
> > > > 
> > > > Nicolas Boichat reported an issue when trying to use the
> > > > copy_file_range
> > > > syscall on a tracefs file.  It failed silently because the file
> > > > content is
> > > > generated on-the-fly (reporting a size of zero) and
> > > > copy_file_range
> > > > needs
> > > > to know in advance how much data is present.
> > > > 
> > > > This commit restores the cross-fs restrictions that existed
> > > > prior
> > > > to
> > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > > devices")
> > > > and
> > > > removes generic_copy_file_range() calls from ceph, cifs, fuse,
> > > > and
> > > > nfs.
> > > > 
> > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > > devices")
> > > > Link:
> > > > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
> > > > Cc: Nicolas Boichat 
> > > > Signed-off-by: Luis Henriques 
> > > 
> > > Code looks ok.
> > > You may add:
> > > 
> > > Reviewed-by: Amir Goldstein 
> > > 
> > > I agree with Trond that the first paragraph of the commit message
> > > could
> > > be improved.
> > > The purpose of this change is to fix the change of behavior that
> > > caused the regression.
> > > 
> > > Before v5.3, behavior was -EXDEV and userspace could fallback to
> > > read.
> > > After v5.3, behavior is zero size copy.
> > > 
> > > It does not matter so much what makes sense for CFR to do in this
> > > case (generic cross-fs copy).  What matters is that nobody asked
> > > for
> > > this change and that it caused problems.
> > > 
> > 
> > No. I'm saying that this patch should be NACKed unless there is a
> > real
> > explanation for why we give crap about this tracefs corner case and
> > why
> > it can't be fixed.
> > 
> > There are plenty of reasons why copy offload across filesystems
> > makes
> > sense, and particularly when you're doing NAS. Clone just doesn't
> > cut
> > it when it comes to disaster recovery (whereas backup to a
> > different
> > storage unit does). If the client has to do the copy, then you're
> > effectively doubling the load on the server, and you're adding
> > potentially unnecessary network traffic (or at the very least you
> > are
> > doubling that traffic).
> > 
> 
> I don't understand the use case you are describing.
> 
> Which filesystem types are you talking about for source and target
> of copy_file_range()?
> 
> To be clear, the original change was done to support NFS/CIFS server-
> side
> copy and those should not be affected by this change.
> 

That is incorrect: 

ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file
*dst,
 u64 dst_pos, u64 count)
{

 /*
 * Limit copy to 4MB to prevent indefinitely blocking an nfsd
 * thread and client rpc slot. The choice of 4MB is somewhat
 * arbitrary. We might instead base this on r/wsize, or make it
 * tunable, or use a time instead of a byte limit, or implement
 * asynchronous copy. In theory a client could also recognize a
 * limit like this and pipeline multiple COPY requests.
 */
 count = min_t(u64, count, 1 << 22);
 return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
}

You are now explicitly changing the behaviour of knfsd when the source
and destination filesystem differ.

For one thing, you are disallowing the NFSv4.2 copy offload use case of
copying from a local filesystem to a remote NFS server. However you are
also disallowing the copy from, say, an XFS formatted partition to an
ext4 partition.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-15 Thread Trond Myklebust
On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote:
> On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques 
> wrote:
> > 
> > Nicolas Boichat reported an issue when trying to use the
> > copy_file_range
> > syscall on a tracefs file.  It failed silently because the file
> > content is
> > generated on-the-fly (reporting a size of zero) and copy_file_range
> > needs
> > to know in advance how much data is present.
> > 
> > This commit restores the cross-fs restrictions that existed prior
> > to
> > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > and
> > removes generic_copy_file_range() calls from ceph, cifs, fuse, and
> > nfs.
> > 
> > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > devices")
> > Link: 
> > https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drink...@chromium.org/
> > Cc: Nicolas Boichat 
> > Signed-off-by: Luis Henriques 
> 
> Code looks ok.
> You may add:
> 
> Reviewed-by: Amir Goldstein 
> 
> I agree with Trond that the first paragraph of the commit message
> could
> be improved.
> The purpose of this change is to fix the change of behavior that
> caused the regression.
> 
> Before v5.3, behavior was -EXDEV and userspace could fallback to
> read.
> After v5.3, behavior is zero size copy.
> 
> It does not matter so much what makes sense for CFR to do in this
> case (generic cross-fs copy).  What matters is that nobody asked for
> this change and that it caused problems.
> 

No. I'm saying that this patch should be NACKed unless there is a real
explanation for why we give crap about this tracefs corner case and why
it can't be fixed.

There are plenty of reasons why copy offload across filesystems makes
sense, and particularly when you're doing NAS. Clone just doesn't cut
it when it comes to disaster recovery (whereas backup to a different
storage unit does). If the client has to do the copy, then you're
effectively doubling the load on the server, and you're adding
potentially unnecessary network traffic (or at the very least you are
doubling that traffic).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-15 Thread Trond Myklebust
On Mon, 2021-02-15 at 15:43 +, Luis Henriques wrote:
> Nicolas Boichat reported an issue when trying to use the
> copy_file_range
> syscall on a tracefs file.  It failed silently because the file
> content is
> generated on-the-fly (reporting a size of zero) and copy_file_range
> needs
> to know in advance how much data is present.

That explanation makes no sense whatsoever. copy_file_range is a non-
atomic operation and so the file can change while being copied. Any
determination of 'how much data is present' that is made in advance
would therefore be a flaw in the copy process being used (i.e.
do_splice_direct()). Does sendfile() also 'issue' in the same way?


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: UBSAN: shift-out-of-bounds in xprt_do_reserve

2021-02-10 Thread Trond Myklebust
Hi Randy,

On Wed, 2021-02-10 at 16:52 -0800, Randy Dunlap wrote:
> On 2/9/21 5:24 PM, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:    dd86e7fa Merge tag 'pci-v5.11-fixes-2' of
> > git://git.kernel..
> > git tree:   upstream
> > console output:
> > https://syzkaller.appspot.com/x/log.txt?x=105930c4d0
> > kernel config: 
> > https://syzkaller.appspot.com/x/.config?x=266a5362c89c8127
> > dashboard link:
> > https://syzkaller.appspot.com/bug?extid=f3a0fa110fd630ab56c8
> > compiler:   Debian clang version 11.0.1-2
> > syz repro: 
> > https://syzkaller.appspot.com/x/repro.syz?x=17ba3038d0
> > C reproducer:  
> > https://syzkaller.appspot.com/x/repro.c?x=15cf0d64d0
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to
> > the commit:
> > Reported-by: syzbot+f3a0fa110fd630ab5...@syzkaller.appspotmail.com
> 
> #syz dup: UBSAN: shift-out-of-bounds in xprt_calc_majortimeo
> 
> > ===
> > =
> > UBSAN: shift-out-of-bounds in net/sunrpc/xprt.c:658:14
> > shift exponent 536870976 is too large for 64-bit type 'unsigned
> > long'
> > CPU: 1 PID: 8411 Comm: syz-executor902 Not tainted 5.11.0-rc6-
> > syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > BIOS Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:79 [inline]
> >  dump_stack+0x137/0x1be lib/dump_stack.c:120
> >  ubsan_epilogue lib/ubsan.c:148 [inline]
> >  __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
> >  xprt_calc_majortimeo net/sunrpc/xprt.c:658 [inline]
> >  xprt_init_majortimeo net/sunrpc/xprt.c:686 [inline]
> 
> 

So, firstly, this is a case of 'doctor it hurts when I do this...' so
it isn't a critcal issue. It is a case where garbage mount options
produces garbage timeout values.
However, more importantly, it is a case where we could easily be
checking these values once at mount time instead of adding runtime
checks that can end up being called several times per RPC call.

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again

2021-02-08 Thread Trond Myklebust
On Mon, 2021-02-08 at 19:48 +, Chuck Lever wrote:
> 
> 
> > On Feb 8, 2021, at 2:34 PM, Trond Myklebust <
> > tron...@hammerspace.com> wrote:
> > 
> > On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote:
> > > From: Chuck Lever 
> > > 
> > > [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ]
> > > 
> > > Daire Byrne reports a ~50% aggregrate throughput regression on
> > > his
> > > Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server
> > > to
> > > use xprt_sock_sendmsg for socket sends"), which replaced
> > > kernel_send_page() calls in NFSD's socket send path with calls to
> > > sock_sendmsg() using iov_iter.
> > > 
> > > Investigation showed that tcp_sendmsg() was not using zero-copy
> > > to
> > > send the xdr_buf's bvec pages, but instead was relying on memcpy.
> > > This means copying every byte of a large NFS READ payload.
> > > 
> > > It looks like TLS sockets do indeed support a ->sendpage method,
> > > so it's really not necessary to use xprt_sock_sendmsg() to
> > > support
> > > TLS fully on the server. A mechanical reversion of da1661b93bf4
> > > is
> > > not possible at this point, but we can re-implement the server's
> > > TCP socket sendmsg path using kernel_sendpage().
> > > 
> > > Reported-by: Daire Byrne 
> > > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
> > > Signed-off-by: Chuck Lever 
> > > Signed-off-by: Sasha Levin 
> > > ---
> > >  net/sunrpc/svcsock.c | 86
> > > +++-
> > >  1 file changed, 85 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index c2752e2b9ce34..4404c491eb388 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct
> > > svc_rqst
> > > *rqstp)
> > >     return 0;   /* record not complete */
> > >  }
> > >  
> > > +static int svc_tcp_send_kvec(struct socket *sock, const struct
> > > kvec
> > > *vec,
> > > + int flags)
> > > +{
> > > +   return kernel_sendpage(sock, virt_to_page(vec->iov_base),
> > > +  offset_in_page(vec->iov_base),
> > > +  vec->iov_len, flags);
> 
> Thanks for your review!
> 
> > I'm having trouble with this line. This looks like it is trying to
> > push
> > a slab page into kernel_sendpage().
> 
> The head and tail kvec's in rq_res are not kmalloc'd, they are
> backed by pages in rqstp->rq_pages[].
> 
> 
> > What guarantees that the nfsd
> > thread won't call kfree() before the socket layer is done
> > transmitting
> > the page?
> 
> If I understand correctly what Neil told us last week, the page
> reference count on those pages is set up so that one of
> svc_xprt_release() or the network layer does the final put_page(),
> in a safe fashion.
> 
> Before da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg
> for socket sends"), the original svc_send_common() code did this:
> 
> -   /* send head */
> -   if (slen == xdr->head[0].iov_len)
> -   flags = 0;
> -   len = kernel_sendpage(sock, headpage, headoffset,
> - xdr->head[0].iov_len, flags);
> -   if (len != xdr->head[0].iov_len)
> -   goto out;
> -   slen -= xdr->head[0].iov_len;
> -   if (slen == 0)
> -   goto out;
> 
> 
> 

OK, so then only the argument kvec can be allocated on the slab (thanks
to  svc_deferred_recv)? Is that correct?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again

2021-02-08 Thread Trond Myklebust
On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote:
> From: Chuck Lever 
> 
> [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ]
> 
> Daire Byrne reports a ~50% aggregrate throughput regression on his
> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to
> use xprt_sock_sendmsg for socket sends"), which replaced
> kernel_send_page() calls in NFSD's socket send path with calls to
> sock_sendmsg() using iov_iter.
> 
> Investigation showed that tcp_sendmsg() was not using zero-copy to
> send the xdr_buf's bvec pages, but instead was relying on memcpy.
> This means copying every byte of a large NFS READ payload.
> 
> It looks like TLS sockets do indeed support a ->sendpage method,
> so it's really not necessary to use xprt_sock_sendmsg() to support
> TLS fully on the server. A mechanical reversion of da1661b93bf4 is
> not possible at this point, but we can re-implement the server's
> TCP socket sendmsg path using kernel_sendpage().
> 
> Reported-by: Daire Byrne 
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
> Signed-off-by: Chuck Lever 
> Signed-off-by: Sasha Levin 
> ---
>  net/sunrpc/svcsock.c | 86
> +++-
>  1 file changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index c2752e2b9ce34..4404c491eb388 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct svc_rqst
> *rqstp)
> return 0;   /* record not complete */
>  }
>  
> +static int svc_tcp_send_kvec(struct socket *sock, const struct kvec
> *vec,
> + int flags)
> +{
> +   return kernel_sendpage(sock, virt_to_page(vec->iov_base),
> +  offset_in_page(vec->iov_base),
> +  vec->iov_len, flags);

I'm having trouble with this line. This looks like it is trying to push
a slab page into kernel_sendpage(). What guarantees that the nfsd
thread won't call kfree() before the socket layer is done transmitting
the page?



-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [GIT PULL] Please pull NFS client bugfixes for 5.11

2021-01-31 Thread Trond Myklebust
On Sun, 2021-01-31 at 11:22 -0800, Linus Torvalds wrote:
> On Sun, Jan 31, 2021 at 8:59 AM Trond Myklebust <
> tron...@hammerspace.com> wrote:
> > 
> >   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-
> > for-5.11-3
> 
> Merged. However, it looks like you won't get a pr-tracker-bot reply
> because I'm not seeing this email on lore.
> 
> So I'm doing these manual replies for now, it looks like the mailing
> list is not doing great.
> 
>  Linus


Yeah, I was curious about that. Wasn't sure if the problem was on my
end or if indeed the mailing lists are down.
Thanks anyway for merging and taking the time to reply!

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: rpc_xprt_debugfs_register() - atomic_inc_return() usage

2021-01-21 Thread Trond Myklebust
On Wed, 2021-01-20 at 16:52 -0700, Shuah Khan wrote:
> Hi Anna and Trond,
> 
> I came across the following while reviewing atomic_inc_return()
> usages
> that cast return value to unsigned
> 
> rpc_xprt_debugfs_register()'s atomic_inc_return() usage looks a bit
> odd.
> 
> - cur_id isn't initialized
> - id = (unsigned int)atomic_inc_return(_id);
> 
> Please note that id is int. Is it expected that cur_id could
> overflow?
> Is there a maximum limit for this value?
> 

Yes, we do expect cur_id to eventually overflow (once you have created
2 billion RPC client instances), however the atomic increment
operations are expected to handle this correctly according to the
maintainers (I already asked them in a different context). Furthermore,
the code itself doesn't care about strict sequentiality. All it wants
from the counter is uniqueness, with that uniqueness condition actually
being enforced by the subsequent debugfs_create_file() call.

IOW: I don't think this is a real problem.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes

2021-01-12 Thread Trond Myklebust
Hi Linus,

The following changes since commit e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62:

  Linux 5.11-rc2 (2021-01-03 15:55:30 -0800)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.11-2

for you to fetch changes up to 896567ee7f17a8a736cda8a28cc987228410a2ac:

  NFS: nfs_igrab_and_active must first reference the superblock (2021-01-10 
16:29:28 -0500)


NFS client bugfixes for Linux 5.11

Highlights include:

Bugfixes:
- Fix parsing of link-local IPv6 addresses
- Fix confusing logging of mount errors that was introduced by the
  fsopen() patchset.
- Fix a tracing use after free in _nfs4_do_setlk()
- Layout return-on-close fixes when called from nfs4_evict_inode()
- Layout segments were being leaked in pnfs_generic_clear_request_commit()
- Don't leak DS commits in pnfs_generic_retry_commit()
- Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
  calls iput() on an inode after the super block has gone away.


Dave Wysochanski (1):
  NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock

Scott Mayhew (1):
  NFS: Adjust fs_context error logging

Trond Myklebust (9):
  pNFS: Mark layout for return if return-on-close was not sent
  pNFS: We want return-on-close to complete when evicting the inode
  pNFS: Clean up pnfs_layoutreturn_free_lsegs()
  pNFS: Stricter ordering of layoutget and layoutreturn
  NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
  NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
  NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
  NFS: nfs_delegation_find_inode_server must first reference the superblock
  NFS: nfs_igrab_and_active must first reference the superblock

j.nixd...@avm.de (1):
  net: sunrpc: interpret the return value of kstrtou32 correctly

 fs/nfs/delegation.c | 12 ++
 fs/nfs/internal.h   | 38 +++---
 fs/nfs/nfs4proc.c   | 28 +-
 fs/nfs/nfs4super.c  |  4 ++--
 fs/nfs/pnfs.c   | 67 -
 fs/nfs/pnfs.h   |  8 +++
 fs/nfs/pnfs_nfs.c   | 22 --
 net/sunrpc/addr.c   |  2 +-
 8 files changed, 99 insertions(+), 82 deletions(-)

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: linux-next: Fixes tag needs some work in the nfs tree

2021-01-10 Thread Trond Myklebust
Hi Stephen,

On Sun, 2021-01-10 at 13:14 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> In commit
> 
>   2cc8aca9d547 ("NFS: Adjust fs_context error logging")
> 
> Fixes tag
> 
>   Fixes: Fixes: ce8866f0913f ("NFS: Attach supplementary error
> information to fs_context.")
> 
> has these problem(s):
> 
>   - No SHA1 recognised
> 
> See duplicate "Fixes: " :-(
> 

Sorry, that was my fault. I added the 'Fixes' tag when I applied the
patch. and failed to notice the copy-paste error.
I'll fix up the commit changelog.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: NFS 4.2 client support broken on 5.10.5

2021-01-08 Thread Trond Myklebust
On Fri, 2021-01-08 at 12:41 +0100, Kurt Garloff wrote:
> Hi Neil, Anna, Trond,
> 
> compiling a kernel, I suddenly started getting errors from objtool
> orc.
> (This first occurs on init/main.o.)
> 
> I looked at all kind of things, before I noticed that this was not a
> toolchain issue (gcc-10.2.1 self compiled), gcc plugins (I use
> structleak and stackleak) nor an issue with objtool or libelf,
> but that there was an -EIO error.
> 
> The kernel tree is on an NFS share, and I run 5.10.5 client kernel
> against the kernel NFS (4.2) server, running a 5.10.3 kernel.
> 
> The issue does NOT occur on a 5.10.3 client kernel, but is easily
> reproducible on 5.10.5. Note that 5.10.5 on a local file system or
> against an NFSv3 server does not show the issue.
> 
> Test program that reproduces this on the first pwrite64() is
> attached.
> Note that the call to ftruncate() is required to make the problem
> happen.
> 
> I could go on bisecting this to a particular patch, but you'll
> probably be able to see right away what's wrong.
> 
> Best,
> 

Hmm... If this is NFSv4.2 do you have READ_PLUS turned on or off in
.config? It really is not safe to enable READ_PLUS on 5.10 kernels
since that can cause random memory corruption.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [GIT PULL] Please pull NFS client changes for 5.11

2020-12-17 Thread Trond Myklebust
On Thu, 2020-12-17 at 12:25 -0800, Linus Torvalds wrote:
> On Thu, Dec 17, 2020 at 10:05 AM Trond Myklebust
>  wrote:
> > 
> >   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-
> > for-
> > 5.11-1
> > 
> > for you to fetch changes up to
> > 52104f274e2d7f134d34bab11cada8913d4544e2:
> 
> pr-tracker-bot doesn't seem to have reacted to this email.
> 
> I suspect it's because of the odd and unfortunate line breaks in it,
> but who knows. Maybe it's just delayed (but the other filesystem
> pulls
> seem to have been tracked).
> 
> Anyway, it's pulled, even if the automation seems to not have
> noticed.
> 
>     Linus

Thanks! Much appreciated.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client changes for 5.11

2020-12-17 Thread Trond Myklebust
Hi Linus,

The following changes since commit
63e2fffa59a9dd91e443b08832656399fd80b7f0:

  pNFS/flexfiles: Fix array overflow when flexfiles mirroring is
enabled (2020-11-30 10:52:22 -0500)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-
5.11-1

for you to fetch changes up to
52104f274e2d7f134d34bab11cada8913d4544e2:

  NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read() (2020-12-16
17:25:24 -0500)


Stephen Rothwell has noted that there are conflicts with the upstream
tree.
They are due to the last-minute fix to the NFSv4.2 READ_PLUS code in
5.10,
and due to the changes for the nfsd XDR code. The fixups should be
trivial
but please let me know if you'd prefer that I do so.

Cheers
  Trond


NFS client updates for Linux 5.11

Highlights include:

Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
  support.
- A series of patches to improve readdir performance, particularly with
  large directories.
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
  drivers.
- Micro-optimisations for RDMA.
- RDMA tracing improvements.

Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving partial
  pages (Dan Aloni).
- Various fixes for getxattr and listxattr, when used over non-TCP
  transports.
- Fixes for containerised NFS from Sargun Dhillon.
- switch nfsiod to be an UNBOUND workqueue (Neil Brown).
- READDIR should not ask for security label information if there is no
  LSM policy. (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum
Mackay).
- A series of RPC and NFS layer fixes to support the NFSv4.2 READ_PLUS
code.
- A couple of fixes for pnfs/flexfiles read failover

Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
  READ_PLUS fixes.


Calum Mackay (1):
  lockd: don't use interval-based rebinding over TCP

Chuck Lever (16):
  xprtrdma: Replace dprintk call sites in ERR_CHUNK path
  xprtrdma: Introduce Receive completion IDs
  xprtrdma: Introduce Send completion IDs
  xprtrdma: Introduce FRWR completion IDs
  xprtrdma: Clean up trace_xprtrdma_post_linv
  xprtrdma: Clean up reply parsing error tracepoints
  xprtrdma: Clean up tracepoints in the reply path
  xprtrdma: Clean up xprtrdma callback tracepoints
  xprtrdma: Clean up trace_xprtrdma_nomrs()
  xprtrdma: Display the task ID when reporting MR events
  xprtrdma: Trace unmap_sync calls
  xprtrdma: Move rpcrdma_mr_put()
  xprtrdma: Micro-optimize MR DMA-unmapping
  SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall
  NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS
operation
  xprtrdma: Fix XDRBUF_SPARSE_PAGES support

Colin Ian King (1):
  NFSv4.2: fix error return on memory allocation failure

Dai Ngo (1):
  NFSv4.2: Fix 5 seconds delay when doing inter server copy

Dan Aloni (1):
  sunrpc: fix xs_read_xdr_buf for partial pages receive

Fedor Tokarev (1):
  net: sunrpc: Fix 'snprintf' return value check in
'do_xprt_debugfs'

Frank van der Linden (1):
  NFSv4.2: improve page handling for GETXATTR

Geliang Tang (1):
  NFSv4.1: use BITS_PER_LONG macro in nfs4session.h

NeilBrown (1):
  NFS: switch nfsiod to be an UNBOUND workqueue.

Olga Kornievskaia (1):
  NFSv4.2: condition READDIR's mask for security label based on LSM
state

Sargun Dhillon (2):
  NFS: NFSv2/NFSv3: Use cred from fs_context during mount
  NFSv4: Refactor to use user namespaces for nfs4idmap

Trond Myklebust (63):
  NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
  NFSv3: Add emulation of the lookupp() operation
  NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp
  SUNRPC: rpc_wake_up() should wake up tasks in the correct order
  NFS: Ensure contents of struct nfs_open_dir_context are
consistent
  NFS: Clean up readdir struct nfs_cache_array
  NFS: Clean up nfs_readdir_page_filler()
  NFS: Clean up directory array handling
  NFS: Don't discard readdir results
  NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
  NFS: Replace kmap() with kmap_atomic() in
nfs_readdir_search_array()
  NFS: Simplify struct nfs_cache_array_entry
  NFS: Support larger readdir buffers
  NFS: More readdir cleanups
  NFS: nfs_do_filldir() does not return a value
  NFS: Reduce readdir stack usage
  NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
  NFS: Allow the NFS generic code to pass in a verifier to readdir
  NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir
calls
  NFS: Improve handling of directory verifiers
  NFS: Optimisations for monotonically increasing readdir cookies
  NFS: Reduce number of RPC calls when doing uncached readdir

Re: linux-next: manual merge of the nfs tree with Linus' tree

2020-12-14 Thread Trond Myklebust
On Tue, 2020-12-15 at 11:24 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the nfs tree got a conflict in:
> 
>   fs/nfs/nfs4proc.c
> 
> between commit:
> 
>   21e31401fc45 ("NFS: Disable READ_PLUS by default")
> 
> from Linus' tree and commit:
> 
>   5c3485bb12c9 ("NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet")
> 
> from the nfs tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your
> tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any
> particularly
> complex conflicts.
> 

Thanks Stephen!

Yes, the fixes for the client side READ_PLUS code ended up being a
little more extensive than hoped for in the last week of the 2.10
cycle, hence the need for a Kconfig option to disable it. Apologies for
the extra work it caused you.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: remove trailing semicolon in macro definition

2020-11-29 Thread Trond Myklebust
On Sun, 2020-11-29 at 16:42 +, Trond Myklebust wrote:
> Hi Tom,
> 
> On Fri, 2020-11-27 at 11:43 -0800, t...@redhat.com wrote:
> > From: Tom Rix 
> > 
> > The macro use will already have a semicolon.
> > 
> > Signed-off-by: Tom Rix 
> > ---
> >  net/sunrpc/auth_gss/gss_generic_token.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/net/sunrpc/auth_gss/gss_generic_token.c
> > b/net/sunrpc/auth_gss/gss_generic_token.c
> > index fe97f3106536..9ae22d797390 100644
> > --- a/net/sunrpc/auth_gss/gss_generic_token.c
> > +++ b/net/sunrpc/auth_gss/gss_generic_token.c
> > @@ -46,7 +46,7 @@
> >  /* TWRITE_STR from gssapiP_generic.h */
> >  #define TWRITE_STR(ptr, str, len) \
> > memcpy((ptr), (char *) (str), (len)); \
> > -   (ptr) += (len);
> > +   (ptr) += (len)
> >  
> >  /*  this code currently makes the assumption that a mech oid
> > will
> >     never be longer than 127 bytes.  This assumption is not
> > inherent
> > in
> 
> There is exactly 1 use of this macro in the code AFAICS. Can we
> please
> just get rid of it, and make the code trivially easier to read?
> 


BTW: To illustrate just how obfuscating this kind of macro can be, note
that the line you are changing above will be completely optimised away
in the 1 use case we're talking about. It is bumping a pointer value
that immediately gets discarded.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: remove trailing semicolon in macro definition

2020-11-29 Thread Trond Myklebust
Hi Tom,

On Fri, 2020-11-27 at 11:43 -0800, t...@redhat.com wrote:
> From: Tom Rix 
> 
> The macro use will already have a semicolon.
> 
> Signed-off-by: Tom Rix 
> ---
>  net/sunrpc/auth_gss/gss_generic_token.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/auth_gss/gss_generic_token.c
> b/net/sunrpc/auth_gss/gss_generic_token.c
> index fe97f3106536..9ae22d797390 100644
> --- a/net/sunrpc/auth_gss/gss_generic_token.c
> +++ b/net/sunrpc/auth_gss/gss_generic_token.c
> @@ -46,7 +46,7 @@
>  /* TWRITE_STR from gssapiP_generic.h */
>  #define TWRITE_STR(ptr, str, len) \
> memcpy((ptr), (char *) (str), (len)); \
> -   (ptr) += (len);
> +   (ptr) += (len)
>  
>  /*  this code currently makes the assumption that a mech oid
> will
>     never be longer than 127 bytes.  This assumption is not inherent
> in

There is exactly 1 use of this macro in the code AFAICS. Can we please
just get rid of it, and make the code trivially easier to read?

Thanks
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: Retry the CLOSE if the embedded GETATTR is rejected with ERR_STALE

2020-11-18 Thread Trond Myklebust
On Wed, 2020-11-18 at 21:29 +, Anchal Agarwal wrote:
> On Wed, Nov 18, 2020 at 03:17:20AM +0000, Trond Myklebust wrote:
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the
> > sender and know the content is safe.
> > 
> > 
> > 
> > On Wed, 2020-11-18 at 00:24 +, Anchal Agarwal wrote:
> > > If our CLOSE RPC call is rejected with an ERR_STALE error, then
> > > we
> > > should remove the GETATTR call from the compound RPC and retry.
> > > This could happen in a scenario where two clients tries to access
> > > the same file. One client opens the file and the other client
> > > removes
> > > the file while it's opened by first client. When the first client
> > > attempts to close the file the server returns ESTALE and the file
> > > ends
> > > up being leaked on the server. This depends on how nfs server is
> > > configured and is not reproducible if running against nfsd.
> > 
> > That would be a seriously broken server. If you return
> > NFS4ERR_STALE to
> > the client, you cannot expect any further interaction with that
> > file
> > from the client. It won't try to send CLOSE or DELEGRETURN or any
> > other
> > stateful operation.
> > 
> In this scenario, the setup we have at EFS is more of a distributed
> fashion. Multiple
> clients are connected to multiple servers with a common filesystem.
> So the above
> scenario leads to leaked open file handles on the client that tries
> to close deleted
> file. So I was of the view, in that case client could retry close
> without getattr
> in the close sequence without anything to do on server side.


If you send the client an NFS4ERR_STALE, you are telling it that its
access to the file has been revoked. That is not a temporary error, it
is a fatal one. The client is not responsible for cleaning up any
state.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: Retry the CLOSE if the embedded GETATTR is rejected with ERR_STALE

2020-11-17 Thread Trond Myklebust
On Wed, 2020-11-18 at 00:24 +, Anchal Agarwal wrote:
> If our CLOSE RPC call is rejected with an ERR_STALE error, then we
> should remove the GETATTR call from the compound RPC and retry.
> This could happen in a scenario where two clients tries to access
> the same file. One client opens the file and the other client removes
> the file while it's opened by first client. When the first client
> attempts to close the file the server returns ESTALE and the file
> ends
> up being leaked on the server. This depends on how nfs server is
> configured and is not reproducible if running against nfsd.

That would be a seriously broken server. If you return NFS4ERR_STALE to
the client, you cannot expect any further interaction with that file
from the client. It won't try to send CLOSE or DELEGRETURN or any other
stateful operation.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread Trond Myklebust
On Mon, 2020-11-16 at 16:12 +1100, NeilBrown wrote:
> On Mon, Nov 16 2020, Trond Myklebust wrote:
> 
> > On Mon, 2020-11-16 at 16:00 +1100, NeilBrown wrote:
> > > On Mon, Nov 16 2020, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2020-11-16 at 15:43 +1100, NeilBrown wrote:
> > > > > On Mon, Nov 16 2020, Trond Myklebust wrote:
> > > > > 
> > > > > > On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
> > > > > > > 
> > > > > > > Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
> > > > > > > nfs_lookup_revalidate()")
> > > > > > > and error from nfs_lookup_verify_inode() other than -
> > > > > > > ESTALE
> > > > > > > would
> > > > > > > result
> > > > > > > in nfs_lookup_revalidate() returning that error code (-
> > > > > > > ESTALE
> > > > > > > is
> > > > > > > mapped
> > > > > > > to zero).
> > > > > > > Since that commit, all errors result in zero being
> > > > > > > returned.
> > > > > > > 
> > > > > > > When nfs_lookup_revalidate() returns zero, the dentry is
> > > > > > > invalidated
> > > > > > > and, significantly, if the dentry is a directory that is
> > > > > > > mounted
> > > > > > > on,
> > > > > > > that mountpoint is lost.
> > > > > > > 
> > > > > > > If you:
> > > > > > >  - mount an NFS filesystem which contains a directory
> > > > > > >  - mount something (e.g. tmpfs) on that directory
> > > > > > >  - use iptables (or scissors) to block traffic to the
> > > > > > > server
> > > > > > >  - ls -l the-mounted-on-directory
> > > > > > >  - interrupt the 'ls -l'
> > > > > > > you will find that the directory has been unmounted.
> > > > > > > 
> > > > > > > This can be fixed by returning the actual error code from
> > > > > > > nfs_lookup_verify_inode() rather then zero (except for -
> > > > > > > ESTALE).
> > > > > > > 
> > > > > > > Fixes: 5ceb9d7fdaaf ("NFS: Refactor
> > > > > > > nfs_lookup_revalidate()")
> > > > > > > Signed-off-by: NeilBrown 
> > > > > > > ---
> > > > > > >  fs/nfs/dir.c | 8 +---
> > > > > > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > > index cb52db9a0cfb..d24acf556e9e 100644
> > > > > > > --- a/fs/nfs/dir.c
> > > > > > > +++ b/fs/nfs/dir.c
> > > > > > > @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct
> > > > > > > inode
> > > > > > > *dir,
> > > > > > > struct dentry *dentry,
> > > > > > >  unsigned int flags)
> > > > > > >  {
> > > > > > > struct inode *inode;
> > > > > > > -   int error;
> > > > > > > +   int error = 0;
> > > > > > >  
> > > > > > > nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
> > > > > > > inode = d_inode(dentry);
> > > > > > > @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct
> > > > > > > inode
> > > > > > > *dir,
> > > > > > > struct dentry *dentry,
> > > > > > >     nfs_check_verifier(dir, dentry, flags &
> > > > > > > LOOKUP_RCU))
> > > > > > > {
> > > > > > > error = nfs_lookup_verify_inode(inode,
> > > > > > > flags);
> > > > > > > if (error) {
> > > > > > > -   if (error == -ESTALE)
> > > > > > > +   if (error == -ESTALE) {
> > > > > > > nfs_zap_caches(dir);
> > > > > > > +   error = 0;
> > > > > > > +   }
> > 

Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread Trond Myklebust
On Mon, 2020-11-16 at 16:00 +1100, NeilBrown wrote:
> On Mon, Nov 16 2020, Trond Myklebust wrote:
> 
> > On Mon, 2020-11-16 at 15:43 +1100, NeilBrown wrote:
> > > On Mon, Nov 16 2020, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
> > > > > 
> > > > > Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
> > > > > nfs_lookup_revalidate()")
> > > > > and error from nfs_lookup_verify_inode() other than -ESTALE
> > > > > would
> > > > > result
> > > > > in nfs_lookup_revalidate() returning that error code (-ESTALE
> > > > > is
> > > > > mapped
> > > > > to zero).
> > > > > Since that commit, all errors result in zero being returned.
> > > > > 
> > > > > When nfs_lookup_revalidate() returns zero, the dentry is
> > > > > invalidated
> > > > > and, significantly, if the dentry is a directory that is
> > > > > mounted
> > > > > on,
> > > > > that mountpoint is lost.
> > > > > 
> > > > > If you:
> > > > >  - mount an NFS filesystem which contains a directory
> > > > >  - mount something (e.g. tmpfs) on that directory
> > > > >  - use iptables (or scissors) to block traffic to the server
> > > > >  - ls -l the-mounted-on-directory
> > > > >  - interrupt the 'ls -l'
> > > > > you will find that the directory has been unmounted.
> > > > > 
> > > > > This can be fixed by returning the actual error code from
> > > > > nfs_lookup_verify_inode() rather then zero (except for -
> > > > > ESTALE).
> > > > > 
> > > > > Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
> > > > > Signed-off-by: NeilBrown 
> > > > > ---
> > > > >  fs/nfs/dir.c | 8 +---
> > > > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > index cb52db9a0cfb..d24acf556e9e 100644
> > > > > --- a/fs/nfs/dir.c
> > > > > +++ b/fs/nfs/dir.c
> > > > > @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct inode
> > > > > *dir,
> > > > > struct dentry *dentry,
> > > > >  unsigned int flags)
> > > > >  {
> > > > > struct inode *inode;
> > > > > -   int error;
> > > > > +   int error = 0;
> > > > >  
> > > > > nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
> > > > > inode = d_inode(dentry);
> > > > > @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct inode
> > > > > *dir,
> > > > > struct dentry *dentry,
> > > > >     nfs_check_verifier(dir, dentry, flags &
> > > > > LOOKUP_RCU))
> > > > > {
> > > > > error = nfs_lookup_verify_inode(inode,
> > > > > flags);
> > > > > if (error) {
> > > > > -   if (error == -ESTALE)
> > > > > +   if (error == -ESTALE) {
> > > > > nfs_zap_caches(dir);
> > > > > +   error = 0;
> > > > > +   }
> > > > > goto out_bad;
> > > > > }
> > > > > nfs_advise_use_readdirplus(dir);
> > > > > @@ -1395,7 +1397,7 @@ nfs_do_lookup_revalidate(struct inode
> > > > > *dir,
> > > > > struct dentry *dentry,
> > > > >  out_bad:
> > > > > if (flags & LOOKUP_RCU)
> > > > > return -ECHILD;
> > > > > -   return nfs_lookup_revalidate_done(dir, dentry, inode,
> > > > > 0);
> > > > > +   return nfs_lookup_revalidate_done(dir, dentry, inode,
> > > > > error);
> > > > 
> > > > Which errors do we actually need to return here? As far as I
> > > > can
> > > > tell,
> > > > the only errors that nfs_lookup_verify_inode() is supposed to
> > > > return is
> > > > ENOMEM, ESTALE, ECHILD, and possibly EIO or ETiMEDOUT.
> > > 

Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread Trond Myklebust
On Mon, 2020-11-16 at 15:43 +1100, NeilBrown wrote:
> On Mon, Nov 16 2020, Trond Myklebust wrote:
> 
> > On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
> > > 
> > > Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
> > > nfs_lookup_revalidate()")
> > > and error from nfs_lookup_verify_inode() other than -ESTALE would
> > > result
> > > in nfs_lookup_revalidate() returning that error code (-ESTALE is
> > > mapped
> > > to zero).
> > > Since that commit, all errors result in zero being returned.
> > > 
> > > When nfs_lookup_revalidate() returns zero, the dentry is
> > > invalidated
> > > and, significantly, if the dentry is a directory that is mounted
> > > on,
> > > that mountpoint is lost.
> > > 
> > > If you:
> > >  - mount an NFS filesystem which contains a directory
> > >  - mount something (e.g. tmpfs) on that directory
> > >  - use iptables (or scissors) to block traffic to the server
> > >  - ls -l the-mounted-on-directory
> > >  - interrupt the 'ls -l'
> > > you will find that the directory has been unmounted.
> > > 
> > > This can be fixed by returning the actual error code from
> > > nfs_lookup_verify_inode() rather then zero (except for -ESTALE).
> > > 
> > > Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
> > > Signed-off-by: NeilBrown 
> > > ---
> > >  fs/nfs/dir.c | 8 +---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > index cb52db9a0cfb..d24acf556e9e 100644
> > > --- a/fs/nfs/dir.c
> > > +++ b/fs/nfs/dir.c
> > > @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
> > > struct dentry *dentry,
> > >  unsigned int flags)
> > >  {
> > > struct inode *inode;
> > > -   int error;
> > > +   int error = 0;
> > >  
> > > nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
> > > inode = d_inode(dentry);
> > > @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct inode
> > > *dir,
> > > struct dentry *dentry,
> > >     nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU))
> > > {
> > > error = nfs_lookup_verify_inode(inode, flags);
> > > if (error) {
> > > -   if (error == -ESTALE)
> > > +   if (error == -ESTALE) {
> > > nfs_zap_caches(dir);
> > > +   error = 0;
> > > +   }
> > > goto out_bad;
> > > }
> > > nfs_advise_use_readdirplus(dir);
> > > @@ -1395,7 +1397,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
> > > struct dentry *dentry,
> > >  out_bad:
> > > if (flags & LOOKUP_RCU)
> > > return -ECHILD;
> > > -   return nfs_lookup_revalidate_done(dir, dentry, inode, 0);
> > > +   return nfs_lookup_revalidate_done(dir, dentry, inode,
> > > error);
> > 
> > Which errors do we actually need to return here? As far as I can
> > tell,
> > the only errors that nfs_lookup_verify_inode() is supposed to
> > return is
> > ENOMEM, ESTALE, ECHILD, and possibly EIO or ETiMEDOUT.
> > 
> > Why would it be better to return those errors rather than just a 0
> > when
> > we need to invalidate the inode, particularly since we already have
> > a
> > special case in nfs_lookup_revalidate_done() when the dentry is
> > root?
> 
> ERESTARTSYS is the error that easily causes problems.
> 
> Returning 0 causes d_invalidate() to be called which is quite heavy
> handed in mountpoints.

My point is that it shouldn't get returned for mountpoints. See
nfs_lookup_revalidate_done().

> So it is only reasonable to return 0 when we have unambiguous
> confirmation from the server that the object no longer exists. 
> ESTALE
> is unambiguous. EIO might be unambiguous.  ERESTARTSYS, ENOMEM,
> ETIMEDOUT are transient and don't justify d_invalidate() being
> called.
> 
> (BTW, Commit cc89684c9a26 ("NFS: only invalidate dentrys that are
> clearly invalid.")
>  fixed much the same bug 3 years ago).
>  
> Thanks,
> NeilBrown
> 
> 
> > 
> > >  }
> > >  
> > >  static int
> > 
> > -- 
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.mykleb...@hammerspace.com

-- 
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022
​
www.hammer.space



Re: [PATCH] NFS: only invalidate dentrys that are clearly invalid.

2020-11-15 Thread Trond Myklebust
On Mon, 2020-11-16 at 13:59 +1100, NeilBrown wrote:
> 
> Prior to commit 5ceb9d7fdaaf ("NFS: Refactor
> nfs_lookup_revalidate()")
> and error from nfs_lookup_verify_inode() other than -ESTALE would
> result
> in nfs_lookup_revalidate() returning that error code (-ESTALE is
> mapped
> to zero).
> Since that commit, all errors result in zero being returned.
> 
> When nfs_lookup_revalidate() returns zero, the dentry is invalidated
> and, significantly, if the dentry is a directory that is mounted on,
> that mountpoint is lost.
> 
> If you:
>  - mount an NFS filesystem which contains a directory
>  - mount something (e.g. tmpfs) on that directory
>  - use iptables (or scissors) to block traffic to the server
>  - ls -l the-mounted-on-directory
>  - interrupt the 'ls -l'
> you will find that the directory has been unmounted.
> 
> This can be fixed by returning the actual error code from
> nfs_lookup_verify_inode() rather then zero (except for -ESTALE).
> 
> Fixes: 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()")
> Signed-off-by: NeilBrown 
> ---
>  fs/nfs/dir.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index cb52db9a0cfb..d24acf556e9e 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1350,7 +1350,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
> struct dentry *dentry,
>  unsigned int flags)
>  {
> struct inode *inode;
> -   int error;
> +   int error = 0;
>  
> nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
> inode = d_inode(dentry);
> @@ -1372,8 +1372,10 @@ nfs_do_lookup_revalidate(struct inode *dir,
> struct dentry *dentry,
>     nfs_check_verifier(dir, dentry, flags & LOOKUP_RCU)) {
> error = nfs_lookup_verify_inode(inode, flags);
> if (error) {
> -   if (error == -ESTALE)
> +   if (error == -ESTALE) {
> nfs_zap_caches(dir);
> +   error = 0;
> +   }
> goto out_bad;
> }
> nfs_advise_use_readdirplus(dir);
> @@ -1395,7 +1397,7 @@ nfs_do_lookup_revalidate(struct inode *dir,
> struct dentry *dentry,
>  out_bad:
> if (flags & LOOKUP_RCU)
> return -ECHILD;
> -   return nfs_lookup_revalidate_done(dir, dentry, inode, 0);
> +   return nfs_lookup_revalidate_done(dir, dentry, inode, error);

Which errors do we actually need to return here? As far as I can tell,
the only errors that nfs_lookup_verify_inode() is supposed to return is
ENOMEM, ESTALE, ECHILD, and possibly EIO or ETiMEDOUT.

Why would it be better to return those errors rather than just a 0 when
we need to invalidate the inode, particularly since we already have a
special case in nfs_lookup_revalidate_done() when the dentry is root?

>  }
>  
>  static int

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-11 Thread Trond Myklebust
On Wed, 2020-11-11 at 18:57 +, Sargun Dhillon wrote:
> On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote:
> > On Wed, 2020-11-11 at 11:12 +, Sargun Dhillon wrote:
> > > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > > > Hi,
> > > > > 
> > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount
> > > > > an
> > > > > NFS
> > > > > share with a different user namespace. fsopen() is done in
> > > > > the
> > > > > container namespaces (user, mnt and net namespaces) while
> > > > > fsconfig(),
> > > > > fsmount() and move_mount() are done on the host namespaces.
> > > > > The
> > > > > mount
> > > > > on the host is available in the container via mount
> > > > > propagation
> > > > > from
> > > > > the host mount.
> > > > > 
> > > > > With this, the files on the NFS server with uid 0 are
> > > > > available
> > > > > in
> > > > > the
> > > > > container with uid 0. On the host, they are available with
> > > > > uid
> > > > > 4294967294 (make_kuid(_user_ns, -2)).
> > > > > 
> > > > 
> > > > Can someone please tell me what is broken with the _current_
> > > > design
> > > > before we start trying to push "fixes" that clearly break it?
> > > Currently the mechanism of mounting nfs4 in a user namespace is
> > > as
> > > follows:
> > > 
> > > Parent: fork()
> > > Child: setns(userns)
> > > C: fsopen("nfs4") = 3
> > > C->P: Send FD 3
> > > P: FSConfig...
> > > P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> > > 
> > > 
> > > Right now, when you mount an NFS filesystem in a non-init user
> > > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> > > are sent to the server are not the UIDs from the mounting
> > > namespace,
> > > instead they are the UIDs from the init user ns.
> > > 
> > > The reason for this is that you can call fsopen("nfs4") in the
> > > unprivileged 
> > > namespace, and that configures fs_context with all the right
> > > information for 
> > > that user namespace, but we currently require CAP_SYS_ADMIN in
> > > the
> > > init user 
> > > namespace to call fsmount. This means that the superblock's user
> > > namespace is 
> > > set "correctly" to the container, but there's absolutely no way
> > > nfs4uidmap
> > > to consume an unprivileged user namespace.
> > > 
> > > This behaviour happens "the other way" as well, where the UID in
> > > the
> > > container
> > > may be 0, but the corresponding kuid is 1000. When a response
> > > from an
> > > NFS
> > > server comes in we decode it according to the idmap userns[1].
> > > The
> > > userns
> > > used to get create idmap is generated at fsmount time, and not as
> > > fsopen
> > > time. So, even if the filesystem is in the user namespace, and
> > > the
> > > server
> > > responds with UID 0, it'll come up with an unmapped UID.
> > > 
> > > This is because we do
> > > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> > > from_kuid(container_ns, 0) -> invalid uid
> > > 
> > > This is broken behaviour, in my humble opinion as is it makes it
> > > impossible to 
> > > use NFSv4 (and v3 for that matter) out of the box with
> > > unprivileged
> > > user 
> > > namespaces. At least in our environment, using usernames / GSS
> > > isn't
> > > an option,
> > > so we have to rely on UIDs being set correctly [at least from the
> > > container's
> > > perspective].
> > > 
> > 
> > The current code for setting server->cred was developed
> > independently
> > of fsopen() (and predates it actually). I'm fine with the change to
> > have server->cred be the cred of the user that called fsopen().
> > That's
> > in line with what we used to do for sys_mount().
> > 
> Just curious, without FS_USERNS, how were you mounting NFSv4 in an
> unprivileged user ns?

The code was originally develop

Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-11 Thread Trond Myklebust
On Wed, 2020-11-11 at 11:12 +, Sargun Dhillon wrote:
> On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > Hi,
> > > 
> > > I tested the patches on top of 5.10.0-rc3+ and I could mount an
> > > NFS
> > > share with a different user namespace. fsopen() is done in the
> > > container namespaces (user, mnt and net namespaces) while
> > > fsconfig(),
> > > fsmount() and move_mount() are done on the host namespaces. The
> > > mount
> > > on the host is available in the container via mount propagation
> > > from
> > > the host mount.
> > > 
> > > With this, the files on the NFS server with uid 0 are available
> > > in
> > > the
> > > container with uid 0. On the host, they are available with uid
> > > 4294967294 (make_kuid(_user_ns, -2)).
> > > 
> > 
> > Can someone please tell me what is broken with the _current_ design
> > before we start trying to push "fixes" that clearly break it?
> Currently the mechanism of mounting nfs4 in a user namespace is as
> follows:
> 
> Parent: fork()
> Child: setns(userns)
> C: fsopen("nfs4") = 3
> C->P: Send FD 3
> P: FSConfig...
> P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> 
> 
> Right now, when you mount an NFS filesystem in a non-init user
> namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> are sent to the server are not the UIDs from the mounting namespace,
> instead they are the UIDs from the init user ns.
> 
> The reason for this is that you can call fsopen("nfs4") in the
> unprivileged 
> namespace, and that configures fs_context with all the right
> information for 
> that user namespace, but we currently require CAP_SYS_ADMIN in the
> init user 
> namespace to call fsmount. This means that the superblock's user
> namespace is 
> set "correctly" to the container, but there's absolutely no way
> nfs4uidmap
> to consume an unprivileged user namespace.
> 
> This behaviour happens "the other way" as well, where the UID in the
> container
> may be 0, but the corresponding kuid is 1000. When a response from an
> NFS
> server comes in we decode it according to the idmap userns[1]. The
> userns
> used to get create idmap is generated at fsmount time, and not as
> fsopen
> time. So, even if the filesystem is in the user namespace, and the
> server
> responds with UID 0, it'll come up with an unmapped UID.
> 
> This is because we do
> Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> from_kuid(container_ns, 0) -> invalid uid
> 
> This is broken behaviour, in my humble opinion as is it makes it
> impossible to 
> use NFSv4 (and v3 for that matter) out of the box with unprivileged
> user 
> namespaces. At least in our environment, using usernames / GSS isn't
> an option,
> so we have to rely on UIDs being set correctly [at least from the
> container's
> perspective].
> 

The current code for setting server->cred was developed independently
of fsopen() (and predates it actually). I'm fine with the change to
have server->cred be the cred of the user that called fsopen(). That's
in line with what we used to do for sys_mount().

However all the other stuff to throw errors when the user namespace is
not init_user_ns introduces massive regressions.

> > 
> > The current design assumes that the user namespace being used is
> > the one where 
> > the mount itself is performed. That means that the uids and gids or
> > usernames 
> > and groupnames that go on the wire match the uids and gids of the
> > container in 
> > which the mount occurred.
> > 
> 
> Right now, NFS does not have the ability for the fsmount() call to be
> called in an unprivileged user namespace. We can change that
> behaviour
> elsewhere if we want, but it's orthogonal to this.
> 
> > The assumption is that the server has authenticated that client as
> > belonging to a domain that it recognises (either through strong
> > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> > addresses to a list of acceptable clients).
> > 
> I added a rejection for upcalls because upcalls can happen in the
> init 
> namespaces. We can drop that restriction from the nfs4 patch if you'd
> like. I
> *believe* (and I'm not a little out of my depth) that the request-key
> handler gets called with the *network namespace* of the NFS mount,
> but the userns is a privileged one, allowing for potential hazards.
> 

The idmapper already rejects upcalls to the

Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-10 Thread Trond Myklebust
On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> Hi,
> 
> I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS
> share with a different user namespace. fsopen() is done in the
> container namespaces (user, mnt and net namespaces) while fsconfig(),
> fsmount() and move_mount() are done on the host namespaces. The mount
> on the host is available in the container via mount propagation from
> the host mount.
> 
> With this, the files on the NFS server with uid 0 are available in
> the
> container with uid 0. On the host, they are available with uid
> 4294967294 (make_kuid(_user_ns, -2)).
> 

Can someone please tell me what is broken with the _current_ design
before we start trying to push "fixes" that clearly break it?

The current design assumes that the user namespace being used is the
one where the mount itself is performed. That means that the uids and
gids or usernames and groupnames that go on the wire match the uids and
gids of the container in which the mount occurred.

The assumption is that the server has authenticated that client as
belonging to a domain that it recognises (either through strong
RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
addresses to a list of acceptable clients).

If you go ahead and change the user namespace on the client without
going through the mount process again to mount a different super block
with a different user namespace, then you will now get the exact same
behaviour as if you do that with any other filesystem.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-09 Thread Trond Myklebust
On Mon, 2020-11-09 at 09:01 -0500, t...@kernel.org wrote:
> Hello,
> 
> On Mon, Nov 09, 2020 at 01:50:40PM +0000, Trond Myklebust wrote:
> > > I'm thinking the real problem is that you're abusing workqueues.
> > > Just
> > > don't stuff so much work into it that this becomes a problem. Or
> > > rather,
> > > if you do, don't lie to it about it.
> > 
> > If we can't use workqueues to call iput_final() on an inode, then
> > what
> > is the point of having them at all?
> > 
> > Neil's use case is simply a file that has managed to accumulate a
> > seriously large page cache, and is therefore taking a long time to
> > complete the call to truncate_inode_pages_final(). Are you saying
> > we
> > have to allocate a dedicated thread for every case where this
> > happens?
> 
> I think the right thing to do here is setting CPU_INTENSIVE or using
> an
> unbound workqueue. Concurrency controlled per-cpu workqueue is
> unlikely to
> be a good fit if the work can run long enough to need cond_resched().
> Better
> to let the scheduler handle it. Making workqueue warn against long-
> running
> concurrency managed per-cpu work items would be great. I'll put that
> on my
> todo list but if anyone is interested please be my guest.
> 

That means changing all filesystem code to use cpu-intensive queues. As
far as I can tell, they all use workqueues (most of them using the
standard system queue) for fput(), dput() and/or iput() calls.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-09 Thread Trond Myklebust
On Mon, 2020-11-09 at 09:00 +0100, Peter Zijlstra wrote:
> On Mon, Nov 09, 2020 at 01:54:59PM +1100, NeilBrown wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4418f5cb8324..728870965df1 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1784,7 +1784,12 @@ static inline int
> > test_tsk_need_resched(struct task_struct *tsk)
> >  #ifndef CONFIG_PREEMPTION
> >  extern int _cond_resched(void);
> >  #else
> > -static inline int _cond_resched(void) { return 0; }
> > +static inline int _cond_resched(void)
> > +{
> > +   if (current->flags & PF_WQ_WORKER)
> > +   workqueue_cond_resched();
> > +   return 0;
> > +}
> >  #endif
> >  
> >  #define cond_resched() ({  \
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..5b2e38567a0c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5620,6 +5620,8 @@ SYSCALL_DEFINE0(sched_yield)
> >  #ifndef CONFIG_PREEMPTION
> >  int __sched _cond_resched(void)
> >  {
> > +   if (current->flags & PF_WQ_WORKER)
> > +   workqueue_cond_resched();
> > if (should_resched(0)) {
> > preempt_schedule_common();
> > return 1;
> 
> 
> Much hate for this.. :/ cond_resched() should be a NOP on !PREEMPT
> and
> you wreck that. Also, you call into that workqueue_cond_resched()
> unconditionally, even when it wouldn't have rescheduled, which seems
> very wrong too.
> 
> On top of all that, you're adding an extra load to the funcion :/
> 
> At some poine Paul tried to frob cond_resched() for RCU and ran into
> all
> sorts of performance issues, I'm thinking this will too.
> 
> 
> Going by your justification for all this:
> 
> > I think that once a worker calls cond_resched(), it should be
> > treated as
> > though it was run from a WQ_CPU_INTENSIVE queue, because only cpu-
> > intensive
> > tasks need to call cond_resched().  This would allow other workers
> > to be
> > scheduled.
> 
> I'm thinking the real problem is that you're abusing workqueues. Just
> don't stuff so much work into it that this becomes a problem. Or
> rather,
> if you do, don't lie to it about it.

If we can't use workqueues to call iput_final() on an inode, then what
is the point of having them at all?

Neil's use case is simply a file that has managed to accumulate a
seriously large page cache, and is therefore taking a long time to
complete the call to truncate_inode_pages_final(). Are you saying we
have to allocate a dedicated thread for every case where this happens?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH 2/2] NFS: Limit the number of retries

2020-11-02 Thread Trond Myklebust
On Tue, 2020-11-03 at 00:24 +0800, Wenle Chen wrote:
>   We can't wait forever, even if the state
> is always delayed.
> 
> Signed-off-by: Wenle Chen 
> ---
>  fs/nfs/nfs4proc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index f6b5dc792b33..bb2316bf13f6 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -7390,15 +7390,17 @@ int nfs4_lock_delegation_recall(struct
> file_lock *fl, struct nfs4_state *state,
>  {
> struct nfs_server *server = NFS_SERVER(state->inode);
> int err;
> +   int retry = 3;
>  
> err = nfs4_set_lock_state(state, fl);
> if (err != 0)
> return err;
> do {
> err = _nfs4_do_setlk(state, F_SETLK, fl,
> NFS_LOCK_NEW);
> -   if (err != -NFS4ERR_DELAY)
> +   if (err != -NFS4ERR_DELAY || retry == 0)
> break;
> ssleep(1);
> +   --retry;
> } while (1);
> return nfs4_handle_delegation_recall_error(server, state,
> stateid, fl, err);
>  }

This patch will just cause the locks to be silently lost, no?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: Fix mode bits and nlink count for v4 referral dirs

2020-10-15 Thread Trond Myklebust
On Thu, 2020-10-15 at 09:36 -0400, Chuck Lever wrote:
> > On Oct 15, 2020, at 8:06 AM, Trond Myklebust <
> > tron...@hammerspace.com> wrote:
> > 
> > On Thu, 2020-10-15 at 00:39 +0530, Ashish Sangwan wrote:
> > > On Wed, Oct 14, 2020 at 11:47 PM Trond Myklebust
> > >  wrote:
> > > > On Tue, 2020-10-06 at 08:14 -0700, Ashish Sangwan wrote:
> > > > > Request for mode bits and nlink count in the
> > > > > nfs4_get_referral
> > > > > call
> > > > > and if server returns them use them instead of hard coded
> > > > > values.
> > > > > 
> > > > > CC: sta...@vger.kernel.org
> > > > > Signed-off-by: Ashish Sangwan 
> > > > > ---
> > > > > fs/nfs/nfs4proc.c | 20 +---
> > > > > 1 file changed, 17 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > > > > index 6e95c85fe395..efec05c5f535 100644
> > > > > --- a/fs/nfs/nfs4proc.c
> > > > > +++ b/fs/nfs/nfs4proc.c
> > > > > @@ -266,7 +266,9 @@ const u32 nfs4_fs_locations_bitmap[3] = {
> > > > >  | FATTR4_WORD0_FSID
> > > > >  | FATTR4_WORD0_FILEID
> > > > >  | FATTR4_WORD0_FS_LOCATIONS,
> > > > > - FATTR4_WORD1_OWNER
> > > > > + FATTR4_WORD1_MODE
> > > > > + | FATTR4_WORD1_NUMLINKS
> > > > > + | FATTR4_WORD1_OWNER
> > > > >  | FATTR4_WORD1_OWNER_GROUP
> > > > >  | FATTR4_WORD1_RAWDEV
> > > > >  | FATTR4_WORD1_SPACE_USED
> > > > > @@ -7594,16 +7596,28 @@ nfs4_listxattr_nfs4_user(struct inode
> > > > > *inode,
> > > > > char *list, size_t list_len)
> > > > >  */
> > > > > static void nfs_fixup_referral_attributes(struct nfs_fattr
> > > > > *fattr)
> > > > > {
> > > > > + bool fix_mode = true, fix_nlink = true;
> > > > > +
> > > > >  if (!(((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID)
> > > > > ||
> > > > > (fattr->valid & NFS_ATTR_FATTR_FILEID)) &&
> > > > >(fattr->valid & NFS_ATTR_FATTR_FSID) &&
> > > > >(fattr->valid & NFS_ATTR_FATTR_V4_LOCATIONS)))
> > > > >  return;
> > > > > 
> > > > > + if (fattr->valid & NFS_ATTR_FATTR_MODE)
> > > > > + fix_mode = false;
> > > > > + if (fattr->valid & NFS_ATTR_FATTR_NLINK)
> > > > > + fix_nlink = false;
> > > > >  fattr->valid |= NFS_ATTR_FATTR_TYPE |
> > > > > NFS_ATTR_FATTR_MODE |
> > > > >  NFS_ATTR_FATTR_NLINK |
> > > > > NFS_ATTR_FATTR_V4_REFERRAL;
> > > > > - fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
> > > > > - fattr->nlink = 2;
> > > > > +
> > > > > + if (fix_mode)
> > > > > + fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
> > > > > + else
> > > > > + fattr->mode |= S_IFDIR;
> > > > > +
> > > > > + if (fix_nlink)
> > > > > + fattr->nlink = 2;
> > > > > }
> > > > > 
> > > > > static int _nfs4_proc_fs_locations(struct rpc_clnt *client,
> > > > > struct
> > > > > inode *dir,
> > > > 
> > > > NACK to this patch. The whole point is that if the server has a
> > > > referral, then it is not going to give us any attributes other
> > > > than
> > > > the
> > > > ones we're already asking for because it may not even have a
> > > > real
> > > > directory. The client is required to fake up an inode, hence
> > > > the
> > > > existing code.
> > > 
> > > Hi Trond, thanks for reviewing the patch!
> > > Sorry but I didn't understand the reason to NACK it. Could you
> > > please
> > > elaborate your concern?
> > > These are the current attributes we request from the server on a
> > > referral:
> > > FATTR4_WORD0_CHANGE
> > > > FATTR4_WORD0_SIZE
> > > > FATTR4_WORD0_FSID
> > > > FATTR4_WORD0_FILEID
> > > > FATTR4_WORD0_FS_LOCATIO

Re: [PATCH] NFS: Fix mode bits and nlink count for v4 referral dirs

2020-10-15 Thread Trond Myklebust
On Thu, 2020-10-15 at 00:39 +0530, Ashish Sangwan wrote:
> On Wed, Oct 14, 2020 at 11:47 PM Trond Myklebust
>  wrote:
> > On Tue, 2020-10-06 at 08:14 -0700, Ashish Sangwan wrote:
> > > Request for mode bits and nlink count in the nfs4_get_referral
> > > call
> > > and if server returns them use them instead of hard coded values.
> > > 
> > > CC: sta...@vger.kernel.org
> > > Signed-off-by: Ashish Sangwan 
> > > ---
> > >  fs/nfs/nfs4proc.c | 20 +---
> > >  1 file changed, 17 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > > index 6e95c85fe395..efec05c5f535 100644
> > > --- a/fs/nfs/nfs4proc.c
> > > +++ b/fs/nfs/nfs4proc.c
> > > @@ -266,7 +266,9 @@ const u32 nfs4_fs_locations_bitmap[3] = {
> > >   | FATTR4_WORD0_FSID
> > >   | FATTR4_WORD0_FILEID
> > >   | FATTR4_WORD0_FS_LOCATIONS,
> > > - FATTR4_WORD1_OWNER
> > > + FATTR4_WORD1_MODE
> > > + | FATTR4_WORD1_NUMLINKS
> > > + | FATTR4_WORD1_OWNER
> > >   | FATTR4_WORD1_OWNER_GROUP
> > >   | FATTR4_WORD1_RAWDEV
> > >   | FATTR4_WORD1_SPACE_USED
> > > @@ -7594,16 +7596,28 @@ nfs4_listxattr_nfs4_user(struct inode
> > > *inode,
> > > char *list, size_t list_len)
> > >   */
> > >  static void nfs_fixup_referral_attributes(struct nfs_fattr
> > > *fattr)
> > >  {
> > > + bool fix_mode = true, fix_nlink = true;
> > > +
> > >   if (!(((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) ||
> > >  (fattr->valid & NFS_ATTR_FATTR_FILEID)) &&
> > > (fattr->valid & NFS_ATTR_FATTR_FSID) &&
> > > (fattr->valid & NFS_ATTR_FATTR_V4_LOCATIONS)))
> > >   return;
> > > 
> > > + if (fattr->valid & NFS_ATTR_FATTR_MODE)
> > > + fix_mode = false;
> > > + if (fattr->valid & NFS_ATTR_FATTR_NLINK)
> > > + fix_nlink = false;
> > >   fattr->valid |= NFS_ATTR_FATTR_TYPE | NFS_ATTR_FATTR_MODE |
> > >   NFS_ATTR_FATTR_NLINK | NFS_ATTR_FATTR_V4_REFERRAL;
> > > - fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
> > > - fattr->nlink = 2;
> > > +
> > > + if (fix_mode)
> > > + fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
> > > + else
> > > + fattr->mode |= S_IFDIR;
> > > +
> > > + if (fix_nlink)
> > > + fattr->nlink = 2;
> > >  }
> > > 
> > >  static int _nfs4_proc_fs_locations(struct rpc_clnt *client,
> > > struct
> > > inode *dir,
> > 
> > NACK to this patch. The whole point is that if the server has a
> > referral, then it is not going to give us any attributes other than
> > the
> > ones we're already asking for because it may not even have a real
> > directory. The client is required to fake up an inode, hence the
> > existing code.
> 
> Hi Trond, thanks for reviewing the patch!
> Sorry but I didn't understand the reason to NACK it. Could you please
> elaborate your concern?
> These are the current attributes we request from the server on a
> referral:
> FATTR4_WORD0_CHANGE
> > FATTR4_WORD0_SIZE
> > FATTR4_WORD0_FSID
> > FATTR4_WORD0_FILEID
> > FATTR4_WORD0_FS_LOCATIONS,
> FATTR4_WORD1_OWNER
> > FATTR4_WORD1_OWNER_GROUP
> > FATTR4_WORD1_RAWDEV
> > FATTR4_WORD1_SPACE_USED
> > FATTR4_WORD1_TIME_ACCESS
> > FATTR4_WORD1_TIME_METADATA
> > FATTR4_WORD1_TIME_MODIFY
> > FATTR4_WORD1_MOUNTED_ON_FILEID,
> 
> So you are suggesting that it's ok to ask for SIZE, OWNER, OWNER
> GROUP, SPACE USED, TIMESTAMPs etc but not ok to ask for mode bits and
> numlinks?

No. We shouldn't be asking for any of that information for a referral
because the server isn't supposed to return any values for it.

Chuck and Anna, what's the deal with commit c05cefcc7241? That appears
to have changed the original code to speculatively assume that the
server will violate RFC5661 Section 11.3.1 and/or RFC7530 Section
8.3.1. Specifically, the paragraph that says:

"
   Other attributes SHOULD NOT be made available for absent file
   systems, even when it is possible to provide them.  The server should
   not assume that more information is always better and should avoid
   gratuitously providing additional information."

So why is the client asking for them?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: Fix mode bits and nlink count for v4 referral dirs

2020-10-14 Thread Trond Myklebust
On Tue, 2020-10-06 at 08:14 -0700, Ashish Sangwan wrote:
> Request for mode bits and nlink count in the nfs4_get_referral call
> and if server returns them use them instead of hard coded values.
> 
> CC: sta...@vger.kernel.org
> Signed-off-by: Ashish Sangwan 
> ---
>  fs/nfs/nfs4proc.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 6e95c85fe395..efec05c5f535 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -266,7 +266,9 @@ const u32 nfs4_fs_locations_bitmap[3] = {
>   | FATTR4_WORD0_FSID
>   | FATTR4_WORD0_FILEID
>   | FATTR4_WORD0_FS_LOCATIONS,
> - FATTR4_WORD1_OWNER
> + FATTR4_WORD1_MODE
> + | FATTR4_WORD1_NUMLINKS
> + | FATTR4_WORD1_OWNER
>   | FATTR4_WORD1_OWNER_GROUP
>   | FATTR4_WORD1_RAWDEV
>   | FATTR4_WORD1_SPACE_USED
> @@ -7594,16 +7596,28 @@ nfs4_listxattr_nfs4_user(struct inode *inode,
> char *list, size_t list_len)
>   */
>  static void nfs_fixup_referral_attributes(struct nfs_fattr *fattr)
>  {
> + bool fix_mode = true, fix_nlink = true;
> +
>   if (!(((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) ||
>  (fattr->valid & NFS_ATTR_FATTR_FILEID)) &&
> (fattr->valid & NFS_ATTR_FATTR_FSID) &&
> (fattr->valid & NFS_ATTR_FATTR_V4_LOCATIONS)))
>   return;
>  
> + if (fattr->valid & NFS_ATTR_FATTR_MODE)
> + fix_mode = false;
> + if (fattr->valid & NFS_ATTR_FATTR_NLINK)
> + fix_nlink = false;
>   fattr->valid |= NFS_ATTR_FATTR_TYPE | NFS_ATTR_FATTR_MODE |
>   NFS_ATTR_FATTR_NLINK | NFS_ATTR_FATTR_V4_REFERRAL;
> - fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
> - fattr->nlink = 2;
> +
> + if (fix_mode)
> + fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
> + else
> + fattr->mode |= S_IFDIR;
> +
> + if (fix_nlink)
> + fattr->nlink = 2;
>  }
>  
>  static int _nfs4_proc_fs_locations(struct rpc_clnt *client, struct
> inode *dir,

NACK to this patch. The whole point is that if the server has a
referral, then it is not going to give us any attributes other than the
ones we're already asking for because it may not even have a real
directory. The client is required to fake up an inode, hence the
existing code.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes

2020-09-28 Thread Trond Myklebust
Hi Linus,

The following changes since commit 856deb866d16e29bd65952e0289066f6078af773:

  Linux 5.9-rc5 (2020-09-13 16:06:00 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.9-3

for you to fetch changes up to b9df46d08a8d098ea2124cb9e3b84458a474b4d4:

  pNFS/flexfiles: Be consistent about mirror index types (2020-09-18 09:25:33 
-0400)

Cheers,
  Trond

NFS client bugfixes for Linux 5.9

Highlights include:

Bugfixes:
- NFSv4.2: copy_file_range needs to invalidate caches on success
- NFSv4.2: Fix security label length not being reset
- pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
- pNFS/flexfiles: Fix signed/unsigned type issues with mirror indices


Jeffrey Mitchell (1):
  nfs: Fix security label length not being reset

Olga Kornievskaia (1):
  NFSv4.2: fix client's attribute cache management for copy_file_range

Trond Myklebust (2):
  pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
  pNFS/flexfiles: Be consistent about mirror index types

 fs/nfs/dir.c   |  3 +++
 fs/nfs/flexfilelayout/flexfilelayout.c | 43 +-
 fs/nfs/nfs42proc.c | 10 +++-
 include/linux/nfs_xdr.h|  4 ++--
 4 files changed, 36 insertions(+), 24 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes for Linux 5.9

2020-09-09 Thread Trond Myklebust
Hi Linus,

The following changes since commit d012a7190fc1fd72ed48911e77ca97ba4521bccd:

  Linux 5.9-rc2 (2020-08-23 14:08:43 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.9-2

for you to fetch changes up to 8c6b6c793ed32b8f9770ebcdf1ba99af423c303b:

  SUNRPC: stop printk reading past end of string (2020-09-05 10:39:41 -0400)


NFS client bugfixes for Linux 5.9

Highlights include:

Bugfixes:
- Fix an NFS/RDMA resource leak
- Fix the error handling during delegation recall
- NFSv4.0 needs to return the delegation on a zero-stateid SETATTR
- Stop printk reading past end of string


Chuck Lever (2):
  xprtrdma: Release in-flight MRs on disconnect
  NFS: Zero-stateid SETATTR should first return delegation

J. Bruce Fields (1):
  SUNRPC: stop printk reading past end of string

Olga Kornievskaia (1):
  NFSv4.1 handle ERR_DELAY error reclaiming locking state on delegation 
recall

Trond Myklebust (1):
  Merge tag 'nfs-rdma-for-5.9-1' of 
git://git.linux-nfs.org/projects/anna/linux-nfs

 fs/nfs/nfs4proc.c   | 11 +--
 net/sunrpc/rpcb_clnt.c  |  4 ++--
 net/sunrpc/xprtrdma/verbs.c |  2 ++
 3 files changed, 13 insertions(+), 4 deletions(-)

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client updates for Linux 5.9

2020-08-14 Thread Trond Myklebust
Hi Linus,

The following changes since commit c1326210477ecc06c53221f0005c64419aba30d6:

  nfs,nfsd: NFSv4.2 extended attribute protocol definitions (2020-07-13 
17:20:49 -0400)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.9-1

for you to fetch changes up to 563c53e73b8b6ec842828736f77e633f7b0911e9:

  NFS: Fix flexfiles read failover (2020-08-12 11:20:29 -0400)

Cheers
  Trond


NFS client updates for Linux 5.9

Highlights include:

Stable fixes:
- pNFS: Don't return layout segments that are being used for I/O
- pNFS: Don't move layout segments off the active list when being used for I/O

Features:
- NFS: Add support for user xattrs through the NFSv4.2 protocol
- NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
- NFSv4.0 allow nconnect for v4.0

Bugfixes and cleanups:
- nfs: ensure correct writeback errors are returned on close()
- nfs: nfs_file_write() should check for writeback errors
- nfs: Fix getxattr kernel panic and memory overflow
- NFS: Fix the pNFS/flexfiles mirrored read failover code
- SUNRPC: dont update timeout value on connection reset
- freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for 
NFS
- sunrpc: destroy rpc_inode_cachep after unregister_filesystem


Colin Ian King (1):
  NFS: remove redundant initialization of variable result

Dan Aloni (1):
  sunrpc: destroy rpc_inode_cachep after unregister_filesystem

Frank van der Linden (13):
  nfs: add client side only definitions for user xattrs
  NFSv4.2: define limits and sizes for user xattr handling
  NFSv4.2: query the server for extended attribute support
  NFSv4.2: add client side XDR handling for extended attributes
  nfs: define nfs_access_get_cached function
  NFSv4.2: query the extended attribute access bits
  nfs: modify update_changeattr to deal with regular files
  nfs: define and use the NFS_INO_INVALID_XATTR flag
  nfs: make the buf_to_pages_noslab function available to the nfs code
  NFSv4.2: add the extended attribute proc functions.
  NFSv4.2: hook in the user extended attribute handlers
  NFSv4.2: add client side xattr caching.
  NFSv4.2: xattr cache: get rid of cache discard work queue

He Zhe (1):
  freezer: Add unsafe versions of freezable_schedule_timeout_interruptible 
for NFS

Jeffrey Mitchell (1):
  nfs: Fix getxattr kernel panic and memory overflow

Olga Kornievskaia (2):
  NFSv4.0 allow nconnect for v4.0
  SUNRPC dont update timeout value on connection reset

Randy Dunlap (1):
  fs: nfs: delete repeated words in comments

Scott Mayhew (2):
  nfs: ensure correct writeback errors are returned on close()
  nfs: nfs_file_write() should check for writeback errors

Trond Myklebust (11):
  NFS: Allow applications to speed up readdir+statx() using 
AT_STATX_DONT_SYNC
  pNFS/flexfiles: Clean up redundant calls to pnfs_put_lseg()
  pNFS/flexfiles: The mirror count could depend on the layout segment range
  Merge commit 'c1326210477ecc06c53221f0005c64419aba30d6' from 
nfsd/linux-next
  Merge branch 'xattr-devel'
  NFS: Report the stateid + status in trace_nfs4_layoutreturn_on_close()
  NFS: Add tracepoints for layouterror and layoutstats.
  NFS: Add layout segment info to pnfs read/write/commit tracepoints
  NFS: Don't move layouts to plh_return_segs list while in use
  NFS: Don't return layout segments that are in use
  NFS: Fix flexfiles read failover

Xu Wang (1):
  rpc_pipefs: convert comma to semicolon

 fs/nfs/Makefile|2 +-
 fs/nfs/blocklayout/rpc_pipefs.c|2 +-
 fs/nfs/client.c|   22 +-
 fs/nfs/dir.c   |   24 +-
 fs/nfs/direct.c|2 +-
 fs/nfs/file.c  |   17 +-
 fs/nfs/flexfilelayout/flexfilelayout.c |   64 +-
 fs/nfs/fs_context.c|2 +-
 fs/nfs/inode.c |   20 +-
 fs/nfs/nfs42.h |   24 +
 fs/nfs/nfs42proc.c |  258 +++-
 fs/nfs/nfs42xattr.c| 1056 
 fs/nfs/nfs42xdr.c  |  438 +
 fs/nfs/nfs4_fs.h   |   35 ++
 fs/nfs/nfs4client.c|   33 +-
 fs/nfs/nfs4file.c  |5 +-
 fs/nfs/nfs4proc.c  |  241 +++-
 fs/nfs/nfs4super.c |   10 +
 fs/nfs/nfs4trace.h |   46 +-
 fs/nfs/nfs4xdr.c   |   39 +-
 fs/nfs/nfstrace.h  |3 +-
 fs/nfs/pnfs.c  |   52 +-
 fs/nfs/pnfs.h  |2 +-
 include/linux/freezer.h|   14 +
 include

Re: [PATCH] nfs: Fix getxattr kernel panic and memory overflow

2020-08-11 Thread Trond Myklebust
On Wed, 2020-08-05 at 12:23 -0500, Jeffrey Mitchell wrote:
> Move the buffer size check to decode_attr_security_label() before
> memcpy()
> Only call memcpy() if the buffer is large enough
> 
> Signed-off-by: Jeffrey Mitchell 
> ---
>  fs/nfs/nfs4proc.c | 2 --
>  fs/nfs/nfs4xdr.c  | 5 -
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 2e2dac29a9e9..45e0585e0667 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5845,8 +5845,6 @@ static int _nfs4_get_security_label(struct
> inode *inode, void *buf,
>   return ret;
>   if (!(fattr.valid & NFS_ATTR_FATTR_V4_SECURITY_LABEL))
>   return -ENOENT;
> - if (buflen < label.len)
> - return -ERANGE;
>   return 0;
>  }
>  
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index 47817ef0aadb..50162e0a959c 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -4166,7 +4166,10 @@ static int decode_attr_security_label(struct
> xdr_stream *xdr, uint32_t *bitmap,
>   return -EIO;
>   if (len < NFS4_MAXLABELLEN) {
>   if (label) {
> - memcpy(label->label, p, len);
> + if (label->len && label->len < len)
> + return -ERANGE;
> + if (label->len)
> + memcpy(label->label, p, len);

Just a heads up that I fixed this to avoid the duplicate test of label-
>len != 0 and I added a Fixes: before applying.

>   label->len = len;
>   label->pi = pi;
>   label->lfs = lfs;
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] nfs: pnfs: use GFP_ATOMIC under spin lock

2020-06-29 Thread Trond Myklebust
On Mon, 2020-06-29 at 14:21 +0800, Xu Wang wrote:
> A spin lock is taken here so we should use GFP_ATOMIC.
> 
> Signed-off-by: Xu Wang 
> ---
>  fs/nfs/pnfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index dd2e14f5875d..d84c1b7b71d2 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -2170,7 +2170,7 @@ _pnfs_grab_empty_layout(struct inode *ino,
> struct nfs_open_context *ctx)
>   struct pnfs_layout_hdr *lo;
>  
>   spin_lock(>i_lock);
> - lo = pnfs_find_alloc_layout(ino, ctx, GFP_KERNEL);
> + lo = pnfs_find_alloc_layout(ino, ctx, GFP_ATOMIC);
>   if (!lo)
>   goto out_unlock;
>   if (!test_bit(NFS_LAYOUT_INVALID_STID, >plh_flags))

NACK. Please read the code before sending yet another one of these
pointless patches.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()

2020-05-24 Thread Trond Myklebust
On Sun, 2020-05-24 at 10:40 -0600, Jens Axboe wrote:
> On 5/24/20 10:30 AM, Jens Axboe wrote:
> > On 5/24/20 8:05 AM, Trond Myklebust wrote:
> > > On Sat, 2020-05-23 at 12:57 -0600, Jens Axboe wrote:
> > > > Use the async page locking infrastructure, if IOCB_WAITQ is set
> > > > in
> > > > the
> > > > passed in iocb. The caller must expect an -EIOCBQUEUED return
> > > > value,
> > > > which means that IO is started but not done yet. This is
> > > > similar to
> > > > how
> > > > O_DIRECT signals the same operation. Once the callback is
> > > > received by
> > > > the caller for IO completion, the caller must retry the
> > > > operation.
> > > > 
> > > > Signed-off-by: Jens Axboe 
> > > > ---
> > > >  mm/filemap.c | 33 ++---
> > > >  1 file changed, 26 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index c746541b1d49..a3b86c9acdc8 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -1219,6 +1219,14 @@ static int
> > > > __wait_on_page_locked_async(struct
> > > > page *page,
> > > > return ret;
> > > >  }
> > > >  
> > > > +static int wait_on_page_locked_async(struct page *page,
> > > > +struct wait_page_queue
> > > > *wait)
> > > > +{
> > > > +   if (!PageLocked(page))
> > > > +   return 0;
> > > > +   return __wait_on_page_locked_async(compound_head(page),
> > > > wait,
> > > > false);
> > > > +}
> > > > +
> > > >  /**
> > > >   * put_and_wait_on_page_locked - Drop a reference and wait for
> > > > it to
> > > > be unlocked
> > > >   * @page: The page to wait for.
> > > > @@ -2058,17 +2066,25 @@ static ssize_t
> > > > generic_file_buffered_read(struct kiocb *iocb,
> > > > index, last_index -
> > > > index);
> > > > }
> > > > if (!PageUptodate(page)) {
> > > > -   if (iocb->ki_flags & IOCB_NOWAIT) {
> > > > -   put_page(page);
> > > > -   goto would_block;
> > > > -   }
> > > > -
> > > > /*
> > > >  * See comment in do_read_cache_page on
> > > > why
> > > >  * wait_on_page_locked is used to avoid
> > > > unnecessarily
> > > >  * serialisations and why it's safe.
> > > >  */
> > > > -   error =
> > > > wait_on_page_locked_killable(page);
> > > > +   if (iocb->ki_flags & IOCB_WAITQ) {
> > > > +   if (written) {
> > > > +   put_page(page);
> > > > +   goto out;
> > > > +   }
> > > > +   error =
> > > > wait_on_page_locked_async(page,
> > > > +   
> > > > iocb-
> > > > > private);
> > > 
> > > If it is being used in 'generic_file_buffered_read()' as storage
> > > for a
> > > wait queue, then it is hard to consider this a 'private' field.
> > 
> > private isn't the prettiest, and in fact this one in particular is
> > a bit
> > of a mess. It's not clear if it's caller or callee owned. It's
> > generally
> > not used, outside of the old usb gadget code, iomap O_DIRECT, and
> > ocfs2.
> > With FMODE_BUF_RASYNC, the fs obviously can't set it if it uses
> > ->private for buffered IO.
> > 
> > > Perhaps either rename and add type checking, or else add a
> > > separate
> > > field altogether to struct kiocb?
> > 
> > I'd hate to add a new field and increase the size of the kiocb...
> > One
> > alternative is to do:
> > 
> > union {
> > void *private;
> > struct wait_page_queue *ki_waitq;
> > };
> > 
>

Re: [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()

2020-05-24 Thread Trond Myklebust
On Sat, 2020-05-23 at 12:57 -0600, Jens Axboe wrote:
> Use the async page locking infrastructure, if IOCB_WAITQ is set in
> the
> passed in iocb. The caller must expect an -EIOCBQUEUED return value,
> which means that IO is started but not done yet. This is similar to
> how
> O_DIRECT signals the same operation. Once the callback is received by
> the caller for IO completion, the caller must retry the operation.
> 
> Signed-off-by: Jens Axboe 
> ---
>  mm/filemap.c | 33 ++---
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c746541b1d49..a3b86c9acdc8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1219,6 +1219,14 @@ static int __wait_on_page_locked_async(struct
> page *page,
>   return ret;
>  }
>  
> +static int wait_on_page_locked_async(struct page *page,
> +  struct wait_page_queue *wait)
> +{
> + if (!PageLocked(page))
> + return 0;
> + return __wait_on_page_locked_async(compound_head(page), wait,
> false);
> +}
> +
>  /**
>   * put_and_wait_on_page_locked - Drop a reference and wait for it to
> be unlocked
>   * @page: The page to wait for.
> @@ -2058,17 +2066,25 @@ static ssize_t
> generic_file_buffered_read(struct kiocb *iocb,
>   index, last_index - index);
>   }
>   if (!PageUptodate(page)) {
> - if (iocb->ki_flags & IOCB_NOWAIT) {
> - put_page(page);
> - goto would_block;
> - }
> -
>   /*
>* See comment in do_read_cache_page on why
>* wait_on_page_locked is used to avoid
> unnecessarily
>* serialisations and why it's safe.
>*/
> - error = wait_on_page_locked_killable(page);
> + if (iocb->ki_flags & IOCB_WAITQ) {
> + if (written) {
> + put_page(page);
> + goto out;
> + }
> + error = wait_on_page_locked_async(page,
> + iocb-
> >private);

If it is being used in 'generic_file_buffered_read()' as storage for a
wait queue, then it is hard to consider this a 'private' field.

Perhaps either rename and add type checking, or else add a separate
field altogether to struct kiocb?

> + } else {
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + put_page(page);
> + goto would_block;
> + }
> + error =
> wait_on_page_locked_killable(page);
> + }
>   if (unlikely(error))
>   goto readpage_error;
>   if (PageUptodate(page))
> @@ -2156,7 +2172,10 @@ static ssize_t
> generic_file_buffered_read(struct kiocb *iocb,
>  
>  page_not_up_to_date:
>   /* Get exclusive access to the page ... */
> - error = lock_page_killable(page);
> + if (iocb->ki_flags & IOCB_WAITQ)
> + error = lock_page_async(page, iocb->private);
> + else
> + error = lock_page_killable(page);
>   if (unlikely(error))
>   goto readpage_error;
>  
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes

2020-05-15 Thread Trond Myklebust
Hi Linus,

The following changes since commit 2ef96a5bb12be62ef75b5828c0aab838ebb29cb8:

  Linux 5.7-rc5 (2020-05-10 15:16:58 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.7-5

for you to fetch changes up to 8eed292bc8cbf737e46fb1c119d4c8f6dcb00650:

  NFSv3: fix rpc receive buffer size for MOUNT call (2020-05-14 18:42:44 -0400)


NFS client bugfixes for Linux 5.7

Highlights include:

Stable fixes:
- nfs: fix NULL deference in nfs4_get_valid_delegation

Bugfixes:
- Fix corruption of the return value in cachefiles_read_or_alloc_pages()
- Fix several fscache cookie issues
- Fix a fscache queuing race that can trigger a BUG_ON
- NFS: Fix 2 use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
- SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
- SUNRPC: Fix a race when tearing down the rpc client debugfs directory
- SUNRPC: Signalled ASYNC tasks need to exit
- NFSv3: fix rpc receive buffer size for MOUNT call


Chuck Lever (1):
  SUNRPC: Signalled ASYNC tasks need to exit

Dave Wysochanski (3):
  NFS: Fix fscache super_cookie index_key from changing after umount
  NFS: Fix fscache super_cookie allocation
  NFSv4: Fix fscache cookie aux_data to ensure change_attr is included

David Howells (1):
  cachefiles: Fix corruption of the return value in 
cachefiles_read_or_alloc_pages()

J. Bruce Fields (2):
  nfs: fix NULL deference in nfs4_get_valid_delegation
  SUNRPC: 'Directory with parent 'rpc_clnt' already present!'

Lei Xue (1):
  cachefiles: Fix race between read_waiter and read_copier involving 
op->to_do

NeilBrown (1):
  SUNRPC: fix use-after-free in rpc_free_client_work()

Olga Kornievskaia (1):
  NFSv3: fix rpc receive buffer size for MOUNT call

Trond Myklebust (3):
  Merge tag 'fscache-fixes-20200508-2' of 
git://git.kernel.org/.../dhowells/linux-fs
  NFS: Don't use RPC_TASK_CRED_NOREF with delegreturn
  NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfs

 fs/cachefiles/rdwr.c | 12 ++--
 fs/nfs/fscache.c | 39 ++-
 fs/nfs/mount_clnt.c  |  3 ++-
 fs/nfs/nfs4proc.c|  2 +-
 fs/nfs/nfs4state.c   |  2 +-
 fs/nfs/pagelist.c|  5 +++--
 fs/nfs/pnfs_nfs.c|  3 ++-
 fs/nfs/super.c   |  1 -
 fs/nfs/write.c   |  4 ++--
 net/sunrpc/clnt.c|  9 +++--
 10 files changed, 42 insertions(+), 38 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH -next] nfs: fsinfo: fix build when CONFIG_NFS_V4 is not enabled

2020-05-15 Thread Trond Myklebust
On Fri, 2020-05-15 at 10:27 -0700, Randy Dunlap wrote:
> From: Randy Dunlap 
> 
> Fix multiple build errors when CONFIG_NFS_V4 is not enabled.
> 
> ../fs/nfs/fsinfo.c: In function 'nfs_fsinfo_get_supports':
> ../fs/nfs/fsinfo.c:153:12: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   if (server->attr_bitmask[0] & FATTR4_WORD0_SIZE)
> ^~
> ../fs/nfs/fsinfo.c:155:12: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   if (server->attr_bitmask[1] & FATTR4_WORD1_NUMLINKS)
> ^~
> ../fs/nfs/fsinfo.c:158:12: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   if (server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE)
> ^~
> ../fs/nfs/fsinfo.c:160:12: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   if (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN)
> ^~
> ../fs/nfs/fsinfo.c:162:12: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   if (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM)
> ^~
> ../fs/nfs/fsinfo.c: In function 'nfs_fsinfo_get_features':
> ../fs/nfs/fsinfo.c:205:12: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   if (server->attr_bitmask[0] & FATTR4_WORD0_CASE_INSENSITIVE)
> ^~
> ../fs/nfs/fsinfo.c:207:13: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   if ((server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE) ||
>  ^~
> ../fs/nfs/fsinfo.c:208:13: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN) ||
>  ^~
> ../fs/nfs/fsinfo.c:209:13: error: 'const struct nfs_server' has no
> member named 'attr_bitmask'
>   (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM))
>  ^~
> 
> 
> Signed-off-by: Randy Dunlap 
> Cc: linux-...@vger.kernel.org
> Cc: Trond Myklebust 
> Cc: Anna Schumaker 
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: David Howells 
> ---
>  fs/nfs/fsinfo.c |5 +
>  1 file changed, 5 insertions(+)
> 
> --- linux-next-20200515.orig/fs/nfs/fsinfo.c
> +++ linux-next-20200515/fs/nfs/fsinfo.c
> @@ -5,6 +5,7 @@
>   * Written by David Howells (dhowe...@redhat.com)
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include "internal.h"
> @@ -150,6 +151,7 @@ static int nfs_fsinfo_get_supports(struc
>   sup->stx_mask |= STATX_CTIME;
>   if (server->caps & NFS_CAP_MTIME)
>   sup->stx_mask |= STATX_MTIME;
> +#if IS_ENABLED(CONFIG_NFS_V4)
>   if (server->attr_bitmask[0] & FATTR4_WORD0_SIZE)
>   sup->stx_mask |= STATX_SIZE;
>   if (server->attr_bitmask[1] & FATTR4_WORD1_NUMLINKS)
> @@ -161,6 +163,7 @@ static int nfs_fsinfo_get_supports(struc
>   sup->win_file_attrs |= ATTR_HIDDEN;
>   if (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM)
>   sup->win_file_attrs |= ATTR_SYSTEM;
> +#endif
>  
>   sup->stx_attributes = STATX_ATTR_AUTOMOUNT;
>   return sizeof(*sup);
> @@ -202,12 +205,14 @@ static int nfs_fsinfo_get_features(struc
>   if (server->caps & NFS_CAP_MTIME)
>   fsinfo_set_feature(ft, FSINFO_FEAT_HAS_MTIME);
>  
> +#if IS_ENABLED(CONFIG_NFS_V4)
>   if (server->attr_bitmask[0] & FATTR4_WORD0_CASE_INSENSITIVE)
>   fsinfo_set_feature(ft, FSINFO_FEAT_NAME_CASE_INDEP);
>   if ((server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE) ||
>   (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN) ||
>   (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM))
>   fsinfo_set_feature(ft, FSINFO_FEAT_WINDOWS_ATTRS);
> +#endif
>  
>   return sizeof(*ft);
>  }

This whole thing needs to be reviewed and acked by the NFS community,
and quite frankly I'm inclined to NAK this. This is the second time
David tries to push this unwanted rewrite of totally unrelated code.

> 
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: linux-next: Tree for May 15 (nfs/fsinfo.c)

2020-05-15 Thread Trond Myklebust
On Fri, 2020-05-15 at 16:32 +, Trond Myklebust wrote:
> Hi Randy,
> 
> On Fri, 2020-05-15 at 08:28 -0700, Randy Dunlap wrote:
> > On 5/15/20 5:42 AM, Stephen Rothwell wrote:
> > > Hi all,
> > > 
> > > Changes since 20200514:
> > > 
> > > -
> > > ---
> > 
> > on i386:
> > 
> >   CC  fs/nfs/fsinfo.o
> > ../fs/nfs/fsinfo.c: In function ‘nfs_fsinfo_get_supports’:
> > ../fs/nfs/fsinfo.c:153:12: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   if (server->attr_bitmask[0] & FATTR4_WORD0_SIZE)
> > ^~
> > ../fs/nfs/fsinfo.c:155:12: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   if (server->attr_bitmask[1] & FATTR4_WORD1_NUMLINKS)
> > ^~
> > ../fs/nfs/fsinfo.c:158:12: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   if (server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE)
> > ^~
> > ../fs/nfs/fsinfo.c:160:12: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   if (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN)
> > ^~
> > ../fs/nfs/fsinfo.c:162:12: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   if (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM)
> > ^~
> > ../fs/nfs/fsinfo.c: In function ‘nfs_fsinfo_get_features’:
> > ../fs/nfs/fsinfo.c:205:12: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   if (server->attr_bitmask[0] & FATTR4_WORD0_CASE_INSENSITIVE)
> > ^~
> > ../fs/nfs/fsinfo.c:207:13: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   if ((server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE) ||
> >  ^~
> > ../fs/nfs/fsinfo.c:208:13: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN) ||
> >  ^~
> > ../fs/nfs/fsinfo.c:209:13: error: ‘const struct nfs_server’ has no
> > member named ‘attr_bitmask’
> >   (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM))
> >  ^~
> > 
> > Full randconfig file is attached.
> > 
> 
> Where is this file coming from? I'm not aware of any fs/nfs/fsinfo.c
> in
> the current tree or in my linux-next for 5.7, and a cursory glance is
> showing it up in Anna's linux-next for the 5.8 merge window.

...is not showing it up in Anna's linux-next.


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: linux-next: Tree for May 15 (nfs/fsinfo.c)

2020-05-15 Thread Trond Myklebust
Hi Randy,

On Fri, 2020-05-15 at 08:28 -0700, Randy Dunlap wrote:
> On 5/15/20 5:42 AM, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20200514:
> > 
> > -
> > ---
> 
> on i386:
> 
>   CC  fs/nfs/fsinfo.o
> ../fs/nfs/fsinfo.c: In function ‘nfs_fsinfo_get_supports’:
> ../fs/nfs/fsinfo.c:153:12: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   if (server->attr_bitmask[0] & FATTR4_WORD0_SIZE)
> ^~
> ../fs/nfs/fsinfo.c:155:12: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   if (server->attr_bitmask[1] & FATTR4_WORD1_NUMLINKS)
> ^~
> ../fs/nfs/fsinfo.c:158:12: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   if (server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE)
> ^~
> ../fs/nfs/fsinfo.c:160:12: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   if (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN)
> ^~
> ../fs/nfs/fsinfo.c:162:12: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   if (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM)
> ^~
> ../fs/nfs/fsinfo.c: In function ‘nfs_fsinfo_get_features’:
> ../fs/nfs/fsinfo.c:205:12: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   if (server->attr_bitmask[0] & FATTR4_WORD0_CASE_INSENSITIVE)
> ^~
> ../fs/nfs/fsinfo.c:207:13: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   if ((server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE) ||
>  ^~
> ../fs/nfs/fsinfo.c:208:13: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN) ||
>  ^~
> ../fs/nfs/fsinfo.c:209:13: error: ‘const struct nfs_server’ has no
> member named ‘attr_bitmask’
>   (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM))
>  ^~
> 
> Full randconfig file is attached.
> 

Where is this file coming from? I'm not aware of any fs/nfs/fsinfo.c in
the current tree or in my linux-next for 5.7, and a cursory glance is
showing it up in Anna's linux-next for the 5.8 merge window.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH 0/5] cachefiles, nfs: Fixes

2020-05-10 Thread Trond Myklebust
Hi David,

On Fri, 2020-05-08 at 23:16 +0100, David Howells wrote:
> Hi Linus, Trond, Anna,
> 
> Can you pull these fixes for cachefiles and NFS's use of
> fscache?  Should
> they go through the NFS tree or directly upstream?  The things fixed
> are:
> 
>  (1) The reorganisation of bmap() use accidentally caused the return
> value
>  of cachefiles_read_or_alloc_pages() to get corrupted.
> 
>  (2) The NFS superblock index key accidentally got changed to include
> a
>  number of kernel pointers - meaning that the key isn't matchable
> after
>  a reboot.
> 
>  (3) A redundant check in nfs_fscache_get_super_cookie().
> 
>  (4) The NFS change_attr sometimes set in the auxiliary data for the
>  caching of an file and sometimes not, which causes the cache to
> get
>  discarded when it shouldn't.
> 
>  (5) There's a race between cachefiles_read_waiter() and
>  cachefiles_read_copier() that causes an occasional assertion
> failure.
> 
> The patches are tagged here:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-
> fs.git
>   tag fscache-fixes-20200508-2
> 
> Thanks,
> David
> ---
> Dave Wysochanski (3):
>   NFS: Fix fscache super_cookie index_key from changing after
> umount
>   NFS: Fix fscache super_cookie allocation
>   NFSv4: Fix fscache cookie aux_data to ensure change_attr is
> included
> 
> David Howells (1):
>   cachefiles: Fix corruption of the return value in
> cachefiles_read_or_alloc_pages()
> 
> Lei Xue (1):
>   cachefiles: Fix race between read_waiter and read_copier
> involving op->to_do
> 
> 
>  fs/cachefiles/rdwr.c |   12 ++--
>  fs/nfs/fscache.c |   39 ++-
>  fs/nfs/super.c   |    1 -
>  3 files changed, 24 insertions(+), 28 deletions(-)
> 

I can pull this branch, and send it together with the NFS client
bugfixes for 5.7-rc5.


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes

2020-05-02 Thread Trond Myklebust
Hi Linus,

The following changes since commit ae83d0b416db002fe95601e7f97f64b59514d936:

  Linux 5.7-rc2 (2020-04-19 14:35:30 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.7-4

for you to fetch changes up to 9c07b75b80eeff714420fb6a4c880b284e529d0f:

  NFS: Fix a race in __nfs_list_for_each_server() (2020-04-30 15:08:26 -0400)

Thanks
  Trond


NFS client bugfixes for Linux 5.7

Highlights include:

Stable fixes
- fix handling of backchannel binding in BIND_CONN_TO_SESSION

Bugfixes
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion

Cleanups
- Remove unreachable error conditions


Andreas Gruenbacher (1):
  nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl

Chuck Lever (3):
  xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
  xprtrdma: Fix trace point use-after-free race
  xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}

NeilBrown (1):
  SUNRPC: defer slow parts of rpc_free_client() to a workqueue.

Olga Kornievskaia (1):
  NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION

Trond Myklebust (4):
  NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn 
completion
  NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
  Merge tag 'nfs-rdma-for-5.7-2' of 
git://git.linux-nfs.org/projects/anna/linux-nfs
  NFS: Fix a race in __nfs_list_for_each_server()

Xiyu Yang (2):
  SUNRPC: Remove unreachable error condition
  NFSv4: Remove unreachable error condition due to rpc_run_task()

 fs/nfs/nfs3acl.c   | 22 +++---
 fs/nfs/nfs4proc.c  | 11 +--
 fs/nfs/pnfs.c  | 11 +--
 fs/nfs/super.c |  2 +-
 include/linux/nfs_xdr.h|  2 ++
 include/linux/sunrpc/clnt.h| 13 -
 include/trace/events/rpcrdma.h | 12 
 net/sunrpc/clnt.c  | 24 ++--
 net/sunrpc/xprtrdma/rpc_rdma.c | 15 +++
 net/sunrpc/xprtrdma/verbs.c|  3 ++-
 10 files changed, 79 insertions(+), 36 deletions(-)

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH 0/4] allow multiple kthreadd's

2020-05-01 Thread Trond Myklebust
On Fri, 2020-05-01 at 14:49 -0400, J. Bruce Fields wrote:
> On Fri, May 01, 2020 at 02:21:54PM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, May 01, 2020 at 10:59:24AM -0700, Linus Torvalds wrote:
> > > Which kind of makes me want to point a finger at Tejun. But it's
> > > been
> > > mostly PeterZ touching this file lately..
> > 
> > Looks fine to me too. I don't quite understand the usecase tho. It
> > looks
> > like all it's being used for is to tag some kthreads as belonging
> > to the
> > same group.
> 
> Pretty much.
> 

Wen running an instance of knfsd from inside a container, you want to
be able to have the knfsd kthreads be parented to the container init
process so that they get killed off when the container is killed.

Right now, we can easily leak those kernel threads simply by killing
the container.

> > Can't that be done with kthread_data()?
> 
> Huh, maybe so, thanks.
> 
> I need to check this from generic file locking code that could be run
> by
> any task--but I assume there's an easy way I can check if I'm a
> kthread
> before calling  kthread_data(current).
> 
> I do expect to expose a delegation interface for userspace servers
> eventually too.  But we could do the tgid check for them and still
> use
> kthread_data() for nfsd.  That might work.
> 
> --b.
> 
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: About patch NFS: Fix O_DIRECT accounting of number of bytes read/written

2019-10-15 Thread Trond Myklebust
On Wed, 2019-10-16 at 01:00 +, Su, Yanjun wrote:
> Hi trond,
> Because My mail system cant receive nfs mail list’s mails, I reply
> your patch here.
> I have some question for the patch.
> 
> > No. Basic O_DIRECT does not guarantee atomicity of requests, which
> > is
> > why we do not have generic locking at the VFS level when reading
> > and
> > writing. The only guarantee being offered is that O_DIRECT and
> > buffered
> > writes do not collide.
> Do you mean other fs also cant guarantee atomicity of O_DIRECT
> request or just nfs?
> 
> > IOW: I think the basic premise for this test is just broken (as I
> > commented in the patch series I sent) because it is assuming a
> > behaviour that is simply not guaranteed.
> So the generic/465 of xfstests can’t apply to nfs for now, am I
> right?

As far as I can see, it is does not belong in the 'generic' category at
all.

> 
> > BTW: note that buffered writes have the same property. They are
> > ordered
> > when being written into the page cache, meaning that reads on the
> > same
> > client will see no holes, however if you try to read from another
> > client, then you will see the same behaviour, with temporary holes
> > magically appearing in the file.
> As you say, nfs buffered write also has the hole problem with
> multiple r/w on different clients.
> I want to know if the problem exists in other local fs such as
> xfs,ext4?
> 

There is no VFS locking that enforces any serialisation for O_DIRECT.
So unless the filesystem implements its own serialisation (which xfs
does) then there is no guarantee.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: Fix O_DIRECT read problem when another write is going on

2019-09-30 Thread Trond Myklebust
Hi Su,

On Mon, 2019-09-30 at 17:11 +0800, Su Yanjun wrote:
> In xfstests generic/465 tests failed. Because O_DIRECT r/w use
> async rpc calls, when r/w rpc calls are running concurrently we
> may read partial data which is wrong.
> 
> For example as follows.
>  user buffer
> /\
> >||
>  rpc0 rpc1
> 
> When rpc0 runs it encounters eof so return 0, then another writes
> something. When rpc1 runs it returns some data. The total data
> buffer contains wrong data.
> 
> In this patch we check eof mark for each direct request. If
> encounters
> eof then set eof mark in the request, when we meet it again report
> -EAGAIN error. In nfs_direct_complete we convert -EAGAIN as if read
> nothing. When the reader issue another read it will read ok.
> 
> Signed-off-by: Su Yanjun 
> ---
>  fs/nfs/direct.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 222d711..7f737a3 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -93,6 +93,7 @@ struct nfs_direct_req {
>   bytes_left, /* bytes left to be
> sent */
>   error;  /* any reported error
> */
>   struct completion   completion; /* wait for i/o completion */
> + int eof;/* eof mark in the
> req */
>  
>   /* commit state */
>   struct nfs_mds_commit_info mds_cinfo;   /* Storage for cinfo
> */
> @@ -380,6 +381,12 @@ static void nfs_direct_complete(struct
> nfs_direct_req *dreq)
>  {
>   struct inode *inode = dreq->inode;
>  
> + /* read partial data just as read nothing */
> + if (dreq->error == -EAGAIN) {
> + dreq->count = 0;
> + dreq->error = 0;
> + }
> +
>   inode_dio_end(inode);
>  
>   if (dreq->iocb) {
> @@ -413,8 +420,13 @@ static void nfs_direct_read_completion(struct
> nfs_pgio_header *hdr)
>   if (hdr->good_bytes != 0)
>   nfs_direct_good_bytes(dreq, hdr);
>  
> - if (test_bit(NFS_IOHDR_EOF, >flags))
> + if (dreq->eof)
> + dreq->error = -EAGAIN;
> +
> + if (test_bit(NFS_IOHDR_EOF, >flags)) {
>   dreq->error = 0;
> + dreq->eof = 1;
> + }
>  
>   spin_unlock(>lock);
>  

Thanks for looking into this issue. I agree with your analysis of what
is going wrong in generic/465.

However, I think the problem is greater than just EOF. I think we also
need to look at the generic error handling, and ensure that it handles
a truncated RPC call in the middle of a series of calls correctly.

Please see the two patches I sent you just now and check if they fix
the problem for you.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: PROBLEM: nfs? crash in Linux 5.3 (possible regression)

2019-09-24 Thread Trond Myklebust
On Tue, 2019-09-24 at 13:55 -0400, Nick Bowler wrote:
> On 9/20/19, Nick Bowler  wrote:
> > On 9/20/19, Trond Myklebust  wrote:
> > > On Fri, 2019-09-20 at 14:23 -0400, Nick Bowler wrote:
> > > > Not sure how reproducible this is.  Since I've never seen a
> > > > crash
> > > > like this before it may be a regression compared to, say, Linux
> > > > 4.19
> > > > but I am not certain because this particular machine is brand
> > > > new so
> > > > I don't have experience with older kernels on it...
> > 
> > So it actually seems pretty reliably reproducible, 4 attempts to
> > compile
> > Linux on Linux 5.3 and all four crash the same way, although
> > there's
> > definitely some randomness here...
> > 
> > On the other hand, I cannot reproduce if I install Linux 5.2 so it
> > does
> > seem like a regression in 5.3.  I will see how well bisecting
> > goes...
> 
> Not well I guess?  I'm not sure what it's doing but for some reason
> git
> bisect does not seem to be converging towards any solution.  I run
> for
> several hours before marking 'good' and it only seems to be reducing
> the
> '# of revisions left to test' by a tiny amount, not the ~half like
> I'd expect so bleh...
> 
> Any ideas what could be wrong?  Hints on how to reproduce faster?

As I said, this is a keyring bug. It has nothing whatsoever to do with
NFS, so I have no advice.

> 
> git bisect start
> # bad: [4d856f72c10ecb060868ed10ff1b1453943fc6c8] Linux 5.3
> git bisect bad 4d856f72c10ecb060868ed10ff1b1453943fc6c8
> # good: [0ecfebd2b52404ae0c54a878c872bb93363ada36] Linux 5.2
> git bisect good 0ecfebd2b52404ae0c54a878c872bb93363ada36
> # skip: [c236b6dd48dcf2ae6ed14b9068830eccc3e181e6] Merge tag
> 'keys-request-20190626' of
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
> git bisect skip c236b6dd48dcf2ae6ed14b9068830eccc3e181e6
> # skip: [028db3e290f15ac509084c0fc3b9d021f668f877] Revert "Merge tag
> 'keys-acl-20190703' of
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs"
> git bisect skip 028db3e290f15ac509084c0fc3b9d021f668f877
> # bad: [002c5f73c508f7df5681bda339831c27f3c1aef4] KVM: x86/mmu:
> Reintroduce fast invalidate/zap for flushing memslot
> git bisect bad 002c5f73c508f7df5681bda339831c27f3c1aef4
> # bad: [d3464ccd105b42f87302572ee1f097e6e0b432c6] Merge tag
> 'dmaengine-fix-5.3' of git://git.infradead.org/users/vkoul/slave-dma
> git bisect bad d3464ccd105b42f87302572ee1f097e6e0b432c6
> # bad: [0445971000375859008414f87e7c72fa0d809cf8] Merge tag
> 'mmc-v5.3-rc7' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
> git bisect bad 0445971000375859008414f87e7c72fa0d809cf8
> # bad: [046ddeed0461b5d270470c253cbb321103d048b6] KVM: Check
> preempted_in_kernel for involuntary preemption
> git bisect bad 046ddeed0461b5d270470c253cbb321103d048b6
> # skip: [8f6ccf6159aed1f04c6d179f61f6fb2691261e84] Merge tag
> 'clone3-v5.3' of
> git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
> git bisect skip 8f6ccf6159aed1f04c6d179f61f6fb2691261e84
> # good: [d84f6269ce24eb4c468e246b24fc0fdce34ab6f6] crypto: ccree -
> check that cryptocell reset completed
> git bisect good d84f6269ce24eb4c468e246b24fc0fdce34ab6f6
> # good: [5f92229d184b80712a8b94d098318960171ae749] ASoC: mxs:
> mxs-sgtl5000: don't select unnecessary Platform
> git bisect good 5f92229d184b80712a8b94d098318960171ae749
> # good: [a2928d28643e3c064ff41397281d20c445525032] r8169: use paged
> versions of phylib MDIO access functions
> git bisect good a2928d28643e3c064ff41397281d20c445525032
> # good: [1b2fc358ddfb1b0915922e441182cda7043f5116] perf tools: Add
> missing util.h to pick up 'page_size' variable
> git bisect good 1b2fc358ddfb1b0915922e441182cda7043f5116
> # good: [a011b49f4ed7813777a15da12a426ab939c58f14] net/mlx5e:
> Consider
> XSK in XDP MTU limit calculation
> git bisect good a011b49f4ed7813777a15da12a426ab939c58f14
> # good: [89a237aa84c7047cafba99f5dc81983ed0c40704] staging: kpc2000:
> Use '%llx' for printing 'long long int' type
> git bisect good 89a237aa84c7047cafba99f5dc81983ed0c40704
> # good: [60e8523e2ea18dc0c0cea69d6c1d69a065019062] ocxl: Allow
> contexts to be attached with a NULL mm
> git bisect good 60e8523e2ea18dc0c0cea69d6c1d69a065019062
> # good: [452181936931f0f08923aba5e04e1e9ef58c389f] afs: Trace
> afs_server usage
> git bisect good 452181936931f0f08923aba5e04e1e9ef58c389f
> # good: [78ff751f8e6a9446e9fb26b2bff0b8d3f8974cbd] scsi: mac_scsi:
> Fix
> pseudo DMA implementation, take 2
> git bisect good 78ff751f8e6a9446e9fb26b2bff0b8d3f8974cbd
> # good: [5315f9d40191f98abcd3164e632a8a8f737b1cf0] rtlwifi: remove
> redundant assi

Re: PROBLEM: nfs? crash in Linux 5.3 (possible regression)

2019-09-20 Thread Trond Myklebust
On Fri, 2019-09-20 at 14:23 -0400, Nick Bowler wrote:
> Hi all,
> 
> I hit this oops on Linux 5.3 yesterday.  The crash itself occurred
> while
> compiling Linux (source and build dirs on NFS).  Afterwards, the
> system
> remained mostly alive but my NFS mounts became very busted with lots
> (but not all) I/O operations appearing to hang forever.
> 
> Not sure how reproducible this is.  Since I've never seen a crash
> like this before it may be a regression compared to, say, Linux 4.19
> but I am not certain because this particular machine is brand new so
> I don't have experience with older kernels on it...
> 
> Full dmesg is attached (gzipped).
> 
> Let me know if you need any more info.
> 
> [  796.050025] BUG: kernel NULL pointer dereference, address:
> 0014
> [  796.051280] #PF: supervisor read access in kernel mode
> [  796.053063] #PF: error_code(0x) - not-present page
> [  796.054636] PGD 0 P4D 0
> [  796.055688] Oops:  [#1] PREEMPT SMP
> [  796.056768] CPU: 2 PID: 190 Comm: kworker/2:2 Tainted: GW
>   5.3.0 #6
> [  796.057953] Hardware name: To Be Filled By O.E.M. To Be Filled By
> O.E.M./B450 Gaming-ITX/ac, BIOS P3.30 05/17/2019
> [  796.059329] Workqueue: events key_garbage_collector
> [  796.060623] RIP: 0010:keyring_gc_check_iterator+0x27/0x30

That would be the keyring garbage collector, not NFS.

Cced keyri...@vger.kernel.org


> [  796.061845] Code: 44 00 00 48 83 e7 fc b8 01 00 00 00 f6 87 80 00
> 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0
> 00 00 00 <0f> b6 40 14 c3 0f 1f 40 00 48 83 e7 fc e9 27 eb ff ff 0f
> 1f
> 80 00
> [  796.064638] RSP: 0018:b40fc0757df8 EFLAGS: 00010282
> [  796.066058] RAX:  RBX: a14338caed80 RCX:
> b40fc0757e40
> [  796.067531] RDX: a1433ae85558 RSI: b40fc0757e40 RDI:
> a1433ae85500
> [  796.069014] RBP: b40fc0757e40 R08:  R09:
> 000f
> [  796.070513] R10: 8080808080808080 R11: 0001 R12:
> a4cd6180
> [  796.072025] R13: a14338caee10 R14: a14338caedf0 R15:
> a1433ffeff00
> [  796.073567] FS:  () GS:a1434048()
> knlGS:
> [  796.075171] CS:  0010 DS:  ES:  CR0: 80050033
> [  796.076785] CR2: 0014 CR3: 000747ce6000 CR4:
> 003406e0
> [  796.078445] Call Trace:
> [  796.080091]  assoc_array_subtree_iterate+0x55/0x100
> [  796.081770]  keyring_gc+0x3f/0x80
> [  796.083447]  key_garbage_collector+0x330/0x3d0
> [  796.085155]  process_one_work+0x1cb/0x320
> [  796.086869]  worker_thread+0x28/0x3c0
> [  796.088603]  ? process_one_work+0x320/0x320
> [  796.090335]  kthread+0x106/0x120
> [  796.092053]  ? kthread_create_on_node+0x40/0x40
> [  796.093810]  ret_from_fork+0x1f/0x30
> [  796.095569] Modules linked in: sha1_ssse3 sha1_generic cbc cts
> rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace ext4 crc16 mbcache
> jbd2 iwlmvm mac80211 libarc4 amdgpu iwlwifi snd_hda_codec_realtek
> snd_hda_codec_generic kvm_amd gpu_sched kvm snd_hda_codec_hdmi
> drm_kms_helper irqbypass k10temp syscopyarea sysfillrect sysimgblt
> fb_sys_fops video ttm cfg80211 snd_hda_intel snd_hda_codec drm
> snd_hwdep rfkill snd_hda_core backlight snd_pcm evdev snd_timer snd
> soundcore efivarfs dm_crypt hid_generic igb hwmon i2c_algo_bit sr_mod
> cdrom sunrpc dm_mod
> [  796.104033] CR2: 0014
> [  796.106304] ---[ end trace 695aee10f9202347 ]---
> [  796.108585] RIP: 0010:keyring_gc_check_iterator+0x27/0x30
> [  796.110894] Code: 44 00 00 48 83 e7 fc b8 01 00 00 00 f6 87 80 00
> 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0
> 00 00 00 <0f> b6 40 14 c3 0f 1f 40 00 48 83 e7 fc e9 27 eb ff ff 0f
> 1f
> 80 00
> [  796.115773] RSP: 0018:b40fc0757df8 EFLAGS: 00010282
> [  796.118209] RAX:  RBX: a14338caed80 RCX:
> b40fc0757e40
> [  796.120683] RDX: a1433ae85558 RSI: b40fc0757e40 RDI:
> a1433ae85500
> [  796.123176] RBP: b40fc0757e40 R08:  R09:
> 000f
> [  796.125668] R10: 8080808080808080 R11: 0001 R12:
> a4cd6180
> [  796.128104] R13: a14338caee10 R14: a14338caedf0 R15:
> a1433ffeff00
> [  796.130493] FS:  () GS:a1434048()
> knlGS:
> [  796.132923] CS:  0010 DS:  ES:  CR0: 80050033
> [  796.135266] CR2: 0014 CR3: 000747ce6000 CR4:
> 003406e0
> 
> Thanks,
>   Nick
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Regression in 5.1.20: Reading long directory fails

2019-09-12 Thread Trond Myklebust
On Thu, 2019-09-12 at 09:13 -0400, J. Bruce Fields wrote:
> (Unless I'm missing something.  I haven't looked at this code in a
> while.  Though it was problem me that wrote it originally--apologies
> for
> that)
> 

The function itself is fine. It was just the name I'm objecting to,
since we're actually moving the last 'n' bytes in the message in order
to be able to read them.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Regression in 5.1.20: Reading long directory fails

2019-09-12 Thread Trond Myklebust
On Thu, 2019-09-12 at 09:08 -0400, Benjamin Coddington wrote:
> On 12 Sep 2019, at 8:53, Trond Myklebust wrote:
> > Let's please just scrap this function and rewrite it as a generic
> > function for reading the MIC. It clearly is not a generic function
> > for
> > reading arbitrary netobjs, and modifications like the above just
> > make
> > the misnomer painfully obvious.
> > 
> > Let's rewrite it as xdr_buf_read_mic() so that we can simplify it
> > where
> > possible.
> 
> Ok.  I want to assume the mic will not land in the head, but I am not
> sure..
> Is there a scenario where the mic might land in the head, or is that
> bit of
> the current function left over from other uses?

It is likely to land in the head for most RPC calls that have short
replies. Particularly in error cases.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Regression in 5.1.20: Reading long directory fails

2019-09-12 Thread Trond Myklebust
On Thu, 2019-09-12 at 08:29 -0400, Benjamin Coddington wrote:
> On 11 Sep 2019, at 13:54, Chuck Lever wrote:
> 
> > > On Sep 11, 2019, at 1:50 PM, Benjamin Coddington 
> > >  wrote:
> > > 
> > > On 11 Sep 2019, at 13:40, Benjamin Coddington wrote:
> > > 
> > > > On 11 Sep 2019, at 13:29, Chuck Lever wrote:
> > > > 
> > > > > > On Sep 11, 2019, at 1:26 PM, Benjamin Coddington 
> > > > > >  wrote:
> > > > > > 
> > > > > > 
> > > > > > On 11 Sep 2019, at 12:39, Chuck Lever wrote:
> > > > > > 
> > > > > > > > On Sep 11, 2019, at 12:25 PM, Benjamin Coddington 
> > > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > Instead, I think we want to make sure the mic falls
> > > > > > > > squarely 
> > > > > > > > into the tail
> > > > > > > > every time.
> > > > > > > 
> > > > > > > I'm not clear how you could do that. The length of the
> > > > > > > page data 
> > > > > > > is not
> > > > > > > known to the client before it parses the reply. Are you 
> > > > > > > suggesting that
> > > > > > > gss_unwrap should do it somehow?
> > > > > > 
> > > > > > Is it too niave to always put the mic at the end of the
> > > > > > tail?
> > > > > 
> > > > > The size of the page content is variable.
> > > > > 
> > > > > The only way the MIC will fall into the tail is if the page
> > > > > content 
> > > > > is
> > > > > exactly the largest expected size. When the page content is
> > > > > smaller 
> > > > > than
> > > > > that, the receive logic will place part or all of the MIC in 
> > > > > ->pages.
> > > > 
> > > > Ok, right.  But what I meant is that xdr_buf_read_netobj()
> > > > should be 
> > > > renamed
> > > > and repurposed to be "move the mic from wherever it is to the
> > > > end of
> > > > xdr_buf's tail".
> > > > 
> > > > But now I see what you mean, and I also see that it is already 
> > > > trying to do
> > > > that.. and we don't want to overlap the copy..
> > > > 
> > > > So, really, we need the tail to be larger than twice the mic..
> > > > less 
> > > > 1.  That
> > > > means the fix is probably just increasing rslack for krb5i.
> > > 
> > > .. or we can keep the tighter tail space, and if we detect the
> > > mic 
> > > straddles
> > > the page and tail, we can move the mic into the tail with 2
> > > copies, 
> > > first
> > > move the bit in the tail back, then move the bit in the pages.
> > > 
> > > Which is preferred, less allocation, or in the rare case this
> > > occurs, 
> > > doing
> > > copy twice?
> > 
> > It sounds like the bug is that the current code does not deal 
> > correctly
> > when the MIC crosses the boundary between ->pages and ->tail? I'd
> > like
> > to see that addressed rather than changing rslack.
> 
> Here's what I'm about to run through my testing:
> 
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 48c93b9e525e..d6ffc9011269 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -1238,14 +1238,21 @@ EXPORT_SYMBOL_GPL(xdr_encode_word);
> 
>   /* If the netobj starting offset bytes from the start of xdr_buf
> is 
> contained
>* entirely in the head or the tail, set object to point to it; 
> otherwise
> - * try to find space for it at the end of the tail, copy it there,
> and
> - * set obj to point to it. */
> + * try to find space for it at the end of the tail, and copy it
> there.  
> If
> + * the netobj is partly within the page data and tail, shrink the
> pages 
> to
> + * move the object into the tail */
>   int xdr_buf_read_netobj(struct xdr_buf *buf, struct xdr_netobj
> *obj, 
> unsigned int offset)
>   {
>  struct xdr_buf subbuf;
> +   unsigned int page_range;
> 
>  if (xdr_decode_word(buf, offset, >len))
>  return -EFAULT;
> +
> +   page_range = buf->head->iov_len + buf->page_len - offset + 4;
> +   if (page_range > 0 && page_range < obj->len)
> +   xdr_shrink_pagelen(buf, page_range);
> +
>  if (xdr_buf_subsegment(buf, , offset + 4, obj->len))
>  return -EFAULT;
> 
> 

Let's please just scrap this function and rewrite it as a generic
function for reading the MIC. It clearly is not a generic function for
reading arbitrary netobjs, and modifications like the above just make
the misnomer painfully obvious.

Let's rewrite it as xdr_buf_read_mic() so that we can simplify it where
possible.


Thanks
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH v3 16/26] NFS: Move mount parameterisation bits into their own file

2019-09-11 Thread Trond Myklebust
On Wed, 2019-09-11 at 14:24 -0400, Chuck Lever wrote:
> > On Sep 11, 2019, at 12:16 PM, Scott Mayhew 
> > wrote:
> > 
> > From: David Howells 
> > 
> > Split various bits relating to mount parameterisation out from
> > fs/nfs/super.c into their own file to form the basis of filesystem
> > context
> > handling for NFS.
> > 
> > No other changes are made to the code beyond removing 'static'
> > qualifiers.
> > 
> > Signed-off-by: David Howells 
> > Signed-off-by: Al Viro 
> > ---
> > fs/nfs/Makefile |2 +-
> > fs/nfs/fs_context.c | 1418
> > +++
> > fs/nfs/internal.h   |   29 +
> > fs/nfs/super.c  | 1411 
> > --
> > 4 files changed, 1448 insertions(+), 1412 deletions(-)
> > create mode 100644 fs/nfs/fs_context.c
> > 
> > diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
> > index 34cdeaecccf6..2433c3e03cfa 100644
> > --- a/fs/nfs/Makefile
> > +++ b/fs/nfs/Makefile
> > @@ -9,7 +9,7 @@ CFLAGS_nfstrace.o += -I$(src)
> > nfs-y   := client.o dir.o file.o getroot.o
> > inode.o super.o \
> >io.o direct.o pagelist.o read.o symlink.o
> > unlink.o \
> >write.o namespace.o mount_clnt.o nfstrace.o
> > \
> > -  export.o sysfs.o
> > +  export.o sysfs.o fs_context.o
> > nfs-$(CONFIG_ROOT_NFS)  += nfsroot.o
> > nfs-$(CONFIG_SYSCTL)+= sysctl.o
> > nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
> > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> > new file mode 100644
> > index ..82b312a5cdde
> > --- /dev/null
> > +++ b/fs/nfs/fs_context.c
> > @@ -0,0 +1,1418 @@
> > +/* NFS mount handling.
> > + *
> > + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> > + * Written by David Howells (dhowe...@redhat.com)
> > + *
> > + * Split from fs/nfs/super.c:
> > + *
> > + *  Copyright (C) 1992  Rick Sladkey
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public Licence
> > + * as published by the Free Software Foundation; either version
> > + * 2 of the Licence, or (at your option) any later version.
> > + */
> 
> New source files should have an SPDX tag instead of boilerplate.
> I suggest:
> 
> // SPDX-License-Identifier: GPL-2.0-only
> 

Agreed. It is also quite a long stretch to claim authorship of the
entire file as implied above. Given that this is mostly a copy-paste
effort, then most of the actual copyrights belong to the people who've
contributed to super.c (and to inode.c before it). David is one of
those authors, but he is one of many.


> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include "nfs.h"
> > +#include "internal.h"
> > +
> > +#define NFSDBG_FACILITYNFSDBG_MOUNT
> > +
> > +#if IS_ENABLED(CONFIG_NFS_V3)
> > +#define NFS_DEFAULT_VERSION 3
> > +#else
> > +#define NFS_DEFAULT_VERSION 2
> > +#endif
> > +
> > +#define NFS_MAX_CONNECTIONS 16
> > +
> > +enum {
> > +   /* Mount options that take no arguments */
> > +   Opt_soft, Opt_softerr, Opt_hard,
> > +   Opt_posix, Opt_noposix,
> > +   Opt_cto, Opt_nocto,
> > +   Opt_ac, Opt_noac,
> > +   Opt_lock, Opt_nolock,
> > +   Opt_udp, Opt_tcp, Opt_rdma,
> > +   Opt_acl, Opt_noacl,
> > +   Opt_rdirplus, Opt_nordirplus,
> > +   Opt_sharecache, Opt_nosharecache,
> > +   Opt_resvport, Opt_noresvport,
> > +   Opt_fscache, Opt_nofscache,
> > +   Opt_migration, Opt_nomigration,
> > +
> > +   /* Mount options that take integer arguments */
> > +   Opt_port,
> > +   Opt_rsize, Opt_wsize, Opt_bsize,
> > +   Opt_timeo, Opt_retrans,
> > +   Opt_acregmin, Opt_acregmax,
> > +   Opt_acdirmin, Opt_acdirmax,
> > +   Opt_actimeo,
> > +   Opt_namelen,
> > +   Opt_mountport,
> > +   Opt_mountvers,
> > +   Opt_minorversion,
> > +
> > +   /* Mount options that take string arguments */
> > +   Opt_nfsvers,
> > +   Opt_sec, Opt_proto, Opt_mountproto, Opt_mounthost,
> > +   Opt_addr, Opt_mountaddr, Opt_clientaddr,
> > +   Opt_nconnect,
> > +   Opt_lookupcache,
> > +   Opt_fscache_uniq,
> > +   Opt_local_lock,
> > +
> > +   /* Special mount options */
> > +   Opt_userspace, Opt_deprecated, Opt_sloppy,
> > +
> > +   Opt_err
> > +};
> > +
> > +static const match_table_t nfs_mount_option_tokens = {
> > +   { Opt_userspace, "bg" },
> > +   { Opt_userspace, "fg" },
> > +   { Opt_userspace, "retry=%s" },
> > +
> > +   { Opt_sloppy, "sloppy" },
> > +
> > +   { Opt_soft, "soft" },
> > +   { Opt_softerr, "softerr" },
> > +   { Opt_hard, "hard" },
> > +   { Opt_deprecated, "intr" },
> > +   { Opt_deprecated, "nointr" },
> > +   { Opt_posix, "posix" },
> > +   { Opt_noposix, "noposix" },
> > +   { Opt_cto, "cto" },
> > +   { Opt_nocto, "nocto" },
> > +   { Opt_ac, "ac" },
> > +   { Opt_noac, "noac" },
> > +   { Opt_lock, "lock" },
> > +   { Opt_nolock, "nolock" },
> > +   { Opt_udp, "udp" },
> > +   { Opt_tcp, "tcp" },
> > +   { Opt_rdma, "rdma" },
> > +   { 

[PATCH v2 1/2] NFS: Fix inode fileid checks in attribute revalidation code

2019-09-10 Thread Trond Myklebust
We want to throw out the attrbute if it refers to the mounted on fileid,
and not the real fileid. However we do not want to block cache consistency
updates from NFSv4 writes.

Reported-by: Murphy Zhou 
Fixes: 7e10cc25bfa0 ("NFS: Don't refresh attributes with mounted-on-file...")
Signed-off-by: Trond Myklebust 
Signed-off-by: Christophe Leroy 
---
 fs/nfs/inode.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index c764cfe456e5..2a03bfeec10a 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1403,11 +1403,12 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
if (NFS_PROTO(inode)->have_delegation(inode, FMODE_READ))
return 0;
 
-   /* No fileid? Just exit */
-   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID))
-   return 0;
+   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
+   /* Only a mounted-on-fileid? Just exit */
+   if (fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID)
+   return 0;
/* Has the inode gone and changed behind our back? */
-   if (nfsi->fileid != fattr->fileid) {
+   } else if (nfsi->fileid != fattr->fileid) {
/* Is this perhaps the mounted-on fileid? */
if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
nfsi->fileid == fattr->mounted_on_fileid)
@@ -1807,11 +1808,12 @@ static int nfs_update_inode(struct inode *inode, struct 
nfs_fattr *fattr)
nfs_display_fhandle_hash(NFS_FH(inode)),
atomic_read(>i_count), fattr->valid);
 
-   /* No fileid? Just exit */
-   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID))
-   return 0;
+   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
+   /* Only a mounted-on-fileid? Just exit */
+   if (fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID)
+   return 0;
/* Has the inode gone and changed behind our back? */
-   if (nfsi->fileid != fattr->fileid) {
+   } else if (nfsi->fileid != fattr->fileid) {
/* Is this perhaps the mounted-on fileid? */
if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
nfsi->fileid == fattr->mounted_on_fileid)
-- 
2.13.3



Re: Regression in 5.1.20: Reading long directory fails

2019-09-08 Thread Trond Myklebust
On Sun, 2019-09-08 at 11:48 -0400, Chuck Lever wrote:
> > On Sep 8, 2019, at 11:19 AM, Trond Myklebust <
> > tron...@hammerspace.com> wrote:
> > 
> > On Sun, 2019-09-08 at 07:39 -0400, Benjamin Coddington wrote:
> > > On 6 Sep 2019, at 16:50, Chuck Lever wrote:
> > > 
> > > > > On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <
> > > > > ti...@math.uh.edu> 
> > > > > wrote:
> > > > > 
> > > > > > > > > > "JBF" == J Bruce Fields 
> > > > > > > > > > writes:
> > > > > 
> > > > > JBF> Those readdir changes were client-side, right?  Based on
> > > > > that 
> > > > > I'd
> > > > > JBF> been assuming a client bug, but maybe it'd be worth
> > > > > getting
> > > > > a 
> > > > > full
> > > > > JBF> packet capture of the readdir reply to make sure it's
> > > > > legit.
> > > > > 
> > > > > I have been working with bcodding on IRC for the past couple
> > > > > of
> > > > > days 
> > > > > on
> > > > > this.  Fortunately I was able to come up with way to fill up
> > > > > a 
> > > > > directory
> > > > > in such a way that it will fail with certainty and as a bonus
> > > > > doesn't
> > > > > include any user data so I can feel OK about sharing packet
> > > > > captures. 
> > > > > I
> > > > > have a capture alongside a kernel trace of the problematic
> > > > > operation 
> > > > > in
> > > > > https://www.math.uh.edu/~tibbs/nfs/.  Not that I can
> > > > > particularly 
> > > > > tell
> > > > > anything useful from that, but bcodding says that it seems to
> > > > > point 
> > > > > to
> > > > > some issue in sunrpc.
> > > > > 
> > > > > And because I can easily reproduce this and I was able to do
> > > > > a 
> > > > > bisect:
> > > > > 
> > > > > 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad
> > > > > commit
> > > > > commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
> > > > > Author: Chuck Lever 
> > > > > Date:   Mon Feb 11 11:25:41 2019 -0500
> > > > > 
> > > > >   SUNRPC: Use au_rslack when computing reply buffer size
> > > > > 
> > > > >   au_rslack is significantly smaller than (au_cslack << 2).
> > > > > Using
> > > > >   that value results in smaller receive buffers. In some
> > > > > cases
> > > > > this
> > > > >   eliminates an extra segment in Reply chunks (RPC/RDMA).
> > > > > 
> > > > >   Signed-off-by: Chuck Lever 
> > > > >   Signed-off-by: Anna Schumaker 
> > > > > 
> > > > > :04 04 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505 
> > > > > 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M  net
> > > > > 
> > > > > But of course, I can't say whether this is the actual bad
> > > > > commit
> > > > > or
> > > > > whether it just introduced a behavior change which alters
> > > > > the 
> > > > > conditions
> > > > > under which the problem appears.
> > > > 
> > > > The first place I'd start looking is the XDR constants at the
> > > > head
> > > > of 
> > > > fs/nfs/nfs4xdr.c
> > > > having to do with READDIR.
> > > > 
> > > > The report of behavior changes with the use of krb5p also makes
> > > > this 
> > > > commit plausible.
> > > 
> > > After sprinkling the printk's, we're coming up one word short in
> > > the 
> > > receive
> > > buffer.  I think we're not accounting for the xdr pad of buf-
> > > >pages
> > > for 
> > > NFS4
> > > readdir -- but I need to check the RFCs.  Anyone know if v4
> > > READDIR 
> > > results
> > > have to be aligned?
> > > 
> > > Also need to check just why krb5i is the only auth that cares..
> > > 
> > 
> > I'm not seeing that. If you look at commit 02ef04e432ba, you'll see
> > that Chuck did add a 'padding term' to decod

Re: Regression in 5.1.20: Reading long directory fails

2019-09-08 Thread Trond Myklebust
On Sun, 2019-09-08 at 07:39 -0400, Benjamin Coddington wrote:
> On 6 Sep 2019, at 16:50, Chuck Lever wrote:
> 
> > > On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <
> > > ti...@math.uh.edu> 
> > > wrote:
> > > 
> > > > > > > > "JBF" == J Bruce Fields  writes:
> > > 
> > > JBF> Those readdir changes were client-side, right?  Based on
> > > that 
> > > I'd
> > > JBF> been assuming a client bug, but maybe it'd be worth getting
> > > a 
> > > full
> > > JBF> packet capture of the readdir reply to make sure it's legit.
> > > 
> > > I have been working with bcodding on IRC for the past couple of
> > > days 
> > > on
> > > this.  Fortunately I was able to come up with way to fill up a 
> > > directory
> > > in such a way that it will fail with certainty and as a bonus
> > > doesn't
> > > include any user data so I can feel OK about sharing packet
> > > captures. 
> > >  I
> > > have a capture alongside a kernel trace of the problematic
> > > operation 
> > > in
> > > https://www.math.uh.edu/~tibbs/nfs/.  Not that I can
> > > particularly 
> > > tell
> > > anything useful from that, but bcodding says that it seems to
> > > point 
> > > to
> > > some issue in sunrpc.
> > > 
> > > And because I can easily reproduce this and I was able to do a 
> > > bisect:
> > > 
> > > 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad commit
> > > commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
> > > Author: Chuck Lever 
> > > Date:   Mon Feb 11 11:25:41 2019 -0500
> > > 
> > >SUNRPC: Use au_rslack when computing reply buffer size
> > > 
> > >au_rslack is significantly smaller than (au_cslack << 2).
> > > Using
> > >that value results in smaller receive buffers. In some cases
> > > this
> > >eliminates an extra segment in Reply chunks (RPC/RDMA).
> > > 
> > >Signed-off-by: Chuck Lever 
> > >Signed-off-by: Anna Schumaker 
> > > 
> > > :04 04 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505 
> > > 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M  net
> > > 
> > > But of course, I can't say whether this is the actual bad commit
> > > or
> > > whether it just introduced a behavior change which alters the 
> > > conditions
> > > under which the problem appears.
> > 
> > The first place I'd start looking is the XDR constants at the head
> > of 
> > fs/nfs/nfs4xdr.c
> > having to do with READDIR.
> > 
> > The report of behavior changes with the use of krb5p also makes
> > this 
> > commit plausible.
> 
> After sprinkling the printk's, we're coming up one word short in the 
> receive
> buffer.  I think we're not accounting for the xdr pad of buf->pages
> for 
> NFS4
> readdir -- but I need to check the RFCs.  Anyone know if v4 READDIR 
> results
> have to be aligned?
> 
> Also need to check just why krb5i is the only auth that cares..
> 

I'm not seeing that. If you look at commit 02ef04e432ba, you'll see
that Chuck did add a 'padding term' to decode_readdir_maxsz in the
NFSv4 case.
The other thing to remember is that a readdir 'dirlist4' entry is
always word aligned (irrespective of the length of the filename), so
there is no padding that needs to be taken into account.

I think we probably rather want to look at how auth->au_ralign is being
calculated for the case of krb5i. I'm really not understanding why
auth->au_ralign should not take into account the presence of the mic.
Chuck?


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS bugfix

2019-09-04 Thread Trond Myklebust
Hi Linus,

One more pull request to fix a NFSv4 regression that affects only the
v5.3-rc series.

The following changes since commit 089cf7f6ecb266b6a4164919a2e69bd2f938374a:

  Linux 5.3-rc7 (2019-09-02 09:57:40 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.3-4

for you to fetch changes up to eb3d8f42231aec65b64b079dd17bd6c008a3fe29:

  NFS: Fix inode fileid checks in attribute revalidation code (2019-09-02 
13:10:19 -0400)


NFS client bugfixes for Linux 5.3

Highlights include:

Bugfixes:
- Fix inode fileid checks in attribute revalidation code


Trond Myklebust (1):
  NFS: Fix inode fileid checks in attribute revalidation code

 fs/nfs/inode.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH 4.19 35/98] NFS: Fix regression whereby fscache errors are appearing on nofsc mounts

2019-08-28 Thread Trond Myklebust
On Wed, 2019-08-28 at 09:11 +0200, Pavel Machek wrote:
> On Tue 2019-08-27 09:50:14, Greg Kroah-Hartman wrote:
> > [ Upstream commit dea1bb35c5f35e0577cfc61f79261d80b8715221 ]
> > 
> > People are reporing seeing fscache errors being reported concerning
> > duplicate cookies even in cases where they are not setting up
> > fscache
> > at all. The rule needs to be that if fscache is not enabled, then
> > it
> > should have no side effects at all.
> > 
> > To ensure this is the case, we disable fscache completely on all
> > superblocks
> > for which the 'fsc' mount option was not set. In order to avoid
> > issues
> > with '-oremount', we also disable the ability to turn fscache on
> > via
> > remount.
> 
> Actually, the code seems to suggest that you disable the ability to
> turn fscache _off_ via remount, too.
> 
> Is that intentional?
> 

Yes. That is intentional. Otherwise we would have to clear all the
fscache cookies from the inodes.

> Best regards,
>   Pavel
> 
> > @@ -2239,6 +2239,7 @@ nfs_compare_remount_data(struct nfs_server
> > *nfss,
> > data->acdirmin != nfss->acdirmin / HZ ||
> > data->acdirmax != nfss->acdirmax / HZ ||
> > data->timeo != (10U * nfss->client->cl_timeout->to_initval
> > / HZ) ||
> > +   (data->options & NFS_OPTION_FSCACHE) != (nfss->options &
> > NFS_OPTION_FSCACHE) ||
> > data->nfs_server.port != nfss->port ||
> > data->nfs_server.addrlen != nfss->nfs_client->cl_addrlen ||
> > !rpc_cmp_addr((struct sockaddr *)>nfs_server.address,
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space




[GIT PULL] Please pull NFS client bugfixes

2019-08-27 Thread Trond Myklebust
Hi Linus,

The following changes since commit d1abaeb3be7b5fa6d7a1fbbd2e14e3310005c4c1:

  Linux 5.3-rc5 (2019-08-18 14:31:08 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.3-3

for you to fetch changes up to 99300a85260c2b7febd57082a617d1062532067e:

  NFS: remove set but not used variable 'mapping' (2019-08-27 10:24:56 -0400)


NFS client bugfixes for Linux 5.3

Highlights include:

Stable fixes:
- Fix a page lock leak in nfs_pageio_resend()
- Ensure O_DIRECT reports an error if the bytes read/written is 0
- Don't handle errors if the bind/connect succeeded
- Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidat
ed"

Bugfixes:
- Don't refresh attributes with mounted-on-file information
- Fix return values for nfs4_file_open() and nfs_finish_open()
- Fix pnfs layoutstats reporting of I/O errors
- Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort for
  soft I/O errors when the user specifies a hard mount.
- Various fixes to the error handling in sunrpc
- Don't report writepage()/writepages() errors twice.

--------
Trond Myklebust (17):
  NFS: Don't refresh attributes with mounted-on-file information
  NFSv4: Fix return values for nfs4_file_open()
  NFSv4: Fix return value in nfs_finish_open()
  NFSv4/pnfs: Fix a page lock leak in nfs_pageio_resend()
  NFS: Ensure O_DIRECT reports an error if the bytes read/written is 0
  NFS: Fix initialisation of I/O result struct in nfs_pgio_rpcsetup
  NFS: On fatal writeback errors, we need to call nfs_inode_remove_request()
  SUNRPC: Don't handle errors if the bind/connect succeeded
  pNFS/flexfiles: Turn off soft RPC calls
  SUNRPC: Handle EADDRINUSE and ENOBUFS correctly
  Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was 
invalidated"
  SUNRPC: Handle connection breakages correctly in call_status()
  pNFS/flexfiles: Don't time out requests on hard mounts
  NFS: Fix spurious EIO read errors
  NFS: Fix writepage(s) error handling to not report errors twice
  NFSv2: Fix eof handling
  NFSv2: Fix write regression

YueHaibing (1):
  NFS: remove set but not used variable 'mapping'

 fs/nfs/dir.c   |  2 +-
 fs/nfs/direct.c| 27 ---
 fs/nfs/flexfilelayout/flexfilelayout.c | 28 +++-
 fs/nfs/inode.c | 33 ++--
 fs/nfs/internal.h  | 10 
 fs/nfs/nfs4file.c  | 12 -
 fs/nfs/pagelist.c  | 19 --
 fs/nfs/pnfs_nfs.c  | 15 +++
 fs/nfs/proc.c  |  7 +++--
 fs/nfs/read.c  | 35 ++---
 fs/nfs/write.c | 38 ---
 include/linux/sunrpc/sched.h   |  1 -
 net/sunrpc/clnt.c  | 47 +++---
 net/sunrpc/xprt.c  |  7 -
 14 files changed, 163 insertions(+), 118 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Linux-next-20190823: x86_64/i386: prot_hsymlinks.c:325: Failed to run cmd: useradd hsym

2019-08-27 Thread Trond Myklebust
On Tue, 2019-08-27 at 06:25 -0400, Jan Stancek wrote:
> That theory is probably not correct for this case, since EIO I see
> appears
> to originate from write and nfs_writeback_result(). This function
> also
> produces message we saw in logs from Naresh.
> 
> I can't find where/how is resp->count updated on WRITE reply in
> NFSv2.
> Issue also goes away with patch below, though I can't speak about its
> correctness:
> 
> NFS version TypeTestReturn code
> nfsvers=2   tcp -b:base 0
> nfsvers=2   tcp -g:general  0
> nfsvers=2   tcp -s:special  0
> nfsvers=2   tcp -l:lock 0
> Total time: 141
> 
> diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c
> index cbc17a203248..4913c6da270b 100644
> --- a/fs/nfs/nfs2xdr.c
> +++ b/fs/nfs/nfs2xdr.c
> @@ -897,6 +897,16 @@ static int nfs2_xdr_dec_writeres(struct rpc_rqst
> *req, struct xdr_stream *xdr,
>  void *data)
>  {
> struct nfs_pgio_res *result = data;
> +   struct rpc_task *rq_task  = req->rq_task;
> +
> +   if (rq_task) {
> +   struct nfs_pgio_args *args = rq_task-
> >tk_msg.rpc_argp;
> +
> +   if (args) {
> +   result->count = args->count;
> +   }
> +   }
>  
> /* All NFSv2 writes are "file sync" writes */
> result->verf->committed = NFS_FILE_SYNC;

Thanks! I've moved the above to nfs_write_done() so that we do it only
on success (see 
http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3ba5688da709dd0f7d917029c206bc1848a6ae74
)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Linux-next-20190823: x86_64/i386: prot_hsymlinks.c:325: Failed to run cmd: useradd hsym

2019-08-26 Thread Trond Myklebust
On Mon, 2019-08-26 at 19:12 -0400, Jan Stancek wrote:
> - Original Message -
> > On Mon, 2019-08-26 at 10:38 -0400, Jan Stancek wrote:
> > > - Original Message -
> > > > Hi Jan and Cyril,
> > > > 
> > > > On Mon, 26 Aug 2019 at 16:35, Jan Stancek 
> > > > wrote:
> > > > > 
> > > > > - Original Message -
> > > > > > Hi!
> > > > > > > Do you see this LTP prot_hsymlinks failure on linux next
> > > > > > > 20190823 on
> > > > > > > x86_64 and i386 devices?
> > > > > > > 
> > > > > > > test output log,
> > > > > > > useradd: failure while writing changes to /etc/passwd
> > > > > > > useradd: /home/hsym was created, but could not be removed
> > > > > > 
> > > > > > This looks like an unrelated problem, failure to write to
> > > > > > /etc/passwd
> > > > > > probably means that filesystem is full or some problem
> > > > > > happend
> > > > > > and how
> > > > > > is remounted RO.
> > > > > 
> > > > > In Naresh' example, root is on NFS:
> > > > >   root=/dev/nfs rw
> > > > >  
> > > > > nfsroot=10.66.16.123:/var/lib/lava/dispatcher/tmp/886412/extr
> > > > > act-
> > > > > nfsrootfs-tyuevoxm,tcp,hard,intr
> > > > 
> > > > Right !
> > > > root is mounted on NFS.
> > > > 
> > > > > 10.66.16.123:/var/lib/lava/dispatcher/tmp/886412/extract-
> > > > > nfsrootfs-tyuevoxm
> > > > > on / type nfs
> > > > > (rw,relatime,vers=2,rsize=4096,wsize=4096,namlen=255,hard,nol
> > > > > ock,
> > > > > proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.66.16.123,
> > > > > moun
> > > > > tvers=1,mountproto=tcp,local_lock=all,addr=10.66.16.123)
> > > > > devtmpfs on /dev type devtmpfs
> > > > > (rw,relatime,size=3977640k,nr_inodes=994410,mode=755)
> > > > > 
> > 
> > The only thing I can think of that might cause an EIO on NFSv2
> > would be
> > this patch
> > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=627d48e597ec5993c4abb3b81dc75e554a07c7c0
> > assuming that a bind-related error is leaking through.
> > 
> > I'd suggest something like the following to fix it up:
> 
> No change with that patch,
> but following one fixes it for me:
> 
> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> index 20b3717cd7ca..56cefa0ab804 100644
> --- a/fs/nfs/pagelist.c
> +++ b/fs/nfs/pagelist.c
> @@ -590,7 +590,7 @@ static void nfs_pgio_rpcsetup(struct
> nfs_pgio_header *hdr,
> }
>  
> hdr->res.fattr   = >fattr;
> -   hdr->res.count   = 0;
> +   hdr->res.count   = count;
> hdr->res.eof = 0;
> hdr->res.verf= >verf;
> nfs_fattr_init(>fattr);
> 
> which is functionally revert of "NFS: Fix initialisation of I/O
> result struct in nfs_pgio_rpcsetup".
> 
> This hunk caught my eye, could res.eof == 0 explain those I/O errors?

Interesting hypothesis. It could if res.count ends up being 0. So does
the following also fix the problem?
8<
From b5bc0812350e94f8c9331174d22f24692411aef9 Mon Sep 17 00:00:00 2001
From: Trond Myklebust 
Date: Mon, 26 Aug 2019 20:41:16 -0400
Subject: [PATCH] NFSv2: Fix eof handling

If we received a reply from the server with a zero length read and
no error, then that implies we are at eof.

Signed-off-by: Trond Myklebust 
---
 fs/nfs/proc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/proc.c b/fs/nfs/proc.c
index 5552fa8b6e12..5919878549d2 100644
--- a/fs/nfs/proc.c
+++ b/fs/nfs/proc.c
@@ -594,7 +594,8 @@ static int nfs_read_done(struct rpc_task *task, struct 
nfs_pgio_header *hdr)
/* Emulate the eof flag, which isn't normally needed in NFSv2
 * as it is guaranteed to always return the file attributes
 */
-   if (hdr->args.offset + hdr->res.count >= hdr->res.fattr->size)
+   if (hdr->res.count == 0 && hdr->args.count > 0 ||
+   hdr->args.offset + hdr->res.count >= hdr->res.fattr->size)
hdr->res.eof = 1;
}
return 0;
-- 
2.21.0

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Linux-next-20190823: x86_64/i386: prot_hsymlinks.c:325: Failed to run cmd: useradd hsym

2019-08-26 Thread Trond Myklebust
On Mon, 2019-08-26 at 10:38 -0400, Jan Stancek wrote:
> - Original Message -
> > Hi Jan and Cyril,
> > 
> > On Mon, 26 Aug 2019 at 16:35, Jan Stancek 
> > wrote:
> > > 
> > > 
> > > - Original Message -
> > > > Hi!
> > > > > Do you see this LTP prot_hsymlinks failure on linux next
> > > > > 20190823 on
> > > > > x86_64 and i386 devices?
> > > > > 
> > > > > test output log,
> > > > > useradd: failure while writing changes to /etc/passwd
> > > > > useradd: /home/hsym was created, but could not be removed
> > > > 
> > > > This looks like an unrelated problem, failure to write to
> > > > /etc/passwd
> > > > probably means that filesystem is full or some problem happend
> > > > and how
> > > > is remounted RO.
> > > 
> > > In Naresh' example, root is on NFS:
> > >   root=/dev/nfs rw
> > >  
> > > nfsroot=10.66.16.123:/var/lib/lava/dispatcher/tmp/886412/extract-
> > > nfsrootfs-tyuevoxm,tcp,hard,intr
> > 
> > Right !
> > root is mounted on NFS.
> > 
> > > 10.66.16.123:/var/lib/lava/dispatcher/tmp/886412/extract-
> > > nfsrootfs-tyuevoxm
> > > on / type nfs
> > > (rw,relatime,vers=2,rsize=4096,wsize=4096,namlen=255,hard,nolock,
> > > proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.66.16.123,moun
> > > tvers=1,mountproto=tcp,local_lock=all,addr=10.66.16.123)
> > > devtmpfs on /dev type devtmpfs
> > > (rw,relatime,size=3977640k,nr_inodes=994410,mode=755)
> > > 

The only thing I can think of that might cause an EIO on NFSv2 would be
this patch 
http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=627d48e597ec5993c4abb3b81dc75e554a07c7c0
assuming that a bind-related error is leaking through.

I'd suggest something like the following to fix it up:

8<---
From 1e9336ac5363914dfcc1f49bf091409edbf36f8d Mon Sep 17 00:00:00 2001
From: Trond Myklebust 
Date: Mon, 26 Aug 2019 11:44:04 -0400
Subject: [PATCH] fixup! SUNRPC: Don't handle errors if the bind/connect
 succeeded

Signed-off-by: Trond Myklebust 
---
 net/sunrpc/clnt.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index f13ec73c8299..a07b516e503a 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1980,9 +1980,11 @@ call_bind_status(struct rpc_task *task)
 
dprint_status(task);
trace_rpc_bind_status(task);
-   if (task->tk_status >= 0 || xprt_bound(xprt)) {
-   task->tk_action = call_connect;
-   return;
+   if (task->tk_status >= 0)
+   goto out_next;
+   if (xprt_bound(xprt)) {
+   task->tk_status = 0;
+   goto out_next;
}
 
switch (task->tk_status) {
@@ -2045,6 +2047,9 @@ call_bind_status(struct rpc_task *task)
 
rpc_call_rpcerror(task, status);
return;
+out_next:
+   task->tk_action = call_connect;
+   return;
 retry_timeout:
task->tk_status = 0;
task->tk_action = call_bind;
@@ -2107,8 +2112,10 @@ call_connect_status(struct rpc_task *task)
clnt->cl_stats->netreconn++;
goto out_next;
}
-   if (xprt_connected(xprt))
+   if (xprt_connected(xprt)) {
+   task->tk_status = 0;
goto out_next;
+   }
 
task->tk_status = 0;
switch (status) {
-- 
2.21.0



-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: nfs4 server stops responding

2019-08-19 Thread Trond Myklebust
On Mon, 2019-08-19 at 12:25 +, Rantala, Tommi T. (Nokia - FI/Espoo)
wrote:
> Hello,
> 
> I have two VMs, exporting some directories in one VM:
> # cat /etc/exports
> /mnt 192.168.1.0/24(ro,fsid=0,no_subtree_check,sync)
> /mnt/export
> 192.168.1.0/24(rw,no_root_squash,sync,no_wdelay,no_subtree_check)
> [...]

Are /mnt and /mnt/export on different filesystems? If not, then your
server configuration is pretty much guaranteed to be broken, with
conflicting sets of rules being imposed on /mnt/export. I'm guessing
that is the case, because I'm not seeing any 'nohide' or 'crossmnt'
entries above.

> 
> And NFS mounting in the second VM:
> # grep nfs /proc/mounts 
> server:/export /mnt/export nfs4
> rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,
> acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,nordirplus,
> proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.11,
> local_lock=none,addr=192.168.1.10 0 0
> [...]
> 
> If I keep some file descriptor open for several minutes in the second
> VM,
> for example by running this:
> # sleep 10m >/mnt/export/test
> 
> Then result is that the NFS mount stops responding: the sleep process
> never finished but is "forever" stuck in (killable) D state, and any
> I/O
> attempt from other processes in /mnt/export never finish.
> It's always reproducible with this sleep command.
> To recover the mountpoint I need to reboot the second VM.
> 
> Kernel version is 5.3.0-rc4 in both VMs.
> Also reproducible with 4.14.x and 4.19.x
> 
> # ps aux|grep sleep
> root  2524  0.0  0.0   5900   688 pts/0D14:04   0:00
> sleep 5m
> 
> # grep -C100 nfs /proc/*/stack
> /proc/2524/stack:[<0>] nfs4_do_close+0x87d/0xb20 [nfsv4]
> /proc/2524/stack:[<0>] __put_nfs_open_context+0x297/0x4f0 [nfs]
> /proc/2524/stack:[<0>] nfs_file_release+0xbe/0xf0 [nfs]
> /proc/2524/stack-[<0>] __fput+0x1df/0x690
> /proc/2524/stack-[<0>] task_work_run+0x123/0x1b0
> /proc/2524/stack-[<0>] exit_to_usermode_loop+0x121/0x140
> /proc/2524/stack-[<0>] do_syscall_64+0x2d1/0x370
> /proc/2524/stack-[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> --
> /proc/561/stack-[<0>] __rpc_execute+0x692/0xb10 [sunrpc]
> /proc/561/stack-[<0>] rpc_run_task+0x45f/0x5d0 [sunrpc]
> /proc/561/stack:[<0>] nfs4_call_sync_sequence+0x12a/0x210 [nfsv4]
> /proc/561/stack:[<0>] _nfs4_proc_getattr+0x19a/0x200 [nfsv4]
> /proc/561/stack:[<0>] nfs4_proc_getattr+0xda/0x230 [nfsv4]
> /proc/561/stack:[<0>] __nfs_revalidate_inode+0x2ed/0x7a0 [nfs]
> /proc/561/stack:[<0>] nfs_do_access+0x605/0xd00 [nfs]
> /proc/561/stack:[<0>] nfs_permission+0x500/0x5e0 [nfs]
> /proc/561/stack-[<0>] inode_permission+0x2dd/0x3f0
> /proc/561/stack-[<0>] link_path_walk.part.60+0x681/0xe40
> /proc/561/stack-[<0>] path_lookupat.isra.63+0x1af/0x850
> /proc/561/stack-[<0>] filename_lookup.part.79+0x165/0x360
> /proc/561/stack-[<0>] vfs_statx+0xb9/0x140
> /proc/561/stack-[<0>] __do_sys_newstat+0x77/0xd0
> /proc/561/stack-[<0>] do_syscall_64+0x9a/0x370
> /proc/561/stack-[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> In dmesg of second VM sometimes nfs complaints are seen:
> 
> [  386.362897] nfs: server xyz not responding, still trying
> 
> Any ideas what's going wrong here...?
> 
The 'server not responding' implies that there is a network connection
issue. Does 'netstat -nt | grep :2049' show a correctly established TCP
connection between the client and server VMs?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes

2019-08-08 Thread Trond Myklebust
Hi Linus,

The following changes since commit e21a712a9685488f5ce80495b37b9fdbe96c230d:

  Linux 5.3-rc3 (2019-08-04 18:40:12 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.3-2

for you to fetch changes up to 67e7b52d44e3d539dfbfcd866c3d3d69da23a909:

  NFSv4: Ensure state recovery handles ETIMEDOUT correctly (2019-08-07 12:55:11 
-0400)


NFS client bugfixes for Linux 5.3

Highlights include:

Stable fixes:
- NFSv4: Ensure we check the return value of update_open_stateid() so we
  correctly track active open state.
- NFSv4: Fix for delegation state recovery to ensure we recover all open
  modes that are active.
- NFSv4: Fix an Oops in nfs4_do_setattr

Bugfixes:
- NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts
- NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
- NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
- pNFS: Report errors from the call to nfs4_select_rw_stateid()
- NFSv4: Various other delegation and open stateid recovery fixes
- NFSv4: Fix state recovery behaviour when server connection times out


Trond Myklebust (12):
  NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
  NFSv4: Fix delegation state recovery
  NFSv4: Print an error in the syslog when state is marked as irrecoverable
  NFSv4: When recovering state fails with EAGAIN, retry the same recovery
  NFSv4: Report the error from nfs4_select_rw_stateid()
  NFSv4.1: Fix open stateid recovery
  NFSv4.1: Only reap expired delegations
  NFSv4: Check the return value of update_open_stateid()
  NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
  NFSv4: Fix an Oops in nfs4_do_setattr
  NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts
  NFSv4: Ensure state recovery handles ETIMEDOUT correctly

 fs/nfs/delegation.c |  25 
 fs/nfs/delegation.h |   2 +-
 fs/nfs/fscache.c|   7 +++-
 fs/nfs/fscache.h|   2 +-
 fs/nfs/nfs4_fs.h|   3 +-
 fs/nfs/nfs4client.c |   5 ++-
 fs/nfs/nfs4proc.c   | 109 ++--
 fs/nfs/nfs4state.c  |  49 ++-
 fs/nfs/pnfs.c   |   7 +---
 fs/nfs/super.c  |   1 +
 10 files changed, 135 insertions(+), 75 deletions(-)

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS changes for Linux 5.3

2019-07-18 Thread Trond Myklebust
Hi Linus,

The following changes since commit bcc0e65f47def010d8d1c4cf09bdc698fe061b77:

  Merge tag 'mips_fixes_5.2_2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux (2019-07-06 10:32:12 
-0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.3-1

for you to fetch changes up to d5b9216fd5114be4ed98ca9c1ecc5f164cd8cf5e:

  pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS 
(2019-07-18 15:50:28 -0400)


NFS client updates for Linux 5.3

Highlights include:

Stable fixes:
- SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC request
- Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR() dereference
- Revert buggy NFS readdirplus optimisation
- NFSv4: Handle the special Linux file open access mode
- pnfs: Fix a problem where we gratuitously start doing I/O through the MDS

Features:
- Allow NFS client to set up multiple TCP connections to the server using
   a new 'nconnect=X' mount option. Queue length is used to balance load.
- Enhance statistics reporting to report on all transports when using
   multiple connections.
- Speed up SUNRPC by removing bh-safe spinlocks
- Add a mechanism to allow NFSv4 to request that containers set a
   unique per-host identifier for when the hostname is not set.
- Ensure NFSv4 updates the lease_time after a clientid update

Bugfixes and cleanup:
- Fix use-after-free in rpcrdma_post_recvs
- Fix a memory leak when nfs_match_client() is interrupted
- Fix buggy file access checking in NFSv4 open for execute
- disable unsupported client side deduplication
- Fix spurious client disconnections
- Fix occasional RDMA transport deadlock
- Various RDMA cleanups
- Various tracepoint fixes
- Fix the TCP callback channel to guarantee the server can actually send
   the number of callback requests that was negotiated at mount time.


Anna Schumaker (1):
  SUNRPC: Drop redundant CONFIG_ from 
CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES

Benjamin Coddington (1):
  NFS: Cleanup if nfs_match_client is interrupted

Chuck Lever (19):
  xprtrdma: Fix a BUG when tracing is enabled with NFSv4.1 on RDMA
  xprtrdma: Fix use-after-free in rpcrdma_post_recvs
  xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req
  xprtrdma: Fix occasional transport deadlock
  xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag
  xprtrdma: Remove fr_state
  xprtrdma: Add mechanism to place MRs back on the free list
  xprtrdma: Reduce context switching due to Local Invalidation
  xprtrdma: Wake RPCs directly in rpcrdma_wc_send path
  xprtrdma: Simplify rpcrdma_rep_create
  xprtrdma: Streamline rpcrdma_post_recvs
  xprtrdma: Refactor chunk encoding
  xprtrdma: Remove rpcrdma_req::rl_buffer
  xprtrdma: Modernize ops->connect
  NFS4: Add a trace event to record invalid CB sequence IDs
  NFS: Fix show_nfs_errors macros again
  NFS: Display symbolic status code names in trace log
  NFS: Update symbolic flags displayed by trace events
  NFS: Record task, client ID, and XID in xdr_status trace points

Darrick J. Wong (1):
  nfs: disable client side deduplication

Dave Wysochanski (5):
  SUNRPC: Move call to rpc_count_iostats before rpc_call_done
  SUNRPC: Use proper printk specifiers for unsigned long long
  SUNRPC: Count ops completing with tk_status < 0
  NFSv4: Add lease_time and lease_expired to 'nfs4:' line of mountstats
  SUNRPC: Fix possible autodisconnect during connect due to old last_used

Donald Buczek (4):
  nfs: Fix copy-and-paste error in debug message
  nfs4: Make nfs4_proc_get_lease_time available for nfs4.0
  nfs4: Rename nfs41_setup_state_renewal
  nfs4.0: Refetch lease_time after clientid update

Markus Elfring (2):
  NFS: Use seq_putc() in nfs_show_stats()
  NFS: Replace 16 seq_printf() calls by seq_puts()

Max Kellermann (1):
  Revert "NFS: readdirplus optimization by cache mechanism" (memleak)

NeilBrown (3):
  NFS: send state management on a single connection.
  SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts.
  SUNRPC: add links for all client xprts to debugfs

Trond Myklebust (36):
  NFSv4: Handle open for execute correctly
  NFSv4: Handle the special Linux file open access mode
  SUNRPC: Replace the queue timer with a delayed work function
  SUNRPC: Replace direct task wakeups from softirq context
  SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lock
  NFS: Create a root NFS directory in /sys/fs/nfs
  NFS: Cleanup - add nfs_clients_exit to mirror nfs_clients_init
  SUNRPC: Add basic load balancing to the transport switch
  SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lock
  NFS: Add deferred cache invalida

Re: [PATCH] Revert "NFS: readdirplus optimization by cache mechanism" (memleak)

2019-07-12 Thread Trond Myklebust
On Fri, 2019-07-12 at 17:28 +0200, Greg KH wrote:
> On Fri, Jul 12, 2019 at 04:18:06PM +0200, Max Kellermann wrote:
> > This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564.
> > 
> > That commit caused a severe memory leak in nfs_readdir_make_qstr().
> > 
> > When listing a directory with more than 100 files (this is how many
> > struct nfs_cache_array_entry elements fit in one 4kB page), all
> > allocated file name strings past those 100 leak.
> > 
> > The root of the leakage is that those string pointers are managed
> > in
> > pages which are never linked into the page cache.
> > 
> > fs/nfs/dir.c puts pages into the page cache by calling
> > read_cache_page(); the callback function nfs_readdir_filler() will
> > then fill the given page struct which was passed to it, which is
> > already linked in the page cache (by do_read_cache_page() calling
> > add_to_page_cache_lru()).
> > 
> > Commit be4c2d4723a4 added another (local) array of allocated pages,
> > to
> > be filled with more data, instead of discarding excess items
> > received
> > from the NFS server.  Those additional pages can be used by the
> > next
> > nfs_readdir_filler() call (from within the same nfs_readdir()
> > call).
> > 
> > The leak happens when some of those additional pages are never used
> > (copied to the page cache using copy_highpage()).  The pages will
> > be
> > freed by nfs_readdir_free_pages(), but their contents will
> > not.  The
> > commit did not invoke nfs_readdir_clear_array() (and doing so would
> > have been dangerous, because it did not track which of those pages
> > were already copied to the page cache, risking double free bugs).
> > 
> > How to reproduce the leak:
> > 
> > - Use a kernel with CONFIG_SLUB_DEBUG_ON.
> > 
> > - Create a directory on a NFS mount with more than 100 files with
> >   names long enough to use the "kmalloc-32" slab (so we can easily
> >   look up the allocation counts):
> > 
> >   for i in `seq 110`; do touch ${i}_0123456789abcdef; done
> > 
> > - Drop all caches:
> > 
> >   echo 3 >/proc/sys/vm/drop_caches
> > 
> > - Check the allocation counter:
> > 
> >   grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> >   30564391 nfs_readdir_add_to_array+0x73/0xd0
> > age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1
> > 
> > - Request a directory listing and check the allocation counters
> > again:
> > 
> >   ls
> >   [...]
> >   grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> >   30564511 nfs_readdir_add_to_array+0x73/0xd0
> > age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1
> > 
> > There are now 120 new allocations.
> > 
> > - Drop all caches and check the counters again:
> > 
> >   echo 3 >/proc/sys/vm/drop_caches
> >   grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> >   30564401 nfs_readdir_add_to_array+0x73/0xd0
> > age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1
> > 
> > 110 allocations are gone, but 10 have leaked and will never be
> > freed.
> > 
> > Unhelpfully, those allocations are explicitly excluded from
> > KMEMLEAK,
> > that's why my initial attempts with KMEMLEAK were not successful:
> > 
> > /*
> >  * Avoid a kmemleak false positive. The pointer to the name is
> > stored
> >  * in a page cache page which kmemleak does not scan.
> >  */
> > kmemleak_not_leak(string->name);
> > 
> > It would be possible to solve this bug without reverting the whole
> > commit:
> > 
> > - keep track of which pages were not used, and call
> >   nfs_readdir_clear_array() on them, or
> > - manually link those pages into the page cache
> > 
> > But for now I have decided to just revert the commit, because the
> > real
> > fix would require complex considerations, risking more dangerous
> > (crash) bugs, which may seem unsuitable for the stable branches.
> > 
> > Signed-off-by: Max Kellermann 
> > ---
> >  fs/nfs/dir.c  | 90 ---
> > 
> >  fs/nfs/internal.h |  3 +-
> >  2 files changed, 7 insertions(+), 86 deletions(-)
> 
> No cc: stable@vger on this patch to get it backported?
> 

I've added one. I've also backed out the 3 fixes for other problems
with the same patch that were in the linux-next tree. (St. Stephen
Rothwell please forgive me, for I have sinned...)

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: linux-next: Signed-off-by missing for commits in the nfs tree

2019-07-09 Thread Trond Myklebust
Yes please.

On Tue, 9 Jul 2019 at 10:30, Schumaker, Anna  wrote:
>
> On Sun, 2019-07-07 at 17:30 -0400, Trond Myklebust wrote:
> > NetApp Security WARNING: This is an external email. Do not click
> > links or open attachments unless you recognize the sender and know
> > the content is safe.
> >
> >
> >
> >
> > On Sun, 7 Jul 2019 at 17:29, Stephen Rothwell 
> > wrote:
> > > Hi all,
> > >
> > > Commits
> > >
> > >   fe9ad197bd8a ("xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag")
> > >   08d720bcd822 ("xprtrdma: Fix occasional transport deadlock")
> > >   beb843739ea0 ("xprtrdma: Replace use of xdr_stream_pos in
> > > rpcrdma_marshal_req")
> > >
> > > are missing a Signed-off-by from their committer.
> > >
> >
> > Anna?
>
> Looks like I missed the "-s" flag to `git-am` when I was applying
> these. I just fixed it locally, do you want me to send you a new pull
> request?
>
> Anna


Re: linux-next: Signed-off-by missing for commits in the nfs tree

2019-07-07 Thread Trond Myklebust
On Sun, 7 Jul 2019 at 17:29, Stephen Rothwell  wrote:
>
> Hi all,
>
> Commits
>
>   fe9ad197bd8a ("xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag")
>   08d720bcd822 ("xprtrdma: Fix occasional transport deadlock")
>   beb843739ea0 ("xprtrdma: Replace use of xdr_stream_pos in 
> rpcrdma_marshal_req")
>
> are missing a Signed-off-by from their committer.
>

Anna?


Re: [REGRESSION v5.2-rc] SUNRPC: Declare RPC timers as TIMER_DEFERRABLE (431235818bc3)

2019-06-05 Thread Trond Myklebust
On Wed, 2019-06-05 at 09:40 +0100, Jon Hunter wrote:
> Hi Trond,
> 
> I have been noticing intermittent failures with a system suspend test
> on
> some of our machines that have a NFS mounted root file-system.
> Bisecting
> this issue points to your commit 431235818bc3 ("SUNRPC: Declare RPC
> timers as TIMER_DEFERRABLE") and reverting this on top of v5.2-rc3
> does
> appear to resolve the problem.
> 
> The cause of the suspend failure appears to be a long delay observed
> sometimes when resuming from suspend, and this is causing our test to
> timeout. For example, in a failing case I see something like the
> following ...
> 
> [   69.667385] PM: suspend entry (deep)
> 
> [   69.675642] Filesystems sync: 0.000 seconds
> 
> [   69.684983] Freezing user space processes ... (elapsed 0.001
> seconds) done.
> 
> [   69.697880] OOM killer disabled.
> 
> [   69.705670] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> 
> [   69.719043] printk: Suspending console(s) (use no_console_suspend
> to debug)
> 
> [   69.758911] Disabling non-boot CPUs ...
> 
> [   69.761875] IRQ 17: no longer affine to CPU3
> 
> [   69.762609] Entering suspend state LP1
> 
> [   69.762636] Enabling non-boot CPUs ...
> 
> [   69.763600] CPU1 is up
> 
> [   69.764517] CPU2 is up
> 
> [   69.765532] CPU3 is up
> 
> [   69.845832] mmc1: queuing unknown CIS tuple 0x80 (50 bytes)
> 
> [   69.854223] mmc1: queuing unknown CIS tuple 0x80 (7 bytes)
> 
> [   69.857238] mmc1: queuing unknown CIS tuple 0x80 (7 bytes)
> 
> [   69.892700] mmc1: queuing unknown CIS tuple 0x02 (1 bytes)
> 
> [   70.407286] OOM killer enabled.
> 
> [   70.414674] Restarting tasks ... done.
> 
> [   70.423232] PM: suspend exit
> 
> [   73.533252] asix 1-1:1.0 eth0: link up, 100Mbps, full-duplex, lpa
> 0xCDE1
> 
> [  105.461852] nfs: server 192.168.99.1 not responding, still trying
> 
> [  105.462347] nfs: server 192.168.99.1 not responding, still trying
> 
> [  105.484809] nfs: server 192.168.99.1 OK
> 
> [  105.486454] nfs: server 192.168.99.1 OK
> 
> 
> So it would appear that making these timers deferrable is having an
> impact
> when resuming from suspend. Do you have any thoughts on this?
> 

I'd be OK with just reverting this patch if it is causing a performance
issue.

Anna?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [LKP] [SUNRPC] 0472e47660: fsmark.app_overhead 16.0% regression

2019-05-30 Thread Trond Myklebust
On Thu, 2019-05-30 at 15:20 +0800, Xing Zhengjun wrote:
> 
> On 5/30/2019 10:00 AM, Trond Myklebust wrote:
> > Hi Xing,
> > 
> > On Thu, 2019-05-30 at 09:35 +0800, Xing Zhengjun wrote:
> > > Hi Trond,
> > > 
> > > On 5/20/2019 1:54 PM, kernel test robot wrote:
> > > > Greeting,
> > > > 
> > > > FYI, we noticed a 16.0% improvement of fsmark.app_overhead due
> > > > to
> > > > commit:
> > > > 
> > > > 
> > > > commit: 0472e476604998c127f3c80d291113e77c5676ac ("SUNRPC:
> > > > Convert
> > > > socket page send code to use iov_iter()")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git
> > > > master
> > > > 
> > > > in testcase: fsmark
> > > > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @
> > > > 3.00GHz with 384G memory
> > > > with following parameters:
> > > > 
> > > > iterations: 1x
> > > > nr_threads: 64t
> > > > disk: 1BRD_48G
> > > > fs: xfs
> > > > fs2: nfsv4
> > > > filesize: 4M
> > > > test_size: 40G
> > > > sync_method: fsyncBeforeClose
> > > > cpufreq_governor: performance
> > > > 
> > > > test-description: The fsmark is a file system benchmark to test
> > > > synchronous write workloads, for example, mail servers
> > > > workload.
> > > > test-url: https://sourceforge.net/projects/fsmark/
> > > > 
> > > > 
> > > > 
> > > > Details are as below:
> > > > -
> > > > 
> > > > ->
> > > > 
> > > > 
> > > > To reproduce:
> > > > 
> > > >   git clone https://github.com/intel/lkp-tests.git
> > > >   cd lkp-tests
> > > >   bin/lkp install job.yaml  # job file is attached in
> > > > this
> > > > email
> > > >   bin/lkp run job.yaml
> > > > 
> > > > ===
> > > > 
> > > > ==
> > > > compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconf
> > > > ig/n
> > > > r_threads/rootfs/sync_method/tbox_group/test_size/testcase:
> > > > gcc-7/performance/1BRD_48G/4M/nfsv4/xfs/1x/x86_64-rhel-
> > > > 7.6/64t/debian-x86_64-2018-04-03.cgz/fsyncBeforeClose/lkp-ivb-
> > > > ep01/40G/fsmark
> > > > 
> > > > commit:
> > > > e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use
> > > > iov_iter_kvec()")
> > > > 0472e47660 ("SUNRPC: Convert socket page send code to use
> > > > iov_iter()")
> > > > 
> > > > e791f8e9380d945e 0472e476604998c127f3c80d291
> > > >  ---
> > > >  fail:runs  %reproductionfail:runs
> > > >  | | |
> > > >  :4   50%   2:4 dmesg.WARNING:a
> > > > t#for
> > > > _ip_interrupt_entry/0x
> > > >%stddev %change %stddev
> > > >\  |\
> > > > 15118573
> > > > ±  2% +16.0%   17538083fsmark.app_overhead
> > > >   510.93   -
> > > > 22.7% 395.12fsmark.files_per_sec
> > > >24.90   +22.8%  30.57fsmark.time.ela
> > > > psed_
> > > > time
> > > >24.90   +22.8%  30.57fsmark.time.ela
> > > > psed_
> > > > time.max
> > > >   288.00 ±  2% -
> > > > 27.8% 208.00fsmark.time.percent_of_cpu_this_job_got
> > > >70.03 ±  2% -
> > > > 11.3%  62.14fsmark.time.system_time
> > > > 
> > > 
> > > Do you have time to take a look at this regression?
> > 
> >  From your stats, it looks to me as if the problem is increased
> > NUMA
> > overhead. Pretty much everything else appears to be the same or
> > actually performing better than previously. Am I interpreting that
> > correctly?
> The real regression is the throughput(fsmark.files_per_sec) is
> decreased 
> by 22.7%.

Understood, but I'm trying to make sense of why. I'm not able to
reproduce this, so I have to rely on your performance stats to
understand where the 22.7% regression is coming from. As far as I can
see, the only numbers in the stats you published that are showing a
performance regression (other than the fsmark number itself), are the
NUMA numbers. Is that a correct interpretation?

> > If my interpretation above is correct, then I'm not seeing where
> > this
> > patch would be introducing new NUMA regressions. It is just
> > converting
> > from using one method of doing socket I/O to another. Could it
> > perhaps
> > be a memory artefact due to your running the NFS client and server
> > on
> > the same machine?
> > 
> > Apologies for pushing back a little, but I just don't have the
> > hardware available to test NUMA configurations, so I'm relying on
> > external testing for the above kind of scenario.
> > 
> Thanks for looking at this.  If you need more information, please let
> me
> know.
> > Thanks
> >Trond
> > 
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space




Re: [LKP] [SUNRPC] 0472e47660: fsmark.app_overhead 16.0% regression

2019-05-29 Thread Trond Myklebust
Hi Xing,

On Thu, 2019-05-30 at 09:35 +0800, Xing Zhengjun wrote:
> Hi Trond,
> 
> On 5/20/2019 1:54 PM, kernel test robot wrote:
> > Greeting,
> > 
> > FYI, we noticed a 16.0% improvement of fsmark.app_overhead due to
> > commit:
> > 
> > 
> > commit: 0472e476604998c127f3c80d291113e77c5676ac ("SUNRPC: Convert
> > socket page send code to use iov_iter()")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git
> > master
> > 
> > in testcase: fsmark
> > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @
> > 3.00GHz with 384G memory
> > with following parameters:
> > 
> > iterations: 1x
> > nr_threads: 64t
> > disk: 1BRD_48G
> > fs: xfs
> > fs2: nfsv4
> > filesize: 4M
> > test_size: 40G
> > sync_method: fsyncBeforeClose
> > cpufreq_governor: performance
> > 
> > test-description: The fsmark is a file system benchmark to test
> > synchronous write workloads, for example, mail servers workload.
> > test-url: https://sourceforge.net/projects/fsmark/
> > 
> > 
> > 
> > Details are as below:
> > -
> > ->
> > 
> > 
> > To reproduce:
> > 
> >  git clone https://github.com/intel/lkp-tests.git
> >  cd lkp-tests
> >  bin/lkp install job.yaml  # job file is attached in this
> > email
> >  bin/lkp run job.yaml
> > 
> > ===
> > ==
> > compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/n
> > r_threads/rootfs/sync_method/tbox_group/test_size/testcase:
> >gcc-7/performance/1BRD_48G/4M/nfsv4/xfs/1x/x86_64-rhel-
> > 7.6/64t/debian-x86_64-2018-04-03.cgz/fsyncBeforeClose/lkp-ivb-
> > ep01/40G/fsmark
> > 
> > commit:
> >e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use
> > iov_iter_kvec()")
> >0472e47660 ("SUNRPC: Convert socket page send code to use
> > iov_iter()")
> > 
> > e791f8e9380d945e 0472e476604998c127f3c80d291
> >  ---
> > fail:runs  %reproductionfail:runs
> > | | |
> > :4   50%   2:4 dmesg.WARNING:at#for
> > _ip_interrupt_entry/0x
> >   %stddev %change %stddev
> >   \  |\
> >15118573 ±  2% +16.0%   17538083fsmark.app_overhead
> >  510.93   -22.7% 395.12fsmark.files_per_sec
> >   24.90   +22.8%  30.57fsmark.time.elapsed_
> > time
> >   24.90   +22.8%  30.57fsmark.time.elapsed_
> > time.max
> >  288.00 ±  2% -
> > 27.8% 208.00fsmark.time.percent_of_cpu_this_job_got
> >   70.03 ±  2% -
> > 11.3%  62.14fsmark.time.system_time
> > 4391964   -16.7%3658341meminfo.max_used_kB
> >6.10 ±  4%  +1.97.97
> > ±  3%  mpstat.cpu.all.iowait%
> >0.27-0.00.24 ±  3%  mpstat.cpu.all.soft%
> >13668070 ± 40%+118.0%   29801846 ± 19%  numa-
> > numastat.node0.local_node
> >1364 ± 40%+117.9%   29810258 ± 19%  numa-
> > numastat.node0.numa_hit
> >5.70 ±  3% +32.1%   7.53 ±  3%  iostat.cpu.iowait
> >   16.42 ±  2%  -5.8%  15.47iostat.cpu.system
> >2.57-4.1%   2.46iostat.cpu.user
> > 1406781 ±  2% -15.5%1188498vmstat.io.bo
> >  251792 ±  3% -16.6% 209928vmstat.system.cs
> >   84841-1.9%  83239vmstat.system.in
> >97374502 ± 20% +66.1%  1.617e+08 ± 17%  cpuidle.C1E.time
> >  573934 ± 19% +44.6% 829662 ± 26%  cpuidle.C1E.usage
> >   5.892e+08 ±  8% +15.3%  6.796e+08 ±  2%  cpuidle.C6.time
> > 1968016 ±  3% -15.1%1670867 ±  3%  cpuidle.POLL.time
> >  106420 ± 47% +86.2% 198108 ± 35%  numa-
> > meminfo.node0.Active
> >  106037 ± 48% +86.2% 197395 ± 35%  numa-
> > meminfo.node0.Active(anon)
> >  105052 ± 48% +86.6% 196037 ± 35%  numa-
> > meminfo.node0.AnonPages
> >  212876 ± 24% -41.5% 124572 ± 56%  numa-
> > meminfo.node1.Active
> >  211801 ± 24% -41.5% 123822 ± 56%  numa-
> > meminfo.node1.Active(anon)
> >  208559 ± 24% -42.2% 120547 ± 57%  numa-
> > meminfo.node1.AnonPages
> >9955+1.6%  10116proc-
> > vmstat.nr_kernel_stack
> >  452.25 ± 59%+280.9%   1722 ±100%  proc-
> > vmstat.numa_hint_faults_local
> >33817303   +55.0%   52421773 ±  5%  proc-vmstat.numa_hit
> >33804286   +55.0%   52408807 ±  5%  proc-
> > vmstat.numa_local
> >33923002   +81.8%   61663426 ±  5%  proc-
> > vmstat.pgalloc_normal
> >  184765+9.3% 201985proc-vmstat.pgfault
> >12840986  +216.0%   40581327 ±  7%  proc-vmstat.pgfree
> >   

Re: PROBLEM: oops spew with Linux 5.1.5 (NFS regression?)

2019-05-29 Thread Trond Myklebust
On Wed, 2019-05-29 at 11:10 -0400, Nick Bowler wrote:
> Hi,
> 
> I upgraded to Linux 5.1.5 on one machine yesterday, and this morning
> I
> happened noticed a large amount of backtraces in the log.  It appears
> that the system oopsed 62 times over a period of about 5 minutes,
> producing about half a megabyte of log messages, after which the
> messages stopped.  No idea what action (if any) triggered these.
> 
> However, other than the noise in the logs there is nothing obviously
> broken, but I thought I should report the spews anyway.  I was
> running
> 5.0.9 previously and have not seen any similar errors.  The first
> couple
> spews are appended.  All 64 faults look very similar to these ones,
> with
> the same faulting address and the same rpc_check_timeout function at
> the
> top of the backtrace.

OK, I think this is the same problem that Olga was seeing (Cced), and
it looks like I missed the use-after-free issue when the server returns
a credential error when she asked.

I believe that the following patch should fix it:

8<--
From 33905f5a7d1d200db8eeb3f4ea8670c9da4cb64d Mon Sep 17 00:00:00 2001
From: Trond Myklebust 
Date: Wed, 29 May 2019 12:49:52 -0400
Subject: [PATCH] SUNRPC: Fix a use after free when a server rejects the
 RPCSEC_GSS credential

The addition of rpc_check_timeout() to call_decode causes an Oops
when the RPCSEC_GSS credential is rejected.
The reason is that rpc_decode_header() will call xprt_release() in
order to free task->tk_rqstp, which is needed by rpc_check_timeout()
to check whether or not we should exit due to a soft timeout.

The fix is to move the call to xprt_release() into call_decode() so
we can perform it after rpc_check_timeout().

Reported-by: Olga Kornievskaia 
Reported-by: Nick Bowler 
Fixes: cea57789e408 ("SUNRPC: Clean up")
Cc: sta...@vger.kernel.org # v5.1+
Signed-off-by: Trond Myklebust 
---
 net/sunrpc/clnt.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index d6e57da56c94..4c02c37fa774 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2426,17 +2426,21 @@ call_decode(struct rpc_task *task)
return;
case -EAGAIN:
task->tk_status = 0;
-   /* Note: rpc_decode_header() may have freed the RPC slot */
-   if (task->tk_rqstp == req) {
-   xdr_free_bvec(>rq_rcv_buf);
-   req->rq_reply_bytes_recvd = 0;
-   req->rq_rcv_buf.len = 0;
-   if (task->tk_client->cl_discrtry)
-   xprt_conditional_disconnect(req->rq_xprt,
-   
req->rq_connect_cookie);
-   }
+   xdr_free_bvec(>rq_rcv_buf);
+   req->rq_reply_bytes_recvd = 0;
+   req->rq_rcv_buf.len = 0;
+   if (task->tk_client->cl_discrtry)
+   xprt_conditional_disconnect(req->rq_xprt,
+   req->rq_connect_cookie);
task->tk_action = call_encode;
rpc_check_timeout(task);
+   break;
+   case -EKEYREJECTED:
+   task->tk_action = call_reserve;
+   rpc_check_timeout(task);
+   rpcauth_invalcred(task);
+   /* Ensure we obtain a new XID if we retry! */
+   xprt_release(task);
}
 }
 
@@ -2572,11 +2576,7 @@ rpc_decode_header(struct rpc_task *task, struct 
xdr_stream *xdr)
break;
task->tk_cred_retry--;
trace_rpc__stale_creds(task);
-   rpcauth_invalcred(task);
-   /* Ensure we obtain a new XID! */
-   xprt_release(task);
-   task->tk_action = call_reserve;
-   return -EAGAIN;
+   return -EKEYREJECTED;
    case rpc_autherr_badcred:
case rpc_autherr_badverf:
/* possibly garbled cred/verf? */
-- 
2.21.0

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull 1 NFS client bugfix

2019-04-20 Thread Trond Myklebust
Hi Linus,

The following changes since commit dc4060a5dc2557e6b5aa813bf5b73677299d62d2:

  Linux 5.1-rc5 (2019-04-14 15:17:41 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.1-5

for you to fetch changes up to a7b1a4839ff979b4dd4fb6c1ccd31af11de9ca87:

  SUNRPC: Ignore queue transmission errors on successful transmission 
(2019-04-17 16:07:28 -0400)

Cheers
  Trond


NFS client bugfixes for Linux 5.1

Highlights include:

Bugfixes:
- Fix a regression in which an RPC call can be tagged with an error despite
  the transmission being successful


Trond Myklebust (1):
  SUNRPC: Ignore queue transmission errors on successful transmission

 net/sunrpc/clnt.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes

2019-04-13 Thread Trond Myklebust
Hi Linus,

The following changes since commit 582549e3fbe137eb6ce9be591aca25ca36b4:

  Merge tag 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma (2019-04-10 09:39:04 
-1000)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.1-4

for you to fetch changes up to af6b61d7ef58099c82d854395a0e002be6bd036c:

  Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping" 
(2019-04-11 15:41:14 -0400)


NFS client bugfixes for Linux 5.1

Highlights include:

Stable fixes:
- Fix a deadlock in close() due to incorrect draining of RDMA queues

Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
  as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return values


Chuck Lever (2):
  NFS: Fix handling of reply page vector
  xprtrdma: Fix helper that drains the transport

Olga Kornievskaia (1):
  NFSv4.1 fix incorrect return value in copy_file_range

Tetsuo Handa (1):
  NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.

Trond Myklebust (1):
  Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"

 fs/nfs/nfs42proc.c   |  3 ---
 fs/nfs/nfs4file.c|  4 +++-
 fs/nfs/nfs4xdr.c |  4 ++--
 fs/nfs/super.c   |  3 ++-
 include/linux/sunrpc/sched.h |  8 
 net/sunrpc/clnt.c| 45 
 net/sunrpc/xprtrdma/verbs.c  |  2 +-
 7 files changed, 16 insertions(+), 53 deletions(-)

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: 5.1.0-rc4: Oops in __rpc_execute() when trying to boot from NFS

2019-04-12 Thread Trond Myklebust
On Fri, 2019-04-12 at 07:57 +0200, Daniel Mack wrote:
> Hi Trond,
> 
> On 11/4/2019 9:50 PM, Trond Myklebust wrote:
> > Could you please try pulling the 'testing' branch from 
> > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/testing
> > 
> > i.e. 'git pull git://git.linux-nfs.org/projects/trondmy/linux-
> > nfs.git
> > testing'
> > 
> > to see if that suffices to fix the issue you're reporting?
> 
> Yes, that revert in your branch fixes the problem. All fine again!

OK. Thanks for testing!

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: 5.1.0-rc4: Oops in __rpc_execute() when trying to boot from NFS

2019-04-11 Thread Trond Myklebust
Hi Daniel,

On Tue, 2019-04-09 at 19:54 +0200, Daniel Mack wrote:
> On 9/4/2019 6:55 PM, Trond Myklebust wrote:
> > On Tue, 2019-04-09 at 18:25 +0200, Daniel Mack wrote:
> > > On 8/4/2019 8:51 PM, Trond Myklebust wrote:
> > > > On Mon, 2019-04-08 at 19:01 +0200, Daniel Mack wrote:
> > > > > Hi,
> > > > > 
> > > > > I'm seeing the Oops below when trying to boot 5.1.0-rc4 on an
> > > > > ARM
> > > > > PXA3xx
> > > > > platform. v5.0 did not show this effect with the same
> > > > > cmdline.
> > > > > 
> > > > Please do bisect if that is at all practical. I'm having
> > > > trouble
> > > > interpreting this Oops.
> > > 
> > > Here you go:
> > > 
> > > 009a82f6437490c262584d65a14094a818bcb747 is the first bad commit
> > > commit 009a82f6437490c262584d65a14094a818bcb747
> > > Author: Trond Myklebust 
> > > Date:   Sat Mar 9 12:07:17 2019 -0500
> > > 
> > > SUNRPC: Micro-optimise when the task is known not to be
> > > sleeping
> > > 
> > > In cases where we know the task is not sleeping, try to
> > > optimise
> > > away the indirect call to task->tk_action() by replacing it
> > > with
> > > a direct call.
> > > Only change tail calls, to allow gcc to perform tail call
> > > elimination.
> > 
> > Ah... It looks like we explicitly turn off tail call optimisation
> > in
> > some ARM configs, so this might be a stack overflow.
> > 
> > Does your config file have THUMB2_AVOID_R_ARM_THM_JUMP11 set?
> 
> Nope. I don't even have THUMB2_KERNEL.
> 
> In the meantime, I tried to trace that with some printks, but the bug
> appears evasive, and the backtrace changes as soon as I modify the
> timing. Hmm.
> 
> Happy to test patches if you have any idea.
> 

Could you please try pulling the 'testing' branch from 
http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/testing

i.e. 'git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git
testing'

to see if that suffices to fix the issue you're reporting?

Thanks
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: linux-next: Tree for Apr 9

2019-04-09 Thread Trond Myklebust
On Tue, 2019-04-09 at 11:25 -0700, Guenter Roeck wrote:
> On Tue, Apr 09, 2019 at 06:00:42PM +1000, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20190408:
> > 
> > The mac80211-next tree gained a conflict against the mac80211 tree.
> > 
> > The drm tree still had its build failure for which I disabled a
> > driver.
> > 
> > The drm-misc tree gained conflicts against the drm tree and also a
> > build
> > failure for which I marked a driver as BROKEN.
> > 
> > The scsi-mkp tree gained a build failure for which I reverted a
> > commit.
> > 
> > The rtc tree gained a conflict against the omap tree.
> > 
> > Non-merge commits (relative to Linus' tree): 5587
> >  5386 files changed, 176906 insertions(+), 87493 deletions(-)
> > 
> 
> This one deserves a reply.
> 
> Build results:
>   total: 159 pass: 144 fail: 15
> Failed builds: 
>   
> Qemu test results:
>   total: 345 pass: 227 fail: 118
> Failed tests: 
>   
> 
> Build failure is:
> 
> s/nfsd/nfssvc.c: In function 'nfsd_support_acl_version':
> fs/nfsd/nfssvc.c:145:14: error: 'NFSD_ACL_MINVERS' undeclared
> 
> and similar. Build failures and (many ? most ? all ?) of the qemu
> failures
> are due to commit 55d4c716ea ("nfsd: Add custom rpcbind callbacks for
> knfsd").

An updated patch set that fixes this issue was sent out earlier today.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: 5.1.0-rc4: Oops in __rpc_execute() when trying to boot from NFS

2019-04-09 Thread Trond Myklebust
On Tue, 2019-04-09 at 18:25 +0200, Daniel Mack wrote:
> On 8/4/2019 8:51 PM, Trond Myklebust wrote:
> > On Mon, 2019-04-08 at 19:01 +0200, Daniel Mack wrote:
> > > Hi,
> > > 
> > > I'm seeing the Oops below when trying to boot 5.1.0-rc4 on an ARM
> > > PXA3xx
> > > platform. v5.0 did not show this effect with the same cmdline.
> > > 
> > Please do bisect if that is at all practical. I'm having trouble
> > interpreting this Oops.
> 
> Here you go:
> 
> 009a82f6437490c262584d65a14094a818bcb747 is the first bad commit
> commit 009a82f6437490c262584d65a14094a818bcb747
> Author: Trond Myklebust 
> Date:   Sat Mar 9 12:07:17 2019 -0500
> 
> SUNRPC: Micro-optimise when the task is known not to be sleeping
> 
> In cases where we know the task is not sleeping, try to optimise
> away the indirect call to task->tk_action() by replacing it with
> a direct call.
> Only change tail calls, to allow gcc to perform tail call
> elimination.

Ah... It looks like we explicitly turn off tail call optimisation in
some ARM configs, so this might be a stack overflow.

Does your config file have THUMB2_AVOID_R_ARM_THM_JUMP11 set?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: 5.1.0-rc4: Oops in __rpc_execute() when trying to boot from NFS

2019-04-08 Thread Trond Myklebust
On Mon, 2019-04-08 at 19:01 +0200, Daniel Mack wrote:
> Hi,
> 
> I'm seeing the Oops below when trying to boot 5.1.0-rc4 on an ARM
> PXA3xx
> platform. v5.0 did not show this effect with the same cmdline.
> 
> Relevant bits from the config are:
> 
> CONFIG_NFS_FS=y
> CONFIG_NFS_V2=y
> CONFIG_NFS_V3=y
> # CONFIG_NFS_V3_ACL is not set
> # CONFIG_NFS_V4 is not set
> # CONFIG_NFS_SWAP is not set
> CONFIG_ROOT_NFS=y
> CONFIG_NFS_FSCACHE=y
> # CONFIG_NFSD is not set
> CONFIG_NFS_COMMON=y
> 
> Anything else I can provide? I could bisect the issue if that helps.
> 

Please do bisect if that is at all practical. I'm having trouble
interpreting this Oops.

Thanks!
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client fixes

2019-03-26 Thread Trond Myklebust
Hi Linus,

The following changes since commit 9e98c678c2d6ae3a17cb2de55d17f69dddaa231b:

  Linux 5.1-rc1 (2019-03-17 14:22:26 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.1-3

for you to fetch changes up to 01f2f5b82a2b523ae76af53f2ff43c48dde10a00:

  SUNRPC: fix uninitialized variable warning (2019-03-26 13:04:32 -0700)


NFS client bugfixes for Linux 5.1

Highlights include:

Stable fixes:
- Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()
- fix mount/umount race in nlmclnt.
- NFSv4.1 don't free interrupted slot on open

Bugfixes:
- Don't let RPC_SOFTCONN tasks time out if the transport is connected
- Fix a typo in nfs_init_timeout_values()
- Fix layoutstats handling during read failovers
- fix uninitialized variable warning


Alakesh Haloi (1):
  SUNRPC: fix uninitialized variable warning

Catalin Marinas (1):
  NFS: Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()

NeilBrown (1):
  NFS: fix mount/umount race in nlmclnt.

Olga Kornievskaia (1):
  NFSv4.1 don't free interrupted slot on open

Trond Myklebust (3):
  SUNRPC: Don't let RPC_SOFTCONN tasks time out if the transport is 
connected
  NFS: Fix a typo in nfs_init_timeout_values()
  pNFS/flexfiles: Fix layoutstats handling during read failovers

 fs/lockd/host.c|  3 +--
 fs/nfs/client.c|  2 +-
 fs/nfs/flexfilelayout/flexfilelayout.c |  5 -
 fs/nfs/nfs4proc.c  |  5 ++---
 net/sunrpc/clnt.c  | 12 +++-
 net/sunrpc/xprtsock.c  |  4 ++--
 6 files changed, 21 insertions(+), 10 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH] NFS: Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()

2019-03-18 Thread Trond Myklebust
On Mon, 2019-03-18 at 17:00 +, Catalin Marinas wrote:
> Commit 7b587e1a5a6c ("NFS: use locks_copy_lock() to copy locks.")
> changed the lock copying from memcpy() to the dedicated
> locks_copy_lock() function. The latter correctly increments the
> nfs4_lock_state.ls_count via nfs4_fl_copy_lock(), however, this
> refcount
> has already been incremented in the nfs4_alloc_{lock,unlock}data().
> Kmemleak subsequently reports an unreferenced nfs4_lock_state object
> as
> below (arm64 platform):
> 
> unreferenced object 0x8000fce0b000 (size 256):
>   comm "systemd-sysuser", pid 1608, jiffies 4294892825 (age 32.348s)
>   hex dump (first 32 bytes):
> 20 57 4c fb 00 80 ff ff 20 57 4c fb 00 80 ff ff   WL. WL.
> 00 57 4c fb 00 80 ff ff 01 00 00 00 00 00 00 00  .WL.
>   backtrace:
> [<0d15010d>] kmem_cache_alloc+0x178/0x208
> [<d7c1d264>] nfs4_set_lock_state+0x124/0x1f0
> [<9c867628>] nfs4_proc_lock+0x90/0x478
> [<1686bd74>] do_setlk+0x64/0xe8
> [<e01500d4>] nfs_lock+0xe8/0x1f0
> [<4f387d8d>] vfs_lock_file+0x18/0x40
> [<656ab79b>] do_lock_file_wait+0x68/0xf8
> [<f17c4a4b>] fcntl_setlk+0x224/0x280
> [<52a242c6>] do_fcntl+0x418/0x730
> [<4f47291a>] __arm64_sys_fcntl+0x84/0xd0
> [<d6856e01>] el0_svc_common+0x80/0xf0
> [<9c4bd1df>] el0_svc_handler+0x2c/0x80
> [<b1a0d479>] el0_svc+0x8/0xc
> [<56c62a0f>] 0x
> 
> This patch removes the original refcount_inc(>ls_count) that was
> paired with the memcpy() lock copying.
> 
> Fixes: 7b587e1a5a6c ("NFS: use locks_copy_lock() to copy locks.")
> Cc:  # 5.0.x-
> Cc: NeilBrown 
> Signed-off-by: Catalin Marinas 
> ---
>  fs/nfs/nfs4proc.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 4dbb0ee23432..6d2812a39287 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -6301,7 +6301,6 @@ static struct nfs4_unlockdata
> *nfs4_alloc_unlockdata(struct file_lock *fl,
>   p->arg.seqid = seqid;
>   p->res.seqid = seqid;
>   p->lsp = lsp;
> - refcount_inc(>ls_count);
>   /* Ensure we don't close file until we're done freeing locks!
> */
>   p->ctx = get_nfs_open_context(ctx);
>   p->l_ctx = nfs_get_lock_context(ctx);
> @@ -6526,7 +6525,6 @@ static struct nfs4_lockdata
> *nfs4_alloc_lockdata(struct file_lock *fl,
>   p->res.lock_seqid = p->arg.lock_seqid;
>   p->lsp = lsp;
>   p->server = server;
> - refcount_inc(>ls_count);
>   p->ctx = get_nfs_open_context(ctx);
>   locks_init_lock(>fl);
>   locks_copy_lock(>fl, fl);

Thanks Catalin! Good catch...

I'm applying this to my linux-next branch.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client bugfixes for 5.1

2019-03-16 Thread Trond Myklebust
Hi Linus,

The following changes since commit 4d6c671ace569d4b0d3f8d92ab3aef18a5d166bc:

  SUNRPC: Take the transport send lock before binding+connecting (2019-03-10 
14:08:19 -0400)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.1-2

for you to fetch changes up to 5e3863fd597eba8c6679de805681631b1aad9bdb:

  SUNRPC: Remove redundant check for the reply length in call_decode() 
(2019-03-15 13:11:36 -0400)

Cheers
  Trond


NFS client bugfixes for Linux 5.1

Highlights include:

Bugfixes:
- Fix an Oops in SUNRPC back channel tracepoints
- Fix a SUNRPC client regression when handling oversized replies
- Fix the minimal size for SUNRPC reply buffer allocation
- rpc_decode_header() must always return a non-zero value on error
- Fix a typo in pnfs_update_layout()

Cleanups:
- Remove redundant check for the reply length in call_decode()


Olga Kornievskaia (1):
  fix null pointer deref in tracepoints in back channel

Trond Myklebust (7):
  pNFS: Fix a typo in pnfs_update_layout
  SUNRPC: Fix a client regression when handling oversized replies
  SUNRPC: Fix the minimal size for reply buffer allocation
  SUNRPC: Use the ENOTCONN error on socket disconnect
  SUNRPC: rpc_decode_header() must always return a non-zero value on error
  SUNRPC: Handle the SYSTEM_ERR rpc error
  SUNRPC: Remove redundant check for the reply length in call_decode()

 fs/nfs/pnfs.c |  2 +-
 include/trace/events/sunrpc.h |  6 --
 net/sunrpc/clnt.c | 32 ++--
 net/sunrpc/xprt.c |  2 +-
 net/sunrpc/xprtsock.c |  2 +-
 5 files changed, 21 insertions(+), 23 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client updates for Linux 5.1

2019-03-12 Thread Trond Myklebust
Hi Linus,

The following changes since commit 2137397c92aec3713fa10be3c9b830f9a1674e60:

  Merge tag 'sound-5.0' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound (2019-02-20 09:42:52 
-0800)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.1-1

for you to fetch changes up to 4d6c671ace569d4b0d3f8d92ab3aef18a5d166bc:

  SUNRPC: Take the transport send lock before binding+connecting (2019-03-10 
14:08:19 -0400)


NFS client updates for Linux 5.1

Highlights include:

Stable fixes:
- Fixes for NFS I/O request leakages
- Fix error handling paths in the NFS I/O recoalescing code
- Reinitialise NFSv4.1 sequence results before retransmitting a request
- Fix a soft lockup in the delegation recovery code
- Bulk destroy of layouts needs to be safe w.r.t. umount
- Prevent thundering herd issues when the SUNRPC socket is not connected
- Respect RPC call timeouts when retrying transmission

Features:
- Convert rpc auth layer to use xdr_streams
- Config option to disable insecure RPCSEC_GSS crypto types
- Reduce size of RPC receive buffers
- Readdirplus optimization by cache mechanism
- Convert SUNRPC socket send code to use iov_iter()
- SUNRPC micro-optimisations to avoid indirect calls
- Add support for the pNFS LAYOUTERROR operation and use it with the
  pNFS/flexfiles driver
- Add trace events to report non-zero NFS status codes
- Various removals of unnecessary dprintks

Bugfixes and cleanups:
- Fix a number of sparse warnings and documentation format warnings
- Fix nfs_parse_devname to not modify it's argument
- Fix potential corruption of page being written through pNFS/blocks
- fix xfstest generic/099 failures on nfsv3
- Avoid NFSv4.1 "false retries" when RPC calls are interrupted
- Abort I/O early if the pNFS/flexfiles layout segment was invalidated
- Avoid unnecessary pNFS/flexfiles layout invalidations


Anna Schumaker (1):
  NFS: Add missing encode / decode sequence_maxsz to v4.2 operations

Chuck Lever (23):
  xprtrdma: Fix sparse warnings
  xprtrdma: Check inline size before providing a Write chunk
  xprtrdma: Reduce the doorbell rate (Receive)
  SUNRPC: Display symbolic flag names in RPC trace events
  SUNRPC: Add xdr_stream::rqst field
  SUNRPC: Add XDR overflow trace event
  SUNRPC: Add trace event that reports reply page vector alignment
  NFS: Remove print_overflow_msg()
  NFS: Add trace events to report non-zero NFS status codes
  SUNRPC: Remove some dprintk() call sites from auth functions
  SUNRPC: Remove rpc_xprt::tsh_size
  SUNRPC: Add build option to disable support for insecure enctypes
  SUNRPC: Use struct xdr_stream when constructing RPC Call header
  SUNRPC: Clean up rpc_verify_header()
  SUNRPC: Use struct xdr_stream when decoding RPC Reply header
  SUNRPC: Introduce trace points in rpc_auth_gss.ko
  SUNRPC: Remove xdr_buf_trim()
  SUNRPC: Add SPDX IDs to some net/sunrpc/auth_gss/ files
  SUNRPC: Introduce rpc_prepare_reply_pages()
  NFS: Account for XDR pad of buf->pages
  SUNRPC: Make AUTH_SYS and AUTH_NULL set au_verfsize
  SUNRPC: Add rpc_auth::au_ralign field
  SUNRPC: Use au_rslack when computing reply buffer size

Eric W. Biederman (1):
  fs/nfs: Fix nfs_parse_devname to not modify it's argument

Julia Lawall (1):
  NFS: drop useless LIST_HEAD

Kazuo Ito (2):
  pNFS: Fix potential corruption of page being written
  pNFS: Avoid read/modify/write when it is not necessary

NeilBrown (1):
  SUNRPC: remove pointless test in unx_match()

Trond Myklebust (58):
  NFS: Fix I/O request leakages
  NFS: Fix an I/O request leakage in nfs_do_recoalesce
  NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
  NFS: Clean up list moves of struct nfs_page
  NFS: Pass error information to the pgio error cleanup routine
  NFS: Ensure NFS writeback allocations don't recurse back into NFS.
  NFS: EINTR is also a fatal error.
  NFS: ENOMEM should also be a fatal error.
  NFS: Fix up documentation warnings
  NFS: Fix sparse annotations for nfs_set_open_stateid_locked()
  SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs
  SUNRPC: Use poll() to fix up the socket requeue races
  SUNRPC: Ensure rq_bytes_sent is reset before request transmission
  SUNRPC: Don't reset the stream record info when the receive worker is 
running
  SUNRPC: Handle zero length fragments correctly
  SUNRPC: Don't suppress socket errors when a message read completes
  SUNRPC: Initiate a connection close on an ESHUTDOWN error in stream 
receive
  SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()
  SUNRPC: Convert socket page send code to use iov_iter()
  SUNRPC: Further cleanups of xs_sendpages()
  S

Re: Regression: SUNRPC: Use poll() to fix up the socket requeue races

2019-02-26 Thread Trond Myklebust
On Tue, 2019-02-26 at 10:33 -0800, Tony Lindgren wrote:
> Hi,
> 
> * Trond Myklebust  [700101 00:00]:
> > On Mon, 2019-02-25 at 22:27 +, Jon Hunter wrote:
> > > On 25/02/2019 21:03, Trond Myklebust wrote:
> > > This is nfsroot. I don't specify any particular NFS version from 
> > > the kernel cmdline, but this is seen with ARM kernel configs
> > > tegra_defconfig and multi_v7_defconfig. 
> > > 
> > > Looking at the logs I am seeing the following crash which appears
> > > to point to UDP ...
> > > 
> > > [8.032956] Unable to handle kernel NULL pointer dereference
> > > at
> > > virtual address 0024
> > > [8.041137] pgd = (ptrval)
> > > [8.043858] [0024] *pgd=
> > > [8.047437] Internal error: Oops: 5 [#1] SMP ARM
> > > [8.052049] Modules linked in:
> > > [8.055104] CPU: 1 PID: 100 Comm: kworker/u9:2 Not tainted
> > > 5.0.0-
> > > rc7-next-20190222-g94a4752 #1
> > > [8.063699] Hardware name: NVIDIA Tegra SoC (Flattened Device
> > > Tree)
> > > [8.069960] Workqueue: xprtiod xs_udp_data_receive_workfn
> > > [8.075353] PC is at udp_poll+0x30/0x64
> > > [8.079178] LR is at udp_poll+0x10/0x64
> > 
> > Thanks! I see what the issue is now and I'll be fixing it ASAP.
> 
> I'm seeing this also with NFSroot. I can test the fix when available,
> that is if you can please Cc me on the fix.
> 
> Thanks,
> 
> Tony

I've applied the following patch to my 'testing' branch: 
http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=patch;h=a73881c96d73ee72b7dbbd38a6eeef66182a8ef7

It has so far stood up to testing on my side, so I'm expecting to push
it to linux-next this evening.

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Regression: SUNRPC: Use poll() to fix up the socket requeue races

2019-02-25 Thread Trond Myklebust
On Mon, 2019-02-25 at 22:27 +, Jon Hunter wrote:
> On 25/02/2019 21:03, Trond Myklebust wrote:
> > On Mon, 2019-02-25 at 20:25 +, Jon Hunter wrote:
> > > Hi Trond,
> > > 
> > > Starting in next-20190222 I have observed a regression with NFS
> > > causing
> > > some of our boards to fail to boot. Bisect points to your commit
> > > ...
> > > 
> > > commit 0ffe86f48026b7f34db22d1004bc9992f0db8b33
> > > Author: Trond Myklebust 
> > > Date:   Wed Jan 30 14:51:26 2019 -0500
> > > 
> > > SUNRPC: Use poll() to fix up the socket requeue races
> > > 
> > > 
> > > After reverting this on top of -next I no longer see the problem.
> > > I
> > > have
> > > not had chance to look any closer, but wanted to see if you had
> > > any
> > > ideas what might be the problem.
> > > 
> > > Cheers
> > > Jon
> > 
> > What kind of boot is this? UDP or TCP? nfsroot? NFSv3 or NFSv4? 
> 
> This is nfsroot. I don't specify any particular NFS version from 
> the kernel cmdline, but this is seen with ARM kernel configs
> tegra_defconfig and multi_v7_defconfig. 
> 
> Looking at the logs I am seeing the following crash which appears
> to point to UDP ...
> 
> [8.032956] Unable to handle kernel NULL pointer dereference at
> virtual address 0024
> [8.041137] pgd = (ptrval)
> [8.043858] [0024] *pgd=
> [8.047437] Internal error: Oops: 5 [#1] SMP ARM
> [8.052049] Modules linked in:
> [8.055104] CPU: 1 PID: 100 Comm: kworker/u9:2 Not tainted 5.0.0-
> rc7-next-20190222-g94a4752 #1
> [8.063699] Hardware name: NVIDIA Tegra SoC (Flattened Device
> Tree)
> [8.069960] Workqueue: xprtiod xs_udp_data_receive_workfn
> [8.075353] PC is at udp_poll+0x30/0x64
> [8.079178] LR is at udp_poll+0x10/0x64

Thanks! I see what the issue is now and I'll be fixing it ASAP.

Cheers
   Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: Regression: SUNRPC: Use poll() to fix up the socket requeue races

2019-02-25 Thread Trond Myklebust
On Mon, 2019-02-25 at 20:25 +, Jon Hunter wrote:
> Hi Trond,
> 
> Starting in next-20190222 I have observed a regression with NFS
> causing
> some of our boards to fail to boot. Bisect points to your commit ...
> 
> commit 0ffe86f48026b7f34db22d1004bc9992f0db8b33
> Author: Trond Myklebust 
> Date:   Wed Jan 30 14:51:26 2019 -0500
> 
> SUNRPC: Use poll() to fix up the socket requeue races
> 
> 
> After reverting this on top of -next I no longer see the problem. I
> have
> not had chance to look any closer, but wanted to see if you had any
> ideas what might be the problem.
> 
> Cheers
> Jon

What kind of boot is this? UDP or TCP? nfsroot? NFSv3 or NFSv4? 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [PATCH 15/20] NFS: drop useless LIST_HEAD

2019-02-21 Thread Trond Myklebust
On Sun, 2018-12-23 at 09:57 +0100, Julia Lawall wrote:
> Drop LIST_HEAD where the variable it declares has never
> been used.
> 
> The semantic patch that fixes this problem is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // 
> @@
> identifier x;
> @@
> - LIST_HEAD(x);
>   ... when != x
> // 
> 
> Fixes: 0e20162ed1e9 ("NFSv4.1 Use MDS auth flavor for data server
> connection")
> Signed-off-by: Julia Lawall 
> 
> ---
> Successfully 0-day tested on 151 configurations.
> 
>  fs/nfs/nfs4client.c |1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> index 2548405da1f7..735c1056a91c 100644
> --- a/fs/nfs/nfs4client.c
> +++ b/fs/nfs/nfs4client.c
> @@ -145,7 +145,6 @@ static void
>  nfs4_shutdown_ds_clients(struct nfs_client *clp)
>  {
>   struct nfs4_ds_server *dss;
> - LIST_HEAD(shutdown_list);
>  
>   while (!list_empty(>cl_ds_clients)) {
>   dss = list_entry(clp->cl_ds_clients.next,
> 

Thanks Julia! Applied to my linux-next branch for inclusion in the 5.1
merge window.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects

2019-02-20 Thread Trond Myklebust
On Tue, 2019-02-19 at 23:03 +, David Howells wrote:
> Trond Myklebust  wrote:
> 
> > Do we really need a new system call to set up containers? That
> > would
> > force changes to all existing orchestration software.
> 
> No, it wouldn't.  Nothing in my patches forces existing orchestration
> software
> to change, unless it wants to use the new facilities - then it would
> have to
> be changed anyway, right?  I will grant, though, that the extent of
> the change
> might vary.

Right. It depends on what you want to the orchestrator to do. If you
want it to manage authenticated storage for you, then I grant that you
may need to change the existing orchestrator. However if you just want
the containerised software to be able to manage AFS/CIFS/... keys for
its own processes, then it's not obvious to me why you would need a new
orchestrator.

> > Given that the main thing we want to achieve is to direct messages
> > from
> > the kernel to an appropriate handler, why not focus on adding
> > functionality to do just that?
> 
> Because it's *not* just that that is added here.  There are a number
> of things
> this patchset (and one it depends on) provides:
> 
>  (1) The ability to intercept request_key() upcalls that happen
> inside a
>  container, filtered by operative namespace.

The requirement that you need to filter derives from the fact that the
kernel is being forced to run an untrusted executable in user space.
That may be acceptable when running in an uncontainerised environment,
where the executable can be vetted by the sysadmin, but it clearly
isn't in an environment where containers can be set up by untrusted
users.

If we replace the executable with a daemon that is started from inside
the container (presumably by the init process running there), then
there should be no requirement for the orchestrator to filter.

>  (2) The ability to provide a per-container keyring that can hold
> keys that
>  can be used inside the container without any action on behalf of
> the
>  denizens of the container.

Keyrings already define some inheritance semantics for child processes.
Why can't we tweak those semantics to do what is needed?

IOW: instead of adding a container syscall and a new keyring type, why
can't we just define the required keyring type and let it be inherited
through the existing clone() mechanism?

>  (3) The ability to grant permissions to a *container* as a subject,
> allowing
>  it and its denizens to use, but not necessarily read, modify,
> link or
>  invalidate a key.

Again, this sounds like a child process keyring inheritance issue.
Right now, the session keyring does not appear to match the semantics
that you describe, but why couldn't we set up a new keyring type that
can provide them?

>  (4) The ability to create superblocks inside a container with a
> separate
>  mount namespace from outside, such that they can use the
> container keys,
>  thereby allowing the root of a container to be on an
> authenticated
>  filesystem.
> 

I'm not sure that I understand the premise. If the orchestrator is
setting up and managing that authenticated root filesystem, then why do
the containerised processes need to be involved at all?

If, OTOH, the intention is to allow the containerised processes to
manage the filesystems without knowledge of the keyring contents, then
again isn't that really the same issue as (3)?

> > Is there any reason why a syscall to allow an appropriately
> > privileged
> > process to add a keyring-specific message queue to its own
> > user_namespace and obtain a file descriptor to that message queue
> > might
> > not work?
> 
> Yes.  That forces the use of a new user_namespace for every container
> in which
> you want to use any of the above features.  The user_namespace is
> already way
> too big and intrusive a hammer as it is.

No. I would need a user_namespace if I want to allow child processes to
handle request upcalls. Is that unreasonable?

> > With such an implementation, the fallback mechanism could be to
> > walk
> > back up the hierarchy of user_namespaces until a message queue is
> > found, and to invoke the existing request_key mechanism if not.
> 
> That's definitely wrong.  /sbin/request-key should *not* be spawned
> if the key
> to be instantiated is not in all the init namespaces.
> 
> I went with a container object with namespaces for a reason:
> initially, it was
> so that the upcall could take place inside of the container's
> namespaces, but
> now it's do that any request that doesn't match the namespaces on the
> container gets rejected at the boundary - so that some daemon up the
> chain
> doesn't try servicing a request for which it can't access the co

Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects

2019-02-17 Thread Trond Myklebust
   By default, if the container is active and its fd is closed, the
>  container is left running and wil be cleaned up when its 'init'
> exits.
>  The default can be changed with the CONTAINER_KILL_ON_CLOSE
> flag.
> 
>  (4) Supervising the container.
> 
>  Given that we have an fd attached to the container, we could
> make it
>  such that the supervising process could monitor and override
> EPERM
>  returns for mount and other privileged operations within the
>  container.
> 
>  (5) Per-container keyring.
> 
>  Each container can point to a per-container keyring for the
> holding of
>  integrity keys and filesystem keys for use inside the
> container.  This
>  would be attached:
> 
>   keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)
> 
>  This keyring would be searched by request_key() after it has
> searched
>  the thread, process and session keyrings.
> 
>  (6) Running different LSM policies by container.  This might
> particularly
>  make sense with something like Apparmor where different path-
> based
>  rules might be required inside a container to inside the parent.
> 
> Signed-off-by: David Howells 
> ---

Do we really need a new system call to set up containers? That would
force changes to all existing orchestration software.

Given that the main thing we want to achieve is to direct messages from
the kernel to an appropriate handler, why not focus on adding
functionality to do just that?

Is there any reason why a syscall to allow an appropriately privileged
process to add a keyring-specific message queue to its own
user_namespace and obtain a file descriptor to that message queue might
not work? That forces the container to use a daemon if it cares to
intercept keyring traffic, rather than worrying about the kernel
running request_key (in fact, it might make sense to allow a trivial
implementation of the daemon to be to just read the messages, parse
them and run request_key).

With such an implementation, the fallback mechanism could be to walk
back up the hierarchy of user_namespaces until a message queue is
found, and to invoke the existing request_key mechanism if not.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: [GIT PULL] Please pull NFS client updates for 4.21

2019-01-02 Thread Trond Myklebust
On Thu, 2019-01-03 at 15:53 +1100, NeilBrown wrote:
> On Wed, Jan 02 2019, Linus Torvalds wrote:
> 
> > On Wed, Jan 2, 2019 at 2:42 PM Schumaker, Anna
> >  wrote:
> > > We also were unable to track down a maintainer for Neil Brown's
> > > changes to
> > > the generic cred code that are prerequisites to his RPC cred
> > > cleanup patches.
> > > We've been asking around for several months without any response,
> > > so
> > > hopefully it's okay to include those patches in this pull
> > > request.
> > 
> > Looks ok to me, although I wonder what the semantics of
> > cred_fscmp()
> > are across namespaces?
> > 
> > IOW, it seems potentially a bit suspicious to do cred_fscmp() if
> > the
> > two creds have different namnespaces? Hmm?
> > 
> > Is there some reason that can't happen, or some reason it doesn't
> > matter?
> > 
> >   Linus
> 
> Interesting question.
> For the current use in NFS, it is consistent with existing practice
> to
> ignore the name space.
> NFS file accesses (when using the normal uid-based access checks)
> always
> use the manifest uid of the process - the one returned by getuid()
> (or
> more accurately, getfsuid()).
> Maybe this is wrong?  Maybe we should always use from_kuid() or
> whatever
> to get the uid/gid to send over the wire?
> 
> Anna/Trond: do you have thoughts on this?  If a process in a user
> namespace accesses a file over NFS, should the UID presented to the
> server be the one in that name-space, or the one you get by mapping
> to
> the global name-space?
> Or should we map to the namespace that was active when the filesystem
> was mounted?
> 
> I don't think cred_fscmp() should do any of this mapping, but maybe
> it
> should treat creds from different namespaces as different - as a
> precaution.
> 
> Thanks,
> NeilBrown

The values being compared are in cred_fscmp() are all of type kuid_t or
kgid_t so that means they have already been mapped from the user
namespace into the kernel uid/gid space.
When we put those kuid/kgid values onto the wire, we currently always
use the init namespace rather than the user namespace of the mount
process.

When using strong authentication (i.e. krb5) then none of this matters,
since the server performs its own mapping of the presented RPCSEC_GSS
session into a credential. That mapping is independent of the user
namespace on the client, it just depends on which krb5 principal the
process used to identify itself.

The problem case is limited to when we're using the weak AUTH_UNIX
authentication, since the server is then implicitly trusting the client
to protect against identity spoofing. This is particularly true if the
NFS server is being accessed through NAT, in which case it has very
limited possibilities for discriminating between containers on the same
client using the export table because they will all originate from the
same source IP address. I think that for these cases, using the init
namespace is the right thing to do for the same reason we use it with
local filesystems: if we try to use a different namespace then
unprivileged userspace processes might be able to manipulate the
mapping to spoof the identities of privileged users or groups, or
otherwise gain access to files to which they normally should not have
access.

Does that argument make sense?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




signature.asc
Description: This is a digitally signed message part


[GIT PULL] Please pull NFS client bugfixes

2018-12-19 Thread Trond Myklebust
Hi Linus,

The following 3 patches fix a regression in the NFS/RPC TPC re-
connection code which can cause the RPC transmission to hang. The issue
was discovered by Dave Wysochanski last week.

With this pull, we still have one more regression to fix. MIPS is
seeing data corruption due to the fact that the iovec_iter code does
not appear to call flush_dcache_page() after copying data into the bvec
pages. We need guidance from Al as to how he wants this fixed.

Cheers
  Trond

The following changes since commit 7566ec393f4161572ba6f11ad5171fd5d59b0fbd:

  Linux 4.20-rc7 (2018-12-16 15:46:55 -0800)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-4.20-6

for you to fetch changes up to abc13275771fac77e2d7b129c289522dacb644b6:

  SUNRPC: Remove xprt_connect_status() (2018-12-18 11:04:10 -0500)


NFS client bugfixes for Linux 4.20

Bugfixes:
- Fix TCP socket disconnection races by ensuring we always call
  xprt_disconnect_done() after releasing the socket.
- Fix a race when clearing both XPRT_CONNECTING and XPRT_LOCKED
- Remove xprt_connect_status() so it does not mask errors that should
  be handled by call_connect_status()


Trond Myklebust (3):
  SUNRPC: Fix disconnection races
  SUNRPC: Fix a race with XPRT_CONNECTING
  SUNRPC: Remove xprt_connect_status()

 net/sunrpc/clnt.c |  1 +
 net/sunrpc/xprt.c | 35 ---
 net/sunrpc/xprtsock.c | 10 --
 3 files changed, 9 insertions(+), 37 deletions(-)

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)

2018-12-17 Thread Trond Myklebust
On Mon, 2018-12-17 at 19:55 +0100, Geert Uytterhoeven wrote:
>   Hi Trond,
> 
> (For the newly added CCs, first message was
> https://lore.kernel.org/lkml/CAMuHMdVJr0PwvJg3FeTCy7vxuyY1=s1tplho7hpsozx4wz+...@mail.gmail.com/)
> 
> > On Mon, Dec 17, 2018 at 3:51 PM Trond Myklebust <
> > tron...@hammerspace.com> wrote:
> > > On Mon, 2018-12-17 at 15:03 +0100, Geert Uytterhoeven wrote:
> > > > On Wed, Dec 5, 2018 at 3:47 PM Geert Uytterhoeven <
> > > > ge...@linux-m68k.org> wrote:
> > > > > On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <
> > > > > tron...@hammerspace.com> wrote:
> > > > > > On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven
> > > > > > wrote:
> > > > > > > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <
> > > > > > > an...@mba.ocn.ne.jp>
> > > > > > > wrote:
> > > > > > > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > > > > > > ge...@linux-m68k.org> wrote:
> > > > > > > > > I found similar crashes in a report from 2006, but of
> > > > > > > > > course the
> > > > > > > > > code
> > > > > > > > > has changed too much to apply the solution proposed
> > > > > > > > > there
> > > > > > > > > (
> > > > > > > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > > > > > > ).
> > > > > > > > > 
> > > > > > > > > Userland is Debian 8 (the last release supporting
> > > > > > > > > "old"
> > > > > > > > > MIPS).
> > > > > > > > > My kernel is based on v4.20.0-rc5, but the issue
> > > > > > > > > happens
> > > > > > > > > with
> > > > > > > > > v4.20-rc1,
> > > > > > > > > too.
> > > > > > > > > 
> > > > > > > > > However, I noticed it works in v4.19! Hence I've
> > > > > > > > > bisected
> > > > > > > > > this,
> > > > > > > > > to commit
> > > > > > > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code
> > > > > > > > > by
> > > > > > > > > switching
> > > > > > > > > to using
> > > > > > > > > iterators").
> > > > > > > > > 
> > > > > > > > > Dropping the ",tcp" part from the nfsroot parameter
> > > > > > > > > also
> > > > > > > > > fixes
> > > > > > > > > the issue.
> > > > > > > > > 
> > > > > > > > > Given RBTX4927 is little endian, just like my
> > > > > > > > > arm/arm64
> > > > > > > > > boards,
> > > > > > > > > it's probably
> > > > > > > > > not an endianness issue.  Sparse didn't show anything
> > > > > > > > > suspicious
> > > > > > > > > before/after
> > > > > > > > > the guilty commit.
> > > > > > > > > 
> > > > > > > > > Do you have a clue?
> > > > > > > > 
> > > > > > > > If it was a cache issue, disabling i-cache or d-cache
> > > > > > > > completely
> > > > > > > > might
> > > > > > > > help understanding the problem.  I added TXx9 specific
> > > > > > > > "icdisable"
> > > > > > > > and
> > > > > > > > "dcdisable" kernel options for debugging long ago.
> > > > > > > > 
> > > > > > > > I hope these options still works correctly with recent
> > > > > > > > kernel
> > > > > > > > but
> > > > > > > > not
> > > > > > > > sure.
> > > > > > > > 
> > > > > > > > Also, disabling i-cache makes your board VERY slow, of
> > > > > > > > course.
> > > > > > >

Re: NFS/TCP crashes on MIPS/RBTX4927 in v4.20-rcX (bisected)

2018-12-17 Thread Trond Myklebust
Hi Geert,

On Mon, 2018-12-17 at 15:03 +0100, Geert Uytterhoeven wrote:
> Hi Trond,
> 
> On Wed, Dec 5, 2018 at 3:47 PM Geert Uytterhoeven <
> ge...@linux-m68k.org> wrote:
> > On Wed, Dec 5, 2018 at 2:45 PM Trond Myklebust <
> > tron...@hammerspace.com> wrote:
> > > On Wed, 2018-12-05 at 14:41 +0100, Geert Uytterhoeven wrote:
> > > > On Wed, Dec 5, 2018 at 2:11 PM Atsushi Nemoto <
> > > > an...@mba.ocn.ne.jp>
> > > > wrote:
> > > > > On Tue, 4 Dec 2018 14:53:07 +0100, Geert Uytterhoeven <
> > > > > ge...@linux-m68k.org> wrote:
> > > > > > I found similar crashes in a report from 2006, but of
> > > > > > course the
> > > > > > code
> > > > > > has changed too much to apply the solution proposed there
> > > > > > (
> > > > > > https://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> > > > > > ).
> > > > > > 
> > > > > > Userland is Debian 8 (the last release supporting "old"
> > > > > > MIPS).
> > > > > > My kernel is based on v4.20.0-rc5, but the issue happens
> > > > > > with
> > > > > > v4.20-rc1,
> > > > > > too.
> > > > > > 
> > > > > > However, I noticed it works in v4.19! Hence I've bisected
> > > > > > this,
> > > > > > to commit
> > > > > > 277e4ab7d530bf28 ("SUNRPC: Simplify TCP receive code by
> > > > > > switching
> > > > > > to using
> > > > > > iterators").
> > > > > > 
> > > > > > Dropping the ",tcp" part from the nfsroot parameter also
> > > > > > fixes
> > > > > > the issue.
> > > > > > 
> > > > > > Given RBTX4927 is little endian, just like my arm/arm64
> > > > > > boards,
> > > > > > it's probably
> > > > > > not an endianness issue.  Sparse didn't show anything
> > > > > > suspicious
> > > > > > before/after
> > > > > > the guilty commit.
> > > > > > 
> > > > > > Do you have a clue?
> > > > > 
> > > > > If it was a cache issue, disabling i-cache or d-cache
> > > > > completely
> > > > > might
> > > > > help understanding the problem.  I added TXx9 specific
> > > > > "icdisable"
> > > > > and
> > > > > "dcdisable" kernel options for debugging long ago.
> > > > > 
> > > > > I hope these options still works correctly with recent kernel
> > > > > but
> > > > > not
> > > > > sure.
> > > > > 
> > > > > Also, disabling i-cache makes your board VERY slow, of
> > > > > course.
> > > > 
> > > > Thanks!
> > > > 
> > > > When using these options, I do see a slowdown in early boot,
> > > > but the
> > > > issue
> > > > is still there.
> > > > 
> > > > My next guess is an unaligned access not using
> > > > {get,put}_unaligned(),
> > > > which
> > > > doesn't seem to work on tx4927, but doesn't cause an exception
> > > > neither.
> > > 
> > > Can you try my linux-next branch on git.linux-nfs.org? It
> > > contains a
> > > fixes for a hang that results from the above commit.
> > > 
> > > git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git
> > > linux-next
> > 
> > Thanks for the suggestion, but unfortunately it doesn't help.
> 
> In the mean time, I tried your newer linux-next, no change.
> I tried several other things:
>   - remove the packed attribute (why did you add that?),

The packed attribute allows us to avoid a series of copy operations
when decoding the first three elements of a RPC over TCP header (which
is why they are all declared as big endian). The alternative would be
to have a 12 byte buffer there for temporary storage, and then a
duplicate set of 3 32-bit words into which we copy the buffer contents
after extracting them from the (non-blocking) socket.

>   - verify (at runtime) that all accesses to fraghdr, xid, and
> calldir
> are aligned,
>   - enable RPC_DEBUG_DATA, nothing fishy seen at first sight.
> 
> Is anyone else seeing this on MIPS, or any other platform?
> Does mounting NFS with -o nfsvers=3,tcp work on other MIPS platforms?

I have no access to any MIPS hardware for the purposes of testing so
that would be a question for the community.

One thing that I have noticed is that unlike the old code, the bvec
'generic' code does appear to fail to call flush_dcache_page(). Could
that be causing the problem here? If so, why would that not be a
problem in the context of regular block I/O?

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client changes

2018-12-06 Thread Trond Myklebust
Hi Linus,

Apologies for the relatively high volume this late in -rc. This is mainly
fallout from the updates to the SUNRPC code that is being triggered from
less common combinations of NFS mount options.

Cheers
  Trond

The following changes since commit 4b78317679c4f3782a3cff0ddb269c1fcfde7621:

  Merge branch 'x86-pti-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2018-12-01 12:35:48 
-0800)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-4.20-5

for you to fetch changes up to 79462857eb547e5d17fc8445b9768615e02dc1cf:

  SUNRPC: Don't force a redundant disconnection in xs_read_stream() (2018-12-05 
07:11:12 -0500)


NFS client bugfixes for Linux 4.20

Highlights include:

Stable fixes:
 - Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.

Bugfixes:
 - Fix a regression that causes the RPC receive code to hang
 - Fix call_connect_status() so that it handles tasks that got transmitted
   while queued waiting for the socket lock.
 - Fix a memory leak in call_encode()
 - Fix several other connect races.
 - Fix receive code error handling.
 - Use the discard iterator rather than MSG_TRUNC for compatibility with
   AF_UNIX/AF_LOCAL sockets.
 - nfs: don't dirty kernel pages read by direct-io
 - pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4 data
   servers


Chuck Lever (1):
  SUNRPC: Fix leak of krb5p encode pages

Dave Kleikamp (1):
  nfs: don't dirty kernel pages read by direct-io

Tigran Mkrtchyan (1):
  flexfiles: enforce per-mirror stateid only for v4 DSes

Trond Myklebust (9):
  SUNRPC: call_connect_status() must handle tasks that got transmitted
  SUNRPC: Fix a memory leak in call_encode()
  SUNRPC: Fix a potential race in xprt_connect()
  SUNRPC: Fix RPC receive hangs
  SUNRPC: Fix up handling of the XDRBUF_SPARSE_PAGES flag
  SUNRPC: Treat EFAULT as a truncated message in xs_read_stream_request()
  SUNRPC: Use the discard iterator rather than MSG_TRUNC
  SUNRPC: Fix up socket polling
  SUNRPC: Don't force a redundant disconnection in xs_read_stream()

 fs/nfs/direct.c|  9 +++-
 fs/nfs/flexfilelayout/flexfilelayout.c |  6 ++-
 include/linux/sunrpc/xdr.h |  1 -
 net/sunrpc/auth_gss/auth_gss.c |  4 ++
 net/sunrpc/clnt.c  |  8 
 net/sunrpc/xprt.c  | 13 +-
 net/sunrpc/xprtsock.c  | 81 --
 7 files changed, 73 insertions(+), 49 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




[GIT PULL] Please pull NFS client changes

2018-12-06 Thread Trond Myklebust
Hi Linus,

Apologies for the relatively high volume this late in -rc. This is mainly
fallout from the updates to the SUNRPC code that is being triggered from
less common combinations of NFS mount options.

Cheers
  Trond

The following changes since commit 4b78317679c4f3782a3cff0ddb269c1fcfde7621:

  Merge branch 'x86-pti-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2018-12-01 12:35:48 
-0800)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-4.20-5

for you to fetch changes up to 79462857eb547e5d17fc8445b9768615e02dc1cf:

  SUNRPC: Don't force a redundant disconnection in xs_read_stream() (2018-12-05 
07:11:12 -0500)


NFS client bugfixes for Linux 4.20

Highlights include:

Stable fixes:
 - Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.

Bugfixes:
 - Fix a regression that causes the RPC receive code to hang
 - Fix call_connect_status() so that it handles tasks that got transmitted
   while queued waiting for the socket lock.
 - Fix a memory leak in call_encode()
 - Fix several other connect races.
 - Fix receive code error handling.
 - Use the discard iterator rather than MSG_TRUNC for compatibility with
   AF_UNIX/AF_LOCAL sockets.
 - nfs: don't dirty kernel pages read by direct-io
 - pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4 data
   servers


Chuck Lever (1):
  SUNRPC: Fix leak of krb5p encode pages

Dave Kleikamp (1):
  nfs: don't dirty kernel pages read by direct-io

Tigran Mkrtchyan (1):
  flexfiles: enforce per-mirror stateid only for v4 DSes

Trond Myklebust (9):
  SUNRPC: call_connect_status() must handle tasks that got transmitted
  SUNRPC: Fix a memory leak in call_encode()
  SUNRPC: Fix a potential race in xprt_connect()
  SUNRPC: Fix RPC receive hangs
  SUNRPC: Fix up handling of the XDRBUF_SPARSE_PAGES flag
  SUNRPC: Treat EFAULT as a truncated message in xs_read_stream_request()
  SUNRPC: Use the discard iterator rather than MSG_TRUNC
  SUNRPC: Fix up socket polling
  SUNRPC: Don't force a redundant disconnection in xs_read_stream()

 fs/nfs/direct.c|  9 +++-
 fs/nfs/flexfilelayout/flexfilelayout.c |  6 ++-
 include/linux/sunrpc/xdr.h |  1 -
 net/sunrpc/auth_gss/auth_gss.c |  4 ++
 net/sunrpc/clnt.c  |  8 
 net/sunrpc/xprt.c  | 13 +-
 net/sunrpc/xprtsock.c  | 81 --
 7 files changed, 73 insertions(+), 49 deletions(-)
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com




  1   2   3   4   5   6   7   8   9   10   >