Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Chuck Lever
On Wed, Apr 10, 2024 at 01:07:41PM -0400, Steven Rostedt wrote:
> On Wed, 10 Apr 2024 12:38:53 -0400
> Chuck Lever  wrote:
> 
> > On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote:
> > > From: "Steven Rostedt (Google)" 
> > > 
> > > The rpcgss_context trace event acceptor field is a dynamically sized
> > > string that records the "data" parameter. But this parameter is also
> > > dependent on the "len" field to determine the size of the data.
> > > 
> > > It needs to use __string_len() helper macro where the length can be passed
> > > in. It also incorrectly uses strncpy() to save it instead of
> > > __assign_str(). As these macros can change, it is not wise to open code
> > > them in trace events.
> > > 
> > > As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
> > > __assign_str() can be used for both __string() and __string_len() fields.
> > > Before that commit, __assign_str_len() is required to be used. This needs
> > > to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
> > > Rework __assign_str() and __string() to not duplicate getting the string")
> > > is the commit that makes __string_str_len() obsolete).
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
> > > Signed-off-by: Steven Rostedt (Google)   
> > 
> > Acked-by: Chuck Lever 
> > 
> 
> Thanks, but feel free to take it if you want. Unless you rather have me
> take it through my tree?

Well I guess I could test it for you. I'll take it for nfsd v6.9-rc.


-- 
Chuck Lever



Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Chuck Lever
On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> The rpcgss_context trace event acceptor field is a dynamically sized
> string that records the "data" parameter. But this parameter is also
> dependent on the "len" field to determine the size of the data.
> 
> It needs to use __string_len() helper macro where the length can be passed
> in. It also incorrectly uses strncpy() to save it instead of
> __assign_str(). As these macros can change, it is not wise to open code
> them in trace events.
> 
> As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
> __assign_str() can be used for both __string() and __string_len() fields.
> Before that commit, __assign_str_len() is required to be used. This needs
> to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
> Rework __assign_str() and __string() to not duplicate getting the string")
> is the commit that makes __string_str_len() obsolete).
> 
> Cc: sta...@vger.kernel.org
> Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
> Signed-off-by: Steven Rostedt (Google) 

Acked-by: Chuck Lever 


> ---
>  include/trace/events/rpcgss.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/rpcgss.h b/include/trace/events/rpcgss.h
> index ba2d96a1bc2f..f50fcafc69de 100644
> --- a/include/trace/events/rpcgss.h
> +++ b/include/trace/events/rpcgss.h
> @@ -609,7 +609,7 @@ TRACE_EVENT(rpcgss_context,
>   __field(unsigned int, timeout)
>   __field(u32, window_size)
>   __field(int, len)
> - __string(acceptor, data)
> + __string_len(acceptor, data, len)
>   ),
>  
>   TP_fast_assign(
> @@ -618,7 +618,7 @@ TRACE_EVENT(rpcgss_context,
>   __entry->timeout = timeout;
>   __entry->window_size = window_size;
>   __entry->len = len;
> - strncpy(__get_str(acceptor), data, len);
> + __assign_str(acceptor, data);
>   ),
>  
>   TP_printk("win_size=%u expiry=%lu now=%lu timeout=%u acceptor=%.*s",
> -- 
> 2.43.0
> 

-- 
Chuck Lever



Re: [PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro

2024-02-22 Thread Chuck Lever
On Thu, Feb 22, 2024 at 12:28:28PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> I'm working on restructuring the __string* macros so that it doesn't need
> to recalculate the string twice. That is, it will save it off when
> processing __string() and the __assign_str() will not need to do the work
> again as it currently does.
> 
> Currently __string_len(item, src, len) doesn't actually use "src", but my
> changes will require src to be correct as that is where the __assign_str()
> will get its value from.
> 
> The event class nfsd_clid_class has:
> 
>   __string_len(name, name, clp->cl_name.len)
> 
> But the second "name" does not exist and causes my changes to fail to
> build. That second parameter should be: clp->cl_name.data.
> 
> Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for 
> nfsd_clid_class")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/nfsd/trace.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index d1e8cf079b0f..2cd57033791f 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -843,7 +843,7 @@ DECLARE_EVENT_CLASS(nfsd_clid_class,
>   __array(unsigned char, addr, sizeof(struct sockaddr_in6))
>   __field(unsigned long, flavor)
>   __array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
> - __string_len(name, name, clp->cl_name.len)
> + __string_len(name, clp->cl_name.data, clp->cl_name.len)
>   ),
>   TP_fast_assign(
>   __entry->cl_boot = clp->cl_clientid.cl_boot;
> -- 
> 2.43.0
> 

Do you want me to take this through the nfsd tree, or would you like
an Ack from me so you can handle it as part of your clean up? Just
in case:

Acked-by: Chuck Lever 

-- 
Chuck Lever



Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs

2024-01-25 Thread Chuck Lever
On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote:
> Long ago, file locks used to hang off of a singly-linked list in struct
> inode. Because of this, when leases were added, they were added to the
> same list and so they had to be tracked using the same sort of
> structure.
> 
> Several years ago, we added struct file_lock_context, which allowed us
> to use separate lists to track different types of file locks. Given
> that, leases no longer need to be tracked using struct file_lock.
> 
> That said, a lot of the underlying infrastructure _is_ the same between
> file leases and locks, so we can't completely separate everything.
> 
> This patchset first splits a group of fields used by both file locks and
> leases into a new struct file_lock_core, that is then embedded in struct
> file_lock. Coccinelle was then used to convert a lot of the callers to
> deal with the move, with the remaining 25% or so converted by hand.
> 
> It then converts several internal functions in fs/locks.c to work
> with struct file_lock_core. Lastly, struct file_lock is split into
> struct file_lock and file_lease, and the lease-related APIs converted to
> take struct file_lease.
> 
> After the first few patches (which I left split up for easier review),
> the set should be bisectable. I'll plan to squash the first few
> together to make sure the resulting set is bisectable before merge.
> 
> Finally, I left the coccinelle scripts I used in tree. I had heard it
> was preferable to merge those along with the patches that they
> generate, but I wasn't sure where they go. I can either move those to a
> more appropriate location or we can just drop that commit if it's not
> needed.
> 
> Signed-off-by: Jeff Layton 

v2 looks nicer.

I would add a few list handling primitives, as I see enough
instances of list_for_each_entry, list_for_each_entry_safe,
list_first_entry, and list_first_entry_or_null on fl_core.flc_list
to make it worth having those.

Also, there doesn't seem to be benefit for API consumers to have to
understand the internal structure of struct file_lock/lease to reach
into fl_core. Having accessor functions for common fields like
fl_type and fl_flags could be cleaner.

For the series:

Reviewed-by: Chuck Lever 

For the nfsd and lockd parts:

Acked-by: Chuck Lever 


> ---
> Changes in v2:
> - renamed file_lock_core fields to have "flc_" prefix
> - used macros to more easily do the change piecemeal
> - broke up patches into per-subsystem ones
> - Link to v1: 
> https://lore.kernel.org/r/20240116-flsplit-v1-0-c9d0f4370...@kernel.org
> 
> ---
> Jeff Layton (41):
>   filelock: rename some fields in tracepoints
>   filelock: rename fl_pid variable in lock_get_status
>   dlm: rename fl_flags variable in dlm_posix_unlock
>   nfs: rename fl_flags variable in nfs4_proc_unlck
>   nfsd: rename fl_type and fl_flags variables in nfsd4_lock
>   lockd: rename fl_flags and fl_type variables in nlmclnt_lock
>   9p: rename fl_type variable in v9fs_file_do_lock
>   afs: rename fl_type variable in afs_next_locker
>   filelock: drop the IS_* macros
>   filelock: split common fields into struct file_lock_core
>   filelock: add coccinelle scripts to move fields to struct file_lock_core
>   filelock: have fs/locks.c deal with file_lock_core directly
>   filelock: convert some internal functions to use file_lock_core instead
>   filelock: convert more internal functions to use file_lock_core
>   filelock: make posix_same_owner take file_lock_core pointers
>   filelock: convert posix_owner_key to take file_lock_core arg
>   filelock: make locks_{insert,delete}_global_locks take file_lock_core 
> arg
>   filelock: convert locks_{insert,delete}_global_blocked
>   filelock: make __locks_delete_block and __locks_wake_up_blocks take 
> file_lock_core
>   filelock: convert __locks_insert_block, conflict and deadlock checks to 
> use file_lock_core
>   filelock: convert fl_blocker to file_lock_core
>   filelock: clean up locks_delete_block internals
>   filelock: reorganize locks_delete_block and __locks_insert_block
>   filelock: make assign_type helper take a file_lock_core pointer
>   filelock: convert locks_wake_up_blocks to take a file_lock_core pointer
>   filelock: convert locks_insert_lock_ctx and locks_delete_lock_ctx
>   filelock: convert locks_translate_pid to take file_lock_core
>   filelock: convert seqfile handling to use file_lock_core
>   9p: adapt to breakup of struct file_lock
>   afs: adapt to breakup of struct file_lock
>   ceph: adapt to breakup of struct file_lock
>   dlm: adapt to breakup of struct file_lock
>   gfs2: adapt to breakup of struct file

Re: [PATCH 04/20] filelock: fixups after the coccinelle changes

2024-01-17 Thread Chuck Lever
On Tue, Jan 16, 2024 at 02:46:00PM -0500, Jeff Layton wrote:
> The coccinelle script doesn't catch quite everythng (particularly with
> embedded structs). These are some by-hand fixups after the split of
> common fields into struct file_lock_core.
> 
> Signed-off-by: Jeff Layton 

For the changes in fs/lockd/ and fs/nfsd/:

Acked-by: Chuck Lever 


> ---
>  fs/ceph/locks.c |  8 ++---
>  fs/lockd/clnt4xdr.c |  8 ++---
>  fs/lockd/clntproc.c |  6 ++--
>  fs/lockd/clntxdr.c  |  8 ++---
>  fs/lockd/svc4proc.c | 10 +++---
>  fs/lockd/svclock.c  | 54 +
>  fs/lockd/svcproc.c  | 10 +++---
>  fs/lockd/svcsubs.c  |  4 +--
>  fs/lockd/xdr.c  |  8 ++---
>  fs/lockd/xdr4.c |  8 ++---
>  fs/locks.c  | 67 
> +
>  fs/nfs/delegation.c |  2 +-
>  fs/nfs/nfs4state.c  |  2 +-
>  fs/nfs/nfs4trace.h  |  4 +--
>  fs/nfs/write.c  |  4 +--
>  fs/nfsd/nfs4callback.c  |  2 +-
>  fs/nfsd/nfs4state.c |  4 +--
>  fs/smb/client/file.c|  2 +-
>  fs/smb/server/vfs.c |  2 +-
>  include/trace/events/afs.h  |  4 +--
>  include/trace/events/filelock.h | 32 ++--
>  21 files changed, 126 insertions(+), 123 deletions(-)
> 
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index ee12f9864980..55be5d231e38 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -386,9 +386,9 @@ void ceph_count_locks(struct inode *inode, int 
> *fcntl_count, int *flock_count)
>   ctx = locks_inode_context(inode);
>   if (ctx) {
>   spin_lock(>flc_lock);
> - list_for_each_entry(lock, >flc_posix, fl_list)
> + list_for_each_entry(lock, >flc_posix, fl_core.fl_list)
>   ++(*fcntl_count);
> - list_for_each_entry(lock, >flc_flock, fl_list)
> + list_for_each_entry(lock, >flc_flock, fl_core.fl_list)
>   ++(*flock_count);
>   spin_unlock(>flc_lock);
>   }
> @@ -455,7 +455,7 @@ int ceph_encode_locks_to_buffer(struct inode *inode,
>   return 0;
>  
>   spin_lock(>flc_lock);
> - list_for_each_entry(lock, >flc_posix, fl_list) {
> + list_for_each_entry(lock, >flc_posix, fl_core.fl_list) {
>   ++seen_fcntl;
>   if (seen_fcntl > num_fcntl_locks) {
>   err = -ENOSPC;
> @@ -466,7 +466,7 @@ int ceph_encode_locks_to_buffer(struct inode *inode,
>   goto fail;
>   ++l;
>   }
> - list_for_each_entry(lock, >flc_flock, fl_list) {
> + list_for_each_entry(lock, >flc_flock, fl_core.fl_list) {
>   ++seen_flock;
>   if (seen_flock > num_flock_locks) {
>   err = -ENOSPC;
> diff --git a/fs/lockd/clnt4xdr.c b/fs/lockd/clnt4xdr.c
> index ed00bd2869a7..083a3b1bf288 100644
> --- a/fs/lockd/clnt4xdr.c
> +++ b/fs/lockd/clnt4xdr.c
> @@ -243,7 +243,7 @@ static void encode_nlm4_holder(struct xdr_stream *xdr,
>   u64 l_offset, l_len;
>   __be32 *p;
>  
> - encode_bool(xdr, lock->fl.fl_type == F_RDLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_RDLCK);
>   encode_int32(xdr, lock->svid);
>   encode_netobj(xdr, lock->oh.data, lock->oh.len);
>  
> @@ -357,7 +357,7 @@ static void nlm4_xdr_enc_testargs(struct rpc_rqst *req,
>   const struct nlm_lock *lock = >lock;
>  
>   encode_cookie(xdr, >cookie);
> - encode_bool(xdr, lock->fl.fl_type == F_WRLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_WRLCK);
>   encode_nlm4_lock(xdr, lock);
>  }
>  
> @@ -380,7 +380,7 @@ static void nlm4_xdr_enc_lockargs(struct rpc_rqst *req,
>  
>   encode_cookie(xdr, >cookie);
>   encode_bool(xdr, args->block);
> - encode_bool(xdr, lock->fl.fl_type == F_WRLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_WRLCK);
>   encode_nlm4_lock(xdr, lock);
>   encode_bool(xdr, args->reclaim);
>   encode_int32(xdr, args->state);
> @@ -403,7 +403,7 @@ static void nlm4_xdr_enc_cancargs(struct rpc_rqst *req,
>  
>   encode_cookie(xdr, >cookie);
>   encode_bool(xdr, args->block);
> - encode_bool(xdr, lock->fl.fl_type == F_WRLCK);
> + encode_bool(xdr, lock->fl.fl_core.fl_type == F_WRLCK);
>   encode_nlm4_lock(xdr, lock);
>  }
>  
> diff --git a/fs/lockd/clntproc.c b/fs/lockd/clntproc.c
> index ac1d0

Re: [PATCH 00/20] filelock: split struct file_lock into file_lock and file_lease structs

2024-01-17 Thread Chuck Lever
  2 +-
>  fs/nfs/nfs4proc.c   |  39 +-
>  fs/nfs/nfs4state.c  |   6 +-
>  fs/nfs/nfs4trace.h  |   4 +-
>  fs/nfs/nfs4xdr.c|   8 +-
>  fs/nfs/write.c  |   8 +-
>  fs/nfsd/filecache.c |   4 +-
>  fs/nfsd/nfs4callback.c  |   2 +-
>  fs/nfsd/nfs4layouts.c   |  34 +-
>  fs/nfsd/nfs4state.c |  98 ++---
>  fs/ocfs2/locks.c|  12 +-
>  fs/ocfs2/stack_user.c   |   2 +-
>  fs/smb/client/cifsfs.c  |   2 +-
>  fs/smb/client/cifssmb.c |   8 +-
>  fs/smb/client/file.c|  74 ++--
>  fs/smb/client/smb2file.c|   2 +-
>  fs/smb/server/smb2pdu.c |  44 +--
>  fs/smb/server/vfs.c |  14 +-
>  include/linux/filelock.h|  58 ++-
>  include/linux/fs.h  |   5 +-
>  include/linux/lockd/lockd.h |   8 +-
>  include/trace/events/afs.h  |   4 +-
>  include/trace/events/filelock.h |  54 +--
>  48 files changed, 1119 insertions(+), 825 deletions(-)
> ---
> base-commit: 052d534373b7ed33712a63d5e17b2b6cdbce84fd
> change-id: 20240116-flsplit-bdb46824db68
> 
> Best regards,
> -- 
> Jeff Layton 
> 

-- 
Chuck Lever



Re: [PATCH 03/20] filelock: the results of the coccinelle conversion

2024-01-17 Thread Chuck Lever
On Tue, Jan 16, 2024 at 02:45:59PM -0500, Jeff Layton wrote:
> This is the direct result of the changes generated by coccinelle. There
> is still about 1/4 of the callsites that need to be touched by hand
> here.
> 
> Signed-off-by: Jeff Layton 

For the changes in include/linux/lockd/lockd.h, fs/lockd/, and
fs/nfsd/:

Acked-by: Chuck Lever 


> ---
>  fs/9p/vfs_file.c|  38 ++---
>  fs/afs/flock.c  |  55 +++---
>  fs/ceph/locks.c |  66 +++
>  fs/dlm/plock.c  |  44 ++---
>  fs/fuse/file.c  |  14 +-
>  fs/gfs2/file.c  |  16 +-
>  fs/lockd/clnt4xdr.c |   6 +-
>  fs/lockd/clntlock.c |   2 +-
>  fs/lockd/clntproc.c |  60 ---
>  fs/lockd/clntxdr.c  |   6 +-
>  fs/lockd/svclock.c  |  10 +-
>  fs/lockd/svcsubs.c  |  20 +--
>  fs/lockd/xdr.c  |   6 +-
>  fs/lockd/xdr4.c |   6 +-
>  fs/locks.c  | 406 
> +++-
>  fs/nfs/delegation.c |   2 +-
>  fs/nfs/file.c   |  22 +--
>  fs/nfs/nfs3proc.c   |   2 +-
>  fs/nfs/nfs4proc.c   |  35 ++--
>  fs/nfs/nfs4state.c  |   4 +-
>  fs/nfs/nfs4xdr.c|   8 +-
>  fs/nfs/write.c  |   4 +-
>  fs/nfsd/filecache.c |   4 +-
>  fs/nfsd/nfs4layouts.c   |  15 +-
>  fs/nfsd/nfs4state.c |  73 
>  fs/ocfs2/locks.c|  12 +-
>  fs/ocfs2/stack_user.c   |   2 +-
>  fs/smb/client/cifssmb.c |   8 +-
>  fs/smb/client/file.c|  72 
>  fs/smb/client/smb2file.c|   2 +-
>  fs/smb/server/smb2pdu.c |  44 ++---
>  fs/smb/server/vfs.c |  12 +-
>  include/linux/lockd/lockd.h |   8 +-
>  33 files changed, 554 insertions(+), 530 deletions(-)
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 11cd8d23f6f2..f35ac7cb782e 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -107,7 +107,7 @@ static int v9fs_file_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>  
>   p9_debug(P9_DEBUG_VFS, "filp: %p lock: %p\n", filp, fl);
>  
> - if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) {
> + if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_core.fl_type != 
> F_UNLCK) {
>   filemap_write_and_wait(inode->i_mapping);
>   invalidate_mapping_pages(>i_data, 0, -1);
>   }
> @@ -127,7 +127,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   fid = filp->private_data;
>   BUG_ON(fid == NULL);
>  
> - BUG_ON((fl->fl_flags & FL_POSIX) != FL_POSIX);
> + BUG_ON((fl->fl_core.fl_flags & FL_POSIX) != FL_POSIX);
>  
>   res = locks_lock_file_wait(filp, fl);
>   if (res < 0)
> @@ -136,7 +136,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   /* convert posix lock to p9 tlock args */
>   memset(, 0, sizeof(flock));
>   /* map the lock type */
> - switch (fl->fl_type) {
> + switch (fl->fl_core.fl_type) {
>   case F_RDLCK:
>   flock.type = P9_LOCK_TYPE_RDLCK;
>   break;
> @@ -152,7 +152,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   flock.length = 0;
>   else
>   flock.length = fl->fl_end - fl->fl_start + 1;
> - flock.proc_id = fl->fl_pid;
> + flock.proc_id = fl->fl_core.fl_pid;
>   flock.client_id = fid->clnt->name;
>   if (IS_SETLKW(cmd))
>   flock.flags = P9_LOCK_FLAGS_BLOCK;
> @@ -207,12 +207,12 @@ static int v9fs_file_do_lock(struct file *filp, int 
> cmd, struct file_lock *fl)
>* incase server returned error for lock request, revert
>* it locally
>*/
> - if (res < 0 && fl->fl_type != F_UNLCK) {
> - fl_type = fl->fl_type;
> - fl->fl_type = F_UNLCK;
> + if (res < 0 && fl->fl_core.fl_type != F_UNLCK) {
> + fl_type = fl->fl_core.fl_type;
> + fl->fl_core.fl_type = F_UNLCK;
>   /* Even if this fails we want to return the remote error */
>   locks_lock_file_wait(filp, fl);
> - fl->fl_type = fl_type;
> + fl->fl_core.fl_type = fl_type;
>   }
>   if (flock.client_id != fid->clnt->name)
>   kfree(flock.client_id);
> @@ -234,7 +234,7 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>* if we have a conflicting lock locally, no need to validate
>

Re: [PATCH 016/141] nfsd: Fix fall-through warnings for Clang

2021-04-20 Thread Chuck Lever III



> On Apr 20, 2021, at 4:28 PM, Gustavo A. R. Silva  
> wrote:
> 
> Hi all,
> 
> Friendly ping: who can take this, please?

Hum. I thought this was going through another tree.
I can take it for the second v5.13 NFSD PR.


> Thanks
> --
> Gustavo
> 
> On 11/23/20 16:46, Gustavo A. R. Silva wrote:
>> On Fri, Nov 20, 2020 at 01:27:51PM -0500, Chuck Lever wrote:
>>> 
>>> 
>>>> On Nov 20, 2020, at 1:26 PM, Gustavo A. R. Silva  
>>>> wrote:
>>>> 
>>>> In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
>>>> warnings by explicitly adding a couple of break statements instead of
>>>> just letting the code fall through to the next case.
>>>> 
>>>> Link: https://github.com/KSPP/linux/issues/115
>>>> Signed-off-by: Gustavo A. R. Silva 
>>>> ---
>>>> fs/nfsd/nfs4state.c | 1 +
>>>> fs/nfsd/nfsctl.c| 1 +
>>>> 2 files changed, 2 insertions(+)
>>>> 
>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>> index d7f27ed6b794..cdab0d5be186 100644
>>>> --- a/fs/nfsd/nfs4state.c
>>>> +++ b/fs/nfsd/nfs4state.c
>>>> @@ -3113,6 +3113,7 @@ nfsd4_exchange_id(struct svc_rqst *rqstp, struct 
>>>> nfsd4_compound_state *cstate,
>>>>goto out_nolock;
>>>>}
>>>>new->cl_mach_cred = true;
>>>> +  break;
>>>>case SP4_NONE:
>>>>break;
>>>>default:/* checked by xdr code */
>>>> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
>>>> index f6d5d783f4a4..9a3bb1e217f9 100644
>>>> --- a/fs/nfsd/nfsctl.c
>>>> +++ b/fs/nfsd/nfsctl.c
>>>> @@ -1165,6 +1165,7 @@ static struct inode *nfsd_get_inode(struct 
>>>> super_block *sb, umode_t mode)
>>>>inode->i_fop = _dir_operations;
>>>>inode->i_op = _dir_inode_operations;
>>>>inc_nlink(inode);
>>>> +  break;
>>>>default:
>>>>break;
>>>>}
>>>> -- 
>>>> 2.27.0
>>>> 
>>> 
>>> Acked-by: Chuck Lever 
>> 
>> Thanks, Chuck.
>> --
>> Gustavo

--
Chuck Lever





Re: [PATCH] nfsd: remove unused function

2021-04-15 Thread Chuck Lever III



> On Apr 15, 2021, at 4:38 AM, Jiapeng Chong  
> wrote:
> 
> Fix the following clang warning:
> 
> fs/nfsd/nfs4state.c:6276:1: warning: unused function 'end_offset'
> [-Wunused-function].
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 

Thanks for your patch. It's been added to the for-next topic branch in

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git


> ---
> fs/nfsd/nfs4state.c | 9 -
> 1 file changed, 9 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 97447a6..32b11ff 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -6272,15 +6272,6 @@ static void nfsd4_close_open_stateid(struct 
> nfs4_ol_stateid *s)
>   return status;
> }
> 
> -static inline u64
> -end_offset(u64 start, u64 len)
> -{
> - u64 end;
> -
> - end = start + len;
> - return end >= start ? end: NFS4_MAX_UINT64;
> -}
> -
> /* last octet in a range */
> static inline u64
> last_byte_offset(u64 start, u64 len)
> -- 
> 1.8.3.1
> 

--
Chuck Lever





Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-09 Thread Chuck Lever III



> On Apr 9, 2021, at 10:26 AM, Tom Talpey  wrote:
> 
> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>  
>>> We need to get a better idea what correctness testing has been done,
>>> and whether positive correctness testing results can be replicated
>>> on a variety of platforms.
>> RO has been rolling out slowly on mlx5 over a few years and storage
>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>> turned on for a long time, userspace HPC applications have been using
>> it for a while now too.
> 
> I'd love to see RO be used more, it was always something the RDMA
> specs supported and carefully architected for. My only concern is
> that it's difficult to get right, especially when the platforms
> have been running strictly-ordered for so long. The ULPs need
> testing, and a lot of it.
> 
>> We know there are platforms with broken RO implementations (like
>> Haswell) but the kernel is supposed to globally turn off RO on all
>> those cases. I'd be a bit surprised if we discover any more from this
>> series.
>> On the other hand there are platforms that get huge speed ups from
>> turning this on, AMD is one example, there are a bunch in the ARM
>> world too.
> 
> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly). The RO pipeline will completely reorder
> DMA writes, and consumers which infer ordering from memory contents may
> break. This can even apply within the provider code, which may attempt
> to poll WR and CQ structures, and be tripped up.

You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?


> The Mellanox adapter, itself, historically has strict in-order DMA
> semantics, and while it's great to relax that, changing it by default
> for all consumers is something to consider very cautiously.
> 
>> Still, obviously people should test on the platforms they have.
> 
> Yes, and "test" be taken seriously with focus on ULP data integrity.
> Speedups will mean nothing if the data is ever damaged.

I agree that data integrity comes first.

Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.


--
Chuck Lever





Re: [PATCH -next] NFSD: Use DEFINE_SPINLOCK() for spinlock

2021-04-06 Thread Chuck Lever III
Hello-

> On Apr 6, 2021, at 8:08 AM, Huang Guobin  wrote:
> 
> From: Guobin Huang 
> 
> spinlock can be initialized automatically with DEFINE_SPINLOCK()
> rather than explicitly calling spin_lock_init().
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Guobin Huang 

This change has been pushed to the for-next topic branch in

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git


> ---
> fs/nfsd/nfssvc.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index b2eef4112bc2..82ba034fa579 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -84,7 +84,7 @@ DEFINE_MUTEX(nfsd_mutex);
>  * version 4.1 DRC caches.
>  * nfsd_drc_pages_used tracks the current version 4.1 DRC memory usage.
>  */
> -spinlock_t   nfsd_drc_lock;
> +DEFINE_SPINLOCK(nfsd_drc_lock);
> unsigned long nfsd_drc_max_mem;
> unsigned long nfsd_drc_mem_used;
> 
> @@ -563,7 +563,6 @@ static void set_max_drc(void)
>   nfsd_drc_max_mem = (nr_free_buffer_pages()
>   >> NFSD_DRC_SIZE_SHIFT) * PAGE_SIZE;
>   nfsd_drc_mem_used = 0;
> - spin_lock_init(_drc_lock);
>   dprintk("%s nfsd_drc_max_mem %lu \n", __func__, nfsd_drc_max_mem);
> }
> 
> 

--
Chuck Lever





Re: [PATCH] sunrpc: Remove unused function ip_map_lookup

2021-04-06 Thread Chuck Lever III
Hello-

> On Apr 5, 2021, at 11:46 PM, Jiapeng Chong  
> wrote:
> 
> Fix the following clang warnings:
> 
> net/sunrpc/svcauth_unix.c:306:30: warning: unused function
> 'ip_map_lookup' [-Wunused-function].
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 

This has been pushed to the for-next topic branch in:

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git


> ---
> net/sunrpc/svcauth_unix.c | 9 -
> 1 file changed, 9 deletions(-)
> 
> diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
> index 97c0bdd..35b7966 100644
> --- a/net/sunrpc/svcauth_unix.c
> +++ b/net/sunrpc/svcauth_unix.c
> @@ -303,15 +303,6 @@ static struct ip_map *__ip_map_lookup(struct 
> cache_detail *cd, char *class,
>   return NULL;
> }
> 
> -static inline struct ip_map *ip_map_lookup(struct net *net, char *class,
> - struct in6_addr *addr)
> -{
> - struct sunrpc_net *sn;
> -
> - sn = net_generic(net, sunrpc_net_id);
> - return __ip_map_lookup(sn->ip_map_cache, class, addr);
> -}
> -
> static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
>   struct unix_domain *udom, time64_t expiry)
> {
> -- 
> 1.8.3.1
> 

--
Chuck Lever





Re: [PATCH -next] NFSD: Use DEFINE_SPINLOCK() for spinlock

2021-04-06 Thread Chuck Lever III



> On Apr 6, 2021, at 8:08 AM, Huang Guobin  wrote:
> 
> From: Guobin Huang 
> 
> spinlock can be initialized automatically with DEFINE_SPINLOCK()
> rather than explicitly calling spin_lock_init().
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Guobin Huang 

This same patch was sent last July, without the Reported-by:

https://lore.kernel.org/linux-nfs/1617710898-49064-1-git-send-email-huangguob...@huawei.com/T/#u

If I'm reading the code correctly, it appears that set_max_drc()
can be called more than once by user space API functions via
nfsd_create_serv().

It seems to me we want to guarantee that nfsd_drc_lock is
initialized exactly once, and using DEFINE_SPINLOCK() would
do just that.

Likewise, re-initializing nfsd_drc_mem_used is probably not good
to do either, but clobbering a spinlock like that might lead to
a crash.

Bruce, any further thoughts?


> ---
> fs/nfsd/nfssvc.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index b2eef4112bc2..82ba034fa579 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -84,7 +84,7 @@ DEFINE_MUTEX(nfsd_mutex);
>  * version 4.1 DRC caches.
>  * nfsd_drc_pages_used tracks the current version 4.1 DRC memory usage.
>  */
> -spinlock_t   nfsd_drc_lock;
> +DEFINE_SPINLOCK(nfsd_drc_lock);
> unsigned long nfsd_drc_max_mem;
> unsigned long nfsd_drc_mem_used;
> 
> @@ -563,7 +563,6 @@ static void set_max_drc(void)
>   nfsd_drc_max_mem = (nr_free_buffer_pages()
>   >> NFSD_DRC_SIZE_SHIFT) * PAGE_SIZE;
>   nfsd_drc_mem_used = 0;
> - spin_lock_init(_drc_lock);
>   dprintk("%s nfsd_drc_max_mem %lu \n", __func__, nfsd_drc_max_mem);
> }
> 
> 

--
Chuck Lever





Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-05 Thread Chuck Lever III



> On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe  wrote:
> 
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky 
>>> 
>>>> From Avihai,
>>> 
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>> 
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>> 
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>> 
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
> 
> I think it is a typo (or at least mit makes no sense to be talking
> about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> workload.

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.

I have an old Haswell dual-socket system in my lab, but otherwise
I'm not sure I have a platform that would be interesting for such a
test.


> AMD dual socket systems are well known to benefit from relaxed
> ordering, people have been doing this in userspace for a while now
> with the opt in.


--
Chuck Lever





Re: [PATCH][next] UAPI: nfsfh.h: Replace one-element array with flexible-array member

2021-03-29 Thread Chuck Lever
Sorry for the reply via gmail, the original patch did not show up in
my Oracle mailbox.

I've been waiting for a resolution of this thread (and perhaps a
Reviewed-by). But in
the meantime I've committed this, provisionally, to the for-next topic branch in

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

On Wed, Mar 24, 2021 at 4:39 AM Gustavo A. R. Silva
 wrote:
>
> There is a regular need in the kernel to provide a way to declare having
> a dynamically sized set of trailing elements in a structure. Kernel code
> should always use “flexible array members”[1] for these cases. The older
> style of one-element or zero-length arrays should no longer be used[2].
>
> Use an anonymous union with a couple of anonymous structs in order to
> keep userspace unchanged:
>
> $ pahole -C nfs_fhbase_new fs/nfsd/nfsfh.o
> struct nfs_fhbase_new {
> union {
> struct {
> __u8   fb_version_aux;   /* 0 1 */
> __u8   fb_auth_type_aux; /* 1 1 */
> __u8   fb_fsid_type_aux; /* 2 1 */
> __u8   fb_fileid_type_aux;   /* 3 1 */
> __u32  fb_auth[1];   /* 4 4 */
> };   /* 0 8 */
> struct {
> __u8   fb_version;   /* 0 1 */
> __u8   fb_auth_type; /* 1 1 */
> __u8   fb_fsid_type; /* 2 1 */
> __u8   fb_fileid_type;   /* 3 1 */
> __u32  fb_auth_flex[0];  /* 4 0 */
> };   /* 0 4 */
> };   /* 0 8 */
>
> /* size: 8, cachelines: 1, members: 1 */
> /* last cacheline: 8 bytes */
> };
>
> Also, this helps with the ongoing efforts to enable -Warray-bounds by
> fixing the following warnings:
>
> fs/nfsd/nfsfh.c: In function ‘nfsd_set_fh_dentry’:
> fs/nfsd/nfsfh.c:191:41: warning: array subscript 1 is above array bounds of 
> ‘__u32[1]’ {aka ‘unsigned int[1]’} [-Warray-bounds]
>   191 |ntohl((__force __be32)fh->fh_fsid[1])));
>   |  ~~~^~~
> ./include/linux/kdev_t.h:12:46: note: in definition of macro ‘MKDEV’
>12 | #define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi))
>   |  ^~
> ./include/uapi/linux/byteorder/little_endian.h:40:26: note: in expansion of 
> macro ‘__swab32’
>40 | #define __be32_to_cpu(x) __swab32((__force __u32)(__be32)(x))
>   |  ^~~~
> ./include/linux/byteorder/generic.h:136:21: note: in expansion of macro 
> ‘__be32_to_cpu’
>   136 | #define ___ntohl(x) __be32_to_cpu(x)
>   | ^
> ./include/linux/byteorder/generic.h:140:18: note: in expansion of macro 
> ‘___ntohl’
>   140 | #define ntohl(x) ___ntohl(x)
>   |  ^~~~
> fs/nfsd/nfsfh.c:191:8: note: in expansion of macro ‘ntohl’
>   191 |ntohl((__force __be32)fh->fh_fsid[1])));
>   |^
> fs/nfsd/nfsfh.c:192:32: warning: array subscript 2 is above array bounds of 
> ‘__u32[1]’ {aka ‘unsigned int[1]’} [-Warray-bounds]
>   192 |fh->fh_fsid[1] = fh->fh_fsid[2];
>   | ~~~^~~
> fs/nfsd/nfsfh.c:192:15: warning: array subscript 1 is above array bounds of 
> ‘__u32[1]’ {aka ‘unsigned int[1]’} [-Warray-bounds]
>   192 |fh->fh_fsid[1] = fh->fh_fsid[2];
>   |~~~^~~
>
> [1] https://en.wikipedia.org/wiki/Flexible_array_member
> [2] 
> https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays
>
> Link: https://github.com/KSPP/linux/issues/79
> Link: https://github.com/KSPP/linux/issues/109
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  include/uapi/linux/nfsd/nfsfh.h | 27 +++
>  1 file changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/include/uapi/linux/nfsd/nfsfh.h b/include/uapi/linux/nfsd/nfsfh.h
> index ff0ca88b1c8f..427294dd56a1 100644
> --- a/include/uapi/linux/nfsd/nfsfh.h
> +++ b/include/uapi/linux/nfsd/nfsfh.h
> @@ -64,13 +64,24 @@ struct nfs_fhbase_old {
>   *   in include/linux/exportfs.h for currently registered values.
>   */
>  struct nfs_fhbase_new {
> -   __u8fb_version; /* == 1, even => nfs_fhbase_old */
> -   __u8fb_auth_type;
> -   __u8fb_fsid_type;
> -   __u8fb_fileid_type;
> -   __u32   fb_auth[1];
> -/* __u32   fb_fsid[0]; floating */
> -/* __u32   fb_fileid[0]; floating */
> +   union {
> +   struct {
> +   __u8fb_version_aux; /* == 1, even => 

Re: [PATCH 2/2] SUNRPC: Refresh rq_pages using a bulk page allocator

2021-03-23 Thread Chuck Lever III



> On Mar 23, 2021, at 3:56 PM, Mel Gorman  wrote:
> 
> On Tue, Mar 23, 2021 at 11:10:05AM -0400, Chuck Lever wrote:
>> Reduce the rate at which nfsd threads hammer on the page allocator.
>> This improves throughput scalability by enabling the threads to run
>> more independently of each other.
>> 
>> Signed-off-by: Chuck Lever 
> 
> I've picked up the series and merged the leader with the first patch
> because I think the array vs list data is interesting but I did change
> the patch.
> 
>> +for (;;) {
>> +filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
>> +rqstp->rq_pages);
>> +/* We assume that if the next array element is populated,
>> + * all the following elements are as well, thus we're done. */
>> +if (filled == pages || rqstp->rq_pages[filled])
>> +break;
>> +
> 
> I altered this check because the implementation now returns a useful
> index. I know I had concerns about this but while the implementation
> cost is higher, the caller needs less knowledge of alloc_bulk_pages
> implementation. It might be unfortunate if new users all had to have
> their own optimisations around hole management so lets keep it simpler
> to start with.

Agreed! Your version below looks like what I'm testing now --
the "rq_pages[filled]" test and the comment have been removed.


> Version current in my tree is below but also available in 
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> mm-bulk-rebase-v6r5
> 
> ---8<---
> SUNRPC: Refresh rq_pages using a bulk page allocator
> 
> From: Chuck Lever 
> 
> Reduce the rate at which nfsd threads hammer on the page allocator.
> This improves throughput scalability by enabling the threads to run
> more independently of each other.
> 
> [mgorman: Update interpretation of alloc_pages_bulk return value]
> Signed-off-by: Chuck Lever 
> Signed-off-by: Mel Gorman 
> ---
> net/sunrpc/svc_xprt.c | 31 +++
> 1 file changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 609bda97d4ae..0c27c3291ca1 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -643,30 +643,29 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> {
>   struct svc_serv *serv = rqstp->rq_server;
>   struct xdr_buf *arg = >rq_arg;
> - int pages;
> - int i;
> + unsigned long pages, filled;
> 
> - /* now allocate needed pages.  If we get a failure, sleep briefly */
>   pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>   if (pages > RPCSVC_MAXPAGES) {
> - pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
> + pr_warn_once("svc: warning: pages=%lu > RPCSVC_MAXPAGES=%lu\n",
>pages, RPCSVC_MAXPAGES);
>   /* use as many pages as possible */
>   pages = RPCSVC_MAXPAGES;
>   }
> - for (i = 0; i < pages ; i++)
> - while (rqstp->rq_pages[i] == NULL) {
> - struct page *p = alloc_page(GFP_KERNEL);
> - if (!p) {
> - set_current_state(TASK_INTERRUPTIBLE);
> - if (signalled() || kthread_should_stop()) {
> - set_current_state(TASK_RUNNING);
> - return -EINTR;
> - }
> - schedule_timeout(msecs_to_jiffies(500));
> - }
> - rqstp->rq_pages[i] = p;
> +
> + for (;;) {
> + filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> + rqstp->rq_pages);
> + if (filled == pages)
> + break;
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> +     if (signalled() || kthread_should_stop()) {
> + set_current_state(TASK_RUNNING);
> + return -EINTR;
>   }
> + schedule_timeout(msecs_to_jiffies(500));
> + }
>   rqstp->rq_page_end = >rq_pages[pages];
>   rqstp->rq_pages[pages] = NULL; /* this might be seen in 
> nfsd_splice_actor() */
> 

--
Chuck Lever





Re: [PATCH 0/3 v5] Introduce a bulk order-0 page allocator

2021-03-23 Thread Chuck Lever III
_hole:
>   /* Attempt the batch allocation */
>   local_irq_save(flags);
>   pcp = _cpu_ptr(zone->pageset)->pcp;
>   pcp_list = >lists[ac.migratetype];
> 
>   while (nr_populated < nr_pages) {
> - /*
> -  * Stop allocating if the next index has a populated
> -  * page or the page will be prepared a second time when
> -  * IRQs are enabled.
> -  */
> +
> + /* Skip existing pages */
>   if (page_array && page_array[nr_populated]) {
> - hole = true;
>   nr_populated++;
> - break;
> + continue;
>   }
> 
>   page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> @@ -5092,6 +5082,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>   __count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
>   zone_statistics(ac.preferred_zoneref->zone, zone);
> 
> + prep_new_page(page, 0, gfp, 0);
>   if (page_list)
>   list_add(>lru, page_list);
>   else
> @@ -5101,25 +5092,6 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 
>   local_irq_restore(flags);
> 
> - /* Prep pages with IRQs enabled. */
> - if (page_list) {
> - list_for_each_entry(page, page_list, lru)
> - prep_new_page(page, 0, gfp, 0);
> - } else {
> - while (prep_index < nr_populated)
> - prep_new_page(page_array[prep_index++], 0, gfp, 0);
> -
> - /*
> -  * If the array is sparse, check whether the array is
> -  * now fully populated. Continue allocations if
> -  * necessary.
> -  */
> - while (nr_populated < nr_pages && page_array[nr_populated])
> - nr_populated++;
> - if (hole && nr_populated < nr_pages)
> - goto retry_hole;
> - }
> -
>   return nr_populated;
> 
> failed_irq:
> 
> -- 
> Mel Gorman
> SUSE Labs

--
Chuck Lever





[PATCH 2/2] SUNRPC: Refresh rq_pages using a bulk page allocator

2021-03-23 Thread Chuck Lever
Reduce the rate at which nfsd threads hammer on the page allocator.
This improves throughput scalability by enabling the threads to run
more independently of each other.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/svc_xprt.c |   33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 609bda97d4ae..d2792b2bf006 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -643,30 +643,31 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 {
struct svc_serv *serv = rqstp->rq_server;
struct xdr_buf *arg = >rq_arg;
-   int pages;
-   int i;
+   unsigned long pages, filled;
 
-   /* now allocate needed pages.  If we get a failure, sleep briefly */
pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
if (pages > RPCSVC_MAXPAGES) {
-   pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
+   pr_warn_once("svc: warning: pages=%lu > RPCSVC_MAXPAGES=%lu\n",
 pages, RPCSVC_MAXPAGES);
/* use as many pages as possible */
pages = RPCSVC_MAXPAGES;
}
-   for (i = 0; i < pages ; i++)
-   while (rqstp->rq_pages[i] == NULL) {
-   struct page *p = alloc_page(GFP_KERNEL);
-   if (!p) {
-   set_current_state(TASK_INTERRUPTIBLE);
-   if (signalled() || kthread_should_stop()) {
-   set_current_state(TASK_RUNNING);
-   return -EINTR;
-   }
-   schedule_timeout(msecs_to_jiffies(500));
-   }
-   rqstp->rq_pages[i] = p;
+
+   for (;;) {
+   filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
+   rqstp->rq_pages);
+   /* We assume that if the next array element is populated,
+* all the following elements are as well, thus we're done. */
+   if (filled == pages || rqstp->rq_pages[filled])
+   break;
+
+   set_current_state(TASK_INTERRUPTIBLE);
+   if (signalled() || kthread_should_stop()) {
+   set_current_state(TASK_RUNNING);
+   return -EINTR;
}
+   schedule_timeout(msecs_to_jiffies(500));
+   }
rqstp->rq_page_end = >rq_pages[pages];
rqstp->rq_pages[pages] = NULL; /* this might be seen in 
nfsd_splice_actor() */
 




[PATCH 1/2] SUNRPC: Set rq_page_end differently

2021-03-23 Thread Chuck Lever
Refactor:

I'm about to use the loop variable @i for something else.

As far as the "i++" is concerned, that is a post-increment. The
value of @i is not used subsequently, so the increment operator
is unnecessary and can be removed.

Also note that nfsd_read_actor() was renamed nfsd_splice_actor()
by commit cf8208d0eabd ("sendfile: convert nfsd to
splice_direct_to_actor()").

Signed-off-by: Chuck Lever 
---
 net/sunrpc/svc_xprt.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 3cdd71a8df1e..609bda97d4ae 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -642,7 +642,7 @@ static void svc_check_conn_limits(struct svc_serv *serv)
 static int svc_alloc_arg(struct svc_rqst *rqstp)
 {
struct svc_serv *serv = rqstp->rq_server;
-   struct xdr_buf *arg;
+   struct xdr_buf *arg = >rq_arg;
int pages;
int i;
 
@@ -667,11 +667,10 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
}
rqstp->rq_pages[i] = p;
}
-   rqstp->rq_page_end = >rq_pages[i];
-   rqstp->rq_pages[i++] = NULL; /* this might be seen in nfs_read_actor */
+   rqstp->rq_page_end = >rq_pages[pages];
+   rqstp->rq_pages[pages] = NULL; /* this might be seen in 
nfsd_splice_actor() */
 
/* Make arg->head point to first page and arg->pages point to rest */
-   arg = >rq_arg;
arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
arg->head[0].iov_len = PAGE_SIZE;
arg->pages = rqstp->rq_pages + 1;




[PATCH 0/2] SUNRPC consumer for the bulk page allocator

2021-03-23 Thread Chuck Lever
This patch set and the measurements below are based on yesterday's
bulk allocator series:

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9

The patches change SUNRPC to invoke the array-based bulk allocator
instead of alloc_page().

The micro-benchmark results are promising. I ran a mixture of 256KB
reads and writes over NFSv3. The server's kernel is built with KASAN
enabled, so the comparison is exaggerated but I believe it is still
valid.

I instrumented svc_recv() to measure the latency of each call to
svc_alloc_arg() and report it via a trace point. The following
results are averages across the trace events.

Single page: 25.007 us per call over 532,571 calls
Bulk list:6.258 us per call over 517,034 calls
Bulk array:   4.590 us per call over 517,442 calls

For SUNRPC, the simplicity and better performance of the array-based
API makes it superior to the list-based API.

---

Chuck Lever (2):
  SUNRPC: Set rq_page_end differently
  SUNRPC: Refresh rq_pages using a bulk page allocator


 net/sunrpc/svc_xprt.c | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

--
Chuck Lever



Re: [PATCH 0/3 v5] Introduce a bulk order-0 page allocator

2021-03-22 Thread Chuck Lever III



> On Mar 22, 2021, at 3:49 PM, Mel Gorman  wrote:
> 
> On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Mar 22, 2021, at 5:18 AM, Mel Gorman  wrote:
>>> 
>>> This series is based on top of Matthew Wilcox's series "Rationalise
>>> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
>>> test and are not using Andrew's tree as a baseline, I suggest using the
>>> following git tree
>>> 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
>>> mm-bulk-rebase-v5r9
>>> 
>>> The users of the API have been dropped in this version as the callers
>>> need to check whether they prefer an array or list interface (whether
>>> preference is based on convenience or performance).
>> 
>> I now have a consumer implementation that uses the array
>> API. If I understand the contract correctly, the return
>> value is the last array index that __alloc_pages_bulk()
>> visits. My consumer uses the return value to determine
>> if it needs to call the allocator again.
>> 
> 
> For either arrays or lists, the return value is the number of valid
> pages. For arrays, the pattern is expected to be
> 
> nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array);
> for (i = 0; i < nr_pages; i++) {
>   do something with page_array[i] 
> }
> 
> There *could* be populated valid elements on and after nr_pages but the
> implementation did not visit those elements. The implementation can abort
> early if the array looks like this
> 
> PPPPPP
> 
> Where P is a page and . is NULL. The implementation would skip the
> first three pages, allocate four pages and then abort when a new page
> was encountered. This is an implementation detail around how I handled
> prep_new_page. It could be addressed if many callers expect to pass in
> an array that has holes in the middle.
> 
>> It is returning some confusing (to me) results. I'd like
>> to get these resolved before posting any benchmark
>> results.
>> 
>> 1. When it has visited every array element, it returns the
>> same value as was passed in @nr_pages. That's the N + 1th
>> array element, which shouldn't be touched. Should the
>> allocator return nr_pages - 1 in the fully successful case?
>> Or should the documentation describe the return value as
>> "the number of elements visited" ?
>> 
> 
> I phrased it as "the known number of populated elements in the
> page_array".

The comment you added states:

+ * For lists, nr_pages is the number of pages that should be allocated.
+ *
+ * For arrays, only NULL elements are populated with pages and nr_pages
+ * is the maximum number of pages that will be stored in the array.
+ *
+ * Returns the number of pages added to the page_list or the index of the
+ * last known populated element of page_array.


> I did not want to write it as "the number of valid elements
> in the array" because that is not necessarily the case if an array is
> passed in with holes in the middle. I'm open to any suggestions on how
> the __alloc_pages_bulk description can be improved.

The comments states that, for the array case, a /count/ of
pages is passed in, and an /index/ is returned. If you want
to return the same type for lists and arrays, it should be
documented as a count in both cases, to match @nr_pages.
Consumers will want to compare @nr_pages with the return
value to see if they need to call again.

Comparing a count to an index is a notorious source of
off-by-one errors.


> The definition of the return value as-is makes sense for either a list
> or an array. Returning "nr_pages - 1" suits an array because it's the
> last valid index but it makes less sense when returning a list.
> 
>> 2. Frequently the allocator returns a number smaller than
>> the total number of elements. As you may recall, sunrpc
>> will delay a bit (via a call to schedule_timeout) then call
>> again. This is supposed to be a rare event, and the delay
>> is substantial. But with the array-based API, a not-fully-
>> successful allocator call seems to happen more than half
>> the time. Is that expected? I'm calling with GFP_KERNEL,
>> seems like the allocator should be trying harder.
>> 
> 
> It's not expected that the array implementation would be worse *unless*
> you are passing in arrays with holes in the middle. Otherwise, the success
> rate should be similar.

Essentially, sunrpc will always pass an array with a hole.
Each RPC consumes the first N elements in the rq_pages array.
Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
pass in an array with more than

Re: [PATCH 0/3 v5] Introduce a bulk order-0 page allocator

2021-03-22 Thread Chuck Lever III



> On Mar 22, 2021, at 5:18 AM, Mel Gorman  wrote:
> 
> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> mm-bulk-rebase-v5r9
> 
> The users of the API have been dropped in this version as the callers
> need to check whether they prefer an array or list interface (whether
> preference is based on convenience or performance).

I now have a consumer implementation that uses the array
API. If I understand the contract correctly, the return
value is the last array index that __alloc_pages_bulk()
visits. My consumer uses the return value to determine
if it needs to call the allocator again.

It is returning some confusing (to me) results. I'd like
to get these resolved before posting any benchmark
results.

1. When it has visited every array element, it returns the
same value as was passed in @nr_pages. That's the N + 1th
array element, which shouldn't be touched. Should the
allocator return nr_pages - 1 in the fully successful case?
Or should the documentation describe the return value as
"the number of elements visited" ?

2. Frequently the allocator returns a number smaller than
the total number of elements. As you may recall, sunrpc
will delay a bit (via a call to schedule_timeout) then call
again. This is supposed to be a rare event, and the delay
is substantial. But with the array-based API, a not-fully-
successful allocator call seems to happen more than half
the time. Is that expected? I'm calling with GFP_KERNEL,
seems like the allocator should be trying harder.

3. Is the current design intended so that if the consumer
does call again, is it supposed to pass in the array address
+ the returned index (and @nr_pages reduced by the returned
index) ?

Thanks for all your hard work, Mel.


> Changelog since v4
> o Drop users of the API
> o Remove free_pages_bulk interface, no users
> o Add array interface
> o Allocate single page if watermark checks on local zones fail
> 
> Changelog since v3
> o Rebase on top of Matthew's series consolidating the alloc_pages API
> o Rename alloced to allocated
> o Split out preparation patch for prepare_alloc_pages
> o Defensive check for bulk allocation or <= 0 pages
> o Call single page allocation path only if no pages were allocated
> o Minor cosmetic cleanups
> o Reorder patch dependencies by subsystem. As this is a cross-subsystem
>  series, the mm patches have to be merged before the sunrpc and net
>  users.
> 
> Changelog since v2
> o Prep new pages with IRQs enabled
> o Minor documentation update
> 
> Changelog since v1
> o Parenthesise binary and boolean comparisons
> o Add reviewed-bys
> o Rebase to 5.12-rc2
> 
> This series introduces a bulk order-0 page allocator with the
> intent that sunrpc and the network page pool become the first users.
> The implementation is not particularly efficient and the intention is to
> iron out what the semantics of the API should have for users. Despite
> that, this is a performance-related enhancement for users that require
> multiple pages for an operation without multiple round-trips to the page
> allocator. Quoting the last patch for the prototype high-speed networking
> use-case.
> 
>For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
>redirecting xdp_frame packets into a veth, that does XDP_PASS to
>create an SKB from the xdp_frame, which then cannot return the page
>to the page_pool. In this case, we saw[1] an improvement of 18.8%
>from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> Both potential users in this series are corner cases (NFS and high-speed
> networks) so it is unlikely that most users will see any benefit in the
> short term. Other potential other users are batch allocations for page
> cache readahead, fault around and SLUB allocations when high-order pages
> are unavailable. It's unknown how much benefit would be seen by converting
> multiple page allocation calls to a single batch or what difference it may
> make to headline performance. It's a chicken and egg problem given that
> the potential benefit cannot be investigated without an implementation
> to test against.
> 
> Light testing passed, I'm relying on Chuck and Jesper to test their
> implementations, choose whether to use lists or arrays and document
> performance gains/losses in the changelogs.
> 
> Patch 1 renames a variable name that is particularly unpopular
> 
> Patch 2 adds a bulk page allocator
> 
> Patch 3 adds an array-based version of the bulk allocator
> 
> include/linux/gfp.h |  18 +
> mm/page_alloc.c | 171 ++--
> 2 files changed, 185 insertions(+), 4 deletions(-)
> 
> -- 
> 2.26.2
> 

--
Chuck Lever





Re: [PATCH] SUNRPC: Output oversized frag reclen as ASCII if printable

2021-03-22 Thread Chuck Lever III



> On Mar 19, 2021, at 6:08 PM, J. Bruce Fields  wrote:
> 
> On Fri, Mar 19, 2021 at 02:58:14PM +0000, Chuck Lever III wrote:
>> Hi Chris-
>> 
>>> On Mar 19, 2021, at 10:54 AM, Chris Down  wrote:
>>> 
>>> The reclen is taken directly from the first four bytes of the message
>>> with the highest bit stripped, which makes it ripe for protocol mixups.
>>> For example, if someone tries to send a HTTP GET request to us, we'll
>>> interpret it as a 1195725856-sized fragment (ie. (u32)'GET '), and print
>>> a ratelimited KERN_NOTICE with that number verbatim.
>>> 
>>> This can be confusing for downstream users, who don't know what messages
>>> like "fragment too large: 1195725856" actually mean, or that they
>>> indicate some misconfigured infrastructure elsewhere.
>> 
>> One wonders whether that error message is actually useful at all.
>> We could, for example, turn this into a tracepoint, or just get
>> rid of it.
> 
> Just going on vague memories here, but: I think we've seen both spurious
> and real bugs reported based on this.
> 
> I'm inclined to go with a dprintk or tracepoint but not removing it
> entirely.

Because this event can be chatty in some cases, I would prefer making
it a tracepoint rather than directing it to the log. Note also it would
be helpful if the server's net namespace and the client's IP address
and port were recorded.

Chris, there exists some boilerplate in fs/nfsd/trace.h to help with
the latter (just so you can see how to build the trace event definition;
you don't have to copy the macros to include/trace/events/sunrpc.h).


> --b.
> 
>> 
>> 
>>> To allow users to more easily understand and debug these cases, add the
>>> number interpreted as ASCII if all characters are printable:
>>> 
>>>   RPC: fragment too large: 1195725856 (ASCII "GET ")
>>> 
>>> If demand grows elsewhere, a new printk format that takes a number and
>>> outputs it in various formats is also a possible solution. For now, it
>>> seems reasonable to put this here since this particular code path is the
>>> one that has repeatedly come up in production.
>>> 
>>> Signed-off-by: Chris Down 
>>> Cc: Chuck Lever 
>>> Cc: J. Bruce Fields 
>>> Cc: Trond Myklebust 
>>> Cc: David S. Miller 
>>> ---
>>> net/sunrpc/svcsock.c | 39 +--
>>> 1 file changed, 37 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index 2e2f007dfc9f..046b1d104340 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -46,6 +46,7 @@
>>> #include 
>>> #include 
>>> #include 
>>> +#include 
>>> 
>>> #include 
>>> #include 
>>> @@ -863,6 +864,34 @@ static void svc_tcp_clear_pages(struct svc_sock *svsk)
>>> svsk->sk_datalen = 0;
>>> }
>>> 
>>> +/* The reclen is taken directly from the first four bytes of the message 
>>> with
>>> + * the highest bit stripped, which makes it ripe for protocol mixups. For
>>> + * example, if someone tries to send a HTTP GET request to us, we'll 
>>> interpret
>>> + * it as a 1195725856-sized fragment (ie. (u32)'GET '), and print a 
>>> ratelimited
>>> + * KERN_NOTICE with that number verbatim.
>>> + *
>>> + * To allow users to more easily understand and debug these cases, this
>>> + * function decodes the purported length as ASCII, and returns it if all
>>> + * characters were printable. Otherwise, we return NULL.
>>> + *
>>> + * WARNING: Since we reuse the u32 directly, the return value is not null
>>> + * terminated, and must be printed using %.*s with
>>> + * sizeof(svc_sock_reclen(svsk)).
>>> + */
>>> +static char *svc_sock_reclen_ascii(struct svc_sock *svsk)
>>> +{
>>> +   u32 len_be = cpu_to_be32(svc_sock_reclen(svsk));
>>> +   char *len_be_ascii = (char *)_be;
>>> +   size_t i;
>>> +
>>> +   for (i = 0; i < sizeof(len_be); i++) {
>>> +   if (!isprint(len_be_ascii[i]))
>>> +   return NULL;
>>> +   }
>>> +
>>> +   return len_be_ascii;
>>> +}
>>> +
>>> /*
>>> * Receive fragment record header into sk_marker.
>>> */
>>> @@ -870,6 +899,7 @@ static ssize_t svc_tcp_read_marker(struct svc_sock 
>>> *svsk,
>>>  

Re: [PATCH] SUNRPC: Output oversized frag reclen as ASCII if printable

2021-03-19 Thread Chuck Lever III
Hi Chris-

> On Mar 19, 2021, at 10:54 AM, Chris Down  wrote:
> 
> The reclen is taken directly from the first four bytes of the message
> with the highest bit stripped, which makes it ripe for protocol mixups.
> For example, if someone tries to send a HTTP GET request to us, we'll
> interpret it as a 1195725856-sized fragment (ie. (u32)'GET '), and print
> a ratelimited KERN_NOTICE with that number verbatim.
> 
> This can be confusing for downstream users, who don't know what messages
> like "fragment too large: 1195725856" actually mean, or that they
> indicate some misconfigured infrastructure elsewhere.

One wonders whether that error message is actually useful at all.
We could, for example, turn this into a tracepoint, or just get
rid of it.


> To allow users to more easily understand and debug these cases, add the
> number interpreted as ASCII if all characters are printable:
> 
>RPC: fragment too large: 1195725856 (ASCII "GET ")
> 
> If demand grows elsewhere, a new printk format that takes a number and
> outputs it in various formats is also a possible solution. For now, it
> seems reasonable to put this here since this particular code path is the
> one that has repeatedly come up in production.
> 
> Signed-off-by: Chris Down 
> Cc: Chuck Lever 
> Cc: J. Bruce Fields 
> Cc: Trond Myklebust 
> Cc: David S. Miller 
> ---
> net/sunrpc/svcsock.c | 39 +--
> 1 file changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 2e2f007dfc9f..046b1d104340 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -46,6 +46,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #include 
> #include 
> @@ -863,6 +864,34 @@ static void svc_tcp_clear_pages(struct svc_sock *svsk)
>   svsk->sk_datalen = 0;
> }
> 
> +/* The reclen is taken directly from the first four bytes of the message with
> + * the highest bit stripped, which makes it ripe for protocol mixups. For
> + * example, if someone tries to send a HTTP GET request to us, we'll 
> interpret
> + * it as a 1195725856-sized fragment (ie. (u32)'GET '), and print a 
> ratelimited
> + * KERN_NOTICE with that number verbatim.
> + *
> + * To allow users to more easily understand and debug these cases, this
> + * function decodes the purported length as ASCII, and returns it if all
> + * characters were printable. Otherwise, we return NULL.
> + *
> + * WARNING: Since we reuse the u32 directly, the return value is not null
> + * terminated, and must be printed using %.*s with
> + * sizeof(svc_sock_reclen(svsk)).
> + */
> +static char *svc_sock_reclen_ascii(struct svc_sock *svsk)
> +{
> + u32 len_be = cpu_to_be32(svc_sock_reclen(svsk));
> + char *len_be_ascii = (char *)_be;
> + size_t i;
> +
> + for (i = 0; i < sizeof(len_be); i++) {
> + if (!isprint(len_be_ascii[i]))
> + return NULL;
> + }
> +
> + return len_be_ascii;
> +}
> +
> /*
>  * Receive fragment record header into sk_marker.
>  */
> @@ -870,6 +899,7 @@ static ssize_t svc_tcp_read_marker(struct svc_sock *svsk,
>  struct svc_rqst *rqstp)
> {
>   ssize_t want, len;
> + char *reclen_ascii;
> 
>   /* If we haven't gotten the record length yet,
>* get the next four bytes.
> @@ -898,9 +928,14 @@ static ssize_t svc_tcp_read_marker(struct svc_sock *svsk,
>   return svc_sock_reclen(svsk);
> 
> err_too_large:
> - net_notice_ratelimited("svc: %s %s RPC fragment too large: %d\n",
> + reclen_ascii = svc_sock_reclen_ascii(svsk);
> + net_notice_ratelimited("svc: %s %s RPC fragment too large: 
> %d%s%.*s%s\n",
>  __func__, svsk->sk_xprt.xpt_server->sv_name,
> -svc_sock_reclen(svsk));
> +svc_sock_reclen(svsk),
> +reclen_ascii ? " (ASCII \"" : "",
> +(int)sizeof(u32),
> +reclen_ascii ?: "",
> +reclen_ascii ? "\")" : "");
>   set_bit(XPT_CLOSE, >sk_xprt.xpt_flags);
> err_short:
>   return -EAGAIN;
> -- 
> 2.30.2
> 

--
Chuck Lever





Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-14 Thread Chuck Lever III



> On Mar 14, 2021, at 8:52 AM, Mel Gorman  wrote:
> 
> On Sat, Mar 13, 2021 at 07:33:43PM +, Matthew Wilcox wrote:
>> On Sat, Mar 13, 2021 at 04:56:31PM +0000, Chuck Lever III wrote:
>>> IME lists are indeed less CPU-efficient, but I wonder if that
>>> expense is insignificant compared to serialization primitives like
>>> disabling and re-enabling IRQs, which we are avoiding by using
>>> bulk page allocation.
>> 
>> Cache misses are a worse problem than serialisation.  Paul McKenney had
>> a neat demonstration where he took a sheet of toilet paper to represent
>> an instruction, and then unrolled two rolls of toilet paper around the
>> lecture theatre to represent an L3 cache miss.  Obviously a serialising
>> instruction is worse than an add instruction, but i'm thinking maybe
>> 50-100 sheets of paper, not an entire roll?
>> 
> 
> I'm well array of the advantages of arrays over lists. The reality is that
> the penalty is incurred unconditionally as the pages have to be removed
> from the per-cpu or buddy lists and the cache footprint of the allocator
> and the data copies are already large. It's also the case that bulk free
> interfaces already exist that operate on lists (free_unref_page_list)
> so there is existing precedent. The bulk free API in this series was not
> used by the callers so I've deleted it.
> 
> Obviously the callers would need to be adjusted to use the array
> interface. The sunrpc user has an array but it is coded in a way that
> expects the array could be partially populated or has holes so the API has
> to skip populated elements. The caller is responsible for making sure that
> there are enough NULL elements available to store nr_pages or the buffer
> overruns. nr_elements could be passed in to avoid the buffer overrun but
> then further logic is needed to distinguish between a failed allocation
> and a failure to have enough space in the array to store the pointer.
> It also means that prep_new_page() should not be deferred outside of
> the IRQ disabled section as it does not have the storage to track which
> pages were freshly allocated and which ones were already on the array. It
> could be tracked using the lower bit of the pointer but that is not free
> either. Ideally the callers simply would ensure the array does not have
> valid struct page pointers in it already so prepping the new page could
> always be deferred.  Obviously the callers are also responsible for
> ensuring protecting the array from parallel access if necessary while
> calling into the allocator.
> 
>> Anyway, I'm not arguing against a bulk allocator, nor even saying this
>> is a bad interface.  It just maybe could be better.
>> 
> 
> I think it puts more responsibility on the caller to use the API correctly
> but I also see no value in arguing about it further because there is no
> supporting data either way (I don't have routine access to a sufficiently
> fast network to generate the data). I can add the following patch and let
> callers figure out which interface is preferred. If one of the interfaces
> is dead in a year, it can be removed.
> 
> As there are a couple of ways the arrays could be used, I'm leaving it
> up to Jesper and Chuck which interface they want to use. In particular,
> it would be preferred if the array has no valid struct pages in it but
> it's up to them to judge how practical that is.

I'm interested to hear from Jesper.

My two cents (US):

If svc_alloc_arg() is the /only/ consumer that wants to fill
a partially populated array of page pointers, then there's no
code-duplication benefit to changing the synopsis of
alloc_pages_bulk() at this point.

Also, if the consumers still have to pass in the number of
pages the array needs, rather than having the bulk allocator
figure it out, then there's not much additional benefit, IMO.

Ideally (for SUNRPC) alloc_pages_bulk() would take a pointer
to a sparsely-populated array and the total number of elements
in that array, and fill in the NULL elements. The return value
would be "success -- all elements are populated" or "failure --
some elements remain NULL".

But again, if no other consumer finds that useful, or that API
design obscures the performance benefits of the bulk allocator,
I can be very happy with the list-centric API. My interest in
this part of the exercise is simply to reduce the overall amount
of new complexity across mm/ and consumers of the bulk allocator.


> Patch is only lightly tested with a poor conversion of the sunrpc code
> to use the array interface.
> 
> ---8<---
> mm/page_alloc: Add an array-based interface to the bulk page allocator
> 
> The existing callers for the bulk allocator are storing the pages in
> arrays. This patc

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-13 Thread Chuck Lever III



> On Mar 13, 2021, at 11:39 AM, Matthew Wilcox  wrote:
> 
> On Sat, Mar 13, 2021 at 01:16:48PM +, Mel Gorman wrote:
>>> I'm not claiming the pagevec is definitely a win, but it's very
>>> unclear which tradeoff is actually going to lead to better performance.
>>> Hopefully Jesper or Chuck can do some tests and figure out what actually
>>> works better with their hardware & usage patterns.
>> 
>> The NFS user is often going to need to make round trips to get the pages it
>> needs. The pagevec would have to be copied into the target array meaning
>> it's not much better than a list manipulation.
> 
> I don't think you fully realise how bad CPUs are at list manipulation.
> See the attached program (and run it on your own hardware).  On my
> less-than-a-year-old core-i7:
> 
> $ gcc -W -Wall -O2 -g array-vs-list.c -o array-vs-list
> $ ./array-vs-list 
> walked sequential array in 0.001765s
> walked sequential list in 0.002920s
> walked sequential array in 0.001777s
> walked shuffled list in 0.081955s
> walked shuffled array in 0.007367s
> 
> If you happen to get the objects in-order, it's only 64% worse to walk
> a list as an array.  If they're out of order, it's *11.1* times as bad.
> 

IME lists are indeed less CPU-efficient, but I wonder if that
expense is insignificant compared to serialization primitives like
disabling and re-enabling IRQs, which we are avoiding by using
bulk page allocation.

My initial experience with the current interface left me feeling
uneasy about re-using the lru list field. That seems to expose an
internal API feature to consumers of the page allocator. If we
continue with a list-centric bulk allocator API I hope there can
be some conveniently-placed documentation that explains when it is
safe to use that field. Or perhaps the field should be renamed.

I have a mild preference for an array-style interface because that's
more natural for the NFSD consumer, but I'm happy to have a bulk
allocator either way. Purely from a code-reuse point of view, I
wonder how many consumers of alloc_pages_bulk() will be like
svc_alloc_arg(), where they need to fill in pages in an array. Each
such consumer would need to repeat the logic to convert the returned
list into an array. We have, for instance, release_pages(), which is
an array-centric page allocator API. Maybe a helper function or two
might prevent duplication of the list conversion logic.

And I agree with Mel that passing a single large array seems more
useful then having to build code at each consumer call-site to
iterate over smaller page_vecs until that array is filled.


--
Chuck Lever





[PATCH] SUNRPC: Refresh rq_pages using a bulk page allocator

2021-03-12 Thread Chuck Lever
Reduce the rate at which nfsd threads hammer on the page allocator.
This improves throughput scalability by enabling the threads to run
more independently of each other.

Signed-off-by: Chuck Lever 
---
Hi Mel-

This patch replaces patch 5/7 in v4 of your alloc_pages_bulk()
series. It implements code clean-ups suggested by Alexander Duyck.
It builds and has seen some light testing.


 net/sunrpc/svc_xprt.c |   39 +++
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 4d58424db009..791ea24159b1 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -661,11 +661,13 @@ static void svc_check_conn_limits(struct svc_serv *serv)
 static int svc_alloc_arg(struct svc_rqst *rqstp)
 {
struct svc_serv *serv = rqstp->rq_server;
+   unsigned long needed;
struct xdr_buf *arg;
+   struct page *page;
+   LIST_HEAD(list);
int pages;
int i;
 
-   /* now allocate needed pages.  If we get a failure, sleep briefly */
pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
if (pages > RPCSVC_MAXPAGES) {
pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
@@ -673,19 +675,32 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
/* use as many pages as possible */
pages = RPCSVC_MAXPAGES;
}
-   for (i = 0; i < pages ; i++)
-   while (rqstp->rq_pages[i] == NULL) {
-   struct page *p = alloc_page(GFP_KERNEL);
-   if (!p) {
-   set_current_state(TASK_INTERRUPTIBLE);
-   if (signalled() || kthread_should_stop()) {
-   set_current_state(TASK_RUNNING);
-   return -EINTR;
-   }
-   schedule_timeout(msecs_to_jiffies(500));
+
+   for (needed = 0, i = 0; i < pages ; i++) {
+   if (!rqstp->rq_pages[i])
+   needed++;
+   }
+   i = 0;
+   while (needed) {
+   needed -= alloc_pages_bulk(GFP_KERNEL, 0, needed, );
+   for (; i < pages; i++) {
+   if (rqstp->rq_pages[i])
+   continue;
+   page = list_first_entry_or_null(, struct page, 
lru);
+   if (likely(page)) {
+   list_del(>lru);
+   rqstp->rq_pages[i] = page;
+   continue;
}
-   rqstp->rq_pages[i] = p;
+   set_current_state(TASK_INTERRUPTIBLE);
+   if (signalled() || kthread_should_stop()) {
+   set_current_state(TASK_RUNNING);
+   return -EINTR;
+   }
+   schedule_timeout(msecs_to_jiffies(500));
+   break;
}
+   }
rqstp->rq_page_end = >rq_pages[pages];
rqstp->rq_pages[pages] = NULL; /* this might be seen in 
nfsd_splice_actor() */
 




Re: [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator

2021-03-12 Thread Chuck Lever III
Mel, I can send you a tidied and tested update to this patch,
or you can drop the two NFSD patches and I can submit them via
the NFSD tree when alloc_pages_bulk() is merged.

> On Mar 12, 2021, at 1:44 PM, Alexander Duyck  
> wrote:
> 
> On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman  
> wrote:
>> 
>> From: Chuck Lever 
>> 
>> Reduce the rate at which nfsd threads hammer on the page allocator.
>> This improves throughput scalability by enabling the threads to run
>> more independently of each other.
>> 
>> Signed-off-by: Chuck Lever 
>> Signed-off-by: Mel Gorman 
>> ---
>> net/sunrpc/svc_xprt.c | 43 +++
>> 1 file changed, 31 insertions(+), 12 deletions(-)
>> 
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index cfa7e4776d0e..38a8d6283801 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv 
>> *serv)
>> static int svc_alloc_arg(struct svc_rqst *rqstp)
>> {
>>struct svc_serv *serv = rqstp->rq_server;
>> +   unsigned long needed;
>>struct xdr_buf *arg;
>> +   struct page *page;
>>int pages;
>>int i;
>> 
>> -   /* now allocate needed pages.  If we get a failure, sleep briefly */
>>pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>if (pages > RPCSVC_MAXPAGES) {
>>pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
>> @@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>/* use as many pages as possible */
>>pages = RPCSVC_MAXPAGES;
>>}
>> -   for (i = 0; i < pages ; i++)
>> -   while (rqstp->rq_pages[i] == NULL) {
>> -   struct page *p = alloc_page(GFP_KERNEL);
>> -   if (!p) {
>> -   set_current_state(TASK_INTERRUPTIBLE);
>> -   if (signalled() || kthread_should_stop()) {
>> -   set_current_state(TASK_RUNNING);
>> -   return -EINTR;
>> -   }
>> -   schedule_timeout(msecs_to_jiffies(500));
>> +
> 
>> +   for (needed = 0, i = 0; i < pages ; i++)
>> +   if (!rqstp->rq_pages[i])
>> +   needed++;
> 
> I would use an opening and closing braces for the for loop since
> technically the if is a multiline statement. It will make this more
> readable.
> 
>> +   if (needed) {
>> +   LIST_HEAD(list);
>> +
>> +retry:
> 
> Rather than kind of open code a while loop why not just make this
> "while (needed)"? Then all you have to do is break out of the for loop
> and you will automatically return here instead of having to jump to
> two different labels.
> 
>> +   alloc_pages_bulk(GFP_KERNEL, needed, );
> 
> Rather than not using the return value would it make sense here to
> perhaps subtract it from needed? Then you would know if any of the
> allocation requests weren't fulfilled.
> 
>> +   for (i = 0; i < pages; i++) {
> 
> It is probably optimizing for the exception case, but I don't think
> you want the "i = 0" here. If you are having to stop because the list
> is empty it probably makes sense to resume where you left off. So you
> should probably be initializing i to 0 before we check for needed.
> 
>> +   if (!rqstp->rq_pages[i]) {
> 
> It might be cleaner here to just do a "continue" if rq_pages[i] is populated.
> 
>> +   page = list_first_entry_or_null(,
>> +   struct page,
>> +   lru);
>> +   if (unlikely(!page))
>> +   goto empty_list;
> 
> I think I preferred the original code that wasn't jumping away from
> the loop here. With the change I suggested above that would switch the
> if(needed) to while(needed) you could have it just break out of the
> for loop to place itself back in the while loop.
> 
>> +   list_del(>lru);
>> +   rqstp->rq_pages[i] = page;
>> +   needed--;
>>}
>> -

Re: [PATCH v25 17/25] LSM: Use lsmcontext in security_inode_getsecctx

2021-03-12 Thread Chuck Lever III
Hi Casey-

> On Mar 9, 2021, at 9:42 AM, Casey Schaufler  wrote:
> 
> Change the security_inode_getsecctx() interface to fill
> a lsmcontext structure instead of data and length pointers.
> This provides the information about which LSM created the
> context so that security_release_secctx() can use the
> correct hook.
> 
> Acked-by: Stephen Smalley 
> Acked-by: Paul Moore 
> Reviewed-by: John Johansen 
> Signed-off-by: Casey Schaufler 
> Cc: linux-...@vger.kernel.org

For the NFSD hunks in 15/25 and 17/25:

Acked-by: Chuck Lever 


> ---
> fs/nfsd/nfs4xdr.c| 23 +--
> include/linux/security.h |  5 +++--
> security/security.c  | 13 +++--
> 3 files changed, 23 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index afccc4f257d0..a796268ec757 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2727,11 +2727,11 @@ nfsd4_encode_layout_types(struct xdr_stream *xdr, u32 
> layout_types)
> #ifdef CONFIG_NFSD_V4_SECURITY_LABEL
> static inline __be32
> nfsd4_encode_security_label(struct xdr_stream *xdr, struct svc_rqst *rqstp,
> - void *context, int len)
> + struct lsmcontext *context)
> {
>   __be32 *p;
> 
> - p = xdr_reserve_space(xdr, len + 4 + 4 + 4);
> + p = xdr_reserve_space(xdr, context->len + 4 + 4 + 4);
>   if (!p)
>   return nfserr_resource;
> 
> @@ -2741,13 +2741,13 @@ nfsd4_encode_security_label(struct xdr_stream *xdr, 
> struct svc_rqst *rqstp,
>*/
>   *p++ = cpu_to_be32(0); /* lfs */
>   *p++ = cpu_to_be32(0); /* pi */
> - p = xdr_encode_opaque(p, context, len);
> + p = xdr_encode_opaque(p, context->context, context->len);
>   return 0;
> }
> #else
> static inline __be32
> nfsd4_encode_security_label(struct xdr_stream *xdr, struct svc_rqst *rqstp,
> - void *context, int len)
> + struct lsmcontext *context)
> { return 0; }
> #endif
> 
> @@ -2844,9 +2844,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct 
> svc_fh *fhp,
>   int err;
>   struct nfs4_acl *acl = NULL;
> #ifdef CONFIG_NFSD_V4_SECURITY_LABEL
> - struct lsmcontext scaff; /* scaffolding */
> - void *context = NULL;
> - int contextlen;
> + struct lsmcontext context = { };
> #endif
>   bool contextsupport = false;
>   struct nfsd4_compoundres *resp = rqstp->rq_resp;
> @@ -2904,7 +2902,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct 
> svc_fh *fhp,
>bmval0 & FATTR4_WORD0_SUPPORTED_ATTRS) {
>   if (exp->ex_flags & NFSEXP_SECURITY_LABEL)
>   err = security_inode_getsecctx(d_inode(dentry),
> - , );
> +);
>   else
>   err = -EOPNOTSUPP;
>   contextsupport = (err == 0);
> @@ -3324,8 +3322,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct 
> svc_fh *fhp,
> 
> #ifdef CONFIG_NFSD_V4_SECURITY_LABEL
>   if (bmval2 & FATTR4_WORD2_SECURITY_LABEL) {
> - status = nfsd4_encode_security_label(xdr, rqstp, context,
> - contextlen);
> + status = nfsd4_encode_security_label(xdr, rqstp, );
>   if (status)
>   goto out;
>   }
> @@ -3346,10 +3343,8 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct 
> svc_fh *fhp,
> 
> out:
> #ifdef CONFIG_NFSD_V4_SECURITY_LABEL
> - if (context) {
> - lsmcontext_init(, context, contextlen, 0); /*scaffolding*/
> - security_release_secctx();
> - }
> + if (context.context)
> + security_release_secctx();
> #endif /* CONFIG_NFSD_V4_SECURITY_LABEL */
>   kfree(acl);
>   if (tempfh) {
> diff --git a/include/linux/security.h b/include/linux/security.h
> index d0e1b6ba330d..9dcc910036f4 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -582,7 +582,7 @@ void security_release_secctx(struct lsmcontext *cp);
> void security_inode_invalidate_secctx(struct inode *inode);
> int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen);
> int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
> -int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
> +int security_inode_getsecctx(struct inode *inode, struct lsmcontext *cp);
> int security_locked_down(enum lockdown_reason what);
> #else /* CONFIG_SECURITY */
> 
> @@ -1442,7 +1442,8 @@ static inline int sec

Re: [PATCH 3/5] SUNRPC: Refresh rq_pages using a bulk page allocator

2021-03-11 Thread Chuck Lever III



> On Mar 11, 2021, at 6:49 AM, Mel Gorman  wrote:
> 
> From: Chuck Lever 
> 
> Reduce the rate at which nfsd threads hammer on the page allocator.
> This improve throughput scalability by enabling the threads to run
> more independently of each other.

Mel, if you should repost this series: ^improve^improves


> Signed-off-by: Chuck Lever 
> Signed-off-by: Mel Gorman 
> ---
> net/sunrpc/svc_xprt.c | 43 +++
> 1 file changed, 31 insertions(+), 12 deletions(-)
> 
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index cfa7e4776d0e..38a8d6283801 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
> static int svc_alloc_arg(struct svc_rqst *rqstp)
> {
>   struct svc_serv *serv = rqstp->rq_server;
> + unsigned long needed;
>   struct xdr_buf *arg;
> + struct page *page;
>   int pages;
>   int i;
> 
> - /* now allocate needed pages.  If we get a failure, sleep briefly */
>   pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>   if (pages > RPCSVC_MAXPAGES) {
>   pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
> @@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>   /* use as many pages as possible */
>   pages = RPCSVC_MAXPAGES;
>   }
> - for (i = 0; i < pages ; i++)
> - while (rqstp->rq_pages[i] == NULL) {
> - struct page *p = alloc_page(GFP_KERNEL);
> - if (!p) {
> - set_current_state(TASK_INTERRUPTIBLE);
> - if (signalled() || kthread_should_stop()) {
> - set_current_state(TASK_RUNNING);
> - return -EINTR;
> - }
> - schedule_timeout(msecs_to_jiffies(500));
> +
> + for (needed = 0, i = 0; i < pages ; i++)
> + if (!rqstp->rq_pages[i])
> + needed++;
> + if (needed) {
> + LIST_HEAD(list);
> +
> +retry:
> + alloc_pages_bulk(GFP_KERNEL, needed, );
> + for (i = 0; i < pages; i++) {
> + if (!rqstp->rq_pages[i]) {
> + page = list_first_entry_or_null(,
> + struct page,
> + lru);
> + if (unlikely(!page))
> + goto empty_list;
> + list_del(>lru);
> + rqstp->rq_pages[i] = page;
> + needed--;
>   }
> - rqstp->rq_pages[i] = p;
>   }
> + }
>   rqstp->rq_page_end = >rq_pages[pages];
>   rqstp->rq_pages[pages] = NULL; /* this might be seen in 
> nfsd_splice_actor() */
> 
> @@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>   arg->len = (pages-1)*PAGE_SIZE;
>   arg->tail[0].iov_len = 0;
>   return 0;
> +
> +empty_list:
> +     set_current_state(TASK_INTERRUPTIBLE);
> + if (signalled() || kthread_should_stop()) {
> + set_current_state(TASK_RUNNING);
> + return -EINTR;
> + }
> + schedule_timeout(msecs_to_jiffies(500));
> + goto retry;
> }
> 
> static bool
> -- 
> 2.26.2
> 

--
Chuck Lever





Re: [PATCH 07/20] SUNRPC: Manual replacement of the deprecated strlcpy() with return values

2021-03-01 Thread Chuck Lever



> On Feb 22, 2021, at 10:12 AM, Romain Perier  wrote:
> 
> The strlcpy() reads the entire source buffer first, it is dangerous if
> the source buffer lenght is unbounded or possibility non NULL-terminated.
> It can lead to linear read overflows, crashes, etc...
> 
> As recommended in the deprecated interfaces [1], it should be replaced
> by strscpy.
> 
> This commit replaces all calls to strlcpy that handle the return values
> by the corresponding strscpy calls with new handling of the return
> values (as it is quite different between the two functions).
> 
> [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
> 
> Signed-off-by: Romain Perier 

Hi Romain-

I assume you are waiting for a maintainer's Ack? IMHO Trond or Anna
should provide it for changes to this particular source file.


> ---
> net/sunrpc/clnt.c |6 +-
> 1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 612f0a641f4c..3c5c4ad8a808 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -282,7 +282,7 @@ static struct rpc_xprt *rpc_clnt_set_transport(struct 
> rpc_clnt *clnt,
> 
> static void rpc_clnt_set_nodename(struct rpc_clnt *clnt, const char *nodename)
> {
> - clnt->cl_nodelen = strlcpy(clnt->cl_nodename,
> + clnt->cl_nodelen = strscpy(clnt->cl_nodename,
>   nodename, sizeof(clnt->cl_nodename));
> }
> 
> @@ -422,6 +422,10 @@ static struct rpc_clnt * rpc_new_client(const struct 
> rpc_create_args *args,
>   nodename = utsname()->nodename;
>   /* save the nodename */
>   rpc_clnt_set_nodename(clnt, nodename);
> + if (clnt->cl_nodelen == -E2BIG) {
> + err = -ENOMEM;
> +     goto out_no_path;
> + }
> 
>   err = rpc_client_register(clnt, args->authflavor, args->client_name);
>   if (err)
> 

--
Chuck Lever





Re: possible deadlock in ipv6_sock_mc_close

2021-03-01 Thread Chuck Lever



> On Mar 1, 2021, at 8:49 AM, syzbot 
>  wrote:
> 
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:eee7ede6 Merge branch 'bnxt_en-error-recovery-bug-fixes'
> git tree:   net
> console output: https://syzkaller.appspot.com/x/log.txt?x=123ad632d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=e2d5ba72abae4f14
> dashboard link: https://syzkaller.appspot.com/bug?extid=e2fa57709a385e6db10f
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=109d89b6d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=12e9e0dad0
> 
> The issue was bisected to:
> 
> commit c8e88e3aa73889421461f878cd569ef84f231ceb
> Author: Chuck Lever 
> Date:   Tue Nov 3 20:06:04 2020 +
> 
>NFSD: Replace READ* macros in nfsd4_decode_layoutget()
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13bef9ccd0
> final oops: https://syzkaller.appspot.com/x/report.txt?x=107ef9ccd0
> console output: https://syzkaller.appspot.com/x/log.txt?x=17bef9ccd0
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+e2fa57709a385e6db...@syzkaller.appspotmail.com
> Fixes: c8e88e3aa738 ("NFSD: Replace READ* macros in nfsd4_decode_layoutget()")
> 
> ==
> WARNING: possible circular locking dependency detected
> 5.11.0-syzkaller #0 Not tainted
> --
> syz-executor905/8822 is trying to acquire lock:
> 8d678fe8 (rtnl_mutex){+.+.}-{3:3}, at: ipv6_sock_mc_close+0xd7/0x110 
> net/ipv6/mcast.c:323
> 
> but task is already holding lock:
> 888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock 
> include/net/sock.h:1600 [inline]
> 888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: 
> mptcp6_release+0x57/0x130 net/mptcp/protocol.c:3507
> 
> which lock already depends on the new lock.

Hi, thanks for the report.

Initial analysis:

c8e88e3aa738 ("NFSD: Replace READ* macros in nfsd4_decode_layoutget()"
changes code several layers above the network layer. In addition,
neither of the stack traces contain NFSD functions. And, repro.c does
not appear to exercise any filesystem code.

Therefore the bisect result looks implausible to me. I don't see any
obvious connection between the lockdep splat and c8e88e3aa738. (If
someone else does, please let me know where to look).


> the existing dependency chain (in reverse order) is:
> 
> -> #1 (sk_lock-AF_INET6){+.+.}-{0:0}:
>   lock_sock_nested+0xca/0x120 net/core/sock.c:3071
>   lock_sock include/net/sock.h:1600 [inline]
>   gtp_encap_enable_socket+0x277/0x4a0 drivers/net/gtp.c:824
>   gtp_encap_enable drivers/net/gtp.c:855 [inline]
>   gtp_newlink+0x2b3/0xc60 drivers/net/gtp.c:683
>   __rtnl_newlink+0x1059/0x1710 net/core/rtnetlink.c:3443
>   rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3491
>   rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553
>   netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
>   netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
>   netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
>   netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
>   sock_sendmsg_nosec net/socket.c:654 [inline]
>   sock_sendmsg+0xcf/0x120 net/socket.c:674
>   sys_sendmsg+0x6e8/0x810 net/socket.c:2350
>   ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
>   __sys_sendmsg+0xe5/0x1b0 net/socket.c:2437
>   do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> -> #0 (rtnl_mutex){+.+.}-{3:3}:
>   check_prev_add kernel/locking/lockdep.c:2936 [inline]
>   check_prevs_add kernel/locking/lockdep.c:3059 [inline]
>   validate_chain kernel/locking/lockdep.c:3674 [inline]
>   __lock_acquire+0x2b14/0x54c0 kernel/locking/lockdep.c:4900
>   lock_acquire kernel/locking/lockdep.c:5510 [inline]
>   lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
>   __mutex_lock_common kernel/locking/mutex.c:946 [inline]
>   __mutex_lock+0x139/0x1120 kernel/locking/mutex.c:1093
>   ipv6_sock_mc_close+0xd7/0x110 net/ipv6/mcast.c:323
>   mptcp6_release+0xb9/0x130 net/mptcp/protocol.c:3515
>   __sock_release+0xcd/0x280 net/socket.c:599
>   sock_close+0x18/0x20 net/socket.c:1258
>   __fput+0x288/0x920 fs/file_table.c:280
>   task_work_run+0xdd/0x1a0 kernel/task_work.c:140
>   tracehook_notify_resume include/linux/tracehook.h:189 [inline]
>   exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
>   exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:208
> 

Re: [PATCH] Repair misuse of sv_lock in 5.10.16-rt30.

2021-02-26 Thread Chuck Lever



> On Feb 26, 2021, at 10:19 AM, Joe Korty  wrote:
> 
> On Fri, Feb 26, 2021 at 03:15:46PM +0000, Chuck Lever wrote:
>> 
>> 
>>> On Feb 26, 2021, at 10:00 AM, J. Bruce Fields  wrote:
>>> 
>>> Adding Chuck, linux-nfs.
>>> 
>>> Makes sense to me.--b.
>> 
>> Joe, I can add this to nfsd-5.12-rc. Would it be appropriate to add:
>> 
>> Fixes: 719f8bcc883e ("svcrpc: fix xpt_list traversal locking on shutdown")
> 
> Sure.
> And thanks, everybody, for the quick response.
> Joe

Your patch has been added to the for-rc topic branch at

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

with minor edits to the patch description.

--
Chuck Lever





Re: [PATCH] Repair misuse of sv_lock in 5.10.16-rt30.

2021-02-26 Thread Chuck Lever
49258 (nfsd_mutex){+.+.}-{3:3}, at: 
>> nfsd+0x19a/0x270
>> [  109.442508]  #1: 994245cb00b0 (>sv_lock){+.+.}-{0:0}, at: 
>> svc_close_list+0x1f/0x90
>> [  109.442512]  #2: 93a81b20 (rcu_read_lock){}-{1:2}, at: 
>> rt_spin_lock+0x5/0xc0
>> [  109.442518] 
>> [  109.442518] stack backtrace:
>> [  109.442519] CPU: 0 PID: 1032 Comm: nfsd Not tainted 5.10.16-rt30 #1
>> [  109.442522] Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 
>> 09/22/2015
>> [  109.442524] Call Trace:
>> [  109.442527]  dump_stack+0x77/0x97
>> [  109.442533]  check_noncircular+0xdc/0xf0
>> [  109.442546]  __lock_acquire+0x1264/0x20b0
>> [  109.442553]  lock_acquire+0xc2/0x400
>> [  109.442564]  rt_spin_lock+0x2b/0xc0
>> [  109.442570]  __local_bh_disable_ip+0xd9/0x270
>> [  109.442573]  svc_xprt_do_enqueue+0xc0/0x4d0
>> [  109.442577]  svc_close_list+0x60/0x90
>> [  109.442581]  svc_close_net+0x49/0x1a0
>> [  109.442585]  svc_shutdown_net+0x12/0x40
>> [  109.442588]  nfsd_destroy+0xc5/0x180
>> [  109.442590]  nfsd+0x1bc/0x270
>> [  109.442595]  kthread+0x194/0x1b0
>> [  109.442600]  ret_from_fork+0x22/0x30
>> [  109.518225] nfsd: last server has exited, flushing export cache
>> [  OK  ] Stopped NFSv4 ID-name mapping service.
>> [  OK  ] Stopped GSSAPI Proxy Daemon.
>> [  OK  ] Stopped NFS Mount Daemon.
>> [  OK  ] Stopped NFS status monitor for NFSv2/3 locking..
>> Index: b/net/sunrpc/svc_xprt.c
>> ===
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -1062,7 +1062,7 @@ static int svc_close_list(struct svc_ser
>>  struct svc_xprt *xprt;
>>  int ret = 0;
>> 
>> -spin_lock(>sv_lock);
>> +spin_lock_bh(>sv_lock);
>>  list_for_each_entry(xprt, xprt_list, xpt_list) {
>>  if (xprt->xpt_net != net)
>>  continue;
>> @@ -1070,7 +1070,7 @@ static int svc_close_list(struct svc_ser
>>  set_bit(XPT_CLOSE, >xpt_flags);
>>  svc_xprt_enqueue(xprt);
>>  }
>> -spin_unlock(>sv_lock);
>> +spin_unlock_bh(>sv_lock);
>>  return ret;
>> }
>> 
>> 
> 

--
Chuck Lever





Re: [PATCH] nfsd: fix kconfig dependency warning for NFSD_V4

2021-02-24 Thread Chuck Lever
Hi Julian-

> On Feb 19, 2021, at 4:56 PM, Julian Braha  wrote:
> 
> When NFSD_V4 is enabled and CRYPTO is disabled,
> Kbuild gives the following warning:
> 
> WARNING: unmet direct dependencies detected for CRYPTO_SHA256
>  Depends on [n]: CRYPTO [=n]
>  Selected by [y]:
>  - NFSD_V4 [=y] && NETWORK_FILESYSTEMS [=y] && NFSD [=y] && PROC_FS [=y]
> 
> WARNING: unmet direct dependencies detected for CRYPTO_MD5
>  Depends on [n]: CRYPTO [=n]
>  Selected by [y]:
>  - NFSD_V4 [=y] && NETWORK_FILESYSTEMS [=y] && NFSD [=y] && PROC_FS [=y]
> 
> This is because NFSD_V4 selects CRYPTO_MD5 and CRYPTO_SHA256,
> without depending on or selecting CRYPTO, despite those config options
> being subordinate to CRYPTO.
> 
> Signed-off-by: Julian Braha 

I've included this patch in the for-rc topic branch at:

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

However, there was some harumphing over on linux-rdma about
a similar change for CRYPTO_CRC32 and the rxe driver. I'm
not quite clear about the outcome of that conversation. It
seems like we are going with either

 - add "select CRYPTO"

or

 - add "depends on CRYPTO"


> --- a/fs/nfsd/Kconfig 2021-02-09 22:05:29.462030761 -0500
> +++ b/fs/nfsd/Kconfig 2021-02-11 12:00:48.974076992 -0500
> @@ -73,6 +73,7 @@
>   select NFSD_V3
>   select FS_POSIX_ACL
>   select SUNRPC_GSS
> + select CRYPTO
>   select CRYPTO_MD5
>   select CRYPTO_SHA256
>   select GRACE_PERIOD
> 
> 

--
Chuck Lever





Re: [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc

2021-02-24 Thread Chuck Lever



> On Feb 24, 2021, at 5:26 AM, Mel Gorman  wrote:
> 
> This is a prototype series that introduces a bulk order-0 page allocator
> with sunrpc being the first user. The implementation is not particularly
> efficient and the intention is to iron out what the semantics of the API
> should be. That said, sunrpc was reported to have reduced allocation
> latency when refilling a pool.
> 
> As a side-note, while the implementation could be more efficient, it
> would require fairly deep surgery in numerous places. The lock scope would
> need to be significantly reduced, particularly as vmstat, per-cpu and the
> buddy allocator have different locking protocol that overal -- e.g. all
> partially depend on irqs being disabled at various points. Secondly,
> the core of the allocator deals with single pages where as both the bulk
> allocator and per-cpu allocator operate in batches. All of that has to
> be reconciled with all the existing users and their constraints (memory
> offline, CMA and cpusets being the trickiest).
> 
> In terms of semantics required by new users, my preference is that a pair
> of patches be applied -- the first which adds the required semantic to
> the bulk allocator and the second which adds the new user.
> 
> Patch 1 of this series is a cleanup to sunrpc, it could be merged
>   separately but is included here for convenience.
> 
> Patch 2 is the prototype bulk allocator
> 
> Patch 3 is the sunrpc user. Chuck also has a patch which further caches
>   pages but is not included in this series. It's not directly
>   related to the bulk allocator and as it caches pages, it might
>   have other concerns (e.g. does it need a shrinker?)
> 
> This has only been lightly tested on a low-end NFS server. It did not break
> but would benefit from an evaluation to see how much, if any, the headline
> performance changes. The biggest concern is that a light test case showed
> that there are a *lot* of bulk requests for 1 page which gets delegated to
> the normal allocator.  The same criteria should apply to any other users.
> 
> include/linux/gfp.h   |  13 +
> mm/page_alloc.c   | 113 +-
> net/sunrpc/svc_xprt.c |  47 ++++--
> 3 files changed, 157 insertions(+), 16 deletions(-)

Hi Mel-

Thank you for carrying the torch!


--
Chuck Lever





Re: NFS Caching broken in 4.19.37

2021-02-20 Thread Chuck Lever



> On Feb 20, 2021, at 3:13 PM, Anton Ivanov  
> wrote:
> 
> On 20/02/2021 20:04, Salvatore Bonaccorso wrote:
>> Hi,
>> 
>> On Mon, Jul 08, 2019 at 07:19:54PM +0100, Anton Ivanov wrote:
>>> Hi list,
>>> 
>>> NFS caching appears broken in 4.19.37.
>>> 
>>> The more cores/threads the easier to reproduce. Tested with identical
>>> results on Ryzen 1600 and 1600X.
>>> 
>>> 1. Mount an openwrt build tree over NFS v4
>>> 2. Run make -j `cat /proc/cpuinfo | grep vendor | wc -l` ; make clean in a
>>> loop
>>> 3. Result after 3-4 iterations:
>>> 
>>> State on the client
>>> 
>>> ls -laF 
>>> /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>> 
>>> total 8
>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>> 
>>> State as seen on the server (mounted via nfs from localhost):
>>> 
>>> ls -laF 
>>> /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>> total 12
>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>> 
>>> Actual state on the filesystem:
>>> 
>>> ls -laF 
>>> /exports/work/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>> total 12
>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>> 
>>> So the client has quite clearly lost the plot. Telling it to drop caches and
>>> re-reading the directory shows the file present.
>>> 
>>> It is possible to reproduce this using a linux kernel tree too, just takes
>>> much more iterations - 10+ at least.
>>> 
>>> Both client and server run 4.19.37 from Debian buster. This is filed as
>>> debian bug 931500. I originally thought it to be autofs related, but IMHO it
>>> is actually something fundamentally broken in nfs caching resulting in cache
>>> corruption.
>> According to the reporter downstream in Debian, at
>> https://bugs.debian.org/940821#26 thi seem still reproducible with
>> more recent kernels than the initial reported. Is there anything Anton
>> can provide to try to track down the issue?
>> 
>> Anton, can you reproduce with current stable series?
> 
> 100% reproducible with any kernel from 4.9 to 5.4, stable or backports. It 
> may exist in earlier versions, but I do not have a machine with anything 
> before 4.9 to test at present.

Confirming you are varying client-side kernels. Should the Linux
NFS client maintainers be Cc'd?


> From 1-2 make clean && make  cycles to one afternoon depending on the number 
> of machine cores. More cores/threads the faster it does it.
> 
> I tried playing with protocol minor versions, caching options, etc - it is 
> still reproducible for any nfs4 settings as long as there is client side 
> caching of metadata.
> 
> A.
> 
>> 
>> Regards,
>> Salvatore
>> 
> 
> -- 
> Anton R. Ivanov
> Cambridgegreys Limited. Registered in England. Company Number 10273661
> https://www.cambridgegreys.com/

--
Chuck Lever





Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again

2021-02-08 Thread Chuck Lever



> On Feb 8, 2021, at 3:12 PM, Trond Myklebust  wrote:
> 
> On Mon, 2021-02-08 at 19:48 +0000, Chuck Lever wrote:
>> 
>> 
>>> On Feb 8, 2021, at 2:34 PM, Trond Myklebust <
>>> tron...@hammerspace.com> wrote:
>>> 
>>> On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote:
>>>> From: Chuck Lever 
>>>> 
>>>> [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ]
>>>> 
>>>> Daire Byrne reports a ~50% aggregrate throughput regression on
>>>> his
>>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server
>>>> to
>>>> use xprt_sock_sendmsg for socket sends"), which replaced
>>>> kernel_send_page() calls in NFSD's socket send path with calls to
>>>> sock_sendmsg() using iov_iter.
>>>> 
>>>> Investigation showed that tcp_sendmsg() was not using zero-copy
>>>> to
>>>> send the xdr_buf's bvec pages, but instead was relying on memcpy.
>>>> This means copying every byte of a large NFS READ payload.
>>>> 
>>>> It looks like TLS sockets do indeed support a ->sendpage method,
>>>> so it's really not necessary to use xprt_sock_sendmsg() to
>>>> support
>>>> TLS fully on the server. A mechanical reversion of da1661b93bf4
>>>> is
>>>> not possible at this point, but we can re-implement the server's
>>>> TCP socket sendmsg path using kernel_sendpage().
>>>> 
>>>> Reported-by: Daire Byrne 
>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>>>> Signed-off-by: Chuck Lever 
>>>> Signed-off-by: Sasha Levin 
>>>> ---
>>>>  net/sunrpc/svcsock.c | 86
>>>> +++-
>>>>  1 file changed, 85 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>>> index c2752e2b9ce34..4404c491eb388 100644
>>>> --- a/net/sunrpc/svcsock.c
>>>> +++ b/net/sunrpc/svcsock.c
>>>> @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct
>>>> svc_rqst
>>>> *rqstp)
>>>> return 0;   /* record not complete */
>>>>  }
>>>>  
>>>> +static int svc_tcp_send_kvec(struct socket *sock, const struct
>>>> kvec
>>>> *vec,
>>>> + int flags)
>>>> +{
>>>> +   return kernel_sendpage(sock, virt_to_page(vec->iov_base),
>>>> +  offset_in_page(vec->iov_base),
>>>> +  vec->iov_len, flags);
>> 
>> Thanks for your review!
>> 
>>> I'm having trouble with this line. This looks like it is trying to
>>> push
>>> a slab page into kernel_sendpage().
>> 
>> The head and tail kvec's in rq_res are not kmalloc'd, they are
>> backed by pages in rqstp->rq_pages[].
>> 
>> 
>>> What guarantees that the nfsd
>>> thread won't call kfree() before the socket layer is done
>>> transmitting
>>> the page?
>> 
>> If I understand correctly what Neil told us last week, the page
>> reference count on those pages is set up so that one of
>> svc_xprt_release() or the network layer does the final put_page(),
>> in a safe fashion.
>> 
>> Before da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg
>> for socket sends"), the original svc_send_common() code did this:
>> 
>> -   /* send head */
>> -   if (slen == xdr->head[0].iov_len)
>> -   flags = 0;
>> -   len = kernel_sendpage(sock, headpage, headoffset,
>> - xdr->head[0].iov_len, flags);
>> -   if (len != xdr->head[0].iov_len)
>> -   goto out;
>> -   slen -= xdr->head[0].iov_len;
>> -   if (slen == 0)
>> -   goto out;
>> 
>> 
>> 
> 
> OK, so then only the argument kvec can be allocated on the slab (thanks
> to  svc_deferred_recv)? Is that correct?

The RPC/RDMA Receive buffer is kmalloc'd, that would be used for
rq_arg.head/tail. But for TCP, I believe the head kvec is always
pulled out of rq_pages[].

svc_process() sets up rq_res.head this way:

1508 resv->iov_base = page_address(rqstp->rq_respages[0]);
1509 resv->iov_len = 0;

I would need to audit the code to confirm that rq_res.tail is never
kmalloc'd.


--
Chuck Lever





Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again

2021-02-08 Thread Chuck Lever



> On Feb 8, 2021, at 2:34 PM, Trond Myklebust  wrote:
> 
> On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote:
>> From: Chuck Lever 
>> 
>> [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ]
>> 
>> Daire Byrne reports a ~50% aggregrate throughput regression on his
>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to
>> use xprt_sock_sendmsg for socket sends"), which replaced
>> kernel_send_page() calls in NFSD's socket send path with calls to
>> sock_sendmsg() using iov_iter.
>> 
>> Investigation showed that tcp_sendmsg() was not using zero-copy to
>> send the xdr_buf's bvec pages, but instead was relying on memcpy.
>> This means copying every byte of a large NFS READ payload.
>> 
>> It looks like TLS sockets do indeed support a ->sendpage method,
>> so it's really not necessary to use xprt_sock_sendmsg() to support
>> TLS fully on the server. A mechanical reversion of da1661b93bf4 is
>> not possible at this point, but we can re-implement the server's
>> TCP socket sendmsg path using kernel_sendpage().
>> 
>> Reported-by: Daire Byrne 
>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>> Signed-off-by: Chuck Lever 
>> Signed-off-by: Sasha Levin 
>> ---
>>  net/sunrpc/svcsock.c | 86
>> +++-
>>  1 file changed, 85 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index c2752e2b9ce34..4404c491eb388 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct svc_rqst
>> *rqstp)
>> return 0;   /* record not complete */
>>  }
>>  
>> +static int svc_tcp_send_kvec(struct socket *sock, const struct kvec
>> *vec,
>> + int flags)
>> +{
>> +   return kernel_sendpage(sock, virt_to_page(vec->iov_base),
>> +  offset_in_page(vec->iov_base),
>> +  vec->iov_len, flags);

Thanks for your review!

> I'm having trouble with this line. This looks like it is trying to push
> a slab page into kernel_sendpage().

The head and tail kvec's in rq_res are not kmalloc'd, they are
backed by pages in rqstp->rq_pages[].


> What guarantees that the nfsd
> thread won't call kfree() before the socket layer is done transmitting
> the page?

If I understand correctly what Neil told us last week, the page
reference count on those pages is set up so that one of
svc_xprt_release() or the network layer does the final put_page(),
in a safe fashion.

Before da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg
for socket sends"), the original svc_send_common() code did this:

-   /* send head */
-   if (slen == xdr->head[0].iov_len)
-   flags = 0;
-   len = kernel_sendpage(sock, headpage, headoffset,
- xdr->head[0].iov_len, flags);
-   if (len != xdr->head[0].iov_len)
-   goto out;
-   slen -= xdr->head[0].iov_len;
-   if (slen == 0)
-   goto out;


--
Chuck Lever





Re: [PATCH][next] nfsd: fix check of statid returned from call to find_stateid_by_type

2021-01-28 Thread Chuck Lever
Hi Dan-

> On Jan 28, 2021, at 10:34 AM, Dan Carpenter  wrote:
> 
> On Thu, Jan 28, 2021 at 03:05:06PM +0000, Chuck Lever wrote:
>> Hi Colin-
>> 
>>> On Jan 28, 2021, at 9:49 AM, Colin King  wrote:
>>> 
>>> From: Colin Ian King 
>>> 
>>> The call to find_stateid_by_type is setting the return value in *stid
>>> yet the NULL check of the return is checking stid instead of *stid.
>>> Fix this by adding in the missing pointer * operator.
>>> 
>>> Addresses-Coverity: ("Dereference before null check")
>>> Fixes: 6cdaa72d4dde ("nfsd: find_cpntf_state cleanup")
>>> Signed-off-by: Colin Ian King 
>> 
>> Thanks for your patch. I've committed it to the for-next branch at
>> 
>> git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
>> 
>> in preparation for the v5.12 merge window, with the following changes:
>> 
>> - ^statid^stateid
>> - Fixes: tag removed, since no stable backport is necessary
>> 
>> The commit you are fixing has not been merged upstream yet.
> 
> Fixes tags don't meant the patch has to be backported.  Is your tree
> rebased?  In that case, the fixes tag probably doesn't make sense
> because the tag can change.  You might want to just consider folding
> Colin's fix into the original commit.

Yes, this branch can be rebased on occasion. Since you and Bruce
suggest squashing the fix into the original patch, I will do that.


> Fixes tags are used for a lot of different things:
> 1)  If there is a fixes tag, then you can tell it does *NOT* have to
>be back ported because the original commit is not in the stable
>tree.  It saves time for the stable maintainers.
> 2)  Metrics to figure out how quickly we are fixing bugs.
> 3)  Sometimes the Fixes tag helps because we want to review the original
>patch to see what the intent was.
> 
> All sorts of stuff.  Etc.

Yep, I'm a fan of all that. I just want to avoid poking the stable
automation bear when it's unnecessary.

--
Chuck Lever





Re: [PATCH][next] nfsd: fix check of statid returned from call to find_stateid_by_type

2021-01-28 Thread Chuck Lever
Hi Colin-

> On Jan 28, 2021, at 9:49 AM, Colin King  wrote:
> 
> From: Colin Ian King 
> 
> The call to find_stateid_by_type is setting the return value in *stid
> yet the NULL check of the return is checking stid instead of *stid.
> Fix this by adding in the missing pointer * operator.
> 
> Addresses-Coverity: ("Dereference before null check")
> Fixes: 6cdaa72d4dde ("nfsd: find_cpntf_state cleanup")
> Signed-off-by: Colin Ian King 

Thanks for your patch. I've committed it to the for-next branch at

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

in preparation for the v5.12 merge window, with the following changes:

- ^statid^stateid
- Fixes: tag removed, since no stable backport is necessary

The commit you are fixing has not been merged upstream yet.


> ---
> fs/nfsd/nfs4state.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index f554e3480bb1..423fd6683f3a 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -5824,7 +5824,7 @@ static __be32 find_cpntf_state(struct nfsd_net *nn, 
> stateid_t *st,
> 
>   *stid = find_stateid_by_type(found, >cp_p_stateid,
>   NFS4_DELEG_STID|NFS4_OPEN_STID|NFS4_LOCK_STID);
> - if (stid)
> + if (*stid)
>   status = nfs_ok;
>   else
>   status = nfserr_bad_stateid;
> -- 
> 2.29.2
> 

--
Chuck Lever





Re: [PATCH 32/32] NFS: Convert readpage to readahead and use netfs_readahead for fscache

2021-01-26 Thread Chuck Lever



> On Jan 25, 2021, at 8:36 PM, Matthew Wilcox  wrote:
> 
> 
> For Subject: s/readpage/readpages/
> 
> On Mon, Jan 25, 2021 at 09:37:29PM +, David Howells wrote:
>> +int __nfs_readahead_from_fscache(struct nfs_readdesc *desc,
>> + struct readahead_control *rac)
> 
> I thought you wanted it called ractl instead of rac?  That's what I've
> been using in new code.
> 
>> -dfprintk(FSCACHE, "NFS: nfs_getpages_from_fscache (0x%p/%u/0x%p)\n",
>> - nfs_i_fscache(inode), npages, inode);
>> +dfprintk(FSCACHE, "NFS: nfs_readahead_from_fscache (0x%p/0x%p)\n",
>> + nfs_i_fscache(inode), inode);
> 
> We do have readahead_count() if this is useful information to be logging.

As a sidebar, the Linux NFS community is transitioning to tracepoints.
It would be helpful (but not completely necessary) to use tracepoints
in new code instead of printk.


>> +static inline int nfs_readahead_from_fscache(struct nfs_readdesc *desc,
>> + struct readahead_control *rac)
>> {
>> -if (NFS_I(inode)->fscache)
>> -return __nfs_readpages_from_fscache(ctx, inode, mapping, pages,
>> -nr_pages);
>> +if (NFS_I(rac->mapping->host)->fscache)
>> +return __nfs_readahead_from_fscache(desc, rac);
>>  return -ENOBUFS;
>> }
> 
> Not entirely sure that it's worth having the two functions separated any more.
> 
>>  /* attempt to read as many of the pages as possible from the cache
>>   * - this returns -ENOBUFS immediately if the cookie is negative
>>   */
>> -ret = nfs_readpages_from_fscache(desc.ctx, inode, mapping,
>> - pages, _pages);
>> +ret = nfs_readahead_from_fscache(, rac);
>>  if (ret == 0)
>>  goto read_complete; /* all pages were read */
>> 
>>  nfs_pageio_init_read(, inode, false,
>>   _async_read_completion_ops);
>> 
>> -ret = read_cache_pages(mapping, pages, readpage_async_filler, );
>> +while ((page = readahead_page(rac))) {
>> +    ret = readpage_async_filler(, page);
>> +put_page(page);
>> +}
> 
> I thought with the new API we didn't need to do this kind of thing
> any more?  ie no matter whether fscache is configured in or not, it'll
> submit the I/Os.

--
Chuck Lever





Re: Why the auxiliary cipher in gss_krb5_crypto.c?

2020-12-04 Thread Chuck Lever



> On Dec 4, 2020, at 10:46 AM, Bruce Fields  wrote:
> 
> On Fri, Dec 04, 2020 at 02:59:35PM +, David Howells wrote:
>> Hi Chuck, Bruce,
>> 
>> Why is gss_krb5_crypto.c using an auxiliary cipher?  For reference, the
>> gss_krb5_aes_encrypt() code looks like the attached.
>> 
>>> From what I can tell, in AES mode, the difference between the main cipher 
>>> and
>> the auxiliary cipher is that the latter is "cbc(aes)" whereas the former is
>> "cts(cbc(aes))" - but they have the same key.
>> 
>> Reading up on CTS, I'm guessing the reason it's like this is that CTS is the
>> same as the non-CTS, except for the last two blocks, but the non-CTS one is
>> more efficient.
> 
> CTS is cipher-text stealing, isn't it?  I think it was Kevin Coffman
> that did that, and I don't remember the history.  I thought it was
> required by some spec or peer implementation (maybe Windows?) but I
> really don't remember.  It may predate git.  I'll dig around and see
> what I can find.

I can't add more here, this design comes from well before I started
working on this body of code (though, I worked near Kevin when he
implemented it).


--
Chuck Lever





Re: [PATCH] nfs_common: need lock during iterate through the list

2020-12-01 Thread Chuck Lever
Hello!

> On Dec 1, 2020, at 7:06 AM, Yi Wang  wrote:
> 
> From: Cheng Lin  
> 
> If the elem is deleted during be iterated on it, the iteration
> process will fall into an endless loop.
> 
> kernel: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [nfsd:17137]
> 
> PID: 17137  TASK: 8818d93c  CPU: 4   COMMAND: "nfsd" 
> [exception RIP: __state_in_grace+76]
> RIP: c00e817c  RSP: 8818d3aefc98  RFLAGS: 0246
> RAX: 881dc0c38298  RBX: 81b03580  RCX: 881dc02c9f50
> RDX: 881e3fce8500  RSI: 0001  RDI: 81b03580
> RBP: 8818d3aefca0   R8: 0020   R9: 8818d3aefd40
> R10: 88017fc03800  R11: 8818e83933c0  R12: 8818d3aefd40
> R13:   R14: 8818e8391068  R15: 8818fa6e4000
> CS: 0010  SS: 0018
>  #0 [8818d3aefc98] opens_in_grace at c00e81e3 [grace]
>  #1 [8818d3aefca8] nfs4_preprocess_stateid_op at c02a3e6c [nfsd]
>  #2 [8818d3aefd18] nfsd4_write at c028ed5b [nfsd]
>  #3 [8818d3aefd80] nfsd4_proc_compound at c0290a0d [nfsd]
>  #4 [8818d3aefdd0] nfsd_dispatch at c027b800 [nfsd]
>  #5 [8818d3aefe08] svc_process_common at c02017f3 [sunrpc]
>  #6 [8818d3aefe70] svc_process at c0201ce3 [sunrpc]
>  #7 [8818d3aefe98] nfsd at c027b117 [nfsd]
>  #8 [8818d3aefec8] kthread at 810b88c1
>  #9 [8818d3aeff50] ret_from_fork at 816d1607
> 
> The troublemake elem:
> crash> lock_manager 881dc0c38298
> struct lock_manager {
>   list = {
> next = 0x881dc0c38298,
> prev = 0x881dc0c38298
>   },
>   block_opens = false
> }
> 
> Signed-off-by: Cheng Lin  
> Signed-off-by: Yi Wang  
> ---
>  fs/nfs_common/grace.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs_common/grace.c b/fs/nfs_common/grace.c
> index b73d9dd37..26f2a50ec 100644
> --- a/fs/nfs_common/grace.c
> +++ b/fs/nfs_common/grace.c
> @@ -69,10 +69,14 @@ __state_in_grace(struct net *net, bool open)
>  if (!open)
>  return !list_empty(grace_list);
>   
> +spin_lock(_lock);
>  list_for_each_entry(lm, grace_list, list) {
> -if (lm->block_opens)
> +if (lm->block_opens) {
> +spin_unlock(_lock);
>  return true;
> +}
>  }
> +spin_unlock(_lock);
>  return false;
>  }
>   
> --

This looks most closely related to NFSD, so I've applied it to
my NFSD tree for the next merge window. I've also added

Fixes: c87fb4a378f9 ("lockd: NLM grace period shouldn't block NFSv4 opens")

You can find it in the cel-next topic branch in my kernel repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Incidentally, the e-mail encoding mangled the white space and I
don't see the e-mail showing up on lore.kernel.org. I applied it
by hand since it was small, but this should be addressed for
future patches so our patch handling infrastructure can deal
properly with your submissions. Thanks!


--
Chuck Lever





Re: [PATCH] nfsd: Fix kernel test robot warning

2020-11-28 Thread Chuck Lever
Hello Souptick-

This looks like the same error that Coverity caught earlier this
week. AFAIK Bruce intends to address this issue with a replacement
patch:

https://lore.kernel.org/linux-nfs/20201125164738.ga7...@fieldses.org/


> On Nov 28, 2020, at 8:50 AM, Souptick Joarder  wrote:
> 
> Kernel test robot throws below warning -
> 
>>> fs/nfsd/nfs3xdr.c:299:6: warning: variable 'err' is used
>>> uninitialized whenever 'if' condition is false
>>> [-Wsometimes-uninitialized]
>   if (!v4 || !inode->i_sb->s_export_op->fetch_iversion)
>   ^~~~
>   fs/nfsd/nfs3xdr.c:304:6: note: uninitialized use occurs here
>   if (err) {
>   ^~~
>   fs/nfsd/nfs3xdr.c:299:2: note: remove the 'if' if its condition is
> always true
>   if (!v4 || !inode->i_sb->s_export_op->fetch_iversion)
>   ^
>   fs/nfsd/nfs3xdr.c:293:12: note: initialize the variable 'err' to
> silence this warning
>   __be32 err;
> ^
>  = 0
>   1 warning generated.
> 
> Initialize err = 0 to silence this warning.
> 
> Reported-by: kernel test robot 
> Signed-off-by: Souptick Joarder 
> ---
> fs/nfsd/nfs3xdr.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index abb1608..47aeaee 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -290,7 +290,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> {
>   bool v4 = (fhp->fh_maxsize == NFS4_FHSIZE);
>   struct inode *inode = d_inode(fhp->fh_dentry);
> - __be32 err;
> + __be32 err = 0;
> 
>   if (fhp->fh_post_saved)
>   printk("nfsd: inode locked twice during operation.\n");
> -- 
> 1.9.1
> 

--
Chuck Lever





Re: [PATCH] nfsd: Fix error return code in nfsd_file_cache_init()

2020-11-25 Thread Chuck Lever



> On Nov 25, 2020, at 9:17 AM, J. Bruce Fields  wrote:
> 
> On Wed, Nov 25, 2020 at 03:39:33AM -0500, Huang Guobin wrote:
>> Fix to return PTR_ERR() error code from the error handling case instead of
>> 0 in function nfsd_file_cache_init(), as done elsewhere in this function.
>> 
>> Fixes: 65294c1f2c5e7("nfsd: add a new struct file caching facility to nfsd")
>> Signed-off-by: Huang Guobin 
>> ---
>> fs/nfsd/filecache.c | 1 +
>> 1 file changed, 1 insertion(+)
>> 
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index c8b9d2667ee6..a8a5b555f08b 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -686,6 +686,7 @@ nfsd_file_cache_init(void)
>>  pr_err("nfsd: unable to create fsnotify group: %ld\n",
>>  PTR_ERR(nfsd_file_fsnotify_group));
>>  nfsd_file_fsnotify_group = NULL;
>> +ret = PTR_ERR(nfsd_file_fsnotify_group);
> 
> I think you meant to add that one line earlier.
> 
> Otherwise fine, but it looks like an unlikely case so can probably wait
> for the merge window.

Applied for the v5.11 merge window with Bruce's suggested change,
and pushed to the cel-next branch in

  git://git.linux-nfs.org/projects/cel/cel-2.6.git

or

  https://git.linux-nfs.org/?p=cel/cel-2.6.git;a=summary


>>  goto out_notifier;
>>  }
>> 
>> -- 
>> 2.22.0

--
Chuck Lever





Re: linux-next: Signed-off-by missing for commit in the cel tree

2020-11-24 Thread Chuck Lever



> On Nov 24, 2020, at 6:03 PM, Stephen Rothwell  wrote:
> 
> Hi all,
> 
> Commits
> 
>  287428e8d11f ("nfsd: skip some unnecessary stats in the v4 case")
>  6f76e787daea ("nfs: use change attribute for NFS re-exports")
>  9fb0a61fedf7 ("nfsd4: don't query change attribute in v2/v3 case")
> 
> are missing a Signed-off-by from their committers.

My bad. I assumed my SoB was not required because I didn't alter the
patch in any way.

But, now fixed in my public repo.


--
Chuck Lever





Re: [PATCH 016/141] nfsd: Fix fall-through warnings for Clang

2020-11-20 Thread Chuck Lever



> On Nov 20, 2020, at 1:26 PM, Gustavo A. R. Silva  
> wrote:
> 
> In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
> warnings by explicitly adding a couple of break statements instead of
> just letting the code fall through to the next case.
> 
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva 
> ---
> fs/nfsd/nfs4state.c | 1 +
> fs/nfsd/nfsctl.c| 1 +
> 2 files changed, 2 insertions(+)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index d7f27ed6b794..cdab0d5be186 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3113,6 +3113,7 @@ nfsd4_exchange_id(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   goto out_nolock;
>   }
>   new->cl_mach_cred = true;
> + break;
>   case SP4_NONE:
>   break;
>   default:/* checked by xdr code */
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index f6d5d783f4a4..9a3bb1e217f9 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -1165,6 +1165,7 @@ static struct inode *nfsd_get_inode(struct super_block 
> *sb, umode_t mode)
>   inode->i_fop = _dir_operations;
>   inode->i_op = _dir_inode_operations;
>   inc_nlink(inode);
> +     break;
>   default:
>   break;
>   }
> -- 
> 2.27.0
> 

Acked-by: Chuck Lever 

--
Chuck Lever





Re: [RFC][PATCH 00/18] crypto: Add generic Kerberos library

2020-11-12 Thread Chuck Lever



> On Nov 12, 2020, at 4:07 PM, Bruce Fields  wrote:
> 
> On Thu, Nov 12, 2020 at 04:54:06PM +, David Howells wrote:
>> Chuck Lever  wrote:
>> 
>>> Really? My understanding of the Linux kernel SUNRPC implementation is
>>> that it uses asynchronous, even for small data items. Maybe I'm using
>>> the terminology incorrectly.
>> 
>> Seems to be synchronous, at least in its use of skcipher:
> 
> Yes, it's all synchronous.  The only cases where we defer and revisit a
> request is when we need to do upcalls to userspace.
> 
> (And those upcalls mostly come after we're done with unwrapping and
> verifying a request, so now I'm sort of curious exactly what Chuck was
> seeing.)

I vaguely recall that setting CRYPTO_TFM_REQ_MAY_SLEEP allows the
crypto API to sleep and defer completion.


--
Chuck Lever





Re: [RFC][PATCH 00/18] crypto: Add generic Kerberos library

2020-11-12 Thread Chuck Lever
   |  190 +
>> crypto/krb5/rfc3961_simplified.c |  732 ++
>> crypto/krb5/rfc3962_aes.c|  140 
>> crypto/krb5/rfc6803_camellia.c   |  249 ++
>> crypto/krb5/rfc8009_aes2.c   |  440 +++
>> crypto/krb5/selftest.c   |  543 +
>> crypto/krb5/selftest_data.c  |  289 +++
>> fs/afs/misc.c|   13 +
>> include/crypto/krb5.h|  100 +++
>> include/keys/rxrpc-type.h|   17 +
>> include/trace/events/rxrpc.h |4 +
>> include/uapi/linux/rxrpc.h   |   17 +
>> net/rxrpc/Kconfig|   10 +
>> net/rxrpc/Makefile   |5 +
>> net/rxrpc/ar-internal.h  |   20 +
>> net/rxrpc/conn_object.c  |2 +
>> net/rxrpc/key.c  |  319 
>> net/rxrpc/rxgk.c | 1232 ++
>> net/rxrpc/rxgk_app.c |  424 ++
>> net/rxrpc/rxgk_common.h  |  164 
>> net/rxrpc/rxgk_kdf.c |  271 +++
>> net/rxrpc/security.c |6 +
>> 26 files changed, 5530 insertions(+), 1 deletion(-)
>> create mode 100644 crypto/krb5/kdf.c
>> create mode 100644 crypto/krb5/rfc3961_simplified.c
>> create mode 100644 crypto/krb5/rfc3962_aes.c
>> create mode 100644 crypto/krb5/rfc6803_camellia.c
>> create mode 100644 crypto/krb5/rfc8009_aes2.c
>> create mode 100644 crypto/krb5/selftest.c
>> create mode 100644 crypto/krb5/selftest_data.c
>> create mode 100644 net/rxrpc/rxgk.c
>> create mode 100644 net/rxrpc/rxgk_app.c
>> create mode 100644 net/rxrpc/rxgk_common.h
>> create mode 100644 net/rxrpc/rxgk_kdf.c

--
Chuck Lever





Re: [RFC][PATCH 00/18] crypto: Add generic Kerberos library

2020-11-12 Thread Chuck Lever



> On Nov 12, 2020, at 10:42 AM, David Howells  wrote:
> 
> Chuck Lever  wrote:
> 
>>> There are three main interfaces to it:
>>> 
>>> (*) I/O crypto: encrypt, decrypt, get_mic and verify_mic.
>>> 
>>>These all do in-place crypto, using an sglist to define the buffer
>>>with the data in it.  Is it necessary to make it able to take separate
>>>input and output buffers?
>> 
>> Hi David, Wondering if these "I/O" APIs use synchronous or async
>> crypto under the covers. For small data items like MICs, synchronous
>> might be a better choice, especially if asynchronous crypto could
>> result in incoming requests getting re-ordered and falling out of
>> the GSS sequence number window.
>> 
>> What say ye?
> 
> For the moment I'm using synchronous APIs as that's what sunrpc is using (I
> borrowed the basic code from there).

Really? My understanding of the Linux kernel SUNRPC implementation is
that it uses asynchronous, even for small data items. Maybe I'm using
the terminology incorrectly.

The problem that arises is on the server. The asynchronous API can
schedule, and if the server has other work to do, that can delay a
verify_mic long enough that the request drops out of the GSS sequence
number window (even a large window).

Whatever the mechanism, we need to have deterministic ordering, at
least on the server-side.


> It would be interesting to consider using async, but there's a potential
> issue.  For the simplified profile, encryption and integrity checksum
> generation can be done simultaneously, but decryption and verification can't.
> For the AES-2 profile, the reverse is true.
> 
> For my purposes in rxrpc, async mode isn't actually that useful since I'm only
> doing the contents of a UDP packet at a time.  Either I'm encrypting with the
> intention of immediate transmission or decrypting with the intention of
> immediately using the data, so I'm in a context where I can wait anyway.
> 
> What might get me more of a boost would be to encrypt the app data directly
> into a UDP packet and decrypt the UDP packet directly into the app buffers.
> This is easier said than done, though, as there's typically security metadata
> inserted into the packet inside the encrypted region.


--
Chuck Lever





Re: [RFC][PATCH 00/18] crypto: Add generic Kerberos library

2020-11-12 Thread Chuck Lever
> net/rxrpc/Makefile   |5 +
> net/rxrpc/ar-internal.h  |   20 +
> net/rxrpc/conn_object.c  |2 +
> net/rxrpc/key.c  |  319 
> net/rxrpc/rxgk.c | 1232 ++
> net/rxrpc/rxgk_app.c |  424 ++
> net/rxrpc/rxgk_common.h  |  164 
> net/rxrpc/rxgk_kdf.c |  271 +++
> net/rxrpc/security.c |6 +
> 26 files changed, 5530 insertions(+), 1 deletion(-)
> create mode 100644 crypto/krb5/kdf.c
> create mode 100644 crypto/krb5/rfc3961_simplified.c
> create mode 100644 crypto/krb5/rfc3962_aes.c
> create mode 100644 crypto/krb5/rfc6803_camellia.c
> create mode 100644 crypto/krb5/rfc8009_aes2.c
> create mode 100644 crypto/krb5/selftest.c
> create mode 100644 crypto/krb5/selftest_data.c
> create mode 100644 net/rxrpc/rxgk.c
> create mode 100644 net/rxrpc/rxgk_app.c
> create mode 100644 net/rxrpc/rxgk_common.h
> create mode 100644 net/rxrpc/rxgk_kdf.c
> 
> 

--
Chuck Lever





Re: [PATCH] nfsd/nfs3: remove unused macro nfsd3_fhandleres

2020-11-06 Thread Chuck Lever



> On Nov 6, 2020, at 12:40 AM, Alex Shi  wrote:
> 
> The macro is unused, remove it to tame gcc warning:
> fs/nfsd/nfs3proc.c:702:0: warning: macro "nfsd3_fhandleres" is not used
> [-Wunused-macros]
> 
> Signed-off-by: Alex Shi 
> Cc: "J. Bruce Fields"  
> Cc: Chuck Lever  
> Cc: linux-...@vger.kernel.org 
> Cc: linux-kernel@vger.kernel.org 
> ---
> fs/nfsd/nfs3proc.c | 1 -
> 1 file changed, 1 deletion(-)
> 
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index 781e265921aa..6e79bae0af4d 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -687,7 +687,6 @@
> #define nfsd3_mkdirargs   nfsd3_createargs
> #define nfsd3_readdirplusargs nfsd3_readdirargs
> #define nfsd3_fhandleargs nfsd_fhandle
> -#define nfsd3_fhandleres nfsd3_attrstat
> #define nfsd3_attrstatres nfsd3_attrstat
> #define nfsd3_wccstatres  nfsd3_attrstat
> #define nfsd3_createres   nfsd3_diropres
> -- 
> 1.8.3.1

Applied to cel-next, thanks Alex.

The topic branch that collects patches for nfsd-5.11 is here:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/cel-next


--
Chuck Lever





Re: [PATCH] nfsd: remove unneeded semicolon

2020-11-02 Thread Chuck Lever



> On Nov 1, 2020, at 10:32 AM, t...@redhat.com wrote:
> 
> From: Tom Rix 
> 
> A semicolon is not needed after a switch statement.
> 
> Signed-off-by: Tom Rix 
> ---
> fs/nfsd/nfs4xdr.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 259d5ad0e3f4..6020f0ff6795 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2558,7 +2558,7 @@ static u32 nfs4_file_type(umode_t mode)
>   case S_IFREG:   return NF4REG;
>   case S_IFSOCK:  return NF4SOCK;
>   default:return NF4BAD;
> - };
> + }
> }
> 
> static inline __be32
> -- 
> 2.18.1
> 

I can take this for 5.11.

--
Chuck Lever





Re: [PATCH] SUNRPC: fix copying of multiple pages in gss_read_proxy_verf()

2020-10-21 Thread Chuck Lever



> On Oct 20, 2020, at 3:16 AM, Martijn de Gouw 
>  wrote:
> 
> Hi,
> 
> On 20-10-2020 00:04, J. Bruce Fields wrote:
>> On Mon, Oct 19, 2020 at 03:46:39PM +, Martijn de Gouw wrote:
>>> Hi
>>> 
>>> On 19-10-2020 17:23, J. Bruce Fields wrote:
>>>> On Mon, Oct 19, 2020 at 01:42:27PM +0200, Martijn de Gouw wrote:
>>>>> When the passed token is longer than 4032 bytes, the remaining part
>>>>> of the token must be copied from the rqstp->rq_arg.pages. But the
>>>>> copy must make sure it happens in a consecutive way.
>>>> 
>>>> Thanks.  Apologies, but I don't immediately see where the copy is
>>>> non-consecutive.  What exactly is the bug in the existing code?
>>> 
>>> In the first memcpy 'length' bytes are copied from argv->iobase, but
>>> since the header is in front, this never fills the whole first page of
>>> in_token->pages.
>>> 
>>> The memcpy in the loop copies the following bytes, but starts writing at
>>> the next page of in_token->pages. This leaves the last bytes of page 0
>>> unwritten.
>>> 
>>> Next to that, the remaining data is in page 0 of rqstp->rq_arg.pages,
>>> not page 1.
>> 
>> Got it, thanks.  Looks like the culprit might be a patch from a year ago
>> from Chuck, 5866efa8cbfb "SUNRPC: Fix svcauth_gss_proxy_init()"?  At
>> least, that's the last major patch to touch this code.

It's likely that we didn't have a test scenario at bake-a-thon that
presents large tokens, so that new tail copy logic was never properly
exercised.


> I found this issue when setting up NFSv4 with Active Directory as KDC 
> and gssproxy. Users with many groups where not able to access the NFS 
> shares, while others could access them just fine. During debugging I 
> found that the token was not the same on both sides.
> 
> I do not have the HW to setup a rdma version of NFSv4, so I'm unable to 
> test if it still works via rdma.

You don't need special HW to get NFS/RDMA with Linux working, though.
Linux now has soft iWARP, and NFS/RDMA works fine with that.

As stated in the patch description for commit 5866efa8cbfb, the original
issue won't appear with Linux clients, because they use TCP to handle
the ACCEPT_SEC_CONTEXT handshake. You'd need to have both Solaris and
RDMA to test it. Maybe we can scrounge something up, but that would only
be enough to ensure that your patch doesn't regress the Solaris NFS/RDMA
with Kerberos setup when using small tokens.


--
Chuck Lever





Re: [PATCH] NFS: Fix mode bits and nlink count for v4 referral dirs

2020-10-15 Thread Chuck Lever



> On Oct 15, 2020, at 9:59 AM, Trond Myklebust  wrote:
> 
> On Thu, 2020-10-15 at 09:36 -0400, Chuck Lever wrote:
>>> On Oct 15, 2020, at 8:06 AM, Trond Myklebust <
>>> tron...@hammerspace.com> wrote:
>>> 
>>> On Thu, 2020-10-15 at 00:39 +0530, Ashish Sangwan wrote:
>>>> On Wed, Oct 14, 2020 at 11:47 PM Trond Myklebust
>>>>  wrote:
>>>>> On Tue, 2020-10-06 at 08:14 -0700, Ashish Sangwan wrote:
>>>>>> Request for mode bits and nlink count in the
>>>>>> nfs4_get_referral
>>>>>> call
>>>>>> and if server returns them use them instead of hard coded
>>>>>> values.
>>>>>> 
>>>>>> CC: sta...@vger.kernel.org
>>>>>> Signed-off-by: Ashish Sangwan 
>>>>>> ---
>>>>>> fs/nfs/nfs4proc.c | 20 +---
>>>>>> 1 file changed, 17 insertions(+), 3 deletions(-)
>>>>>> 
>>>>>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>>>>>> index 6e95c85fe395..efec05c5f535 100644
>>>>>> --- a/fs/nfs/nfs4proc.c
>>>>>> +++ b/fs/nfs/nfs4proc.c
>>>>>> @@ -266,7 +266,9 @@ const u32 nfs4_fs_locations_bitmap[3] = {
>>>>>> | FATTR4_WORD0_FSID
>>>>>> | FATTR4_WORD0_FILEID
>>>>>> | FATTR4_WORD0_FS_LOCATIONS,
>>>>>> - FATTR4_WORD1_OWNER
>>>>>> + FATTR4_WORD1_MODE
>>>>>> + | FATTR4_WORD1_NUMLINKS
>>>>>> + | FATTR4_WORD1_OWNER
>>>>>> | FATTR4_WORD1_OWNER_GROUP
>>>>>> | FATTR4_WORD1_RAWDEV
>>>>>> | FATTR4_WORD1_SPACE_USED
>>>>>> @@ -7594,16 +7596,28 @@ nfs4_listxattr_nfs4_user(struct inode
>>>>>> *inode,
>>>>>> char *list, size_t list_len)
>>>>>> */
>>>>>> static void nfs_fixup_referral_attributes(struct nfs_fattr
>>>>>> *fattr)
>>>>>> {
>>>>>> + bool fix_mode = true, fix_nlink = true;
>>>>>> +
>>>>>> if (!(((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID)
>>>>>> ||
>>>>>>(fattr->valid & NFS_ATTR_FATTR_FILEID)) &&
>>>>>>   (fattr->valid & NFS_ATTR_FATTR_FSID) &&
>>>>>>   (fattr->valid & NFS_ATTR_FATTR_V4_LOCATIONS)))
>>>>>> return;
>>>>>> 
>>>>>> + if (fattr->valid & NFS_ATTR_FATTR_MODE)
>>>>>> + fix_mode = false;
>>>>>> + if (fattr->valid & NFS_ATTR_FATTR_NLINK)
>>>>>> + fix_nlink = false;
>>>>>> fattr->valid |= NFS_ATTR_FATTR_TYPE |
>>>>>> NFS_ATTR_FATTR_MODE |
>>>>>> NFS_ATTR_FATTR_NLINK |
>>>>>> NFS_ATTR_FATTR_V4_REFERRAL;
>>>>>> - fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
>>>>>> - fattr->nlink = 2;
>>>>>> +
>>>>>> + if (fix_mode)
>>>>>> + fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
>>>>>> + else
>>>>>> + fattr->mode |= S_IFDIR;
>>>>>> +
>>>>>> + if (fix_nlink)
>>>>>> + fattr->nlink = 2;
>>>>>> }
>>>>>> 
>>>>>> static int _nfs4_proc_fs_locations(struct rpc_clnt *client,
>>>>>> struct
>>>>>> inode *dir,
>>>>> 
>>>>> NACK to this patch. The whole point is that if the server has a
>>>>> referral, then it is not going to give us any attributes other
>>>>> than
>>>>> the
>>>>> ones we're already asking for because it may not even have a
>>>>> real
>>>>> directory. The client is required to fake up an inode, hence
>>>>> the
>>>>> existing code.
>>>> 
>>>> Hi Trond, thanks for reviewing the patch!
>>>> Sorry but I didn't understand the reason to NACK it. Could you
>>>> please
>>>> elaborate your concern?
>>>> These are the current attributes we request from the server on a
>>>> referral:
>>>> FATTR4_WORD0_CHANGE
>>>>> FATTR4_WORD0_SIZE
>>>

Re: [PATCH] NFS: Fix mode bits and nlink count for v4 referral dirs

2020-10-15 Thread Chuck Lever



> On Oct 15, 2020, at 9:59 AM, Trond Myklebust  wrote:
> 
> On Thu, 2020-10-15 at 09:36 -0400, Chuck Lever wrote:
>>> On Oct 15, 2020, at 8:06 AM, Trond Myklebust <
>>> tron...@hammerspace.com> wrote:
>>> 
>>> On Thu, 2020-10-15 at 00:39 +0530, Ashish Sangwan wrote:
>>>> On Wed, Oct 14, 2020 at 11:47 PM Trond Myklebust
>>>>  wrote:
>>>>> On Tue, 2020-10-06 at 08:14 -0700, Ashish Sangwan wrote:
>>>>>> Request for mode bits and nlink count in the
>>>>>> nfs4_get_referral
>>>>>> call
>>>>>> and if server returns them use them instead of hard coded
>>>>>> values.
>>>>>> 
>>>>>> CC: sta...@vger.kernel.org
>>>>>> Signed-off-by: Ashish Sangwan 
>>>>>> ---
>>>>>> fs/nfs/nfs4proc.c | 20 +---
>>>>>> 1 file changed, 17 insertions(+), 3 deletions(-)
>>>>>> 
>>>>>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>>>>>> index 6e95c85fe395..efec05c5f535 100644
>>>>>> --- a/fs/nfs/nfs4proc.c
>>>>>> +++ b/fs/nfs/nfs4proc.c
>>>>>> @@ -266,7 +266,9 @@ const u32 nfs4_fs_locations_bitmap[3] = {
>>>>>> | FATTR4_WORD0_FSID
>>>>>> | FATTR4_WORD0_FILEID
>>>>>> | FATTR4_WORD0_FS_LOCATIONS,
>>>>>> - FATTR4_WORD1_OWNER
>>>>>> + FATTR4_WORD1_MODE
>>>>>> + | FATTR4_WORD1_NUMLINKS
>>>>>> + | FATTR4_WORD1_OWNER
>>>>>> | FATTR4_WORD1_OWNER_GROUP
>>>>>> | FATTR4_WORD1_RAWDEV
>>>>>> | FATTR4_WORD1_SPACE_USED
>>>>>> @@ -7594,16 +7596,28 @@ nfs4_listxattr_nfs4_user(struct inode
>>>>>> *inode,
>>>>>> char *list, size_t list_len)
>>>>>> */
>>>>>> static void nfs_fixup_referral_attributes(struct nfs_fattr
>>>>>> *fattr)
>>>>>> {
>>>>>> + bool fix_mode = true, fix_nlink = true;
>>>>>> +
>>>>>> if (!(((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID)
>>>>>> ||
>>>>>>(fattr->valid & NFS_ATTR_FATTR_FILEID)) &&
>>>>>>   (fattr->valid & NFS_ATTR_FATTR_FSID) &&
>>>>>>   (fattr->valid & NFS_ATTR_FATTR_V4_LOCATIONS)))
>>>>>> return;
>>>>>> 
>>>>>> + if (fattr->valid & NFS_ATTR_FATTR_MODE)
>>>>>> + fix_mode = false;
>>>>>> + if (fattr->valid & NFS_ATTR_FATTR_NLINK)
>>>>>> + fix_nlink = false;
>>>>>> fattr->valid |= NFS_ATTR_FATTR_TYPE |
>>>>>> NFS_ATTR_FATTR_MODE |
>>>>>> NFS_ATTR_FATTR_NLINK |
>>>>>> NFS_ATTR_FATTR_V4_REFERRAL;
>>>>>> - fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
>>>>>> - fattr->nlink = 2;
>>>>>> +
>>>>>> + if (fix_mode)
>>>>>> + fattr->mode = S_IFDIR | S_IRUGO | S_IXUGO;
>>>>>> + else
>>>>>> + fattr->mode |= S_IFDIR;
>>>>>> +
>>>>>> + if (fix_nlink)
>>>>>> + fattr->nlink = 2;
>>>>>> }
>>>>>> 
>>>>>> static int _nfs4_proc_fs_locations(struct rpc_clnt *client,
>>>>>> struct
>>>>>> inode *dir,
>>>>> 
>>>>> NACK to this patch. The whole point is that if the server has a
>>>>> referral, then it is not going to give us any attributes other
>>>>> than
>>>>> the
>>>>> ones we're already asking for because it may not even have a
>>>>> real
>>>>> directory. The client is required to fake up an inode, hence
>>>>> the
>>>>> existing code.
>>>> 
>>>> Hi Trond, thanks for reviewing the patch!
>>>> Sorry but I didn't understand the reason to NACK it. Could you
>>>> please
>>>> elaborate your concern?
>>>> These are the current attributes we request from the server on a
>>>> referral:
>>>> FATTR4_WORD0_CHANGE
>>>>> FATTR4_WORD0_SIZE
>>>

Re: [PATCH] NFS: Fix mode bits and nlink count for v4 referral dirs

2020-10-15 Thread Chuck Lever
d
about the ugly appearance of referral anchors. The strange "special"
default values made the client appear to be broken, and was confusing
to some. I consider this to be a UX issue: the information displayed
in this case is not meant to be factual, but rather to prevent the
user concluding that something is wrong.

I'm not attached to this particular solution, though. Does it make
sense to perform the referral mount before returning "ls" results
so that the target server has a chance to supply reasonable
attribute values for the mounted-on directory object? Just spit
balling here.


> Specifically, the paragraph that says:
> 
> "
>   Other attributes SHOULD NOT be made available for absent file
>   systems, even when it is possible to provide them.  The server should
>   not assume that more information is always better and should avoid
>   gratuitously providing additional information."
> 
> So why is the client asking for them?

This paragraph (and it's most modern incarnation in RFC 8881 Section
11.4.1) describes server behavior. The current client behavior is
spec-compliant because there is no explicit prohibition in the spec
language against a client requesting additional attributes in this
case.

Either the server can clear those bitmap flags on the GETATTR reply
and not supply those attributes, and clients must be prepared for
that.

Or, it's also possible to read this paragraph to mean that the
server can provide those attributes and the values should not
reflect attributes for the absent file system, but rather something
else (eg, server-manufactured defaults, or the attributes from the
object on the source server).

And since this is a SHOULD NOT rather than a MUST NOT, servers are
still free to return information about the absent file system.
Clients are not guaranteed this will be the case, however.

I don't think c05cefcc7241 makes any assumption about whether the
server is lying about the extra attributes. Perhaps the server has
no better values for these attributes than the client's defaults
were.


--
Chuck Lever





Please pull NFS server fixes for v5.9

2020-09-25 Thread Chuck Lever
The following changes since commit ba4f184e126b751d1bffad5897f263108befc780:

  Linux 5.9-rc6 (2020-09-20 16:33:55 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/cel/cel-2.6.git tags/nfsd-5.9-2

for you to fetch changes up to 13a9a9d74d4d9689ad65938966dbc66386063648:

  SUNRPC: Fix svc_flush_dcache() (2020-09-21 10:13:25 -0400)


Fixes:

- Incorrect calculation on platforms that implement flush_dcache_page()


Chuck Lever (1):
  SUNRPC: Fix svc_flush_dcache()

 net/sunrpc/svcsock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
Chuck Lever





Re: fs/nfsd/nfs4xdr.c:4683:24: sparse: sparse: incorrect type in return expression (different base types)

2020-09-23 Thread Chuck Lever
 + 3) & ~3);
>  4829 if (xdrlen > xdrleft) {
>  4830 if (count == 0) {
>  4831 /*
>  4832  * Can't even fit the first attribute 
> name.
>  4833  */
>  4834 status = nfserr_toosmall;
>  4835 goto out;
>  4836 }
>  4837 eof = 0;
>  4838 goto wreof;
>  4839 }
>  4840 
>  4841 left -= XATTR_USER_PREFIX_LEN;
>  4842 sp += XATTR_USER_PREFIX_LEN;
>  4843 if (nuser++ < offset)
>  4844 goto contloop;
>  4845 
>  4846 
>  4847 p = xdr_reserve_space(xdr, xdrlen);
>  4848 if (!p) {
>  4849 status = nfserr_resource;
>  4850 goto out;
>  4851 }
>  4852 
>  4853 p = xdr_encode_opaque(p, sp, slen);
>  4854 
>  4855 xdrleft -= xdrlen;
>  4856 count++;
>  4857 contloop:
>  4858 sp += slen + 1;
>  4859 left -= slen + 1;
>  4860 }
>  4861 
>  4862 /*
>  4863  * If there were user attributes to copy, but we didn't copy
>  4864  * any, the offset was too large (e.g. the cookie was invalid).
>  4865  */
>  4866 if (nuser > 0 && count == 0) {
>  4867 status = nfserr_badcookie;
>  4868 goto out;
>  4869 }
>  4870 
>  4871 wreof:
>  4872 p = xdr_reserve_space(xdr, 4);
>  4873 if (!p) {
>  4874 status = nfserr_resource;
>  4875 goto out;
>  4876 }
>  4877 *p = cpu_to_be32(eof);
>  4878 
>  4879 cookie = offset + count;
>  4880 
>  4881 write_bytes_to_xdr_buf(xdr->buf, cookie_offset, , 8);
>> 4882 count = htonl(count);
>  4883 write_bytes_to_xdr_buf(xdr->buf, count_offset, , 4);
>  4884 out:
>  4885 if (listxattrs->lsxa_len)
>  4886 kvfree(listxattrs->lsxa_buf);
>  4887 return status;
>  4888 }
>  4889 
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
> <.config.gz>

--
Chuck Lever





Re: [PATCH] SUNRPC: Fix svc_flush_dcache()

2020-09-22 Thread Chuck Lever



> On Sep 22, 2020, at 3:13 AM, He Zhe  wrote:
> 
> 
> 
> On 9/21/20 3:51 AM, Chuck Lever wrote:
>> On platforms that implement flush_dcache_page(), a large NFS WRITE
>> triggers the WARN_ONCE in bvec_iter_advance():
>> 
>> Sep 20 14:01:05 klimt.1015granger.net kernel: Attempted to advance past end 
>> of bvec iter
>> Sep 20 14:01:05 klimt.1015granger.net kernel: WARNING: CPU: 0 PID: 1032 at 
>> include/linux/bvec.h:101 bvec_iter_advance.isra.0+0xa7/0x158 [sunrpc]
>> 
>> Sep 20 14:01:05 klimt.1015granger.net kernel: Call Trace:
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  svc_tcp_recvfrom+0x60c/0x12c7 
>> [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>> bvec_iter_advance.isra.0+0x158/0x158 [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? del_timer_sync+0x4b/0x55
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? test_bit+0x1d/0x27 [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  svc_recv+0x1193/0x15e4 
>> [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>> try_to_freeze.isra.0+0x6f/0x6f [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>> refcount_sub_and_test.constprop.0+0x13/0x40 [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? svc_xprt_put+0x1e/0x29f 
>> [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? svc_send+0x39f/0x3c1 
>> [sunrpc]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  nfsd+0x282/0x345 [nfsd]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? __kthread_parkme+0x74/0xba
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  kthread+0x2ad/0x2bc
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? nfsd_destroy+0x124/0x124 
>> [nfsd]
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? test_bit+0x1d/0x27
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>> kthread_mod_delayed_work+0x115/0x115
>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ret_from_fork+0x22/0x30
>> 
>> Reported-by: He Zhe 
>> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()")
>> Signed-off-by: Chuck Lever 
>> ---
>> net/sunrpc/svcsock.c |2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> Hi Zhe-
>> 
>> If you confirm this fixes your issue and there are no other
>> objections or regressions, I can submit this for v5.9-rc.
> 
> I don't quite get why we add "seek" to "size". It seems this action does not
> reflect the actual scenario and forcedly neutralizes the WARN_ONCE check in
> bvec_iter_advance, so that it may "advance past end of bvec iter" and thus
> introduces overflow.

> Why don't we avoid this problem at the very begginning like my v1? That is, 
> call
> svc_flush_bvec only when we have received more than we want to seek.
> 
> len = sock_recvmsg(svsk->sk_sock, , MSG_DONTWAIT);
> -   if (len > 0)
> +   if (len > 0 && (size_t)len > (seek & PAGE_MASK))
> svc_flush_bvec(bvec, len, seek);

Because this doesn't fix the underlying bug that triggered the
WARN_ONCE.

svc_tcp_recvfrom() attempts to assemble a possibly large RPC Call
from a sequence of sock_recvmsg's.

@seek is the running number of bytes that has been received so
far for the RPC Call we are assembling. @size is the number of
bytes that was just received in the most recent sock_recvmsg.

We want svc_flush_bvec to flush just the area of @bvec that
hasn't been flushed yet.

Thus: the current size of the partial Call message in @bvec is
@seek + @size. The starting location of the flush is
@seek & PAGE_MASK. This aligns the flush so it starts on a page
boundary.

This:

 230 struct bvec_iter bi = {
 231 .bi_size= size + seek,
 232 };

 235 bvec_iter_advance(bvec, , seek & PAGE_MASK);

advances the bvec_iter to the part of @bvec that hasn't been
flushed yet.

This loop:

 236 for_each_bvec(bv, bvec, bi, bi)
 237 flush_dcache_page(bv.bv_page);

flushes each page starting at that point to the end of the bytes
that have been received so far

In other words, ca07eda33e01 was wrong because it always flushed
the first section of @bvec, never the later parts of it.


> Regards,
> Zhe
> 
>> 
>> 
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index d5805fa1d066..c2752e2b9ce3 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -228,7 +228,7 @@ static int svc_one_sock_name(struct svc_sock *svsk, char 
>> *buf, int remaining)
>> static void svc_flush_bvec(const struct bio_vec *bvec, size_t size, size_t 
>> seek)
>> {
>>  struct bvec_iter bi = {
>> -.bi_size= size,
>> +.bi_size= size + seek,
>>  };
>>  struct bio_vec bv;

--
Chuck Lever





Re: [PATCH 08/14] xprtrdma: drop double zeroing

2020-09-20 Thread Chuck Lever
Thanks, Julia!

> On Sep 20, 2020, at 7:26 AM, Julia Lawall  wrote:
> 
> sg_init_table zeroes its first argument, so the allocation of that argument
> doesn't have to.
> 
> the semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // 
> @@
> expression x,n,flags;
> @@
> 
> x = 
> - kcalloc
> + kmalloc_array
>  (n,sizeof(*x),flags)
> ...
> sg_init_table(x,n)
> // 
> 
> Signed-off-by: Julia Lawall 

Acked-by: Chuck Lever 

This one goes to Anna.


> ---
> net/sunrpc/xprtrdma/frwr_ops.c |2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff -u -p a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -124,7 +124,7 @@ int frwr_mr_init(struct rpcrdma_xprt *r_
>   if (IS_ERR(frmr))
>   goto out_mr_err;
> 
> - sg = kcalloc(depth, sizeof(*sg), GFP_NOFS);
> + sg = kmalloc_array(depth, sizeof(*sg), GFP_NOFS);
>   if (!sg)
>   goto out_list_err;
> 
> 

--
Chuck Lever





Re: [PATCH] SUNRPC: Flush dcache only when receiving more seeking

2020-09-18 Thread Chuck Lever



> On Sep 18, 2020, at 8:50 AM, zhe...@windriver.com wrote:
> 
> From: He Zhe 
> 
> commit ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()") introduces
> svc_flush_bvec to after sock_recvmsg, but sometimes we receive less than we
> seek, which triggers the following warning.
> 
> WARNING: CPU: 0 PID: 18266 at include/linux/bvec.h:101 
> bvec_iter_advance+0x44/0xa8
> Attempted to advance past end of bvec iter
> Modules linked in: sch_fq_codel openvswitch nsh nf_conncount nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> CPU: 1 PID: 18266 Comm: nfsd Not tainted 5.9.0-rc5 #1
> Hardware name: Xilinx Zynq Platform
> [<80112ec0>] (unwind_backtrace) from [<8010c3a8>] (show_stack+0x18/0x1c)
> [<8010c3a8>] (show_stack) from [<80755214>] (dump_stack+0x9c/0xd0)
> [<80755214>] (dump_stack) from [<80125e64>] (__warn+0xdc/0xf4)
> [<80125e64>] (__warn) from [<80126244>] (warn_slowpath_fmt+0x84/0xac)
> [<80126244>] (warn_slowpath_fmt) from [<80c88514>] 
> (bvec_iter_advance+0x44/0xa8)
> [<80c88514>] (bvec_iter_advance) from [<80c88940>] 
> (svc_tcp_read_msg+0x10c/0x1bc)
> [<80c88940>] (svc_tcp_read_msg) from [<80c895d4>] 
> (svc_tcp_recvfrom+0x98/0x63c)
> [<80c895d4>] (svc_tcp_recvfrom) from [<80c97bf4>] 
> (svc_handle_xprt+0x48c/0x4f8)
> [<80c97bf4>] (svc_handle_xprt) from [<80c98038>] (svc_recv+0x94/0x1e0)
> [<80c98038>] (svc_recv) from [<804747cc>] (nfsd+0xf0/0x168)
> [<804747cc>] (nfsd) from [<80148a0c>] (kthread+0x144/0x154)
> [<80148a0c>] (kthread) from [<80100114>] (ret_from_fork+0x14/0x20)
> 
> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()")
> Cc:  # 5.8+
> Signed-off-by: He Zhe 
> ---
> net/sunrpc/svcsock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index d5805fa1d066..ea3bc9635448 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -277,7 +277,7 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, 
> size_t buflen,
>   buflen -= seek;
>   }
>   len = sock_recvmsg(svsk->sk_sock, , MSG_DONTWAIT);
> - if (len > 0)
> + if (len > (seek & PAGE_MASK))

I don't understand how this addresses the WARNING. Can you provide
an example set of inputs that trigger the issue?

Also this change introduces a mixed-sign comparison, so NACK on
this particular patch unless it can be demonstrated that the
implicit type conversion here is benign (I don't think it is,
but I haven't thought through it).


>   svc_flush_bvec(bvec, len, seek);
> 
>   /* If we read a full record, then assume there may be more
> -- 
> 2.17.1
> 

--
Chuck Lever





Please pull NFS server fixes for v5.9

2020-08-25 Thread Chuck Lever
Hi Linus -

The following changes since commit 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5:

 Linux 5.9-rc1 (2020-08-16 13:04:57 -0700)

are available in the Git repository at:

 git://git.linux-nfs.org/projects/cel/cel-2.6.git tags/nfsd-5.9-1

for you to fetch changes up to ad112aa8b1ac4bf5e8da67734fcb535fd3cd564e:

 SUNRPC: remove duplicate include (2020-08-19 13:19:42 -0400)


Fixes:

- Eliminate an oops introduced in v5.8
- Remove a duplicate #include added by nfsd-5.9


J. Bruce Fields (1):
 nfsd: fix oops on mixed NFSv4/NFSv3 client access

Wang Hai (1):
 SUNRPC: remove duplicate include

fs/nfsd/nfs4state.c | 2 ++
net/sunrpc/auth_gss/trace.c | 1 -
2 files changed, 2 insertions(+), 1 deletion(-)

--
Chuck Lever





Re: [PATCH v2] nfsd: Convert to use the preferred fallthrough macro

2020-08-20 Thread Chuck Lever
Hi-

> On Aug 19, 2020, at 10:57 PM, Miaohe Lin  wrote:
> 
> Convert the uses of fallthrough comments to fallthrough macro. Please see
> commit 294f69e662d1 ("compiler_attributes.h: Add 'fallthrough' pseudo
> keyword for switch/case use") for detail.
> 
> Signed-off-by: Hongxiang Lou 
> Signed-off-by: Miaohe Lin 

LGTM. If he also approves, I assume Bruce is taking this one for v5.10.


> ---
> fs/nfsd/nfs4callback.c | 2 +-
> fs/nfsd/nfs4proc.c | 2 +-
> fs/nfsd/nfs4state.c| 2 +-
> fs/nfsd/nfsfh.c| 2 +-
> fs/nfsd/vfs.c  | 4 ++--
> 5 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index 7fbe9840a03e..052be5bf9ef5 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -1119,7 +1119,7 @@ static bool nfsd4_cb_sequence_done(struct rpc_task 
> *task, struct nfsd4_callback
>   break;
>   case -ESERVERFAULT:
>   ++session->se_cb_seq_nr;
> - /* Fall through */
> + fallthrough;
>   case 1:
>   case -NFS4ERR_BADSESSION:
>   nfsd4_mark_cb_fault(cb->cb_clp, cb->cb_seq_status);
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index a527da3d8052..eaf50eafa935 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -428,7 +428,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   goto out;
>   open->op_openowner->oo_flags |= NFS4_OO_CONFIRMED;
>   reclaim = true;
> - /* fall through */
> + fallthrough;
>   case NFS4_OPEN_CLAIM_FH:
>   case NFS4_OPEN_CLAIM_DELEG_CUR_FH:
>   status = do_open_fhandle(rqstp, cstate, open);
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 81ed8e8bab3f..2f77f4b66cbc 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3117,7 +3117,7 @@ nfsd4_exchange_id(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   break;
>   default:/* checked by xdr code */
>   WARN_ON_ONCE(1);
> - /* fall through */
> + fallthrough;
>   case SP4_SSV:
>   status = nfserr_encr_alg_unsupp;
>   goto out_nolock;
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 37bc8f5f4514..a0a8d27539ae 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -469,7 +469,7 @@ static bool fsid_type_ok_for_exp(u8 fsid_type, struct 
> svc_export *exp)
>   case FSID_UUID16:
>   if (!is_root_export(exp))
>   return false;
> - /* fall through */
> + fallthrough;
>   case FSID_UUID4_INUM:
>   case FSID_UUID16_INUM:
>   return exp->ex_uuid != NULL;
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 7d2933b85b65..aba5af9df328 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1456,7 +1456,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh 
> *fhp,
>   *created = true;
>   break;
>   }
> - /* fall through */
> + fallthrough;
>   case NFS4_CREATE_EXCLUSIVE4_1:
>   if (   d_inode(dchild)->i_mtime.tv_sec == v_mtime
>   && d_inode(dchild)->i_atime.tv_sec == v_atime
> @@ -1465,7 +1465,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh 
> *fhp,
>       *created = true;
>   goto set_attr;
>   }
> - /* fall through */
> + fallthrough;
>   case NFS3_CREATE_GUARDED:
>   err = nfserr_exist;
>   }
> -- 
> 2.19.1
> 

--
Chuck Lever





Re: [PATCH net-next] SUNRPC: remove duplicate include

2020-08-19 Thread Chuck Lever



> On Aug 18, 2020, at 10:49 PM, Wang Hai  wrote:
> 
> Remove linux/sunrpc/auth_gss.h which is included more than once
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Wang Hai 

I've reviewed and compile-tested this, so no objection from me.

Since this duplicate was introduced in nfsd-5.9, I can take this
for an nfsd-5.9-rc pull, if there are no other objections.


> ---
> net/sunrpc/auth_gss/trace.c | 1 -
> 1 file changed, 1 deletion(-)
> 
> diff --git a/net/sunrpc/auth_gss/trace.c b/net/sunrpc/auth_gss/trace.c
> index d26036a57443..76685abba60f 100644
> --- a/net/sunrpc/auth_gss/trace.c
> +++ b/net/sunrpc/auth_gss/trace.c
> @@ -9,7 +9,6 @@
> #include 
> #include 
> #include 
> -#include 
> 
> #define CREATE_TRACE_POINTS
> #include 
> -- 
> 2.17.1
> 

--
Chuck Lever





Re: [PATCH] nfsd: Convert to use the preferred fallthrough macro

2020-08-19 Thread Chuck Lever



> On Aug 19, 2020, at 5:26 AM, Miaohe Lin  wrote:
> 
> Convert the uses of fallthrough comments to fallthrough macro.

The patch description would be more helpful if it referenced the
commit that added the fallthrough macro to the kernel, or a
permanent mailing list link that described its appropriate usage.

Thanks!

> Signed-off-by: Hongxiang Lou 
> Signed-off-by: Miaohe Lin 
> ---
> fs/nfsd/nfs4callback.c | 2 +-
> fs/nfsd/nfs4proc.c | 2 +-
> fs/nfsd/nfs4state.c| 2 +-
> fs/nfsd/nfsfh.c| 2 +-
> fs/nfsd/vfs.c  | 4 ++--
> 5 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index 7fbe9840a03e..052be5bf9ef5 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -1119,7 +1119,7 @@ static bool nfsd4_cb_sequence_done(struct rpc_task 
> *task, struct nfsd4_callback
>   break;
>   case -ESERVERFAULT:
>   ++session->se_cb_seq_nr;
> - /* Fall through */
> + fallthrough;
>   case 1:
>   case -NFS4ERR_BADSESSION:
>   nfsd4_mark_cb_fault(cb->cb_clp, cb->cb_seq_status);
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index a527da3d8052..eaf50eafa935 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -428,7 +428,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   goto out;
>   open->op_openowner->oo_flags |= NFS4_OO_CONFIRMED;
>   reclaim = true;
> - /* fall through */
> + fallthrough;
>   case NFS4_OPEN_CLAIM_FH:
>   case NFS4_OPEN_CLAIM_DELEG_CUR_FH:
>   status = do_open_fhandle(rqstp, cstate, open);
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 81ed8e8bab3f..2f77f4b66cbc 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3117,7 +3117,7 @@ nfsd4_exchange_id(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   break;
>   default:/* checked by xdr code */
>   WARN_ON_ONCE(1);
> - /* fall through */
> + fallthrough;
>   case SP4_SSV:
>   status = nfserr_encr_alg_unsupp;
>   goto out_nolock;
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 37bc8f5f4514..a0a8d27539ae 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -469,7 +469,7 @@ static bool fsid_type_ok_for_exp(u8 fsid_type, struct 
> svc_export *exp)
>   case FSID_UUID16:
>   if (!is_root_export(exp))
>   return false;
> - /* fall through */
> + fallthrough;
>   case FSID_UUID4_INUM:
>   case FSID_UUID16_INUM:
>   return exp->ex_uuid != NULL;
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 7d2933b85b65..aba5af9df328 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1456,7 +1456,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh 
> *fhp,
>   *created = true;
>   break;
>   }
> - /* fall through */
> + fallthrough;
>   case NFS4_CREATE_EXCLUSIVE4_1:
>   if (   d_inode(dchild)->i_mtime.tv_sec == v_mtime
>   && d_inode(dchild)->i_atime.tv_sec == v_atime
> @@ -1465,7 +1465,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh 
> *fhp,
>   *created = true;
>   goto set_attr;
>   }
> - /* fall through */
> + fallthrough;
>   case NFS3_CREATE_GUARDED:
>   err = nfserr_exist;
>   }
> -- 
> 2.19.1
> 

--
Chuck Lever





Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-14 Thread Chuck Lever



> On Aug 13, 2020, at 11:10 AM, James Bottomley 
>  wrote:
> 
> On Thu, 2020-08-13 at 10:42 -0400, Chuck Lever wrote:
>>> On Aug 12, 2020, at 11:51 AM, James Bottomley >> enPartnership.com> wrote:
>>> On Wed, 2020-08-12 at 10:15 -0400, Chuck Lever wrote:
>>>>> On Aug 11, 2020, at 11:53 AM, James Bottomley
>>>>>  wrote:
>>>>> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
> [...]
>>>>>>>> The client would have to reconstruct that tree again if
>>>>>>>> memory pressure caused some or all of the tree to be
>>>>>>>> evicted, so perhaps an on-demand mechanism is preferable.
>>>>>>> 
>>>>>>> Right, but I think that's implementation detail.  Probably
>>>>>>> what we need is a way to get the log(N) verification hashes
>>>>>>> from the server and it's up to the client whether it caches
>>>>>>> them or not.
>>>>>> 
>>>>>> Agreed, these are implementation details. But see above about
>>>>>> the trustworthiness of the intermediate hashes. If they are
>>>>>> conveyed on an untrusted network, then they can't be trusted
>>>>>> either.
>>>>> 
>>>>> Yes, they can, provided enough of them are asked for to
>>>>> verify.  If you look at the simple example above, suppose I
>>>>> have cached H11 and H12, but I've lost the entire H2X layer.  I
>>>>> want to verify B3 so I also ask you for your copy of H24.  Then
>>>>> I generate H23 from B3 and Hash H23 and H24.  If this doesn't
>>>>> hash to H12 I know either you supplied me the wrong block or
>>>>> lied about H24.  However, if it all hashes correctly I know you
>>>>> supplied me with both the correct B3 and the correct H24.
>>>> 
>>>> My point is there is a difference between a trusted cache and an
>>>> untrusted cache. I argue there is not much value in a cache where
>>>> the hashes have to be verified again.
>>> 
>>> And my point isn't about caching, it's about where the tree comes
>>> from. I claim and you agree the client can get the tree from the
>>> server a piece at a time (because it can path verify it) and
>>> doesn't have to generate it itself.
>> 
>> OK, let's focus on where the tree comes from. It is certainly
>> possible to build protocol to exchange parts of a Merkle tree.
> 
> Which is what I think we need to extend IMA to do.
> 
>> The question is how it might be stored on the server.
> 
> I think the only thing the server has to guarantee to store is the head
> hash, possibly signed.
> 
>> There are some underlying assumptions about the metadata storage
>> mechanism that should be stated up front.
>> 
>> Current forms of IMA metadata are limited in size and stored in a
>> container that is read and written in a single operation. If we stick
>> with that container format, I don't see a way to store a Merkle tree
>> in there for all file sizes.
> 
> Well, I don't think you need to.  The only thing that needs to be
> stored is the head hash.  Everything else can be reconstructed.  If you
> asked me to implement it locally, I'd probably put the head hash in an
> xattr but use a CAM based cache for the merkel trees and construct the
> tree on first access if it weren't already in the cache.

The contents of the security.ima xattr might be modeled after
EVM_IMA_DIGSIG:

- a format enumerator (used by all IMA metadata formats)
- the tree's unit size
- a fingerprint of the signer's certificate
  - digest algorithm name and full digest
- the root hash, always signed
  - signing algorithm name and signature

The rest of the hash tree is always stored somewhere else or
constructed on-demand.

My experience of security communities both within and outside the
IETF is that they would insist on always having a signature.

If one doesn't care about signing, a self-signed certificate can be
automatically provisioned when ima-evm-utils is installed that can
be used for those cases. That would make the signature process
invisible to any administrator who doesn't care about signed
metadata.

Because storage in NFS would cross trust boundaries, it would have
to require the use of a signed root hash. I don't want to be in the
position where copying a file with an unsigned root hash into NFS
makes it unreadable because of a change in policy.


> However, the above isn't what fs-verity does: it stores the tree in a
> hidden section of the file.  That's why I don't think we'd mandate
&g

Re: [PATCH 1/2] nfsd: Remove unnecessary assignment in nfs4xdr.c

2020-08-13 Thread Chuck Lever



> On Aug 12, 2020, at 4:36 PM, Frank van der Linden  wrote:
> 
> On Wed, Aug 12, 2020 at 03:12:51PM +0100, Alex Dewar wrote:
>> 
>> In nfsd4_encode_listxattrs(), the variable p is assigned to at one point
>> but this value is never used before p is reassigned. Fix this.
>> 
>> Addresses-Coverity: ("Unused value")
>> Signed-off-by: Alex Dewar 
>> ---
>> fs/nfsd/nfs4xdr.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>> index 259d5ad0e3f47..1a0341fd80f9a 100644
>> --- a/fs/nfsd/nfs4xdr.c
>> +++ b/fs/nfsd/nfs4xdr.c
>> @@ -4859,7 +4859,7 @@ nfsd4_encode_listxattrs(struct nfsd4_compoundres 
>> *resp, __be32 nfserr,
>>goto out;
>>}
>> 
>> -   p = xdr_encode_opaque(p, sp, slen);
>> +   xdr_encode_opaque(p, sp, slen);
>> 
>>xdrleft -= xdrlen;
>>count++;
>> --
>> 2.28.0
>> 
> 
> Yep, I guess my linting missed that, thanks for the fix.

Bruce, these two don't appear to be urgent, so I'm deferring them
to you for v5.10.


--
Chuck Lever





Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-13 Thread Chuck Lever



> On Aug 13, 2020, at 10:42 AM, James Bottomley 
>  wrote:
> 
> On Thu, 2020-08-13 at 10:21 -0400, Chuck Lever wrote:
>>> On Aug 12, 2020, at 11:42 AM, James Bottomley >> enPartnership.com> wrote:
> [...]
>>> For most people the security mechanism of local xattrs is
>>> sufficient.  If you're paranoid, you don't believe it is and you
>>> use EVM.
>> 
>> When IMA metadata happens to be stored in local filesystems in
>> a trusted xattr, it's going to enjoy the protection you describe
>> without needing the addition of a cryptographic signature.
>> 
>> However, that metadata doesn't live its whole life there. It
>> can reside in a tar file, it can cross a network, it can live
>> on a back-up tape. I think we agree that any time that metadata
>> is in transit or at rest outside of a Linux local filesystem, it
>> is exposed.
>> 
>> Thus I'm interested in a metadata protection mechanism that does
>> not rely on the security characteristics of a particular storage
>> container. For me, a cryptographic signature fits that bill
>> nicely.
> 
> Sure, but one of the points about IMA is a separation of mechanism from
> policy.  Signed hashes (called appraisal in IMA terms) is just one
> policy you can decide to require or not or even make it conditional on
> other things.

AFAICT, the current EVM_IMA_DIGSIG and EVM_PORTABLE_DIGSIG formats are
always signed. The policy choice is whether or not to verify the
signature, not whether or not the metadata format is signed.


--
Chuck Lever
chuckle...@gmail.com





Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-13 Thread Chuck Lever



> On Aug 12, 2020, at 11:51 AM, James Bottomley 
>  wrote:
> 
> On Wed, 2020-08-12 at 10:15 -0400, Chuck Lever wrote:
>>> On Aug 11, 2020, at 11:53 AM, James Bottomley
>>>  wrote:
>>> 
>>> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
> [...]
>>>>> 
>>>>> and what is nice to have to speed up the verification
>>>>> process.  The choice for the latter is cache or reconstruct
>>>>> depending on the resources available.  If the tree gets cached
>>>>> on the server, that would be a server implementation detail
>>>>> invisible to the client.
>>>> 
>>>> We assume that storage targets (for block or file) are not
>>>> trusted. Therefore storage clients cannot rely on intermediate
>>>> results (eg, middle nodes in a Merkle tree) unless those results
>>>> are generated within the client's trust envelope.
>>> 
>>> Yes, they can ... because supplied nodes can be verified.  That's
>>> the whole point of a merkle tree.  As long as I'm sure of the root
>>> hash I can verify all the rest even if supplied by an untrusted
>>> source.  If you consider a simple merkle tree covering 4 blocks:
>>> 
>>>  R
>>>/   \
>>> H11 H12
>>> / \ / \
>>> H21 H22 H23 H24
>>>>   |   |   |
>>> 
>>> B1   B2  B3  B4
>>> 
>>> Assume I have the verified root hash R.  If you supply B3 you also
>>> supply H24 and H11 as proof.  I verify by hashing B3 to produce H23
>>> then hash H23 and H24 to produce H12 and if H12 and your supplied
>>> H11 hash to R the tree is correct and the B3 you supplied must
>>> likewise be correct.
>> 
>> I'm not sure what you are proving here. Obviously this has to work
>> in order for a client to reconstruct the file's Merkle tree given
>> only R and the file content.
> 
> You implied the server can't be trusted to generate the merkel tree. 
> I'm showing above it can because of the tree path based verification.

What I was implying is that clients can't trust intermediate Merkle
tree content that is not also signed. So far we are talking about
signing only the tree root.

The storage server can store the tree durably, but if the intermediate
parts of the tree are not signed, the client has to verify them anyway,
and that reduces the value of storing potentially large data structures.


>> It's the construction of the tree and verification of the hashes that
>> are potentially expensive. The point of caching intermediate hashes
>> is so that the client verifies them as few times as possible.  I
>> don't see value in caching those hashes on an untrusted server --
>> the client will have to reverify them anyway, and there will be no
>> savings.
> 
> I'm not making any claim about server caching, I'm just saying the
> client can request pieces of the tree from the server without having to
> reconstruct the whole thing itself because it can verify their
> correctness.

To be clear, my concern is about how much of the tree might be stored
in a Merkle-based metadata format. I just don't see that it has much
value to store more than the signed tree root, because the client will
have to reconstitute or verify some tree contents on most every read.

For sufficiently large files, the tree itself can be larger than what
can be stored in an xattr. This is the same problem that fs-verity
faces. And, as I stated earlier, xattr objects are read in their
entirety, they can't be seeked into or read piecemeal.

What it seemed to me that you were suggesting was an offloaded cache
of the Merkle tree. Either the whole tree is stored on the storage
server, or the storage server provides a service that reconstitutes
that tree on behalf of clients. (Please correct me if I misunderstood).
I just don't think that will be practicable or provide the kind of
benefit you might want.


>> Cache once, as close as you can to where the data will be used.
>> 
>> 
>>>> So: if the storage target is considered inside the client's trust
>>>> envelope, it can cache or store durably any intermediate parts of
>>>> the verification process. If not, the network and file storage is
>>>> considered untrusted, and the client has to rely on nothing but
>>>> the signed digest of the tree root.
>>>> 
>>>> We could build a scheme around, say, fscache, that might save the
>>>> intermediate results durably and locally.
>>> 
>>> I agree we want caching on the client, but we can always page in
>>> from the remote as long as we page enough t

Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-13 Thread Chuck Lever



> On Aug 12, 2020, at 11:42 AM, James Bottomley 
>  wrote:
> 
> On Wed, 2020-08-12 at 09:56 -0400, Chuck Lever wrote:
>>> On Aug 11, 2020, at 2:28 PM, James Bottomley >> nPartnership.com> wrote:
>>> 
>>> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
>>>> Mimi's earlier point is that any IMA metadata format that
>>>> involves unsigned digests is exposed to an alteration attack at
>>>> rest or in transit, thus will not provide a robust end-to-end
>>>> integrity guarantee.
>>> 
>>> I don't believe that is Mimi's point, because it's mostly not
>>> correct: the xattr mechanism does provide this today.  The point is
>>> the mechanism we use for storing IMA hashes and signatures today is
>>> xattrs because they have robust security properties for local
>>> filesystems that the kernel enforces.  This use goes beyond IMA,
>>> selinux labels for instance use this property as well.
>> 
>> I don't buy this for a second. If storing a security label in a
>> local xattr is so secure, we wouldn't have any need for EVM.
> 
> What don't you buy?  Security xattrs can only be updated by local root.
> If you trust local root, the xattr mechanism is fine ... it's the only
> one a lot of LSMs use, for instance.  If you don't trust local root or
> worry about offline backups, you use EVM.  A thing isn't secure or
> insecure, it depends on the threat model.  However, if you don't trust
> the NFS server it doesn't matter whether you do or don't trust local
> root, you can't believe the contents of the xattr.
> 
>>> What I think you're saying is that NFS can't provide the robust
>>> security for xattrs we've been relying on, so you need some other
>>> mechanism for storing them.
>> 
>> For NFS, there's a network traversal which is an attack surface.
>> 
>> A local xattr can be attacked as well: a device or bus malfunction
>> can corrupt the content of an xattr, or a privileged user can modify
>> it.
>> 
>> How does that metadata get from the software provider to the end
>> user? It's got to go over a network, stored in various ways, some
>> of which will not be trusted. To attain an unbroken chain of
>> provenance, that metadata has to be signed.
>> 
>> I don't think the question is the storage mechanism, but rather the
>> protection mechanism. Signing the metadata protects it in all of
>> these cases.
> 
> I think we're saying about the same thing.

Roughly.


> For most people the
> security mechanism of local xattrs is sufficient.  If you're paranoid,
> you don't believe it is and you use EVM.

When IMA metadata happens to be stored in local filesystems in
a trusted xattr, it's going to enjoy the protection you describe
without needing the addition of a cryptographic signature.

However, that metadata doesn't live its whole life there. It
can reside in a tar file, it can cross a network, it can live
on a back-up tape. I think we agree that any time that metadata
is in transit or at rest outside of a Linux local filesystem, it
is exposed.

Thus I'm interested in a metadata protection mechanism that does
not rely on the security characteristics of a particular storage
container. For me, a cryptographic signature fits that bill
nicely.


>>> I think Mimi's other point is actually that IMA uses a flat hash
>>> which we derive by reading the entire file and then watching for
>>> mutations. Since you cannot guarantee we get notice of mutation
>>> with NFS, the entire IMA mechanism can't really be applied in its
>>> current form and we have to resort to chunk at a time verifications
>>> that a Merkel tree would provide.
>> 
>> I'm not sure what you mean by this. An NFS client relies on
>> notification of mutation to maintain the integrity of its cache of
>> NFS file content, and it's done that since the 1980s.
> 
> Mutation detection is part of the current IMA security model.  If IMA
> sees a file mutate it has to be rehashed the next time it passes the
> gate.  If we can't trust the NFS server, we can't trust the NFS
> mutation notification and we have to have a different mechanism to
> check the file.

When an NFS server lies about mtime and ctime, then NFS is completely
broken. Untrusted NFS server doesn't mean "broken behavior" -- I
would think that local filesystems will have the same problem if
they can't trust a local block device to store filesystem metadata
like indirect blocks and timestamps.

It's not clear to me that IMA as currently implemented can protect
against broken storage devices or incorrect filesystem behavior.


>> In addition to examining a file's mtime and ctim

Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-12 Thread Chuck Lever



> On Aug 11, 2020, at 11:32 AM, James Bottomley 
>  wrote:
> 
> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
>>> On Aug 11, 2020, at 1:43 AM, James Bottomley
>>>  wrote:
>>> On Mon, 2020-08-10 at 19:36 -0400, Chuck Lever wrote:
> [...]
>>>> Thanks for the help! I just want to emphasize that documentation
>>>> (eg, a specification) will be critical for remote filesystems.
>>>> 
>>>> If any of this is to be supported by a remote filesystem, then we
>>>> need an unencumbered description of the new metadata format
>>>> rather than code. GPL-encumbered formats cannot be contributed to
>>>> the NFS standard, and are probably difficult for other
>>>> filesystems that are not Linux-native, like SMB, as well.
>>> 
>>> I don't understand what you mean by GPL encumbered formats.  The
>>> GPL is a code licence not a data or document licence.
>> 
>> IETF contributions occur under a BSD-style license incompatible
>> with the GPL.
>> 
>> https://trustee.ietf.org/trust-legal-provisions.html
>> 
>> Non-Linux implementers (of OEM storage devices) rely on such
>> standards processes to indemnify them against licensing claims.
> 
> Well, that simply means we won't be contributing the Linux
> implementation, right?

At the present time, there is nothing but the Linux implementation.
There's no English description, there's no specification of the
formats, the format is described only by source code.

The only way to contribute current IMA metadata formats to an open
standards body like the IETF is to look at encumbered code first.
We would effectively be contributing an implementation in this case.

(I'm not saying the current formats should or should not be
contributed; merely that there is a legal stumbling block to doing
so that can be avoided for newly defined formats).


> Well, let me put the counterpoint: I can write a book about how linux
> device drivers work (which includes describing the data formats)


Our position is that someone who reads that book and implements those
formats under a non-GPL-compatible license would be in breach of the
GPL.

The point of the standards process is to indemnify implementing
and distributing under _any_ license what has been published by the
standards body. That legally enables everyone to use the published
protocol/format in their own code no matter how it happens to be
licensed.


> Fine, good grief, people who take a sensible view of this can write the
> data format down and publish it under any licence you like then you can
> pick it up again safely.


That's what I proposed. Write it down under the IETF Trust legal
provisions license. And I volunteered to do that.

All I'm saying is that description needs to come before code.


--
Chuck Lever
chuckle...@gmail.com





Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-12 Thread Chuck Lever



> On Aug 11, 2020, at 5:03 PM, James Morris  wrote:
> 
> On Sat, 8 Aug 2020, Chuck Lever wrote:
> 
>> My interest is in code integrity enforcement for executables stored
>> in NFS files.
>> 
>> My struggle with IPE is that due to its dependence on dm-verity, it
>> does not seem to able to protect content that is stored separately
>> from its execution environment and accessed via a file access
>> protocol (FUSE, SMB, NFS, etc).
> 
> It's not dependent on DM-Verity, that's just one possible integrity 
> verification mechanism, and one of two supported in this initial 
> version. The other is 'boot_verified' for a verified or otherwise trusted 
> rootfs. Future versions will support FS-Verity, at least.
> 
> IPE was designed to be extensible in this way, with a strong separation of 
> mechanism and policy.

I got that, but it looked to me like the whole system relied on having
access to the block device under the filesystem. That's not possible
for a remote filesystem like Ceph or NFS.

I'm happy to take a closer look if someone can point me the right way.


--
Chuck Lever
chuckle...@gmail.com





Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-12 Thread Chuck Lever



> On Aug 11, 2020, at 11:53 AM, James Bottomley 
>  wrote:
> 
> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
>>> On Aug 11, 2020, at 1:43 AM, James Bottomley >> nPartnership.com> wrote:
>>> 
>>> On Mon, 2020-08-10 at 19:36 -0400, Chuck Lever wrote:
>>>>> On Aug 10, 2020, at 11:35 AM, James Bottomley
>>>>>  wrote:
> [...]
>>>>> The first basic is that a merkle tree allows unit at a time
>>>>> verification. First of all we should agree on the unit.  Since
>>>>> we always fault a page at a time, I think our merkle tree unit
>>>>> should be a page not a block.
>>>> 
>>>> Remote filesystems will need to agree that the size of that unit
>>>> is the same everywhere, or the unit size could be stored in the
>>>> per-filemetadata.
>>>> 
>>>> 
>>>>> Next, we should agree where the check gates for the per page
>>>>> accesses should be ... definitely somewhere in readpage, I
>>>>> suspect and finally we should agree how the merkle tree is
>>>>> presented at the gate.  I think there are three ways:
>>>>> 
>>>>> 1. Ahead of time transfer:  The merkle tree is transferred and
>>>>> verified
>>>>>at some time before the accesses begin, so we already have
>>>>> a
>>>>>verified copy and can compare against the lower leaf.
>>>>> 2. Async transfer:  We provide an async mechanism to transfer
>>>>> the
>>>>>necessary components, so when presented with a unit, we
>>>>> check the
>>>>>log n components required to get to the root
>>>>> 3. The protocol actually provides the capability of 2 (like
>>>>> the SCSI
>>>>>DIF/DIX), so to IMA all the pieces get presented instead of
>>>>> IMA
>>>>>having to manage the tree
>>>> 
>>>> A Merkle tree is potentially large enough that it cannot be
>>>> stored in an extended attribute. In addition, an extended
>>>> attribute is not a byte stream that you can seek into or read
>>>> small parts of, it is retrieved in a single shot.
>>> 
>>> Well you wouldn't store the tree would you, just the head
>>> hash.  The rest of the tree can be derived from the data.  You need
>>> to distinguish between what you *must* have to verify integrity
>>> (the head hash, possibly signed)
>> 
>> We're dealing with an untrusted storage device, and for a remote
>> filesystem, an untrusted network.
>> 
>> Mimi's earlier point is that any IMA metadata format that involves
>> unsigned digests is exposed to an alteration attack at rest or in
>> transit, thus will not provide a robust end-to-end integrity
>> guarantee.
>> 
>> Therefore, tree root digests must be cryptographically signed to be
>> properly protected in these environments. Verifying that signature
>> should be done infrequently relative to reading a file's content.
> 
> I'm not disagreeing there has to be a way for the relying party to
> trust the root hash.
> 
>>> and what is nice to have to speed up the verification
>>> process.  The choice for the latter is cache or reconstruct
>>> depending on the resources available.  If the tree gets cached on
>>> the server, that would be a server implementation detail invisible
>>> to the client.
>> 
>> We assume that storage targets (for block or file) are not trusted.
>> Therefore storage clients cannot rely on intermediate results (eg,
>> middle nodes in a Merkle tree) unless those results are generated
>> within the client's trust envelope.
> 
> Yes, they can ... because supplied nodes can be verified.  That's the
> whole point of a merkle tree.  As long as I'm sure of the root hash I
> can verify all the rest even if supplied by an untrusted source.  If
> you consider a simple merkle tree covering 4 blocks:
> 
>   R
> /   \
>  H11 H12
>  / \ / \
> H21 H22 H23 H24
> ||   |   |
> B1   B2  B3  B4
> 
> Assume I have the verified root hash R.  If you supply B3 you also
> supply H24 and H11 as proof.  I verify by hashing B3 to produce H23
> then hash H23 and H24 to produce H12 and if H12 and your supplied H11
> hash to R the tree is correct and the B3 you supplied must likewise be
> correct.

I'm not sure what you are proving here. Obviously this has to work
in order for a client to reconstruct the file's Merkle tree given
only R and the fi

Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-12 Thread Chuck Lever



> On Aug 11, 2020, at 2:28 PM, James Bottomley 
>  wrote:
> 
> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
>> Mimi's earlier point is that any IMA metadata format that involves
>> unsigned digests is exposed to an alteration attack at rest or in
>> transit, thus will not provide a robust end-to-end integrity
>> guarantee.
> 
> I don't believe that is Mimi's point, because it's mostly not correct:
> the xattr mechanism does provide this today.  The point is the
> mechanism we use for storing IMA hashes and signatures today is xattrs
> because they have robust security properties for local filesystems that
> the kernel enforces.  This use goes beyond IMA, selinux labels for
> instance use this property as well.

I don't buy this for a second. If storing a security label in a
local xattr is so secure, we wouldn't have any need for EVM.


> What I think you're saying is that NFS can't provide the robust
> security for xattrs we've been relying on, so you need some other
> mechanism for storing them.

For NFS, there's a network traversal which is an attack surface.

A local xattr can be attacked as well: a device or bus malfunction
can corrupt the content of an xattr, or a privileged user can modify
it.

How does that metadata get from the software provider to the end
user? It's got to go over a network, stored in various ways, some
of which will not be trusted. To attain an unbroken chain of
provenance, that metadata has to be signed.

I don't think the question is the storage mechanism, but rather the
protection mechanism. Signing the metadata protects it in all of
these cases.


> I think Mimi's other point is actually that IMA uses a flat hash which
> we derive by reading the entire file and then watching for mutations. 
> Since you cannot guarantee we get notice of mutation with NFS, the
> entire IMA mechanism can't really be applied in its current form and we
> have to resort to chunk at a time verifications that a Merkel tree
> would provide.

I'm not sure what you mean by this. An NFS client relies on notification
of mutation to maintain the integrity of its cache of NFS file content,
and it's done that since the 1980s.

In addition to examining a file's mtime and ctime as maintained by
the NFS server, a client can rely on the file's NFSv4 change attribute
or an NFSv4 delegation.


> Doesn't this make moot any thinking about
> standardisation in NFS for the current IMA flat hash mechanism because
> we simply can't use it ... If I were to construct a prototype I'd have
> to work out and securely cache the hash of ever chunk when verifying
> the flat hash so I could recheck on every chunk read.  I think that's
> infeasible for large files.
> 
> James
> 

--
Chuck Lever
chuckle...@gmail.com





Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-11 Thread Chuck Lever



> On Aug 11, 2020, at 1:43 AM, James Bottomley 
>  wrote:
> 
> On Mon, 2020-08-10 at 19:36 -0400, Chuck Lever wrote:
>>> On Aug 10, 2020, at 11:35 AM, James Bottomley
>>>  wrote:
>>> On Sun, 2020-08-09 at 13:16 -0400, Mimi Zohar wrote:
>>>> On Sat, 2020-08-08 at 13:47 -0400, Chuck Lever wrote:
> [...]
>>>>> The first priority (for me, anyway) therefore is getting the
>>>>> ability to move IMA metadata between NFS clients and servers
>>>>> shoveled into the NFS protocol, but that's been blocked for
>>>>> various legal reasons.
>>>> 
>>>> Up to now, verifying remote filesystem file integrity has been
>>>> out of scope for IMA.   With fs-verity file signatures I can at
>>>> least grasp how remote file integrity could possibly work.  I
>>>> don't understand how remote file integrity with existing IMA
>>>> formats could be supported. You might want to consider writing a
>>>> whitepaper, which could later be used as the basis for a patch
>>>> set cover letter.
>>> 
>>> I think, before this, we can help with the basics (and perhaps we
>>> should sort them out before we start documenting what we'll do).
>> 
>> Thanks for the help! I just want to emphasize that documentation
>> (eg, a specification) will be critical for remote filesystems.
>> 
>> If any of this is to be supported by a remote filesystem, then we
>> need an unencumbered description of the new metadata format rather
>> than code. GPL-encumbered formats cannot be contributed to the NFS
>> standard, and are probably difficult for other filesystems that are
>> not Linux-native, like SMB, as well.
> 
> I don't understand what you mean by GPL encumbered formats.  The GPL is
> a code licence not a data or document licence.

IETF contributions occur under a BSD-style license incompatible
with the GPL.

https://trustee.ietf.org/trust-legal-provisions.html

Non-Linux implementers (of OEM storage devices) rely on such standards
processes to indemnify them against licensing claims.

Today, there is no specification for existing IMA metadata formats,
there is only code. My lawyer tells me that because the code that
implements these formats is under GPL, the formats themselves cannot
be contributed to, say, the IETF without express permission from the
authors of that code. There are a lot of authors of the Linux IMA
code, so this is proving to be an impediment to contribution. That
blocks the ability to provide a fully-specified NFS protocol
extension to support IMA metadata formats.


> The way the spec
> process works in Linux is that we implement or evolve a data format
> under a GPL implementaiton, but that implementation doesn't implicate
> the later standardisation of the data format and people are free to
> reimplement under any licence they choose.

That technology transfer can happen only if all the authors of that
prototype agree to contribute to a standard. That's much easier if
that agreement comes before an implementation is done. The current
IMA code base is more than a decade old, and there are more than a
hundred authors who have contributed to that base.

Thus IMO we want an unencumbered description of any IMA metadata
format that is to be contributed to an open standards body (as it
would have to be to extend, say, the NFS protocol).

I'm happy to write that specification, as long as any contributions
here are unencumbered and acknowledged. In fact, I have been working
on a (so far, flawed) NFS extension:

https://datatracker.ietf.org/doc/draft-ietf-nfsv4-integrity-measurement/

Now, for example, the components of a putative Merkle-based IMA
metadata format are all already open:

- The unit size is just an integer

- A certificate fingerprint is a de facto standard, and the
fingerprint digest algorithms are all standardized

- Merkle trees are public domain, I believe, and we're not adding
any special sauce here

- Digital signing algorithms are all standardized

Wondering if we want to hash and save the file's mtime and size too.


>>> The first basic is that a merkle tree allows unit at a time
>>> verification. First of all we should agree on the unit.  Since we
>>> always fault a page at a time, I think our merkle tree unit should
>>> be a page not a block.
>> 
>> Remote filesystems will need to agree that the size of that unit is
>> the same everywhere, or the unit size could be stored in the per-file
>> metadata.
>> 
>> 
>>> Next, we should agree where the check gates for the per page
>>> accesses should be ... definitely somewhere in readpage, I suspect
>>> and finally we should agree how the merkle tree is presented at the
>&g

Re: Please pull NFS server updates for v5.9

2020-08-11 Thread Chuck Lever



> On Aug 11, 2020, at 2:15 AM, Stephen Rothwell  wrote:
> 
> Hi Chuck,
> 
> On Mon, 10 Aug 2020 08:25:14 -0400 Chuck Lever  wrote:
>> 
>> Is there something I need to change? The public copy of the cel-testing
>> branch has had this content for the past 12 days.
> 
> You just need to keep your cel-next branch up to the top commit that is
> ready.  That is the branch you told me to fetch.  It is currently at commit
> 
>  0a8e7b7d0846 ("SUNRPC: Revert 241b1f419f0e ("SUNRPC: Remove 
> xdr_buf_trim()")")
> 
> It looks like that is what Linus merged into v5.7-rc2.

Ah. I thought you were pulling cel-testing, so I deleted cel-next. I will
rehydrate cel-next right now.

--
Chuck Lever





Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-10 Thread Chuck Lever



> On Aug 10, 2020, at 11:35 AM, James Bottomley 
>  wrote:
> 
> On Sun, 2020-08-09 at 13:16 -0400, Mimi Zohar wrote:
>> On Sat, 2020-08-08 at 13:47 -0400, Chuck Lever wrote:
>>>> On Aug 5, 2020, at 2:15 PM, Mimi Zohar 
>>>> wrote:
>> 
>> 
>> 
>>>> If block layer integrity was enough, there wouldn't have been a
>>>> need for fs-verity.   Even fs-verity is limited to read only
>>>> filesystems, which makes validating file integrity so much
>>>> easier.  From the beginning, we've said that fs-verity signatures
>>>> should be included in the measurement list.  (I thought someone
>>>> signed on to add that support to IMA, but have not yet seen
>>>> anything.)
>>> 
>>> Mimi, when you and I discussed this during LSS NA 2019, I didn't
>>> fully understand that you expected me to implement signed Merkle
>>> trees for all filesystems. At the time, it sounded to me like you
>>> wanted signed Merkle trees only for NFS files. Is that still the
>>> case?
>> 
>> I definitely do not expect you to support signed Merkle trees for all
>> filesystems.  My interested is from an IMA perspective of measuring
>> and verifying the fs-verity Merkle tree root (and header info)
>> signature. This is independent of which filesystems support it.
>> 
>>> 
>>> The first priority (for me, anyway) therefore is getting the
>>> ability to move IMA metadata between NFS clients and servers
>>> shoveled into the NFS protocol, but that's been blocked for various
>>> legal reasons.
>> 
>> Up to now, verifying remote filesystem file integrity has been out of
>> scope for IMA.   With fs-verity file signatures I can at least grasp
>> how remote file integrity could possibly work.  I don't understand
>> how remote file integrity with existing IMA formats could be
>> supported. You might want to consider writing a whitepaper, which
>> could later be used as the basis for a patch set cover letter.
> 
> I think, before this, we can help with the basics (and perhaps we
> should sort them out before we start documenting what we'll do).

Thanks for the help! I just want to emphasize that documentation
(eg, a specification) will be critical for remote filesystems.

If any of this is to be supported by a remote filesystem, then we
need an unencumbered description of the new metadata format rather
than code. GPL-encumbered formats cannot be contributed to the NFS
standard, and are probably difficult for other filesystems that are
not Linux-native, like SMB, as well.


> The
> first basic is that a merkle tree allows unit at a time verification. 
> First of all we should agree on the unit.  Since we always fault a page
> at a time, I think our merkle tree unit should be a page not a block.

Remote filesystems will need to agree that the size of that unit is
the same everywhere, or the unit size could be stored in the per-file
metadata.


> Next, we should agree where the check gates for the per page accesses
> should be ... definitely somewhere in readpage, I suspect and finally
> we should agree how the merkle tree is presented at the gate.  I think
> there are three ways:
> 
>   1. Ahead of time transfer:  The merkle tree is transferred and verified
>  at some time before the accesses begin, so we already have a
>  verified copy and can compare against the lower leaf.
>   2. Async transfer:  We provide an async mechanism to transfer the
>  necessary components, so when presented with a unit, we check the
>  log n components required to get to the root
>   3. The protocol actually provides the capability of 2 (like the SCSI
>  DIF/DIX), so to IMA all the pieces get presented instead of IMA
>  having to manage the tree

A Merkle tree is potentially large enough that it cannot be stored in
an extended attribute. In addition, an extended attribute is not a
byte stream that you can seek into or read small parts of, it is
retrieved in a single shot.

For this reason, the idea was to save only the signature of the tree's
root on durable storage. The client would retrieve that signature
possibly at open time, and reconstruct the tree at that time.

Or the tree could be partially constructed on-demand at the time each
unit is to be checked (say, as part of 2. above).

The client would have to reconstruct that tree again if memory pressure
caused some or all of the tree to be evicted, so perhaps an on-demand
mechanism is preferable.


> There are also a load of minor things like how we get the head hash,
> which must be presented and verified ahead of time for each of the
> above 3.

Also, changes to a file's content and its tree signature are not
atomic. If a file is mutable, then there is the period between when
the file content has changed and when the signature is updated.
Some discussion of how a client is to behave in those situations will
be necessary.


--
Chuck Lever
chuckle...@gmail.com





Re: Please pull NFS server updates for v5.9

2020-08-10 Thread Chuck Lever



> On Aug 9, 2020, at 7:03 PM, Stephen Rothwell  wrote:
> 
> Hi Chuck,
> 
> On Sun, 9 Aug 2020 11:44:15 -0400 Chuck Lever  wrote:
>> 
>> The following changes since commit 11ba468877bb23f28956a35e896356252d63c983:
>> 
>>  Linux 5.8-rc5 (2020-07-12 16:34:50 -0700)
>> 
>> are available in the Git repository at:
>> 
>>  git://git.linux-nfs.org/projects/cel/cel-2.6.git tags/nfsd-5.9
>> 
>> for you to fetch changes up to b297fed699ad9e50315b27e78de42ac631c9990d:
>> 
>>  svcrdma: CM event handler clean up (2020-07-28 10:18:15 -0400)
> 
> Despite you having a branch included in linux-next, only one of these
> commits has been in linux-next :-( (and that via Trond's nfs tree)

Is there something I need to change? The public copy of the cel-testing
branch has had this content for the past 12 days.


--
Chuck Lever





Please pull NFS server updates for v5.9

2020-08-09 Thread Chuck Lever
Hello Linus-

The following changes since commit 11ba468877bb23f28956a35e896356252d63c983:

  Linux 5.8-rc5 (2020-07-12 16:34:50 -0700)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/cel/cel-2.6.git tags/nfsd-5.9

for you to fetch changes up to b297fed699ad9e50315b27e78de42ac631c9990d:

  svcrdma: CM event handler clean up (2020-07-28 10:18:15 -0400)


Highlights:

- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls

Notable fixes:

- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference

Other:

- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints


Chuck Lever (24):
  SUNRPC: Augment server-side rpcgss tracepoints
  svcrdma: Fix page leak in svc_rdma_recv_read_chunk()
  svcrdma: Remove save_io_pages() call from send_error_msg()
  svcrdma: Add @rctxt parameter to svc_rdma_send_error() functions
  svcrdma: Add a @status parameter to svc_rdma_send_error_msg()
  svcrdma: Eliminate return value for svc_rdma_send_error_msg()
  svcrdma: Make svc_rdma_send_error_msg() a global function
  svcrdma: Consolidate send_error helper functions
  svcrdma: Clean up trace_svcrdma_send_failed() tracepoint
  svcrdma: Remove declarations for functions long removed
  SUNRPC: Add helpers for decoding list discriminators symbolically
  svcrdma: Add common XDR decoders for RDMA and Read segments
  svcrdma: Add common XDR encoders for RDMA and Read segments
  svcrdma: Introduce infrastructure to support completion IDs
  svcrdma: Introduce Receive completion IDs
  svcrdma: Record Receive completion ID in svc_rdma_decode_rqst
  svcrdma: Introduce Send completion IDs
  svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()
  svcrdma: Display chunk completion ID when posting a rw_ctxt
  SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()")
  SUNRPC: Refresh the show_rqstp_flags() macro
  svcrdma: Fix another Receive buffer leak
  svcrdma: Remove transport reference counting
  svcrdma: CM event handler clean up

Frank van der Linden (10):
  nfs,nfsd: NFSv4.2 extended attribute protocol definitions
  xattr: break delegations in {set,remove}xattr
  xattr: add a function to check if a namespace is supported
  nfsd: split off the write decode code into a separate function
  nfsd: add defines for NFSv4.2 extended attribute support
  nfsd: define xattr functions to call into their vfs counterparts
  nfsd: take xattr bits into account for permission checks
  nfsd: add structure definitions for xattr requests / responses
  nfsd: implement the xattr functions and en/decode logic
  nfsd: add fattr support for user extended attributes

J. Bruce Fields (1):
  nfsd4: a client's own opens needn't prevent delegations

Randy Dunlap (1):
  nfsd: netns.h: delete a duplicated word

Scott Mayhew (1):
  nfsd: avoid a NULL dereference in __cld_pipe_upcall()

Xu Wang (1):
  nfsd: Use seq_putc() in two functions

 fs/locks.c |   3 +
 fs/nfsd/netns.h|   2 +-
 fs/nfsd/nfs4idmap.c|   4 +-
 fs/nfsd/nfs4proc.c | 128 ++-
 fs/nfsd/nfs4recover.c  |  24 ++---
 fs/nfsd/nfs4state.c|  54 +++---
 fs/nfsd/nfs4xdr.c  | 531 
+++--
 fs/nfsd/nfsd.h |   5 +-
 fs/nfsd/vfs.c  | 239 
++
 fs/nfsd/vfs.h  |  10 ++
 fs/nfsd/xdr4.h |  31 ++
 fs/xattr.c | 111 ++--
 include/linux/nfs4.h   |  22 +++-
 include/linux/sunrpc/rpc_rdma.h|  74 +
 include/linux/sunrpc/rpc_rdma_cid.h|  24 +
 include/linux/sunrpc/svc_rdma.h|  17 ++-
 include/linux/sunrpc/xdr.h |  26 +
 include/linux/xattr.h  |   4 +
 include/trace/events/rpcgss.h  | 168 +-
 include/trace/events/rpcrdma.h | 207 
+++-
 include/trace/events/sunrpc.h  |  35 +--
 include/uapi/linux/nfs4.h  |   3 +
 net/sunrpc/auth_gss/gss_krb5_wrap.c|   2 +-
 net/sunrpc/auth_gss/svcauth_gss.c  | 118 ++---
 net/sunrpc/auth_gss/trace.c|   3 +
 net/sunrpc/xprtrdma/frwr_ops.c

Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)

2020-08-08 Thread Chuck Lever



> On Aug 5, 2020, at 2:15 PM, Mimi Zohar  wrote:
> 
> On Wed, 2020-08-05 at 09:59 -0700, James Morris wrote:
>> On Wed, 5 Aug 2020, James Bottomley wrote:
>> 
>>> I'll leave Mimi to answer, but really this is exactly the question that
>>> should have been asked before writing IPE.  However, since we have the
>>> cart before the horse, let me break the above down into two specific
>>> questions.
>> 
>> The question is valid and it was asked. We decided to first prototype what 
>> we needed and then evaluate if it should be integrated with IMA. We 
>> discussed this plan in person with Mimi (at LSS-NA in 2019), and presented 
>> a more mature version of IPE to LSS-NA in 2020, with the expectation that 
>> such a discussion may come up (it did not).
> 
> When we first spoke the concepts weren't fully formulated, at least to
> me.
>> 
>> These patches are still part of this process and 'RFC' status.
>> 
>>>   1. Could we implement IPE in IMA (as in would extensions to IMA cover
>>>  everything).  I think the answers above indicate this is a "yes".
>> 
>> It could be done, if needed.
>> 
>>>   2. Should we extend IMA to implement it?  This is really whether from a
>>>  usability standpoint two seperate LSMs would make sense to cover the
>>>  different use cases.
>> 
>> One issue here is that IMA is fundamentally a measurement & appraisal 
>> scheme which has been extended to include integrity enforcement. IPE was 
>> designed from scratch to only perform integrity enforcement. As such, it 
>> is a cleaner design -- "do one thing and do it well" is a good design 
>> pattern.
>> 
>> In our use-case, we utilize _both_ IMA and IPE, for attestation and code 
>> integrity respectively. It is useful to be able to separate these 
>> concepts. They really are different:
>> 
>> - Code integrity enforcement ensures that code running locally is of known 
>> provenance and has not been modified prior to execution.

My interest is in code integrity enforcement for executables stored
in NFS files.

My struggle with IPE is that due to its dependence on dm-verity, it
does not seem to able to protect content that is stored separately
from its execution environment and accessed via a file access
protocol (FUSE, SMB, NFS, etc).


>> - Attestation is about measuring the health of a system and having that 
>> measurement validated by a remote system. (Local attestation is useless).
>> 
>> I'm not sure there is value in continuing to shoe-horn both of these into 
>> IMA.
> 
> True, IMA was originally limited to measurement and attestation, but
> most of the original EVM concepts were subsequently included in IMA. 
> (Remember, Reiner Sailer wrote the original IMA, which I inherited.  I
> was originially working on EVM code integrity.)  From a naming
> perspective including EVM code integrity in IMA was a mistake.  My
> thinking at the time was that as IMA was already calculating the file
> hash, instead of re-calculating the file hash for integrity, calculate
> the file hash once and re-use it for multiple things - measurement, 
> integrity, and audit.   At the same time define a single system wide
> policy.
> 
> When we first started working on IMA, EVM, trusted, and encrypted keys,
> the general kernel community didn't see a need for any of it.  Thus, a
> lot of what was accomplished has been accomplished without the backing
> of the real core filesystem people.
> 
> If block layer integrity was enough, there wouldn't have been a need
> for fs-verity.   Even fs-verity is limited to read only filesystems,
> which makes validating file integrity so much easier.  From the
> beginning, we've said that fs-verity signatures should be included in
> the measurement list.  (I thought someone signed on to add that support
> to IMA, but have not yet seen anything.)

Mimi, when you and I discussed this during LSS NA 2019, I didn't fully
understand that you expected me to implement signed Merkle trees for all
filesystems. At the time, it sounded to me like you wanted signed Merkle
trees only for NFS files. Is that still the case?

The first priority (for me, anyway) therefore is getting the ability to
move IMA metadata between NFS clients and servers shoveled into the NFS
protocol, but that's been blocked for various legal reasons.

IMO we need agreement from everyone (integrity developers, FS
implementers, and Linux distributors) that a signed Merkle tree IMA
metadata format, stored in either an xattr or appended to an executable
file, will be the way forward for IMA in all filesystems.


> Going forward I see a lot of what we've accomplished being incorporated
> into the filesystems.  When IMA will be limited to defining a system
> wide policy, I'll have completed my job.
> 
> Mimi
> 
>> 
>>> I've got to say the least attractive thing
>>>  about separation is the fact that you now both have a policy parser.
>>>   You've tried to differentiate yours by making it more Kconfig
>>>  based, but policy has a way of 

Re: [PATCH -next] nfsd: use DEFINE_SPINLOCK() for spinlock

2020-07-31 Thread Chuck Lever
Hello-

> On Jul 25, 2020, at 4:56 AM, Qinglang Miao  wrote:
> 
> nfsd_drc_lock can be initialized automatically with
> DEFINE_SPINLOCK() rather than explicitly calling spin_lock_init().
> 
> Signed-off-by: Qinglang Miao 

Thanks for your patch.

I am quite honestly on the fence about this one. It doesn't
seem to make a difference behaviorally or in terms of code
legibility.

A broader clean-up that moves set_max_drc and those global
variables into nfs4state.c might be better, but again, there
isn't much to justify such a change.

Bruce, any thoughts?


> ---
> fs/nfsd/nfssvc.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index b603dfcdd..20f0a27fe 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -88,7 +88,7 @@ DEFINE_MUTEX(nfsd_mutex);
>  * version 4.1 DRC caches.
>  * nfsd_drc_pages_used tracks the current version 4.1 DRC memory usage.
>  */
> -spinlock_t   nfsd_drc_lock;
> +DEFINE_SPINLOCK(nfsd_drc_lock);
> unsigned long nfsd_drc_max_mem;
> unsigned long nfsd_drc_mem_used;
> 
> @@ -568,7 +568,6 @@ static void set_max_drc(void)
>   nfsd_drc_max_mem = (nr_free_buffer_pages()
>   >> NFSD_DRC_SIZE_SHIFT) * PAGE_SIZE;
>   nfsd_drc_mem_used = 0;
> - spin_lock_init(_drc_lock);
>   dprintk("%s nfsd_drc_max_mem %lu \n", __func__, nfsd_drc_max_mem);
> }
> 
> -- 
> 2.25.1
> 

--
Chuck Lever





Re: [Regression] "SUNRPC: Add "@len" parameter to gss_unwrap()" breaks NFS Kerberos on upstream stable 5.4.y

2020-07-20 Thread Chuck Lever



> On Jul 18, 2020, at 11:55 AM, Chuck Lever  wrote:
> 
> 
> 
>> On Jul 17, 2020, at 3:46 PM, Pierre Sauter  wrote:
>> 
>> Am Freitag, 17. Juli 2020, 19:56:09 CEST schrieb Kai-Heng Feng:
>>>> Pierre, thanks for confirming!
>>>> 
>>>> Kai-Heng suspected an upstream stable commit that is missing in 5.4.0-40,
>>>> but I don't have any good suggestions.
>>> 
>>> Well, Ubuntu's 5.4 kernel is based on upstream stable v5.4, so I asked 
>>> users to test stable v5.4.51, however the feedback was negative, and that's 
>>> the reason why I raised the issue here.
>>> 
>>> Anyway, good to know that it's fixed in upstream stable, everything's good 
>>> now!
>>> Thanks for your effort Chuck.
>>> 
>>> Kai-Heng
>> 
>> Sorry to have caused premature happiness. Kai-Hengs last message reminded me
>> that I had seen the bug earlier in the week on Ubuntu Mainline v.5.4.51.
>> So I decided to rebuild vanilla v5.4.51 with Ubuntus config + KASAN, and 
>> voila.
>> It seems that their config is just really good in exposing the bug on mount. 
>> I
>> am off for the weekend, can do more testing next week.
>> 
>> [   21.664580] 
>> ==
>> [   21.664657] BUG: KASAN: slab-out-of-bounds in _copy_from_pages+0xed/0x210 
>> [sunrpc]
>> [   21.664705] Write of size 64 at addr 8883b6b7d444 by task 
>> update-desktop-/1345
>> 
>> [   21.664764] CPU: 0 PID: 1345 Comm: update-desktop- Not tainted 5.4.51 #1
>> [   21.664765] Hardware name: XX
>> [   21.664766] Call Trace:
>> [   21.664771]  dump_stack+0x96/0xca
>> [   21.664775]  print_address_description.constprop.0+0x20/0x210
>> [   21.664795]  ? _copy_from_pages+0xed/0x210 [sunrpc]
>> [   21.664797]  __kasan_report.cold+0x1b/0x41
>> [   21.664816]  ? _copy_from_pages+0xed/0x210 [sunrpc]
>> [   21.664819]  kasan_report+0x14/0x20
>> [   21.664820]  check_memory_region+0x129/0x1b0
>> [   21.664822]  memcpy+0x38/0x50
>> [   21.664840]  _copy_from_pages+0xed/0x210 [sunrpc]
>> [   21.664859]  xdr_shrink_pagelen+0x1d6/0x440 [sunrpc]
>> [   21.664877]  xdr_align_pages+0x15f/0x580 [sunrpc]
>> [   21.664897]  ? decode_setattr+0x120/0x120 [nfsv4]
>> [   21.664916]  xdr_read_pages+0x44/0x290 [sunrpc]
>> [   21.664933]  ? __decode_op_hdr+0x29/0x430 [nfsv4]
>> [   21.664950]  nfs4_xdr_dec_readlink+0x238/0x390 [nfsv4]
> 
> READLINK appears to be a common element in these splats. Is there
> an especially large symbolic link in your home directory? Knowing
> that might help me reproduce the problem here.
> 
> You confirmed the crash does not occur in v5.5.19, but the 5.8-ish
> kernel you tested was Ubuntu's. Do you have test results for a
> stock upstream v5.8-rc5 kernel?
> 
> Do you know if v5.6.19 has this issue?

I have a workload that can reproduce this exact KASAN splat on
v5.4.51. Looking into it now.


>> [   21.664966]  ? nfs4_xdr_dec_read+0x3c0/0x3c0 [nfsv4]
>> [   21.664969]  ? __kasan_slab_free+0x14e/0x180
>> [   21.664985]  ? nfs4_xdr_dec_read+0x3c0/0x3c0 [nfsv4]
>> [   21.665003]  rpcauth_unwrap_resp_decode+0xaa/0x100 [sunrpc]
>> [   21.665009]  gss_unwrap_resp+0x99d/0x1570 [auth_rpcgss]
>> [   21.665014]  ? gss_destroy_cred+0x460/0x460 [auth_rpcgss]
>> [   21.665016]  ? finish_task_switch+0x163/0x670
>> [   21.665019]  ? __switch_to_asm+0x34/0x70
>> [   21.665023]  ? gss_wrap_req+0x1700/0x1700 [auth_rpcgss]
>> [   21.665026]  ? prepare_to_wait+0xea/0x2b0
>> [   21.665045]  rpcauth_unwrap_resp+0xac/0x100 [sunrpc]
>> [   21.665061]  call_decode+0x454/0x7e0 [sunrpc]
>> [   21.665077]  ? rpc_decode_header+0x10a0/0x10a0 [sunrpc]
>> [   21.665079]  ? var_wake_function+0x140/0x140
>> [   21.665095]  ? call_transmit_status+0x31e/0x5d0 [sunrpc]
>> [   21.665110]  ? rpc_decode_header+0x10a0/0x10a0 [sunrpc]
>> [   21.665127]  __rpc_execute+0x204/0xbd0 [sunrpc]
>> [   21.665143]  ? xprt_wait_for_reply_request_def+0x170/0x170 [sunrpc]
>> [   21.665160]  ? rpc_exit+0xc0/0xc0 [sunrpc]
>> [   21.665162]  ? __kasan_check_read+0x11/0x20
>> [   21.665164]  ? wake_up_bit+0x42/0x50
>> [   21.665181]  rpc_execute+0x1a0/0x1f0 [sunrpc]
>> [   21.665197]  rpc_run_task+0x454/0x5e0 [sunrpc]
>> [   21.665213]  nfs4_call_sync_custom+0x12/0x70 [nfsv4]
>> [   21.665229]  nfs4_call_sync_sequence+0x143/0x1f0 [nfsv4]
>> [   21.665244]  ? nfs4_call_sync_custom+0x70/0x70 [nfsv4]
>> [   21.665247]  ? get_page_from_freelist+0x24d0/0x45f0
>> [   21.665263]  _nfs4_proc_readlink+0x1a6/0x250 [nfsv4]
>

Re: [PATCH] nfsd: netns.h: delete a duplicated word

2020-07-20 Thread Chuck Lever



> On Jul 19, 2020, at 8:14 PM, Randy Dunlap  wrote:
> 
> Drop the repeated word "the" in a comment.
> 
> Signed-off-by: Randy Dunlap 
> Cc: "J. Bruce Fields" 
> Cc: Chuck Lever 
> Cc: linux-...@vger.kernel.org

Acked-by: Chuck Lever 


> ---
> fs/nfsd/netns.h |2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next-20200717.orig/fs/nfsd/netns.h
> +++ linux-next-20200717/fs/nfsd/netns.h
> @@ -171,7 +171,7 @@ struct nfsd_net {
>   unsigned int longest_chain_cachesize;
> 
>   struct shrinker nfsd_reply_cache_shrinker;
> - /* utsname taken from the the process that starts the server */
> + /* utsname taken from the process that starts the server */
>   charnfsd_name[UNX_MAXNODENAME+1];
> };
> 

--
Chuck Lever





Re: [Regression] "SUNRPC: Add "@len" parameter to gss_unwrap()" breaks NFS Kerberos on upstream stable 5.4.y

2020-07-18 Thread Chuck Lever
bytes inside of
>2048-byte region [8883b6b7cc80, 8883b6b7d480)
> [   21.665939] The buggy address belongs to the page:
> [   21.665970] page:ea000edade00 refcount:1 mapcount:0 
> mapping:88840afecc00 index:0x0 compound_mapcount: 0
> [   21.666029] flags: 0x17c0010200(slab|head)
> [   21.666059] raw: 0017c0010200 dead0100 dead0122 
> 88840afecc00
> [   21.666107] raw:  800f000f 0001 
> 
> [   21.666152] page dumped because: kasan: bad access detected
> 
> [   21.666197] Memory state around the buggy address:
> [   21.666228]  8883b6b7d380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00
> [   21.666272]  8883b6b7d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00
> [   21.666315] >8883b6b7d480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
> fc fc
> [   21.666358]^
> [   21.666379]  8883b6b7d500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
> fc fc
> [   21.666423]  8883b6b7d580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
> fc fc
> [   21.666465] 
> ==
> [   21.666509] Disabling lock debugging due to kernel taint
> 
> 
> 

--
Chuck Lever





Re: [Regression] "SUNRPC: Add "@len" parameter to gss_unwrap()" breaks NFS Kerberos on upstream stable 5.4.y

2020-07-17 Thread Chuck Lever



> On Jul 17, 2020, at 1:29 PM, Pierre Sauter  wrote:
> 
> Hi Chuck,
> 
> Am Donnerstag, 16. Juli 2020, 21:25:40 CEST schrieb Chuck Lever:
>> So this makes me think there's a possibility you are not using upstream
>> stable kernels. I can't help if I don't know what source code and commit
>> stream you are using. It also makes me question the bisect result.
> 
> Yes you are right, I was referring to Ubuntu kernels 5.4.0-XX. From the
> discussion in the Ubuntu bugtracker I got the impression that Ubuntu kernels
> 5.4.0-XX and upstream 5.4.XX are closely related, obviously they are not. The
> bisection was done by the original bug reporter and also refers to the Ubuntu
> kernel.
> 
> In the meantime I tested v5.4.51 upstream, which shows no problems. Sorry for
> the bother.

Pierre, thanks for confirming!

Kai-Heng suspected an upstream stable commit that is missing in 5.4.0-40,
but I don't have any good suggestions.


>>> My krb5 etype is aes256-cts-hmac-sha1-96.
>> 
>> Thanks! And what is your NFS server and filesystem? It's possible that the
>> client is not estimating the size of the reply correctly. Variables include
>> the size of file handles, MIC verifiers, and wrap tokens.
> 
> The server is Debian with v4.19.130 upstream, filesystem ext4.
> 
>> You might try:
>> 
>> e8d70b321ecc ("SUNRPC: Fix another issue with MIC buffer space")
> 
> That one is actually in Ubuntus 5.4.0-40, from looking at the code.

--
Chuck Lever





Re: [Regression] "SUNRPC: Add "@len" parameter to gss_unwrap()" breaks NFS Kerberos on upstream stable 5.4.y

2020-07-16 Thread Chuck Lever
Hi Pierre-

> On Jul 16, 2020, at 2:40 PM, Pierre Sauter  wrote:
> 
> Hi,
> 
> Am 2020-07-15 20:54, schrieb Chuck Lever:
>> v5.4.40 does not have 31c9590ae468 and friends, but the claim is this
>> one crashes?
> 
> To my knowledge 31c9590ae468 and friends are in v5.4.40.

Those upstream commits were merged in v5.4.42.

The last commit that is applied to net/sunrpc/auth_gss/gss_krb5_wrap.c
in v5.4.40 is 241b1f419f0e ("SUNRPC: Remove xdr_buf_trim()").


>> And v5.4.51 has those three and 89a3c9f5b9f0, which Pierre claims fixes
>> the problem for him; but another commenter says v5.4.51 still crashes.
> 
> v5.4.51 still crashes for me (and afaik it does not have 89a3c9f5b9f0). 
> I applied 89a3c9f5b9f0 to the original v5.4.40 which helps mostly.

In the v5.4.51 upstream stable kernel, I see:

commit 7b99577ff376d58addf149eebe046351b3d7
Author: Chuck Lever 
AuthorDate: Thu Jun 25 11:32:34 2020 -0400
Commit: Sasha Levin 
CommitDate: Tue Jun 30 15:37:12 2020 -0400

SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()

commit 89a3c9f5b9f0bcaa9aea3e8b2a616fcaea9aad78 upstream.

According to "git describe --contains", 7b99577ff376 was merged in v5.4.50.

So this makes me think there's a possibility you are not using upstream
stable kernels. I can't help if I don't know what source code and commit
stream you are using. It also makes me question the bisect result.


> My krb5 etype is aes256-cts-hmac-sha1-96.

Thanks! And what is your NFS server and filesystem? It's possible that the
client is not estimating the size of the reply correctly. Variables include
the size of file handles, MIC verifiers, and wrap tokens.


> Below is the bug in the original v5.4.40 with KASAN enabled. It happened 
> immediately after mount of /home, no login neccessary:
> 
> [   21.501730] 
> ==
> [   21.501756] BUG: KASAN: slab-out-of-bounds in _copy_from_pages+0xe9/0x200 
> [sunrpc]
> [   21.501759] Write of size 64 at addr 8883bc9f3244 by task 
> update-desktop-/1478
> 
> [   21.501763] CPU: 0 PID: 1478 Comm: update-desktop- Tainted: G   OE 
> 5.4.0-40-generic #44

So, I don't know what 5.4.0-40-generic is, but it's not an upstream
stable kernel. If it's an Ubuntu kernel, you should work with them
directly to nail this issue down.


> [   21.501764] Hardware name: XX
> [   21.501765] Call Trace:
> [   21.501769]  dump_stack+0x96/0xca
> [   21.501772]  print_address_description.constprop.0+0x20/0x210
> [   21.501789]  ? _copy_from_pages+0xe9/0x200 [sunrpc]
> [   21.501790]  __kasan_report.cold+0x1b/0x41
> [   21.501806]  ? _copy_from_pages+0xe9/0x200 [sunrpc]
> [   21.501807]  kasan_report+0x12/0x20
> [   21.501809]  check_memory_region+0x129/0x1b0
> [   21.501811]  memcpy+0x38/0x50
> [   21.501825]  _copy_from_pages+0xe9/0x200 [sunrpc]
> [   21.501839]  ? call_decode+0x2fd/0x7e0 [sunrpc]
> [   21.501854]  ? __rpc_execute+0x204/0xbd0 [sunrpc]
> [   21.501869]  xdr_shrink_pagelen+0x198/0x3c0 [sunrpc]

You might try:

e8d70b321ecc ("SUNRPC: Fix another issue with MIC buffer space")


> [   21.501871]  ? trailing_symlink+0x6fe/0x810
> [   21.501886]  xdr_align_pages+0x15f/0x580 [sunrpc]
> [   21.501904]  ? decode_setattr+0x120/0x120 [nfsv4]
> [   21.501920]  xdr_read_pages+0x44/0x290 [sunrpc]
> [   21.501935]  ? __decode_op_hdr+0x29/0x430 [nfsv4]
> [   21.501949]  nfs4_xdr_dec_readlink+0x238/0x390 [nfsv4]
> [   21.501963]  ? nfs4_xdr_dec_read+0x3c0/0x3c0 [nfsv4]
> [   21.501966]  ? __kasan_slab_free+0x14e/0x180
> [   21.501970]  ? gss_validate+0x37e/0x610 [auth_rpcgss]
> [   21.501972]  ? kasan_slab_free+0xe/0x10
> [   21.501985]  ? nfs4_xdr_dec_read+0x3c0/0x3c0 [nfsv4]
> [   21.502001]  rpcauth_unwrap_resp_decode+0xaa/0x100 [sunrpc]
> [   21.502005]  gss_unwrap_resp+0x99d/0x1570 [auth_rpcgss]
> [   21.502009]  ? gss_destroy_cred+0x460/0x460 [auth_rpcgss]
> [   21.502011]  ? finish_task_switch+0x163/0x670
> [   21.502014]  ? __switch_to_asm+0x34/0x70
> [   21.502017]  ? gss_wrap_req+0x830/0x830 [auth_rpcgss]
> [   21.502020]  ? prepare_to_wait+0xea/0x2b0
> [   21.502036]  rpcauth_unwrap_resp+0xac/0x100 [sunrpc]
> [   21.502049]  call_decode+0x454/0x7e0 [sunrpc]
> [   21.502063]  ? rpc_decode_header+0x10a0/0x10a0 [sunrpc]
> [   21.502065]  ? var_wake_function+0x140/0x140
> [   21.502078]  ? call_transmit_status+0x31e/0x5d0 [sunrpc]
> [   21.502091]  ? rpc_decode_header+0x10a0/0x10a0 [sunrpc]
> [   21.502106]  __rpc_execute+0x204/0xbd0 [sunrpc]
> [   21.502119]  ? xprt_wait_for_reply_request_def+0x170/0x170 [sunrpc]
> [   21.502134]  ? rpc_exit+0xc0/0xc0 [sunrpc]
> [   21.502135]  ? __kasan_check_read+0x11/0x20
> [   21.502137]  ? wake_up_

Re: [Regression] "SUNRPC: Add "@len" parameter to gss_unwrap()" breaks NFS Kerberos on upstream stable 5.4.y

2020-07-15 Thread Chuck Lever



> On Jul 15, 2020, at 11:14 AM, Chuck Lever  wrote:
> 
> 
> 
>> On Jul 15, 2020, at 11:08 AM, Kai-Heng Feng  
>> wrote:
>> 
>>> On Jul 15, 2020, at 23:02, Chuck Lever  wrote:
>>> 
>>>> On Jul 15, 2020, at 10:48 AM, Kai-Heng Feng  
>>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Multiple users reported NFS causes NULL pointer dereference [1] on Ubuntu, 
>>>> due to commit "SUNRPC: Add "@len" parameter to gss_unwrap()" and commit 
>>>> "SUNRPC: Fix GSS privacy computation of auth->au_ralign".
>>>> 
>>>> The same issue happens on upstream stable 5.4.y branch.
>>>> The mainline kernel doesn't have this issue though.
>>>> 
>>>> Should we revert them? Or is there any missing commits need to be 
>>>> backported to v5.4?
>>>> 
>>>> [1] https://bugs.launchpad.net/bugs/1886277
>>>> 
>>>> Kai-Heng
>>> 
>>> 31c9590ae468 ("SUNRPC: Add "@len" parameter to gss_unwrap()") is a 
>>> refactoring
>>> change. It shouldn't have introduced any behavior difference. But in theory,
>>> practice and theory should be the same...
>>> 
>>> Check if 0a8e7b7d0846 ("SUNRPC: Revert 241b1f419f0e ("SUNRPC: Remove 
>>> xdr_buf_trim()")")
>>> is also applied to 5.4.0-40-generic.
>> 
>> Yes, it's included. The commit is part of upstream stable 5.4.
>> 
>>> 
>>> It would help to know if v5.5 stable is working for you. I haven't had any
>>> problems with it.
>> 
>> I'll ask users to test it out. 
>> Thanks for you quick reply!
> 
> Another thought: Please ask what encryption type is in use. The
> kerberos_v1 enctypes might exercise a code path I wasn't able to
> test.

OK.

v5.4.40 does not have 31c9590ae468 and friends, but the claim is this
one crashes?

And v5.4.51 has those three and 89a3c9f5b9f0, which Pierre claims fixes
the problem for him; but another commenter says v5.4.51 still crashes.

So we're getting inconsistent problem reports.

Have the testers enable memory debugging : KASAN or SLUB debugging
might provide more information. I might have some time later this week
to try reproducing on upstream stable, but no guarantees.


--
Chuck Lever





Re: [PATCH] xprtrdma: fix incorrect header size calcations

2020-07-15 Thread Chuck Lever



> On Jul 15, 2020, at 12:31 PM, Colin Ian King  wrote:
> 
> Bah, $SUBJECT typo "calcations" -> "calculations". can that be fixed up
> when it's applied, or shall I send a V2?

Anna's preference.

Reviewed-by: Chuck Lever 


> On 15/07/2020 17:26, Colin King wrote:
>> From: Colin Ian King 
>> 
>> Currently the header size calculations are using an assignment
>> operator instead of a += operator when accumulating the header
>> size leading to incorrect sizes.  Fix this by using the correct
>> operator.
>> 
>> Addresses-Coverity: ("Unused value")
>> Fixes: 302d3deb2068 ("xprtrdma: Prevent inline overflow")
>> Signed-off-by: Colin Ian King 
>> ---
>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index 935bbef2f7be..453bacc99907 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -71,7 +71,7 @@ static unsigned int rpcrdma_max_call_header_size(unsigned 
>> int maxsegs)
>>  size = RPCRDMA_HDRLEN_MIN;
>> 
>>  /* Maximum Read list size */
>> -size = maxsegs * rpcrdma_readchunk_maxsz * sizeof(__be32);
>> +size += maxsegs * rpcrdma_readchunk_maxsz * sizeof(__be32);
>> 
>>  /* Minimal Read chunk size */
>>  size += sizeof(__be32); /* segment count */
>> @@ -94,7 +94,7 @@ static unsigned int rpcrdma_max_reply_header_size(unsigned 
>> int maxsegs)
>>  size = RPCRDMA_HDRLEN_MIN;
>> 
>>  /* Maximum Write list size */
>> -    size = sizeof(__be32);  /* segment count */
>> +size += sizeof(__be32); /* segment count */
>>  size += maxsegs * rpcrdma_segment_maxsz * sizeof(__be32);
>>  size += sizeof(__be32); /* list discriminator */
>> 
>> 
> 

--
Chuck Lever





Re: xprtrdma: Prevent inline overflow

2020-07-15 Thread Chuck Lever



> On Jul 15, 2020, at 11:56 AM, Colin Ian King  wrote:
> 
> Hi,
> 
> Static analysis with Coverity has found a potential issue with the
> header size calculations in source net/sunrpc/xprtrdma/rpc_rdma.c in
> functions rpcrdma_max_call_header_size and rpcrdma_max_reply_header_size.
> 
> The commit in question is relatively old:
> 
> commit 302d3deb20682a076e1ab551821cacfdc81c5e4f
> Author: Chuck Lever 
> Date:   Mon May 2 14:41:05 2016 -0400
> 
>xprtrdma: Prevent inline overflow
> 
> The two issues are as follows:
> 
> Issue #1:
> 
> 66 static unsigned int rpcrdma_max_call_header_size(unsigned int maxsegs)
> 67 {
> 68unsigned int size;
> 69
> 70/* Fixed header fields and list discriminators */
> 
> Unused value (UNUSED_VALUE)
> 
> 71size = RPCRDMA_HDRLEN_MIN;
> 72
> 73/* Maximum Read list size */
> 74size = maxsegs * rpcrdma_readchunk_maxsz * sizeof(__be32);
> 75
> 
> should the size assignment on line 74 be instead:
> 
>   size += maxsegs * rpcrdma_readchunk_maxsz * sizeof(__be32);
> 
> 
> Issue #2:
> 
> 89 static unsigned int rpcrdma_max_reply_header_size(unsigned int maxsegs)
> 90 {
> 91unsigned int size;
> 92
> 93/* Fixed header fields and list discriminators */
> 
> Unused value (UNUSED_VALUE)
> 
> 94size = RPCRDMA_HDRLEN_MIN;
> 95
> 96/* Maximum Write list size */
> 97size = sizeof(__be32);  /* segment count */
> 
> should the size assignment in line 97 be instead:
> 
>   size += sizeof(__be32)?

Colin, Yes to both questions. Can you send a fix to Anna?

--
Chuck Lever





Re: [Regression] "SUNRPC: Add "@len" parameter to gss_unwrap()" breaks NFS Kerberos on upstream stable 5.4.y

2020-07-15 Thread Chuck Lever



> On Jul 15, 2020, at 11:08 AM, Kai-Heng Feng  
> wrote:
> 
>> On Jul 15, 2020, at 23:02, Chuck Lever  wrote:
>> 
>>> On Jul 15, 2020, at 10:48 AM, Kai-Heng Feng  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> Multiple users reported NFS causes NULL pointer dereference [1] on Ubuntu, 
>>> due to commit "SUNRPC: Add "@len" parameter to gss_unwrap()" and commit 
>>> "SUNRPC: Fix GSS privacy computation of auth->au_ralign".
>>> 
>>> The same issue happens on upstream stable 5.4.y branch.
>>> The mainline kernel doesn't have this issue though.
>>> 
>>> Should we revert them? Or is there any missing commits need to be 
>>> backported to v5.4?
>>> 
>>> [1] https://bugs.launchpad.net/bugs/1886277
>>> 
>>> Kai-Heng
>> 
>> 31c9590ae468 ("SUNRPC: Add "@len" parameter to gss_unwrap()") is a 
>> refactoring
>> change. It shouldn't have introduced any behavior difference. But in theory,
>> practice and theory should be the same...
>> 
>> Check if 0a8e7b7d0846 ("SUNRPC: Revert 241b1f419f0e ("SUNRPC: Remove 
>> xdr_buf_trim()")")
>> is also applied to 5.4.0-40-generic.
> 
> Yes, it's included. The commit is part of upstream stable 5.4.
> 
>> 
>> It would help to know if v5.5 stable is working for you. I haven't had any
>> problems with it.
> 
> I'll ask users to test it out. 
> Thanks for you quick reply!

Another thought: Please ask what encryption type is in use. The
kerberos_v1 enctypes might exercise a code path I wasn't able to
test.


--
Chuck Lever





Re: [Regression] "SUNRPC: Add "@len" parameter to gss_unwrap()" breaks NFS Kerberos on upstream stable 5.4.y

2020-07-15 Thread Chuck Lever



> On Jul 15, 2020, at 10:48 AM, Kai-Heng Feng  
> wrote:
> 
> Hi,
> 
> Multiple users reported NFS causes NULL pointer dereference [1] on Ubuntu, 
> due to commit "SUNRPC: Add "@len" parameter to gss_unwrap()" and commit 
> "SUNRPC: Fix GSS privacy computation of auth->au_ralign".
> 
> The same issue happens on upstream stable 5.4.y branch.
> The mainline kernel doesn't have this issue though.
> 
> Should we revert them? Or is there any missing commits need to be backported 
> to v5.4?
> 
> [1] https://bugs.launchpad.net/bugs/1886277
> 
> Kai-Heng

31c9590ae468 ("SUNRPC: Add "@len" parameter to gss_unwrap()") is a refactoring
change. It shouldn't have introduced any behavior difference. But in theory,
practice and theory should be the same...

Check if 0a8e7b7d0846 ("SUNRPC: Revert 241b1f419f0e ("SUNRPC: Remove 
xdr_buf_trim()")")
is also applied to 5.4.0-40-generic.

It would help to know if v5.5 stable is working for you. I haven't had any
problems with it.


--
Chuck Lever





Re: [PATCH] nfsd: Use seq_putc() in two functions

2020-07-04 Thread Chuck Lever
Hi-

> On Jun 17, 2020, at 9:56 PM, Xu Wang  wrote:
> 
> A single character (line break) should be put into a sequence.
> Thus use the corresponding function "seq_putc()".
> 
> Signed-off-by: Xu Wang 

Applied to nfsd-5.9. Thanks.


> ---
> fs/nfsd/nfs4idmap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4idmap.c b/fs/nfsd/nfs4idmap.c
> index 9460be8a8321..f92161ce1f97 100644
> --- a/fs/nfsd/nfs4idmap.c
> +++ b/fs/nfsd/nfs4idmap.c
> @@ -168,7 +168,7 @@ idtoname_show(struct seq_file *m, struct cache_detail 
> *cd, struct cache_head *h)
>   ent->id);
>   if (test_bit(CACHE_VALID, >flags))
>   seq_printf(m, " %s", ent->name);
> - seq_printf(m, "\n");
> + seq_putc(m, '\n');
>   return 0;
> }
> 
> @@ -346,7 +346,7 @@ nametoid_show(struct seq_file *m, struct cache_detail 
> *cd, struct cache_head *h)
>   ent->name);
>   if (test_bit(CACHE_VALID, >flags))
>   seq_printf(m, " %u", ent->id);
> - seq_printf(m, "\n");
> + seq_putc(m, '\n');
>   return 0;
> }
> 
> -- 
> 2.17.1
> 

--
Chuck Lever





Re: [PATCH v2] SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE

2020-06-15 Thread Chuck Lever



> On Jun 15, 2020, at 2:25 AM, Christophe Leroy  
> wrote:
> 
> Even if that's only a warning, not including asm/cacheflush.h
> leads to svc_flush_bvec() being empty allthough powerpc defines
> ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> 
>  CC  net/sunrpc/svcsock.o
> net/sunrpc/svcsock.c:227:5: warning: "ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE" is 
> not defined [-Wundef]
> #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
> ^
> 
> Include linux/highmem.h so that asm/cacheflush.h will be included.
> 
> Reported-by: Christophe Leroy 
> Reported-by: kernel test robot 
> Cc: Chuck Lever 
> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()")
> Signed-off-by: Christophe Leroy 

LGTM.

Acked-by: Chuck Lever 


> ---
> v2: Use linux/highmem.h instead of asm/cacheflush.sh
> Signed-off-by: Christophe Leroy 
> ---
> net/sunrpc/svcsock.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 5c4ec9386f81..c537272f9c7e 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -44,6 +44,7 @@
> #include 
> #include 
> #include 
> +#include 
> #include 
> 
> #include 
> -- 
> 2.25.0
> 

--
Chuck Lever





Re: [PATCH] SUNRPC: Add missing asm/cacheflush.h

2020-06-14 Thread Chuck Lever
Hi Christophe -

> On Jun 14, 2020, at 1:07 PM, Christophe Leroy  
> wrote:
> 
> Even if that's only a warning, not including asm/cacheflush.h
> leads to svc_flush_bvec() being empty allthough powerpc defines
> ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> 
>  CC  net/sunrpc/svcsock.o
> net/sunrpc/svcsock.c:227:5: warning: "ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE" is 
> not defined [-Wundef]
> #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
> ^
> 
> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()")
> Signed-off-by: Christophe Leroy 
> ---
> I detected this on linux-next on June 4th and warned Chuck. Seems like it 
> went into mainline anyway.

Thanks for your patch. I've searched my mailbox. It appears I never
received your June 4th e-mail.

Does your patch also address:

   https://marc.info/?l=linux-kernel=159194369128024=2 ?

If so, then

   Reported-by: kernel test robot 

should be added to the patch description.

Ideally, compilation on x86_64 should have thrown the same warning,
but it didn't. Why would the x86_64 build behave differently than
ppc64 or i386?


> net/sunrpc/svcsock.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 5c4ec9386f81..d9e99cb09aab 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -45,6 +45,7 @@
> #include 
> #include 
> #include 
> +#include 

Nit: Let's include  in net/sunrpc/svcsock.h instead
of  directly.


> #include 
> #include 
> -- 
> 2.25.0
> 

--
Chuck Lever





Re: net/sunrpc/svcsock.c:226:5: warning: "ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE" is not defined

2020-06-12 Thread Chuck Lever
 size_t 
> size,
>   240   size_t seek)
>   241 {
>   242 }
>   243 #endif
>   244 
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
> <.config.gz>

--
Chuck Lever





Re: linux-next: manual merge of the nfsd tree with the nfs-anna tree

2020-05-29 Thread Chuck Lever


> On May 28, 2020, at 8:59 PM, Stephen Rothwell  wrote:
> 
> Hi all,
> 
> Today's linux-next merge of the nfsd tree got a conflict in:
> 
>  include/trace/events/sunrpc.h
> 
> between commit:
> 
>  2baebf955125 ("SUNRPC: Split the xdr_buf event class")
> 
> from the nfs-anna tree and commit:
> 
>  998024dee197 ("SUNRPC: Add more svcsock tracepoints")
> 
> from the nfsd tree.

Alternately, I can provide a v4 nfsd-5.8 series for Bruce that
includes 2baebf955125 ("SUNRPC: Split the xdr_buf event class")
so that these merge conflicts are avoided.


> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc include/trace/events/sunrpc.h
> index 73193c79fcaa,852413cbb7d9..
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@@ -14,9 -14,41 +14,42 @@@
>  #include 
>  #include 
> 
> + TRACE_DEFINE_ENUM(SOCK_STREAM);
> + TRACE_DEFINE_ENUM(SOCK_DGRAM);
> + TRACE_DEFINE_ENUM(SOCK_RAW);
> + TRACE_DEFINE_ENUM(SOCK_RDM);
> + TRACE_DEFINE_ENUM(SOCK_SEQPACKET);
> + TRACE_DEFINE_ENUM(SOCK_DCCP);
> + TRACE_DEFINE_ENUM(SOCK_PACKET);
> + 
> + #define show_socket_type(type)  \
> + __print_symbolic(type,  \
> + { SOCK_STREAM,  "STREAM" }, \
> + { SOCK_DGRAM,   "DGRAM" },  \
> + { SOCK_RAW, "RAW" },\
> + { SOCK_RDM, "RDM" },\
> + { SOCK_SEQPACKET,   "SEQPACKET" },  \
> + { SOCK_DCCP,"DCCP" },   \
> + { SOCK_PACKET,  "PACKET" })
> + 
> + /* This list is known to be incomplete, add new enums as needed. */
> + TRACE_DEFINE_ENUM(AF_UNSPEC);
> + TRACE_DEFINE_ENUM(AF_UNIX);
> + TRACE_DEFINE_ENUM(AF_LOCAL);
> + TRACE_DEFINE_ENUM(AF_INET);
> + TRACE_DEFINE_ENUM(AF_INET6);
> + 
> + #define rpc_show_address_family(family) \
> + __print_symbolic(family,\
> + { AF_UNSPEC,"AF_UNSPEC" },  \
> + { AF_UNIX,  "AF_UNIX" },\
> + { AF_LOCAL, "AF_LOCAL" },   \
> + { AF_INET,  "AF_INET" },\
> + { AF_INET6, "AF_INET6" })
> + 
> -DECLARE_EVENT_CLASS(xdr_buf_class,
> +DECLARE_EVENT_CLASS(rpc_xdr_buf_class,
>   TP_PROTO(
> + const struct rpc_task *task,
>   const struct xdr_buf *xdr
>   ),
> 

--
Chuck Lever





Re: [PATCH v3 16/26] NFS: Move mount parameterisation bits into their own file

2019-09-11 Thread Chuck Lever



> On Sep 11, 2019, at 12:16 PM, Scott Mayhew  wrote:
> 
> From: David Howells 
> 
> Split various bits relating to mount parameterisation out from
> fs/nfs/super.c into their own file to form the basis of filesystem context
> handling for NFS.
> 
> No other changes are made to the code beyond removing 'static' qualifiers.
> 
> Signed-off-by: David Howells 
> Signed-off-by: Al Viro 
> ---
> fs/nfs/Makefile |2 +-
> fs/nfs/fs_context.c | 1418 +++
> fs/nfs/internal.h   |   29 +
> fs/nfs/super.c  | 1411 --
> 4 files changed, 1448 insertions(+), 1412 deletions(-)
> create mode 100644 fs/nfs/fs_context.c
> 
> diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
> index 34cdeaecccf6..2433c3e03cfa 100644
> --- a/fs/nfs/Makefile
> +++ b/fs/nfs/Makefile
> @@ -9,7 +9,7 @@ CFLAGS_nfstrace.o += -I$(src)
> nfs-y := client.o dir.o file.o getroot.o inode.o 
> super.o \
>  io.o direct.o pagelist.o read.o symlink.o unlink.o \
>  write.o namespace.o mount_clnt.o nfstrace.o \
> -export.o sysfs.o
> +export.o sysfs.o fs_context.o
> nfs-$(CONFIG_ROOT_NFS)+= nfsroot.o
> nfs-$(CONFIG_SYSCTL)  += sysctl.o
> nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
> diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
> new file mode 100644
> index ..82b312a5cdde
> --- /dev/null
> +++ b/fs/nfs/fs_context.c
> @@ -0,0 +1,1418 @@
> +/* NFS mount handling.
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowe...@redhat.com)
> + *
> + * Split from fs/nfs/super.c:
> + *
> + *  Copyright (C) 1992  Rick Sladkey
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */

New source files should have an SPDX tag instead of boilerplate.
I suggest:

// SPDX-License-Identifier: GPL-2.0-only


> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "nfs.h"
> +#include "internal.h"
> +
> +#define NFSDBG_FACILITY  NFSDBG_MOUNT
> +
> +#if IS_ENABLED(CONFIG_NFS_V3)
> +#define NFS_DEFAULT_VERSION 3
> +#else
> +#define NFS_DEFAULT_VERSION 2
> +#endif
> +
> +#define NFS_MAX_CONNECTIONS 16
> +
> +enum {
> + /* Mount options that take no arguments */
> + Opt_soft, Opt_softerr, Opt_hard,
> + Opt_posix, Opt_noposix,
> + Opt_cto, Opt_nocto,
> + Opt_ac, Opt_noac,
> + Opt_lock, Opt_nolock,
> + Opt_udp, Opt_tcp, Opt_rdma,
> + Opt_acl, Opt_noacl,
> + Opt_rdirplus, Opt_nordirplus,
> + Opt_sharecache, Opt_nosharecache,
> + Opt_resvport, Opt_noresvport,
> + Opt_fscache, Opt_nofscache,
> + Opt_migration, Opt_nomigration,
> +
> + /* Mount options that take integer arguments */
> + Opt_port,
> + Opt_rsize, Opt_wsize, Opt_bsize,
> + Opt_timeo, Opt_retrans,
> + Opt_acregmin, Opt_acregmax,
> + Opt_acdirmin, Opt_acdirmax,
> + Opt_actimeo,
> + Opt_namelen,
> + Opt_mountport,
> + Opt_mountvers,
> + Opt_minorversion,
> +
> + /* Mount options that take string arguments */
> + Opt_nfsvers,
> + Opt_sec, Opt_proto, Opt_mountproto, Opt_mounthost,
> + Opt_addr, Opt_mountaddr, Opt_clientaddr,
> + Opt_nconnect,
> + Opt_lookupcache,
> + Opt_fscache_uniq,
> + Opt_local_lock,
> +
> + /* Special mount options */
> + Opt_userspace, Opt_deprecated, Opt_sloppy,
> +
> + Opt_err
> +};
> +
> +static const match_table_t nfs_mount_option_tokens = {
> + { Opt_userspace, "bg" },
> + { Opt_userspace, "fg" },
> + { Opt_userspace, "retry=%s" },
> +
> + { Opt_sloppy, "sloppy" },
> +
> + { Opt_soft, "soft" },
> + { Opt_softerr, "softerr" },
> + { Opt_hard, "hard" },
> + { Opt_deprecated, "intr" },
> + { Opt_deprecated, "nointr" },
> + { Opt_posix, "posix" },
> + { Opt_noposix, "noposix" },
> + { Opt_cto, "cto" },
> + { Opt_nocto, "nocto" },
> + { Opt_ac, "ac" },
> + { Opt_noac, "noac" },
> + { Opt_lock, "lock" },
> + { Opt_nolock, "nolock" },
> + { Opt_udp, "udp" },
> + { Opt_tcp, "tcp" },
> + { Opt_rdma, "rdma" },
> + { Opt_acl, "acl" },
> + { Opt_noacl, "noacl" },
> + { Opt_rdirplus, "rdirplus" },
> + { Opt_nordirplus, "nordirplus" },
> + { Opt_sharecache, "sharecache" },
> + { Opt_nosharecache, "nosharecache" },
> + { Opt_resvport, "resvport" },
> + { Opt_noresvport, "noresvport" },
> + { Opt_fscache, "fsc" },
> + { Opt_nofscache, "nofsc" },
> + { Opt_migration, "migration" },
> + { Opt_nomigration, "nomigration" },
> +
> + { Opt_port, "port=%s" },
> + { Opt_rsize, "rsize=%s" },
> + { 

Re: Regression in 5.1.20: Reading long directory fails

2019-09-11 Thread Chuck Lever



> On Sep 11, 2019, at 1:50 PM, Benjamin Coddington  wrote:
> 
> On 11 Sep 2019, at 13:40, Benjamin Coddington wrote:
> 
>> On 11 Sep 2019, at 13:29, Chuck Lever wrote:
>> 
>>>> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington  
>>>> wrote:
>>>> 
>>>> 
>>>> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>>>> 
>>>>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington  
>>>>>> wrote:
>>>>>> 
>>>> 
>>>>>> Instead, I think we want to make sure the mic falls squarely into the 
>>>>>> tail
>>>>>> every time.
>>>>> 
>>>>> I'm not clear how you could do that. The length of the page data is not
>>>>> known to the client before it parses the reply. Are you suggesting that
>>>>> gss_unwrap should do it somehow?
>>>> 
>>>> Is it too niave to always put the mic at the end of the tail?
>>> 
>>> The size of the page content is variable.
>>> 
>>> The only way the MIC will fall into the tail is if the page content is
>>> exactly the largest expected size. When the page content is smaller than
>>> that, the receive logic will place part or all of the MIC in ->pages.
>> 
>> Ok, right.  But what I meant is that xdr_buf_read_netobj() should be renamed
>> and repurposed to be "move the mic from wherever it is to the end of
>> xdr_buf's tail".
>> 
>> But now I see what you mean, and I also see that it is already trying to do
>> that.. and we don't want to overlap the copy..
>> 
>> So, really, we need the tail to be larger than twice the mic.. less 1.  That
>> means the fix is probably just increasing rslack for krb5i.
> 
> .. or we can keep the tighter tail space, and if we detect the mic straddles
> the page and tail, we can move the mic into the tail with 2 copies, first
> move the bit in the tail back, then move the bit in the pages.
> 
> Which is preferred, less allocation, or in the rare case this occurs, doing
> copy twice?

It sounds like the bug is that the current code does not deal correctly
when the MIC crosses the boundary between ->pages and ->tail? I'd like
to see that addressed rather than changing rslack.


--
Chuck Lever





  1   2   3   4   >