Re: [Cluster-devel] [PATCH 01/35] fscache: Remove unused ->now_uncached callback
Jan Kara wrote: > The callback doesn't ever get called. Remove it. Hmmm... I should perhaps be calling this. I'm not sure why I never did. At the moment, it doesn't strictly matter as ops on pages marked with PG_fscache get ignored if the cache has suffered an I/O error or has been withdrawn - but it will incur a performance penalty (the PG_fscache flag is checked in the netfs before calling into fscache). The downside of calling this is that when a cache is removed, fscache would go through all the cookies for that cache and iterate over all the pages associated with those cookies - which could cause a performance dip in the system. David
Re: [Cluster-devel] [PATCH 1/2] gfs2: Convert gfs2 to fs_context
Andrew Price wrote: > sget() is still used instead of sget_fc() as there doesn't seem to be an > obvious replacement for the bdev pointer propagation to the test/set > functions (yet?) Umm... What about the fs_context struct? Why can't that be used to propagate the bdev pointer? That's kind of what it's for... struct super_block *sget_fc( struct fs_context *fc, int (*test)(struct super_block *, struct fs_context *), int (*set)(struct super_block *, struct fs_context *)) It looks like you should be able to stash the bdev pointer in the gfs2_args struct. > + fsparam_s32("commit", Opt_commit), > + fsparam_s32("statfs_quantum", Opt_statfs_quantum), > + fsparam_s32("statfs_percent", Opt_statfs_percent), Why s32? Why not u32? David
Re: [Cluster-devel] [PATCH 1/2] gfs2: Convert gfs2 to fs_context
Andrew Price wrote: > > Umm... What about the fs_context struct? Why can't that be used to > > propagate the bdev pointer? That's kind of what it's for... > > It would be useful to have the block device pointer in the fs_context since so > many of the filesystems use them and it makes for an obvious API migration. That may be so. I've argued also that we should put a net-namespace pointer in there, but Al disagreed. However, I think most bdev-based filesystems use mount_bdev() which gfs2 does not. > > It looks like you should be able to stash the bdev pointer in the gfs2_args > > struct. > > Sure, but since the new API is young I figured I'd hold off until we had this > conversation because adding it to the fs_context might be agreeable :) Note that I have patches to kill of sget_userns() and sget() will hopefully soon follow. Have a look at: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=mount-api-viro David
Re: [Cluster-devel] [PATCH v2 1/2] gfs2: Convert gfs2 to fs_context
Andrew Price wrote: > + pr_warn("-o debug and -o errors=panic are mutually > exclusive\n"); > + return -EINVAL; return invalf(fc, "gfs2: -o debug and -o errors=panic are mutually exclusive"); (Note: no "\n") > + if (result.int_32 > 0) > + args->ar_quota = opt_quota_values[result.int_32]; > + else if (result.negated) > + args->ar_quota = GFS2_QUOTA_OFF; > + else > + args->ar_quota = GFS2_QUOTA_ON; I recommend checking result.negated first. > + /* Not allowed to change locking details */ > + if (strcmp(newargs->ar_lockproto, oldargs->ar_lockproto) || > + strcmp(newargs->ar_locktable, oldargs->ar_locktable) || > + strcmp(newargs->ar_hostdata, oldargs->ar_hostdata)) > + return -EINVAL; Use errorf(). (Not invalf - the parameter isn't exactly invalid, it's just that you're not allowed to do this operation). > + error = gfs2_make_fs_ro(sdp); > + else > + error = gfs2_make_fs_rw(sdp); > + if (error) > + return error; Might want to call errorf() here too. > - s = sget(&gfs2_fs_type, test_gfs2_super, set_meta_super, flags, > + s = sget(&gfs2_fs_type, test_meta_super, set_meta_super, flags, Try and use sget_fc() please. If you look at the fuse patchset I cc'd you on, there's a commit there that adds a ->bdev and ->bdev_mode to fs_context that may be of use to you. Can you use vfs_get_block_super()? Would it be of use to export test_bdev_super_fc() and set_bdev_super_fc()? David
Re: [Cluster-devel] [PATCH v2 1/2] gfs2: Convert gfs2 to fs_context
Andrew Price wrote: > Would this fly? Can you update the comment to document what it's for? David
Re: [Cluster-devel] [PATCH] vfs: Allow selection of fs root independent of sb
That's not what I meant. There's a kerneldoc comment on vfs_get_block_super() that you need to modify. I would in any case roll your patch into mine. > This is intended for gfs2 to select between the normal root and the > 'meta' root, which must be done whether a root already exists for a > superblock or not, i.e. it cannot be decided in fill_super(). Can this be done in gfs2_get_tree, where it sets fc->root? David
[Cluster-devel] [RFC PATCH 00/68] VFS: Convert a bunch of filesystems to the new mount API
Hi Al, Here's a set of patches that converts a bunch (but not yet all!) to the new mount API. To this end, it makes the following changes: (1) Provides a convenience member in struct fs_context that is OR'd into sb->s_iflags by sget_fc(). (2) Provides a convenience helper function, vfs_init_pseudo_fs_context(), for doing most of the work in mounting a pseudo filesystem. (3) Provides a convenience helper function, vfs_get_block_super(), for doing the work in setting up a block-based superblock. (4) Improves the handling of fd-type parameters. (5) Moves some of the subtype handling int fuse. (6) Provides a convenience helper function, vfs_get_mtd_super(), for doing the work in setting up an MTD device-based superblock. (7) Kills off mount_pseudo(), mount_pseudo_xattr(), mount_ns(), sget_userns(), mount_mtd(), mount_single(). (8) Converts a slew of filesystems to use the mount API. (9) Fixes a bug in hypfs. The patches can be found here also: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git on branch: mount-api-viro David --- Andrew Price (1): gfs2: Convert gfs2 to fs_context David Howells (66): vfs: Update mount API docs vfs: Fix refcounting of filenames in fs_parser vfs: Provide sb->s_iflags settings in fs_context struct vfs: Provide a mount_pseudo-replacement for the new mount API vfs: Convert aio to use the new mount API vfs: Convert anon_inodes to use the new mount API vfs: Convert bdev to use the new mount API vfs: Convert nsfs to use the new mount API vfs: Convert pipe to use the new mount API vfs: Convert zsmalloc to use the new mount API vfs: Convert sockfs to use the new mount API vfs: Convert dax to use the new mount API vfs: Convert drm to use the new mount API vfs: Convert ia64 perfmon to use the new mount API vfs: Convert cxl to use the new mount API vfs: Convert ocxlflash to use the new mount API vfs: Convert virtio_balloon to use the new mount API vfs: Convert btrfs_test to use the new mount API vfs: Kill off mount_pseudo() and mount_pseudo_xattr() vfs: Use sget_fc() for pseudo-filesystems vfs: Convert binderfs to use the new mount API vfs: Convert nfsctl to use the new mount API vfs: Convert rpc_pipefs to use the new mount API vfs: Kill mount_ns() vfs: Kill sget_userns() vfs: Convert binfmt_misc to use the new mount API vfs: Convert configfs to use the new mount API vfs: Convert efivarfs to use the new mount API vfs: Convert fusectl to use the new mount API vfs: Convert qib_fs/ipathfs to use the new mount API vfs: Convert ibmasmfs to use the new mount API vfs: Convert oprofilefs to use the new mount API vfs: Convert gadgetfs to use the new mount API vfs: Convert xenfs to use the new mount API vfs: Convert openpromfs to use the new mount API vfs: Convert apparmorfs to use the new mount API vfs: Convert securityfs to use the new mount API vfs: Convert selinuxfs to use the new mount API vfs: Convert smackfs to use the new mount API vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API vfs: Create fs_context-aware mount_bdev() replacement vfs: Make fs_parse() handle fs_param_is_fd-type params better vfs: Convert fuse to use the new mount API vfs: Move the subtype parameter into fuse mtd: Provide fs_context-aware mount_mtd() replacement vfs: Convert romfs to use the new mount API vfs: Convert cramfs to use the new mount API vfs: Convert jffs2 to use the new mount API mtd: Kill mount_mtd() vfs: Convert squashfs to use the new mount API vfs: Convert ceph to use the new mount API vfs: Convert functionfs to use the new mount API vfs: Add a single-or-reconfig keying to vfs_get_super() vfs: Convert debugfs to use the new mount API vfs: Convert tracefs to use the new mount API vfs: Convert pstore to use the new mount API hypfs: Fix error number left in struct pointer member vfs: Convert hypfs to use the new mount API vfs: Convert spufs to use the new mount API vfs: Kill mount_single() vfs: Convert coda to use the new mount API vfs: Convert autofs to use the new mount API vfs: Convert devpts to use the new mount API vfs: Convert bpf to use the new mount API vfs: Convert ubifs to use the new mount API vfs: Convert orangefs to use the new mount API Masahiro Yamada (1): kbuild: skip sub-make for in-tree build with GNU Make 4.x Documentation/filesystems/mount_api.txt | 367 --- Documentation/filesystems/vfs.txt |4 Makefile | 31 + arch/ia64/kernel/perfmon.c| 14 - arch/powerpc/platforms/cell/spuf
[Cluster-devel] [RFC PATCH 68/68] gfs2: Convert gfs2 to fs_context
From: Andrew Price Convert gfs2 and gfs2meta to fs_context. Removes the duplicated vfs code from gfs2_mount and instead uses the new vfs_get_block_super() before switching the ->root to the appropriate dentry. The mount option parsing has been converted to the new API and error reporting for invalid options has been made more precise at the same time. All of the mount/remount code has been moved into ops_fstype.c Signed-off-by: Andrew Price Signed-off-by: David Howells cc: cluster-devel@redhat.com --- fs/gfs2/incore.h |8 - fs/gfs2/ops_fstype.c | 495 ++ fs/gfs2/super.c | 335 -- fs/gfs2/super.h |3 4 files changed, 382 insertions(+), 459 deletions(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index cdf07b408f54..10febb298da5 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -587,10 +587,10 @@ struct gfs2_args { unsigned int ar_rgrplvb:1; /* use lvbs for rgrp info */ unsigned int ar_loccookie:1;/* use location based readdir cookies */ - int ar_commit; /* Commit interval */ - int ar_statfs_quantum; /* The fast statfs interval */ - int ar_quota_quantum; /* The quota interval */ - int ar_statfs_percent; /* The % change to force sync */ + s32 ar_commit; /* Commit interval */ + s32 ar_statfs_quantum; /* The fast statfs interval */ + s32 ar_quota_quantum; /* The quota interval */ + s32 ar_statfs_percent; /* The % change to force sync */ }; struct gfs2_tune { diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index b041cb8ae383..8e8d624170a1 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "gfs2.h" #include "incore.h" @@ -1024,16 +1025,17 @@ void gfs2_online_uevent(struct gfs2_sbd *sdp) } /** - * fill_super - Read in superblock + * gfs2_fill_super - Read in superblock * @sb: The VFS superblock - * @data: Mount options + * @args: Mount options * @silent: Don't complain if it's not a GFS2 filesystem * - * Returns: errno + * Returns: -errno */ - -static int fill_super(struct super_block *sb, struct gfs2_args *args, int silent) +static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc) { + struct gfs2_args *args = fc->fs_private; + int silent = fc->sb_flags & SB_SILENT; struct gfs2_sbd *sdp; struct gfs2_holder mount_gh; int error; @@ -1200,161 +1202,415 @@ static int fill_super(struct super_block *sb, struct gfs2_args *args, int silent return error; } -static int set_gfs2_super(struct super_block *s, void *data) +/** + * gfs2_get_tree - Get the GFS2 superblock and root directory + * @fc: The filesystem context + * + * Returns: 0 or -errno on error + */ +static int gfs2_get_tree(struct fs_context *fc) { - s->s_bdev = data; - s->s_dev = s->s_bdev->bd_dev; - s->s_bdi = bdi_get(s->s_bdev->bd_bdi); + struct gfs2_args *args = fc->fs_private; + struct gfs2_sbd *sdp; + int error; + + error = vfs_get_block_super(fc, gfs2_fill_super); + if (error) + return error; + + sdp = fc->root->d_sb->s_fs_info; + dput(fc->root); + if (args->ar_meta) + fc->root = dget(sdp->sd_master_dir); + else + fc->root = dget(sdp->sd_root_dir); return 0; } -static int test_gfs2_super(struct super_block *s, void *ptr) +static void gfs2_fc_free(struct fs_context *fc) { - struct block_device *bdev = ptr; - return (bdev == s->s_bdev); + struct gfs2_args *args = fc->fs_private; + + kfree(args); } -/** - * gfs2_mount - Get the GFS2 superblock - * @fs_type: The GFS2 filesystem type - * @flags: Mount flags - * @dev_name: The name of the device - * @data: The mount arguments - * - * Q. Why not use get_sb_bdev() ? - * A. We need to select one of two root directories to mount, independent - *of whether this is the initial, or subsequent, mount of this sb - * - * Returns: 0 or -ve on error - */ +enum gfs2_param { + Opt_lockproto, + Opt_locktable, + Opt_hostdata, + Opt_spectator, + Opt_ignore_local_fs, + Opt_localflocks, + Opt_localcaching, + Opt_debug, + Opt_upgrade, + Opt_acl, + Opt_quota, + Opt_suiddir, + Opt_data, + Opt_meta, + Opt_discard, + Opt_commit, + Opt_errors, + Opt_statfs_quantum, + Opt_statfs_percent, + Opt_quota_quantum, + Opt_barrier, + Opt_rgrplvb, + Opt_loccookie, +
Re: [Cluster-devel] [RFC][PATCH] wake_up_var() memory ordering
Peter Zijlstra wrote: > I tried using wake_up_var() today and accidentally noticed that it > didn't imply an smp_mb() and specifically requires it through > wake_up_bit() / waitqueue_active(). Thinking about it again, I'm not sure why you need to add the barrier when wake_up() (which this is a wrapper around) is required to impose a barrier at the front if there's anything to wake up (ie. the wait queue isn't empty). If this is insufficient, does it make sense just to have wake_up*() functions do an unconditional release or full barrier right at the front, rather than it being conditional on something being woken up? > @@ -619,9 +614,7 @@ static int dvb_usb_fe_sleep(struct dvb_frontend *fe) > err: > if (!adap->suspend_resume_active) { > adap->active_fe = -1; I'm wondering if there's a missing barrier here. Should the clear_bit() on the next line be clear_bit_unlock() or clear_bit_release()? > - clear_bit(ADAP_SLEEP, &adap->state_bits); > - smp_mb__after_atomic(); > - wake_up_bit(&adap->state_bits, ADAP_SLEEP); > + clear_and_wake_up_bit(ADAP_SLEEP, &adap->state_bits); > } > > dev_dbg(&d->udev->dev, "%s: ret=%d\n", __func__, ret); > diff --git a/fs/afs/fs_probe.c b/fs/afs/fs_probe.c > index cfe62b154f68..377ee07d5f76 100644 > --- a/fs/afs/fs_probe.c > +++ b/fs/afs/fs_probe.c > @@ -18,6 +18,7 @@ static bool afs_fs_probe_done(struct afs_server *server) > > wake_up_var(&server->probe_outstanding); > clear_bit_unlock(AFS_SERVER_FL_PROBING, &server->flags); > + smp_mb__after_atomic(); > wake_up_bit(&server->flags, AFS_SERVER_FL_PROBING); > return true; > } Looking at this and the dvb one, does it make sense to stick the release semantics of clear_bit_unlock() into clear_and_wake_up_bit()? Also, should clear_bit_unlock() be renamed to clear_bit_release() (and similarly test_and_set_bit_lock() -> test_and_set_bit_acquire()) if we seem to be trying to standardise on that terminology. David
Re: [Cluster-devel] [PATCH 29/33] rxrpc_sock_set_min_security_level
Christoph Hellwig wrote: > +int rxrpc_sock_set_min_security_level(struct sock *sk, unsigned int val); > + Looks good - but you do need to add this to Documentation/networking/rxrpc.txt also, thanks. David
Re: [Cluster-devel] [PATCH 21/33] ipv4: add ip_sock_set_mtu_discover
Christoph Hellwig wrote: > + ip_sock_set_mtu_discover(conn->params.local->socket->sk, > + IP_PMTUDISC_DONT); Um... The socket in question could be an AF_INET6 socket, not an AF_INET4 socket - I presume it will work in that case. If so: Reviewed-by: David Howells [rxrpc bits]
Re: [Cluster-devel] [PATCH 20/33] ipv4: add ip_sock_set_recverr
Christoph Hellwig wrote: > Add a helper to directly set the IP_RECVERR sockopt from kernel space > without going through a fake uaccess. It looks like if this is an AF_INET6 socket, it will just pass the message straight through to AF_INET4, so: Reviewed-by: David Howells
Re: [Cluster-devel] [PATCH 23/33] ipv6: add ip6_sock_set_recverr
Christoph Hellwig wrote: > Add a helper to directly set the IPV6_RECVERR sockopt from kernel space > without going through a fake uaccess. > > Signed-off-by: Christoph Hellwig Reviewed-by: David Howells
Re: [Cluster-devel] [PATCH 06/33] net: add sock_set_timestamps
Christoph Hellwig wrote: > Add a helper to directly set the SO_TIMESTAMP* sockopts from kernel space > without going through a fake uaccess. > > Signed-off-by: Christoph Hellwig Reviewed-by: David Howells
Re: [Cluster-devel] [PATCH 29/33] rxrpc_sock_set_min_security_level
Christoph Hellwig wrote: > > Looks good - but you do need to add this to > > Documentation/networking/rxrpc.txt > > also, thanks. > > That file doesn't exist, instead we now have a > cumentation/networking/rxrpc.rst in weird markup. Yeah - that's only in net/next thus far. > Where do you want this to be added, and with what text? Remember I don't > really know what this thing does, I just provide a shortcut. The document itself describes what each rxrpc sockopt does. Just look for RXRPC_MIN_SECURITY_LEVEL in there;-) Anyway, see the attached. This also fixes a couple of errors in the doc that I noticed. David --- diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst index 5ad35113d0f4..68552b92dc44 100644 --- a/Documentation/networking/rxrpc.rst +++ b/Documentation/networking/rxrpc.rst @@ -477,7 +477,7 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: Encrypted checksum plus packet padded and first eight bytes of packet encrypted - which includes the actual packet length. - (c) RXRPC_SECURITY_ENCRYPTED + (c) RXRPC_SECURITY_ENCRYPT Encrypted checksum plus entire packet padded and encrypted, including actual packet length. @@ -578,7 +578,7 @@ A client would issue an operation by: This issues a request_key() to get the key representing the security context. The minimum security level can be set:: - unsigned int sec = RXRPC_SECURITY_ENCRYPTED; + unsigned int sec = RXRPC_SECURITY_ENCRYPT; setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL, &sec, sizeof(sec)); @@ -1090,6 +1090,15 @@ The kernel interface functions are as follows: jiffies). In the event of the timeout occurring, the call will be aborted and -ETIME or -ETIMEDOUT will be returned. + (#) Apply the RXRPC_MIN_SECURITY_LEVEL sockopt to a socket from within in the + kernel:: + + int rxrpc_sock_set_min_security_level(struct sock *sk, +unsigned int val); + + This specifies the minimum security level required for calls on this + socket. + Configurable Parameters === diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c index 7dfcbd58da85..e313dae01674 100644 --- a/fs/afs/rxrpc.c +++ b/fs/afs/rxrpc.c @@ -57,7 +57,7 @@ int afs_open_socket(struct afs_net *net) srx.transport.sin6.sin6_port= htons(AFS_CM_PORT); ret = rxrpc_sock_set_min_security_level(socket->sk, - RXRPC_SECURITY_ENCRYPT); + RXRPC_SECURITY_ENCRYPT); if (ret < 0) goto error_2;
Re: [Cluster-devel] [PATCH 21/33] ipv4: add ip_sock_set_mtu_discover
Christoph Hellwig wrote: > > > + ip_sock_set_mtu_discover(conn->params.local->socket->sk, > > > + IP_PMTUDISC_DONT); > > > > Um... The socket in question could be an AF_INET6 socket, not an AF_INET4 > > socket - I presume it will work in that case. If so: > > Yes, the implementation of that sockopt, including the inet_sock > structure where these options are set is shared between ipv4 and ipv6. Great! Could you note that either in the patch description or in the kerneldoc attached to the function? Thanks, David
Re: [Cluster-devel] [PATCH 27/33] sctp: export sctp_setsockopt_bindx
Christoph Hellwig wrote: > > The advantage on using kernel_setsockopt here is that sctp module will > > only be loaded if dlm actually creates a SCTP socket. With this > > change, sctp will be loaded on setups that may not be actually using > > it. It's a quite big module and might expose the system. > > True. Not that the intent is to kill kernel space callers of setsockopt, > as I plan to remove the set_fs address space override used for it. For getsockopt, does it make sense to have the core kernel load optval/optlen into a buffer before calling the protocol driver? Then the driver need not see the userspace pointer at all. Similar could be done for setsockopt - allocate a buffer of the size requested by the user inside the kernel and pass it into the driver, then copy the data back afterwards. David
Re: [Cluster-devel] [PATCH 29/33] rxrpc: add rxrpc_sock_set_min_security_level
Christoph Hellwig wrote: > Add a helper to directly set the RXRPC_MIN_SECURITY_LEVEL sockopt from > kernel space without going through a fake uaccess. > > Thanks to David Howells for the documentation updates. > > Signed-off-by: Christoph Hellwig Acked-by: David Howells
Re: [Cluster-devel] [PATCH 05/23] afs: Convert afs_writepages_region() to use filemap_get_folios_tag()
Vishal Moola (Oracle) wrote: > Convert to use folios throughout. This function is in preparation to > remove find_get_pages_range_tag(). > > Also modified this function to write the whole batch one at a time, > rather than calling for a new set every single write. > > Signed-off-by: Vishal Moola (Oracle) Tested-by: David Howells
Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag
Benjamin Coddington wrote: > Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the > GFP_NOIO flag on sk_allocation which the networking system uses to decide > when it is safe to use current->task_frag. Um, what's task_frag? David
Re: [Cluster-devel] [PATCH] filelock: move file locking definitions to separate header file
Jeff Layton wrote: > The file locking definitions have lived in fs.h since the dawn of time, > but they are only used by a small subset of the source files that > include it. > > Move the file locking definitions to a new header file, and add the > appropriate #include directives to the source files that need them. By > doing this we trim down fs.h a bit and limit the amount of rebuilding > that has to be done when we make changes to the file locking APIs. > > Signed-off-by: Jeff Layton Reviewed-by: David Howells
[Cluster-devel] [RFC PATCH 26/28] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index a9b14f81d655..9c0c691b6106 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1394,8 +1394,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1411,8 +1414,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);
[Cluster-devel] [RFC PATCH v2 39/48] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index a9b14f81d655..9c0c691b6106 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1394,8 +1394,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1411,8 +1414,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);
[Cluster-devel] [PATCH v3 47/55] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index a9b14f81d655..9c0c691b6106 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1394,8 +1394,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1411,8 +1414,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);
Re: [Cluster-devel] [linux-next:master] [splice] 2cb1e08985: stress-ng.sendfile.ops_per_sec 11.6% improvement
kernel test robot wrote: > kernel test robot noticed a 11.6% improvement of > stress-ng.sendfile.ops_per_sec on: If it's sending to a socket, this is entirely feasible. The splice_to_socket() function now sends multiple pages in one go to the network protocol's sendmsg() method to process instead of using sendpage to send one page at a time. David
[Cluster-devel] [PATCH net-next 09/17] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 3d3802c47b8b..5c12d8cdfc16 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1395,8 +1395,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1412,8 +1415,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);
[Cluster-devel] [PATCH net-next v2 09/17] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 3d3802c47b8b..5c12d8cdfc16 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1395,8 +1395,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1412,8 +1415,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);
[Cluster-devel] [PATCH net-next v3 09/18] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 3d3802c47b8b..5c12d8cdfc16 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1395,8 +1395,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1412,8 +1415,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);
[Cluster-devel] [PATCH net-next v4 06/15] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 3d3802c47b8b..5c12d8cdfc16 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1395,8 +1395,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1412,8 +1415,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);
[Cluster-devel] [PATCH net-next v5 06/16] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells cc: Christine Caulfield cc: David Teigland cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: cluster-devel@redhat.com cc: net...@vger.kernel.org --- fs/dlm/lowcomms.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 3d3802c47b8b..5c12d8cdfc16 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1395,8 +1395,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg) /* Send a message */ static int send_to_sock(struct connection *con) { - const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL; struct writequeue_entry *e; + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL, + }; int len, offset, ret; spin_lock_bh(&con->writequeue_lock); @@ -1412,8 +1415,9 @@ static int send_to_sock(struct connection *con) WARN_ON_ONCE(len == 0 && e->users == 0); spin_unlock_bh(&con->writequeue_lock); - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + bvec_set_page(&bvec, e->page, len, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); + ret = sock_sendmsg(con->sock, &msg); trace_dlm_send(con->nodeid, ret); if (ret == -EAGAIN || ret == 0) { lock_sock(con->sock->sk);