On Thu, Nov 06, 2025 at 05:18:00PM +0100, Stefano Garzarella wrote:
On Thu, Oct 23, 2025 at 11:27:43AM -0700, Bobby Eshleman wrote:
> From: Bobby Eshleman <[email protected]>
>
> Add netns logic to vsock core. Additionally, modify transport hook
> prototypes to be used by later transport-specific patches (e.g.,
> *_seqpacket_allow()).
>
> Namespaces are supported primarily by changing socket lookup functions
> (e.g., vsock_find_connected_socket()) to take into account the socket
> namespace and the namespace mode before considering a candidate socket a
> "match".
>
> Introduce a dummy namespace struct, __vsock_global_dummy_net, to be
> used by transports that do not support namespacing. This dummy always
> has mode "global" to preserve previous CID behavior.
>
> This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode that
> accepts the "global" or "local" mode strings.
>
> The transports (besides vhost) are modified to use the global dummy,
> which makes them behave as if always in the global namespace. Vhost is
> an exception because it inherits its namespace from the process that
> opens the vhost device.
>
> Add netns functionality (initialization, passing to transports, procfs,
> etc...) to the af_vsock socket layer. Later patches that add netns
> support to transports depend on this patch.
>
> seqpacket_allow() callbacks are modified to take a vsk so that transport
> implementations can inspect sock_net(sk) and vsk->net_mode when performing
> lookups (e.g., vhost does this in its future netns patch). Because the
> API change affects all transports, it seemed more appropriate to make
> this internal API change in the "vsock core" patch then in the "vhost"
> patch.
>
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> Changes in v7:
> - hv_sock: fix hyperv build error
> - explain why vhost does not use the dummy
> - explain usage of __vsock_global_dummy_net
> - explain why VSOCK_NET_MODE_STR_MAX is 8 characters
> - use switch-case in vsock_net_mode_string()
> - avoid changing transports as much as possible
> - add vsock_find_{bound,connected}_socket_net()
> - rename `vsock_hdr` to `sysctl_hdr`
> - add virtio_vsock_alloc_linear_skb() wrapper for setting dummy net and
> global mode for virtio-vsock, move skb->cb zero-ing into wrapper
> - explain seqpacket_allow() change
> - move net setting to __vsock_create() instead of vsock_create() so
> that child sockets also have their net assigned upon accept()
>
> Changes in v6:
> - unregister sysctl ops in vsock_exit()
> - af_vsock: clarify description of CID behavior
> - af_vsock: fix buf vs buffer naming, and length checking
> - af_vsock: fix length checking w/ correct ctl_table->maxlen
>
> Changes in v5:
> - vsock_global_net() -> vsock_global_dummy_net()
> - update comments for new uAPI
> - use /proc/sys/net/vsock/ns_mode instead of /proc/net/vsock_ns_mode
> - add prototype changes so patch remains compilable
> ---
> drivers/vhost/vsock.c | 4 +-
> include/linux/virtio_vsock.h | 21 ++++
> include/net/af_vsock.h | 14 ++-
> net/vmw_vsock/af_vsock.c | 264 ++++++++++++++++++++++++++++++++++++---
> net/vmw_vsock/virtio_transport.c | 7 +-
> net/vmw_vsock/vsock_loopback.c | 4 +-
> 6 files changed, 288 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index ae01457ea2cd..34adf0cf9124 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -404,7 +404,7 @@ static bool vhost_transport_msgzerocopy_allow(void)
> return true;
> }
>
> -static bool vhost_transport_seqpacket_allow(u32 remote_cid);
> +static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk, u32
remote_cid);
>
> static struct virtio_transport vhost_transport = {
> .transport = {
> @@ -460,7 +460,7 @@ static struct virtio_transport vhost_transport = {
> .send_pkt = vhost_transport_send_pkt,
> };
>
> -static bool vhost_transport_seqpacket_allow(u32 remote_cid)
> +static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk, u32
remote_cid)
> {
> struct vhost_vsock *vsock;
> bool seqpacket_allow = false;
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 7f334a32133c..29290395054c 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -153,6 +153,27 @@ static inline void virtio_vsock_skb_set_net_mode(struct
sk_buff *skb,
> VIRTIO_VSOCK_SKB_CB(skb)->net_mode = net_mode;
> }
>
> +static inline struct sk_buff *
> +virtio_vsock_alloc_rx_skb(unsigned int size, gfp_t mask)
> +{
> + struct sk_buff *skb;
> +
> + skb = virtio_vsock_alloc_linear_skb(size, mask);
> + if (!skb)
> + return NULL;
> +
> + memset(skb->head, 0, VIRTIO_VSOCK_SKB_HEADROOM);
> +
> + /* virtio-vsock does not yet support namespaces, so on receive
> + * we force legacy namespace behavior using the global dummy net
> + * and global net mode.
> + */
> + virtio_vsock_skb_set_net(skb, vsock_global_dummy_net());
> + virtio_vsock_skb_set_net_mode(skb, VSOCK_NET_MODE_GLOBAL);
> +
> + return skb;
> +}
Why we are introducing this change in this patch?
Where the net of the virtio's skb is read?
Oh good point, this is a weird place for this. I'll move this to where
it is actually used.
[...]
>
> +static int vsock_net_mode_string(const struct ctl_table *table, int write,
> + void *buffer, size_t *lenp, loff_t *ppos)
> +{
> + char data[VSOCK_NET_MODE_STR_MAX] = {0};
> + enum vsock_net_mode mode;
> + struct ctl_table tmp;
> + struct net *net;
> + int ret;
> +
> + if (!table->data || !table->maxlen || !*lenp) {
> + *lenp = 0;
> + return 0;
> + }
> +
> + net = current->nsproxy->net_ns;
> + tmp = *table;
> + tmp.data = data;
> +
> + if (!write) {
> + const char *p;
> +
> + mode = vsock_net_mode(net);
> +
> + switch (mode) {
> + case VSOCK_NET_MODE_GLOBAL:
> + p = VSOCK_NET_MODE_STR_GLOBAL;
> + break;
> + case VSOCK_NET_MODE_LOCAL:
> + p = VSOCK_NET_MODE_STR_LOCAL;
> + break;
> + default:
> + WARN_ONCE(true, "netns has invalid vsock mode");
> + *lenp = 0;
> + return 0;
> + }
> +
> + strscpy(data, p, sizeof(data));
> + tmp.maxlen = strlen(p);
> + }
> +
> + ret = proc_dostring(&tmp, write, buffer, lenp, ppos);
> + if (ret)
> + return ret;
> +
> + if (write) {
Do we need to check some capability, e.g. CAP_NET_ADMIN ?
We get that for free via the sysctl_net registration, through this path
on open (CAP_NET_ADMIN is checked in net_ctl_permissions):
net_ctl_permissions+1
sysctl_perm+24
proc_sys_permission+117
inode_permission+217
link_path_walk+162
path_openat+152
do_filp_open+171
do_sys_openat2+98
__x64_sys_openat+69
do_syscall_64+93
Verified with:
cp /bin/echo /tmp/echo_netadmin
setcap cap_net_admin+ep /tmp/echo_netadmin
(non-root user fails with regular echo, succeeds with
/tmp/echo_netadmin)