[PATCH v2 net-next] af_unix: ensure POLLOUT on remote close() for connected dgram socket
Applications use -ECONNREFUSED as returned from write() in order to determine that a socket should be closed. However, when using connected dgram unix sockets in a poll/write loop, a final POLLOUT event can be missed when the remote end closes. Thus, the poll is stuck forever: thread 1 (client) thread 2 (server) connect() to server write() returns -EAGAIN unix_dgram_poll() -> unix_recvq_full() is true close() ->unix_release_sock() ->wake_up_interruptible_all() unix_dgram_poll() (due to the wake_up_interruptible_all) -> unix_recvq_full() still is true ->free all skbs Now thread 1 is stuck and will not receive anymore wakeups. In this case, when thread 1 gets the -EAGAIN, it has not queued any skbs otherwise the 'free all skbs' step would in fact cause a wakeup and a POLLOUT return. So the race here is probably fairly rare because it means there are no skbs that thread 1 queued and that thread 1 schedules before the 'free all skbs' step. This issue was reported as a hang when /dev/log is closed. The fix is to signal POLLOUT if the socket is marked as SOCK_DEAD, which means a subsequent write() will get -ECONNREFUSED. Reported-by: Ian Lance Taylor Cc: David Rientjes Cc: Rainer Weikusat Cc: Eric Dumazet Signed-off-by: Jason Baron --- v2: use check for SOCK_DEAD, since skb's can be purged in unix_sock_destructor() --- net/unix/af_unix.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 1772a0e..d1edfa3 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -430,7 +430,12 @@ static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other) connected = unix_dgram_peer_wake_connect(sk, other); - if (unix_recvq_full(other)) + /* If other is SOCK_DEAD, we want to make sure we signal +* POLLOUT, such that a subsequent write() can get a +* -ECONNREFUSED. Otherwise, if we haven't queued any skbs +* to other and its full, we will hang waiting for POLLOUT. +*/ + if (unix_recvq_full(other) && !sock_flag(other, SOCK_DEAD)) return 1; if (connected) -- 1.9.1
[PATCH net] af_unix: ensure POLLOUT on remote close() for connected dgram sockets
Applications use ECONNREFUSED as returned from write() in order to determine that a socket should be closed. When using connected dgram unix sockets in a poll/write loop, this relies on POLLOUT being signaled when the remote end closes. However, due to a race POLLOUT can be missed when the remote closes: thread 1 (client) thread 2 (server) connect() to server write() returns -EAGAIN unix_dgram_poll() -> unix_recvq_full() is true close() ->unix_release_sock() ->wake_up_interruptible_all() unix_dgram_poll() (due to the wake_up_interruptible_all) -> unix_recvq_full() still is true ->free all skbs Now thread 1 is stuck and will not receive anymore wakeups. In this case, when thread 1 gets the -EAGAIN, it has not queued any skbs otherwise the 'free all skbs' step would in fact cause a wakeup and a POLLOUT return. So the race here is probably fairly rare because it means there are no skbs that thread 1 queued and that thread 1 runs before the 'free all skbs' step. Nevertheless, this has been observed when the syslog daemon closes /dev/log. Tested against a reproducer that re-creates the syslog hang. The proposed fix is to move the wake_up_interruptible_all() call after the 'free all skbs' step. Reported-by: Ian Lance Taylor Cc: Rainer Weikusat Signed-off-by: Jason Baron --- net/unix/af_unix.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index e5473c0..de242cf 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -529,8 +529,6 @@ static void unix_release_sock(struct sock *sk, int embrion) sk->sk_state = TCP_CLOSE; unix_state_unlock(sk); - wake_up_interruptible_all(>peer_wait); - skpair = unix_peer(sk); if (skpair != NULL) { @@ -560,6 +558,9 @@ static void unix_release_sock(struct sock *sk, int embrion) kfree_skb(skb); } + /* after freeing skbs to make sure POLLOUT triggers */ + wake_up_interruptible_all(>peer_wait); + if (path.dentry) path_put(); -- 2.7.4
Re: Bug report: epoll can fail to report EPOLLOUT when unix datagram socket peer is closed
On 06/26/2018 10:18 AM, Ian Lance Taylor wrote: > I'm reporting what appears to be a bug in the Linux kernel's epoll > support. It seems that epoll appears to sometimes fail to report an > EPOLLOUT event when the other side of an AF_UNIX/SOCK_DGRAM socket is > closed. This bug report started as a Go program reported at > https://golang.org/issue/23604. I've written a C program that > demonstrates the same symptoms, at > https://github.com/golang/go/issues/23604#issuecomment-398945027 . > > The C program sets up an AF_UNIX/SOCK_DGRAM server and serveral > identical clients, all running in non-blocking mode. All the > non-blocking sockets are added to epoll, using EPOLLET. The server > periodically closes and reopens its socket. The clients look for > ECONNREFUSED errors on their write calls, and close and reopen their > sockets when they see one. > > The clients will sometimes fill up their buffer and block with EAGAIN. > At that point they expect the poller to return an EPOLLOUT event to > tell them when they are ready to write again. The expectation is that > either the server will read data, freeing up buffer space, or will > close the socket, which should cause the sending packets to be > discarded, freeing up buffer space. Generally the EPOLLOUT event > happens. But sometimes, the poller never returns such an event, and > the client stalls. In the test program this is reported as a client > that waits more than 20 seconds to be told to continue. > > A similar bug report was made, with few details, at > https://stackoverflow.com/questions/38441059/edge-triggered-epoll-for-unix-domain-socket > . > > I've tested the program and seen the failure on kernel 4.9.0-6-amd64. > A colleague has tested the program and seen the failure on > 4.18.0-smp-DEV #3 SMP @1529531011 x86_64 GNU/Linux. > > If there is a better way for me to report this, please let me know. > > Thanks for your attention. > > Ian > Hi, Thanks for the report and the test program. The patch below seems to have cured the reproducer for me. But perhaps you can confirm? Thanks, -Jason [PATCH] af_unix: ensure POLLOUT on remote close() for connected dgram socket Applictions use ECONNREFUSED as returned from write() in order to determine that a socket should be closed. When using connected dgram unix sockets in a poll/write loop, this relies on POLLOUT being signaled when the remote end closes. However, due to a race POLLOUT can be missed when the remote closes: thread 1 (client) thread 2 (server) connect() to server write() returns -EAGAIN unix_dgram_poll() -> unix_recvq_full() is true close() ->unix_release_sock() ->wake_up_interruptible_all() unix_dgram_poll() (due to the wake_up_interruptible_all) -> unix_recvq_full() still is true ->free all skbs Now thread 1 is stuck and will not receive anymore wakeups. In this case, when thread 1 gets the -EAGAIN, it has not queued any skbs otherwise the 'free all skbs' step would in fact cause a wakeup and a POLLOUT return. So the race here is probably fairly rare because it means there are no skbs that thread 1 queued and that thread 1 schedules before the 'free all skbs' step. Nevertheless, this has been observed in the wild via syslog. The proposed fix is to move the wake_up_interruptible_all() call after the 'free all skbs' step. Signed-off-by: Jason Baron --- net/unix/af_unix.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index e5473c0..de242cf 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -529,8 +529,6 @@ static void unix_release_sock(struct sock *sk, int embrion) sk->sk_state = TCP_CLOSE; unix_state_unlock(sk); - wake_up_interruptible_all(>peer_wait); - skpair = unix_peer(sk); if (skpair != NULL) { @@ -560,6 +558,9 @@ static void unix_release_sock(struct sock *sk, int embrion) kfree_skb(skb); } + /* after freeing skbs to make sure POLLOUT triggers */ + wake_up_interruptible_all(>peer_wait); + if (path.dentry) path_put(); -- 2.7.4
[PATCH v4 2/3] qemu: virtio-net: use 64-bit values for feature flags
In prepartion for using some of the high order feature bits, make sure that virtio-net uses 64-bit values everywhere. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: virtio-...@lists.oasis-open.org --- hw/net/virtio-net.c| 55 +- include/hw/virtio/virtio-net.h | 2 +- 2 files changed, 29 insertions(+), 28 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 38674b0..54823af 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -48,18 +48,18 @@ (offsetof(container, field) + sizeof(((container *)0)->field)) typedef struct VirtIOFeature { -uint32_t flags; +uint64_t flags; size_t end; } VirtIOFeature; static VirtIOFeature feature_sizes[] = { -{.flags = 1 << VIRTIO_NET_F_MAC, +{.flags = 1ULL << VIRTIO_NET_F_MAC, .end = endof(struct virtio_net_config, mac)}, -{.flags = 1 << VIRTIO_NET_F_STATUS, +{.flags = 1ULL << VIRTIO_NET_F_STATUS, .end = endof(struct virtio_net_config, status)}, -{.flags = 1 << VIRTIO_NET_F_MQ, +{.flags = 1ULL << VIRTIO_NET_F_MQ, .end = endof(struct virtio_net_config, max_virtqueue_pairs)}, -{.flags = 1 << VIRTIO_NET_F_MTU, +{.flags = 1ULL << VIRTIO_NET_F_MTU, .end = endof(struct virtio_net_config, mtu)}, {} }; @@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp) int i; if (n->net_conf.mtu) { -n->host_features |= (0x1 << VIRTIO_NET_F_MTU); +n->host_features |= (1ULL << VIRTIO_NET_F_MTU); } virtio_net_set_config_size(n, n->host_features); @@ -2109,45 +2109,46 @@ static const VMStateDescription vmstate_virtio_net = { }; static Property virtio_net_properties[] = { -DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true), -DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features, +DEFINE_PROP_BIT64("csum", VirtIONet, host_features, +VIRTIO_NET_F_CSUM, true), +DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features, VIRTIO_NET_F_GUEST_CSUM, true), -DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), -DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features, +DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), +DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO4, true), -DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO6, true), -DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ECN, true), -DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features, VIRTIO_NET_F_GUEST_UFO, true), -DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ANNOUNCE, true), -DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO4, true), -DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO6, true), -DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features, VIRTIO_NET_F_HOST_ECN, true), -DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features, VIRTIO_NET_F_HOST_UFO, true), -DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features, +DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features, VIRTIO_NET_F_MRG_RXBUF, true), -DEFINE_PROP_BIT("status", VirtIONet, host_features, +DEFINE_PROP_BIT64("status", VirtIONet, host_features, VIRTIO_NET_F_STATUS, true), -DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features, +DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features, VIRTIO_NET_F_CTRL_VQ, true), -DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features, +DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
[PATCH net-next v4 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
The ability to set speed and duplex for virtio_net is useful in various scenarios as described here: 16032be virtio_net: add ethtool support for set and get of settings However, it would be nice to be able to set this from the hypervisor, such that virtio_net doesn't require custom guest ethtool commands. Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows the hypervisor to export a linkspeed and duplex setting. The user can subsequently overwrite it later if desired via: 'ethtool -s'. Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention is that device feature bits are to grow down from bit 63, since the transports are starting from bit 24 and growing up. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: virtio-...@lists.oasis-open.org --- drivers/net/virtio_net.c| 23 ++- include/uapi/linux/virtio_net.h | 13 + 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 6fb7b65..4f27508 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1894,6 +1894,24 @@ static void virtnet_init_settings(struct net_device *dev) vi->duplex = DUPLEX_UNKNOWN; } +static void virtnet_update_settings(struct virtnet_info *vi) +{ + u32 speed; + u8 duplex; + + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) + return; + + speed = virtio_cread32(vi->vdev, offsetof(struct virtio_net_config, + speed)); + if (ethtool_validate_speed(speed)) + vi->speed = speed; + duplex = virtio_cread8(vi->vdev, offsetof(struct virtio_net_config, + duplex)); + if (ethtool_validate_duplex(duplex)) + vi->duplex = duplex; +} + static const struct ethtool_ops virtnet_ethtool_ops = { .get_drvinfo = virtnet_get_drvinfo, .get_link = ethtool_op_get_link, @@ -2147,6 +2165,7 @@ static void virtnet_config_changed_work(struct work_struct *work) vi->status = v; if (vi->status & VIRTIO_NET_S_LINK_UP) { + virtnet_update_settings(vi); netif_carrier_on(vi->dev); netif_tx_wake_all_queues(vi->dev); } else { @@ -2695,6 +2714,7 @@ static int virtnet_probe(struct virtio_device *vdev) schedule_work(>config_work); } else { vi->status = VIRTIO_NET_S_LINK_UP; + virtnet_update_settings(vi); netif_carrier_on(dev); } @@ -2796,7 +2816,8 @@ static struct virtio_device_id id_table[] = { VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \ VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ VIRTIO_NET_F_CTRL_MAC_ADDR, \ - VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS + VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ + VIRTIO_NET_F_SPEED_DUPLEX static unsigned int features[] = { VIRTNET_FEATURES, diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index fc353b5..5de6ed3 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -57,6 +57,8 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */ + #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ #endif /* VIRTIO_NET_NO_LEGACY */ @@ -76,6 +78,17 @@ struct virtio_net_config { __u16 max_virtqueue_pairs; /* Default maximum transmit unit advice */ __u16 mtu; + /* +* speed, in units of 1Mb. All values 0 to INT_MAX are legal. +* Any other value stands for unknown. +*/ + __u32 speed; + /* +* 0x00 - half duplex +* 0x01 - full duplex +* Any other value stands for unknown. +*/ + __u8 duplex; } __attribute__((packed)); /* -- 2.6.1
[PATCH v4 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting
We have found it useful to be able to set the linkspeed and duplex settings from the host-side for virtio_net. This obviates the need for guest changes and settings for these fields, and does not require custom ethtool commands for virtio_net. The ability to set linkspeed and duplex is useful in various cases as described here: 16032be virtio_net: add ethtool support for set and get of settings Using 'ethtool -s' continues to over-write the linkspeed/duplex settings with this patch. The 1/3 patch is against net-next, while the 2-3/3 patch are the associated qemu changes that would go in after as update-linux-headers.sh should be run first. So the qemu patches are a demonstration of how I intend this to work. Thanks, -Jason linux changes: changes from v3: * break the speed/duplex read into a function and also call from virtnet_probe when status bit is not negotiated * only do speed/duplex read in virtnet_config_changed_work() on LINK_UP changes from v2: * move speed/duplex read into virtnet_config_changed_work() so link up changes are detected Jason Baron (1): virtio_net: propagate linkspeed/duplex settings from the hypervisor drivers/net/virtio_net.c| 23 ++- include/uapi/linux/virtio_net.h | 13 + 2 files changed, 35 insertions(+), 1 deletion(-) qemu changes: Jason Baron (2): qemu: virtio-net: use 64-bit values for feature flags qemu: add linkspeed and duplex settings to virtio-net hw/net/virtio-net.c | 87 - include/hw/virtio/virtio-net.h | 5 +- include/standard-headers/linux/virtio_net.h | 13 + 3 files changed, 77 insertions(+), 28 deletions(-) -- 2.6.1
[PATCH v4 3/3] qemu: add linkspeed and duplex settings to virtio-net
Although linkspeed and duplex can be set in a linux guest via 'ethtool -s', this requires custom ethtool commands for virtio-net by default. Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows the hypervisor to export a linkspeed and duplex setting. The user can subsequently overwrite it later if desired via: 'ethtool -s'. Linkspeed and duplex settings can be set as: '-device virtio-net,speed=1,duplex=full' where speed is [-1...INT_MAX], and duplex is ["half"|"full"]. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: virtio-...@lists.oasis-open.org --- hw/net/virtio-net.c | 32 + include/hw/virtio/virtio-net.h | 3 +++ include/standard-headers/linux/virtio_net.h | 13 3 files changed, 48 insertions(+) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 54823af..cd63659 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -40,6 +40,12 @@ #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE +/* duplex and speed */ +#define DUPLEX_UNKNOWN 0xff +#define DUPLEX_HALF 0x00 +#define DUPLEX_FULL 0x01 +#define SPEED_UNKNOWN -1 + /* * Calculate the number of bytes up to and including the given 'field' of * 'container'. @@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = { .end = endof(struct virtio_net_config, max_virtqueue_pairs)}, {.flags = 1ULL << VIRTIO_NET_F_MTU, .end = endof(struct virtio_net_config, mtu)}, +{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX, + .end = endof(struct virtio_net_config, duplex)}, {} }; @@ -89,6 +97,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config) virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues); virtio_stw_p(vdev, , n->net_conf.mtu); memcpy(netcfg.mac, n->mac, ETH_ALEN); +virtio_stl_p(vdev, , n->net_conf.speed); +netcfg.duplex = n->net_conf.duplex; memcpy(config, , n->config_size); } @@ -1941,6 +1951,26 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp) n->host_features |= (1ULL << VIRTIO_NET_F_MTU); } +if (n->net_conf.duplex_str) { +if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) { +n->net_conf.duplex = DUPLEX_HALF; +} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) { +n->net_conf.duplex = DUPLEX_FULL; +} else { +error_setg(errp, "'duplex' must be 'half' or 'full'"); +} +n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX); +} else { +n->net_conf.duplex = DUPLEX_UNKNOWN; +} + +if (n->net_conf.speed < SPEED_UNKNOWN) { +error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and " + "INT_MAX"); +} else if (n->net_conf.speed >= 0) { +n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX); +} + virtio_net_set_config_size(n, n->host_features); virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size); @@ -2161,6 +2191,8 @@ static Property virtio_net_properties[] = { DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0), DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend, true), +DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN), +DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h index e7634c9..02484dc 100644 --- a/include/hw/virtio/virtio-net.h +++ b/include/hw/virtio/virtio-net.h @@ -38,6 +38,9 @@ typedef struct virtio_net_conf uint16_t rx_queue_size; uint16_t tx_queue_size; uint16_t mtu; +int32_t speed; +char *duplex_str; +uint8_t duplex; } virtio_net_conf; /* Maximum packet size we can receive from tap device: header + 64k */ diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h index 30ff249..17c8531 100644 --- a/include/standard-headers/linux/virtio_net.h +++ b/include/standard-headers/linux/virtio_net.h @@ -57,6 +57,8 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */ + #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ #endif /* VIRTIO_NET_NO_LEGACY */ @@ -76,6 +78,17 @@ struct virtio_net_confi
Re: [PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
On 01/04/2018 01:22 PM, Michael S. Tsirkin wrote: > On Thu, Jan 04, 2018 at 01:12:30PM -0500, Jason Baron wrote: >> >> >> On 01/04/2018 12:05 PM, Michael S. Tsirkin wrote: >>> On Thu, Jan 04, 2018 at 12:16:44AM -0500, Jason Baron wrote: >>>> The ability to set speed and duplex for virtio_net is useful in various >>>> scenarios as described here: >>>> >>>> 16032be virtio_net: add ethtool support for set and get of settings >>>> >>>> However, it would be nice to be able to set this from the hypervisor, >>>> such that virtio_net doesn't require custom guest ethtool commands. >>>> >>>> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows >>>> the hypervisor to export a linkspeed and duplex setting. The user can >>>> subsequently overwrite it later if desired via: 'ethtool -s'. >>>> >>>> Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention >>>> is that device feature bits are to grow down from bit 63, since the >>>> transports are starting from bit 24 and growing up. >>>> >>>> Signed-off-by: Jason Baron <jba...@akamai.com> >>>> Cc: "Michael S. Tsirkin" <m...@redhat.com> >>>> Cc: Jason Wang <jasow...@redhat.com> >>>> Cc: virtio-...@lists.oasis-open.org >>>> --- >>>> drivers/net/virtio_net.c| 19 ++- >>>> include/uapi/linux/virtio_net.h | 13 + >>>> 2 files changed, 31 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >>>> index 6fb7b65..0b2d314 100644 >>>> --- a/drivers/net/virtio_net.c >>>> +++ b/drivers/net/virtio_net.c >>>> @@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct >>>> work_struct *work) >>>> >>>>vi->status = v; >>>> >>>> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) { >>> >>> BTW we can avoid this read for when link goes down. >>> Not a big deal but still. >> >> So you are saying that we can just set vi->speed and vi->duplex to >> 'unknown' when the link goes down and not check for the presence of >> VIRTIO_NET_F_SPEED_DUPLEX? >> >> If so, that could over-write what the user may have configured in the >> guest via 'ethtool -s' when the link goes down, so that would be a >> change in behavior, but perhaps that is ok? > > No - what I am saying is that your patch overwrites the values > set by user when link goes down. > > I suggest limiting this call to when > > if (vi->status & VIRTIO_NET_S_LINK_UP) > > and then the values are overwritten when link goes up > which seems closer to what a user might expect. ok, will update this for v4. Thanks, -Jason > >> >> I think I would prefer to have the link down event still check for >> VIRTIO_NET_F_SPEED_DUPLEX before changing speed/duplex. That way we >> still have 2 modes for updating the fields: >> >> 1) completely guest controlled. Same as we have now and host does not >> change any values and does not set VIRTIO_NET_F_SPEED_DUPLEX flag (hence >> don't remove above check). >> >> 2) if speed or duplex or speed is set in the qemu command line, then set >> the VIRTIO_NET_F_SPEED_DUPLEX and have host control the settings of >> speed/duplex (with ability of guest to over-write if it wanted to). >> >> >>> >>>> + u32 speed; >>>> + u8 duplex; >>>> + >>>> + speed = virtio_cread32(vi->vdev, >>>> + offsetof(struct virtio_net_config, >>>> + speed)); >>>> + if (ethtool_validate_speed(speed)) >>>> + vi->speed = speed; >>>> + duplex = virtio_cread8(vi->vdev, >>>> + offsetof(struct virtio_net_config, >>>> + duplex)); >>>> + if (ethtool_validate_duplex(duplex)) >>>> + vi->duplex = duplex; >>>> + } >>>> + >>>>if (vi->status & VIRTIO_NET_S_LINK_UP) { >>>>netif_carrier_on(vi->dev); >>>>netif_tx_wake_all_queues(vi->dev); >>> >>> OK so this handles the case when VIRTIO_NET_F_STATUS is set, >>&g
Re: [PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
On 01/04/2018 12:05 PM, Michael S. Tsirkin wrote: > On Thu, Jan 04, 2018 at 12:16:44AM -0500, Jason Baron wrote: >> The ability to set speed and duplex for virtio_net is useful in various >> scenarios as described here: >> >> 16032be virtio_net: add ethtool support for set and get of settings >> >> However, it would be nice to be able to set this from the hypervisor, >> such that virtio_net doesn't require custom guest ethtool commands. >> >> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows >> the hypervisor to export a linkspeed and duplex setting. The user can >> subsequently overwrite it later if desired via: 'ethtool -s'. >> >> Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention >> is that device feature bits are to grow down from bit 63, since the >> transports are starting from bit 24 and growing up. >> >> Signed-off-by: Jason Baron <jba...@akamai.com> >> Cc: "Michael S. Tsirkin" <m...@redhat.com> >> Cc: Jason Wang <jasow...@redhat.com> >> Cc: virtio-...@lists.oasis-open.org >> --- >> drivers/net/virtio_net.c| 19 ++- >> include/uapi/linux/virtio_net.h | 13 + >> 2 files changed, 31 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index 6fb7b65..0b2d314 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct >> work_struct *work) >> >> vi->status = v; >> >> +if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) { > > BTW we can avoid this read for when link goes down. > Not a big deal but still. So you are saying that we can just set vi->speed and vi->duplex to 'unknown' when the link goes down and not check for the presence of VIRTIO_NET_F_SPEED_DUPLEX? If so, that could over-write what the user may have configured in the guest via 'ethtool -s' when the link goes down, so that would be a change in behavior, but perhaps that is ok? I think I would prefer to have the link down event still check for VIRTIO_NET_F_SPEED_DUPLEX before changing speed/duplex. That way we still have 2 modes for updating the fields: 1) completely guest controlled. Same as we have now and host does not change any values and does not set VIRTIO_NET_F_SPEED_DUPLEX flag (hence don't remove above check). 2) if speed or duplex or speed is set in the qemu command line, then set the VIRTIO_NET_F_SPEED_DUPLEX and have host control the settings of speed/duplex (with ability of guest to over-write if it wanted to). > >> +u32 speed; >> +u8 duplex; >> + >> +speed = virtio_cread32(vi->vdev, >> + offsetof(struct virtio_net_config, >> +speed)); >> +if (ethtool_validate_speed(speed)) >> +vi->speed = speed; >> +duplex = virtio_cread8(vi->vdev, >> + offsetof(struct virtio_net_config, >> +duplex)); >> +if (ethtool_validate_duplex(duplex)) >> +vi->duplex = duplex; >> +} >> + >> if (vi->status & VIRTIO_NET_S_LINK_UP) { >> netif_carrier_on(vi->dev); >> netif_tx_wake_all_queues(vi->dev); > > OK so this handles the case when VIRTIO_NET_F_STATUS is set, > but when it's clear we need to call this from virtnet_probe. > > I propose moving this chunk to a function and calling from two places. > good point. will update. Thanks, -Jason > >> @@ -2796,7 +2812,8 @@ static struct virtio_device_id id_table[] = { >> VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \ >> VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ >> VIRTIO_NET_F_CTRL_MAC_ADDR, \ >> -VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS >> +VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ >> +VIRTIO_NET_F_SPEED_DUPLEX >> >> static unsigned int features[] = { >> VIRTNET_FEATURES, >> diff --git a/include/uapi/linux/virtio_net.h >> b/include/uapi/linux/virtio_net.h >> index fc353b5..5de6ed3 100644 >> --- a/include/uapi/linux/virtio_net.h >> +++ b/include/uapi/linux/virtio_net.h >> @@ -57,6 +57,8 @@ >> * Steering */ >> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ >> >> +#define VIRTIO_NET_F_SPEED_DUPLEX 63
Re: [PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
On 01/04/2018 11:27 AM, Michael S. Tsirkin wrote: > On Thu, Jan 04, 2018 at 12:16:44AM -0500, Jason Baron wrote: >> The ability to set speed and duplex for virtio_net is useful in various >> scenarios as described here: >> >> 16032be virtio_net: add ethtool support for set and get of settings >> >> However, it would be nice to be able to set this from the hypervisor, >> such that virtio_net doesn't require custom guest ethtool commands. >> >> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows >> the hypervisor to export a linkspeed and duplex setting. The user can >> subsequently overwrite it later if desired via: 'ethtool -s'. >> >> Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention >> is that device feature bits are to grow down from bit 63, since the >> transports are starting from bit 24 and growing up. >> >> Signed-off-by: Jason Baron <jba...@akamai.com> >> Cc: "Michael S. Tsirkin" <m...@redhat.com> >> Cc: Jason Wang <jasow...@redhat.com> >> Cc: virtio-...@lists.oasis-open.org >> --- >> drivers/net/virtio_net.c| 19 ++- >> include/uapi/linux/virtio_net.h | 13 + >> 2 files changed, 31 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index 6fb7b65..0b2d314 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct >> work_struct *work) >> >> vi->status = v; >> >> +if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) { >> +u32 speed; >> +u8 duplex; >> + >> +speed = virtio_cread32(vi->vdev, >> + offsetof(struct virtio_net_config, >> +speed)); >> +if (ethtool_validate_speed(speed)) >> +vi->speed = speed; >> +duplex = virtio_cread8(vi->vdev, >> + offsetof(struct virtio_net_config, >> +duplex)); >> +if (ethtool_validate_duplex(duplex)) >> +vi->duplex = duplex; >> +} >> + >> if (vi->status & VIRTIO_NET_S_LINK_UP) { >> netif_carrier_on(vi->dev); >> netif_tx_wake_all_queues(vi->dev); >> @@ -2796,7 +2812,8 @@ static struct virtio_device_id id_table[] = { >> VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \ >> VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ >> VIRTIO_NET_F_CTRL_MAC_ADDR, \ >> -VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS >> +VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ >> +VIRTIO_NET_F_SPEED_DUPLEX >> >> static unsigned int features[] = { >> VIRTNET_FEATURES, > > Still missing the update from virtnet_config_changed_work, and I think > it's important to reflex host changes within guest when the > feature bit has been acked. > I update vi->speed and vi->duplex in virtnet_config_changed_work(). And I tested using the 'set_link' on/off from the qemu monitor. Specifically, an 'off' sets the speed and link to 'unknown', and an 'on' returns the speed and link to the configured speed and duplex. So they are being updated dynamically now. What host changes are you referring to that are not reflected? Thanks, -Jason >> diff --git a/include/uapi/linux/virtio_net.h >> b/include/uapi/linux/virtio_net.h >> index fc353b5..5de6ed3 100644 >> --- a/include/uapi/linux/virtio_net.h >> +++ b/include/uapi/linux/virtio_net.h >> @@ -57,6 +57,8 @@ >> * Steering */ >> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ >> >> +#define VIRTIO_NET_F_SPEED_DUPLEX 63/* Device set linkspeed and >> duplex */ >> + >> #ifndef VIRTIO_NET_NO_LEGACY >> #define VIRTIO_NET_F_GSO6 /* Host handles pkts w/ any GSO type */ >> #endif /* VIRTIO_NET_NO_LEGACY */ >> @@ -76,6 +78,17 @@ struct virtio_net_config { >> __u16 max_virtqueue_pairs; >> /* Default maximum transmit unit advice */ >> __u16 mtu; >> +/* >> + * speed, in units of 1Mb. All values 0 to INT_MAX are legal. >> + * Any other value stands for unknown. >> + */ >> +__u32 speed; >> +/* >> + * 0x00 - half duplex >> + * 0x01 - full duplex >> + * Any other value stands for unknown. >> + */ >> +__u8 duplex; >> } __attribute__((packed)); >> >> /* >> -- >> 2.6.1
[PATCH v3 3/3] qemu: add linkspeed and duplex settings to virtio-net
Although linkspeed and duplex can be set in a linux guest via 'ethtool -s', this requires custom ethtool commands for virtio-net by default. Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows the hypervisor to export a linkspeed and duplex setting. The user can subsequently overwrite it later if desired via: 'ethtool -s'. Linkspeed and duplex settings can be set as: '-device virtio-net,speed=1,duplex=full' where speed is [-1...INT_MAX], and duplex is ["half"|"full"]. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: virtio-...@lists.oasis-open.org --- hw/net/virtio-net.c | 35 + include/hw/virtio/virtio-net.h | 3 +++ include/standard-headers/linux/virtio_net.h | 13 +++ 3 files changed, 51 insertions(+) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index adc20df..eec8422 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -40,6 +40,12 @@ #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE +/* duplex and speed */ +#define DUPLEX_UNKNOWN 0xff +#define DUPLEX_HALF 0x00 +#define DUPLEX_FULL 0x01 +#define SPEED_UNKNOWN -1 + /* * Calculate the number of bytes up to and including the given 'field' of * 'container'. @@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = { .end = endof(struct virtio_net_config, max_virtqueue_pairs)}, {.flags = 1ULL << VIRTIO_NET_F_MTU, .end = endof(struct virtio_net_config, mtu)}, +{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX, + .end = endof(struct virtio_net_config, duplex)}, {} }; @@ -89,6 +97,14 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config) virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues); virtio_stw_p(vdev, , n->net_conf.mtu); memcpy(netcfg.mac, n->mac, ETH_ALEN); +if (n->status & VIRTIO_NET_S_LINK_UP) { +virtio_stl_p(vdev, , n->net_conf.speed); +netcfg.duplex = n->net_conf.duplex; +} else { +virtio_stl_p(vdev, , SPEED_UNKNOWN); +netcfg.duplex = DUPLEX_UNKNOWN; +} + memcpy(config, , n->config_size); } @@ -1941,6 +1957,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp) n->host_features |= (1ULL << VIRTIO_NET_F_MTU); } +n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX); +if (n->net_conf.duplex_str) { +if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) { +n->net_conf.duplex = DUPLEX_HALF; +} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) { +n->net_conf.duplex = DUPLEX_FULL; +} else { +error_setg(errp, "'duplex' must be 'half' or 'full'"); +} +} else { +n->net_conf.duplex = DUPLEX_UNKNOWN; +} +if (n->net_conf.speed < SPEED_UNKNOWN) { +error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and " + "INT_MAX"); +} + virtio_net_set_config_size(n, n->host_features); virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size); @@ -2160,6 +2193,8 @@ static Property virtio_net_properties[] = { DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0), DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend, true), +DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN), +DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h index e7634c9..02484dc 100644 --- a/include/hw/virtio/virtio-net.h +++ b/include/hw/virtio/virtio-net.h @@ -38,6 +38,9 @@ typedef struct virtio_net_conf uint16_t rx_queue_size; uint16_t tx_queue_size; uint16_t mtu; +int32_t speed; +char *duplex_str; +uint8_t duplex; } virtio_net_conf; /* Maximum packet size we can receive from tap device: header + 64k */ diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h index 30ff249..17c8531 100644 --- a/include/standard-headers/linux/virtio_net.h +++ b/include/standard-headers/linux/virtio_net.h @@ -57,6 +57,8 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */ + #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ #endif /* VIRTIO_NET_NO_LEGACY */ @@
[PATCH v3 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting
We have found it useful to be able to set the linkspeed and duplex settings from the host-side for virtio_net. This obviates the need for guest changes and settings for these fields, and does not require custom ethtool commands for virtio_net. The ability to set linkspeed and duplex is useful in various cases as described here: 16032be virtio_net: add ethtool support for set and get of settings Using 'ethtool -s' continues to over-write the linkspeed/duplex settings with this patch. The 1/3 patch is against net-next, while the 2-3/3 patch are the associated qemu changes that would go in after as update-linux-headers.sh should be run first. So the qemu patches are a demonstration of how I intend this to work. Thanks, -Jason linux changes: changes from v2: * move speed/duplex read into virtnet_config_changed_work() so link up changes are detected Jason Baron (1): virtio_net: propagate linkspeed/duplex settings from the hypervisor drivers/net/virtio_net.c| 19 ++- include/uapi/linux/virtio_net.h | 13 + 2 files changed, 31 insertions(+), 1 deletion(-) qemu changes: changes from v2: * if link up return configured speed/duplex, else return UNKNOWN speed and duplex Jason Baron (2): qemu: virtio-net: use 64-bit values for feature flags qemu: add linkspeed and duplex settings to virtio-net hw/net/virtio-net.c | 89 - include/hw/virtio/virtio-net.h | 5 +- include/standard-headers/linux/virtio_net.h | 13 + 3 files changed, 79 insertions(+), 28 deletions(-) -- 2.6.1
[PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
The ability to set speed and duplex for virtio_net is useful in various scenarios as described here: 16032be virtio_net: add ethtool support for set and get of settings However, it would be nice to be able to set this from the hypervisor, such that virtio_net doesn't require custom guest ethtool commands. Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows the hypervisor to export a linkspeed and duplex setting. The user can subsequently overwrite it later if desired via: 'ethtool -s'. Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention is that device feature bits are to grow down from bit 63, since the transports are starting from bit 24 and growing up. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: virtio-...@lists.oasis-open.org --- drivers/net/virtio_net.c| 19 ++- include/uapi/linux/virtio_net.h | 13 + 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 6fb7b65..0b2d314 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct work_struct *work) vi->status = v; + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) { + u32 speed; + u8 duplex; + + speed = virtio_cread32(vi->vdev, + offsetof(struct virtio_net_config, + speed)); + if (ethtool_validate_speed(speed)) + vi->speed = speed; + duplex = virtio_cread8(vi->vdev, + offsetof(struct virtio_net_config, + duplex)); + if (ethtool_validate_duplex(duplex)) + vi->duplex = duplex; + } + if (vi->status & VIRTIO_NET_S_LINK_UP) { netif_carrier_on(vi->dev); netif_tx_wake_all_queues(vi->dev); @@ -2796,7 +2812,8 @@ static struct virtio_device_id id_table[] = { VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \ VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ VIRTIO_NET_F_CTRL_MAC_ADDR, \ - VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS + VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ + VIRTIO_NET_F_SPEED_DUPLEX static unsigned int features[] = { VIRTNET_FEATURES, diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index fc353b5..5de6ed3 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -57,6 +57,8 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */ + #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ #endif /* VIRTIO_NET_NO_LEGACY */ @@ -76,6 +78,17 @@ struct virtio_net_config { __u16 max_virtqueue_pairs; /* Default maximum transmit unit advice */ __u16 mtu; + /* +* speed, in units of 1Mb. All values 0 to INT_MAX are legal. +* Any other value stands for unknown. +*/ + __u32 speed; + /* +* 0x00 - half duplex +* 0x01 - full duplex +* Any other value stands for unknown. +*/ + __u8 duplex; } __attribute__((packed)); /* -- 2.6.1
[PATCH v3 2/3] qemu: virtio-net: use 64-bit values for feature flags
In prepartion for using some of the high order feature bits, make sure that virtio-net uses 64-bit values everywhere. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: virtio-...@lists.oasis-open.org --- hw/net/virtio-net.c| 54 +- include/hw/virtio/virtio-net.h | 2 +- 2 files changed, 28 insertions(+), 28 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 38674b0..adc20df 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -48,18 +48,18 @@ (offsetof(container, field) + sizeof(((container *)0)->field)) typedef struct VirtIOFeature { -uint32_t flags; +uint64_t flags; size_t end; } VirtIOFeature; static VirtIOFeature feature_sizes[] = { -{.flags = 1 << VIRTIO_NET_F_MAC, +{.flags = 1ULL << VIRTIO_NET_F_MAC, .end = endof(struct virtio_net_config, mac)}, -{.flags = 1 << VIRTIO_NET_F_STATUS, +{.flags = 1ULL << VIRTIO_NET_F_STATUS, .end = endof(struct virtio_net_config, status)}, -{.flags = 1 << VIRTIO_NET_F_MQ, +{.flags = 1ULL << VIRTIO_NET_F_MQ, .end = endof(struct virtio_net_config, max_virtqueue_pairs)}, -{.flags = 1 << VIRTIO_NET_F_MTU, +{.flags = 1ULL << VIRTIO_NET_F_MTU, .end = endof(struct virtio_net_config, mtu)}, {} }; @@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp) int i; if (n->net_conf.mtu) { -n->host_features |= (0x1 << VIRTIO_NET_F_MTU); +n->host_features |= (1ULL << VIRTIO_NET_F_MTU); } virtio_net_set_config_size(n, n->host_features); @@ -2109,45 +2109,45 @@ static const VMStateDescription vmstate_virtio_net = { }; static Property virtio_net_properties[] = { -DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true), -DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features, +DEFINE_PROP_BIT64("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true), +DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features, VIRTIO_NET_F_GUEST_CSUM, true), -DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), -DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features, +DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), +DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO4, true), -DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO6, true), -DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ECN, true), -DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features, VIRTIO_NET_F_GUEST_UFO, true), -DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ANNOUNCE, true), -DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO4, true), -DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO6, true), -DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features, VIRTIO_NET_F_HOST_ECN, true), -DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features, VIRTIO_NET_F_HOST_UFO, true), -DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features, +DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features, VIRTIO_NET_F_MRG_RXBUF, true), -DEFINE_PROP_BIT("status", VirtIONet, host_features, +DEFINE_PROP_BIT64("status", VirtIONet, host_features, VIRTIO_NET_F_STATUS, true), -DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features, +DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features, VIRTIO_NET_F_CTRL_VQ, true), -DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features, +DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features, V
Re: [PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
On 12/27/2017 04:43 PM, David Miller wrote: > From: Jason Baron <jba...@akamai.com> > Date: Fri, 22 Dec 2017 16:54:01 -0500 > >> The ability to set speed and duplex for virtio_net in useful in various >> scenarios as described here: >> >> 16032be virtio_net: add ethtool support for set and get of settings >> >> However, it would be nice to be able to set this from the hypervisor, >> such that virtio_net doesn't require custom guest ethtool commands. >> >> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows >> the hypervisor to export a linkspeed and duplex setting. The user can >> subsequently overwrite it later if desired via: 'ethtool -s'. >> >> Signed-off-by: Jason Baron <jba...@akamai.com> >> Cc: "Michael S. Tsirkin" <m...@redhat.com> >> Cc: Jason Wang <jasow...@redhat.com> > > Looks mostly fine to me but need some virtio_net reviewers on this one. > >> @@ -57,6 +57,8 @@ >> * Steering */ >> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ >> >> +#define VIRTIO_NET_F_SPEED_DUPLEX 63/* Host set linkspeed and >> duplex */ >> + > > Why use a value so far away from the largest existing one? > > Just curious. > So that came from a discussion with Michael about which bit to use for this, and he suggested using 63: " Transports started from bit 24 and are growing up. So I would say devices should start from bit 63 and grow down. " https://patchwork.ozlabs.org/patch/848814/#1826669 I will add a comment to explain it. Thanks, -Jason
[PATCH 3/3] qemu: add linkspeed and duplex settings to virtio-net
Although linkspeed and duplex can be set in a linux guest via 'ethtool -s', this requires custom ethtool commands for virtio-net by default. Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows the hypervisor to export a linkspeed and duplex setting. The user can subsequently overwrite it later if desired via: 'ethtool -s'. Linkspeed and duplex settings can be set as: '-device virtio-net,speed=1,duplex=full' where speed is [-1...INT_MAX], and duplex is ["half"|"full"]. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- hw/net/virtio-net.c | 29 + include/hw/virtio/virtio-net.h | 3 +++ include/standard-headers/linux/virtio_net.h | 4 3 files changed, 36 insertions(+) create mode 16 pixman diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index adc20df..7fafe6e 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -40,6 +40,12 @@ #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE +/* duplex and speed defines */ +#define DUPLEX_UNKNOWN 0xff +#define DUPLEX_HALF 0x00 +#define DUPLEX_FULL 0x01 +#define SPEED_UNKNOWN -1 + /* * Calculate the number of bytes up to and including the given 'field' of * 'container'. @@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = { .end = endof(struct virtio_net_config, max_virtqueue_pairs)}, {.flags = 1ULL << VIRTIO_NET_F_MTU, .end = endof(struct virtio_net_config, mtu)}, +{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX, + .end = endof(struct virtio_net_config, duplex)}, {} }; @@ -88,6 +96,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config) virtio_stw_p(vdev, , n->status); virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues); virtio_stw_p(vdev, , n->net_conf.mtu); +virtio_stl_p(vdev, , n->net_conf.speed); +netcfg.duplex = n->net_conf.duplex; memcpy(netcfg.mac, n->mac, ETH_ALEN); memcpy(config, , n->config_size); } @@ -1941,6 +1951,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp) n->host_features |= (1ULL << VIRTIO_NET_F_MTU); } +n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX); +if (n->net_conf.duplex_str) { +if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) { +n->net_conf.duplex = DUPLEX_HALF; +} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) { +n->net_conf.duplex = DUPLEX_FULL; +} else { +error_setg(errp, "'duplex' must be 'half' or 'full'"); +} +} else { +n->net_conf.duplex = DUPLEX_UNKNOWN; +} +if (n->net_conf.speed < SPEED_UNKNOWN) { +error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and " + "INT_MAX"); +} + virtio_net_set_config_size(n, n->host_features); virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size); @@ -2160,6 +2187,8 @@ static Property virtio_net_properties[] = { DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0), DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend, true), +DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN), +DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h index e7634c9..02484dc 100644 --- a/include/hw/virtio/virtio-net.h +++ b/include/hw/virtio/virtio-net.h @@ -38,6 +38,9 @@ typedef struct virtio_net_conf uint16_t rx_queue_size; uint16_t tx_queue_size; uint16_t mtu; +int32_t speed; +char *duplex_str; +uint8_t duplex; } virtio_net_conf; /* Maximum packet size we can receive from tap device: header + 64k */ diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h index 30ff249..0ff1447 100644 --- a/include/standard-headers/linux/virtio_net.h +++ b/include/standard-headers/linux/virtio_net.h @@ -36,6 +36,7 @@ #define VIRTIO_NET_F_GUEST_CSUM1 /* Guest handles pkts w/ partial csum */ #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */ #define VIRTIO_NET_F_MTU 3 /* Initial MTU advice */ +#define VIRTIO_NET_F_SPEED_DUPLEX 4/* Host set linkspeed and duplex */ #define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */ #define VIRTIO_NET_F_GUEST_TSO47 /* Guest can handle TSOv4 in. */ #defi
[PATCH 2/3] qemu: use 64-bit values for feature flags in virtio-net
In prepartion for using some of the high order feature bits, make sure that virtio-net uses 64-bit values everywhere. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- hw/net/virtio-net.c| 54 +- include/hw/virtio/virtio-net.h | 2 +- 2 files changed, 28 insertions(+), 28 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 38674b0..adc20df 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -48,18 +48,18 @@ (offsetof(container, field) + sizeof(((container *)0)->field)) typedef struct VirtIOFeature { -uint32_t flags; +uint64_t flags; size_t end; } VirtIOFeature; static VirtIOFeature feature_sizes[] = { -{.flags = 1 << VIRTIO_NET_F_MAC, +{.flags = 1ULL << VIRTIO_NET_F_MAC, .end = endof(struct virtio_net_config, mac)}, -{.flags = 1 << VIRTIO_NET_F_STATUS, +{.flags = 1ULL << VIRTIO_NET_F_STATUS, .end = endof(struct virtio_net_config, status)}, -{.flags = 1 << VIRTIO_NET_F_MQ, +{.flags = 1ULL << VIRTIO_NET_F_MQ, .end = endof(struct virtio_net_config, max_virtqueue_pairs)}, -{.flags = 1 << VIRTIO_NET_F_MTU, +{.flags = 1ULL << VIRTIO_NET_F_MTU, .end = endof(struct virtio_net_config, mtu)}, {} }; @@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp) int i; if (n->net_conf.mtu) { -n->host_features |= (0x1 << VIRTIO_NET_F_MTU); +n->host_features |= (1ULL << VIRTIO_NET_F_MTU); } virtio_net_set_config_size(n, n->host_features); @@ -2109,45 +2109,45 @@ static const VMStateDescription vmstate_virtio_net = { }; static Property virtio_net_properties[] = { -DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true), -DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features, +DEFINE_PROP_BIT64("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true), +DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features, VIRTIO_NET_F_GUEST_CSUM, true), -DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), -DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features, +DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), +DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO4, true), -DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO6, true), -DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ECN, true), -DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features, VIRTIO_NET_F_GUEST_UFO, true), -DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features, +DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ANNOUNCE, true), -DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO4, true), -DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO6, true), -DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features, VIRTIO_NET_F_HOST_ECN, true), -DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features, +DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features, VIRTIO_NET_F_HOST_UFO, true), -DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features, +DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features, VIRTIO_NET_F_MRG_RXBUF, true), -DEFINE_PROP_BIT("status", VirtIONet, host_features, +DEFINE_PROP_BIT64("status", VirtIONet, host_features, VIRTIO_NET_F_STATUS, true), -DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features, +DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features, VIRTIO_NET_F_CTRL_VQ, true), -DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features, +DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features, VIRTIO_NET_F_CTRL_RX, true),
[PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
The ability to set speed and duplex for virtio_net in useful in various scenarios as described here: 16032be virtio_net: add ethtool support for set and get of settings However, it would be nice to be able to set this from the hypervisor, such that virtio_net doesn't require custom guest ethtool commands. Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows the hypervisor to export a linkspeed and duplex setting. The user can subsequently overwrite it later if desired via: 'ethtool -s'. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c| 17 - include/uapi/linux/virtio_net.h | 5 + 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 511f833..4168d82 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2621,6 +2621,20 @@ static int virtnet_probe(struct virtio_device *vdev) netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs); virtnet_init_settings(dev); + if (virtio_has_feature(vdev, VIRTIO_NET_F_SPEED_DUPLEX)) { + u32 speed; + u8 duplex; + + speed = virtio_cread32(vdev, offsetof(struct virtio_net_config, + speed)); + if (ethtool_validate_speed(speed)) + vi->speed = speed; + duplex = virtio_cread8(vdev, + offsetof(struct virtio_net_config, + duplex)); + if (ethtool_validate_duplex(duplex)) + vi->duplex = duplex; + } err = register_netdev(dev); if (err) { @@ -2746,7 +2760,8 @@ static struct virtio_device_id id_table[] = { VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \ VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ VIRTIO_NET_F_CTRL_MAC_ADDR, \ - VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS + VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ + VIRTIO_NET_F_SPEED_DUPLEX static unsigned int features[] = { VIRTNET_FEATURES, diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index fc353b5..0f1548e 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -57,6 +57,8 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Host set linkspeed and duplex */ + #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ #endif /* VIRTIO_NET_NO_LEGACY */ @@ -76,6 +78,9 @@ struct virtio_net_config { __u16 max_virtqueue_pairs; /* Default maximum transmit unit advice */ __u16 mtu; + /* Host exported linkspeed and duplex */ + __u32 speed; + __u8 duplex; } __attribute__((packed)); /* -- 2.6.1
[PATCH v2 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting
We have found it useful to be able to set the linkspeed and duplex settings from the host-side for virtio_net. This obviates the need for guest changes and settings for these fields, and does not require custom ethtool commands for virtio_net. The ability to set linkspeed and duplex is useful in various cases as described here: 16032be virtio_net: add ethtool support for set and get of settings Using 'ethtool -s' continues to over-write the linkspeed/duplex settings with this patch. The 1/3 patch is against net-next, while the 2-3/3 patch are the associated qemu changes that would go in after as update-linux-headers.sh should be run first. So the qemu patches are a demonstration of how I intend this to work. Thanks, -Jason linux changes: Jason Baron (1): virtio_net: propagate linkspeed/duplex settings from the hypervisor drivers/net/virtio_net.c| 17 - include/uapi/linux/virtio_net.h | 5 + 2 files changed, 21 insertions(+), 1 deletion(-) qemu changes: Jason Baron (2): qemu: virtio-net: use 64-bit values for feature flags qemu: add linkspeed and duplex settings to virtio-net hw/net/virtio-net.c | 83 +++-- include/hw/virtio/virtio-net.h | 5 +- include/standard-headers/linux/virtio_net.h | 4 ++ 3 files changed, 64 insertions(+), 28 deletions(-)
[PATCH net] tcp: correct memory barrier usage in tcp_check_space()
From: Jason Baron <jba...@akamai.com> sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit(). Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient. Fixes: 3c7151275c0c ("tcp: add memory barriers to write space paths") Cc: Eric Dumazet <eric.duma...@gmail.com> Cc: Oleg Nesterov <o...@redhat.com> Signed-off-by: Jason Baron <jba...@akamai.com> --- net/ipv4/tcp_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bfa165cc455a..1e22ae4a5b38 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5028,7 +5028,7 @@ static void tcp_check_space(struct sock *sk) if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) { sock_reset_flag(sk, SOCK_QUEUE_SHRUNK); /* pairs with tcp_poll() */ - smp_mb__after_atomic(); + smp_mb(); if (sk->sk_socket && test_bit(SOCK_NOSPACE, >sk_socket->flags)) { tcp_new_space(sk); -- 2.6.1
Re: wrong smp_mb__after_atomic() in tcp_check_space() ?
On 01/23/2017 01:04 PM, Eric Dumazet wrote: On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote: On 01/23/2017 09:30 AM, Oleg Nesterov wrote: Hello, smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE) (non-atomic too) won't be reordered. Indeed. Here's a bit of discussion on it: http://marc.info/?l=linux-netdev=146662325920596=2 It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" and the patch looks correct in that we need the barriers in tcp_check_space() and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? Yes, I think it should be upgraded to an smp_mb() there. If you agree with this analysis, I will send a patch to upgrade it. Note, I did not actually run into this race in practice. SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll(). (Otherwise it would be using atomic set/clear operations) I do not see obvious reason why we have this smp_mb__after_atomic() in tcp_check_space(). The idea of the smp_mb__after_atomic() in tcp_check_space() was to ensure that the 'read' of SOCK_NOSPACE there didn't happen before any of the 'write' to make sk_stream_is_writeable() true. Otherwise, we could miss doing the wakeup from tcp_check_space(). There is probably an argument here that there will likely be a subsequent call to tcp_check_space() that will see the SOCK_NOSPACE bit set, but in theory we could have a small send buffer, or a lot of data could be ack'd in one go. What I missed in the original patch was that sock_reset_flag() isn't an atomic operation and thus the smp_mb__after_atomic() is wrong. But looking at this code, it seems we lack one barrier if sk_sndbuf is ever increased. Fortunately this almost never happen during TCP session lifetime... But the wakeup from sk->sk_write_space(sk) will imply a smp_wmb() as per the comment in __wake_up() ? Thanks, -Jason diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk) sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2; sndmem *= nr_segs * per_mss; - if (sk->sk_sndbuf < sndmem) + if (sk->sk_sndbuf < sndmem) { sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]); + /* Paired with second sk_stream_is_writeable(sk) +* test from tcp_poll() +*/ + smp_wmb(); + } } /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
Re: wrong smp_mb__after_atomic() in tcp_check_space() ?
On 01/23/2017 09:30 AM, Oleg Nesterov wrote: Hello, smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE) (non-atomic too) won't be reordered. Indeed. Here's a bit of discussion on it: http://marc.info/?l=linux-netdev=146662325920596=2 It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths" and the patch looks correct in that we need the barriers in tcp_check_space() and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ? Yes, I think it should be upgraded to an smp_mb() there. If you agree with this analysis, I will send a patch to upgrade it. Note, I did not actually run into this race in practice. Thanks, -Jason
[PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
From: Jason Baron <jba...@akamai.com> Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then sometimes (they are rate-limited) does not reply with any RST as would be expected on a closed socket. This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. A packetdrill test demonstrating the behavior: // testing mac osx rst behavior // Establish a connection 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 32768 0.100 > S. 0:0(0) ack 1 0.200 < . 1:1(0) ack 1 win 32768 0.200 accept(3, ..., ...) = 4 // Client closes the connection 0.300 < F. 1:1(0) ack 1 win 32768 // now send rst with same sequence 0.300 < R. 1:1(0) ack 1 win 32768 // make sure we are in TCP_CLOSE 0.400 %{ assert tcpi_state == 7 }% Signed-off-by: Jason Baron <jba...@akamai.com> Cc: Eric Dumazet <eduma...@google.com> --- net/ipv4/tcp_input.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 1a34e9278c07..bfa165cc455a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5199,6 +5199,23 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen) return err; } +/* Accept RST for rcv_nxt - 1 after a FIN. + * When tcp connections are abruptly terminated from Mac OSX (via ^C), a + * FIN is sent followed by a RST packet. The RST is sent with the same + * sequence number as the FIN, and thus according to RFC 5961 a challenge + * ACK should be sent. However, Mac OSX rate limits replies to challenge + * ACKs on the closed socket. In addition middleboxes can drop either the + * challenge ACK or a subsequent RST. + */ +static bool tcp_reset_check(const struct sock *sk, const struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + + return unlikely(TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1) && + (1 << sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK | + TCPF_CLOSING)); +} + /* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ @@ -5237,20 +5254,25 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, LINUX_MIB_TCPACKSKIPPEDSEQ, >last_oow_ack_time)) tcp_send_dupack(sk, skb); + } else if (tcp_reset_check(sk, skb)) { + tcp_reset(sk); } goto discard; } /* Step 2: check RST bit */ if (th->rst) { - /* RFC 5961 3.2 (extend to match against SACK too if available): -* If seq num matches RCV.NXT or the right-most SACK block, + /* RFC 5961 3.2 (extend to match against (RCV.NXT - 1) after a +* FIN and SACK too if available): +* If seq num matches RCV.NXT or (RCV.NXT - 1) after a FIN, or +* the right-most SACK block, * then * RESET the connection * else * Send a challenge ACK */ - if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { + if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || + tcp_reset_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = >selective_acks[0]; -- 2.6.1
Re: [RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
On 01/11/2017 10:48 AM, Eric Dumazet wrote: > On Thu, 2017-01-05 at 16:33 -0500, Jason Baron wrote: > >> >> +/* Accept RST for rcv_nxt - 1 after a FIN. >> + * When tcp connections are abruptly terminated from Mac OSX (via ^C), a >> + * FIN is sent followed by a RST packet. The RST is sent with the same >> + * sequence number as the FIN, and thus according to RFC 5961 a challenge >> + * ACK should be sent. However, Mac OSX does not reply to the challenge ACK >> + * with a RST on the closed socket, hence accept this class of RSTs. >> + */ >> +static bool tcp_reset_check(struct sock *sk, struct sk_buff *skb) > const struct sock *sk, const struct sk_buff *skb > >> +{ >> +struct tcp_sock *tp = tcp_sk(sk); >> + >> +return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) && >> +(TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) && > Why is the test on end_seq needed ? Hi, (Re-sending - seems like my reply was lost) I wanted to define this condition as narrowly as I could. I'm ok dropping it - I'm not sure its going to make much difference in practice. So to that end, dropping this extra check makes sense. I posted this as RFC because RFC 5961, I don't think says anything about accepting rcv_nxt - 1 in this case, so I was wondering what people thought... Thanks, -Jason >> +(sk->sk_state == TCP_CLOSE_WAIT || >> + sk->sk_state == TCP_LAST_ACK || >> + sk->sk_state == TCP_CLOSING)); >> +} > Testing many states can be done more efficiently : > >(1 << sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK | > TCPF_CLOSING) > > Thanks >
Re: [RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
On 01/11/2017 12:17 AM, Christoph Paasch wrote: Hello Jason, (resending as Gmail sent out with HTML) On 05/01/17 - 16:33:28, Jason Baron wrote: Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then does not reply with any RST as would be expected on a closed socket. do you see this behavior consistently, even in a controlled environment? The problem seems rather to be that after the first RST, the NAT on the path has dropped its mapping and is thus dropping all other traffic. So, Linux's challenge-ack does not go through to the OSX-host to "re-synchronize" the state (which would allow OSX to send a RST with the updated sequence numbers). This is also documented in RFC5961: 9.3. Middleboxes That Drop the Challenge ACK It also needs to be noted that, some middleboxes (Firewalls/NATs) that don't have the fix recommended in the document, may drop the challenge ACK. This can happen because, the original RST segment that was in window had already cleared the flow state pertaining to the TCP connection in the middlebox. In such cases, the end hosts that have implemented the RST mitigation described in this document, will have the TCP connection left open. This is a corner case and can go away if the middlebox is conformant with the changes proposed in this document. Yes, I've observed a couple of different ways that this can happen: 1) The case where Mac OSX does not send an RST in response to the challenge ACK. I found that there is actually a rate limit to the number of 'RSTs' that Mac OSX will send on a closed socket. Its controlled by: 'net.inet.icmp.icmplim'. By default its set to 250, which means it will send up to 250 'RSTs' on closed sockets per second. Thus, I've confirmed that I can hit this limit by looking in the Mac OSX logs. 2) When I had the Mac OSX box connecting over a vpn to the linux server I found that while the challenge ack sent by Linux did reach the Mac OSX box, the RST that Mac OSX sent could get dropped somewhere in between. That is I see it sent from the Mac OSX box but never see it received by the Linux server. Thus, this is closer to the scenario you described above. Also, as mentioned we've had this deployed in production where we've seen this change make a real difference in capacity (due to freeing up memory resources much more quickly). Since, I don't have dumps from the client side I can't say exactly what the sequence of events is there. Thanks, -Jason Cheers, Christoph This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. I also found a posting that the iOS client behaves in a similar manner here (it will send a FIN followed by a RST for rcv_nxt - 1): https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/ A packetdrill test demonstrating the behavior. // testing mac osx rst behavior // Establish a connection 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 32768 0.100 > S. 0:0(0) ack 1 0.200 < . 1:1(0) ack 1 win 32768 0.200 accept(3, ..., ...) = 4 // Client closes the connection 0.300 < F. 1:1(0) ack 1 win 32768 // now send rst with same sequence 0.300 < R. 1:1(0) ack 1 win 32768 // make sure we are in TCP_CLOSE 0.400 %{ assert tcpi_state == 7 }% Signed-off-by: Jason Baron <jba...@akamai.com> --- net/ipv4/tcp_input.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ec6d84363024..373bea05c93b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5249,6 +5249,24 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen) return err; } +/* Accept RST for rcv_nxt - 1 after a FIN. + * When tcp connections are abruptly terminated from Mac OSX (via ^C), a + * FIN is sent followed by a RST packet. The RST is sent with the same + * sequence number as the FIN, and thus according to RFC 5961 a challenge + * ACK should be sent. However, Mac OSX does not reply to the challenge ACK + * with a RST on the closed socket, hence accept this c
[RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then does not reply with any RST as would be expected on a closed socket. This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. I also found a posting that the iOS client behaves in a similar manner here (it will send a FIN followed by a RST for rcv_nxt - 1): https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/ A packetdrill test demonstrating the behavior. // testing mac osx rst behavior // Establish a connection 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 32768 0.100 > S. 0:0(0) ack 1 0.200 < . 1:1(0) ack 1 win 32768 0.200 accept(3, ..., ...) = 4 // Client closes the connection 0.300 < F. 1:1(0) ack 1 win 32768 // now send rst with same sequence 0.300 < R. 1:1(0) ack 1 win 32768 // make sure we are in TCP_CLOSE 0.400 %{ assert tcpi_state == 7 }% Signed-off-by: Jason Baron <jba...@akamai.com> --- net/ipv4/tcp_input.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ec6d84363024..373bea05c93b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5249,6 +5249,24 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen) return err; } +/* Accept RST for rcv_nxt - 1 after a FIN. + * When tcp connections are abruptly terminated from Mac OSX (via ^C), a + * FIN is sent followed by a RST packet. The RST is sent with the same + * sequence number as the FIN, and thus according to RFC 5961 a challenge + * ACK should be sent. However, Mac OSX does not reply to the challenge ACK + * with a RST on the closed socket, hence accept this class of RSTs. + */ +static bool tcp_reset_check(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + + return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) && + (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) && + (sk->sk_state == TCP_CLOSE_WAIT || +sk->sk_state == TCP_LAST_ACK || +sk->sk_state == TCP_CLOSING)); +} + /* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ @@ -5287,6 +5305,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, LINUX_MIB_TCPACKSKIPPEDSEQ, >last_oow_ack_time)) tcp_send_dupack(sk, skb); + } else if (tcp_reset_check(sk, skb)) { + tcp_reset(sk); } goto discard; } @@ -5300,7 +5320,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, * else * Send a challenge ACK */ - if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { + if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || + tcp_reset_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = >selective_acks[0]; -- 2.6.1
Re: [net PATCH] fib_trie: Correct /proc/net/route off by one error
On 11/04/2016 03:11 PM, Alexander Duyck wrote: The display of /proc/net/route has had a couple issues due to the fact that when I originally rewrote most of fib_trie I made it so that the iterator was tracking the next value to use instead of the current. In addition it had an off by 1 error where I was tracking the first piece of data as position 0, even though in reality that belonged to the SEQ_START_TOKEN. This patch updates the code so the iterator tracks the last reported position and key instead of the next expected position and key. In addition it shifts things so that all of the leaves start at 1 instead of trying to report leaves starting with offset 0 as being valid. With these two issues addressed this should resolve any off by one errors that were present in the display of /proc/net/route. Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route") Cc: Andy Whitcroft <a...@canonical.com> Reported-by: Jason Baron <jba...@akamai.com> Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com> --- net/ipv4/fib_trie.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) Ok. Works for me. Feel free to add: Reviewed-and-Tested-by: Jason Baron <jba...@akamai.com> Thanks, -Jason
Re: [PATCH net] fib_trie: correct /proc/net/route for large read buffer
On 11/04/2016 02:43 PM, Alexander Duyck wrote: On Fri, Nov 4, 2016 at 7:45 AM, Jason Baron <jba...@akamai.com> wrote: From: Jason Baron <jba...@akamai.com> When read() is called on /proc/net/route requesting a size that is one entry size (128 bytes) less than m->size or greater, the resulting output has missing and/or duplicate entries. Since m->size is typically PAGE_SIZE, for a PAGE_SIZE of 4,096 this means that reads requesting more than 3,968 bytes will see bogus output. For example: for i in {100..200}; do ip route add 192.168.1.$i dev eth0 done dd if=/proc/net/route of=/tmp/good bs=1024 dd if=/proc/net/route of=/tmp/bad bs=4096 # diff -q /tmp/good /tmp/bad Files /tmp/good and /tmp/bad differ I think this has gone unnoticed, since the output of 'netstat -r' and 'route' is generated by reading in 1,024 byte increments and thus not corrupted. Further, the number of entries in the route table needs to be sufficiently large in order to trigger the problematic case. The issue arises because fib_route_get_idx() does not properly handle the case where pos equals iter->pos. This case only arises when we have a large read buffer size because we end up re-requesting the last entry that overflowed m->buf. In the case of a smaller read buffer size, we don't exceed the size of m->buf, and thus fib_route_get_idx() is called with pos greater than iter->pos. Fix by properly handling the iter->pos == pos case. Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route") Cc: Andy Whitcroft <a...@canonical.com> Cc: Alexander Duyck <alexander.h.du...@intel.com> Signed-off-by: Jason Baron <jba...@akamai.com> --- net/ipv4/fib_trie.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 31cef3602585..1017533fc75c 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2411,12 +2411,17 @@ static struct key_vector *fib_route_get_idx(struct fib_route_iter *iter, loff_t pos) { struct key_vector *l, **tp = >tnode; + loff_t saved_pos = 0; t_key key; /* use cache location of next-to-find key */ if (iter->pos > 0 && pos >= iter->pos) { pos -= iter->pos; key = iter->key; + if (pos == 0) { + saved_pos = iter->pos; + key--; + } } else { iter->pos = 0; key = 0; @@ -2436,10 +2441,13 @@ static struct key_vector *fib_route_get_idx(struct fib_route_iter *iter, break; } - if (l) + if (l) { iter->key = key;/* remember it */ - else + if (saved_pos) + iter->pos = saved_pos; + } else { iter->pos = 0; /* forget it */ + } return l; } This doesn't seem correct to me. I will have to look through this. My understanding is that the value of iter->pos is supposed to be the next position for us to grab, not the last one that was retrieved. If we are trying to re-request the last value then we should be falling back into the else case for this since pos should be one less than iter->pos. The problem is the table could change out from under us which is one of the reasons why we don't want to try and rewind the key like you are doing here. - Alex Hi Alex, In this case, seq_read() has called m->op->next(), which sets iter->pos equal to pos and iter->key to key + 1. However, when we then go to output the item associated with key, the 'm->op->next()' call overflows. Thus, we have a situation where iter->pos equals pos, iter->key = key + 1, but we have not displayed the item at position 'key' (thus the bug is that we miss the item at key). The change I proposed was simply to restart the search from 'key' in this case. If that item has disappeared, we will output the next one, or if its been replaced we will display its replacement. I think that is ok? The bug could also be fixed by changing: if (iter->pos > 0 && pos >= iter->pos) { to say: if (iter->pos > 0 && pos > iter->pos) { But that restarts the search on every overflow, which could mean every page size, and that seems suboptimal to me. Like-wise, if we make pos 1 less than iter->pos that restarts the search. The idea with this patch is to not force us to redo the entire search on each overflow. Thanks, -Jason
[PATCH net] fib_trie: correct /proc/net/route for large read buffer
From: Jason Baron <jba...@akamai.com> When read() is called on /proc/net/route requesting a size that is one entry size (128 bytes) less than m->size or greater, the resulting output has missing and/or duplicate entries. Since m->size is typically PAGE_SIZE, for a PAGE_SIZE of 4,096 this means that reads requesting more than 3,968 bytes will see bogus output. For example: for i in {100..200}; do ip route add 192.168.1.$i dev eth0 done dd if=/proc/net/route of=/tmp/good bs=1024 dd if=/proc/net/route of=/tmp/bad bs=4096 # diff -q /tmp/good /tmp/bad Files /tmp/good and /tmp/bad differ I think this has gone unnoticed, since the output of 'netstat -r' and 'route' is generated by reading in 1,024 byte increments and thus not corrupted. Further, the number of entries in the route table needs to be sufficiently large in order to trigger the problematic case. The issue arises because fib_route_get_idx() does not properly handle the case where pos equals iter->pos. This case only arises when we have a large read buffer size because we end up re-requesting the last entry that overflowed m->buf. In the case of a smaller read buffer size, we don't exceed the size of m->buf, and thus fib_route_get_idx() is called with pos greater than iter->pos. Fix by properly handling the iter->pos == pos case. Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route") Cc: Andy Whitcroft <a...@canonical.com> Cc: Alexander Duyck <alexander.h.du...@intel.com> Signed-off-by: Jason Baron <jba...@akamai.com> --- net/ipv4/fib_trie.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 31cef3602585..1017533fc75c 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2411,12 +2411,17 @@ static struct key_vector *fib_route_get_idx(struct fib_route_iter *iter, loff_t pos) { struct key_vector *l, **tp = >tnode; + loff_t saved_pos = 0; t_key key; /* use cache location of next-to-find key */ if (iter->pos > 0 && pos >= iter->pos) { pos -= iter->pos; key = iter->key; + if (pos == 0) { + saved_pos = iter->pos; + key--; + } } else { iter->pos = 0; key = 0; @@ -2436,10 +2441,13 @@ static struct key_vector *fib_route_get_idx(struct fib_route_iter *iter, break; } - if (l) + if (l) { iter->key = key;/* remember it */ - else + if (saved_pos) + iter->pos = saved_pos; + } else { iter->pos = 0; /* forget it */ + } return l; } -- 2.6.1
Re: [PATCH v2 net-next 0/2] bnx2x: page allocation failure
On 10/03/2016 08:19 PM, David Miller wrote: > From: "Baron, Jason"> Date: Mon, 3 Oct 2016 20:24:32 + > >> Or should I just send the incremental at this point? > Incremental is required at this point. Hi David, Ok. The above question was sent out erroneously. I have already posted the incremental patch that I was referring to here: https://patchwork.ozlabs.org/patch/675227/ Prior to that incremental patch post, I had composed the above question, but realized that the right thing to do was simply to send the incremental. However, I forgot to delete the 'draft' mail that I had, and it accidentally got sent out on Monday. Please ignore the noise, we are all set here :) Thanks, -Jason
[PATCH net-next] bnx2x: free the mac filter group list before freeing the cmd
The group list must be freed prior to freeing the command otherwise we have a use-after-free. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: Yuval Mintz <yuval.mi...@qlogic.com> Cc: Ariel Elior <ariel.el...@qlogic.com> --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c index 4947a9c..cea6bdc 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c @@ -2714,8 +2714,8 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp, elem_group = (struct bnx2x_mcast_elem_group *) __get_free_page(GFP_ATOMIC | __GFP_ZERO); if (!elem_group) { - kfree(new_cmd); bnx2x_free_groups(_cmd->group_head); + kfree(new_cmd); return -ENOMEM; } total_elems -= MCAST_MAC_ELEMS_PER_PG; -- 1.9.1
Re: [PATCH] fs/select: add vmalloc fallback for select(2)
Hi, On 09/23/2016 03:24 AM, Nicholas Piggin wrote: On Fri, 23 Sep 2016 14:42:53 +0800 "Hillf Danton"wrote: The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows with the number of fds passed. We had a customer report page allocation failures of order-4 for this allocation. This is a costly order, so it might easily fail, as the VM expects such allocation to have a lower-order fallback. Such trivial fallback is vmalloc(), as the memory doesn't have to be physically contiguous. Also the allocation is temporary for the duration of the syscall, so it's unlikely to stress vmalloc too much. Note that the poll(2) syscall seems to use a linked list of order-0 pages, so it doesn't need this kind of fallback. How about something like this? (untested) Eric isn't wrong about vmalloc sucking :) Thanks, Nick --- fs/select.c | 57 +++-- 1 file changed, 43 insertions(+), 14 deletions(-) diff --git a/fs/select.c b/fs/select.c index 8ed9da5..3b4834c 100644 --- a/fs/select.c +++ b/fs/select.c @@ -555,6 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, void *bits; int ret, max_fds; unsigned int size; + size_t nr_bytes; struct fdtable *fdt; /* Allocate small arguments on the stack to save memory and be faster */ long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; @@ -576,21 +577,39 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, * since we used fdset we need to allocate memory in units of * long-words. */ - size = FDS_BYTES(n); + ret = -ENOMEM; bits = stack_fds; - if (size > sizeof(stack_fds) / 6) { - /* Not enough space in on-stack array; must use kmalloc */ + size = FDS_BYTES(n); + nr_bytes = 6 * size; + + if (unlikely(nr_bytes > PAGE_SIZE)) { + /* Avoid multi-page allocation if possible */ ret = -ENOMEM; - bits = kmalloc(6 * size, GFP_KERNEL); - if (!bits) - goto out_nofds; + fds.in = kmalloc(size, GFP_KERNEL); + fds.out = kmalloc(size, GFP_KERNEL); + fds.ex = kmalloc(size, GFP_KERNEL); + fds.res_in = kmalloc(size, GFP_KERNEL); + fds.res_out = kmalloc(size, GFP_KERNEL); + fds.res_ex = kmalloc(size, GFP_KERNEL); + + if (!(fds.in && fds.out && fds.ex && + fds.res_in && fds.res_out && fds.res_ex)) + goto out; + } else { + if (nr_bytes > sizeof(stack_fds)) { + /* Not enough space in on-stack array */ + if (nr_bytes > PAGE_SIZE * 2) The 'if' looks extraneous? Also, I wonder if we can just avoid some allocations altogether by checking by if the user fd_set pointers are NULL? That can avoid failures :) Thanks, -Jason
[PATCH v2 net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
From: Jason Baron <jba...@akamai.com> Currently, we can have high order page allocations that specify GFP_ATOMIC when configuring multicast MAC address filters. For example, we have seen order 2 page allocation failures with ~500 multicast addresses configured. Convert the allocation for 'mcast_list' to be done in PAGE_SIZE increments. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: Yuval Mintz <yuval.mi...@qlogic.com> Cc: Ariel Elior <ariel.el...@qlogic.com> --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 79 +++- 1 file changed, 51 insertions(+), 28 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index dab61a81a3ba..20fe6a8c35c1 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -12563,43 +12563,64 @@ static int bnx2x_close(struct net_device *dev) return 0; } -static int bnx2x_init_mcast_macs_list(struct bnx2x *bp, - struct bnx2x_mcast_ramrod_params *p) +struct bnx2x_mcast_list_elem_group { - int mc_count = netdev_mc_count(bp->dev); - struct bnx2x_mcast_list_elem *mc_mac = - kcalloc(mc_count, sizeof(*mc_mac), GFP_ATOMIC); - struct netdev_hw_addr *ha; + struct list_head mcast_group_link; + struct bnx2x_mcast_list_elem mcast_elems[]; +}; - if (!mc_mac) { - BNX2X_ERR("Failed to allocate mc MAC list\n"); - return -ENOMEM; +#define MCAST_ELEMS_PER_PG \ + ((PAGE_SIZE - sizeof(struct bnx2x_mcast_list_elem_group)) / \ + sizeof(struct bnx2x_mcast_list_elem)) + +static void bnx2x_free_mcast_macs_list(struct list_head *mcast_group_list) +{ + struct bnx2x_mcast_list_elem_group *current_mcast_group; + + while (!list_empty(mcast_group_list)) { + current_mcast_group = list_first_entry(mcast_group_list, + struct bnx2x_mcast_list_elem_group, + mcast_group_link); + list_del(_mcast_group->mcast_group_link); + free_page((unsigned long)current_mcast_group); } +} - INIT_LIST_HEAD(>mcast_list); +static int bnx2x_init_mcast_macs_list(struct bnx2x *bp, + struct bnx2x_mcast_ramrod_params *p, + struct list_head *mcast_group_list) +{ + struct bnx2x_mcast_list_elem *mc_mac; + struct netdev_hw_addr *ha; + struct bnx2x_mcast_list_elem_group *current_mcast_group = NULL; + int mc_count = netdev_mc_count(bp->dev); + int offset = 0; + INIT_LIST_HEAD(>mcast_list); netdev_for_each_mc_addr(ha, bp->dev) { + if (!offset) { + current_mcast_group = + (struct bnx2x_mcast_list_elem_group *) + __get_free_page(GFP_ATOMIC); + if (!current_mcast_group) { + bnx2x_free_mcast_macs_list(mcast_group_list); + BNX2X_ERR("Failed to allocate mc MAC list\n"); + return -ENOMEM; + } + list_add(_mcast_group->mcast_group_link, +mcast_group_list); + } + mc_mac = _mcast_group->mcast_elems[offset]; mc_mac->mac = bnx2x_mc_addr(ha); list_add_tail(_mac->link, >mcast_list); - mc_mac++; + offset++; + if (offset == MCAST_ELEMS_PER_PG) + offset = 0; } - p->mcast_list_len = mc_count; - return 0; } -static void bnx2x_free_mcast_macs_list( - struct bnx2x_mcast_ramrod_params *p) -{ - struct bnx2x_mcast_list_elem *mc_mac = - list_first_entry(>mcast_list, struct bnx2x_mcast_list_elem, -link); - - WARN_ON(!mc_mac); - kfree(mc_mac); -} - /** * bnx2x_set_uc_list - configure a new unicast MACs list. * @@ -12647,6 +12668,7 @@ static int bnx2x_set_uc_list(struct bnx2x *bp) static int bnx2x_set_mc_list_e1x(struct bnx2x *bp) { + LIST_HEAD(mcast_group_list); struct net_device *dev = bp->dev; struct bnx2x_mcast_ramrod_params rparam = {NULL}; int rc = 0; @@ -12662,7 +12684,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp) /* then, configure a new MACs list */ if (netdev_mc_count(dev)) { - rc = bnx2x_init_mcast_macs_list(bp, ); + rc = bnx2x_init_mcast_macs_list(bp, , _group_list); if (rc) return rc; @@ -12673,7 +12695,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
[PATCH v2 net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
From: Jason Baron <jba...@akamai.com> Currently, we can have high order page allocations that specify GFP_ATOMIC when configuring multicast MAC address filters. For example, we have seen order 2 page allocation failures with ~500 multicast addresses configured. Convert the allocation for the pending list to be done in PAGE_SIZE increments. Signed-off-by: Jason Baron <jba...@akamai.com> Cc: Yuval Mintz <yuval.mi...@qlogic.com> Cc: Ariel Elior <ariel.el...@qlogic.com> --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 123 + 1 file changed, 86 insertions(+), 37 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c index d468380c2a23..4947a9cbf0c1 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c @@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem { int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */ }; +union bnx2x_mcast_elem { + struct bnx2x_mcast_bin_elem bin_elem; + struct bnx2x_mcast_mac_elem mac_elem; +}; + +struct bnx2x_mcast_elem_group { + struct list_head mcast_group_link; + union bnx2x_mcast_elem mcast_elems[]; +}; + +#define MCAST_MAC_ELEMS_PER_PG \ + ((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \ + sizeof(union bnx2x_mcast_elem)) + struct bnx2x_pending_mcast_cmd { struct list_head link; + struct list_head group_head; int type; /* BNX2X_MCAST_CMD_X */ union { struct list_head macs_head; @@ -2638,16 +2653,29 @@ static int bnx2x_mcast_wait(struct bnx2x *bp, return 0; } +static void bnx2x_free_groups(struct list_head *mcast_group_list) +{ + struct bnx2x_mcast_elem_group *current_mcast_group; + + while (!list_empty(mcast_group_list)) { + current_mcast_group = list_first_entry(mcast_group_list, + struct bnx2x_mcast_elem_group, + mcast_group_link); + list_del(_mcast_group->mcast_group_link); + free_page((unsigned long)current_mcast_group); + } +} + static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp, struct bnx2x_mcast_obj *o, struct bnx2x_mcast_ramrod_params *p, enum bnx2x_mcast_cmd cmd) { - int total_sz; struct bnx2x_pending_mcast_cmd *new_cmd; - struct bnx2x_mcast_mac_elem *cur_mac = NULL; struct bnx2x_mcast_list_elem *pos; - int macs_list_len = 0, macs_list_len_size; + struct bnx2x_mcast_elem_group *elem_group; + struct bnx2x_mcast_mac_elem *mac_elem; + int total_elems = 0, macs_list_len = 0, offset = 0; /* When adding MACs we'll need to store their values */ if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET) @@ -2657,50 +2685,61 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp, if (!p->mcast_list_len) return 0; - /* For a set command, we need to allocate sufficient memory for all -* the bins, since we can't analyze at this point how much memory would -* be required. -*/ - macs_list_len_size = macs_list_len * -sizeof(struct bnx2x_mcast_mac_elem); - if (cmd == BNX2X_MCAST_CMD_SET) { - int bin_size = BNX2X_MCAST_BINS_NUM * - sizeof(struct bnx2x_mcast_bin_elem); - - if (bin_size > macs_list_len_size) - macs_list_len_size = bin_size; - } - total_sz = sizeof(*new_cmd) + macs_list_len_size; - /* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */ - new_cmd = kzalloc(total_sz, GFP_ATOMIC); - + new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC); if (!new_cmd) return -ENOMEM; - DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n", - cmd, macs_list_len); - INIT_LIST_HEAD(_cmd->data.macs_head); - + INIT_LIST_HEAD(_cmd->group_head); new_cmd->type = cmd; new_cmd->done = false; + DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n", + cmd, macs_list_len); + switch (cmd) { case BNX2X_MCAST_CMD_ADD: case BNX2X_MCAST_CMD_SET: - cur_mac = (struct bnx2x_mcast_mac_elem *) - ((u8 *)new_cmd + sizeof(*new_cmd)); - - /* Push the MACs of the current command into the pending command -* MACs list: FIFO + /* For a set command, we need to allocate sufficient memory for +* all the bins, since we can't analyze at this point how much +* memory would be required. */ +
[PATCH v2 net-next 0/2] bnx2x: page allocation failure
Hi, While configuring ~500 multicast addrs, we ran into high order page allocation failures. They don't need to be high order, and thus I'm proposing to split them into at most PAGE_SIZE allocations. Below is a sample failure. Thanks, -Jason [1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create multicast MACs list: -12 [1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020 [1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: GW [1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, BIOS V8.810 05/16/2013 [1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x] [1207325.728206] 88012d873a78 8267f7c7 c020 [1207325.736754] 88012d873b08 8212f8e0 fffc0003 [1207325.745301] 88041ffecd80 88040030 0002 c0206800da13 [1207325.753846] Call Trace: [1207325.756789] [] dump_stack+0x4d/0x63 [1207325.762426] [] warn_alloc_failed+0xe0/0x130 [1207325.768756] [] ? wakeup_kswapd+0x48/0x140 [1207325.774914] [] __alloc_pages_nodemask+0x2bc/0x970 [1207325.781761] [] alloc_pages_current+0x91/0x100 [1207325.788260] [] alloc_kmem_pages+0xe/0x10 [1207325.794329] [] kmalloc_order+0x18/0x50 [1207325.800227] [] kmalloc_order_trace+0x26/0xb0 [1207325.806642] [] ? _xfer_secondary_pool+0xa8/0x1a0 [1207325.813404] [] __kmalloc+0x19a/0x1b0 [1207325.819142] [] bnx2x_set_rx_mode_inner+0x3d5/0x590 [bnx2x] [1207325.827000] [] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x] [1207325.834197] [] process_one_work+0x134/0x3c0 [1207325.840522] [] worker_thread+0x121/0x460 [1207325.846585] [] ? process_one_work+0x3c0/0x3c0 [1207325.853089] [] kthread+0xc9/0xe0 [1207325.858459] [] ? notify_die+0x10/0x40 [1207325.864263] [] ? kthread_create_on_node+0x180/0x180 [1207325.871288] [] ret_from_fork+0x42/0x70 [1207325.877183] [] ? kthread_create_on_node+0x180/0x180 v2: -make use of list_next_entry() -only use PAGE_SIZE allocations Jason Baron (2): bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments bnx2x: allocate mac filtering pending list in PAGE_SIZE increments drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 79 +-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 123 --- 2 files changed, 137 insertions(+), 65 deletions(-) -- 2.6.1
Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
On 09/20/2016 07:30 AM, David Laight wrote: From: Jason Baron Sent: 19 September 2016 19:34 ... sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So if we want to fit 2,048 elements, we need 12 pages. If you only need to save the mcast addresses you could use a 'heap' that requires no overhead per entry and gives you O(log) lookup. 6 bytes per entry is 682 in a 4k page. David Indeed, that would save space here. Based on Yuval's comments it sounds as though he agrees that it makes sense to go beyond a page (even if we get 682 per page as you suggest), when configuring these mac filters. So we would then have to allocate and manage the page pointers. Currently, there is a list_head per entry to manage the macs as a linked list. The patch I proposed continues to use that same data structure, thus it will not add to the memory footprint, it only proposes to break that footprint up into PAGE_SIZE chunks. So I think the change you suggest can be viewed as an additional enhancement here, and also note that the memory allocations here are short-lived. That is, they only exist in memory until the NIC is re-configured. Thanks, -Jason
Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
On 09/20/2016 11:00 AM, Mintz, Yuval wrote: The question I rose was whether it actually makes a difference under such circumstances whether the device would actually filter those multicast addresses or be completely multicast promiscuous. e.g., whether it's significant to be filtering out multicast ingress traffic when you're already allowing 1/2 of all random multicast packets to be classified for the interface. Agreed, I think this is the more interesting question here. I thought that we would want to make sure we are using most of the bins before falling back to multicast ingress. The reason being that even if its more expensive for the NIC to do the filtering than the multicast mode, it would be more than made up for by having to drop the traffic higher up the stack. So I think if we can determine the percent of the bins that we want to use, we can then back into the average number of filters required to get there. As I said, I thought we would want to make sure we filled basically all the bins (with a high probability that is) before falling back to multicast, and so I threw out 2,048. AFAIK configuring multiple filters doesn't incur any performance penalty from the adapter side. And I agree that from 'offloading' perspective it's probably better to filter in HW even if the gain is negligible. So for the upper limit - there's not much of a reason to it; The only gain would be to prevent driver from allocating lots-and-lots of memory temporarily for an unnecessary configuration. Ok. We already have an upper limit to an extent with /proc/sys/net/ipv4/igmp_max_memberships. And as posted I didn't include one b/c of the higher level limits already in place. Thanks, -Jason
Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
On 09/20/2016 03:41 AM, Mintz, Yuval wrote: Currently, we can have high order page allocations that specify GFP_ATOMIC when configuring multicast MAC address filters. For example, we have seen order 2 page allocation failures with ~500 multicast addresses configured. Convert the allocation for the pending list to be done in PAGE_SIZE increments. Signed-off-by: Jason Baron <jba...@akamai.com> While I appreciate the effort, I wonder whether it's worth it: - The hardware [even in its newer generation] provides an approximate based classification [I.e., hashed] with 256 bins. When configuring 500 multicast addresses, one can argue the difference between multicast-promisc mode and actual configuration is insignificant. With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses to expect to have all bins have at least one hash, assuming a uniform distribution of the hashes. Perhaps the easier-to-maintain alternative would simply be to determine the maximal number of multicast addresses that can be configured using a single PAGE, and if in need of more than that simply move into multicast-promisc. sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So if we want to fit 2,048 elements, we need 12 pages. That's not exactly what I mean - let's assume you'd have problems allocating more than a PAGE. According to your calculation, that means you're already using more than 170 multicast addresses. I didn't bother trying to solve the combinatorics question of how many bins you'd use on average for 170 filters given there are only 256 bins, but that would be a significant portion. On average for 170 filters, I get an average of 124 bins in use out of 256 possible bins. The question I rose was whether it actually makes a difference under such circumstances whether the device would actually filter those multicast addresses or be completely multicast promiscuous. e.g., whether it's significant to be filtering out multicast ingress traffic when you're already allowing 1/2 of all random multicast packets to be classified for the interface. Agreed, I think this is the more interesting question here. I thought that we would want to make sure we are using most of the bins before falling back to multicast ingress. The reason being that even if its more expensive for the NIC to do the filtering than the multicast mode, it would be more than made up for by having to drop the traffic higher up the stack. So I think if we can determine the percent of the bins that we want to use, we can then back into the average number of filters required to get there. As I said, I thought we would want to make sure we filled basically all the bins (with a high probability that is) before falling back to multicast, and so I threw out 2,048. Thanks, -Jason
Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
On 09/18/2016 06:25 AM, Mintz, Yuval wrote: Currently, we can have high order page allocations that specify GFP_ATOMIC when configuring multicast MAC address filters. For example, we have seen order 2 page allocation failures with ~500 multicast addresses configured. Convert the allocation for the pending list to be done in PAGE_SIZE increments. Signed-off-by: Jason Baron <jba...@akamai.com> While I appreciate the effort, I wonder whether it's worth it: - The hardware [even in its newer generation] provides an approximate based classification [I.e., hashed] with 256 bins. When configuring 500 multicast addresses, one can argue the difference between multicast-promisc mode and actual configuration is insignificant. With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses to expect to have all bins have at least one hash, assuming a uniform distribution of the hashes. Perhaps the easier-to-maintain alternative would simply be to determine the maximal number of multicast addresses that can be configured using a single PAGE, and if in need of more than that simply move into multicast-promisc. sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So if we want to fit 2,048 elements, we need 12 pages. - While GFP_ATOMIC is required in this flow due to the fact it's being called from sleepless context, I do believe this is mostly a remnant - it's possible that by slightly changing the locking scheme we can have the configuration done from sleepless context and simply switch to GFP_KERNEL instead. Ok if its GFP_KERNEL, I think its still undesirable to do large page order allocations (unless of course its converted to a single page, but I'm not sure this makes sense as mentioned). Regarding the patch itself, only comment I have: + elem_group = (struct bnx2x_mcast_elem_group *) +elem_group->mcast_group_link.next; Let's use list_next_entry() instead. Yes, agreed. I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to enforce some maximum number of elements (perhaps 2,048 based on the above math) for the !CHIP_IS_E1() case on top of what I already posted. Thanks, -Jason
[PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
Currently, we can have high order page allocations that specify GFP_ATOMIC when configuring multicast MAC address filters. For example, we have seen order 2 page allocation failures with ~500 multicast addresses configured. Convert the allocation for the pending list to be done in PAGE_SIZE increments. Signed-off-by: Jason Baron <jba...@akamai.com> --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 131 ++--- 1 file changed, 94 insertions(+), 37 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c index d468380c2a23..6db8dd252d7c 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c @@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem { int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */ }; +union bnx2x_mcast_elem { + struct bnx2x_mcast_bin_elem bin_elem; + struct bnx2x_mcast_mac_elem mac_elem; +}; + +struct bnx2x_mcast_elem_group { + struct list_head mcast_group_link; + union bnx2x_mcast_elem mcast_elems[]; +}; + +#define MCAST_MAC_ELEMS_PER_PG \ + ((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \ + sizeof(union bnx2x_mcast_elem)) + struct bnx2x_pending_mcast_cmd { struct list_head link; + struct list_head group_head; int type; /* BNX2X_MCAST_CMD_X */ union { struct list_head macs_head; @@ -2638,16 +2653,30 @@ static int bnx2x_mcast_wait(struct bnx2x *bp, return 0; } +static void bnx2x_free_groups(struct list_head *mcast_group_list) +{ + struct bnx2x_mcast_elem_group *current_mcast_group; + + while (!list_empty(mcast_group_list)) { + current_mcast_group = list_first_entry(mcast_group_list, + struct bnx2x_mcast_elem_group, + mcast_group_link); + list_del(_mcast_group->mcast_group_link); + kfree(current_mcast_group); + } +} + static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp, struct bnx2x_mcast_obj *o, struct bnx2x_mcast_ramrod_params *p, enum bnx2x_mcast_cmd cmd) { - int total_sz; struct bnx2x_pending_mcast_cmd *new_cmd; - struct bnx2x_mcast_mac_elem *cur_mac = NULL; struct bnx2x_mcast_list_elem *pos; - int macs_list_len = 0, macs_list_len_size; + struct bnx2x_mcast_elem_group *elem_group; + struct bnx2x_mcast_mac_elem *mac_elem; + int i = 0, offset = 0, macs_list_len = 0; + int total_elems, alloc_size; /* When adding MACs we'll need to store their values */ if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET) @@ -2657,50 +2686,68 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp, if (!p->mcast_list_len) return 0; - /* For a set command, we need to allocate sufficient memory for all -* the bins, since we can't analyze at this point how much memory would -* be required. -*/ - macs_list_len_size = macs_list_len * -sizeof(struct bnx2x_mcast_mac_elem); - if (cmd == BNX2X_MCAST_CMD_SET) { - int bin_size = BNX2X_MCAST_BINS_NUM * - sizeof(struct bnx2x_mcast_bin_elem); - - if (bin_size > macs_list_len_size) - macs_list_len_size = bin_size; - } - total_sz = sizeof(*new_cmd) + macs_list_len_size; - /* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */ - new_cmd = kzalloc(total_sz, GFP_ATOMIC); - + new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC); if (!new_cmd) return -ENOMEM; - DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n", - cmd, macs_list_len); - INIT_LIST_HEAD(_cmd->data.macs_head); - + INIT_LIST_HEAD(_cmd->group_head); new_cmd->type = cmd; new_cmd->done = false; + DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n", + cmd, macs_list_len); + switch (cmd) { case BNX2X_MCAST_CMD_ADD: case BNX2X_MCAST_CMD_SET: - cur_mac = (struct bnx2x_mcast_mac_elem *) - ((u8 *)new_cmd + sizeof(*new_cmd)); - - /* Push the MACs of the current command into the pending command -* MACs list: FIFO + /* For a set command, we need to allocate sufficient memory for +* all the bins, since we can't analyze at this point how much +* memory would be required. */ + total_elems = macs_list_len; + if (cmd == BNX2X_MCAST_CMD_SET) { +
[PATCH net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
Currently, we can have high order page allocations that specify GFP_ATOMIC when configuring multicast MAC address filters. For example, we have seen order 2 page allocation failures with ~500 multicast addresses configured. Convert the allocation for 'mcast_list' to be done in PAGE_SIZE increments. Signed-off-by: Jason Baron <jba...@akamai.com> --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 85 1 file changed, 57 insertions(+), 28 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index dab61a81a3ba..9f5b2d94e4df 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -12563,43 +12563,70 @@ static int bnx2x_close(struct net_device *dev) return 0; } -static int bnx2x_init_mcast_macs_list(struct bnx2x *bp, - struct bnx2x_mcast_ramrod_params *p) +struct bnx2x_mcast_list_elem_group { - int mc_count = netdev_mc_count(bp->dev); - struct bnx2x_mcast_list_elem *mc_mac = - kcalloc(mc_count, sizeof(*mc_mac), GFP_ATOMIC); - struct netdev_hw_addr *ha; + struct list_head mcast_group_link; + struct bnx2x_mcast_list_elem mcast_elems[]; +}; - if (!mc_mac) { - BNX2X_ERR("Failed to allocate mc MAC list\n"); - return -ENOMEM; +#define MCAST_ELEMS_PER_PG \ + ((PAGE_SIZE - sizeof(struct bnx2x_mcast_list_elem_group)) / \ + sizeof(struct bnx2x_mcast_list_elem)) + +static void bnx2x_free_mcast_macs_list(struct list_head *mcast_group_list) +{ + struct bnx2x_mcast_list_elem_group *current_mcast_group; + + while (!list_empty(mcast_group_list)) { + current_mcast_group = list_first_entry(mcast_group_list, + struct bnx2x_mcast_list_elem_group, + mcast_group_link); + list_del(_mcast_group->mcast_group_link); + kfree(current_mcast_group); } +} - INIT_LIST_HEAD(>mcast_list); +static int bnx2x_init_mcast_macs_list(struct bnx2x *bp, + struct bnx2x_mcast_ramrod_params *p, + struct list_head *mcast_group_list) +{ + struct bnx2x_mcast_list_elem *mc_mac; + struct netdev_hw_addr *ha; + struct bnx2x_mcast_list_elem_group *current_mcast_group = NULL; + int mc_count = netdev_mc_count(bp->dev); + int i = 0, offset = 0, alloc_size; + INIT_LIST_HEAD(>mcast_list); netdev_for_each_mc_addr(ha, bp->dev) { + if (!offset) { + if ((mc_count - i) < MCAST_ELEMS_PER_PG) + alloc_size = sizeof(struct list_head) + + (sizeof(struct bnx2x_mcast_list_elem) * + (mc_count - i)); + else + alloc_size = PAGE_SIZE; + current_mcast_group = kmalloc(alloc_size, GFP_ATOMIC); + if (!current_mcast_group) { + bnx2x_free_mcast_macs_list(mcast_group_list); + BNX2X_ERR("Failed to allocate mc MAC list\n"); + return -ENOMEM; + } + list_add(_mcast_group->mcast_group_link, +mcast_group_list); + } + mc_mac = _mcast_group->mcast_elems[offset]; mc_mac->mac = bnx2x_mc_addr(ha); list_add_tail(_mac->link, >mcast_list); - mc_mac++; + offset++; + if (offset == MCAST_ELEMS_PER_PG) { + i += offset; + offset = 0; + } } - p->mcast_list_len = mc_count; - return 0; } -static void bnx2x_free_mcast_macs_list( - struct bnx2x_mcast_ramrod_params *p) -{ - struct bnx2x_mcast_list_elem *mc_mac = - list_first_entry(>mcast_list, struct bnx2x_mcast_list_elem, -link); - - WARN_ON(!mc_mac); - kfree(mc_mac); -} - /** * bnx2x_set_uc_list - configure a new unicast MACs list. * @@ -12647,6 +12674,7 @@ static int bnx2x_set_uc_list(struct bnx2x *bp) static int bnx2x_set_mc_list_e1x(struct bnx2x *bp) { + LIST_HEAD(mcast_group_list); struct net_device *dev = bp->dev; struct bnx2x_mcast_ramrod_params rparam = {NULL}; int rc = 0; @@ -12662,7 +12690,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp) /* then, configure a new MACs list */ if (netdev_mc_count(dev)) { - rc = bnx2x_init_mcast_macs_list(bp, ); + rc = bnx2x_init_mcast_macs_list(bp, , _group_list);
[PATCH net-next 0/2] bnx2x: page allocation failure
Hi, While configuring ~500 multicast addrs, we ran into high order page allocation failures. They don't need to be high order, and thus I'm proposing to split them into at most PAGE_SIZE allocations. Below is a sample failure. Thanks, -Jason [1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create multicast MACs list: -12 [1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020 [1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: GW [1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, BIOS V8.810 05/16/2013 [1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x] [1207325.728206] 88012d873a78 8267f7c7 c020 [1207325.736754] 88012d873b08 8212f8e0 fffc0003 [1207325.745301] 88041ffecd80 88040030 0002 c0206800da13 [1207325.753846] Call Trace: [1207325.756789] [] dump_stack+0x4d/0x63 [1207325.762426] [] warn_alloc_failed+0xe0/0x130 [1207325.768756] [] ? wakeup_kswapd+0x48/0x140 [1207325.774914] [] __alloc_pages_nodemask+0x2bc/0x970 [1207325.781761] [] alloc_pages_current+0x91/0x100 [1207325.788260] [] alloc_kmem_pages+0xe/0x10 [1207325.794329] [] kmalloc_order+0x18/0x50 [1207325.800227] [] kmalloc_order_trace+0x26/0xb0 [1207325.806642] [] ? _xfer_secondary_pool+0xa8/0x1a0 [1207325.813404] [] __kmalloc+0x19a/0x1b0 [1207325.819142] [] bnx2x_set_rx_mode_inner+0x3d5/0x590 [bnx2x] [1207325.827000] [] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x] [1207325.834197] [] process_one_work+0x134/0x3c0 [1207325.840522] [] worker_thread+0x121/0x460 [1207325.846585] [] ? process_one_work+0x3c0/0x3c0 [1207325.853089] [] kthread+0xc9/0xe0 [1207325.858459] [] ? notify_die+0x10/0x40 [1207325.864263] [] ? kthread_create_on_node+0x180/0x180 [1207325.871288] [] ret_from_fork+0x42/0x70 [1207325.877183] [] ? kthread_create_on_node+0x180/0x180 Jason Baron (2): bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments bnx2x: allocate mac filtering pending list in PAGE_SIZE increments drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 85 ++- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 131 --- 2 files changed, 151 insertions(+), 65 deletions(-) -- 2.6.1
Re: strange Mac OSX RST behavior
Hi, After looking at this further we found that there is actually a rate limit on 'rst' packets sent by OSX on a closed socket. Its set to 250 per second and controlled via: net.inet.icmp.icmplim. Increasing that limit resolves the issue, but the default is apparently 250. Thanks, -Jason On 07/01/2016 02:16 PM, One Thousand Gnomes wrote: yes, we do in fact see a POLLRDHUP from the FIN in this case and read of zero, but we still have more data to write to the socket, and b/c the RST is dropped here, the socket stays in TIME_WAIT until things eventually time out... After the FIN when you send/retransmit your next segment do you then get a valid RST back from the Mac end? Alan
[PATCH net] tcp: enable per-socket rate limiting of all 'challenge acks'
From: Jason Baron <jba...@akamai.com> The per-socket rate limit for 'challenge acks' was introduced in the context of limiting ack loops: commit f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock") And I think it can be extended to rate limit all 'challenge acks' on a per-socket basis. Since we have the global tcp_challenge_ack_limit, this patch allows for tcp_challenge_ack_limit to be set to a large value and effectively rely on the per-socket limit, or set tcp_challenge_ack_limit to a lower value and still prevents a single connections from consuming the entire challenge ack quota. It further moves in the direction of eliminating the global limit at some point, as Eric Dumazet has suggested. This a follow-up to: Subject: tcp: make challenge acks less predictable Cc: Eric Dumazet <eduma...@google.com> Cc: David S. Miller <da...@davemloft.net> Cc: Neal Cardwell <ncardw...@google.com> Cc: Yuchung Cheng <ych...@google.com> Cc: Yue Cao <ycao...@ucr.edu> Signed-off-by: Jason Baron <jba...@akamai.com> --- net/ipv4/tcp_input.c | 39 ++- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index fda084462ed9..f9f9e375d7de 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3424,6 +3424,23 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32 return flag; } +static bool __tcp_oow_rate_limited(struct net *net, int mib_idx, + u32 *last_oow_ack_time) +{ + if (*last_oow_ack_time) { + s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time); + + if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) { + NET_INC_STATS(net, mib_idx); + return true;/* rate-limited: don't send yet! */ + } + } + + *last_oow_ack_time = tcp_time_stamp; + + return false; /* not rate-limited: go ahead, send dupack now! */ +} + /* Return true if we're currently rate-limiting out-of-window ACKs and * thus shouldn't send a dupack right now. We rate-limit dupacks in * response to out-of-window SYNs or ACKs to mitigate ACK loops or DoS @@ -3437,21 +3454,9 @@ bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb, /* Data packets without SYNs are not likely part of an ACK loop. */ if ((TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq) && !tcp_hdr(skb)->syn) - goto not_rate_limited; - - if (*last_oow_ack_time) { - s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time); - - if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) { - NET_INC_STATS(net, mib_idx); - return true;/* rate-limited: don't send yet! */ - } - } - - *last_oow_ack_time = tcp_time_stamp; + return false; -not_rate_limited: - return false; /* not rate-limited: go ahead, send dupack now! */ + return __tcp_oow_rate_limited(net, mib_idx, last_oow_ack_time); } /* RFC 5961 7 [ACK Throttling] */ @@ -3464,9 +3469,9 @@ static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb) u32 count, now; /* First check our per-socket dupack rate limit. */ - if (tcp_oow_rate_limited(sock_net(sk), skb, -LINUX_MIB_TCPACKSKIPPEDCHALLENGE, ->last_oow_ack_time)) + if (__tcp_oow_rate_limited(sock_net(sk), + LINUX_MIB_TCPACKSKIPPEDCHALLENGE, + >last_oow_ack_time)) return; /* Then check host-wide RFC 5961 rate limit. */ -- 2.6.1
Re: strange Mac OSX RST behavior
On 07/01/2016 02:16 PM, One Thousand Gnomes wrote: >> yes, we do in fact see a POLLRDHUP from the FIN in this case and >> read of zero, but we still have more data to write to the socket, and >> b/c the RST is dropped here, the socket stays in TIME_WAIT until >> things eventually time out... > > After the FIN when you send/retransmit your next segment do you then get > a valid RST back from the Mac end? > > Alan > No, we only get the single RST after the FIN from the Mac side which is dropped. I would have expected the RST from the Mac after the retransmits, but we don't see any further transmits from the Mac. And the linux socket stays in CLOSE-WAIT (i mistakingly said TIME_WAIT above). For reference, I put the packet exchange in my initial mail. Thanks, -Jason
Re: strange Mac OSX RST behavior
On 07/01/2016 01:08 PM, Rick Jones wrote: > On 07/01/2016 08:10 AM, Jason Baron wrote: >> I'm wondering if anybody else has run into this... >> >> On Mac OSX 10.11.5 (latest version), we have found that when tcp >> connections are abruptly terminated (via ^C), a FIN is sent followed >> by an RST packet. > > That just seems, well, silly. If the client application wants to use > abortive close (sigh..) it should do so, there shouldn't be this > little-bit-pregnant, correct close initiation (FIN) followed by a RST. > >> The RST is sent with the same sequence number as the >> FIN, and thus dropped since the stack only accepts RST packets matching >> rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with >> an RST on the closed socket, but it appears that it does not. >> >> The workaround here is then to reset the connection, if the RST is >> is equal to rcv_nxt - 1, if we have already received a FIN. >> >> The RST attack surface is limited b/c we only accept the RST after we've >> accepted a FIN and have not previously sent a FIN and received back the >> corresponding ACK. In other words RST is only accepted in the tcp >> states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING. >> >> I'm interested if anybody else has run into this issue. Its problematic >> since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT. > > Isn't the server application expected to act on the read return of zero > (which is supposed to be) triggered by the receipt of the FIN segment? > yes, we do in fact see a POLLRDHUP from the FIN in this case and read of zero, but we still have more data to write to the socket, and b/c the RST is dropped here, the socket stays in TIME_WAIT until things eventually time out... Thanks, -Jason > rick jones > >> We are also in the process of contacting Apple to see what can be done >> here...workaround patch is below.
strange Mac OSX RST behavior
I'm wondering if anybody else has run into this... On Mac OSX 10.11.5 (latest version), we have found that when tcp connections are abruptly terminated (via ^C), a FIN is sent followed by an RST packet. The RST is sent with the same sequence number as the FIN, and thus dropped since the stack only accepts RST packets matching rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with an RST on the closed socket, but it appears that it does not. The workaround here is then to reset the connection, if the RST is is equal to rcv_nxt - 1, if we have already received a FIN. The RST attack surface is limited b/c we only accept the RST after we've accepted a FIN and have not previously sent a FIN and received back the corresponding ACK. In other words RST is only accepted in the tcp states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING. I'm interested if anybody else has run into this issue. Its problematic since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT. We are also in the process of contacting Apple to see what can be done here...workaround patch is below. Here is the sequence from wireshark, mac osx is client sending the fin: 84581 14.752908 -> TCP 66 49896 > http [FIN, ACK] Seq=673257230 Ack=924722210 Win=131072 Len=0 TSval=622455547 TSecr=346246436 84984 14.788056 -> TCP 60 49896 > http [RST] Seq=673257230 Win=0 Len=0 84985 14.788061 -> TCP 66 http > 49896 [ACK] Seq=924739994 Ack=673257231 Win=28960 Len=0 TSval=346246723 TSecr=622455547 followed by a bunch of retransmits from server: 85138 14.994217 -> TCP 1054 [TCP segment of a reassembled PDU] 85237 15.348217 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85337 16.056224 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85436 17.472225 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85540 20.304222 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85644 25.968218 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85745 37.280230 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85845 59.904235 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] Thanks, -Jason --- net/ipv4/tcp_input.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 94d4aff97523..b3c55b91140c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5155,6 +5155,25 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen) return err; } +/* + * Mac OSX 10.11.5 can send a FIN followed by a RST where the RST + * has the same sequence number as the FIN. This is not compliant + * with RFC 5961, but ends up in a number of sockets tied up mostly + * in TCP_CLOSE_WAIT. The rst attack surface is limited b/c we only + * accept the RST after we've accepted a FIN and have not previously + * sent a FIN and received back the corresponding ACK. + */ +static bool tcp_fin_rst_check(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + + return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) && + (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) && + (sk->sk_state == TCP_CLOSE_WAIT || +sk->sk_state == TCP_LAST_ACK || +sk->sk_state == TCP_CLOSING)); +} + /* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ @@ -5193,7 +5212,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, LINUX_MIB_TCPACKSKIPPEDSEQ, >last_oow_ack_time)) tcp_send_dupack(sk, skb); - } + } else if (tcp_fin_rst_check(sk, skb)) + tcp_reset(sk); goto discard; } @@ -5206,7 +5226,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, * else * Send a challenge ACK */ - if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { + if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || + tcp_fin_rst_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = >selective_acks[0]; -- 2.6.1
Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set
On 06/22/2016 02:51 PM, Eric Dumazet wrote: On Wed, 2016-06-22 at 11:43 -0700, Eric Dumazet wrote: On Wed, 2016-06-22 at 14:18 -0400, Jason Baron wrote: For 1/2, the getting the correct memory barrier, should I re-submit that as a separate patch? Are you sure a full memory barrier (smp_mb() is needed ? Maybe smp_wmb() would be enough ? (And smp_rmb() in tcp_poll() ?) Well, in tcp_poll() smp_mb__after_atomic() is fine as it follows set_bit(SOCK_NOSPACE, >sk_socket->flags); (although we might add a comment why we should keep sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk) before the set_bit() !) But presumably smp_wmb() would be enough in tcp_check_space() hmm, I think we need the smp_mb() there. From tcp_poll() we have: 1) set_bit(SOCK_NOSPACE, ...) (write) 2) smp_mb__after_atomic(); 3) if (sk_stream_is_writeable(sk)) (read) while in tcp_check_space() its: 1) the state that sk_stream_is_writeable() cares about (write) 2) smp_mb(); 3) if (sk->sk_socket && test_bit(SOCK_NOSPACE,...) (read) So if we can show that there are sufficient barriers for #1 (directly above), maybe it can be down-graded or eliminated. But it would still seem somewhat fragile. Note I didn't observe any missing wakeups here, but I just wanted to make sure we didn't miss any, since they can be quite hard to debug. Thanks, -Jason
Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set
On 06/22/2016 01:34 PM, Eric Dumazet wrote: On Wed, 2016-06-22 at 11:32 -0400, Jason Baron wrote: From: Jason Baron <jba...@akamai.com> When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write(), and thus result in a tight loop. Note that epoll in edge-triggered does not have this issue since it only triggers once data has been ack'd. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk->sndbuf. Still, generating one POLLOUT event per incoming ACK will not please epoll() users in edge-trigger mode. Host is under global memory pressure, so we probably want to drain socket queues _and_ reduce cpu pressure. Strategy to insure all sockets converge to small amounts ASAP is simply the best answer. Letting big SO_SNDBUF offenders hog memory while their queue is big is not going to help sockets who can not get ACK (elephants get more ACK than mice, so they have more chance to succeed their new allocations) Your patch adds lot of complexity logic in tcp_sendmsg() and tcp_sendpage(). I would prefer a simpler patch like : Ok, fair enough. I'm going to assume that you will submit this as a formal patch. For 1/2, the getting the correct memory barrier, should I re-submit that as a separate patch? Thanks, -Jason
[PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set
From: Jason Baron <jba...@akamai.com> When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write(), and thus result in a tight loop. Note that epoll in edge-triggered does not have this issue since it only triggers once data has been ack'd. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk->sndbuf. Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the socket when we have a short write due to memory pressure. By then testing for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a non-zero amount of data has been ack'd. In a previous approach: http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a new field in 'struct sock' to solve this issue, but its undesirable to add bloat to 'struct sock'. We also could address this issue, by waiting for the buffer to become completely empty, but that may reduce throughput since the write buffer would be empty while waiting for subsequent writes. This change brings us in line with the current epoll edge-trigger behavior for the poll()/select() and epoll level-trigger. We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and SOCK_NOSPACE is set as well (in sk_stream_wait_memory()). I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the server side, to induce tcp memory pressure. A single server thread reduced its cpu usage from 100% to 19%, while maintaining the same level of throughput. Note: This is not a new issue, it has been this way for some time. Cc: Eric Dumazet <eric.duma...@gmail.com> Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/sock.h | 6 ++ net/ipv4/tcp.c | 26 +++--- net/ipv4/tcp_input.c | 3 ++- 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 649d2a8c17fc..616e8e1a5d5d 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -741,6 +741,7 @@ enum sock_flags { SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */ SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */ SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */ + SOCK_SHORT_WRITE, /* Couldn't fill sndbuf due to memory pressure */ }; #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE)) @@ -1114,6 +1115,11 @@ static inline bool sk_stream_is_writeable(const struct sock *sk) sk_stream_memory_free(sk); } +static inline void sk_set_short_write(struct sock *sk) +{ + if (sk->sk_wmem_queued > 0 && sk_stream_is_writeable(sk)) + sock_set_flag(sk, SOCK_SHORT_WRITE); +} static inline bool sk_has_memory_pressure(const struct sock *sk) { diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 5c7ed147449c..4577a90d7d87 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -517,7 +517,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait) mask |= POLLIN | POLLRDNORM; if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { - if (sk_stream_is_writeable(sk)) { + if (sk_stream_is_writeable(sk) && + !sock_flag(sk, SOCK_SHORT_WRITE)) { mask |= POLLOUT | POLLWRNORM; } else { /* send SIGIO later */ sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk); @@ -529,7 +530,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait) * pairs with the input side. */ smp_mb__after_atomic(); - if (sk_stream_is_writeable(sk)) + if (sk_stream_is_writeable(sk) && + !sock_flag(sk, SOCK_SHORT_WRITE)) mask |= POLLOUT | POLLWRNORM; } } else @@ -917,8 +919,10 @@ new_segment: skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, skb_queue_empty(>sk_write_queue)); - if (!skb) + if (!skb) { + sk_set_short_write(sk); goto wait_for_memory; + } skb_entail(sk, skb); copy = size_goal; @@ -933,8 +937,10 @@ new_segment:
[PATCH v2 net-next 1/2] tcp: replace smp_mb__after_atomic() with smp_mb() in tcp_poll()
From: Jason Baron <jba...@akamai.com> sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit(), hence we need an smp_mb() there, smp_mb__after_atomic() is not sufficient. Fixes: 3c7151275c0c ("tcp: add memory barriers to write space paths") Cc: Eric Dumazet <eric.duma...@gmail.com> Signed-off-by: Jason Baron <jba...@akamai.com> --- net/ipv4/tcp_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 94d4aff97523..3ba526ecdeb9 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4987,7 +4987,7 @@ static void tcp_check_space(struct sock *sk) if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) { sock_reset_flag(sk, SOCK_QUEUE_SHRUNK); /* pairs with tcp_poll() */ - smp_mb__after_atomic(); + smp_mb(); if (sk->sk_socket && test_bit(SOCK_NOSPACE, >sk_socket->flags)) tcp_new_space(sk); -- 2.6.1
[PATCH v2 net-next 0/2] tcp: reduce cpu usage when SO_SNDBUF is set
Hi, I've added a patch to upgrade smp_mb__after_atomic() to a smp_mb() as Eric Dumazet pointed out. Also combined the clearing of SOCK_SHORT_WRITE with SOCK_QUEUE_SHRUNK. Thanks, -Jason v2: -upgrade smp_mb__after_atomic to smp_mb() in tcp_poll() -combine clear of SOCK_SHORT_WRITE with SOCK_QUEUE_SHRUNK Jason Baron (2): tcp: replace smp_mb__after_atomic() with smp_mb() in tcp_poll() tcp: reduce cpu usage when SO_SNDBUF is set include/net/sock.h | 6 ++ net/ipv4/tcp.c | 26 +++--- net/ipv4/tcp_input.c | 5 +++-- 3 files changed, 28 insertions(+), 9 deletions(-) -- 2.6.1
Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set
On 06/20/2016 06:29 PM, Eric Dumazet wrote: > On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote: >> From: Jason Baron <jba...@akamai.com> >> >> When SO_SNDBUF is set and we are under tcp memory pressure, the effective >> write buffer space can be much lower than what was set using SO_SNDBUF. For >> example, we may have set the buffer to 100kb, but we may only be able to >> write 10kb. In this scenario poll()/select()/epoll(), are going to >> continuously return POLLOUT, followed by -EAGAIN from write(), and thus >> result in a tight loop. Note that epoll in edge-triggered does not have >> this issue since it only triggers once data has been ack'd. There is no >> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune >> the sk->sndbuf. >> >> Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the >> socket when we have a short write due to memory pressure. By then testing >> for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a >> non-zero amount of data has been ack'd. In a previous approach: >> http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a >> new field in 'struct sock' to solve this issue, but its undesirable to add >> bloat to 'struct sock'. We also could address this issue, by waiting for >> the buffer to become completely empty, but that may reduce throughput since >> the write buffer would be empty while waiting for subsequent writes. This >> change brings us in line with the current epoll edge-trigger behavior for >> the poll()/select() and epoll level-trigger. >> >> We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set >> SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and >> SOCK_NOSPACE is set as well (in sk_stream_wait_memory()). >> >> I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the >> server side, to induce tcp memory pressure. A single server thread reduced >> its cpu usage from 100% to 19%, while maintaining the same level of >> throughput. >> >> Signed-off-by: Jason Baron <jba...@akamai.com> >> --- > > > When this bug was added, and by which commit ? > I think we have always had this behavior, I've seen it at least back to 3.10. It requires SO_SNDBUF, lots of sockets and memory pressure... That said its quite easy to reproduce. > This looks serious, but your patch looks way too complex. > > This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space() > is expensive, because it dirties sk_flags which is not supposed to > written often. This field is located in a read mostly cache line. > > I believe my suggestion was not to add a per socket bit, > but have a global state like tcp_memory_pressure. (or reuse it) > > Ie , when under global memory pressure, tcp_poll() should apply a > strategy to give POLLOUT only on sockets having less than 4K > (tcp_wmem[0]) of memory consumed in their write queue. > > > > ok, I think you're saying to just add: if (sk_under_memory_pressure(sk) && sk->sk_wmem_queued > tcp_wmem[0]) return false; to the beginning of sk_stream_is_writeable(). My concern is that it then allows the sndbuf to basically completely empty before we are triggered to write again. The patch I presented attempts to write against as soon as any new space is available for the flow. I also didn't necessarily want to affect all flows. If so_sndbuf is not set, we generally don't need to change the wakeup behavior b/c the sndbuf shrinks to the point where the normal write wakeup does not cause these tight poll()/write() loops, where we don't make any progress. We also don't need to change the behavior if we are under memory pressure but but have written enough (filled more than 2/3 of the buffer), such that the normal wakeups again would work fine. So we could incorporate some of this logic into the above check, but it becomes more complex. In my patch, I could replace the: sock_reset_flag(sk, SOCK_QUEUE_SHRUNK); in tcp_check_space() with something like: sk->sk_flags &= ~((1UL << SOCK_QUEUE_SHRUNK) | (1UL << SOCK_SHORT_WRITE)); Since we are already writing to sk_flags there this should have very minimal overhead. And then remove the clear in sk_stream_write_space(). Thanks, -Jason
[PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set
From: Jason Baron <jba...@akamai.com> When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write(), and thus result in a tight loop. Note that epoll in edge-triggered does not have this issue since it only triggers once data has been ack'd. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk->sndbuf. Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the socket when we have a short write due to memory pressure. By then testing for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a non-zero amount of data has been ack'd. In a previous approach: http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a new field in 'struct sock' to solve this issue, but its undesirable to add bloat to 'struct sock'. We also could address this issue, by waiting for the buffer to become completely empty, but that may reduce throughput since the write buffer would be empty while waiting for subsequent writes. This change brings us in line with the current epoll edge-trigger behavior for the poll()/select() and epoll level-trigger. We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and SOCK_NOSPACE is set as well (in sk_stream_wait_memory()). I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the server side, to induce tcp memory pressure. A single server thread reduced its cpu usage from 100% to 19%, while maintaining the same level of throughput. Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/sock.h | 6 ++ net/core/stream.c | 1 + net/ipv4/tcp.c | 26 +++--- 3 files changed, 26 insertions(+), 7 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 649d2a8c17fc..616e8e1a5d5d 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -741,6 +741,7 @@ enum sock_flags { SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */ SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */ SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */ + SOCK_SHORT_WRITE, /* Couldn't fill sndbuf due to memory pressure */ }; #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE)) @@ -1114,6 +1115,11 @@ static inline bool sk_stream_is_writeable(const struct sock *sk) sk_stream_memory_free(sk); } +static inline void sk_set_short_write(struct sock *sk) +{ + if (sk->sk_wmem_queued > 0 && sk_stream_is_writeable(sk)) + sock_set_flag(sk, SOCK_SHORT_WRITE); +} static inline bool sk_has_memory_pressure(const struct sock *sk) { diff --git a/net/core/stream.c b/net/core/stream.c index 159516a11b7e..ead768b1e95d 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk) if (sk_stream_is_writeable(sk) && sock) { clear_bit(SOCK_NOSPACE, >flags); + sock_reset_flag(sk, SOCK_SHORT_WRITE); rcu_read_lock(); wq = rcu_dereference(sk->sk_wq); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 5c7ed147449c..254c7bb6d3d5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -517,7 +517,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait) mask |= POLLIN | POLLRDNORM; if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { - if (sk_stream_is_writeable(sk)) { + if (!sock_flag(sk, SOCK_SHORT_WRITE) && + sk_stream_is_writeable(sk)) { mask |= POLLOUT | POLLWRNORM; } else { /* send SIGIO later */ sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk); @@ -529,7 +530,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait) * pairs with the input side. */ smp_mb__after_atomic(); - if (sk_stream_is_writeable(sk)) + if (!sock_flag(sk, SOCK_SHORT_WRITE) && + sk_stream_is_writeable(sk)) mask |= POLLOUT | POLLWRNORM; } } else @@ -917,8 +919,10 @@ new_segment: skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, skb_queue_empty(>
Re: use-after-free in sctp_do_sm
On 12/04/2015 12:03 PM, Joe Perches wrote: > On Fri, 2015-12-04 at 11:47 -0500, Jason Baron wrote: >> When DYNAMIC_DEBUG is enabled we have this wrapper from >> include/linux/dynamic_debug.h: >> >> if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)) >> >> >> So the compiler is not emitting the side-effects in this >> case. > > Huh? Do I misunderstand what you are writing? Yes, I wasn't terribly clear - I was trying to say that the 'side-effects', in this case the debug code and use-after-free, are hidden behind the branch. They aren't invoked unless we enable the debug statement. Thanks, -Jason > > You are testing a variable that is not generally set > so the call is not being performed in the general case, > but the compiler can not elide the code. > > If the variable was enabled via the control file, the > __dynamic_pr_debug would be performed with the > use-after-free. > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: use-after-free in sctp_do_sm
On 12/04/2015 11:12 AM, Dmitry Vyukov wrote: > On Thu, Dec 3, 2015 at 9:51 PM, Joe Perches <j...@perches.com> wrote: >> (adding lkml as this is likely better discussed there) >> >> On Thu, 2015-12-03 at 15:42 -0500, Jason Baron wrote: >>> On 12/03/2015 03:24 PM, Joe Perches wrote: >>>> On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote: >>>>> On 12/03/2015 03:03 PM, Joe Perches wrote: >>>>>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote: >>>>>>> On 12/03/2015 01:52 PM, Aaron Conole wrote: >>>>>>>> I think that as a minimum, the following patch should be evaluted, >>>>>>>> but am unsure to whom I should submit it (after I test): >>>>>> [] >>>>>>> Agreed - the intention here is certainly to have no side effects. It >>>>>>> looks like 'no_printk()' is used in quite a few other places that would >>>>>>> benefit from this change. So we probably want a generic >>>>>>> 'really_no_printk()' macro. >>>>>> >>>>>> https://lkml.org/lkml/2012/6/17/231 >>>>> >>>>> I don't see this in the tree. >>>> >>>> It never got applied. >>>> >>>>> Also maybe we should just convert >>>>> no_printk() to do what your 'eliminated_printk()'. >>>> >>>> Some of them at least. >>>> >>>>> So we can convert all users with this change? >>>> >>>> I don't think so, I think there are some >>>> function evaluation/side effects that are >>>> required. I believe some do hardware I/O. >>>> >>>> It'd be good to at least isolate them. >>>> >>>> I'm not sure how to find them via some >>>> automated tool/mechanism though. >>>> >>>> I asked Julia Lawall about it once in this >>>> thread: https://lkml.org/lkml/2014/12/3/696 >>>> >>> >>> Seems rather fragile to have side effects that we rely >>> upon hidden in a printk(). >> >> Yup. >> >>> Just convert them and see what breaks :) >> >> I appreciate your optimism. It's very 1995. >> Try it and see what happens. > > > But Aaron says that DYNAMIC_DEBUG is enabled in most major > distributions, and all these side-effects don't happen with > DYNAMIC_DEBUG. When DYNAMIC_DEBUG is enabled we have this wrapper from include/linux/dynamic_debug.h: if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)) So the compiler is not emitting the side-effects in this case. >This suggests that we can make these side-effects not > happen without DYNAMIC_DEBUG as well. > Or I am missing something here? > When DYNAMIC_DEBUG is disabled we are instead replacing pr_debug() with the 'no_printk()' function as you've pointed out. We are changing this to emit no code at all: http://marc.info/?l=linux-kernel=144918276518878=2 Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: use-after-free in sctp_do_sm
On 12/03/2015 01:52 PM, Aaron Conole wrote: > Dmitry Vyukovwrites: >> On Thu, Dec 3, 2015 at 6:02 PM, Eric Dumazet wrote: >>> On Thu, Dec 3, 2015 at 7:55 AM, Dmitry Vyukov wrote: On Thu, Dec 3, 2015 at 3:48 PM, Eric Dumazet wrote: >> >> No, I don't. But pr_debug always computes its arguments. See no_printk >> in printk.h. So this use-after-free happens for all users. > > Hmm. > > pr_debug() should be a nop unless either DEBUG or > CONFIG_DYNAMIC_DEBUG are set > > On our production kernels, pr_debug() is a nop. > > Can you double check ? Thanks ! Why should it be nop? no_printk thing in printk.h pretty much explicitly makes it not a nop... > > Because it was until commit 5264f2f75d8. It also violates my reading of > the following from printk.h: > > * All of these will print unconditionally, although note that pr_debug() > * and other debug macros are compiled out unless either DEBUG is defined > * or CONFIG_DYNAMIC_DEBUG is set. > Double-checked: debug_post_sfx leads to some generated code: debug_post_sfx(); 8229f256: 48 8b 85 58 fe ff ffmov-0x1a8(%rbp),%rax 8229f25d: 48 85 c0test %rax,%rax 8229f260: 74 24 je 8229f286 8229f262: 8b b0 a8 00 00 00 mov0xa8(%rax),%esi 8229f268: 48 8b 85 60 fe ff ffmov-0x1a0(%rbp),%rax 8229f26f: 44 89 85 74 fe ff ffmov%r8d,-0x18c(%rbp) 8229f276: 48 8b 78 20 mov0x20(%rax),%rdi 8229f27a: e8 71 28 01 00 callq 822b1af0 8229f27f: 44 8b 85 74 fe ff ffmov-0x18c(%rbp),%r8d return error; } 8229f286: 48 81 c4 a0 01 00 00add$0x1a0,%rsp 8229f28d: 44 89 c0mov%r8d,%eax 8229f290: 5b pop%rbx 8229f291: 41 5c pop%r12 8229f293: 41 5d pop%r13 8229f295: 41 5e pop%r14 8229f297: 41 5f pop%r15 8229f299: 5d pop%rbp 8229f29a: c3 retq >>> >>> This is a serious concern, because we let in the past lot of patches >>> converting traditional > > +1 > >>> #ifdef DEBUG >>> # define some_hand_coded_ugly_debug() printk( _ >>> #else >>> # define some_hand_coded_ugly_debug() >>> #endif >>> >>> On the premise pr_debug() would be a nop. >>> >>> It seems it is not always the case. This is a very serious problem. > > +1 > >>> We probably have hundred of potential bugs, because few people >>> actually make sure all debugging stuff is correct, >>> like comments can be wrong because they are not updated properly as >>> time flies. >>> >>> It is definitely a nop for many cases. >>> >>> +void eric_test_pr_debug(struct sock *sk) >>> +{ >>> + if (atomic_read(>sk_omem_alloc)) >>> + pr_debug("%s: optmem leakage for sock %p\n", >>> +__func__, sk); >>> +} >>> >>> -> >>> >>> 4740 : >>> 4740: e8 00 00 00 00 callq 4745 >>> 4741: R_X86_64_PC32 __fentry__-0x4 >>> 4745: 55 push %rbp >>> 4746: 8b 87 24 01 00 00 mov0x124(%rdi),%eax // >>> atomic_read() but nothing follows >>> 474c: 48 89 e5 mov%rsp,%rbp >>> 474f: 5d pop%rbp >>> 4750: c3 retq >> >> >> >> I would expect that it is nop when argument evaluation does not have >> side-effects. For example, for a load of a variable compiler will most >> likely elide it (though, it does not have to elide it, because the >> load is spelled in the code, so it can also legally emit the load and >> doesn't use the result). >> But if argument computation has side-effect (or compiler can't prove >> otherwise), it must emit code. It must emit code for function calls >> when the function is defined in a different translation unit, and for >> volatile accesses (most likely including atomic accesses), etc > > This isn't 100% true. As you state, in order to reach the return 0, all > side effects must be evaluated. Load generally does not have side > effects, so it can be safely elided, but function() must be emitted. > > However, that is _not_ required to get the desired warning emission on a > printf argument function, see http://pastebin.com/UHuaydkj for an > example. > > I think that as a minimum, the following patch should be evaluted, but am > unsure to whom I should submit it (after I test): Agreed - the
Re: use-after-free in sctp_do_sm
On 12/03/2015 03:03 PM, Joe Perches wrote: > On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote: >> On 12/03/2015 01:52 PM, Aaron Conole wrote: >>> I think that as a minimum, the following patch should be evaluted, >>> but am unsure to whom I should submit it (after I test): > [] >> Agreed - the intention here is certainly to have no side effects. It >> looks like 'no_printk()' is used in quite a few other places that would >> benefit from this change. So we probably want a generic >> 'really_no_printk()' macro. > > https://lkml.org/lkml/2012/6/17/231 > I don't see this in the tree. Also maybe we should just convert no_printk() to do what your 'eliminated_printk()'. So we can convert all users with this change? Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: use-after-free in sctp_do_sm
On 12/03/2015 03:24 PM, Joe Perches wrote: > On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote: >> On 12/03/2015 03:03 PM, Joe Perches wrote: >>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote: >>>> On 12/03/2015 01:52 PM, Aaron Conole wrote: >>>>> I think that as a minimum, the following patch should be evaluted, >>>>> but am unsure to whom I should submit it (after I test): >>> [] >>>> Agreed - the intention here is certainly to have no side effects. It >>>> looks like 'no_printk()' is used in quite a few other places that would >>>> benefit from this change. So we probably want a generic >>>> 'really_no_printk()' macro. >>> >>> https://lkml.org/lkml/2012/6/17/231 >> >> I don't see this in the tree. > > It never got applied. > >> Also maybe we should just convert >> no_printk() to do what your 'eliminated_printk()'. > > Some of them at least. > >> So we can convert all users with this change? > > I don't think so, I think there are some > function evaluation/side effects that are > required. I believe some do hardware I/O. > > It'd be good to at least isolate them. > > I'm not sure how to find them via some > automated tool/mechanism though. > > I asked Julia Lawall about it once in this > thread: https://lkml.org/lkml/2014/12/3/696 > Seems rather fragile to have side effects that we rely upon hidden in a printk(). Just convert them and see what breaks :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/13] mm: memcontrol: generalize the socket accounting jump label
Hi, On 11/24/2015 04:52 PM, Johannes Weiner wrote: > The unified hierarchy memory controller is going to use this jump > label as well to control the networking callbacks. Move it to the > memory controller code and give it a more generic name. > > Signed-off-by: Johannes Weiner> Acked-by: Michal Hocko > Reviewed-by: Vladimir Davydov > --- > include/linux/memcontrol.h | 4 > include/net/sock.h | 7 --- > mm/memcontrol.c| 3 +++ > net/core/sock.c| 5 - > net/ipv4/tcp_memcontrol.c | 4 ++-- > 5 files changed, 9 insertions(+), 14 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index d99fefe..dad56ef 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -681,6 +681,8 @@ static inline void mem_cgroup_wb_stats(struct > bdi_writeback *wb, > > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > struct sock; > +extern struct static_key memcg_sockets_enabled_key; > +#define mem_cgroup_sockets_enabled > static_key_false(_sockets_enabled_key) We're trying to move to the updated API, so this should be: static_branch_unlikely(_sockets_enabled_key) see: include/linux/jump_label.h for details. > void sock_update_memcg(struct sock *sk); > void sock_release_memcg(struct sock *sk); > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int > nr_pages); > @@ -689,6 +691,8 @@ static inline bool > mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) > { > return memcg->tcp_mem.memory_pressure; > } > +#else > +#define mem_cgroup_sockets_enabled 0 > #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */ > > #ifdef CONFIG_MEMCG_KMEM > diff --git a/include/net/sock.h b/include/net/sock.h > index 1a94b85..fcc9442 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -1065,13 +1065,6 @@ static inline void sk_refcnt_debug_release(const > struct sock *sk) > #define sk_refcnt_debug_release(sk) do { } while (0) > #endif /* SOCK_REFCNT_DEBUG */ > > -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET) > -extern struct static_key memcg_socket_limit_enabled; > -#define mem_cgroup_sockets_enabled > static_key_false(_socket_limit_enabled) > -#else > -#define mem_cgroup_sockets_enabled 0 > -#endif > - > static inline bool sk_stream_memory_free(const struct sock *sk) > { > if (sk->sk_wmem_queued >= sk->sk_sndbuf) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 68d67fc..0602bee 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -291,6 +291,9 @@ static inline struct mem_cgroup > *mem_cgroup_from_id(unsigned short id) > /* Writing them here to avoid exposing memcg's inner layout */ > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > > +struct static_key memcg_sockets_enabled_key; And this would be: static DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); > void sock_update_memcg(struct sock *sk) > { > struct mem_cgroup *memcg; > diff --git a/net/core/sock.c b/net/core/sock.c > index 6486b0d..c5435b5 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -201,11 +201,6 @@ EXPORT_SYMBOL(sk_net_capable); > static struct lock_class_key af_family_keys[AF_MAX]; > static struct lock_class_key af_family_slock_keys[AF_MAX]; > > -#if defined(CONFIG_MEMCG_KMEM) > -struct static_key memcg_socket_limit_enabled; > -EXPORT_SYMBOL(memcg_socket_limit_enabled); > -#endif > - > /* > * Make lock validator output more readable. (we pre-construct these > * strings build-time, so that runtime initialization of socket > diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c > index e507825..9a22e2d 100644 > --- a/net/ipv4/tcp_memcontrol.c > +++ b/net/ipv4/tcp_memcontrol.c > @@ -34,7 +34,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg) > return; > > if (memcg->tcp_mem.active) > - static_key_slow_dec(_socket_limit_enabled); > + static_key_slow_dec(_sockets_enabled_key); > static_branch_dec(_sockets_enabled_key); } > > static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages) > @@ -65,7 +65,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, > unsigned long nr_pages) >* because when this value change, the code to process it is not >* patched in yet. >*/ > - static_key_slow_inc(_socket_limit_enabled); > + static_key_slow_inc(_sockets_enabled_key); > memcg->tcp_mem.active = true; > } > > static_branch_inc(_sockets_enabled_key); Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/13] mm: memcontrol: generalize the socket accounting jump label
On 11/30/2015 04:50 PM, Johannes Weiner wrote: > On Mon, Nov 30, 2015 at 04:08:18PM -0500, Jason Baron wrote: >> We're trying to move to the updated API, so this should be: >> static_branch_unlikely(_sockets_enabled_key) >> >> see: include/linux/jump_label.h for details. > > Good point. There is another struct static_key in there as well. How > about the following on top of this series? > Looks fine - you may be able to make use of 'static_branch_enable()/disable()' instead of the inc()/dec() to simply set the branch direction, if you think its more readable. Although I didn't look to see if it would be racy here. Thanks, -Jason > --- > From b784aa0323628d43272e13a67ead2a2ce0e93ea6 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <han...@cmpxchg.org> > Date: Mon, 30 Nov 2015 16:41:38 -0500 > Subject: [PATCH] mm: memcontrol: switch to the updated jump-label API > > According to the direct use of struct static_key > is deprecated. Update the socket and slab accounting code accordingly. > > Reported-by: Jason Baron <jba...@akamai.com> > Signed-off-by: Johannes Weiner <han...@cmpxchg.org> > --- > include/linux/memcontrol.h | 8 > mm/memcontrol.c| 12 ++-- > net/ipv4/tcp_memcontrol.c | 4 ++-- > 3 files changed, 12 insertions(+), 12 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index a8df46c..9a19590 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -704,8 +704,8 @@ static inline void mem_cgroup_wb_stats(struct > bdi_writeback *wb, > > #ifdef CONFIG_INET > struct sock; > -extern struct static_key memcg_sockets_enabled_key; > -#define mem_cgroup_sockets_enabled > static_key_false(_sockets_enabled_key) > +extern struct static_key_false memcg_sockets_enabled_key; > +#define mem_cgroup_sockets_enabled > static_branch_unlikely(_sockets_enabled_key) > void sock_update_memcg(struct sock *sk); > void sock_release_memcg(struct sock *sk); > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int > nr_pages); > @@ -727,7 +727,7 @@ static inline bool > mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) > #endif /* CONFIG_INET */ > > #ifdef CONFIG_MEMCG_KMEM > -extern struct static_key memcg_kmem_enabled_key; > +extern struct static_key_false memcg_kmem_enabled_key; > > extern int memcg_nr_cache_ids; > void memcg_get_cache_ids(void); > @@ -743,7 +743,7 @@ void memcg_put_cache_ids(void); > > static inline bool memcg_kmem_enabled(void) > { > - return static_key_false(_kmem_enabled_key); > + return static_branch_unlikely(_kmem_enabled_key); > } > > static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index a0da91f..5fe45d68 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -346,7 +346,7 @@ void memcg_put_cache_ids(void) > * conditional to this static branch, we'll have to allow modules that does > * kmem_cache_alloc and the such to see this symbol as well > */ > -struct static_key memcg_kmem_enabled_key; > +DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key); > EXPORT_SYMBOL(memcg_kmem_enabled_key); > > #endif /* CONFIG_MEMCG_KMEM */ > @@ -2883,7 +2883,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, > err = page_counter_limit(>kmem, nr_pages); > VM_BUG_ON(err); > > - static_key_slow_inc(_kmem_enabled_key); > + static_branch_inc(_kmem_enabled_key); > /* >* A memory cgroup is considered kmem-active as soon as it gets >* kmemcg_id. Setting the id after enabling static branching will > @@ -3622,7 +3622,7 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg) > { > if (memcg->kmem_acct_activated) { > memcg_destroy_kmem_caches(memcg); > - static_key_slow_dec(_kmem_enabled_key); > + static_branch_dec(_kmem_enabled_key); > WARN_ON(page_counter_read(>kmem)); > } > tcp_destroy_cgroup(memcg); > @@ -4258,7 +4258,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > > #ifdef CONFIG_INET > if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) > - static_key_slow_inc(_sockets_enabled_key); > + static_branch_inc(_sockets_enabled_key); > #endif > > /* > @@ -4302,7 +4302,7 @@ static void mem_cgroup_css_free(struct > cgroup_subsys_state *css) > memcg_destroy_kmem(memcg); > #ifdef CONFIG_INET > if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) > - static_key_sl
Re: use-after-free in sock_wake_async
On 11/24/2015 10:21 AM, Eric Dumazet wrote: > On Tue, Nov 24, 2015 at 6:18 AM, Dmitry Vyukovwrote: >> Hello, >> >> The following program triggers use-after-free in sock_wake_async: >> >> // autogenerated by syzkaller (http://github.com/google/syzkaller) >> #include >> #include >> #include >> #include >> >> long r2 = -1; >> long r3 = -1; >> long r7 = -1; >> >> void *thr0(void *arg) >> { >> syscall(SYS_splice, r2, 0x0ul, r7, 0x0ul, 0x4ul, 0x8ul); >> return 0; >> } >> >> void *thr1(void *arg) >> { >> syscall(SYS_close, r2, 0, 0, 0, 0, 0); >> return 0; >> } >> >> void *thr2(void *arg) >> { >> syscall(SYS_write, r3, 0x20003000ul, 0xe7ul, 0, 0, 0); >> return 0; >> } >> >> int main() >> { >> long r0 = syscall(SYS_mmap, 0x2000ul, 0x1ul, 0x3ul, >> 0x32ul, 0xul, 0x0ul); >> long r1 = syscall(SYS_socketpair, 0x1ul, 0x1ul, 0x0ul, >> 0x2000ul, 0, 0); >> r2 = *(uint32_t*)0x2000; >> r3 = *(uint32_t*)0x2004; >> >> *(uint64_t*)0x20001000 = 0x4; >> long r5 = syscall(SYS_ioctl, r2, 0x5452ul, 0x20001000ul, 0, 0, 0); >> >> long r6 = syscall(SYS_pipe2, 0x20002000ul, 0x80800ul, 0, 0, 0, 0); >> r7 = *(uint32_t*)0x20002004; >> >> pthread_t th[3]; >> pthread_create([0], 0, thr0, 0); >> pthread_create([1], 0, thr1, 0); >> pthread_create([2], 0, thr2, 0); >> pthread_join(th[0], 0); >> pthread_join(th[1], 0); >> pthread_join(th[2], 0); >> return 0; >> } >> >> >> The use-after-free fires after a minute of running it in a tight >> parallel loop. I use the stress utility for this: >> >> $ go get golang.org/x/tools/cmd/stress >> $ stress -p 128 -failure "ignore" ./a.out >> >> >> == >> BUG: KASAN: use-after-free in sock_wake_async+0x325/0x340 at addr >> 880061d1ad10 >> Read of size 8 by task a.out/23178 >> = >> BUG sock_inode_cache (Not tainted): kasan: bad access detected >> - >> >> Disabling lock debugging due to kernel taint >> INFO: Allocated in sock_alloc_inode+0x1d/0x220 age=0 cpu=2 pid=23183 >> [< none >] kmem_cache_alloc+0x1a6/0x1f0 mm/slub.c:2514 >> [< none >] sock_alloc_inode+0x1d/0x220 net/socket.c:250 >> [< none >] alloc_inode+0x61/0x180 fs/inode.c:198 >> [< none >] new_inode_pseudo+0x17/0xe0 fs/inode.c:878 >> [< none >] sock_alloc+0x3d/0x260 net/socket.c:540 >> [< none >] __sock_create+0xa7/0x620 net/socket.c:1133 >> [< inline >] sock_create net/socket.c:1209 >> [< inline >] SYSC_socketpair net/socket.c:1281 >> [< none >] SyS_socketpair+0x112/0x4e0 net/socket.c:1260 >> [< none >] entry_SYSCALL_64_fastpath+0x16/0x7a >> arch/x86/entry/entry_64.S:185 >> >> INFO: Freed in sock_destroy_inode+0x56/0x70 age=0 cpu=2 pid=23185 >> [< none >] kmem_cache_free+0x24e/0x260 mm/slub.c:2742 >> [< none >] sock_destroy_inode+0x56/0x70 net/socket.c:279 >> [< none >] destroy_inode+0xc4/0x120 fs/inode.c:255 >> [< none >] evict+0x36b/0x580 fs/inode.c:559 >> [< inline >] iput_final fs/inode.c:1477 >> [< none >] iput+0x4a0/0x790 fs/inode.c:1504 >> [< inline >] dentry_iput fs/dcache.c:358 >> [< none >] __dentry_kill+0x4fe/0x700 fs/dcache.c:543 >> [< inline >] dentry_kill fs/dcache.c:587 >> [< none >] dput+0x6ab/0x7a0 fs/dcache.c:796 >> [< none >] __fput+0x3fb/0x6e0 fs/file_table.c:226 >> [< none >] fput+0x15/0x20 fs/file_table.c:244 >> [< none >] task_work_run+0x163/0x1f0 kernel/task_work.c:115 >> (discriminator 1) >> [< inline >] tracehook_notify_resume include/linux/tracehook.h:191 >> [< none >] exit_to_usermode_loop+0x180/0x1a0 >> arch/x86/entry/common.c:251 >> [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282 >> [< none >] syscall_return_slowpath+0x19f/0x210 >> arch/x86/entry/common.c:344 >> [< none >] int_ret_from_sys_call+0x25/0x9f >> arch/x86/entry/entry_64.S:281 >> >> INFO: Slab 0xea0001874600 objects=25 used=2 fp=0x880061d1c100 >> flags=0x5004080 >> INFO: Object 0x880061d1ad00 @offset=11520 fp=0x880061d1a300 >> CPU: 3 PID: 23178 Comm: a.out Tainted: GB 4.4.0-rc1+ #84 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 >> 880061baf8f0 825d3336 88003e0dc280 >> 880061d1ad00 880061d18000 880061baf920 81618784 >> 88003e0dc280 ea0001874600 880061d1ad00 00e7 >> >> Call Trace: >> [] __asan_report_load8_noabort+0x3e/0x40 >> mm/kasan/report.c:280 >> [< inline >]
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/20/2015 05:07 PM, Rainer Weikusat wrote: > Rainer Weikusat <rweiku...@mobileactivedefense.com> writes: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusat <rweiku...@mobileactivedefense.com> > Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") Looks good to me. Reviewed-by: Jason Baron <jba...@akamai.com> Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
On 11/19/2015 06:52 PM, Rainer Weikusat wrote: [...] > @@ -1590,21 +1718,35 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > - if (!timeo) { > + if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { > + if (timeo) { > + timeo = unix_wait_for_peer(other, timeo); > + > + err = sock_intr_errno(timeo); > + if (signal_pending(current)) > + goto out_free; > + > + goto restart; > + } > + > + if (unix_peer(sk) != other || > + unix_dgram_peer_wake_me(sk, other)) { > err = -EAGAIN; > goto out_unlock; > } Hi, So here we are calling unix_dgram_peer_wake_me() without the sk lock the first time through - right? In that case, we can end up registering on the queue of other for the callback but we might have already connected to a different remote. In that case, the wakeup will crash if 'sk' has freed in the meantime. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
On 11/16/2015 05:28 PM, Rainer Weikusat wrote: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusat <rweiku...@mobileactivedefense.com> > Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") > --- > > Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll() > behavior for EPOLLOUT event": This shouldn't be an issue anymore with > this change despite it restores the "only when writable" behaviour" as > the wake up relay will also be set up once _dgram_sendmsg returned > EAGAIN for a send attempt on a n:1 connected socket. > > Hi, My only comment was about potentially avoiding the double lock in the write path, otherwise this looks ok to me. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/15/2015 01:32 PM, Rainer Weikusat wrote: > > That was my original idea. The problem with this is that the code > starting after the _lock and running until the main code path unlock has > to be executed in one go with the other lock held as the results of the > tests above this one may become invalid as soon as the other lock is > released. This means instead of continuing execution with the send code > proper after the block in case other became receive-ready between the > first and the second test (possible because _dgram_recvmsg does not > take the unix state lock), the whole procedure starting with acquiring > the other lock would need to be restarted. Given sufficiently unfavorable > circumstances, this could even turn into an endless loop which couldn't > be interrupted. (unless code for this was added). > hmmm - I think we can avoid it by doing the wakeup from the write path in the rare case that the queue has emptied - and avoid the double lock. IE: unix_state_unlock(other); unix_state_lock(sk); err = -EAGAIN; if (unix_peer(sk) == other) { unix_dgram_peer_wake_connect(sk, other); if (skb_queue_len(>sk_receive_queue) == 0) need_wakeup = true; } unix_state_unlock(sk); if (need_wakeup) wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND); goto out_free; Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/13/2015 01:51 PM, Rainer Weikusat wrote: [...] > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > - if (!timeo) { > - err = -EAGAIN; > - goto out_unlock; > - } > + if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) { Remind me why the 'unix_peer(sk) == other' is added here? If the remote is not connected we still want to make sure that we don't overflow the the remote rcv queue, right? In terms of this added 'double' lock for both sk and other, where previously we just held the 'other' lock. I think we could continue to just hold the 'other' lock unless the remote queue is full, so something like: if (unix_peer(other) != sk && unix_recvq_full(other)) { bool need_wakeup = false; skipping the blocking case... err = -EAGAIN; if (!other_connected) goto out_unlock; unix_state_unlock(other); unix_state_lock(sk); /* if remote peer has changed under us, the connect() will wake up any pending waiter, just return -EAGAIN if (unix_peer(sk) == other) { /* In case we see there is space available queue the wakeup and we will try again. This this should be an unlikely condition */ if (!unix_dgram_peer_wake_me(sk, other)) need_wakeup = true; } unix_state_unlock(sk); if (need_wakeup) wake_up_interruptible_poll(sk_sleep(sk),POLLOUT | POLLWRNORM | POLLWRBAND); goto out_free; } So I'm not sure if the 'double' lock really affects any workload, but the above might be away to avoid it. Also - it might be helpful to add a 'Fixes:' tag referencing where this issue started, in the changelog. Worth mentioning too is that this patch should improve the polling case here dramatically, as we currently wake the entire queue on every remote read even when we have room in the rcv buffer. So this patch will cut down on ctxt switching rate dramatically from what we currently have. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Hi Rainer, > + > +/* Needs sk unix state lock. After recv_ready indicated not ready, > + * establish peer_wait connection if still needed. > + */ > +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other) > +{ > + int connected; > + > + connected = unix_dgram_peer_wake_connect(sk, other); > + > + if (unix_recvq_full(other)) > + return 1; > + > + if (connected) > + unix_dgram_peer_wake_disconnect(sk, other); > + > + return 0; > +} > + So the comment above this function says 'needs unix state lock', however the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage in unix_dgram_poll() has the 'sk' lock. So this looks racy. Also, another tweak on this scheme: Instead of calling '__remove_wait_queue()' in unix_dgram_peer_wake_relay(). We could instead simply mark each item in the queue as 'WQ_FLAG_EXCLUSIVE'. Then, since 'unix_dgram_recvmsg()' does an exclusive wakeup the queue has effectively been disabled (minus the first exlusive item in the list which can just return if its marked exclusive). This means that in dgram_poll(), we add to the list if we have not yet been added, and if we are on the list, we do a remove and then add removing the exclusive flag. Thus, all the waiters that need a wakeup are at the beginning of the queue, and the disabled ones are at the end with the 'WQ_FLAG_EXCLUSIVE' flag set. This does make the list potentially long, but if we only walk it to the point we are doing the wakeup, it has no impact. I like the fact that in this scheme the wakeup doesn't have to call remove against a long of waiters - its just setting the exclusive flag. Thanks, -Jason > static inline int unix_writable(struct sock *sk) > { > return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; > @@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int > embrion) > skpair->sk_state_change(skpair); > sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP); > } > + > + unix_dgram_peer_wake_disconnect(sk, skpair); > sock_put(skpair); /* It may now die */ > unix_peer(sk) = NULL; > } > @@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct > socket *sock, int kern) > INIT_LIST_HEAD(>link); > mutex_init(>readlock); /* single task reading lock */ > init_waitqueue_head(>peer_wait); > + init_waitqueue_func_entry(>peer_wake, unix_dgram_peer_wake_relay); > unix_insert_socket(unix_sockets_unbound(sk), sk); > out: > if (sk == NULL) > @@ -1031,6 +1150,13 @@ restart: > if (unix_peer(sk)) { > struct sock *old_peer = unix_peer(sk); > unix_peer(sk) = other; > + > + if (unix_dgram_peer_wake_disconnect(sk, other)) > + wake_up_interruptible_poll(sk_sleep(sk), > +POLLOUT | > +POLLWRNORM | > +POLLWRBAND); > + > unix_state_double_unlock(sk, other); > > if (other != old_peer) > @@ -1565,6 +1691,13 @@ restart: > unix_state_lock(sk); > if (unix_peer(sk) == other) { > unix_peer(sk) = NULL; > + > + if (unix_dgram_peer_wake_disconnect(sk, other)) > + wake_up_interruptible_poll(sk_sleep(sk), > +POLLOUT | > +POLLWRNORM | > +POLLWRBAND); > + > unix_state_unlock(sk); > > unix_dgram_disconnected(sk, other); > @@ -1590,19 +1723,21 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > - if (!timeo) { > - err = -EAGAIN; > - goto out_unlock; > - } > + if (!unix_dgram_peer_recv_ready(sk, other)) { > + if (timeo) { > + timeo = unix_wait_for_peer(other, timeo); > > - timeo = unix_wait_for_peer(other, timeo); > + err = sock_intr_errno(timeo); > + if (signal_pending(current)) > + goto out_free; > > - err = sock_intr_errno(timeo); > - if (signal_pending(current)) > - goto out_free; > + goto restart; > + } > > - goto restart; > + if (unix_dgram_peer_wake_me(sk, other)) { > + err = -EAGAIN; > + goto out_unlock; > + } > } > > if (sock_flag(other, SOCK_RCVTSTAMP)) > @@ -2453,14 +2588,16 @@ static unsigned int unix_dgram_poll(struct
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/09/2015 09:40 AM, Rainer Weikusat wrote: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusuat <rweiku...@mobileactivedefense.com> [...] > @@ -1590,10 +1723,14 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > + if (!unix_dgram_peer_recv_ready(sk, other)) { > if (!timeo) { > - err = -EAGAIN; > - goto out_unlock; > + if (unix_dgram_peer_wake_me(sk, other)) { > + err = -EAGAIN; > + goto out_unlock; > + } > + > + goto restart; > } So this will cause 'unix_state_lock(other) to be called twice in a row if we 'goto restart' (and hence will softlock the box). It just needs a 'unix_state_unlock(other);' before the 'goto restart'. I also tested this patch with a single unix server and 200 client threads doing roughly epoll() followed by write() until -EAGAIN in a loop. The throughput for the test was roughly the same as current upstream, but the cpu usage was a lot higher. I think its b/c this patch takes the server wait queue lock in the _poll() routine. This causes a lot of contention. The previous patch you posted for this where you did not clear the wait queue on every wakeup and thus didn't need the queue lock in poll() (unless we were adding to it), performed much better. However, the previous patch which tested better didn't add to the remote queue when it was full on sendmsg() - so it wouldn't be correct for epoll ET. Adding to the remote queue for every sendmsg() that fails does seem undesirable, if we aren't even doing poll(). So I'm not sure if just going back to the previous patch is a great option eitherI'm also not sure how realistic the test case I have is. It would be great if we had some other workloads to test against. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use-after-free in ep_remove_wait_queue
On 11/06/2015 08:06 AM, Dmitry Vyukov wrote: > On Mon, Oct 12, 2015 at 2:17 PM, Dmitry Vyukovwrote: >> On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet wrote: >>> On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote: >>> Probably the issue discussed in http://thread.gmane.org/gmane.linux.kernel/2057497/ and previous related threads. >>> >>> Same issue, but Dmitry apparently did not trust me. >> >> Just wanted to make sure. Let's consider this as fixed. > > > Is it meant to be fixed now? > > I am still seeing use-after-free reports in ep_unregister_pollwait on > d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5). > Hi, Indeed - No formal patch has been posted for this yet - we've agreed that Rainer is going to post the patch for this issue. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] unix: fix use-after-free in unix_dgram_poll()
On 10/28/2015 12:46 PM, Rainer Weikusat wrote: > Rainer Weikusat <r...@doppelsaurus.mobileactivedefense.com> writes: >> Jason Baron <jba...@akamai.com> writes: > > [...] > >>> 2) >>> >>> For the case of epoll() in edge triggered mode we need to ensure that >>> when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full() >>> is true, we need to add a unix_peer_wake_connect() call to guarantee a >>> wakeup. Otherwise, we are going to potentially hang there. >> >> I consider this necessary. > > (As already discussed privately) just doing this would open up another > way for sockets to be enqueued on the peer_wait queue of the peer > forever despite no one wants to be notified of write space > availability. Here's another RFC patch addressing the issues so far plus > this one by breaking the connection to the peer socket from the wake up > relaying function. This has the nice additional property that the > dgram_poll code becomes somewhat simpler as the "dequeued where we > didn't enqueue" situation can no longer occur and the not-so-nice > additional property that the connect and disconnect functions need to > take the peer_wait.lock spinlock explicitly so that this lock is used to > ensure that no two threads modifiy the private pointer of the client > wait_queue_t. Hmmm...I thought these were already all guarded by unix_state_lock(sk). In any case, rest of the patch overall looks good to me. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/18/2015 04:58 PM, Rainer Weikusat wrote: [...] > > The idea behind 'the wait queue' (insofar I'm aware of it) is that it > will be used as list of threads who need to be notified when the > associated event occurs. Since you seem to argue that the run-of-the-mill > algorithm is too slow for this particular case, is there anything to > back this up? > Generally the poll() routines only add to a wait queue once at the beginning, and all subsequent calls to poll() simply check the wakeup conditions. So here you are proposing to add/remove to the wait queue on subsequent invocations of poll(). So the initial patch I did, continued in the usual pattern and only added once on registration or connect(). However, I do think that this is a special case since the registration is on a shared wait queue, and thus having a long list of registered waiters is going to affect all waiters. So I am fine with doing the add/removes in the poll() routine and I agree that the patch below is more compact that what I initially posted. A couple of notes on your patch: 1) In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus it requires proper dereferencing. Something like: struct unix_sock *u; struct socket_wq *wq; u = container_of(wait, struct unix_sock, wait); rcu_read_lock(); wq = rcu_dereference(u->sk.sk_wq); if (wq_has_sleeper(wq)) wake_up_interruptible_sync_poll(>wait, key); rcu_read_unlock(); 2) For the case of epoll() in edge triggered mode we need to ensure that when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full() is true, we need to add a unix_peer_wake_connect() call to guarantee a wakeup. Otherwise, we are going to potentially hang there. With these changes (or tell me why they are not needed), I'm happy to ack this patch. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
> > X-Signed-Off-By: Rainer Weikusat> Hi, So the patches I've posted and yours both use the idea of a relaying the remote peer wakeup via callbacks that are internal to the net/unix, such that we avoid exposing the remote peer wakeup to the external poll()/select()/epoll(). They differ in when and how those callbacks are registered/unregistered. So I think your approach here will generally keep the peer wait wakeup queue to its absolute minimum, by removing from that queue when we set POLLOUT, however it requires taking the peer waitqueue lock on every poll() call. So I think there are tradeoffs here vs. what I've posted. So for example, if there are a lot of writers against one 'server' socket, there is going to be a lot of lock contention with your approach here. So I think the performance is going to depend on the workload that is tested. I don't have a specific workload that I am trying to solve here, but since you introduced the waiting on the remote peer queue, perhaps you can post numbers comparing the patches that have been posted for the workload that this was developed for? I will send you the latest version of what I have privately - so as not to overly spam the list. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/12/2015 04:41 PM, Rainer Weikusat wrote: > Jason Baron <jba...@akamai.com> writes: >> On 10/05/2015 12:31 PM, Rainer Weikusat wrote: > > [...] > >>> Here's a more simple idea which _might_ work. The underlying problem >>> seems to be that the second sock_poll_wait introduces a covert reference >>> to the peer socket which isn't accounted for. The basic idea behind this >>> is to execute an additional sock_hold for the peer whenever the >>> sock_poll_wait is called for it and store it (the struct sock *) in a >>> new struct unix_sock member. > > [...] > >> Interesting - will this work for the test case you supplied where we are in >> epoll_wait() and another thread does a connect() to a new target? In that >> case, even if we issue a wakeup to the epoll thread, its not going to have >> a non-NULL poll_table, so it wouldn't be added to the new target. IE >> the first test case here: >> >> https://lkml.org/lkml/2015/10/4/154 > > "Indeed it would not." I've also meanwhile found the time to check what > is and isn't locked here and found that Eric's "this looks racy" was > also justified. In theory, a clean solution could be based on modifying > the various polling implementations to keep a piece of data for a polled > something and provided that again on each subsequent poll call. This > could then be used to keep the peer socket alive for as long as > necessary were it possible to change the set of wait queues with every > poll call. Since this also isn't the case, the idea to increment the > reference count of the peer socket won't fly. > > OTOH, something I seriously dislike about your relaying implementation > is the unconditional add_wait_queue upon connect as this builds up a > possibly large wait queue of entities entirely uninterested in the > event which will need to be traversed even if peer_wait wakeups will > only happen if at least someone actually wants to be notified. This > could be changed such that the struct unix_sock member is only put onto > the peer's wait queue in poll and only if it hasn't already been put > onto it. The connection could then be severed if some kind of > 'disconnect' occurs. > > The code below (again based on 3.2.54) is what I'm currently running and > it has survived testing during the day (without trying the exercise in > hexadecimal as that doesn't cause failure for me, anyway). The wakeup > relaying function checks that a socket wait queue actually still exists > because I used to get null pointers oopses without every now and then > (I've also tested this with an additional printk printing 'no q' in case > the pointer was actually null to verify that this really occurs here). > Hi, What about the following race? 1) thread A does poll() on f, finds the wakeup condition low, and adds itself to the remote peer_wait queue. 2) thread B sets the wake up condition in dgram_recvmsg(), but does not execute the wakeup of threads yet. 3) thread C also does a poll() on f, finds now that the wakeup condition is set, and hence removes f from the remote peer_wait queue. Then, thread A misses the POLLOUT, and hangs. I understand your concern about POLLIN only waiters-I think we could add the 'relay callback' in dgram_poll() only for those who are looking for POLLOUT, and simply avoid the de-registration, as in practice I think its unlikely we are going to have a socket switching from POLLOUT to *only* POLLIN. I suspect that will cover most of the cases that concern you? Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/3] net: unix: fix use-after-free
On 10/11/2015 07:55 AM, David Miller wrote: > From: Jason Baron <jba...@akamai.com> > Date: Fri, 9 Oct 2015 00:15:59 -0400 > >> These patches are against mainline, I can re-base to net-next, please >> let me know. >> >> They have been tested against: https://lkml.org/lkml/2015/9/13/195, >> which causes the use-after-free quite quickly and here: >> https://lkml.org/lkml/2015/10/2/693. > Hi, > I'd like to understand how patches that don't even compile can be > "tested"? > > net/unix/af_unix.c: In function ‘unix_dgram_writable’: > net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this > function) > net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only > once for each function it appears in > > Could you explain how that works, I'm having a hard time understanding > this? > Traveling this week, so responses a bit delayed. Yes, I screwed up the posting. I had some outstanding code in my local tree to make it compile, but I failed to refresh my patch series with this outstanding code before mailing it out. So what I tested/built was not quite what I mailed out. As soon as I noticed this issue in patch 3/3 I re-posted it here: http://marc.info/?l=linux-netdev=10355808472=2 in an attempt to avoid this confusion. I'm happy to re-post the series or whatever makes things easiest for you. > Also please address Hannes's feedback, thanks. > I've replied directly to Hannes. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
On 10/09/2015 10:38 AM, Hannes Frederic Sowa wrote: > Hi, > > Jason Baron <jba...@akamai.com> writes: > >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >> queue associated with the socket s that we are poll'ing against, but also >> calls >> sock_poll_wait() for a remote peer socket p, if it is connected. Thus, >> if we call poll()/select()/epoll() for the socket s, there are then >> a couple of code paths in which the remote peer socket p and its associated >> peer_wait queue can be freed before poll()/select()/epoll() have a chance >> to remove themselves from the remote peer socket. >> >> The way that remote peer socket can be freed are: >> >> 1. If s calls connect() to a connect to a new socket other than p, it will >> drop its reference on p, and thus a close() on p will free it. >> >> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop >> the final reference to p, allowing it to be freed. >> >> Address this issue, by reverting unix_dgram_poll() to only register with >> the wait queue associated with s and register a callback with the remote peer >> socket on connect() that will wake up the wait queue associated with s. If >> scenarios 1 or 2 occur above we then simply remove the callback from the >> remote peer. This then presents the expected semantics to poll()/select()/ >> epoll(). >> >> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET >> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). >> >> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected >> DGRAM sockets"). >> >> Tested-by: Mathias Krause <mini...@googlemail.com> >> Signed-off-by: Jason Baron <jba...@akamai.com> > > While I think this approach works, I haven't seen where the current code > leaks a reference. Assignment to unix_peer(sk) in general take spin_lock > and increment refcount. Are there bugs at the two places you referred > to? > > Is an easier fix just to use atomic_inc_not_zero(>sk_refcnt) in > unix_peer_get() which could also help other places? > Hi, So we could potentially inc the refcnt on the remote peer such that the remote peer does not free before the socket that has connected to it. However, then the socket that has taken the reference against the peer socket has to potentially record a number of remote sockets (all the ones that it has connected to over its lifetime), and then drop all of their refcnt's when it finally closes. The reason for this is that with the current code when we do poll()/select()/epoll() on a socket with a peer socket, those calls take reference on the peer socket. Specifically, they record the remote peer whead, such that they can remove their callbacks when they return. So its not safe to just drop a reference on the remote peer when it closes because their might be outstanding poll()/select()/epoll() references pending. Normally, poll()/select()/epoll() are waiting on a whead associated directly with the fd/file that they are waiting for. The other point here is that the way this patch structures things is that when the socket connects to a new remote and hence disconnects from an existing remote, POLLOUT events will continue to be correctly delivered. That was not possible with the current structure of things b/c there was no way to inform poll to re-register with the remote peer whead. So, that means that the first test case here now works: https://lkml.org/lkml/2015/10/4/154 Whereas with the old code test case would just hang for ever. So yes there is a bit of code churn here, but I think it moves the code-base in a direction that not only solves this issue, but corrects additional poll() behaviors as well. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()
On 10/09/2015 12:29 AM, kbuild test robot wrote: > Hi Jason, > > [auto build test ERROR on v4.3-rc3 -- if it's inappropriate base, please > ignore] > > config: x86_64-randconfig-i0-201540 (attached as .config) > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 > > All errors (new ones prefixed by >>): > >net/unix/af_unix.c: In function 'unix_dgram_writable': >>> net/unix/af_unix.c:2465:3: error: 'other_full' undeclared (first use in >>> this function) > *other_full = false; > ^ >net/unix/af_unix.c:2465:3: note: each undeclared identifier is reported > only once for each function it appears in > Forgot to refresh this patch before sending. The one that I tested with is below. Thanks, -Jason Now that connect() permanently registers a callback routine, we can induce extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up its peer_wait queue on every receive. This patch makes the wakeup there conditional on there being waiters. Tested using: http://www.spinics.net/lists/netdev/msg145533.html Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 92 +-- 2 files changed, 69 insertions(+), 24 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 6a4a345..cf21ffd 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -61,6 +61,7 @@ struct unix_sock { unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 +#define UNIX_NOSPACE 2 struct socket_wqpeer_wq; wait_queue_twait; }; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f789423..ac9bcd8 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,7 +326,7 @@ found: return s; } -static inline int unix_writable(struct sock *sk) +static inline bool unix_writable(struct sock *sk) { return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } @@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long timeo) prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE); + set_bit(UNIX_NOSPACE, >flags); + /* Ensure that we either see space in the peer sk_receive_queue via the +* unix_recvq_full() check below, or we receive a wakeup when it +* empties. Pairs with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); sched = !sock_flag(other, SOCK_DEAD) && !(other->sk_shutdown & RCV_SHUTDOWN) && unix_recvq_full(other); @@ -1623,17 +1629,27 @@ restart: if (unix_peer(other) != sk && unix_recvq_full(other)) { if (!timeo) { - err = -EAGAIN; - goto out_unlock; - } - - timeo = unix_wait_for_peer(other, timeo); + set_bit(UNIX_NOSPACE, _sk(other)->flags); + /* Ensure that we either see space in the peer +* sk_receive_queue via the unix_recvq_full() check +* below, or we receive a wakeup when it empties. This +* makes sure that epoll ET triggers correctly. Pairs +* with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); + if (unix_recvq_full(other)) { + err = -EAGAIN; + goto out_unlock; + } + } else { + timeo = unix_wait_for_peer(other, timeo); - err = sock_intr_errno(timeo); - if (signal_pending(current)) - goto out_free; + err = sock_intr_errno(timeo); + if (signal_pending(current)) + goto out_free; - goto restart; + goto restart; + } } if (sock_flag(other, SOCK_RCVTSTAMP)) @@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, goto out_unlock; } - wake_up_interruptible_sync_poll(>peer_wait, - POLLOUT | POLLWRNORM | POLLWRBAND); + /* Ensure that waiters on our sk->sk_receive_queue draining that check +* via unix_recvq_full() either see space in the queue or get a wakeup +* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram() +* call above. Pairs with the mb in unix_dgram_sendmsg(), +*unix_dgram_poll(), and unix_wait_for_peer(). +*/ + smp_mb(); +
[PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()
Now that connect() permanently registers a callback routine, we can induce extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up its peer_wait queue on every receive. This patch makes the wakeup there conditional on there being waiters. Tested using: http://www.spinics.net/lists/netdev/msg145533.html Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 92 +-- 2 files changed, 69 insertions(+), 24 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 6a4a345..cf21ffd 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -61,6 +61,7 @@ struct unix_sock { unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 +#define UNIX_NOSPACE 2 struct socket_wqpeer_wq; wait_queue_twait; }; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f789423..05fbd00 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,7 +326,7 @@ found: return s; } -static inline int unix_writable(struct sock *sk) +static inline bool unix_writable(struct sock *sk) { return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } @@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long timeo) prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE); + set_bit(UNIX_NOSPACE, >flags); + /* Ensure that we either see space in the peer sk_receive_queue via the +* unix_recvq_full() check below, or we receive a wakeup when it +* empties. Pairs with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); sched = !sock_flag(other, SOCK_DEAD) && !(other->sk_shutdown & RCV_SHUTDOWN) && unix_recvq_full(other); @@ -1623,17 +1629,27 @@ restart: if (unix_peer(other) != sk && unix_recvq_full(other)) { if (!timeo) { - err = -EAGAIN; - goto out_unlock; - } - - timeo = unix_wait_for_peer(other, timeo); + set_bit(UNIX_NOSPACE, _sk(other)->flags); + /* Ensure that we either see space in the peer +* sk_receive_queue via the unix_recvq_full() check +* below, or we receive a wakeup when it empties. This +* makes sure that epoll ET triggers correctly. Pairs +* with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); + if (unix_recvq_full(other)) { + err = -EAGAIN; + goto out_unlock; + } + } else { + timeo = unix_wait_for_peer(other, timeo); - err = sock_intr_errno(timeo); - if (signal_pending(current)) - goto out_free; + err = sock_intr_errno(timeo); + if (signal_pending(current)) + goto out_free; - goto restart; + goto restart; + } } if (sock_flag(other, SOCK_RCVTSTAMP)) @@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, goto out_unlock; } - wake_up_interruptible_sync_poll(>peer_wait, - POLLOUT | POLLWRNORM | POLLWRBAND); + /* Ensure that waiters on our sk->sk_receive_queue draining that check +* via unix_recvq_full() either see space in the queue or get a wakeup +* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram() +* call above. Pairs with the mb in unix_dgram_sendmsg(), +*unix_dgram_poll(), and unix_wait_for_peer(). +*/ + smp_mb(); + if (test_bit(UNIX_NOSPACE, >flags)) { + clear_bit(UNIX_NOSPACE, >flags); + wake_up_interruptible_sync_poll(>peer_wait, + POLLOUT | POLLWRNORM | + POLLWRBAND); + } if (msg->msg_name) unix_copy_addr(msg, skb->sk); @@ -2432,11 +2459,25 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table return mask; } +static bool unix_dgram_writable(struct sock *sk, struct sock *other, + bool *other_nospace) +{ + *other_full = false; + + if (other && unix_peer(other) != sk && unix_recvq_full(other)) { + *other_full = true; + return false; + } + + retur
[PATCH v4 0/3] net: unix: fix use-after-free
Hi, These patches are against mainline, I can re-base to net-next, please let me know. They have been tested against: https://lkml.org/lkml/2015/9/13/195, which causes the use-after-free quite quickly and here: https://lkml.org/lkml/2015/10/2/693. Thanks, -Jason v4: -set UNIX_NOSPACE only if the peer socket has receive space v3: -beef up memory barrier comments in 3/3 (Peter Zijlstra) -clean up unix_dgram_writable() function in 3/3 (Joe Perches) Jason Baron (3): net: unix: fix use-after-free in unix_dgram_poll() net: unix: Convert gc_flags to flags net: unix: optimize wakeups in unix_dgram_recvmsg() include/net/af_unix.h | 4 +- net/unix/af_unix.c| 124 -- net/unix/garbage.c| 12 ++--- 3 files changed, 108 insertions(+), 32 deletions(-) -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait queue associated with the socket s that we are poll'ing against, but also calls sock_poll_wait() for a remote peer socket p, if it is connected. Thus, if we call poll()/select()/epoll() for the socket s, there are then a couple of code paths in which the remote peer socket p and its associated peer_wait queue can be freed before poll()/select()/epoll() have a chance to remove themselves from the remote peer socket. The way that remote peer socket can be freed are: 1. If s calls connect() to a connect to a new socket other than p, it will drop its reference on p, and thus a close() on p will free it. 2. If we call close on p(), then a subsequent sendmsg() from s, will drop the final reference to p, allowing it to be freed. Address this issue, by reverting unix_dgram_poll() to only register with the wait queue associated with s and register a callback with the remote peer socket on connect() that will wake up the wait queue associated with s. If scenarios 1 or 2 occur above we then simply remove the callback from the remote peer. This then presents the expected semantics to poll()/select()/ epoll(). I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets"). Tested-by: Mathias Krause <mini...@googlemail.com> Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 32 +++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 4a167b3..9698aff 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_twait; }; #define unix_sk(__sk) ((struct unix_sock *)__sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..f789423 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion) skpair = unix_peer(sk); if (skpair != NULL) { + if (sk->sk_type != SOCK_STREAM) + remove_wait_queue(_sk(skpair)->peer_wait, + >wait); if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { unix_state_lock(skpair); /* No more writes */ @@ -636,6 +639,16 @@ static struct proto unix_proto = { */ static struct lock_class_key af_unix_sk_receive_queue_lock_key; +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct unix_sock *u; + + u = container_of(wait, struct unix_sock, wait); + wake_up_interruptible_sync_poll(sk_sleep(>sk), key); + + return 0; +} + static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) { struct sock *sk = NULL; @@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) INIT_LIST_HEAD(>link); mutex_init(>readlock); /* single task reading lock */ init_waitqueue_head(>peer_wait); + init_waitqueue_func_entry(>wait, peer_wake); unix_insert_socket(unix_sockets_unbound(sk), sk); out: if (sk == NULL) @@ -1030,7 +1044,11 @@ restart: */ if (unix_peer(sk)) { struct sock *old_peer = unix_peer(sk); + + remove_wait_queue(_sk(old_peer)->peer_wait, + _sk(sk)->wait); unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); if (other != old_peer) @@ -1038,8 +1056,12 @@ restart: sock_put(old_peer); } else { unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); } + /* New remote may have created write space for us */ + wake_up_interruptible_sync_poll(sk_sleep(sk), + POLLOUT | POLLWRNORM | POLLWRBAND); return 0; out_unlock: @@ -1194,6 +1216,8 @@ restart: sock_hold(sk); unix_peer(newsk)= sk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(_sk(sk)->peer_wait, _sk(newsk)->wait); newsk->sk_state = TCP_ESTABLISHED; newsk->sk_type = sk->sk_type; init_peercred(newsk); @@ -1220,6 +1244,8 @@ restart: smp_mb__after_atomic(); /* sock
[PATCH v4 2/3] net: unix: Convert gc_flags to flags
Convert gc_flags to flags in perparation for the subsequent patch, which will make use of a flag bit for a non-gc purpose. Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 2 +- net/unix/garbage.c| 12 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 9698aff..6a4a345 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -58,7 +58,7 @@ struct unix_sock { atomic_long_t inflight; spinlock_t lock; unsigned char recursion_level; - unsigned long gc_flags; + unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; diff --git a/net/unix/garbage.c b/net/unix/garbage.c index a73a226..39794d9 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), * have been added to the queues after * starting the garbage collection */ - if (test_bit(UNIX_GC_CANDIDATE, >gc_flags)) { + if (test_bit(UNIX_GC_CANDIDATE, >flags)) { hit = true; func(u); @@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u) * of the list, so that it's checked even if it was already * passed over */ - if (test_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags)) + if (test_bit(UNIX_GC_MAYBE_CYCLE, >flags)) list_move_tail(>link, _candidates); } @@ -305,8 +305,8 @@ void unix_gc(void) BUG_ON(total_refs < inflight_refs); if (total_refs == inflight_refs) { list_move_tail(>link, _candidates); - __set_bit(UNIX_GC_CANDIDATE, >gc_flags); - __set_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags); + __set_bit(UNIX_GC_CANDIDATE, >flags); + __set_bit(UNIX_GC_MAYBE_CYCLE, >flags); } } @@ -332,7 +332,7 @@ void unix_gc(void) if (atomic_long_read(>inflight) > 0) { list_move_tail(>link, _cycle_list); - __clear_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags); + __clear_bit(UNIX_GC_MAYBE_CYCLE, >flags); scan_children(>sk, inc_inflight_move_tail, NULL); } } @@ -343,7 +343,7 @@ void unix_gc(void) */ while (!list_empty(_cycle_list)) { u = list_entry(not_cycle_list.next, struct unix_sock, link); - __clear_bit(UNIX_GC_CANDIDATE, >gc_flags); + __clear_bit(UNIX_GC_CANDIDATE, >flags); list_move_tail(>link, _inflight_list); } -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()
Now that connect() permanently registers a callback routine, we can induce extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up its peer_wait queue on every receive. This patch makes the wakeup there conditional on there being waiters. Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 85 --- 2 files changed, 62 insertions(+), 24 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 6a4a345..cf21ffd 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -61,6 +61,7 @@ struct unix_sock { unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 +#define UNIX_NOSPACE 2 struct socket_wqpeer_wq; wait_queue_twait; }; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f789423..66979d4 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,7 +326,7 @@ found: return s; } -static inline int unix_writable(struct sock *sk) +static inline bool unix_writable(struct sock *sk) { return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } @@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long timeo) prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE); + set_bit(UNIX_NOSPACE, >flags); + /* Ensure that we either see space in the peer sk_receive_queue via the +* unix_recvq_full() check below, or we receive a wakeup when it +* empties. Pairs with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); sched = !sock_flag(other, SOCK_DEAD) && !(other->sk_shutdown & RCV_SHUTDOWN) && unix_recvq_full(other); @@ -1623,17 +1629,27 @@ restart: if (unix_peer(other) != sk && unix_recvq_full(other)) { if (!timeo) { - err = -EAGAIN; - goto out_unlock; - } + set_bit(UNIX_NOSPACE, _sk(other)->flags); + /* Ensure that we either see space in the peer +* sk_receive_queue via the unix_recvq_full() check +* below, or we receive a wakeup when it empties. This +* makes sure that epoll ET triggers correctly. Pairs +* with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); + if (unix_recvq_full(other)) { + err = -EAGAIN; + goto out_unlock; + } + } else { + timeo = unix_wait_for_peer(other, timeo); - timeo = unix_wait_for_peer(other, timeo); + err = sock_intr_errno(timeo); + if (signal_pending(current)) + goto out_free; - err = sock_intr_errno(timeo); - if (signal_pending(current)) - goto out_free; - - goto restart; + goto restart; + } } if (sock_flag(other, SOCK_RCVTSTAMP)) @@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, goto out_unlock; } - wake_up_interruptible_sync_poll(>peer_wait, - POLLOUT | POLLWRNORM | POLLWRBAND); + /* Ensure that waiters on our sk->sk_receive_queue draining that check +* via unix_recvq_full() either see space in the queue or get a wakeup +* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram() +* call above. Pairs with the mb in unix_dgram_sendmsg(), +*unix_dgram_poll(), and unix_wait_for_peer(). +*/ + smp_mb(); + if (test_bit(UNIX_NOSPACE, >flags)) { + clear_bit(UNIX_NOSPACE, >flags); + wake_up_interruptible_sync_poll(>peer_wait, + POLLOUT | POLLWRNORM | + POLLWRBAND); + } if (msg->msg_name) unix_copy_addr(msg, skb->sk); @@ -2432,11 +2459,19 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table return mask; } +static bool unix_dgram_writable(struct sock *sk, struct sock *other) +{ + if (other && unix_peer(other) != sk && unix_recvq_full(other)) + return false; + + return unix_writable(sk); +} + static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, poll_table *wait) { struct sock *sk = sock->sk, *other; - u
[PATCH v3 0/3] net: unix: fix use-after-free
Hi, These patches are against mainline, I can re-base to net-next, just let me know. They have been tested against: https://lkml.org/lkml/2015/9/13/195, which causes the use-after-free quite quickly and here: https://lkml.org/lkml/2015/10/2/693. Thanks, -Jason v3: -beef up memory barrier comments in 3/3 (Peter Zijlstra) -clean up unix_dgram_writable() function in 3/3 (Joe Perches) Jason Baron (3): net: unix: fix use-after-free in unix_dgram_poll() net: unix: Convert gc_flags to flags net: unix: optimize wakeups in unix_dgram_recvmsg() include/net/af_unix.h | 4 +- net/unix/af_unix.c| 117 +++--- net/unix/garbage.c| 12 +++--- 3 files changed, 101 insertions(+), 32 deletions(-) -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/3] net: unix: fix use-after-free in unix_dgram_poll()
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait queue associated with the socket s that we are poll'ing against, but also calls sock_poll_wait() for a remote peer socket p, if it is connected. Thus, if we call poll()/select()/epoll() for the socket s, there are then a couple of code paths in which the remote peer socket p and its associated peer_wait queue can be freed before poll()/select()/epoll() have a chance to remove themselves from the remote peer socket. The way that remote peer socket can be freed are: 1. If s calls connect() to a connect to a new socket other than p, it will drop its reference on p, and thus a close() on p will free it. 2. If we call close on p(), then a subsequent sendmsg() from s, will drop the final reference to p, allowing it to be freed. Address this issue, by reverting unix_dgram_poll() to only register with the wait queue associated with s and register a callback with the remote peer socket on connect() that will wake up the wait queue associated with s. If scenarios 1 or 2 occur above we then simply remove the callback from the remote peer. This then presents the expected semantics to poll()/select()/ epoll(). I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets"). Tested-by: Mathias Krause <mini...@googlemail.com> Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 32 +++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 4a167b3..9698aff 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_twait; }; #define unix_sk(__sk) ((struct unix_sock *)__sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..f789423 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion) skpair = unix_peer(sk); if (skpair != NULL) { + if (sk->sk_type != SOCK_STREAM) + remove_wait_queue(_sk(skpair)->peer_wait, + >wait); if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { unix_state_lock(skpair); /* No more writes */ @@ -636,6 +639,16 @@ static struct proto unix_proto = { */ static struct lock_class_key af_unix_sk_receive_queue_lock_key; +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct unix_sock *u; + + u = container_of(wait, struct unix_sock, wait); + wake_up_interruptible_sync_poll(sk_sleep(>sk), key); + + return 0; +} + static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) { struct sock *sk = NULL; @@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) INIT_LIST_HEAD(>link); mutex_init(>readlock); /* single task reading lock */ init_waitqueue_head(>peer_wait); + init_waitqueue_func_entry(>wait, peer_wake); unix_insert_socket(unix_sockets_unbound(sk), sk); out: if (sk == NULL) @@ -1030,7 +1044,11 @@ restart: */ if (unix_peer(sk)) { struct sock *old_peer = unix_peer(sk); + + remove_wait_queue(_sk(old_peer)->peer_wait, + _sk(sk)->wait); unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); if (other != old_peer) @@ -1038,8 +1056,12 @@ restart: sock_put(old_peer); } else { unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); } + /* New remote may have created write space for us */ + wake_up_interruptible_sync_poll(sk_sleep(sk), + POLLOUT | POLLWRNORM | POLLWRBAND); return 0; out_unlock: @@ -1194,6 +1216,8 @@ restart: sock_hold(sk); unix_peer(newsk)= sk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(_sk(sk)->peer_wait, _sk(newsk)->wait); newsk->sk_state = TCP_ESTABLISHED; newsk->sk_type = sk->sk_type; init_peercred(newsk); @@ -1220,6 +1244,8 @@ restart: smp_mb__after_atomic(); /* sock
[PATCH v3 2/3] net: unix: Convert gc_flags to flags
Convert gc_flags to flags in perparation for the subsequent patch, which will make use of a flag bit for a non-gc purpose. Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 2 +- net/unix/garbage.c| 12 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 9698aff..6a4a345 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -58,7 +58,7 @@ struct unix_sock { atomic_long_t inflight; spinlock_t lock; unsigned char recursion_level; - unsigned long gc_flags; + unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; diff --git a/net/unix/garbage.c b/net/unix/garbage.c index a73a226..39794d9 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), * have been added to the queues after * starting the garbage collection */ - if (test_bit(UNIX_GC_CANDIDATE, >gc_flags)) { + if (test_bit(UNIX_GC_CANDIDATE, >flags)) { hit = true; func(u); @@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u) * of the list, so that it's checked even if it was already * passed over */ - if (test_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags)) + if (test_bit(UNIX_GC_MAYBE_CYCLE, >flags)) list_move_tail(>link, _candidates); } @@ -305,8 +305,8 @@ void unix_gc(void) BUG_ON(total_refs < inflight_refs); if (total_refs == inflight_refs) { list_move_tail(>link, _candidates); - __set_bit(UNIX_GC_CANDIDATE, >gc_flags); - __set_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags); + __set_bit(UNIX_GC_CANDIDATE, >flags); + __set_bit(UNIX_GC_MAYBE_CYCLE, >flags); } } @@ -332,7 +332,7 @@ void unix_gc(void) if (atomic_long_read(>inflight) > 0) { list_move_tail(>link, _cycle_list); - __clear_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags); + __clear_bit(UNIX_GC_MAYBE_CYCLE, >flags); scan_children(>sk, inc_inflight_move_tail, NULL); } } @@ -343,7 +343,7 @@ void unix_gc(void) */ while (!list_empty(_cycle_list)) { u = list_entry(not_cycle_list.next, struct unix_sock, link); - __clear_bit(UNIX_GC_CANDIDATE, >gc_flags); + __clear_bit(UNIX_GC_CANDIDATE, >flags); list_move_tail(>link, _inflight_list); } -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/05/2015 12:31 PM, Rainer Weikusat wrote: > Jason Baron <jba...@akamai.com> writes: >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >> queue associated with the socket s that we are poll'ing against, but also >> calls >> sock_poll_wait() for a remote peer socket p, if it is connected. Thus, >> if we call poll()/select()/epoll() for the socket s, there are then >> a couple of code paths in which the remote peer socket p and its associated >> peer_wait queue can be freed before poll()/select()/epoll() have a chance >> to remove themselves from the remote peer socket. >> >> The way that remote peer socket can be freed are: >> >> 1. If s calls connect() to a connect to a new socket other than p, it will >> drop its reference on p, and thus a close() on p will free it. >> >> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop >> the final reference to p, allowing it to be freed. > > Here's a more simple idea which _might_ work. The underlying problem > seems to be that the second sock_poll_wait introduces a covert reference > to the peer socket which isn't accounted for. The basic idea behind this > is to execute an additional sock_hold for the peer whenever the > sock_poll_wait is called for it and store it (the struct sock *) in a > new struct unix_sock member. Upon entering unix_dgram_poll, if the > member is not NULL, it's cleared and a sock_put for its former value is > done. The 'poll peer not NULL -> sock_put it' code is also added to the > destructor, although I'm unsure if this is really necessary. The patch > below also includes the additional SOCK_DEAD test suggested by Martin as > that seems generally sensible to me. > > NB: This has survived both Martin's and my test programs for a number > of executions/ longer periods of time than was common before without > generating list corruption warnings. The patch below is against 'my' > 3.2.54 and is here provided as a suggestion in the hope that it will be > useful for someting, not as patch submission, as I spent less time > thinking through this than I should ideally have but despite of this, > it's another 2.5 hours of my life spent on something completely > different than what I should be working on at the moment. > > -- > diff -pru linux-2-6/include/net/af_unix.h linux-2-6.p/include/net/af_unix.h > --- linux-2-6/include/net/af_unix.h 2014-01-20 21:52:53.0 + > +++ linux-2-6.p/include/net/af_unix.h 2015-10-05 15:11:20.270958297 +0100 > @@ -50,6 +50,7 @@ struct unix_sock { > struct vfsmount *mnt; > struct mutexreadlock; > struct sock *peer; > + struct sock *poll_peer; > struct sock *other; > struct list_headlink; > atomic_long_t inflight; > diff -pru linux-2-6/net/unix/af_unix.c linux-2-6.p/net/unix/af_unix.c > --- linux-2-6/net/unix/af_unix.c 2014-01-22 16:51:52.0 + > +++ linux-2-6.p/net/unix/af_unix.c2015-10-05 17:05:28.358273567 +0100 > @@ -361,6 +361,9 @@ static void unix_sock_destructor(struct > if (u->addr) > unix_release_addr(u->addr); > > + if (u->poll_peer) > + sock_put(u->poll_peer); > + > atomic_long_dec(_nr_socks); > local_bh_disable(); > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > @@ -625,6 +628,7 @@ static struct sock *unix_create1(struct > u = unix_sk(sk); > u->dentry = NULL; > u->mnt= NULL; > + u->poll_peer = NULL; > spin_lock_init(>lock); > atomic_long_set(>inflight, 0); > INIT_LIST_HEAD(>link); > @@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil > static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, > poll_table *wait) > { > - struct sock *sk = sock->sk, *other; > - unsigned int mask, writable; > + struct sock *sk = sock->sk, *other, *pp; > + struct unix_sock *u; > + unsigned int mask, writable, dead; > + > + u = unix_sk(sk); > + pp = u->poll_peer; > + if (pp) { > + u->poll_peer = NULL; > + sock_put(pp); > + } > > sock_poll_wait(file, sk_sleep(sk), wait); > mask = 0; > @@ -2170,7 +2182,20 @@ static unsigned int unix_dgram_poll(stru > other = unix_peer_get(sk); > if (other) { > if (unix_peer(other) != sk) { > - sock_poll_wait(file, _sk(other)->peer_wait, wait); > + unix
Re: [PATCH v2 3/3] af_unix: optimize the unix_dgram_recvmsg()
On 10/05/2015 03:41 AM, Peter Zijlstra wrote: > On Fri, Oct 02, 2015 at 08:44:02PM +0000, Jason Baron wrote: >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c >> index f789423..b8ed1bc 100644 >> --- a/net/unix/af_unix.c >> +++ b/net/unix/af_unix.c > >> @@ -1079,6 +1079,9 @@ static long unix_wait_for_peer(struct sock *other, >> long timeo) >> >> prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE); >> >> +set_bit(UNIX_NOSPACE, >flags); >> +/* pairs with mb in unix_dgram_recv */ >> +smp_mb__after_atomic(); >> sched = !sock_flag(other, SOCK_DEAD) && >> !(other->sk_shutdown & RCV_SHUTDOWN) && >> unix_recvq_full(other); >> @@ -1623,17 +1626,22 @@ restart: >> >> if (unix_peer(other) != sk && unix_recvq_full(other)) { >> if (!timeo) { >> +set_bit(UNIX_NOSPACE, _sk(other)->flags); >> +/* pairs with mb in unix_dgram_recv */ >> +smp_mb__after_atomic(); >> +if (unix_recvq_full(other)) { >> +err = -EAGAIN; >> +goto out_unlock; >> +} >> +} else { > >> @@ -1939,8 +1947,14 @@ static int unix_dgram_recvmsg(struct socket *sock, >> struct msghdr *msg, >> goto out_unlock; >> } >> >> +/* pairs with unix_dgram_poll() and wait_for_peer() */ >> +smp_mb(); >> +if (test_bit(UNIX_NOSPACE, >flags)) { >> +clear_bit(UNIX_NOSPACE, >flags); >> +wake_up_interruptible_sync_poll(>peer_wait, >> +POLLOUT | POLLWRNORM | >> +POLLWRBAND); >> +} >> >> if (msg->msg_name) >> unix_copy_addr(msg, skb->sk); >> @@ -2468,20 +2493,19 @@ static unsigned int unix_dgram_poll(struct file >> *file, struct socket *sock, >> if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT))) >> return mask; >> >> other = unix_peer_get(sk); >> +if (unix_dgram_writable(sk, other)) { >> mask |= POLLOUT | POLLWRNORM | POLLWRBAND; >> +} else { >> set_bit(SOCK_ASYNC_NOSPACE, >sk_socket->flags); >> +set_bit(UNIX_NOSPACE, _sk(other)->flags); >> +/* pairs with mb in unix_dgram_recv */ >> +smp_mb__after_atomic(); >> +if (unix_dgram_writable(sk, other)) >> +mask |= POLLOUT | POLLWRNORM | POLLWRBAND; >> +} >> +if (other) >> +sock_put(other); >> >> return mask; >> } > > > So I must object to these barrier comments; stating which other barrier > they pair with is indeed good and required, but its not sufficient. > > A barrier comment should also explain the data ordering -- the most > important part. > > As it stands its not clear to me these barriers are required at all, but > this is not code I understand so I might well miss something obvious. > So in patch 1/3 I've added sockets that call connect() onto the 'peer_wait' queue of the server. The reason being that when the server reads from its socket (unix_dgram_recvmsg()) it can wake up n clients that may be waiting for the receive queue of the server to empty. This was previously accomplished in unix_dgram_poll() via the: sock_poll_wait(file, _sk(other)->peer_wait, wait); which I've removed. The issue with that approach is that the 'other' socket or server socket can close() and go away out from under the poll() call. Thus, I've converted back unix_dgram_poll(), to simply wait on its own socket's wait queue, which is the semantics that poll()/select()/ epoll() expect. However, I still needed a way to signal the client socket, and thus I've added the callback on connect() in patch 1/3. However, now that the client sockets are added during connect(), and not poll() as was previously done, this means that any calls to unix_dgram_recvmsg() have to walk the entire wakeup list, even if nobody is sitting in poll(). Thus, I've introduced the UNIX_SOCK flag to signal that we are in poll() to the server side, such that it doesn't have to potentially walk a long list of wakeups for no reason. This makes a difference on this test case: http://www.spinics.net/lists/netdev/msg145533.html which I found while working on this issue. And this patch restores the performance for that test case. So what we need to guarantee is that we either see t
[PATCH] unix: fix use-after-free with unix_dgram_poll()
From: Jason Baron <jba...@akamai.com> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait queue associated with the socket s that we've called poll() on, but it also calls sock_poll_wait() for a remote peer socket's wait queue, if it's connected. Thus, if we call poll()/select()/epoll() for the socket s, there are then a couple of code paths in which the remote peer socket s2 and its associated peer_wait queue can be freed before poll()/select()/epoll() have a chance to remove themselves from this remote peer socket s2's wait queue. The remote peer's socket and associated wait queues can be freed via: 1. If s calls connect() to connect to a new socket other than s2, it will drop its reference on s2, and thus a close() on s2 will free it. 2. If we call close() on s2, then a subsequent sendmsg() from s, will drop the final reference to s2, allowing it to be freed. Address this issue, by reverting unix_dgram_poll() to only register with the wait queue associated with s and simply drop the second sock_poll_wait() registration for the remote peer socket wait queue. This then presents the expected semantics to poll()/select()/epoll(). This works because we will continue to get POLLOUT wakeups from unix_write_space(), which is called via sock_wfree(). In fact, we avoid having two wakeup calls here for every buffer we read, since unix_dgram_recvmsg() unconditionally calls wake_up_interruptible_sync_poll() on its 'peer_wait' queue and we will no longer be in poll against that queue. So I think this should be more performant than the current code. And we avoid the second poll() call here as well during registration. unix_write_space() should probably be enhanced such that it checks for the unix_recvq_full() condition as well. In fact, it should probably look for some fraction of that buffer being free, as is done in unix_writable(). But I'm considering that a separate enhancement from fixing this issue. I've tested this by specifically reproducing cases #1 and #2 above as well as by running the test code here: https://lkml.org/lkml/2015/9/13/195 Signed-off-by: Jason Baron <jba...@akamai.com> --- net/unix/af_unix.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..c1ae595 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2441,7 +2441,6 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, other = unix_peer_get(sk); if (other) { if (unix_peer(other) != sk) { - sock_poll_wait(file, _sk(other)->peer_wait, wait); if (unix_recvq_full(other)) writable = 0; } -- 1.8.2.rc2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: fix use-after-free with unix_dgram_poll()
On 10/02/2015 03:30 PM, Rainer Weikusat wrote: > Jason Baron <jba...@akamai.com> writes: >> From: Jason Baron <jba...@akamai.com> >> >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >> queue associated with the socket s that we've called poll() on, but it also >> calls sock_poll_wait() for a remote peer socket's wait queue, if it's >> connected. >> Thus, if we call poll()/select()/epoll() for the socket s, there are then >> a couple of code paths in which the remote peer socket s2 and its associated >> peer_wait queue can be freed before poll()/select()/epoll() have a chance >> to remove themselves from this remote peer socket s2's wait queue. > > [...] > >> This works because we will continue to get POLLOUT wakeups from >> unix_write_space(), which is called via sock_wfree(). > > As pointed out in my original comment, this doesn't work (as far as I > can/ could tell) because it will only wake up sockets which had a chance > to enqueue datagrams to the queue of the receiving socket as only > skbuffs enqueued there will be consumed. A socket which is really > waiting for space in the receiving queue won't ever be woken up in this > way. Ok, good point. I was hoping to avoid a more complex approach here. I think then that the patch I posted in the previous thread on this would address this concern. I will post it for review. > > Further, considering that you're demonstrably not interested in > debugging and fixing this issue (as you haven't even bothered to post > one of the test programs you claim to have), I'm beginning to wonder why > this tripe is being sent to me at all --- it's not "git on autopilot" > this time as someone took the time to dig up my current e-mail address > as the one in the original commit is not valid anymore. Could you please > refrain from such exercises in future unless a discussion is actually > intended? > > Just trying to help fix this. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] af_unix: fix use-after-free
Hi, These patches are against mainline, I can re-base to net-next, just let me know. They have been tested against: https://lkml.org/lkml/2015/9/13/195, which causes the use-after-free quite quickly and here: https://lkml.org/lkml/2015/10/2/693. Thanks, -Jason Jason Baron (3): unix: fix use-after-free in unix_dgram_poll() af_unix: Convert gc_flags to flags af_unix: optimize the unix_dgram_recvmsg() include/net/af_unix.h | 4 +- net/unix/af_unix.c| 104 ++ net/unix/garbage.c| 12 +++--- 3 files changed, 88 insertions(+), 32 deletions(-) -- 1.8.2.rc2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] af_unix: Convert gc_flags to flags
Convert gc_flags to flags in preparation for the subsequent patch, which will make use of a flag bit for a non-gc purpose. Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 2 +- net/unix/garbage.c| 12 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 9698aff..6a4a345 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -58,7 +58,7 @@ struct unix_sock { atomic_long_t inflight; spinlock_t lock; unsigned char recursion_level; - unsigned long gc_flags; + unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; diff --git a/net/unix/garbage.c b/net/unix/garbage.c index a73a226..39794d9 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), * have been added to the queues after * starting the garbage collection */ - if (test_bit(UNIX_GC_CANDIDATE, >gc_flags)) { + if (test_bit(UNIX_GC_CANDIDATE, >flags)) { hit = true; func(u); @@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u) * of the list, so that it's checked even if it was already * passed over */ - if (test_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags)) + if (test_bit(UNIX_GC_MAYBE_CYCLE, >flags)) list_move_tail(>link, _candidates); } @@ -305,8 +305,8 @@ void unix_gc(void) BUG_ON(total_refs < inflight_refs); if (total_refs == inflight_refs) { list_move_tail(>link, _candidates); - __set_bit(UNIX_GC_CANDIDATE, >gc_flags); - __set_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags); + __set_bit(UNIX_GC_CANDIDATE, >flags); + __set_bit(UNIX_GC_MAYBE_CYCLE, >flags); } } @@ -332,7 +332,7 @@ void unix_gc(void) if (atomic_long_read(>inflight) > 0) { list_move_tail(>link, _cycle_list); - __clear_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags); + __clear_bit(UNIX_GC_MAYBE_CYCLE, >flags); scan_children(>sk, inc_inflight_move_tail, NULL); } } @@ -343,7 +343,7 @@ void unix_gc(void) */ while (!list_empty(_cycle_list)) { u = list_entry(not_cycle_list.next, struct unix_sock, link); - __clear_bit(UNIX_GC_CANDIDATE, >gc_flags); + __clear_bit(UNIX_GC_CANDIDATE, >flags); list_move_tail(>link, _inflight_list); } -- 1.8.2.rc2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait queue associated with the socket s that we are poll'ing against, but also calls sock_poll_wait() for a remote peer socket p, if it is connected. Thus, if we call poll()/select()/epoll() for the socket s, there are then a couple of code paths in which the remote peer socket p and its associated peer_wait queue can be freed before poll()/select()/epoll() have a chance to remove themselves from the remote peer socket. The way that remote peer socket can be freed are: 1. If s calls connect() to a connect to a new socket other than p, it will drop its reference on p, and thus a close() on p will free it. 2. If we call close on p(), then a subsequent sendmsg() from s, will drop the final reference to p, allowing it to be freed. Address this issue, by reverting unix_dgram_poll() to only register with the wait queue associated with s and register a callback with the remote peer socket on connect() that will wake up the wait queue associated with s. If scenarios 1 or 2 occur above we then simply remove the callback from the remote peer. This then presents the expected semantics to poll()/select()/ epoll(). I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 32 +++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 4a167b3..9698aff 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_twait; }; #define unix_sk(__sk) ((struct unix_sock *)__sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..f789423 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion) skpair = unix_peer(sk); if (skpair != NULL) { + if (sk->sk_type != SOCK_STREAM) + remove_wait_queue(_sk(skpair)->peer_wait, + >wait); if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { unix_state_lock(skpair); /* No more writes */ @@ -636,6 +639,16 @@ static struct proto unix_proto = { */ static struct lock_class_key af_unix_sk_receive_queue_lock_key; +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct unix_sock *u; + + u = container_of(wait, struct unix_sock, wait); + wake_up_interruptible_sync_poll(sk_sleep(>sk), key); + + return 0; +} + static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) { struct sock *sk = NULL; @@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) INIT_LIST_HEAD(>link); mutex_init(>readlock); /* single task reading lock */ init_waitqueue_head(>peer_wait); + init_waitqueue_func_entry(>wait, peer_wake); unix_insert_socket(unix_sockets_unbound(sk), sk); out: if (sk == NULL) @@ -1030,7 +1044,11 @@ restart: */ if (unix_peer(sk)) { struct sock *old_peer = unix_peer(sk); + + remove_wait_queue(_sk(old_peer)->peer_wait, + _sk(sk)->wait); unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); if (other != old_peer) @@ -1038,8 +1056,12 @@ restart: sock_put(old_peer); } else { unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); } + /* New remote may have created write space for us */ + wake_up_interruptible_sync_poll(sk_sleep(sk), + POLLOUT | POLLWRNORM | POLLWRBAND); return 0; out_unlock: @@ -1194,6 +1216,8 @@ restart: sock_hold(sk); unix_peer(newsk)= sk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(_sk(sk)->peer_wait, _sk(newsk)->wait); newsk->sk_state = TCP_ESTABLISHED; newsk->sk_type = sk->sk_type; init_peercred(newsk); @@ -1220,6 +1244,8 @@ restart: smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ unix_peer(sk) = newsk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(_sk(newsk)->peer_wait, _s
[PATCH v2 3/3] af_unix: optimize the unix_dgram_recvmsg()
Now that connect() permanently registers a callback routine, we can induce extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up its peer_wait queue on every receive. This patch makes the wakeup there conditional on there being waiters interested in wait events. Signed-off-by: Jason Baron <jba...@akamai.com> --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 72 ++- 2 files changed, 49 insertions(+), 24 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 6a4a345..cf21ffd 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -61,6 +61,7 @@ struct unix_sock { unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 +#define UNIX_NOSPACE 2 struct socket_wqpeer_wq; wait_queue_twait; }; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f789423..b8ed1bc 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,7 +326,7 @@ found: return s; } -static inline int unix_writable(struct sock *sk) +static inline bool unix_writable(struct sock *sk) { return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } @@ -1079,6 +1079,9 @@ static long unix_wait_for_peer(struct sock *other, long timeo) prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE); + set_bit(UNIX_NOSPACE, >flags); + /* pairs with mb in unix_dgram_recv */ + smp_mb__after_atomic(); sched = !sock_flag(other, SOCK_DEAD) && !(other->sk_shutdown & RCV_SHUTDOWN) && unix_recvq_full(other); @@ -1623,17 +1626,22 @@ restart: if (unix_peer(other) != sk && unix_recvq_full(other)) { if (!timeo) { - err = -EAGAIN; - goto out_unlock; - } - - timeo = unix_wait_for_peer(other, timeo); + set_bit(UNIX_NOSPACE, _sk(other)->flags); + /* pairs with mb in unix_dgram_recv */ + smp_mb__after_atomic(); + if (unix_recvq_full(other)) { + err = -EAGAIN; + goto out_unlock; + } + } else { + timeo = unix_wait_for_peer(other, timeo); - err = sock_intr_errno(timeo); - if (signal_pending(current)) - goto out_free; + err = sock_intr_errno(timeo); + if (signal_pending(current)) + goto out_free; - goto restart; + goto restart; + } } if (sock_flag(other, SOCK_RCVTSTAMP)) @@ -1939,8 +1947,14 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, goto out_unlock; } - wake_up_interruptible_sync_poll(>peer_wait, - POLLOUT | POLLWRNORM | POLLWRBAND); + /* pairs with unix_dgram_poll() and wait_for_peer() */ + smp_mb(); + if (test_bit(UNIX_NOSPACE, >flags)) { + clear_bit(UNIX_NOSPACE, >flags); + wake_up_interruptible_sync_poll(>peer_wait, + POLLOUT | POLLWRNORM | + POLLWRBAND); + } if (msg->msg_name) unix_copy_addr(msg, skb->sk); @@ -2432,11 +2446,22 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table return mask; } +static bool unix_dgram_writable(struct sock *sk, struct sock *other) +{ + bool writable; + + writable = unix_writable(sk); + if (other && unix_peer(other) != sk && unix_recvq_full(other)) + writable = false; + + return writable; +} + static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, poll_table *wait) { struct sock *sk = sock->sk, *other; - unsigned int mask, writable; + unsigned int mask; sock_poll_wait(file, sk_sleep(sk), wait); mask = 0; @@ -2468,20 +2493,19 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT))) return mask; - writable = unix_writable(sk); other = unix_peer_get(sk); - if (other) { - if (unix_peer(other) != sk) { - if (unix_recvq_full(other)) - writable = 0; - } - sock_put(other); - } - - if (writable) + if (unix_dgram_writable(sk, other)) { mask |= POLLO
Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
On 09/30/2015 03:34 AM, Michal Kubecek wrote: > On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote: >> On 29 September 2015 at 21:09, Jason Baron <jba...@akamai.com> wrote: >>> However, if we call connect on socket 's', to connect to a new socket 'o2', >>> we >>> drop the reference on the original socket 'o'. Thus, we can now close socket >>> 'o' without unregistering from epoll. Then, when we either close the ep >>> or unregister 'o', we end up with this list corruption. Thus, this is not a >>> race per se, but can be triggered sequentially. >> >> Sounds profound, but the reproducers calls connect only once per >> socket. So there is no "connect to a new socket", no? > > I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is > still connected to it. This is detected by 's' in unix_dgram_sendmsg() > so that 's' releases its reference on 'o' and 'o' can be freed. If this > happens before 's' is unregistered, we get use-after-free as 'o' has > never been unregistered. And as the interval between freeing 'o' and > unregistering 's' can be quite long, there is a chance for the memory to > be reused. This is what one of our customers has seen: > > [exception RIP: _raw_spin_lock_irqsave+156] > RIP: 8040f5bc RSP: 8800e929de78 RFLAGS: 00010082 > RAX: a32c RBX: 88003954ab80 RCX: 1000 > RDX: f232 RSI: f232 RDI: 88003954ab80 > RBP: 5220 R8: dead00100100 R9: > R10: 7fff1a284960 R11: 0246 R12: > R13: 8800e929de8c R14: 000e R15: > ORIG_RAX: CS: 1e030 SS: e02b > #8 [8800e929de70] _raw_spin_lock_irqsave at 8040f5a9 > #9 [8800e929deb0] remove_wait_queue at 8006ad09 > #10 [8800e929ded0] ep_unregister_pollwait at 80170043 > #11 [8800e929def0] ep_remove at 80170073 > #12 [8800e929df10] sys_epoll_ctl at 80171453 > #13 [8800e929df80] system_call_fastpath at 80417553 > > In this case, crash happened on unregistering 's' which had null peer > (i.e. not reconnected but rather disconnected) but there were still two > items in the list, the other pointing to an unallocated page which has > apparently been modified in between. > > IMHO unix_dgram_disonnected() could be the place to handle this issue: > it is called from both places where we disconnect from a peer (dead peer > detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect()) > just before the reference to peer is released. I'm not familiar with the > epoll implementation so I'm still trying to find what exactly needs to > be done to unregister the peer at this moment. > Indeed that is a path as well. The patch I posted here deals with that case as well. It does a remove_wait_queue() in that case. unix_dgram_disconnected() gets called as you point out when we are removing the remote peer, and I have a remove_wait_queue() in that case. The patch I posted converts us back to a polling() against a single wait queue. The wait structure that epoll()/select()/poll() adds to the peer wait queue is really opaque to the unix code. The normal pattern is for epoll()/select()/poll() to do the unregister, not the socket/fd that we are waiting on. Further we could not just release all of the wait queues in unix_dgram_disconnected() b/c there could be multiple waiters there. So the POLLFREE thing really has to be done from the unix_sock_destructor() path, since it going to free the entire queue. In addition, I think that removing the the wait queue from unix_dgram_disconnected() will still be broken, b/c we would need to re-add to the remote peer via subsequent poll(), to get events if the socket was re-connected to a new peer. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
On 09/30/2015 01:54 AM, Mathias Krause wrote: > On 29 September 2015 at 21:09, Jason Baron <jba...@akamai.com> wrote: >> However, if we call connect on socket 's', to connect to a new socket 'o2', >> we >> drop the reference on the original socket 'o'. Thus, we can now close socket >> 'o' without unregistering from epoll. Then, when we either close the ep >> or unregister 'o', we end up with this list corruption. Thus, this is not a >> race per se, but can be triggered sequentially. > > Sounds profound, but the reproducers calls connect only once per > socket. So there is no "connect to a new socket", no? > But w/e, see below. Yes, but it can be reproduced this way too. It can also happen with a close() on the remote peer 'o', and a send to 'o' from 's', which the reproducer can do as pointed out Michal. The patch I sent deals with both cases. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
On 09/29/2015 02:09 PM, Mathias Krause wrote: > On 14 September 2015 at 04:39, Eric Wong <normalper...@yhbt.net> wrote: >> +cc Jason Baron since he might be able to provide more insight into >> epoll. >> >> Mathias Krause <mini...@googlemail.com> wrote: >>> Hi, >>> >>> this is an attempt to resurrect the thread initially started here: >>> >>> http://thread.gmane.org/gmane.linux.network/353003 >>> >>> As that patch fixed the issue for the mentioned reproducer, it did not >>> fix the bug for the production code Olivier is using. :( >>> >>> Changing the reproducer only slightly allows me to trigger the following >>> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even >>> with the above linked patch applied: >>> >>> [ 50.264249] [ cut here ] >>> [ 50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 >>> __list_del_entry+0xa4/0xd0() >>> [ 50.264249] list_del corruption. prev->next should be 88003c2c1bb8, >>> but was 88003f07bbb8 >>> [ 50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode >>> virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >>> [ 50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75 >>> [ 50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> 1.7.5-20140531_083030-gandalf 04/01/2014 >>> [ 50.264249] 817e902e 88087d08 8155593c >>> 0007 >>> [ 50.264249] 88087d58 88087d48 8105202a >>> 0001 >>> [ 50.264249] 88003c2c1bb8 88003f07bb80 0286 >>> 88003f736640 >>> [ 50.264249] Call Trace: >>> [ 50.264249] [] dump_stack+0x4c/0x65 >>> [ 50.264249] [] warn_slowpath_common+0x8a/0xc0 >>> [ 50.264249] [] warn_slowpath_fmt+0x46/0x50 >>> [ 50.264249] [] __list_del_entry+0xa4/0xd0 >>> [ 50.264249] [] list_del+0x11/0x40 >>> [ 50.264249] [] remove_wait_queue+0x29/0x40 >>> [ 50.264249] [] >>> ep_unregister_pollwait.isra.6+0x58/0x1a0 >>> [ 50.264249] [] ? >>> ep_unregister_pollwait.isra.6+0xa9/0x1a0 >>> [ 50.264249] [] ep_remove+0x22/0x110 >>> [ 50.264249] [] SyS_epoll_ctl+0x62b/0xf70 >>> [ 50.264249] [] ? lockdep_sys_exit_thunk+0x12/0x14 >>> [ 50.264249] [] entry_SYSCALL_64_fastpath+0x12/0x6f >>> [ 50.264249] ---[ end trace d9af9b915df9667e ]--- >>> [ 50.572100] [ cut here ] >>> [ 50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 >>> __list_del_entry+0xc3/0xd0() >>> [ 50.584263] list_del corruption. next->prev should be 88003f664c90, >>> but was 88003f0cb5b8 >>> [ 50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode >>> virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >>> [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: GW >>> 4.2.0 #75 >>> [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> 1.7.5-20140531_083030-gandalf 04/01/2014 >>> [ 50.584263] 817e902e 88003d37fce8 8155593c >>> 0007 >>> [ 50.584263] 88003d37fd38 88003d37fd28 8105202a >>> 0001 >>> [ 50.584263] 88003f664c90 88003f0cb580 0282 >>> 88003f731740 >>> [ 50.584263] Call Trace: >>> [ 50.584263] [] dump_stack+0x4c/0x65 >>> [ 50.584263] [] warn_slowpath_common+0x8a/0xc0 >>> [ 50.584263] [] warn_slowpath_fmt+0x46/0x50 >>> [ 50.584263] [] __list_del_entry+0xc3/0xd0 >>> [ 50.584263] [] list_del+0x11/0x40 >>> [ 50.584263] [] remove_wait_queue+0x29/0x40 >>> [ 50.584263] [] >>> ep_unregister_pollwait.isra.6+0x58/0x1a0 >>> [ 50.584263] [] ? >>> ep_unregister_pollwait.isra.6+0xa9/0x1a0 >>> [ 50.584263] [] ep_remove+0x22/0x110 >>> [ 50.584263] [] eventpoll_release_file+0x62/0xa0 >>> [ 50.584263] [] __fput+0x1af/0x200 >>> [ 50.584263] [] ? int_very_careful+0x5/0x3f >>> [ 50.584263] [] fput+0xe/0x10 >>> [ 50.584263] [] task_work_run+0x8d/0xc0 >>> [ 50.584263] [] do_notify_resume+0x4f/0x60 >>> [ 50.584263] [] int_signal+0x12/0x17 >>> [ 50.584263] ---[ end trace d9af9b915df9667f ]--- >>> [ 50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212 >>
Re: [PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On 08/11/2015 01:59 PM, Jason Baron wrote: On 08/11/2015 12:12 PM, Eric Dumazet wrote: On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote: Yes, so the test case I'm using to test against is somewhat contrived. In that I am simply allocating around 40,000 sockets that are idle to create a 'permanent' memory pressure in the background. Then, I have just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop. That said, we encountered this issue initially where we had 10,000+ flows and whenever the system would get into memory pressure, we would see all the cpus spin at 100%. So the testcase I wrote, was just a simplistic version for testing. But I am going to try and test against the more realistic workload where this issue was initially observed. Note that I am still trying to understand why we need to increase socket structure, for something which is inherently a problem of sharing memory with an unknown (potentially big) number of sockets. I was trying to mirror the wakeups when SO_SNDBUF is not set, where we continue to trigger on 1/3 of the buffer being available, as the sk-sndbuf is shrunk. And I saw this value as dynamic depending on number of sockets and read/write buffer usage. So that's where I was coming from with it. Also, at least with the .config I have the tcp_sock structure didn't increase in size (although struct sock did go up by 8 and not 4). I suggested to use a flag (one bit). If set, then we should fallback to tcp_wmem[0] (each socket has 4096 bytes, so that we can avoid starvation) Ok, I will test this approach. Hi Eric, So I created a test here with 20,000 streams, and if I set SO_SNDBUF high enough on the server side, I can create tcp memory pressure above tcp_mem[2]. In this case, with the 'one bit' approach using tcp_wmem[0] as the wakeup threshold I can still observe the 100% cpu spinning issue, but with this v2 patch, cpu usage is minimal (1-2%). Since, we don't guarantee tcp_wmem[0], above tcp_mem[2]. So using the 'one bit' definitely alleviates the spinning between tcp_mem[1] and tcp_mem[2], but not above tcp_mem[2] in my testing. Maybe nobody cares about this case (you are getting what you ask for by using SO_SNDBUF), but it seems to me that it would be nice to avoid this sort of behavior. I also like the fact that with the sk_effective_sndbuf, we keep doing wakeups on 1/3 of the write buffer emptying, which keeps the wakeup behavior consistent. In theory this would matter for high latency and bandwidth link, but in the testing I did, I didn't observe any throughput differences between this v2 patch, and the 'one bit' approach. As I mentioned with this v2, the 'struct sock' grows by 4 bytes, but struct tcp_sock does not increase. So since this is tcp specific, we could add the sk_effective_sndbuf only to the struct tcp_sock. So the 'one bit' approach definitely seems to me to be an improvement, but I wanted to get feedback on this testing, before deciding how to proceed. Thanks, -Jason -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
From: Jason Baron jba...@akamai.com When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write() in a very tight loop. Introduce sk-sk_effective_sndbuf, such that we can track the 'effective' size of the sndbuf, when we have a short write due to memory pressure. By using the sk-sk_effective_sndbuf instead of the sk-sk_sndbuf when we are under memory pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk-sndbuf. Although sk-sk_effective_sndbuf, and even sk-sk_sndbuf can change in subsequent calls to tcp_poll() (queued via wakeups), we can guarantee that we will eventually satisfy the write space check in all cases, since the write queue will eventually drain to 0, and sk-sk_sndbuf and sk-sk_effective_sndbuf will be non-zero when used in the write space check. If the write queue does not drain to 0, then the client is not responding, and we will eventually timeout the socket. In my testing, this brought a single threaad's cpu usage down from 100% to ~1% while maintaining the same level of throughput. Signed-off-by: Jason Baron jba...@akamai.com --- Changes from V1: -Add READ_ONCE() in sk_stream_wspace() (Eric Dumazet) -Restrict sk_set_effective_sndbuf() to the write() path --- include/net/sock.h | 16 net/core/sock.c| 1 + net/core/stream.c | 1 + net/ipv4/tcp.c | 20 +++- 4 files changed, 33 insertions(+), 5 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 43c6abc..86598e4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -243,6 +243,8 @@ struct cg_proto; *@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler) *@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE) *@sk_sndbuf: size of send buffer in bytes + *@sk_effective_sndbuf: Less than or equal to the size of sk_sndbuf when + * under memory pressure, 0 otherwise. *@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings *@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets @@ -380,6 +382,7 @@ struct sock { atomic_tsk_wmem_alloc; atomic_tsk_omem_alloc; int sk_sndbuf; + int sk_effective_sndbuf; struct sk_buff_head sk_write_queue; kmemcheck_bitfield_begin(flags); unsigned intsk_shutdown : 2, @@ -779,6 +782,14 @@ static inline bool sk_acceptq_is_full(const struct sock *sk) return sk-sk_ack_backlog sk-sk_max_ack_backlog; } +static inline void sk_set_effective_sndbuf(struct sock *sk) +{ + if (sk-sk_wmem_queued sk-sk_sndbuf) + sk-sk_effective_sndbuf = sk-sk_sndbuf; + else + sk-sk_effective_sndbuf = sk-sk_wmem_queued; +} + /* * Compute minimal free write space needed to queue new packets. */ @@ -789,6 +800,11 @@ static inline int sk_stream_min_wspace(const struct sock *sk) static inline int sk_stream_wspace(const struct sock *sk) { + int effective_sndbuf = READ_ONCE(sk-sk_effective_sndbuf); + + if (effective_sndbuf) + return effective_sndbuf - sk-sk_wmem_queued; + return sk-sk_sndbuf - sk-sk_wmem_queued; } diff --git a/net/core/sock.c b/net/core/sock.c index 193901d..4fce879 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2309,6 +2309,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk-sk_allocation = GFP_KERNEL; sk-sk_rcvbuf = sysctl_rmem_default; sk-sk_sndbuf = sysctl_wmem_default; + sk-sk_effective_sndbuf = 0; sk-sk_state= TCP_CLOSE; sk_set_socket(sk, sock); diff --git a/net/core/stream.c b/net/core/stream.c index d70f77a..7c175e7 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk) if (sk_stream_is_writeable(sk) sock) { clear_bit(SOCK_NOSPACE, sock-flags); + sk-sk_effective_sndbuf = 0; rcu_read_lock(); wq = rcu_dereference(sk-sk_wq); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 45534a5..d935ad5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -923,8 +923,10 @@ new_segment: skb = sk_stream_alloc_skb(sk, 0, sk-sk_allocation, skb_queue_empty(sk-sk_write_queue
Re: [PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On 08/11/2015 10:49 AM, Eric Dumazet wrote: On Tue, 2015-08-11 at 14:38 +, Jason Baron wrote: From: Jason Baron jba...@akamai.com In my testing, this brought a single threaad's cpu usage down from 100% to ~1% while maintaining the same level of throughput. Hi Jason. Could you give more details on this test ? How many flows are competing ? Yes, so the test case I'm using to test against is somewhat contrived. In that I am simply allocating around 40,000 sockets that are idle to create a 'permanent' memory pressure in the background. Then, I have just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop. That said, we encountered this issue initially where we had 10,000+ flows and whenever the system would get into memory pressure, we would see all the cpus spin at 100%. So the testcase I wrote, was just a simplistic version for testing. But I am going to try and test against the more realistic workload where this issue was initially observed. Thanks, -Jason -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On 08/11/2015 12:12 PM, Eric Dumazet wrote: On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote: Yes, so the test case I'm using to test against is somewhat contrived. In that I am simply allocating around 40,000 sockets that are idle to create a 'permanent' memory pressure in the background. Then, I have just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop. That said, we encountered this issue initially where we had 10,000+ flows and whenever the system would get into memory pressure, we would see all the cpus spin at 100%. So the testcase I wrote, was just a simplistic version for testing. But I am going to try and test against the more realistic workload where this issue was initially observed. Note that I am still trying to understand why we need to increase socket structure, for something which is inherently a problem of sharing memory with an unknown (potentially big) number of sockets. I was trying to mirror the wakeups when SO_SNDBUF is not set, where we continue to trigger on 1/3 of the buffer being available, as the sk-sndbuf is shrunk. And I saw this value as dynamic depending on number of sockets and read/write buffer usage. So that's where I was coming from with it. Also, at least with the .config I have the tcp_sock structure didn't increase in size (although struct sock did go up by 8 and not 4). I suggested to use a flag (one bit). If set, then we should fallback to tcp_wmem[0] (each socket has 4096 bytes, so that we can avoid starvation) Ok, I will test this approach. Thanks, -Jason -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html