[PATCH v2 net-next] af_unix: ensure POLLOUT on remote close() for connected dgram socket

2018-08-03 Thread Jason Baron
Applications use -ECONNREFUSED as returned from write() in order to
determine that a socket should be closed. However, when using connected
dgram unix sockets in a poll/write loop, a final POLLOUT event can be
missed when the remote end closes. Thus, the poll is stuck forever:

  thread 1 (client)   thread 2 (server)

connect() to server
write() returns -EAGAIN
unix_dgram_poll()
 -> unix_recvq_full() is true
   close()
->unix_release_sock()
 ->wake_up_interruptible_all()
unix_dgram_poll() (due to the
 wake_up_interruptible_all)
 -> unix_recvq_full() still is true
 ->free all skbs


Now thread 1 is stuck and will not receive anymore wakeups. In this
case, when thread 1 gets the -EAGAIN, it has not queued any skbs
otherwise the 'free all skbs' step would in fact cause a wakeup and
a POLLOUT return. So the race here is probably fairly rare because
it means there are no skbs that thread 1 queued and that thread 1
schedules before the 'free all skbs' step.

This issue was reported as a hang when /dev/log is closed.

The fix is to signal POLLOUT if the socket is marked as SOCK_DEAD, which
means a subsequent write() will get -ECONNREFUSED.

Reported-by: Ian Lance Taylor 
Cc: David Rientjes 
Cc: Rainer Weikusat 
Cc: Eric Dumazet 
Signed-off-by: Jason Baron 
---
v2: use check for SOCK_DEAD, since skb's can be purged in unix_sock_destructor()
---
 net/unix/af_unix.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 1772a0e..d1edfa3 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -430,7 +430,12 @@ static int unix_dgram_peer_wake_me(struct sock *sk, struct 
sock *other)
 
connected = unix_dgram_peer_wake_connect(sk, other);
 
-   if (unix_recvq_full(other))
+   /* If other is SOCK_DEAD, we want to make sure we signal
+* POLLOUT, such that a subsequent write() can get a
+* -ECONNREFUSED. Otherwise, if we haven't queued any skbs
+* to other and its full, we will hang waiting for POLLOUT.
+*/
+   if (unix_recvq_full(other) && !sock_flag(other, SOCK_DEAD))
return 1;
 
if (connected)
-- 
1.9.1



[PATCH net] af_unix: ensure POLLOUT on remote close() for connected dgram sockets

2018-07-16 Thread Jason Baron
Applications use ECONNREFUSED as returned from write() in order to
determine that a socket should be closed. When using connected dgram
unix sockets in a poll/write loop, this relies on POLLOUT being
signaled when the remote end closes. However, due to a race POLLOUT
can be missed when the remote closes:

  thread 1 (client)   thread 2 (server)

connect() to server
write() returns -EAGAIN
unix_dgram_poll()
 -> unix_recvq_full() is true
   close()
->unix_release_sock()
 ->wake_up_interruptible_all()
unix_dgram_poll() (due to the
 wake_up_interruptible_all)
 -> unix_recvq_full() still is true
 ->free all skbs


Now thread 1 is stuck and will not receive anymore wakeups. In this
case, when thread 1 gets the -EAGAIN, it has not queued any skbs
otherwise the 'free all skbs' step would in fact cause a wakeup and
a POLLOUT return. So the race here is probably fairly rare because
it means there are no skbs that thread 1 queued and that thread 1
runs before the 'free all skbs' step. Nevertheless, this has been
observed when the syslog daemon closes /dev/log. Tested against
a reproducer that re-creates the syslog hang.

The proposed fix is to move the wake_up_interruptible_all() call
after the 'free all skbs' step.

Reported-by: Ian Lance Taylor 
Cc: Rainer Weikusat 
Signed-off-by: Jason Baron 
---
 net/unix/af_unix.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e5473c0..de242cf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -529,8 +529,6 @@ static void unix_release_sock(struct sock *sk, int embrion)
sk->sk_state = TCP_CLOSE;
unix_state_unlock(sk);
 
-   wake_up_interruptible_all(>peer_wait);
-
skpair = unix_peer(sk);
 
if (skpair != NULL) {
@@ -560,6 +558,9 @@ static void unix_release_sock(struct sock *sk, int embrion)
kfree_skb(skb);
}
 
+   /* after freeing skbs to make sure POLLOUT triggers */
+   wake_up_interruptible_all(>peer_wait);
+
if (path.dentry)
path_put();
 
-- 
2.7.4



Re: Bug report: epoll can fail to report EPOLLOUT when unix datagram socket peer is closed

2018-07-11 Thread Jason Baron



On 06/26/2018 10:18 AM, Ian Lance Taylor wrote:
> I'm reporting what appears to be a bug in the Linux kernel's epoll
> support.  It seems that epoll appears to sometimes fail to report an
> EPOLLOUT event when the other side of an AF_UNIX/SOCK_DGRAM socket is
> closed.  This bug report started as a Go program reported at
> https://golang.org/issue/23604.  I've written a C program that
> demonstrates the same symptoms, at
> https://github.com/golang/go/issues/23604#issuecomment-398945027 .
> 
> The C program sets up an AF_UNIX/SOCK_DGRAM server and serveral
> identical clients, all running in non-blocking mode.  All the
> non-blocking sockets are added to epoll, using EPOLLET.  The server
> periodically closes and reopens its socket.  The clients look for
> ECONNREFUSED errors on their write calls, and close and reopen their
> sockets when they see one.
> 
> The clients will sometimes fill up their buffer and block with EAGAIN.
> At that point they expect the poller to return an EPOLLOUT event to
> tell them when they are ready to write again.  The expectation is that
> either the server will read data, freeing up buffer space, or will
> close the socket, which should cause the sending packets to be
> discarded, freeing up buffer space.  Generally the EPOLLOUT event
> happens.  But sometimes, the poller never returns such an event, and
> the client stalls.  In the test program this is reported as a client
> that waits more than 20 seconds to be told to continue.
> 
> A similar bug report was made, with few details, at
> https://stackoverflow.com/questions/38441059/edge-triggered-epoll-for-unix-domain-socket
> .
> 
> I've tested the program and seen the failure on kernel 4.9.0-6-amd64.
> A colleague has tested the program and seen the failure on
> 4.18.0-smp-DEV #3 SMP @1529531011 x86_64 GNU/Linux.
> 
> If there is a better way for me to report this, please let me know.
> 
> Thanks for your attention.
> 
> Ian
> 

Hi,

Thanks for the report and the test program. The patch below seems to
have cured the reproducer for me. But perhaps you can confirm?

Thanks,

-Jason


[PATCH] af_unix: ensure POLLOUT on remote close() for connected dgram socket

Applictions use ECONNREFUSED as returned from write() in order to
determine that a socket should be closed. When using connected dgram
unix sockets in a poll/write loop, this relies on POLLOUT being
signaled when the remote end closes. However, due to a race POLLOUT
can be missed when the remote closes:

  thread 1 (client)   thread 2 (server)

connect() to server
write() returns -EAGAIN
unix_dgram_poll()
 -> unix_recvq_full() is true
   close()
->unix_release_sock()
 ->wake_up_interruptible_all()
unix_dgram_poll() (due to the
 wake_up_interruptible_all)
 -> unix_recvq_full() still is true
 ->free all skbs


Now thread 1 is stuck and will not receive anymore wakeups. In this
case, when thread 1 gets the -EAGAIN, it has not queued any skbs
otherwise the 'free all skbs' step would in fact cause a wakeup and
a POLLOUT return. So the race here is probably fairly rare because
it means there are no skbs that thread 1 queued and that thread 1
schedules before the 'free all skbs' step. Nevertheless, this has
been observed in the wild via syslog.

The proposed fix is to move the wake_up_interruptible_all() call
after the 'free all skbs' step.

Signed-off-by: Jason Baron 
---
 net/unix/af_unix.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e5473c0..de242cf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -529,8 +529,6 @@ static void unix_release_sock(struct sock *sk, int
embrion)
sk->sk_state = TCP_CLOSE;
unix_state_unlock(sk);

-   wake_up_interruptible_all(>peer_wait);
-
skpair = unix_peer(sk);

if (skpair != NULL) {
@@ -560,6 +558,9 @@ static void unix_release_sock(struct sock *sk, int
embrion)
kfree_skb(skb);
}

+   /* after freeing skbs to make sure POLLOUT triggers */
+   wake_up_interruptible_all(>peer_wait);
+
if (path.dentry)
path_put();

-- 
2.7.4




[PATCH v4 2/3] qemu: virtio-net: use 64-bit values for feature flags

2018-01-05 Thread Jason Baron
In prepartion for using some of the high order feature bits, make sure that
virtio-net uses 64-bit values everywhere.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: virtio-...@lists.oasis-open.org
---
 hw/net/virtio-net.c| 55 +-
 include/hw/virtio/virtio-net.h |  2 +-
 2 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38674b0..54823af 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -48,18 +48,18 @@
 (offsetof(container, field) + sizeof(((container *)0)->field))
 
 typedef struct VirtIOFeature {
-uint32_t flags;
+uint64_t flags;
 size_t end;
 } VirtIOFeature;
 
 static VirtIOFeature feature_sizes[] = {
-{.flags = 1 << VIRTIO_NET_F_MAC,
+{.flags = 1ULL << VIRTIO_NET_F_MAC,
  .end = endof(struct virtio_net_config, mac)},
-{.flags = 1 << VIRTIO_NET_F_STATUS,
+{.flags = 1ULL << VIRTIO_NET_F_STATUS,
  .end = endof(struct virtio_net_config, status)},
-{.flags = 1 << VIRTIO_NET_F_MQ,
+{.flags = 1ULL << VIRTIO_NET_F_MQ,
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
-{.flags = 1 << VIRTIO_NET_F_MTU,
+{.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
 {}
 };
@@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 int i;
 
 if (n->net_conf.mtu) {
-n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
+n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
 virtio_net_set_config_size(n, n->host_features);
@@ -2109,45 +2109,46 @@ static const VMStateDescription vmstate_virtio_net = {
 };
 
 static Property virtio_net_properties[] = {
-DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
-DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
+DEFINE_PROP_BIT64("csum", VirtIONet, host_features,
+VIRTIO_NET_F_CSUM, true),
+DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_CSUM, true),
-DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
-DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
+DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO4, true),
-DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO6, true),
-DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ECN, true),
-DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_UFO, true),
-DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ANNOUNCE, true),
-DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO4, true),
-DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO6, true),
-DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_ECN, true),
-DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_UFO, true),
-DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
+DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
 VIRTIO_NET_F_MRG_RXBUF, true),
-DEFINE_PROP_BIT("status", VirtIONet, host_features,
+DEFINE_PROP_BIT64("status", VirtIONet, host_features,
 VIRTIO_NET_F_STATUS, true),
-DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_VQ, true),
-DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
   

[PATCH net-next v4 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2018-01-05 Thread Jason Baron
The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: virtio-...@lists.oasis-open.org
---
 drivers/net/virtio_net.c| 23 ++-
 include/uapi/linux/virtio_net.h | 13 +
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6fb7b65..4f27508 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1894,6 +1894,24 @@ static void virtnet_init_settings(struct net_device *dev)
vi->duplex = DUPLEX_UNKNOWN;
 }
 
+static void virtnet_update_settings(struct virtnet_info *vi)
+{
+   u32 speed;
+   u8 duplex;
+
+   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
+   return;
+
+   speed = virtio_cread32(vi->vdev, offsetof(struct virtio_net_config,
+ speed));
+   if (ethtool_validate_speed(speed))
+   vi->speed = speed;
+   duplex = virtio_cread8(vi->vdev, offsetof(struct virtio_net_config,
+ duplex));
+   if (ethtool_validate_duplex(duplex))
+   vi->duplex = duplex;
+}
+
 static const struct ethtool_ops virtnet_ethtool_ops = {
.get_drvinfo = virtnet_get_drvinfo,
.get_link = ethtool_op_get_link,
@@ -2147,6 +2165,7 @@ static void virtnet_config_changed_work(struct 
work_struct *work)
vi->status = v;
 
if (vi->status & VIRTIO_NET_S_LINK_UP) {
+   virtnet_update_settings(vi);
netif_carrier_on(vi->dev);
netif_tx_wake_all_queues(vi->dev);
} else {
@@ -2695,6 +2714,7 @@ static int virtnet_probe(struct virtio_device *vdev)
schedule_work(>config_work);
} else {
vi->status = VIRTIO_NET_S_LINK_UP;
+   virtnet_update_settings(vi);
netif_carrier_on(dev);
}
 
@@ -2796,7 +2816,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
-   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
+   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
+   VIRTIO_NET_F_SPEED_DUPLEX
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..5de6ed3 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,17 @@ struct virtio_net_config {
__u16 max_virtqueue_pairs;
/* Default maximum transmit unit advice */
__u16 mtu;
+   /*
+* speed, in units of 1Mb. All values 0 to INT_MAX are legal.
+* Any other value stands for unknown.
+*/
+   __u32 speed;
+   /*
+* 0x00 - half duplex
+* 0x01 - full duplex
+* Any other value stands for unknown.
+*/
+   __u8 duplex;
 } __attribute__((packed));
 
 /*
-- 
2.6.1



[PATCH v4 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting

2018-01-05 Thread Jason Baron
We have found it useful to be able to set the linkspeed and duplex
settings from the host-side for virtio_net. This obviates the need
for guest changes and settings for these fields, and does not require
custom ethtool commands for virtio_net.

The ability to set linkspeed and duplex is useful in various cases
as described here:

16032be virtio_net: add ethtool support for set and get of settings

Using 'ethtool -s' continues to over-write the linkspeed/duplex
settings with this patch.

The 1/3 patch is against net-next, while the 2-3/3 patch are the associated
qemu changes that would go in after as update-linux-headers.sh should
be run first. So the qemu patches are a demonstration of how I intend this
to work.

Thanks,

-Jason

linux changes:

changes from v3:
* break the speed/duplex read into a function and also call from virtnet_probe
  when status bit is not negotiated
* only do speed/duplex read in virtnet_config_changed_work() on LINK_UP

changes from v2:
* move speed/duplex read into virtnet_config_changed_work() so link up changes
  are detected

Jason Baron (1):
  virtio_net: propagate linkspeed/duplex settings from the hypervisor

 drivers/net/virtio_net.c| 23 ++-
 include/uapi/linux/virtio_net.h | 13 +
 2 files changed, 35 insertions(+), 1 deletion(-)


qemu changes:

Jason Baron (2):
  qemu: virtio-net: use 64-bit values for feature flags
  qemu: add linkspeed and duplex settings to virtio-net

 hw/net/virtio-net.c | 87 -
 include/hw/virtio/virtio-net.h  |  5 +-
 include/standard-headers/linux/virtio_net.h | 13 +
 3 files changed, 77 insertions(+), 28 deletions(-)

-- 
2.6.1



[PATCH v4 3/3] qemu: add linkspeed and duplex settings to virtio-net

2018-01-05 Thread Jason Baron
Although linkspeed and duplex can be set in a linux guest via 'ethtool -s',
this requires custom ethtool commands for virtio-net by default.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Linkspeed and duplex settings can be set as:
'-device virtio-net,speed=1,duplex=full'

where speed is [-1...INT_MAX], and duplex is ["half"|"full"].

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: virtio-...@lists.oasis-open.org
---
 hw/net/virtio-net.c | 32 +
 include/hw/virtio/virtio-net.h  |  3 +++
 include/standard-headers/linux/virtio_net.h | 13 
 3 files changed, 48 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 54823af..cd63659 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -40,6 +40,12 @@
 #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
 #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
 
+/* duplex and speed */
+#define DUPLEX_UNKNOWN  0xff
+#define DUPLEX_HALF 0x00
+#define DUPLEX_FULL 0x01
+#define SPEED_UNKNOWN   -1
+
 /*
  * Calculate the number of bytes up to and including the given 'field' of
  * 'container'.
@@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
 {.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
+{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
+ .end = endof(struct virtio_net_config, duplex)},
 {}
 };
 
@@ -89,6 +97,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t 
*config)
 virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues);
 virtio_stw_p(vdev, , n->net_conf.mtu);
 memcpy(netcfg.mac, n->mac, ETH_ALEN);
+virtio_stl_p(vdev, , n->net_conf.speed);
+netcfg.duplex = n->net_conf.duplex;
 memcpy(config, , n->config_size);
 }
 
@@ -1941,6 +1951,26 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
+if (n->net_conf.duplex_str) {
+if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) {
+n->net_conf.duplex = DUPLEX_HALF;
+} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) {
+n->net_conf.duplex = DUPLEX_FULL;
+} else {
+error_setg(errp, "'duplex' must be 'half' or 'full'");
+}
+n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+} else {
+n->net_conf.duplex = DUPLEX_UNKNOWN;
+}
+
+if (n->net_conf.speed < SPEED_UNKNOWN) {
+error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and "
+   "INT_MAX");
+} else if (n->net_conf.speed >= 0) {
+n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+}
+
 virtio_net_set_config_size(n, n->host_features);
 virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
 
@@ -2161,6 +2191,8 @@ static Property virtio_net_properties[] = {
 DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
 DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
  true),
+DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
+DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index e7634c9..02484dc 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,9 @@ typedef struct virtio_net_conf
 uint16_t rx_queue_size;
 uint16_t tx_queue_size;
 uint16_t mtu;
+int32_t speed;
+char *duplex_str;
+uint8_t duplex;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
diff --git a/include/standard-headers/linux/virtio_net.h 
b/include/standard-headers/linux/virtio_net.h
index 30ff249..17c8531 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,17 @@ struct virtio_net_confi

Re: [PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2018-01-04 Thread Jason Baron


On 01/04/2018 01:22 PM, Michael S. Tsirkin wrote:
> On Thu, Jan 04, 2018 at 01:12:30PM -0500, Jason Baron wrote:
>>
>>
>> On 01/04/2018 12:05 PM, Michael S. Tsirkin wrote:
>>> On Thu, Jan 04, 2018 at 12:16:44AM -0500, Jason Baron wrote:
>>>> The ability to set speed and duplex for virtio_net is useful in various
>>>> scenarios as described here:
>>>>
>>>> 16032be virtio_net: add ethtool support for set and get of settings
>>>>
>>>> However, it would be nice to be able to set this from the hypervisor,
>>>> such that virtio_net doesn't require custom guest ethtool commands.
>>>>
>>>> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
>>>> the hypervisor to export a linkspeed and duplex setting. The user can
>>>> subsequently overwrite it later if desired via: 'ethtool -s'.
>>>>
>>>> Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
>>>> is that device feature bits are to grow down from bit 63, since the
>>>> transports are starting from bit 24 and growing up.
>>>>
>>>> Signed-off-by: Jason Baron <jba...@akamai.com>
>>>> Cc: "Michael S. Tsirkin" <m...@redhat.com>
>>>> Cc: Jason Wang <jasow...@redhat.com>
>>>> Cc: virtio-...@lists.oasis-open.org
>>>> ---
>>>>  drivers/net/virtio_net.c| 19 ++-
>>>>  include/uapi/linux/virtio_net.h | 13 +
>>>>  2 files changed, 31 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 6fb7b65..0b2d314 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct 
>>>> work_struct *work)
>>>>  
>>>>vi->status = v;
>>>>  
>>>> +  if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) {
>>>
>>> BTW we can avoid this read for when link goes down.
>>> Not a big deal but still.
>>
>> So you are saying that we can just set vi->speed and vi->duplex to
>> 'unknown' when the link goes down and not check for the presence of
>> VIRTIO_NET_F_SPEED_DUPLEX?
>>
>> If so, that could over-write what the user may have configured in the
>> guest via 'ethtool -s' when the link goes down, so that would be a
>> change in behavior, but perhaps that is ok?
> 
> No - what I am saying is that your patch overwrites the values
> set by user when link goes down.
> 
> I suggest limiting this call to when
> 
> if (vi->status & VIRTIO_NET_S_LINK_UP)
> 
> and then the values are overwritten when link goes up
> which seems closer to what a user might expect.
ok, will update this for v4.

Thanks,

-Jason

> 
>>
>> I think I would prefer to have the link down event still check for
>> VIRTIO_NET_F_SPEED_DUPLEX before changing speed/duplex. That way we
>> still have 2 modes for updating the fields:
>>
>> 1) completely guest controlled. Same as we have now and host does not
>> change any values and does not set VIRTIO_NET_F_SPEED_DUPLEX flag (hence
>> don't remove above check).
>>
>> 2) if speed or duplex or speed is set in the qemu command line, then set
>> the VIRTIO_NET_F_SPEED_DUPLEX and have host control the settings of
>> speed/duplex (with ability of guest to over-write if it wanted to).
>>
>>
>>>
>>>> +  u32 speed;
>>>> +  u8 duplex;
>>>> +
>>>> +  speed = virtio_cread32(vi->vdev,
>>>> + offsetof(struct virtio_net_config,
>>>> +  speed));
>>>> +  if (ethtool_validate_speed(speed))
>>>> +  vi->speed = speed;
>>>> +  duplex = virtio_cread8(vi->vdev,
>>>> + offsetof(struct virtio_net_config,
>>>> +  duplex));
>>>> +  if (ethtool_validate_duplex(duplex))
>>>> +  vi->duplex = duplex;
>>>> +  }
>>>> +
>>>>if (vi->status & VIRTIO_NET_S_LINK_UP) {
>>>>netif_carrier_on(vi->dev);
>>>>netif_tx_wake_all_queues(vi->dev);
>>>
>>> OK so this handles the case when VIRTIO_NET_F_STATUS is set,
>>&g

Re: [PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2018-01-04 Thread Jason Baron


On 01/04/2018 12:05 PM, Michael S. Tsirkin wrote:
> On Thu, Jan 04, 2018 at 12:16:44AM -0500, Jason Baron wrote:
>> The ability to set speed and duplex for virtio_net is useful in various
>> scenarios as described here:
>>
>> 16032be virtio_net: add ethtool support for set and get of settings
>>
>> However, it would be nice to be able to set this from the hypervisor,
>> such that virtio_net doesn't require custom guest ethtool commands.
>>
>> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
>> the hypervisor to export a linkspeed and duplex setting. The user can
>> subsequently overwrite it later if desired via: 'ethtool -s'.
>>
>> Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
>> is that device feature bits are to grow down from bit 63, since the
>> transports are starting from bit 24 and growing up.
>>
>> Signed-off-by: Jason Baron <jba...@akamai.com>
>> Cc: "Michael S. Tsirkin" <m...@redhat.com>
>> Cc: Jason Wang <jasow...@redhat.com>
>> Cc: virtio-...@lists.oasis-open.org
>> ---
>>  drivers/net/virtio_net.c| 19 ++-
>>  include/uapi/linux/virtio_net.h | 13 +
>>  2 files changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 6fb7b65..0b2d314 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct 
>> work_struct *work)
>>  
>>  vi->status = v;
>>  
>> +if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) {
> 
> BTW we can avoid this read for when link goes down.
> Not a big deal but still.

So you are saying that we can just set vi->speed and vi->duplex to
'unknown' when the link goes down and not check for the presence of
VIRTIO_NET_F_SPEED_DUPLEX?

If so, that could over-write what the user may have configured in the
guest via 'ethtool -s' when the link goes down, so that would be a
change in behavior, but perhaps that is ok?

I think I would prefer to have the link down event still check for
VIRTIO_NET_F_SPEED_DUPLEX before changing speed/duplex. That way we
still have 2 modes for updating the fields:

1) completely guest controlled. Same as we have now and host does not
change any values and does not set VIRTIO_NET_F_SPEED_DUPLEX flag (hence
don't remove above check).

2) if speed or duplex or speed is set in the qemu command line, then set
the VIRTIO_NET_F_SPEED_DUPLEX and have host control the settings of
speed/duplex (with ability of guest to over-write if it wanted to).


> 
>> +u32 speed;
>> +u8 duplex;
>> +
>> +speed = virtio_cread32(vi->vdev,
>> +   offsetof(struct virtio_net_config,
>> +speed));
>> +if (ethtool_validate_speed(speed))
>> +vi->speed = speed;
>> +duplex = virtio_cread8(vi->vdev,
>> +   offsetof(struct virtio_net_config,
>> +duplex));
>> +if (ethtool_validate_duplex(duplex))
>> +vi->duplex = duplex;
>> +}
>> +
>>  if (vi->status & VIRTIO_NET_S_LINK_UP) {
>>  netif_carrier_on(vi->dev);
>>  netif_tx_wake_all_queues(vi->dev);
> 
> OK so this handles the case when VIRTIO_NET_F_STATUS is set,
> but when it's clear we need to call this from virtnet_probe.
> 
> I propose moving this chunk to a function and calling from two places.
> 

good point. will update.

Thanks,

-Jason

> 
>> @@ -2796,7 +2812,8 @@ static struct virtio_device_id id_table[] = {
>>  VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
>>  VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>  VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> -VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
>> +VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>> +VIRTIO_NET_F_SPEED_DUPLEX
>>  
>>  static unsigned int features[] = {
>>  VIRTNET_FEATURES,
>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index fc353b5..5de6ed3 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -57,6 +57,8 @@
>>   * Steering */
>>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23   /* Set MAC address */
>>  
>> +#define VIRTIO_NET_F_SPEED_DUPLEX 63

Re: [PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2018-01-04 Thread Jason Baron


On 01/04/2018 11:27 AM, Michael S. Tsirkin wrote:
> On Thu, Jan 04, 2018 at 12:16:44AM -0500, Jason Baron wrote:
>> The ability to set speed and duplex for virtio_net is useful in various
>> scenarios as described here:
>>
>> 16032be virtio_net: add ethtool support for set and get of settings
>>
>> However, it would be nice to be able to set this from the hypervisor,
>> such that virtio_net doesn't require custom guest ethtool commands.
>>
>> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
>> the hypervisor to export a linkspeed and duplex setting. The user can
>> subsequently overwrite it later if desired via: 'ethtool -s'.
>>
>> Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
>> is that device feature bits are to grow down from bit 63, since the
>> transports are starting from bit 24 and growing up.
>>
>> Signed-off-by: Jason Baron <jba...@akamai.com>
>> Cc: "Michael S. Tsirkin" <m...@redhat.com>
>> Cc: Jason Wang <jasow...@redhat.com>
>> Cc: virtio-...@lists.oasis-open.org
>> ---
>>  drivers/net/virtio_net.c| 19 ++-
>>  include/uapi/linux/virtio_net.h | 13 +
>>  2 files changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 6fb7b65..0b2d314 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct 
>> work_struct *work)
>>  
>>  vi->status = v;
>>  
>> +if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) {
>> +u32 speed;
>> +u8 duplex;
>> +
>> +speed = virtio_cread32(vi->vdev,
>> +   offsetof(struct virtio_net_config,
>> +speed));
>> +if (ethtool_validate_speed(speed))
>> +vi->speed = speed;
>> +duplex = virtio_cread8(vi->vdev,
>> +   offsetof(struct virtio_net_config,
>> +duplex));
>> +if (ethtool_validate_duplex(duplex))
>> +vi->duplex = duplex;
>> +}
>> +
>>  if (vi->status & VIRTIO_NET_S_LINK_UP) {
>>  netif_carrier_on(vi->dev);
>>  netif_tx_wake_all_queues(vi->dev);
>> @@ -2796,7 +2812,8 @@ static struct virtio_device_id id_table[] = {
>>  VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
>>  VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>  VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> -VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
>> +VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>> +VIRTIO_NET_F_SPEED_DUPLEX
>>  
>>  static unsigned int features[] = {
>>  VIRTNET_FEATURES,
> 
> Still missing the update from virtnet_config_changed_work, and I think
> it's important to reflex host changes within guest when the
> feature bit has been acked.
> 

I update vi->speed and vi->duplex in virtnet_config_changed_work(). And
I tested using the 'set_link' on/off from the qemu monitor.
Specifically, an 'off' sets the speed and link to 'unknown', and an 'on'
returns the speed and link to the configured speed and duplex. So they
are being updated dynamically now. What host changes are you referring
to that are not reflected?

Thanks,

-Jason


>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index fc353b5..5de6ed3 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -57,6 +57,8 @@
>>   * Steering */
>>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23   /* Set MAC address */
>>  
>> +#define VIRTIO_NET_F_SPEED_DUPLEX 63/* Device set linkspeed and 
>> duplex */
>> +
>>  #ifndef VIRTIO_NET_NO_LEGACY
>>  #define VIRTIO_NET_F_GSO6   /* Host handles pkts w/ any GSO type */
>>  #endif /* VIRTIO_NET_NO_LEGACY */
>> @@ -76,6 +78,17 @@ struct virtio_net_config {
>>  __u16 max_virtqueue_pairs;
>>  /* Default maximum transmit unit advice */
>>  __u16 mtu;
>> +/*
>> + * speed, in units of 1Mb. All values 0 to INT_MAX are legal.
>> + * Any other value stands for unknown.
>> + */
>> +__u32 speed;
>> +/*
>> + * 0x00 - half duplex
>> + * 0x01 - full duplex
>> + * Any other value stands for unknown.
>> + */
>> +__u8 duplex;
>>  } __attribute__((packed));
>>  
>>  /*
>> -- 
>> 2.6.1


[PATCH v3 3/3] qemu: add linkspeed and duplex settings to virtio-net

2018-01-03 Thread Jason Baron
Although linkspeed and duplex can be set in a linux guest via 'ethtool -s',
this requires custom ethtool commands for virtio-net by default.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Linkspeed and duplex settings can be set as:
'-device virtio-net,speed=1,duplex=full'

where speed is [-1...INT_MAX], and duplex is ["half"|"full"].

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: virtio-...@lists.oasis-open.org
---
 hw/net/virtio-net.c | 35 +
 include/hw/virtio/virtio-net.h  |  3 +++
 include/standard-headers/linux/virtio_net.h | 13 +++
 3 files changed, 51 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index adc20df..eec8422 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -40,6 +40,12 @@
 #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
 #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
 
+/* duplex and speed */
+#define DUPLEX_UNKNOWN  0xff
+#define DUPLEX_HALF 0x00
+#define DUPLEX_FULL 0x01
+#define SPEED_UNKNOWN   -1
+
 /*
  * Calculate the number of bytes up to and including the given 'field' of
  * 'container'.
@@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
 {.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
+{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
+ .end = endof(struct virtio_net_config, duplex)},
 {}
 };
 
@@ -89,6 +97,14 @@ static void virtio_net_get_config(VirtIODevice *vdev, 
uint8_t *config)
 virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues);
 virtio_stw_p(vdev, , n->net_conf.mtu);
 memcpy(netcfg.mac, n->mac, ETH_ALEN);
+if (n->status & VIRTIO_NET_S_LINK_UP) {
+virtio_stl_p(vdev, , n->net_conf.speed);
+netcfg.duplex = n->net_conf.duplex;
+} else {
+virtio_stl_p(vdev, , SPEED_UNKNOWN);
+netcfg.duplex = DUPLEX_UNKNOWN;
+}
+
 memcpy(config, , n->config_size);
 }
 
@@ -1941,6 +1957,23 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
+n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+if (n->net_conf.duplex_str) {
+if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) {
+n->net_conf.duplex = DUPLEX_HALF;
+} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) {
+n->net_conf.duplex = DUPLEX_FULL;
+} else {
+error_setg(errp, "'duplex' must be 'half' or 'full'");
+}
+} else {
+n->net_conf.duplex = DUPLEX_UNKNOWN;
+}
+if (n->net_conf.speed < SPEED_UNKNOWN) {
+error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and "
+   "INT_MAX");
+}
+
 virtio_net_set_config_size(n, n->host_features);
 virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
 
@@ -2160,6 +2193,8 @@ static Property virtio_net_properties[] = {
 DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
 DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
  true),
+DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
+DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index e7634c9..02484dc 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,9 @@ typedef struct virtio_net_conf
 uint16_t rx_queue_size;
 uint16_t tx_queue_size;
 uint16_t mtu;
+int32_t speed;
+char *duplex_str;
+uint8_t duplex;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
diff --git a/include/standard-headers/linux/virtio_net.h 
b/include/standard-headers/linux/virtio_net.h
index 30ff249..17c8531 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ 

[PATCH v3 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting

2018-01-03 Thread Jason Baron
We have found it useful to be able to set the linkspeed and duplex
settings from the host-side for virtio_net. This obviates the need
for guest changes and settings for these fields, and does not require
custom ethtool commands for virtio_net.

The ability to set linkspeed and duplex is useful in various cases
as described here:

16032be virtio_net: add ethtool support for set and get of settings 

  
Using 'ethtool -s' continues to over-write the linkspeed/duplex
settings with this patch.

The 1/3 patch is against net-next, while the 2-3/3 patch are the associated
qemu changes that would go in after as update-linux-headers.sh should
be run first. So the qemu patches are a demonstration of how I intend this
to work.

Thanks,

-Jason  

linux changes:

changes from v2:
* move speed/duplex read into virtnet_config_changed_work() so link up changes
  are detected

Jason Baron (1):
  virtio_net: propagate linkspeed/duplex settings from the hypervisor

 drivers/net/virtio_net.c| 19 ++-
 include/uapi/linux/virtio_net.h | 13 +
 2 files changed, 31 insertions(+), 1 deletion(-)

qemu changes:

changes from v2:
* if link up return configured speed/duplex, else return UNKNOWN speed and 
duplex

Jason Baron (2):
  qemu: virtio-net: use 64-bit values for feature flags
  qemu: add linkspeed and duplex settings to virtio-net

 hw/net/virtio-net.c | 89 -
 include/hw/virtio/virtio-net.h  |  5 +-
 include/standard-headers/linux/virtio_net.h | 13 +
 3 files changed, 79 insertions(+), 28 deletions(-)


-- 
2.6.1



[PATCH net-next v3 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2018-01-03 Thread Jason Baron
The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: virtio-...@lists.oasis-open.org
---
 drivers/net/virtio_net.c| 19 ++-
 include/uapi/linux/virtio_net.h | 13 +
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6fb7b65..0b2d314 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2146,6 +2146,22 @@ static void virtnet_config_changed_work(struct 
work_struct *work)
 
vi->status = v;
 
+   if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX)) {
+   u32 speed;
+   u8 duplex;
+
+   speed = virtio_cread32(vi->vdev,
+  offsetof(struct virtio_net_config,
+   speed));
+   if (ethtool_validate_speed(speed))
+   vi->speed = speed;
+   duplex = virtio_cread8(vi->vdev,
+  offsetof(struct virtio_net_config,
+   duplex));
+   if (ethtool_validate_duplex(duplex))
+   vi->duplex = duplex;
+   }
+
if (vi->status & VIRTIO_NET_S_LINK_UP) {
netif_carrier_on(vi->dev);
netif_tx_wake_all_queues(vi->dev);
@@ -2796,7 +2812,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
-   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
+   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
+   VIRTIO_NET_F_SPEED_DUPLEX
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..5de6ed3 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,17 @@ struct virtio_net_config {
__u16 max_virtqueue_pairs;
/* Default maximum transmit unit advice */
__u16 mtu;
+   /*
+* speed, in units of 1Mb. All values 0 to INT_MAX are legal.
+* Any other value stands for unknown.
+*/
+   __u32 speed;
+   /*
+* 0x00 - half duplex
+* 0x01 - full duplex
+* Any other value stands for unknown.
+*/
+   __u8 duplex;
 } __attribute__((packed));
 
 /*
-- 
2.6.1



[PATCH v3 2/3] qemu: virtio-net: use 64-bit values for feature flags

2018-01-03 Thread Jason Baron
In prepartion for using some of the high order feature bits, make sure that
virtio-net uses 64-bit values everywhere.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: virtio-...@lists.oasis-open.org
---
 hw/net/virtio-net.c| 54 +-
 include/hw/virtio/virtio-net.h |  2 +-
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38674b0..adc20df 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -48,18 +48,18 @@
 (offsetof(container, field) + sizeof(((container *)0)->field))
 
 typedef struct VirtIOFeature {
-uint32_t flags;
+uint64_t flags;
 size_t end;
 } VirtIOFeature;
 
 static VirtIOFeature feature_sizes[] = {
-{.flags = 1 << VIRTIO_NET_F_MAC,
+{.flags = 1ULL << VIRTIO_NET_F_MAC,
  .end = endof(struct virtio_net_config, mac)},
-{.flags = 1 << VIRTIO_NET_F_STATUS,
+{.flags = 1ULL << VIRTIO_NET_F_STATUS,
  .end = endof(struct virtio_net_config, status)},
-{.flags = 1 << VIRTIO_NET_F_MQ,
+{.flags = 1ULL << VIRTIO_NET_F_MQ,
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
-{.flags = 1 << VIRTIO_NET_F_MTU,
+{.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
 {}
 };
@@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 int i;
 
 if (n->net_conf.mtu) {
-n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
+n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
 virtio_net_set_config_size(n, n->host_features);
@@ -2109,45 +2109,45 @@ static const VMStateDescription vmstate_virtio_net = {
 };
 
 static Property virtio_net_properties[] = {
-DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
-DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
+DEFINE_PROP_BIT64("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, 
true),
+DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_CSUM, true),
-DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
-DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
+DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO4, true),
-DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO6, true),
-DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ECN, true),
-DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_UFO, true),
-DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ANNOUNCE, true),
-DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO4, true),
-DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO6, true),
-DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_ECN, true),
-DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_UFO, true),
-DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
+DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
 VIRTIO_NET_F_MRG_RXBUF, true),
-DEFINE_PROP_BIT("status", VirtIONet, host_features,
+DEFINE_PROP_BIT64("status", VirtIONet, host_features,
 VIRTIO_NET_F_STATUS, true),
-DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_VQ, true),
-DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
 V

Re: [PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2017-12-28 Thread Jason Baron


On 12/27/2017 04:43 PM, David Miller wrote:
> From: Jason Baron <jba...@akamai.com>
> Date: Fri, 22 Dec 2017 16:54:01 -0500
> 
>> The ability to set speed and duplex for virtio_net in useful in various
>> scenarios as described here:
>>
>> 16032be virtio_net: add ethtool support for set and get of settings
>>
>> However, it would be nice to be able to set this from the hypervisor,
>> such that virtio_net doesn't require custom guest ethtool commands.
>>
>> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
>> the hypervisor to export a linkspeed and duplex setting. The user can
>> subsequently overwrite it later if desired via: 'ethtool -s'.
>>
>> Signed-off-by: Jason Baron <jba...@akamai.com>
>> Cc: "Michael S. Tsirkin" <m...@redhat.com>
>> Cc: Jason Wang <jasow...@redhat.com>
> 
> Looks mostly fine to me but need some virtio_net reviewers on this one.
> 
>> @@ -57,6 +57,8 @@
>>   * Steering */
>>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23   /* Set MAC address */
>>  
>> +#define VIRTIO_NET_F_SPEED_DUPLEX 63/* Host set linkspeed and 
>> duplex */
>> +
> 
> Why use a value so far away from the largest existing one?
> 
> Just curious.
> 

So that came from a discussion with Michael about which bit to use for
this, and he suggested using 63:

"
Transports started from bit 24 and are growing up.
So I would say devices should start from bit 63 and grow down.
"

https://patchwork.ozlabs.org/patch/848814/#1826669

I will add a comment to explain it.

Thanks,

-Jason



[PATCH 3/3] qemu: add linkspeed and duplex settings to virtio-net

2017-12-22 Thread Jason Baron
Although linkspeed and duplex can be set in a linux guest via 'ethtool -s',
this requires custom ethtool commands for virtio-net by default.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Linkspeed and duplex settings can be set as:
'-device virtio-net,speed=1,duplex=full'

where speed is [-1...INT_MAX], and duplex is ["half"|"full"].

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 hw/net/virtio-net.c | 29 +
 include/hw/virtio/virtio-net.h  |  3 +++
 include/standard-headers/linux/virtio_net.h |  4 
 3 files changed, 36 insertions(+)
 create mode 16 pixman

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index adc20df..7fafe6e 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -40,6 +40,12 @@
 #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
 #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
 
+/* duplex and speed defines */
+#define DUPLEX_UNKNOWN  0xff
+#define DUPLEX_HALF 0x00
+#define DUPLEX_FULL 0x01
+#define SPEED_UNKNOWN   -1
+
 /*
  * Calculate the number of bytes up to and including the given 'field' of
  * 'container'.
@@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
 {.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
+{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
+ .end = endof(struct virtio_net_config, duplex)},
 {}
 };
 
@@ -88,6 +96,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t 
*config)
 virtio_stw_p(vdev, , n->status);
 virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues);
 virtio_stw_p(vdev, , n->net_conf.mtu);
+virtio_stl_p(vdev, , n->net_conf.speed);
+netcfg.duplex = n->net_conf.duplex;
 memcpy(netcfg.mac, n->mac, ETH_ALEN);
 memcpy(config, , n->config_size);
 }
@@ -1941,6 +1951,23 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
+n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+if (n->net_conf.duplex_str) {
+if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) {
+n->net_conf.duplex = DUPLEX_HALF;
+} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) {
+n->net_conf.duplex = DUPLEX_FULL;
+} else {
+error_setg(errp, "'duplex' must be 'half' or 'full'");
+}
+} else {
+n->net_conf.duplex = DUPLEX_UNKNOWN;
+}
+if (n->net_conf.speed < SPEED_UNKNOWN) {
+error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and "
+   "INT_MAX");
+}
+
 virtio_net_set_config_size(n, n->host_features);
 virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
 
@@ -2160,6 +2187,8 @@ static Property virtio_net_properties[] = {
 DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
 DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
  true),
+DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
+DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index e7634c9..02484dc 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,9 @@ typedef struct virtio_net_conf
 uint16_t rx_queue_size;
 uint16_t tx_queue_size;
 uint16_t mtu;
+int32_t speed;
+char *duplex_str;
+uint8_t duplex;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
diff --git a/include/standard-headers/linux/virtio_net.h 
b/include/standard-headers/linux/virtio_net.h
index 30ff249..0ff1447 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -36,6 +36,7 @@
 #define VIRTIO_NET_F_GUEST_CSUM1   /* Guest handles pkts w/ 
partial csum */
 #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
 #define VIRTIO_NET_F_MTU   3   /* Initial MTU advice */
+#define VIRTIO_NET_F_SPEED_DUPLEX 4/* Host set linkspeed and duplex */
 #define VIRTIO_NET_F_MAC   5   /* Host has given MAC address. */
 #define VIRTIO_NET_F_GUEST_TSO47   /* Guest can handle TSOv4 in. */
 #defi

[PATCH 2/3] qemu: use 64-bit values for feature flags in virtio-net

2017-12-22 Thread Jason Baron
In prepartion for using some of the high order feature bits, make sure that
virtio-net uses 64-bit values everywhere.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 hw/net/virtio-net.c| 54 +-
 include/hw/virtio/virtio-net.h |  2 +-
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38674b0..adc20df 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -48,18 +48,18 @@
 (offsetof(container, field) + sizeof(((container *)0)->field))
 
 typedef struct VirtIOFeature {
-uint32_t flags;
+uint64_t flags;
 size_t end;
 } VirtIOFeature;
 
 static VirtIOFeature feature_sizes[] = {
-{.flags = 1 << VIRTIO_NET_F_MAC,
+{.flags = 1ULL << VIRTIO_NET_F_MAC,
  .end = endof(struct virtio_net_config, mac)},
-{.flags = 1 << VIRTIO_NET_F_STATUS,
+{.flags = 1ULL << VIRTIO_NET_F_STATUS,
  .end = endof(struct virtio_net_config, status)},
-{.flags = 1 << VIRTIO_NET_F_MQ,
+{.flags = 1ULL << VIRTIO_NET_F_MQ,
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
-{.flags = 1 << VIRTIO_NET_F_MTU,
+{.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
 {}
 };
@@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 int i;
 
 if (n->net_conf.mtu) {
-n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
+n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
 virtio_net_set_config_size(n, n->host_features);
@@ -2109,45 +2109,45 @@ static const VMStateDescription vmstate_virtio_net = {
 };
 
 static Property virtio_net_properties[] = {
-DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
-DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
+DEFINE_PROP_BIT64("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, 
true),
+DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_CSUM, true),
-DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
-DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
+DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO4, true),
-DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO6, true),
-DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ECN, true),
-DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_UFO, true),
-DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ANNOUNCE, true),
-DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO4, true),
-DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO6, true),
-DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_ECN, true),
-DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_UFO, true),
-DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
+DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
 VIRTIO_NET_F_MRG_RXBUF, true),
-DEFINE_PROP_BIT("status", VirtIONet, host_features,
+DEFINE_PROP_BIT64("status", VirtIONet, host_features,
 VIRTIO_NET_F_STATUS, true),
-DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_VQ, true),
-DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_RX, true),

[PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2017-12-22 Thread Jason Baron
The ability to set speed and duplex for virtio_net in useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c| 17 -
 include/uapi/linux/virtio_net.h |  5 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 511f833..4168d82 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2621,6 +2621,20 @@ static int virtnet_probe(struct virtio_device *vdev)
netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
virtnet_init_settings(dev);
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_SPEED_DUPLEX)) {
+   u32 speed;
+   u8 duplex;
+
+   speed = virtio_cread32(vdev, offsetof(struct virtio_net_config,
+  speed));
+   if (ethtool_validate_speed(speed))
+   vi->speed = speed;
+   duplex = virtio_cread8(vdev,
+  offsetof(struct virtio_net_config,
+  duplex));
+   if (ethtool_validate_duplex(duplex))
+   vi->duplex = duplex;
+   }
 
err = register_netdev(dev);
if (err) {
@@ -2746,7 +2760,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
-   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
+   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
+   VIRTIO_NET_F_SPEED_DUPLEX
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..0f1548e 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Host set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,9 @@ struct virtio_net_config {
__u16 max_virtqueue_pairs;
/* Default maximum transmit unit advice */
__u16 mtu;
+   /* Host exported linkspeed and duplex */
+   __u32 speed;
+   __u8 duplex;
 } __attribute__((packed));
 
 /*
-- 
2.6.1



[PATCH v2 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting

2017-12-22 Thread Jason Baron
We have found it useful to be able to set the linkspeed and duplex
settings from the host-side for virtio_net. This obviates the need
for guest changes and settings for these fields, and does not require
custom ethtool commands for virtio_net.

  
The ability to set linkspeed and duplex is useful in various cases
as described here:  
  

  
16032be virtio_net: add ethtool support for set and get of settings 
  

  
Using 'ethtool -s' continues to over-write the linkspeed/duplex 
  
settings with this patch.   
   

  
The 1/3 patch is against net-next, while the 2-3/3 patch are the associated
qemu changes that would go in after as update-linux-headers.sh should
be run first. So the qemu patches are a demonstration of how I intend this
to work.

  
Thanks, 
  

  
-Jason  
  

  
 linux changes: 
  

Jason Baron (1):
  virtio_net: propagate linkspeed/duplex settings from the hypervisor

 drivers/net/virtio_net.c| 17 -
 include/uapi/linux/virtio_net.h |  5 +
 2 files changed, 21 insertions(+), 1 deletion(-)

  

  
 qemu changes:  
  

Jason Baron (2):
  qemu: virtio-net: use 64-bit values for feature flags
  qemu: add linkspeed and duplex settings to virtio-net

 hw/net/virtio-net.c | 83 +++--
 include/hw/virtio/virtio-net.h  |  5 +-
 include/standard-headers/linux/virtio_net.h |  4 ++
 3 files changed, 64 insertions(+), 28 deletions(-)


[PATCH net] tcp: correct memory barrier usage in tcp_check_space()

2017-01-24 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.

Fixes: 3c7151275c0c ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.duma...@gmail.com>
Cc: Oleg Nesterov <o...@redhat.com>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bfa165cc455a..1e22ae4a5b38 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5028,7 +5028,7 @@ static void tcp_check_space(struct sock *sk)
if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
/* pairs with tcp_poll() */
-   smp_mb__after_atomic();
+   smp_mb();
if (sk->sk_socket &&
test_bit(SOCK_NOSPACE, >sk_socket->flags)) {
tcp_new_space(sk);
-- 
2.6.1



Re: wrong smp_mb__after_atomic() in tcp_check_space() ?

2017-01-23 Thread Jason Baron



On 01/23/2017 01:04 PM, Eric Dumazet wrote:

On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote:

On 01/23/2017 09:30 AM, Oleg Nesterov wrote:

Hello,

smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the
non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE)
(non-atomic too) won't be reordered.



Indeed. Here's a bit of discussion on it:
http://marc.info/?l=linux-netdev=146662325920596=2


It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths"
and the patch looks correct in that we need the barriers in tcp_check_space()
and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ?



Yes, I think it should be upgraded to an smp_mb() there. If you agree
with this analysis, I will send a patch to upgrade it. Note, I did not
actually run into this race in practice.


SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll().

(Otherwise it would be using atomic set/clear operations)

I do not see obvious reason why we have this smp_mb__after_atomic() in
tcp_check_space().




The idea of the  smp_mb__after_atomic() in tcp_check_space() was to 
ensure that the 'read' of SOCK_NOSPACE there didn't happen before any of 
the 'write' to make sk_stream_is_writeable() true. Otherwise, we could 
miss doing the wakeup from tcp_check_space(). There is probably an 
argument here that there will likely be a subsequent call to 
tcp_check_space() that will see the SOCK_NOSPACE bit set, but in theory 
we could have a small send buffer, or a lot of data could be ack'd in 
one go.


What I missed in the original patch was that sock_reset_flag() isn't an 
atomic operation and thus the smp_mb__after_atomic() is wrong.




But looking at this code, it seems we lack one barrier if sk_sndbuf is
ever increased. Fortunately this almost never happen during TCP session
lifetime...



But the wakeup from sk->sk_write_space(sk) will imply a smp_wmb() as per 
the comment in __wake_up() ?


Thanks,

-Jason



diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk)
sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
sndmem *= nr_segs * per_mss;

-   if (sk->sk_sndbuf < sndmem)
+   if (sk->sk_sndbuf < sndmem) {
sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
+   /* Paired with second sk_stream_is_writeable(sk)
+* test from tcp_poll()
+*/
+   smp_wmb();
+   }
 }

 /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)




Re: wrong smp_mb__after_atomic() in tcp_check_space() ?

2017-01-23 Thread Jason Baron

On 01/23/2017 09:30 AM, Oleg Nesterov wrote:

Hello,

smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the
non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE)
(non-atomic too) won't be reordered.



Indeed. Here's a bit of discussion on it:
http://marc.info/?l=linux-netdev=146662325920596=2


It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths"
and the patch looks correct in that we need the barriers in tcp_check_space()
and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ?



Yes, I think it should be upgraded to an smp_mb() there. If you agree 
with this analysis, I will send a patch to upgrade it. Note, I did not 
actually run into this race in practice.


Thanks,

-Jason




[PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-17 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then sometimes (they are rate-limited) does not
reply with any RST as would be expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.

A packetdrill test demonstrating the behavior:

// testing mac osx rst behavior

// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32768 
0.100 > S. 0:0(0) ack 1 
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4

// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768

// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768

// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1a34e9278c07..bfa165cc455a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5199,6 +5199,23 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
return err;
 }
 
+/* Accept RST for rcv_nxt - 1 after a FIN.
+ * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
+ * FIN is sent followed by a RST packet. The RST is sent with the same
+ * sequence number as the FIN, and thus according to RFC 5961 a challenge
+ * ACK should be sent. However, Mac OSX rate limits replies to challenge
+ * ACKs on the closed socket. In addition middleboxes can drop either the
+ * challenge ACK or a subsequent RST.
+ */
+static bool tcp_reset_check(const struct sock *sk, const struct sk_buff *skb)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   return unlikely(TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1) &&
+   (1 << sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK |
+  TCPF_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5237,20 +5254,25 @@ static bool tcp_validate_incoming(struct sock *sk, 
struct sk_buff *skb,
  LINUX_MIB_TCPACKSKIPPEDSEQ,
  >last_oow_ack_time))
tcp_send_dupack(sk, skb);
+   } else if (tcp_reset_check(sk, skb)) {
+   tcp_reset(sk);
}
goto discard;
}
 
/* Step 2: check RST bit */
if (th->rst) {
-   /* RFC 5961 3.2 (extend to match against SACK too if available):
-* If seq num matches RCV.NXT or the right-most SACK block,
+   /* RFC 5961 3.2 (extend to match against (RCV.NXT - 1) after a
+* FIN and SACK too if available):
+* If seq num matches RCV.NXT or (RCV.NXT - 1) after a FIN, or
+* the right-most SACK block,
 * then
 * RESET the connection
 * else
 * Send a challenge ACK
 */
-   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+   tcp_reset_check(sk, skb)) {
rst_seq_match = true;
} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
struct tcp_sack_block *sp = >selective_acks[0];
-- 
2.6.1



Re: [RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-13 Thread Jason Baron
On 01/11/2017 10:48 AM, Eric Dumazet wrote:

> On Thu, 2017-01-05 at 16:33 -0500, Jason Baron wrote:
>
>>  
>> +/* Accept RST for rcv_nxt - 1 after a FIN.
>> + * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
>> + * FIN is sent followed by a RST packet. The RST is sent with the same
>> + * sequence number as the FIN, and thus according to RFC 5961 a challenge
>> + * ACK should be sent. However, Mac OSX does not reply to the challenge ACK
>> + * with a RST on the closed socket, hence accept this class of RSTs.
>> + */
>> +static bool tcp_reset_check(struct sock *sk, struct sk_buff *skb)
> const struct sock *sk, const struct sk_buff *skb
>
>> +{
>> +struct tcp_sock *tp = tcp_sk(sk);
>> +
>> +return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) &&
>> +(TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) &&
> Why is the test on end_seq needed ?

Hi,

(Re-sending - seems like my reply was lost)

I wanted to define this condition as narrowly as I could. I'm ok
dropping it -
I'm not sure its going to make much difference in practice. So to that end,
dropping this extra check makes sense.

I posted this as RFC because RFC 5961, I don't think says anything about
accepting rcv_nxt - 1 in this case, so I was wondering what people
thought...

Thanks,

-Jason

>> +(sk->sk_state == TCP_CLOSE_WAIT ||
>> + sk->sk_state == TCP_LAST_ACK ||
>> + sk->sk_state == TCP_CLOSING));
>> +}
> Testing many states can be done more efficiently :
>
>(1 << sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK |
>   TCPF_CLOSING)
>
> Thanks
>



Re: [RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-11 Thread Jason Baron

On 01/11/2017 12:17 AM, Christoph Paasch wrote:

Hello Jason,

(resending as Gmail sent out with HTML)

On 05/01/17 - 16:33:28, Jason Baron wrote:

Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then does not reply with any RST as would be
expected on a closed socket.


do you see this behavior consistently, even in a controlled environment?

The problem seems rather to be that after the first RST, the NAT on the path
has dropped its mapping and is thus dropping all other traffic. So, Linux's
challenge-ack does not go through to the OSX-host to "re-synchronize" the
state (which would allow OSX to send a RST with the updated sequence numbers).

This is also documented in RFC5961:

9.3.  Middleboxes That Drop the Challenge ACK

   It also needs to be noted that, some middleboxes (Firewalls/NATs)
   that don't have the fix recommended in the document, may drop the
   challenge ACK.  This can happen because, the original RST segment
   that was in window had already cleared the flow state pertaining to
   the TCP connection in the middlebox.  In such cases, the end hosts
   that have implemented the RST mitigation described in this document,
   will have the TCP connection left open.  This is a corner case and
   can go away if the middlebox is conformant with the changes proposed
   in this document.



Yes, I've observed a couple of different ways that this can happen:

1) The case where Mac OSX does not send an RST in response to the 
challenge ACK. I found that there is actually a rate limit to the number 
of 'RSTs' that Mac OSX will send on a closed socket. Its
controlled by: 'net.inet.icmp.icmplim'. By default its set to 250, which 
means it will send up to 250 'RSTs' on closed sockets per second. Thus, 
I've confirmed that I can hit this limit by looking in the Mac OSX logs.


2) When I had the Mac OSX box connecting over a vpn to the linux server 
I found that while the challenge ack sent by Linux did reach the Mac OSX 
box, the RST that Mac OSX sent could get dropped somewhere in between. 
That is I see it sent from the Mac OSX box but never see it received by 
the Linux server. Thus, this is closer to the scenario you described above.


Also, as mentioned we've had this deployed in production where we've 
seen this change make a real difference in capacity (due to freeing up 
memory resources much more quickly). Since, I don't have dumps from the 
client side I can't say exactly what the sequence of events is there.


Thanks,

-Jason






Cheers,
Christoph



This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.

I also found a posting that the iOS client behaves in a similar manner here
(it will send a FIN followed by a RST for rcv_nxt - 1):
https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/

A packetdrill test demonstrating the behavior.

// testing mac osx rst behavior

// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32768 
0.100 > S. 0:0(0) ack 1 
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4

// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768

// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768

// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/ipv4/tcp_input.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec6d84363024..373bea05c93b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5249,6 +5249,24 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
return err;
 }

+/* Accept RST for rcv_nxt - 1 after a FIN.
+ * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
+ * FIN is sent followed by a RST packet. The RST is sent with the same
+ * sequence number as the FIN, and thus according to RFC 5961 a challenge
+ * ACK should be sent. However, Mac OSX does not reply to the challenge ACK
+ * with a RST on the closed socket, hence accept this c

[RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-05 Thread Jason Baron
Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then does not reply with any RST as would be
expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.

I also found a posting that the iOS client behaves in a similar manner here
(it will send a FIN followed by a RST for rcv_nxt - 1):
https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/

A packetdrill test demonstrating the behavior.

// testing mac osx rst behavior

// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32768 
0.100 > S. 0:0(0) ack 1 
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4

// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768

// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768

// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/ipv4/tcp_input.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec6d84363024..373bea05c93b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5249,6 +5249,24 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
return err;
 }
 
+/* Accept RST for rcv_nxt - 1 after a FIN.
+ * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
+ * FIN is sent followed by a RST packet. The RST is sent with the same
+ * sequence number as the FIN, and thus according to RFC 5961 a challenge
+ * ACK should be sent. However, Mac OSX does not reply to the challenge ACK
+ * with a RST on the closed socket, hence accept this class of RSTs.
+ */
+static bool tcp_reset_check(struct sock *sk, struct sk_buff *skb)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) &&
+   (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) &&
+   (sk->sk_state == TCP_CLOSE_WAIT ||
+sk->sk_state == TCP_LAST_ACK ||
+sk->sk_state == TCP_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5287,6 +5305,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
  LINUX_MIB_TCPACKSKIPPEDSEQ,
  >last_oow_ack_time))
tcp_send_dupack(sk, skb);
+   } else if (tcp_reset_check(sk, skb)) {
+   tcp_reset(sk);
}
goto discard;
}
@@ -5300,7 +5320,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
 * else
 * Send a challenge ACK
 */
-   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+   tcp_reset_check(sk, skb)) {
rst_seq_match = true;
} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
struct tcp_sack_block *sp = >selective_acks[0];
-- 
2.6.1



Re: [net PATCH] fib_trie: Correct /proc/net/route off by one error

2016-11-07 Thread Jason Baron



On 11/04/2016 03:11 PM, Alexander Duyck wrote:

The display of /proc/net/route has had a couple issues due to the fact that
when I originally rewrote most of fib_trie I made it so that the iterator
was tracking the next value to use instead of the current.

In addition it had an off by 1 error where I was tracking the first piece
of data as position 0, even though in reality that belonged to the
SEQ_START_TOKEN.

This patch updates the code so the iterator tracks the last reported
position and key instead of the next expected position and key.  In
addition it shifts things so that all of the leaves start at 1 instead of
trying to report leaves starting with offset 0 as being valid.  With these
two issues addressed this should resolve any off by one errors that were
present in the display of /proc/net/route.

Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in 
/proc/net/route")
Cc: Andy Whitcroft <a...@canonical.com>
Reported-by: Jason Baron <jba...@akamai.com>
Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 net/ipv4/fib_trie.c |   21 +
 1 file changed, 9 insertions(+), 12 deletions(-)



Ok. Works for me.

Feel free to add:
Reviewed-and-Tested-by: Jason Baron <jba...@akamai.com>

Thanks,

-Jason


Re: [PATCH net] fib_trie: correct /proc/net/route for large read buffer

2016-11-04 Thread Jason Baron



On 11/04/2016 02:43 PM, Alexander Duyck wrote:

On Fri, Nov 4, 2016 at 7:45 AM, Jason Baron <jba...@akamai.com> wrote:

From: Jason Baron <jba...@akamai.com>

When read() is called on /proc/net/route requesting a size that is one
entry size (128 bytes) less than m->size or greater, the resulting output
has missing and/or duplicate entries. Since m->size is typically PAGE_SIZE,
for a PAGE_SIZE of 4,096 this means that reads requesting more than 3,968
bytes will see bogus output.

For example:

for i in {100..200}; do
ip route add 192.168.1.$i dev eth0
done
dd if=/proc/net/route of=/tmp/good bs=1024
dd if=/proc/net/route of=/tmp/bad bs=4096

# diff -q /tmp/good /tmp/bad
Files /tmp/good and /tmp/bad differ

I think this has gone unnoticed, since the output of 'netstat -r' and
'route' is generated by reading in 1,024 byte increments and thus not
corrupted. Further, the number of entries in the route table needs to be
sufficiently large in order to trigger the problematic case.

The issue arises because fib_route_get_idx() does not properly handle
the case where pos equals iter->pos. This case only arises when we have
a large read buffer size because we end up re-requesting the last entry
that overflowed m->buf. In the case of a smaller read buffer size,
we don't exceed the size of m->buf, and thus fib_route_get_idx() is called
with pos greater than iter->pos.

Fix by properly handling the iter->pos == pos case.

Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in 
/proc/net/route")
Cc: Andy Whitcroft <a...@canonical.com>
Cc: Alexander Duyck <alexander.h.du...@intel.com>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/ipv4/fib_trie.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 31cef3602585..1017533fc75c 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2411,12 +2411,17 @@ static struct key_vector *fib_route_get_idx(struct 
fib_route_iter *iter,
loff_t pos)
 {
struct key_vector *l, **tp = >tnode;
+   loff_t saved_pos = 0;
t_key key;

/* use cache location of next-to-find key */
if (iter->pos > 0 && pos >= iter->pos) {
pos -= iter->pos;
key = iter->key;
+   if (pos == 0) {
+   saved_pos = iter->pos;
+   key--;
+   }
} else {
iter->pos = 0;
key = 0;
@@ -2436,10 +2441,13 @@ static struct key_vector *fib_route_get_idx(struct 
fib_route_iter *iter,
break;
}

-   if (l)
+   if (l) {
iter->key = key;/* remember it */
-   else
+   if (saved_pos)
+   iter->pos = saved_pos;
+   } else {
iter->pos = 0;  /* forget it */
+   }

return l;
 }


This doesn't seem correct to me.  I will have to look through this.
My understanding is that the value of iter->pos is supposed to be the
next position for us to grab, not the last one that was retrieved.  If
we are trying to re-request the last value then we should be falling
back into the else case for this since pos should be one less than
iter->pos.  The problem is the table could change out from under us
which is one of the reasons why we don't want to try and rewind the
key like you are doing here.

- Alex



Hi Alex,

In this case, seq_read() has called m->op->next(), which sets iter->pos 
equal to pos and iter->key to key + 1. However, when we then go to 
output the item associated with key, the 'm->op->next()' call overflows. 
Thus, we have a situation where iter->pos equals pos, iter->key = key + 
1, but we have not displayed the item at position 'key' (thus the bug is 
that we miss the item at key).


The change I proposed was simply to restart the search from 'key' in 
this case. If that item has disappeared, we will output the next one, or 
if its been replaced we will display its replacement. I think that is

ok?

The bug could also be fixed by changing:

if (iter->pos > 0 && pos >= iter->pos) {

to say:

if (iter->pos > 0 && pos > iter->pos) {

But that restarts the search on every overflow, which could mean every 
page size, and that seems suboptimal to me. Like-wise, if we make pos 1 
less than iter->pos that restarts the search. The idea with this patch 
is to not force us to redo the entire search on each overflow.


Thanks,

-Jason


[PATCH net] fib_trie: correct /proc/net/route for large read buffer

2016-11-04 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

When read() is called on /proc/net/route requesting a size that is one
entry size (128 bytes) less than m->size or greater, the resulting output
has missing and/or duplicate entries. Since m->size is typically PAGE_SIZE,
for a PAGE_SIZE of 4,096 this means that reads requesting more than 3,968
bytes will see bogus output.

For example:

for i in {100..200}; do
ip route add 192.168.1.$i dev eth0
done
dd if=/proc/net/route of=/tmp/good bs=1024
dd if=/proc/net/route of=/tmp/bad bs=4096

# diff -q /tmp/good /tmp/bad
Files /tmp/good and /tmp/bad differ

I think this has gone unnoticed, since the output of 'netstat -r' and
'route' is generated by reading in 1,024 byte increments and thus not
corrupted. Further, the number of entries in the route table needs to be
sufficiently large in order to trigger the problematic case.

The issue arises because fib_route_get_idx() does not properly handle
the case where pos equals iter->pos. This case only arises when we have
a large read buffer size because we end up re-requesting the last entry
that overflowed m->buf. In the case of a smaller read buffer size,
we don't exceed the size of m->buf, and thus fib_route_get_idx() is called
with pos greater than iter->pos.

Fix by properly handling the iter->pos == pos case.

Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in 
/proc/net/route")
Cc: Andy Whitcroft <a...@canonical.com>
Cc: Alexander Duyck <alexander.h.du...@intel.com>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/ipv4/fib_trie.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 31cef3602585..1017533fc75c 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2411,12 +2411,17 @@ static struct key_vector *fib_route_get_idx(struct 
fib_route_iter *iter,
loff_t pos)
 {
struct key_vector *l, **tp = >tnode;
+   loff_t saved_pos = 0;
t_key key;
 
/* use cache location of next-to-find key */
if (iter->pos > 0 && pos >= iter->pos) {
pos -= iter->pos;
key = iter->key;
+   if (pos == 0) {
+   saved_pos = iter->pos;
+   key--;
+   }
} else {
iter->pos = 0;
key = 0;
@@ -2436,10 +2441,13 @@ static struct key_vector *fib_route_get_idx(struct 
fib_route_iter *iter,
break;
}
 
-   if (l)
+   if (l) {
iter->key = key;/* remember it */
-   else
+   if (saved_pos)
+   iter->pos = saved_pos;
+   } else {
iter->pos = 0;  /* forget it */
+   }
 
return l;
 }
-- 
2.6.1



Re: [PATCH v2 net-next 0/2] bnx2x: page allocation failure

2016-10-05 Thread Jason Baron


On 10/03/2016 08:19 PM, David Miller wrote:
> From: "Baron, Jason" 
> Date: Mon, 3 Oct 2016 20:24:32 +
>
>> Or should I just send the incremental at this point?
> Incremental is required at this point.

Hi David,

Ok. The above question was sent out erroneously. I have already posted
the incremental patch that I was referring to here:
https://patchwork.ozlabs.org/patch/675227/

Prior to that incremental patch post, I had composed the above question,
but realized that the right thing to do was simply to send the
incremental. However, I forgot to delete the 'draft' mail that I had,
and it accidentally got sent out on Monday.

Please ignore the noise, we are all set here :)

Thanks,

-Jason


[PATCH net-next] bnx2x: free the mac filter group list before freeing the cmd

2016-09-26 Thread Jason Baron
The group list must be freed prior to freeing the command otherwise
we have a use-after-free.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: Yuval Mintz <yuval.mi...@qlogic.com>
Cc: Ariel Elior <ariel.el...@qlogic.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index 4947a9c..cea6bdc 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -2714,8 +2714,8 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
elem_group = (struct bnx2x_mcast_elem_group *)
 __get_free_page(GFP_ATOMIC | __GFP_ZERO);
if (!elem_group) {
-   kfree(new_cmd);
bnx2x_free_groups(_cmd->group_head);
+   kfree(new_cmd);
return -ENOMEM;
}
total_elems -= MCAST_MAC_ELEMS_PER_PG;
-- 
1.9.1



Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-23 Thread Jason Baron

Hi,

On 09/23/2016 03:24 AM, Nicholas Piggin wrote:

On Fri, 23 Sep 2016 14:42:53 +0800
"Hillf Danton"  wrote:



The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be
physically contiguous. Also the allocation is temporary for the duration of the
syscall, so it's unlikely to stress vmalloc too much.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.


How about something like this? (untested)

Eric isn't wrong about vmalloc sucking :)

Thanks,
Nick


---
  fs/select.c | 57 +++--
  1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da5..3b4834c 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -555,6 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
void *bits;
int ret, max_fds;
unsigned int size;
+   size_t nr_bytes;
struct fdtable *fdt;
/* Allocate small arguments on the stack to save memory and be faster */
long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
@@ -576,21 +577,39 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
 * since we used fdset we need to allocate memory in units of
 * long-words.
 */
-   size = FDS_BYTES(n);
+   ret = -ENOMEM;
bits = stack_fds;
-   if (size > sizeof(stack_fds) / 6) {
-   /* Not enough space in on-stack array; must use kmalloc */
+   size = FDS_BYTES(n);
+   nr_bytes = 6 * size;
+
+   if (unlikely(nr_bytes > PAGE_SIZE)) {
+   /* Avoid multi-page allocation if possible */
ret = -ENOMEM;
-   bits = kmalloc(6 * size, GFP_KERNEL);
-   if (!bits)
-   goto out_nofds;
+   fds.in = kmalloc(size, GFP_KERNEL);
+   fds.out = kmalloc(size, GFP_KERNEL);
+   fds.ex = kmalloc(size, GFP_KERNEL);
+   fds.res_in = kmalloc(size, GFP_KERNEL);
+   fds.res_out = kmalloc(size, GFP_KERNEL);
+   fds.res_ex = kmalloc(size, GFP_KERNEL);
+
+   if (!(fds.in && fds.out && fds.ex &&
+   fds.res_in && fds.res_out && fds.res_ex))
+   goto out;
+   } else {
+   if (nr_bytes > sizeof(stack_fds)) {
+   /* Not enough space in on-stack array */
+   if (nr_bytes > PAGE_SIZE * 2)


The 'if' looks extraneous?

Also, I wonder if we can just avoid some allocations altogether by 
checking by if the user fd_set pointers are NULL? That can avoid failures :)


Thanks,

-Jason


[PATCH v2 net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments

2016-09-22 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for 'mcast_list' to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: Yuval Mintz <yuval.mi...@qlogic.com>
Cc: Ariel Elior <ariel.el...@qlogic.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 79 +++-
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index dab61a81a3ba..20fe6a8c35c1 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12563,43 +12563,64 @@ static int bnx2x_close(struct net_device *dev)
return 0;
 }
 
-static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
- struct bnx2x_mcast_ramrod_params *p)
+struct bnx2x_mcast_list_elem_group
 {
-   int mc_count = netdev_mc_count(bp->dev);
-   struct bnx2x_mcast_list_elem *mc_mac =
-   kcalloc(mc_count, sizeof(*mc_mac), GFP_ATOMIC);
-   struct netdev_hw_addr *ha;
+   struct list_head mcast_group_link;
+   struct bnx2x_mcast_list_elem mcast_elems[];
+};
 
-   if (!mc_mac) {
-   BNX2X_ERR("Failed to allocate mc MAC list\n");
-   return -ENOMEM;
+#define MCAST_ELEMS_PER_PG \
+   ((PAGE_SIZE - sizeof(struct bnx2x_mcast_list_elem_group)) / \
+   sizeof(struct bnx2x_mcast_list_elem))
+
+static void bnx2x_free_mcast_macs_list(struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_list_elem_group *current_mcast_group;
+
+   while (!list_empty(mcast_group_list)) {
+   current_mcast_group = list_first_entry(mcast_group_list,
+ struct bnx2x_mcast_list_elem_group,
+ mcast_group_link);
+   list_del(_mcast_group->mcast_group_link);
+   free_page((unsigned long)current_mcast_group);
}
+}
 
-   INIT_LIST_HEAD(>mcast_list);
+static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
+ struct bnx2x_mcast_ramrod_params *p,
+ struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_list_elem *mc_mac;
+   struct netdev_hw_addr *ha;
+   struct bnx2x_mcast_list_elem_group *current_mcast_group = NULL;
+   int mc_count = netdev_mc_count(bp->dev);
+   int offset = 0;
 
+   INIT_LIST_HEAD(>mcast_list);
netdev_for_each_mc_addr(ha, bp->dev) {
+   if (!offset) {
+   current_mcast_group =
+   (struct bnx2x_mcast_list_elem_group *)
+   __get_free_page(GFP_ATOMIC);
+   if (!current_mcast_group) {
+   bnx2x_free_mcast_macs_list(mcast_group_list);
+   BNX2X_ERR("Failed to allocate mc MAC list\n");
+   return -ENOMEM;
+   }
+   list_add(_mcast_group->mcast_group_link,
+mcast_group_list);
+   }
+   mc_mac = _mcast_group->mcast_elems[offset];
mc_mac->mac = bnx2x_mc_addr(ha);
list_add_tail(_mac->link, >mcast_list);
-   mc_mac++;
+   offset++;
+   if (offset == MCAST_ELEMS_PER_PG)
+   offset = 0;
}
-
p->mcast_list_len = mc_count;
-
return 0;
 }
 
-static void bnx2x_free_mcast_macs_list(
-   struct bnx2x_mcast_ramrod_params *p)
-{
-   struct bnx2x_mcast_list_elem *mc_mac =
-   list_first_entry(>mcast_list, struct bnx2x_mcast_list_elem,
-link);
-
-   WARN_ON(!mc_mac);
-   kfree(mc_mac);
-}
-
 /**
  * bnx2x_set_uc_list - configure a new unicast MACs list.
  *
@@ -12647,6 +12668,7 @@ static int bnx2x_set_uc_list(struct bnx2x *bp)
 
 static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 {
+   LIST_HEAD(mcast_group_list);
struct net_device *dev = bp->dev;
struct bnx2x_mcast_ramrod_params rparam = {NULL};
int rc = 0;
@@ -12662,7 +12684,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 
/* then, configure a new MACs list */
if (netdev_mc_count(dev)) {
-   rc = bnx2x_init_mcast_macs_list(bp, );
+   rc = bnx2x_init_mcast_macs_list(bp, , _group_list);
if (rc)
return rc;
 
@@ -12673,7 +12695,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)

[PATCH v2 net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-22 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jba...@akamai.com>
Cc: Yuval Mintz <yuval.mi...@qlogic.com>
Cc: Ariel Elior <ariel.el...@qlogic.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 123 +
 1 file changed, 86 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index d468380c2a23..4947a9cbf0c1 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem {
int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */
 };
 
+union bnx2x_mcast_elem {
+   struct bnx2x_mcast_bin_elem bin_elem;
+   struct bnx2x_mcast_mac_elem mac_elem;
+};
+
+struct bnx2x_mcast_elem_group {
+   struct list_head mcast_group_link;
+   union bnx2x_mcast_elem mcast_elems[];
+};
+
+#define MCAST_MAC_ELEMS_PER_PG \
+   ((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \
+   sizeof(union bnx2x_mcast_elem))
+
 struct bnx2x_pending_mcast_cmd {
struct list_head link;
+   struct list_head group_head;
int type; /* BNX2X_MCAST_CMD_X */
union {
struct list_head macs_head;
@@ -2638,16 +2653,29 @@ static int bnx2x_mcast_wait(struct bnx2x *bp,
return 0;
 }
 
+static void bnx2x_free_groups(struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_elem_group *current_mcast_group;
+
+   while (!list_empty(mcast_group_list)) {
+   current_mcast_group = list_first_entry(mcast_group_list,
+ struct bnx2x_mcast_elem_group,
+ mcast_group_link);
+   list_del(_mcast_group->mcast_group_link);
+   free_page((unsigned long)current_mcast_group);
+   }
+}
+
 static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
   struct bnx2x_mcast_obj *o,
   struct bnx2x_mcast_ramrod_params *p,
   enum bnx2x_mcast_cmd cmd)
 {
-   int total_sz;
struct bnx2x_pending_mcast_cmd *new_cmd;
-   struct bnx2x_mcast_mac_elem *cur_mac = NULL;
struct bnx2x_mcast_list_elem *pos;
-   int macs_list_len = 0, macs_list_len_size;
+   struct bnx2x_mcast_elem_group *elem_group;
+   struct bnx2x_mcast_mac_elem *mac_elem;
+   int total_elems = 0, macs_list_len = 0, offset = 0;
 
/* When adding MACs we'll need to store their values */
if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET)
@@ -2657,50 +2685,61 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
if (!p->mcast_list_len)
return 0;
 
-   /* For a set command, we need to allocate sufficient memory for all
-* the bins, since we can't analyze at this point how much memory would
-* be required.
-*/
-   macs_list_len_size = macs_list_len *
-sizeof(struct bnx2x_mcast_mac_elem);
-   if (cmd == BNX2X_MCAST_CMD_SET) {
-   int bin_size = BNX2X_MCAST_BINS_NUM *
-  sizeof(struct bnx2x_mcast_bin_elem);
-
-   if (bin_size > macs_list_len_size)
-   macs_list_len_size = bin_size;
-   }
-   total_sz = sizeof(*new_cmd) + macs_list_len_size;
-
/* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */
-   new_cmd = kzalloc(total_sz, GFP_ATOMIC);
-
+   new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC);
if (!new_cmd)
return -ENOMEM;
 
-   DP(BNX2X_MSG_SP, "About to enqueue a new %d command. 
macs_list_len=%d\n",
-  cmd, macs_list_len);
-
INIT_LIST_HEAD(_cmd->data.macs_head);
-
+   INIT_LIST_HEAD(_cmd->group_head);
new_cmd->type = cmd;
new_cmd->done = false;
 
+   DP(BNX2X_MSG_SP, "About to enqueue a new %d command. 
macs_list_len=%d\n",
+  cmd, macs_list_len);
+
switch (cmd) {
case BNX2X_MCAST_CMD_ADD:
case BNX2X_MCAST_CMD_SET:
-   cur_mac = (struct bnx2x_mcast_mac_elem *)
- ((u8 *)new_cmd + sizeof(*new_cmd));
-
-   /* Push the MACs of the current command into the pending command
-* MACs list: FIFO
+   /* For a set command, we need to allocate sufficient memory for
+* all the bins, since we can't analyze at this point how much
+* memory would be required.
 */
+  

[PATCH v2 net-next 0/2] bnx2x: page allocation failure

2016-09-22 Thread Jason Baron
Hi,

While configuring ~500 multicast addrs, we ran into high order
page allocation failures. They don't need to be high order, and
thus I'm proposing to split them into at most PAGE_SIZE allocations.

Below is a sample failure.

Thanks,

-Jason

[1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create 
multicast MACs list: -12
[1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020
[1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: GW
[1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, 
BIOS V8.810 05/16/2013
[1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x]
[1207325.728206]   88012d873a78 8267f7c7 
c020
[1207325.736754]   88012d873b08 8212f8e0 
fffc0003
[1207325.745301]  88041ffecd80 88040030 0002 
c0206800da13
[1207325.753846] Call Trace:
[1207325.756789]  [] dump_stack+0x4d/0x63
[1207325.762426]  [] warn_alloc_failed+0xe0/0x130
[1207325.768756]  [] ? wakeup_kswapd+0x48/0x140
[1207325.774914]  [] __alloc_pages_nodemask+0x2bc/0x970
[1207325.781761]  [] alloc_pages_current+0x91/0x100
[1207325.788260]  [] alloc_kmem_pages+0xe/0x10
[1207325.794329]  [] kmalloc_order+0x18/0x50
[1207325.800227]  [] kmalloc_order_trace+0x26/0xb0
[1207325.806642]  [] ? _xfer_secondary_pool+0xa8/0x1a0
[1207325.813404]  [] __kmalloc+0x19a/0x1b0
[1207325.819142]  [] bnx2x_set_rx_mode_inner+0x3d5/0x590 
[bnx2x]
[1207325.827000]  [] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x]
[1207325.834197]  [] process_one_work+0x134/0x3c0
[1207325.840522]  [] worker_thread+0x121/0x460
[1207325.846585]  [] ? process_one_work+0x3c0/0x3c0
[1207325.853089]  [] kthread+0xc9/0xe0
[1207325.858459]  [] ? notify_die+0x10/0x40
[1207325.864263]  [] ? kthread_create_on_node+0x180/0x180
[1207325.871288]  [] ret_from_fork+0x42/0x70
[1207325.877183]  [] ? kthread_create_on_node+0x180/0x180

v2:
 -make use of list_next_entry()
 -only use PAGE_SIZE allocations

Jason Baron (2):
  bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
  bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  79 +--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c   | 123 ---
 2 files changed, 137 insertions(+), 65 deletions(-)

-- 
2.6.1



Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-20 Thread Jason Baron



On 09/20/2016 07:30 AM, David Laight wrote:

From: Jason Baron

Sent: 19 September 2016 19:34

...


sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
page on x86. So if we want to fit 2,048 elements, we need 12 pages.


If you only need to save the mcast addresses you could use a 'heap'
that requires no overhead per entry and gives you O(log) lookup.
6 bytes per entry is 682 in a 4k page.

David



Indeed, that would save space here.

Based on Yuval's comments it sounds as though he agrees that it makes 
sense to go beyond a page (even if we get 682 per page as you suggest), 
when configuring these mac filters. So we would then have to allocate 
and manage the page pointers. Currently, there is a list_head per entry 
to manage the macs as a linked list. The patch I proposed continues to 
use that same data structure, thus it will not add to the memory 
footprint, it only proposes to break that footprint up into PAGE_SIZE 
chunks.


So I think the change you suggest can be viewed as an additional 
enhancement here, and also note that the memory allocations here are 
short-lived. That is, they only exist in memory until the NIC is 
re-configured.


Thanks,

-Jason


Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-20 Thread Jason Baron



On 09/20/2016 11:00 AM, Mintz, Yuval wrote:

The question I rose was whether it actually makes a difference under
such circumstances whether the device would actually filter those
multicast addresses or be completely multicast promiscuous.
e.g., whether it's significant to be filtering out multicast ingress
traffic when you're already allowing 1/2 of all random multicast
packets to be classified for the interface.



Agreed, I think this is the more interesting question here. I thought that we
would want to make sure we are using most of the bins before falling back to
multicast ingress. The reason being that even if its more expensive for the NIC 
to
do the filtering than the multicast mode, it would be more than made up for by
having to drop the traffic higher up the stack. So I think if we can determine 
the
percent of the bins that we want to use, we can then back into the average
number of filters required to get there. As I said, I thought we would want to
make sure we filled basically all the bins (with a high probability that is) 
before
falling back to multicast, and so I threw out 2,048.


AFAIK configuring multiple filters doesn't incur any performance penalty
from the adapter side.
And I agree that from 'offloading' perspective it's probably better to
filter in HW even if the gain is negligible.
So for the upper limit - there's not much of a reason to it; The only gain
would be to prevent driver from allocating lots-and-lots of memory
temporarily for an unnecessary configuration.



Ok. We already have an upper limit to an extent with 
/proc/sys/net/ipv4/igmp_max_memberships. And as posted I didn't include 
one b/c of the higher level limits already in place.


Thanks,

-Jason


Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-20 Thread Jason Baron



On 09/20/2016 03:41 AM, Mintz, Yuval wrote:

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jba...@akamai.com>


While I appreciate the effort, I wonder whether it's worth it:

- The hardware [even in its newer generation] provides an approximate
based classification [I.e., hashed] with 256 bins.
When configuring 500 multicast addresses, one can argue the difference
between multicast-promisc mode and actual configuration is
insignificant.


With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast 
addresses
to expect to have all bins have at least one hash, assuming a uniform 
distribution
of the hashes.


Perhaps the easier-to-maintain alternative would simply be to
determine the maximal number of multicast addresses that can be
configured using a single PAGE, and if in need of more than that
simply move into multicast-promisc.



sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So
if we want to fit 2,048 elements, we need 12 pages.


That's not exactly what I mean - let's assume you'd have problems
allocating more than a PAGE. According to your calculation, that
means you're already using more than 170 multicast addresses.
I didn't bother trying to solve the combinatorics question of how
many bins you'd use on average for 170 filters given there are only
256 bins, but that would be a significant portion.


On average for 170 filters, I get an average of 124 bins in use out
of 256 possible bins.


The question I rose was whether it actually makes a difference
under such circumstances whether the device would actually filter
those multicast addresses or be completely multicast promiscuous.
e.g., whether it's significant to be filtering out multicast ingress
traffic when you're already allowing 1/2 of all random multicast
packets to be classified for the interface.



Agreed, I think this is the more interesting question here. I thought 
that we would want to make sure we are using most of the bins before 
falling back to multicast ingress. The reason being that even if its 
more expensive for the NIC to do the filtering than the multicast mode, 
it would be more than made up for by having to drop the traffic higher 
up the stack. So I think if we can determine the percent of the bins 
that we want to use, we can then back into the average number of filters 
required to get there. As I said, I thought we would want to make sure 
we filled basically all the bins (with a high probability that is) 
before falling back to multicast, and so I threw out 2,048.


Thanks,

-Jason


Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-19 Thread Jason Baron

On 09/18/2016 06:25 AM, Mintz, Yuval wrote:

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jba...@akamai.com>


While I appreciate the effort, I wonder whether it's worth it:

- The hardware [even in its newer generation] provides an approximate
based classification [I.e., hashed] with 256 bins.
When configuring 500 multicast addresses, one can argue the
difference between multicast-promisc mode and actual configuration
is insignificant.


With 256 bins, I think it takes close to: 256*lg(256) or 2,048
multicast addresses to expect to have all bins have at least one hash, 
assuming a uniform distribution of the hashes.



Perhaps the easier-to-maintain alternative would simply be to
determine the maximal number of multicast addresses that can be
configured using a single PAGE, and if in need of more than that
simply move into multicast-promisc.



sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
page on x86. So if we want to fit 2,048 elements, we need 12 pages.


  - While GFP_ATOMIC is required in this flow due to the fact it's being
called from sleepless context, I do believe this is mostly a remnant -
it's possible that by slightly changing the locking scheme we can have
the configuration done from sleepless context and simply switch to
GFP_KERNEL instead.



Ok if its GFP_KERNEL, I think its still undesirable to do large page 
order allocations (unless of course its converted to a single page, but

I'm not sure this makes sense as mentioned).


Regarding the patch itself, only comment I have:

+   elem_group = (struct bnx2x_mcast_elem_group *)
+elem_group->mcast_group_link.next;

Let's use list_next_entry() instead.




Yes, agreed.

I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to
enforce some maximum number of elements (perhaps 2,048 based on the
above math) for the !CHIP_IS_E1() case on top of what I already posted.

Thanks,

-Jason


[PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-16 Thread Jason Baron
Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 131 ++---
 1 file changed, 94 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index d468380c2a23..6db8dd252d7c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem {
int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */
 };
 
+union bnx2x_mcast_elem {
+   struct bnx2x_mcast_bin_elem bin_elem;
+   struct bnx2x_mcast_mac_elem mac_elem;
+};
+
+struct bnx2x_mcast_elem_group {
+   struct list_head mcast_group_link;
+   union bnx2x_mcast_elem mcast_elems[];
+};
+
+#define MCAST_MAC_ELEMS_PER_PG \
+   ((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \
+   sizeof(union bnx2x_mcast_elem))
+
 struct bnx2x_pending_mcast_cmd {
struct list_head link;
+   struct list_head group_head;
int type; /* BNX2X_MCAST_CMD_X */
union {
struct list_head macs_head;
@@ -2638,16 +2653,30 @@ static int bnx2x_mcast_wait(struct bnx2x *bp,
return 0;
 }
 
+static void bnx2x_free_groups(struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_elem_group *current_mcast_group;
+
+   while (!list_empty(mcast_group_list)) {
+   current_mcast_group = list_first_entry(mcast_group_list,
+ struct bnx2x_mcast_elem_group,
+ mcast_group_link);
+   list_del(_mcast_group->mcast_group_link);
+   kfree(current_mcast_group);
+   }
+}
+
 static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
   struct bnx2x_mcast_obj *o,
   struct bnx2x_mcast_ramrod_params *p,
   enum bnx2x_mcast_cmd cmd)
 {
-   int total_sz;
struct bnx2x_pending_mcast_cmd *new_cmd;
-   struct bnx2x_mcast_mac_elem *cur_mac = NULL;
struct bnx2x_mcast_list_elem *pos;
-   int macs_list_len = 0, macs_list_len_size;
+   struct bnx2x_mcast_elem_group *elem_group;
+   struct bnx2x_mcast_mac_elem *mac_elem;
+   int i = 0, offset = 0, macs_list_len = 0;
+   int total_elems, alloc_size;
 
/* When adding MACs we'll need to store their values */
if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET)
@@ -2657,50 +2686,68 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
if (!p->mcast_list_len)
return 0;
 
-   /* For a set command, we need to allocate sufficient memory for all
-* the bins, since we can't analyze at this point how much memory would
-* be required.
-*/
-   macs_list_len_size = macs_list_len *
-sizeof(struct bnx2x_mcast_mac_elem);
-   if (cmd == BNX2X_MCAST_CMD_SET) {
-   int bin_size = BNX2X_MCAST_BINS_NUM *
-  sizeof(struct bnx2x_mcast_bin_elem);
-
-   if (bin_size > macs_list_len_size)
-   macs_list_len_size = bin_size;
-   }
-   total_sz = sizeof(*new_cmd) + macs_list_len_size;
-
/* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */
-   new_cmd = kzalloc(total_sz, GFP_ATOMIC);
-
+   new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC);
if (!new_cmd)
return -ENOMEM;
 
-   DP(BNX2X_MSG_SP, "About to enqueue a new %d command. 
macs_list_len=%d\n",
-  cmd, macs_list_len);
-
INIT_LIST_HEAD(_cmd->data.macs_head);
-
+   INIT_LIST_HEAD(_cmd->group_head);
new_cmd->type = cmd;
new_cmd->done = false;
 
+   DP(BNX2X_MSG_SP, "About to enqueue a new %d command. 
macs_list_len=%d\n",
+  cmd, macs_list_len);
+
switch (cmd) {
case BNX2X_MCAST_CMD_ADD:
case BNX2X_MCAST_CMD_SET:
-   cur_mac = (struct bnx2x_mcast_mac_elem *)
- ((u8 *)new_cmd + sizeof(*new_cmd));
-
-   /* Push the MACs of the current command into the pending command
-* MACs list: FIFO
+   /* For a set command, we need to allocate sufficient memory for
+* all the bins, since we can't analyze at this point how much
+* memory would be required.
 */
+   total_elems = macs_list_len;
+   if (cmd == BNX2X_MCAST_CMD_SET) {
+  

[PATCH net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments

2016-09-16 Thread Jason Baron
Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for 'mcast_list' to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 85 
 1 file changed, 57 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index dab61a81a3ba..9f5b2d94e4df 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12563,43 +12563,70 @@ static int bnx2x_close(struct net_device *dev)
return 0;
 }
 
-static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
- struct bnx2x_mcast_ramrod_params *p)
+struct bnx2x_mcast_list_elem_group
 {
-   int mc_count = netdev_mc_count(bp->dev);
-   struct bnx2x_mcast_list_elem *mc_mac =
-   kcalloc(mc_count, sizeof(*mc_mac), GFP_ATOMIC);
-   struct netdev_hw_addr *ha;
+   struct list_head mcast_group_link;
+   struct bnx2x_mcast_list_elem mcast_elems[];
+};
 
-   if (!mc_mac) {
-   BNX2X_ERR("Failed to allocate mc MAC list\n");
-   return -ENOMEM;
+#define MCAST_ELEMS_PER_PG \
+   ((PAGE_SIZE - sizeof(struct bnx2x_mcast_list_elem_group)) / \
+   sizeof(struct bnx2x_mcast_list_elem))
+
+static void bnx2x_free_mcast_macs_list(struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_list_elem_group *current_mcast_group;
+
+   while (!list_empty(mcast_group_list)) {
+   current_mcast_group = list_first_entry(mcast_group_list,
+ struct bnx2x_mcast_list_elem_group,
+ mcast_group_link);
+   list_del(_mcast_group->mcast_group_link);
+   kfree(current_mcast_group);
}
+}
 
-   INIT_LIST_HEAD(>mcast_list);
+static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
+ struct bnx2x_mcast_ramrod_params *p,
+ struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_list_elem *mc_mac;
+   struct netdev_hw_addr *ha;
+   struct bnx2x_mcast_list_elem_group *current_mcast_group = NULL;
+   int mc_count = netdev_mc_count(bp->dev);
+   int i = 0, offset = 0, alloc_size;
 
+   INIT_LIST_HEAD(>mcast_list);
netdev_for_each_mc_addr(ha, bp->dev) {
+   if (!offset) {
+   if ((mc_count - i) < MCAST_ELEMS_PER_PG)
+   alloc_size = sizeof(struct list_head) +
+   (sizeof(struct bnx2x_mcast_list_elem) *
+   (mc_count - i));
+   else
+   alloc_size = PAGE_SIZE;
+   current_mcast_group = kmalloc(alloc_size, GFP_ATOMIC);
+   if (!current_mcast_group) {
+   bnx2x_free_mcast_macs_list(mcast_group_list);
+   BNX2X_ERR("Failed to allocate mc MAC list\n");
+   return -ENOMEM;
+   }
+   list_add(_mcast_group->mcast_group_link,
+mcast_group_list);
+   }
+   mc_mac = _mcast_group->mcast_elems[offset];
mc_mac->mac = bnx2x_mc_addr(ha);
list_add_tail(_mac->link, >mcast_list);
-   mc_mac++;
+   offset++;
+   if (offset == MCAST_ELEMS_PER_PG) {
+   i += offset;
+   offset = 0;
+   }
}
-
p->mcast_list_len = mc_count;
-
return 0;
 }
 
-static void bnx2x_free_mcast_macs_list(
-   struct bnx2x_mcast_ramrod_params *p)
-{
-   struct bnx2x_mcast_list_elem *mc_mac =
-   list_first_entry(>mcast_list, struct bnx2x_mcast_list_elem,
-link);
-
-   WARN_ON(!mc_mac);
-   kfree(mc_mac);
-}
-
 /**
  * bnx2x_set_uc_list - configure a new unicast MACs list.
  *
@@ -12647,6 +12674,7 @@ static int bnx2x_set_uc_list(struct bnx2x *bp)
 
 static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 {
+   LIST_HEAD(mcast_group_list);
struct net_device *dev = bp->dev;
struct bnx2x_mcast_ramrod_params rparam = {NULL};
int rc = 0;
@@ -12662,7 +12690,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 
/* then, configure a new MACs list */
if (netdev_mc_count(dev)) {
-   rc = bnx2x_init_mcast_macs_list(bp, );
+   rc = bnx2x_init_mcast_macs_list(bp, , _group_list);
 

[PATCH net-next 0/2] bnx2x: page allocation failure

2016-09-16 Thread Jason Baron
Hi,

While configuring ~500 multicast addrs, we ran into high order
page allocation failures. They don't need to be high order, and
thus I'm proposing to split them into at most PAGE_SIZE allocations.

Below is a sample failure.

Thanks,

-Jason

[1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create 
multicast MACs list: -12
[1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020
[1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: GW
[1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, 
BIOS V8.810 05/16/2013
[1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x]
[1207325.728206]   88012d873a78 8267f7c7 
c020
[1207325.736754]   88012d873b08 8212f8e0 
fffc0003
[1207325.745301]  88041ffecd80 88040030 0002 
c0206800da13
[1207325.753846] Call Trace:
[1207325.756789]  [] dump_stack+0x4d/0x63
[1207325.762426]  [] warn_alloc_failed+0xe0/0x130
[1207325.768756]  [] ? wakeup_kswapd+0x48/0x140
[1207325.774914]  [] __alloc_pages_nodemask+0x2bc/0x970
[1207325.781761]  [] alloc_pages_current+0x91/0x100
[1207325.788260]  [] alloc_kmem_pages+0xe/0x10
[1207325.794329]  [] kmalloc_order+0x18/0x50
[1207325.800227]  [] kmalloc_order_trace+0x26/0xb0
[1207325.806642]  [] ? _xfer_secondary_pool+0xa8/0x1a0
[1207325.813404]  [] __kmalloc+0x19a/0x1b0
[1207325.819142]  [] bnx2x_set_rx_mode_inner+0x3d5/0x590 
[bnx2x]
[1207325.827000]  [] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x]
[1207325.834197]  [] process_one_work+0x134/0x3c0
[1207325.840522]  [] worker_thread+0x121/0x460
[1207325.846585]  [] ? process_one_work+0x3c0/0x3c0
[1207325.853089]  [] kthread+0xc9/0xe0
[1207325.858459]  [] ? notify_die+0x10/0x40
[1207325.864263]  [] ? kthread_create_on_node+0x180/0x180
[1207325.871288]  [] ret_from_fork+0x42/0x70
[1207325.877183]  [] ? kthread_create_on_node+0x180/0x180


Jason Baron (2):
  bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
  bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  85 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c   | 131 ---
 2 files changed, 151 insertions(+), 65 deletions(-)

-- 
2.6.1



Re: strange Mac OSX RST behavior

2016-07-22 Thread Jason Baron

Hi,

After looking at this further we found that there is actually
a rate limit on 'rst' packets sent by OSX on a closed socket.
Its set to 250 per second and controlled via:
net.inet.icmp.icmplim. Increasing that limit resolves the
issue, but the default is apparently 250.

Thanks,

-Jason

On 07/01/2016 02:16 PM, One Thousand Gnomes wrote:

yes, we do in fact see a POLLRDHUP from the FIN in this case and
read of zero, but we still have more data to write to the socket, and
b/c the RST is dropped here, the socket stays in TIME_WAIT until
things eventually time out...

After the FIN when you send/retransmit your next segment do you then get
a valid RST back from the Mac end?

Alan




[PATCH net] tcp: enable per-socket rate limiting of all 'challenge acks'

2016-07-14 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

The per-socket rate limit for 'challenge acks' was introduced in the
context of limiting ack loops:

commit f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock")

And I think it can be extended to rate limit all 'challenge acks' on a
per-socket basis.

Since we have the global tcp_challenge_ack_limit, this patch allows for
tcp_challenge_ack_limit to be set to a large value and effectively rely on
the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
still prevents a single connections from consuming the entire challenge ack
quota.

It further moves in the direction of eliminating the global limit at some
point, as Eric Dumazet has suggested. This a follow-up to:
Subject: tcp: make challenge acks less predictable

Cc: Eric Dumazet <eduma...@google.com>
Cc: David S. Miller <da...@davemloft.net>
Cc: Neal Cardwell <ncardw...@google.com>
Cc: Yuchung Cheng <ych...@google.com>
Cc: Yue Cao <ycao...@ucr.edu>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/ipv4/tcp_input.c | 39 ++-
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fda084462ed9..f9f9e375d7de 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3424,6 +3424,23 @@ static int tcp_ack_update_window(struct sock *sk, const 
struct sk_buff *skb, u32
return flag;
 }
 
+static bool __tcp_oow_rate_limited(struct net *net, int mib_idx,
+  u32 *last_oow_ack_time)
+{
+   if (*last_oow_ack_time) {
+   s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time);
+
+   if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) {
+   NET_INC_STATS(net, mib_idx);
+   return true;/* rate-limited: don't send yet! */
+   }
+   }
+
+   *last_oow_ack_time = tcp_time_stamp;
+
+   return false;   /* not rate-limited: go ahead, send dupack now! */
+}
+
 /* Return true if we're currently rate-limiting out-of-window ACKs and
  * thus shouldn't send a dupack right now. We rate-limit dupacks in
  * response to out-of-window SYNs or ACKs to mitigate ACK loops or DoS
@@ -3437,21 +3454,9 @@ bool tcp_oow_rate_limited(struct net *net, const struct 
sk_buff *skb,
/* Data packets without SYNs are not likely part of an ACK loop. */
if ((TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq) &&
!tcp_hdr(skb)->syn)
-   goto not_rate_limited;
-
-   if (*last_oow_ack_time) {
-   s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time);
-
-   if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) {
-   NET_INC_STATS(net, mib_idx);
-   return true;/* rate-limited: don't send yet! */
-   }
-   }
-
-   *last_oow_ack_time = tcp_time_stamp;
+   return false;
 
-not_rate_limited:
-   return false;   /* not rate-limited: go ahead, send dupack now! */
+   return __tcp_oow_rate_limited(net, mib_idx, last_oow_ack_time);
 }
 
 /* RFC 5961 7 [ACK Throttling] */
@@ -3464,9 +3469,9 @@ static void tcp_send_challenge_ack(struct sock *sk, const 
struct sk_buff *skb)
u32 count, now;
 
/* First check our per-socket dupack rate limit. */
-   if (tcp_oow_rate_limited(sock_net(sk), skb,
-LINUX_MIB_TCPACKSKIPPEDCHALLENGE,
->last_oow_ack_time))
+   if (__tcp_oow_rate_limited(sock_net(sk),
+  LINUX_MIB_TCPACKSKIPPEDCHALLENGE,
+  >last_oow_ack_time))
return;
 
/* Then check host-wide RFC 5961 rate limit. */
-- 
2.6.1



Re: strange Mac OSX RST behavior

2016-07-01 Thread Jason Baron


On 07/01/2016 02:16 PM, One Thousand Gnomes wrote:
>> yes, we do in fact see a POLLRDHUP from the FIN in this case and
>> read of zero, but we still have more data to write to the socket, and
>> b/c the RST is dropped here, the socket stays in TIME_WAIT until
>> things eventually time out...
> 
> After the FIN when you send/retransmit your next segment do you then get
> a valid RST back from the Mac end?
> 
> Alan
> 

No, we only get the single RST after the FIN from the Mac side which
is dropped. I would have expected the RST from the Mac after the
retransmits, but we don't see any further transmits from the Mac.
And the linux socket stays in CLOSE-WAIT (i mistakingly said
TIME_WAIT above).

For reference, I put the packet exchange in my initial mail.

Thanks,

-Jason


Re: strange Mac OSX RST behavior

2016-07-01 Thread Jason Baron
On 07/01/2016 01:08 PM, Rick Jones wrote:
> On 07/01/2016 08:10 AM, Jason Baron wrote:
>> I'm wondering if anybody else has run into this...
>>
>> On Mac OSX 10.11.5 (latest version), we have found that when tcp
>> connections are abruptly terminated (via ^C), a FIN is sent followed
>> by an RST packet.
> 
> That just seems, well, silly.  If the client application wants to use
> abortive close (sigh..) it should do so, there shouldn't be this
> little-bit-pregnant, correct close initiation (FIN) followed by a RST.
> 
>> The RST is sent with the same sequence number as the
>> FIN, and thus dropped since the stack only accepts RST packets matching
>> rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with
>> an RST on the closed socket, but it appears that it does not.
>>
>> The workaround here is then to reset the connection, if the RST is
>> is equal to rcv_nxt - 1, if we have already received a FIN.
>>
>> The RST attack surface is limited b/c we only accept the RST after we've
>> accepted a FIN and have not previously sent a FIN and received back the
>> corresponding ACK. In other words RST is only accepted in the tcp
>> states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING.
>>
>> I'm interested if anybody else has run into this issue. Its problematic
>> since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT.
> 
> Isn't the server application expected to act on the read return of zero
> (which is supposed to be) triggered by the receipt of the FIN segment?
>

yes, we do in fact see a POLLRDHUP from the FIN in this case and
read of zero, but we still have more data to write to the socket, and
b/c the RST is dropped here, the socket stays in TIME_WAIT until
things eventually time out...

Thanks,

-Jason

> rick jones
> 
>> We are also in the process of contacting Apple to see what can be done
>> here...workaround patch is below.


strange Mac OSX RST behavior

2016-07-01 Thread Jason Baron
I'm wondering if anybody else has run into this...

On Mac OSX 10.11.5 (latest version), we have found that when tcp
connections are abruptly terminated (via ^C), a FIN is sent followed
by an RST packet. The RST is sent with the same sequence number as the
FIN, and thus dropped since the stack only accepts RST packets matching
rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with
an RST on the closed socket, but it appears that it does not.

The workaround here is then to reset the connection, if the RST is
is equal to rcv_nxt - 1, if we have already received a FIN.

The RST attack surface is limited b/c we only accept the RST after we've
accepted a FIN and have not previously sent a FIN and received back the
corresponding ACK. In other words RST is only accepted in the tcp
states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING.

I'm interested if anybody else has run into this issue. Its problematic
since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT.
We are also in the process of contacting Apple to see what can be done
here...workaround patch is below.


Here is the sequence from wireshark, mac osx is client sending the
fin:

84581  14.752908  ->  TCP 66 49896 > http [FIN, ACK] 
Seq=673257230 Ack=924722210 Win=131072 Len=0 TSval=622455547 TSecr=346246436
84984  14.788056  ->  TCP 60 49896 > http [RST] 
Seq=673257230 Win=0 Len=0
84985  14.788061  ->  TCP 66 http > 49896 [ACK] 
Seq=924739994 Ack=673257231 Win=28960 Len=0 TSval=346246723 TSecr=622455547

followed by a bunch of retransmits from server:

85138  14.994217  ->  TCP 1054 [TCP segment of a 
reassembled PDU]
85237  15.348217  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85337  16.056224  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85436  17.472225  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85540  20.304222  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85644  25.968218  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85745  37.280230  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85845  59.904235  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]

Thanks,

-Jason

---
 net/ipv4/tcp_input.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff97523..b3c55b91140c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5155,6 +5155,25 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
return err;
 }
 
+/*
+ * Mac OSX 10.11.5 can send a FIN followed by a RST where the RST
+ * has the same sequence number as the FIN. This is not compliant
+ * with RFC 5961, but ends up in a number of sockets tied up mostly
+ * in TCP_CLOSE_WAIT. The rst attack surface is limited b/c we only
+ * accept the RST after we've accepted a FIN and have not previously
+ * sent a FIN and received back the corresponding ACK.
+ */
+static bool tcp_fin_rst_check(struct sock *sk, struct sk_buff *skb)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) &&
+   (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) &&
+   (sk->sk_state == TCP_CLOSE_WAIT ||
+sk->sk_state == TCP_LAST_ACK ||
+sk->sk_state == TCP_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5193,7 +5212,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
  LINUX_MIB_TCPACKSKIPPEDSEQ,
  >last_oow_ack_time))
tcp_send_dupack(sk, skb);
-   }
+   } else if (tcp_fin_rst_check(sk, skb))
+   tcp_reset(sk);
goto discard;
}
 
@@ -5206,7 +5226,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
 * else
 * Send a challenge ACK
 */
-   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+   tcp_fin_rst_check(sk, skb)) {
rst_seq_match = true;
} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
struct tcp_sack_block *sp = >selective_acks[0];
-- 
2.6.1



Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Jason Baron



On 06/22/2016 02:51 PM, Eric Dumazet wrote:

On Wed, 2016-06-22 at 11:43 -0700, Eric Dumazet wrote:

On Wed, 2016-06-22 at 14:18 -0400, Jason Baron wrote:

For 1/2, the getting the correct memory barrier, should I re-submit
that as a separate patch?

Are you sure a full memory barrier (smp_mb() is needed ?

Maybe smp_wmb() would be enough ?

(And smp_rmb() in tcp_poll() ?)

Well, in tcp_poll() smp_mb__after_atomic() is fine as it follows
set_bit(SOCK_NOSPACE, >sk_socket->flags);

(although we might add a comment why we should keep
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk) before the set_bit() !)

But presumably smp_wmb() would be enough in tcp_check_space()






hmm, I think we need the smp_mb() there. From
tcp_poll() we have:

1) set_bit(SOCK_NOSPACE, ...)  (write)
2) smp_mb__after_atomic();
3) if (sk_stream_is_writeable(sk)) (read)

while in tcp_check_space() its:

1) the state that sk_stream_is_writeable() cares about (write)
2) smp_mb();
3) if (sk->sk_socket && test_bit(SOCK_NOSPACE,...) (read)

So if we can show that there are sufficient barriers
for #1 (directly above), maybe it can be down-graded or
eliminated. But it would still seem somewhat fragile.

Note I didn't observe any missing wakeups here, but I
just wanted to make sure we didn't miss any, since they
can be quite hard to debug.

Thanks,

-Jason


Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Jason Baron



On 06/22/2016 01:34 PM, Eric Dumazet wrote:

On Wed, 2016-06-22 at 11:32 -0400, Jason Baron wrote:

From: Jason Baron <jba...@akamai.com>

When SO_SNDBUF is set and we are under tcp memory pressure, the effective
write buffer space can be much lower than what was set using SO_SNDBUF. For
example, we may have set the buffer to 100kb, but we may only be able to
write 10kb. In this scenario poll()/select()/epoll(), are going to
continuously return POLLOUT, followed by -EAGAIN from write(), and thus
result in a tight loop. Note that epoll in edge-triggered does not have
this issue since it only triggers once data has been ack'd. There is no
issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
the sk->sndbuf.


Still, generating one POLLOUT event per incoming ACK will not please
epoll() users in edge-trigger mode.

Host is under global memory pressure, so we probably want to drain
socket queues _and_ reduce cpu pressure.

Strategy to insure all sockets converge to small amounts ASAP is simply
the best answer.

Letting big SO_SNDBUF offenders hog memory while their queue is big
is not going to help sockets who can not get ACK
(elephants get more ACK than mice, so they have more chance to succeed
their new allocations)

Your patch adds lot of complexity logic in tcp_sendmsg() and
tcp_sendpage().


I would prefer a simpler patch like :




Ok, fair enough. I'm going to assume that you will submit this as
a formal patch.

For 1/2, the getting the correct memory barrier, should I re-submit
that as a separate patch?

Thanks,

-Jason



[PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

When SO_SNDBUF is set and we are under tcp memory pressure, the effective
write buffer space can be much lower than what was set using SO_SNDBUF. For
example, we may have set the buffer to 100kb, but we may only be able to
write 10kb. In this scenario poll()/select()/epoll(), are going to
continuously return POLLOUT, followed by -EAGAIN from write(), and thus
result in a tight loop. Note that epoll in edge-triggered does not have
this issue since it only triggers once data has been ack'd. There is no
issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
the sk->sndbuf.

Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
socket when we have a short write due to memory pressure. By then testing
for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
non-zero amount of data has been ack'd. In a previous approach:
http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a
new field in 'struct sock' to solve this issue, but its undesirable to add
bloat to 'struct sock'. We also could address this issue, by waiting for
the buffer to become completely empty, but that may reduce throughput since
the write buffer would be empty while waiting for subsequent writes. This
change brings us in line with the current epoll edge-trigger behavior for
the poll()/select() and epoll level-trigger.

We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).

I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
server side, to induce tcp memory pressure. A single server thread reduced
its cpu usage from 100% to 19%, while maintaining the same level of
throughput.

Note: This is not a new issue, it has been this way for some time.

Cc: Eric Dumazet <eric.duma...@gmail.com>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/sock.h   |  6 ++
 net/ipv4/tcp.c   | 26 +++---
 net/ipv4/tcp_input.c |  3 ++-
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 649d2a8c17fc..616e8e1a5d5d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -741,6 +741,7 @@ enum sock_flags {
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
+   SOCK_SHORT_WRITE, /* Couldn't fill sndbuf due to memory pressure */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1114,6 +1115,11 @@ static inline bool sk_stream_is_writeable(const struct 
sock *sk)
   sk_stream_memory_free(sk);
 }
 
+static inline void sk_set_short_write(struct sock *sk)
+{
+   if (sk->sk_wmem_queued > 0 && sk_stream_is_writeable(sk))
+   sock_set_flag(sk, SOCK_SHORT_WRITE);
+}
 
 static inline bool sk_has_memory_pressure(const struct sock *sk)
 {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed147449c..4577a90d7d87 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -517,7 +517,8 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
mask |= POLLIN | POLLRDNORM;
 
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
-   if (sk_stream_is_writeable(sk)) {
+   if (sk_stream_is_writeable(sk) &&
+   !sock_flag(sk, SOCK_SHORT_WRITE)) {
mask |= POLLOUT | POLLWRNORM;
} else {  /* send SIGIO later */
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -529,7 +530,8 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
 * pairs with the input side.
 */
smp_mb__after_atomic();
-   if (sk_stream_is_writeable(sk))
+   if (sk_stream_is_writeable(sk) &&
+   !sock_flag(sk, SOCK_SHORT_WRITE))
mask |= POLLOUT | POLLWRNORM;
}
} else
@@ -917,8 +919,10 @@ new_segment:
 
skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
  
skb_queue_empty(>sk_write_queue));
-   if (!skb)
+   if (!skb) {
+   sk_set_short_write(sk);
goto wait_for_memory;
+   }
 
skb_entail(sk, skb);
copy = size_goal;
@@ -933,8 +937,10 @@ new_segment:

[PATCH v2 net-next 1/2] tcp: replace smp_mb__after_atomic() with smp_mb() in tcp_poll()

2016-06-22 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit(),
hence we need an smp_mb() there, smp_mb__after_atomic() is not sufficient.

Fixes: 3c7151275c0c ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.duma...@gmail.com>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff97523..3ba526ecdeb9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4987,7 +4987,7 @@ static void tcp_check_space(struct sock *sk)
if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
/* pairs with tcp_poll() */
-   smp_mb__after_atomic();
+   smp_mb();
if (sk->sk_socket &&
test_bit(SOCK_NOSPACE, >sk_socket->flags))
tcp_new_space(sk);
-- 
2.6.1



[PATCH v2 net-next 0/2] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-22 Thread Jason Baron
Hi,

I've added a patch to upgrade smp_mb__after_atomic() to a smp_mb()
as Eric Dumazet pointed out. Also combined the clearing of
SOCK_SHORT_WRITE with SOCK_QUEUE_SHRUNK.

Thanks,

-Jason

v2:
-upgrade smp_mb__after_atomic to smp_mb() in tcp_poll()
-combine clear of SOCK_SHORT_WRITE with SOCK_QUEUE_SHRUNK

Jason Baron (2):
  tcp: replace smp_mb__after_atomic() with smp_mb() in tcp_poll()
  tcp: reduce cpu usage when SO_SNDBUF is set

 include/net/sock.h   |  6 ++
 net/ipv4/tcp.c   | 26 +++---
 net/ipv4/tcp_input.c |  5 +++--
 3 files changed, 28 insertions(+), 9 deletions(-)

-- 
2.6.1



Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-21 Thread Jason Baron


On 06/20/2016 06:29 PM, Eric Dumazet wrote:
> On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote:
>> From: Jason Baron <jba...@akamai.com>
>>
>> When SO_SNDBUF is set and we are under tcp memory pressure, the effective
>> write buffer space can be much lower than what was set using SO_SNDBUF. For
>> example, we may have set the buffer to 100kb, but we may only be able to
>> write 10kb. In this scenario poll()/select()/epoll(), are going to
>> continuously return POLLOUT, followed by -EAGAIN from write(), and thus
>> result in a tight loop. Note that epoll in edge-triggered does not have
>> this issue since it only triggers once data has been ack'd. There is no
>> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
>> the sk->sndbuf.
>>
>> Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
>> socket when we have a short write due to memory pressure. By then testing
>> for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
>> non-zero amount of data has been ack'd. In a previous approach:
>> http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a
>> new field in 'struct sock' to solve this issue, but its undesirable to add
>> bloat to 'struct sock'. We also could address this issue, by waiting for
>> the buffer to become completely empty, but that may reduce throughput since
>> the write buffer would be empty while waiting for subsequent writes. This
>> change brings us in line with the current epoll edge-trigger behavior for
>> the poll()/select() and epoll level-trigger.
>>
>> We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
>> SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
>> SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).
>>
>> I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
>> server side, to induce tcp memory pressure. A single server thread reduced
>> its cpu usage from 100% to 19%, while maintaining the same level of
>> throughput.
>>
>> Signed-off-by: Jason Baron <jba...@akamai.com>
>> ---
> 
> 
> When this bug was added, and by which commit ?
> 

I think we have always had this behavior, I've seen it at least back
to 3.10. It requires SO_SNDBUF, lots of sockets and memory pressure...
That said its quite easy to reproduce.

> This looks serious, but your patch looks way too complex.
> 
> This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space()
> is expensive, because it dirties sk_flags which is not supposed to
> written often. This field is located in a read mostly cache line.
> 
> I believe my suggestion was not to add a per socket bit,
> but have a global state like tcp_memory_pressure. (or reuse it)
> 
> Ie , when under global memory pressure, tcp_poll() should apply a
> strategy to give POLLOUT only on sockets having less than 4K
> (tcp_wmem[0]) of memory consumed in their write queue.
> 
> 
> 
> 

ok, I think you're saying to just add:

if (sk_under_memory_pressure(sk) && sk->sk_wmem_queued > tcp_wmem[0])
return false;

to the beginning of sk_stream_is_writeable().

My concern is that it then allows the sndbuf to basically completely
empty before we are triggered to write again. The patch I presented
attempts to write against as soon as any new space is available for
the flow.

I also didn't necessarily want to affect all flows. If so_sndbuf is
not set, we generally don't need to change the wakeup behavior b/c
the sndbuf shrinks to the point where the normal write wakeup does
not cause these tight poll()/write() loops, where we don't make
any progress. We also don't need to change the behavior if we are
under memory pressure but but have written enough (filled more than 2/3
of the buffer), such that the normal wakeups again would work fine.
So we could incorporate some of this logic into the above check, but
it becomes more complex.

In my patch, I could replace the:

sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);

in tcp_check_space() with something like:

sk->sk_flags &= ~((1UL << SOCK_QUEUE_SHRUNK) | (1UL << SOCK_SHORT_WRITE));

Since we are already writing to sk_flags there this should have very
minimal overhead. And then remove the clear in sk_stream_write_space().

Thanks,

-Jason


[PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-20 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

When SO_SNDBUF is set and we are under tcp memory pressure, the effective
write buffer space can be much lower than what was set using SO_SNDBUF. For
example, we may have set the buffer to 100kb, but we may only be able to
write 10kb. In this scenario poll()/select()/epoll(), are going to
continuously return POLLOUT, followed by -EAGAIN from write(), and thus
result in a tight loop. Note that epoll in edge-triggered does not have
this issue since it only triggers once data has been ack'd. There is no
issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
the sk->sndbuf.

Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
socket when we have a short write due to memory pressure. By then testing
for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
non-zero amount of data has been ack'd. In a previous approach:
http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a
new field in 'struct sock' to solve this issue, but its undesirable to add
bloat to 'struct sock'. We also could address this issue, by waiting for
the buffer to become completely empty, but that may reduce throughput since
the write buffer would be empty while waiting for subsequent writes. This
change brings us in line with the current epoll edge-trigger behavior for
the poll()/select() and epoll level-trigger.

We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).

I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
server side, to induce tcp memory pressure. A single server thread reduced
its cpu usage from 100% to 19%, while maintaining the same level of
throughput.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/sock.h |  6 ++
 net/core/stream.c  |  1 +
 net/ipv4/tcp.c | 26 +++---
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 649d2a8c17fc..616e8e1a5d5d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -741,6 +741,7 @@ enum sock_flags {
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
+   SOCK_SHORT_WRITE, /* Couldn't fill sndbuf due to memory pressure */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1114,6 +1115,11 @@ static inline bool sk_stream_is_writeable(const struct 
sock *sk)
   sk_stream_memory_free(sk);
 }
 
+static inline void sk_set_short_write(struct sock *sk)
+{
+   if (sk->sk_wmem_queued > 0 && sk_stream_is_writeable(sk))
+   sock_set_flag(sk, SOCK_SHORT_WRITE);
+}
 
 static inline bool sk_has_memory_pressure(const struct sock *sk)
 {
diff --git a/net/core/stream.c b/net/core/stream.c
index 159516a11b7e..ead768b1e95d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk)
 
if (sk_stream_is_writeable(sk) && sock) {
clear_bit(SOCK_NOSPACE, >flags);
+   sock_reset_flag(sk, SOCK_SHORT_WRITE);
 
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed147449c..254c7bb6d3d5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -517,7 +517,8 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
mask |= POLLIN | POLLRDNORM;
 
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
-   if (sk_stream_is_writeable(sk)) {
+   if (!sock_flag(sk, SOCK_SHORT_WRITE) &&
+   sk_stream_is_writeable(sk)) {
mask |= POLLOUT | POLLWRNORM;
} else {  /* send SIGIO later */
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -529,7 +530,8 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
 * pairs with the input side.
 */
smp_mb__after_atomic();
-   if (sk_stream_is_writeable(sk))
+   if (!sock_flag(sk, SOCK_SHORT_WRITE) &&
+   sk_stream_is_writeable(sk))
mask |= POLLOUT | POLLWRNORM;
}
} else
@@ -917,8 +919,10 @@ new_segment:
 
skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
  
skb_queue_empty(>

Re: use-after-free in sctp_do_sm

2015-12-04 Thread Jason Baron


On 12/04/2015 12:03 PM, Joe Perches wrote:
> On Fri, 2015-12-04 at 11:47 -0500, Jason Baron wrote:
>> When DYNAMIC_DEBUG is enabled we have this wrapper from
>> include/linux/dynamic_debug.h:
>>
>> if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT))
>>  
>>
>> So the compiler is not emitting the side-effects in this
>> case.
> 
> Huh?  Do I misunderstand what you are writing?

Yes, I wasn't terribly clear - I was trying to say that the
'side-effects', in this case the debug code and use-after-free, are
hidden behind the branch. They aren't invoked unless we enable the debug
statement.

Thanks,

-Jason

> 
> You are testing a variable that is not generally set
> so the call is not being performed in the general case,
> but the compiler can not elide the code.
> 
> If the variable was enabled via the control file, the
> __dynamic_pr_debug would be performed with the
> use-after-free.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-04 Thread Jason Baron
On 12/04/2015 11:12 AM, Dmitry Vyukov wrote:
> On Thu, Dec 3, 2015 at 9:51 PM, Joe Perches <j...@perches.com> wrote:
>> (adding lkml as this is likely better discussed there)
>>
>> On Thu, 2015-12-03 at 15:42 -0500, Jason Baron wrote:
>>> On 12/03/2015 03:24 PM, Joe Perches wrote:
>>>> On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote:
>>>>> On 12/03/2015 03:03 PM, Joe Perches wrote:
>>>>>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote:
>>>>>>> On 12/03/2015 01:52 PM, Aaron Conole wrote:
>>>>>>>> I think that as a minimum, the following patch should be evaluted,
>>>>>>>> but am unsure to whom I should submit it (after I test):
>>>>>> []
>>>>>>> Agreed - the intention here is certainly to have no side effects. It
>>>>>>> looks like 'no_printk()' is used in quite a few other places that would
>>>>>>> benefit from this change. So we probably want a generic
>>>>>>> 'really_no_printk()' macro.
>>>>>>
>>>>>> https://lkml.org/lkml/2012/6/17/231
>>>>>
>>>>> I don't see this in the tree.
>>>>
>>>> It never got applied.
>>>>
>>>>> Also maybe we should just convert
>>>>> no_printk() to do what your 'eliminated_printk()'.
>>>>
>>>> Some of them at least.
>>>>
>>>>> So we can convert all users with this change?
>>>>
>>>> I don't think so, I think there are some
>>>> function evaluation/side effects that are
>>>> required.  I believe some do hardware I/O.
>>>>
>>>> It'd be good to at least isolate them.
>>>>
>>>> I'm not sure how to find them via some
>>>> automated tool/mechanism though.
>>>>
>>>> I asked Julia Lawall about it once in this
>>>> thread:  https://lkml.org/lkml/2014/12/3/696
>>>>
>>>
>>> Seems rather fragile to have side effects that we rely
>>> upon hidden in a printk().
>>
>> Yup.
>>
>>> Just convert them and see what breaks :)
>>
>> I appreciate your optimism.  It's very 1995.
>> Try it and see what happens.
> 
> 
> But Aaron says that DYNAMIC_DEBUG is enabled in most major
> distributions, and all these side-effects don't happen with
> DYNAMIC_DEBUG.

When DYNAMIC_DEBUG is enabled we have this wrapper from
include/linux/dynamic_debug.h:

if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT))


So the compiler is not emitting the side-effects in this
case.

>This suggests that we can make these side-effects not
> happen without DYNAMIC_DEBUG as well.
> Or I am missing something here?
> 

When DYNAMIC_DEBUG is disabled we are instead replacing
pr_debug() with the 'no_printk()' function as you've pointed
out. We are changing this to emit no code at all:

http://marc.info/?l=linux-kernel=144918276518878=2

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-03 Thread Jason Baron


On 12/03/2015 01:52 PM, Aaron Conole wrote:
> Dmitry Vyukov  writes:
>> On Thu, Dec 3, 2015 at 6:02 PM, Eric Dumazet  wrote:
>>> On Thu, Dec 3, 2015 at 7:55 AM, Dmitry Vyukov  wrote:
 On Thu, Dec 3, 2015 at 3:48 PM, Eric Dumazet  wrote:
>>
>> No, I don't. But pr_debug always computes its arguments. See no_printk
>> in printk.h. So this use-after-free happens for all users.
>
> Hmm.
>
> pr_debug() should be a nop unless either DEBUG or
> CONFIG_DYNAMIC_DEBUG are set
>
> On our production kernels, pr_debug() is a nop.
>
> Can you double check ? Thanks !


 Why should it be nop? no_printk thing in printk.h pretty much
 explicitly makes it not a nop...
> 
> Because it was until commit 5264f2f75d8. It also violates my reading of
> the following from printk.h:
> 
>  * All of these will print unconditionally, although note that pr_debug()
>  * and other debug macros are compiled out unless either DEBUG is defined
>  * or CONFIG_DYNAMIC_DEBUG is set.
> 

 Double-checked: debug_post_sfx leads to some generated code:

 debug_post_sfx();
 8229f256:   48 8b 85 58 fe ff ffmov-0x1a8(%rbp),%rax
 8229f25d:   48 85 c0test   %rax,%rax
 8229f260:   74 24   je
 8229f286 
 8229f262:   8b b0 a8 00 00 00   mov0xa8(%rax),%esi
 8229f268:   48 8b 85 60 fe ff ffmov-0x1a0(%rbp),%rax
 8229f26f:   44 89 85 74 fe ff ffmov%r8d,-0x18c(%rbp)
 8229f276:   48 8b 78 20 mov0x20(%rax),%rdi
 8229f27a:   e8 71 28 01 00  callq
 822b1af0 
 8229f27f:   44 8b 85 74 fe ff ffmov-0x18c(%rbp),%r8d

 return error;
 }
 8229f286:   48 81 c4 a0 01 00 00add$0x1a0,%rsp
 8229f28d:   44 89 c0mov%r8d,%eax
 8229f290:   5b  pop%rbx
 8229f291:   41 5c   pop%r12
 8229f293:   41 5d   pop%r13
 8229f295:   41 5e   pop%r14
 8229f297:   41 5f   pop%r15
 8229f299:   5d  pop%rbp
 8229f29a:   c3  retq
>>>
>>> This is a serious concern, because we let in the past lot of patches
>>> converting traditional
> 
> +1
> 
>>> #ifdef DEBUG
>>> # define some_hand_coded_ugly_debug()  printk( _
>>> #else
>>> # define some_hand_coded_ugly_debug()
>>> #endif
>>>
>>> On the premise pr_debug() would be a nop.
>>>
>>> It seems it is not always the case. This is a very serious problem.
> 
> +1
> 
>>> We probably have hundred of potential bugs, because few people
>>> actually make sure all debugging stuff is correct,
>>> like comments can be wrong because they are not updated properly as
>>> time flies.
>>>
>>> It is definitely a nop for many cases.
>>>
>>> +void eric_test_pr_debug(struct sock *sk)
>>> +{
>>> +   if (atomic_read(>sk_omem_alloc))
>>> +   pr_debug("%s: optmem leakage for sock %p\n",
>>> +__func__, sk);
>>> +}
>>>
>>> ->
>>>
>>> 4740 :
>>> 4740: e8 00 00 00 00   callq  4745 
>>> 4741: R_X86_64_PC32 __fentry__-0x4
>>> 4745: 55   push   %rbp
>>> 4746: 8b 87 24 01 00 00 mov0x124(%rdi),%eax //
>>> atomic_read()  but nothing follows
>>> 474c: 48 89 e5 mov%rsp,%rbp
>>> 474f: 5d   pop%rbp
>>> 4750: c3   retq
>>
>>
>>
>> I would expect that it is nop when argument evaluation does not have
>> side-effects. For example, for a load of a variable compiler will most
>> likely elide it (though, it does not have to elide it, because the
>> load is spelled in the code, so it can also legally emit the load and
>> doesn't use the result).
>> But if argument computation has side-effect (or compiler can't prove
>> otherwise), it must emit code. It must emit code for function calls
>> when the function is defined in a different translation unit, and for
>> volatile accesses (most likely including atomic accesses), etc
> 
> This isn't 100% true. As you state, in order to reach the return 0, all
> side effects must be evaluated. Load generally does not have side
> effects, so it can be safely elided, but function() must be emitted.
> 
> However, that is _not_ required to get the desired warning emission on a
> printf argument function, see http://pastebin.com/UHuaydkj for an
> example.
> 
> I think that as a minimum, the following patch should be evaluted, but am
> unsure to whom I should submit it (after I test):

Agreed - the 

Re: use-after-free in sctp_do_sm

2015-12-03 Thread Jason Baron


On 12/03/2015 03:03 PM, Joe Perches wrote:
> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote:
>> On 12/03/2015 01:52 PM, Aaron Conole wrote:
>>> I think that as a minimum, the following patch should be evaluted,
>>> but am unsure to whom I should submit it (after I test):
> []
>> Agreed - the intention here is certainly to have no side effects. It
>> looks like 'no_printk()' is used in quite a few other places that would
>> benefit from this change. So we probably want a generic
>> 'really_no_printk()' macro.
> 
> https://lkml.org/lkml/2012/6/17/231
> 

I don't see this in the tree. Also maybe we should just convert
no_printk() to do what your 'eliminated_printk()'. So we can convert all
users with this change?

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use-after-free in sctp_do_sm

2015-12-03 Thread Jason Baron
On 12/03/2015 03:24 PM, Joe Perches wrote:
> On Thu, 2015-12-03 at 15:10 -0500, Jason Baron wrote:
>> On 12/03/2015 03:03 PM, Joe Perches wrote:
>>> On Thu, 2015-12-03 at 14:32 -0500, Jason Baron wrote:
>>>> On 12/03/2015 01:52 PM, Aaron Conole wrote:
>>>>> I think that as a minimum, the following patch should be evaluted,
>>>>> but am unsure to whom I should submit it (after I test):
>>> []
>>>> Agreed - the intention here is certainly to have no side effects. It
>>>> looks like 'no_printk()' is used in quite a few other places that would
>>>> benefit from this change. So we probably want a generic
>>>> 'really_no_printk()' macro.
>>>
>>> https://lkml.org/lkml/2012/6/17/231
>>
>> I don't see this in the tree.
> 
> It never got applied.
> 
>> Also maybe we should just convert
>> no_printk() to do what your 'eliminated_printk()'.
> 
> Some of them at least.
> 
>> So we can convert all users with this change?
> 
> I don't think so, I think there are some
> function evaluation/side effects that are
> required.  I believe some do hardware I/O.
> 
> It'd be good to at least isolate them.
> 
> I'm not sure how to find them via some
> automated tool/mechanism though.
> 
> I asked Julia Lawall about it once in this
> thread:  https://lkml.org/lkml/2014/12/3/696
> 

Seems rather fragile to have side effects that we rely
upon hidden in a printk().

Just convert them and see what breaks :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] mm: memcontrol: generalize the socket accounting jump label

2015-11-30 Thread Jason Baron
Hi,

On 11/24/2015 04:52 PM, Johannes Weiner wrote:
> The unified hierarchy memory controller is going to use this jump
> label as well to control the networking callbacks. Move it to the
> memory controller code and give it a more generic name.
> 
> Signed-off-by: Johannes Weiner 
> Acked-by: Michal Hocko 
> Reviewed-by: Vladimir Davydov 
> ---
>  include/linux/memcontrol.h | 4 
>  include/net/sock.h | 7 ---
>  mm/memcontrol.c| 3 +++
>  net/core/sock.c| 5 -
>  net/ipv4/tcp_memcontrol.c  | 4 ++--
>  5 files changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d99fefe..dad56ef 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -681,6 +681,8 @@ static inline void mem_cgroup_wb_stats(struct 
> bdi_writeback *wb,
>  
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  struct sock;
> +extern struct static_key memcg_sockets_enabled_key;
> +#define mem_cgroup_sockets_enabled 
> static_key_false(_sockets_enabled_key)


We're trying to move to the updated API, so this should be:
static_branch_unlikely(_sockets_enabled_key)

see: include/linux/jump_label.h for details.


>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int 
> nr_pages);
> @@ -689,6 +691,8 @@ static inline bool 
> mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
>  {
>   return memcg->tcp_mem.memory_pressure;
>  }
> +#else
> +#define mem_cgroup_sockets_enabled 0
>  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1a94b85..fcc9442 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1065,13 +1065,6 @@ static inline void sk_refcnt_debug_release(const 
> struct sock *sk)
>  #define sk_refcnt_debug_release(sk) do { } while (0)
>  #endif /* SOCK_REFCNT_DEBUG */
>  
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
> -extern struct static_key memcg_socket_limit_enabled;
> -#define mem_cgroup_sockets_enabled 
> static_key_false(_socket_limit_enabled)
> -#else
> -#define mem_cgroup_sockets_enabled 0
> -#endif
> -
>  static inline bool sk_stream_memory_free(const struct sock *sk)
>  {
>   if (sk->sk_wmem_queued >= sk->sk_sndbuf)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 68d67fc..0602bee 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -291,6 +291,9 @@ static inline struct mem_cgroup 
> *mem_cgroup_from_id(unsigned short id)
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  
> +struct static_key memcg_sockets_enabled_key;


And this would be:

static DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);


>  void sock_update_memcg(struct sock *sk)
>  {
>   struct mem_cgroup *memcg;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 6486b0d..c5435b5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -201,11 +201,6 @@ EXPORT_SYMBOL(sk_net_capable);
>  static struct lock_class_key af_family_keys[AF_MAX];
>  static struct lock_class_key af_family_slock_keys[AF_MAX];
>  
> -#if defined(CONFIG_MEMCG_KMEM)
> -struct static_key memcg_socket_limit_enabled;
> -EXPORT_SYMBOL(memcg_socket_limit_enabled);
> -#endif
> -
>  /*
>   * Make lock validator output more readable. (we pre-construct these
>   * strings build-time, so that runtime initialization of socket
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index e507825..9a22e2d 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -34,7 +34,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
>   return;
>  
>   if (memcg->tcp_mem.active)
> - static_key_slow_dec(_socket_limit_enabled);
> + static_key_slow_dec(_sockets_enabled_key);
>  

static_branch_dec(_sockets_enabled_key);

}
>  
>  static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
> @@ -65,7 +65,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, 
> unsigned long nr_pages)
>* because when this value change, the code to process it is not
>* patched in yet.
>*/
> - static_key_slow_inc(_socket_limit_enabled);
> + static_key_slow_inc(_sockets_enabled_key);
>   memcg->tcp_mem.active = true;
>   }
>  
> 

static_branch_inc(_sockets_enabled_key);

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] mm: memcontrol: generalize the socket accounting jump label

2015-11-30 Thread Jason Baron
On 11/30/2015 04:50 PM, Johannes Weiner wrote:
> On Mon, Nov 30, 2015 at 04:08:18PM -0500, Jason Baron wrote:
>> We're trying to move to the updated API, so this should be:
>> static_branch_unlikely(_sockets_enabled_key)
>>
>> see: include/linux/jump_label.h for details.
> 
> Good point. There is another struct static_key in there as well. How
> about the following on top of this series?
> 

Looks fine - you may be able to make use of
'static_branch_enable()/disable()' instead of the inc()/dec() to simply
set the branch direction, if you think its more readable. Although I
didn't look to see if it would be racy here.

Thanks,

-Jason


> ---
> From b784aa0323628d43272e13a67ead2a2ce0e93ea6 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <han...@cmpxchg.org>
> Date: Mon, 30 Nov 2015 16:41:38 -0500
> Subject: [PATCH] mm: memcontrol: switch to the updated jump-label API
> 
> According to  the direct use of struct static_key
> is deprecated. Update the socket and slab accounting code accordingly.
> 
> Reported-by: Jason Baron <jba...@akamai.com>
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |  8 
>  mm/memcontrol.c| 12 ++--
>  net/ipv4/tcp_memcontrol.c  |  4 ++--
>  3 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a8df46c..9a19590 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -704,8 +704,8 @@ static inline void mem_cgroup_wb_stats(struct 
> bdi_writeback *wb,
>  
>  #ifdef CONFIG_INET
>  struct sock;
> -extern struct static_key memcg_sockets_enabled_key;
> -#define mem_cgroup_sockets_enabled 
> static_key_false(_sockets_enabled_key)
> +extern struct static_key_false memcg_sockets_enabled_key;
> +#define mem_cgroup_sockets_enabled 
> static_branch_unlikely(_sockets_enabled_key)
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int 
> nr_pages);
> @@ -727,7 +727,7 @@ static inline bool 
> mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
>  #endif /* CONFIG_INET */
>  
>  #ifdef CONFIG_MEMCG_KMEM
> -extern struct static_key memcg_kmem_enabled_key;
> +extern struct static_key_false memcg_kmem_enabled_key;
>  
>  extern int memcg_nr_cache_ids;
>  void memcg_get_cache_ids(void);
> @@ -743,7 +743,7 @@ void memcg_put_cache_ids(void);
>  
>  static inline bool memcg_kmem_enabled(void)
>  {
> - return static_key_false(_kmem_enabled_key);
> + return static_branch_unlikely(_kmem_enabled_key);
>  }
>  
>  static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a0da91f..5fe45d68 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -346,7 +346,7 @@ void memcg_put_cache_ids(void)
>   * conditional to this static branch, we'll have to allow modules that does
>   * kmem_cache_alloc and the such to see this symbol as well
>   */
> -struct static_key memcg_kmem_enabled_key;
> +DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
>  EXPORT_SYMBOL(memcg_kmem_enabled_key);
>  
>  #endif /* CONFIG_MEMCG_KMEM */
> @@ -2883,7 +2883,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg,
>   err = page_counter_limit(>kmem, nr_pages);
>   VM_BUG_ON(err);
>  
> - static_key_slow_inc(_kmem_enabled_key);
> + static_branch_inc(_kmem_enabled_key);
>   /*
>* A memory cgroup is considered kmem-active as soon as it gets
>* kmemcg_id. Setting the id after enabling static branching will
> @@ -3622,7 +3622,7 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
>  {
>   if (memcg->kmem_acct_activated) {
>   memcg_destroy_kmem_caches(memcg);
> - static_key_slow_dec(_kmem_enabled_key);
> + static_branch_dec(_kmem_enabled_key);
>   WARN_ON(page_counter_read(>kmem));
>   }
>   tcp_destroy_cgroup(memcg);
> @@ -4258,7 +4258,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  
>  #ifdef CONFIG_INET
>   if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
> - static_key_slow_inc(_sockets_enabled_key);
> + static_branch_inc(_sockets_enabled_key);
>  #endif
>  
>   /*
> @@ -4302,7 +4302,7 @@ static void mem_cgroup_css_free(struct 
> cgroup_subsys_state *css)
>   memcg_destroy_kmem(memcg);
>  #ifdef CONFIG_INET
>   if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
> - static_key_sl

Re: use-after-free in sock_wake_async

2015-11-24 Thread Jason Baron


On 11/24/2015 10:21 AM, Eric Dumazet wrote:
> On Tue, Nov 24, 2015 at 6:18 AM, Dmitry Vyukov  wrote:
>> Hello,
>>
>> The following program triggers use-after-free in sock_wake_async:
>>
>> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> long r2 = -1;
>> long r3 = -1;
>> long r7 = -1;
>>
>> void *thr0(void *arg)
>> {
>> syscall(SYS_splice, r2, 0x0ul, r7, 0x0ul, 0x4ul, 0x8ul);
>> return 0;
>> }
>>
>> void *thr1(void *arg)
>> {
>> syscall(SYS_close, r2, 0, 0, 0, 0, 0);
>> return 0;
>> }
>>
>> void *thr2(void *arg)
>> {
>> syscall(SYS_write, r3, 0x20003000ul, 0xe7ul, 0, 0, 0);
>> return 0;
>> }
>>
>> int main()
>> {
>> long r0 = syscall(SYS_mmap, 0x2000ul, 0x1ul, 0x3ul,
>> 0x32ul, 0xul, 0x0ul);
>> long r1 = syscall(SYS_socketpair, 0x1ul, 0x1ul, 0x0ul,
>> 0x2000ul, 0, 0);
>> r2 = *(uint32_t*)0x2000;
>> r3 = *(uint32_t*)0x2004;
>>
>> *(uint64_t*)0x20001000 = 0x4;
>> long r5 = syscall(SYS_ioctl, r2, 0x5452ul, 0x20001000ul, 0, 0, 0);
>>
>> long r6 = syscall(SYS_pipe2, 0x20002000ul, 0x80800ul, 0, 0, 0, 0);
>> r7 = *(uint32_t*)0x20002004;
>>
>> pthread_t th[3];
>> pthread_create([0], 0, thr0, 0);
>> pthread_create([1], 0, thr1, 0);
>> pthread_create([2], 0, thr2, 0);
>> pthread_join(th[0], 0);
>> pthread_join(th[1], 0);
>> pthread_join(th[2], 0);
>> return 0;
>> }
>>
>>
>> The use-after-free fires after a minute of running it in a tight
>> parallel loop. I use the stress utility for this:
>>
>> $ go get golang.org/x/tools/cmd/stress
>> $ stress -p 128 -failure "ignore" ./a.out
>>
>>
>> ==
>> BUG: KASAN: use-after-free in sock_wake_async+0x325/0x340 at addr
>> 880061d1ad10
>> Read of size 8 by task a.out/23178
>> =
>> BUG sock_inode_cache (Not tainted): kasan: bad access detected
>> -
>>
>> Disabling lock debugging due to kernel taint
>> INFO: Allocated in sock_alloc_inode+0x1d/0x220 age=0 cpu=2 pid=23183
>> [<  none  >] kmem_cache_alloc+0x1a6/0x1f0 mm/slub.c:2514
>> [<  none  >] sock_alloc_inode+0x1d/0x220 net/socket.c:250
>> [<  none  >] alloc_inode+0x61/0x180 fs/inode.c:198
>> [<  none  >] new_inode_pseudo+0x17/0xe0 fs/inode.c:878
>> [<  none  >] sock_alloc+0x3d/0x260 net/socket.c:540
>> [<  none  >] __sock_create+0xa7/0x620 net/socket.c:1133
>> [< inline >] sock_create net/socket.c:1209
>> [< inline >] SYSC_socketpair net/socket.c:1281
>> [<  none  >] SyS_socketpair+0x112/0x4e0 net/socket.c:1260
>> [<  none  >] entry_SYSCALL_64_fastpath+0x16/0x7a
>> arch/x86/entry/entry_64.S:185
>>
>> INFO: Freed in sock_destroy_inode+0x56/0x70 age=0 cpu=2 pid=23185
>> [<  none  >] kmem_cache_free+0x24e/0x260 mm/slub.c:2742
>> [<  none  >] sock_destroy_inode+0x56/0x70 net/socket.c:279
>> [<  none  >] destroy_inode+0xc4/0x120 fs/inode.c:255
>> [<  none  >] evict+0x36b/0x580 fs/inode.c:559
>> [< inline >] iput_final fs/inode.c:1477
>> [<  none  >] iput+0x4a0/0x790 fs/inode.c:1504
>> [< inline >] dentry_iput fs/dcache.c:358
>> [<  none  >] __dentry_kill+0x4fe/0x700 fs/dcache.c:543
>> [< inline >] dentry_kill fs/dcache.c:587
>> [<  none  >] dput+0x6ab/0x7a0 fs/dcache.c:796
>> [<  none  >] __fput+0x3fb/0x6e0 fs/file_table.c:226
>> [<  none  >] fput+0x15/0x20 fs/file_table.c:244
>> [<  none  >] task_work_run+0x163/0x1f0 kernel/task_work.c:115
>> (discriminator 1)
>> [< inline >] tracehook_notify_resume include/linux/tracehook.h:191
>> [<  none  >] exit_to_usermode_loop+0x180/0x1a0
>> arch/x86/entry/common.c:251
>> [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282
>> [<  none  >] syscall_return_slowpath+0x19f/0x210
>> arch/x86/entry/common.c:344
>> [<  none  >] int_ret_from_sys_call+0x25/0x9f
>> arch/x86/entry/entry_64.S:281
>>
>> INFO: Slab 0xea0001874600 objects=25 used=2 fp=0x880061d1c100
>> flags=0x5004080
>> INFO: Object 0x880061d1ad00 @offset=11520 fp=0x880061d1a300
>> CPU: 3 PID: 23178 Comm: a.out Tainted: GB   4.4.0-rc1+ #84
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>   880061baf8f0 825d3336 88003e0dc280
>>  880061d1ad00 880061d18000 880061baf920 81618784
>>  88003e0dc280 ea0001874600 880061d1ad00 00e7
>>
>> Call Trace:
>>  [] __asan_report_load8_noabort+0x3e/0x40
>> mm/kasan/report.c:280
>>  [< inline >] 

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-23 Thread Jason Baron


On 11/20/2015 05:07 PM, Rainer Weikusat wrote:
> Rainer Weikusat <rweiku...@mobileactivedefense.com> writes:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
> 
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
> 
> Signed-off-by: Rainer Weikusat <rweiku...@mobileactivedefense.com>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")

Looks good to me.
Reviewed-by: Jason Baron <jba...@akamai.com>

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-20 Thread Jason Baron
On 11/19/2015 06:52 PM, Rainer Weikusat wrote:

[...]

> @@ -1590,21 +1718,35 @@ restart:
>   goto out_unlock;
>   }
>  
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> + if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
> + if (timeo) {
> + timeo = unix_wait_for_peer(other, timeo);
> +
> + err = sock_intr_errno(timeo);
> + if (signal_pending(current))
> + goto out_free;
> +
> + goto restart;
> + }
> +
> + if (unix_peer(sk) != other ||
> + unix_dgram_peer_wake_me(sk, other)) {
>   err = -EAGAIN;
>   goto out_unlock;
>   }

Hi,

So here we are calling unix_dgram_peer_wake_me() without the sk lock the first 
time
through - right? In that case, we can end up registering on the queue of other 
for
the callback but we might have already connected to a different remote. In that 
case,
the wakeup will crash if 'sk' has freed in the meantime.

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-17 Thread Jason Baron


On 11/16/2015 05:28 PM, Rainer Weikusat wrote:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
> 
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
> 
> Signed-off-by: Rainer Weikusat <rweiku...@mobileactivedefense.com>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
> ---
> 
> Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll()
> behavior for EPOLLOUT event": This shouldn't be an issue anymore with
> this change despite it restores the "only when writable" behaviour" as
> the wake up relay will also be set up once _dgram_sendmsg returned
> EAGAIN for a send attempt on a n:1 connected socket.
> 
> 

Hi,

My only comment was about potentially avoiding the double lock in the
write path, otherwise this looks ok to me.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-17 Thread Jason Baron


On 11/15/2015 01:32 PM, Rainer Weikusat wrote:

> 
> That was my original idea. The problem with this is that the code
> starting after the _lock and running until the main code path unlock has
> to be executed in one go with the other lock held as the results of the
> tests above this one may become invalid as soon as the other lock is
> released. This means instead of continuing execution with the send code
> proper after the block in case other became receive-ready between the
> first and the second test (possible because _dgram_recvmsg does not
> take the unix state lock), the whole procedure starting with acquiring
> the other lock would need to be restarted. Given sufficiently unfavorable
> circumstances, this could even turn into an endless loop which couldn't
> be interrupted. (unless code for this was added). 
> 

hmmm - I think we can avoid it by doing the wakeup from the write path
in the rare case that the queue has emptied - and avoid the double lock. IE:

unix_state_unlock(other);
unix_state_lock(sk);
err = -EAGAIN;
if (unix_peer(sk) == other) {
   unix_dgram_peer_wake_connect(sk, other);
   if (skb_queue_len(>sk_receive_queue) == 0)
   need_wakeup = true;
}
unix_state_unlock(sk);
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk), POLLOUT
| POLLWRNORM | POLLWRBAND);
goto out_free;


Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-13 Thread Jason Baron


On 11/13/2015 01:51 PM, Rainer Weikusat wrote:

[...]

>  
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> - }
> + if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {

Remind me why the 'unix_peer(sk) == other' is added here? If the remote
is not connected we still want to make sure that we don't overflow the
the remote rcv queue, right?

In terms of this added 'double' lock for both sk and other, where
previously we just held the 'other' lock. I think we could continue to
just hold the 'other' lock unless the remote queue is full, so something
like:

if (unix_peer(other) != sk && unix_recvq_full(other)) {
bool need_wakeup = false;

skipping the blocking case...

err = -EAGAIN;
if (!other_connected)
goto out_unlock;
unix_state_unlock(other);
unix_state_lock(sk);

/* if remote peer has changed under us, the connect()
   will wake up any pending waiter, just return -EAGAIN

if (unix_peer(sk) == other) {
/* In case we see there is space available
   queue the wakeup and we will try again. This
   this should be an unlikely condition */
if (!unix_dgram_peer_wake_me(sk, other))
need_wakeup = true;
}
unix_state_unlock(sk);
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),POLLOUT
| POLLWRNORM | POLLWRBAND);
goto out_free;
}

So I'm not sure if the 'double' lock really affects any workload, but
the above might be away to avoid it.

Also - it might be helpful to add a 'Fixes:' tag referencing where this
issue started, in the changelog.

Worth mentioning too is that this patch should improve the polling case
here dramatically, as we currently wake the entire queue on every remote
read even when we have room in the rcv buffer. So this patch will cut
down on ctxt switching rate dramatically from what we currently have.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-11 Thread Jason Baron
Hi Rainer,

> +
> +/* Needs sk unix state lock. After recv_ready indicated not ready,
> + * establish peer_wait connection if still needed.
> + */
> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
> +{
> + int connected;
> +
> + connected = unix_dgram_peer_wake_connect(sk, other);
> +
> + if (unix_recvq_full(other))
> + return 1;
> +
> + if (connected)
> + unix_dgram_peer_wake_disconnect(sk, other);
> +
> + return 0;
> +}
> +

So the comment above this function says 'needs unix state lock', however
the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage
in unix_dgram_poll() has the 'sk' lock. So this looks racy.

Also, another tweak on this scheme: Instead of calling
'__remove_wait_queue()' in unix_dgram_peer_wake_relay(). We could
instead simply mark each item in the queue as 'WQ_FLAG_EXCLUSIVE'. Then,
since 'unix_dgram_recvmsg()' does an exclusive wakeup the queue has
effectively been disabled (minus the first exlusive item in the list
which can just return if its marked exclusive). This means that in
dgram_poll(), we add to the list if we have not yet been added, and if
we are on the list, we do a remove and then add removing the exclusive
flag. Thus, all the waiters that need a wakeup are at the beginning of
the queue, and the disabled ones are at the end with the
'WQ_FLAG_EXCLUSIVE' flag set.

This does make the list potentially long, but if we only walk it to the
point we are doing the wakeup, it has no impact. I like the fact that in
this scheme the wakeup doesn't have to call remove against a long of
waiters - its just setting the exclusive flag.

Thanks,

-Jason





>  static inline int unix_writable(struct sock *sk)
>  {
>   return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
> @@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int 
> embrion)
>   skpair->sk_state_change(skpair);
>   sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
>   }
> +
> + unix_dgram_peer_wake_disconnect(sk, skpair);
>   sock_put(skpair); /* It may now die */
>   unix_peer(sk) = NULL;
>   }
> @@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct 
> socket *sock, int kern)
>   INIT_LIST_HEAD(>link);
>   mutex_init(>readlock); /* single task reading lock */
>   init_waitqueue_head(>peer_wait);
> + init_waitqueue_func_entry(>peer_wake, unix_dgram_peer_wake_relay);
>   unix_insert_socket(unix_sockets_unbound(sk), sk);
>  out:
>   if (sk == NULL)
> @@ -1031,6 +1150,13 @@ restart:
>   if (unix_peer(sk)) {
>   struct sock *old_peer = unix_peer(sk);
>   unix_peer(sk) = other;
> +
> + if (unix_dgram_peer_wake_disconnect(sk, other))
> + wake_up_interruptible_poll(sk_sleep(sk),
> +POLLOUT |
> +POLLWRNORM |
> +POLLWRBAND);
> +
>   unix_state_double_unlock(sk, other);
>  
>   if (other != old_peer)
> @@ -1565,6 +1691,13 @@ restart:
>   unix_state_lock(sk);
>   if (unix_peer(sk) == other) {
>   unix_peer(sk) = NULL;
> +
> + if (unix_dgram_peer_wake_disconnect(sk, other))
> + wake_up_interruptible_poll(sk_sleep(sk),
> +POLLOUT |
> +POLLWRNORM |
> +POLLWRBAND);
> +
>   unix_state_unlock(sk);
>  
>   unix_dgram_disconnected(sk, other);
> @@ -1590,19 +1723,21 @@ restart:
>   goto out_unlock;
>   }
>  
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> - }
> + if (!unix_dgram_peer_recv_ready(sk, other)) {
> + if (timeo) {
> + timeo = unix_wait_for_peer(other, timeo);
>  
> - timeo = unix_wait_for_peer(other, timeo);
> + err = sock_intr_errno(timeo);
> + if (signal_pending(current))
> + goto out_free;
>  
> - err = sock_intr_errno(timeo);
> - if (signal_pending(current))
> - goto out_free;
> + goto restart;
> + }
>  
> - goto restart;
> + if (unix_dgram_peer_wake_me(sk, other)) {
> + err = -EAGAIN;
> + goto out_unlock;
> + }
>   }
>  
>   if (sock_flag(other, SOCK_RCVTSTAMP))
> @@ -2453,14 +2588,16 @@ static unsigned int unix_dgram_poll(struct 

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-09 Thread Jason Baron
On 11/09/2015 09:40 AM, Rainer Weikusat wrote:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
> 
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
> 
> Signed-off-by: Rainer Weikusuat <rweiku...@mobileactivedefense.com>

[...]

> @@ -1590,10 +1723,14 @@ restart:
>   goto out_unlock;
>   }
>  
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> + if (!unix_dgram_peer_recv_ready(sk, other)) {
>   if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> + if (unix_dgram_peer_wake_me(sk, other)) {
> + err = -EAGAIN;
> + goto out_unlock;
> + }
> +
> + goto restart;
>   }


So this will cause 'unix_state_lock(other) to be called twice in a
row if we 'goto restart' (and hence will softlock the box). It just
needs a 'unix_state_unlock(other);' before the 'goto restart'.

I also tested this patch with a single unix server and 200 client
threads doing roughly epoll() followed by write() until -EAGAIN in a
loop. The throughput for the test was roughly the same as current
upstream, but the cpu usage was a lot higher. I think its b/c this patch
takes the server wait queue lock in the _poll() routine. This causes a
lot of contention. The previous patch you posted for this where you did
not clear the wait queue on every wakeup and thus didn't need the queue
lock in poll() (unless we were adding to it), performed much better.

However, the previous patch which tested better didn't add to the remote
queue when it was full on sendmsg() - so it wouldn't be correct for
epoll ET. Adding to the remote queue for every sendmsg() that fails does
seem undesirable, if we aren't even doing poll(). So I'm not sure if
just going back to the previous patch is a great option eitherI'm
also not sure how realistic the test case I have is. It would be great
if we had some other workloads to test against.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use-after-free in ep_remove_wait_queue

2015-11-06 Thread Jason Baron
On 11/06/2015 08:06 AM, Dmitry Vyukov wrote:
> On Mon, Oct 12, 2015 at 2:17 PM, Dmitry Vyukov  wrote:
>> On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet  wrote:
>>> On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:
>>>
 Probably the issue discussed in

   http://thread.gmane.org/gmane.linux.kernel/2057497/

 and previous related threads.

>>>
>>> Same issue, but Dmitry apparently did not trust me.
>>
>> Just wanted to make sure. Let's consider this as fixed.
> 
> 
> Is it meant to be fixed now?
> 
> I am still seeing use-after-free reports in ep_unregister_pollwait on
> d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).
> 

Hi,

Indeed - No formal patch has been posted for this yet - we've agreed
that Rainer is going to post the patch for this issue.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] unix: fix use-after-free in unix_dgram_poll()

2015-10-28 Thread Jason Baron
On 10/28/2015 12:46 PM, Rainer Weikusat wrote:
> Rainer Weikusat <r...@doppelsaurus.mobileactivedefense.com> writes:
>> Jason Baron <jba...@akamai.com> writes:
> 
> [...]
> 
>>> 2)
>>>
>>> For the case of epoll() in edge triggered mode we need to ensure that
>>> when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full()
>>> is true, we need to add a unix_peer_wake_connect() call to guarantee a
>>> wakeup. Otherwise, we are going to potentially hang there.
>>
>> I consider this necessary.
> 
> (As already discussed privately) just doing this would open up another
> way for sockets to be enqueued on the peer_wait queue of the peer
> forever despite no one wants to be notified of write space
> availability. Here's another RFC patch addressing the issues so far plus
> this one by breaking the connection to the peer socket from the wake up
> relaying function. This has the nice additional property that the
> dgram_poll code becomes somewhat simpler as the "dequeued where we
> didn't enqueue" situation can no longer occur and the not-so-nice
> additional property that the connect and disconnect functions need to
> take the peer_wait.lock spinlock explicitly so that this lock is used to
> ensure that no two threads modifiy the private pointer of the client
> wait_queue_t.

Hmmm...I thought these were already all guarded by unix_state_lock(sk).
In any case, rest of the patch overall looks good to me.

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-19 Thread Jason Baron
On 10/18/2015 04:58 PM, Rainer Weikusat wrote:

[...]

> 
> The idea behind 'the wait queue' (insofar I'm aware of it) is that it
> will be used as list of threads who need to be notified when the
> associated event occurs. Since you seem to argue that the run-of-the-mill
> algorithm is too slow for this particular case, is there anything to
> back this up?
>

Generally the poll() routines only add to a wait queue once at the
beginning, and all subsequent calls to poll() simply check the wakeup
conditions. So here you are proposing to add/remove to the wait queue on
subsequent invocations of poll(). So the initial patch I did, continued
in the usual pattern and only added once on registration or connect().

However, I do think that this is a special case since the registration
is on a shared wait queue, and thus having a long list of registered
waiters is going to affect all waiters. So I am fine with doing the
add/removes in the poll() routine and I agree that the patch below is
more compact that what I initially posted.

A couple of notes on your patch:

1)

In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus
it requires proper dereferencing. Something like:

struct unix_sock *u;
struct socket_wq *wq;

u = container_of(wait, struct unix_sock, wait);
rcu_read_lock();
wq = rcu_dereference(u->sk.sk_wq);
if (wq_has_sleeper(wq))
wake_up_interruptible_sync_poll(>wait, key);
rcu_read_unlock();


2)

For the case of epoll() in edge triggered mode we need to ensure that
when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full()
is true, we need to add a unix_peer_wake_connect() call to guarantee a
wakeup. Otherwise, we are going to potentially hang there.

With these changes (or tell me why they are not needed), I'm happy to
ack this patch.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-14 Thread Jason Baron

> 
> X-Signed-Off-By: Rainer Weikusat 
> 

Hi,

So the patches I've posted and yours both use the idea of a relaying
the remote peer wakeup via callbacks that are internal to the net/unix,
such that we avoid exposing the remote peer wakeup to the external
poll()/select()/epoll(). They differ in when and how those callbacks
are registered/unregistered.

So I think your approach here will generally keep the peer wait wakeup
queue to its absolute minimum, by removing from that queue when
we set POLLOUT, however it requires taking the peer waitqueue lock on
every poll() call. So I think there are tradeoffs here vs. what I've
posted. So for example, if there are a lot of writers against one 'server'
socket, there is going to be a lot of lock contention with your approach
here. So I think the performance is going to depend on the workload that
is tested.

I don't have a specific workload that I am trying to solve here, but
since you introduced the waiting on the remote peer queue, perhaps you
can post numbers comparing the patches that have been posted for the
workload that this was developed for? I will send you the latest version
of what I have privately - so as not to overly spam the list.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-13 Thread Jason Baron
On 10/12/2015 04:41 PM, Rainer Weikusat wrote:
> Jason Baron <jba...@akamai.com> writes:
>> On 10/05/2015 12:31 PM, Rainer Weikusat wrote:
> 
> [...]
> 
>>> Here's a more simple idea which _might_ work. The underlying problem
>>> seems to be that the second sock_poll_wait introduces a covert reference
>>> to the peer socket which isn't accounted for. The basic idea behind this
>>> is to execute an additional sock_hold for the peer whenever the
>>> sock_poll_wait is called for it and store it (the struct sock *) in a
>>> new struct unix_sock member.
> 
> [...]
> 
>> Interesting - will this work for the test case you supplied where we are in
>> epoll_wait() and another thread does a connect() to a new target? In that
>> case, even if we issue a wakeup to the epoll thread, its not going to have
>> a non-NULL poll_table, so it wouldn't be added to the new target. IE
>> the first test case here:
>>
>> https://lkml.org/lkml/2015/10/4/154
> 
> "Indeed it would not." I've also meanwhile found the time to check what
> is and isn't locked here and found that Eric's "this looks racy" was
> also justified. In theory, a clean solution could be based on modifying
> the various polling implementations to keep a piece of data for a polled
> something and provided that again on each subsequent poll call. This
> could then be used to keep the peer socket alive for as long as
> necessary were it possible to change the set of wait queues with every
> poll call. Since this also isn't the case, the idea to increment the
> reference count of the peer socket won't fly.
> 
> OTOH, something I seriously dislike about your relaying implementation
> is the unconditional add_wait_queue upon connect as this builds up a
> possibly large wait queue of entities entirely uninterested in the
> event which will need to be traversed even if peer_wait wakeups will
> only happen if at least someone actually wants to be notified. This
> could be changed such that the struct unix_sock member is only put onto
> the peer's wait queue in poll and only if it hasn't already been put
> onto it. The connection could then be severed if some kind of
> 'disconnect' occurs.
> 
> The code below (again based on 3.2.54) is what I'm currently running and
> it has survived testing during the day (without trying the exercise in
> hexadecimal as that doesn't cause failure for me, anyway). The wakeup
> relaying function checks that a socket wait queue actually still exists
> because I used to get null pointers oopses without every now and then
> (I've also tested this with an additional printk printing 'no q' in case
> the pointer was actually null to verify that this really occurs here).
> 

Hi,

What about the following race?

1) thread A does poll() on f, finds the wakeup condition low, and adds
itself to the remote peer_wait queue.

2) thread B sets the wake up condition in dgram_recvmsg(), but does not
execute the wakeup of threads yet.

3) thread C also does a poll() on f, finds now that the wakeup condition
is set, and hence removes f from the remote peer_wait queue.

Then, thread A misses the POLLOUT, and hangs.

I understand your concern about POLLIN only waiters-I think we
could add the 'relay callback' in dgram_poll() only for those who are
looking for POLLOUT, and simply avoid the de-registration, as in practice
I think its unlikely we are going to have a socket switching from
POLLOUT to *only* POLLIN. I suspect that will cover most of the cases
that concern you?

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/3] net: unix: fix use-after-free

2015-10-12 Thread Jason Baron
On 10/11/2015 07:55 AM, David Miller wrote:
> From: Jason Baron <jba...@akamai.com>
> Date: Fri,  9 Oct 2015 00:15:59 -0400
> 
>> These patches are against mainline, I can re-base to net-next, please
>> let me know.
>>
>> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
>> which causes the use-after-free quite quickly and here:
>> https://lkml.org/lkml/2015/10/2/693.
> 

Hi,

> I'd like to understand how patches that don't even compile can be
> "tested"?
> 
> net/unix/af_unix.c: In function ‘unix_dgram_writable’:
> net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this 
> function)
> net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only 
> once for each function it appears in
> 
> Could you explain how that works, I'm having a hard time understanding
> this?
> 

Traveling this week, so responses a bit delayed.

Yes, I screwed up the posting. I had some outstanding code in my
local tree to make it compile, but I failed to refresh my patch series
with this outstanding code before mailing it out. So what I tested/built
was not quite what I mailed out.

As soon as I noticed this issue in patch 3/3 I re-posted it here:

http://marc.info/?l=linux-netdev=10355808472=2

in an attempt to avoid this confusion. I'm happy to re-post the series
or whatever makes things easiest for you.

> Also please address Hannes's feedback, thanks.
> 

I've replied directly to Hannes.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()

2015-10-12 Thread Jason Baron
On 10/09/2015 10:38 AM, Hannes Frederic Sowa wrote:
> Hi,
> 
> Jason Baron <jba...@akamai.com> writes:
> 
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we are poll'ing against, but also 
>> calls
>> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
>> if we call poll()/select()/epoll() for the socket s, there are then
>> a couple of code paths in which the remote peer socket p and its associated
>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>> to remove themselves from the remote peer socket.
>>
>> The way that remote peer socket can be freed are:
>>
>> 1. If s calls connect() to a connect to a new socket other than p, it will
>> drop its reference on p, and thus a close() on p will free it.
>>
>> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
>> the final reference to p, allowing it to be freed.
>>
>> Address this issue, by reverting unix_dgram_poll() to only register with
>> the wait queue associated with s and register a callback with the remote peer
>> socket on connect() that will wake up the wait queue associated with s. If
>> scenarios 1 or 2 occur above we then simply remove the callback from the
>> remote peer. This then presents the expected semantics to poll()/select()/
>> epoll().
>>
>> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
>> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().
>>
>> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
>> DGRAM sockets").
>>
>> Tested-by: Mathias Krause <mini...@googlemail.com>
>> Signed-off-by: Jason Baron <jba...@akamai.com>
> 
> While I think this approach works, I haven't seen where the current code
> leaks a reference. Assignment to unix_peer(sk) in general take spin_lock
> and increment refcount. Are there bugs at the two places you referred
> to?
> 
> Is an easier fix just to use atomic_inc_not_zero(>sk_refcnt) in
> unix_peer_get() which could also help other places?
> 

Hi,

So we could potentially inc the refcnt on the remote peer such that the
remote peer does not free before the socket that has connected to it.
However, then the socket that has taken the reference against the peer
socket has to potentially record a number of remote sockets (all the ones
that it has connected to over its lifetime), and then drop all of their
refcnt's when it finally closes.

The reason for this is that with the current code when we do
poll()/select()/epoll() on a socket with a peer socket, those calls
take reference on the peer socket. Specifically, they record the remote
peer whead, such that they can remove their callbacks when they return.
So its not safe to just drop a reference on the remote peer when it
closes because their might be outstanding poll()/select()/epoll()
references pending.

Normally, poll()/select()/epoll() are waiting on a whead associated
directly with the fd/file that they are waiting for.

The other point here is that the way this patch structures things is
that when the socket connects to a new remote and hence disconnects from
an existing remote, POLLOUT events will continue to be correctly
delivered. That was not possible with the current structure of things
b/c there was no way to inform poll to re-register with the remote peer
whead. So, that means that the first test case here now works:

https://lkml.org/lkml/2015/10/4/154

Whereas with the old code test case would just hang for ever.

So yes there is a bit of code churn here, but I think it moves the
code-base in a direction that not only solves this issue, but corrects
additional poll() behaviors as well.

Thanks,

-Jason



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()

2015-10-09 Thread Jason Baron
On 10/09/2015 12:29 AM, kbuild test robot wrote:
> Hi Jason,
> 
> [auto build test ERROR on v4.3-rc3 -- if it's inappropriate base, please 
> ignore]
> 
> config: x86_64-randconfig-i0-201540 (attached as .config)
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>net/unix/af_unix.c: In function 'unix_dgram_writable':
>>> net/unix/af_unix.c:2465:3: error: 'other_full' undeclared (first use in 
>>> this function)
>  *other_full = false;
>   ^
>net/unix/af_unix.c:2465:3: note: each undeclared identifier is reported 
> only once for each function it appears in
> 


Forgot to refresh this patch before sending. The one that I tested with
is below.

Thanks,

-Jason




Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters.

Tested using: http://www.spinics.net/lists/netdev/msg145533.html

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 92 +--
 2 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
+#define UNIX_NOSPACE   2
struct socket_wqpeer_wq;
wait_queue_twait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..ac9bcd8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long 
timeo)
 
prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE);
 
+   set_bit(UNIX_NOSPACE, >flags);
+   /* Ensure that we either see space in the peer sk_receive_queue via the
+* unix_recvq_full() check below, or we receive a wakeup when it
+* empties. Pairs with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
sched = !sock_flag(other, SOCK_DEAD) &&
!(other->sk_shutdown & RCV_SHUTDOWN) &&
unix_recvq_full(other);
@@ -1623,17 +1629,27 @@ restart:
 
if (unix_peer(other) != sk && unix_recvq_full(other)) {
if (!timeo) {
-   err = -EAGAIN;
-   goto out_unlock;
-   }
-
-   timeo = unix_wait_for_peer(other, timeo);
+   set_bit(UNIX_NOSPACE, _sk(other)->flags);
+   /* Ensure that we either see space in the peer
+* sk_receive_queue via the unix_recvq_full() check
+* below, or we receive a wakeup when it empties. This
+* makes sure that epoll ET triggers correctly. Pairs
+* with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
+   if (unix_recvq_full(other)) {
+   err = -EAGAIN;
+   goto out_unlock;
+   }
+   } else {
+   timeo = unix_wait_for_peer(other, timeo);
 
-   err = sock_intr_errno(timeo);
-   if (signal_pending(current))
-   goto out_free;
+   err = sock_intr_errno(timeo);
+   if (signal_pending(current))
+   goto out_free;
 
-   goto restart;
+   goto restart;
+   }
}
 
if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   wake_up_interruptible_sync_poll(>peer_wait,
-   POLLOUT | POLLWRNORM | POLLWRBAND);
+   /* Ensure that waiters on our sk->sk_receive_queue draining that check
+* via unix_recvq_full() either see space in the queue or get a wakeup
+* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram()
+* call above. Pairs with the mb in unix_dgram_sendmsg(),
+*unix_dgram_poll(), and unix_wait_for_peer().
+*/
+   smp_mb();
+

[PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()

2015-10-08 Thread Jason Baron
Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters.

Tested using: http://www.spinics.net/lists/netdev/msg145533.html

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 92 +--
 2 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
+#define UNIX_NOSPACE   2
struct socket_wqpeer_wq;
wait_queue_twait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..05fbd00 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long 
timeo)
 
prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE);
 
+   set_bit(UNIX_NOSPACE, >flags);
+   /* Ensure that we either see space in the peer sk_receive_queue via the
+* unix_recvq_full() check below, or we receive a wakeup when it
+* empties. Pairs with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
sched = !sock_flag(other, SOCK_DEAD) &&
!(other->sk_shutdown & RCV_SHUTDOWN) &&
unix_recvq_full(other);
@@ -1623,17 +1629,27 @@ restart:
 
if (unix_peer(other) != sk && unix_recvq_full(other)) {
if (!timeo) {
-   err = -EAGAIN;
-   goto out_unlock;
-   }
-
-   timeo = unix_wait_for_peer(other, timeo);
+   set_bit(UNIX_NOSPACE, _sk(other)->flags);
+   /* Ensure that we either see space in the peer
+* sk_receive_queue via the unix_recvq_full() check
+* below, or we receive a wakeup when it empties. This
+* makes sure that epoll ET triggers correctly. Pairs
+* with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
+   if (unix_recvq_full(other)) {
+   err = -EAGAIN;
+   goto out_unlock;
+   }
+   } else {
+   timeo = unix_wait_for_peer(other, timeo);
 
-   err = sock_intr_errno(timeo);
-   if (signal_pending(current))
-   goto out_free;
+   err = sock_intr_errno(timeo);
+   if (signal_pending(current))
+   goto out_free;
 
-   goto restart;
+   goto restart;
+   }
}
 
if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   wake_up_interruptible_sync_poll(>peer_wait,
-   POLLOUT | POLLWRNORM | POLLWRBAND);
+   /* Ensure that waiters on our sk->sk_receive_queue draining that check
+* via unix_recvq_full() either see space in the queue or get a wakeup
+* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram()
+* call above. Pairs with the mb in unix_dgram_sendmsg(),
+*unix_dgram_poll(), and unix_wait_for_peer().
+*/
+   smp_mb();
+   if (test_bit(UNIX_NOSPACE, >flags)) {
+   clear_bit(UNIX_NOSPACE, >flags);
+   wake_up_interruptible_sync_poll(>peer_wait,
+   POLLOUT | POLLWRNORM |
+   POLLWRBAND);
+   }
 
if (msg->msg_name)
unix_copy_addr(msg, skb->sk);
@@ -2432,11 +2459,25 @@ static unsigned int unix_poll(struct file *file, struct 
socket *sock, poll_table
return mask;
 }
 
+static bool unix_dgram_writable(struct sock *sk, struct sock *other,
+   bool *other_nospace)
+{
+   *other_full = false;
+
+   if (other && unix_peer(other) != sk && unix_recvq_full(other)) {
+   *other_full = true;
+   return false;
+   }
+
+   retur

[PATCH v4 0/3] net: unix: fix use-after-free

2015-10-08 Thread Jason Baron
Hi,

These patches are against mainline, I can re-base to net-next, please
let me know.

They have been tested against: https://lkml.org/lkml/2015/9/13/195,
which causes the use-after-free quite quickly and here:
https://lkml.org/lkml/2015/10/2/693.

Thanks,

-Jason

v4:
-set UNIX_NOSPACE only if the peer socket has receive space

v3:
-beef up memory barrier comments in 3/3 (Peter Zijlstra)
-clean up unix_dgram_writable() function in 3/3 (Joe Perches)

Jason Baron (3):
  net: unix: fix use-after-free in unix_dgram_poll()
  net: unix: Convert gc_flags to flags
  net: unix: optimize wakeups in unix_dgram_recvmsg()

 include/net/af_unix.h |   4 +-
 net/unix/af_unix.c| 124 --
 net/unix/garbage.c|  12 ++---
 3 files changed, 108 insertions(+), 32 deletions(-)

-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()

2015-10-08 Thread Jason Baron
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
queue associated with the socket s that we are poll'ing against, but also calls
sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
if we call poll()/select()/epoll() for the socket s, there are then
a couple of code paths in which the remote peer socket p and its associated
peer_wait queue can be freed before poll()/select()/epoll() have a chance
to remove themselves from the remote peer socket.

The way that remote peer socket can be freed are:

1. If s calls connect() to a connect to a new socket other than p, it will
drop its reference on p, and thus a close() on p will free it.

2. If we call close on p(), then a subsequent sendmsg() from s, will drop
the final reference to p, allowing it to be freed.

Address this issue, by reverting unix_dgram_poll() to only register with
the wait queue associated with s and register a callback with the remote peer
socket on connect() that will wake up the wait queue associated with s. If
scenarios 1 or 2 occur above we then simply remove the callback from the
remote peer. This then presents the expected semantics to poll()/select()/
epoll().

I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().

Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
DGRAM sockets").

Tested-by: Mathias Krause <mini...@googlemail.com>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 32 +++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 4a167b3..9698aff 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_twait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..f789423 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair = unix_peer(sk);
 
if (skpair != NULL) {
+   if (sk->sk_type != SOCK_STREAM)
+   remove_wait_queue(_sk(skpair)->peer_wait,
+ >wait);
if (sk->sk_type == SOCK_STREAM || sk->sk_type == 
SOCK_SEQPACKET) {
unix_state_lock(skpair);
/* No more writes */
@@ -636,6 +639,16 @@ static struct proto unix_proto = {
  */
 static struct lock_class_key af_unix_sk_receive_queue_lock_key;
 
+static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+   struct unix_sock *u;
+
+   u = container_of(wait, struct unix_sock, wait);
+   wake_up_interruptible_sync_poll(sk_sleep(>sk), key);
+
+   return 0;
+}
+
 static struct sock *unix_create1(struct net *net, struct socket *sock, int 
kern)
 {
struct sock *sk = NULL;
@@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
INIT_LIST_HEAD(>link);
mutex_init(>readlock); /* single task reading lock */
init_waitqueue_head(>peer_wait);
+   init_waitqueue_func_entry(>wait, peer_wake);
unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
if (sk == NULL)
@@ -1030,7 +1044,11 @@ restart:
 */
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
+
+   remove_wait_queue(_sk(old_peer)->peer_wait,
+ _sk(sk)->wait);
unix_peer(sk) = other;
+   add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait);
unix_state_double_unlock(sk, other);
 
if (other != old_peer)
@@ -1038,8 +1056,12 @@ restart:
sock_put(old_peer);
} else {
unix_peer(sk) = other;
+   add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait);
unix_state_double_unlock(sk, other);
}
+   /* New remote may have created write space for us */
+   wake_up_interruptible_sync_poll(sk_sleep(sk),
+   POLLOUT | POLLWRNORM | POLLWRBAND);
return 0;
 
 out_unlock:
@@ -1194,6 +1216,8 @@ restart:
 
sock_hold(sk);
unix_peer(newsk)= sk;
+   if (sk->sk_type == SOCK_SEQPACKET)
+   add_wait_queue(_sk(sk)->peer_wait, _sk(newsk)->wait);
newsk->sk_state = TCP_ESTABLISHED;
newsk->sk_type  = sk->sk_type;
init_peercred(newsk);
@@ -1220,6 +1244,8 @@ restart:
 
smp_mb__after_atomic(); /* sock

[PATCH v4 2/3] net: unix: Convert gc_flags to flags

2015-10-08 Thread Jason Baron
Convert gc_flags to flags in perparation for the subsequent patch, which will
make use of a flag bit for a non-gc purpose.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c| 12 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 9698aff..6a4a345 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -58,7 +58,7 @@ struct unix_sock {
atomic_long_t   inflight;
spinlock_t  lock;
unsigned char   recursion_level;
-   unsigned long   gc_flags;
+   unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a73a226..39794d9 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void 
(*func)(struct unix_sock *),
 * have been added to the queues after
 * starting the garbage collection
 */
-   if (test_bit(UNIX_GC_CANDIDATE, 
>gc_flags)) {
+   if (test_bit(UNIX_GC_CANDIDATE, 
>flags)) {
hit = true;
 
func(u);
@@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 * of the list, so that it's checked even if it was already
 * passed over
 */
-   if (test_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags))
+   if (test_bit(UNIX_GC_MAYBE_CYCLE, >flags))
list_move_tail(>link, _candidates);
 }
 
@@ -305,8 +305,8 @@ void unix_gc(void)
BUG_ON(total_refs < inflight_refs);
if (total_refs == inflight_refs) {
list_move_tail(>link, _candidates);
-   __set_bit(UNIX_GC_CANDIDATE, >gc_flags);
-   __set_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags);
+   __set_bit(UNIX_GC_CANDIDATE, >flags);
+   __set_bit(UNIX_GC_MAYBE_CYCLE, >flags);
}
}
 
@@ -332,7 +332,7 @@ void unix_gc(void)
 
if (atomic_long_read(>inflight) > 0) {
list_move_tail(>link, _cycle_list);
-   __clear_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags);
+   __clear_bit(UNIX_GC_MAYBE_CYCLE, >flags);
scan_children(>sk, inc_inflight_move_tail, NULL);
}
}
@@ -343,7 +343,7 @@ void unix_gc(void)
 */
while (!list_empty(_cycle_list)) {
u = list_entry(not_cycle_list.next, struct unix_sock, link);
-   __clear_bit(UNIX_GC_CANDIDATE, >gc_flags);
+   __clear_bit(UNIX_GC_CANDIDATE, >flags);
list_move_tail(>link, _inflight_list);
}
 
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()

2015-10-07 Thread Jason Baron
Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 85 ---
 2 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
+#define UNIX_NOSPACE   2
struct socket_wqpeer_wq;
wait_queue_twait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..66979d4 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long 
timeo)
 
prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE);
 
+   set_bit(UNIX_NOSPACE, >flags);
+   /* Ensure that we either see space in the peer sk_receive_queue via the
+* unix_recvq_full() check below, or we receive a wakeup when it
+* empties. Pairs with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
sched = !sock_flag(other, SOCK_DEAD) &&
!(other->sk_shutdown & RCV_SHUTDOWN) &&
unix_recvq_full(other);
@@ -1623,17 +1629,27 @@ restart:
 
if (unix_peer(other) != sk && unix_recvq_full(other)) {
if (!timeo) {
-   err = -EAGAIN;
-   goto out_unlock;
-   }
+   set_bit(UNIX_NOSPACE, _sk(other)->flags);
+   /* Ensure that we either see space in the peer
+* sk_receive_queue via the unix_recvq_full() check
+* below, or we receive a wakeup when it empties. This
+* makes sure that epoll ET triggers correctly. Pairs
+* with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
+   if (unix_recvq_full(other)) {
+   err = -EAGAIN;
+   goto out_unlock;
+   }
+   } else {
+   timeo = unix_wait_for_peer(other, timeo);
 
-   timeo = unix_wait_for_peer(other, timeo);
+   err = sock_intr_errno(timeo);
+   if (signal_pending(current))
+   goto out_free;
 
-   err = sock_intr_errno(timeo);
-   if (signal_pending(current))
-   goto out_free;
-
-   goto restart;
+   goto restart;
+   }
}
 
if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   wake_up_interruptible_sync_poll(>peer_wait,
-   POLLOUT | POLLWRNORM | POLLWRBAND);
+   /* Ensure that waiters on our sk->sk_receive_queue draining that check
+* via unix_recvq_full() either see space in the queue or get a wakeup
+* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram()
+* call above. Pairs with the mb in unix_dgram_sendmsg(),
+*unix_dgram_poll(), and unix_wait_for_peer().
+*/
+   smp_mb();
+   if (test_bit(UNIX_NOSPACE, >flags)) {
+   clear_bit(UNIX_NOSPACE, >flags);
+   wake_up_interruptible_sync_poll(>peer_wait,
+   POLLOUT | POLLWRNORM |
+   POLLWRBAND);
+   }
 
if (msg->msg_name)
unix_copy_addr(msg, skb->sk);
@@ -2432,11 +2459,19 @@ static unsigned int unix_poll(struct file *file, struct 
socket *sock, poll_table
return mask;
 }
 
+static bool unix_dgram_writable(struct sock *sk, struct sock *other)
+{
+   if (other && unix_peer(other) != sk && unix_recvq_full(other))
+   return false;
+
+   return unix_writable(sk);
+}
+
 static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
poll_table *wait)
 {
struct sock *sk = sock->sk, *other;
-   u

[PATCH v3 0/3] net: unix: fix use-after-free

2015-10-07 Thread Jason Baron
Hi,

These patches are against mainline, I can re-base to net-next, just
let me know.

They have been tested against: https://lkml.org/lkml/2015/9/13/195,
which causes the use-after-free quite quickly and here:
https://lkml.org/lkml/2015/10/2/693.

Thanks,

-Jason

v3:
-beef up memory barrier comments in 3/3 (Peter Zijlstra)
-clean up unix_dgram_writable() function in 3/3 (Joe Perches)

Jason Baron (3):
  net: unix: fix use-after-free in unix_dgram_poll()
  net: unix: Convert gc_flags to flags
  net: unix: optimize wakeups in unix_dgram_recvmsg()

 include/net/af_unix.h |   4 +-
 net/unix/af_unix.c| 117 +++---
 net/unix/garbage.c|  12 +++---
 3 files changed, 101 insertions(+), 32 deletions(-)

-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/3] net: unix: fix use-after-free in unix_dgram_poll()

2015-10-07 Thread Jason Baron
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
queue associated with the socket s that we are poll'ing against, but also calls
sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
if we call poll()/select()/epoll() for the socket s, there are then
a couple of code paths in which the remote peer socket p and its associated
peer_wait queue can be freed before poll()/select()/epoll() have a chance
to remove themselves from the remote peer socket.

The way that remote peer socket can be freed are:

1. If s calls connect() to a connect to a new socket other than p, it will
drop its reference on p, and thus a close() on p will free it.

2. If we call close on p(), then a subsequent sendmsg() from s, will drop
the final reference to p, allowing it to be freed.

Address this issue, by reverting unix_dgram_poll() to only register with
the wait queue associated with s and register a callback with the remote peer
socket on connect() that will wake up the wait queue associated with s. If
scenarios 1 or 2 occur above we then simply remove the callback from the
remote peer. This then presents the expected semantics to poll()/select()/
epoll().

I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().

Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
DGRAM sockets").

Tested-by: Mathias Krause <mini...@googlemail.com>
Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 32 +++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 4a167b3..9698aff 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_twait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..f789423 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair = unix_peer(sk);
 
if (skpair != NULL) {
+   if (sk->sk_type != SOCK_STREAM)
+   remove_wait_queue(_sk(skpair)->peer_wait,
+ >wait);
if (sk->sk_type == SOCK_STREAM || sk->sk_type == 
SOCK_SEQPACKET) {
unix_state_lock(skpair);
/* No more writes */
@@ -636,6 +639,16 @@ static struct proto unix_proto = {
  */
 static struct lock_class_key af_unix_sk_receive_queue_lock_key;
 
+static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+   struct unix_sock *u;
+
+   u = container_of(wait, struct unix_sock, wait);
+   wake_up_interruptible_sync_poll(sk_sleep(>sk), key);
+
+   return 0;
+}
+
 static struct sock *unix_create1(struct net *net, struct socket *sock, int 
kern)
 {
struct sock *sk = NULL;
@@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
INIT_LIST_HEAD(>link);
mutex_init(>readlock); /* single task reading lock */
init_waitqueue_head(>peer_wait);
+   init_waitqueue_func_entry(>wait, peer_wake);
unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
if (sk == NULL)
@@ -1030,7 +1044,11 @@ restart:
 */
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
+
+   remove_wait_queue(_sk(old_peer)->peer_wait,
+ _sk(sk)->wait);
unix_peer(sk) = other;
+   add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait);
unix_state_double_unlock(sk, other);
 
if (other != old_peer)
@@ -1038,8 +1056,12 @@ restart:
sock_put(old_peer);
} else {
unix_peer(sk) = other;
+   add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait);
unix_state_double_unlock(sk, other);
}
+   /* New remote may have created write space for us */
+   wake_up_interruptible_sync_poll(sk_sleep(sk),
+   POLLOUT | POLLWRNORM | POLLWRBAND);
return 0;
 
 out_unlock:
@@ -1194,6 +1216,8 @@ restart:
 
sock_hold(sk);
unix_peer(newsk)= sk;
+   if (sk->sk_type == SOCK_SEQPACKET)
+   add_wait_queue(_sk(sk)->peer_wait, _sk(newsk)->wait);
newsk->sk_state = TCP_ESTABLISHED;
newsk->sk_type  = sk->sk_type;
init_peercred(newsk);
@@ -1220,6 +1244,8 @@ restart:
 
smp_mb__after_atomic(); /* sock

[PATCH v3 2/3] net: unix: Convert gc_flags to flags

2015-10-07 Thread Jason Baron
Convert gc_flags to flags in perparation for the subsequent patch, which will
make use of a flag bit for a non-gc purpose.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c| 12 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 9698aff..6a4a345 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -58,7 +58,7 @@ struct unix_sock {
atomic_long_t   inflight;
spinlock_t  lock;
unsigned char   recursion_level;
-   unsigned long   gc_flags;
+   unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a73a226..39794d9 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void 
(*func)(struct unix_sock *),
 * have been added to the queues after
 * starting the garbage collection
 */
-   if (test_bit(UNIX_GC_CANDIDATE, 
>gc_flags)) {
+   if (test_bit(UNIX_GC_CANDIDATE, 
>flags)) {
hit = true;
 
func(u);
@@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 * of the list, so that it's checked even if it was already
 * passed over
 */
-   if (test_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags))
+   if (test_bit(UNIX_GC_MAYBE_CYCLE, >flags))
list_move_tail(>link, _candidates);
 }
 
@@ -305,8 +305,8 @@ void unix_gc(void)
BUG_ON(total_refs < inflight_refs);
if (total_refs == inflight_refs) {
list_move_tail(>link, _candidates);
-   __set_bit(UNIX_GC_CANDIDATE, >gc_flags);
-   __set_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags);
+   __set_bit(UNIX_GC_CANDIDATE, >flags);
+   __set_bit(UNIX_GC_MAYBE_CYCLE, >flags);
}
}
 
@@ -332,7 +332,7 @@ void unix_gc(void)
 
if (atomic_long_read(>inflight) > 0) {
list_move_tail(>link, _cycle_list);
-   __clear_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags);
+   __clear_bit(UNIX_GC_MAYBE_CYCLE, >flags);
scan_children(>sk, inc_inflight_move_tail, NULL);
}
}
@@ -343,7 +343,7 @@ void unix_gc(void)
 */
while (!list_empty(_cycle_list)) {
u = list_entry(not_cycle_list.next, struct unix_sock, link);
-   __clear_bit(UNIX_GC_CANDIDATE, >gc_flags);
+   __clear_bit(UNIX_GC_CANDIDATE, >flags);
list_move_tail(>link, _inflight_list);
}
 
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-05 Thread Jason Baron
On 10/05/2015 12:31 PM, Rainer Weikusat wrote:
> Jason Baron <jba...@akamai.com> writes:
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we are poll'ing against, but also 
>> calls
>> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
>> if we call poll()/select()/epoll() for the socket s, there are then
>> a couple of code paths in which the remote peer socket p and its associated
>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>> to remove themselves from the remote peer socket.
>>
>> The way that remote peer socket can be freed are:
>>
>> 1. If s calls connect() to a connect to a new socket other than p, it will
>> drop its reference on p, and thus a close() on p will free it.
>>
>> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
>> the final reference to p, allowing it to be freed.
> 
> Here's a more simple idea which _might_ work. The underlying problem
> seems to be that the second sock_poll_wait introduces a covert reference
> to the peer socket which isn't accounted for. The basic idea behind this
> is to execute an additional sock_hold for the peer whenever the
> sock_poll_wait is called for it and store it (the struct sock *) in a
> new struct unix_sock member. Upon entering unix_dgram_poll, if the
> member is not NULL, it's cleared and a sock_put for its former value is
> done. The 'poll peer not NULL -> sock_put it' code is also added to the
> destructor, although I'm unsure if this is really necessary. The patch
> below also includes the additional SOCK_DEAD test suggested by Martin as
> that seems generally sensible to me.
> 
> NB: This has survived both Martin's and my test programs for a number
> of executions/ longer periods of time than was common before without
> generating list corruption warnings. The patch below is against 'my'
> 3.2.54 and is here provided as a suggestion in the hope that it will be
> useful for someting, not as patch submission, as I spent less time
> thinking through this than I should ideally have but despite of this,
> it's another 2.5 hours of my life spent on something completely
> different than what I should be working on at the moment.
> 
> --
> diff -pru linux-2-6/include/net/af_unix.h linux-2-6.p/include/net/af_unix.h
> --- linux-2-6/include/net/af_unix.h   2014-01-20 21:52:53.0 +
> +++ linux-2-6.p/include/net/af_unix.h 2015-10-05 15:11:20.270958297 +0100
> @@ -50,6 +50,7 @@ struct unix_sock {
>   struct vfsmount *mnt;
>   struct mutexreadlock;
>   struct sock *peer;
> + struct sock *poll_peer;
>   struct sock *other;
>   struct list_headlink;
>   atomic_long_t   inflight;
> diff -pru linux-2-6/net/unix/af_unix.c linux-2-6.p/net/unix/af_unix.c
> --- linux-2-6/net/unix/af_unix.c  2014-01-22 16:51:52.0 +
> +++ linux-2-6.p/net/unix/af_unix.c2015-10-05 17:05:28.358273567 +0100
> @@ -361,6 +361,9 @@ static void unix_sock_destructor(struct
>   if (u->addr)
>   unix_release_addr(u->addr);
>  
> + if (u->poll_peer)
> + sock_put(u->poll_peer);
> +
>   atomic_long_dec(_nr_socks);
>   local_bh_disable();
>   sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> @@ -625,6 +628,7 @@ static struct sock *unix_create1(struct
>   u = unix_sk(sk);
>   u->dentry = NULL;
>   u->mnt= NULL;
> + u->poll_peer = NULL;
>   spin_lock_init(>lock);
>   atomic_long_set(>inflight, 0);
>   INIT_LIST_HEAD(>link);
> @@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil
>  static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>   poll_table *wait)
>  {
> - struct sock *sk = sock->sk, *other;
> - unsigned int mask, writable;
> + struct sock *sk = sock->sk, *other, *pp;
> + struct unix_sock *u;
> + unsigned int mask, writable, dead;
> +
> + u = unix_sk(sk);
> + pp = u->poll_peer;
> + if (pp) {
> + u->poll_peer = NULL;
> + sock_put(pp);
> + }
>  
>   sock_poll_wait(file, sk_sleep(sk), wait);
>   mask = 0;
> @@ -2170,7 +2182,20 @@ static unsigned int unix_dgram_poll(stru
>   other = unix_peer_get(sk);
>   if (other) {
>   if (unix_peer(other) != sk) {
> - sock_poll_wait(file, _sk(other)->peer_wait, wait);
> + unix

Re: [PATCH v2 3/3] af_unix: optimize the unix_dgram_recvmsg()

2015-10-05 Thread Jason Baron
On 10/05/2015 03:41 AM, Peter Zijlstra wrote:
> On Fri, Oct 02, 2015 at 08:44:02PM +0000, Jason Baron wrote:
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index f789423..b8ed1bc 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
> 
>> @@ -1079,6 +1079,9 @@ static long unix_wait_for_peer(struct sock *other, 
>> long timeo)
>>  
>>  prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE);
>>  
>> +set_bit(UNIX_NOSPACE, >flags);
>> +/* pairs with mb in unix_dgram_recv */
>> +smp_mb__after_atomic();
>>  sched = !sock_flag(other, SOCK_DEAD) &&
>>  !(other->sk_shutdown & RCV_SHUTDOWN) &&
>>  unix_recvq_full(other);
>> @@ -1623,17 +1626,22 @@ restart:
>>  
>>  if (unix_peer(other) != sk && unix_recvq_full(other)) {
>>  if (!timeo) {
>> +set_bit(UNIX_NOSPACE, _sk(other)->flags);
>> +/* pairs with mb in unix_dgram_recv */
>> +smp_mb__after_atomic();
>> +if (unix_recvq_full(other)) {
>> +err = -EAGAIN;
>> +goto out_unlock;
>> +}
>> +} else {
> 
>> @@ -1939,8 +1947,14 @@ static int unix_dgram_recvmsg(struct socket *sock, 
>> struct msghdr *msg,
>>  goto out_unlock;
>>  }
>>  
>> +/* pairs with unix_dgram_poll() and wait_for_peer() */
>> +smp_mb();
>> +if (test_bit(UNIX_NOSPACE, >flags)) {
>> +clear_bit(UNIX_NOSPACE, >flags);
>> +wake_up_interruptible_sync_poll(>peer_wait,
>> +POLLOUT | POLLWRNORM |
>> +POLLWRBAND);
>> +}
>>  
>>  if (msg->msg_name)
>>  unix_copy_addr(msg, skb->sk);
>> @@ -2468,20 +2493,19 @@ static unsigned int unix_dgram_poll(struct file 
>> *file, struct socket *sock,
>>  if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT)))
>>  return mask;
>>  
>>  other = unix_peer_get(sk);
>> +if (unix_dgram_writable(sk, other)) {
>>  mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
>> +} else {
>>  set_bit(SOCK_ASYNC_NOSPACE, >sk_socket->flags);
>> +set_bit(UNIX_NOSPACE, _sk(other)->flags);
>> +/* pairs with mb in unix_dgram_recv */
>> +smp_mb__after_atomic();
>> +if (unix_dgram_writable(sk, other))
>> +mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
>> +}
>> +if (other)
>> +sock_put(other);
>>  
>>  return mask;
>>  }
> 
> 
> So I must object to these barrier comments; stating which other barrier
> they pair with is indeed good and required, but its not sufficient.
> 
> A barrier comment should also explain the data ordering -- the most
> important part.
> 
> As it stands its not clear to me these barriers are required at all, but
> this is not code I understand so I might well miss something obvious.
> 

So in patch 1/3 I've added sockets that call connect() onto the 'peer_wait'
queue of the server. The reason being that when the server reads from its
socket (unix_dgram_recvmsg()) it can wake up n clients that may be waiting
for the receive queue of the server to empty. This was previously
accomplished in unix_dgram_poll() via the:

sock_poll_wait(file, _sk(other)->peer_wait, wait);

which I've removed. The issue with that approach is that the 'other' socket
or server socket can close() and go away out from under the poll() call.
Thus, I've converted back unix_dgram_poll(), to simply wait on its
own socket's wait queue, which is the semantics that poll()/select()/
epoll() expect. However, I still needed a way to signal the client socket,
and thus I've added the callback on connect() in patch 1/3.

However, now that the client sockets are added during connect(), and
not poll() as was previously done, this means that any calls to
unix_dgram_recvmsg() have to walk the entire wakeup list, even if nobody
is sitting in poll().

Thus, I've introduced the UNIX_SOCK flag to signal that we are in poll()
to the server side, such that it doesn't have to potentially walk a long
list of wakeups for no reason. This makes a difference on this test
case:

http://www.spinics.net/lists/netdev/msg145533.html

which I found while working on this issue. And this patch restores the
performance for that test case.

So what we need to guarantee is that we either see t

[PATCH] unix: fix use-after-free with unix_dgram_poll()

2015-10-02 Thread Jason Baron
From: Jason Baron <jba...@akamai.com>

The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
queue associated with the socket s that we've called poll() on, but it also
calls sock_poll_wait() for a remote peer socket's wait queue, if it's connected.
Thus, if we call poll()/select()/epoll() for the socket s, there are then
a couple of code paths in which the remote peer socket s2 and its associated
peer_wait queue can be freed before poll()/select()/epoll() have a chance
to remove themselves from this remote peer socket s2's wait queue.

The remote peer's socket and associated wait queues can be freed via:

1. If s calls connect() to connect to a new socket other than s2, it will
drop its reference on s2, and thus a close() on s2 will free it.

2. If we call close() on s2, then a subsequent sendmsg() from s, will drop
the final reference to s2, allowing it to be freed.

Address this issue, by reverting unix_dgram_poll() to only register with
the wait queue associated with s and simply drop the second sock_poll_wait()
registration for the remote peer socket wait queue. This then presents the
expected semantics to poll()/select()/epoll().

This works because we will continue to get POLLOUT wakeups from
unix_write_space(), which is called via sock_wfree(). In fact, we avoid having
two wakeup calls here for every buffer we read, since unix_dgram_recvmsg()
unconditionally calls wake_up_interruptible_sync_poll() on its 'peer_wait' queue
and we will no longer be in poll against that queue. So I think this should be
more performant than the current code. And we avoid the second poll() call here
as well during registration.

unix_write_space() should probably be enhanced such that it checks for the
unix_recvq_full() condition as well. In fact, it should probably look for
some fraction of that buffer being free, as is done in unix_writable(). But I'm
considering that a separate enhancement from fixing this issue.

I've tested this by specifically reproducing cases #1 and #2 above as well as
by running the test code here: https://lkml.org/lkml/2015/9/13/195

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/unix/af_unix.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..c1ae595 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2441,7 +2441,6 @@ static unsigned int unix_dgram_poll(struct file *file, 
struct socket *sock,
other = unix_peer_get(sk);
if (other) {
if (unix_peer(other) != sk) {
-   sock_poll_wait(file, _sk(other)->peer_wait, wait);
if (unix_recvq_full(other))
writable = 0;
}
-- 
1.8.2.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unix: fix use-after-free with unix_dgram_poll()

2015-10-02 Thread Jason Baron
On 10/02/2015 03:30 PM, Rainer Weikusat wrote:
> Jason Baron <jba...@akamai.com> writes:
>> From: Jason Baron <jba...@akamai.com>
>>
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we've called poll() on, but it also
>> calls sock_poll_wait() for a remote peer socket's wait queue, if it's 
>> connected.
>> Thus, if we call poll()/select()/epoll() for the socket s, there are then
>> a couple of code paths in which the remote peer socket s2 and its associated
>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>> to remove themselves from this remote peer socket s2's wait queue.
> 
> [...]
> 
>> This works because we will continue to get POLLOUT wakeups from
>> unix_write_space(), which is called via sock_wfree().
> 
> As pointed out in my original comment, this doesn't work (as far as I
> can/ could tell) because it will only wake up sockets which had a chance
> to enqueue datagrams to the queue of the receiving socket as only
> skbuffs enqueued there will be consumed. A socket which is really
> waiting for space in the receiving queue won't ever be woken up in this
> way.

Ok, good point. I was hoping to avoid a more complex approach here. I think
then that the patch I posted in the previous thread on this would address
this concern. I will post it for review.

> 
> Further, considering that you're demonstrably not interested in
> debugging and fixing this issue (as you haven't even bothered to post
> one of the test programs you claim to have), I'm beginning to wonder why
> this tripe is being sent to me at all --- it's not "git on autopilot"
> this time as someone took the time to dig up my current e-mail address
> as the one in the original commit is not valid anymore. Could you please
> refrain from such exercises in future unless a discussion is actually
> intended?
> 
> 

Just trying to help fix this.

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] af_unix: fix use-after-free

2015-10-02 Thread Jason Baron
Hi,

These patches are against mainline, I can re-base to net-next, just
let me know.

They have been tested against: https://lkml.org/lkml/2015/9/13/195,
which causes the use-after-free quite quickly and here:
https://lkml.org/lkml/2015/10/2/693.

Thanks,

-Jason


Jason Baron (3):
  unix: fix use-after-free in unix_dgram_poll()
  af_unix: Convert gc_flags to flags
  af_unix: optimize the unix_dgram_recvmsg()

 include/net/af_unix.h |   4 +-
 net/unix/af_unix.c| 104 ++
 net/unix/garbage.c|  12 +++---
 3 files changed, 88 insertions(+), 32 deletions(-)

-- 
1.8.2.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] af_unix: Convert gc_flags to flags

2015-10-02 Thread Jason Baron
Convert gc_flags to flags in preparation for the subsequent patch, which will
make use of a flag bit for a non-gc purpose.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c| 12 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 9698aff..6a4a345 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -58,7 +58,7 @@ struct unix_sock {
atomic_long_t   inflight;
spinlock_t  lock;
unsigned char   recursion_level;
-   unsigned long   gc_flags;
+   unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a73a226..39794d9 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void 
(*func)(struct unix_sock *),
 * have been added to the queues after
 * starting the garbage collection
 */
-   if (test_bit(UNIX_GC_CANDIDATE, 
>gc_flags)) {
+   if (test_bit(UNIX_GC_CANDIDATE, 
>flags)) {
hit = true;
 
func(u);
@@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 * of the list, so that it's checked even if it was already
 * passed over
 */
-   if (test_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags))
+   if (test_bit(UNIX_GC_MAYBE_CYCLE, >flags))
list_move_tail(>link, _candidates);
 }
 
@@ -305,8 +305,8 @@ void unix_gc(void)
BUG_ON(total_refs < inflight_refs);
if (total_refs == inflight_refs) {
list_move_tail(>link, _candidates);
-   __set_bit(UNIX_GC_CANDIDATE, >gc_flags);
-   __set_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags);
+   __set_bit(UNIX_GC_CANDIDATE, >flags);
+   __set_bit(UNIX_GC_MAYBE_CYCLE, >flags);
}
}
 
@@ -332,7 +332,7 @@ void unix_gc(void)
 
if (atomic_long_read(>inflight) > 0) {
list_move_tail(>link, _cycle_list);
-   __clear_bit(UNIX_GC_MAYBE_CYCLE, >gc_flags);
+   __clear_bit(UNIX_GC_MAYBE_CYCLE, >flags);
scan_children(>sk, inc_inflight_move_tail, NULL);
}
}
@@ -343,7 +343,7 @@ void unix_gc(void)
 */
while (!list_empty(_cycle_list)) {
u = list_entry(not_cycle_list.next, struct unix_sock, link);
-   __clear_bit(UNIX_GC_CANDIDATE, >gc_flags);
+   __clear_bit(UNIX_GC_CANDIDATE, >flags);
list_move_tail(>link, _inflight_list);
}
 
-- 
1.8.2.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-02 Thread Jason Baron
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
queue associated with the socket s that we are poll'ing against, but also calls
sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
if we call poll()/select()/epoll() for the socket s, there are then
a couple of code paths in which the remote peer socket p and its associated
peer_wait queue can be freed before poll()/select()/epoll() have a chance
to remove themselves from the remote peer socket.

The way that remote peer socket can be freed are:

1. If s calls connect() to a connect to a new socket other than p, it will
drop its reference on p, and thus a close() on p will free it.

2. If we call close on p(), then a subsequent sendmsg() from s, will drop
the final reference to p, allowing it to be freed.

Address this issue, by reverting unix_dgram_poll() to only register with
the wait queue associated with s and register a callback with the remote peer
socket on connect() that will wake up the wait queue associated with s. If
scenarios 1 or 2 occur above we then simply remove the callback from the
remote peer. This then presents the expected semantics to poll()/select()/
epoll().

I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 32 +++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 4a167b3..9698aff 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_twait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..f789423 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair = unix_peer(sk);
 
if (skpair != NULL) {
+   if (sk->sk_type != SOCK_STREAM)
+   remove_wait_queue(_sk(skpair)->peer_wait,
+ >wait);
if (sk->sk_type == SOCK_STREAM || sk->sk_type == 
SOCK_SEQPACKET) {
unix_state_lock(skpair);
/* No more writes */
@@ -636,6 +639,16 @@ static struct proto unix_proto = {
  */
 static struct lock_class_key af_unix_sk_receive_queue_lock_key;
 
+static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+   struct unix_sock *u;
+
+   u = container_of(wait, struct unix_sock, wait);
+   wake_up_interruptible_sync_poll(sk_sleep(>sk), key);
+
+   return 0;
+}
+
 static struct sock *unix_create1(struct net *net, struct socket *sock, int 
kern)
 {
struct sock *sk = NULL;
@@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
INIT_LIST_HEAD(>link);
mutex_init(>readlock); /* single task reading lock */
init_waitqueue_head(>peer_wait);
+   init_waitqueue_func_entry(>wait, peer_wake);
unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
if (sk == NULL)
@@ -1030,7 +1044,11 @@ restart:
 */
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
+
+   remove_wait_queue(_sk(old_peer)->peer_wait,
+ _sk(sk)->wait);
unix_peer(sk) = other;
+   add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait);
unix_state_double_unlock(sk, other);
 
if (other != old_peer)
@@ -1038,8 +1056,12 @@ restart:
sock_put(old_peer);
} else {
unix_peer(sk) = other;
+   add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait);
unix_state_double_unlock(sk, other);
}
+   /* New remote may have created write space for us */
+   wake_up_interruptible_sync_poll(sk_sleep(sk),
+   POLLOUT | POLLWRNORM | POLLWRBAND);
return 0;
 
 out_unlock:
@@ -1194,6 +1216,8 @@ restart:
 
sock_hold(sk);
unix_peer(newsk)= sk;
+   if (sk->sk_type == SOCK_SEQPACKET)
+   add_wait_queue(_sk(sk)->peer_wait, _sk(newsk)->wait);
newsk->sk_state = TCP_ESTABLISHED;
newsk->sk_type  = sk->sk_type;
init_peercred(newsk);
@@ -1220,6 +1244,8 @@ restart:
 
smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
unix_peer(sk)   = newsk;
+   if (sk->sk_type == SOCK_SEQPACKET)
+   add_wait_queue(_sk(newsk)->peer_wait, _s

[PATCH v2 3/3] af_unix: optimize the unix_dgram_recvmsg()

2015-10-02 Thread Jason Baron
Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters interested in wait events.

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 72 ++-
 2 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
+#define UNIX_NOSPACE   2
struct socket_wqpeer_wq;
wait_queue_twait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..b8ed1bc 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,9 @@ static long unix_wait_for_peer(struct sock *other, long 
timeo)
 
prepare_to_wait_exclusive(>peer_wait, , TASK_INTERRUPTIBLE);
 
+   set_bit(UNIX_NOSPACE, >flags);
+   /* pairs with mb in unix_dgram_recv */
+   smp_mb__after_atomic();
sched = !sock_flag(other, SOCK_DEAD) &&
!(other->sk_shutdown & RCV_SHUTDOWN) &&
unix_recvq_full(other);
@@ -1623,17 +1626,22 @@ restart:
 
if (unix_peer(other) != sk && unix_recvq_full(other)) {
if (!timeo) {
-   err = -EAGAIN;
-   goto out_unlock;
-   }
-
-   timeo = unix_wait_for_peer(other, timeo);
+   set_bit(UNIX_NOSPACE, _sk(other)->flags);
+   /* pairs with mb in unix_dgram_recv */
+   smp_mb__after_atomic();
+   if (unix_recvq_full(other)) {
+   err = -EAGAIN;
+   goto out_unlock;
+   }
+   } else {
+   timeo = unix_wait_for_peer(other, timeo);
 
-   err = sock_intr_errno(timeo);
-   if (signal_pending(current))
-   goto out_free;
+   err = sock_intr_errno(timeo);
+   if (signal_pending(current))
+   goto out_free;
 
-   goto restart;
+   goto restart;
+   }
}
 
if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1947,14 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   wake_up_interruptible_sync_poll(>peer_wait,
-   POLLOUT | POLLWRNORM | POLLWRBAND);
+   /* pairs with unix_dgram_poll() and wait_for_peer() */
+   smp_mb();
+   if (test_bit(UNIX_NOSPACE, >flags)) {
+   clear_bit(UNIX_NOSPACE, >flags);
+   wake_up_interruptible_sync_poll(>peer_wait,
+   POLLOUT | POLLWRNORM |
+   POLLWRBAND);
+   }
 
if (msg->msg_name)
unix_copy_addr(msg, skb->sk);
@@ -2432,11 +2446,22 @@ static unsigned int unix_poll(struct file *file, struct 
socket *sock, poll_table
return mask;
 }
 
+static bool unix_dgram_writable(struct sock *sk, struct sock *other)
+{
+   bool writable;
+
+   writable = unix_writable(sk);
+   if (other && unix_peer(other) != sk && unix_recvq_full(other))
+   writable = false;
+
+   return writable;
+}
+
 static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
poll_table *wait)
 {
struct sock *sk = sock->sk, *other;
-   unsigned int mask, writable;
+   unsigned int mask;
 
sock_poll_wait(file, sk_sleep(sk), wait);
mask = 0;
@@ -2468,20 +2493,19 @@ static unsigned int unix_dgram_poll(struct file *file, 
struct socket *sock,
if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT)))
return mask;
 
-   writable = unix_writable(sk);
other = unix_peer_get(sk);
-   if (other) {
-   if (unix_peer(other) != sk) {
-   if (unix_recvq_full(other))
-   writable = 0;
-   }
-   sock_put(other);
-   }
-
-   if (writable)
+   if (unix_dgram_writable(sk, other)) {
mask |= POLLO

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-30 Thread Jason Baron
On 09/30/2015 03:34 AM, Michal Kubecek wrote:
> On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote:
>> On 29 September 2015 at 21:09, Jason Baron <jba...@akamai.com> wrote:
>>> However, if we call connect on socket 's', to connect to a new socket 'o2', 
>>> we
>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>> race per se, but can be triggered sequentially.
>>
>> Sounds profound, but the reproducers calls connect only once per
>> socket. So there is no "connect to a new socket", no?
> 
> I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is
> still connected to it. This is detected by 's' in unix_dgram_sendmsg()
> so that 's' releases its reference on 'o' and 'o' can be freed. If this
> happens before 's' is unregistered, we get use-after-free as 'o' has
> never been unregistered. And as the interval between freeing 'o' and
> unregistering 's' can be quite long, there is a chance for the memory to
> be reused. This is what one of our customers has seen:
> 
> [exception RIP: _raw_spin_lock_irqsave+156]
> RIP: 8040f5bc  RSP: 8800e929de78  RFLAGS: 00010082
> RAX: a32c  RBX: 88003954ab80  RCX: 1000
> RDX: f232  RSI: f232  RDI: 88003954ab80
> RBP: 5220   R8: dead00100100   R9: 
> R10: 7fff1a284960  R11: 0246  R12: 
> R13: 8800e929de8c  R14: 000e  R15: 
> ORIG_RAX:   CS: 1e030  SS: e02b
>  #8 [8800e929de70] _raw_spin_lock_irqsave at 8040f5a9
>  #9 [8800e929deb0] remove_wait_queue at 8006ad09
> #10 [8800e929ded0] ep_unregister_pollwait at 80170043
> #11 [8800e929def0] ep_remove at 80170073
> #12 [8800e929df10] sys_epoll_ctl at 80171453
> #13 [8800e929df80] system_call_fastpath at 80417553
> 
> In this case, crash happened on unregistering 's' which had null peer
> (i.e. not reconnected but rather disconnected) but there were still two
> items in the list, the other pointing to an unallocated page which has
> apparently been modified in between.
> 
> IMHO unix_dgram_disonnected() could be the place to handle this issue:
> it is called from both places where we disconnect from a peer (dead peer
> detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect())
> just before the reference to peer is released. I'm not familiar with the
> epoll implementation so I'm still trying to find what exactly needs to
> be done to unregister the peer at this moment.
> 

Indeed that is a path as well. The patch I posted here deals with that
case as well. It does a remove_wait_queue() in that case.

unix_dgram_disconnected() gets called as you point out when we are removing
the remote peer, and I have a remove_wait_queue() in that case. The patch
I posted converts us back to a polling() against a single wait queue.

The wait structure that epoll()/select()/poll() adds to the peer wait
queue is really opaque to the unix code. The normal pattern is for
epoll()/select()/poll() to do the unregister, not the socket/fd that
we are waiting on. Further we could not just release all of the wait
queues in unix_dgram_disconnected() b/c there could be multiple waiters
there. So the POLLFREE thing really has to be done from the
unix_sock_destructor() path, since it going to free the entire queue.

In addition, I think that removing the the wait queue from
unix_dgram_disconnected() will still be broken, b/c we would need to
re-add to the remote peer via subsequent poll(), to get events if the
socket was re-connected to a new peer.

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-30 Thread Jason Baron
On 09/30/2015 01:54 AM, Mathias Krause wrote:
> On 29 September 2015 at 21:09, Jason Baron <jba...@akamai.com> wrote:
>> However, if we call connect on socket 's', to connect to a new socket 'o2', 
>> we
>> drop the reference on the original socket 'o'. Thus, we can now close socket
>> 'o' without unregistering from epoll. Then, when we either close the ep
>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>> race per se, but can be triggered sequentially.
> 
> Sounds profound, but the reproducers calls connect only once per
> socket. So there is no "connect to a new socket", no?
> But w/e, see below.

Yes, but it can be reproduced this way too. It can also happen with a
close() on the remote peer 'o', and a send to 'o' from 's', which the
reproducer can do as pointed out Michal. The patch I sent deals with
both cases.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-29 Thread Jason Baron
On 09/29/2015 02:09 PM, Mathias Krause wrote:
> On 14 September 2015 at 04:39, Eric Wong <normalper...@yhbt.net> wrote:
>> +cc Jason Baron since he might be able to provide more insight into
>> epoll.
>>
>> Mathias Krause <mini...@googlemail.com> wrote:
>>> Hi,
>>>
>>> this is an attempt to resurrect the thread initially started here:
>>>
>>>   http://thread.gmane.org/gmane.linux.network/353003
>>>
>>> As that patch fixed the issue for the mentioned reproducer, it did not
>>> fix the bug for the production code Olivier is using. :(
>>>
>>> Changing the reproducer only slightly allows me to trigger the following
>>> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
>>> with the above linked patch applied:
>>>
>>> [   50.264249] [ cut here ]
>>> [   50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 
>>> __list_del_entry+0xa4/0xd0()
>>> [   50.264249] list_del corruption. prev->next should be 88003c2c1bb8, 
>>> but was 88003f07bbb8
>>> [   50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode 
>>> virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>>> [   50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75
>>> [   50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 1.7.5-20140531_083030-gandalf 04/01/2014
>>> [   50.264249]  817e902e 88087d08 8155593c 
>>> 0007
>>> [   50.264249]  88087d58 88087d48 8105202a 
>>> 0001
>>> [   50.264249]  88003c2c1bb8 88003f07bb80 0286 
>>> 88003f736640
>>> [   50.264249] Call Trace:
>>> [   50.264249]  [] dump_stack+0x4c/0x65
>>> [   50.264249]  [] warn_slowpath_common+0x8a/0xc0
>>> [   50.264249]  [] warn_slowpath_fmt+0x46/0x50
>>> [   50.264249]  [] __list_del_entry+0xa4/0xd0
>>> [   50.264249]  [] list_del+0x11/0x40
>>> [   50.264249]  [] remove_wait_queue+0x29/0x40
>>> [   50.264249]  [] 
>>> ep_unregister_pollwait.isra.6+0x58/0x1a0
>>> [   50.264249]  [] ? 
>>> ep_unregister_pollwait.isra.6+0xa9/0x1a0
>>> [   50.264249]  [] ep_remove+0x22/0x110
>>> [   50.264249]  [] SyS_epoll_ctl+0x62b/0xf70
>>> [   50.264249]  [] ? lockdep_sys_exit_thunk+0x12/0x14
>>> [   50.264249]  [] entry_SYSCALL_64_fastpath+0x12/0x6f
>>> [   50.264249] ---[ end trace d9af9b915df9667e ]---
>>> [   50.572100] [ cut here ]
>>> [   50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 
>>> __list_del_entry+0xc3/0xd0()
>>> [   50.584263] list_del corruption. next->prev should be 88003f664c90, 
>>> but was 88003f0cb5b8
>>> [   50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode 
>>> virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>>> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: GW   
>>> 4.2.0 #75
>>> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 1.7.5-20140531_083030-gandalf 04/01/2014
>>> [   50.584263]  817e902e 88003d37fce8 8155593c 
>>> 0007
>>> [   50.584263]  88003d37fd38 88003d37fd28 8105202a 
>>> 0001
>>> [   50.584263]  88003f664c90 88003f0cb580 0282 
>>> 88003f731740
>>> [   50.584263] Call Trace:
>>> [   50.584263]  [] dump_stack+0x4c/0x65
>>> [   50.584263]  [] warn_slowpath_common+0x8a/0xc0
>>> [   50.584263]  [] warn_slowpath_fmt+0x46/0x50
>>> [   50.584263]  [] __list_del_entry+0xc3/0xd0
>>> [   50.584263]  [] list_del+0x11/0x40
>>> [   50.584263]  [] remove_wait_queue+0x29/0x40
>>> [   50.584263]  [] 
>>> ep_unregister_pollwait.isra.6+0x58/0x1a0
>>> [   50.584263]  [] ? 
>>> ep_unregister_pollwait.isra.6+0xa9/0x1a0
>>> [   50.584263]  [] ep_remove+0x22/0x110
>>> [   50.584263]  [] eventpoll_release_file+0x62/0xa0
>>> [   50.584263]  [] __fput+0x1af/0x200
>>> [   50.584263]  [] ? int_very_careful+0x5/0x3f
>>> [   50.584263]  [] fput+0xe/0x10
>>> [   50.584263]  [] task_work_run+0x8d/0xc0
>>> [   50.584263]  [] do_notify_resume+0x4f/0x60
>>> [   50.584263]  [] int_signal+0x12/0x17
>>> [   50.584263] ---[ end trace d9af9b915df9667f ]---
>>> [   50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212
>>

Re: [PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set

2015-08-21 Thread Jason Baron


On 08/11/2015 01:59 PM, Jason Baron wrote:
 
 
 On 08/11/2015 12:12 PM, Eric Dumazet wrote:
 On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote:


 Yes, so the test case I'm using to test against is somewhat contrived.
 In that I am simply allocating around 40,000 sockets that are idle to
 create a 'permanent' memory pressure in the background. Then, I have
 just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop.

 That said, we encountered this issue initially where we had 10,000+
 flows and whenever the system would get into memory pressure, we would
 see all the cpus spin at 100%.

 So the testcase I wrote, was just a simplistic version for testing. But
 I am going to try and test against the more realistic workload where
 this issue was initially observed.


 Note that I am still trying to understand why we need to increase socket
 structure, for something which is inherently a problem of sharing memory
 with an unknown (potentially big) number of sockets.

 
 I was trying to mirror the wakeups when SO_SNDBUF is not set, where we
 continue to trigger on 1/3 of the buffer being available, as the
 sk-sndbuf is shrunk. And I saw this value as dynamic depending on
 number of sockets and read/write buffer usage. So that's where I was
 coming from with it.
 
 Also, at least with the .config I have the tcp_sock structure didn't
 increase in size (although struct sock did go up by 8 and not 4).
 
 I suggested to use a flag (one bit).

 If set, then we should fallback to tcp_wmem[0] (each socket has 4096
 bytes, so that we can avoid starvation)



 
 Ok, I will test this approach.

Hi Eric,

So I created a test here with 20,000 streams, and if I set SO_SNDBUF
high enough on the server side, I can create tcp memory pressure above
tcp_mem[2]. In this case, with the 'one bit' approach using tcp_wmem[0]
as the wakeup threshold I can still observe the 100% cpu spinning issue,
but with this v2 patch, cpu usage is minimal (1-2%). Since, we don't
guarantee tcp_wmem[0], above tcp_mem[2]. So using the 'one bit'
definitely alleviates the spinning between tcp_mem[1] and tcp_mem[2],
but not above tcp_mem[2] in my testing.

Maybe nobody cares about this case (you are getting what you ask for by
using SO_SNDBUF), but it seems to me that it would be nice to avoid this
sort of behavior. I also like the fact that with the
sk_effective_sndbuf, we keep doing wakeups on 1/3 of the write buffer
emptying, which keeps the wakeup behavior consistent. In theory this
would matter for high latency and bandwidth link, but in the testing I
did, I didn't observe any throughput differences between this v2 patch,
and the 'one bit' approach.

As I mentioned with this v2, the 'struct sock' grows by 4 bytes, but
struct tcp_sock does not increase. So since this is tcp specific, we
could add the sk_effective_sndbuf only to the struct tcp_sock.

So the 'one bit' approach definitely seems to me to be an improvement,
but I wanted to get feedback on this testing, before deciding how to
proceed.

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set

2015-08-11 Thread Jason Baron
From: Jason Baron jba...@akamai.com

When SO_SNDBUF is set and we are under tcp memory pressure, the effective write
buffer space can be much lower than what was set using SO_SNDBUF. For example,
we may have set the buffer to 100kb, but we may only be able to write 10kb. In
this scenario poll()/select()/epoll(), are going to continuously return POLLOUT,
followed by -EAGAIN from write() in a very tight loop.

Introduce sk-sk_effective_sndbuf, such that we can track the 'effective' size
of the sndbuf, when we have a short write due to memory pressure. By using the
sk-sk_effective_sndbuf instead of the sk-sk_sndbuf when we are under memory
pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally
do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will
auto tune the sk-sndbuf.

Although sk-sk_effective_sndbuf, and even sk-sk_sndbuf can change in
subsequent calls to tcp_poll() (queued via wakeups), we can guarantee that we
will eventually satisfy the write space check in all cases, since the write
queue will eventually drain to 0, and sk-sk_sndbuf and sk-sk_effective_sndbuf
will be non-zero when used in the write space check. If the write queue does not
drain to 0, then the client is not responding, and we will eventually timeout
the socket.

In my testing, this brought a single threaad's cpu usage down from 100% to ~1%
while maintaining the same level of throughput.

Signed-off-by: Jason Baron jba...@akamai.com
---
Changes from V1:
-Add READ_ONCE() in sk_stream_wspace() (Eric Dumazet)
-Restrict sk_set_effective_sndbuf() to the write() path
---
 include/net/sock.h | 16 
 net/core/sock.c|  1 +
 net/core/stream.c  |  1 +
 net/ipv4/tcp.c | 20 +++-
 4 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 43c6abc..86598e4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -243,6 +243,8 @@ struct cg_proto;
   *@sk_pacing_rate: Pacing rate (if supported by transport/packet 
scheduler)
   *@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
   *@sk_sndbuf: size of send buffer in bytes
+  *@sk_effective_sndbuf: Less than or equal to the size of sk_sndbuf when
+  *  under memory pressure, 0 otherwise.
   *@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   *@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
@@ -380,6 +382,7 @@ struct sock {
atomic_tsk_wmem_alloc;
atomic_tsk_omem_alloc;
int sk_sndbuf;
+   int sk_effective_sndbuf;
struct sk_buff_head sk_write_queue;
kmemcheck_bitfield_begin(flags);
unsigned intsk_shutdown  : 2,
@@ -779,6 +782,14 @@ static inline bool sk_acceptq_is_full(const struct sock 
*sk)
return sk-sk_ack_backlog  sk-sk_max_ack_backlog;
 }
 
+static inline void sk_set_effective_sndbuf(struct sock *sk)
+{
+   if (sk-sk_wmem_queued  sk-sk_sndbuf)
+   sk-sk_effective_sndbuf = sk-sk_sndbuf;
+   else
+   sk-sk_effective_sndbuf = sk-sk_wmem_queued;
+}
+
 /*
  * Compute minimal free write space needed to queue new packets.
  */
@@ -789,6 +800,11 @@ static inline int sk_stream_min_wspace(const struct sock 
*sk)
 
 static inline int sk_stream_wspace(const struct sock *sk)
 {
+   int effective_sndbuf = READ_ONCE(sk-sk_effective_sndbuf);
+
+   if (effective_sndbuf)
+   return effective_sndbuf - sk-sk_wmem_queued;
+
return sk-sk_sndbuf - sk-sk_wmem_queued;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 193901d..4fce879 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2309,6 +2309,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk-sk_allocation   =   GFP_KERNEL;
sk-sk_rcvbuf   =   sysctl_rmem_default;
sk-sk_sndbuf   =   sysctl_wmem_default;
+   sk-sk_effective_sndbuf =   0;
sk-sk_state=   TCP_CLOSE;
sk_set_socket(sk, sock);
 
diff --git a/net/core/stream.c b/net/core/stream.c
index d70f77a..7c175e7 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk)
 
if (sk_stream_is_writeable(sk)  sock) {
clear_bit(SOCK_NOSPACE, sock-flags);
+   sk-sk_effective_sndbuf = 0;
 
rcu_read_lock();
wq = rcu_dereference(sk-sk_wq);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 45534a5..d935ad5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -923,8 +923,10 @@ new_segment:
 
skb = sk_stream_alloc_skb(sk, 0, sk-sk_allocation,
  
skb_queue_empty(sk-sk_write_queue

Re: [PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set

2015-08-11 Thread Jason Baron


On 08/11/2015 10:49 AM, Eric Dumazet wrote:
 On Tue, 2015-08-11 at 14:38 +, Jason Baron wrote:
 From: Jason Baron jba...@akamai.com
 
 In my testing, this brought a single threaad's cpu usage down from 100% to 
 ~1%
 while maintaining the same level of throughput.

 
 Hi Jason. Could you give more details on this test ?
 
 How many flows are competing ?
 


Yes, so the test case I'm using to test against is somewhat contrived.
In that I am simply allocating around 40,000 sockets that are idle to
create a 'permanent' memory pressure in the background. Then, I have
just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop.

That said, we encountered this issue initially where we had 10,000+
flows and whenever the system would get into memory pressure, we would
see all the cpus spin at 100%.

So the testcase I wrote, was just a simplistic version for testing. But
I am going to try and test against the more realistic workload where
this issue was initially observed.

Thanks,

-Jason

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set

2015-08-11 Thread Jason Baron


On 08/11/2015 12:12 PM, Eric Dumazet wrote:
 On Tue, 2015-08-11 at 11:03 -0400, Jason Baron wrote:
 

 Yes, so the test case I'm using to test against is somewhat contrived.
 In that I am simply allocating around 40,000 sockets that are idle to
 create a 'permanent' memory pressure in the background. Then, I have
 just 1 flow that sets SO_SNDBUF, which results in the: poll(), write() loop.

 That said, we encountered this issue initially where we had 10,000+
 flows and whenever the system would get into memory pressure, we would
 see all the cpus spin at 100%.

 So the testcase I wrote, was just a simplistic version for testing. But
 I am going to try and test against the more realistic workload where
 this issue was initially observed.

 
 Note that I am still trying to understand why we need to increase socket
 structure, for something which is inherently a problem of sharing memory
 with an unknown (potentially big) number of sockets.
 

I was trying to mirror the wakeups when SO_SNDBUF is not set, where we
continue to trigger on 1/3 of the buffer being available, as the
sk-sndbuf is shrunk. And I saw this value as dynamic depending on
number of sockets and read/write buffer usage. So that's where I was
coming from with it.

Also, at least with the .config I have the tcp_sock structure didn't
increase in size (although struct sock did go up by 8 and not 4).

 I suggested to use a flag (one bit).
 
 If set, then we should fallback to tcp_wmem[0] (each socket has 4096
 bytes, so that we can avoid starvation)
 
 
 

Ok, I will test this approach.

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >