date:20170123

Re: Reference counting struct inet_peer

2017-01-23 Thread Julian Anastasov

Hello,

On Mon, 23 Jan 2017, David Windsor wrote:

> Hi,
> 
> I'm working on a patchset that adds overflow protection to kernel
> reference counters, as part of the KSPP effort.  We're introducing a
> new type, tentatively called refcount_t, that will ultimately replace
> atomic_t as the type used for kernel reference counters.  refcount_t
> has a constrained interface relative to atomic_t and stores reference
> counts as unsigned integers.
> 
> While performing an audit of kernel reference counters, we've come
> upon a few corner cases that we're unable to cleanly migrate to
> refcount_t.  One of these is the reference counting scheme for struct
> inet_peer.

...

> We're also seeing the same thing (freeing shared objects when their
> refcount becomes -1) in ip_vs.h:
> 
> http://lxr.free-electrons.com/source/include/net/ip_vs.h#L1424
> 
> static inline void ip_vs_dest_put_and_free(struct ip_vs_dest *dest)
> {
> if (atomic_dec_return(>refcnt) < 0)
> kfree(dest);
> }

I think, this is easy to fix. The problem is that
dest_trash currently holds deleted dests (unlinked from RCU lists)
with refcnt=0. If we change the dest_trash to hold dest
with refcnt=1, the above atomic_dec_return can be changed to
atomic_dec_and_test. Change should be small: ip_vs_dest_put
should be removed from __ip_vs_del_dest(), ip_vs_dest_hold()
from ip_vs_trash_get_dest() and refcnt check in
ip_vs_dest_trash_expire() should be updated. Let me know if
this holds your work, I can provide such patch to fix it.

Regards

--
Julian Anastasov

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 10:59:22AM -0800, Wei Wang wrote:
> This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
> alternative way to perform Fast Open on the active side (client).

Wei, I think that nothing prevents from reusin the original TCP_FASTOPEN
sockopt instead of adding a new one. The original one does this :

case TCP_FASTOPEN:
if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE |
TCPF_LISTEN))) {
tcp_fastopen_init_key_once(true);

fastopen_queue_tune(sk, val);
} else {
err = -EINVAL;
}
break;

and your new option does this :

case TCP_FASTOPEN_CONNECT:
if (val > 1 || val < 0) {
err = -EINVAL;
} else if (sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) {
if (sk->sk_state == TCP_CLOSE)
tp->fastopen_connect = val;
else
err = -EINVAL;
} else {
err = -EOPNOTSUPP;
}
break;

Now if we compare :
  - the value ranges are the same (0,1)
  - tcp_fastopen_init_key_once() only performs an initialization once
  - fastopen_queue_tune() only sets sk->max_qlen based on the backlog,
this has no effect on an outgoing connection ;
  - tp->fastopen_connect can be applied to a listening socket without
side effect.

Thus I think we can merge them this way :

case TCP_FASTOPEN:
if (val >= 0) {
if ((sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) &&
(sk->sk_state == TCP_CLOSE)
tp->fastopen_connect = val;

if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) {
tcp_fastopen_init_key_once(true);
fastopen_queue_tune(sk, val);
}
} else {
err = -EINVAL;
}
break;

And for the userland, the API is even simpler because we can use the
same TCP_FASTOPEN sockopt regardless of the socket direction. Also,
I don't know if TCP_FASTOPEN is supported on simultaneous connect,
but at least if it works it would be easier to understand this way.

Do you think there's a compelling reason for adding a new option or
are you interested in a small patch to perform the change above ?

Regards,
Willy

Re: [PATCH] Bluetooth: hidp: might sleep error in hidp_session_thread

2017-01-23 Thread jeffy


Hi brian,

On 01/24/2017 10:31 AM, Brian Norris wrote:

Hi Jeffy,

On Fri, Jan 20, 2017 at 09:52:08PM +0800, Jeffy Chen wrote:

[   39.044329] do not call blocking ops when !TASK_RUNNING; state=1 set
at [] hidp_session_thread+0x110/0x568 [hidp]
...
[   40.159664] Call trace:
[   40.162122] [] __might_sleep+0x64/0x90
[   40.167443] [] lock_sock_nested+0x30/0x78
[   40.173047] [] l2cap_sock_sendmsg+0x90/0xf0
[bluetooth]
[   40.179842] [] sock_sendmsg+0x4c/0x68
[   40.185072] [] kernel_sendmsg+0x54/0x68
[   40.190477] [] hidp_send_frame+0x78/0xa0 [hidp]
[   40.196574] [] hidp_process_transmit+0x44/0x98
[hidp]
[   40.203191] [] hidp_session_thread+0x364/0x568
[hidp]

Am I crazy, or are several other protocols broken like this too? I see a
similar structure in net/bluetooth/bnep/core.c and
net/bluetooth/cmtp/core.c, at least, each of which also call
kernel_sendmsg(), which might be an l2cap socket (...I think? I'm not a
bluetooth expert really).
Thanx, uploaded a new serial of patchset, which contains hidp & cmtp & 
bnep:9534023/9534025/9534027



Following (https://lwn.net/Articles/628628/).

Signed-off-by: Jeffy Chen 
---

  net/bluetooth/hidp/core.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 0bec458..bfd3fb8 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -1180,7 +1180,9 @@ static void hidp_session_run(struct hidp_session *session)
struct sock *ctrl_sk = session->ctrl_sock->sk;
struct sock *intr_sk = session->intr_sock->sk;
struct sk_buff *skb;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
  
+	add_wait_queue(sk_sleep(intr_sk), );

for (;;) {
/*
 * This thread can be woken up two ways:
@@ -1188,12 +1190,10 @@ static void hidp_session_run(struct hidp_session 
*session)
 *session->terminate flag and wakes this thread up.
 *  - Via modifying the socket state of ctrl/intr_sock. This
 *thread is woken up by ->sk_state_changed().
-*
-* Note: set_current_state() performs any necessary
-* memory-barriers for us.
 */
-   set_current_state(TASK_INTERRUPTIBLE);
  
+		/* Ensure session->terminate is updated */

+   smp_mb__before_atomic();
if (atomic_read(>terminate))
break;
  
@@ -1227,11 +1227,14 @@ static void hidp_session_run(struct hidp_session *session)

hidp_process_transmit(session, >ctrl_transmit,
  session->ctrl_sock);
  
-		schedule();

+   wait_woken(, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);

I think this looks mostly good, except what about the
hidp_session_terminate() condition? In that case, you're running
wake_up_process() -- which won't set WQ_FLAG_WOKEN for us. So what
happens if we see a hidp_session_terminate() call in between the check
for the ->terminate count, but before we call wait_woken()? IIUC, I
think we'll just ignore the call and keep waiting for the next wake
signal.

I think you might need to rework hidp_session_terminate() too, to
actually target the wait queue and not just the processes.

IIUC, of course. I could be wrong...

Ok, that make sense, thanx for point that out.


Brian


}
+   remove_wait_queue(sk_sleep(intr_sk), );
  
  	atomic_inc(>terminate);

-   set_current_state(TASK_RUNNING);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
  }
  
  /*

--
2.1.4

Re: [patch net-next 1/4] net: Introduce psample, a new genetlink channel for packet sampling

2017-01-23 Thread Jiri Pirko

Tue, Jan 24, 2017 at 12:57:50AM CET, step...@networkplumber.org wrote:
>On Sun, 22 Jan 2017 12:44:44 +0100
>Jiri Pirko  wrote:
>
>> +static LIST_HEAD(psample_groups_list);
>> +static DEFINE_SPINLOCK(psample_groups_lock);
>> +
>
>Why not a mutex? You aren't acquiring this in IRQ context?

Well, that is not argument for mutex. Mutex is much heavier and does not
need to be used when not necessary (like if you don't need to sleep).

Re: [patch] samples/bpf: silence shift wrapping warning

2017-01-23 Thread Alexei Starovoitov

On Mon, Jan 23, 2017 at 5:27 AM, Arnaldo Carvalho de Melo
 wrote:
> Em Sun, Jan 22, 2017 at 02:51:25PM -0800, Alexei Starovoitov escreveu:
>> On Sat, Jan 21, 2017 at 07:51:43AM +0300, Dan Carpenter wrote:
>> > max_key is a value in the 0-63 range, so on 32 bit systems the shift
>> > could wrap.
>> >
>> > Signed-off-by: Dan Carpenter 
>>
>> Looks fine. I think 'net-next' is ok.
>
> I could process these patches, if that would help,

Thanks for the offer.
I don't think there will be conflicts with all the work happening in net-next,
but it's best to avoid even possibility of it when we can.
Dan,
can you please resend the patch cc-ing Dave and netdev ?
please mention [PATCH net-next] in the subject.

> - Arnaldo
>
>> Acked-by: Alexei Starovoitov 
>
>> > diff --git a/samples/bpf/lwt_len_hist_user.c 
>> > b/samples/bpf/lwt_len_hist_user.c
>> > index ec8f3bb..bd06eef 100644
>> > --- a/samples/bpf/lwt_len_hist_user.c
>> > +++ b/samples/bpf/lwt_len_hist_user.c
>> > @@ -68,7 +68,7 @@ int main(int argc, char **argv)
>> > for (i = 1; i <= max_key + 1; i++) {
>> > stars(starstr, data[i - 1], max_value, MAX_STARS);
>> > printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
>> > -  (1l << i) >> 1, (1l << i) - 1, data[i - 1],
>> > +  (1ULL << i) >> 1, (1ULL << i) - 1, data[i - 1],
>> >MAX_STARS, starstr);
>> > }
>> >

[PATCH] ibmveth: Add a proper check for the availability of the checksum features

2017-01-23 Thread Thomas Huth

When using the ibmveth driver in a KVM/QEMU based VM, it currently
always prints out a scary error message like this when it is started:

 ibmveth 7103 (unregistered net_device): unable to change
 checksum offload settings. 1 rc=-2 ret_attr=7103

This happens because the driver always tries to enable the checksum
offloading without checking for the availability of this feature first.
QEMU does not support checksum offloading for the spapr-vlan device,
thus we always get the error message here.
According to the LoPAPR specification, the "ibm,illan-options" property
of the corresponding device tree node should be checked first to see
whether the H_ILLAN_ATTRIUBTES hypercall and thus the checksum offloading
feature is available. Thus let's do this in the ibmveth driver, too, so
that the error message is really only limited to cases where something
goes wrong, and does not occur if the feature is just missing.

Signed-off-by: Thomas Huth 
---
 drivers/net/ethernet/ibm/ibmveth.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index a831f94..309f5c6 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1601,8 +1601,11 @@ static int ibmveth_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
netdev->netdev_ops = _netdev_ops;
netdev->ethtool_ops = _ethtool_ops;
SET_NETDEV_DEV(netdev, >dev);
-   netdev->hw_features = NETIF_F_SG | NETIF_F_RXCSUM |
-   NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+   netdev->hw_features = NETIF_F_SG;
+   if (vio_get_attribute(dev, "ibm,illan-options", NULL) != NULL) {
+   netdev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
+  NETIF_F_RXCSUM;
+   }
 
netdev->features |= netdev->hw_features;
 
-- 
1.8.3.1

[PATCH v8 08/13] net: ethernet: aquantia: PCI operations

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add functions that handle the PCI bus interface.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 .../net/ethernet/aquantia/atlantic/aq_pci_func.c   | 345 +
 .../net/ethernet/aquantia/atlantic/aq_pci_func.h   |  34 ++
 2 files changed, 379 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_pci_func.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
new file mode 100644
index 000..afcecdb
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
@@ -0,0 +1,345 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_pci_func.c: Definition of PCI functions. */
+
+#include "aq_pci_func.h"
+#include "aq_nic.h"
+#include "aq_vec.h"
+#include "aq_hw.h"
+#include 
+
+struct aq_pci_func_s {
+   struct pci_dev *pdev;
+   struct aq_nic_s *port[AQ_CFG_PCI_FUNC_PORTS];
+   void __iomem *mmio;
+   void *aq_vec[AQ_CFG_PCI_FUNC_MSIX_IRQS];
+   resource_size_t mmio_pa;
+   unsigned int msix_entry_mask;
+   unsigned int irq_type;
+   unsigned int ports;
+   bool is_pci_enabled;
+   bool is_regions;
+   bool is_pci_using_dac;
+   struct aq_hw_caps_s aq_hw_caps;
+   struct msix_entry msix_entry[AQ_CFG_PCI_FUNC_MSIX_IRQS];
+};
+
+struct aq_pci_func_s *aq_pci_func_alloc(struct aq_hw_ops *aq_hw_ops,
+   struct pci_dev *pdev,
+   const struct net_device_ops *ndev_ops,
+   const struct ethtool_ops *eth_ops)
+{
+   struct aq_pci_func_s *self = NULL;
+   int err = 0;
+   unsigned int port = 0U;
+
+   if (!aq_hw_ops) {
+   err = -EFAULT;
+   goto err_exit;
+   }
+   self = kzalloc(sizeof(*self), GFP_KERNEL);
+   if (!self) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+
+   pci_set_drvdata(pdev, self);
+   self->pdev = pdev;
+
+   err = aq_hw_ops->get_hw_caps(NULL, >aq_hw_caps);
+   if (err < 0)
+   goto err_exit;
+
+   self->ports = self->aq_hw_caps.ports;
+
+   for (port = 0; port < self->ports; ++port) {
+   struct aq_nic_s *aq_nic = aq_nic_alloc_cold(ndev_ops, eth_ops,
+   >dev, self,
+   port, aq_hw_ops);
+
+   if (!aq_nic) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+   self->port[port] = aq_nic;
+   }
+
+err_exit:
+   if (err < 0) {
+   if (self)
+   aq_pci_func_free(self);
+   self = NULL;
+   }
+
+   (void)err;
+   return self;
+}
+
+int aq_pci_func_init(struct aq_pci_func_s *self)
+{
+   int err = 0;
+   unsigned int bar = 0U;
+   unsigned int port = 0U;
+   unsigned int i = 0U;
+
+   err = pci_enable_device(self->pdev);
+   if (err < 0)
+   goto err_exit;
+
+   self->is_pci_enabled = true;
+
+   err = pci_set_dma_mask(self->pdev, DMA_BIT_MASK(64));
+   if (!err) {
+   err = pci_set_consistent_dma_mask(self->pdev, DMA_BIT_MASK(64));
+   self->is_pci_using_dac = 1;
+   }
+   if (err) {
+   err = pci_set_dma_mask(self->pdev, DMA_BIT_MASK(32));
+   if (!err)
+   err = pci_set_consistent_dma_mask(self->pdev,
+ DMA_BIT_MASK(32));
+   self->is_pci_using_dac = 0;
+   }
+   if (err != 0) {
+   err = -ENOSR;
+   goto err_exit;
+   }
+
+   err = pci_request_regions(self->pdev, AQ_CFG_DRV_NAME "_mmio");
+   if (err < 0)
+   goto err_exit;
+
+   self->is_regions = true;
+
+   pci_set_master(self->pdev);
+
+   for (bar = 0; bar < 4; ++bar) {
+   if (IORESOURCE_MEM & pci_resource_flags(self->pdev, bar)) {
+   resource_size_t reg_sz;
+
+   self->mmio_pa = pci_resource_start(self->pdev, bar);
+   if (self->mmio_pa == 0U) {
+   err = -EIO;
+

[PATCH v8 05/13] net: ethernet: aquantia: Support for NIC-specific code

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add support for code specific to the Atlantic NIC.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/aquantia/atlantic/aq_main.c   | 273 ++
 drivers/net/ethernet/aquantia/atlantic/aq_main.h   |  17 +
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c| 937 +
 drivers/net/ethernet/aquantia/atlantic/aq_nic.h| 108 +++
 .../ethernet/aquantia/atlantic/aq_nic_internal.h   |  46 +
 5 files changed, 1381 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_main.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_main.h
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_nic.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_nic.h
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_nic_internal.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_main.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_main.c
new file mode 100644
index 000..c17c70a
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_main.c
@@ -0,0 +1,273 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_main.c: Main file for aQuantia Linux driver. */
+
+#include "aq_main.h"
+#include "aq_nic.h"
+#include "aq_pci_func.h"
+#include "aq_ethtool.h"
+#include "hw_atl/hw_atl_a0.h"
+#include "hw_atl/hw_atl_b0.h"
+
+#include 
+#include 
+
+static const struct pci_device_id aq_pci_tbl[] = {
+   { PCI_VDEVICE(AQUANTIA, HW_ATL_DEVICE_ID_0001), },
+   { PCI_VDEVICE(AQUANTIA, HW_ATL_DEVICE_ID_D100), },
+   { PCI_VDEVICE(AQUANTIA, HW_ATL_DEVICE_ID_D107), },
+   { PCI_VDEVICE(AQUANTIA, HW_ATL_DEVICE_ID_D108), },
+   { PCI_VDEVICE(AQUANTIA, HW_ATL_DEVICE_ID_D109), },
+   {}
+};
+
+MODULE_DEVICE_TABLE(pci, aq_pci_tbl);
+
+MODULE_LICENSE("GPL v2");
+MODULE_VERSION(AQ_CFG_DRV_VERSION);
+MODULE_AUTHOR(AQ_CFG_DRV_AUTHOR);
+MODULE_DESCRIPTION(AQ_CFG_DRV_DESC);
+
+static struct aq_hw_ops *aq_pci_probe_get_hw_ops_by_id(struct pci_dev *pdev)
+{
+   struct aq_hw_ops *ops = NULL;
+
+   ops = hw_atl_a0_get_ops_by_id(pdev);
+   if (!ops)
+   ops = hw_atl_b0_get_ops_by_id(pdev);
+
+   return ops;
+}
+
+static int aq_ndev_open(struct net_device *ndev)
+{
+   struct aq_nic_s *aq_nic = NULL;
+   int err = 0;
+
+   aq_nic = aq_nic_alloc_hot(ndev);
+   if (!aq_nic) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+   err = aq_nic_init(aq_nic);
+   if (err < 0)
+   goto err_exit;
+   err = aq_nic_start(aq_nic);
+   if (err < 0)
+   goto err_exit;
+
+err_exit:
+   if (err < 0)
+   aq_nic_deinit(aq_nic);
+   return err;
+}
+
+static int aq_ndev_close(struct net_device *ndev)
+{
+   int err = 0;
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+
+   err = aq_nic_stop(aq_nic);
+   if (err < 0)
+   goto err_exit;
+   aq_nic_deinit(aq_nic);
+   aq_nic_free_hot_resources(aq_nic);
+
+err_exit:
+   return err;
+}
+
+static int aq_ndev_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+   int err = 0;
+
+   err = aq_nic_xmit(aq_nic, skb);
+   if (err < 0)
+   goto err_exit;
+
+err_exit:
+   return err;
+}
+
+static int aq_ndev_change_mtu(struct net_device *ndev, int new_mtu)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+   int err = 0;
+
+   if (new_mtu == ndev->mtu) {
+   err = 0;
+   goto err_exit;
+   }
+   if (new_mtu < 68) {
+   err = -EINVAL;
+   goto err_exit;
+   }
+   err = aq_nic_set_mtu(aq_nic, new_mtu + ETH_HLEN);
+   if (err < 0)
+   goto err_exit;
+   ndev->mtu = new_mtu;
+
+   if (netif_running(ndev)) {
+   aq_ndev_close(ndev);
+   aq_ndev_open(ndev);
+   }
+
+err_exit:
+   return err;
+}
+
+static int aq_ndev_set_features(struct net_device *ndev,
+   netdev_features_t features)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+   struct aq_nic_cfg_s *aq_cfg = aq_nic_get_cfg(aq_nic);
+   bool is_lro = false;
+
+   if (aq_cfg->hw_features & NETIF_F_LRO) {
+   is_lro = features & NETIF_F_LRO;
+
+   if (aq_cfg->is_lro != is_lro) {
+   aq_cfg->is_lro = is_lro;

[PATCH v8 02/13] net: ethernet: aquantia: Common functions and definitions

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add files containing the functions and definitions used in common in
different functional areas.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/aquantia/atlantic/aq_cfg.h| 77 ++
 drivers/net/ethernet/aquantia/atlantic/aq_common.h | 23 +++
 drivers/net/ethernet/aquantia/atlantic/aq_utils.h  | 50 ++
 3 files changed, 150 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_common.h
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_utils.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
new file mode 100644
index 000..5f99237
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
@@ -0,0 +1,77 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_cfg.h: Definition of configuration parameters and constants. */
+
+#ifndef AQ_CFG_H
+#define AQ_CFG_H
+
+#define AQ_CFG_VECS_DEF   4U
+#define AQ_CFG_TCS_DEF1U
+
+#define AQ_CFG_TXDS_DEF4096U
+#define AQ_CFG_RXDS_DEF1024U
+
+#define AQ_CFG_IS_POLLING_DEF 0U
+
+#define AQ_CFG_FORCE_LEGACY_INT 0U
+
+#define AQ_CFG_IS_INTERRUPT_MODERATION_DEF   1U
+#define AQ_CFG_INTERRUPT_MODERATION_RATE_DEF 0xU
+#define AQ_CFG_IRQ_MASK  0x1FFU
+
+#define AQ_CFG_VECS_MAX   8U
+#define AQ_CFG_TCS_MAX8U
+
+#define AQ_CFG_TX_FRAME_MAX  (16U * 1024U)
+#define AQ_CFG_RX_FRAME_MAX  (4U * 1024U)
+
+/* LRO */
+#define AQ_CFG_IS_LRO_DEF   1U
+
+/* RSS */
+#define AQ_CFG_RSS_INDIRECTION_TABLE_MAX  128U
+#define AQ_CFG_RSS_HASHKEY_SIZE   320U
+
+#define AQ_CFG_IS_RSS_DEF   1U
+#define AQ_CFG_NUM_RSS_QUEUES_DEF   AQ_CFG_VECS_DEF
+#define AQ_CFG_RSS_BASE_CPU_NUM_DEF 0U
+
+#define AQ_CFG_PCI_FUNC_MSIX_IRQS   9U
+#define AQ_CFG_PCI_FUNC_PORTS   2U
+
+#define AQ_CFG_SERVICE_TIMER_INTERVAL(2 * HZ)
+#define AQ_CFG_POLLING_TIMER_INTERVAL   ((unsigned int)(2 * HZ))
+
+#define AQ_CFG_SKB_FRAGS_MAX   32U
+
+#define AQ_CFG_NAPI_WEIGHT 64U
+
+#define AQ_CFG_MULTICAST_ADDRESS_MAX 32U
+
+/*#define AQ_CFG_MAC_ADDR_PERMANENT {0x30, 0x0E, 0xE3, 0x12, 0x34, 0x56}*/
+
+#define AQ_CFG_FC_MODE 3U
+
+#define AQ_CFG_SPEED_MSK  0xU  /* 0xU==auto_neg */
+
+#define AQ_CFG_IS_AUTONEG_DEF   1U
+#define AQ_CFG_MTU_DEF  1514U
+
+#define AQ_CFG_LOCK_TRYS   100U
+
+#define AQ_CFG_DRV_AUTHOR  "aQuantia"
+#define AQ_CFG_DRV_DESC"aQuantia Corporation(R) Network Driver"
+#define AQ_CFG_DRV_NAME"aquantia"
+#define AQ_CFG_DRV_VERSION __stringify(NIC_MAJOR_DRIVER_VERSION)"."\
+   __stringify(NIC_MINOR_DRIVER_VERSION)"."\
+   __stringify(NIC_BUILD_DRIVER_VERSION)"."\
+   __stringify(NIC_REVISION_DRIVER_VERSION)
+
+#endif /* AQ_CFG_H */
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_common.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_common.h
new file mode 100644
index 000..9eb5e22
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_common.h
@@ -0,0 +1,23 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_common.h: Basic includes for all files in project. */
+
+#ifndef AQ_COMMON_H
+#define AQ_COMMON_H
+
+#include 
+#include 
+
+#include "ver.h"
+#include "aq_nic.h"
+#include "aq_cfg.h"
+#include "aq_utils.h"
+
+#endif /* AQ_COMMON_H */
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_utils.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
new file mode 100644
index 000..4446bd9
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_utils.h
@@ -0,0 +1,50 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_utils.h: Useful macro and structures used in all layers of driver.

[PATCH v8 11/13] net: ethernet: aquantia: Ethtool support

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add the driver interfaces required for support by the ethtool utility.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 .../net/ethernet/aquantia/atlantic/aq_ethtool.c| 261 +
 .../net/ethernet/aquantia/atlantic/aq_ethtool.h|  19 ++
 2 files changed, 280 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_ethtool.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
new file mode 100644
index 000..c5b025e
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
@@ -0,0 +1,261 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_ethtool.c: Definition of ethertool related functions. */
+
+#include "aq_ethtool.h"
+#include "aq_nic.h"
+
+static void aq_ethtool_get_regs(struct net_device *ndev,
+   struct ethtool_regs *regs, void *p)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+   u32 regs_count = aq_nic_get_regs_count(aq_nic);
+
+   memset(p, 0, regs_count * sizeof(u32));
+   aq_nic_get_regs(aq_nic, regs, p);
+}
+
+static int aq_ethtool_get_regs_len(struct net_device *ndev)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+   u32 regs_count = aq_nic_get_regs_count(aq_nic);
+
+   return regs_count * sizeof(u32);
+}
+
+static u32 aq_ethtool_get_link(struct net_device *ndev)
+{
+   return ethtool_op_get_link(ndev);
+}
+
+static int aq_ethtool_get_settings(struct net_device *ndev,
+  struct ethtool_cmd *cmd)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+
+   aq_nic_get_link_settings(aq_nic, cmd);
+   ethtool_cmd_speed_set(cmd, netif_carrier_ok(ndev) ?
+   aq_nic_get_link_speed(aq_nic) : 0U);
+
+   return 0;
+}
+
+static int aq_ethtool_set_settings(struct net_device *ndev,
+  struct ethtool_cmd *cmd)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+
+   return aq_nic_set_link_settings(aq_nic, cmd);
+}
+
+/* there "5U" is number of queue[#] stats lines (InPackets+...+InErrors) */
+static const unsigned int aq_ethtool_stat_queue_lines = 5U;
+static const unsigned int aq_ethtool_stat_queue_chars =
+   5U * ETH_GSTRING_LEN;
+static const char aq_ethtool_stat_names[][ETH_GSTRING_LEN] = {
+   "InPackets",
+   "InUCast",
+   "InMCast",
+   "InBCast",
+   "InErrors",
+   "OutPackets",
+   "OutUCast",
+   "OutMCast",
+   "OutBCast",
+   "InUCastOctects",
+   "OutUCastOctects",
+   "InMCastOctects",
+   "OutMCastOctects",
+   "InBCastOctects",
+   "OutBCastOctects",
+   "InOctects",
+   "OutOctects",
+   "InPacketsDma",
+   "OutPacketsDma",
+   "InOctetsDma",
+   "OutOctetsDma",
+   "InDroppedDma",
+   "Queue[0] InPackets",
+   "Queue[0] OutPackets",
+   "Queue[0] InJumboPackets",
+   "Queue[0] InLroPackets",
+   "Queue[0] InErrors",
+   "Queue[1] InPackets",
+   "Queue[1] OutPackets",
+   "Queue[1] InJumboPackets",
+   "Queue[1] InLroPackets",
+   "Queue[1] InErrors",
+   "Queue[2] InPackets",
+   "Queue[2] OutPackets",
+   "Queue[2] InJumboPackets",
+   "Queue[2] InLroPackets",
+   "Queue[2] InErrors",
+   "Queue[3] InPackets",
+   "Queue[3] OutPackets",
+   "Queue[3] InJumboPackets",
+   "Queue[3] InLroPackets",
+   "Queue[3] InErrors",
+   "Queue[4] InPackets",
+   "Queue[4] OutPackets",
+   "Queue[4] InJumboPackets",
+   "Queue[4] InLroPackets",
+   "Queue[4] InErrors",
+   "Queue[5] InPackets",
+   "Queue[5] OutPackets",
+   "Queue[5] InJumboPackets",
+   "Queue[5] InLroPackets",
+   "Queue[5] InErrors",
+   "Queue[6] InPackets",
+   "Queue[6] OutPackets",
+   "Queue[6] InJumboPackets",
+   "Queue[6] InLroPackets",
+   "Queue[6] InErrors",
+   "Queue[7] InPackets",
+   "Queue[7] OutPackets",
+   "Queue[7] InJumboPackets",
+   "Queue[7] InLroPackets",
+   "Queue[7] InErrors",
+};
+
+static void aq_ethtool_stats(struct net_device *ndev,
+struct ethtool_stats *stats, u64 *data)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(ndev);
+
+/* ASSERT: Need add

[PATCH v8 10/13] net: ethernet: aquantia: Hardware interface and utility functions

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add functions to interface with the hardware and some utility functions.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/aquantia/atlantic/aq_hw.h | 177 +
 .../net/ethernet/aquantia/atlantic/aq_hw_utils.c   |  68 
 .../net/ethernet/aquantia/atlantic/aq_hw_utils.h   |  47 ++
 3 files changed, 292 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_hw.h
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_hw_utils.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_hw_utils.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_hw.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_hw.h
new file mode 100644
index 000..fce0fd3
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_hw.h
@@ -0,0 +1,177 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_hw.h: Declaraion of abstract interface for NIC hardware specific
+ * functions.
+ */
+
+#ifndef AQ_HW_H
+#define AQ_HW_H
+
+#include "aq_common.h"
+
+/* NIC H/W capabilities */
+struct aq_hw_caps_s {
+   u64 hw_features;
+   u64 link_speed_msk;
+   unsigned int hw_priv_flags;
+   u32 rxds;
+   u32 txds;
+   u32 txhwb_alignment;
+   u32 irq_mask;
+   u32 vecs;
+   u32 mtu;
+   u32 mac_regs_count;
+   u8 ports;
+   u8 msix_irqs;
+   u8 tcs;
+   u8 rxd_alignment;
+   u8 rxd_size;
+   u8 txd_alignment;
+   u8 txd_size;
+   u8 tx_rings;
+   u8 rx_rings;
+   bool flow_control;
+   bool is_64_dma;
+   u32 fw_ver_expected;
+};
+
+struct aq_hw_link_status_s {
+   unsigned int mbps;
+};
+
+#define AQ_HW_IRQ_INVALID 0U
+#define AQ_HW_IRQ_LEGACY  1U
+#define AQ_HW_IRQ_MSI 2U
+#define AQ_HW_IRQ_MSIX3U
+
+#define AQ_HW_POWER_STATE_D0   0U
+#define AQ_HW_POWER_STATE_D3   3U
+
+#define AQ_HW_FLAG_STARTED 0x0004U
+#define AQ_HW_FLAG_STOPPING0x0008U
+#define AQ_HW_FLAG_RESETTING   0x0010U
+#define AQ_HW_FLAG_CLOSING 0x0020U
+#define AQ_HW_LINK_DOWN0x0400U
+#define AQ_HW_FLAG_ERR_UNPLUG  0x4000U
+#define AQ_HW_FLAG_ERR_HW  0x8000U
+
+#define AQ_HW_FLAG_ERRORS  (AQ_HW_FLAG_ERR_HW | AQ_HW_FLAG_ERR_UNPLUG)
+
+struct aq_hw_s {
+   struct aq_obj_s header;
+   struct aq_nic_cfg_s *aq_nic_cfg;
+   struct aq_pci_func_s *aq_pci_func;
+   void __iomem *mmio;
+   unsigned int not_ff_addr;
+   struct aq_hw_link_status_s aq_link_status;
+};
+
+struct aq_ring_s;
+struct aq_ring_param_s;
+struct aq_nic_cfg_s;
+struct sk_buff;
+
+struct aq_hw_ops {
+   struct aq_hw_s *(*create)(struct aq_pci_func_s *aq_pci_func,
+ unsigned int port, struct aq_hw_ops *ops);
+
+   void (*destroy)(struct aq_hw_s *self);
+
+   int (*get_hw_caps)(struct aq_hw_s *self,
+  struct aq_hw_caps_s *aq_hw_caps);
+
+   int (*hw_ring_tx_xmit)(struct aq_hw_s *self, struct aq_ring_s *aq_ring,
+  unsigned int frags);
+
+   int (*hw_ring_rx_receive)(struct aq_hw_s *self,
+ struct aq_ring_s *aq_ring);
+
+   int (*hw_ring_rx_fill)(struct aq_hw_s *self, struct aq_ring_s *aq_ring,
+  unsigned int sw_tail_old);
+
+   int (*hw_ring_tx_head_update)(struct aq_hw_s *self,
+ struct aq_ring_s *aq_ring);
+
+   int (*hw_get_mac_permanent)(struct aq_hw_s *self,
+   struct aq_hw_caps_s *aq_hw_caps,
+   u8 *mac);
+
+   int (*hw_set_mac_address)(struct aq_hw_s *self, u8 *mac_addr);
+
+   int (*hw_get_link_status)(struct aq_hw_s *self,
+ struct aq_hw_link_status_s *link_status);
+
+   int (*hw_set_link_speed)(struct aq_hw_s *self, u32 speed);
+
+   int (*hw_reset)(struct aq_hw_s *self);
+
+   int (*hw_init)(struct aq_hw_s *self, struct aq_nic_cfg_s *aq_nic_cfg,
+  u8 *mac_addr);
+
+   int (*hw_start)(struct aq_hw_s *self);
+
+   int (*hw_stop)(struct aq_hw_s *self);
+
+   int (*hw_ring_tx_init)(struct aq_hw_s *self, struct aq_ring_s *aq_ring,
+  struct aq_ring_param_s *aq_ring_param);
+
+   int (*hw_ring_tx_start)(struct aq_hw_s *self,
+   struct aq_ring_s

[PATCH v8 07/13] net: ethernet: aquantia: Vector operations

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add functions to manululate the vector of receive and transmit rings.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel.Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/aquantia/atlantic/aq_vec.c | 392 
 drivers/net/ethernet/aquantia/atlantic/aq_vec.h |  42 +++
 2 files changed, 434 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_vec.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_vec.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_vec.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_vec.c
new file mode 100644
index 000..140962f
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_vec.c
@@ -0,0 +1,392 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_vec.c: Definition of common structure for vector of Rx and Tx rings.
+ * Definition of functions for Rx and Tx rings. Friendly module for aq_nic.
+ */
+
+#include "aq_vec.h"
+#include "aq_nic.h"
+#include "aq_ring.h"
+#include "aq_hw.h"
+
+#include 
+
+struct aq_vec_s {
+   struct aq_obj_s header;
+   struct aq_hw_ops *aq_hw_ops;
+   struct aq_hw_s *aq_hw;
+   struct aq_nic_s *aq_nic;
+   unsigned int tx_rings;
+   unsigned int rx_rings;
+   struct aq_ring_param_s aq_ring_param;
+   struct napi_struct napi;
+   struct aq_ring_s ring[AQ_CFG_TCS_MAX][2];
+};
+
+#define AQ_VEC_TX_ID 0
+#define AQ_VEC_RX_ID 1
+
+static int aq_vec_poll(struct napi_struct *napi, int budget)
+__releases(>lock)
+__acquires(>lock)
+{
+   struct aq_vec_s *self = container_of(napi, struct aq_vec_s, napi);
+   struct aq_ring_s *ring = NULL;
+   int work_done = 0;
+   int err = 0;
+   unsigned int i = 0U;
+   unsigned int sw_tail_old = 0U;
+   bool was_tx_cleaned = false;
+
+   if (!self) {
+   err = -EINVAL;
+   } else if (spin_trylock(>header.lock)) {
+   for (i = 0U, ring = self->ring[0];
+   self->tx_rings > i; ++i, ring = self->ring[i]) {
+   if (self->aq_hw_ops->hw_ring_tx_head_update) {
+   err = self->aq_hw_ops->hw_ring_tx_head_update(
+   self->aq_hw,
+   [AQ_VEC_TX_ID]);
+   if (err < 0)
+   goto err_exit;
+   }
+
+   if (ring[AQ_VEC_TX_ID].sw_head !=
+   ring[AQ_VEC_TX_ID].hw_head) {
+   err = aq_ring_tx_clean([AQ_VEC_TX_ID]);
+   if (err < 0)
+   goto err_exit;
+   was_tx_cleaned = true;
+   }
+
+   err = self->aq_hw_ops->hw_ring_rx_receive(self->aq_hw,
+   [AQ_VEC_RX_ID]);
+   if (err < 0)
+   goto err_exit;
+
+   if (ring[AQ_VEC_RX_ID].sw_head !=
+   ring[AQ_VEC_RX_ID].hw_head) {
+   err = aq_ring_rx_clean([AQ_VEC_RX_ID],
+  _done,
+  budget - work_done);
+   if (err < 0)
+   goto err_exit;
+
+   sw_tail_old = ring[AQ_VEC_RX_ID].sw_tail;
+
+   err = aq_ring_rx_fill([AQ_VEC_RX_ID]);
+   if (err < 0)
+   goto err_exit;
+
+   err = self->aq_hw_ops->hw_ring_rx_fill(
+   self->aq_hw,
+   [AQ_VEC_RX_ID], sw_tail_old);
+   if (err < 0)
+   goto err_exit;
+   }
+   }
+
+   if (was_tx_cleaned)
+   work_done = budget;
+
+   if (work_done < budget) {
+   napi_complete(napi);
+   self->aq_hw_ops->hw_irq_enable(self->aq_hw,
+   1U << self->aq_ring_param.vec_idx);
+   }
+
+err_exit:
+   spin_unlock(>header.lock);
+

[PATCH v8 13/13] net: ethernet: aquantia: Integrate AQtion 2.5/5 GB NIC driver

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Modify the drivers/net/ethernet/{Makefile,Kconfig} file to make them a
part of the network drivers build.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/Kconfig  | 1 +
 drivers/net/ethernet/Makefile | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index e4c28fe..1258b55 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -29,6 +29,7 @@ source "drivers/net/ethernet/amazon/Kconfig"
 source "drivers/net/ethernet/amd/Kconfig"
 source "drivers/net/ethernet/apm/Kconfig"
 source "drivers/net/ethernet/apple/Kconfig"
+source "drivers/net/ethernet/aquantia/Kconfig"
 source "drivers/net/ethernet/arc/Kconfig"
 source "drivers/net/ethernet/atheros/Kconfig"
 source "drivers/net/ethernet/aurora/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index 24330f4..d8a33c7 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_NET_VENDOR_AMAZON) += amazon/
 obj-$(CONFIG_NET_VENDOR_AMD) += amd/
 obj-$(CONFIG_NET_XGENE) += apm/
 obj-$(CONFIG_NET_VENDOR_APPLE) += apple/
+obj-$(CONFIG_NET_VENDOR_AQUANTIA) += aquantia/
 obj-$(CONFIG_NET_VENDOR_ARC) += arc/
 obj-$(CONFIG_NET_VENDOR_ATHEROS) += atheros/
 obj-$(CONFIG_NET_VENDOR_AURORA) += aurora/
-- 
2.9.3

[PATCH v8 12/13] net: ethernet: aquantia: Receive side scaling

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add definitions that support receive side scaling.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/aquantia/atlantic/aq_rss.h | 26 +
 1 file changed, 26 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_rss.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_rss.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_rss.h
new file mode 100644
index 000..1db6eb2
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_rss.h
@@ -0,0 +1,26 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_rss.h: Receive Side Scaling definitions. */
+
+#ifndef AQ_RSS_H
+#define AQ_RSS_H
+
+#include "aq_common.h"
+#include "aq_cfg.h"
+
+struct aq_rss_parameters {
+   u16 base_cpu_number;
+   u16 indirection_table_size;
+   u16 hash_secret_key_size;
+   u32 hash_secret_key[AQ_CFG_RSS_HASHKEY_SIZE / sizeof(u32)];
+   u8 indirection_table[AQ_CFG_RSS_INDIRECTION_TABLE_MAX];
+};
+
+#endif /* AQ_RSS_H */
-- 
2.9.3

[PATCH v8 09/13] net: ethernet: aquantia: Atlantic hardware abstraction layer

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add common functions for Atlantic hardware abstraction layer.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 .../aquantia/atlantic/hw_atl/hw_atl_utils.c| 570 +
 .../aquantia/atlantic/hw_atl/hw_atl_utils.h| 210 
 2 files changed, 780 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
new file mode 100644
index 000..8d6d8f5
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
@@ -0,0 +1,570 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File hw_atl_utils.c: Definition of common functions for Atlantic hardware
+ * abstraction layer.
+ */
+
+#include "../aq_hw.h"
+#include "../aq_hw_utils.h"
+#include "../aq_pci_func.h"
+#include "../aq_ring.h"
+#include "../aq_vec.h"
+#include "hw_atl_utils.h"
+#include "hw_atl_llh.h"
+
+#include 
+
+#define HW_ATL_UCP_0X370_REG0x0370U
+
+#define HW_ATL_FW_SM_RAM0x2U
+#define HW_ATL_MPI_CONTROL_ADR  0x0368U
+#define HW_ATL_MPI_STATE_ADR0x036CU
+
+#define HW_ATL_MPI_STATE_MSK0x00FFU
+#define HW_ATL_MPI_STATE_SHIFT  0U
+#define HW_ATL_MPI_SPEED_MSK0xU
+#define HW_ATL_MPI_SPEED_SHIFT  16U
+
+static int hw_atl_utils_fw_downld_dwords(struct aq_hw_s *self, u32 a,
+u32 *p, u32 cnt)
+{
+   int err = 0;
+
+   AQ_HW_WAIT_FOR(reg_glb_cpu_sem_get(self,
+  HW_ATL_FW_SM_RAM) == 1U,
+  1U, 1U);
+
+   if (err < 0) {
+   bool is_locked;
+
+   reg_glb_cpu_sem_set(self, 1U, HW_ATL_FW_SM_RAM);
+   is_locked = reg_glb_cpu_sem_get(self, HW_ATL_FW_SM_RAM);
+   if (!is_locked) {
+   err = -ETIME;
+   goto err_exit;
+   }
+   }
+
+   aq_hw_write_reg(self, 0x0208U, a);
+
+   for (++cnt; --cnt;) {
+   u32 i = 0U;
+
+   aq_hw_write_reg(self, 0x0200U, 0x8000U);
+
+   for (i = 1024U;
+   (0x100U & aq_hw_read_reg(self, 0x0200U)) && --i;) {
+   }
+
+   *(p++) = aq_hw_read_reg(self, 0x020CU);
+   }
+
+   reg_glb_cpu_sem_set(self, 1U, HW_ATL_FW_SM_RAM);
+
+err_exit:
+   return err;
+}
+
+static int hw_atl_utils_fw_upload_dwords(struct aq_hw_s *self, u32 a, u32 *p,
+u32 cnt)
+{
+   int err = 0;
+   bool is_locked;
+
+   is_locked = reg_glb_cpu_sem_get(self, HW_ATL_FW_SM_RAM);
+   if (!is_locked) {
+   err = -ETIME;
+   goto err_exit;
+   }
+
+   aq_hw_write_reg(self, 0x0208U, a);
+
+   for (++cnt; --cnt;) {
+   u32 i = 0U;
+
+   aq_hw_write_reg(self, 0x020CU, *(p++));
+   aq_hw_write_reg(self, 0x0200U, 0xC000U);
+
+   for (i = 1024U;
+   (0x100U & aq_hw_read_reg(self, 0x0200U)) && --i;) {
+   }
+   }
+
+   reg_glb_cpu_sem_set(self, 1U, HW_ATL_FW_SM_RAM);
+
+err_exit:
+   return err;
+}
+
+static int hw_atl_utils_ver_match(u32 ver_expected, u32 ver_actual)
+{
+   int err = 0;
+   const u32 dw_major_mask = 0xff00U;
+   const u32 dw_minor_mask = 0x00ffU;
+
+   err = (dw_major_mask & (ver_expected ^ ver_actual)) ? -EOPNOTSUPP : 0;
+   if (err < 0)
+   goto err_exit;
+   err = ((dw_minor_mask & ver_expected) > (dw_minor_mask & ver_actual)) ?
+   -EOPNOTSUPP : 0;
+err_exit:
+   return err;
+}
+
+static int hw_atl_utils_init_ucp(struct aq_hw_s *self,
+struct aq_hw_caps_s *aq_hw_caps)
+{
+   int err = 0;
+
+   if (!aq_hw_read_reg(self, 0x370U)) {
+   unsigned int rnd = 0U;
+   unsigned int ucp_0x370 = 0U;
+
+   get_random_bytes(, sizeof(unsigned int));
+
+   ucp_0x370 = 0x02020202U | (0xFEFEFEFEU & rnd);
+   aq_hw_write_reg(self, HW_ATL_UCP_0X370_REG, ucp_0x370);
+   }
+
+   reg_glb_cpu_scratch_scp_set(self, 0xU, 25U);
+
+   /* check 10 times

[PATCH v8 06/13] net: ethernet: aquantia: Atlantic A0 and B0 specific functions.

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add Atlantic A0 and B0 specific functions.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c  | 905 +++
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.h  |  34 +
 .../aquantia/atlantic/hw_atl/hw_atl_a0_internal.h  | 155 
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c  | 958 +
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.h  |  34 +
 .../aquantia/atlantic/hw_atl/hw_atl_b0_internal.h  | 207 +
 6 files changed, 2293 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.h
 create mode 100644 
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0_internal.h
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.h
 create mode 100644 
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0_internal.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
new file mode 100644
index 000..1f38805
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c
@@ -0,0 +1,905 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File hw_atl_a0.c: Definition of Atlantic hardware specific functions. */
+
+#include "../aq_hw.h"
+#include "../aq_hw_utils.h"
+#include "../aq_ring.h"
+#include "hw_atl_a0.h"
+#include "hw_atl_utils.h"
+#include "hw_atl_llh.h"
+#include "hw_atl_a0_internal.h"
+
+static int hw_atl_a0_get_hw_caps(struct aq_hw_s *self,
+struct aq_hw_caps_s *aq_hw_caps)
+{
+   memcpy(aq_hw_caps, _atl_a0_hw_caps_, sizeof(*aq_hw_caps));
+   return 0;
+}
+
+static struct aq_hw_s *hw_atl_a0_create(struct aq_pci_func_s *aq_pci_func,
+   unsigned int port,
+   struct aq_hw_ops *ops)
+{
+   struct hw_atl_s *self = NULL;
+
+   self = kzalloc(sizeof(*self), GFP_KERNEL);
+   if (!self)
+   goto err_exit;
+
+   self->base.aq_pci_func = aq_pci_func;
+
+   self->base.not_ff_addr = 0x10U;
+
+err_exit:
+   return (struct aq_hw_s *)self;
+}
+
+static void hw_atl_a0_destroy(struct aq_hw_s *self)
+{
+   kfree(self);
+}
+
+static int hw_atl_a0_hw_reset(struct aq_hw_s *self)
+{
+   int err = 0;
+
+   glb_glb_reg_res_dis_set(self, 1U);
+   pci_pci_reg_res_dis_set(self, 0U);
+   rx_rx_reg_res_dis_set(self, 0U);
+   tx_tx_reg_res_dis_set(self, 0U);
+
+   HW_ATL_FLUSH();
+   glb_soft_res_set(self, 1);
+
+   /* check 10 times by 1ms */
+   AQ_HW_WAIT_FOR(glb_soft_res_get(self) == 0, 1000U, 10U);
+   if (err < 0)
+   goto err_exit;
+
+   itr_irq_reg_res_dis_set(self, 0U);
+   itr_res_irq_set(self, 1U);
+
+   /* check 10 times by 1ms */
+   AQ_HW_WAIT_FOR(itr_res_irq_get(self) == 0, 1000U, 10U);
+   if (err < 0)
+   goto err_exit;
+
+   hw_atl_utils_mpi_set(self, MPI_RESET, 0x0U);
+
+   err = aq_hw_err_from_flags(self);
+
+err_exit:
+   return err;
+}
+
+static int hw_atl_a0_hw_qos_set(struct aq_hw_s *self)
+{
+   u32 tc = 0U;
+   u32 buff_size = 0U;
+   unsigned int i_priority = 0U;
+   bool is_rx_flow_control = false;
+
+   /* TPS Descriptor rate init */
+   tps_tx_pkt_shed_desc_rate_curr_time_res_set(self, 0x0U);
+   tps_tx_pkt_shed_desc_rate_lim_set(self, 0xA);
+
+   /* TPS VM init */
+   tps_tx_pkt_shed_desc_vm_arb_mode_set(self, 0U);
+
+   /* TPS TC credits init */
+   tps_tx_pkt_shed_desc_tc_arb_mode_set(self, 0U);
+   tps_tx_pkt_shed_data_arb_mode_set(self, 0U);
+
+   tps_tx_pkt_shed_tc_data_max_credit_set(self, 0xFFF, 0U);
+   tps_tx_pkt_shed_tc_data_weight_set(self, 0x64, 0U);
+   tps_tx_pkt_shed_desc_tc_max_credit_set(self, 0x50, 0U);
+   tps_tx_pkt_shed_desc_tc_weight_set(self, 0x1E, 0U);
+
+   /* Tx buf size */
+   buff_size = HW_ATL_A0_TXBUF_MAX;
+
+   tpb_tx_pkt_buff_size_per_tc_set(self, buff_size, tc);
+   tpb_tx_buff_hi_threshold_per_tc_set(self,
+   (buff_size * (1024 / 32U) * 66U) /
+   100U, tc);
+

[PATCH v8 01/13] net: ethernet: aquantia: Make and configuration files.

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Patches to create the make and configuration files.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/aquantia/Kconfig   | 24 ++
 drivers/net/ethernet/aquantia/Makefile  |  5 +++
 drivers/net/ethernet/aquantia/atlantic/Makefile | 42 +
 drivers/net/ethernet/aquantia/atlantic/ver.h| 18 +++
 4 files changed, 89 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/Kconfig
 create mode 100644 drivers/net/ethernet/aquantia/Makefile
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/Makefile
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/ver.h

diff --git a/drivers/net/ethernet/aquantia/Kconfig 
b/drivers/net/ethernet/aquantia/Kconfig
new file mode 100644
index 000..cdf78e0
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/Kconfig
@@ -0,0 +1,24 @@
+#
+# aQuantia device configuration
+#
+
+config NET_VENDOR_AQUANTIA
+   bool "aQuantia devices"
+   default y
+   ---help---
+ Set this to y if you have an Ethernet network cards that uses the 
aQuantia
+ AQC107/AQC108 chipset.
+
+ This option does not build any drivers; it casues the aQuantia
+ drivers that can be built to appear in the list of Ethernet drivers.
+
+
+if NET_VENDOR_AQUANTIA
+
+config AQTION
+   tristate "aQuantia AQtion(tm) Support"
+   depends on PCI && X86_64
+   ---help---
+ This enables the support for the aQuantia AQtion(tm) Ethernet card.
+
+endif # NET_VENDOR_AQUANTIA
diff --git a/drivers/net/ethernet/aquantia/Makefile 
b/drivers/net/ethernet/aquantia/Makefile
new file mode 100644
index 000..4f4897b
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the aQuantia device drivers.
+#
+
+obj-$(CONFIG_AQTION) += atlantic/
diff --git a/drivers/net/ethernet/aquantia/atlantic/Makefile 
b/drivers/net/ethernet/aquantia/atlantic/Makefile
new file mode 100644
index 000..e4ae696
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/Makefile
@@ -0,0 +1,42 @@
+
+#
+# aQuantia Ethernet Controller AQtion Linux Driver
+# Copyright(c) 2014-2017 aQuantia Corporation.
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms and conditions of the GNU General Public License,
+# version 2, as published by the Free Software Foundation.
+#
+# This program is distributed in the hope it will be useful, but WITHOUT
+# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+# more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program. If not, see .
+#
+# The full GNU General Public License is included in this distribution in
+# the file called "COPYING".
+#
+# Contact Information: 
+# aQuantia Corporation, 105 E. Tasman Dr. San Jose, CA 95134, USA
+#
+
+
+#
+# Makefile for the AQtion(tm) Ethernet driver
+#
+
+obj-$(CONFIG_AQTION) += atlantic.o
+
+atlantic-objs := aq_main.o \
+   aq_nic.o \
+   aq_pci_func.o \
+   aq_vec.o \
+   aq_ring.o \
+   aq_hw_utils.o \
+   aq_ethtool.o \
+   hw_atl/hw_atl_a0.o \
+   hw_atl/hw_atl_b0.o \
+   hw_atl/hw_atl_utils.o \
+   hw_atl/hw_atl_llh.o
diff --git a/drivers/net/ethernet/aquantia/atlantic/ver.h 
b/drivers/net/ethernet/aquantia/atlantic/ver.h
new file mode 100644
index 000..0de858d
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/ver.h
@@ -0,0 +1,18 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+#ifndef VER_H
+#define VER_H
+
+#define NIC_MAJOR_DRIVER_VERSION   1
+#define NIC_MINOR_DRIVER_VERSION   5
+#define NIC_BUILD_DRIVER_VERSION   345
+#define NIC_REVISION_DRIVER_VERSION0
+
+#endif /* VER_H */
-- 
2.9.3

[PATCH v8 03/13] net: ethernet: aquantia: Add ring support code

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

Add code to support the transmit and receive ring buffers.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
 drivers/net/ethernet/aquantia/atlantic/aq_ring.c | 376 +++
 drivers/net/ethernet/aquantia/atlantic/aq_ring.h | 157 ++
 2 files changed, 533 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_ring.c
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/aq_ring.h

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
new file mode 100644
index 000..b517b26
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -0,0 +1,376 @@
+/*
+ * aQuantia Corporation Network Driver
+ * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+
+/* File aq_ring.c: Definition of functions for Rx/Tx rings. */
+
+#include "aq_ring.h"
+#include "aq_nic.h"
+#include "aq_hw.h"
+
+#include 
+#include 
+
+static struct aq_ring_s *aq_ring_alloc(struct aq_ring_s *self,
+  struct aq_nic_s *aq_nic)
+{
+   int err = 0;
+
+   self->buff_ring =
+   kcalloc(self->size, sizeof(struct aq_ring_buff_s), GFP_KERNEL);
+
+   if (!self->buff_ring) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+   self->dx_ring = dma_alloc_coherent(aq_nic_get_dev(aq_nic),
+   self->size * self->dx_size,
+   >dx_ring_pa, GFP_KERNEL);
+   if (!self->dx_ring) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+
+err_exit:
+   if (err < 0) {
+   aq_ring_free(self);
+   self = NULL;
+   }
+   return self;
+}
+
+struct aq_ring_s *aq_ring_tx_alloc(struct aq_ring_s *self,
+  struct aq_nic_s *aq_nic,
+  unsigned int idx,
+  struct aq_nic_cfg_s *aq_nic_cfg)
+{
+   int err = 0;
+
+   self->aq_nic = aq_nic;
+   self->idx = idx;
+   self->size = aq_nic_cfg->txds;
+   self->dx_size = aq_nic_cfg->aq_hw_caps->txd_size;
+
+   self = aq_ring_alloc(self, aq_nic);
+   if (!self) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+
+err_exit:
+   if (err < 0) {
+   aq_ring_free(self);
+   self = NULL;
+   }
+   return self;
+}
+
+struct aq_ring_s *aq_ring_rx_alloc(struct aq_ring_s *self,
+  struct aq_nic_s *aq_nic,
+  unsigned int idx,
+  struct aq_nic_cfg_s *aq_nic_cfg)
+{
+   int err = 0;
+
+   self->aq_nic = aq_nic;
+   self->idx = idx;
+   self->size = aq_nic_cfg->rxds;
+   self->dx_size = aq_nic_cfg->aq_hw_caps->rxd_size;
+
+   self = aq_ring_alloc(self, aq_nic);
+   if (!self) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+
+err_exit:
+   if (err < 0) {
+   aq_ring_free(self);
+   self = NULL;
+   }
+   return self;
+}
+
+int aq_ring_init(struct aq_ring_s *self)
+{
+   self->hw_head = 0;
+   self->sw_head = 0;
+   self->sw_tail = 0;
+   return 0;
+}
+
+void aq_ring_tx_append_buffs(struct aq_ring_s *self,
+struct aq_ring_buff_s *buffer,
+unsigned int buffers)
+{
+   if (likely(self->sw_tail + buffers < self->size)) {
+   memcpy(>buff_ring[self->sw_tail], buffer,
+  sizeof(buffer[0]) * buffers);
+   } else {
+   unsigned int first_part = self->size - self->sw_tail;
+   unsigned int second_part = buffers - first_part;
+
+   memcpy(>buff_ring[self->sw_tail], buffer,
+  sizeof(buffer[0]) * first_part);
+
+   memcpy(>buff_ring[0], [first_part],
+  sizeof(buffer[0]) * second_part);
+   }
+}
+
+int aq_ring_tx_clean(struct aq_ring_s *self)
+{
+   struct device *dev = aq_nic_get_dev(self->aq_nic);
+
+   for (; self->sw_head != self->hw_head;
+   self->sw_head = aq_ring_next_dx(self, self->sw_head)) {
+   struct aq_ring_buff_s *buff = >buff_ring[self->sw_head];
+
+   if (likely(buff->is_mapped)) {
+   if (unlikely(buff->is_sop))
+

[PATCH v8 00/13] net: ethernet: aquantia: Add AQtion 2.5/5 GB NIC driver

2017-01-23 Thread Alexander Loktionov

From: David VomLehn 

This series introduces the AQtion NIC driver for the aQuantia
AQC107/AQC108 network devices.

v1: Initial version
v2: o Make necessary drivers/net/ethernet changes to integrate software
o Drop intermediate atlantic directory
o Remove Makefile things only appropriate to out of tree module
  building
v3: o Move changes to drivers/net/ethernet/{Kconfig,Makefile} to the last
  patch to ensure clean bisection.
o Removed inline attribute aq_hw_write_req() as it was defined in
  only one .c file.
o #included pci.h in aq_common.h to get struct pci definition.
o Modified code to unlock based execution flow rather than using a
  flag.
o Made a number of functions that were only used in a single file
  static.
o Cleaned up error and return code handling in various places.
o Remove AQ_CFG_IP_ALIGN definition.
o Other minor code clean up.
v4: o Using do_div for 64 bit division.
o Modified NIC statistics code.
o Using build_skb instead netdev_alloc_skb for single fragment packets.
o Removed extra aq_nic.o from Makefile
v5: o Removed extra newline at the end of the files.
v6: o Removed unnecessary cast from void*.
o Reworked strings array for ethtool statistics.
o Added stringset == ETH_SS_STATS checking.
o AQ_OBJ_HEADER replaced to aq_obj_header_s struct.
o AQ_OBJ_SET/TST/CLR macroses replaced to inline functions.
o Driver sources placed in to atlantic directory.
o Fixed compilation warnings (Make W=1)
o Added firmware version checking.
o Code cleaning.
v7  o Removed unnecessary cast from memory allocation function (aq_ring.c).
v8  o Switched to using kcalloc instead kzalloc.
o Now provide bus_info for ethtool
o Used div() to avoid __bad_udelay build error.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: David M. VomLehn 
---
David VomLehn (13):
  net: ethernet: aquantia: Make and configuration files.
  net: ethernet: aquantia: Common functions and definitions
  net: ethernet: aquantia: Add ring support code
  net: ethernet: aquantia: Low-level hardware interfaces
  net: ethernet: aquantia: Support for NIC-specific code
  net: ethernet: aquantia: Atlantic A0 and B0 specific functions.
  net: ethernet: aquantia: Vector operations
  net: ethernet: aquantia: PCI operations
  net: ethernet: aquantia: Atlantic hardware abstraction layer
  net: ethernet: aquantia: Hardware interface and utility functions
  net: ethernet: aquantia: Ethtool support
  net: ethernet: aquantia: Receive side scaling
  net: ethernet: aquantia: Integrate AQtion 2.5/5 GB NIC driver

 drivers/net/ethernet/Kconfig   |1 +
 drivers/net/ethernet/Makefile  |1 +
 drivers/net/ethernet/aquantia/Kconfig  |   24 +
 drivers/net/ethernet/aquantia/Makefile |5 +
 drivers/net/ethernet/aquantia/atlantic/Makefile|   42 +
 drivers/net/ethernet/aquantia/atlantic/aq_cfg.h|   77 +
 drivers/net/ethernet/aquantia/atlantic/aq_common.h |   23 +
 .../net/ethernet/aquantia/atlantic/aq_ethtool.c|  261 +++
 .../net/ethernet/aquantia/atlantic/aq_ethtool.h|   19 +
 drivers/net/ethernet/aquantia/atlantic/aq_hw.h |  177 ++
 .../net/ethernet/aquantia/atlantic/aq_hw_utils.c   |   68 +
 .../net/ethernet/aquantia/atlantic/aq_hw_utils.h   |   47 +
 drivers/net/ethernet/aquantia/atlantic/aq_main.c   |  273 +++
 drivers/net/ethernet/aquantia/atlantic/aq_main.h   |   17 +
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c|  937 
 drivers/net/ethernet/aquantia/atlantic/aq_nic.h|  108 +
 .../ethernet/aquantia/atlantic/aq_nic_internal.h   |   46 +
 .../net/ethernet/aquantia/atlantic/aq_pci_func.c   |  345 +++
 .../net/ethernet/aquantia/atlantic/aq_pci_func.h   |   34 +
 drivers/net/ethernet/aquantia/atlantic/aq_ring.c   |  376 
 drivers/net/ethernet/aquantia/atlantic/aq_ring.h   |  157 ++
 drivers/net/ethernet/aquantia/atlantic/aq_rss.h|   26 +
 drivers/net/ethernet/aquantia/atlantic/aq_utils.h  |   50 +
 drivers/net/ethernet/aquantia/atlantic/aq_vec.c|  392 
 drivers/net/ethernet/aquantia/atlantic/aq_vec.h|   42 +
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.c  |  905 
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_a0.h  |   34 +
 .../aquantia/atlantic/hw_atl/hw_atl_a0_internal.h  |  155 ++
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c  |  958 
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.h  |   34 +
 .../aquantia/atlantic/hw_atl/hw_atl_b0_internal.h  |  207 ++
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.c | 1394 
 .../ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.h |  677 ++
 .../aquantia/atlantic/hw_atl/hw_atl_llh_internal.h | 2375

[PATCH net] sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment

2017-01-23 Thread Xin Long

Now sctp gso puts segments into skb's frag_list, then processes these
segments in skb_segment. But skb_segment handles them only when gs is
enabled, as it's in the same branch with skb's frags.

Although almost all the NICs support sg other than some old ones, but
since commit 1e16aa3ddf86 ("net: gso: use feature flag argument in all
protocol gso handlers"), features &= skb->dev->hw_enc_features, and
xfrm_output_gso call skb_segment with features = 0, which means sctp
gso would call skb_segment with sg = 0, and skb_segment would not work
as expected.

This patch is to fix it by setting features param with NETIF_F_SG when
calling skb_segment so that it can go the right branch to process the
skb's frag_list.

Signed-off-by: Xin Long 
---
 net/sctp/offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 7e869d0..4f5a2b5 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -68,7 +68,7 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
goto out;
}
 
-   segs = skb_segment(skb, features | NETIF_F_HW_CSUM);
+   segs = skb_segment(skb, features | NETIF_F_HW_CSUM | NETIF_F_SG);
if (IS_ERR(segs))
goto out;
 
-- 
2.1.0

[PATCH net] sctp: sctp_addr_id2transport should verify the addr before looking up assoc

2017-01-23 Thread Xin Long

sctp_addr_id2transport is a function for sockopt to look up assoc by
address. As the address is from userspace, it can be a v4-mapped v6
address. But in sctp protocol stack, it always handles a v4-mapped
v6 address as a v4 address. So it's necessary to convert it to a v4
address before looking up assoc by address.

This patch is to fix it by calling sctp_verify_addr in which it can do
this conversion before calling sctp_endpoint_lookup_assoc, just like
what sctp_sendmsg and __sctp_connect do for the address from users.

Signed-off-by: Xin Long 
---
 net/sctp/socket.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 318c678..37eeab7 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -235,8 +235,12 @@ static struct sctp_transport 
*sctp_addr_id2transport(struct sock *sk,
  sctp_assoc_t id)
 {
struct sctp_association *addr_asoc = NULL, *id_asoc = NULL;
-   struct sctp_transport *transport;
+   struct sctp_af *af = sctp_get_af_specific(addr->ss_family);
union sctp_addr *laddr = (union sctp_addr *)addr;
+   struct sctp_transport *transport;
+
+   if (sctp_verify_addr(sk, laddr, af->sockaddr_len))
+   return NULL;
 
addr_asoc = sctp_endpoint_lookup_assoc(sctp_sk(sk)->ep,
   laddr,
-- 
2.1.0

Re: [PATCH 2/3] ath10k: use dma_zalloc_coherent()

2017-01-23 Thread Joe Perches

On Tue, 2017-01-24 at 05:18 +, Valo, Kalle wrote:
> Joe Perches  writes:
> 
> > On Mon, 2017-01-23 at 15:04 +, Srinivas Kandagatla wrote:
> > > use dma_zalloc_coherent() instead of dma_alloc_coherent and memset().
> > 
> > []
> > > diff --git a/drivers/net/wireless/ath/ath10k/pci.c 
> > > b/drivers/net/wireless/ath/ath10k/pci.c
> > 
> > []
> > > @@ -896,7 +896,7 @@ static int ath10k_pci_diag_read_mem(struct ath10k 
> > > *ar, u32 address, void *data,
> > >*/
> > >   alloc_nbytes = min_t(unsigned int, nbytes, DIAG_TRANSFER_LIMIT);
> > >  
> > > - data_buf = (unsigned char *)dma_alloc_coherent(ar->dev,
> > > + data_buf = (unsigned char *)dma_zalloc_coherent(ar->dev,
> > >  alloc_nbytes,
> > >  _data_base,
> > >  GFP_ATOMIC);
> > 
> > trivia:
> > 
> > Nicer to realign arguments and remove the unnecessary cast.
> > 
> > Perhaps:
> > 
> > data_buf = dma_zalloc_coherent(ar->dev, alloc_nbytes, _data_base,
> >    GFP_ATOMIC);
> 
> Sure, but that should be in a separate patch.

I don't think so, trivial patches can be combined.

It's also nicer to realign all modified multiline
arguments when performing these changes.

Coccinelle generally does it automatically.

loopback device reference count leakage

2017-01-23 Thread Kaiwen Xu

Hi netdev folks,

I am currently experiencing an issue related with the loopback during
network devices shutdown inside a network namespace, which mainfested as

unregister_netdevice: waiting for lo to become free. Usage count = 1

showing up every 10 seconds or so in the kernel log. It eventually
causes a deadlock on new network namespace creation.

When I was trying to debug the issue, I tested from latest kernel 4.10-rc
back to 3.19, which all can be reproduced on. I also tried to disable
IPv6, which doesn't help either.

So I tried to patch the kernel with following change to add addtional
logging message for dev_put() and dev_hold().

https://lkml.org/lkml/2008/7/20/5

Here are all the dev_put/hold() calls for the loopback device inside a 
namespace from creation to shutdown, when the issue occurs.

Jan 11 16:17:43  kernel: [ 2196.943640] lo: dev_hold 1 
rx_queue_add_kobject
Jan 11 16:17:43  kernel: [ 2196.943652] lo: dev_hold 2 
netdev_queue_add_kobject
Jan 11 16:17:43  kernel: [ 2196.943654] lo: dev_hold 3 
register_netdevice
Jan 11 16:17:43  kernel: [ 2196.943658] lo: dev_hold 4 
neigh_parms_alloc
Jan 11 16:17:43  kernel: [ 2196.943660] lo: dev_hold 5 inetdev_init
Jan 11 16:17:43  kernel: [ 2197.001771] lo: dev_hold 6 qdisc_alloc
Jan 11 16:17:43  kernel: [ 2197.001815] lo: dev_hold 7 
dev_get_by_index
Jan 11 16:17:43  kernel: [ 2197.001824] lo: dev_hold 8 
dev_get_by_index
Jan 11 16:17:43  kernel: [ 2197.001831] lo: dev_hold 9 fib_check_nh
Jan 11 16:17:43  kernel: [ 2197.001836] lo: dev_hold 10 fib_check_nh
Jan 11 16:17:43  kernel: [ 2197.001843] lo: dev_hold 11 
dev_get_by_index
Jan 11 16:17:43  kernel: [ 2197.001849] lo: dev_hold 12 
dev_get_by_index
Jan 11 16:17:43  kernel: [ 2197.001852] lo: dev_hold 13 fib_check_nh
Jan 11 16:17:43  kernel: [ 2197.001856] lo: dev_hold 14 fib_check_nh
Jan 11 16:17:43  kernel: [ 2197.025451] lo: dev_put 14 
free_fib_info_rcu
Jan 11 16:17:43  kernel: [ 2197.025473] lo: dev_put 13 
free_fib_info_rcu
Jan 11 16:17:43  kernel: [ 2197.025475] lo: dev_put 12 
free_fib_info_rcu
Jan 11 16:17:43  kernel: [ 2197.025477] lo: dev_put 11 
free_fib_info_rcu
Jan 11 16:17:43  kernel: [ 2197.025480] lo: dev_put 10 
free_fib_info_rcu
Jan 11 16:17:43  kernel: [ 2197.025482] lo: dev_put 9 
free_fib_info_rcu
Jan 11 16:17:43  kernel: [ 2197.414900] lo: dev_hold 9 dst_init
Jan 11 16:17:43  kernel: [ 2197.418634] lo: dev_hold 10 dst_init
Jan 11 16:17:43  kernel: [ 2197.418638] lo: dev_hold 11 dst_init
Jan 11 16:17:43  kernel: [ 2197.445496] lo: dev_put 11 dst_destroy
Jan 11 16:17:44  kernel: [ 2197.614240] lo: dev_hold 11 dst_init
Jan 11 16:20:41  kernel: [ 2374.898896] lo: dev_hold 12 dst_init
Jan 11 16:20:41  kernel: [ 2374.898917] lo: dev_hold 13 __neigh_create
Jan 11 16:23:42  kernel: [ 2555.900160] lo: dev_hold 14 dst_init
Jan 11 16:24:10  kernel: [ 2583.573193] lo: dev_hold 15 dst_init
Jan 11 16:26:09  kernel: [ 2702.992174] lo: dev_hold 16 dst_init
Jan 11 16:27:15  kernel: [ 2769.188044] lo: dev_hold 17 dst_init
Jan 11 16:31:57  kernel: [ 3050.821593] lo: dev_hold 18 dst_init
Jan 11 16:37:41  kernel: [ 3394.962714] lo: dev_hold 19 dst_init
Jan 11 16:46:27  kernel: [ 3921.376163] lo: dev_hold 20 dst_init
Jan 11 16:46:30  kernel: [ 3923.759813] lo: dev_hold 21 dst_init
Jan 11 16:47:01  kernel: [ 3954.802588] lo: dev_hold 22 dst_ifdown
Jan 11 16:47:01  kernel: [ 3954.802596] lo: dev_hold 23 dst_ifdown
Jan 11 16:47:01  kernel: [ 3954.802599] lo: dev_hold 24 dst_ifdown
Jan 11 16:47:01  kernel: [ 3954.802602] lo: dev_hold 25 dst_ifdown
Jan 11 16:47:01  kernel: [ 3954.802605] lo: dev_hold 26 dst_ifdown
Jan 11 16:47:01  kernel: [ 3954.802609] lo: dev_hold 27 dst_ifdown
Jan 11 16:47:01  kernel: [ 3955.482563] lo: dev_put 27 dst_destroy
Jan 11 16:47:01  kernel: [ 3955.482566] lo: dev_put 26 dst_destroy
Jan 11 16:47:01  kernel: [ 3955.482568] lo: dev_put 25 dst_destroy
Jan 11 16:47:01  kernel: [ 3955.482570] lo: dev_put 24 dst_destroy
Jan 11 16:47:01  kernel: [ 3955.482572] lo: dev_put 23 dst_destroy
Jan 11 16:47:01  kernel: [ 3955.482574] lo: dev_put 22 dst_destroy
Jan 11 16:49:20  kernel: [ 4093.651505] lo: dev_put 21 neigh_destroy
Jan 11 16:49:20  kernel: [ 4093.653397] lo: dev_put 20 qdisc_destroy
Jan 11 16:49:20  kernel: [ 4093.653432] lo: dev_put 19 
neigh_parms_release
Jan 11 16:49:20  kernel: [ 4093.653463] lo: dev_put 18 
rx_queue_release
Jan 11 16:49:20  kernel: [ 4093.653474] lo: dev_put 17 
netdev_queue_release
Jan 11 16:49:20  kernel: [ 4093.653520] lo: dev_put 16 
rollback_registered_many
Jan 11 16:49:20  kernel: [ 4093.663424] lo: dev_put 15 
free_fib_info_rcu
Jan 11 16:49:20  kernel: [ 4093.667401] lo: dev_put 14 
free_fib_info_rcu
Jan 11 16:49:20  kernel: [ 4093.667430] lo: dev_put 13 dst_destroy
Jan 11 16:49:20  kernel: [ 4093.667432] lo: dev_put 12 dst_destroy
Jan 11 16:49:20  kernel: [ 4093.667434] lo: dev_put 11 dst_destroy
Jan 11 16:49:20  kernel: [ 4093.667436] lo: dev_put 10 dst_destroy
Jan 11 16:49:20  kernel: [ 4093.667438] lo: dev_put 9 dst_destroy
Jan 11

Re: [PATCH 2/3] ath10k: use dma_zalloc_coherent()

2017-01-23 Thread Valo, Kalle

Joe Perches  writes:

> On Mon, 2017-01-23 at 15:04 +, Srinivas Kandagatla wrote:
>> use dma_zalloc_coherent() instead of dma_alloc_coherent and memset().
> []
>> diff --git a/drivers/net/wireless/ath/ath10k/pci.c 
>> b/drivers/net/wireless/ath/ath10k/pci.c
> []
>> @@ -896,7 +896,7 @@ static int ath10k_pci_diag_read_mem(struct ath10k *ar, 
>> u32 address, void *data,
>>   */
>>  alloc_nbytes = min_t(unsigned int, nbytes, DIAG_TRANSFER_LIMIT);
>>  
>> -data_buf = (unsigned char *)dma_alloc_coherent(ar->dev,
>> +data_buf = (unsigned char *)dma_zalloc_coherent(ar->dev,
>> alloc_nbytes,
>> _data_base,
>> GFP_ATOMIC);
>
> trivia:
>
> Nicer to realign arguments and remove the unnecessary cast.
>
> Perhaps:
>
>   data_buf = dma_zalloc_coherent(ar->dev, alloc_nbytes, _data_base,
>  GFP_ATOMIC);

Sure, but that should be in a separate patch.

-- 
Kalle Valo

linux-next: manual merge of the kselftest tree with the net-next tree

2017-01-23 Thread Stephen Rothwell

Hi Shuah,

Today's linux-next merge of the kselftest tree got a conflict in:

  tools/testing/selftests/bpf/Makefile

between commit:

  4d3381f5a322 ("bpf: Add tests for the lpm trie map")

from the net-next tree and commit:

  88baa78d1f31 ("selftests: remove duplicated all and clean target")

from the kselftest tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

P.S. Shuah, that kselftest commit has a different email address in its
Author and Signed-off-by.

-- 
Cheers,
Stephen Rothwell

diff --cc tools/testing/selftests/bpf/Makefile
index 064a3e5f2836,058351b0694f..
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@@ -1,13 -1,7 +1,7 @@@
  CFLAGS += -Wall -O2 -I../../../../usr/include
  
- test_objs = test_verifier test_maps test_lru_map test_lpm_map
 -TEST_GEN_PROGS = test_verifier test_maps test_lru_map
++TEST_GEN_PROGS = test_verifier test_maps test_lru_map test_lpm_map
  
- TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
- TEST_FILES := $(test_objs)
- 
- all: $(test_objs)
+ TEST_PROGS := test_kmod.sh
  
  include ../lib.mk
- 
- clean:
-   $(RM) $(test_objs)

[PATCH net-next v2 0/2] vxlan: misc fdb fixes

2017-01-23 Thread Roopa Prabhu

From: Roopa Prabhu 

Balakrishnan Raman (1):
  vxlan: do not age static remote mac entries

Roopa Prabhu (1):
  vxlan: don't flush static fdb entries on admin down

 drivers/net/vxlan.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

-- 
v1 -> v2: use bool as suggested by DavidM

[PATCH net-next v2 2/2] vxlan: do not age static remote mac entries

2017-01-23 Thread Roopa Prabhu

From: Balakrishnan Raman 

Mac aging is applicable only for dynamically learnt remote mac
entries. Check for user configured static remote mac entries
and skip aging.

Signed-off-by: Balakrishnan Raman 
Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index b34822f..09aa52e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2268,7 +2268,7 @@ static void vxlan_cleanup(unsigned long arg)
= container_of(p, struct vxlan_fdb, hlist);
unsigned long timeout;
 
-   if (f->state & NUD_PERMANENT)
+   if (f->state & (NUD_PERMANENT | NUD_NOARP))
continue;
 
timeout = f->used + vxlan->cfg.age_interval * HZ;
-- 
1.9.1

[PATCH net-next v2 1/2] vxlan: don't flush static fdb entries on admin down

2017-01-23 Thread Roopa Prabhu

From: Roopa Prabhu 

This patch skips flushing static fdb entries in
ndo_stop, but flushes all fdb entries during vxlan
device delete. This is consistent with the bridge
driver fdb

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 19b1653..b34822f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2354,7 +2354,7 @@ static int vxlan_open(struct net_device *dev)
 }
 
 /* Purge the forwarding table */
-static void vxlan_flush(struct vxlan_dev *vxlan)
+static void vxlan_flush(struct vxlan_dev *vxlan, bool do_all)
 {
unsigned int h;
 
@@ -2364,6 +2364,8 @@ static void vxlan_flush(struct vxlan_dev *vxlan)
hlist_for_each_safe(p, n, >fdb_head[h]) {
struct vxlan_fdb *f
= container_of(p, struct vxlan_fdb, hlist);
+   if (!do_all && (f->state & (NUD_PERMANENT | NUD_NOARP)))
+   continue;
/* the all_zeros_mac entry is deleted at vxlan_uninit */
if (!is_zero_ether_addr(f->eth_addr))
vxlan_fdb_destroy(vxlan, f);
@@ -2385,7 +2387,7 @@ static int vxlan_stop(struct net_device *dev)
 
del_timer_sync(>age_timer);
 
-   vxlan_flush(vxlan);
+   vxlan_flush(vxlan, false);
vxlan_sock_release(vxlan);
 
return ret;
@@ -3058,6 +3060,8 @@ static void vxlan_dellink(struct net_device *dev, struct 
list_head *head)
struct vxlan_dev *vxlan = netdev_priv(dev);
struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
 
+   vxlan_flush(vxlan, true);
+
spin_lock(>sock_lock);
if (!hlist_unhashed(>hlist))
hlist_del_rcu(>hlist);
-- 
1.9.1

Re: XDP offload to hypervisor

2017-01-23 Thread Michael S. Tsirkin

On Mon, Jan 23, 2017 at 07:50:31PM -0800, Alexei Starovoitov wrote:
> On Tue, Jan 24, 2017 at 05:33:37AM +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 23, 2017 at 05:02:02PM -0800, Alexei Starovoitov wrote:
> > > Frankly I don't understand the whole virtio nit picking that was 
> > > happening.
> > > imo virtio+xdp by itself is only useful for debugging, development and 
> > > testing
> > > of xdp programs in a VM. The discussion about performance of virtio+xdp
> > > will only be meaningful when corresponding host part is done.
> > > Likely in the form of vhost extensions and may be driver changes.
> > > Trying to optimize virtio+xdp when host is doing traditional skb+vhost
> > > isn't going to be impactful.
> > 
> > Well if packets can be dropped without a host/guest
> > transition then yes, that will have an impact even
> > with traditional skbs.
> 
> I don't think it's worth optimizing for though, since the speed of drop
> matters for ddos-like use case

It's not just drops. adjust head + xmit can handle bridging
without entering the VM.

> and if we let host be flooded with skbs,
> we already lost, since the only thing cpu is doing is allocating skbs
> and moving them around. Whether drop is happening upon entry into VM
> or host does it in some post-vhost layer doesn't change the picture much.
> That said, I do like the idea of offloading virto+xdp into host somehow.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski

On Mon, Jan 23, 2017 at 8:05 PM, David Ahern  wrote:
> On 1/23/17 8:37 PM, Andy Lutomirski wrote:
>> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
>> subject to BPF code installed by a different, potentially unrelated
>> process.  That's a new situation.  The failure can happen when a
>> privileged supervisor (whoever runs ip vrf) runs a clueless or
>> unprivileged program (the thing calling unshare()).
>
> There are many, many ways to misconfigure networking and to run programs in a 
> context or with an input argument that causes the program to not work at all, 
> not work as expected or stop working. This situation is no different.
>
> For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are 
> namespace based is the ifindex. You brought up the example of changing 
> namespaces where the ifindex is not defined. Alexei mentioned an example 
> where interfaces can be moved to another namespace breaking any ifindex based 
> programs. Another example is the interface can be deleted. Deleting an 
> interface with sockets bound to it does not impact the program in any way - 
> no notification, no wakeup, nothing. The sockets just don't work.

And if you use 'ip vrf' to bind to a vrf with ifindex 4 and a program
unshares netns and creates an interface with ifindex 4, then that
program will end up with its sockets magically bound to ifindex 4 and
will silently malfunction.

I can think of multiple ways to address this problem.  You could scope
the hooks to a netns (which is sort of what my patch does).  You could
find a way to force programs in a given cgroup to only execute in a
single netns, although that would probably cause other breakage.  You
could improve the BPF hook API to be netns-aware, which could plausbly
address issues related to unshare() but might get very tricky when
setns() is involved.

My point is that, in 4.10-rc, it doesn't work right, and I doubt this
problem is restricted to just 'ip vrf'.  Without some kind of change
to the way that netns and cgroup+bpf interact, anything that uses
sk_bound_dev_if or reads the ifindex on an skb will be subject to a
huge footgun that unprivileged programs can trigger and any future
attempt to make the cgroup+bpf work for unprivileged users is going to
be more complicated than it deserves to be.

(Aside: I wonder if a similar goal to 'ip vrf' could be achieved by
moving the vrf to a different network namespace and putting programs
that you want to be subject to the VRF into that namespace.)

>
> The point of using a management tool like ip is to handle the details to make 
> things sane -- which is the consistent with the whole point of ip netns vs 
> running unshare -n.

I still don't understand how you're going to make 'ip netns' make this
problem go away.  Keep in mind that there are real programs that call
the unshare() syscall directly.

>
>>
>> If the way cgroup+bpf worked was that a program did a prctl() or
>> similar to opt in to being managed by a provided cgroup-aware BPF
>> program, then I'd agree with everything you're saying.  But that's not
>> at all how this code works.
>
> I don't want an opt-in approach. The program does not need to know or even 
> care that it has some restriction. Today, someone runs 'ip netns exec NAME 
> COMMAND' the command does not need to know or care that the network namespace 
> was changed on it. Same with 'ip vrf exec' -- it is a network policy that is 
> forced on the program by the user.

I absolutely agree.  My point is that cgroup+bpf is *not* opt-in, so
the bar should be higher.  You can't blame the evil program that
called unshare() for breaking things when 'ip vrf' unavoidably sets
things up so that unshare will malfunction.  It should avoid
malfunctioning when running Linux programs that are unaware of it.
This should include programs like unshare and 'ip netns'.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski

On Mon, Jan 23, 2017 at 8:10 PM, David Ahern  wrote:
> On 1/23/17 7:39 PM, Andy Lutomirski wrote:
>> I'm not sure I followed what you meant.  If I understood right (which
>> is a big if) you're saying that ip vrf when run in a netns works
>> correctly in that netns.  I agree, but I think that it would continue
>> to work (even more reliably) if the hooks only executed in the netns
>> in which they were created.  I think that would probably be a good
>> improvement, and it could be done on top of my patch, but I'm not
>> about to write that code for 4.10.
>
> Sounds like an efficiency on the implementation that does not require 
> limiting the code today to just init namespace.
>

It's an ABI change, not an implementation change.  If someone wants to
make the code fully netns-aware in time for 4.10, that sounds great.
I'm not going to do that.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread David Ahern

On 1/23/17 8:37 PM, Andy Lutomirski wrote:
> Yes, it is a bug because cgroup+bpf causes unwitting programs to be
> subject to BPF code installed by a different, potentially unrelated
> process.  That's a new situation.  The failure can happen when a
> privileged supervisor (whoever runs ip vrf) runs a clueless or
> unprivileged program (the thing calling unshare()).

There are many, many ways to misconfigure networking and to run programs in a 
context or with an input argument that causes the program to not work at all, 
not work as expected or stop working. This situation is no different. 

For example, the only aspect of BPF_PROG_TYPE_CGROUP_SOCK filters that are 
namespace based is the ifindex. You brought up the example of changing 
namespaces where the ifindex is not defined. Alexei mentioned an example where 
interfaces can be moved to another namespace breaking any ifindex based 
programs. Another example is the interface can be deleted. Deleting an 
interface with sockets bound to it does not impact the program in any way - no 
notification, no wakeup, nothing. The sockets just don't work.

The point of using a management tool like ip is to handle the details to make 
things sane -- which is the consistent with the whole point of ip netns vs 
running unshare -n.

> 
> If the way cgroup+bpf worked was that a program did a prctl() or
> similar to opt in to being managed by a provided cgroup-aware BPF
> program, then I'd agree with everything you're saying.  But that's not
> at all how this code works.

I don't want an opt-in approach. The program does not need to know or even care 
that it has some restriction. Today, someone runs 'ip netns exec NAME COMMAND' 
the command does not need to know or care that the network namespace was 
changed on it. Same with 'ip vrf exec' -- it is a network policy that is forced 
on the program by the user.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread David Ahern

On 1/23/17 7:39 PM, Andy Lutomirski wrote:
> I'm not sure I followed what you meant.  If I understood right (which
> is a big if) you're saying that ip vrf when run in a netns works
> correctly in that netns.  I agree, but I think that it would continue
> to work (even more reliably) if the hooks only executed in the netns
> in which they were created.  I think that would probably be a good
> improvement, and it could be done on top of my patch, but I'm not
> about to write that code for 4.10.

Sounds like an efficiency on the implementation that does not require limiting 
the code today to just init namespace.

[PATCH 2/3] Bluetooth: cmtp: fix possible might sleep error in cmtp_session

2017-01-23 Thread Jeffy Chen

It looks like cmtp_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
---

 net/bluetooth/cmtp/core.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 9e59b66..6b03f2b 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -280,16 +280,16 @@ static int cmtp_session(void *arg)
struct cmtp_session *session = arg;
struct sock *sk = session->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
BT_DBG("session %p", session);
 
set_user_nice(current, -15);
 
-   init_waitqueue_entry(, current);
add_wait_queue(sk_sleep(sk), );
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
 
if (atomic_read(>terminate))
break;
@@ -306,9 +306,8 @@ static int cmtp_session(void *arg)
 
cmtp_process_transmit(session);
 
-   schedule();
+   wait_woken(, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), );
 
down_write(_session_sem);
@@ -393,7 +392,11 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, 
struct socket *sock)
err = cmtp_attach_device(session);
if (err < 0) {
atomic_inc(>terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+
+   wake_up_interruptible(sk_sleep(session->sock->sk));
up_write(_session_sem);
return err;
}
@@ -431,7 +434,11 @@ int cmtp_del_connection(struct cmtp_conndel_req *req)
 
/* Stop session thread */
atomic_inc(>terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+
+   wake_up_interruptible(sk_sleep(session->sock->sk));
} else
err = -ENOENT;
 
-- 
2.1.4

[PATCH 1/3] Bluetooth: bnep: fix possible might sleep error in bnep_session

2017-01-23 Thread Jeffy Chen

It looks like bnep_session has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
---

 net/bluetooth/bnep/core.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index fbf251f..da04d51 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -484,16 +484,16 @@ static int bnep_session(void *arg)
struct net_device *dev = s->dev;
struct sock *sk = s->sock->sk;
struct sk_buff *skb;
-   wait_queue_t wait;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
BT_DBG("");
 
set_user_nice(current, -15);
 
-   init_waitqueue_entry(, current);
add_wait_queue(sk_sleep(sk), );
while (1) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
 
if (atomic_read(>terminate))
break;
@@ -515,9 +515,8 @@ static int bnep_session(void *arg)
break;
netif_wake_queue(dev);
 
-   schedule();
+   wait_woken(, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
-   __set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), );
 
/* Cleanup session */
@@ -666,7 +665,11 @@ int bnep_del_connection(struct bnep_conndel_req *req)
s = __bnep_get_session(req->dst);
if (s) {
atomic_inc(>terminate);
-   wake_up_process(s->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+
+   wake_up_interruptible(sk_sleep(s->sock->sk));
} else
err = -ENOENT;
 
-- 
2.1.4

[PATCH 3/3] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread

2017-01-23 Thread Jeffy Chen

It looks like hidp_session_thread has same pattern as the issue reported in
old rfcomm:

while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (condition)
break;
// may call might_sleep here
schedule();
}
__set_current_state(TASK_RUNNING);

Which fixed at:
dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen 
---

 net/bluetooth/hidp/core.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 0bec458..43d6e6a 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -36,6 +36,7 @@
 #define VERSION "1.2"
 
 static DECLARE_RWSEM(hidp_session_sem);
+static DECLARE_WAIT_QUEUE_HEAD(hidp_session_wq);
 static LIST_HEAD(hidp_session_list);
 
 static unsigned char hidp_keycode[256] = {
@@ -1068,12 +1069,15 @@ static int hidp_session_start_sync(struct hidp_session 
*session)
  * Wake up session thread and notify it to stop. This is asynchronous and
  * returns immediately. Call this whenever a runtime error occurs and you want
  * the session to stop.
- * Note: wake_up_process() performs any necessary memory-barriers for us.
  */
 static void hidp_session_terminate(struct hidp_session *session)
 {
atomic_inc(>terminate);
-   wake_up_process(session->task);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
+
+   wake_up_interruptible(_session_wq);
 }
 
 /*
@@ -1180,7 +1184,9 @@ static void hidp_session_run(struct hidp_session *session)
struct sock *ctrl_sk = session->ctrl_sock->sk;
struct sock *intr_sk = session->intr_sock->sk;
struct sk_buff *skb;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
+   add_wait_queue(_session_wq, );
for (;;) {
/*
 * This thread can be woken up two ways:
@@ -1188,12 +1194,10 @@ static void hidp_session_run(struct hidp_session 
*session)
 *session->terminate flag and wakes this thread up.
 *  - Via modifying the socket state of ctrl/intr_sock. This
 *thread is woken up by ->sk_state_changed().
-*
-* Note: set_current_state() performs any necessary
-* memory-barriers for us.
 */
-   set_current_state(TASK_INTERRUPTIBLE);
 
+   /* Ensure session->terminate is updated */
+   smp_mb__before_atomic();
if (atomic_read(>terminate))
break;
 
@@ -1227,11 +1231,14 @@ static void hidp_session_run(struct hidp_session 
*session)
hidp_process_transmit(session, >ctrl_transmit,
  session->ctrl_sock);
 
-   schedule();
+   wait_woken(, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
+   remove_wait_queue(_session_wq, );
 
atomic_inc(>terminate);
-   set_current_state(TASK_RUNNING);
+
+   /* Ensure session->terminate is updated */
+   smp_mb__after_atomic();
 }
 
 /*
-- 
2.1.4

Re: XDP offload to hypervisor

2017-01-23 Thread Alexei Starovoitov

On Tue, Jan 24, 2017 at 05:33:37AM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 23, 2017 at 05:02:02PM -0800, Alexei Starovoitov wrote:
> > Frankly I don't understand the whole virtio nit picking that was happening.
> > imo virtio+xdp by itself is only useful for debugging, development and 
> > testing
> > of xdp programs in a VM. The discussion about performance of virtio+xdp
> > will only be meaningful when corresponding host part is done.
> > Likely in the form of vhost extensions and may be driver changes.
> > Trying to optimize virtio+xdp when host is doing traditional skb+vhost
> > isn't going to be impactful.
> 
> Well if packets can be dropped without a host/guest
> transition then yes, that will have an impact even
> with traditional skbs.

I don't think it's worth optimizing for though, since the speed of drop
matters for ddos-like use case and if we let host be flooded with skbs,
we already lost, since the only thing cpu is doing is allocating skbs
and moving them around. Whether drop is happening upon entry into VM
or host does it in some post-vhost layer doesn't change the picture much.
That said, I do like the idea of offloading virto+xdp into host somehow.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski

On Mon, Jan 23, 2017 at 7:13 PM, Alexei Starovoitov
 wrote:
> On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote:
>> Please explain how the change results in a broken ABI and how the
>> current ABI is better.  I gave a fully worked out example of how the
>> current ABI misbehaves using only standard Linux networking tools.
>
> I gave an example in the other thread that demonstrated
> cgroup escape by the appliction when it changes netns.
> If application became part of cgroup, it has to stay there,
> no matter setns() calls afterwards.

The other thread is out of control.  Can you restate your example?
Please keep in mind that uprivileged programs can unshare their netns.

>
>> >
>> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> index e89acea22ecf..c0bbc55e244d 100644
>> >> --- a/kernel/bpf/syscall.c
>> >> +++ b/kernel/bpf/syscall.c
>> >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr 
>> >> *attr)
>> >>   struct cgroup *cgrp;
>> >>   enum bpf_prog_type ptype;
>> >>
>> >> + /*
>> >> +  * For now, socket bpf hooks attached to cgroups can only be
>> >> +  * installed in the init netns and only affect the init netns.
>> >> +  * This could be relaxed in the future once some semantic issues
>> >> +  * are resolved.  For example, ifindexes belonging to one netns
>> >> +  * should probably not be visible to hooks installed by programs
>> >> +  * running in a different netns.
>> >
>> > the comment is bogus and shows complete lack of understanding
>> > how bpf is working and how it's being used.
>> > See SKF_AD_IFINDEX that was in classic bpf forever.
>> >
>>
>> I think I disagree completely.
>>
>> With classic BPF, a program creates a socket and binds a bpf program
>> to it.  That program can see the ifindex, which is fine because that
>> ifindex is scoped to the socket's netns, which is (unless the caller
>> uses setns() or unshare()) the same as the caller's netns, so the
>> ifindex means exactly what the caller thinks it means.
>>
>> With cgroup+bpf, the program installing the hook can be completely
>> unrelated to the program whose sockets are subject to the hook, and,
>> if they're using different namespaces, it breaks.
>
> Please also see ingress_ifindex, ifindex, bpf_redirect(), bpf_clone_redirect()
> that all operate on the ifindex while the program can be detached from
> the application. In general bpf program and application that loaded and
> attached it are mostly disjoint in case of networking. We have tail_call
> mechanism and master bpf prog may call programs installed by other apps
> that may have exited already.

If program A acquires a BPF object from program B where program B runs
in a different netns from program A, and program B uses or tail-calls
that BPF object, then A is either doing it intentionally and is
netns-aware or it is terminally buggy and deserves what it gets.

> cls_bpf is scoped by netdev that belongs to some netns.
> If after attaching a program with hardcoded if(ifindex==3) check
> to such netdev, this netdev is moved into different netns, this 'if' check
> and the program become bogus and won't do what program author
> expected. It is expected behavior. The same thing with current 'ip vrf'
> implementation that Dave is adjusting. It's not a bug in cgroup+bpf behavior.
>

Yes, it is a bug because cgroup+bpf causes unwitting programs to be
subject to BPF code installed by a different, potentially unrelated
process.  That's a new situation.  The failure can happen when a
privileged supervisor (whoever runs ip vrf) runs a clueless or
unprivileged program (the thing calling unshare()).

If the way cgroup+bpf worked was that a program did a prctl() or
similar to opt in to being managed by a provided cgroup-aware BPF
program, then I'd agree with everything you're saying.  But that's not
at all how this code works.

Re: XDP offload to hypervisor

2017-01-23 Thread Michael S. Tsirkin

On Mon, Jan 23, 2017 at 05:02:02PM -0800, Alexei Starovoitov wrote:
> Frankly I don't understand the whole virtio nit picking that was happening.
> imo virtio+xdp by itself is only useful for debugging, development and testing
> of xdp programs in a VM. The discussion about performance of virtio+xdp
> will only be meaningful when corresponding host part is done.
> Likely in the form of vhost extensions and may be driver changes.
> Trying to optimize virtio+xdp when host is doing traditional skb+vhost
> isn't going to be impactful.

Well if packets can be dropped without a host/guest
transition then yes, that will have an impact even
with traditional skbs.

-- 
MST

[PATCH net-next] net: dsa: Drop WARN() in tag_brcm.c

2017-01-23 Thread Florian Fainelli

We may be able to see invalid Broadcom tags when the hardware and drivers are
misconfigured, or just while exercising the error path. Instead of flooding
the console with messages, flat out drop the packet.

Signed-off-by: Florian Fainelli 
---
 net/dsa/tag_brcm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/dsa/tag_brcm.c b/net/dsa/tag_brcm.c
index af82927674e0..cb5a2b7a0118 100644
--- a/net/dsa/tag_brcm.c
+++ b/net/dsa/tag_brcm.c
@@ -121,7 +121,8 @@ static int brcm_tag_rcv(struct sk_buff *skb, struct 
net_device *dev,
/* We should never see a reserved reason code without knowing how to
 * handle it
 */
-   WARN_ON(brcm_tag[2] & BRCM_EG_RC_RSVD);
+   if (unlikely(brcm_tag[2] & BRCM_EG_RC_RSVD))
+   goto out_drop;
 
/* Locate which port this is coming from */
source_port = brcm_tag[3] & BRCM_EG_PID_MASK;
-- 
2.9.3

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Alexei Starovoitov

On Mon, Jan 23, 2017 at 06:42:27PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
> >> To see how cgroup+bpf interacts with network namespaces, I wrote a
> >> little program called show_bind that calls getsockopt(...,
> >> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> >>
> >>  # ./ip link add dev vrf0 type vrf table 10
> >>  # ./ip vrf exec vrf0 ./show_bind
> >>  Default binding is "vrf0"
> >>  # ./ip vrf exec vrf0 unshare -n ./show_bind
> >>  show_bind: getsockopt: No such device
> >>
> >> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> >> the init netns and installs a hook that binds sockets to that
> >> ifindex.  When the hook runs in a different netns, it sets
> >> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> >> incorrect behavior.  In this particular example, the ifindex was 4
> >> and there was no ifindex 4 in the new netns.  If there had been,
> >> this test would have malfunctioned differently
> >>
> >> Since it's rather late in the release cycle, let's punt.  This patch
> >> makes it impossible to install cgroup+bpf hooks outside the init
> >> netns and makes them not run on sockets that aren't in the init
> >> netns.
> >>
> >> In a future release, it should be relatively straightforward to make
> >> these hooks be local to a netns and, if needed, to add a flag so
> >> that hooks can be made global if necessary.  Global hooks should
> >> presumably be constrained so that they can't write to any ifindex
> >> fields.
> >>
> >> Cc: Daniel Borkmann 
> >> Cc: Alexei Starovoitov 
> >> Cc: David Ahern 
> >> Signed-off-by: Andy Lutomirski 
> >> ---
> >>
> >> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
> >> bug can be hit using current iproute2 -git.  please consider this for
> >> -net.
> >>
> >> Changes from v1:
> >>  - Fix the commit message.  'git commit' was very clever and thought that
> >>all the interesting bits of the test case were intended to be comments
> >>and stripped them.  Whoops!
> >>
> >> kernel/bpf/cgroup.c  | 21 +
> >>  kernel/bpf/syscall.c | 11 +++
> >>  2 files changed, 32 insertions(+)
> >>
> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> >> index a515f7b007c6..a824f543de69 100644
> >> --- a/kernel/bpf/cgroup.c
> >> +++ b/kernel/bpf/cgroup.c
> >> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
> >>   if (!sk || !sk_fullsock(sk))
> >>   return 0;
> >>
> >> + /*
> >> +  * For now, socket bpf hooks attached to cgroups can only be
> >> +  * installed in the init netns and only affect the init netns.
> >> +  * This could be relaxed in the future once some semantic issues
> >> +  * are resolved.  For example, ifindexes belonging to one netns
> >> +  * should probably not be visible to hooks installed by programs
> >> +  * running in a different netns.
> >> +  */
> >> + if (sock_net(sk) != _net)
> >> + return 0;
> >
> > Nack.
> > Such check will create absolutely broken abi that we won't be able to fix 
> > later.
> 
> Please explain how the change results in a broken ABI and how the
> current ABI is better.  I gave a fully worked out example of how the
> current ABI misbehaves using only standard Linux networking tools.

I gave an example in the other thread that demonstrated
cgroup escape by the appliction when it changes netns.
If application became part of cgroup, it has to stay there,
no matter setns() calls afterwards.

> >
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index e89acea22ecf..c0bbc55e244d 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >>   struct cgroup *cgrp;
> >>   enum bpf_prog_type ptype;
> >>
> >> + /*
> >> +  * For now, socket bpf hooks attached to cgroups can only be
> >> +  * installed in the init netns and only affect the init netns.
> >> +  * This could be relaxed in the future once some semantic issues
> >> +  * are resolved.  For example, ifindexes belonging to one netns
> >> +  * should probably not be visible to hooks installed by programs
> >> +  * running in a different netns.
> >
> > the comment is bogus and shows complete lack of understanding
> > how bpf is working and how it's being used.
> > See SKF_AD_IFINDEX that was in classic bpf forever.
> >
> 
> I think I disagree completely.
> 
> With classic BPF, a program creates a socket and binds a bpf program
> to it.  That program can see the ifindex, which is fine because that
> ifindex is scoped to the socket's netns, which is (unless the caller
> uses setns() or unshare()) the same as the caller's netns,

Re: XDP offload to hypervisor

2017-01-23 Thread Michael S. Tsirkin

On Mon, Jan 23, 2017 at 05:02:02PM -0800, Alexei Starovoitov wrote:
> > Another issue is around host/guest ABI. Guest BPF could add new features
> > at any point. What if hypervisor can not support it all?  I guess we
> > could try loading program into hypervisor and run it within guest on
> > failure to load, but this ignores question of cross-version
> > compatibility - someone might start guest on a new host
> > then try to move to an old one. So we will need an option
> > "behave like an older host" such that guest can start and then
> > move to an older host later. This will likely mean
> > implementing this validation of programs in qemu userspace unless linux
> > can supply something like this. Is this (disabling some features)
> > something that might be of interest to larger bpf community?
> 
> In case of x86->nic offload not all xdp features will be supported
> by the nic and that is expected. The user will request 'offload of xdp prog'
> in some form and if it cannot be done, then xdp programs will run
> on x86 as before. Same thing, I imagine, is applicable to virtio->host
> offload. Therefore I don't see a need for user space visible
> feature negotiation.

Not userspace visible - guest visible. As guests move between hosts,
you need to make sure source host does not commit to a feature that
destination host does not support.

-- 
MST

Re: [PULL] vhost: cleanups and fixes

2017-01-23 Thread Michael S. Tsirkin

On Mon, Jan 23, 2017 at 01:50:32PM -0800, Linus Torvalds wrote:
> On Mon, Jan 23, 2017 at 7:05 AM, Michael S. Tsirkin  wrote:
> >
> > virtio, vhost: fixes, cleanups
> 
> Was there a reason why you sent this twice?
> 
> Or was this *supposed* to be the ARM DMA fix pull request? Because it wasn't.
> 
>   Linus

Ouch sorry. Was tired and forgot I already sent it.  Should have
checked.

-- 
MST

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski

On Mon, Jan 23, 2017 at 6:09 PM, Alexei Starovoitov
 wrote:
> On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
>> To see how cgroup+bpf interacts with network namespaces, I wrote a
>> little program called show_bind that calls getsockopt(...,
>> SO_BINDTODEVICE, ...) and prints the result.  It did this:
>>
>>  # ./ip link add dev vrf0 type vrf table 10
>>  # ./ip vrf exec vrf0 ./show_bind
>>  Default binding is "vrf0"
>>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>>  show_bind: getsockopt: No such device
>>
>> What's happening here is that "ip vrf" looks up vrf0's ifindex in
>> the init netns and installs a hook that binds sockets to that
>> ifindex.  When the hook runs in a different netns, it sets
>> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
>> incorrect behavior.  In this particular example, the ifindex was 4
>> and there was no ifindex 4 in the new netns.  If there had been,
>> this test would have malfunctioned differently
>>
>> Since it's rather late in the release cycle, let's punt.  This patch
>> makes it impossible to install cgroup+bpf hooks outside the init
>> netns and makes them not run on sockets that aren't in the init
>> netns.
>>
>> In a future release, it should be relatively straightforward to make
>> these hooks be local to a netns and, if needed, to add a flag so
>> that hooks can be made global if necessary.  Global hooks should
>> presumably be constrained so that they can't write to any ifindex
>> fields.
>>
>> Cc: Daniel Borkmann 
>> Cc: Alexei Starovoitov 
>> Cc: David Ahern 
>> Signed-off-by: Andy Lutomirski 
>> ---
>>
>> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
>> bug can be hit using current iproute2 -git.  please consider this for
>> -net.
>>
>> Changes from v1:
>>  - Fix the commit message.  'git commit' was very clever and thought that
>>all the interesting bits of the test case were intended to be comments
>>and stripped them.  Whoops!
>>
>> kernel/bpf/cgroup.c  | 21 +
>>  kernel/bpf/syscall.c | 11 +++
>>  2 files changed, 32 insertions(+)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index a515f7b007c6..a824f543de69 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>   if (!sk || !sk_fullsock(sk))
>>   return 0;
>>
>> + /*
>> +  * For now, socket bpf hooks attached to cgroups can only be
>> +  * installed in the init netns and only affect the init netns.
>> +  * This could be relaxed in the future once some semantic issues
>> +  * are resolved.  For example, ifindexes belonging to one netns
>> +  * should probably not be visible to hooks installed by programs
>> +  * running in a different netns.
>> +  */
>> + if (sock_net(sk) != _net)
>> + return 0;
>
> Nack.
> Such check will create absolutely broken abi that we won't be able to fix 
> later.

Please explain how the change results in a broken ABI and how the
current ABI is better.  I gave a fully worked out example of how the
current ABI misbehaves using only standard Linux networking tools.

>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e89acea22ecf..c0bbc55e244d 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>>   struct cgroup *cgrp;
>>   enum bpf_prog_type ptype;
>>
>> + /*
>> +  * For now, socket bpf hooks attached to cgroups can only be
>> +  * installed in the init netns and only affect the init netns.
>> +  * This could be relaxed in the future once some semantic issues
>> +  * are resolved.  For example, ifindexes belonging to one netns
>> +  * should probably not be visible to hooks installed by programs
>> +  * running in a different netns.
>
> the comment is bogus and shows complete lack of understanding
> how bpf is working and how it's being used.
> See SKF_AD_IFINDEX that was in classic bpf forever.
>

I think I disagree completely.

With classic BPF, a program creates a socket and binds a bpf program
to it.  That program can see the ifindex, which is fine because that
ifindex is scoped to the socket's netns, which is (unless the caller
uses setns() or unshare()) the same as the caller's netns, so the
ifindex means exactly what the caller thinks it means.

With cgroup+bpf, the program installing the hook can be completely
unrelated to the program whose sockets are subject to the hook, and,
if they're using different namespaces, it breaks.

--Andy

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski

On Mon, Jan 23, 2017 at 6:31 PM, David Ahern  wrote:
> On 1/23/17 7:09 PM, Alexei Starovoitov wrote:
>>> + */
>>> +if (current->nsproxy->net_ns != _net)
>>> +return -EINVAL;
>>
>> this restriction I actually don't mind, since it indeed can be
>> relaxed later, but please make it proper with net_eq()
>>
>
> I do mind. Why have different restrictions for the skb and sk filters?
>
> I have already shown that ip can alleviate the management aspects of the 
> implementation -- just like ip netns does.

I'm not sure I followed what you meant.  If I understood right (which
is a big if) you're saying that ip vrf when run in a netns works
correctly in that netns.  I agree, but I think that it would continue
to work (even more reliably) if the hooks only executed in the netns
in which they were created.  I think that would probably be a good
improvement, and it could be done on top of my patch, but I'm not
about to write that code for 4.10.

--Andy

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread David Ahern

On 1/23/17 7:09 PM, Alexei Starovoitov wrote:
>> + */
>> +if (current->nsproxy->net_ns != _net)
>> +return -EINVAL;
> 
> this restriction I actually don't mind, since it indeed can be
> relaxed later, but please make it proper with net_eq()
> 

I do mind. Why have different restrictions for the skb and sk filters?

I have already shown that ip can alleviate the management aspects of the 
implementation -- just like ip netns does.

Re: [PATCH] Bluetooth: hidp: might sleep error in hidp_session_thread

2017-01-23 Thread Brian Norris

Hi Jeffy,

On Fri, Jan 20, 2017 at 09:52:08PM +0800, Jeffy Chen wrote:
> [   39.044329] do not call blocking ops when !TASK_RUNNING; state=1 set
> at [] hidp_session_thread+0x110/0x568 [hidp]
> ...
> [   40.159664] Call trace:
> [   40.162122] [] __might_sleep+0x64/0x90
> [   40.167443] [] lock_sock_nested+0x30/0x78
> [   40.173047] [] l2cap_sock_sendmsg+0x90/0xf0
> [bluetooth]
> [   40.179842] [] sock_sendmsg+0x4c/0x68
> [   40.185072] [] kernel_sendmsg+0x54/0x68
> [   40.190477] [] hidp_send_frame+0x78/0xa0 [hidp]
> [   40.196574] [] hidp_process_transmit+0x44/0x98
> [hidp]
> [   40.203191] [] hidp_session_thread+0x364/0x568
> [hidp]

Am I crazy, or are several other protocols broken like this too? I see a
similar structure in net/bluetooth/bnep/core.c and
net/bluetooth/cmtp/core.c, at least, each of which also call
kernel_sendmsg(), which might be an l2cap socket (...I think? I'm not a
bluetooth expert really).

> 
> Following (https://lwn.net/Articles/628628/).
> 
> Signed-off-by: Jeffy Chen 
> ---
> 
>  net/bluetooth/hidp/core.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
> index 0bec458..bfd3fb8 100644
> --- a/net/bluetooth/hidp/core.c
> +++ b/net/bluetooth/hidp/core.c
> @@ -1180,7 +1180,9 @@ static void hidp_session_run(struct hidp_session 
> *session)
>   struct sock *ctrl_sk = session->ctrl_sock->sk;
>   struct sock *intr_sk = session->intr_sock->sk;
>   struct sk_buff *skb;
> + DEFINE_WAIT_FUNC(wait, woken_wake_function);
>  
> + add_wait_queue(sk_sleep(intr_sk), );
>   for (;;) {
>   /*
>* This thread can be woken up two ways:
> @@ -1188,12 +1190,10 @@ static void hidp_session_run(struct hidp_session 
> *session)
>*session->terminate flag and wakes this thread up.
>*  - Via modifying the socket state of ctrl/intr_sock. This
>*thread is woken up by ->sk_state_changed().
> -  *
> -  * Note: set_current_state() performs any necessary
> -  * memory-barriers for us.
>*/
> - set_current_state(TASK_INTERRUPTIBLE);
>  
> + /* Ensure session->terminate is updated */
> + smp_mb__before_atomic();
>   if (atomic_read(>terminate))
>   break;
>  
> @@ -1227,11 +1227,14 @@ static void hidp_session_run(struct hidp_session 
> *session)
>   hidp_process_transmit(session, >ctrl_transmit,
> session->ctrl_sock);
>  
> - schedule();
> + wait_woken(, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);

I think this looks mostly good, except what about the
hidp_session_terminate() condition? In that case, you're running
wake_up_process() -- which won't set WQ_FLAG_WOKEN for us. So what
happens if we see a hidp_session_terminate() call in between the check
for the ->terminate count, but before we call wait_woken()? IIUC, I
think we'll just ignore the call and keep waiting for the next wake
signal.

I think you might need to rework hidp_session_terminate() too, to
actually target the wait queue and not just the processes.

IIUC, of course. I could be wrong...

Brian

>   }
> + remove_wait_queue(sk_sleep(intr_sk), );
>  
>   atomic_inc(>terminate);
> - set_current_state(TASK_RUNNING);
> +
> + /* Ensure session->terminate is updated */
> + smp_mb__after_atomic();
>  }
>  
>  /*
> -- 
> 2.1.4
> 
>

Re: [PATCH net-next] bpf, lpm: fix kfree of im_node in trie_update_elem

2017-01-23 Thread David Miller

From: Daniel Borkmann 
Date: Tue, 24 Jan 2017 01:26:46 +0100

> We need to initialize im_node to NULL, otherwise in case of error path
> it gets passed to kfree() as uninitialized pointer.
> 
> Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map 
> implementation")
> Signed-off-by: Daniel Borkmann 
> ---
>  Mentioned it in http://patchwork.ozlabs.org/patch/718070/, but
>  was probably overlooked.

Applied, thanks.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Alexei Starovoitov

On Mon, Jan 23, 2017 at 12:36:08PM -0800, Andy Lutomirski wrote:
> To see how cgroup+bpf interacts with network namespaces, I wrote a
> little program called show_bind that calls getsockopt(...,
> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> 
>  # ./ip link add dev vrf0 type vrf table 10
>  # ./ip vrf exec vrf0 ./show_bind
>  Default binding is "vrf0"
>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>  show_bind: getsockopt: No such device
> 
> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> the init netns and installs a hook that binds sockets to that
> ifindex.  When the hook runs in a different netns, it sets
> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> incorrect behavior.  In this particular example, the ifindex was 4
> and there was no ifindex 4 in the new netns.  If there had been,
> this test would have malfunctioned differently
> 
> Since it's rather late in the release cycle, let's punt.  This patch
> makes it impossible to install cgroup+bpf hooks outside the init
> netns and makes them not run on sockets that aren't in the init
> netns.
> 
> In a future release, it should be relatively straightforward to make
> these hooks be local to a netns and, if needed, to add a flag so
> that hooks can be made global if necessary.  Global hooks should
> presumably be constrained so that they can't write to any ifindex
> fields.
> 
> Cc: Daniel Borkmann 
> Cc: Alexei Starovoitov 
> Cc: David Ahern 
> Signed-off-by: Andy Lutomirski 
> ---
> 
> DaveM, this mitigates a bug in a feature that's new in 4.10, and the
> bug can be hit using current iproute2 -git.  please consider this for
> -net.
> 
> Changes from v1:
>  - Fix the commit message.  'git commit' was very clever and thought that
>all the interesting bits of the test case were intended to be comments
>and stripped them.  Whoops!
> 
> kernel/bpf/cgroup.c  | 21 +
>  kernel/bpf/syscall.c | 11 +++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index a515f7b007c6..a824f543de69 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -143,6 +143,17 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>   if (!sk || !sk_fullsock(sk))
>   return 0;
>  
> + /*
> +  * For now, socket bpf hooks attached to cgroups can only be
> +  * installed in the init netns and only affect the init netns.
> +  * This could be relaxed in the future once some semantic issues
> +  * are resolved.  For example, ifindexes belonging to one netns
> +  * should probably not be visible to hooks installed by programs
> +  * running in a different netns.
> +  */
> + if (sock_net(sk) != _net)
> + return 0;

Nack.
Such check will create absolutely broken abi that we won't be able to fix later.

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e89acea22ecf..c0bbc55e244d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   struct cgroup *cgrp;
>   enum bpf_prog_type ptype;
>  
> + /*
> +  * For now, socket bpf hooks attached to cgroups can only be
> +  * installed in the init netns and only affect the init netns.
> +  * This could be relaxed in the future once some semantic issues
> +  * are resolved.  For example, ifindexes belonging to one netns
> +  * should probably not be visible to hooks installed by programs
> +  * running in a different netns.

the comment is bogus and shows complete lack of understanding
how bpf is working and how it's being used.
See SKF_AD_IFINDEX that was in classic bpf forever.

> +  */
> + if (current->nsproxy->net_ns != _net)
> + return -EINVAL;

this restriction I actually don't mind, since it indeed can be
relaxed later, but please make it proper with net_eq()

[PATCH] net: ks8851: Drop eeprom_size structure member

2017-01-23 Thread Stephen Boyd

After commit 51b7b1c34e19 (KSZ8851-SNL: Add ethtool support for
EEPROM via eeprom_93cx6, 2011-11-21) this structure member is
unused. Delete it.

Signed-off-by: Stephen Boyd 
---

Found while debugging an eeprom issue.

 drivers/net/ethernet/micrel/ks8851.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/net/ethernet/micrel/ks8851.c 
b/drivers/net/ethernet/micrel/ks8851.c
index e7e1aff40bd9..955d69a8e8d3 100644
--- a/drivers/net/ethernet/micrel/ks8851.c
+++ b/drivers/net/ethernet/micrel/ks8851.c
@@ -84,7 +84,6 @@ union ks8851_tx_hdr {
  * @rc_ier: Cached copy of KS_IER.
  * @rc_ccr: Cached copy of KS_CCR.
  * @rc_rxqcr: Cached copy of KS_RXQCR.
- * @eeprom_size: Companion eeprom size in Bytes, 0 if no eeprom
  * @eeprom: 93CX6 EEPROM state for accessing on-board EEPROM.
  * @vdd_reg:   Optional regulator supplying the chip
  * @vdd_io: Optional digital power supply for IO
@@ -120,7 +119,6 @@ struct ks8851_net {
u16 rc_ier;
u16 rc_rxqcr;
u16 rc_ccr;
-   u16 eeprom_size;
 
struct mii_if_info  mii;
struct ks8851_rxctrlrxctrl;
@@ -1533,11 +1531,6 @@ static int ks8851_probe(struct spi_device *spi)
/* cache the contents of the CCR register for EEPROM, etc. */
ks->rc_ccr = ks8851_rdreg16(ks, KS_CCR);
 
-   if (ks->rc_ccr & CCR_EEPROM)
-   ks->eeprom_size = 128;
-   else
-   ks->eeprom_size = 0;
-
ks8851_read_selftest(ks);
ks8851_init_mac(ks);
 
-- 
2.10.0.297.gf6727b0

Re: XDP offload to hypervisor

2017-01-23 Thread Alexei Starovoitov

On Mon, Jan 23, 2017 at 11:40:29PM +0200, Michael S. Tsirkin wrote:
> I've been thinking about passing XDP programs from guest to the
> hypervisor.  Basically, after getting an incoming packet, we could run
> an XDP program in host kernel.
> 
> If the result is XDP_DROP or XDP_TX we don't need to wake up the guest at all!

that's an interesting idea!
Long term 'xdp offload' needs to be defined, since NICs become smarter
and can accelerate xdp programs.
So pushing the xdp program down from virtio in the guest into host
and from x86 into nic cpu should probably be handled through the same api.

> When using tun for networking - especially with adjust_head - this
> unfortunately probably means we need to do a data copy unless there is
> enough headroom.  How much is enough though?

Frankly I don't understand the whole virtio nit picking that was happening.
imo virtio+xdp by itself is only useful for debugging, development and testing
of xdp programs in a VM. The discussion about performance of virtio+xdp
will only be meaningful when corresponding host part is done.
Likely in the form of vhost extensions and may be driver changes.
Trying to optimize virtio+xdp when host is doing traditional skb+vhost
isn't going to be impactful.
But when host can do xdp in phyiscal NIC that can deliver raw
pages into vhost that gets picked up by guest virtio, then we hopefully
will be around 10G line rate. page pool is likely needed in such scenario.
Some new xdp action like xdp_tx_into_vhost or whatever.
And guest will be seeing full pages that host nic provided and discussion
about headroom will be automatically solved.
Arguing that skb has 64-byte headroom and therefore we need to
reduce XDP_PACKET_HEADROOM is really upside down.

> Another issue is around host/guest ABI. Guest BPF could add new features
> at any point. What if hypervisor can not support it all?  I guess we
> could try loading program into hypervisor and run it within guest on
> failure to load, but this ignores question of cross-version
> compatibility - someone might start guest on a new host
> then try to move to an old one. So we will need an option
> "behave like an older host" such that guest can start and then
> move to an older host later. This will likely mean
> implementing this validation of programs in qemu userspace unless linux
> can supply something like this. Is this (disabling some features)
> something that might be of interest to larger bpf community?

In case of x86->nic offload not all xdp features will be supported
by the nic and that is expected. The user will request 'offload of xdp prog'
in some form and if it cannot be done, then xdp programs will run
on x86 as before. Same thing, I imagine, is applicable to virtio->host
offload. Therefore I don't see a need for user space visible
feature negotiation.

> With a device such as macvtap there exist configurations where a single
> guest is in control of the device (aka passthrough mode) in that case
> there's a potential to run xdp on host before host skb is built, unless
> host already has an xdp program attached.  If it does we could run the
> program within guest, but what if a guest program got attached first?
> Maybe we should pass a flag in the packet "xdp passed on this packet in
> host". Then, guest can skip running it.  Unless we do a full reset
> there's always a potential for packets to slip through, e.g. on xdp
> program changes. Maybe a flush command is needed, or force queue or
> device reset to make sure nothing is going on. Does this make sense?

All valid questions and concerns.
Since there is still no xdp_adjust_head support in virtio,
it feels kinda early to get into detailed 'virtio offload' discussion.

[PATCH v2 net 2/2] ipv6: fix ip6_tnl_parse_tlv_enc_lim()

2017-01-23 Thread Eric Dumazet

This function suffers from multiple issues.

First one is that pskb_may_pull() may reallocate skb->head,
so the 'raw' pointer needs either to be reloaded or not used at all.

Second issue is that NEXTHDR_DEST handling does not validate
that the options are present in skb->data, so we might read
garbage or access non existent memory.

With help from Willem de Bruijn.

Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov  
Cc: Willem de Bruijn 
---
 net/ipv6/ip6_tunnel.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 
02923f956ac8ba5f3270e8d5282fd9637c2d..ff8ee06491c335d209e86bb15f2526ab1915 
100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -400,18 +400,19 @@ ip6_tnl_dev_uninit(struct net_device *dev)
 
 __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
 {
-   const struct ipv6hdr *ipv6h = (const struct ipv6hdr *) raw;
-   __u8 nexthdr = ipv6h->nexthdr;
-   __u16 off = sizeof(*ipv6h);
+   const struct ipv6hdr *ipv6h = (const struct ipv6hdr *)raw;
+   unsigned int nhoff = raw - skb->data;
+   unsigned int off = nhoff + sizeof(*ipv6h);
+   u8 next, nexthdr = ipv6h->nexthdr;
 
while (ipv6_ext_hdr(nexthdr) && nexthdr != NEXTHDR_NONE) {
-   __u16 optlen = 0;
struct ipv6_opt_hdr *hdr;
-   if (raw + off + sizeof(*hdr) > skb->data &&
-   !pskb_may_pull(skb, raw - skb->data + off + sizeof (*hdr)))
+   u16 optlen;
+
+   if (!pskb_may_pull(skb, off + sizeof(*hdr)))
break;
 
-   hdr = (struct ipv6_opt_hdr *) (raw + off);
+   hdr = (struct ipv6_opt_hdr *)(skb->data + off);
if (nexthdr == NEXTHDR_FRAGMENT) {
struct frag_hdr *frag_hdr = (struct frag_hdr *) hdr;
if (frag_hdr->frag_off)
@@ -422,20 +423,29 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 
*raw)
} else {
optlen = ipv6_optlen(hdr);
}
+   /* cache hdr->nexthdr, since pskb_may_pull() might
+* invalidate hdr
+*/
+   next = hdr->nexthdr;
if (nexthdr == NEXTHDR_DEST) {
-   __u16 i = off + 2;
+   u16 i = 2;
+
+   /* Remember : hdr is no longer valid at this point. */
+   if (!pskb_may_pull(skb, off + optlen))
+   break;
+
while (1) {
struct ipv6_tlv_tnl_enc_lim *tel;
 
/* No more room for encapsulation limit */
-   if (i + sizeof (*tel) > off + optlen)
+   if (i + sizeof(*tel) > optlen)
break;
 
-   tel = (struct ipv6_tlv_tnl_enc_lim *) [i];
+   tel = (struct ipv6_tlv_tnl_enc_lim *) skb->data 
+ off + i;
/* return index of option if found and valid */
if (tel->type == IPV6_TLV_TNL_ENCAP_LIMIT &&
tel->length == 1)
-   return i;
+   return i + off - nhoff;
/* else jump to next option */
if (tel->type)
i += tel->length + 2;
@@ -443,7 +453,7 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 
*raw)
i++;
}
}
-   nexthdr = hdr->nexthdr;
+   nexthdr = next;
off += optlen;
}
return 0;
-- 
2.11.0.483.g087da7b7c-goog

[PATCH v2 net 1/2] ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()

2017-01-23 Thread Eric Dumazet

Since ip6_tnl_parse_tlv_enc_lim() can call pskb_may_pull(),
we must reload any pointer that was related to skb->head
(or skb->data), or risk use after free.

Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: Eric Dumazet 
Cc: Dmitry Kozlov 
---
 net/ipv6/ip6_gre.c| 3 +++
 net/ipv6/ip6_tunnel.c | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 
75b6108234dd05a54af0ae51c7c11eaf1ca2..558631860d91bbcb1321a4b566b6e92ddcaf 
100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -582,6 +582,9 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, 
struct net_device *dev)
return -1;
 
offset = ip6_tnl_parse_tlv_enc_lim(skb, skb_network_header(skb));
+   /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head */
+   ipv6h = ipv6_hdr(skb);
+
if (offset > 0) {
struct ipv6_tlv_tnl_enc_lim *tel;
tel = (struct ipv6_tlv_tnl_enc_lim 
*)_network_header(skb)[offset];
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 
753d6d0860fb14c100ab8b20799782ab8160..02923f956ac8ba5f3270e8d5282fd9637c2d 
100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1303,6 +1303,8 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
fl6.flowlabel = key->label;
} else {
offset = ip6_tnl_parse_tlv_enc_lim(skb, 
skb_network_header(skb));
+   /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head 
*/
+   ipv6h = ipv6_hdr(skb);
if (offset > 0) {
struct ipv6_tlv_tnl_enc_lim *tel;
 
-- 
2.11.0.483.g087da7b7c-goog

[PATCH v2 net 0/2] ipv6: fix ip6_tnl_parse_tlv_enc_lim() issues

2017-01-23 Thread Eric Dumazet

First patch fixes ip6_tnl_parse_tlv_enc_lim() callers,
bug added in linux-3.7

Second patch fixes ip6_tnl_parse_tlv_enc_lim() itself,
bug predates linux-2.6.12

Based on a report from Dmitry Vyukov, thanks to KASAN.


Eric Dumazet (2):
  ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()
  ipv6: fix ip6_tnl_parse_tlv_enc_lim()

 net/ipv6/ip6_gre.c|  3 +++
 net/ipv6/ip6_tunnel.c | 36 
 2 files changed, 27 insertions(+), 12 deletions(-)

-- 
2.11.0.483.g087da7b7c-goog

linux-next: manual merge of the net-next tree with the net tree

2017-01-23 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/mpls/af_mpls.c

between commit:

  9f427a0e474a ("net: mpls: Fix multipath selection for LSR use case")

from the net tree and commit:

  27d691056bde ("mpls: Packet stats")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/mpls/af_mpls.c
index 5b77377e5a15,4dc81963af8f..
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@@ -98,10 -94,35 +94,35 @@@ bool mpls_pkt_too_big(const struct sk_b
  }
  EXPORT_SYMBOL_GPL(mpls_pkt_too_big);
  
+ void mpls_stats_inc_outucastpkts(struct net_device *dev,
+const struct sk_buff *skb)
+ {
+   struct mpls_dev *mdev;
+ 
+   if (skb->protocol == htons(ETH_P_MPLS_UC)) {
+   mdev = mpls_dev_get(dev);
+   if (mdev)
+   MPLS_INC_STATS_LEN(mdev, skb->len,
+  tx_packets,
+  tx_bytes);
+   } else if (skb->protocol == htons(ETH_P_IP)) {
+   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+ #if IS_ENABLED(CONFIG_IPV6)
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   struct inet6_dev *in6dev = __in6_dev_get(dev);
+ 
+   if (in6dev)
+   IP6_UPD_PO_STATS(dev_net(dev), in6dev,
+IPSTATS_MIB_OUT, skb->len);
+ #endif
+   }
+ }
+ EXPORT_SYMBOL_GPL(mpls_stats_inc_outucastpkts);
+ 
 -static u32 mpls_multipath_hash(struct mpls_route *rt,
 - struct sk_buff *skb, bool bos)
 +static u32 mpls_multipath_hash(struct mpls_route *rt, struct sk_buff *skb)
  {
struct mpls_entry_decoded dec;
 +  unsigned int mpls_hdr_len = 0;
struct mpls_shim_hdr *hdr;
bool eli_seen = false;
int label_index;
@@@ -280,27 -308,24 +310,24 @@@ static int mpls_forward(struct sk_buff 
hdr = mpls_hdr(skb);
dec = mpls_entry_decode(hdr);
  
 -  /* Pop the label */
 -  skb_pull(skb, sizeof(*hdr));
 -  skb_reset_network_header(skb);
 -
 -  skb_orphan(skb);
 -
rt = mpls_route_input_rcu(net, dec.label);
-   if (!rt)
+   if (!rt) {
+   MPLS_INC_STATS(mdev, rx_noroute);
goto drop;
+   }
  
 -  nh = mpls_select_multipath(rt, skb, dec.bos);
 +  nh = mpls_select_multipath(rt, skb);
if (!nh)
-   goto drop;
- 
-   /* Find the output device */
-   out_dev = rcu_dereference(nh->nh_dev);
-   if (!mpls_output_possible(out_dev))
-   goto drop;
+   goto err;
  
 +  /* Pop the label */
 +  skb_pull(skb, sizeof(*hdr));
 +  skb_reset_network_header(skb);
 +
 +  skb_orphan(skb);
 +
if (skb_warn_if_lro(skb))
-   goto drop;
+   goto err;
  
skb_forward_csum(skb);

Re: [PATCH net-next] bpf, lpm: fix kfree of im_node in trie_update_elem

2017-01-23 Thread Alexei Starovoitov

On Tue, Jan 24, 2017 at 01:26:46AM +0100, Daniel Borkmann wrote:
> We need to initialize im_node to NULL, otherwise in case of error path
> it gets passed to kfree() as uninitialized pointer.
> 
> Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map 
> implementation")
> Signed-off-by: Daniel Borkmann 

Great catch.
Acked-by: Alexei Starovoitov

[PATCH net-next] bpf, lpm: fix kfree of im_node in trie_update_elem

2017-01-23 Thread Daniel Borkmann

We need to initialize im_node to NULL, otherwise in case of error path
it gets passed to kfree() as uninitialized pointer.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Daniel Borkmann 
---
 Mentioned it in http://patchwork.ozlabs.org/patch/718070/, but
 was probably overlooked.

 kernel/bpf/lpm_trie.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index ba19241d..144e976 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -262,7 +262,7 @@ static int trie_update_elem(struct bpf_map *map,
void *_key, void *value, u64 flags)
 {
struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
-   struct lpm_trie_node *node, *im_node, *new_node = NULL;
+   struct lpm_trie_node *node, *im_node = NULL, *new_node = NULL;
struct lpm_trie_node __rcu **slot;
struct bpf_lpm_trie_key *key = _key;
unsigned long irq_flags;
-- 
1.9.3

Re: [PATCH cumulus-4.1.y 2/5] vxlan: don't replace fdb entry if nothing changed

2017-01-23 Thread Roopa Prabhu

On 1/23/17, 9:02 AM, Stephen Hemminger wrote:
> On Fri, 20 Jan 2017 23:40:07 -0800
> Roopa Prabhu  wrote:
>
>> +if (!vxlan_addr_equal(>remote_ip, ip) ||
>> +rd->remote_port != port ||
>> +rd->remote_vni != vni ||
>> +rd->remote_ifindex != ifindex) {
>> +dst_cache_reset(>dst_cache);
>> +rd->remote_ip = *ip;
>> +rd->remote_port = port;
>> +rd->remote_vni = vni;
>> +rd->remote_ifindex = ifindex;
>> +return 1;
>> +}
>> +
> I think it would be clearer if negative logic was avoided.
>
>   if (vxlan_addr_equal(>remote_ip, ip) &&
>   rd->remote_port == port &&
>   rd->remote_vni == vni &&
> rd->ermote_ifindex == ifndex)
>   return 1;
>
>   dst_cache_reset ...

ack, this was an accidental hit on send as well.
It is on my upstream patch stack..but i think this patch is not really needed 
upstream because
a previous call to vxlan_fdb_find_rdst in vxlan_fdb_replace does the same thing.

I will test again and repost if needed, thanks.

Re: [RFC PATCH] mlx5: Fix page rfcnt issue

2017-01-23 Thread David Miller

From: Tom Herbert 
Date: Mon, 23 Jan 2017 15:56:58 -0800

> One other potential problem in the driver is the use of put_page in
> release pages.  Comparing how the allocation is done in other drivers
> (for instance comparing to ixgbe) some seem to use __free_pages instead.
> I don't know which is correct to use, but somehow it doesn't seem like
> they can both be right.

The only difference I can see is that put_page() does a
__page_cache_release() which shouldn't be necessary for driver RX
pages, so it could be unnecessary overhead.

Re: [PATCH net-next 1/2] vxlan: don't flush static fdb entries on admin down

2017-01-23 Thread Roopa Prabhu

On 1/23/17, 1:07 PM, David Miller wrote:
> From: Roopa Prabhu 
> Date: Fri, 20 Jan 2017 23:43:18 -0800
>
>>  /* Purge the forwarding table */
>> -static void vxlan_flush(struct vxlan_dev *vxlan)
>> +static void vxlan_flush(struct vxlan_dev *vxlan, int do_all)
> Please use 'bool' and true/false for this new argument.
>
> FWIW, I am fine with changing the default behavior in this way.

ack, will fix and post a v2, thanks.

Re: [PATCH cumulus-4.1.y 1/5] vxlan: flush fdb entries on oper down

2017-01-23 Thread Roopa Prabhu

On 1/23/17, 8:59 AM, Stephen Hemminger wrote:
> On Fri, 20 Jan 2017 23:40:06 -0800
> Roopa Prabhu  wrote:
>
>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>> index 19b1653..15b1c23 100644
>> --- a/drivers/net/vxlan.c
>> +++ b/drivers/net/vxlan.c
>> @@ -3276,6 +3276,12 @@ static int vxlan_netdevice_event(struct 
>> notifier_block *unused,
>>  vxlan_handle_lowerdev_unregister(vn, dev);
>>  else if (event == NETDEV_UDP_TUNNEL_PUSH_INFO)
>>  vxlan_push_rx_ports(dev);
>> +else if (event == NETDEV_CHANGE) {
>> +if (dev->netdev_ops == _netdev_ops) {
>> +if (netif_running(dev) && !netif_oper_up(dev))
>> +vxlan_flush(netdev_priv(dev));
>> +}
>> +}
> Looks correct.
> Maybe logic would be clearer with a switch() statement here.
ack. this was an internal patch i accidentally sent. I have refined it a bit 
for upstream since then.
will post an update..

thanks

Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support

2017-01-23 Thread Roopa Prabhu

On 1/23/17, 9:03 AM, Or Gerlitz wrote:
> On Mon, Jan 23, 2017 at 6:13 PM, Roopa Prabhu 
> wrote:
>
>> Also, the goal is to reduce the number of vxlan devices from say 4k to 1.
>> I don't think replacing it with 8k (egress + ingress) rules is going in the
>> right direction.
>>
> Can't you take advantage of the shared vxlan device configuration
> introduced throughout the LWT work such that you have single device dealing
> with many tunnels? why?
>
I tried to cover this in my initial paragraph in the cover letter:
"lwt and dst_metadata/collect_metadata have enabled vxlan l3 deployments to use 
a 'single vxlan
netdev for multiple vnis' eliminating the scalability problem with using a 
'single vxlan netdev per vni'.
This series tries to do the same for vxlan netdevs in pure l2 bridged networks. 
Use-case/deployment and
details are below." there is more in the cover letter on this.

There is no route pointing to the vxlan device here. vxlan device is a bridged 
port. And it bridges local host ports to remote vxlan tunnels
vlan-to-vxlan.

[PATCH net-next 5/5] bpf: enable verifier to better track const alu ops

2017-01-23 Thread Daniel Borkmann

William reported couple of issues in relation to direct packet
access. Typical scheme is to check for data + [off] <= data_end,
where [off] can be either immediate or coming from a tracked
register that contains an immediate, depending on the branch, we
can then access the data. However, in case of calculating [off]
for either the mentioned test itself or for access after the test
in a more "complex" way, then the verifier will stop tracking the
CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.

Adding that UNKNOWN_VALUE typed register to a pkt() marked
register, the verifier then bails out in check_packet_ptr_add()
as it finds the registers imm value below 48. In the first below
example, that is due to evaluate_reg_imm_alu() not handling right
shifts and thus marking the register as UNKNOWN_VALUE via helper
__mark_reg_unknown_value() that resets imm to 0.

In the second case the same happens at the time when r4 is set
to r4 &= r5, where it transitions to UNKNOWN_VALUE from
evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
evaluate_reg_alu(), where the register's imm turns into 3. That
is, for registers with type UNKNOWN_VALUE, imm of 0 means that
we don't know what value the register has, and for imm > 0 it
means that the value has [imm] upper zero bits. F.e. when shifting
an UNKNOWN_VALUE register by 3 to the right, no matter what value
it had, we know that the 3 upper most bits must be zero now.
This is to make sure that ALU operations with unknown registers
don't overflow. Meaning, once we know that we have more than 48
upper zero bits, or, in other words cannot go beyond 0x offset
with ALU ops, such an addition will track the target register
as a new pkt() register with a new id, but 0 offset and 0 range,
so for that a new data/data_end test will be required. Is the source
register a CONST_IMM one that is to be added to the pkt() register,
or the source instruction is an add instruction with immediate
value, then it will get added if it stays within max 0x bounds.
>From there, pkt() type, can be accessed should reg->off + imm be
within the access range of pkt().

  [...]
  from 28 to 30: R0=imm1,min_value=1,max_value=1
R1=pkt(id=0,off=0,r=22) R2=pkt_end
R3=imm144,min_value=144,max_value=144
R4=imm0,min_value=0,max_value=0
R5=inv48,min_value=2054,max_value=2054 R10=fp
  30: (bf) r5 = r3
  31: (07) r5 += 23
  32: (77) r5 >>= 3
  33: (bf) r6 = r1
  34: (0f) r6 += r5
  cannot add integer value with 0 upper zero bits to ptr_to_packet

  [...]
  from 52 to 80: R0=imm1,min_value=1,max_value=1
R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
R4=imm272 R5=inv56,min_value=17,max_value=17
R6=pkt(id=0,off=26,r=34) R10=fp
  80: (07) r4 += 71
  81: (18) r5 = 0xfff8
  83: (5f) r4 &= r5
  84: (77) r4 >>= 3
  85: (0f) r1 += r4
  cannot add integer value with 3 upper zero bits to ptr_to_packet

Thus to get above use-cases working, evaluate_reg_imm_alu() has
been extended for further ALU ops. This is fine, because we only
operate strictly within realm of CONST_IMM types, so here we don't
care about overflows as they will happen in the simulated but also
real execution and interaction with pkt() in check_packet_ptr_add()
will check actual imm value once added to pkt(), but it's irrelevant
before.

With regards to 06c1c049721a ("bpf: allow helpers access to variable
memory") that works on UNKNOWN_VALUE registers, the verifier becomes
now a bit smarter as it can better resolve ALU ops, so we need to
adapt two test cases there, as min/max bound tracking only becomes
necessary when registers were spilled to stack. So while mask was
set before to track upper bound for UNKNOWN_VALUE case, it's now
resolved directly as CONST_IMM, and such contructs are only necessary
when f.e. registers are spilled.

For commit 6b17387307ba ("bpf: recognize 64bit immediate loads as
consts") that initially enabled dw load tracking only for nfp jit/
analyzer, I did couple of tests on large, complex programs and we
don't increase complexity badly (my tests were in ~3% range on avg).
I've added a couple of tests similar to affected code above, and
it works fine with verifier now.

Reported-by: William Tu 
Signed-off-by: Daniel Borkmann 
Cc: Gianluca Borello 
Cc: William Tu 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   | 64 +++---
 tools/testing/selftests/bpf/test_verifier.c | 82 +
 2 files changed, 127 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8f69df7..fb3513b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1566,22 +1566,54 @@ static int evaluate_reg_imm_alu(struct bpf_verifier_env 
*env,
struct bpf_reg_state *dst_reg = [insn->dst_reg];
struct bpf_reg_state *src_reg = [insn->src_reg];
u8 opcode = BPF_OP(insn->code);

[PATCH net-next 1/5] bpf: simplify __is_valid_access test on cb

2017-01-23 Thread Daniel Borkmann

The __is_valid_access() test for cb[] from 62c7989b24db ("bpf: allow
b/h/w/dw access for bpf's cb in ctx") was done unnecessarily complex,
we can just simplify it the same way as recent fix from 2d071c643f1c
("bpf, trace: make ctx access checks more robust") did. Overflow can
never happen as size is 1/2/4/8 depending on access.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/core/filter.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9038386..883975f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2784,19 +2784,8 @@ static bool __is_valid_access(int off, int size)
switch (off) {
case offsetof(struct __sk_buff, cb[0]) ...
 offsetof(struct __sk_buff, cb[4]) + sizeof(__u32) - 1:
-   if (size == sizeof(__u16) &&
-   off > offsetof(struct __sk_buff, cb[4]) + sizeof(__u16))
-   return false;
-   if (size == sizeof(__u32) &&
-   off > offsetof(struct __sk_buff, cb[4]))
-   return false;
-   if (size == sizeof(__u64) &&
-   off > offsetof(struct __sk_buff, cb[2]))
-   return false;
-   if (size != sizeof(__u8)  &&
-   size != sizeof(__u16) &&
-   size != sizeof(__u32) &&
-   size != sizeof(__u64))
+   if (off + size >
+   offsetof(struct __sk_buff, cb[4]) + sizeof(__u32))
return false;
break;
default:
-- 
1.9.3

[PATCH net-next 0/5] Misc BPF improvements

2017-01-23 Thread Daniel Borkmann

This series adds various misc improvements to BPF, f.e. allowing
skb_load_bytes() helper to be used with filter/reuseport programs
to facilitate programming, test cases for program tag, etc. For
details, please see individual patches.

Thanks!

Daniel Borkmann (5):
  bpf: simplify __is_valid_access test on cb
  bpf: enable load bytes helper for filter/reuseport progs
  bpf: allow option for setting bpf_l4_csum_replace from scratch
  bpf: add prog tag test case to bpf selftests
  bpf: enable verifier to better track const alu ops

 include/uapi/linux/bpf.h|   1 +
 kernel/bpf/verifier.c   |  64 ++---
 net/core/filter.c   |  63 -
 tools/testing/selftests/bpf/Makefile|   4 +-
 tools/testing/selftests/bpf/test_tag.c  | 202 
 tools/testing/selftests/bpf/test_verifier.c |  82 +++
 6 files changed, 364 insertions(+), 52 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_tag.c

-- 
1.9.3

[PATCH net-next 4/5] bpf: add prog tag test case to bpf selftests

2017-01-23 Thread Daniel Borkmann

Add the test case used to compare the results from fdinfo with
af_alg's output on the tag. Tests are from min to max sized
programs, with and without maps included.

  # ./test_tag
  test_tag: OK (40945 tests)

Tested on x86_64 and s390x.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_tag.c | 202 +
 2 files changed, 204 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_tag.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 064a3e5..769a6cb 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map test_lpm_map
+test_objs = test_verifier test_tag test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
+TEST_PROGS := $(test_objs) test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_tag.c 
b/tools/testing/selftests/bpf/test_tag.c
new file mode 100644
index 000..6ab4793
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tag.c
@@ -0,0 +1,202 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "../../../include/linux/filter.h"
+
+#include "bpf_sys.h"
+
+static struct bpf_insn prog[BPF_MAXINSNS];
+
+static void bpf_gen_imm_prog(unsigned int insns, int fd_map)
+{
+   int i;
+
+   srand(time(NULL));
+   for (i = 0; i < insns; i++)
+   prog[i] = BPF_ALU64_IMM(BPF_MOV, i % BPF_REG_10, rand());
+   prog[i - 1] = BPF_EXIT_INSN();
+}
+
+static void bpf_gen_map_prog(unsigned int insns, int fd_map)
+{
+   int i, j = 0;
+
+   for (i = 0; i + 1 < insns; i += 2) {
+   struct bpf_insn tmp[] = {
+   BPF_LD_MAP_FD(j++ % BPF_REG_10, fd_map)
+   };
+
+   memcpy([i], tmp, sizeof(tmp));
+   }
+   if (insns % 2 == 0)
+   prog[insns - 2] = BPF_ALU64_IMM(BPF_MOV, i % BPF_REG_10, 42);
+   prog[insns - 1] = BPF_EXIT_INSN();
+}
+
+static int bpf_try_load_prog(int insns, int fd_map,
+void (*bpf_filler)(unsigned int insns,
+   int fd_map))
+{
+   int fd_prog;
+
+   bpf_filler(insns, fd_map);
+   fd_prog = bpf_prog_load(BPF_PROG_TYPE_SCHED_CLS, prog, insns *
+   sizeof(struct bpf_insn), "", NULL, 0);
+   assert(fd_prog > 0);
+   if (fd_map > 0)
+   bpf_filler(insns, 0);
+   return fd_prog;
+}
+
+static int __hex2bin(char ch)
+{
+   if ((ch >= '0') && (ch <= '9'))
+   return ch - '0';
+   ch = tolower(ch);
+   if ((ch >= 'a') && (ch <= 'f'))
+   return ch - 'a' + 10;
+   return -1;
+}
+
+static int hex2bin(uint8_t *dst, const char *src, size_t count)
+{
+   while (count--) {
+   int hi = __hex2bin(*src++);
+   int lo = __hex2bin(*src++);
+
+   if ((hi < 0) || (lo < 0))
+   return -1;
+   *dst++ = (hi << 4) | lo;
+   }
+   return 0;
+}
+
+static void tag_from_fdinfo(int fd_prog, uint8_t *tag, uint32_t len)
+{
+   const int prefix_len = sizeof("prog_tag:\t") - 1;
+   char buff[256];
+   int ret = -1;
+   FILE *fp;
+
+   snprintf(buff, sizeof(buff), "/proc/%d/fdinfo/%d", getpid(),
+fd_prog);
+   fp = fopen(buff, "r");
+   assert(fp);
+
+   while (fgets(buff, sizeof(buff), fp)) {
+   if (strncmp(buff, "prog_tag:\t", len))
+   continue;
+   ret = hex2bin(tag, buff + prefix_len, len);
+   break;
+   }
+
+   fclose(fp);
+   assert(!ret);
+}
+
+static void tag_from_alg(int insns, uint8_t *tag, uint32_t len)
+{
+   static const struct sockaddr_alg alg = {
+   .salg_family= AF_ALG,
+   .salg_type  = "hash",
+   .salg_name  = "sha1",
+   };
+   int fd_base, fd_alg, ret;
+   ssize_t size;
+
+   fd_base = socket(AF_ALG, SOCK_SEQPACKET, 0);
+   assert(fd_base > 0);
+
+   ret = bind(fd_base, (struct sockaddr *), sizeof(alg));
+   assert(!ret);
+
+   fd_alg = accept(fd_base, NULL, 0);
+   assert(fd_alg > 0);
+
+   insns *= sizeof(struct bpf_insn);
+   size = write(fd_alg, prog, insns);
+   assert(size == insns);
+
+   size = read(fd_alg, tag, len);
+   assert(size == len);
+
+   close(fd_alg);
+   close(fd_base);
+}
+
+static void tag_dump(const char *prefix, uint8_t *tag, uint32_t len)
+{
+   int i;
+
+

[PATCH net-next 2/5] bpf: enable load bytes helper for filter/reuseport progs

2017-01-23 Thread Daniel Borkmann

BPF_PROG_TYPE_SOCKET_FILTER are used in various facilities such as
for SO_REUSEPORT and packet fanout demuxing, packet filtering, kcm,
etc, and yet the only facility they can use is BPF_LD with {BPF_ABS,
BPF_IND} for single byte/half/word access.

Direct packet access is only restricted to tc programs right now,
but we can still facilitate usage by allowing skb_load_bytes() helper
added back then in 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper")
that calls skb_header_pointer() similarly to bpf_load_pointer(), but
for stack buffers with larger access size.

Name the previous sk_filter_func_proto() as bpf_base_func_proto()
since this is used everywhere else as well, similarly for the ctx
converter, that is, bpf_convert_ctx_access().

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/core/filter.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 883975f..e2263da 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2598,7 +2598,7 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
 };
 
 static const struct bpf_func_proto *
-sk_filter_func_proto(enum bpf_func_id func_id)
+bpf_base_func_proto(enum bpf_func_id func_id)
 {
switch (func_id) {
case BPF_FUNC_map_lookup_elem:
@@ -2626,6 +2626,17 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
 }
 
 static const struct bpf_func_proto *
+sk_filter_func_proto(enum bpf_func_id func_id)
+{
+   switch (func_id) {
+   case BPF_FUNC_skb_load_bytes:
+   return _skb_load_bytes_proto;
+   default:
+   return bpf_base_func_proto(func_id);
+   }
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id)
 {
switch (func_id) {
@@ -2680,7 +2691,7 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
case BPF_FUNC_skb_under_cgroup:
return _skb_under_cgroup_proto;
default:
-   return sk_filter_func_proto(func_id);
+   return bpf_base_func_proto(func_id);
}
 }
 
@@ -2695,7 +2706,7 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
case BPF_FUNC_xdp_adjust_head:
return _xdp_adjust_head_proto;
default:
-   return sk_filter_func_proto(func_id);
+   return bpf_base_func_proto(func_id);
}
 }
 
@@ -2706,7 +2717,7 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
case BPF_FUNC_skb_load_bytes:
return _skb_load_bytes_proto;
default:
-   return sk_filter_func_proto(func_id);
+   return bpf_base_func_proto(func_id);
}
 }
 
@@ -2733,7 +2744,7 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
case BPF_FUNC_skb_under_cgroup:
return _skb_under_cgroup_proto;
default:
-   return sk_filter_func_proto(func_id);
+   return bpf_base_func_proto(func_id);
}
 }
 
@@ -2983,10 +2994,10 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static u32 sk_filter_convert_ctx_access(enum bpf_access_type type,
-   const struct bpf_insn *si,
-   struct bpf_insn *insn_buf,
-   struct bpf_prog *prog)
+static u32 bpf_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
 {
struct bpf_insn *insn = insn_buf;
int off;
@@ -3210,7 +3221,7 @@ static u32 tc_cls_act_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(struct net_device, ifindex));
break;
default:
-   return sk_filter_convert_ctx_access(type, si, insn_buf, prog);
+   return bpf_convert_ctx_access(type, si, insn_buf, prog);
}
 
return insn - insn_buf;
@@ -3242,7 +3253,7 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
type,
 static const struct bpf_verifier_ops sk_filter_ops = {
.get_func_proto = sk_filter_func_proto,
.is_valid_access= sk_filter_is_valid_access,
-   .convert_ctx_access = sk_filter_convert_ctx_access,
+   .convert_ctx_access = bpf_convert_ctx_access,
 };
 
 static const struct bpf_verifier_ops tc_cls_act_ops = {
@@ -3261,24 +3272,24 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
type,
 static const struct bpf_verifier_ops cg_skb_ops = {
.get_func_proto = cg_skb_func_proto,
.is_valid_access= sk_filter_is_valid_access,
-

Re: [PATCH net-next 2/2] ipv6: fix ip6_tnl_parse_tlv_enc_lim()

2017-01-23 Thread Eric Dumazet

On Mon, Jan 23, 2017 at 3:58 PM, Eric Dumazet  wrote:
> This function suffers from multiple issues.
>
> First one is that pskb_may_pull() may reallocate skb->head,
> so the 'raw' pointer needs either to be reloaded or not used at all.
>
> Second issue is that NEXTHDR_DEST handling does not validate
> that the options are present in skb->data, so we might read
> garbage or access non existent memory.
>
> With help from Willem de Bruijn.

Hmm, I've added a bug. Will send a V2, sorry for this.

Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support

2017-01-23 Thread Roopa Prabhu

On 1/23/17, 8:24 AM, Jiri Benc wrote:
> On Mon, 23 Jan 2017 08:13:30 -0800, Roopa Prabhu wrote:
>> And, a 'vlan-to-tunid' mapping is a very common configuration in L2 ethernet 
>> vpn configurations.
> You have one particular and narrow use case in mind and are proposing a
> rather large patchset to add support for that (and only that) single
> use case, while we already have a generic mechanism in place to address
> this and many similar (and dissimilar, too) use cases. That doesn't
> sound right.
Let me clarify:
the generic mechanism you are talking about is dst_metadata infra. Any 
subsystem can use it.
tc vlan and dst_metadata wrapper/filter provide a creative way to use it inside 
the tc subsystem and is very
useful for people using tc all-around.
What I am proposing here is hooks in bridge to use the dst_metadata for pure L2 
networks who
use the bridge driver. This is similar to how we have lwt plugged into the L3 
(routing) code.
If you are using the bridge driver for vlan config and filtering, I don't see 
why one
 has to duplicate vlan config using tc. Its painful trying to deploy l2 
networks with vlan config spanning
multiple subsystems and apis.

Regarding the patch-set size, let me give you a breakdown:
If i used tc for passing dst_metadata (assume 4k vlans that are participating 
in l2 ethernet vpn):
(a) configure bridging/vlan filtering using bridge driver (4k vlans)
(b) configure tc rules to map vlans to tunnel-id (Additional patch to tc to 
only allow tunnel-id in dst_metadata: ingress + egress = 8k tc rules)
(c) vxlan driver patch to make it bridge friendly (my patch in this series is 
required regardless if i use tc or bridge driver for dst_metadata because vxlan 
driver learns and needs to carry the forwarding information database)
(d) ethernet vpn controller (quagga bgp) looks at 'bridge api + vxlan api + tc 
filtering rules'
   

My current series:
(a) configure bridging/vlan filtering using bridge driver (4k vlans with tunnel 
info)
(b) vxlan driver patch to make it bridge friendly (my patch in this series is 
required regardless if
i use tc or bridge driver for dst_metadata because vxlan driver learns and 
needs to carry the forwarding information database)
(c) ethernet vpn controller (quagga bgp) looks at 'bridge api + vxlan api'


And btw, most of the functions that i am adding in the bridge driver are 
related to vlan range handling.
vlan ranges code is tricky and i am trying to also support vlan-tunnelid 
mapping in ranges, and i have tried
to rewrite my own vlan range code (added long back) to include tunnel info. The 
rest is just use of the dst_metadata infra
to store and use  dst_metadata per vlan.


>
> If the current generic mechanisms have bottlenecks for your use case,
> let's work on removing those bottlenecks. That way, everybody benefits,
> not just a single use case.
For people using all tc, the tc wrapper for dst_metadata is a good fit.
I see my series as still using the generic 'dst_metadata' mechanism/infra for a 
newer use case.
like i say above, I see this similar to how we have plugged dst_metadata into 
the L3 (routing) code.
This does it in the bridging code (for L2 networks).

Thanks,
Roopa

[PATCH net-next 2/2] ipv6: fix ip6_tnl_parse_tlv_enc_lim()

2017-01-23 Thread Eric Dumazet

This function suffers from multiple issues.

First one is that pskb_may_pull() may reallocate skb->head,
so the 'raw' pointer needs either to be reloaded or not used at all.

Second issue is that NEXTHDR_DEST handling does not validate
that the options are present in skb->data, so we might read
garbage or access non existent memory.

With help from Willem de Bruijn.

Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov 
Cc: Willem de Bruijn 
---
 net/ipv6/ip6_tunnel.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 
02923f956ac8ba5f3270e8d5282fd9637c2d..f9c43c9043f7e386108bca2f2aeb866cfb76 
100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -400,18 +400,18 @@ ip6_tnl_dev_uninit(struct net_device *dev)
 
 __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
 {
-   const struct ipv6hdr *ipv6h = (const struct ipv6hdr *) raw;
-   __u8 nexthdr = ipv6h->nexthdr;
-   __u16 off = sizeof(*ipv6h);
+   const struct ipv6hdr *ipv6h = (const struct ipv6hdr *)raw;
+   unsigned int nhoff = (raw - skb->data) + sizeof(*ipv6h);
+   u8 next, nexthdr = ipv6h->nexthdr;
 
while (ipv6_ext_hdr(nexthdr) && nexthdr != NEXTHDR_NONE) {
-   __u16 optlen = 0;
struct ipv6_opt_hdr *hdr;
-   if (raw + off + sizeof(*hdr) > skb->data &&
-   !pskb_may_pull(skb, raw - skb->data + off + sizeof (*hdr)))
+   u16 optlen;
+
+   if (!pskb_may_pull(skb, nhoff + sizeof(*hdr)))
break;
 
-   hdr = (struct ipv6_opt_hdr *) (raw + off);
+   hdr = (struct ipv6_opt_hdr *)(skb->data + nhoff);
if (nexthdr == NEXTHDR_FRAGMENT) {
struct frag_hdr *frag_hdr = (struct frag_hdr *) hdr;
if (frag_hdr->frag_off)
@@ -422,13 +422,23 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 
*raw)
} else {
optlen = ipv6_optlen(hdr);
}
+   /* cache hdr->nexthdr, since pskb_may_pull() might
+* invalidate hdr
+*/
+   next = hdr->nexthdr;
if (nexthdr == NEXTHDR_DEST) {
-   __u16 i = off + 2;
+   u16 i = 2;
+
+   /* Remember : hdr is no longer valid at this point. */
+   if (!pskb_may_pull(skb, nhoff + optlen))
+   break;
+
+   raw = skb->data + nhoff;
while (1) {
struct ipv6_tlv_tnl_enc_lim *tel;
 
/* No more room for encapsulation limit */
-   if (i + sizeof (*tel) > off + optlen)
+   if (i + sizeof(*tel) > optlen)
break;
 
tel = (struct ipv6_tlv_tnl_enc_lim *) [i];
@@ -443,8 +453,8 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 
*raw)
i++;
}
}
-   nexthdr = hdr->nexthdr;
-   off += optlen;
+   nexthdr = next;
+   nhoff += optlen;
}
return 0;
 }
-- 
2.11.0.483.g087da7b7c-goog

[PATCH net 0/2] ipv6: fix ip6_tnl_parse_tlv_enc_lim() issues

2017-01-23 Thread Eric Dumazet

First patch fixes ip6_tnl_parse_tlv_enc_lim() callers,
bug added in linux-3.7

Second patch fixes ip6_tnl_parse_tlv_enc_lim() itself,
bug predates linux-2.6.12

Based on a report from Dmitry Vyukov, thanks to KASAN.

Eric Dumazet (2):
  ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()
  ipv6: fix ip6_tnl_parse_tlv_enc_lim()

 net/ipv6/ip6_gre.c|  3 +++
 net/ipv6/ip6_tunnel.c | 34 +++---
 2 files changed, 26 insertions(+), 11 deletions(-)

-- 
2.11.0.483.g087da7b7c-goog

[PATCH net-next 1/2] ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()

2017-01-23 Thread Eric Dumazet

Since ip6_tnl_parse_tlv_enc_lim() can call pskb_may_pull(),
we must reload any pointer that was related to skb->head
(or skb->data), or risk use after free.

Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: Eric Dumazet 
Cc: Dmitry Kozlov 
---
 net/ipv6/ip6_gre.c| 3 +++
 net/ipv6/ip6_tunnel.c | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 
75b6108234dd05a54af0ae51c7c11eaf1ca2..558631860d91bbcb1321a4b566b6e92ddcaf 
100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -582,6 +582,9 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, 
struct net_device *dev)
return -1;
 
offset = ip6_tnl_parse_tlv_enc_lim(skb, skb_network_header(skb));
+   /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head */
+   ipv6h = ipv6_hdr(skb);
+
if (offset > 0) {
struct ipv6_tlv_tnl_enc_lim *tel;
tel = (struct ipv6_tlv_tnl_enc_lim 
*)_network_header(skb)[offset];
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 
753d6d0860fb14c100ab8b20799782ab8160..02923f956ac8ba5f3270e8d5282fd9637c2d 
100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1303,6 +1303,8 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
fl6.flowlabel = key->label;
} else {
offset = ip6_tnl_parse_tlv_enc_lim(skb, 
skb_network_header(skb));
+   /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head 
*/
+   ipv6h = ipv6_hdr(skb);
if (offset > 0) {
struct ipv6_tlv_tnl_enc_lim *tel;
 
-- 
2.11.0.483.g087da7b7c-goog

Re: [patch net-next 1/4] net: Introduce psample, a new genetlink channel for packet sampling

2017-01-23 Thread Stephen Hemminger

On Sun, 22 Jan 2017 12:44:44 +0100
Jiri Pirko  wrote:

> +static LIST_HEAD(psample_groups_list);
> +static DEFINE_SPINLOCK(psample_groups_lock);
> +

Why not a mutex? You aren't acquiring this in IRQ context?

[RFC PATCH] mlx5: Fix page rfcnt issue

2017-01-23 Thread Tom Herbert

This patch is an FYI about possible issuses in mlx5.

There are two issues we discovered in the mlx5 backport from 4.9 to
4.6...

The bad behaviours we were seeing was a refcnts going to less than zero
and eventually killing hosts. We've only seen this running a real
application work load and it does take a while to trigger. The root
cause is unclear, however this patch seems to circumvent the issue. I
have not been able to reproduce this issue upstream because I haven't
been able to get the app running cleanly. At this point I cannot verify
if it is an upstream issue or not.

Specific issues that I have identified are:

1) In mlx5e_free_rx_mpwqe we are seeing cases where wi->skbs_frags[i] >
   pg_strides so that a negative refcnt is being subtracted. The warning
   at en_rx.c:409 of this patch does trigger.
2) Checksum faults are occurring. I have previously described this
   problem.

As for the checksum faults that seems to be an unrelated issue and I do
believe that was an upstream issue in 4.9, but may have been fixed in
4.10. This is easier to reproduce than the page refcnt issue-- just a
few concurrent netperf TCP_STREAMs was able to trigger the faults.

One other potential problem in the driver is the use of put_page in
release pages.  Comparing how the allocation is done in other drivers
(for instance comparing to ixgbe) some seem to use __free_pages instead.
I don't know which is correct to use, but somehow it doesn't seem like
they can both be right.

This patch:
1) We no longer do the page_ref_sub page_ref_add, instead we take the
   more traditional approach of just doing get_page when giving the
   page to an skb.
2) Add warnings to catch cases where operations on page refcnts are
   nonsensical
3) Comment out checksum-complete processing in order to squelch the
   checksum faults.

---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 20f116f..b24c2b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -209,6 +209,7 @@ static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
}
 
if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
+   WARN_ON(page_ref_count(cache->page_cache[cache->head].page) <= 
0);
rq->stats.cache_busy++;
return false;
}
@@ -292,6 +293,13 @@ static inline void mlx5e_add_skb_frag_mpwqe(struct 
mlx5e_rq *rq,
wi->umr.dma_info[page_idx].addr + frag_offset,
len, DMA_FROM_DEVICE);
wi->skbs_frags[page_idx]++;
+   WARN_ON(wi->skbs_frags[page_idx] > mlx5e_mpwqe_strides_per_page(rq));
+
+   /* Take a page reference every time we give page to skb (alternative
+* to original mlx code).
+*/
+   get_page(wi->umr.dma_info[page_idx].page);
+
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
wi->umr.dma_info[page_idx].page, frag_offset,
len, truesize);
@@ -372,7 +380,6 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
if (unlikely(err))
goto err_unmap;
wi->umr.mtt[i] = cpu_to_be64(dma_info->addr | MLX5_EN_WR);
-   page_ref_add(dma_info->page, pg_strides);
wi->skbs_frags[i] = 0;
}
 
@@ -385,7 +392,6 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
while (--i >= 0) {
struct mlx5e_dma_info *dma_info = >umr.dma_info[i];
 
-   page_ref_sub(dma_info->page, pg_strides);
mlx5e_page_release(rq, dma_info, true);
}
 
@@ -400,7 +406,7 @@ void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct 
mlx5e_mpw_info *wi)
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
struct mlx5e_dma_info *dma_info = >umr.dma_info[i];
 
-   page_ref_sub(dma_info->page, pg_strides - wi->skbs_frags[i]);
+   WARN_ON(pg_strides - wi->skbs_frags[i] < 0);
mlx5e_page_release(rq, dma_info, true);
}
 }
@@ -565,12 +571,17 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
return;
}
 
+#if 0
+   /* Disabled since we are seeing checksum faults occurring. This should
+* not have any noticeable impact (in the short term).
+*/
if (is_first_ethertype_ip(skb)) {
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
rq->stats.csum_complete++;
return;
}
+#endif
 
if (likely((cqe->hds_ip_ext & CQE_L3_OK) &&
   (cqe->hds_ip_ext & CQE_L4_OK))) {
@@ -929,6 +940,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq,

Re: sk_buff and reference counting netdev pointers

2017-01-23 Thread Cong Wang

On Mon, Jan 23, 2017 at 2:37 PM, Joel Cunningham  wrote:
> Hi,
>
> I’m working on a research effort to understand the synchronization mechanisms 
> for accessing and modifying a struct net_device object.  One area that isn’t 
> clear is the net device pointer (dev) stored in a struct sk_buff.  From my 
> investigation, the pointer appears to be assigned without increasing the 
> struct net_device’s reference count (example __netdev_alloc_skb doesn’t call 
> dev_hold) and also when the sk_buff is freed (kfree_skb) no call to dev_put() 
> is made.  This seems to leave a possibility of an skb referencing a stale net 
> device unless something is cleaning up all the skbs during 
> unregister_netdevice() (which waits for all outstanding references to be 
> released).  Any insight in understanding how this is working would be 
> appreciated!
>

This is a very common question.

synchronize_net() is supposed to wait for on-flying packets, since
both for TX and RX paths we acquire RCU read lock.

sk_buff and reference counting netdev pointers

2017-01-23 Thread Joel Cunningham

Hi,

I’m working on a research effort to understand the synchronization mechanisms 
for accessing and modifying a struct net_device object.  One area that isn’t 
clear is the net device pointer (dev) stored in a struct sk_buff.  From my 
investigation, the pointer appears to be assigned without increasing the struct 
net_device’s reference count (example __netdev_alloc_skb doesn’t call dev_hold) 
and also when the sk_buff is freed (kfree_skb) no call to dev_put() is made.  
This seems to leave a possibility of an skb referencing a stale net device 
unless something is cleaning up all the skbs during unregister_netdevice() 
(which waits for all outstanding references to be released).  Any insight in 
understanding how this is working would be appreciated!

Joel

Re: [PATCH 2/3] ath10k: use dma_zalloc_coherent()

2017-01-23 Thread Joe Perches

On Mon, 2017-01-23 at 15:04 +, Srinivas Kandagatla wrote:
> use dma_zalloc_coherent() instead of dma_alloc_coherent and memset().
[]
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c 
> b/drivers/net/wireless/ath/ath10k/pci.c
[]
> @@ -896,7 +896,7 @@ static int ath10k_pci_diag_read_mem(struct ath10k *ar, 
> u32 address, void *data,
>*/
>   alloc_nbytes = min_t(unsigned int, nbytes, DIAG_TRANSFER_LIMIT);
>  
> - data_buf = (unsigned char *)dma_alloc_coherent(ar->dev,
> + data_buf = (unsigned char *)dma_zalloc_coherent(ar->dev,
>  alloc_nbytes,
>  _data_base,
>  GFP_ATOMIC);

trivia:

Nicer to realign arguments and remove the unnecessary cast.

Perhaps:

data_buf = dma_zalloc_coherent(ar->dev, alloc_nbytes, _data_base,
   GFP_ATOMIC);

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 02:57:31PM -0800, Wei Wang wrote:
> Yes. That seems to be a valid fix to it.
> Let me try it with my existing test cases as well to see if it works for
> all scenarios I have.

Perfect. Note that since the state 2 is transient I initially thought
about abusing the flags passed to __inet_stream_connect() to say "hey
I'm sendmsg() and not connect()" but that would have been a ugly hack
while here we really have the 3 socket states represented eventhough
one changes twice around a call.

Thanks,
Willy

Re: Question about veth_xmit()

2017-01-23 Thread Cong Wang

On Mon, Jan 23, 2017 at 2:05 PM, Eric Dumazet  wrote:
> On Mon, 2017-01-23 at 13:46 -0800, Xiangning Yu wrote:
>> On Mon, Jan 23, 2017 at 12:56 PM, Cong Wang  wrote:
>> > On Mon, Jan 23, 2017 at 10:46 AM, Xiangning Yu  
>> > wrote:
>> >> Hi netdev folks,
>> >>
>> >> It looks like we call dev_forward_skb() in veth_xmit(), which calls
>> >> netif_rx() eventually.
>> >>
>> >> While netif_rx() will enqueue the skb to the CPU RX backlog before the
>> >> actual processing takes place. So, this actually means a TX skb has to
>> >> wait some un-related RX skbs to finish. And this will happen twice for
>> >> a single ping, because the veth device always works as a pair?
>> >
>> > For me it is more like for the completeness of network stack of each
>> > netns. The /proc net.core.netdev_max_backlog etc. are per netns, which
>> > means each netns, as an independent network stack, should respect it
>> > too.
>> >
>> > Since you care about latency, why not tune net.core.dev_weight for your
>> > own netns?
>>
>> I haven't tried that yet, thank you for the hint! Though normally one
>> of the veth device will be in the global namespace.
>
> Well, per cpu backlog are not per net ns, but per cpu.

Right, they all point to a same weight_p. But weight_p itself
is not per cpu, softnet_data is. However, if a container uses
cpuset and netns for isolations, per cpu will turn into per netns.

The point of network stack completeness still stands.

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 11:01:21PM +0100, Willy Tarreau wrote:
> On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote:
> > On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> > > Hi Willy,
> > > 
> > > True. If you call connect() multiple times on a socket which already has
> > > cookie without a write(), the second and onward connect() call will return
> > > EINPROGRESS.
> > > It is basically because the following code block in 
> > > __inet_stream_connect()
> > > can't distinguish if it is the first time connect() is called or not:
> > > 
> > > case SS_CONNECTING:
> > > if (inet_sk(sk)->defer_connect)  <- defer_connect will
> > > be 0 only after a write() is called
> > > err = -EINPROGRESS;
> > > else
> > > err = -EALREADY;
> > > /* Fall out of switch with err, set for this state */
> > > break;
> > 
> > Ah OK that totally makes sense, thanks for the explanation!
> > 
> > > I guess we can add some extra logic here to address this issue. So the
> > > second connect() and onwards will return EALREADY.
> 
> Thinking about it a bit more, I really think it would make more
> sense to return -EISCONN here if we want to match the semantics
> of a connect() returning zero on the first call. This way the
> caller knows it can write whenever it wants and can disable
> write polling until needed.
> 
> I'm currently rebuilding a kernel with this change to see if it
> behaves any better :
> 
> -err = -EINPROGRESS;
> +err = -EISCONN;

OK so obviously it didn't work since sendmsg() goes there as well.

But that made me realize that there really are 3 states, not 2 :

  - after connect() and before sendmsg() :
 defer_accept = 1, we want to lie to userland and pretend we're
 connected so that userland can call send(). A connect() must
 return either zero or -EISCONN.

  - during first sendmsg(), still connecting :
 the connection is in progress, EINPROGRESS must be returned to
 the first sendmsg().

  - after the first sendmsg() :
 defer_accept = 0 ; connect() must return -EALREADY. We want to
 return real socket states from now on.

Thus I modified defer_accept to take two bits to represent the extra
state we need to indicate the transition. Now sendmsg() upgrades
defer_accept from 1 to 2 before calling __inet_stream_connect(), which
then knows it must return EINPROGRESS to sendmsg().

This way we correctly cover all these situations. Even if we call
connect() again after the first connect() attempt it still matches
the first result :

accept4(7, {sa_family=AF_INET, sin_port=htons(36860), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK_NONBLOCK) = 1
setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0
accept4(7, 0x7ffc2282fcb0, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(1, 0x7b53a4, 7006, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily 
unavailable)
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 2
fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
setsockopt(2, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(2, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
connect(2, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0
epoll_wait(0, [], 200, 0)   = 0
connect(2, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EISCONN (Transport endpoint is 
already connected)
epoll_wait(0, [], 200, 0)   = 0
recvfrom(2, 0x7b53a4, 8030, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily 
unavailable)
epoll_ctl(0, EPOLL_CTL_ADD, 2, {EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}) = 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [], 200, 1000)= 0
epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1
recvfrom(1, "GET / HTTP/1.1\r\n", 8030, 0, NULL, NULL) = 16
sendto(2, "GET / HTTP/1.1\r\n", 16, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 16
epoll_wait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 1000) = 1
recvfrom(1, "\r\n", 8030, 0, NULL, NULL) = 2
sendto(2, "\r\n", 2, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 2
epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=2, u64=2}}], 200, 1000) = 1
recvfrom(2, "HTTP/1.1 302 Found\r\nCache-Contro"..., 8030, 0, NULL, NULL) = 98
sendto(1, "HTTP/1.1 302 Found\r\nCache-Contro"..., 98, 
MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 98
shutdown(1, SHUT_WR)= 0
epoll_ctl(0, EPOLL_CTL_DEL, 2, 0x6ff55c) = 0
epoll_wait(0, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=1, u64=1}}], 200, 1000) = 1
recvfrom(1, "", 8030, 0, NULL, NULL)= 0
close(1)= 0
shutdown(2, SHUT_WR)= 0
close(2)= 0

Here's what I changed on top of your patchset :

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h

Re: [net PATCH v5 6/6] virtio_net: XDP support for adjust_head

2017-01-23 Thread Michael S. Tsirkin

On Mon, Jan 23, 2017 at 02:12:47PM -0800, John Fastabend wrote:
> On 17-01-23 12:09 PM, Michael S. Tsirkin wrote:
> > On Mon, Jan 23, 2017 at 09:22:36PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Jan 17, 2017 at 02:22:59PM -0800, John Fastabend wrote:
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 62dbf4b..3b129b4 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -41,6 +41,9 @@
> >>>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> >>>  #define GOOD_COPY_LEN128
> >>>  
> >>> +/* Amount of XDP headroom to prepend to packets for use by 
> >>> xdp_adjust_head */
> >>> +#define VIRTIO_XDP_HEADROOM 256
> >>> +
> >>>  /* RX packet size EWMA. The average packet size is used to determine the 
> >>> packet
> >>>   * buffer size when refilling RX rings. As the entire RX ring may be 
> >>> refilled
> >>>   * at once, the weight is chosen so that the EWMA will be insensitive to 
> >>> short-
> >>
> >> I wonder where does this number come from?  This is quite a lot and
> >> means that using XDP_PASS will slow down any sockets on top of it.
> >> Which in turn means people will try to remove XDP when not in use,
> >> causing resets.  E.g. build_skb (which I have a patch to switch to) uses
> >> a much more reasonable NET_SKB_PAD.
> 
> I just used the value Alexei (or someone?) came up with. I think it needs to 
> be
> large enough to avoid copy in header encap cases. So minimum
> 
>   VXLAN_HDR + OUTER_UDP + OUTER_IPV6_HDR + OUTER_MAC =
>  8  + 8 +40  +  14   =  70
> 
> The choice of VXLAN hdr was sort of arbitrary but seems good for estimates. 
> For
> what its worth there is also a ndo_set_rx_headroom could we use that to set it
> and choose a reasonable default.
> 
> >>
> >> -- 
> >> MST
> > 
> > 
> > Let me show you a patch that I've been cooking.  What is missing there
> > is handling corner cases like e.g.  when ring size is ~4 entries so
> > using smaller buffers might mean we no longer have enough space to store
> > a full packet.  So it looks like I have to maintain the skb copy path
> > for this hardware.
> > 
> > With this patch, standard configuration has NET_SKB_PAD + NET_IP_ALIGN
> > bytes head padding. Would this be enough for XDP? If yes we do not
> > need the resets.
> 
> Based on above seems a bit small (L1_CACHE_BYTES + 2)? How tricky would it
> be to add support for ndo_set_rx_headroom.

Donnu but then what? Expose it to userspace and let admin
make the decision for us?


> > 
> > Thoughts?
> 
> I'll take a look at the patch this afternoon. Thanks.
> 
> > 
> > --->
> > 
> > virtio_net: switch to build_skb for mrg_rxbuf
> > 
> > For small packets data copy was observed to
> > take up about 15% CPU time. Switch to build_skb
> > and avoid the copy when using mergeable rx buffers.
> > 
> > As a bonus, medium-size skbs that fit in a page will be
> > completely linear.
> > 
> > Of course, we now need to lower the lower bound on packet size,
> > to make sure a sane number of skbs fits in rx socket buffer.
> > By how much? I don't know yet.
> > 
> > It might also be useful to prefetch the packet buffer since
> > net stack will likely use it soon.
> > 
> > Lightly tested, in particular, I didn't yet test what this
> > actually does to performance - sending this out for early
> > feedback/flames.
> > 
> > TODO: it appears that Linux won't handle correctly the case of first
> > buffer being very small (or consisting exclusively of virtio header).
> > This is already the case for current code, need to fix.
> > TODO: might be unfair to the last packet in a fragment as we include
> > remaining space if any in its truesize.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > 
> > 
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index b425fa1..a6b996f 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -38,6 +38,8 @@ module_param(gso, bool, 0444);
> >  
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > +//#define MIN_PACKET_ALLOC GOOD_PACKET_LEN
> > +#define MIN_PACKET_ALLOC 128
> >  #define GOOD_COPY_LEN  128
> >  
> >  /* RX packet size EWMA. The average packet size is used to determine the 
> > packet
> > @@ -246,6 +248,9 @@ static void *mergeable_ctx_to_buf_address(unsigned long 
> > mrg_ctx)
> >  static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
> >  {
> > unsigned int size = truesize / MERGEABLE_BUFFER_ALIGN;
> > +
> > +   BUG_ON((unsigned long)buf & (MERGEABLE_BUFFER_ALIGN - 1));
> > +   BUG_ON(size - 1 >= MERGEABLE_BUFFER_ALIGN);
> > return (unsigned long)buf | (size - 1);
> >  }
> >  
> > @@ -354,25 +359,54 @@ static struct sk_buff *receive_big(struct net_device 
> > *dev,
> > return NULL;
> >  }
> >  
> > +#define VNET_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
> > +#define VNET_SKB_BUG (VNET_SKB_PAD <

Re: XDP offload to hypervisor

2017-01-23 Thread Michael S. Tsirkin

On Mon, Jan 23, 2017 at 01:56:16PM -0800, John Fastabend wrote:
> On 17-01-23 01:40 PM, Michael S. Tsirkin wrote:
> > I've been thinking about passing XDP programs from guest to the
> > hypervisor.  Basically, after getting an incoming packet, we could run
> > an XDP program in host kernel.
> > 
> 
> Interesting. I am planning on adding XDP to tun driver. My use case
> is we want to use XDP to restrict VM traffic. I was planning on pushing
> the xdp program execution into tun_get_user(). So different then "offloading"
> an xdp program into hypervisor.

tun currently supports TUNATTACHFILTER. Do you plan to extend it then?

So maybe there's need to support more than one program.

Would it work if we run one (host-supplied)
and then if we get XDP_PASS run another (guest supplied)
otherwise don't wake up guest?


> > If the result is XDP_DROP or XDP_TX we don't need to wake up the guest at 
> > all!
> > 
> 
> nice win.
> 
> > When using tun for networking - especially with adjust_head - this
> > unfortunately probably means we need to do a data copy unless there is
> > enough headroom.  How much is enough though?
> 
> We were looking at making headroom configurable on Intel drivers or at
> least matching it with XDP headroom guidelines. (although the developers
> had the same complaint about 256B being large).
> Then at least on supported
> drivers the copy could be an exception path.

So I am concerned that userspace comes to depend on support for 256byte
headroom that this patchset enables. How about 
-#define XDP_PACKET_HEADROOM 256
+#define XDP_PACKET_HEADROOM 64
so we start with a concervative value?
In fact NET_SKB_PAD would be ideal I think but it's
platform dependent.

Or at least do the equivalent for virtio only ...

> > 
> > Another issue is around host/guest ABI. Guest BPF could add new features
> > at any point. What if hypervisor can not support it all?  I guess we
> > could try loading program into hypervisor and run it within guest on
> > failure to load, but this ignores question of cross-version
> > compatibility - someone might start guest on a new host
> > then try to move to an old one. So we will need an option
> > "behave like an older host" such that guest can start and then
> > move to an older host later. This will likely mean
> > implementing this validation of programs in qemu userspace unless linux
> > can supply something like this. Is this (disabling some features)
> > something that might be of interest to larger bpf community?
> 
> This is interesting to me at least. Another interesting "feature" of
> running bpf in qemu userspace is it could work with vhost_user as well
> presumably?

I think with vhost user you would want to push it out
to the switch, not run it in qemu.
IOW qemu gets the program and sends it to the switch.
Response is sent to guest so it knows whether switch can support it.

> > 
> > With a device such as macvtap there exist configurations where a single
> > guest is in control of the device (aka passthrough mode) in that case
> > there's a potential to run xdp on host before host skb is built, unless
> > host already has an xdp program attached.  If it does we could run the
> > program within guest, but what if a guest program got attached first?
> > Maybe we should pass a flag in the packet "xdp passed on this packet in
> > host". Then, guest can skip running it.  Unless we do a full reset
> > there's always a potential for packets to slip through, e.g. on xdp
> > program changes. Maybe a flush command is needed, or force queue or
> > device reset to make sure nothing is going on. Does this make sense?
> > 
> 
> Could the virtio driver pretend its "offloading" the XDP program to
> hardware? This would make it explicit in VM that the program is run
> before data is received by virtio_net. Then qemu is enabling the
> offload framework which would be interesting.

On qemu side this is not a problem a command causes
a trap and qemu could flush the queue. But the packets
are still in the rx queue and get processed by napi later.
I think the cleanest interface for it might be a command consuming
an rx buffer and writing a pre-defined pattern into it.
This way guest can figure out how far did device get in the rx queue.


> > Thanks!
> >

Re: [net PATCH v5 6/6] virtio_net: XDP support for adjust_head

2017-01-23 Thread John Fastabend

On 17-01-23 12:09 PM, Michael S. Tsirkin wrote:
> On Mon, Jan 23, 2017 at 09:22:36PM +0200, Michael S. Tsirkin wrote:
>> On Tue, Jan 17, 2017 at 02:22:59PM -0800, John Fastabend wrote:
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 62dbf4b..3b129b4 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -41,6 +41,9 @@
>>>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>>>  #define GOOD_COPY_LEN  128
>>>  
>>> +/* Amount of XDP headroom to prepend to packets for use by xdp_adjust_head 
>>> */
>>> +#define VIRTIO_XDP_HEADROOM 256
>>> +
>>>  /* RX packet size EWMA. The average packet size is used to determine the 
>>> packet
>>>   * buffer size when refilling RX rings. As the entire RX ring may be 
>>> refilled
>>>   * at once, the weight is chosen so that the EWMA will be insensitive to 
>>> short-
>>
>> I wonder where does this number come from?  This is quite a lot and
>> means that using XDP_PASS will slow down any sockets on top of it.
>> Which in turn means people will try to remove XDP when not in use,
>> causing resets.  E.g. build_skb (which I have a patch to switch to) uses
>> a much more reasonable NET_SKB_PAD.

I just used the value Alexei (or someone?) came up with. I think it needs to be
large enough to avoid copy in header encap cases. So minimum

  VXLAN_HDR + OUTER_UDP + OUTER_IPV6_HDR + OUTER_MAC =
 8  + 8 +40  +  14   =  70

The choice of VXLAN hdr was sort of arbitrary but seems good for estimates. For
what its worth there is also a ndo_set_rx_headroom could we use that to set it
and choose a reasonable default.

>>
>> -- 
>> MST
> 
> 
> Let me show you a patch that I've been cooking.  What is missing there
> is handling corner cases like e.g.  when ring size is ~4 entries so
> using smaller buffers might mean we no longer have enough space to store
> a full packet.  So it looks like I have to maintain the skb copy path
> for this hardware.
> 
> With this patch, standard configuration has NET_SKB_PAD + NET_IP_ALIGN
> bytes head padding. Would this be enough for XDP? If yes we do not
> need the resets.

Based on above seems a bit small (L1_CACHE_BYTES + 2)? How tricky would it
be to add support for ndo_set_rx_headroom.

> 
> Thoughts?

I'll take a look at the patch this afternoon. Thanks.

> 
> --->
> 
> virtio_net: switch to build_skb for mrg_rxbuf
> 
> For small packets data copy was observed to
> take up about 15% CPU time. Switch to build_skb
> and avoid the copy when using mergeable rx buffers.
> 
> As a bonus, medium-size skbs that fit in a page will be
> completely linear.
> 
> Of course, we now need to lower the lower bound on packet size,
> to make sure a sane number of skbs fits in rx socket buffer.
> By how much? I don't know yet.
> 
> It might also be useful to prefetch the packet buffer since
> net stack will likely use it soon.
> 
> Lightly tested, in particular, I didn't yet test what this
> actually does to performance - sending this out for early
> feedback/flames.
> 
> TODO: it appears that Linux won't handle correctly the case of first
> buffer being very small (or consisting exclusively of virtio header).
> This is already the case for current code, need to fix.
> TODO: might be unfair to the last packet in a fragment as we include
> remaining space if any in its truesize.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> 
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b425fa1..a6b996f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -38,6 +38,8 @@ module_param(gso, bool, 0444);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> +//#define MIN_PACKET_ALLOC GOOD_PACKET_LEN
> +#define MIN_PACKET_ALLOC 128
>  #define GOOD_COPY_LEN128
>  
>  /* RX packet size EWMA. The average packet size is used to determine the 
> packet
> @@ -246,6 +248,9 @@ static void *mergeable_ctx_to_buf_address(unsigned long 
> mrg_ctx)
>  static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
>  {
>   unsigned int size = truesize / MERGEABLE_BUFFER_ALIGN;
> +
> + BUG_ON((unsigned long)buf & (MERGEABLE_BUFFER_ALIGN - 1));
> + BUG_ON(size - 1 >= MERGEABLE_BUFFER_ALIGN);
>   return (unsigned long)buf | (size - 1);
>  }
>  
> @@ -354,25 +359,54 @@ static struct sk_buff *receive_big(struct net_device 
> *dev,
>   return NULL;
>  }
>  
> +#define VNET_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
> +#define VNET_SKB_BUG (VNET_SKB_PAD < sizeof(struct virtio_net_hdr_mrg_rxbuf))
> +#define VNET_SKB_LEN(len) ((len) - sizeof(struct virtio_net_hdr_mrg_rxbuf))
> +#define VNET_SKB_OFF VNET_SKB_LEN(VNET_SKB_PAD)
> +
> +static struct sk_buff *vnet_build_skb(struct virtnet_info *vi,
> +   void *buf,
> +   unsigned int len, unsigned int truesize)
> +{
> + struct sk_buff *skb

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 10:37:32PM +0100, Willy Tarreau wrote:
> On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> > Hi Willy,
> > 
> > True. If you call connect() multiple times on a socket which already has
> > cookie without a write(), the second and onward connect() call will return
> > EINPROGRESS.
> > It is basically because the following code block in __inet_stream_connect()
> > can't distinguish if it is the first time connect() is called or not:
> > 
> > case SS_CONNECTING:
> > if (inet_sk(sk)->defer_connect)  <- defer_connect will
> > be 0 only after a write() is called
> > err = -EINPROGRESS;
> > else
> > err = -EALREADY;
> > /* Fall out of switch with err, set for this state */
> > break;
> 
> Ah OK that totally makes sense, thanks for the explanation!
> 
> > I guess we can add some extra logic here to address this issue. So the
> > second connect() and onwards will return EALREADY.

Thinking about it a bit more, I really think it would make more
sense to return -EISCONN here if we want to match the semantics
of a connect() returning zero on the first call. This way the
caller knows it can write whenever it wants and can disable
write polling until needed.

I'm currently rebuilding a kernel with this change to see if it
behaves any better :

-err = -EINPROGRESS;
+err = -EISCONN;

I'll keep you updated.

Thanks,
Willy

Re: Question about veth_xmit()

2017-01-23 Thread Eric Dumazet

On Mon, 2017-01-23 at 13:46 -0800, Xiangning Yu wrote:
> On Mon, Jan 23, 2017 at 12:56 PM, Cong Wang  wrote:
> > On Mon, Jan 23, 2017 at 10:46 AM, Xiangning Yu  
> > wrote:
> >> Hi netdev folks,
> >>
> >> It looks like we call dev_forward_skb() in veth_xmit(), which calls
> >> netif_rx() eventually.
> >>
> >> While netif_rx() will enqueue the skb to the CPU RX backlog before the
> >> actual processing takes place. So, this actually means a TX skb has to
> >> wait some un-related RX skbs to finish. And this will happen twice for
> >> a single ping, because the veth device always works as a pair?
> >
> > For me it is more like for the completeness of network stack of each
> > netns. The /proc net.core.netdev_max_backlog etc. are per netns, which
> > means each netns, as an independent network stack, should respect it
> > too.
> >
> > Since you care about latency, why not tune net.core.dev_weight for your
> > own netns?
> 
> I haven't tried that yet, thank you for the hint! Though normally one
> of the veth device will be in the global namespace.

Well, per cpu backlog are not per net ns, but per cpu.

So Cong suggestion is not going to work.

Re: [net PATCH v5 1/6] virtio_net: use dev_kfree_skb for small buffer XDP receive

2017-01-23 Thread John Fastabend

On 17-01-23 01:08 PM, Michael S. Tsirkin wrote:
> On Tue, Jan 17, 2017 at 02:19:50PM -0800, John Fastabend wrote:
>> In the small buffer case during driver unload we currently use
>> put_page instead of dev_kfree_skb. Resolve this by adding a check
>> for virtnet mode when checking XDP queue type. Also name the
>> function so that the code reads correctly to match the additional
>> check.
>>
>> Fixes: bb91accf2733 ("virtio-net: XDP support for small buffers")
>> Signed-off-by: John Fastabend 
>> Acked-by: Jason Wang 
> 
> Acked-by: Michael S. Tsirkin 
> 
> I think we definitely want this one in -net as it's
> a bugfix.
> 

Agreed, let me pull this fix out of the series and submit it for
net.

Re: XDP offload to hypervisor

2017-01-23 Thread John Fastabend

On 17-01-23 01:40 PM, Michael S. Tsirkin wrote:
> I've been thinking about passing XDP programs from guest to the
> hypervisor.  Basically, after getting an incoming packet, we could run
> an XDP program in host kernel.
> 

Interesting. I am planning on adding XDP to tun driver. My use case
is we want to use XDP to restrict VM traffic. I was planning on pushing
the xdp program execution into tun_get_user(). So different then "offloading"
an xdp program into hypervisor.

> If the result is XDP_DROP or XDP_TX we don't need to wake up the guest at all!
> 

nice win.

> When using tun for networking - especially with adjust_head - this
> unfortunately probably means we need to do a data copy unless there is
> enough headroom.  How much is enough though?

We were looking at making headroom configurable on Intel drivers or at
least matching it with XDP headroom guidelines. (although the developers
had the same complaint about 256B being large). Then at least on supported
drivers the copy could be an exception path.

> 
> Another issue is around host/guest ABI. Guest BPF could add new features
> at any point. What if hypervisor can not support it all?  I guess we
> could try loading program into hypervisor and run it within guest on
> failure to load, but this ignores question of cross-version
> compatibility - someone might start guest on a new host
> then try to move to an old one. So we will need an option
> "behave like an older host" such that guest can start and then
> move to an older host later. This will likely mean
> implementing this validation of programs in qemu userspace unless linux
> can supply something like this. Is this (disabling some features)
> something that might be of interest to larger bpf community?

This is interesting to me at least. Another interesting "feature" of
running bpf in qemu userspace is it could work with vhost_user as well
presumably?

> 
> With a device such as macvtap there exist configurations where a single
> guest is in control of the device (aka passthrough mode) in that case
> there's a potential to run xdp on host before host skb is built, unless
> host already has an xdp program attached.  If it does we could run the
> program within guest, but what if a guest program got attached first?
> Maybe we should pass a flag in the packet "xdp passed on this packet in
> host". Then, guest can skip running it.  Unless we do a full reset
> there's always a potential for packets to slip through, e.g. on xdp
> program changes. Maybe a flush command is needed, or force queue or
> device reset to make sure nothing is going on. Does this make sense?
> 

Could the virtio driver pretend its "offloading" the XDP program to
hardware? This would make it explicit in VM that the program is run
before data is received by virtio_net. Then qemu is enabling the
offload framework which would be interesting.

> Thanks!
>

Re: [PULL] vhost: cleanups and fixes

2017-01-23 Thread Linus Torvalds

On Mon, Jan 23, 2017 at 7:05 AM, Michael S. Tsirkin  wrote:
>
> virtio, vhost: fixes, cleanups

Was there a reason why you sent this twice?

Or was this *supposed* to be the ARM DMA fix pull request? Because it wasn't.

  Linus

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread Andy Lutomirski

On Mon, Jan 23, 2017 at 1:03 PM, David Ahern  wrote:
> On 1/23/17 1:36 PM, Andy Lutomirski wrote:
>> To see how cgroup+bpf interacts with network namespaces, I wrote a
>> little program called show_bind that calls getsockopt(...,
>> SO_BINDTODEVICE, ...) and prints the result.  It did this:
>>
>>  # ./ip link add dev vrf0 type vrf table 10
>>  # ./ip vrf exec vrf0 ./show_bind
>>  Default binding is "vrf0"
>>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>>  show_bind: getsockopt: No such device
>>
>> What's happening here is that "ip vrf" looks up vrf0's ifindex in
>> the init netns and installs a hook that binds sockets to that
>
> It looks up the device name in the current namespace.
>
>> ifindex.  When the hook runs in a different netns, it sets
>> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
>> incorrect behavior.  In this particular example, the ifindex was 4
>> and there was no ifindex 4 in the new netns.  If there had been,
>> this test would have malfunctioned differently
>
> While the cgroups and network namespace interaction needs improvement, a 
> management tool can workaround the deficiencies:
>
> A shell in the default namespace, mgmt vrf (PS1 tells me the network context):
> dsa@kenny:mgmt:~$
>
> Switch to a different namespace (one that I run VMs for network testing):
> dsa@kenny:mgmt:~$ sudo ip netns exec vms su - dsa
>
> And then bind the shell to vrf2
> dsa@kenny:vms:~$ sudo ip vrf exec vrf2 su - dsa
> dsa@kenny:vms:vrf2:~$
>
> Or I can go straight to vrf2:
> dsa@kenny:mgmt:~$ sudo ip netns exec vms ip vrf exec vrf2 su - dsa
> dsa@kenny:vms:vrf2:~$

Indeed, if you're careful to set up the vrf cgroup in the same netns
that you end up using it in, it'll work.  But there's a bigger footgun
there than I think is warranted, and I'm not sure how iproute2 is
supposed to do all that much better given that the eBPF program can
neither see what namespace a socket is bound to nor can it act in a
way that works correctly in any namespace.

Long-term, I think the real fix is to make the hook work on a
per-netns basis and, if needed, add an interface for a cross-netns
hook to work sensibly.  But I think it's a bit late to do that for
4.10, so instead I'm proposing to limit the API to the case where it
works and the semantics are unambiguous and to leave further
improvements for later.

It's a bit unfortunate that there seems to be an impedance mismatch in
that "ip vrf" acts on cgroups and that cgroups are somewhat orthogonal
to network namespaces.

>
>
> I am testing additional iproute2 cleanups which will be sent before 4.10 is 
> released.
>
> -8<-
>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e89acea22ecf..c0bbc55e244d 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>>   struct cgroup *cgrp;
>>   enum bpf_prog_type ptype;
>>
>> + /*
>> +  * For now, socket bpf hooks attached to cgroups can only be
>> +  * installed in the init netns and only affect the init netns.
>> +  * This could be relaxed in the future once some semantic issues
>> +  * are resolved.  For example, ifindexes belonging to one netns
>> +  * should probably not be visible to hooks installed by programs
>> +  * running in a different netns.
>> +  */
>> + if (current->nsproxy->net_ns != _net)
>> + return -EINVAL;
>> +
>>   if (!capable(CAP_NET_ADMIN))
>>   return -EPERM;
>>
>
> But should this patch be taken, shouldn't the EPERM out rank the namespace 
> check.
>

I could see that going either way.  If the hook becomes per-netns,
then the capable() check could potentially become ns_capable() and it
would start succeeding.  I'd be happy to change it, though.

--Andy
-- 
Andy Lutomirski
AMA Capital Management, LLC

Re: Question about veth_xmit()

2017-01-23 Thread Xiangning Yu

On Mon, Jan 23, 2017 at 12:56 PM, Cong Wang  wrote:
> On Mon, Jan 23, 2017 at 10:46 AM, Xiangning Yu  wrote:
>> Hi netdev folks,
>>
>> It looks like we call dev_forward_skb() in veth_xmit(), which calls
>> netif_rx() eventually.
>>
>> While netif_rx() will enqueue the skb to the CPU RX backlog before the
>> actual processing takes place. So, this actually means a TX skb has to
>> wait some un-related RX skbs to finish. And this will happen twice for
>> a single ping, because the veth device always works as a pair?
>
> For me it is more like for the completeness of network stack of each
> netns. The /proc net.core.netdev_max_backlog etc. are per netns, which
> means each netns, as an independent network stack, should respect it
> too.
>
> Since you care about latency, why not tune net.core.dev_weight for your
> own netns?

I haven't tried that yet, thank you for the hint! Though normally one
of the veth device will be in the global namespace.

Thanks,
- Xiangning

XDP offload to hypervisor

2017-01-23 Thread Michael S. Tsirkin

I've been thinking about passing XDP programs from guest to the
hypervisor.  Basically, after getting an incoming packet, we could run
an XDP program in host kernel.

If the result is XDP_DROP or XDP_TX we don't need to wake up the guest at all!

When using tun for networking - especially with adjust_head - this
unfortunately probably means we need to do a data copy unless there is
enough headroom.  How much is enough though?

Another issue is around host/guest ABI. Guest BPF could add new features
at any point. What if hypervisor can not support it all?  I guess we
could try loading program into hypervisor and run it within guest on
failure to load, but this ignores question of cross-version
compatibility - someone might start guest on a new host
then try to move to an old one. So we will need an option
"behave like an older host" such that guest can start and then
move to an older host later. This will likely mean
implementing this validation of programs in qemu userspace unless linux
can supply something like this. Is this (disabling some features)
something that might be of interest to larger bpf community?

With a device such as macvtap there exist configurations where a single
guest is in control of the device (aka passthrough mode) in that case
there's a potential to run xdp on host before host skb is built, unless
host already has an xdp program attached.  If it does we could run the
program within guest, but what if a guest program got attached first?
Maybe we should pass a flag in the packet "xdp passed on this packet in
host". Then, guest can skip running it.  Unless we do a full reset
there's always a potential for packets to slip through, e.g. on xdp
program changes. Maybe a flush command is needed, or force queue or
device reset to make sure nothing is going on. Does this make sense?

Thanks!

-- 
MST

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

On Mon, Jan 23, 2017 at 01:28:53PM -0800, Wei Wang wrote:
> Hi Willy,
> 
> True. If you call connect() multiple times on a socket which already has
> cookie without a write(), the second and onward connect() call will return
> EINPROGRESS.
> It is basically because the following code block in __inet_stream_connect()
> can't distinguish if it is the first time connect() is called or not:
> 
> case SS_CONNECTING:
> if (inet_sk(sk)->defer_connect)  <- defer_connect will
> be 0 only after a write() is called
> err = -EINPROGRESS;
> else
> err = -EALREADY;
> /* Fall out of switch with err, set for this state */
> break;

Ah OK that totally makes sense, thanks for the explanation!

> I guess we can add some extra logic here to address this issue. So the
> second connect() and onwards will return EALREADY.

If that's possible at little cost it would be nice, because your patch
makes it so easy to enable TFO on outgoing connections now that I
expect many people will blindly run the setsockopt() before connect().

Do not hesitate to ask me to run some tests. While 4 years ago it was
not easy, here it's very simple for me. By the way I'm seeing an ~10%
performance increase on haproxy by enabling this, it's really cool!

Thanks,
Willy

Re: [PATCH net-next 3/3] net/tcp-fastopen: Add new API support

2017-01-23 Thread Willy Tarreau

Hi Wei,

first, thanks a lot for doing this, it's really awesome!

I'm testing it on 4.9 on haproxy and I met a corner case : when I
perform a connect() to a server and I have nothing to send, upon
POLLOUT notification since I have nothing to send I simply probe the
connection using connect() again to see if it returns EISCONN or
anything else. But here now I'm seeing EINPROGRESS loops.

To illustrate this, here's what I'm doing :

:8000  :8001
  [ client ] ---> [ proxy ] ---> [ server ]

The proxy is configured to enable TFO to the server and the server
supports TFO as well. The proxy and the server are in fact two proxy
instances in haproxy running in the same process for convenience.

When I already have data to send here's what I'm seeing (so it works fine) :

06:29:16.861190 accept4(7, {sa_family=AF_INET, sin_port=htons(33986), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK
_NONBLOCK) = 9
06:29:16.861277 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.861342 accept4(7, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.861417 recvfrom(9, "BLAH\n", 7006, 0, NULL, NULL) = 5
06:29:16.861509 recvfrom(9, 0x2619329, 7001, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.861657 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
06:29:16.861730 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
06:29:16.861779 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.861882 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
06:29:16.861942 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
06:29:16.862015 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, 
u64=9}}) = 0
06:29:16.862072 epoll_wait(3, [], 200, 0) = 0
06:29:16.862126 sendto(10, "BLAH\n", 5, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 5
06:29:16.862281 epoll_wait(3, [{EPOLLIN, {u32=8, u64=8}}], 200, 0) = 1
06:29:16.862334 recvfrom(10, 0x26173a4, 8030, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.862385 accept4(8, {sa_family=AF_INET, sin_port=htons(46760), 
sin_addr=inet_addr("127.0.0.1")}, [128->16], SOCK_NON
BLOCK) = 11
06:29:16.862450 setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:16.862504 accept4(8, 0x7ffd0d794430, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:16.862564 recvfrom(11, "BLAH\n", 7006, 0, NULL, NULL) = 5


When I don't have data, here's what I'm seeing :

06:29:24.047801 accept4(7, {sa_family=AF_INET, sin_port=htons(33988), 
sin_addr=inet_addr("192.168.0.176")}, [128->16], SOCK
_NONBLOCK) = 9
06:29:24.047899 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:24.047966 accept4(7, 0x7ffdedb2c7f0, [128], SOCK_NONBLOCK) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:24.048043 recvfrom(9, 0xd31324, 7006, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable)
06:29:24.048281 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
06:29:24.048342 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
06:29:24.048392 setsockopt(10, SOL_TCP, TCP_NODELAY, [1], 4) = 0
06:29:24.048447 setsockopt(10, SOL_TCP, 0x1e /* TCP_??? */, [1], 4) = 0
06:29:24.048508 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
06:29:24.048593 epoll_ctl(3, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=9, 
u64=9}}) = 0
06:29:24.048651 epoll_wait(3, [], 200, 0) = 0
06:29:24.048699 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.048751 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)
06:29:24.048808 epoll_ctl(3, EPOLL_CTL_ADD, 10, {EPOLLOUT, {u32=10, u64=10}}) = 0
06:29:24.048860 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1
06:29:24.048912 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.048963 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)
06:29:24.049018 epoll_wait(3, [{EPOLLOUT, {u32=10, u64=10}}], 200, 1000) = 1
06:29:24.049072 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
06:29:24.049122 connect(10, {sa_family=AF_INET, sin_port=htons(8001), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRES
S (Operation now in progress)


I theorically understand why but I think we have something wrong here
and instead we should have -1 EISCONN (to pretend the connection is
established) or return EALREADY (to mention that a previous request was
already made and that we're waiting for the next step).

While I can instrument my connect() *not* to use TFO when connecting
without any pending data, I don't always know this (eg when I use
openssl and cross fingers so that it decides to quickly send something
on the next round).

I think it's easy to fall into this tricky corner case and am wondering
what can be done about it. Does the EINPROGRESS happen only because there
is no cookie yet ? If so,

Re: [PATCH v4 0/3] bpf: add longest prefix match map

2017-01-23 Thread David Miller

From: Daniel Mack 
Date: Sat, 21 Jan 2017 17:26:10 +0100

> This patch set adds a longest prefix match algorithm that can be used
> to match IP addresses to a stored set of ranges. It is exposed as a
> bpf map type.
>
> Internally, data is stored in an unbalanced tree of nodes that has a
> maximum height of n, where n is the prefixlen the trie was created
> with.
>  
> Note that this has nothing to do with fib or fib6 and is in no way meant
> to replace or share code with it. It's rather a much simpler
> implementation that is specifically written with bpf maps in mind.
>  
> Patch 1/2 adds the implementation, 2/2 an extensive test suite and 3/3
> has benchmarking code for the new trie type.
> 
> Feedback is much appreciated.

Series applied, thank you.

Re: [net PATCH v5 1/6] virtio_net: use dev_kfree_skb for small buffer XDP receive

2017-01-23 Thread Michael S. Tsirkin

On Tue, Jan 17, 2017 at 02:19:50PM -0800, John Fastabend wrote:
> In the small buffer case during driver unload we currently use
> put_page instead of dev_kfree_skb. Resolve this by adding a check
> for virtnet mode when checking XDP queue type. Also name the
> function so that the code reads correctly to match the additional
> check.
> 
> Fixes: bb91accf2733 ("virtio-net: XDP support for small buffers")
> Signed-off-by: John Fastabend 
> Acked-by: Jason Wang 

Acked-by: Michael S. Tsirkin 

I think we definitely want this one in -net as it's
a bugfix.

> ---
>  drivers/net/virtio_net.c |8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4a10500..d97bb71 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1890,8 +1890,12 @@ static void free_receive_page_frags(struct 
> virtnet_info *vi)
>   put_page(vi->rq[i].alloc_frag.page);
>  }
>  
> -static bool is_xdp_queue(struct virtnet_info *vi, int q)
> +static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
>  {
> + /* For small receive mode always use kfree_skb variants */
> + if (!vi->mergeable_rx_bufs)
> + return false;
> +
>   if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
>   return false;
>   else if (q < vi->curr_queue_pairs)
> @@ -1908,7 +1912,7 @@ static void free_unused_bufs(struct virtnet_info *vi)
>   for (i = 0; i < vi->max_queue_pairs; i++) {
>   struct virtqueue *vq = vi->sq[i].vq;
>   while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> - if (!is_xdp_queue(vi, i))
> + if (!is_xdp_raw_buffer_queue(vi, i))
>   dev_kfree_skb(buf);
>   else
>   put_page(virt_to_head_page(buf));

Re: [PATCH net-next 1/2] vxlan: don't flush static fdb entries on admin down

2017-01-23 Thread David Miller

From: Roopa Prabhu 
Date: Fri, 20 Jan 2017 23:43:18 -0800

>  /* Purge the forwarding table */
> -static void vxlan_flush(struct vxlan_dev *vxlan)
> +static void vxlan_flush(struct vxlan_dev *vxlan, int do_all)

Please use 'bool' and true/false for this new argument.

FWIW, I am fine with changing the default behavior in this way.

Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-23 Thread David Ahern

On 1/23/17 1:36 PM, Andy Lutomirski wrote:
> To see how cgroup+bpf interacts with network namespaces, I wrote a
> little program called show_bind that calls getsockopt(...,
> SO_BINDTODEVICE, ...) and prints the result.  It did this:
> 
>  # ./ip link add dev vrf0 type vrf table 10
>  # ./ip vrf exec vrf0 ./show_bind
>  Default binding is "vrf0"
>  # ./ip vrf exec vrf0 unshare -n ./show_bind
>  show_bind: getsockopt: No such device
> 
> What's happening here is that "ip vrf" looks up vrf0's ifindex in
> the init netns and installs a hook that binds sockets to that

It looks up the device name in the current namespace.

> ifindex.  When the hook runs in a different netns, it sets
> sk_bound_dev_if to an ifindex from the wrong netns, resulting in
> incorrect behavior.  In this particular example, the ifindex was 4
> and there was no ifindex 4 in the new netns.  If there had been,
> this test would have malfunctioned differently

While the cgroups and network namespace interaction needs improvement, a 
management tool can workaround the deficiencies:

A shell in the default namespace, mgmt vrf (PS1 tells me the network context):
dsa@kenny:mgmt:~$ 

Switch to a different namespace (one that I run VMs for network testing):
dsa@kenny:mgmt:~$ sudo ip netns exec vms su - dsa

And then bind the shell to vrf2
dsa@kenny:vms:~$ sudo ip vrf exec vrf2 su - dsa
dsa@kenny:vms:vrf2:~$

Or I can go straight to vrf2:
dsa@kenny:mgmt:~$ sudo ip netns exec vms ip vrf exec vrf2 su - dsa
dsa@kenny:vms:vrf2:~$


I am testing additional iproute2 cleanups which will be sent before 4.10 is 
released.

-8<-

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e89acea22ecf..c0bbc55e244d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -902,6 +902,17 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   struct cgroup *cgrp;
>   enum bpf_prog_type ptype;
>  
> + /*
> +  * For now, socket bpf hooks attached to cgroups can only be
> +  * installed in the init netns and only affect the init netns.
> +  * This could be relaxed in the future once some semantic issues
> +  * are resolved.  For example, ifindexes belonging to one netns
> +  * should probably not be visible to hooks installed by programs
> +  * running in a different netns.
> +  */
> + if (current->nsproxy->net_ns != _net)
> + return -EINVAL;
> +
>   if (!capable(CAP_NET_ADMIN))
>   return -EPERM;
>  

But should this patch be taken, shouldn't the EPERM out rank the namespace 
check.

Re: [PATCH v2] net: xilinx: constify net_device_ops structure

2017-01-23 Thread David Miller

From: Bhumika Goyal 
Date: Sat, 21 Jan 2017 12:28:58 +0530

> Declare net_device_ops structure as const as it is only stored in
> the netdev_ops field of a net_device structure. This field is of type
> const, so net_device_ops structures having same properties can be made
> const too.
> Done using Coccinelle:
 ...
> Signed-off-by: Bhumika Goyal 

Applied.

Re: [PATCH v2] net: moxa: constify net_device_ops structures

2017-01-23 Thread David Miller

From: Bhumika Goyal 
Date: Sat, 21 Jan 2017 12:27:26 +0530

> Declare net_device_ops structure as const as it is only stored in
> the netdev_ops field of a net_device structure. This field is of type
> const, so net_device_ops structures having same properties can be made
> const too.
> Done using Coccinelle:
 ...
> Signed-off-by: Bhumika Goyal 

Applied.

Re: [RFC PATCH net-next 2/5] net: split skb_checksum_help

2017-01-23 Thread Tom Herbert

On Mon, Jan 23, 2017 at 8:52 AM, Davide Caratti  wrote:
> skb_checksum_help is designed to compute the Internet Checksum only. To
> avoid duplicating code when other checksumming algorithms (e.g. crc32c)
> are used, separate common part from RFC1624-specific part.
>
> Reviewed-by: Marcelo Ricardo Leitner 
> Signed-off-by: Davide Caratti 
> ---
>  net/core/dev.c | 51 +++
>  1 file changed, 35 insertions(+), 16 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ad5959e..6742160 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2532,13 +2532,36 @@ static void skb_warn_bad_offload(const struct sk_buff 
> *skb)
>  skb_shinfo(skb)->gso_type, skb->ip_summed);
>  }
>
> -/*
> - * Invalidate hardware checksum when packet is to be mangled, and
> +/* compute 16-bit RFC1624 checksum and store it at skb->data + offset */
> +static int skb_rfc1624_csum(struct sk_buff *skb, int offset)
> +{
> +   __wsum csum;
> +   int ret = 0;
> +
> +   csum = skb_checksum(skb, offset, skb->len - offset, 0);
> +
> +   offset += skb->csum_offset;
> +   BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));
> +
> +   if (skb_cloned(skb) &&
> +   !skb_clone_writable(skb, offset + sizeof(__sum16))) {
> +   ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> +   if (ret)
> +   goto out;
> +   }
> +   *(__sum16 *)(skb->data + offset) = csum_fold(csum) ?: CSUM_MANGLED_0;
> +out:
> +   return ret;
> +}
> +
> +/* Invalidate hardware checksum when packet is to be mangled, and
>   * complete checksum manually on outgoing path.
> + *@skb - buffer that needs checksum
> + *@csum_algo(skb, offset) - function used to compute the checksum
>   */
> -int skb_checksum_help(struct sk_buff *skb)
> +static int __skb_checksum_help(struct sk_buff *skb,
> +  int (*csum_algo)(struct sk_buff *, int))
>  {
> -   __wsum csum;
> int ret = 0, offset;
>
> if (skb->ip_summed == CHECKSUM_COMPLETE)

skb_checksum_help is specific to the Internet checksum. For instance,
CHECKSUM_COMPLETE can _only_ refer to Internet checksum calculation
nothing else will work. Checksums and CRCs are very different things
with very different processing. They are not interchangeable, have
very different properties, and hence it is a mistake to try to shoe
horn things so that they use a common infrastructure.

It might make sense to create some CRC helper functions, but last time
I checked there are so few users of CRC in skbufs I'm not even sure
that would make sense.

Tom

> @@ -2560,24 +2583,20 @@ int skb_checksum_help(struct sk_buff *skb)
>
> offset = skb_checksum_start_offset(skb);
> BUG_ON(offset >= skb_headlen(skb));
> -   csum = skb_checksum(skb, offset, skb->len - offset, 0);
> -
> -   offset += skb->csum_offset;
> -   BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));
> -
> -   if (skb_cloned(skb) &&
> -   !skb_clone_writable(skb, offset + sizeof(__sum16))) {
> -   ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> -   if (ret)
> -   goto out;
> -   }
>
> -   *(__sum16 *)(skb->data + offset) = csum_fold(csum) ?: CSUM_MANGLED_0;
> +   ret = csum_algo(skb, offset);
> +   if (ret)
> +   goto out;
>  out_set_summed:
> skb->ip_summed = CHECKSUM_NONE;
>  out:
> return ret;
>  }
> +
> +int skb_checksum_help(struct sk_buff *skb)
> +{
> +   return __skb_checksum_help(skb, skb_rfc1624_csum);
> +}
>  EXPORT_SYMBOL(skb_checksum_help);
>
>  __be16 skb_network_protocol(struct sk_buff *skb, int *depth)
> --
> 2.7.4
>

1 2 3 >

1 - 100 of 242 matches

Mail list logo