Re: [RFC 02/11] Add RoCE driver framework

2016-09-12 Thread Leon Romanovsky
On Mon, Sep 12, 2016 at 07:07:36PM +0300, Ram Amrani wrote:
> Adds a skeletal implementation of the qed* RoCE driver -
> basically the ability to communicate with the qede driver and
> receive notifications from it regarding various init/exit events.
>
> Signed-off-by: Rajesh Borundia 
> Signed-off-by: Ram Amrani 
> ---

<...>

> +
> +#define QEDR_MODULE_VERSION  "8.10.10.0"

I am strongly against module versions. You should rely on official
kernel version.

> +#define QEDR_NODE_DESC "QLogic 579xx RoCE HCA"
> +#define DP_NAME(dev) ((dev)->ibdev.name)


signature.asc
Description: PGP signature


Re: [RFC 02/11] Add RoCE driver framework

2016-09-12 Thread Leon Romanovsky
On Mon, Sep 12, 2016 at 07:17:35PM +, Yuval Mintz wrote:
> >> +uint debug;
> >> +module_param(debug, uint, 0);
> >> +MODULE_PARM_DESC(debug, "Default debug msglevel");
>
> >Why are you adding this as a module parameter?
>
> I believe this is mostly to follow same line as qede which also defines
> 'debug' module parameter for allowing easy user control of debug
> prints [& specifically for probe prints, which can't be controlled
> otherwise].

Can you give us an example where dynamic debug and tracing infrastructures
are not enough?

AFAIK, most of these debug module parameters are legacy copy/paste
code which is useless in real life scenarios.

Thanks


>  --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: icmpv6: issue with routing table entries from link local addresses

2016-09-12 Thread Andreas Hübner
On Mon, Sep 12, 2016 at 07:26:23PM +0200, Hannes Frederic Sowa wrote:
> > Actually, I'm convinced I must be doing something wrong here. The setup
> > for the issue is quite trivial, someone would have tripped over it
> > already. The only condition is that one host has multiple interfaces
> > with ipv6 enabled.
> > 
> > Any help in shedding some light onto this issue would be appreciated.
> 
> This shouldn't be the case. We certainly carry over the ifindex of the
> received packet into the routing lookup of the outgoing packet, thus the
> appropriate rule, with outgoing ifindex should be selected.

I saw this in the code and that's the reason why I wrote the initial
mail. Was trying to trace with ftrace, but got stuck somewhere around
the find_rr_leaf function. (If there is any good documentation on the
internal fib data structure, please point me to it.)

> I also couldn't reproduce your problem here with my system. Can you
> verify with tcpdump that the packet is leaving on another interface?

It is not leaving on another interface but simply discarded on the host.
The Ip6OutNoRoutes stat in /proc/net/snmp6 is increased.
>From my understanding the routing subsystem finds the first matching entry
in the main routing table, checks the interface and bails out because it
does not match.

I did omit a crucial information in the last mail, I'm currently stuck
on an older distribution kernel (3.16).
I'll try to check if there have been any relevant changes to IPv6
route lookup in the last two years.
(Maybe I should try to reproduce it with the current kernel, sorry that
I didn't think of this before.)


Andreas


Re: [RFC 03/11] Add support for RoCE HW init

2016-09-12 Thread Mark Bloch


On 12/09/2016 19:07, Ram Amrani wrote:
> Allocate and setup RoCE resources, interrupts and completion queues.
> Adds device attributes.
> 
> Signed-off-by: Rajesh Borundia 
> Signed-off-by: Ram Amrani 
> ---
>  drivers/infiniband/hw/qedr/main.c  | 408 +++-
>  drivers/infiniband/hw/qedr/qedr.h  | 118 
>  drivers/infiniband/hw/qedr/qedr_hsi.h  |  56 ++
>  drivers/infiniband/hw/qedr/qedr_hsi_rdma.h |  96 +++
>  drivers/net/ethernet/qlogic/qed/Makefile   |   1 +
>  drivers/net/ethernet/qlogic/qed/qed.h  |  26 +-
>  drivers/net/ethernet/qlogic/qed/qed_cxt.c  |   6 +
>  drivers/net/ethernet/qlogic/qed/qed_cxt.h  |   6 +
>  drivers/net/ethernet/qlogic/qed/qed_dev.c  | 155 +
>  drivers/net/ethernet/qlogic/qed/qed_main.c |  44 +-
>  drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |   7 +
>  drivers/net/ethernet/qlogic/qed/qed_roce.c | 887 
> +
>  drivers/net/ethernet/qlogic/qed/qed_roce.h | 117 
>  drivers/net/ethernet/qlogic/qed/qed_sp.h   |   1 +
>  drivers/net/ethernet/qlogic/qed/qed_spq.c  |   8 +
>  drivers/net/ethernet/qlogic/qede/qede_roce.c   |   2 +-
>  include/linux/qed/qed_if.h |   5 +-
>  include/linux/qed/qed_roce_if.h| 345 ++
>  18 files changed, 2281 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/infiniband/hw/qedr/qedr_hsi.h
>  create mode 100644 drivers/infiniband/hw/qedr/qedr_hsi_rdma.h
>  create mode 100644 drivers/net/ethernet/qlogic/qed/qed_roce.c
>  create mode 100644 drivers/net/ethernet/qlogic/qed/qed_roce.h
>  create mode 100644 include/linux/qed/qed_roce_if.h
> 
> diff --git a/drivers/infiniband/hw/qedr/main.c 
> b/drivers/infiniband/hw/qedr/main.c
> index 3fe58a3..0b5274a 100644
> --- a/drivers/infiniband/hw/qedr/main.c
> +++ b/drivers/infiniband/hw/qedr/main.c
> @@ -36,6 +36,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include "qedr.h"
>  
>  MODULE_DESCRIPTION("QLogic 40G/100G ROCE Driver");
> @@ -80,6 +82,139 @@ static int qedr_register_device(struct qedr_dev *dev)
>   return 0;
>  }
>  
> +/* This function allocates fast-path status block memory */
> +static int qedr_alloc_mem_sb(struct qedr_dev *dev,
> +  struct qed_sb_info *sb_info, u16 sb_id)
> +{
> + struct status_block *sb_virt;
> + dma_addr_t sb_phys;
> + int rc;
> +
> + sb_virt = dma_alloc_coherent(&dev->pdev->dev,
> +  sizeof(*sb_virt), &sb_phys, GFP_KERNEL);
> + if (!sb_virt) {
> + pr_err("Status block allocation failed\n");
> + return -ENOMEM;
> + }
> +
> + rc = dev->ops->common->sb_init(dev->cdev, sb_info,
> +sb_virt, sb_phys, sb_id,
> +QED_SB_TYPE_CNQ);
> + if (rc) {
> + pr_err("Status block initialization failed\n");
> + dma_free_coherent(&dev->pdev->dev, sizeof(*sb_virt),
> +   sb_virt, sb_phys);
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +static void qedr_free_mem_sb(struct qedr_dev *dev,
> +  struct qed_sb_info *sb_info, int sb_id)
> +{
> + if (sb_info->sb_virt) {
> + dev->ops->common->sb_release(dev->cdev, sb_info, sb_id);
> + dma_free_coherent(&dev->pdev->dev, sizeof(*sb_info->sb_virt),
> +   (void *)sb_info->sb_virt, sb_info->sb_phys);
> + }
> +}
> +
> +static void qedr_free_resources(struct qedr_dev *dev)
> +{
> + int i;
> +
> + for (i = 0; i < dev->num_cnq; i++) {
> + qedr_free_mem_sb(dev, &dev->sb_array[i], dev->sb_start + i);
> + dev->ops->common->chain_free(dev->cdev, &dev->cnq_array[i].pbl);
> + }
> +
> + kfree(dev->cnq_array);
> + kfree(dev->sb_array);
> + kfree(dev->sgid_tbl);
> +}
> +
> +static int qedr_alloc_resources(struct qedr_dev *dev)
> +{
> + struct qedr_cnq *cnq;
> + __le16 *cons_pi;
> + u16 n_entries;
> + int i, rc;
> +
> + dev->sgid_tbl = kzalloc(sizeof(union ib_gid) *
> + QEDR_MAX_SGID, GFP_KERNEL);
> + if (!dev->sgid_tbl)
> + return -ENOMEM;
> +
> + spin_lock_init(&dev->sgid_lock);
> +
> + /* Allocate Status blocks for CNQ */
> + dev->sb_array = kcalloc(dev->num_cnq, sizeof(*dev->sb_array),
> + GFP_KERNEL);
> + if (!dev->sb_array) {
> + rc = -ENOMEM;
> + goto err1;
> + }
> +
> + dev->cnq_array = kcalloc(dev->num_cnq,
> +  sizeof(*dev->cnq_array), GFP_KERNEL);
> + if (!dev->cnq_array) {
> + rc = -ENOMEM;
> + goto err2;
> + }
> +
> + dev->sb_start = dev->ops->rdma_get_start_sb(dev->cdev);
> +
> + /* Allocate CNQ PBLs */
> + n_entries = min_t(u32, QED_RDMA_MAX_CNQ_SIZE, QEDR_ROCE_MAX_CNQ_SIZE

Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC

2016-09-12 Thread Leon Romanovsky
On Mon, Sep 12, 2016 at 05:39:35PM +, Yuval Mintz wrote:
> >>>  include/linux/qed/common_hsi.h |    1 +
> >>>  include/linux/qed/qed_if.h |    9 +-
> >>>  include/linux/qed/qed_ll2_if.h |  140 +
> >>>  include/linux/qed/qed_roce_if.h    |  604 
> >>>  include/linux/qed/qede_roce.h  |   88 +
> >> > include/linux/qed/rdma_common.h    |    1 +
> >>
> >> Something not directly related to your patches, but they brought my
> >> attention to the fact that all these new (and old) rdma<->net devices
> >> are polluting include/linux
> >>
> > ocrdma driver includes be_roce.h located in net/ethernet/emulex/benet
> > location instead of include/linux/.
> > This file helps to bind rdma to net device or underlying hw device.
>
> > May be similar change can be done for rest of the drivers for
> > rdma<-->net devices?
>
> By adding explicit inclusion paths in the Makefile, a la
> ccflags-y := -Idrivers/net/ethernet/emulex/benet   ?
>
> While this might work, I personally dislike it as I find it
> counter-intuitive when going over the code -
> I don't expect driver to locally modify the inclusion path.
> Besides, we're going to [eventually] a whole suite of drivers based
> on the qed module, some of which would reside under drivers/scsi;
> Not sure it's best to have 3 or 4 different drivers privately include the
> same directory under a different subsystem.

I agree with you that orcdma's way can be valuable for small drivers.

Orcmda has small shared headers set and doesn't need to change them rapidly
to support different devices.

I thought to place them in similar directory to include/soc/* and remove
from include/linux/. We have include/rdma/ and it looks like a good
candidate.

>
> >> Filtered output:
> >> ➜  linux-rdma git:(topic/fixes-for-4.8-2) ls -dl include/linux/*/
> >> drwxrwxr-x  2 leonro leonro  4096 Aug 30 16:27 include/linux/hsi/
> >> drwxrwxr-x  2 leonro leonro  4096 Sep 12 19:08 include/linux/mlx4/
> >> drwxrwxr-x  2 leonro leonro  4096 Sep  7 15:31 include/linux/mlx5/
> >> drwxrwxr-x  2 leonro leonro  4096 Sep  8 17:46 include/linux/qed/
> >>
> >> Is this the right place for them?
> >
> > Thanks
>


signature.asc
Description: PGP signature


[PATCH V3] net-next: dsa: add FIB support

2016-09-12 Thread John Crispin
Add SWITCHDEV_OBJ_ID_IPV4_FIB support to the DSA layer.

Signed-off-by: John Crispin 
---
Changes in V2
* rebase on latest net-next to fix compile errors

Changes in V3
* fix subject prefix. this needs to go into the next tree

 Documentation/networking/dsa/dsa.txt |   18 +++
 include/net/dsa.h|   13 +++
 net/dsa/slave.c  |   41 ++
 3 files changed, 72 insertions(+)

diff --git a/Documentation/networking/dsa/dsa.txt 
b/Documentation/networking/dsa/dsa.txt
index 6d6c07c..6cd5831 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -607,6 +607,24 @@ of DSA, would be the its port-based VLAN, used by the 
associated bridge device.
   function that the driver has to call for each MAC address known to be behind
   the given port. A switchdev object is used to carry the VID and MDB info.
 
+Route offloading
+
+
+- ipv4_fib_prepare: routing layer function invoked to prepare the installation
+  of an ipv4 route into the switches routing database. If the operation is not
+  supported this function should return -EOPNOTSUPP. No hardware setup must be
+  done in this function. See ipv4_fib_add for this and details.
+
+- ipv4_fib_add: routing layer function invoked when a new ipv4 route should be
+  installed into the routing database of the switch, it should be programmed
+  with the route for the specified VLAN ID of the device that the route applies
+  to.
+
+- ipv4_fib_del: routing layer function invoked when an ipv4 route should be
+  removed from the routing database, the switch hardware should be programmed
+  to delete the specified MAC address of the nexthop from the specified VLAN ID
+  if it was mapped into the forwarding database.
+
 TODO
 
 
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 7556646..7fdd63e 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -237,6 +237,7 @@ struct switchdev_obj;
 struct switchdev_obj_port_fdb;
 struct switchdev_obj_port_mdb;
 struct switchdev_obj_port_vlan;
+struct switchdev_obj_ipv4_fib;
 
 struct dsa_switch_ops {
struct list_headlist;
@@ -386,6 +387,18 @@ struct dsa_switch_ops {
int (*port_mdb_dump)(struct dsa_switch *ds, int port,
 struct switchdev_obj_port_mdb *mdb,
 int (*cb)(struct switchdev_obj *obj));
+
+   /*
+* IPV4 routing
+*/
+   int (*ipv4_fib_prepare)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_ipv4_fib *fib4,
+   struct switchdev_trans *trans);
+   int (*ipv4_fib_add)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_ipv4_fib *fib4,
+   struct switchdev_trans *trans);
+   int (*ipv4_fib_del)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_ipv4_fib *fib4);
 };
 
 void register_switch_driver(struct dsa_switch_ops *type);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9ecbe78..c974ac0 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -334,6 +334,38 @@ static int dsa_slave_port_mdb_dump(struct net_device *dev,
return -EOPNOTSUPP;
 }
 
+static int dsa_slave_ipv4_fib_add(struct net_device *dev,
+ const struct switchdev_obj_ipv4_fib *fib4,
+ struct switchdev_trans *trans)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+   int ret;
+
+   if (!ds->ops->ipv4_fib_prepare || !ds->ops->ipv4_fib_add)
+   return -EOPNOTSUPP;
+
+   if (switchdev_trans_ph_prepare(trans))
+   ret = ds->ops->ipv4_fib_prepare(ds, p->port, fib4, trans);
+   else
+   ret = ds->ops->ipv4_fib_add(ds, p->port, fib4, trans);
+
+   return ret;
+}
+
+static int dsa_slave_ipv4_fib_del(struct net_device *dev,
+ const struct switchdev_obj_ipv4_fib *fib4)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+   int ret = -EOPNOTSUPP;
+
+   if (ds->ops->ipv4_fib_del)
+   ret = ds->ops->ipv4_fib_del(ds, p->port, fib4);
+
+   return ret;
+}
+
 static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
@@ -465,6 +497,11 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
  SWITCHDEV_OBJ_PORT_VLAN(obj),
  trans);
break;
+   case SWITCHDEV_OBJ_ID_IPV4_FIB:
+   err = dsa_slave_ipv4_fib_add(dev,
+SWITCHDEV_OBJ_IPV4_FIB(obj),
+trans);
+

[PATCH V2] net: dsa: add FIB support

2016-09-12 Thread John Crispin
Add SWITCHDEV_OBJ_ID_IPV4_FIB support to the DSA layer.

Signed-off-by: John Crispin 
---
Changes in V2
* rebase on latest net-next to fix compile errors

 Documentation/networking/dsa/dsa.txt |   18 +++
 include/net/dsa.h|   13 +++
 net/dsa/slave.c  |   41 ++
 3 files changed, 72 insertions(+)

diff --git a/Documentation/networking/dsa/dsa.txt 
b/Documentation/networking/dsa/dsa.txt
index 6d6c07c..6cd5831 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -607,6 +607,24 @@ of DSA, would be the its port-based VLAN, used by the 
associated bridge device.
   function that the driver has to call for each MAC address known to be behind
   the given port. A switchdev object is used to carry the VID and MDB info.
 
+Route offloading
+
+
+- ipv4_fib_prepare: routing layer function invoked to prepare the installation
+  of an ipv4 route into the switches routing database. If the operation is not
+  supported this function should return -EOPNOTSUPP. No hardware setup must be
+  done in this function. See ipv4_fib_add for this and details.
+
+- ipv4_fib_add: routing layer function invoked when a new ipv4 route should be
+  installed into the routing database of the switch, it should be programmed
+  with the route for the specified VLAN ID of the device that the route applies
+  to.
+
+- ipv4_fib_del: routing layer function invoked when an ipv4 route should be
+  removed from the routing database, the switch hardware should be programmed
+  to delete the specified MAC address of the nexthop from the specified VLAN ID
+  if it was mapped into the forwarding database.
+
 TODO
 
 
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 7556646..7fdd63e 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -237,6 +237,7 @@ struct switchdev_obj;
 struct switchdev_obj_port_fdb;
 struct switchdev_obj_port_mdb;
 struct switchdev_obj_port_vlan;
+struct switchdev_obj_ipv4_fib;
 
 struct dsa_switch_ops {
struct list_headlist;
@@ -386,6 +387,18 @@ struct dsa_switch_ops {
int (*port_mdb_dump)(struct dsa_switch *ds, int port,
 struct switchdev_obj_port_mdb *mdb,
 int (*cb)(struct switchdev_obj *obj));
+
+   /*
+* IPV4 routing
+*/
+   int (*ipv4_fib_prepare)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_ipv4_fib *fib4,
+   struct switchdev_trans *trans);
+   int (*ipv4_fib_add)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_ipv4_fib *fib4,
+   struct switchdev_trans *trans);
+   int (*ipv4_fib_del)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_ipv4_fib *fib4);
 };
 
 void register_switch_driver(struct dsa_switch_ops *type);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9ecbe78..c974ac0 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -334,6 +334,38 @@ static int dsa_slave_port_mdb_dump(struct net_device *dev,
return -EOPNOTSUPP;
 }
 
+static int dsa_slave_ipv4_fib_add(struct net_device *dev,
+ const struct switchdev_obj_ipv4_fib *fib4,
+ struct switchdev_trans *trans)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+   int ret;
+
+   if (!ds->ops->ipv4_fib_prepare || !ds->ops->ipv4_fib_add)
+   return -EOPNOTSUPP;
+
+   if (switchdev_trans_ph_prepare(trans))
+   ret = ds->ops->ipv4_fib_prepare(ds, p->port, fib4, trans);
+   else
+   ret = ds->ops->ipv4_fib_add(ds, p->port, fib4, trans);
+
+   return ret;
+}
+
+static int dsa_slave_ipv4_fib_del(struct net_device *dev,
+ const struct switchdev_obj_ipv4_fib *fib4)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+   int ret = -EOPNOTSUPP;
+
+   if (ds->ops->ipv4_fib_del)
+   ret = ds->ops->ipv4_fib_del(ds, p->port, fib4);
+
+   return ret;
+}
+
 static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
@@ -465,6 +497,11 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
  SWITCHDEV_OBJ_PORT_VLAN(obj),
  trans);
break;
+   case SWITCHDEV_OBJ_ID_IPV4_FIB:
+   err = dsa_slave_ipv4_fib_add(dev,
+SWITCHDEV_OBJ_IPV4_FIB(obj),
+trans);
+   break;
default:
err = -EOPNOTSUPP;
   

Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not

2016-09-12 Thread Tom Herbert
On Mon, Sep 12, 2016 at 8:00 PM, Alexander Duyck
 wrote:
> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
>  wrote:
>> The BQL API does not reference the sk_buff nor does the driver need to
>> reference the sk_buff to calculate the length of a transmitted frame.
>> This patch removes an sk_buff reference from the xmit irq path and
>> also allows packets sent from XDP to use BQL.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++-
>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
>> b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> index f42129d..62a7f8d 100644
>> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
>> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter 
>> *adapter,
>> if (cleaned) {
>> total_tx_packets += buffer_info->segs;
>> total_tx_bytes += buffer_info->bytecount;
>> -   if (buffer_info->skb) {
>> -   bytes_compl += buffer_info->skb->len;
>> -   pkts_compl++;
>> -   }
>> -
>> +   bytes_compl += buffer_info->length;
>> +   pkts_compl++;
>> }
>> e1000_unmap_and_free_tx_resource(adapter, 
>> buffer_info);
>> tx_desc->upper.data = 0;
>
> Actually it might be worth looking into why we have two different
> stats for tracking bytecount and segs.  From what I can tell the
> pkts_compl value is never actually used.  The function doesn't even
> use the value so it is just wasted cycles.  And as far as the bytes go

Transmit flow steering which I posted and is pending on some testing
uses the pkt count BQL to track inflight packets.

> the accounting would be more accurate if you were to use bytecount
> instead of buffer_info->skb->len.  You would just need to update the
> xmit function to use that on the other side so that they match.
>
> - Alex


Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread Alexander Duyck
On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
 wrote:
> From: Alexei Starovoitov 
>
> This patch adds initial support for XDP on e1000 driver. Note e1000
> driver does not support page recycling in general which could be
> added as a further improvement. However XDP_DROP case will recycle.
> XDP_TX and XDP_PASS do not support recycling.
>
> e1000 only supports a single tx queue at this time so the queue
> is shared between xdp program and Linux stack. It is possible for
> an XDP program to starve the stack in this model.
>
> The XDP program will drop packets on XDP_TX errors. This can occur
> when the tx descriptors are exhausted. This behavior is the same
> for both shared queue models like e1000 and dedicated tx queue
> models used in multiqueue devices. However if both the stack and
> XDP are transmitting packets it is perhaps more likely to occur in
> the shared queue model. Further refinement to the XDP model may be
> possible in the future.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device.
>
> CC: William Tu 
> Signed-off-by: Alexei Starovoitov 
> Signed-off-by: John Fastabend 
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h  |2
>  drivers/net/ethernet/intel/e1000/e1000_main.c |  176 
> +
>  2 files changed, 175 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h 
> b/drivers/net/ethernet/intel/e1000/e1000.h
> index d7bdea7..5cf8a0a 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -150,6 +150,7 @@ struct e1000_adapter;
>   */
>  struct e1000_tx_buffer {
> struct sk_buff *skb;
> +   struct page *page;
> dma_addr_t dma;
> unsigned long time_stamp;
> u16 length;

I'm not really a huge fan of adding yet another member to this
structure.  Each e1000_tx_buffer is already pretty big at 40 bytes,
pushing it to 48 just means we lose that much more memory.  If nothing
else we may wan to look at doing something like creating a union
between the skb, page, and an unsigned long.  Then you could use the
lowest bit of the address as a flag indicating if this is a skb or a
page.

> @@ -279,6 +280,7 @@ struct e1000_adapter {
>  struct e1000_rx_ring *rx_ring,
>  int cleaned_count);
> struct e1000_rx_ring *rx_ring;  /* One per active queue */
> +   struct bpf_prog *prog;
> struct napi_struct napi;
>
> int num_tx_queues;
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
> b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 62a7f8d..232b927 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  char e1000_driver_name[] = "e1000";
>  static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
> return 0;
>  }
>
> +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
> +{
> +   struct e1000_adapter *adapter = netdev_priv(netdev);
> +   struct bpf_prog *old_prog;
> +
> +   old_prog = xchg(&adapter->prog, prog);
> +   if (old_prog) {
> +   synchronize_net();
> +   bpf_prog_put(old_prog);
> +   }
> +
> +   if (netif_running(netdev))
> +   e1000_reinit_locked(adapter);
> +   else
> +   e1000_reset(adapter);

What is the point of the reset?  If the interface isn't running is
there anything in the hardware you actually need to cleanup?

> +   return 0;
> +}
> +
> +static bool e1000_xdp_attached(struct net_device *dev)
> +{
> +   struct e1000_adapter *priv = netdev_priv(dev);
> +
> +   return !!priv->prog;
> +}
> +
> +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> +{
> +   switch (xdp->command) {
> +   case XDP_SETUP_PROG:
> +   return e1000_xdp_set(dev, xdp->prog);
> +   case XDP_QUERY_PROG:
> +   xdp->prog_attached = e1000_xdp_attached(dev);
> +   return 0;
> +   default:
> +   return -EINVAL;
> +   }
> +}
> +
>  static const struct net_device_ops e1000_netdev_ops = {
> .ndo_open   = e1000_open,
> .ndo_stop   = e1000_close,
> @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
>  #endif
> .ndo_fix_features   = e1000_fix_features,
> .ndo_set_features   = e1000_set_features,
> +   .ndo_xdp= e1000_xdp,
>  };
>
>  /**
> @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
> e1000_down_and_stop(adapter);
> e1000_release_manageability(adapter);
>
> +   if (adapter->prog)
> +   bpf_prog_put(adapter->prog);
> +
> unregister_netdev(netdev);
>
>   

Re: [PATCH 00/15] drivers: net: use IS_ENABLED() instead of checking for built-in or module

2016-09-12 Thread David Miller
From: Javier Martinez Canillas 
Date: Mon, 12 Sep 2016 10:03:31 -0400

> This trivial series is similar to [0] for net/ that you already merged, but
> for drivers/net. The patches replaces the open coding to check for a Kconfig
> symbol being built-in or module, with IS_ENABLED() macro that does the same.

Series applied, thanks.


Re: icmpv6: issue with routing table entries from link local addresses

2016-09-12 Thread Sowmini Varadhan
On (09/13/16 04:42), Hannes Frederic Sowa wrote:
> > But a couple of unexpected things I noticed in linux: the link-local
> > prefix should have a prefixlen of /10 according to
> > http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml
> > but "ip -6 route show"  lists this as a /64..
> 
> The link local subnet is still specified to be a /64 as the other parts
> of the address must be 0. Legally we probably could blackhole them.
> https://tools.ietf.org/html/rfc4291#section-2.5.6

A bit of a gray area. 4291 does not specify this as MBZ, and IANA
registration is a /10. Both Solaris and BSD use /10. And while fec0
is deprecated, I suppose some similar thing could come up in the
future. ymmv.

> We don't have urpf checks for ipv6, those are implemented in netfilter
> only. This could very well be a firewall issue or something like that.

yes, I know that (no rp_filter check for ipv6), and thats why I said it
may be some similar variant.  What tripped me up is that onlink prefixes 
(which are multipath routes in that they have the same dst, mask, metric)
are not treated as part of the typical IP_ROUTE_MULTIPATH in many places 
in the code because the fib_nhs data-structures do not get set up.
(thus, e.g., one ipoib config I was looking at recently, which 
had multiple ports connected to the same IB switch, and had the same
onlink prefix on these ports,  would not load-spread across all ports
until I explicitly did the 'ip route change' to tell the kernel to
ecmp that prefix).

Lets see what Andreas reports..

--Sowmini


Re: icmpv6: issue with routing table entries from link local addresses

2016-09-12 Thread YOSHIFUJI Hideaki
Sowmini Varadhan wrote:
> On Mon, Sep 12, 2016 at 1:26 PM, Hannes Frederic Sowa
>  wrote:
>> Hello,
>>
>> On 12.09.2016 16:27, Andreas Hübner wrote:
> 
>>>
>>> I have the following setup:
>>>   - 2 directly connected hosts (A+B), both have only link local addresses
>>> configured (interface on both hosts is eth0)
>:
>>>   fe80::/64 dev eth1  proto kernel  metric 256
>>>   fe80::/64 dev eth0  proto kernel  metric 256
>:
>>> The issue I currently have is, that the echo reply that host B should
>>> generate is never sent back to host A. If I change the order of the
>>> routing table entries on host B, everything works fine.
>>> (host A is connected on eth0)
>:
>> This shouldn't be the case. We certainly carry over the ifindex of the
>> received packet into the routing lookup of the outgoing packet, thus the
>> appropriate rule, with outgoing ifindex should be selected.
> 
> Like Hannes,  I too would first check "is B sending out the echo-resp? on
> which interface?".
> 
> But a couple of unexpected things I noticed in linux: the link-local
> prefix should have a prefixlen of /10 according to
> http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml
> but "ip -6 route show"  lists this as a /64..

Do not be confused; link-local address for ethernet is described by
IPv6 over FOO document (e.g., RFC2464 for Ethernet).  The address
(fe80::/64 for Ethernet, for example) is defined inside the link-local
scope unicast address space (/10).

-- 
Hideaki Yoshifuji 
Technical Division, MIRACLE LINUX CORPORATION


Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not

2016-09-12 Thread Alexander Duyck
On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
 wrote:
> The BQL API does not reference the sk_buff nor does the driver need to
> reference the sk_buff to calculate the length of a transmitted frame.
> This patch removes an sk_buff reference from the xmit irq path and
> also allows packets sent from XDP to use BQL.
>
> Signed-off-by: John Fastabend 
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
> b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index f42129d..62a7f8d 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter 
> *adapter,
> if (cleaned) {
> total_tx_packets += buffer_info->segs;
> total_tx_bytes += buffer_info->bytecount;
> -   if (buffer_info->skb) {
> -   bytes_compl += buffer_info->skb->len;
> -   pkts_compl++;
> -   }
> -
> +   bytes_compl += buffer_info->length;
> +   pkts_compl++;
> }
> e1000_unmap_and_free_tx_resource(adapter, 
> buffer_info);
> tx_desc->upper.data = 0;

Actually it might be worth looking into why we have two different
stats for tracking bytecount and segs.  From what I can tell the
pkts_compl value is never actually used.  The function doesn't even
use the value so it is just wasted cycles.  And as far as the bytes go
the accounting would be more accurate if you were to use bytecount
instead of buffer_info->skb->len.  You would just need to update the
xmit function to use that on the other side so that they match.

- Alex


Re: [RFC 02/11] Add RoCE driver framework

2016-09-12 Thread Mark Bloch

Hi Ram,

Just a few thoughts 

On 12/09/2016 19:07, Ram Amrani wrote:
> Adds a skeletal implementation of the qed* RoCE driver -
> basically the ability to communicate with the qede driver and
> receive notifications from it regarding various init/exit events.
> 
> Signed-off-by: Rajesh Borundia 
> Signed-off-by: Ram Amrani 
> ---
>  drivers/infiniband/Kconfig   |   2 +
>  drivers/infiniband/hw/Makefile   |   1 +
>  drivers/infiniband/hw/qedr/Kconfig   |   7 +
>  drivers/infiniband/hw/qedr/Makefile  |   3 +
>  drivers/infiniband/hw/qedr/main.c| 293 +
>  drivers/infiniband/hw/qedr/qedr.h|  60 ++
>  drivers/net/ethernet/qlogic/qede/Makefile|   1 +
>  drivers/net/ethernet/qlogic/qede/qede.h  |   9 +
>  drivers/net/ethernet/qlogic/qede/qede_main.c |  35 ++-
>  drivers/net/ethernet/qlogic/qede/qede_roce.c | 309 
> +++
>  include/linux/qed/qed_if.h   |   3 +-
>  include/linux/qed/qede_roce.h|  88 
>  include/uapi/linux/pci_regs.h|   3 +
>  13 files changed, 803 insertions(+), 11 deletions(-)
>  create mode 100644 drivers/infiniband/hw/qedr/Kconfig
>  create mode 100644 drivers/infiniband/hw/qedr/Makefile
>  create mode 100644 drivers/infiniband/hw/qedr/main.c
>  create mode 100644 drivers/infiniband/hw/qedr/qedr.h
>  create mode 100644 drivers/net/ethernet/qlogic/qede/qede_roce.c
>  create mode 100644 include/linux/qed/qede_roce.h

[SNIP]

> +
> +MODULE_DESCRIPTION("QLogic 40G/100G ROCE Driver");
> +MODULE_AUTHOR("QLogic Corporation");
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION(QEDR_MODULE_VERSION);
> +
> +uint debug;
> +module_param(debug, uint, 0);
> +MODULE_PARM_DESC(debug, "Default debug msglevel");

Why are you adding this as a module parameter? 


> +static LIST_HEAD(qedr_dev_list);
> +static DEFINE_SPINLOCK(qedr_devlist_lock);
> +

You already have a qedr_dev_list mutex in the qede_roce.c file,
why do you need this spinlock as well?

> +void qedr_ib_dispatch_event(struct qedr_dev *dev, u8 port_num,
> + enum ib_event_type type)
> +{
> + struct ib_event ibev;
> +
> + ibev.device = &dev->ibdev;
> + ibev.element.port_num = port_num;
> + ibev.event = type;
> +
> + ib_dispatch_event(&ibev);
> +}
> +
> +static enum rdma_link_layer qedr_link_layer(struct ib_device *device,
> + u8 port_num)
> +{
> + return IB_LINK_LAYER_ETHERNET;
> +}
> +
> +static int qedr_register_device(struct qedr_dev *dev)
> +{
> + strlcpy(dev->ibdev.name, "qedr%d", IB_DEVICE_NAME_MAX);
> +
> + memcpy(dev->ibdev.node_desc, QEDR_NODE_DESC, sizeof(QEDR_NODE_DESC));
> + dev->ibdev.owner = THIS_MODULE;
> +
> + dev->ibdev.get_link_layer = qedr_link_layer;
> +
> + return 0;
> +}
> +
> +/* QEDR sysfs interface */
> +static ssize_t show_rev(struct device *device, struct device_attribute *attr,
> + char *buf)
> +{
> + struct qedr_dev *dev = dev_get_drvdata(device);
> +
> + return scnprintf(buf, PAGE_SIZE, "0x%x\n", dev->pdev->vendor);
> +}
> +
> +static ssize_t show_fw_ver(struct device *device, struct device_attribute 
> *attr,
> +char *buf)
> +{
> + return scnprintf(buf, PAGE_SIZE, "%s\n", "FW_VER_TO_ADD");
> +}

Ira Weiny has added a generic way to expose firmware versions in the rdma stack,
can you have please have a look at c73428230d98d1352bcc69cd8306c292a85e1e42 and
see how he converted the mlx5_ib module to use it.

> +static ssize_t show_hca_type(struct device *device,
> +  struct device_attribute *attr, char *buf)
> +{
> + return scnprintf(buf, PAGE_SIZE, "%s\n", "HCA_TYPE_TO_SET");
> +}
> +
> +static DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
> +static DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
> +static DEVICE_ATTR(hca_type, S_IRUGO, show_hca_type, NULL);
> +
> +static struct device_attribute *qedr_attributes[] = {
> + &dev_attr_hw_rev,
> + &dev_attr_fw_ver,
> + &dev_attr_hca_type
> +};
> +
> +static void qedr_remove_sysfiles(struct qedr_dev *dev)
> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(qedr_attributes); i++)
> + device_remove_file(&dev->ibdev.dev, qedr_attributes[i]);
> +}
> +
> +void qedr_config_debug(uint debug, u32 *p_dp_module, u8 *p_dp_level)
> +{
> + *p_dp_level = 0;
> + *p_dp_module = 0;
> +
> + if (debug & QED_LOG_VERBOSE_MASK) {
> + *p_dp_level = QED_LEVEL_VERBOSE;
> + *p_dp_module = (debug & 0x3FFF);
> + } else if (debug & QED_LOG_INFO_MASK) {
> + *p_dp_level = QED_LEVEL_INFO;
> + } else if (debug & QED_LOG_NOTICE_MASK) {
> + *p_dp_level = QED_LEVEL_NOTICE;
> + }
> +}
> +
> +static void qedr_pci_set_atomic(struct qedr_dev *dev, struct pci_dev *pdev)
> +{
> + struct pci_dev *bridge;
> + u32 val;
> +
> + dev->atom

Re: icmpv6: issue with routing table entries from link local addresses

2016-09-12 Thread Hannes Frederic Sowa
On 13.09.2016 04:03, Sowmini Varadhan wrote:
> On Mon, Sep 12, 2016 at 1:26 PM, Hannes Frederic Sowa
>  wrote:
>> Hello,
>>
>> On 12.09.2016 16:27, Andreas Hübner wrote:
> 
>>>
>>> I have the following setup:
>>>   - 2 directly connected hosts (A+B), both have only link local addresses
>>> configured (interface on both hosts is eth0)
>:
>>>   fe80::/64 dev eth1  proto kernel  metric 256
>>>   fe80::/64 dev eth0  proto kernel  metric 256
>:
>>> The issue I currently have is, that the echo reply that host B should
>>> generate is never sent back to host A. If I change the order of the
>>> routing table entries on host B, everything works fine.
>>> (host A is connected on eth0)
>:
>> This shouldn't be the case. We certainly carry over the ifindex of the
>> received packet into the routing lookup of the outgoing packet, thus the
>> appropriate rule, with outgoing ifindex should be selected.
> 
> Like Hannes,  I too would first check "is B sending out the echo-resp? on
> which interface?".
> 
> But a couple of unexpected things I noticed in linux: the link-local
> prefix should have a prefixlen of /10 according to
> http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml
> but "ip -6 route show"  lists this as a /64..

The link local subnet is still specified to be a /64 as the other parts
of the address must be 0. Legally we probably could blackhole them.
https://tools.ietf.org/html/rfc4291#section-2.5.6

> moreover, even though I cannot use "ip [-6] route add.." to add the
> same prefix multiple times (with different nexthop and/or interface)
> unless I explicitly mark them as ECMP with /sbin/ip, it seems like you
> can create the same onlink prefix on multiple interfaces, but the
> kernel will not treat this as an ECMP group (and sometimes this
> can produce inconsistent results depending on the order of
> route addition, e.g., for ipv4 rp_filter checks). I dont know if some
> variant of  this (latter observation) may be the reason for the behavior
> that  Andreas reports.

iproute sets the NLM_F_EXCL flag. Use ip route prepend ...

We don't have urpf checks for ipv6, those are implemented in netfilter
only. This could very well be a firewall issue or something like that.

Bye,
Hannes



Re: icmpv6: issue with routing table entries from link local addresses

2016-09-12 Thread Sowmini Varadhan
On Mon, Sep 12, 2016 at 1:26 PM, Hannes Frederic Sowa
 wrote:
> Hello,
>
> On 12.09.2016 16:27, Andreas Hübner wrote:

>>
>> I have the following setup:
>>   - 2 directly connected hosts (A+B), both have only link local addresses
>> configured (interface on both hosts is eth0)
   :
>>   fe80::/64 dev eth1  proto kernel  metric 256
>>   fe80::/64 dev eth0  proto kernel  metric 256
   :
>> The issue I currently have is, that the echo reply that host B should
>> generate is never sent back to host A. If I change the order of the
>> routing table entries on host B, everything works fine.
>> (host A is connected on eth0)
   :
> This shouldn't be the case. We certainly carry over the ifindex of the
> received packet into the routing lookup of the outgoing packet, thus the
> appropriate rule, with outgoing ifindex should be selected.

Like Hannes,  I too would first check "is B sending out the echo-resp? on
which interface?".

But a couple of unexpected things I noticed in linux: the link-local
prefix should have a prefixlen of /10 according to
http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml
but "ip -6 route show"  lists this as a /64..

moreover, even though I cannot use "ip [-6] route add.." to add the
same prefix multiple times (with different nexthop and/or interface)
unless I explicitly mark them as ECMP with /sbin/ip, it seems like you
can create the same onlink prefix on multiple interfaces, but the
kernel will not treat this as an ECMP group (and sometimes this
can produce inconsistent results depending on the order of
route addition, e.g., for ipv4 rp_filter checks). I dont know if some
variant of  this (latter observation) may be the reason for the behavior
that  Andreas reports.

--Sowmini


Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet  wrote:
> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >
> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> The simplest program that drops all the packets will make the box 
> >> unpingable.
> >
> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> > scooter on 101 highway ;)
> >
> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> > absolutely no documentation on it, warning users about possible
> > limitations/outcomes.
> >
> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >
> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> > but tx completion will call netdev_tx_completed_queue() -> crash
> >
> > Do we have one test to validate that a XDP_TX implementation is actually
> > correct ?
> >
> Obviously not for e1000 :-(. We really need some real test and
> performance results and analysis on the interaction between the stack
> data path and XDP data path.

no. we don't need it for e1k and we cannot really do it.
 this patch is for debugging of xdp programs only.

> The fact that these changes are being
> passed of as something only needed for KCM is irrelevant, e1000 is a
> well deployed a NIC and there's no restriction that I see that would
> prevent any users from enabling this feature on real devices.

e1k is not even manufactured any more. Probably the only place
where it can be found is computer history museum.
e1000e fairs slightly better, but it's a different nic and this
patch is not about it.



Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-12 Thread Andrew Lunn
> +static int
> +qca8k_setup(struct dsa_switch *ds)
> +{
> + struct qca8k_priv *priv = qca8k_to_priv(ds);
> + int ret, i, phy_mode = -1;
> +
> + /* Keep track of which port is the connected to the cpu. This can be
> +  * phy 11 / port 0 or phy 5 / port 6.
> +  */
> + switch (dsa_upstream_port(ds)) {
> + case 11:
> + priv->cpu_port = 0;
> + break;
> + case 5:
> + priv->cpu_port = 6;
> + break;
> + default:
> + return -EOPNOTSUPP;
> + }

Hi John

Can you explain this funkiness with CPU port numbers. Why use 11
instead of 0? I ideally i would like to use 0 here, but if that it not
possible, it needs documenting in the device tree binding that 5 and
11 are special, representing ports 0 and 6.

> +
> + /* Start by setting up the register mapping */
> + priv->regmap = devm_regmap_init(ds->dev, NULL, priv,
> + &qca8k_regmap_config);
> +
> + if (IS_ERR(priv->regmap))
> + pr_warn("regmap initialization failed");
> +
> + /* Initialize CPU port pad mode (xMII type, delays...) */
> + phy_mode = of_get_phy_mode(ds->ports[ds->dst->cpu_port].dn);
> + if (phy_mode < 0) {
> + pr_err("Can't find phy-mode for master device\n");
> + return phy_mode;
> + }
> + ret = qca8k_set_pad_ctrl(priv, priv->cpu_port, phy_mode);
> + if (ret < 0)
> + return ret;
> +
> + /* Enable CPU Port */
> + qca8k_reg_set(priv, QCA8K_REG_GLOBAL_FW_CTRL0,
> +   QCA8K_GLOBAL_FW_CTRL0_CPU_PORT_EN);

Does it matter which port is the CPU port here? 11 or 5

> +
> + /* Enable MIB counters */
> + qca8k_reg_set(priv, QCA8K_REG_MIB, QCA8K_MIB_CPU_KEEP);
> + qca8k_write(priv, QCA8K_REG_MODULE_EN, QCA8K_MODULE_EN_MIB);
> +
> + /* Enable QCA header mode on Port 0 */
> + qca8k_write(priv, QCA8K_REG_PORT_HDR_CTRL(priv->cpu_port),
> + QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_TX_S |
> + QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_RX_S);

Does the header mode only work on port 0?

> +static int
> +qca8k_phy_read(struct dsa_switch *ds, int phy, int regnum)
> +{
> + struct qca8k_priv *priv = qca8k_to_priv(ds);
> +
> + return mdiobus_read(priv->bus, phy, regnum);
> +}
> +
> +static int
> +qca8k_phy_write(struct dsa_switch *ds, int phy, int regnum, u16 val)
> +{
> + struct qca8k_priv *priv = qca8k_to_priv(ds);
> +
> + return mdiobus_write(priv->bus, phy, regnum, val);
> +}

snip

> +static int
> +qca8k_sw_probe(struct mdio_device *mdiodev)
> +{
> + struct qca8k_priv *priv;
> + u32 phy_id;
> +
> + /* sw_addr is irrelevant as the switch occupies the MDIO bus from
> +  * addresses 0 to 4 (PHYs) and 16-23 (for MDIO 32bits protocol). So
> +  * we'll probe address 0 to see if we see the right switch family.
> +  */

So lets see if i have this right.

Port 0 has no internal phy.
Port 1 has an internal PHY at MDIO address 0.
Port 2 has an internal PHY at MDIO address 1.
...
Port 5 has an internal PHY ad MDIO address 4.
Port 6 has no internal PHY.

This is why you have funky port numbers, and phy_to_port.

I think it would be a lot cleaner to handle this in qca8k_phy_read()
and qca8k_phy_write(). 

Also, the comment it a bit misleading. You are probing the PHY ID, not
the switch ID. At least for the Marvell switches, different switches
can have the same embedded PHY. It would be annoying to find there is
another incompatible switch with the same PHY ID.

Is the embedded PHY compatible with the at803x driver?

   Thanks
Andrew


Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 04:46:08PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> 
> > yep. there are various ways to shoot yourself in the foot with xdp.
> > The simplest program that drops all the packets will make the box 
> > unpingable.
> 
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
> 
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

that's fair critique. there is no documentation for xdp in general.

> BTW, I am not sure mlx4 implementation even works, vs BQL :

it doesn't and it shouldn't. xdp and stack use different tx queues.

> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash

not quite. netdev_tx_completed_queue() is not called for xdp queues.

> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?

it's correct in the scope that it was defined for.
There is no objective to share the same tx queue with the stack in xdp.
The queues must be absolutely separate otherwise performance will tank.

This e1k patch is really special, because the e1k has one tx queue,
but this is for debugging of the programs, so it's ok to cut corners.
The e1k xdp support doesn't need to interact nicely with stack.
It's impossible in the first place. xdp is the layer before the stack.



Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-12 Thread Andrew Lunn
Hi John

> +static u16
> +qca8k_phy_mmd_read(struct qca8k_priv *priv, int phy_addr, u16 addr, u16 reg)
> +{
> + u16 data;
> +
> + mutex_lock(&priv->bus->mdio_lock);
> +
> + priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_ADDR, addr);
> + priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_DATA, reg);
> + priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_ADDR, addr | 0x4000);
> + data = priv->bus->read(priv->bus, phy_addr, MII_ATH_MMD_DATA);
> +
> + mutex_unlock(&priv->bus->mdio_lock);
> +
> + return data;
> +}

snip

> +static int
> +qca8k_get_eee(struct dsa_switch *ds, int port,
> +   struct ethtool_eee *e)
> +{
> + struct qca8k_priv *priv = qca8k_to_priv(ds);
> + struct ethtool_eee *p = &priv->port_sts[qca8k_phy_to_port(port)].eee;
> + u32 lp, adv, supported;
> + u16 val;
> +
> + /* The switch has no way to tell the result of the AN so we need to
> +  * read the result directly from the PHYs MMD registers
> +  */
> + val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
> + supported = mmd_eee_cap_to_ethtool_sup_t(val);
> +
> + val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_AN, MDIO_AN_EEE_ADV);
> + adv = mmd_eee_adv_to_ethtool_adv_t(val);
> +
> + val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_AN, MDIO_AN_EEE_LPABLE);
> + lp = mmd_eee_adv_to_ethtool_adv_t(val);
> +
> + e->eee_enabled = p->eee_enabled;
> + e->eee_active = !!(supported & adv & lp);
> +
> + return 0;
> +}

Couldn't you just call phy_ethtool_get_eee(phydev)? Then you don't
need qca8k_phy_mmd_read()?

 Andrew


[PATCH v6 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbmod.h|  30 
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +
 net/sched/Kconfig |  11 ++
 net/sched/Makefile|   1 +
 net/sched/act_skbmod.c| 301 ++
 5 files changed, 382 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include 
+#include 
+
+struct tcf_skbmod_params {
+   struct rcu_head rcu;
+   u64 flags; /*up to 64 types of operations; extend if needed */
+   u8  eth_dst[ETH_ALEN];
+   u16 eth_type;
+   u8  eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+   struct tc_actioncommon;
+   struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h 
b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include 
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC  0x1
+#define SKBMOD_F_SMAC  0x2
+#define SKBMOD_F_ETYPE 0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+   tc_gen;
+   __u64 flags;
+};
+
+enum {
+   TCA_SKBMOD_UNSPEC,
+   TCA_SKBMOD_TM,
+   TCA_SKBMOD_PARMS,
+   TCA_SKBMOD_DMAC,
+   TCA_SKBMOD_SMAC,
+   TCA_SKBMOD_ETYPE,
+   TCA_SKBMOD_PAD,
+   __TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
  To compile this code as a module, choose M here: the
  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+tristate "skb data modification action"
+depends on NET_CLS_ACT
+---help---
+ Say Y here to allow modification of skb data
+
+ If unsure, say N.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_skbmod.
+
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)  += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.

linux-next: manual merge of the netfilter-next tree with the net tree

2016-09-12 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the netfilter-next tree got a conflict in:

  net/netfilter/nf_tables_netdev.c

between commit:

  c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr 
assignment")

from the net tree and commit:

  ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()")

from the netfilter-next tree.

I fixed it up (I used the latter version of this file and applied the
patch below) and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging.  You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

From: Stephen Rothwell 
Date: Tue, 13 Sep 2016 10:08:58 +1000
Subject: [PATCH] netfilter: merge fixup for "nf_tables_netdev: remove
 redundant ip_hdr assignment"

Signed-off-by: Stephen Rothwell 
---
 include/net/netfilter/nf_tables_ipv4.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/net/netfilter/nf_tables_ipv4.h 
b/include/net/netfilter/nf_tables_ipv4.h
index 968f00b82fb5..25e33aee91e7 100644
--- a/include/net/netfilter/nf_tables_ipv4.h
+++ b/include/net/netfilter/nf_tables_ipv4.h
@@ -33,7 +33,6 @@ __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt,
if (!iph)
return -1;
 
-   iph = ip_hdr(skb);
if (iph->ihl < 5 || iph->version != 4)
return -1;
 
-- 
2.8.1

-- 
Cheers,
Stephen Rothwell


Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim

Ok, screwup. I used version 5 again ;-< I will resend and incorporate
Florian's comment.

cheers,
jamal


[PATCH v5 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbmod.h|  30 
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +
 net/sched/Kconfig |  11 ++
 net/sched/Makefile|   1 +
 net/sched/act_skbmod.c| 301 ++
 5 files changed, 382 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include 
+#include 
+
+struct tcf_skbmod_params {
+   struct rcu_head rcu;
+   u64 flags; /*up to 64 types of operations; extend if needed */
+   u8  eth_dst[ETH_ALEN];
+   u16 eth_type;
+   u8  eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+   struct tc_actioncommon;
+   struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h 
b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include 
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC  0x1
+#define SKBMOD_F_SMAC  0x2
+#define SKBMOD_F_ETYPE 0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+   tc_gen;
+   __u64 flags;
+};
+
+enum {
+   TCA_SKBMOD_UNSPEC,
+   TCA_SKBMOD_TM,
+   TCA_SKBMOD_PARMS,
+   TCA_SKBMOD_DMAC,
+   TCA_SKBMOD_SMAC,
+   TCA_SKBMOD_ETYPE,
+   TCA_SKBMOD_PAD,
+   __TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
  To compile this code as a module, choose M here: the
  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+tristate "skb data modification action"
+depends on NET_CLS_ACT
+---help---
+ Say Y here to allow modification of skb data
+
+ If unsure, say N.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_skbmod.
+
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)  += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.

Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread Tom Herbert
On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet  wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>
>> yep. there are various ways to shoot yourself in the foot with xdp.
>> The simplest program that drops all the packets will make the box unpingable.
>
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
>
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.
>
> BTW, I am not sure mlx4 implementation even works, vs BQL :
>
> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash
>
> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?
>
Obviously not for e1000 :-(. We really need some real test and
performance results and analysis on the interaction between the stack
data path and XDP data path. The fact that these changes are being
passed of as something only needed for KCM is irrelevant, e1000 is a
well deployed a NIC and there's no restriction that I see that would
prevent any users from enabling this feature on real devices. So these
patches need to be tested and justified. Eric's skepticism in all this
really doesn't surprise me...

Tom

>
>


Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Florian Westphal
Jamal Hadi Salim  wrote:
> +#define MAX_EDIT_LEN ETH_HLEN

This define is only referenced in a comment and seems unused otherwise.


Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 19:56 -0400, Jamal Hadi Salim wrote:

> 
> Yes, I did ;->
> Another version coming. Is there a good way to catch
> something like this (ala userspace valgrind)?

You might try CONFIG_DEBUG_KMEMLEAK=y,
or simply run your script deleting/inserting the action one million
time ;)






Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim

On 16-09-12 07:40 PM, Eric Dumazet wrote:

On Mon, 2016-09-12 at 19:30 -0400, Jamal Hadi Salim wrote:

From: Jamal Hadi Salim 



+
+static struct tc_action_ops act_skbmod_ops = {
+   .kind   =   "skbmod",
+   .type   =   TCA_ACT_SKBMOD,
+   .owner  =   THIS_MODULE,
+   .act=   tcf_skbmod_run,
+   .dump   =   tcf_skbmod_dump,
+   .init   =   tcf_skbmod_init,
+   .walk   =   tcf_skbmod_walker,
+   .lookup =   tcf_skbmod_search,
+   .size   =   sizeof(struct tcf_skbmod),
+};


Looks like you missed a .cleanup() handler ?

Otherwise, d->skbmod_p wont be freed .





Yes, I did ;->
Another version coming. Is there a good way to catch
something like this (ala userspace valgrind)?

cheers,
jamal



Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:

> yep. there are various ways to shoot yourself in the foot with xdp.
> The simplest program that drops all the packets will make the box unpingable.

Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
scooter on 101 highway ;)

This XDP_TX thing was one of the XDP marketing stuff, but there is
absolutely no documentation on it, warning users about possible
limitations/outcomes.

BTW, I am not sure mlx4 implementation even works, vs BQL :

mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
but tx completion will call netdev_tx_completed_queue() -> crash

Do we have one test to validate that a XDP_TX implementation is actually
correct ?





Re: [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines

2016-09-12 Thread Tom Herbert
On Mon, Sep 12, 2016 at 3:14 PM, John Fastabend
 wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
>
John,

It still looks to me like this will break the normal TX path. Can you
test these patches on real e1000 to get performance numbers for both
the drop and forwarding case. Also can you run this against an
antagonist workload over the normal stack to see if one or the other
path can be starved?

Thanks,
Tom

> Suggested-by: Jesper Dangaard Brouer 
> Signed-off-by: John Fastabend 
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h  |   10 +++
>  drivers/net/ethernet/intel/e1000/e1000_main.c |   80 
> +++--
>  2 files changed, 70 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h 
> b/drivers/net/ethernet/intel/e1000/e1000.h
> index 5cf8a0a..877b377 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -133,6 +133,8 @@ struct e1000_adapter;
>  #define E1000_TX_QUEUE_WAKE16
>  /* How many Rx Buffers do we bundle into one write to the hardware ? */
>  #define E1000_RX_BUFFER_WRITE  16 /* Must be power of 2 */
> +/* How many XDP XMIT buffers to bundle into one xmit transaction */
> +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
>
>  #define AUTO_ALL_MODES 0
>  #define E1000_EEPROM_82544_APM 0x0004
> @@ -168,6 +170,11 @@ struct e1000_rx_buffer {
> dma_addr_t dma;
>  };
>
> +struct e1000_rx_buffer_bundle {
> +   struct e1000_rx_buffer *buffer;
> +   u32 length;
> +};
> +
>  struct e1000_tx_ring {
> /* pointer to the descriptor ring memory */
> void *desc;
> @@ -206,6 +213,9 @@ struct e1000_rx_ring {
> struct e1000_rx_buffer *buffer_info;
> struct sk_buff *rx_skb_top;
>
> +   /* array of XDP buffer information structs */
> +   struct e1000_rx_buffer_bundle *xdp_buffer;
> +
> /* cpu for rx queue */
> int cpu;
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
> b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 232b927..31489d4 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, 
> struct bpf_prog *prog)
> struct e1000_adapter *adapter = netdev_priv(netdev);
> struct bpf_prog *old_prog;
>
> +   if (!adapter->rx_ring[0].xdp_buffer) {
> +   int size = sizeof(struct e1000_rx_buffer_bundle) *
> +   E1000_XDP_XMIT_BUNDLE_MAX;
> +
> +   adapter->rx_ring[0].xdp_buffer = vzalloc(size);
> +   if (!adapter->rx_ring[0].xdp_buffer)
> +   return -ENOMEM;
> +   }
> +
> old_prog = xchg(&adapter->prog, prog);
> if (old_prog) {
> synchronize_net();
> @@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
> if (adapter->prog)
> bpf_prog_put(adapter->prog);
>
> +   if (adapter->rx_ring[0].xdp_buffer)
> +   vfree(adapter->rx_ring[0].xdp_buffer);
> +
> unregister_netdev(netdev);
>
> e1000_phy_hw_reset(hw);
> @@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring 
> *tx_ring,
>
>  static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>  u32 len,
> +struct e1000_adapter *adapter,
>  struct net_device *netdev,
> -struct e1000_adapter *adapter)
> +struct e1000_tx_ring *tx_ring)
>  {
> -   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> -   struct e1000_hw *hw = &adapter->hw;
> -   struct e1000_tx_ring *tx_ring;
> +   const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
>
> if (len > E1000_MAX_DATA_PER_TXD)
> return;
>
> -   /* e1000 only support a single txq at the moment so the queue is being
> -* shared with stack. To support this requires locking to ensure the
> -* stack and XDP are not running at the same time. Devices with
> -* multiple queues should allocate a separate queue space.
> -*/
> -   HARD_TX_LOCK(netdev, txq, smp_processor_id());
> -
> -   tx_ring = adapter->tx_ring;
> -
> -   if (E1000_DESC_UNUSED(tx_ring) < 2) {
> -   HARD_TX_UNLOCK(netdev, txq);
> +   if (E1000_DESC_UNUSED(tx_ring) < 2)
> return;
> - 

Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 19:30 -0400, Jamal Hadi Salim wrote:
> From: Jamal Hadi Salim 

> +
> +static struct tc_action_ops act_skbmod_ops = {
> + .kind   =   "skbmod",
> + .type   =   TCA_ACT_SKBMOD,
> + .owner  =   THIS_MODULE,
> + .act=   tcf_skbmod_run,
> + .dump   =   tcf_skbmod_dump,
> + .init   =   tcf_skbmod_init,
> + .walk   =   tcf_skbmod_walker,
> + .lookup =   tcf_skbmod_search,
> + .size   =   sizeof(struct tcf_skbmod),
> +};

Looks like you missed a .cleanup() handler ?

Otherwise, d->skbmod_p wont be freed .





Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim

On 16-09-12 07:20 PM, Eric Dumazet wrote:

On Mon, 2016-09-12 at 16:10 -0700, Eric Dumazet wrote:



diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 
508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041
 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running,
return;
}
do {
-   if (running)
+   if (running) {
+   local_bh_disable();
seq = read_seqcount_begin(running);
+   }
bstats->bytes = b->bytes;
bstats->packets = b->packets;
+   if (running)
+   local_bh_enable();
} while (running && read_seqcount_retry(running, seq));
 }


Ah well, forget this patch, re-enabling bh right before
read_seqcount_retry() is not going to help.



I have to say I have seen some odd issues once in a while reading
generic action stats.
I had a program that opened a netlink socket into the kernel.
Every X seconds it does a dump of all the actions to read the
stats.
There is a very reproducible behavior that the stats
are not in sync with the kernel. Given generic stats is lockless
I thought maybe rcu or per-cpu stats was the issue. I havent had time
to look closely.
The solution is instead of keeping the socket open all the time;
I open, read stats, close (repeat every x seconds).

If there is something you want me to try - I could do sometimes
this week. Your patch above may be useful!

cheers,
jamal




[PATCH v5 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbmod.h|  30 
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +
 net/sched/Kconfig |  11 ++
 net/sched/Makefile|   1 +
 net/sched/act_skbmod.c| 291 ++
 5 files changed, 372 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include 
+#include 
+
+struct tcf_skbmod_params {
+   struct rcu_head rcu;
+   u64 flags; /*up to 64 types of operations; extend if needed */
+   u8  eth_dst[ETH_ALEN];
+   u16 eth_type;
+   u8  eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+   struct tc_actioncommon;
+   struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h 
b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include 
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC  0x1
+#define SKBMOD_F_SMAC  0x2
+#define SKBMOD_F_ETYPE 0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+   tc_gen;
+   __u64 flags;
+};
+
+enum {
+   TCA_SKBMOD_UNSPEC,
+   TCA_SKBMOD_TM,
+   TCA_SKBMOD_PARMS,
+   TCA_SKBMOD_DMAC,
+   TCA_SKBMOD_SMAC,
+   TCA_SKBMOD_ETYPE,
+   TCA_SKBMOD_PAD,
+   __TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
  To compile this code as a module, choose M here: the
  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+tristate "skb data modification action"
+depends on NET_CLS_ACT
+---help---
+ Say Y here to allow modification of skb data
+
+ If unsure, say N.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_skbmod.
+
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)  += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.

Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim

On 16-09-12 07:02 PM, Jamal Hadi Salim wrote:


Looks like typical starvation caused by aggressive softirq.



Well, then it is strange that in one case a tc dump of the rule
was immediate and in the other case it was consistent for 5-15
seconds.



I may be going nuts because i cant reproduce that issue anymore.
VM is configured for SMP 4 vcpus.

I will remove that check and submit a new version.

cheers,
jamal


Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms

2016-09-12 Thread Scott Wood
On Mon, 2016-09-12 at 06:39 +, Y.B. Lu wrote:
> Hi Scott,
> 
> Thanks for your review :)
> See my comment inline.
> 
> > 
> > -Original Message-
> > From: Scott Wood [mailto:o...@buserror.net]
> > Sent: Friday, September 09, 2016 11:47 AM
> > To: Y.B. Lu; linux-...@vger.kernel.org; ulf.hans...@linaro.org; Arnd
> > Bergmann
> > Cc: linuxppc-...@lists.ozlabs.org; devicet...@vger.kernel.org; linux-arm-
> > ker...@lists.infradead.org; linux-ker...@vger.kernel.org; linux-
> > c...@vger.kernel.org; linux-...@vger.kernel.org; iommu@lists.linux-
> > foundation.org; netdev@vger.kernel.org; Mark Rutland; Rob Herring;
> > Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil; Bhupesh
> > Sharma; Qiang Zhao; Kumar Gala; Santosh Shilimkar; Leo Li; X.B. Xie
> > Subject: Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
> > 
> > On Tue, 2016-09-06 at 16:28 +0800, Yangbo Lu wrote:
> > > 
> > > The global utilities block controls power management, I/O device
> > > enabling, power-onreset(POR) configuration monitoring, alternate
> > > function selection for multiplexed signals,and clock control.
> > > 
> > > This patch adds a driver to manage and access global utilities block.
> > > Initially only reading SVR and registering soc device are supported.
> > > Other guts accesses, such as reading RCW, should eventually be moved
> > > into this driver as well.
> > > 
> > > Signed-off-by: Yangbo Lu 
> > > Signed-off-by: Scott Wood 
> > Don't put my signoff on patches that I didn't put it on
> > myself.  Definitely don't put mine *after* yours on patches that were
> > last modified by you.
> > 
> > If you want to mention that the soc_id encoding was my suggestion, then
> > do so explicitly.
> > 
> [Lu Yangbo-B47093] I found your 'signoff' on this patch at below link.
> http://patchwork.ozlabs.org/patch/649211/
> 
> So, let me just change the order in next version ?
> Signed-off-by: Scott Wood 
> Signed-off-by: Yangbo Lu 

No.  This isn't my patch so my signoff shouldn't be on it.

> [Lu Yangbo-B47093] It's a good idea to move die into .family I think.
> In my opinion, it's better to keep svr and name in soc_id just like your
> suggestion above.
> > 
> > {
> > .soc_id = "svr:0x85490010,name:T1023E,",
> > .family = "QorIQ T1024",
> > }
> The user probably don’t like to learn the svr value. What they want is just
> to match the soc they use.
> It's convenient to use name+rev for them to match a soc.

What the user should want 99% of the time is to match the die (plus revision),
not the soc.

> Regarding shrinking the table, I think it's hard to use svr+mask. Because I
> find many platforms use different masks.
> We couldn’t know the mask according svr value.

The mask would be part of the table:

{
{
.die = "T1024",
.svr = 0x8540,
.mask = 0xfff0,
},
{
.die = "T1040",
.svr = 0x8520,
.mask = 0xfff0,
},
{
.die = "LS1088A",
.svr = 0x8703,
.mask = 0x,
},
...
}

There's a small risk that we get the mask wrong and a different die is created
that matches an existing table, but it doesn't seem too likely, and can easily
be fixed with a kernel update if it happens.

BTW, aren't ls2080a and ls2085a the same die?  And is there no non-E version
of LS2080A/LS2040A?

> > > + do {
> > > + if (!matches->soc_id)
> > > + return NULL;
> > > + if (glob_match(svr_match, matches->soc_id))
> > > + break;
> > > + } while (matches++);
> > Are you expecting "matches++" to ever evaluate as false?
> [Lu Yangbo-B47093] Yes, this is used to match the soc we use in qoriq_soc
> array until getting true. 
> We need to get the name and die information defined in array.

I'm not asking whether the glob_match will ever return true.  I'm saying that
"matches++" will never become NULL.

> > > + /* Register soc device */
> > > + soc_dev_attr = kzalloc(sizeof(*soc_dev_attr), GFP_KERNEL);
> > > + if (!soc_dev_attr) {
> > > + ret = -ENOMEM;
> > > + goto out_unmap;
> > > + }
> > Couldn't this be statically allocated?
> [Lu Yangbo-B47093] Do you mean we define this struct statically ?
> 
> static struct soc_device_attribute soc_dev_attr;

Yes.

> > > +
> > > + soc_dev = soc_device_register(soc_dev_attr);
> > > + if (IS_ERR(soc_dev)) {
> > > + ret = -ENODEV;
> > Why are you changing the error code?
> [Lu Yangbo-B47093] What error code should we use ? :)

ret = PTR_ERR(soc_dev);

+   }
> > > + return 0;
> > > +out:
> > > + kfree(soc_dev_attr->machine);
> > > + kfree(soc_dev_attr->family);
> > > + kfree(soc_dev_attr->soc_id);
> > > + kfree(soc_dev_attr->revision);
> > > + kfree(soc_dev_attr);
> > > +out_unmap:
> > > + iounmap(guts->regs);
> > > +out_free:
> > > + kfree(guts);
> > devm
> [Lu Yangbo-B47093] What's the devm 

Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 16:10 -0700, Eric Dumazet wrote:

> 
> diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
> index 
> 508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041
>  100644
> --- a/net/core/gen_stats.c
> +++ b/net/core/gen_stats.c
> @@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running,
>   return;
>   }
>   do {
> - if (running)
> + if (running) {
> + local_bh_disable();
>   seq = read_seqcount_begin(running);
> + }
>   bstats->bytes = b->bytes;
>   bstats->packets = b->packets;
> + if (running)
> + local_bh_enable();
>   } while (running && read_seqcount_retry(running, seq));
>  }

Ah well, forget this patch, re-enabling bh right before
read_seqcount_retry() is not going to help.







Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-12 Thread Jason Gunthorpe
On Mon, Sep 12, 2016 at 10:43:00PM +, Adit Ranadive wrote:
> On Mon, Sep 12, 2016 at 11:03:39 -0700, Jason Gunthorpe wrote:
> > On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote:
> > > [2] Libpvrdma User-level library - 
> > > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary
> > 
> > You will probably find that rdma-plumbing will be the best way to get
> > your userspace component into the distributors.

> Sorry I haven't paying attention to that discussion. Do you know how soon 
> distros will pick up the rdma-plumbing stuff?

Hopefully things will be clearer after the call on Tuesday.

I'm hoping we can begin the process sooner than later.

There are 4 new drivers being submitted and at least 2-3 more kernel
drivers that are not yet in Debian/SuSE/etc.

By far the simplest way to push those 6 new user space drivers into
the distros will be via rdma-plumbing, I'm optimistic that the vendors
will work together on this.

> It seems like it should be fairly straightforward to include the libpvrdma git
> within the rdma-plumbing git as well. Let me know what the process is since
> I may have to discuss it internally.

For now, you can prepare your own clone and create a providers/pvrdma
directory similar to the others. It should be very
straightforward. Let me know if you have any problems.

You should ensure clean compilation on FC24 (for instance using the
provided docker tooling), and review the commit stream in
rdma-plumbing to make sure you pick up all the systemic fixes that
have already been done to the other providers.

If Doug accepts your provider upstream before we finalize
rdma-plumbing then I will add your git to the giant merge, however
please make my life easier and ensure the code does not have any of
the systemic problems I alreadly fixed in other providers :(

After finalization you'll have to post your user space provider to the
mailing list and Doug will pick it up.

Jason


Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 19:02 -0400, Jamal Hadi Salim wrote:
> On 16-09-12 06:26 PM, Eric Dumazet wrote:
> > On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote:
> >
> >> I noticed some very weird issues when I took that out.
> >> Running sufficiently large amount of traffic (ping -f is sufficient)
> >> I saw that when i did a dump it took anywhere between 6-15 seconds.
> >> With the read_lock in place response was immediate.
> >> I can go back and run things to verify - but it was very odd.
> >
> > This was on uni processor ?
> >
> 
> It was a VM.
> 
> > Looks like typical starvation caused by aggressive softirq.
> >
> 
> Well, then it is strange that in one case a tc dump of the rule
> was immediate and in the other case it was consistent for 5-15
> seconds.
> 

This needs investigation ;)

One possible loop under high stress would be possible in
__gnet_stats_copy_basic(), since we might restart the loop if we are
really really unlucky, but this would have nothing with your patches.



diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 
508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041
 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running,
return;
}
do {
-   if (running)
+   if (running) {
+   local_bh_disable();
seq = read_seqcount_begin(running);
+   }
bstats->bytes = b->bytes;
bstats->packets = b->packets;
+   if (running)
+   local_bh_enable();
} while (running && read_seqcount_retry(running, seq));
 }
 EXPORT_SYMBOL(__gnet_stats_copy_basic);






Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 03:46:39PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> > From: Alexei Starovoitov 
> 
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +u32 len,
> > +struct net_device *netdev,
> > +struct e1000_adapter *adapter)
> > +{
> > +   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +   struct e1000_hw *hw = &adapter->hw;
> > +   struct e1000_tx_ring *tx_ring;
> > +
> > +   if (len > E1000_MAX_DATA_PER_TXD)
> > +   return;
> > +
> > +   /* e1000 only support a single txq at the moment so the queue is being
> > +* shared with stack. To support this requires locking to ensure the
> > +* stack and XDP are not running at the same time. Devices with
> > +* multiple queues should allocate a separate queue space.
> > +*/
> > +   HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +   tx_ring = adapter->tx_ring;
> > +
> > +   if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +   HARD_TX_UNLOCK(netdev, txq);
> > +   return;
> > +   }
> > +
> > +   if (netif_xmit_frozen_or_stopped(txq))
> > +   return;
> > +
> > +   e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +   netdev_sent_queue(netdev, len);
> > +   e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +   writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +   mmiowb();
> > +
> > +   HARD_TX_UNLOCK(netdev, txq);
> > +}
> 
> 
> e1000_tx_map() is full of workarounds.
> 
> Have a look at last_tx_tso for example.
> 
>/* Workaround for Controller erratum --
>  * descriptor for non-tso packet in a linear SKB that follows 
> a
>  * tso gets written back prematurely before the data is fully
>  * DMA'd to the controller
>  */
> if (!skb->data_len && tx_ring->last_tx_tso &&
> !skb_is_gso(skb)) {
> tx_ring->last_tx_tso = false;
> size -= 4;
> }
> 
> Look, this XDP_TX thing is hard to properly implement and test on
> various NIC revisions.

my reading of these e1k workarounds are for old versions of the nic that
don't even exist anymore. I believe none of them apply to qemu e1k.

> Without proper queue management, high prio packets in qdisc wont be sent
> if NIC is under RX -> XDP_TX flood.

yep. there are various ways to shoot yourself in the foot with xdp.
The simplest program that drops all the packets will make the box unpingable.

> Sounds a horrible feature to me.

yep. not pretty by any means.
This whole thing is only to test xdp programs under qemu which
realistically emulates only e1k.
I simply don't see any other options to test xdp under kvm.



[PATCH v3 net 1/1] net sched actions: fix GETing actions

2016-09-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes:
commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
commit 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
commit f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_api.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d09d068..50720b1 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -592,6 +592,16 @@ err_out:
return ERR_PTR(err);
 }
 
+static void cleanup_a(struct list_head *actions, int ovr)
+{
+   struct tc_action *a;
+
+   list_for_each_entry(a, actions, list) {
+   if (ovr)
+   a->tcfa_refcnt -= 1;
+   }
+}
+
 int tcf_action_init(struct net *net, struct nlattr *nla,
  struct nlattr *est, char *name, int ovr,
  int bind, struct list_head *actions)
@@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
goto err;
}
act->order = i;
+   if (ovr)
+   act->tcfa_refcnt += 1;
list_add_tail(&act->list, actions);
}
+
+   /* Remove the temp refcnt which was necessary to protect against
+* destroying an existing action which was being replaced
+*/
+   cleanup_a(actions, ovr);
return 0;
 
 err:
@@ -883,6 +900,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct 
nlmsghdr *n,
goto err;
}
act->order = i;
+   if (event == RTM_GETACTION)
+   act->tcfa_refcnt += 1;
list_add_tail(&act->list, &actions);
}
 
-- 
1.9.1



Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim

On 16-09-12 06:26 PM, Eric Dumazet wrote:

On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote:


I noticed some very weird issues when I took that out.
Running sufficiently large amount of traffic (ping -f is sufficient)
I saw that when i did a dump it took anywhere between 6-15 seconds.
With the read_lock in place response was immediate.
I can go back and run things to verify - but it was very odd.


This was on uni processor ?



It was a VM.


Looks like typical starvation caused by aggressive softirq.



Well, then it is strange that in one case a tc dump of the rule
was immediate and in the other case it was consistent for 5-15
seconds.


Anyway, I suspect your kernel build has rcu_read_lock() and
rcu_read_unlock() as NOP ;)



Which doesnt give me a good feel if i tested this well ;->

I would like to try again with those two kernels just to
make sure i was not imagining this.

cheers,
jamal


Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-12 Thread Andrew Lunn
Hi John

> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

One linux/phy.h is enough.

> + /* Setup connection between CPU ports & PHYs */
> + for (i = 0; i < DSA_MAX_PORTS; i++) {
> + /* CPU port gets connected to all PHYs in the switch */
> + if (dsa_is_cpu_port(ds, i)) {
> + qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(priv->cpu_port),
> +   QCA8K_PORT_LOOKUP_MEMBER,
> +   ds->enabled_port_mask << 1);
> + }
> +
> + /* Invividual PHYs gets connected to CPU port only */
> + if (ds->enabled_port_mask & BIT(i)) {
> + int port = qca8k_phy_to_port(i);
> + int shift = 16 * (port % 2);
> +

snip

> +static void
> +qca8k_get_ethtool_stats(struct dsa_switch *ds, int phy,
> + uint64_t *data)
> +{
> + struct qca8k_priv *priv = qca8k_to_priv(ds);
> + const struct qca8k_mib_desc *mib;
> + u32 reg, i, port;
> + u64 hi;
> +
> + port = qca8k_phy_to_port(phy);

snip

> +static inline int qca8k_phy_to_port(int phy)
> +{
> + if (phy < 5)
> + return phy + 1;
> +
> + return -1;
> +}

What keep throwing me while reading this code is the use of PHY/phy.

What is meant by a phy in this driver?

DSA is all about switches. Switches are a switch fabric and a number
of ports. The ports contain Ethernet MACs, and optionally contain
embedded PHYs. If there are not embedded PHYs, there are often
external PHYs, and sometimes we just have MACs connected back to back.

Generally, DSA drivers don't need to do much with the MAC. Maybe set
the speed and duplex, and maybe signal delays. They also don't need to
do much with the PHY, phylib and a phy driver does all the work. DSA
is all about the switch fabric.

Yet i see phy scattered in a few places in this driver, and it seems
like they have nothing to do with the PHY.

Please can you change the terminology here? It might help if you can
remove qca8k_phy_to_port(). Why do you need to add 1? Could this be
eliminated via a better choice of reg in the device tree?

Thanks
Andrew 


Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> From: Alexei Starovoitov 

> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +  u32 len,
> +  struct net_device *netdev,
> +  struct e1000_adapter *adapter)
> +{
> + struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> + struct e1000_hw *hw = &adapter->hw;
> + struct e1000_tx_ring *tx_ring;
> +
> + if (len > E1000_MAX_DATA_PER_TXD)
> + return;
> +
> + /* e1000 only support a single txq at the moment so the queue is being
> +  * shared with stack. To support this requires locking to ensure the
> +  * stack and XDP are not running at the same time. Devices with
> +  * multiple queues should allocate a separate queue space.
> +  */
> + HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> + tx_ring = adapter->tx_ring;
> +
> + if (E1000_DESC_UNUSED(tx_ring) < 2) {
> + HARD_TX_UNLOCK(netdev, txq);
> + return;
> + }
> +
> + if (netif_xmit_frozen_or_stopped(txq))
> + return;
> +
> + e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> + netdev_sent_queue(netdev, len);
> + e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> + writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> + mmiowb();
> +
> + HARD_TX_UNLOCK(netdev, txq);
> +}


e1000_tx_map() is full of workarounds.

Have a look at last_tx_tso for example.

   /* Workaround for Controller erratum --
 * descriptor for non-tso packet in a linear SKB that follows a
 * tso gets written back prematurely before the data is fully
 * DMA'd to the controller
 */
if (!skb->data_len && tx_ring->last_tx_tso &&
!skb_is_gso(skb)) {
tx_ring->last_tx_tso = false;
size -= 4;
}

Look, this XDP_TX thing is hard to properly implement and test on
various NIC revisions.

Without proper queue management, high prio packets in qdisc wont be sent
if NIC is under RX -> XDP_TX flood.

Sounds a horrible feature to me.




Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-12 Thread Adit Ranadive
On Mon, Sep 12, 2016 at 11:03:39 -0700, Jason Gunthorpe wrote:
> On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote:
> > [2] Libpvrdma User-level library - 
> > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary
> 
> You will probably find that rdma-plumbing will be the best way to get
> your userspace component into the distributors.

Hi Jason,

Sorry I haven't paying attention to that discussion. Do you know how soon 
distros will pick up the rdma-plumbing stuff? 
It seems like it should be fairly straightforward to include the libpvrdma git 
within the rdma-plumbing git as well. Let me know what the process is since
I may have to discuss it internally.

Thanks!
Adit


Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote:

> I noticed some very weird issues when I took that out.
> Running sufficiently large amount of traffic (ping -f is sufficient)
> I saw that when i did a dump it took anywhere between 6-15 seconds.
> With the read_lock in place response was immediate.
> I can go back and run things to verify - but it was very odd.

This was on uni processor ?

Looks like typical starvation caused by aggressive softirq.

Anyway, I suspect your kernel build has rcu_read_lock() and
rcu_read_unlock() as NOP ;)





Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim

On 16-09-12 06:01 PM, Eric Dumazet wrote:

On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:


+
+static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
+  int bind, int ref)
+{
+   struct tcf_skbmod *d = to_skbmod(a);
+   unsigned char *b = skb_tail_pointer(skb);
+   struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
+   struct tc_skbmod opt = {
+   .index   = d->tcf_index,
+   .refcnt  = d->tcf_refcnt - ref,
+   .bindcnt = d->tcf_bindcnt - bind,
+   .action  = d->tcf_action,
+   };
+   struct tcf_t t;
+
+   rcu_read_lock();


You do not need rcu read lock protection here, RTNL is enough.


I noticed some very weird issues when I took that out.
Running sufficiently large amount of traffic (ping -f is sufficient)
I saw that when i did a dump it took anywhere between 6-15 seconds.
With the read_lock in place response was immediate.
I can go back and run things to verify - but it was very odd.

cheers,
jamal






[net-next PATCH v3 3/3] e1000: bundle xdp xmit routines

2016-09-12 Thread John Fastabend
e1000 supports a single TX queue so it is being shared with the stack
when XDP runs XDP_TX action. This requires taking the xmit lock to
ensure we don't corrupt the tx ring. To avoid taking and dropping the
lock per packet this patch adds a bundling implementation to submit
a bundle of packets to the xmit routine.

I tested this patch running e1000 in a VM using KVM over a tap
device using pktgen to generate traffic along with 'ping -f -l 100'.

Suggested-by: Jesper Dangaard Brouer 
Signed-off-by: John Fastabend 
---
 drivers/net/ethernet/intel/e1000/e1000.h  |   10 +++
 drivers/net/ethernet/intel/e1000/e1000_main.c |   80 +++--
 2 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h 
b/drivers/net/ethernet/intel/e1000/e1000.h
index 5cf8a0a..877b377 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -133,6 +133,8 @@ struct e1000_adapter;
 #define E1000_TX_QUEUE_WAKE16
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define E1000_RX_BUFFER_WRITE  16 /* Must be power of 2 */
+/* How many XDP XMIT buffers to bundle into one xmit transaction */
+#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
 
 #define AUTO_ALL_MODES 0
 #define E1000_EEPROM_82544_APM 0x0004
@@ -168,6 +170,11 @@ struct e1000_rx_buffer {
dma_addr_t dma;
 };
 
+struct e1000_rx_buffer_bundle {
+   struct e1000_rx_buffer *buffer;
+   u32 length;
+};
+
 struct e1000_tx_ring {
/* pointer to the descriptor ring memory */
void *desc;
@@ -206,6 +213,9 @@ struct e1000_rx_ring {
struct e1000_rx_buffer *buffer_info;
struct sk_buff *rx_skb_top;
 
+   /* array of XDP buffer information structs */
+   struct e1000_rx_buffer_bundle *xdp_buffer;
+
/* cpu for rx queue */
int cpu;
 
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 232b927..31489d4 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct 
bpf_prog *prog)
struct e1000_adapter *adapter = netdev_priv(netdev);
struct bpf_prog *old_prog;
 
+   if (!adapter->rx_ring[0].xdp_buffer) {
+   int size = sizeof(struct e1000_rx_buffer_bundle) *
+   E1000_XDP_XMIT_BUNDLE_MAX;
+
+   adapter->rx_ring[0].xdp_buffer = vzalloc(size);
+   if (!adapter->rx_ring[0].xdp_buffer)
+   return -ENOMEM;
+   }
+
old_prog = xchg(&adapter->prog, prog);
if (old_prog) {
synchronize_net();
@@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
if (adapter->prog)
bpf_prog_put(adapter->prog);
 
+   if (adapter->rx_ring[0].xdp_buffer)
+   vfree(adapter->rx_ring[0].xdp_buffer);
+
unregister_netdev(netdev);
 
e1000_phy_hw_reset(hw);
@@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring 
*tx_ring,
 
 static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
 u32 len,
+struct e1000_adapter *adapter,
 struct net_device *netdev,
-struct e1000_adapter *adapter)
+struct e1000_tx_ring *tx_ring)
 {
-   struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
-   struct e1000_hw *hw = &adapter->hw;
-   struct e1000_tx_ring *tx_ring;
+   const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
 
if (len > E1000_MAX_DATA_PER_TXD)
return;
 
-   /* e1000 only support a single txq at the moment so the queue is being
-* shared with stack. To support this requires locking to ensure the
-* stack and XDP are not running at the same time. Devices with
-* multiple queues should allocate a separate queue space.
-*/
-   HARD_TX_LOCK(netdev, txq, smp_processor_id());
-
-   tx_ring = adapter->tx_ring;
-
-   if (E1000_DESC_UNUSED(tx_ring) < 2) {
-   HARD_TX_UNLOCK(netdev, txq);
+   if (E1000_DESC_UNUSED(tx_ring) < 2)
return;
-   }
 
if (netif_xmit_frozen_or_stopped(txq))
return;
@@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer 
*rx_buffer_info,
e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
netdev_sent_queue(netdev, len);
e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+}
 
+static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
+ struct net_device *netdev,
+ struct e1000_adapter *adapter)
+{
+   struct netdev_queue *tx

[net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-12 Thread John Fastabend
From: Alexei Starovoitov 

This patch adds initial support for XDP on e1000 driver. Note e1000
driver does not support page recycling in general which could be
added as a further improvement. However XDP_DROP case will recycle.
XDP_TX and XDP_PASS do not support recycling.

e1000 only supports a single tx queue at this time so the queue
is shared between xdp program and Linux stack. It is possible for
an XDP program to starve the stack in this model.

The XDP program will drop packets on XDP_TX errors. This can occur
when the tx descriptors are exhausted. This behavior is the same
for both shared queue models like e1000 and dedicated tx queue
models used in multiqueue devices. However if both the stack and
XDP are transmitting packets it is perhaps more likely to occur in
the shared queue model. Further refinement to the XDP model may be
possible in the future.

I tested this patch running e1000 in a VM using KVM over a tap
device.

CC: William Tu 
Signed-off-by: Alexei Starovoitov 
Signed-off-by: John Fastabend 
---
 drivers/net/ethernet/intel/e1000/e1000.h  |2 
 drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +
 2 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h 
b/drivers/net/ethernet/intel/e1000/e1000.h
index d7bdea7..5cf8a0a 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -150,6 +150,7 @@ struct e1000_adapter;
  */
 struct e1000_tx_buffer {
struct sk_buff *skb;
+   struct page *page;
dma_addr_t dma;
unsigned long time_stamp;
u16 length;
@@ -279,6 +280,7 @@ struct e1000_adapter {
 struct e1000_rx_ring *rx_ring,
 int cleaned_count);
struct e1000_rx_ring *rx_ring;  /* One per active queue */
+   struct bpf_prog *prog;
struct napi_struct napi;
 
int num_tx_queues;
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 62a7f8d..232b927 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
return 0;
 }
 
+static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+   struct e1000_adapter *adapter = netdev_priv(netdev);
+   struct bpf_prog *old_prog;
+
+   old_prog = xchg(&adapter->prog, prog);
+   if (old_prog) {
+   synchronize_net();
+   bpf_prog_put(old_prog);
+   }
+
+   if (netif_running(netdev))
+   e1000_reinit_locked(adapter);
+   else
+   e1000_reset(adapter);
+   return 0;
+}
+
+static bool e1000_xdp_attached(struct net_device *dev)
+{
+   struct e1000_adapter *priv = netdev_priv(dev);
+
+   return !!priv->prog;
+}
+
+static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+   switch (xdp->command) {
+   case XDP_SETUP_PROG:
+   return e1000_xdp_set(dev, xdp->prog);
+   case XDP_QUERY_PROG:
+   xdp->prog_attached = e1000_xdp_attached(dev);
+   return 0;
+   default:
+   return -EINVAL;
+   }
+}
+
 static const struct net_device_ops e1000_netdev_ops = {
.ndo_open   = e1000_open,
.ndo_stop   = e1000_close,
@@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
 #endif
.ndo_fix_features   = e1000_fix_features,
.ndo_set_features   = e1000_set_features,
+   .ndo_xdp= e1000_xdp,
 };
 
 /**
@@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
e1000_down_and_stop(adapter);
e1000_release_manageability(adapter);
 
+   if (adapter->prog)
+   bpf_prog_put(adapter->prog);
+
unregister_netdev(netdev);
 
e1000_phy_hw_reset(hw);
@@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter 
*adapter)
struct e1000_hw *hw = &adapter->hw;
u32 rdlen, rctl, rxcsum;
 
-   if (adapter->netdev->mtu > ETH_DATA_LEN) {
+   if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
rdlen = adapter->rx_ring[0].count *
sizeof(struct e1000_rx_desc);
adapter->clean_rx = e1000_clean_jumbo_rx_irq;
@@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter 
*adapter,
dev_kfree_skb_any(buffer_info->skb);
buffer_info->skb = NULL;
}
+   if (buffer_info->page) {
+   put_page(buffer_info->page);
+   buffer_info->page = NULL;
+   }
+
buffer_info->time_stamp = 0;

Re: [PATCH v4 16/16] MAINTAINERS: Update for PVRDMA driver

2016-09-12 Thread Adit Ranadive
On Mon, Sep 12, 2016 at 10:52:26 -0700, Jason Gunthorpe wrote: 
> On Sun, Sep 11, 2016 at 09:49:26PM -0700, Adit Ranadive wrote:
> > Add maintainer info for the PVRDMA driver.
> 
> You can probably squash the last three patches.

Thanks for taking a look.
Since Doug mentioned that he would squash my commits anyways while applying I
decided to keep them separate. Let me know if you still want me to squash them.

> .. and fix the __u32 stuff throughout the entire driver please.

Will do.

Thanks,
Adit


[net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not

2016-09-12 Thread John Fastabend
The BQL API does not reference the sk_buff nor does the driver need to
reference the sk_buff to calculate the length of a transmitted frame.
This patch removes an sk_buff reference from the xmit irq path and
also allows packets sent from XDP to use BQL.

Signed-off-by: John Fastabend 
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f42129d..62a7f8d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter 
*adapter,
if (cleaned) {
total_tx_packets += buffer_info->segs;
total_tx_bytes += buffer_info->bytecount;
-   if (buffer_info->skb) {
-   bytes_compl += buffer_info->skb->len;
-   pkts_compl++;
-   }
-
+   bytes_compl += buffer_info->length;
+   pkts_compl++;
}
e1000_unmap_and_free_tx_resource(adapter, buffer_info);
tx_desc->upper.data = 0;



[net-next PATCH v3 0/3] e1000 XDP implementation

2016-09-12 Thread John Fastabend
This patch implements XDP on e1000 a few comments below:

The XDP_TX path does not violate BQL in this series so we have to increment
and decrement the bql counters correctly. When a TX queue can have both stack
traffic and XDP traffic I believe user configured BQL should be correct. If
users do not want BQL they can always disable it or raise the limits.


I left the xdp program attached to the adapter structure because in the e1000
case the program is global because we only run on a single queue. Pushing the
program into the rx_ring just creates a bunch of ring[0] references in the code
and adds little in my opinion. This is really just a style issue though. Notice
I did push the bundle logic onto the queue but I view the bundling and rx
ring buffers to be so closely linked it was worth it and only caused a couple
extra ring[0] lines.


XDP_TX will drop packets if any errors occur while attempting to push onto the
TX descriptor ring. This seems to me at least to be the most sensible thing to
do and keeps e1000 inline with existing mlx XDP drivers. This can always be
extended if some consensus is made later.

Thanks for all the comments in v{1|2}. As always any comments/feedback
appreciated.

Also initial patch was based on a patch I took from Alexei so I left his
signed-off there even though I mangled the code a fair amount since then and
did not give him any chance to review off-list.

---

Alexei Starovoitov (1):
  e1000: add initial XDP support

John Fastabend (2):
  e1000: track BQL bytes regardless of skb or not
  e1000: bundle xdp xmit routines


 drivers/net/ethernet/intel/e1000/e1000.h  |   12 +
 drivers/net/ethernet/intel/e1000/e1000_main.c |  227 -
 2 files changed, 229 insertions(+), 10 deletions(-)

--
Signature


Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim

On 16-09-12 05:58 PM, Eric Dumazet wrote:

On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:

From: Jamal Hadi Salim 



+
+   /* XXX: if you are going to edit more fields beyond ethernet header
+* (example when you add IP header replacement or vlan swap)
+* then MAX_EDIT_LEN needs to change appropriately
+   */
+   err = skb_ensure_writable(skb, ETH_HLEN);
+   if (unlikely(err)) /* best policy is to drop on the floor */
+   action = TC_ACT_SHOT;
+
+   rcu_read_lock();
+   action = READ_ONCE(d->tcf_action);


You are overwriting @action, while you might have put TC_ACT_SHOT in it
3 lines above.

Maybe you meant :
if (err)
return TC_ACT_SHOT;



Thanks for catching that (leftover from when i used a lock).
Will resubmit.

cheers,
jamal


Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:

> +
> +static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
> +int bind, int ref)
> +{
> + struct tcf_skbmod *d = to_skbmod(a);
> + unsigned char *b = skb_tail_pointer(skb);
> + struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
> + struct tc_skbmod opt = {
> + .index   = d->tcf_index,
> + .refcnt  = d->tcf_refcnt - ref,
> + .bindcnt = d->tcf_bindcnt - bind,
> + .action  = d->tcf_action,
> + };
> + struct tcf_t t;
> +
> + rcu_read_lock();

You do not need rcu read lock protection here, RTNL is enough.

> +
> + opt.flags  = p->flags;
> + if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), &opt))
> + goto nla_put_failure;
> + if ((p->flags & SKBMOD_F_DMAC) &&
> + nla_put(skb, TCA_SKBMOD_DMAC, ETH_ALEN, p->eth_dst))
> + goto nla_put_failure;
> + if ((p->flags & SKBMOD_F_SMAC) &&
> + nla_put(skb, TCA_SKBMOD_SMAC, ETH_ALEN, p->eth_src))
> + goto nla_put_failure;
> + if ((p->flags & SKBMOD_F_ETYPE) &&
> + nla_put_u16(skb, TCA_SKBMOD_ETYPE, ntohs(p->eth_type)))
> + goto nla_put_failure;
> +
> + tcf_tm_dump(&t, &d->tcf_tm);
> + if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), &t, TCA_SKBMOD_PAD))
> + goto nla_put_failure;
> +
> + rcu_read_unlock();
> +
> + return skb->len;
> +nla_put_failure:
> + rcu_read_unlock();
> + nlmsg_trim(skb, b);
> + return -1;
> +}
> +



Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Eric Dumazet
On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:
> From: Jamal Hadi Salim 

> +
> + /* XXX: if you are going to edit more fields beyond ethernet header
> +  * (example when you add IP header replacement or vlan swap)
> +  * then MAX_EDIT_LEN needs to change appropriately
> + */
> + err = skb_ensure_writable(skb, ETH_HLEN);
> + if (unlikely(err)) /* best policy is to drop on the floor */
> + action = TC_ACT_SHOT;
> +
> + rcu_read_lock();
> + action = READ_ONCE(d->tcf_action);

You are overwriting @action, while you might have put TC_ACT_SHOT in it
3 lines above.

Maybe you meant :
if (err)
return TC_ACT_SHOT;





Re: [PATCH] [RFC] proc connector: add namespace events

2016-09-12 Thread Evgeniy Polyakov
Hi everyone

08.09.2016, 18:39, "Alban Crequy" :
> The act of a process creating or joining a namespace via clone(),
> unshare() or setns() is a useful signal for monitoring applications.

> + if (old_ns->mnt_ns != new_ns->mnt_ns)
> + proc_ns_connector(tsk, CLONE_NEWNS, PROC_NM_REASON_CLONE, old_mntns_inum, 
> new_mntns_inum);
> +
> + if (old_ns->uts_ns != new_ns->uts_ns)
> + proc_ns_connector(tsk, CLONE_NEWUTS, PROC_NM_REASON_CLONE, 
> old_ns->uts_ns->ns.inum, new_ns->uts_ns->ns.inum);
> +
> + if (old_ns->ipc_ns != new_ns->ipc_ns)
> + proc_ns_connector(tsk, CLONE_NEWIPC, PROC_NM_REASON_CLONE, 
> old_ns->ipc_ns->ns.inum, new_ns->ipc_ns->ns.inum);
> +
> + if (old_ns->net_ns != new_ns->net_ns)
> + proc_ns_connector(tsk, CLONE_NEWNET, PROC_NM_REASON_CLONE, 
> old_ns->net_ns->ns.inum, new_ns->net_ns->ns.inum);
> +
> + if (old_ns->cgroup_ns != new_ns->cgroup_ns)
> + proc_ns_connector(tsk, CLONE_NEWCGROUP, PROC_NM_REASON_CLONE, 
> old_ns->cgroup_ns->ns.inum, new_ns->cgroup_ns->ns.inum);
> +
> + if (old_ns->pid_ns_for_children != new_ns->pid_ns_for_children)
> + proc_ns_connector(tsk, CLONE_NEWPID, PROC_NM_REASON_CLONE, 
> old_ns->pid_ns_for_children->ns.inum, new_ns->pid_ns_for_children->ns.inum);
> + }
> +

Patch looks good to me from technical/connector point of view, but these even 
multiplication is a bit weird imho.

I'm not against it, but did you consider sending just 2 serialized ns 
structures via single message, and client
would check all ns bits himself?


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Tom Herbert
On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouer
 wrote:
> On Fri, 9 Sep 2016 18:03:09 +0300
> Saeed Mahameed  wrote:
>
>> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
>>  wrote:
>> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >>
>> >> I'm sorry but I have a problem with this patch!
>> >> Looking at this patch, I want to bring up a fundamental architectural
>> >> concern with the development direction of XDP transmit.
>> >>
>> >>
>> >> What you are trying to implement, with delaying the doorbell, is
>> >> basically TX bulking for TX_XDP.
>> >>
>> >>  Why not implement a TX bulking interface directly instead?!?
>> >>
>> >> Yes, the tailptr/doorbell is the most costly operation, but why not
>> >> also take advantage of the benefits of bulking for other parts of the
>> >> code? (benefit is smaller, by every cycles counts in this area)
>> >>
>> >> This hole XDP exercise is about avoiding having a transaction cost per
>> >> packet, that reads "bulking" or "bundling" of packets, where possible.
>> >>
>> >>  Lets do bundling/bulking from the start!
>>
>> Jesper, what we did here is also bulking, instead of bulking in a
>> temporary list in the driver we list the packets in the HW and once
>> done we transmit all at once via the xdp_doorbell indication.
>>
>> I agree with you that we can take advantage and improve the icache by
>> bulking first in software and then queue all at once in the hw then
>> ring one doorbell.
>>
>> but I also agree with Alexei that this will introduce an extra
>> pointer/list handling in the driver and we need to do the comparison
>> between both approaches before we decide which is better.
>
> I welcome implementing both approaches and benchmarking them against
> each-other, I'll gladly dedicate time for this!
>
Yes, please implement this so we can have something clear to evaluate
and compare. There is far to much spewing of "expert opinions"
happening here :-(

> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> The key point is stop seeing the lowest driver RX as something that
> works on a per packet basis.  It might be easier to view this as a kind
> of vector processing.  This is about having a vector of packets, where
> we apply some action/operation.
>
> This is about using the CPU more efficiently, getting it to do more
> instructions per cycle.  The next level of optimization (for >= Sandy
> Bridge CPUs) is to make these vector processing stages small enough to fit
> into the CPUs decoded-I-cache section.
>
>
> It might also be important to mention, that for netstack delivery, I
> don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
> packets.  Why, because the NIC-HW runs independently and have the
> opportunity to deliver more frames in the RX ring queue, while the
> stack "slow" processed packets.  You can view this as "bulking" from
> the RX ring queue, with a "look-back" before exiting the NAPI poll loop.
>
>
>> this must be marked as future work and not have this from the start.
>
> We both know that statement is BS, and the other approach will never be
> implemented once this patch is accepted upstream.
>
>
>> > mlx4 already does bulking and this proposed mlx5 set of patches
>> > does bulking as well.
>
> I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
> way IMHO.  And now mlx5 is building on the same principle. That is why
> I'm yelling STOP.
>
>
>> >> The reason behind the xmit_more API is that we could not change the
>> >> API of all the drivers.  And we found that calling an explicit NDO
>> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> >> would hit the common single packet use-case.
>> >>
>> >> It should be really easy to build a bundle of packets that need XDP_TX
>> >> action, especially given you only have a single destination "port".
>> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer


[PATCH net-next 2/2] bpf: use skb_at_tc_ingress helper in tcf_bpf

2016-09-12 Thread Daniel Borkmann
We have a small skb_at_tc_ingress() helper for testing for ingress, so
make use of it. cls_bpf already uses it and so should act_bpf.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/sched/act_bpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 78400de..1d39600 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -39,10 +39,10 @@ static struct tc_action_ops act_bpf_ops;
 static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act,
   struct tcf_result *res)
 {
+   bool at_ingress = skb_at_tc_ingress(skb);
struct tcf_bpf *prog = to_bpf(act);
struct bpf_prog *filter;
int action, filter_res;
-   bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS;
 
tcf_lastuse_update(&prog->tcf_tm);
bstats_cpu_update(this_cpu_ptr(prog->common.cpu_bstats), skb);
-- 
1.9.3



[PATCH net-next 1/2] bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf

2016-09-12 Thread Daniel Borkmann
The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is
actually unnecessary and can be removed altogether. This was added by
commit a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative
offsets"), which was later on improved by 3431205e0397 ("bpf: make programs
see skb->data == L2 for ingress and egress"). We're always guaranteed to
have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf().

Reason is that since 6d1ccff62780 ("net: reset mac header in dev_start_xmit()")
we do skb_reset_mac_header() in __dev_queue_xmit() before we could call
into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress()
always sees a valid mac header as well (things like skb_reset_mac_len()
would badly fail otherwise). Thus, drop the unnecessary test in classifier
and action case.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/sched/act_bpf.c | 3 ---
 net/sched/cls_bpf.c | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index bfa8707..78400de 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -44,9 +44,6 @@ static int tcf_bpf(struct sk_buff *skb, const struct 
tc_action *act,
int action, filter_res;
bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS;
 
-   if (unlikely(!skb_mac_header_was_set(skb)))
-   return TC_ACT_UNSPEC;
-
tcf_lastuse_update(&prog->tcf_tm);
bstats_cpu_update(this_cpu_ptr(prog->common.cpu_bstats), skb);
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 4742f41..1d92d4d 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -83,9 +83,6 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
struct cls_bpf_prog *prog;
int ret = -1;
 
-   if (unlikely(!skb_mac_header_was_set(skb)))
-   return -1;
-
/* Needed here for accessing maps. */
rcu_read_lock();
list_for_each_entry_rcu(prog, &head->plist, link) {
-- 
1.9.3



[PATCH net-next 0/2] Misc cls_bpf/act_bpf improvements

2016-09-12 Thread Daniel Borkmann
Two minor improvements to {cls,act}_bpf. For details please see
individual patches.

Thanks!

Daniel Borkmann (2):
  bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf
  bpf: use skb_at_tc_ingress helper in tcf_bpf

 net/sched/act_bpf.c | 5 +
 net/sched/cls_bpf.c | 3 ---
 2 files changed, 1 insertion(+), 7 deletions(-)

-- 
1.9.3



Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems

2016-09-12 Thread Martin Blumenstingl
Hi Matt,

On Sun, Sep 11, 2016 at 10:57 PM, Matt Corallo  wrote:
> The general advice is to force it into 100M mode then it seems to work fine.
> There is some issue with the driver seemingly not fully supporting 1G (at
> least on Odroid C2) which needs to be worked out.
thanks for the hint - by "forcing 100M mode" you mean "ethtool –s eth0
speed 100 duplex full"?
Do you know more about this issue (or do you just know the
workaround)? It'd be interesting to know which "driver" does not
support Gbit speeds (as there are three drivers involved: realtek PHY
driver, stmmac and Meson stmmac glue driver


Regards,
Martin


Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems

2016-09-12 Thread Martin Blumenstingl
Hi Alexandre,

On Mon, Sep 12, 2016 at 6:37 PM, Alexandre Torgue
 wrote:
> Which Synopsys IP version do you use ?
found this in a dmesg log:
[1.504784] stmmac - user ID: 0x11, Synopsys ID: 0x37
[1.509785]  Ring mode enabled
[1.512796]  DMA HW capability register supported
[1.517286]  Normal descriptors
[1.520565]  RX Checksum Offload Engine supported
[1.525219]  COE Type 2
[1.527638]  TX Checksum insertion supported
[1.531862]  Wake-Up On Lan supported
[1.535483]  Enable RX Mitigation via HW Watchdog Timer
[1.543851] libphy: stmmac: probed
[1.544025] eth0: PHY ID 001cc916 at 0 IRQ POLL (stmmac-0:00) active
[1.550321] eth0: PHY ID 001cc916 at 7 IRQ POLL (stmmac-0:07)

>> Gbit ethernet on my device is provided by a Realtek RTL8211F RGMII PHY.
>> Similar issues were reported in #linux-amlogic by a user with an
>> Odroid C2 board (= similar hardware).
>>
>> The symptoms are:
>> Receiving data is plenty fast (I can max out my internet connection
>> easily, and with iperf3 I get ~900Mbit/s).
>> Transmitting data from the device is unfortunately very slow, traffic
>> sometimes even stalls completely.
>>
>> I have attached the iperf results and the output of
>> /sys/kernel/debug/stmmaceth/eth0/descriptors_status.
>> Below you can find the ifconfig, netstat and stmmac dma_cap info
>> (*after* I ran all tests).
>>
>> The "involved parties" are:
>> - Meson GXBB specific network configuration registers (I have have
>> double-checked them with the reference drivers: everything seems fine
>> here)
>> - stmmac: it seems that nobody else has reported these kind of issues
>> so far, however I'd still like to hear where I should enable some
>> debugging bits to rule out any stmmac bug
>
>
> On my side, I just tested on the same "kind" of system:
> -SYNOPSYS GMAC 3.7
> -RTL8211EG as PHY
>
> With I perf, I reach:
> -RX: 932 Mbps
> -TX: 820Mbps
>
> Can you check ethtool -S eth0 (most precisely "MMC"counter and errors) ?
> Which kernel version do you use ?
I am using a 4.8.0-rc4 kernel, based on Kevin's "integration" branch: [0]
Unfortunately I don't have access to my device in the next few days,
but I'll keep you updated once I have the ethtool output.


Thanks for your time
Regards,
Martin


[0] 
https://git.kernel.org/cgit/linux/kernel/git/khilman/linux-amlogic.git/log/?h=v4.8/integ


Re: [PATCH 00/26] constify local structures

2016-09-12 Thread Julia Lawall


On Mon, 12 Sep 2016, Jarkko Sakkinen wrote:

> On Mon, Sep 12, 2016 at 04:43:58PM +0300, Felipe Balbi wrote:
> >
> > Hi,
> >
> > Jarkko Sakkinen  writes:
> > > On Mon, Sep 12, 2016 at 10:54:07AM +0200, Julia Lawall wrote:
> > >>
> > >>
> > >> On Sun, 11 Sep 2016, Jarkko Sakkinen wrote:
> > >>
> > >> > On Sun, Sep 11, 2016 at 03:05:42PM +0200, Julia Lawall wrote:
> > >> > > Constify local structures.
> > >> > >
> > >> > > The semantic patch that makes this change is as follows:
> > >> > > (http://coccinelle.lip6.fr/)
> > >> >
> > >> > Just my two cents but:
> > >> >
> > >> > 1. You *can* use a static analysis too to find bugs or other issues.
> > >> > 2. However, you should manually do the commits and proper commit
> > >> >messages to subsystems based on your findings. And I generally think
> > >> >that if one contributes code one should also at least smoke test 
> > >> > changes
> > >> >somehow.
> > >> >
> > >> > I don't know if I'm alone with my opinion. I just think that one should
> > >> > also do the analysis part and not blindly create and submit patches.
> > >>
> > >> All of the patches are compile tested.  And the individual patches are
> > >
> > > Compile-testing is not testing. If you are not able to test a commit,
> > > you should explain why.
> >
> > Dude, Julia has been doing semantic patching for years already and
> > nobody has raised any concerns so far. There's already an expectation
> > that Coccinelle *works* and Julia's sematic patches are sound.
> >
> > Besides, adding 'const' is something that causes virtually no functional
> > changes to the point that build-testing is really all you need. Any
> > problems caused by adding 'const' to a definition will be seen by build
> > errors or warnings.
> >
> > Really, just stop with the pointless discussion and go read a bit about
> > Coccinelle and what semantic patches are giving you. The work done by
> > Julia and her peers are INRIA have measurable benefits.
> >
> > You're really making a thunderstorm in a glass of water.
>
> Hmm... I've been using coccinelle in cyclic basis for some time now.
> My comment was oversized but I didn't mean it to be impolite or attack
> of any kind for that matter.

No problem :)  Thanks for the feedback.

julia


Re: [PATCH v2 net 1/1] net sched actions: fix GETing actions

2016-09-12 Thread Jamal Hadi Salim

On 16-09-12 01:18 PM, Cong Wang wrote:

On Mon, Sep 12, 2016 at 3:21 AM, Jamal Hadi Salim  wrote:




+   if (ovr)
+   a->tcfa_refcnt-=1;


How about tcfa_bindcnt?


We dont touch it when mucking around only with actions (as is in
this case).



+   }
+}
+
 int tcf_action_init(struct net *net, struct nlattr *nla,
  struct nlattr *est, char *name, int ovr,
  int bind, struct list_head *actions)
@@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
goto err;
}
act->order = i;
+   if (ovr)


Need to check this boolean? It looks like we need this for !ovr case too?



They are fine if they didnt exist.

I will fix the coding style issues and submit.

cheers,
jamal


Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer
On Mon, 12 Sep 2016 12:56:28 -0700
Alexei Starovoitov  wrote:

> On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Sep 2016 23:30:50 -0700
> > Alexei Starovoitov  wrote:
> >   
> > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:  
> > [...]  
> > > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > > Every time you call the stack code, then you flush your icache.  When
> > > > returning to the driver code, you will have to reload all the icache
> > > > associated with the XDP_TX, this is a costly operation.
> > >   
> > [...]  
> > > To make further progress in this discussion can we talk about
> > > the use case you have in mind instead? Then solution will
> > > be much clear, I hope.  
> > 
> > The DDoS use-case _is_ affected by this "hidden" bulking design.
> > 
> > Lets say, I want to implement a DDoS facility. Instead of just
> > dropping the malicious packets, I want to see the bad packets.  I
> > implement this by rewriting the destination-MAC to be my monitor
> > machine and then XDP_TX the packet.  
> 
> not following the use case. you want to implement a DDoS generator?
> Or just forward all bad packets from affected host to another host
> in the same rack? so two servers will be spammed with traffic and
> even more load on the tor? I really don't see how this is useful
> for anything but stress testing.

As I wrote below, the purpose of the monitor machine is to diagnose
false positives.  If you worry about the added load I would either,
forward out another interface (which is not supported yet) or simply do
sampling of packets being forwarded to the monitor host.

> > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> > of the traffic is delivered to the stack. (See note 1)  
> 
> hmm. DoS prevention use case is when 99% of the traffic is dropped.

As I wrote below, until the DDoS attack starts, all packets are
delivered to the stack.

> > Once the DDoS attack starts, then the traffic pattern changes, and XDP
> > should (hopefully only) catch the malicious traffic (monitor machine can
> > help diagnose false positive).  Now, due to interleaving the DDoS
> > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> > more icache misses...
> > 
> > 
> > 
> > Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> > program with 100% delivery to the stack, actually slows down the
> > normal stack.  This is due to hitting a bottleneck in the page
> > allocator.  I'm working removing that bottleneck with page_pool, and
> > that solution is orthogonal to this problem.  
> 
> sure. no one arguing against improving page allocator.
> 
> >  It is actually an excellent argument, for why you would want to run a
> > DDoS XDP filter only on a restricted number of RX queues.  
> 
> no. it's the opposite. If the host is under DoS there is no way
> the host can tell in advance which rx queue will be seeing bad
> packets.

Sorry, this note was not related to the DoS use-case.  You
misunderstood it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


[PATCH v4 net-next 1/1] net_sched: Introduce skbmod action

2016-09-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

To: 

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesn't get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbmod.h|  30 
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +
 net/sched/Kconfig |  11 ++
 net/sched/Makefile|   1 +
 net/sched/act_skbmod.c| 293 ++
 5 files changed, 374 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include 
+#include 
+
+struct tcf_skbmod_params {
+   struct rcu_head rcu;
+   u64 flags; /*up to 64 types of operations; extend if needed */
+   u8  eth_dst[ETH_ALEN];
+   u16 eth_type;
+   u8  eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+   struct tc_actioncommon;
+   struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h 
b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include 
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC  0x1
+#define SKBMOD_F_SMAC  0x2
+#define SKBMOD_F_ETYPE 0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+   tc_gen;
+   __u64 flags;
+};
+
+enum {
+   TCA_SKBMOD_UNSPEC,
+   TCA_SKBMOD_TM,
+   TCA_SKBMOD_PARMS,
+   TCA_SKBMOD_DMAC,
+   TCA_SKBMOD_SMAC,
+   TCA_SKBMOD_ETYPE,
+   TCA_SKBMOD_PAD,
+   __TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
  To compile this code as a module, choose M here: the
  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+tristate "skb data modification action"
+depends on NET_CLS_ACT
+---help---
+ Say Y here to allow modification of skb data
+
+ If unsure, say N.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_skbmod.
+
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)  += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmar

Re: [RFC PATCH 9/9] ethernet: sun8i-emac: add pm_runtime support

2016-09-12 Thread Maxime Ripard
Hi,

On Fri, Sep 09, 2016 at 02:45:17PM +0200, Corentin Labbe wrote:
> This patch add pm_runtime support to sun8i-emac.
> For the moment, only basic support is added, (the device is marked as
> used when net/open)
> 
> Signed-off-by: Corentin Labbe 
> ---
>  drivers/net/ethernet/allwinner/sun8i-emac.c | 62 
> -
>  1 file changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/allwinner/sun8i-emac.c 
> b/drivers/net/ethernet/allwinner/sun8i-emac.c
> index 1c4bc80..cce886e 100644
> --- a/drivers/net/ethernet/allwinner/sun8i-emac.c
> +++ b/drivers/net/ethernet/allwinner/sun8i-emac.c
> @@ -9,7 +9,6 @@
>   * - MAC filtering
>   * - Jumbo frame
>   * - features rx-all (NETIF_F_RXALL_BIT)
> - * - PM runtime
>   */
>  #include 
>  #include 
> @@ -27,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1301,11 +1301,18 @@ static int sun8i_emac_open(struct net_device *ndev)
>   int err;
>   u32 v;
>  
> + err = pm_runtime_get_sync(priv->dev);
> + if (err) {
> + pm_runtime_put_noidle(priv->dev);
> + dev_err(priv->dev, "pm_runtime error: %d\n", err);
> + return err;
> + }
> +
>   err = request_irq(priv->irq, sun8i_emac_dma_interrupt, 0,
> dev_name(priv->dev), ndev);
>   if (err) {
>   dev_err(priv->dev, "Cannot request IRQ: %d\n", err);
> - return err;
> + goto err_runtime;
>   }
>  
>   /* Set interface mode (and configure internal PHY on H3) */
> @@ -1395,6 +1402,8 @@ err_syscon:
>   sun8i_emac_unset_syscon(ndev);
>  err_irq:
>   free_irq(priv->irq, ndev);
> +err_runtime:
> + pm_runtime_put(priv->dev);
>   return err;
>  }
>  
> @@ -1483,6 +1492,8 @@ static int sun8i_emac_stop(struct net_device *ndev)
>   dma_free_coherent(priv->dev, priv->nbdesc_tx * sizeof(struct dma_desc),
> priv->dd_tx, priv->dd_tx_phy);
>  
> + pm_runtime_put(priv->dev);
> +
>   return 0;
>  }
>  
> @@ -2210,6 +2221,8 @@ static int sun8i_emac_probe(struct platform_device 
> *pdev)
>   goto probe_err;
>   }
>  
> + pm_runtime_enable(priv->dev);
> +
>   return 0;
>  
>  probe_err:
> @@ -2221,6 +2234,8 @@ static int sun8i_emac_remove(struct platform_device 
> *pdev)
>  {
>   struct net_device *ndev = platform_get_drvdata(pdev);
>  
> + pm_runtime_disable(&pdev->dev);
> +
>   unregister_netdev(ndev);
>   platform_set_drvdata(pdev, NULL);
>   free_netdev(ndev);
> @@ -2228,6 +2243,47 @@ static int sun8i_emac_remove(struct platform_device 
> *pdev)
>   return 0;
>  }
>  
> +static int __maybe_unused sun8i_emac_suspend(struct platform_device *pdev, 
> pm_message_t state)
> +{
> + struct net_device *ndev = platform_get_drvdata(pdev);
> + struct sun8i_emac_priv *priv = netdev_priv(ndev);
> +
> + napi_disable(&priv->napi);
> +
> + if (netif_running(ndev))
> + netif_device_detach(ndev);
> +
> + sun8i_emac_stop_tx(ndev);
> + sun8i_emac_stop_rx(ndev);
> +
> + sun8i_emac_rx_clean(ndev);
> + sun8i_emac_tx_clean(ndev);
> +
> + phy_stop(ndev->phydev);
> +
> + return 0;
> +}
> +
> +static int __maybe_unused sun8i_emac_resume(struct platform_device *pdev)
> +{
> + struct net_device *ndev = platform_get_drvdata(pdev);
> + struct sun8i_emac_priv *priv = netdev_priv(ndev);
> +
> + phy_start(ndev->phydev);
> +
> + sun8i_emac_start_tx(ndev);
> + sun8i_emac_start_rx(ndev);
> +
> + if (netif_running(ndev))
> + netif_device_attach(ndev);
> +
> + netif_start_queue(ndev);
> +
> + napi_enable(&priv->napi);
> +
> + return 0;
> +}

The main idea behind the runtime PM hooks is that they bring the
device to a working state and shuts it down when it's not needed
anymore.

However, they shouldn't be called when the device is still in used, so
all the mangling with NAPI, the phy and so on is irrelevant here, but
the clocks, resets, for example, are.

>  static const struct of_device_id sun8i_emac_of_match_table[] = {
>   { .compatible = "allwinner,sun8i-a83t-emac",
> .data = &emac_variant_a83t },
> @@ -2246,6 +2302,8 @@ static struct platform_driver sun8i_emac_driver = {
>   .name   = "sun8i-emac",
>   .of_match_table = sun8i_emac_of_match_table,
>   },
> + .suspend= sun8i_emac_suspend,
> + .resume = sun8i_emac_resume,

These are not the runtime PM hooks. How did you test that?

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com


signature.asc
Description: PGP signature


Re: [RFC V3 PATCH 18/26] net/netpolicy: set tx queues according to policy

2016-09-12 Thread Tom Herbert
On Mon, Sep 12, 2016 at 7:55 AM,   wrote:
> From: Kan Liang 
>
> When the device tries to transmit a packet, netdev_pick_tx is called to
> find the available tx queues. If the net policy is applied, it picks up
> the assigned tx queue from net policy subsystem, and redirect the
> traffic to the assigned queue.
>
> Signed-off-by: Kan Liang 
> ---
>  include/net/sock.h |  9 +
>  net/core/dev.c | 20 ++--
>  2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e1e9e3d..ca97f35 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2280,4 +2280,13 @@ extern int sysctl_optmem_max;
>  extern __u32 sysctl_wmem_default;
>  extern __u32 sysctl_rmem_default;
>
> +/* Return netpolicy instance information from socket. */
> +static inline struct netpolicy_instance *netpolicy_find_instance(struct sock 
> *sk)
> +{
> +#ifdef CONFIG_NETPOLICY
> +   if (is_net_policy_valid(sk->sk_netpolicy.policy))
> +   return &sk->sk_netpolicy;
> +#endif
> +   return NULL;
> +}
>  #endif /* _SOCK_H */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 34b5322..b9a8044 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3266,6 +3266,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device 
> *dev,
> struct sk_buff *skb,
> void *accel_priv)
>  {
> +   struct sock *sk = skb->sk;
> int queue_index = 0;
>
>  #ifdef CONFIG_XPS
> @@ -3280,8 +3281,23 @@ struct netdev_queue *netdev_pick_tx(struct net_device 
> *dev,
> if (ops->ndo_select_queue)
> queue_index = ops->ndo_select_queue(dev, skb, 
> accel_priv,
> __netdev_pick_tx);
> -   else
> -   queue_index = __netdev_pick_tx(dev, skb);
> +   else {
> +#ifdef CONFIG_NETPOLICY
> +   struct netpolicy_instance *instance;
> +
> +   queue_index = -1;
> +   if (dev->netpolicy && sk) {
> +   instance = netpolicy_find_instance(sk);
> +   if (instance) {
> +   if (!instance->dev)
> +   instance->dev = dev;
> +   queue_index = 
> netpolicy_pick_queue(instance, false);
> +   }
> +   }
> +   if (queue_index < 0)
> +#endif

I doubt this produces the intended effect. Several drivers use
ndo_select_queue (such as mlx4) where there might do something special
for a few packets but end up called the default handler which
__netdev_pick_tx for most packets. So in such cases the netpolicy path
would be routinely bypassed. Maybe this code should be in
__netdev_pick_tx.

Tom

> +   queue_index = __netdev_pick_tx(dev, skb);
> +   }
>
> if (!accel_priv)
> queue_index = netdev_cap_txqueue(dev, queue_index);
> --
> 2.5.5
>


Re: [PATCH 00/26] constify local structures

2016-09-12 Thread Jarkko Sakkinen
On Mon, Sep 12, 2016 at 04:43:58PM +0300, Felipe Balbi wrote:
> 
> Hi,
> 
> Jarkko Sakkinen  writes:
> > On Mon, Sep 12, 2016 at 10:54:07AM +0200, Julia Lawall wrote:
> >> 
> >> 
> >> On Sun, 11 Sep 2016, Jarkko Sakkinen wrote:
> >> 
> >> > On Sun, Sep 11, 2016 at 03:05:42PM +0200, Julia Lawall wrote:
> >> > > Constify local structures.
> >> > >
> >> > > The semantic patch that makes this change is as follows:
> >> > > (http://coccinelle.lip6.fr/)
> >> >
> >> > Just my two cents but:
> >> >
> >> > 1. You *can* use a static analysis too to find bugs or other issues.
> >> > 2. However, you should manually do the commits and proper commit
> >> >messages to subsystems based on your findings. And I generally think
> >> >that if one contributes code one should also at least smoke test 
> >> > changes
> >> >somehow.
> >> >
> >> > I don't know if I'm alone with my opinion. I just think that one should
> >> > also do the analysis part and not blindly create and submit patches.
> >> 
> >> All of the patches are compile tested.  And the individual patches are
> >
> > Compile-testing is not testing. If you are not able to test a commit,
> > you should explain why.
> 
> Dude, Julia has been doing semantic patching for years already and
> nobody has raised any concerns so far. There's already an expectation
> that Coccinelle *works* and Julia's sematic patches are sound.
> 
> Besides, adding 'const' is something that causes virtually no functional
> changes to the point that build-testing is really all you need. Any
> problems caused by adding 'const' to a definition will be seen by build
> errors or warnings.
> 
> Really, just stop with the pointless discussion and go read a bit about
> Coccinelle and what semantic patches are giving you. The work done by
> Julia and her peers are INRIA have measurable benefits.
> 
> You're really making a thunderstorm in a glass of water.

Hmm... I've been using coccinelle in cyclic basis for some time now.
My comment was oversized but I didn't mean it to be impolite or attack
of any kind for that matter.

> -- 
> balbi

/Jarkko


[PATCH net-next 2/2] net: deprecate eth_change_mtu, remove usage

2016-09-12 Thread Jarod Wilson
With centralized MTU checking, there's nothing productive done by
eth_change_mtu that isn't already done in dev_set_mtu, so mark it as
deprecated and remove all usage of it in the kernel. All callers have been
audited for calls to alloc_etherdev* or ether_setup directly, which means
they all have a valid dev->min_mtu and dev->max_mtu. Now eth_change_mtu
prints out a netdev_warn about being deprecated, for the benefit of
out-of-tree drivers that might be utilizing it.

Of note, dvb_net.c actually had dev->mtu = 4096, while using
eth_change_mtu, meaning that if you ever tried changing it's mtu, you
couldn't set it above 1500 anymore. It's now getting dev->max_mtu also set
to 4096 to remedy that.

CC: "David S. Miller" 
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 arch/m68k/emu/nfeth.c | 1 -
 drivers/isdn/hysdn/hysdn_net.c| 1 -
 drivers/media/dvb-core/dvb_net.c  | 2 +-
 drivers/net/appletalk/ipddp.c | 1 -
 drivers/net/cris/eth_v10.c| 1 -
 drivers/net/ethernet/3com/3c509.c | 1 -
 drivers/net/ethernet/3com/3c515.c | 1 -
 drivers/net/ethernet/3com/3c574_cs.c  | 1 -
 drivers/net/ethernet/3com/3c589_cs.c  | 1 -
 drivers/net/ethernet/3com/3c59x.c | 2 --
 drivers/net/ethernet/3com/typhoon.c   | 1 -
 drivers/net/ethernet/8390/8390.c  | 1 -
 drivers/net/ethernet/8390/8390p.c | 1 -
 drivers/net/ethernet/8390/ax88796.c   | 1 -
 drivers/net/ethernet/8390/axnet_cs.c  | 1 -
 drivers/net/ethernet/8390/etherh.c| 1 -
 drivers/net/ethernet/8390/hydra.c | 1 -
 drivers/net/ethernet/8390/mac8390.c   | 1 -
 drivers/net/ethernet/8390/mcf8390.c   | 1 -
 drivers/net/ethernet/8390/ne2k-pci.c  | 1 -
 drivers/net/ethernet/8390/pcnet_cs.c  | 1 -
 drivers/net/ethernet/8390/smc-ultra.c | 1 -
 drivers/net/ethernet/8390/wd.c| 1 -
 drivers/net/ethernet/8390/zorro8390.c | 1 -
 drivers/net/ethernet/adaptec/starfire.c   | 1 -
 drivers/net/ethernet/adi/bfin_mac.c   | 1 -
 drivers/net/ethernet/allwinner/sun4i-emac.c   | 1 -
 drivers/net/ethernet/amd/a2065.c  | 1 -
 drivers/net/ethernet/amd/am79c961a.c  | 1 -
 drivers/net/ethernet/amd/ariadne.c| 1 -
 drivers/net/ethernet/amd/atarilance.c | 1 -
 drivers/net/ethernet/amd/au1000_eth.c | 1 -
 drivers/net/ethernet/amd/declance.c   | 1 -
 drivers/net/ethernet/amd/hplance.c| 1 -
 drivers/net/ethernet/amd/lance.c  | 1 -
 drivers/net/ethernet/amd/mvme147.c| 1 -
 drivers/net/ethernet/amd/ni65.c   | 1 -
 drivers/net/ethernet/amd/nmclan_cs.c  | 1 -
 drivers/net/ethernet/amd/pcnet32.c| 1 -
 drivers/net/ethernet/amd/sun3lance.c  | 1 -
 drivers/net/ethernet/amd/sunlance.c   | 1 -
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  | 1 -
 drivers/net/ethernet/apple/bmac.c | 1 -
 drivers/net/ethernet/apple/mace.c | 1 -
 drivers/net/ethernet/apple/macmace.c  | 1 -
 drivers/net/ethernet/aurora/nb8800.c  | 1 -
 drivers/net/ethernet/cadence/macb.c   | 1 -
 drivers/net/ethernet/cirrus/cs89x0.c  | 1 -
 drivers/net/ethernet/cirrus/ep93xx_eth.c  | 1 -
 drivers/net/ethernet/cirrus/mac89x0.c | 1 -
 drivers/net/ethernet/davicom/dm9000.c | 1 -
 drivers/net/ethernet/dec/tulip/de2104x.c  | 1 -
 drivers/net/ethernet/dec/tulip/de4x5.c| 1 -
 drivers/net/ethernet/dec/tulip/dmfe.c | 1 -
 drivers/net/ethernet/dec/tulip/tulip_core.c   | 1 -
 drivers/net/ethernet/dec/tulip/uli526x.c  | 1 -
 drivers/net/ethernet/dec/tulip/winbond-840.c  | 1 -
 drivers/net/ethernet/dec/tulip/xircom_cb.c| 1 -
 drivers/net/ethernet/dnet.c   | 1 -
 drivers/net/ethernet/ec_bhf.c | 1 -
 drivers/net/ethernet/fealnx.c | 1 -
 drivers/net/ethernet/freescale/fec_main.c | 1 -
 drivers/net/ethernet/freescale/fec_mpc52xx.c  | 1 -
 drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c | 1 -
 drivers/net/ethernet/freescale/ucc_geth.c | 1 -
 drivers/net/ethernet/fujitsu/fmvj18x_cs.c | 1 -
 drivers/net/ethernet/hisilicon/hip04_eth.c| 1 -
 drivers/net/ethernet/hisilicon/hisi_femac.c   | 1 -
 drivers/net/ethernet/hp/hp100.c   | 2 --

[PATCH net-next 0/2] net: centralize net_device MTU bounds checking

2016-09-12 Thread Jarod Wilson
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().

This pair of patches looks to introduce centralized MTU range checking
infrastructure, while maintaining compatibility with all existing drivers,
and start to make use of it, converting all eth_change_mtu/ether_setup users
over to this new infra.

Assuming these pass review muster, I've got a ton of follow-on patches to
clean up MTU settings for everything in the kernel with an ndo_change_mtu.

Jarod Wilson (2):
  net: centralize net_device min/max MTU checking
  net: deprecate eth_change_mtu, remove usage

 arch/m68k/emu/nfeth.c |  1 -
 drivers/isdn/hysdn/hysdn_net.c|  1 -
 drivers/media/dvb-core/dvb_net.c  |  2 +-
 drivers/net/appletalk/ipddp.c |  1 -
 drivers/net/cris/eth_v10.c|  1 -
 drivers/net/ethernet/3com/3c509.c |  1 -
 drivers/net/ethernet/3com/3c515.c |  1 -
 drivers/net/ethernet/3com/3c574_cs.c  |  1 -
 drivers/net/ethernet/3com/3c589_cs.c  |  1 -
 drivers/net/ethernet/3com/3c59x.c |  2 --
 drivers/net/ethernet/3com/typhoon.c   |  1 -
 drivers/net/ethernet/8390/8390.c  |  1 -
 drivers/net/ethernet/8390/8390p.c |  1 -
 drivers/net/ethernet/8390/ax88796.c   |  1 -
 drivers/net/ethernet/8390/axnet_cs.c  |  1 -
 drivers/net/ethernet/8390/etherh.c|  1 -
 drivers/net/ethernet/8390/hydra.c |  1 -
 drivers/net/ethernet/8390/mac8390.c   |  1 -
 drivers/net/ethernet/8390/mcf8390.c   |  1 -
 drivers/net/ethernet/8390/ne2k-pci.c  |  1 -
 drivers/net/ethernet/8390/pcnet_cs.c  |  1 -
 drivers/net/ethernet/8390/smc-ultra.c |  1 -
 drivers/net/ethernet/8390/wd.c|  1 -
 drivers/net/ethernet/8390/zorro8390.c |  1 -
 drivers/net/ethernet/adaptec/starfire.c   |  1 -
 drivers/net/ethernet/adi/bfin_mac.c   |  1 -
 drivers/net/ethernet/allwinner/sun4i-emac.c   |  1 -
 drivers/net/ethernet/amd/a2065.c  |  1 -
 drivers/net/ethernet/amd/am79c961a.c  |  1 -
 drivers/net/ethernet/amd/ariadne.c|  1 -
 drivers/net/ethernet/amd/atarilance.c |  1 -
 drivers/net/ethernet/amd/au1000_eth.c |  1 -
 drivers/net/ethernet/amd/declance.c   |  1 -
 drivers/net/ethernet/amd/hplance.c|  1 -
 drivers/net/ethernet/amd/lance.c  |  1 -
 drivers/net/ethernet/amd/mvme147.c|  1 -
 drivers/net/ethernet/amd/ni65.c   |  1 -
 drivers/net/ethernet/amd/nmclan_cs.c  |  1 -
 drivers/net/ethernet/amd/pcnet32.c|  1 -
 drivers/net/ethernet/amd/sun3lance.c  |  1 -
 drivers/net/ethernet/amd/sunlance.c   |  1 -
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  |  1 -
 drivers/net/ethernet/apple/bmac.c |  1 -
 drivers/net/ethernet/apple/mace.c |  1 -
 drivers/net/ethernet/apple/macmace.c  |  1 -
 drivers/net/ethernet/aurora/nb8800.c  |  1 -
 drivers/net/ethernet/cadence/macb.c   |  1 -
 drivers/net/ethernet/cirrus/cs89x0.c  |  1 -
 drivers/net/ethernet/cirrus/ep93xx_eth.c  |  1 -
 drivers/net/ethernet/cirrus/mac89x0.c |  1 -
 drivers/net/ethernet/davicom/dm9000.c |  1 -
 drivers/net/ethernet/dec/tulip/de2104x.c  |  1 -
 drivers/net/ethernet/dec/tulip/de4x5.c|  1 -
 drivers/net/ethernet/dec/tulip/dmfe.c |  1 -
 drivers/net/ethernet/dec/tulip/tulip_core.c   |  1 -
 drivers/net/ethernet/dec/tulip/uli526x.c  |  1 -
 drivers/net/ethernet/dec/tulip/winbond-840.c  |  1 -
 drivers/net/ethernet/dec/tulip/xircom_cb.c|  1 -
 drivers/net/ethernet/dnet.c   |  1 -
 drivers/net/ethernet/ec_bhf.c |  1 -
 drivers/net/ethernet/fealnx.c |  1 -
 drivers/net/ethernet/freescale/fec_main.c |  1 -
 drivers/net/ethernet/freescale/fec_mpc52xx.c  

[PATCH net-next 1/2] net: centralize net_device min/max MTU checking

2016-09-12 Thread Jarod Wilson
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().

In theory, there should be zero functional change with this patch, it just
puts the infrastructure in place. Subsequent patches will attempt to start
using said infrastructure, with theoretically zero change in
functionality.

CC: "David S. Miller" 
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 include/linux/netdevice.h |  4 
 net/core/dev.c| 12 ++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 67bb978..e466949 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1490,6 +1490,8 @@ enum netdev_priv_flags {
  * @if_port:   Selectable AUI, TP, ...
  * @dma:   DMA channel
  * @mtu:   Interface MTU value
+ * @min_mtu:   Interface Minimum MTU value
+ * @max_mtu:   Interface Maximum MTU value
  * @type:  Interface hardware type
  * @hard_header_len: Maximum hardware header length.
  *
@@ -1710,6 +1712,8 @@ struct net_device {
unsigned char   dma;
 
unsigned intmtu;
+   unsigned intmin_mtu;
+   unsigned intmax_mtu;
unsigned short  type;
unsigned short  hard_header_len;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 34b5322..9785588 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6473,9 +6473,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
if (new_mtu == dev->mtu)
return 0;
 
-   /*  MTU must be positive.*/
-   if (new_mtu < 0)
+   if (new_mtu < dev->min_mtu) {
+   net_err_ratelimited("%s: Invalid MTU %d requested, hw min %d\n",
+   dev->name, new_mtu, dev->min_mtu);
return -EINVAL;
+   }
+
+   if (dev->max_mtu > 0 && new_mtu > dev->max_mtu) {
+   net_err_ratelimited("%s: Invalid MTU %d requested, hw max %d\n",
+   dev->name, new_mtu, dev->min_mtu);
+   return -EINVAL;
+   }
 
if (!netif_device_present(dev))
return -ENODEV;
-- 
2.10.0



Re: [PATCH net-next 2/2] errqueue: include linux/time.h

2016-09-12 Thread Willem de Bruijn
On Mon, Sep 12, 2016 at 3:26 PM, kbuild test robot  wrote:
> Hi Willem,
>
> [auto build test ERROR on net-next/master]
>
> url:
> https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/uapi-include-time-h-from-errqueue-h/20160913-020431
> config: i386-defconfig (attached as .config)
> compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):

This error report shows that breakage can occur with applications that
include linux/errqueue.h before the libc time headers.

The libc-compat definitions in this patch set only fix compilation
when uapi headers are included after the userspace headers.

These errors indeed go away when errqueue.h is included after the
userspace time includes, as in the diff below.

For the inverse, the libc headers need additional #if __UAPI_DEF_FOO
changes, as described in include/uapli/linux/libc-compat.h. Those
changes are a noop without kernel definitions, so arguably that libc
patch should be merged before this kernel patch.

I will remove this patch set from the patchwork queue for now.

diff --git a/Documentation/networking/timestamping/txtimestamp.c
b/Documentation/networking/timestamping/txtimestamp.c
index 5df0704..f073801 100644
--- a/Documentation/networking/timestamping/txtimestamp.c
+++ b/Documentation/networking/timestamping/txtimestamp.c
@@ -37,7 +37,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -59,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 


Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov  wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> [...]
> > To make further progress in this discussion can we talk about
> > the use case you have in mind instead? Then solution will
> > be much clear, I hope.
> 
> The DDoS use-case _is_ affected by this "hidden" bulking design.
> 
> Lets say, I want to implement a DDoS facility. Instead of just
> dropping the malicious packets, I want to see the bad packets.  I
> implement this by rewriting the destination-MAC to be my monitor
> machine and then XDP_TX the packet.

not following the use case. you want to implement a DDoS generator?
Or just forward all bad packets from affected host to another host
in the same rack? so two servers will be spammed with traffic and
even more load on the tor? I really don't see how this is useful
for anything but stress testing.

> In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> of the traffic is delivered to the stack. (See note 1)

hmm. DoS prevention use case is when 99% of the traffic is dropped.

> Once the DDoS attack starts, then the traffic pattern changes, and XDP
> should (hopefully only) catch the malicious traffic (monitor machine can
> help diagnose false positive).  Now, due to interleaving the DDoS
> traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> more icache misses...
> 
> 
> 
> Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> program with 100% delivery to the stack, actually slows down the
> normal stack.  This is due to hitting a bottleneck in the page
> allocator.  I'm working removing that bottleneck with page_pool, and
> that solution is orthogonal to this problem.

sure. no one arguing against improving page allocator.

>  It is actually an excellent argument, for why you would want to run a
> DDoS XDP filter only on a restricted number of RX queues.

no. it's the opposite. If the host is under DoS there is no way
the host can tell in advance which rx queue will be seeing bad packets.



Re: [PATCH net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail

2016-09-12 Thread Marcelo Ricardo Leitner
On Sat, Sep 10, 2016 at 12:03:53AM +0800, Xin Long wrote:
> > That said, have you considered the retransmit case?  That is to say, if you
> > queue and flush the outq, and some packets fail delivery, and in the time
> > between the intial send and the expiration of the RTX timer (during which 
> > the
> > socket lock will have been released), an event may occur which changes the
> > transport state, which will then be ignored with your patch.
> Sorry, I'm not sure if I got it.
> 
> You mean "during which changes q->asoc->state", right ?
> 
> This patch removes the check of q->asoc->state in sctp_outq_tail().
> 
> sctp_outq_tail() is called for data only in:
> sctp_primitive_SEND -> sctp_do_sm -> sctp_cmd_send_msg ->
> sctp_cmd_interpreter -> sctp_cmd_send_msg() -> sctp_outq_tail()
> 
> before calling sctp_primitive_SEND, hold sock lock first.
> then sctp_primitive_SEND choose FUNC according:
> 
> #define TYPE_SCTP_PRIMITIVE_SEND  {
> 
> 
> if asoc->state is unavailable, FUNC can't be sctp_cmd_send_msg,
> but sctp_sf_error_closed/sctp_sf_error_shutdown,  sctp_outq_tail
> can't be called, either.
> I mean sctp_primitive_SEND do the same check for asoc->state
> already actually.
> 
> so the code in sctp_outq_tail is redundant actually.

I also don't see an issue with this patch, btw.

Xin, you may want to add more/such details to the changelog, specially
about the timer versus primitive handling.

  Marcelo



Re: [PATCH net 3/6] sctp: free msg->chunks when sctp_primitive_SEND return err

2016-09-12 Thread Marcelo Ricardo Leitner
On Thu, Sep 08, 2016 at 05:31:47PM +0800, Xin Long wrote:
> Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
> made sctp_primitive_SEND return err only when asoc state is unavailable.
> In this case, chunks are not enqueued, they have no chance to be freed if
> we don't take care of them later.
> 
> This Patch is actually to revert commit 1cd4d5c4326a ("sctp: remove the
> unused sctp_datamsg_free()") and commit 8b570dc9f7b6 ("sctp: only drop the
> reference on the datamsg after sending a msg"), to use sctp_datamsg_free to
> free the chunks of current msg.
> 

Considering the subsequent patches in this series are improving error
return and the first is a cleanup/optimization, Xin, this patch and the
previous one both deserve a Fixes tag too, for 8b570dc9f7b6. Thanks

> Signed-off-by: Xin Long 
> ---
>  include/net/sctp/structs.h |  1 +
>  net/sctp/chunk.c   | 13 +
>  net/sctp/socket.c  |  6 --
>  3 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index ce93c4b..f61fb7c 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -537,6 +537,7 @@ struct sctp_datamsg {
>  struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *,
>   struct sctp_sndrcvinfo *,
>   struct iov_iter *);
> +void sctp_datamsg_free(struct sctp_datamsg *);
>  void sctp_datamsg_put(struct sctp_datamsg *);
>  void sctp_chunk_fail(struct sctp_chunk *, int error);
>  int sctp_chunk_abandoned(struct sctp_chunk *);
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index a55e547..af9cc80 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -70,6 +70,19 @@ static struct sctp_datamsg *sctp_datamsg_new(gfp_t gfp)
>   return msg;
>  }
>  
> +void sctp_datamsg_free(struct sctp_datamsg *msg)
> +{
> + struct sctp_chunk *chunk;
> +
> + /* This doesn't have to be a _safe vairant because
> +  * sctp_chunk_free() only drops the refs.
> +  */
> + list_for_each_entry(chunk, &msg->chunks, frag_list)
> + sctp_chunk_free(chunk);
> +
> + sctp_datamsg_put(msg);
> +}
> +
>  /* Final destructruction of datamsg memory. */
>  static void sctp_datamsg_destroy(struct sctp_datamsg *msg)
>  {
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 9fc417a..9d8e2be 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1970,13 +1970,15 @@ static int sctp_sendmsg(struct sock *sk, struct 
> msghdr *msg, size_t msg_len)
>* breaks.
>*/
>   err = sctp_primitive_SEND(net, asoc, datamsg);
> - sctp_datamsg_put(datamsg);
>   /* Did the lower layer accept the chunk? */
> - if (err)
> + if (err) {
> + sctp_datamsg_free(datamsg);
>   goto out_free;
> + }
>  
>   pr_debug("%s: we sent primitively\n", __func__);
>  
> + sctp_datamsg_put(datamsg);
>   err = msg_len;
>  
>   if (unlikely(wait_connect)) {
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [ovs-dev] [PATCH 2/2] openvswitch: use percpu flow stats

2016-09-12 Thread pravin shelar
On Fri, Sep 9, 2016 at 1:41 PM, Thadeu Lima de Souza Cascardo
 wrote:
> Instead of using flow stats per NUMA node, use it per CPU. When using
> megaflows, the stats lock can be a bottleneck in scalability.
>
> On a E5-2690 12-core system, usual throughput went from ~4Mpps to
> ~15Mpps when forwarding between two 40GbE ports with a single flow
> configured on the datapath.
>
> This has been tested on a system with possible CPUs 0-7,16-23. After
> module removal, there were no corruption on the slab cache.
>
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> ---
>  net/openvswitch/flow.c   | 43 +++
>  net/openvswitch/flow.h   |  4 ++--
>  net/openvswitch/flow_table.c | 23 ---
>  3 files changed, 37 insertions(+), 33 deletions(-)
>
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 3609f37..2970a9f 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
> tcp_flags,
>  {
> struct flow_stats *stats;
> int node = numa_node_id();
> +   int cpu = get_cpu();
> int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
>
This function is always called from BH context. So calling
smp_processor_id() for cpu id is fine. There is no need to handle
pre-emption here.

> -   stats = rcu_dereference(flow->stats[node]);
> +   stats = rcu_dereference(flow->stats[cpu]);
>
...
> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index 957a3c3..60e5ae0 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
...

> @@ -102,9 +103,9 @@ struct sw_flow *ovs_flow_alloc(void)
>
> RCU_INIT_POINTER(flow->stats[0], stats);
>
> -   for_each_node(node)
> -   if (node != 0)
> -   RCU_INIT_POINTER(flow->stats[node], NULL);
> +   for_each_possible_cpu(cpu)
> +   if (cpu != 0)
> +   RCU_INIT_POINTER(flow->stats[cpu], NULL);
>
I think at this point we should just use GFP_ZERO flag for allocating
struct flow.


Re: [PATCH net-next 2/2] errqueue: include linux/time.h

2016-09-12 Thread kbuild test robot
Hi Willem,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/uapi-include-time-h-from-errqueue-h/20160913-020431
config: i386-defconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from ./usr/include/linux/errqueue.h:4:0,
from 
Documentation/networking/timestamping/timestamping.c:46:
>> ./usr/include/linux/time.h:26:8: error: redefinition of 'struct timezone'
struct timezone {
   ^~~~
   In file included from 
Documentation/networking/timestamping/timestamping.c:37:0:
   /usr/include/x86_64-linux-gnu/sys/time.h:55:8: note: originally defined here
struct timezone
   ^~~~
--
   In file included from ./usr/include/linux/errqueue.h:4:0,
from Documentation/networking/timestamping/txtimestamp.c:40:
>> ./usr/include/linux/time.h:19:8: error: redefinition of 'struct timeval'
struct timeval {
   ^~~
   In file included from /usr/include/x86_64-linux-gnu/sys/select.h:45:0,
from /usr/include/x86_64-linux-gnu/sys/types.h:219,
from /usr/include/x86_64-linux-gnu/sys/uio.h:23,
from /usr/include/x86_64-linux-gnu/sys/socket.h:26,
from /usr/include/netinet/in.h:23,
from /usr/include/arpa/inet.h:22,
from Documentation/networking/timestamping/txtimestamp.c:35:
   /usr/include/x86_64-linux-gnu/bits/time.h:30:8: note: originally defined here
struct timeval
   ^~~
   In file included from 
Documentation/networking/timestamping/txtimestamp.c:59:0:
>> /usr/include/x86_64-linux-gnu/sys/time.h:55:8: error: redefinition of 
>> 'struct timezone'
struct timezone
   ^~~~
   In file included from ./usr/include/linux/errqueue.h:4:0,
from Documentation/networking/timestamping/txtimestamp.c:40:
   ./usr/include/linux/time.h:26:8: note: originally defined here
struct timezone {
   ^~~~
>> ./usr/include/linux/time.h:38:22: error: expected identifier before numeric 
>> constant
#define ITIMER_REAL  0
 ^
   In file included from 
Documentation/networking/timestamping/txtimestamp.c:59:0:
>> /usr/include/x86_64-linux-gnu/sys/time.h:107:8: error: redefinition of 
>> 'struct itimerval'
struct itimerval
   ^
   In file included from ./usr/include/linux/errqueue.h:4:0,
from Documentation/networking/timestamping/txtimestamp.c:40:
   ./usr/include/linux/time.h:51:8: note: originally defined here
struct itimerval {
   ^
   In file included from 
Documentation/networking/timestamping/txtimestamp.c:61:0:
>> /usr/include/time.h:161:8: error: redefinition of 'struct itimerspec'
struct itimerspec
   ^~
   In file included from ./usr/include/linux/errqueue.h:4:0,
from Documentation/networking/timestamping/txtimestamp.c:40:
   ./usr/include/linux/time.h:44:8: note: originally defined here
struct itimerspec {
   ^~

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH RFC] e1000: Send XDP_TX packets using dev_queue_xmit

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 09:21:10AM -0700, Tom Herbert wrote:
> In order to use XDP with non multi-queue drivers (such as e1000) we need
> a method to handle XDP_TX return code from xdp program. Since there is
> only one queue the XDP transmit path needs to share that queue with the
> normal stack path, and in order to do that we can't break the existing
> stack's mechanisms of locking, qdiscs, and BQL for the queue.
> 
> This patch allocates an skbuff when XDP_TX is returned from the XDP
> program and call dev_queue_xmit to transmit the skbuff.
> 
> Note that this is expected to be a performance hit (which needs to be
> quantified) relative to a MQ XDP implementation where we can reserve
> queues to be used by XDP. The first intent of these patches is
> correctness. This should not affect performance in the XDP_DROP path.
> 
> This patch needs to be applied after "e1000: add initial XDP support".
> 
> Tested: I've only built this, need to see if I can find some e1000
> machines. John, if you could try this that would be much appreciated.
> 
> Signed-off-by: Tom Herbert 
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c | 29 
> ---
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
> b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 91d5c87..f457ea8 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -4338,13 +4338,28 @@ static bool e1000_clean_jumbo_rx_irq(struct 
> e1000_adapter *adapter,
>   case XDP_PASS:
>   break;
>   case XDP_TX:
> - dma_sync_single_for_device(&pdev->dev,
> -dma,
> -length,
> -DMA_TO_DEVICE);
> - e1000_xmit_raw_frame(buffer_info, length,
> -  netdev, adapter);
> - buffer_info->rxbuf.page = NULL;
> + /* There is only one transmit queue so we need
> +  * to inject the packet at the qdisc layer
> +  * to maintain sanity in the stack. We allocate
> +  * an skbuff, set up frags with the page, and
> +  * then do dev_queue_xmit.
> +  */
> + skb = alloc_skb(0, GFP_ATOMIC);
> + if (skb) {
> + dma_unmap_page(&pdev->dev,
> +buffer_info->dma,
> +adapter->rx_buffer_len,
> +DMA_FROM_DEVICE);
> + skb->dev = netdev;
> + skb_fill_page_desc(skb, 0,
> + buffer_info->rxbuf.page,
> + 0, length);
> + dev_queue_xmit(skb);

I don't like this approach, since it does exactly what it meant to:
It uses qdisc and everything else in xmit path of the stack.
It's a serious issue, since for some drivers XDP_TX would mean raw xmit
whereas for e1k it goes via stack. Sooner or later program authors will
start relying on that and that's not acceptable from XDP design point of view.



Re: icmpv6: issue with routing table entries from link local addresses

2016-09-12 Thread David Ahern
On 9/12/16 11:26 AM, Hannes Frederic Sowa wrote:
> Hello,
> 
> On 12.09.2016 16:27, Andreas Hübner wrote:
>> Hi,
>>
>> I'm currently debugging a potential issue with the icmpv6 stack and
>> hopefully this is the correct place to ask. (Was actually looking for a
>> more specific list, but didn't find anything. Please point me to a more
>> apropriate list if this is out of place here.)
>>
>> I have the following setup:
>>   - 2 directly connected hosts (A+B), both have only link local addresses
>> configured (interface on both hosts is eth0)
>>   - host B is also connected to another host C (via interface eth1)
>>   - main routing table (relevant part) on host B looks like this:
>>
>>   fe80::/64 dev eth1  proto kernel  metric 256
>>   fe80::/64 dev eth0  proto kernel  metric 256
>>
>>   - host A is trying to ICMPv6 ping the link local address of host B
>>
>> The issue I currently have is, that the echo reply that host B should
>> generate is never sent back to host A. If I change the order of the
>> routing table entries on host B, everything works fine.
>> (host A is connected on eth0)
>>
>> I'm wondering, if this is how it is supposed to work. Do we need to do a
>> routing table lookup when generating an ICMPv6 echo reply for link local
>> addresses?  (From my understanding, this is not done in the neighbour
>> discovery stack, so why here?)
> 
> For global addresses this is necessary as asymetric routing could be
> involved and we don't want to treat ping echos in any way special.
> 
>> Actually, I'm convinced I must be doing something wrong here. The setup
>> for the issue is quite trivial, someone would have tripped over it
>> already. The only condition is that one host has multiple interfaces
>> with ipv6 enabled.
>>
>> Any help in shedding some light onto this issue would be appreciated.
> 
> This shouldn't be the case. We certainly carry over the ifindex of the
> received packet into the routing lookup of the outgoing packet, thus the
> appropriate rule, with outgoing ifindex should be selected.
> 
> I also couldn't reproduce your problem here with my system. Can you
> verify with tcpdump that the packet is leaving on another interface?

v4.4 and on there are fib6 tracepoints that show the lookup result. May provide 
some insights.

perf record -a -e fib6:* 
perf script




Re: [RFC 02/11] Add RoCE driver framework

2016-09-12 Thread Yuval Mintz
>> +uint debug;
>> +module_param(debug, uint, 0);
>> +MODULE_PARM_DESC(debug, "Default debug msglevel");

>Why are you adding this as a module parameter? 

I believe this is mostly to follow same line as qede which also defines
'debug' module parameter for allowing easy user control of debug
prints [& specifically for probe prints, which can't be controlled
otherwise].
 

Re: [RFC PATCH v3 3/7] proc: Reduce cache miss in snmp6_seq_show

2016-09-12 Thread Marcelo
On Fri, Sep 09, 2016 at 02:33:58PM +0800, Jia He wrote:
> This is to use the generic interface snmp_get_cpu_field{,64}_batch to 
> aggregate the data by going through all the items of each cpu sequentially.
> 
> Signed-off-by: Jia He 
> ---
>  net/ipv6/proc.c | 32 +++-
>  1 file changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
> index 679253d0..50ba2c3 100644
> --- a/net/ipv6/proc.c
> +++ b/net/ipv6/proc.c
> @@ -30,6 +30,11 @@
>  #include 
>  #include 
>  
> +#define MAX4(a, b, c, d) \
> + max_t(u32, max_t(u32, a, b), max_t(u32, c, d))
> +#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
> + IPSTATS_MIB_MAX, ICMP_MIB_MAX)
> +
>  static int sockstat6_seq_show(struct seq_file *seq, void *v)
>  {
>   struct net *net = seq->private;
> @@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
> void __percpu *pcpumib,
>   const struct snmp_mib *itemlist)
>  {
>   int i;
> - unsigned long val;
> -
> - for (i = 0; itemlist[i].name; i++) {
> - val = pcpumib ?
> - snmp_fold_field(pcpumib, itemlist[i].entry) :
> - atomic_long_read(smib + itemlist[i].entry);
> - seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
> + unsigned long buff[SNMP_MIB_MAX];
> +
> + memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);

This memset() could be moved...

> +
> + if (pcpumib) {

... here, so it's not executed if it hits the else block.

> + snmp_get_cpu_field_batch(buff, itemlist, pcpumib);
> + for (i = 0; itemlist[i].name; i++)
> + seq_printf(seq, "%-32s\t%lu\n",
> +itemlist[i].name, buff[i]);
> + } else {
> + for (i = 0; itemlist[i].name; i++)
> + seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
> +atomic_long_read(smib + itemlist[i].entry));
>   }
>  }
>  
> @@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, 
> void __percpu *mib,
> const struct snmp_mib *itemlist, size_t 
> syncpoff)
>  {
>   int i;
> + u64 buff64[SNMP_MIB_MAX];
> +
> + memset(buff64, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
>  
> + snmp_get_cpu_field64_batch(buff64, itemlist, mib, syncpoff);
>   for (i = 0; itemlist[i].name; i++)
> - seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
> -snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
> + seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff64[i]);
>  }
>  
>  static int snmp6_seq_show(struct seq_file *seq, void *v)
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-12 Thread Marcelo
On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:
> This is to use the generic interface snmp_get_cpu_field{,64}_batch to 
> aggregate the data by going through all the items of each cpu sequentially.
> Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
> warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

> 
> Signed-off-by: Jia He 
> ---
>  net/ipv4/proc.c | 106 
> +++-
>  1 file changed, 74 insertions(+), 32 deletions(-)
> 
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 9f665b6..c6fc80e 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -46,6 +46,8 @@
>  #include 
>  #include 
>  
> +#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)
> +
>  /*
>   *   Report socket allocation statistics [m...@utu.fi]
>   */
> @@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
>  /*
>   *   Called from the PROCfs module. This outputs /proc/net/snmp.
>   */
> -static int snmp_seq_show(struct seq_file *seq, void *v)
> +static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
>  {
>   int i;
> + u64 buff64[IPSTATS_MIB_MAX];
>   struct net *net = seq->private;
>  
> - seq_puts(seq, "Ip: Forwarding DefaultTTL");
> + memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
>  
> + seq_puts(seq, "Ip: Forwarding DefaultTTL");
>   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
>   seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
>  
> @@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
>  net->ipv4.sysctl_ip_default_ttl);
>  
>   BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
> + snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
> +net->mib.ip_statistics,
> +offsetof(struct ipstats_mib, syncp));
>   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
> - seq_printf(seq, " %llu",
> -snmp_fold_field64(net->mib.ip_statistics,
> -  snmp4_ipstats_list[i].entry,
> -  offsetof(struct ipstats_mib, 
> syncp)));
> + seq_printf(seq, " %llu", buff64[i]);
>  
> - icmp_put(seq);  /* RFC 2011 compatibility */
> - icmpmsg_put(seq);
> + return 0;
> +}
> +
> +static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
> +{
> + int i;
> + unsigned long buff[TCPUDP_MIB_MAX];
> + struct net *net = seq->private;
> +
> + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
>  
>   seq_puts(seq, "\nTcp:");
>   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
>   seq_printf(seq, " %s", snmp4_tcp_list[i].name);
>  
>   seq_puts(seq, "\nTcp:");
> + snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
> +  net->mib.tcp_statistics);
>   for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
>   /* MaxConn field is signed, RFC 2012 */
>   if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
> - seq_printf(seq, " %ld",
> -snmp_fold_field(net->mib.tcp_statistics,
> -snmp4_tcp_list[i].entry));
> + seq_printf(seq, " %ld", buff[i]);
>   else
> - seq_printf(seq, " %lu",
> -snmp_fold_field(net->mib.tcp_statistics,
> -snmp4_tcp_list[i].entry));
> + seq_printf(seq, " %lu", buff[i]);
>   }
>  
> + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
> +
> + snmp_get_cpu_field_batch(buff, snmp4_udp_list,
> +  net->mib.udp_statistics);
>   seq_puts(seq, "\nUdp:");
>   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
>   seq_printf(seq, " %s", snmp4_udp_list[i].name);
> -
>   seq_puts(seq, "\nUdp:");
>   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
> - seq_printf(seq, " %lu",
> -snmp_fold_field(net->mib.udp_statistics,
> -snmp4_udp_list[i].entry));
> + seq_printf(seq, " %lu", buff[i]);
> +
> + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
>  
>   /* the UDP and UDP-Lite MIBs are the same */
>   seq_puts(seq, "\nUdpLite:");
> + snmp_get_cpu_field_batch(buff, snmp4_udp_list,
> +  net->mib.udplite_statistics);
>   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
>   seq_printf(seq, " %s", snmp4_udp_list[i].name);
> -
>   seq_puts(seq, "\nUdpLite:");
>   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
> - seq_printf(seq, " %lu",
> - 

Re: [PATCH 00/26] constify local structures

2016-09-12 Thread Jarkko Sakkinen
On Mon, Sep 12, 2016 at 03:52:08PM +0200, Julia Lawall wrote:
> 
> 
> On Mon, 12 Sep 2016, Felipe Balbi wrote:
> 
> >
> > Hi,
> >
> > Jarkko Sakkinen  writes:
> > > On Mon, Sep 12, 2016 at 10:54:07AM +0200, Julia Lawall wrote:
> > >>
> > >>
> > >> On Sun, 11 Sep 2016, Jarkko Sakkinen wrote:
> > >>
> > >> > On Sun, Sep 11, 2016 at 03:05:42PM +0200, Julia Lawall wrote:
> > >> > > Constify local structures.
> > >> > >
> > >> > > The semantic patch that makes this change is as follows:
> > >> > > (http://coccinelle.lip6.fr/)
> > >> >
> > >> > Just my two cents but:
> > >> >
> > >> > 1. You *can* use a static analysis too to find bugs or other issues.
> > >> > 2. However, you should manually do the commits and proper commit
> > >> >messages to subsystems based on your findings. And I generally think
> > >> >that if one contributes code one should also at least smoke test 
> > >> > changes
> > >> >somehow.
> > >> >
> > >> > I don't know if I'm alone with my opinion. I just think that one should
> > >> > also do the analysis part and not blindly create and submit patches.
> > >>
> > >> All of the patches are compile tested.  And the individual patches are
> > >
> > > Compile-testing is not testing. If you are not able to test a commit,
> > > you should explain why.
> >
> > Dude, Julia has been doing semantic patching for years already and
> > nobody has raised any concerns so far. There's already an expectation
> > that Coccinelle *works* and Julia's sematic patches are sound.
> >
> > Besides, adding 'const' is something that causes virtually no functional
> > changes to the point that build-testing is really all you need. Any
> > problems caused by adding 'const' to a definition will be seen by build
> > errors or warnings.
> >
> > Really, just stop with the pointless discussion and go read a bit about
> > Coccinelle and what semantic patches are giving you. The work done by
> > Julia and her peers are INRIA have measurable benefits.
> >
> > You're really making a thunderstorm in a glass of water.
> 
> Thanks for the defense, but since a lot of these patches torned out to be
> wrong, due to an incorrect parse by Coccinelle, combined with an
> unpleasantly lax compiler, Jarkko does have a point that I should have
> looked at the patches more carefully.  In any case, I have written to the
> maintainers relevant to the patches that turned out to be incorrect.

Exactly. I'm not excepting that every commit would require extensive
analysis but it would be good to quickly at least skim through commits
and see if they make sense (or ask if unsure) :) 

And I'm fine with compile testing if it is mentioned in the commit msg.

> julia

/Jarkko


Re: [PATCHv2 net] sctp: hold the transport before using it in sctp_hash_cmp

2016-09-12 Thread Marcelo Ricardo Leitner
On Sat, Sep 10, 2016 at 11:11:23PM +0800, Xin Long wrote:
> Since commit 4f0087812648 ("sctp: apply rhashtable api to send/recv
> path"), sctp uses transport rhashtable with .obj_cmpfn sctp_hash_cmp,
> in which it compares the members of the transport with the rhashtable
> args to check if it's the right transport.
> 
> But sctp uses the transport without holding it in sctp_hash_cmp, it can
> cause a use-after-free panic. As after it gets transport from hashtable,
> another CPU may close the sk and free the asoc. In sctp_association_free,
> it frees all the transports, meanwhile, the assoc's refcnt may be reduced
> to 0, assoc can be destroyed by sctp_association_destroy.
> 
> So after that, transport->assoc is actually an unavailable memory address
> in sctp_hash_cmp. Although sctp_hash_cmp is under rcu_read_lock, it still
> can not avoid this, as assoc is not freed by RCU.
> 
> This patch is to hold the transport before checking it's members with
> sctp_transport_hold, in which it checks the refcnt first, holds it if
> it's not 0.
> 
> Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
> Signed-off-by: Xin Long 

Note that we cannot defer the free of the asoc too because that cause
issues with port re-use (issue already hit and fixed in the past), as
the port would be still in use during the RCU grace period.

Acked-by: Marcelo Ricardo Leitner 

> ---
>  net/sctp/input.c | 27 +--
>  1 file changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 69444d3..1555fb8 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -796,27 +796,34 @@ struct sctp_hash_cmp_arg {
>  static inline int sctp_hash_cmp(struct rhashtable_compare_arg *arg,
>   const void *ptr)
>  {
> + struct sctp_transport *t = (struct sctp_transport *)ptr;
>   const struct sctp_hash_cmp_arg *x = arg->key;
> - const struct sctp_transport *t = ptr;
> - struct sctp_association *asoc = t->asoc;
> - const struct net *net = x->net;
> + struct sctp_association *asoc;
> + int err = 1;
>  
>   if (!sctp_cmp_addr_exact(&t->ipaddr, x->paddr))
> - return 1;
> - if (!net_eq(sock_net(asoc->base.sk), net))
> - return 1;
> + return err;
> + if (!sctp_transport_hold(t))
> + return err;
> +
> + asoc = t->asoc;
> + if (!net_eq(sock_net(asoc->base.sk), x->net))
> + goto out;
>   if (x->ep) {
>   if (x->ep != asoc->ep)
> - return 1;
> + goto out;
>   } else {
>   if (x->laddr->v4.sin_port != htons(asoc->base.bind_addr.port))
> - return 1;
> + goto out;
>   if (!sctp_bind_addr_match(&asoc->base.bind_addr,
> x->laddr, sctp_sk(asoc->base.sk)))
> - return 1;
> + goto out;
>   }
>  
> - return 0;
> + err = 0;
> +out:
> + sctp_transport_put(t);
> + return err;
>  }
>  
>  static inline u32 sctp_hash_obj(const void *data, u32 len, u32 seed)
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] net: dsa: add FIB support

2016-09-12 Thread John Crispin


On 12/09/2016 16:00, Andrew Lunn wrote:
> Hi John
> 
>> +if (!ds->drv->ipv4_fib_prepare || !ds->drv->ipv4_fib_add)
> 
> drv recently got renamed to ops, which is what 0-day is complaining
> about.
> 
>   Andrew
> 

i based this on a net-next from 5-6 days ago, will rebase/resend tomorrow.

John


Re: [PATCH] net: inet: diag: Fix an error handling

2016-09-12 Thread Christophe JAILLET

Le 12/09/2016 à 16:35, David Ahern a écrit :

On 9/12/16 12:02 AM, Christophe JAILLET wrote:

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index abfbe492ebfe..795af25cf84c 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -1134,7 +1134,6 @@ int inet_diag_handler_get_info(struct sk_buff *skb, 
struct sock *sk)
  
  	handler = inet_diag_lock_handler(sk->sk_protocol);

if (IS_ERR(handler)) {
-   inet_diag_unlock_handler(handler);
nlmsg_cancel(skb, nlh);
return PTR_ERR(handler);
}


That call is needed. inet_diag_unlock_handler does not operate on the input arg 
but a global mutex. Perhaps a better patch is to change 
inet_diag_unlock_handler() to a void.

This is obvious. Sorry for the noise.

CJ


Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines

2016-09-12 Thread John Fastabend
On 16-09-12 05:17 AM, Jesper Dangaard Brouer wrote:
> On Fri, 09 Sep 2016 14:29:38 -0700
> John Fastabend  wrote:
> 
>> e1000 supports a single TX queue so it is being shared with the stack
>> when XDP runs XDP_TX action. This requires taking the xmit lock to
>> ensure we don't corrupt the tx ring. To avoid taking and dropping the
>> lock per packet this patch adds a bundling implementation to submit
>> a bundle of packets to the xmit routine.
>>
>> I tested this patch running e1000 in a VM using KVM over a tap
>> device using pktgen to generate traffic along with 'ping -f -l 100'.
>>
>> Suggested-by: Jesper Dangaard Brouer 
> 
> Thank you for actually implementing this! :-)
> 

Yep no problem the effects are minimal on e1000 but should be
noticeable at 10/40/100gbps nics.

>> Signed-off-by: John Fastabend 
>> ---
> [...]


[...]

>> +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle 
>> *buffer_info,
>> +  struct net_device *netdev,
>> +  struct e1000_adapter *adapter)
>> +{
>> +struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
>> +struct e1000_tx_ring *tx_ring = adapter->tx_ring;
>> +struct e1000_hw *hw = &adapter->hw;
>> +int i = 0;
>> +
>>  /* e1000 only support a single txq at the moment so the queue is being
>>   * shared with stack. To support this requires locking to ensure the
>>   * stack and XDP are not running at the same time. Devices with
>>   * multiple queues should allocate a separate queue space.
>> + *
>> + * To amortize the locking cost e1000 bundles the xmits and sends as
>> + * many as possible until either running out of descriptors or failing.
>>   */
>>  HARD_TX_LOCK(netdev, txq, smp_processor_id());
>>  
>> -tx_ring = adapter->tx_ring;
>> -
>> -if (E1000_DESC_UNUSED(tx_ring) < 2) {
>> -HARD_TX_UNLOCK(netdev, txq);
>> -return;
>> +for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
>^^^
>> +e1000_xmit_raw_frame(buffer_info[i].buffer,
>> + buffer_info[i].length,
>> + adapter, tx_ring);
>> +buffer_info[i].buffer->rxbuf.page = NULL;
>> +buffer_info[i].buffer = NULL;
>> +buffer_info[i].length = 0;
>> +i++;
> ^^^
> Looks like "i" is incremented twice, is that correct?
> 
>>  }

Yep this and a couple other issues are resolved in v3 which I'll send
out in a moment.

Also in v3 I kept the program in the adapter structure. Moving it into
the ring structure made the code a bit uglier IMO. I agree with the
logic but practically only one program can exist for e1000.



[PATCHv2 next 3/3] ipvlan: Introduce l3s mode

2016-09-12 Thread Mahesh Bandewar
From: Mahesh Bandewar 

In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.

Signed-off-by: Mahesh Bandewar 
---
 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig |  1 +
 drivers/net/ipvlan/ipvlan.h |  7 +++
 drivers/net/ipvlan/ipvlan_core.c| 94 +
 drivers/net/ipvlan/ipvlan_main.c| 60 ---
 include/uapi/linux/if_link.h|  1 +
 6 files changed, 162 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/ipvlan.txt 
b/Documentation/networking/ipvlan.txt
index 14422f8fcdc4..24196cef7c91 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.txt
@@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or 
as a module
There are no module parameters for this driver and it can be configured
 using IProute2/ip utility.
 
-   ip link add link   type ipvlan mode { l2 | L3 }
+   ip link add link   type ipvlan mode { l2 | l3 | 
l3s }
 
e.g. ip link add link ipvl0 eth0 type ipvlan mode l2
 
@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that 
instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.
 
+4.3 L3S mode:
+   This is very similar to the L3 mode except that iptables (conn-tracking)
+works in this mode and hence it is L3-symmetric (L3s). This will have slightly 
less
+performance but that shouldn't matter since you are choosing this mode over 
plain-L3
+mode to make conn-tracking work.
 
 5. What to choose (macvlan vs. ipvlan)?
These two devices are very similar in many regards and the specific use
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0c5415b05ea9..8768a625350d 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -149,6 +149,7 @@ config IPVLAN
 tristate "IP-VLAN support"
 depends on INET
 depends on IPV6
+depends on NET_L3_MASTER_DEV
 ---help---
   This allows one to create virtual devices off of a main interface
   and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 695a5dc9ace3..68b270b59ba9 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -23,11 +23,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define IPVLAN_DRV "ipvlan"
 #define IPV_DRV_VER"0.1"
@@ -96,6 +98,7 @@ struct ipvl_port {
struct work_struct  wq;
struct sk_buff_head backlog;
int count;
+   boolipt_hook_added;
struct rcu_head rcu;
 };
 
@@ -124,4 +127,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev 
*ipvlan,
   const void *iaddr, bool is_v6);
 bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6);
 void ipvlan_ht_addr_del(struct ipvl_addr *addr);
+struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
+ u16 proto);
+unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
+const struct nf_hook_state *state);
 #endif /* __IPVLAN_H */
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index b5f9511d819e..b4e990743e1d 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
case IPVLAN_MODE_L2:
return ipvlan_xmit_mode_l2(skb, dev);
case IPVLAN_MODE_L3:
+   case IPVLAN_MODE_L3S:
return ipvlan_xmit_mode_l3(skb, dev);
}
 
@@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff 
**pskb)
return ipvlan_handle_mode_l2(pskb, port);
case IPVLAN_MODE_L3:
return ipvlan_handle_mode_l3(pskb, port);
+   case IPVLAN_MODE_L3S:
+   return RX_HANDLER_PASS;
}
 
/* Should not reach here */
@@ -672,3 +675,94 @@ rx_handler_result_t ipvlan_handle_frame(struct

Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC

2016-09-12 Thread Jason Gunthorpe
On Mon, Sep 12, 2016 at 07:07:34PM +0300, Ram Amrani wrote:
>  drivers/infiniband/hw/qedr/main.c  |  907 ++
>  drivers/infiniband/hw/qedr/qedr.h  |  494 
>  drivers/infiniband/hw/qedr/qedr_cm.c   |  626 +
>  drivers/infiniband/hw/qedr/qedr_cm.h   |   61 +
>  drivers/infiniband/hw/qedr/qedr_hsi.h  |   56 +
>  drivers/infiniband/hw/qedr/qedr_hsi_rdma.h |  748 +
>  drivers/infiniband/hw/qedr/qedr_user.h |   80 +

We are requiring new uAPI headers are placed under include/uapi/rdma/,
please coordinate with Leon on the path.

Jason


[PATCHv2 next 2/3] net: Add _nf_(un)register_hooks symbols

2016-09-12 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.

Signed-off-by: Mahesh Bandewar 
CC: Pablo Neira Ayuso 
---
 include/linux/netfilter.h |  2 ++
 net/netfilter/core.c  | 51 ++-
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9aee896..e82b76781bf6 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -133,6 +133,8 @@ int nf_register_hook(struct nf_hook_ops *reg);
 void nf_unregister_hook(struct nf_hook_ops *reg);
 int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
 
 /* Functions to register get/setsockopt ranges (non-inclusive).  You
need to check permissions yourself! */
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index f39276d1c2d7..2c5327e43a88 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -188,19 +188,17 @@ EXPORT_SYMBOL(nf_unregister_net_hooks);
 
 static LIST_HEAD(nf_hook_list);
 
-int nf_register_hook(struct nf_hook_ops *reg)
+static int _nf_register_hook(struct nf_hook_ops *reg)
 {
struct net *net, *last;
int ret;
 
-   rtnl_lock();
for_each_net(net) {
ret = nf_register_net_hook(net, reg);
if (ret && ret != -ENOENT)
goto rollback;
}
list_add_tail(®->list, &nf_hook_list);
-   rtnl_unlock();
 
return 0;
 rollback:
@@ -210,19 +208,34 @@ rollback:
break;
nf_unregister_net_hook(net, reg);
}
+   return ret;
+}
+
+int nf_register_hook(struct nf_hook_ops *reg)
+{
+   int ret;
+
+   rtnl_lock();
+   ret = _nf_register_hook(reg);
rtnl_unlock();
+
return ret;
 }
 EXPORT_SYMBOL(nf_register_hook);
 
-void nf_unregister_hook(struct nf_hook_ops *reg)
+static void _nf_unregister_hook(struct nf_hook_ops *reg)
 {
struct net *net;
 
-   rtnl_lock();
list_del(®->list);
for_each_net(net)
nf_unregister_net_hook(net, reg);
+}
+
+void nf_unregister_hook(struct nf_hook_ops *reg)
+{
+   rtnl_lock();
+   _nf_unregister_hook(reg);
rtnl_unlock();
 }
 EXPORT_SYMBOL(nf_unregister_hook);
@@ -246,6 +259,26 @@ err:
 }
 EXPORT_SYMBOL(nf_register_hooks);
 
+/* Caller MUST take rtnl_lock() */
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+   unsigned int i;
+   int err = 0;
+
+   for (i = 0; i < n; i++) {
+   err = _nf_register_hook(®[i]);
+   if (err)
+   goto err;
+   }
+   return err;
+
+err:
+   if (i > 0)
+   _nf_unregister_hooks(reg, i);
+   return err;
+}
+EXPORT_SYMBOL(_nf_register_hooks);
+
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
 {
while (n-- > 0)
@@ -253,6 +286,14 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned 
int n)
 }
 EXPORT_SYMBOL(nf_unregister_hooks);
 
+/* Caller MUST take rtnl_lock */
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+   while (n-- > 0)
+   _nf_unregister_hook(®[n]);
+}
+EXPORT_SYMBOL(_nf_unregister_hooks);
+
 unsigned int nf_iterate(struct list_head *head,
struct sk_buff *skb,
struct nf_hook_state *state,
-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-12 Thread Jason Gunthorpe
On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote:
> [2] Libpvrdma User-level library - 
> http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary

You will probably find that rdma-plumbing will be the best way to get
your userspace component into the distributors.

  http://www.spinics.net/lists/linux-rdma/msg39026.html
  http://www.spinics.net/lists/linux-rdma/msg39328.html
  http://www.spinics.net/lists/linux-rdma/msg40014.html
  http://www.spinics.net/lists/linux-rdma/msg39026.html

Jason


[PATCHv2 next 1/3] ipv6: Export p6_route_input_lookup symbol

2016-09-12 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.

Signed-off-by: Mahesh Bandewar 
---
 include/net/ip6_route.h | 3 +++
 net/ipv6/route.c| 7 ---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index d97305d0e71f..e0cd318d5103 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -64,6 +64,9 @@ static inline bool rt6_need_strict(const struct in6_addr 
*daddr)
 }
 
 void ip6_route_input(struct sk_buff *skb);
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+struct net_device *dev,
+struct flowi6 *fl6, int flags);
 
 struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock 
*sk,
 struct flowi6 *fl6, int flags);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 09d43ff11a8d..9563eedd4f97 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1147,15 +1147,16 @@ static struct rt6_info *ip6_pol_route_input(struct net 
*net, struct fib6_table *
return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags);
 }
 
-static struct dst_entry *ip6_route_input_lookup(struct net *net,
-   struct net_device *dev,
-   struct flowi6 *fl6, int flags)
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+struct net_device *dev,
+struct flowi6 *fl6, int flags)
 {
if (rt6_need_strict(&fl6->daddr) && dev->type != ARPHRD_PIMREG)
flags |= RT6_LOOKUP_F_IFACE;
 
return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input);
 }
+EXPORT_SYMBOL_GPL(ip6_route_input_lookup);
 
 void ip6_route_input(struct sk_buff *skb)
 {
-- 
2.8.0.rc3.226.g39d4020



[PATCHv2 next 0/3] ipvlan introduce l3s mode

2016-09-12 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Same old problem with new approach especially from suggestions from
earlier patch-series.

First thing is that this is introduced as a new mode rather than
modifying the old (L3) mode. So the behavior of the existing modes is
preserved as it is and the new L3s mode obeys iptables so that intended
conn-tracking can work. 

To do this, the code uses newly added l3mdev_rcv() handler and an
Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the
correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT
to change the input device from master to the slave to complete the
formality.

Supporting stack changes are trivial changes to export symbol to get
IPv4 equivalent code exported for IPv6 and to allow netfilter hook
registration code to allow caller to hold RTNL. Please look into
individual patches for details.

Mahesh Bandewar (3):
  ipv6: Export p6_route_input_lookup symbol
  net: Add _nf_(un)register_hooks symbols
  ipvlan: Introduce l3s mode

 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig |  1 +
 drivers/net/ipvlan/ipvlan.h |  7 +++
 drivers/net/ipvlan/ipvlan_core.c| 94 +
 drivers/net/ipvlan/ipvlan_main.c| 60 ---
 include/linux/netfilter.h   |  2 +
 include/net/ip6_route.h |  3 ++
 include/uapi/linux/if_link.h|  1 +
 net/ipv6/route.c|  7 +--
 net/netfilter/core.c| 51 ++--
 10 files changed, 217 insertions(+), 16 deletions(-)


v1: Initial post
v2: Text correction and config changed from "select" to "depends on"
-- 
2.8.0.rc3.226.g39d4020



[PATCH net-next 1/2] uapi glibc compat: make linux/time.h compile with user time.h files

2016-09-12 Thread Willem de Bruijn
From: Willem de Bruijn 

Add libc-compat workaround for definitions in linux/time.h that
duplicate those in libc time.h, sys/time.h and bits/time.h.

With this change, userspace builds succeeds when linux/time.h is
included after libc time.h and when it is included after sys/time.h.

The inverse requires additional changes to those userspace headers.

Without this patch, when compiling the following program after
make headers_install:

  echo -e "#include \n#include " | \
gcc -Wall -Werror -Iusr/include -c -xc -

gcc gives these errors:

  #include 
  #include 

In file included from ../test_time.c:3:0:
/usr/include/time.h:120:8: error: redefinition of ‘struct timespec’
 struct timespec
^
In file included from ../test_time.c:2:0:
./usr/include/linux/time.h:9:8: note: originally defined here
 struct timespec {
^
In file included from ../test_time.c:3:0:
/usr/include/time.h:161:8: error: redefinition of ‘struct itimerspec’
 struct itimerspec
^
In file included from ../test_time.c:2:0:
./usr/include/linux/time.h:34:8: note: originally defined here
 struct itimerspec {

and this warning by indirect inclusion of bits/time.h:

In file included from ../test_time.c:4:0:
./usr/include/linux/time.h:67:0: error: "TIMER_ABSTIME" redefined [-Werror]
 #define TIMER_ABSTIME   0x01
 ^
In file included from /usr/include/time.h:41:0,
 from ../test_time.c:3:
/usr/include/x86_64-linux-gnu/bits/time.h:82:0: note: this is the location 
of the previous definition
 #   define TIMER_ABSTIME  1
 ^

The _SYS_TIME_H variant resolves similar errors for timeval, timezone,
itimerval and warnings for ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF.

Ran the same program for sys/time.h and bits/time.h.

Signed-off-by: Willem de Bruijn 
---
 include/uapi/linux/libc-compat.h | 50 
 include/uapi/linux/time.h| 15 
 2 files changed, 65 insertions(+)

diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index 44b8a6b..b08b0f5 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -165,6 +165,43 @@
 #define __UAPI_DEF_XATTR   1
 #endif
 
+/* Definitions for time.h */
+#if defined(__timespec_defined)
+#define __UAPI_DEF_TIMESPEC0
+#else
+#define __UAPI_DEF_TIMESPEC1
+#endif
+
+#if defined(_TIME_H) && defined(__USE_POSIX199309)
+#define __UAPI_DEF_ITIMERSPEC  0
+#else
+#define __UAPI_DEF_ITIMERSPEC  1
+#endif
+
+/* Definitions for sys/time.h */
+#if defined(_SYS_TIME_H)
+#define __UAPI_DEF_TIMEVAL 0
+#define __UAPI_DEF_ITIMERVAL   0
+#define __UAPI_DEF_ITIMER_WHICH0
+#else
+#define __UAPI_DEF_TIMEVAL 1
+#define __UAPI_DEF_ITIMERVAL   1
+#define __UAPI_DEF_ITIMER_WHICH1
+#endif
+
+/* Definitions for bits/time.h */
+#if defined(_BITS_TIME_H)
+#define __UAPI_DEF_ABSTIME 0
+#else
+#define __UAPI_DEF_ABSTIME 1
+#endif
+
+#if defined(_SYS_TIME_H) && defined(__USE_BSD)
+#define __UAPI_DEF_TIMEZONE0
+#else
+#define __UAPI_DEF_TIMEZONE1
+#endif
+
 /* If we did not see any headers from any supported C libraries,
  * or we are being included in the kernel, then define everything
  * that we need. */
@@ -208,6 +245,19 @@
 /* Definitions for xattr.h */
 #define __UAPI_DEF_XATTR   1
 
+/* Definitions for time.h */
+#define __UAPI_DEF_TIMESPEC1
+#define __UAPI_DEF_ITIMERSPEC  1
+
+/* Definitions for sys/time.h */
+#define __UAPI_DEF_TIMEVAL 1
+#define __UAPI_DEF_ITIMERVAL   1
+#define __UAPI_DEF_ITIMER_WHICH1
+#define __UAPI_DEF_TIMEZONE1
+
+/* Definitions for bits/time.h */
+#define __UAPI_DEF_ABSTIME 1
+
 #endif /* __GLIBC__ */
 
 #endif /* _UAPI_LIBC_COMPAT_H */
diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h
index e75e1b6..4e7333c 100644
--- a/include/uapi/linux/time.h
+++ b/include/uapi/linux/time.h
@@ -1,9 +1,11 @@
 #ifndef _UAPI_LINUX_TIME_H
 #define _UAPI_LINUX_TIME_H
 
+#include 
 #include 
 
 
+#if __UAPI_DEF_TIMESPEC
 #ifndef _STRUCT_TIMESPEC
 #define _STRUCT_TIMESPEC
 struct timespec {
@@ -11,35 +13,46 @@ struct timespec {
longtv_nsec;/* nanoseconds */
 };
 #endif
+#endif
 
+#if __UAPI_DEF_TIMEVAL
 struct timeval {
__kernel_time_t tv_sec; /* seconds */
__kernel_suseconds_ttv_usec;/* microseconds */
 };
+#endif
 
+#if __UAPI_DEF_TIMEZONE
 struct timezone {
int tz_minuteswest; /* minutes west of Greenwich */
int tz_dsttime; /* type of dst correction */
 };
+#endif
 
 
 /*
  * Names of the interval timers, and structure
  * defining a timer setting:
  */
+#if __UAPI_DEF_ITIMER_WHICH
 #defineITIMER_REAL 0
 #define

Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Alexei Starovoitov
On Mon, Sep 12, 2016 at 10:56:55AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov  wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > > >  Lets do bundling/bulking from the start!
> > > > 
> > > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > > does bulking as well.
> > > > See nothing wrong about it. RX side processes the packets and
> > > > when it's done it tells TX to xmit whatever it collected.  
> > > 
> > > This is doing "hidden" bulking and not really taking advantage of using
> > > the icache more effeciently.  
> > > 
> > > Let me explain the problem I see, little more clear then, so you
> > > hopefully see where I'm going.
> > > 
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> > correct. And why is that a problem?
> 
> It is good that you can see and acknowledge the I-cache problem.
> 
> XDP is all about performance.  What I hear is, that you are arguing
> against a model that will yield better performance, that does not make
> sense to me.  Let me explain this again, in another way.

I'm arguing against your proposal that I think will be more complex and
lower performance than what Saeed and the team already implemented.
Therefore I don't think it's fair to block the patch and ask them to
reimplement it just to test an idea that may or may not improve performance.

Getting maximum performance is tricky. Good is better than perfect.
It's important to argue about user space visible bits upfront, but
on the kernel performance side we should build/test incrementally.
This particular patch 11/11 is simple, easy to review and provides
good performance. What's not to like?



[PATCH net-next 2/2] errqueue: include linux/time.h

2016-09-12 Thread Willem de Bruijn
From: Willem de Bruijn 

struct scm_timestamping has fields of type struct timespec. Now that
it is safe to include linux/time.h and time.h at the same time,
include linux/time.h directly in linux/errqueue.h

Without this patch, when compiling the following program after
make headers_install:

gcc -Wall -Werror -Iusr/include -c -xc - <
  static struct scm_timestamping tss;
  int main(void) { tss.ts[0].tv_sec = 1; return 0; }
EOF

gcc gives this error:

  In file included from :1:0:
  usr/include/linux/errqueue.h:33:18: error: array type has incomplete element 
type
struct timespec ts[3];

Reported-by: Brooks Moses 
Signed-off-by: Willem de Bruijn 
---
 include/uapi/linux/errqueue.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..abafec8 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -1,6 +1,7 @@
 #ifndef _UAPI_LINUX_ERRQUEUE_H
 #define _UAPI_LINUX_ERRQUEUE_H
 
+#include 
 #include 
 
 struct sock_extended_err {
-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH v4 16/16] MAINTAINERS: Update for PVRDMA driver

2016-09-12 Thread Jason Gunthorpe
On Sun, Sep 11, 2016 at 09:49:26PM -0700, Adit Ranadive wrote:
> Add maintainer info for the PVRDMA driver.

You can probably squash the last three patches.

.. and fix the __u32 stuff throughout the entire driver please.

Jason


[PATCH net-next 0/2] uapi: include time.h from errqueue.h

2016-09-12 Thread Willem de Bruijn
From: Willem de Bruijn 

It was reported that linux/errqueue.h requires linux/time.h, but that
adding the include directly may cause userspace conflicts between
linux/time.h and glibc time.h:

  https://lkml.org/lkml/2016/7/10/10

Address the conflicts using the standard libc-compat approach, then
add the #include to errqueue.h

The first patch is a resubmit. It was previously submitted to
tip/timers/core, but given the commit history, the maintainer
suggested this tree, instead.

  https://lkml.org/lkml/2016/8/10/748

This also allows sending the follow-up as part of the patchset.

Willem de Bruijn (2):
  uapi glibc compat: make linux/time.h compile with user time.h files
  errqueue: include linux/time.h

 include/uapi/linux/errqueue.h|  1 +
 include/uapi/linux/libc-compat.h | 50 
 include/uapi/linux/time.h| 15 
 3 files changed, 66 insertions(+)

-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH v4 03/16] IB/pvrdma: Add virtual device RDMA structures

2016-09-12 Thread Jason Gunthorpe
On Sun, Sep 11, 2016 at 09:49:13PM -0700, Adit Ranadive wrote:
> + __u8raw[16];
> + struct {
> + __be64  subnet_prefix;
> + __be64  interface_id;
> + } global;

If this is not a userspace header do not use the __ varients..

Jason


Re: [PATCH v4 02/16] IB/pvrdma: Add user-level shared functions

2016-09-12 Thread Jason Gunthorpe
On Sun, Sep 11, 2016 at 09:49:12PM -0700, Adit Ranadive wrote:
> We share some common structures with the user-level driver. This patch
> adds those structures and shared functions to traverse the QP/CQ rings.

>  create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_uapi.h
>  create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_user.h

The files that are intended to be shared with userspace must be under
include/uapi/, please coordinate with Leon on the path.

Same for all the new drivers.

> +static inline __s32 pvrdma_idx(atomic_t *var, __u32 max_elems)
> +{
> + const unsigned int idx = atomic_read(var);

Eh? Does this even compile in userspace?

If this is not a userspace header then why does it use __u32 and
related ??

> +#define PVRDMA_UVERBS_ABI_VERSION3
> +#define PVRDMA_BOARD_ID  1
> +#define PVRDMA_REV_ID1
> +
> +struct pvrdma_alloc_ucontext_resp {
> + u32 qp_tab_size;
> + u32 reserved;
> +};

This certainly looks like a userspace header, shouldn't it use __u32?

NAK

Jason


  1   2   3   >