Re: [RFC 02/11] Add RoCE driver framework
On Mon, Sep 12, 2016 at 07:07:36PM +0300, Ram Amrani wrote: > Adds a skeletal implementation of the qed* RoCE driver - > basically the ability to communicate with the qede driver and > receive notifications from it regarding various init/exit events. > > Signed-off-by: Rajesh Borundia > Signed-off-by: Ram Amrani > --- <...> > + > +#define QEDR_MODULE_VERSION "8.10.10.0" I am strongly against module versions. You should rely on official kernel version. > +#define QEDR_NODE_DESC "QLogic 579xx RoCE HCA" > +#define DP_NAME(dev) ((dev)->ibdev.name) signature.asc Description: PGP signature
Re: [RFC 02/11] Add RoCE driver framework
On Mon, Sep 12, 2016 at 07:17:35PM +, Yuval Mintz wrote: > >> +uint debug; > >> +module_param(debug, uint, 0); > >> +MODULE_PARM_DESC(debug, "Default debug msglevel"); > > >Why are you adding this as a module parameter? > > I believe this is mostly to follow same line as qede which also defines > 'debug' module parameter for allowing easy user control of debug > prints [& specifically for probe prints, which can't be controlled > otherwise]. Can you give us an example where dynamic debug and tracing infrastructures are not enough? AFAIK, most of these debug module parameters are legacy copy/paste code which is useless in real life scenarios. Thanks > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: icmpv6: issue with routing table entries from link local addresses
On Mon, Sep 12, 2016 at 07:26:23PM +0200, Hannes Frederic Sowa wrote: > > Actually, I'm convinced I must be doing something wrong here. The setup > > for the issue is quite trivial, someone would have tripped over it > > already. The only condition is that one host has multiple interfaces > > with ipv6 enabled. > > > > Any help in shedding some light onto this issue would be appreciated. > > This shouldn't be the case. We certainly carry over the ifindex of the > received packet into the routing lookup of the outgoing packet, thus the > appropriate rule, with outgoing ifindex should be selected. I saw this in the code and that's the reason why I wrote the initial mail. Was trying to trace with ftrace, but got stuck somewhere around the find_rr_leaf function. (If there is any good documentation on the internal fib data structure, please point me to it.) > I also couldn't reproduce your problem here with my system. Can you > verify with tcpdump that the packet is leaving on another interface? It is not leaving on another interface but simply discarded on the host. The Ip6OutNoRoutes stat in /proc/net/snmp6 is increased. >From my understanding the routing subsystem finds the first matching entry in the main routing table, checks the interface and bails out because it does not match. I did omit a crucial information in the last mail, I'm currently stuck on an older distribution kernel (3.16). I'll try to check if there have been any relevant changes to IPv6 route lookup in the last two years. (Maybe I should try to reproduce it with the current kernel, sorry that I didn't think of this before.) Andreas
Re: [RFC 03/11] Add support for RoCE HW init
On 12/09/2016 19:07, Ram Amrani wrote: > Allocate and setup RoCE resources, interrupts and completion queues. > Adds device attributes. > > Signed-off-by: Rajesh Borundia > Signed-off-by: Ram Amrani > --- > drivers/infiniband/hw/qedr/main.c | 408 +++- > drivers/infiniband/hw/qedr/qedr.h | 118 > drivers/infiniband/hw/qedr/qedr_hsi.h | 56 ++ > drivers/infiniband/hw/qedr/qedr_hsi_rdma.h | 96 +++ > drivers/net/ethernet/qlogic/qed/Makefile | 1 + > drivers/net/ethernet/qlogic/qed/qed.h | 26 +- > drivers/net/ethernet/qlogic/qed/qed_cxt.c | 6 + > drivers/net/ethernet/qlogic/qed/qed_cxt.h | 6 + > drivers/net/ethernet/qlogic/qed/qed_dev.c | 155 + > drivers/net/ethernet/qlogic/qed/qed_main.c | 44 +- > drivers/net/ethernet/qlogic/qed/qed_reg_addr.h | 7 + > drivers/net/ethernet/qlogic/qed/qed_roce.c | 887 > + > drivers/net/ethernet/qlogic/qed/qed_roce.h | 117 > drivers/net/ethernet/qlogic/qed/qed_sp.h | 1 + > drivers/net/ethernet/qlogic/qed/qed_spq.c | 8 + > drivers/net/ethernet/qlogic/qede/qede_roce.c | 2 +- > include/linux/qed/qed_if.h | 5 +- > include/linux/qed/qed_roce_if.h| 345 ++ > 18 files changed, 2281 insertions(+), 7 deletions(-) > create mode 100644 drivers/infiniband/hw/qedr/qedr_hsi.h > create mode 100644 drivers/infiniband/hw/qedr/qedr_hsi_rdma.h > create mode 100644 drivers/net/ethernet/qlogic/qed/qed_roce.c > create mode 100644 drivers/net/ethernet/qlogic/qed/qed_roce.h > create mode 100644 include/linux/qed/qed_roce_if.h > > diff --git a/drivers/infiniband/hw/qedr/main.c > b/drivers/infiniband/hw/qedr/main.c > index 3fe58a3..0b5274a 100644 > --- a/drivers/infiniband/hw/qedr/main.c > +++ b/drivers/infiniband/hw/qedr/main.c > @@ -36,6 +36,8 @@ > #include > #include > #include > +#include > +#include > #include "qedr.h" > > MODULE_DESCRIPTION("QLogic 40G/100G ROCE Driver"); > @@ -80,6 +82,139 @@ static int qedr_register_device(struct qedr_dev *dev) > return 0; > } > > +/* This function allocates fast-path status block memory */ > +static int qedr_alloc_mem_sb(struct qedr_dev *dev, > + struct qed_sb_info *sb_info, u16 sb_id) > +{ > + struct status_block *sb_virt; > + dma_addr_t sb_phys; > + int rc; > + > + sb_virt = dma_alloc_coherent(&dev->pdev->dev, > + sizeof(*sb_virt), &sb_phys, GFP_KERNEL); > + if (!sb_virt) { > + pr_err("Status block allocation failed\n"); > + return -ENOMEM; > + } > + > + rc = dev->ops->common->sb_init(dev->cdev, sb_info, > +sb_virt, sb_phys, sb_id, > +QED_SB_TYPE_CNQ); > + if (rc) { > + pr_err("Status block initialization failed\n"); > + dma_free_coherent(&dev->pdev->dev, sizeof(*sb_virt), > + sb_virt, sb_phys); > + return rc; > + } > + > + return 0; > +} > + > +static void qedr_free_mem_sb(struct qedr_dev *dev, > + struct qed_sb_info *sb_info, int sb_id) > +{ > + if (sb_info->sb_virt) { > + dev->ops->common->sb_release(dev->cdev, sb_info, sb_id); > + dma_free_coherent(&dev->pdev->dev, sizeof(*sb_info->sb_virt), > + (void *)sb_info->sb_virt, sb_info->sb_phys); > + } > +} > + > +static void qedr_free_resources(struct qedr_dev *dev) > +{ > + int i; > + > + for (i = 0; i < dev->num_cnq; i++) { > + qedr_free_mem_sb(dev, &dev->sb_array[i], dev->sb_start + i); > + dev->ops->common->chain_free(dev->cdev, &dev->cnq_array[i].pbl); > + } > + > + kfree(dev->cnq_array); > + kfree(dev->sb_array); > + kfree(dev->sgid_tbl); > +} > + > +static int qedr_alloc_resources(struct qedr_dev *dev) > +{ > + struct qedr_cnq *cnq; > + __le16 *cons_pi; > + u16 n_entries; > + int i, rc; > + > + dev->sgid_tbl = kzalloc(sizeof(union ib_gid) * > + QEDR_MAX_SGID, GFP_KERNEL); > + if (!dev->sgid_tbl) > + return -ENOMEM; > + > + spin_lock_init(&dev->sgid_lock); > + > + /* Allocate Status blocks for CNQ */ > + dev->sb_array = kcalloc(dev->num_cnq, sizeof(*dev->sb_array), > + GFP_KERNEL); > + if (!dev->sb_array) { > + rc = -ENOMEM; > + goto err1; > + } > + > + dev->cnq_array = kcalloc(dev->num_cnq, > + sizeof(*dev->cnq_array), GFP_KERNEL); > + if (!dev->cnq_array) { > + rc = -ENOMEM; > + goto err2; > + } > + > + dev->sb_start = dev->ops->rdma_get_start_sb(dev->cdev); > + > + /* Allocate CNQ PBLs */ > + n_entries = min_t(u32, QED_RDMA_MAX_CNQ_SIZE, QEDR_ROCE_MAX_CNQ_SIZE
Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC
On Mon, Sep 12, 2016 at 05:39:35PM +, Yuval Mintz wrote: > >>> include/linux/qed/common_hsi.h | 1 + > >>> include/linux/qed/qed_if.h | 9 +- > >>> include/linux/qed/qed_ll2_if.h | 140 + > >>> include/linux/qed/qed_roce_if.h | 604 > >>> include/linux/qed/qede_roce.h | 88 + > >> > include/linux/qed/rdma_common.h | 1 + > >> > >> Something not directly related to your patches, but they brought my > >> attention to the fact that all these new (and old) rdma<->net devices > >> are polluting include/linux > >> > > ocrdma driver includes be_roce.h located in net/ethernet/emulex/benet > > location instead of include/linux/. > > This file helps to bind rdma to net device or underlying hw device. > > > May be similar change can be done for rest of the drivers for > > rdma<-->net devices? > > By adding explicit inclusion paths in the Makefile, a la > ccflags-y := -Idrivers/net/ethernet/emulex/benet ? > > While this might work, I personally dislike it as I find it > counter-intuitive when going over the code - > I don't expect driver to locally modify the inclusion path. > Besides, we're going to [eventually] a whole suite of drivers based > on the qed module, some of which would reside under drivers/scsi; > Not sure it's best to have 3 or 4 different drivers privately include the > same directory under a different subsystem. I agree with you that orcdma's way can be valuable for small drivers. Orcmda has small shared headers set and doesn't need to change them rapidly to support different devices. I thought to place them in similar directory to include/soc/* and remove from include/linux/. We have include/rdma/ and it looks like a good candidate. > > >> Filtered output: > >> ➜ linux-rdma git:(topic/fixes-for-4.8-2) ls -dl include/linux/*/ > >> drwxrwxr-x 2 leonro leonro 4096 Aug 30 16:27 include/linux/hsi/ > >> drwxrwxr-x 2 leonro leonro 4096 Sep 12 19:08 include/linux/mlx4/ > >> drwxrwxr-x 2 leonro leonro 4096 Sep 7 15:31 include/linux/mlx5/ > >> drwxrwxr-x 2 leonro leonro 4096 Sep 8 17:46 include/linux/qed/ > >> > >> Is this the right place for them? > > > > Thanks > signature.asc Description: PGP signature
[PATCH V3] net-next: dsa: add FIB support
Add SWITCHDEV_OBJ_ID_IPV4_FIB support to the DSA layer. Signed-off-by: John Crispin --- Changes in V2 * rebase on latest net-next to fix compile errors Changes in V3 * fix subject prefix. this needs to go into the next tree Documentation/networking/dsa/dsa.txt | 18 +++ include/net/dsa.h| 13 +++ net/dsa/slave.c | 41 ++ 3 files changed, 72 insertions(+) diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt index 6d6c07c..6cd5831 100644 --- a/Documentation/networking/dsa/dsa.txt +++ b/Documentation/networking/dsa/dsa.txt @@ -607,6 +607,24 @@ of DSA, would be the its port-based VLAN, used by the associated bridge device. function that the driver has to call for each MAC address known to be behind the given port. A switchdev object is used to carry the VID and MDB info. +Route offloading + + +- ipv4_fib_prepare: routing layer function invoked to prepare the installation + of an ipv4 route into the switches routing database. If the operation is not + supported this function should return -EOPNOTSUPP. No hardware setup must be + done in this function. See ipv4_fib_add for this and details. + +- ipv4_fib_add: routing layer function invoked when a new ipv4 route should be + installed into the routing database of the switch, it should be programmed + with the route for the specified VLAN ID of the device that the route applies + to. + +- ipv4_fib_del: routing layer function invoked when an ipv4 route should be + removed from the routing database, the switch hardware should be programmed + to delete the specified MAC address of the nexthop from the specified VLAN ID + if it was mapped into the forwarding database. + TODO diff --git a/include/net/dsa.h b/include/net/dsa.h index 7556646..7fdd63e 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -237,6 +237,7 @@ struct switchdev_obj; struct switchdev_obj_port_fdb; struct switchdev_obj_port_mdb; struct switchdev_obj_port_vlan; +struct switchdev_obj_ipv4_fib; struct dsa_switch_ops { struct list_headlist; @@ -386,6 +387,18 @@ struct dsa_switch_ops { int (*port_mdb_dump)(struct dsa_switch *ds, int port, struct switchdev_obj_port_mdb *mdb, int (*cb)(struct switchdev_obj *obj)); + + /* +* IPV4 routing +*/ + int (*ipv4_fib_prepare)(struct dsa_switch *ds, int port, + const struct switchdev_obj_ipv4_fib *fib4, + struct switchdev_trans *trans); + int (*ipv4_fib_add)(struct dsa_switch *ds, int port, + const struct switchdev_obj_ipv4_fib *fib4, + struct switchdev_trans *trans); + int (*ipv4_fib_del)(struct dsa_switch *ds, int port, + const struct switchdev_obj_ipv4_fib *fib4); }; void register_switch_driver(struct dsa_switch_ops *type); diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 9ecbe78..c974ac0 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -334,6 +334,38 @@ static int dsa_slave_port_mdb_dump(struct net_device *dev, return -EOPNOTSUPP; } +static int dsa_slave_ipv4_fib_add(struct net_device *dev, + const struct switchdev_obj_ipv4_fib *fib4, + struct switchdev_trans *trans) +{ + struct dsa_slave_priv *p = netdev_priv(dev); + struct dsa_switch *ds = p->parent; + int ret; + + if (!ds->ops->ipv4_fib_prepare || !ds->ops->ipv4_fib_add) + return -EOPNOTSUPP; + + if (switchdev_trans_ph_prepare(trans)) + ret = ds->ops->ipv4_fib_prepare(ds, p->port, fib4, trans); + else + ret = ds->ops->ipv4_fib_add(ds, p->port, fib4, trans); + + return ret; +} + +static int dsa_slave_ipv4_fib_del(struct net_device *dev, + const struct switchdev_obj_ipv4_fib *fib4) +{ + struct dsa_slave_priv *p = netdev_priv(dev); + struct dsa_switch *ds = p->parent; + int ret = -EOPNOTSUPP; + + if (ds->ops->ipv4_fib_del) + ret = ds->ops->ipv4_fib_del(ds, p->port, fib4); + + return ret; +} + static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) { struct dsa_slave_priv *p = netdev_priv(dev); @@ -465,6 +497,11 @@ static int dsa_slave_port_obj_add(struct net_device *dev, SWITCHDEV_OBJ_PORT_VLAN(obj), trans); break; + case SWITCHDEV_OBJ_ID_IPV4_FIB: + err = dsa_slave_ipv4_fib_add(dev, +SWITCHDEV_OBJ_IPV4_FIB(obj), +trans); +
[PATCH V2] net: dsa: add FIB support
Add SWITCHDEV_OBJ_ID_IPV4_FIB support to the DSA layer. Signed-off-by: John Crispin --- Changes in V2 * rebase on latest net-next to fix compile errors Documentation/networking/dsa/dsa.txt | 18 +++ include/net/dsa.h| 13 +++ net/dsa/slave.c | 41 ++ 3 files changed, 72 insertions(+) diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt index 6d6c07c..6cd5831 100644 --- a/Documentation/networking/dsa/dsa.txt +++ b/Documentation/networking/dsa/dsa.txt @@ -607,6 +607,24 @@ of DSA, would be the its port-based VLAN, used by the associated bridge device. function that the driver has to call for each MAC address known to be behind the given port. A switchdev object is used to carry the VID and MDB info. +Route offloading + + +- ipv4_fib_prepare: routing layer function invoked to prepare the installation + of an ipv4 route into the switches routing database. If the operation is not + supported this function should return -EOPNOTSUPP. No hardware setup must be + done in this function. See ipv4_fib_add for this and details. + +- ipv4_fib_add: routing layer function invoked when a new ipv4 route should be + installed into the routing database of the switch, it should be programmed + with the route for the specified VLAN ID of the device that the route applies + to. + +- ipv4_fib_del: routing layer function invoked when an ipv4 route should be + removed from the routing database, the switch hardware should be programmed + to delete the specified MAC address of the nexthop from the specified VLAN ID + if it was mapped into the forwarding database. + TODO diff --git a/include/net/dsa.h b/include/net/dsa.h index 7556646..7fdd63e 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -237,6 +237,7 @@ struct switchdev_obj; struct switchdev_obj_port_fdb; struct switchdev_obj_port_mdb; struct switchdev_obj_port_vlan; +struct switchdev_obj_ipv4_fib; struct dsa_switch_ops { struct list_headlist; @@ -386,6 +387,18 @@ struct dsa_switch_ops { int (*port_mdb_dump)(struct dsa_switch *ds, int port, struct switchdev_obj_port_mdb *mdb, int (*cb)(struct switchdev_obj *obj)); + + /* +* IPV4 routing +*/ + int (*ipv4_fib_prepare)(struct dsa_switch *ds, int port, + const struct switchdev_obj_ipv4_fib *fib4, + struct switchdev_trans *trans); + int (*ipv4_fib_add)(struct dsa_switch *ds, int port, + const struct switchdev_obj_ipv4_fib *fib4, + struct switchdev_trans *trans); + int (*ipv4_fib_del)(struct dsa_switch *ds, int port, + const struct switchdev_obj_ipv4_fib *fib4); }; void register_switch_driver(struct dsa_switch_ops *type); diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 9ecbe78..c974ac0 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -334,6 +334,38 @@ static int dsa_slave_port_mdb_dump(struct net_device *dev, return -EOPNOTSUPP; } +static int dsa_slave_ipv4_fib_add(struct net_device *dev, + const struct switchdev_obj_ipv4_fib *fib4, + struct switchdev_trans *trans) +{ + struct dsa_slave_priv *p = netdev_priv(dev); + struct dsa_switch *ds = p->parent; + int ret; + + if (!ds->ops->ipv4_fib_prepare || !ds->ops->ipv4_fib_add) + return -EOPNOTSUPP; + + if (switchdev_trans_ph_prepare(trans)) + ret = ds->ops->ipv4_fib_prepare(ds, p->port, fib4, trans); + else + ret = ds->ops->ipv4_fib_add(ds, p->port, fib4, trans); + + return ret; +} + +static int dsa_slave_ipv4_fib_del(struct net_device *dev, + const struct switchdev_obj_ipv4_fib *fib4) +{ + struct dsa_slave_priv *p = netdev_priv(dev); + struct dsa_switch *ds = p->parent; + int ret = -EOPNOTSUPP; + + if (ds->ops->ipv4_fib_del) + ret = ds->ops->ipv4_fib_del(ds, p->port, fib4); + + return ret; +} + static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) { struct dsa_slave_priv *p = netdev_priv(dev); @@ -465,6 +497,11 @@ static int dsa_slave_port_obj_add(struct net_device *dev, SWITCHDEV_OBJ_PORT_VLAN(obj), trans); break; + case SWITCHDEV_OBJ_ID_IPV4_FIB: + err = dsa_slave_ipv4_fib_add(dev, +SWITCHDEV_OBJ_IPV4_FIB(obj), +trans); + break; default: err = -EOPNOTSUPP;
Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
On Mon, Sep 12, 2016 at 8:00 PM, Alexander Duyck wrote: > On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend > wrote: >> The BQL API does not reference the sk_buff nor does the driver need to >> reference the sk_buff to calculate the length of a transmitted frame. >> This patch removes an sk_buff reference from the xmit irq path and >> also allows packets sent from XDP to use BQL. >> >> Signed-off-by: John Fastabend >> --- >> drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++- >> 1 file changed, 2 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c >> b/drivers/net/ethernet/intel/e1000/e1000_main.c >> index f42129d..62a7f8d 100644 >> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c >> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c >> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter >> *adapter, >> if (cleaned) { >> total_tx_packets += buffer_info->segs; >> total_tx_bytes += buffer_info->bytecount; >> - if (buffer_info->skb) { >> - bytes_compl += buffer_info->skb->len; >> - pkts_compl++; >> - } >> - >> + bytes_compl += buffer_info->length; >> + pkts_compl++; >> } >> e1000_unmap_and_free_tx_resource(adapter, >> buffer_info); >> tx_desc->upper.data = 0; > > Actually it might be worth looking into why we have two different > stats for tracking bytecount and segs. From what I can tell the > pkts_compl value is never actually used. The function doesn't even > use the value so it is just wasted cycles. And as far as the bytes go Transmit flow steering which I posted and is pending on some testing uses the pkt count BQL to track inflight packets. > the accounting would be more accurate if you were to use bytecount > instead of buffer_info->skb->len. You would just need to update the > xmit function to use that on the other side so that they match. > > - Alex
Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend wrote: > From: Alexei Starovoitov > > This patch adds initial support for XDP on e1000 driver. Note e1000 > driver does not support page recycling in general which could be > added as a further improvement. However XDP_DROP case will recycle. > XDP_TX and XDP_PASS do not support recycling. > > e1000 only supports a single tx queue at this time so the queue > is shared between xdp program and Linux stack. It is possible for > an XDP program to starve the stack in this model. > > The XDP program will drop packets on XDP_TX errors. This can occur > when the tx descriptors are exhausted. This behavior is the same > for both shared queue models like e1000 and dedicated tx queue > models used in multiqueue devices. However if both the stack and > XDP are transmitting packets it is perhaps more likely to occur in > the shared queue model. Further refinement to the XDP model may be > possible in the future. > > I tested this patch running e1000 in a VM using KVM over a tap > device. > > CC: William Tu > Signed-off-by: Alexei Starovoitov > Signed-off-by: John Fastabend > --- > drivers/net/ethernet/intel/e1000/e1000.h |2 > drivers/net/ethernet/intel/e1000/e1000_main.c | 176 > + > 2 files changed, 175 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/intel/e1000/e1000.h > b/drivers/net/ethernet/intel/e1000/e1000.h > index d7bdea7..5cf8a0a 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000.h > +++ b/drivers/net/ethernet/intel/e1000/e1000.h > @@ -150,6 +150,7 @@ struct e1000_adapter; > */ > struct e1000_tx_buffer { > struct sk_buff *skb; > + struct page *page; > dma_addr_t dma; > unsigned long time_stamp; > u16 length; I'm not really a huge fan of adding yet another member to this structure. Each e1000_tx_buffer is already pretty big at 40 bytes, pushing it to 48 just means we lose that much more memory. If nothing else we may wan to look at doing something like creating a union between the skb, page, and an unsigned long. Then you could use the lowest bit of the address as a flag indicating if this is a skb or a page. > @@ -279,6 +280,7 @@ struct e1000_adapter { > struct e1000_rx_ring *rx_ring, > int cleaned_count); > struct e1000_rx_ring *rx_ring; /* One per active queue */ > + struct bpf_prog *prog; > struct napi_struct napi; > > int num_tx_queues; > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c > b/drivers/net/ethernet/intel/e1000/e1000_main.c > index 62a7f8d..232b927 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c > @@ -32,6 +32,7 @@ > #include > #include > #include > +#include > > char e1000_driver_name[] = "e1000"; > static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver"; > @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev, > return 0; > } > > +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog) > +{ > + struct e1000_adapter *adapter = netdev_priv(netdev); > + struct bpf_prog *old_prog; > + > + old_prog = xchg(&adapter->prog, prog); > + if (old_prog) { > + synchronize_net(); > + bpf_prog_put(old_prog); > + } > + > + if (netif_running(netdev)) > + e1000_reinit_locked(adapter); > + else > + e1000_reset(adapter); What is the point of the reset? If the interface isn't running is there anything in the hardware you actually need to cleanup? > + return 0; > +} > + > +static bool e1000_xdp_attached(struct net_device *dev) > +{ > + struct e1000_adapter *priv = netdev_priv(dev); > + > + return !!priv->prog; > +} > + > +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp) > +{ > + switch (xdp->command) { > + case XDP_SETUP_PROG: > + return e1000_xdp_set(dev, xdp->prog); > + case XDP_QUERY_PROG: > + xdp->prog_attached = e1000_xdp_attached(dev); > + return 0; > + default: > + return -EINVAL; > + } > +} > + > static const struct net_device_ops e1000_netdev_ops = { > .ndo_open = e1000_open, > .ndo_stop = e1000_close, > @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = { > #endif > .ndo_fix_features = e1000_fix_features, > .ndo_set_features = e1000_set_features, > + .ndo_xdp= e1000_xdp, > }; > > /** > @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev) > e1000_down_and_stop(adapter); > e1000_release_manageability(adapter); > > + if (adapter->prog) > + bpf_prog_put(adapter->prog); > + > unregister_netdev(netdev); > >
Re: [PATCH 00/15] drivers: net: use IS_ENABLED() instead of checking for built-in or module
From: Javier Martinez Canillas Date: Mon, 12 Sep 2016 10:03:31 -0400 > This trivial series is similar to [0] for net/ that you already merged, but > for drivers/net. The patches replaces the open coding to check for a Kconfig > symbol being built-in or module, with IS_ENABLED() macro that does the same. Series applied, thanks.
Re: icmpv6: issue with routing table entries from link local addresses
On (09/13/16 04:42), Hannes Frederic Sowa wrote: > > But a couple of unexpected things I noticed in linux: the link-local > > prefix should have a prefixlen of /10 according to > > http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml > > but "ip -6 route show" lists this as a /64.. > > The link local subnet is still specified to be a /64 as the other parts > of the address must be 0. Legally we probably could blackhole them. > https://tools.ietf.org/html/rfc4291#section-2.5.6 A bit of a gray area. 4291 does not specify this as MBZ, and IANA registration is a /10. Both Solaris and BSD use /10. And while fec0 is deprecated, I suppose some similar thing could come up in the future. ymmv. > We don't have urpf checks for ipv6, those are implemented in netfilter > only. This could very well be a firewall issue or something like that. yes, I know that (no rp_filter check for ipv6), and thats why I said it may be some similar variant. What tripped me up is that onlink prefixes (which are multipath routes in that they have the same dst, mask, metric) are not treated as part of the typical IP_ROUTE_MULTIPATH in many places in the code because the fib_nhs data-structures do not get set up. (thus, e.g., one ipoib config I was looking at recently, which had multiple ports connected to the same IB switch, and had the same onlink prefix on these ports, would not load-spread across all ports until I explicitly did the 'ip route change' to tell the kernel to ecmp that prefix). Lets see what Andreas reports.. --Sowmini
Re: icmpv6: issue with routing table entries from link local addresses
Sowmini Varadhan wrote: > On Mon, Sep 12, 2016 at 1:26 PM, Hannes Frederic Sowa > wrote: >> Hello, >> >> On 12.09.2016 16:27, Andreas Hübner wrote: > >>> >>> I have the following setup: >>> - 2 directly connected hosts (A+B), both have only link local addresses >>> configured (interface on both hosts is eth0) >: >>> fe80::/64 dev eth1 proto kernel metric 256 >>> fe80::/64 dev eth0 proto kernel metric 256 >: >>> The issue I currently have is, that the echo reply that host B should >>> generate is never sent back to host A. If I change the order of the >>> routing table entries on host B, everything works fine. >>> (host A is connected on eth0) >: >> This shouldn't be the case. We certainly carry over the ifindex of the >> received packet into the routing lookup of the outgoing packet, thus the >> appropriate rule, with outgoing ifindex should be selected. > > Like Hannes, I too would first check "is B sending out the echo-resp? on > which interface?". > > But a couple of unexpected things I noticed in linux: the link-local > prefix should have a prefixlen of /10 according to > http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml > but "ip -6 route show" lists this as a /64.. Do not be confused; link-local address for ethernet is described by IPv6 over FOO document (e.g., RFC2464 for Ethernet). The address (fe80::/64 for Ethernet, for example) is defined inside the link-local scope unicast address space (/10). -- Hideaki Yoshifuji Technical Division, MIRACLE LINUX CORPORATION
Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend wrote: > The BQL API does not reference the sk_buff nor does the driver need to > reference the sk_buff to calculate the length of a transmitted frame. > This patch removes an sk_buff reference from the xmit irq path and > also allows packets sent from XDP to use BQL. > > Signed-off-by: John Fastabend > --- > drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c > b/drivers/net/ethernet/intel/e1000/e1000_main.c > index f42129d..62a7f8d 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c > @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter > *adapter, > if (cleaned) { > total_tx_packets += buffer_info->segs; > total_tx_bytes += buffer_info->bytecount; > - if (buffer_info->skb) { > - bytes_compl += buffer_info->skb->len; > - pkts_compl++; > - } > - > + bytes_compl += buffer_info->length; > + pkts_compl++; > } > e1000_unmap_and_free_tx_resource(adapter, > buffer_info); > tx_desc->upper.data = 0; Actually it might be worth looking into why we have two different stats for tracking bytecount and segs. From what I can tell the pkts_compl value is never actually used. The function doesn't even use the value so it is just wasted cycles. And as far as the bytes go the accounting would be more accurate if you were to use bytecount instead of buffer_info->skb->len. You would just need to update the xmit function to use that on the other side so that they match. - Alex
Re: [RFC 02/11] Add RoCE driver framework
Hi Ram, Just a few thoughts On 12/09/2016 19:07, Ram Amrani wrote: > Adds a skeletal implementation of the qed* RoCE driver - > basically the ability to communicate with the qede driver and > receive notifications from it regarding various init/exit events. > > Signed-off-by: Rajesh Borundia > Signed-off-by: Ram Amrani > --- > drivers/infiniband/Kconfig | 2 + > drivers/infiniband/hw/Makefile | 1 + > drivers/infiniband/hw/qedr/Kconfig | 7 + > drivers/infiniband/hw/qedr/Makefile | 3 + > drivers/infiniband/hw/qedr/main.c| 293 + > drivers/infiniband/hw/qedr/qedr.h| 60 ++ > drivers/net/ethernet/qlogic/qede/Makefile| 1 + > drivers/net/ethernet/qlogic/qede/qede.h | 9 + > drivers/net/ethernet/qlogic/qede/qede_main.c | 35 ++- > drivers/net/ethernet/qlogic/qede/qede_roce.c | 309 > +++ > include/linux/qed/qed_if.h | 3 +- > include/linux/qed/qede_roce.h| 88 > include/uapi/linux/pci_regs.h| 3 + > 13 files changed, 803 insertions(+), 11 deletions(-) > create mode 100644 drivers/infiniband/hw/qedr/Kconfig > create mode 100644 drivers/infiniband/hw/qedr/Makefile > create mode 100644 drivers/infiniband/hw/qedr/main.c > create mode 100644 drivers/infiniband/hw/qedr/qedr.h > create mode 100644 drivers/net/ethernet/qlogic/qede/qede_roce.c > create mode 100644 include/linux/qed/qede_roce.h [SNIP] > + > +MODULE_DESCRIPTION("QLogic 40G/100G ROCE Driver"); > +MODULE_AUTHOR("QLogic Corporation"); > +MODULE_LICENSE("Dual BSD/GPL"); > +MODULE_VERSION(QEDR_MODULE_VERSION); > + > +uint debug; > +module_param(debug, uint, 0); > +MODULE_PARM_DESC(debug, "Default debug msglevel"); Why are you adding this as a module parameter? > +static LIST_HEAD(qedr_dev_list); > +static DEFINE_SPINLOCK(qedr_devlist_lock); > + You already have a qedr_dev_list mutex in the qede_roce.c file, why do you need this spinlock as well? > +void qedr_ib_dispatch_event(struct qedr_dev *dev, u8 port_num, > + enum ib_event_type type) > +{ > + struct ib_event ibev; > + > + ibev.device = &dev->ibdev; > + ibev.element.port_num = port_num; > + ibev.event = type; > + > + ib_dispatch_event(&ibev); > +} > + > +static enum rdma_link_layer qedr_link_layer(struct ib_device *device, > + u8 port_num) > +{ > + return IB_LINK_LAYER_ETHERNET; > +} > + > +static int qedr_register_device(struct qedr_dev *dev) > +{ > + strlcpy(dev->ibdev.name, "qedr%d", IB_DEVICE_NAME_MAX); > + > + memcpy(dev->ibdev.node_desc, QEDR_NODE_DESC, sizeof(QEDR_NODE_DESC)); > + dev->ibdev.owner = THIS_MODULE; > + > + dev->ibdev.get_link_layer = qedr_link_layer; > + > + return 0; > +} > + > +/* QEDR sysfs interface */ > +static ssize_t show_rev(struct device *device, struct device_attribute *attr, > + char *buf) > +{ > + struct qedr_dev *dev = dev_get_drvdata(device); > + > + return scnprintf(buf, PAGE_SIZE, "0x%x\n", dev->pdev->vendor); > +} > + > +static ssize_t show_fw_ver(struct device *device, struct device_attribute > *attr, > +char *buf) > +{ > + return scnprintf(buf, PAGE_SIZE, "%s\n", "FW_VER_TO_ADD"); > +} Ira Weiny has added a generic way to expose firmware versions in the rdma stack, can you have please have a look at c73428230d98d1352bcc69cd8306c292a85e1e42 and see how he converted the mlx5_ib module to use it. > +static ssize_t show_hca_type(struct device *device, > + struct device_attribute *attr, char *buf) > +{ > + return scnprintf(buf, PAGE_SIZE, "%s\n", "HCA_TYPE_TO_SET"); > +} > + > +static DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); > +static DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); > +static DEVICE_ATTR(hca_type, S_IRUGO, show_hca_type, NULL); > + > +static struct device_attribute *qedr_attributes[] = { > + &dev_attr_hw_rev, > + &dev_attr_fw_ver, > + &dev_attr_hca_type > +}; > + > +static void qedr_remove_sysfiles(struct qedr_dev *dev) > +{ > + int i; > + > + for (i = 0; i < ARRAY_SIZE(qedr_attributes); i++) > + device_remove_file(&dev->ibdev.dev, qedr_attributes[i]); > +} > + > +void qedr_config_debug(uint debug, u32 *p_dp_module, u8 *p_dp_level) > +{ > + *p_dp_level = 0; > + *p_dp_module = 0; > + > + if (debug & QED_LOG_VERBOSE_MASK) { > + *p_dp_level = QED_LEVEL_VERBOSE; > + *p_dp_module = (debug & 0x3FFF); > + } else if (debug & QED_LOG_INFO_MASK) { > + *p_dp_level = QED_LEVEL_INFO; > + } else if (debug & QED_LOG_NOTICE_MASK) { > + *p_dp_level = QED_LEVEL_NOTICE; > + } > +} > + > +static void qedr_pci_set_atomic(struct qedr_dev *dev, struct pci_dev *pdev) > +{ > + struct pci_dev *bridge; > + u32 val; > + > + dev->atom
Re: icmpv6: issue with routing table entries from link local addresses
On 13.09.2016 04:03, Sowmini Varadhan wrote: > On Mon, Sep 12, 2016 at 1:26 PM, Hannes Frederic Sowa > wrote: >> Hello, >> >> On 12.09.2016 16:27, Andreas Hübner wrote: > >>> >>> I have the following setup: >>> - 2 directly connected hosts (A+B), both have only link local addresses >>> configured (interface on both hosts is eth0) >: >>> fe80::/64 dev eth1 proto kernel metric 256 >>> fe80::/64 dev eth0 proto kernel metric 256 >: >>> The issue I currently have is, that the echo reply that host B should >>> generate is never sent back to host A. If I change the order of the >>> routing table entries on host B, everything works fine. >>> (host A is connected on eth0) >: >> This shouldn't be the case. We certainly carry over the ifindex of the >> received packet into the routing lookup of the outgoing packet, thus the >> appropriate rule, with outgoing ifindex should be selected. > > Like Hannes, I too would first check "is B sending out the echo-resp? on > which interface?". > > But a couple of unexpected things I noticed in linux: the link-local > prefix should have a prefixlen of /10 according to > http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml > but "ip -6 route show" lists this as a /64.. The link local subnet is still specified to be a /64 as the other parts of the address must be 0. Legally we probably could blackhole them. https://tools.ietf.org/html/rfc4291#section-2.5.6 > moreover, even though I cannot use "ip [-6] route add.." to add the > same prefix multiple times (with different nexthop and/or interface) > unless I explicitly mark them as ECMP with /sbin/ip, it seems like you > can create the same onlink prefix on multiple interfaces, but the > kernel will not treat this as an ECMP group (and sometimes this > can produce inconsistent results depending on the order of > route addition, e.g., for ipv4 rp_filter checks). I dont know if some > variant of this (latter observation) may be the reason for the behavior > that Andreas reports. iproute sets the NLM_F_EXCL flag. Use ip route prepend ... We don't have urpf checks for ipv6, those are implemented in netfilter only. This could very well be a firewall issue or something like that. Bye, Hannes
Re: icmpv6: issue with routing table entries from link local addresses
On Mon, Sep 12, 2016 at 1:26 PM, Hannes Frederic Sowa wrote: > Hello, > > On 12.09.2016 16:27, Andreas Hübner wrote: >> >> I have the following setup: >> - 2 directly connected hosts (A+B), both have only link local addresses >> configured (interface on both hosts is eth0) : >> fe80::/64 dev eth1 proto kernel metric 256 >> fe80::/64 dev eth0 proto kernel metric 256 : >> The issue I currently have is, that the echo reply that host B should >> generate is never sent back to host A. If I change the order of the >> routing table entries on host B, everything works fine. >> (host A is connected on eth0) : > This shouldn't be the case. We certainly carry over the ifindex of the > received packet into the routing lookup of the outgoing packet, thus the > appropriate rule, with outgoing ifindex should be selected. Like Hannes, I too would first check "is B sending out the echo-resp? on which interface?". But a couple of unexpected things I noticed in linux: the link-local prefix should have a prefixlen of /10 according to http://www.iana.org/assignments/ipv6-address-space/ipv6-address-space.xhtml but "ip -6 route show" lists this as a /64.. moreover, even though I cannot use "ip [-6] route add.." to add the same prefix multiple times (with different nexthop and/or interface) unless I explicitly mark them as ECMP with /sbin/ip, it seems like you can create the same onlink prefix on multiple interfaces, but the kernel will not treat this as an ECMP group (and sometimes this can produce inconsistent results depending on the order of route addition, e.g., for ipv4 rp_filter checks). I dont know if some variant of this (latter observation) may be the reason for the behavior that Andreas reports. --Sowmini
Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote: > On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet wrote: > > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote: > > > >> yep. there are various ways to shoot yourself in the foot with xdp. > >> The simplest program that drops all the packets will make the box > >> unpingable. > > > > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a > > scooter on 101 highway ;) > > > > This XDP_TX thing was one of the XDP marketing stuff, but there is > > absolutely no documentation on it, warning users about possible > > limitations/outcomes. > > > > BTW, I am not sure mlx4 implementation even works, vs BQL : > > > > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(), > > but tx completion will call netdev_tx_completed_queue() -> crash > > > > Do we have one test to validate that a XDP_TX implementation is actually > > correct ? > > > Obviously not for e1000 :-(. We really need some real test and > performance results and analysis on the interaction between the stack > data path and XDP data path. no. we don't need it for e1k and we cannot really do it. this patch is for debugging of xdp programs only. > The fact that these changes are being > passed of as something only needed for KCM is irrelevant, e1000 is a > well deployed a NIC and there's no restriction that I see that would > prevent any users from enabling this feature on real devices. e1k is not even manufactured any more. Probably the only place where it can be found is computer history museum. e1000e fairs slightly better, but it's a different nic and this patch is not about it.
Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
> +static int > +qca8k_setup(struct dsa_switch *ds) > +{ > + struct qca8k_priv *priv = qca8k_to_priv(ds); > + int ret, i, phy_mode = -1; > + > + /* Keep track of which port is the connected to the cpu. This can be > + * phy 11 / port 0 or phy 5 / port 6. > + */ > + switch (dsa_upstream_port(ds)) { > + case 11: > + priv->cpu_port = 0; > + break; > + case 5: > + priv->cpu_port = 6; > + break; > + default: > + return -EOPNOTSUPP; > + } Hi John Can you explain this funkiness with CPU port numbers. Why use 11 instead of 0? I ideally i would like to use 0 here, but if that it not possible, it needs documenting in the device tree binding that 5 and 11 are special, representing ports 0 and 6. > + > + /* Start by setting up the register mapping */ > + priv->regmap = devm_regmap_init(ds->dev, NULL, priv, > + &qca8k_regmap_config); > + > + if (IS_ERR(priv->regmap)) > + pr_warn("regmap initialization failed"); > + > + /* Initialize CPU port pad mode (xMII type, delays...) */ > + phy_mode = of_get_phy_mode(ds->ports[ds->dst->cpu_port].dn); > + if (phy_mode < 0) { > + pr_err("Can't find phy-mode for master device\n"); > + return phy_mode; > + } > + ret = qca8k_set_pad_ctrl(priv, priv->cpu_port, phy_mode); > + if (ret < 0) > + return ret; > + > + /* Enable CPU Port */ > + qca8k_reg_set(priv, QCA8K_REG_GLOBAL_FW_CTRL0, > + QCA8K_GLOBAL_FW_CTRL0_CPU_PORT_EN); Does it matter which port is the CPU port here? 11 or 5 > + > + /* Enable MIB counters */ > + qca8k_reg_set(priv, QCA8K_REG_MIB, QCA8K_MIB_CPU_KEEP); > + qca8k_write(priv, QCA8K_REG_MODULE_EN, QCA8K_MODULE_EN_MIB); > + > + /* Enable QCA header mode on Port 0 */ > + qca8k_write(priv, QCA8K_REG_PORT_HDR_CTRL(priv->cpu_port), > + QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_TX_S | > + QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_RX_S); Does the header mode only work on port 0? > +static int > +qca8k_phy_read(struct dsa_switch *ds, int phy, int regnum) > +{ > + struct qca8k_priv *priv = qca8k_to_priv(ds); > + > + return mdiobus_read(priv->bus, phy, regnum); > +} > + > +static int > +qca8k_phy_write(struct dsa_switch *ds, int phy, int regnum, u16 val) > +{ > + struct qca8k_priv *priv = qca8k_to_priv(ds); > + > + return mdiobus_write(priv->bus, phy, regnum, val); > +} snip > +static int > +qca8k_sw_probe(struct mdio_device *mdiodev) > +{ > + struct qca8k_priv *priv; > + u32 phy_id; > + > + /* sw_addr is irrelevant as the switch occupies the MDIO bus from > + * addresses 0 to 4 (PHYs) and 16-23 (for MDIO 32bits protocol). So > + * we'll probe address 0 to see if we see the right switch family. > + */ So lets see if i have this right. Port 0 has no internal phy. Port 1 has an internal PHY at MDIO address 0. Port 2 has an internal PHY at MDIO address 1. ... Port 5 has an internal PHY ad MDIO address 4. Port 6 has no internal PHY. This is why you have funky port numbers, and phy_to_port. I think it would be a lot cleaner to handle this in qca8k_phy_read() and qca8k_phy_write(). Also, the comment it a bit misleading. You are probing the PHY ID, not the switch ID. At least for the Marvell switches, different switches can have the same embedded PHY. It would be annoying to find there is another incompatible switch with the same PHY ID. Is the embedded PHY compatible with the at803x driver? Thanks Andrew
Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
On Mon, Sep 12, 2016 at 04:46:08PM -0700, Eric Dumazet wrote: > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote: > > > yep. there are various ways to shoot yourself in the foot with xdp. > > The simplest program that drops all the packets will make the box > > unpingable. > > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a > scooter on 101 highway ;) > > This XDP_TX thing was one of the XDP marketing stuff, but there is > absolutely no documentation on it, warning users about possible > limitations/outcomes. that's fair critique. there is no documentation for xdp in general. > BTW, I am not sure mlx4 implementation even works, vs BQL : it doesn't and it shouldn't. xdp and stack use different tx queues. > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(), > but tx completion will call netdev_tx_completed_queue() -> crash not quite. netdev_tx_completed_queue() is not called for xdp queues. > Do we have one test to validate that a XDP_TX implementation is actually > correct ? it's correct in the scope that it was defined for. There is no objective to share the same tx queue with the stack in xdp. The queues must be absolutely separate otherwise performance will tank. This e1k patch is really special, because the e1k has one tx queue, but this is for debugging of the programs, so it's ok to cut corners. The e1k xdp support doesn't need to interact nicely with stack. It's impossible in the first place. xdp is the layer before the stack.
Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
Hi John > +static u16 > +qca8k_phy_mmd_read(struct qca8k_priv *priv, int phy_addr, u16 addr, u16 reg) > +{ > + u16 data; > + > + mutex_lock(&priv->bus->mdio_lock); > + > + priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_ADDR, addr); > + priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_DATA, reg); > + priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_ADDR, addr | 0x4000); > + data = priv->bus->read(priv->bus, phy_addr, MII_ATH_MMD_DATA); > + > + mutex_unlock(&priv->bus->mdio_lock); > + > + return data; > +} snip > +static int > +qca8k_get_eee(struct dsa_switch *ds, int port, > + struct ethtool_eee *e) > +{ > + struct qca8k_priv *priv = qca8k_to_priv(ds); > + struct ethtool_eee *p = &priv->port_sts[qca8k_phy_to_port(port)].eee; > + u32 lp, adv, supported; > + u16 val; > + > + /* The switch has no way to tell the result of the AN so we need to > + * read the result directly from the PHYs MMD registers > + */ > + val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE); > + supported = mmd_eee_cap_to_ethtool_sup_t(val); > + > + val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_AN, MDIO_AN_EEE_ADV); > + adv = mmd_eee_adv_to_ethtool_adv_t(val); > + > + val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_AN, MDIO_AN_EEE_LPABLE); > + lp = mmd_eee_adv_to_ethtool_adv_t(val); > + > + e->eee_enabled = p->eee_enabled; > + e->eee_active = !!(supported & adv & lp); > + > + return 0; > +} Couldn't you just call phy_ethtool_get_eee(phydev)? Then you don't need qca8k_phy_mmd_read()? Andrew
[PATCH v6 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe to: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesnt get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim --- include/net/tc_act/tc_skbmod.h| 30 include/uapi/linux/tc_act/tc_skbmod.h | 39 + net/sched/Kconfig | 11 ++ net/sched/Makefile| 1 + net/sched/act_skbmod.c| 301 ++ 5 files changed, 382 insertions(+) create mode 100644 include/net/tc_act/tc_skbmod.h create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h create mode 100644 net/sched/act_skbmod.c diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h new file mode 100644 index 000..644a211 --- /dev/null +++ b/include/net/tc_act/tc_skbmod.h @@ -0,0 +1,30 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __NET_TC_SKBMOD_H +#define __NET_TC_SKBMOD_H + +#include +#include + +struct tcf_skbmod_params { + struct rcu_head rcu; + u64 flags; /*up to 64 types of operations; extend if needed */ + u8 eth_dst[ETH_ALEN]; + u16 eth_type; + u8 eth_src[ETH_ALEN]; +}; + +struct tcf_skbmod { + struct tc_actioncommon; + struct tcf_skbmod_params __rcu *skbmod_p; +}; +#define to_skbmod(a) ((struct tcf_skbmod *)a) + +#endif /* __NET_TC_SKBMOD_H */ diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h new file mode 100644 index 000..10fc07d --- /dev/null +++ b/include/uapi/linux/tc_act/tc_skbmod.h @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __LINUX_TC_SKBMOD_H +#define __LINUX_TC_SKBMOD_H + +#include + +#define TCA_ACT_SKBMOD 15 + +#define SKBMOD_F_DMAC 0x1 +#define SKBMOD_F_SMAC 0x2 +#define SKBMOD_F_ETYPE 0x4 +#define SKBMOD_F_SWAPMAC 0x8 + +struct tc_skbmod { + tc_gen; + __u64 flags; +}; + +enum { + TCA_SKBMOD_UNSPEC, + TCA_SKBMOD_TM, + TCA_SKBMOD_PARMS, + TCA_SKBMOD_DMAC, + TCA_SKBMOD_SMAC, + TCA_SKBMOD_ETYPE, + TCA_SKBMOD_PAD, + __TCA_SKBMOD_MAX +}; +#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1) + +#endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 72e3426..7795d5a 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -749,6 +749,17 @@ config NET_ACT_CONNMARK To compile this code as a module, choose M here: the module will be called act_connmark. +config NET_ACT_SKBMOD +tristate "skb data modification action" +depends on NET_CLS_ACT +---help--- + Say Y here to allow modification of skb data + + If unsure, say N. + + To compile this code as a module, choose M here: the + module will be called act_skbmod. + config NET_ACT_IFE tristate "Inter-FE action based on IETF ForCES InterFE LFB" depends on NET_CLS_ACT diff --git a/net/sched/Makefile b/net/sched/Makefile index b9d046b..148ae0d 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.
linux-next: manual merge of the netfilter-next tree with the net tree
Hi all, Today's linux-next merge of the netfilter-next tree got a conflict in: net/netfilter/nf_tables_netdev.c between commit: c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment") from the net tree and commit: ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()") from the netfilter-next tree. I fixed it up (I used the latter version of this file and applied the patch below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. From: Stephen Rothwell Date: Tue, 13 Sep 2016 10:08:58 +1000 Subject: [PATCH] netfilter: merge fixup for "nf_tables_netdev: remove redundant ip_hdr assignment" Signed-off-by: Stephen Rothwell --- include/net/netfilter/nf_tables_ipv4.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/net/netfilter/nf_tables_ipv4.h b/include/net/netfilter/nf_tables_ipv4.h index 968f00b82fb5..25e33aee91e7 100644 --- a/include/net/netfilter/nf_tables_ipv4.h +++ b/include/net/netfilter/nf_tables_ipv4.h @@ -33,7 +33,6 @@ __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt, if (!iph) return -1; - iph = ip_hdr(skb); if (iph->ihl < 5 || iph->version != 4) return -1; -- 2.8.1 -- Cheers, Stephen Rothwell
Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
Ok, screwup. I used version 5 again ;-< I will resend and incorporate Florian's comment. cheers, jamal
[PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe to: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesnt get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim --- include/net/tc_act/tc_skbmod.h| 30 include/uapi/linux/tc_act/tc_skbmod.h | 39 + net/sched/Kconfig | 11 ++ net/sched/Makefile| 1 + net/sched/act_skbmod.c| 301 ++ 5 files changed, 382 insertions(+) create mode 100644 include/net/tc_act/tc_skbmod.h create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h create mode 100644 net/sched/act_skbmod.c diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h new file mode 100644 index 000..644a211 --- /dev/null +++ b/include/net/tc_act/tc_skbmod.h @@ -0,0 +1,30 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __NET_TC_SKBMOD_H +#define __NET_TC_SKBMOD_H + +#include +#include + +struct tcf_skbmod_params { + struct rcu_head rcu; + u64 flags; /*up to 64 types of operations; extend if needed */ + u8 eth_dst[ETH_ALEN]; + u16 eth_type; + u8 eth_src[ETH_ALEN]; +}; + +struct tcf_skbmod { + struct tc_actioncommon; + struct tcf_skbmod_params __rcu *skbmod_p; +}; +#define to_skbmod(a) ((struct tcf_skbmod *)a) + +#endif /* __NET_TC_SKBMOD_H */ diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h new file mode 100644 index 000..10fc07d --- /dev/null +++ b/include/uapi/linux/tc_act/tc_skbmod.h @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __LINUX_TC_SKBMOD_H +#define __LINUX_TC_SKBMOD_H + +#include + +#define TCA_ACT_SKBMOD 15 + +#define SKBMOD_F_DMAC 0x1 +#define SKBMOD_F_SMAC 0x2 +#define SKBMOD_F_ETYPE 0x4 +#define SKBMOD_F_SWAPMAC 0x8 + +struct tc_skbmod { + tc_gen; + __u64 flags; +}; + +enum { + TCA_SKBMOD_UNSPEC, + TCA_SKBMOD_TM, + TCA_SKBMOD_PARMS, + TCA_SKBMOD_DMAC, + TCA_SKBMOD_SMAC, + TCA_SKBMOD_ETYPE, + TCA_SKBMOD_PAD, + __TCA_SKBMOD_MAX +}; +#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1) + +#endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 72e3426..7795d5a 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -749,6 +749,17 @@ config NET_ACT_CONNMARK To compile this code as a module, choose M here: the module will be called act_connmark. +config NET_ACT_SKBMOD +tristate "skb data modification action" +depends on NET_CLS_ACT +---help--- + Say Y here to allow modification of skb data + + If unsure, say N. + + To compile this code as a module, choose M here: the + module will be called act_skbmod. + config NET_ACT_IFE tristate "Inter-FE action based on IETF ForCES InterFE LFB" depends on NET_CLS_ACT diff --git a/net/sched/Makefile b/net/sched/Makefile index b9d046b..148ae0d 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.
Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet wrote: > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote: > >> yep. there are various ways to shoot yourself in the foot with xdp. >> The simplest program that drops all the packets will make the box unpingable. > > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a > scooter on 101 highway ;) > > This XDP_TX thing was one of the XDP marketing stuff, but there is > absolutely no documentation on it, warning users about possible > limitations/outcomes. > > BTW, I am not sure mlx4 implementation even works, vs BQL : > > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(), > but tx completion will call netdev_tx_completed_queue() -> crash > > Do we have one test to validate that a XDP_TX implementation is actually > correct ? > Obviously not for e1000 :-(. We really need some real test and performance results and analysis on the interaction between the stack data path and XDP data path. The fact that these changes are being passed of as something only needed for KCM is irrelevant, e1000 is a well deployed a NIC and there's no restriction that I see that would prevent any users from enabling this feature on real devices. So these patches need to be tested and justified. Eric's skepticism in all this really doesn't surprise me... Tom > >
Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
Jamal Hadi Salim wrote: > +#define MAX_EDIT_LEN ETH_HLEN This define is only referenced in a comment and seems unused otherwise.
Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
On Mon, 2016-09-12 at 19:56 -0400, Jamal Hadi Salim wrote: > > Yes, I did ;-> > Another version coming. Is there a good way to catch > something like this (ala userspace valgrind)? You might try CONFIG_DEBUG_KMEMLEAK=y, or simply run your script deleting/inserting the action one million time ;)
Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
On 16-09-12 07:40 PM, Eric Dumazet wrote: On Mon, 2016-09-12 at 19:30 -0400, Jamal Hadi Salim wrote: From: Jamal Hadi Salim + +static struct tc_action_ops act_skbmod_ops = { + .kind = "skbmod", + .type = TCA_ACT_SKBMOD, + .owner = THIS_MODULE, + .act= tcf_skbmod_run, + .dump = tcf_skbmod_dump, + .init = tcf_skbmod_init, + .walk = tcf_skbmod_walker, + .lookup = tcf_skbmod_search, + .size = sizeof(struct tcf_skbmod), +}; Looks like you missed a .cleanup() handler ? Otherwise, d->skbmod_p wont be freed . Yes, I did ;-> Another version coming. Is there a good way to catch something like this (ala userspace valgrind)? cheers, jamal
Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote: > yep. there are various ways to shoot yourself in the foot with xdp. > The simplest program that drops all the packets will make the box unpingable. Well, my comment was about XDP_TX only, not about XDP_DROP or driving a scooter on 101 highway ;) This XDP_TX thing was one of the XDP marketing stuff, but there is absolutely no documentation on it, warning users about possible limitations/outcomes. BTW, I am not sure mlx4 implementation even works, vs BQL : mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(), but tx completion will call netdev_tx_completed_queue() -> crash Do we have one test to validate that a XDP_TX implementation is actually correct ?
Re: [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
On Mon, Sep 12, 2016 at 3:14 PM, John Fastabend wrote: > e1000 supports a single TX queue so it is being shared with the stack > when XDP runs XDP_TX action. This requires taking the xmit lock to > ensure we don't corrupt the tx ring. To avoid taking and dropping the > lock per packet this patch adds a bundling implementation to submit > a bundle of packets to the xmit routine. > > I tested this patch running e1000 in a VM using KVM over a tap > device using pktgen to generate traffic along with 'ping -f -l 100'. > John, It still looks to me like this will break the normal TX path. Can you test these patches on real e1000 to get performance numbers for both the drop and forwarding case. Also can you run this against an antagonist workload over the normal stack to see if one or the other path can be starved? Thanks, Tom > Suggested-by: Jesper Dangaard Brouer > Signed-off-by: John Fastabend > --- > drivers/net/ethernet/intel/e1000/e1000.h | 10 +++ > drivers/net/ethernet/intel/e1000/e1000_main.c | 80 > +++-- > 2 files changed, 70 insertions(+), 20 deletions(-) > > diff --git a/drivers/net/ethernet/intel/e1000/e1000.h > b/drivers/net/ethernet/intel/e1000/e1000.h > index 5cf8a0a..877b377 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000.h > +++ b/drivers/net/ethernet/intel/e1000/e1000.h > @@ -133,6 +133,8 @@ struct e1000_adapter; > #define E1000_TX_QUEUE_WAKE16 > /* How many Rx Buffers do we bundle into one write to the hardware ? */ > #define E1000_RX_BUFFER_WRITE 16 /* Must be power of 2 */ > +/* How many XDP XMIT buffers to bundle into one xmit transaction */ > +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE > > #define AUTO_ALL_MODES 0 > #define E1000_EEPROM_82544_APM 0x0004 > @@ -168,6 +170,11 @@ struct e1000_rx_buffer { > dma_addr_t dma; > }; > > +struct e1000_rx_buffer_bundle { > + struct e1000_rx_buffer *buffer; > + u32 length; > +}; > + > struct e1000_tx_ring { > /* pointer to the descriptor ring memory */ > void *desc; > @@ -206,6 +213,9 @@ struct e1000_rx_ring { > struct e1000_rx_buffer *buffer_info; > struct sk_buff *rx_skb_top; > > + /* array of XDP buffer information structs */ > + struct e1000_rx_buffer_bundle *xdp_buffer; > + > /* cpu for rx queue */ > int cpu; > > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c > b/drivers/net/ethernet/intel/e1000/e1000_main.c > index 232b927..31489d4 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c > @@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, > struct bpf_prog *prog) > struct e1000_adapter *adapter = netdev_priv(netdev); > struct bpf_prog *old_prog; > > + if (!adapter->rx_ring[0].xdp_buffer) { > + int size = sizeof(struct e1000_rx_buffer_bundle) * > + E1000_XDP_XMIT_BUNDLE_MAX; > + > + adapter->rx_ring[0].xdp_buffer = vzalloc(size); > + if (!adapter->rx_ring[0].xdp_buffer) > + return -ENOMEM; > + } > + > old_prog = xchg(&adapter->prog, prog); > if (old_prog) { > synchronize_net(); > @@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev) > if (adapter->prog) > bpf_prog_put(adapter->prog); > > + if (adapter->rx_ring[0].xdp_buffer) > + vfree(adapter->rx_ring[0].xdp_buffer); > + > unregister_netdev(netdev); > > e1000_phy_hw_reset(hw); > @@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring > *tx_ring, > > static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, > u32 len, > +struct e1000_adapter *adapter, > struct net_device *netdev, > -struct e1000_adapter *adapter) > +struct e1000_tx_ring *tx_ring) > { > - struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); > - struct e1000_hw *hw = &adapter->hw; > - struct e1000_tx_ring *tx_ring; > + const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); > > if (len > E1000_MAX_DATA_PER_TXD) > return; > > - /* e1000 only support a single txq at the moment so the queue is being > -* shared with stack. To support this requires locking to ensure the > -* stack and XDP are not running at the same time. Devices with > -* multiple queues should allocate a separate queue space. > -*/ > - HARD_TX_LOCK(netdev, txq, smp_processor_id()); > - > - tx_ring = adapter->tx_ring; > - > - if (E1000_DESC_UNUSED(tx_ring) < 2) { > - HARD_TX_UNLOCK(netdev, txq); > + if (E1000_DESC_UNUSED(tx_ring) < 2) > return; > -
Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
On Mon, 2016-09-12 at 19:30 -0400, Jamal Hadi Salim wrote: > From: Jamal Hadi Salim > + > +static struct tc_action_ops act_skbmod_ops = { > + .kind = "skbmod", > + .type = TCA_ACT_SKBMOD, > + .owner = THIS_MODULE, > + .act= tcf_skbmod_run, > + .dump = tcf_skbmod_dump, > + .init = tcf_skbmod_init, > + .walk = tcf_skbmod_walker, > + .lookup = tcf_skbmod_search, > + .size = sizeof(struct tcf_skbmod), > +}; Looks like you missed a .cleanup() handler ? Otherwise, d->skbmod_p wont be freed .
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On 16-09-12 07:20 PM, Eric Dumazet wrote: On Mon, 2016-09-12 at 16:10 -0700, Eric Dumazet wrote: diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c index 508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041 100644 --- a/net/core/gen_stats.c +++ b/net/core/gen_stats.c @@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running, return; } do { - if (running) + if (running) { + local_bh_disable(); seq = read_seqcount_begin(running); + } bstats->bytes = b->bytes; bstats->packets = b->packets; + if (running) + local_bh_enable(); } while (running && read_seqcount_retry(running, seq)); } Ah well, forget this patch, re-enabling bh right before read_seqcount_retry() is not going to help. I have to say I have seen some odd issues once in a while reading generic action stats. I had a program that opened a netlink socket into the kernel. Every X seconds it does a dump of all the actions to read the stats. There is a very reproducible behavior that the stats are not in sync with the kernel. Given generic stats is lockless I thought maybe rcu or per-cpu stats was the issue. I havent had time to look closely. The solution is instead of keeping the socket open all the time; I open, read stats, close (repeat every x seconds). If there is something you want me to try - I could do sometimes this week. Your patch above may be useful! cheers, jamal
[PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe to: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesnt get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim --- include/net/tc_act/tc_skbmod.h| 30 include/uapi/linux/tc_act/tc_skbmod.h | 39 + net/sched/Kconfig | 11 ++ net/sched/Makefile| 1 + net/sched/act_skbmod.c| 291 ++ 5 files changed, 372 insertions(+) create mode 100644 include/net/tc_act/tc_skbmod.h create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h create mode 100644 net/sched/act_skbmod.c diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h new file mode 100644 index 000..644a211 --- /dev/null +++ b/include/net/tc_act/tc_skbmod.h @@ -0,0 +1,30 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __NET_TC_SKBMOD_H +#define __NET_TC_SKBMOD_H + +#include +#include + +struct tcf_skbmod_params { + struct rcu_head rcu; + u64 flags; /*up to 64 types of operations; extend if needed */ + u8 eth_dst[ETH_ALEN]; + u16 eth_type; + u8 eth_src[ETH_ALEN]; +}; + +struct tcf_skbmod { + struct tc_actioncommon; + struct tcf_skbmod_params __rcu *skbmod_p; +}; +#define to_skbmod(a) ((struct tcf_skbmod *)a) + +#endif /* __NET_TC_SKBMOD_H */ diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h new file mode 100644 index 000..10fc07d --- /dev/null +++ b/include/uapi/linux/tc_act/tc_skbmod.h @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __LINUX_TC_SKBMOD_H +#define __LINUX_TC_SKBMOD_H + +#include + +#define TCA_ACT_SKBMOD 15 + +#define SKBMOD_F_DMAC 0x1 +#define SKBMOD_F_SMAC 0x2 +#define SKBMOD_F_ETYPE 0x4 +#define SKBMOD_F_SWAPMAC 0x8 + +struct tc_skbmod { + tc_gen; + __u64 flags; +}; + +enum { + TCA_SKBMOD_UNSPEC, + TCA_SKBMOD_TM, + TCA_SKBMOD_PARMS, + TCA_SKBMOD_DMAC, + TCA_SKBMOD_SMAC, + TCA_SKBMOD_ETYPE, + TCA_SKBMOD_PAD, + __TCA_SKBMOD_MAX +}; +#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1) + +#endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 72e3426..7795d5a 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -749,6 +749,17 @@ config NET_ACT_CONNMARK To compile this code as a module, choose M here: the module will be called act_connmark. +config NET_ACT_SKBMOD +tristate "skb data modification action" +depends on NET_CLS_ACT +---help--- + Say Y here to allow modification of skb data + + If unsure, say N. + + To compile this code as a module, choose M here: the + module will be called act_skbmod. + config NET_ACT_IFE tristate "Inter-FE action based on IETF ForCES InterFE LFB" depends on NET_CLS_ACT diff --git a/net/sched/Makefile b/net/sched/Makefile index b9d046b..148ae0d 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On 16-09-12 07:02 PM, Jamal Hadi Salim wrote: Looks like typical starvation caused by aggressive softirq. Well, then it is strange that in one case a tc dump of the rule was immediate and in the other case it was consistent for 5-15 seconds. I may be going nuts because i cant reproduce that issue anymore. VM is configured for SMP 4 vcpus. I will remove that check and submit a new version. cheers, jamal
Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
On Mon, 2016-09-12 at 06:39 +, Y.B. Lu wrote: > Hi Scott, > > Thanks for your review :) > See my comment inline. > > > > > -Original Message- > > From: Scott Wood [mailto:o...@buserror.net] > > Sent: Friday, September 09, 2016 11:47 AM > > To: Y.B. Lu; linux-...@vger.kernel.org; ulf.hans...@linaro.org; Arnd > > Bergmann > > Cc: linuxppc-...@lists.ozlabs.org; devicet...@vger.kernel.org; linux-arm- > > ker...@lists.infradead.org; linux-ker...@vger.kernel.org; linux- > > c...@vger.kernel.org; linux-...@vger.kernel.org; iommu@lists.linux- > > foundation.org; netdev@vger.kernel.org; Mark Rutland; Rob Herring; > > Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil; Bhupesh > > Sharma; Qiang Zhao; Kumar Gala; Santosh Shilimkar; Leo Li; X.B. Xie > > Subject: Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms > > > > On Tue, 2016-09-06 at 16:28 +0800, Yangbo Lu wrote: > > > > > > The global utilities block controls power management, I/O device > > > enabling, power-onreset(POR) configuration monitoring, alternate > > > function selection for multiplexed signals,and clock control. > > > > > > This patch adds a driver to manage and access global utilities block. > > > Initially only reading SVR and registering soc device are supported. > > > Other guts accesses, such as reading RCW, should eventually be moved > > > into this driver as well. > > > > > > Signed-off-by: Yangbo Lu > > > Signed-off-by: Scott Wood > > Don't put my signoff on patches that I didn't put it on > > myself. Definitely don't put mine *after* yours on patches that were > > last modified by you. > > > > If you want to mention that the soc_id encoding was my suggestion, then > > do so explicitly. > > > [Lu Yangbo-B47093] I found your 'signoff' on this patch at below link. > http://patchwork.ozlabs.org/patch/649211/ > > So, let me just change the order in next version ? > Signed-off-by: Scott Wood > Signed-off-by: Yangbo Lu No. This isn't my patch so my signoff shouldn't be on it. > [Lu Yangbo-B47093] It's a good idea to move die into .family I think. > In my opinion, it's better to keep svr and name in soc_id just like your > suggestion above. > > > > { > > .soc_id = "svr:0x85490010,name:T1023E,", > > .family = "QorIQ T1024", > > } > The user probably don’t like to learn the svr value. What they want is just > to match the soc they use. > It's convenient to use name+rev for them to match a soc. What the user should want 99% of the time is to match the die (plus revision), not the soc. > Regarding shrinking the table, I think it's hard to use svr+mask. Because I > find many platforms use different masks. > We couldn’t know the mask according svr value. The mask would be part of the table: { { .die = "T1024", .svr = 0x8540, .mask = 0xfff0, }, { .die = "T1040", .svr = 0x8520, .mask = 0xfff0, }, { .die = "LS1088A", .svr = 0x8703, .mask = 0x, }, ... } There's a small risk that we get the mask wrong and a different die is created that matches an existing table, but it doesn't seem too likely, and can easily be fixed with a kernel update if it happens. BTW, aren't ls2080a and ls2085a the same die? And is there no non-E version of LS2080A/LS2040A? > > > + do { > > > + if (!matches->soc_id) > > > + return NULL; > > > + if (glob_match(svr_match, matches->soc_id)) > > > + break; > > > + } while (matches++); > > Are you expecting "matches++" to ever evaluate as false? > [Lu Yangbo-B47093] Yes, this is used to match the soc we use in qoriq_soc > array until getting true. > We need to get the name and die information defined in array. I'm not asking whether the glob_match will ever return true. I'm saying that "matches++" will never become NULL. > > > + /* Register soc device */ > > > + soc_dev_attr = kzalloc(sizeof(*soc_dev_attr), GFP_KERNEL); > > > + if (!soc_dev_attr) { > > > + ret = -ENOMEM; > > > + goto out_unmap; > > > + } > > Couldn't this be statically allocated? > [Lu Yangbo-B47093] Do you mean we define this struct statically ? > > static struct soc_device_attribute soc_dev_attr; Yes. > > > + > > > + soc_dev = soc_device_register(soc_dev_attr); > > > + if (IS_ERR(soc_dev)) { > > > + ret = -ENODEV; > > Why are you changing the error code? > [Lu Yangbo-B47093] What error code should we use ? :) ret = PTR_ERR(soc_dev); + } > > > + return 0; > > > +out: > > > + kfree(soc_dev_attr->machine); > > > + kfree(soc_dev_attr->family); > > > + kfree(soc_dev_attr->soc_id); > > > + kfree(soc_dev_attr->revision); > > > + kfree(soc_dev_attr); > > > +out_unmap: > > > + iounmap(guts->regs); > > > +out_free: > > > + kfree(guts); > > devm > [Lu Yangbo-B47093] What's the devm
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On Mon, 2016-09-12 at 16:10 -0700, Eric Dumazet wrote: > > diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c > index > 508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041 > 100644 > --- a/net/core/gen_stats.c > +++ b/net/core/gen_stats.c > @@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running, > return; > } > do { > - if (running) > + if (running) { > + local_bh_disable(); > seq = read_seqcount_begin(running); > + } > bstats->bytes = b->bytes; > bstats->packets = b->packets; > + if (running) > + local_bh_enable(); > } while (running && read_seqcount_retry(running, seq)); > } Ah well, forget this patch, re-enabling bh right before read_seqcount_retry() is not going to help.
Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver
On Mon, Sep 12, 2016 at 10:43:00PM +, Adit Ranadive wrote: > On Mon, Sep 12, 2016 at 11:03:39 -0700, Jason Gunthorpe wrote: > > On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote: > > > [2] Libpvrdma User-level library - > > > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary > > > > You will probably find that rdma-plumbing will be the best way to get > > your userspace component into the distributors. > Sorry I haven't paying attention to that discussion. Do you know how soon > distros will pick up the rdma-plumbing stuff? Hopefully things will be clearer after the call on Tuesday. I'm hoping we can begin the process sooner than later. There are 4 new drivers being submitted and at least 2-3 more kernel drivers that are not yet in Debian/SuSE/etc. By far the simplest way to push those 6 new user space drivers into the distros will be via rdma-plumbing, I'm optimistic that the vendors will work together on this. > It seems like it should be fairly straightforward to include the libpvrdma git > within the rdma-plumbing git as well. Let me know what the process is since > I may have to discuss it internally. For now, you can prepare your own clone and create a providers/pvrdma directory similar to the others. It should be very straightforward. Let me know if you have any problems. You should ensure clean compilation on FC24 (for instance using the provided docker tooling), and review the commit stream in rdma-plumbing to make sure you pick up all the systemic fixes that have already been done to the other providers. If Doug accepts your provider upstream before we finalize rdma-plumbing then I will add your git to the giant merge, however please make my life easier and ensure the code does not have any of the systemic problems I alreadly fixed in other providers :( After finalization you'll have to post your user space provider to the mailing list and Doug will pick it up. Jason
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On Mon, 2016-09-12 at 19:02 -0400, Jamal Hadi Salim wrote: > On 16-09-12 06:26 PM, Eric Dumazet wrote: > > On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote: > > > >> I noticed some very weird issues when I took that out. > >> Running sufficiently large amount of traffic (ping -f is sufficient) > >> I saw that when i did a dump it took anywhere between 6-15 seconds. > >> With the read_lock in place response was immediate. > >> I can go back and run things to verify - but it was very odd. > > > > This was on uni processor ? > > > > It was a VM. > > > Looks like typical starvation caused by aggressive softirq. > > > > Well, then it is strange that in one case a tc dump of the rule > was immediate and in the other case it was consistent for 5-15 > seconds. > This needs investigation ;) One possible loop under high stress would be possible in __gnet_stats_copy_basic(), since we might restart the loop if we are really really unlucky, but this would have nothing with your patches. diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c index 508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041 100644 --- a/net/core/gen_stats.c +++ b/net/core/gen_stats.c @@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running, return; } do { - if (running) + if (running) { + local_bh_disable(); seq = read_seqcount_begin(running); + } bstats->bytes = b->bytes; bstats->packets = b->packets; + if (running) + local_bh_enable(); } while (running && read_seqcount_retry(running, seq)); } EXPORT_SYMBOL(__gnet_stats_copy_basic);
Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
On Mon, Sep 12, 2016 at 03:46:39PM -0700, Eric Dumazet wrote: > On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote: > > From: Alexei Starovoitov > > > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, > > +u32 len, > > +struct net_device *netdev, > > +struct e1000_adapter *adapter) > > +{ > > + struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); > > + struct e1000_hw *hw = &adapter->hw; > > + struct e1000_tx_ring *tx_ring; > > + > > + if (len > E1000_MAX_DATA_PER_TXD) > > + return; > > + > > + /* e1000 only support a single txq at the moment so the queue is being > > +* shared with stack. To support this requires locking to ensure the > > +* stack and XDP are not running at the same time. Devices with > > +* multiple queues should allocate a separate queue space. > > +*/ > > + HARD_TX_LOCK(netdev, txq, smp_processor_id()); > > + > > + tx_ring = adapter->tx_ring; > > + > > + if (E1000_DESC_UNUSED(tx_ring) < 2) { > > + HARD_TX_UNLOCK(netdev, txq); > > + return; > > + } > > + > > + if (netif_xmit_frozen_or_stopped(txq)) > > + return; > > + > > + e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len); > > + netdev_sent_queue(netdev, len); > > + e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1); > > + > > + writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt); > > + mmiowb(); > > + > > + HARD_TX_UNLOCK(netdev, txq); > > +} > > > e1000_tx_map() is full of workarounds. > > Have a look at last_tx_tso for example. > >/* Workaround for Controller erratum -- > * descriptor for non-tso packet in a linear SKB that follows > a > * tso gets written back prematurely before the data is fully > * DMA'd to the controller > */ > if (!skb->data_len && tx_ring->last_tx_tso && > !skb_is_gso(skb)) { > tx_ring->last_tx_tso = false; > size -= 4; > } > > Look, this XDP_TX thing is hard to properly implement and test on > various NIC revisions. my reading of these e1k workarounds are for old versions of the nic that don't even exist anymore. I believe none of them apply to qemu e1k. > Without proper queue management, high prio packets in qdisc wont be sent > if NIC is under RX -> XDP_TX flood. yep. there are various ways to shoot yourself in the foot with xdp. The simplest program that drops all the packets will make the box unpingable. > Sounds a horrible feature to me. yep. not pretty by any means. This whole thing is only to test xdp programs under qemu which realistically emulates only e1k. I simply don't see any other options to test xdp under kvm.
[PATCH v3 net 1/1] net sched actions: fix GETing actions
From: Jamal Hadi Salim With the batch changes that translated transient actions into a temporary list lost in the translation was the fact that tcf_action_destroy() will eventually delete the action from the permanent location if the refcount is zero. Example of what broke: ...add a gact action to drop sudo $TC actions add action drop index 10 ...now retrieve it, looks good sudo $TC actions get action gact index 10 ...retrieve it again and find it is gone! sudo $TC actions get action gact index 10 Fixes: commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"), commit 824a7e8863b3 ("net_sched: remove an unnecessary list_del()") commit f07fed82ad79 ("net_sched: remove the leftover cleanup_a()") Signed-off-by: Jamal Hadi Salim --- net/sched/act_api.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index d09d068..50720b1 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -592,6 +592,16 @@ err_out: return ERR_PTR(err); } +static void cleanup_a(struct list_head *actions, int ovr) +{ + struct tc_action *a; + + list_for_each_entry(a, actions, list) { + if (ovr) + a->tcfa_refcnt -= 1; + } +} + int tcf_action_init(struct net *net, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions) @@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla, goto err; } act->order = i; + if (ovr) + act->tcfa_refcnt += 1; list_add_tail(&act->list, actions); } + + /* Remove the temp refcnt which was necessary to protect against +* destroying an existing action which was being replaced +*/ + cleanup_a(actions, ovr); return 0; err: @@ -883,6 +900,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n, goto err; } act->order = i; + if (event == RTM_GETACTION) + act->tcfa_refcnt += 1; list_add_tail(&act->list, &actions); } -- 1.9.1
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On 16-09-12 06:26 PM, Eric Dumazet wrote: On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote: I noticed some very weird issues when I took that out. Running sufficiently large amount of traffic (ping -f is sufficient) I saw that when i did a dump it took anywhere between 6-15 seconds. With the read_lock in place response was immediate. I can go back and run things to verify - but it was very odd. This was on uni processor ? It was a VM. Looks like typical starvation caused by aggressive softirq. Well, then it is strange that in one case a tc dump of the rule was immediate and in the other case it was consistent for 5-15 seconds. Anyway, I suspect your kernel build has rcu_read_lock() and rcu_read_unlock() as NOP ;) Which doesnt give me a good feel if i tested this well ;-> I would like to try again with those two kernels just to make sure i was not imagining this. cheers, jamal
Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
Hi John > +#include > +#include > +#include > +#include > +#include > +#include One linux/phy.h is enough. > + /* Setup connection between CPU ports & PHYs */ > + for (i = 0; i < DSA_MAX_PORTS; i++) { > + /* CPU port gets connected to all PHYs in the switch */ > + if (dsa_is_cpu_port(ds, i)) { > + qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(priv->cpu_port), > + QCA8K_PORT_LOOKUP_MEMBER, > + ds->enabled_port_mask << 1); > + } > + > + /* Invividual PHYs gets connected to CPU port only */ > + if (ds->enabled_port_mask & BIT(i)) { > + int port = qca8k_phy_to_port(i); > + int shift = 16 * (port % 2); > + snip > +static void > +qca8k_get_ethtool_stats(struct dsa_switch *ds, int phy, > + uint64_t *data) > +{ > + struct qca8k_priv *priv = qca8k_to_priv(ds); > + const struct qca8k_mib_desc *mib; > + u32 reg, i, port; > + u64 hi; > + > + port = qca8k_phy_to_port(phy); snip > +static inline int qca8k_phy_to_port(int phy) > +{ > + if (phy < 5) > + return phy + 1; > + > + return -1; > +} What keep throwing me while reading this code is the use of PHY/phy. What is meant by a phy in this driver? DSA is all about switches. Switches are a switch fabric and a number of ports. The ports contain Ethernet MACs, and optionally contain embedded PHYs. If there are not embedded PHYs, there are often external PHYs, and sometimes we just have MACs connected back to back. Generally, DSA drivers don't need to do much with the MAC. Maybe set the speed and duplex, and maybe signal delays. They also don't need to do much with the PHY, phylib and a phy driver does all the work. DSA is all about the switch fabric. Yet i see phy scattered in a few places in this driver, and it seems like they have nothing to do with the PHY. Please can you change the terminology here? It might help if you can remove qca8k_phy_to_port(). Why do you need to add 1? Could this be eliminated via a better choice of reg in the device tree? Thanks Andrew
Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote: > From: Alexei Starovoitov > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, > + u32 len, > + struct net_device *netdev, > + struct e1000_adapter *adapter) > +{ > + struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); > + struct e1000_hw *hw = &adapter->hw; > + struct e1000_tx_ring *tx_ring; > + > + if (len > E1000_MAX_DATA_PER_TXD) > + return; > + > + /* e1000 only support a single txq at the moment so the queue is being > + * shared with stack. To support this requires locking to ensure the > + * stack and XDP are not running at the same time. Devices with > + * multiple queues should allocate a separate queue space. > + */ > + HARD_TX_LOCK(netdev, txq, smp_processor_id()); > + > + tx_ring = adapter->tx_ring; > + > + if (E1000_DESC_UNUSED(tx_ring) < 2) { > + HARD_TX_UNLOCK(netdev, txq); > + return; > + } > + > + if (netif_xmit_frozen_or_stopped(txq)) > + return; > + > + e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len); > + netdev_sent_queue(netdev, len); > + e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1); > + > + writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt); > + mmiowb(); > + > + HARD_TX_UNLOCK(netdev, txq); > +} e1000_tx_map() is full of workarounds. Have a look at last_tx_tso for example. /* Workaround for Controller erratum -- * descriptor for non-tso packet in a linear SKB that follows a * tso gets written back prematurely before the data is fully * DMA'd to the controller */ if (!skb->data_len && tx_ring->last_tx_tso && !skb_is_gso(skb)) { tx_ring->last_tx_tso = false; size -= 4; } Look, this XDP_TX thing is hard to properly implement and test on various NIC revisions. Without proper queue management, high prio packets in qdisc wont be sent if NIC is under RX -> XDP_TX flood. Sounds a horrible feature to me.
Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver
On Mon, Sep 12, 2016 at 11:03:39 -0700, Jason Gunthorpe wrote: > On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote: > > [2] Libpvrdma User-level library - > > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary > > You will probably find that rdma-plumbing will be the best way to get > your userspace component into the distributors. Hi Jason, Sorry I haven't paying attention to that discussion. Do you know how soon distros will pick up the rdma-plumbing stuff? It seems like it should be fairly straightforward to include the libpvrdma git within the rdma-plumbing git as well. Let me know what the process is since I may have to discuss it internally. Thanks! Adit
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote: > I noticed some very weird issues when I took that out. > Running sufficiently large amount of traffic (ping -f is sufficient) > I saw that when i did a dump it took anywhere between 6-15 seconds. > With the read_lock in place response was immediate. > I can go back and run things to verify - but it was very odd. This was on uni processor ? Looks like typical starvation caused by aggressive softirq. Anyway, I suspect your kernel build has rcu_read_lock() and rcu_read_unlock() as NOP ;)
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On 16-09-12 06:01 PM, Eric Dumazet wrote: On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote: + +static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a, + int bind, int ref) +{ + struct tcf_skbmod *d = to_skbmod(a); + unsigned char *b = skb_tail_pointer(skb); + struct tcf_skbmod_params *p = rtnl_dereference(d->skbmod_p); + struct tc_skbmod opt = { + .index = d->tcf_index, + .refcnt = d->tcf_refcnt - ref, + .bindcnt = d->tcf_bindcnt - bind, + .action = d->tcf_action, + }; + struct tcf_t t; + + rcu_read_lock(); You do not need rcu read lock protection here, RTNL is enough. I noticed some very weird issues when I took that out. Running sufficiently large amount of traffic (ping -f is sufficient) I saw that when i did a dump it took anywhere between 6-15 seconds. With the read_lock in place response was immediate. I can go back and run things to verify - but it was very odd. cheers, jamal
[net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
e1000 supports a single TX queue so it is being shared with the stack when XDP runs XDP_TX action. This requires taking the xmit lock to ensure we don't corrupt the tx ring. To avoid taking and dropping the lock per packet this patch adds a bundling implementation to submit a bundle of packets to the xmit routine. I tested this patch running e1000 in a VM using KVM over a tap device using pktgen to generate traffic along with 'ping -f -l 100'. Suggested-by: Jesper Dangaard Brouer Signed-off-by: John Fastabend --- drivers/net/ethernet/intel/e1000/e1000.h | 10 +++ drivers/net/ethernet/intel/e1000/e1000_main.c | 80 +++-- 2 files changed, 70 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h index 5cf8a0a..877b377 100644 --- a/drivers/net/ethernet/intel/e1000/e1000.h +++ b/drivers/net/ethernet/intel/e1000/e1000.h @@ -133,6 +133,8 @@ struct e1000_adapter; #define E1000_TX_QUEUE_WAKE16 /* How many Rx Buffers do we bundle into one write to the hardware ? */ #define E1000_RX_BUFFER_WRITE 16 /* Must be power of 2 */ +/* How many XDP XMIT buffers to bundle into one xmit transaction */ +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE #define AUTO_ALL_MODES 0 #define E1000_EEPROM_82544_APM 0x0004 @@ -168,6 +170,11 @@ struct e1000_rx_buffer { dma_addr_t dma; }; +struct e1000_rx_buffer_bundle { + struct e1000_rx_buffer *buffer; + u32 length; +}; + struct e1000_tx_ring { /* pointer to the descriptor ring memory */ void *desc; @@ -206,6 +213,9 @@ struct e1000_rx_ring { struct e1000_rx_buffer *buffer_info; struct sk_buff *rx_skb_top; + /* array of XDP buffer information structs */ + struct e1000_rx_buffer_bundle *xdp_buffer; + /* cpu for rx queue */ int cpu; diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c index 232b927..31489d4 100644 --- a/drivers/net/ethernet/intel/e1000/e1000_main.c +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c @@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog) struct e1000_adapter *adapter = netdev_priv(netdev); struct bpf_prog *old_prog; + if (!adapter->rx_ring[0].xdp_buffer) { + int size = sizeof(struct e1000_rx_buffer_bundle) * + E1000_XDP_XMIT_BUNDLE_MAX; + + adapter->rx_ring[0].xdp_buffer = vzalloc(size); + if (!adapter->rx_ring[0].xdp_buffer) + return -ENOMEM; + } + old_prog = xchg(&adapter->prog, prog); if (old_prog) { synchronize_net(); @@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev) if (adapter->prog) bpf_prog_put(adapter->prog); + if (adapter->rx_ring[0].xdp_buffer) + vfree(adapter->rx_ring[0].xdp_buffer); + unregister_netdev(netdev); e1000_phy_hw_reset(hw); @@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring, static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, u32 len, +struct e1000_adapter *adapter, struct net_device *netdev, -struct e1000_adapter *adapter) +struct e1000_tx_ring *tx_ring) { - struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); - struct e1000_hw *hw = &adapter->hw; - struct e1000_tx_ring *tx_ring; + const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); if (len > E1000_MAX_DATA_PER_TXD) return; - /* e1000 only support a single txq at the moment so the queue is being -* shared with stack. To support this requires locking to ensure the -* stack and XDP are not running at the same time. Devices with -* multiple queues should allocate a separate queue space. -*/ - HARD_TX_LOCK(netdev, txq, smp_processor_id()); - - tx_ring = adapter->tx_ring; - - if (E1000_DESC_UNUSED(tx_ring) < 2) { - HARD_TX_UNLOCK(netdev, txq); + if (E1000_DESC_UNUSED(tx_ring) < 2) return; - } if (netif_xmit_frozen_or_stopped(txq)) return; @@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info, e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len); netdev_sent_queue(netdev, len); e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1); +} +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info, + struct net_device *netdev, + struct e1000_adapter *adapter) +{ + struct netdev_queue *tx
[net-next PATCH v3 2/3] e1000: add initial XDP support
From: Alexei Starovoitov This patch adds initial support for XDP on e1000 driver. Note e1000 driver does not support page recycling in general which could be added as a further improvement. However XDP_DROP case will recycle. XDP_TX and XDP_PASS do not support recycling. e1000 only supports a single tx queue at this time so the queue is shared between xdp program and Linux stack. It is possible for an XDP program to starve the stack in this model. The XDP program will drop packets on XDP_TX errors. This can occur when the tx descriptors are exhausted. This behavior is the same for both shared queue models like e1000 and dedicated tx queue models used in multiqueue devices. However if both the stack and XDP are transmitting packets it is perhaps more likely to occur in the shared queue model. Further refinement to the XDP model may be possible in the future. I tested this patch running e1000 in a VM using KVM over a tap device. CC: William Tu Signed-off-by: Alexei Starovoitov Signed-off-by: John Fastabend --- drivers/net/ethernet/intel/e1000/e1000.h |2 drivers/net/ethernet/intel/e1000/e1000_main.c | 176 + 2 files changed, 175 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h index d7bdea7..5cf8a0a 100644 --- a/drivers/net/ethernet/intel/e1000/e1000.h +++ b/drivers/net/ethernet/intel/e1000/e1000.h @@ -150,6 +150,7 @@ struct e1000_adapter; */ struct e1000_tx_buffer { struct sk_buff *skb; + struct page *page; dma_addr_t dma; unsigned long time_stamp; u16 length; @@ -279,6 +280,7 @@ struct e1000_adapter { struct e1000_rx_ring *rx_ring, int cleaned_count); struct e1000_rx_ring *rx_ring; /* One per active queue */ + struct bpf_prog *prog; struct napi_struct napi; int num_tx_queues; diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c index 62a7f8d..232b927 100644 --- a/drivers/net/ethernet/intel/e1000/e1000_main.c +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c @@ -32,6 +32,7 @@ #include #include #include +#include char e1000_driver_name[] = "e1000"; static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver"; @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev, return 0; } +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog) +{ + struct e1000_adapter *adapter = netdev_priv(netdev); + struct bpf_prog *old_prog; + + old_prog = xchg(&adapter->prog, prog); + if (old_prog) { + synchronize_net(); + bpf_prog_put(old_prog); + } + + if (netif_running(netdev)) + e1000_reinit_locked(adapter); + else + e1000_reset(adapter); + return 0; +} + +static bool e1000_xdp_attached(struct net_device *dev) +{ + struct e1000_adapter *priv = netdev_priv(dev); + + return !!priv->prog; +} + +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp) +{ + switch (xdp->command) { + case XDP_SETUP_PROG: + return e1000_xdp_set(dev, xdp->prog); + case XDP_QUERY_PROG: + xdp->prog_attached = e1000_xdp_attached(dev); + return 0; + default: + return -EINVAL; + } +} + static const struct net_device_ops e1000_netdev_ops = { .ndo_open = e1000_open, .ndo_stop = e1000_close, @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = { #endif .ndo_fix_features = e1000_fix_features, .ndo_set_features = e1000_set_features, + .ndo_xdp= e1000_xdp, }; /** @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev) e1000_down_and_stop(adapter); e1000_release_manageability(adapter); + if (adapter->prog) + bpf_prog_put(adapter->prog); + unregister_netdev(netdev); e1000_phy_hw_reset(hw); @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter) struct e1000_hw *hw = &adapter->hw; u32 rdlen, rctl, rxcsum; - if (adapter->netdev->mtu > ETH_DATA_LEN) { + if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) { rdlen = adapter->rx_ring[0].count * sizeof(struct e1000_rx_desc); adapter->clean_rx = e1000_clean_jumbo_rx_irq; @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter, dev_kfree_skb_any(buffer_info->skb); buffer_info->skb = NULL; } + if (buffer_info->page) { + put_page(buffer_info->page); + buffer_info->page = NULL; + } + buffer_info->time_stamp = 0;
Re: [PATCH v4 16/16] MAINTAINERS: Update for PVRDMA driver
On Mon, Sep 12, 2016 at 10:52:26 -0700, Jason Gunthorpe wrote: > On Sun, Sep 11, 2016 at 09:49:26PM -0700, Adit Ranadive wrote: > > Add maintainer info for the PVRDMA driver. > > You can probably squash the last three patches. Thanks for taking a look. Since Doug mentioned that he would squash my commits anyways while applying I decided to keep them separate. Let me know if you still want me to squash them. > .. and fix the __u32 stuff throughout the entire driver please. Will do. Thanks, Adit
[net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
The BQL API does not reference the sk_buff nor does the driver need to reference the sk_buff to calculate the length of a transmitted frame. This patch removes an sk_buff reference from the xmit irq path and also allows packets sent from XDP to use BQL. Signed-off-by: John Fastabend --- drivers/net/ethernet/intel/e1000/e1000_main.c |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c index f42129d..62a7f8d 100644 --- a/drivers/net/ethernet/intel/e1000/e1000_main.c +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter, if (cleaned) { total_tx_packets += buffer_info->segs; total_tx_bytes += buffer_info->bytecount; - if (buffer_info->skb) { - bytes_compl += buffer_info->skb->len; - pkts_compl++; - } - + bytes_compl += buffer_info->length; + pkts_compl++; } e1000_unmap_and_free_tx_resource(adapter, buffer_info); tx_desc->upper.data = 0;
[net-next PATCH v3 0/3] e1000 XDP implementation
This patch implements XDP on e1000 a few comments below: The XDP_TX path does not violate BQL in this series so we have to increment and decrement the bql counters correctly. When a TX queue can have both stack traffic and XDP traffic I believe user configured BQL should be correct. If users do not want BQL they can always disable it or raise the limits. I left the xdp program attached to the adapter structure because in the e1000 case the program is global because we only run on a single queue. Pushing the program into the rx_ring just creates a bunch of ring[0] references in the code and adds little in my opinion. This is really just a style issue though. Notice I did push the bundle logic onto the queue but I view the bundling and rx ring buffers to be so closely linked it was worth it and only caused a couple extra ring[0] lines. XDP_TX will drop packets if any errors occur while attempting to push onto the TX descriptor ring. This seems to me at least to be the most sensible thing to do and keeps e1000 inline with existing mlx XDP drivers. This can always be extended if some consensus is made later. Thanks for all the comments in v{1|2}. As always any comments/feedback appreciated. Also initial patch was based on a patch I took from Alexei so I left his signed-off there even though I mangled the code a fair amount since then and did not give him any chance to review off-list. --- Alexei Starovoitov (1): e1000: add initial XDP support John Fastabend (2): e1000: track BQL bytes regardless of skb or not e1000: bundle xdp xmit routines drivers/net/ethernet/intel/e1000/e1000.h | 12 + drivers/net/ethernet/intel/e1000/e1000_main.c | 227 - 2 files changed, 229 insertions(+), 10 deletions(-) -- Signature
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On 16-09-12 05:58 PM, Eric Dumazet wrote: On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote: From: Jamal Hadi Salim + + /* XXX: if you are going to edit more fields beyond ethernet header +* (example when you add IP header replacement or vlan swap) +* then MAX_EDIT_LEN needs to change appropriately + */ + err = skb_ensure_writable(skb, ETH_HLEN); + if (unlikely(err)) /* best policy is to drop on the floor */ + action = TC_ACT_SHOT; + + rcu_read_lock(); + action = READ_ONCE(d->tcf_action); You are overwriting @action, while you might have put TC_ACT_SHOT in it 3 lines above. Maybe you meant : if (err) return TC_ACT_SHOT; Thanks for catching that (leftover from when i used a lock). Will resubmit. cheers, jamal
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote: > + > +static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a, > +int bind, int ref) > +{ > + struct tcf_skbmod *d = to_skbmod(a); > + unsigned char *b = skb_tail_pointer(skb); > + struct tcf_skbmod_params *p = rtnl_dereference(d->skbmod_p); > + struct tc_skbmod opt = { > + .index = d->tcf_index, > + .refcnt = d->tcf_refcnt - ref, > + .bindcnt = d->tcf_bindcnt - bind, > + .action = d->tcf_action, > + }; > + struct tcf_t t; > + > + rcu_read_lock(); You do not need rcu read lock protection here, RTNL is enough. > + > + opt.flags = p->flags; > + if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), &opt)) > + goto nla_put_failure; > + if ((p->flags & SKBMOD_F_DMAC) && > + nla_put(skb, TCA_SKBMOD_DMAC, ETH_ALEN, p->eth_dst)) > + goto nla_put_failure; > + if ((p->flags & SKBMOD_F_SMAC) && > + nla_put(skb, TCA_SKBMOD_SMAC, ETH_ALEN, p->eth_src)) > + goto nla_put_failure; > + if ((p->flags & SKBMOD_F_ETYPE) && > + nla_put_u16(skb, TCA_SKBMOD_ETYPE, ntohs(p->eth_type))) > + goto nla_put_failure; > + > + tcf_tm_dump(&t, &d->tcf_tm); > + if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), &t, TCA_SKBMOD_PAD)) > + goto nla_put_failure; > + > + rcu_read_unlock(); > + > + return skb->len; > +nla_put_failure: > + rcu_read_unlock(); > + nlmsg_trim(skb, b); > + return -1; > +} > +
Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote: > From: Jamal Hadi Salim > + > + /* XXX: if you are going to edit more fields beyond ethernet header > + * (example when you add IP header replacement or vlan swap) > + * then MAX_EDIT_LEN needs to change appropriately > + */ > + err = skb_ensure_writable(skb, ETH_HLEN); > + if (unlikely(err)) /* best policy is to drop on the floor */ > + action = TC_ACT_SHOT; > + > + rcu_read_lock(); > + action = READ_ONCE(d->tcf_action); You are overwriting @action, while you might have put TC_ACT_SHOT in it 3 lines above. Maybe you meant : if (err) return TC_ACT_SHOT;
Re: [PATCH] [RFC] proc connector: add namespace events
Hi everyone 08.09.2016, 18:39, "Alban Crequy" : > The act of a process creating or joining a namespace via clone(), > unshare() or setns() is a useful signal for monitoring applications. > + if (old_ns->mnt_ns != new_ns->mnt_ns) > + proc_ns_connector(tsk, CLONE_NEWNS, PROC_NM_REASON_CLONE, old_mntns_inum, > new_mntns_inum); > + > + if (old_ns->uts_ns != new_ns->uts_ns) > + proc_ns_connector(tsk, CLONE_NEWUTS, PROC_NM_REASON_CLONE, > old_ns->uts_ns->ns.inum, new_ns->uts_ns->ns.inum); > + > + if (old_ns->ipc_ns != new_ns->ipc_ns) > + proc_ns_connector(tsk, CLONE_NEWIPC, PROC_NM_REASON_CLONE, > old_ns->ipc_ns->ns.inum, new_ns->ipc_ns->ns.inum); > + > + if (old_ns->net_ns != new_ns->net_ns) > + proc_ns_connector(tsk, CLONE_NEWNET, PROC_NM_REASON_CLONE, > old_ns->net_ns->ns.inum, new_ns->net_ns->ns.inum); > + > + if (old_ns->cgroup_ns != new_ns->cgroup_ns) > + proc_ns_connector(tsk, CLONE_NEWCGROUP, PROC_NM_REASON_CLONE, > old_ns->cgroup_ns->ns.inum, new_ns->cgroup_ns->ns.inum); > + > + if (old_ns->pid_ns_for_children != new_ns->pid_ns_for_children) > + proc_ns_connector(tsk, CLONE_NEWPID, PROC_NM_REASON_CLONE, > old_ns->pid_ns_for_children->ns.inum, new_ns->pid_ns_for_children->ns.inum); > + } > + Patch looks good to me from technical/connector point of view, but these even multiplication is a bit weird imho. I'm not against it, but did you consider sending just 2 serialized ns structures via single message, and client would check all ns bits himself?
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouer wrote: > On Fri, 9 Sep 2016 18:03:09 +0300 > Saeed Mahameed wrote: > >> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev >> wrote: >> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: >> >> >> >> I'm sorry but I have a problem with this patch! >> >> Looking at this patch, I want to bring up a fundamental architectural >> >> concern with the development direction of XDP transmit. >> >> >> >> >> >> What you are trying to implement, with delaying the doorbell, is >> >> basically TX bulking for TX_XDP. >> >> >> >> Why not implement a TX bulking interface directly instead?!? >> >> >> >> Yes, the tailptr/doorbell is the most costly operation, but why not >> >> also take advantage of the benefits of bulking for other parts of the >> >> code? (benefit is smaller, by every cycles counts in this area) >> >> >> >> This hole XDP exercise is about avoiding having a transaction cost per >> >> packet, that reads "bulking" or "bundling" of packets, where possible. >> >> >> >> Lets do bundling/bulking from the start! >> >> Jesper, what we did here is also bulking, instead of bulking in a >> temporary list in the driver we list the packets in the HW and once >> done we transmit all at once via the xdp_doorbell indication. >> >> I agree with you that we can take advantage and improve the icache by >> bulking first in software and then queue all at once in the hw then >> ring one doorbell. >> >> but I also agree with Alexei that this will introduce an extra >> pointer/list handling in the driver and we need to do the comparison >> between both approaches before we decide which is better. > > I welcome implementing both approaches and benchmarking them against > each-other, I'll gladly dedicate time for this! > Yes, please implement this so we can have something clear to evaluate and compare. There is far to much spewing of "expert opinions" happening here :-( > I'm reacting so loudly, because this is a mental model switch, that > need to be applied to the full drivers RX path. Also for normal stack > delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated, > there is between 10%-25% perf gain here. > > The key point is stop seeing the lowest driver RX as something that > works on a per packet basis. It might be easier to view this as a kind > of vector processing. This is about having a vector of packets, where > we apply some action/operation. > > This is about using the CPU more efficiently, getting it to do more > instructions per cycle. The next level of optimization (for >= Sandy > Bridge CPUs) is to make these vector processing stages small enough to fit > into the CPUs decoded-I-cache section. > > > It might also be important to mention, that for netstack delivery, I > don't imagine bulking 64 packets. Instead, I imagine doing 8-16 > packets. Why, because the NIC-HW runs independently and have the > opportunity to deliver more frames in the RX ring queue, while the > stack "slow" processed packets. You can view this as "bulking" from > the RX ring queue, with a "look-back" before exiting the NAPI poll loop. > > >> this must be marked as future work and not have this from the start. > > We both know that statement is BS, and the other approach will never be > implemented once this patch is accepted upstream. > > >> > mlx4 already does bulking and this proposed mlx5 set of patches >> > does bulking as well. > > I'm reacting exactly because mlx4 is also doing "bulking" in the wrong > way IMHO. And now mlx5 is building on the same principle. That is why > I'm yelling STOP. > > >> >> The reason behind the xmit_more API is that we could not change the >> >> API of all the drivers. And we found that calling an explicit NDO >> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that >> >> would hit the common single packet use-case. >> >> >> >> It should be really easy to build a bundle of packets that need XDP_TX >> >> action, especially given you only have a single destination "port". >> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. > > > [1] http://lists.openwall.net/netdev/2016/04/19/89 > [2] http://lists.openwall.net/netdev/2016/01/15/51 > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
[PATCH net-next 2/2] bpf: use skb_at_tc_ingress helper in tcf_bpf
We have a small skb_at_tc_ingress() helper for testing for ingress, so make use of it. cls_bpf already uses it and so should act_bpf. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- net/sched/act_bpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 78400de..1d39600 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -39,10 +39,10 @@ static struct tc_action_ops act_bpf_ops; static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act, struct tcf_result *res) { + bool at_ingress = skb_at_tc_ingress(skb); struct tcf_bpf *prog = to_bpf(act); struct bpf_prog *filter; int action, filter_res; - bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS; tcf_lastuse_update(&prog->tcf_tm); bstats_cpu_update(this_cpu_ptr(prog->common.cpu_bstats), skb); -- 1.9.3
[PATCH net-next 1/2] bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf
The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is actually unnecessary and can be removed altogether. This was added by commit a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets"), which was later on improved by 3431205e0397 ("bpf: make programs see skb->data == L2 for ingress and egress"). We're always guaranteed to have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf(). Reason is that since 6d1ccff62780 ("net: reset mac header in dev_start_xmit()") we do skb_reset_mac_header() in __dev_queue_xmit() before we could call into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress() always sees a valid mac header as well (things like skb_reset_mac_len() would badly fail otherwise). Thus, drop the unnecessary test in classifier and action case. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- net/sched/act_bpf.c | 3 --- net/sched/cls_bpf.c | 3 --- 2 files changed, 6 deletions(-) diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index bfa8707..78400de 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -44,9 +44,6 @@ static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act, int action, filter_res; bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS; - if (unlikely(!skb_mac_header_was_set(skb))) - return TC_ACT_UNSPEC; - tcf_lastuse_update(&prog->tcf_tm); bstats_cpu_update(this_cpu_ptr(prog->common.cpu_bstats), skb); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 4742f41..1d92d4d 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -83,9 +83,6 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp, struct cls_bpf_prog *prog; int ret = -1; - if (unlikely(!skb_mac_header_was_set(skb))) - return -1; - /* Needed here for accessing maps. */ rcu_read_lock(); list_for_each_entry_rcu(prog, &head->plist, link) { -- 1.9.3
[PATCH net-next 0/2] Misc cls_bpf/act_bpf improvements
Two minor improvements to {cls,act}_bpf. For details please see individual patches. Thanks! Daniel Borkmann (2): bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf bpf: use skb_at_tc_ingress helper in tcf_bpf net/sched/act_bpf.c | 5 + net/sched/cls_bpf.c | 3 --- 2 files changed, 1 insertion(+), 7 deletions(-) -- 1.9.3
Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
Hi Matt, On Sun, Sep 11, 2016 at 10:57 PM, Matt Corallo wrote: > The general advice is to force it into 100M mode then it seems to work fine. > There is some issue with the driver seemingly not fully supporting 1G (at > least on Odroid C2) which needs to be worked out. thanks for the hint - by "forcing 100M mode" you mean "ethtool –s eth0 speed 100 duplex full"? Do you know more about this issue (or do you just know the workaround)? It'd be interesting to know which "driver" does not support Gbit speeds (as there are three drivers involved: realtek PHY driver, stmmac and Meson stmmac glue driver Regards, Martin
Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
Hi Alexandre, On Mon, Sep 12, 2016 at 6:37 PM, Alexandre Torgue wrote: > Which Synopsys IP version do you use ? found this in a dmesg log: [1.504784] stmmac - user ID: 0x11, Synopsys ID: 0x37 [1.509785] Ring mode enabled [1.512796] DMA HW capability register supported [1.517286] Normal descriptors [1.520565] RX Checksum Offload Engine supported [1.525219] COE Type 2 [1.527638] TX Checksum insertion supported [1.531862] Wake-Up On Lan supported [1.535483] Enable RX Mitigation via HW Watchdog Timer [1.543851] libphy: stmmac: probed [1.544025] eth0: PHY ID 001cc916 at 0 IRQ POLL (stmmac-0:00) active [1.550321] eth0: PHY ID 001cc916 at 7 IRQ POLL (stmmac-0:07) >> Gbit ethernet on my device is provided by a Realtek RTL8211F RGMII PHY. >> Similar issues were reported in #linux-amlogic by a user with an >> Odroid C2 board (= similar hardware). >> >> The symptoms are: >> Receiving data is plenty fast (I can max out my internet connection >> easily, and with iperf3 I get ~900Mbit/s). >> Transmitting data from the device is unfortunately very slow, traffic >> sometimes even stalls completely. >> >> I have attached the iperf results and the output of >> /sys/kernel/debug/stmmaceth/eth0/descriptors_status. >> Below you can find the ifconfig, netstat and stmmac dma_cap info >> (*after* I ran all tests). >> >> The "involved parties" are: >> - Meson GXBB specific network configuration registers (I have have >> double-checked them with the reference drivers: everything seems fine >> here) >> - stmmac: it seems that nobody else has reported these kind of issues >> so far, however I'd still like to hear where I should enable some >> debugging bits to rule out any stmmac bug > > > On my side, I just tested on the same "kind" of system: > -SYNOPSYS GMAC 3.7 > -RTL8211EG as PHY > > With I perf, I reach: > -RX: 932 Mbps > -TX: 820Mbps > > Can you check ethtool -S eth0 (most precisely "MMC"counter and errors) ? > Which kernel version do you use ? I am using a 4.8.0-rc4 kernel, based on Kevin's "integration" branch: [0] Unfortunately I don't have access to my device in the next few days, but I'll keep you updated once I have the ethtool output. Thanks for your time Regards, Martin [0] https://git.kernel.org/cgit/linux/kernel/git/khilman/linux-amlogic.git/log/?h=v4.8/integ
Re: [PATCH 00/26] constify local structures
On Mon, 12 Sep 2016, Jarkko Sakkinen wrote: > On Mon, Sep 12, 2016 at 04:43:58PM +0300, Felipe Balbi wrote: > > > > Hi, > > > > Jarkko Sakkinen writes: > > > On Mon, Sep 12, 2016 at 10:54:07AM +0200, Julia Lawall wrote: > > >> > > >> > > >> On Sun, 11 Sep 2016, Jarkko Sakkinen wrote: > > >> > > >> > On Sun, Sep 11, 2016 at 03:05:42PM +0200, Julia Lawall wrote: > > >> > > Constify local structures. > > >> > > > > >> > > The semantic patch that makes this change is as follows: > > >> > > (http://coccinelle.lip6.fr/) > > >> > > > >> > Just my two cents but: > > >> > > > >> > 1. You *can* use a static analysis too to find bugs or other issues. > > >> > 2. However, you should manually do the commits and proper commit > > >> >messages to subsystems based on your findings. And I generally think > > >> >that if one contributes code one should also at least smoke test > > >> > changes > > >> >somehow. > > >> > > > >> > I don't know if I'm alone with my opinion. I just think that one should > > >> > also do the analysis part and not blindly create and submit patches. > > >> > > >> All of the patches are compile tested. And the individual patches are > > > > > > Compile-testing is not testing. If you are not able to test a commit, > > > you should explain why. > > > > Dude, Julia has been doing semantic patching for years already and > > nobody has raised any concerns so far. There's already an expectation > > that Coccinelle *works* and Julia's sematic patches are sound. > > > > Besides, adding 'const' is something that causes virtually no functional > > changes to the point that build-testing is really all you need. Any > > problems caused by adding 'const' to a definition will be seen by build > > errors or warnings. > > > > Really, just stop with the pointless discussion and go read a bit about > > Coccinelle and what semantic patches are giving you. The work done by > > Julia and her peers are INRIA have measurable benefits. > > > > You're really making a thunderstorm in a glass of water. > > Hmm... I've been using coccinelle in cyclic basis for some time now. > My comment was oversized but I didn't mean it to be impolite or attack > of any kind for that matter. No problem :) Thanks for the feedback. julia
Re: [PATCH v2 net 1/1] net sched actions: fix GETing actions
On 16-09-12 01:18 PM, Cong Wang wrote: On Mon, Sep 12, 2016 at 3:21 AM, Jamal Hadi Salim wrote: + if (ovr) + a->tcfa_refcnt-=1; How about tcfa_bindcnt? We dont touch it when mucking around only with actions (as is in this case). + } +} + int tcf_action_init(struct net *net, struct nlattr *nla, struct nlattr *est, char *name, int ovr, int bind, struct list_head *actions) @@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla, goto err; } act->order = i; + if (ovr) Need to check this boolean? It looks like we need this for !ovr case too? They are fine if they didnt exist. I will fix the coding style issues and submit. cheers, jamal
Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Mon, 12 Sep 2016 12:56:28 -0700 Alexei Starovoitov wrote: > On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote: > > On Thu, 8 Sep 2016 23:30:50 -0700 > > Alexei Starovoitov wrote: > > > > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote: > > [...] > > > > Imagine you have packets intermixed towards the stack and XDP_TX. > > > > Every time you call the stack code, then you flush your icache. When > > > > returning to the driver code, you will have to reload all the icache > > > > associated with the XDP_TX, this is a costly operation. > > > > > [...] > > > To make further progress in this discussion can we talk about > > > the use case you have in mind instead? Then solution will > > > be much clear, I hope. > > > > The DDoS use-case _is_ affected by this "hidden" bulking design. > > > > Lets say, I want to implement a DDoS facility. Instead of just > > dropping the malicious packets, I want to see the bad packets. I > > implement this by rewriting the destination-MAC to be my monitor > > machine and then XDP_TX the packet. > > not following the use case. you want to implement a DDoS generator? > Or just forward all bad packets from affected host to another host > in the same rack? so two servers will be spammed with traffic and > even more load on the tor? I really don't see how this is useful > for anything but stress testing. As I wrote below, the purpose of the monitor machine is to diagnose false positives. If you worry about the added load I would either, forward out another interface (which is not supported yet) or simply do sampling of packets being forwarded to the monitor host. > > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100% > > of the traffic is delivered to the stack. (See note 1) > > hmm. DoS prevention use case is when 99% of the traffic is dropped. As I wrote below, until the DDoS attack starts, all packets are delivered to the stack. > > Once the DDoS attack starts, then the traffic pattern changes, and XDP > > should (hopefully only) catch the malicious traffic (monitor machine can > > help diagnose false positive). Now, due to interleaving the DDoS > > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to > > more icache misses... > > > > > > > > Note(1): Notice I have already demonstrated that loading a XDP/eBPF > > program with 100% delivery to the stack, actually slows down the > > normal stack. This is due to hitting a bottleneck in the page > > allocator. I'm working removing that bottleneck with page_pool, and > > that solution is orthogonal to this problem. > > sure. no one arguing against improving page allocator. > > > It is actually an excellent argument, for why you would want to run a > > DDoS XDP filter only on a restricted number of RX queues. > > no. it's the opposite. If the host is under DoS there is no way > the host can tell in advance which rx queue will be seeing bad > packets. Sorry, this note was not related to the DoS use-case. You misunderstood it. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
[PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe To: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesn't get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim --- include/net/tc_act/tc_skbmod.h| 30 include/uapi/linux/tc_act/tc_skbmod.h | 39 + net/sched/Kconfig | 11 ++ net/sched/Makefile| 1 + net/sched/act_skbmod.c| 293 ++ 5 files changed, 374 insertions(+) create mode 100644 include/net/tc_act/tc_skbmod.h create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h create mode 100644 net/sched/act_skbmod.c diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h new file mode 100644 index 000..644a211 --- /dev/null +++ b/include/net/tc_act/tc_skbmod.h @@ -0,0 +1,30 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __NET_TC_SKBMOD_H +#define __NET_TC_SKBMOD_H + +#include +#include + +struct tcf_skbmod_params { + struct rcu_head rcu; + u64 flags; /*up to 64 types of operations; extend if needed */ + u8 eth_dst[ETH_ALEN]; + u16 eth_type; + u8 eth_src[ETH_ALEN]; +}; + +struct tcf_skbmod { + struct tc_actioncommon; + struct tcf_skbmod_params __rcu *skbmod_p; +}; +#define to_skbmod(a) ((struct tcf_skbmod *)a) + +#endif /* __NET_TC_SKBMOD_H */ diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h new file mode 100644 index 000..10fc07d --- /dev/null +++ b/include/uapi/linux/tc_act/tc_skbmod.h @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2016, Jamal Hadi Salim + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. +*/ + +#ifndef __LINUX_TC_SKBMOD_H +#define __LINUX_TC_SKBMOD_H + +#include + +#define TCA_ACT_SKBMOD 15 + +#define SKBMOD_F_DMAC 0x1 +#define SKBMOD_F_SMAC 0x2 +#define SKBMOD_F_ETYPE 0x4 +#define SKBMOD_F_SWAPMAC 0x8 + +struct tc_skbmod { + tc_gen; + __u64 flags; +}; + +enum { + TCA_SKBMOD_UNSPEC, + TCA_SKBMOD_TM, + TCA_SKBMOD_PARMS, + TCA_SKBMOD_DMAC, + TCA_SKBMOD_SMAC, + TCA_SKBMOD_ETYPE, + TCA_SKBMOD_PAD, + __TCA_SKBMOD_MAX +}; +#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1) + +#endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 72e3426..7795d5a 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -749,6 +749,17 @@ config NET_ACT_CONNMARK To compile this code as a module, choose M here: the module will be called act_connmark. +config NET_ACT_SKBMOD +tristate "skb data modification action" +depends on NET_CLS_ACT +---help--- + Say Y here to allow modification of skb data + + If unsure, say N. + + To compile this code as a module, choose M here: the + module will be called act_skbmod. + config NET_ACT_IFE tristate "Inter-FE action based on IETF ForCES InterFE LFB" depends on NET_CLS_ACT diff --git a/net/sched/Makefile b/net/sched/Makefile index b9d046b..148ae0d 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)+= act_csum.o obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmar
Re: [RFC PATCH 9/9] ethernet: sun8i-emac: add pm_runtime support
Hi, On Fri, Sep 09, 2016 at 02:45:17PM +0200, Corentin Labbe wrote: > This patch add pm_runtime support to sun8i-emac. > For the moment, only basic support is added, (the device is marked as > used when net/open) > > Signed-off-by: Corentin Labbe > --- > drivers/net/ethernet/allwinner/sun8i-emac.c | 62 > - > 1 file changed, 60 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/allwinner/sun8i-emac.c > b/drivers/net/ethernet/allwinner/sun8i-emac.c > index 1c4bc80..cce886e 100644 > --- a/drivers/net/ethernet/allwinner/sun8i-emac.c > +++ b/drivers/net/ethernet/allwinner/sun8i-emac.c > @@ -9,7 +9,6 @@ > * - MAC filtering > * - Jumbo frame > * - features rx-all (NETIF_F_RXALL_BIT) > - * - PM runtime > */ > #include > #include > @@ -27,6 +26,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1301,11 +1301,18 @@ static int sun8i_emac_open(struct net_device *ndev) > int err; > u32 v; > > + err = pm_runtime_get_sync(priv->dev); > + if (err) { > + pm_runtime_put_noidle(priv->dev); > + dev_err(priv->dev, "pm_runtime error: %d\n", err); > + return err; > + } > + > err = request_irq(priv->irq, sun8i_emac_dma_interrupt, 0, > dev_name(priv->dev), ndev); > if (err) { > dev_err(priv->dev, "Cannot request IRQ: %d\n", err); > - return err; > + goto err_runtime; > } > > /* Set interface mode (and configure internal PHY on H3) */ > @@ -1395,6 +1402,8 @@ err_syscon: > sun8i_emac_unset_syscon(ndev); > err_irq: > free_irq(priv->irq, ndev); > +err_runtime: > + pm_runtime_put(priv->dev); > return err; > } > > @@ -1483,6 +1492,8 @@ static int sun8i_emac_stop(struct net_device *ndev) > dma_free_coherent(priv->dev, priv->nbdesc_tx * sizeof(struct dma_desc), > priv->dd_tx, priv->dd_tx_phy); > > + pm_runtime_put(priv->dev); > + > return 0; > } > > @@ -2210,6 +2221,8 @@ static int sun8i_emac_probe(struct platform_device > *pdev) > goto probe_err; > } > > + pm_runtime_enable(priv->dev); > + > return 0; > > probe_err: > @@ -2221,6 +2234,8 @@ static int sun8i_emac_remove(struct platform_device > *pdev) > { > struct net_device *ndev = platform_get_drvdata(pdev); > > + pm_runtime_disable(&pdev->dev); > + > unregister_netdev(ndev); > platform_set_drvdata(pdev, NULL); > free_netdev(ndev); > @@ -2228,6 +2243,47 @@ static int sun8i_emac_remove(struct platform_device > *pdev) > return 0; > } > > +static int __maybe_unused sun8i_emac_suspend(struct platform_device *pdev, > pm_message_t state) > +{ > + struct net_device *ndev = platform_get_drvdata(pdev); > + struct sun8i_emac_priv *priv = netdev_priv(ndev); > + > + napi_disable(&priv->napi); > + > + if (netif_running(ndev)) > + netif_device_detach(ndev); > + > + sun8i_emac_stop_tx(ndev); > + sun8i_emac_stop_rx(ndev); > + > + sun8i_emac_rx_clean(ndev); > + sun8i_emac_tx_clean(ndev); > + > + phy_stop(ndev->phydev); > + > + return 0; > +} > + > +static int __maybe_unused sun8i_emac_resume(struct platform_device *pdev) > +{ > + struct net_device *ndev = platform_get_drvdata(pdev); > + struct sun8i_emac_priv *priv = netdev_priv(ndev); > + > + phy_start(ndev->phydev); > + > + sun8i_emac_start_tx(ndev); > + sun8i_emac_start_rx(ndev); > + > + if (netif_running(ndev)) > + netif_device_attach(ndev); > + > + netif_start_queue(ndev); > + > + napi_enable(&priv->napi); > + > + return 0; > +} The main idea behind the runtime PM hooks is that they bring the device to a working state and shuts it down when it's not needed anymore. However, they shouldn't be called when the device is still in used, so all the mangling with NAPI, the phy and so on is irrelevant here, but the clocks, resets, for example, are. > static const struct of_device_id sun8i_emac_of_match_table[] = { > { .compatible = "allwinner,sun8i-a83t-emac", > .data = &emac_variant_a83t }, > @@ -2246,6 +2302,8 @@ static struct platform_driver sun8i_emac_driver = { > .name = "sun8i-emac", > .of_match_table = sun8i_emac_of_match_table, > }, > + .suspend= sun8i_emac_suspend, > + .resume = sun8i_emac_resume, These are not the runtime PM hooks. How did you test that? Maxime -- Maxime Ripard, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com signature.asc Description: PGP signature
Re: [RFC V3 PATCH 18/26] net/netpolicy: set tx queues according to policy
On Mon, Sep 12, 2016 at 7:55 AM, wrote: > From: Kan Liang > > When the device tries to transmit a packet, netdev_pick_tx is called to > find the available tx queues. If the net policy is applied, it picks up > the assigned tx queue from net policy subsystem, and redirect the > traffic to the assigned queue. > > Signed-off-by: Kan Liang > --- > include/net/sock.h | 9 + > net/core/dev.c | 20 ++-- > 2 files changed, 27 insertions(+), 2 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index e1e9e3d..ca97f35 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -2280,4 +2280,13 @@ extern int sysctl_optmem_max; > extern __u32 sysctl_wmem_default; > extern __u32 sysctl_rmem_default; > > +/* Return netpolicy instance information from socket. */ > +static inline struct netpolicy_instance *netpolicy_find_instance(struct sock > *sk) > +{ > +#ifdef CONFIG_NETPOLICY > + if (is_net_policy_valid(sk->sk_netpolicy.policy)) > + return &sk->sk_netpolicy; > +#endif > + return NULL; > +} > #endif /* _SOCK_H */ > diff --git a/net/core/dev.c b/net/core/dev.c > index 34b5322..b9a8044 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3266,6 +3266,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device > *dev, > struct sk_buff *skb, > void *accel_priv) > { > + struct sock *sk = skb->sk; > int queue_index = 0; > > #ifdef CONFIG_XPS > @@ -3280,8 +3281,23 @@ struct netdev_queue *netdev_pick_tx(struct net_device > *dev, > if (ops->ndo_select_queue) > queue_index = ops->ndo_select_queue(dev, skb, > accel_priv, > __netdev_pick_tx); > - else > - queue_index = __netdev_pick_tx(dev, skb); > + else { > +#ifdef CONFIG_NETPOLICY > + struct netpolicy_instance *instance; > + > + queue_index = -1; > + if (dev->netpolicy && sk) { > + instance = netpolicy_find_instance(sk); > + if (instance) { > + if (!instance->dev) > + instance->dev = dev; > + queue_index = > netpolicy_pick_queue(instance, false); > + } > + } > + if (queue_index < 0) > +#endif I doubt this produces the intended effect. Several drivers use ndo_select_queue (such as mlx4) where there might do something special for a few packets but end up called the default handler which __netdev_pick_tx for most packets. So in such cases the netpolicy path would be routinely bypassed. Maybe this code should be in __netdev_pick_tx. Tom > + queue_index = __netdev_pick_tx(dev, skb); > + } > > if (!accel_priv) > queue_index = netdev_cap_txqueue(dev, queue_index); > -- > 2.5.5 >
Re: [PATCH 00/26] constify local structures
On Mon, Sep 12, 2016 at 04:43:58PM +0300, Felipe Balbi wrote: > > Hi, > > Jarkko Sakkinen writes: > > On Mon, Sep 12, 2016 at 10:54:07AM +0200, Julia Lawall wrote: > >> > >> > >> On Sun, 11 Sep 2016, Jarkko Sakkinen wrote: > >> > >> > On Sun, Sep 11, 2016 at 03:05:42PM +0200, Julia Lawall wrote: > >> > > Constify local structures. > >> > > > >> > > The semantic patch that makes this change is as follows: > >> > > (http://coccinelle.lip6.fr/) > >> > > >> > Just my two cents but: > >> > > >> > 1. You *can* use a static analysis too to find bugs or other issues. > >> > 2. However, you should manually do the commits and proper commit > >> >messages to subsystems based on your findings. And I generally think > >> >that if one contributes code one should also at least smoke test > >> > changes > >> >somehow. > >> > > >> > I don't know if I'm alone with my opinion. I just think that one should > >> > also do the analysis part and not blindly create and submit patches. > >> > >> All of the patches are compile tested. And the individual patches are > > > > Compile-testing is not testing. If you are not able to test a commit, > > you should explain why. > > Dude, Julia has been doing semantic patching for years already and > nobody has raised any concerns so far. There's already an expectation > that Coccinelle *works* and Julia's sematic patches are sound. > > Besides, adding 'const' is something that causes virtually no functional > changes to the point that build-testing is really all you need. Any > problems caused by adding 'const' to a definition will be seen by build > errors or warnings. > > Really, just stop with the pointless discussion and go read a bit about > Coccinelle and what semantic patches are giving you. The work done by > Julia and her peers are INRIA have measurable benefits. > > You're really making a thunderstorm in a glass of water. Hmm... I've been using coccinelle in cyclic basis for some time now. My comment was oversized but I didn't mean it to be impolite or attack of any kind for that matter. > -- > balbi /Jarkko
[PATCH net-next 2/2] net: deprecate eth_change_mtu, remove usage
With centralized MTU checking, there's nothing productive done by eth_change_mtu that isn't already done in dev_set_mtu, so mark it as deprecated and remove all usage of it in the kernel. All callers have been audited for calls to alloc_etherdev* or ether_setup directly, which means they all have a valid dev->min_mtu and dev->max_mtu. Now eth_change_mtu prints out a netdev_warn about being deprecated, for the benefit of out-of-tree drivers that might be utilizing it. Of note, dvb_net.c actually had dev->mtu = 4096, while using eth_change_mtu, meaning that if you ever tried changing it's mtu, you couldn't set it above 1500 anymore. It's now getting dev->max_mtu also set to 4096 to remedy that. CC: "David S. Miller" CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson --- arch/m68k/emu/nfeth.c | 1 - drivers/isdn/hysdn/hysdn_net.c| 1 - drivers/media/dvb-core/dvb_net.c | 2 +- drivers/net/appletalk/ipddp.c | 1 - drivers/net/cris/eth_v10.c| 1 - drivers/net/ethernet/3com/3c509.c | 1 - drivers/net/ethernet/3com/3c515.c | 1 - drivers/net/ethernet/3com/3c574_cs.c | 1 - drivers/net/ethernet/3com/3c589_cs.c | 1 - drivers/net/ethernet/3com/3c59x.c | 2 -- drivers/net/ethernet/3com/typhoon.c | 1 - drivers/net/ethernet/8390/8390.c | 1 - drivers/net/ethernet/8390/8390p.c | 1 - drivers/net/ethernet/8390/ax88796.c | 1 - drivers/net/ethernet/8390/axnet_cs.c | 1 - drivers/net/ethernet/8390/etherh.c| 1 - drivers/net/ethernet/8390/hydra.c | 1 - drivers/net/ethernet/8390/mac8390.c | 1 - drivers/net/ethernet/8390/mcf8390.c | 1 - drivers/net/ethernet/8390/ne2k-pci.c | 1 - drivers/net/ethernet/8390/pcnet_cs.c | 1 - drivers/net/ethernet/8390/smc-ultra.c | 1 - drivers/net/ethernet/8390/wd.c| 1 - drivers/net/ethernet/8390/zorro8390.c | 1 - drivers/net/ethernet/adaptec/starfire.c | 1 - drivers/net/ethernet/adi/bfin_mac.c | 1 - drivers/net/ethernet/allwinner/sun4i-emac.c | 1 - drivers/net/ethernet/amd/a2065.c | 1 - drivers/net/ethernet/amd/am79c961a.c | 1 - drivers/net/ethernet/amd/ariadne.c| 1 - drivers/net/ethernet/amd/atarilance.c | 1 - drivers/net/ethernet/amd/au1000_eth.c | 1 - drivers/net/ethernet/amd/declance.c | 1 - drivers/net/ethernet/amd/hplance.c| 1 - drivers/net/ethernet/amd/lance.c | 1 - drivers/net/ethernet/amd/mvme147.c| 1 - drivers/net/ethernet/amd/ni65.c | 1 - drivers/net/ethernet/amd/nmclan_cs.c | 1 - drivers/net/ethernet/amd/pcnet32.c| 1 - drivers/net/ethernet/amd/sun3lance.c | 1 - drivers/net/ethernet/amd/sunlance.c | 1 - drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 1 - drivers/net/ethernet/apple/bmac.c | 1 - drivers/net/ethernet/apple/mace.c | 1 - drivers/net/ethernet/apple/macmace.c | 1 - drivers/net/ethernet/aurora/nb8800.c | 1 - drivers/net/ethernet/cadence/macb.c | 1 - drivers/net/ethernet/cirrus/cs89x0.c | 1 - drivers/net/ethernet/cirrus/ep93xx_eth.c | 1 - drivers/net/ethernet/cirrus/mac89x0.c | 1 - drivers/net/ethernet/davicom/dm9000.c | 1 - drivers/net/ethernet/dec/tulip/de2104x.c | 1 - drivers/net/ethernet/dec/tulip/de4x5.c| 1 - drivers/net/ethernet/dec/tulip/dmfe.c | 1 - drivers/net/ethernet/dec/tulip/tulip_core.c | 1 - drivers/net/ethernet/dec/tulip/uli526x.c | 1 - drivers/net/ethernet/dec/tulip/winbond-840.c | 1 - drivers/net/ethernet/dec/tulip/xircom_cb.c| 1 - drivers/net/ethernet/dnet.c | 1 - drivers/net/ethernet/ec_bhf.c | 1 - drivers/net/ethernet/fealnx.c | 1 - drivers/net/ethernet/freescale/fec_main.c | 1 - drivers/net/ethernet/freescale/fec_mpc52xx.c | 1 - drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c | 1 - drivers/net/ethernet/freescale/ucc_geth.c | 1 - drivers/net/ethernet/fujitsu/fmvj18x_cs.c | 1 - drivers/net/ethernet/hisilicon/hip04_eth.c| 1 - drivers/net/ethernet/hisilicon/hisi_femac.c | 1 - drivers/net/ethernet/hp/hp100.c | 2 --
[PATCH net-next 0/2] net: centralize net_device MTU bounds checking
While looking into an MTU issue with sfc, I started noticing that almost every NIC driver with an ndo_change_mtu function implemented almost exactly the same range checks, and in many cases, that was the only practical thing their ndo_change_mtu function was doing. Quite a few drivers have either 68, 64, 60 or 46 as their minimum MTU value checked, and then various sizes from 1500 to 65535 for their maximum MTU value. We can remove a whole lot of redundant code here if we simple store min_mtu and max_mtu in net_device, and check against those in net/core/dev.c's dev_set_mtu(). This pair of patches looks to introduce centralized MTU range checking infrastructure, while maintaining compatibility with all existing drivers, and start to make use of it, converting all eth_change_mtu/ether_setup users over to this new infra. Assuming these pass review muster, I've got a ton of follow-on patches to clean up MTU settings for everything in the kernel with an ndo_change_mtu. Jarod Wilson (2): net: centralize net_device min/max MTU checking net: deprecate eth_change_mtu, remove usage arch/m68k/emu/nfeth.c | 1 - drivers/isdn/hysdn/hysdn_net.c| 1 - drivers/media/dvb-core/dvb_net.c | 2 +- drivers/net/appletalk/ipddp.c | 1 - drivers/net/cris/eth_v10.c| 1 - drivers/net/ethernet/3com/3c509.c | 1 - drivers/net/ethernet/3com/3c515.c | 1 - drivers/net/ethernet/3com/3c574_cs.c | 1 - drivers/net/ethernet/3com/3c589_cs.c | 1 - drivers/net/ethernet/3com/3c59x.c | 2 -- drivers/net/ethernet/3com/typhoon.c | 1 - drivers/net/ethernet/8390/8390.c | 1 - drivers/net/ethernet/8390/8390p.c | 1 - drivers/net/ethernet/8390/ax88796.c | 1 - drivers/net/ethernet/8390/axnet_cs.c | 1 - drivers/net/ethernet/8390/etherh.c| 1 - drivers/net/ethernet/8390/hydra.c | 1 - drivers/net/ethernet/8390/mac8390.c | 1 - drivers/net/ethernet/8390/mcf8390.c | 1 - drivers/net/ethernet/8390/ne2k-pci.c | 1 - drivers/net/ethernet/8390/pcnet_cs.c | 1 - drivers/net/ethernet/8390/smc-ultra.c | 1 - drivers/net/ethernet/8390/wd.c| 1 - drivers/net/ethernet/8390/zorro8390.c | 1 - drivers/net/ethernet/adaptec/starfire.c | 1 - drivers/net/ethernet/adi/bfin_mac.c | 1 - drivers/net/ethernet/allwinner/sun4i-emac.c | 1 - drivers/net/ethernet/amd/a2065.c | 1 - drivers/net/ethernet/amd/am79c961a.c | 1 - drivers/net/ethernet/amd/ariadne.c| 1 - drivers/net/ethernet/amd/atarilance.c | 1 - drivers/net/ethernet/amd/au1000_eth.c | 1 - drivers/net/ethernet/amd/declance.c | 1 - drivers/net/ethernet/amd/hplance.c| 1 - drivers/net/ethernet/amd/lance.c | 1 - drivers/net/ethernet/amd/mvme147.c| 1 - drivers/net/ethernet/amd/ni65.c | 1 - drivers/net/ethernet/amd/nmclan_cs.c | 1 - drivers/net/ethernet/amd/pcnet32.c| 1 - drivers/net/ethernet/amd/sun3lance.c | 1 - drivers/net/ethernet/amd/sunlance.c | 1 - drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 1 - drivers/net/ethernet/apple/bmac.c | 1 - drivers/net/ethernet/apple/mace.c | 1 - drivers/net/ethernet/apple/macmace.c | 1 - drivers/net/ethernet/aurora/nb8800.c | 1 - drivers/net/ethernet/cadence/macb.c | 1 - drivers/net/ethernet/cirrus/cs89x0.c | 1 - drivers/net/ethernet/cirrus/ep93xx_eth.c | 1 - drivers/net/ethernet/cirrus/mac89x0.c | 1 - drivers/net/ethernet/davicom/dm9000.c | 1 - drivers/net/ethernet/dec/tulip/de2104x.c | 1 - drivers/net/ethernet/dec/tulip/de4x5.c| 1 - drivers/net/ethernet/dec/tulip/dmfe.c | 1 - drivers/net/ethernet/dec/tulip/tulip_core.c | 1 - drivers/net/ethernet/dec/tulip/uli526x.c | 1 - drivers/net/ethernet/dec/tulip/winbond-840.c | 1 - drivers/net/ethernet/dec/tulip/xircom_cb.c| 1 - drivers/net/ethernet/dnet.c | 1 - drivers/net/ethernet/ec_bhf.c | 1 - drivers/net/ethernet/fealnx.c | 1 - drivers/net/ethernet/freescale/fec_main.c | 1 - drivers/net/ethernet/freescale/fec_mpc52xx.c
[PATCH net-next 1/2] net: centralize net_device min/max MTU checking
While looking into an MTU issue with sfc, I started noticing that almost every NIC driver with an ndo_change_mtu function implemented almost exactly the same range checks, and in many cases, that was the only practical thing their ndo_change_mtu function was doing. Quite a few drivers have either 68, 64, 60 or 46 as their minimum MTU value checked, and then various sizes from 1500 to 65535 for their maximum MTU value. We can remove a whole lot of redundant code here if we simple store min_mtu and max_mtu in net_device, and check against those in net/core/dev.c's dev_set_mtu(). In theory, there should be zero functional change with this patch, it just puts the infrastructure in place. Subsequent patches will attempt to start using said infrastructure, with theoretically zero change in functionality. CC: "David S. Miller" CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson --- include/linux/netdevice.h | 4 net/core/dev.c| 12 ++-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 67bb978..e466949 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1490,6 +1490,8 @@ enum netdev_priv_flags { * @if_port: Selectable AUI, TP, ... * @dma: DMA channel * @mtu: Interface MTU value + * @min_mtu: Interface Minimum MTU value + * @max_mtu: Interface Maximum MTU value * @type: Interface hardware type * @hard_header_len: Maximum hardware header length. * @@ -1710,6 +1712,8 @@ struct net_device { unsigned char dma; unsigned intmtu; + unsigned intmin_mtu; + unsigned intmax_mtu; unsigned short type; unsigned short hard_header_len; diff --git a/net/core/dev.c b/net/core/dev.c index 34b5322..9785588 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6473,9 +6473,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu) if (new_mtu == dev->mtu) return 0; - /* MTU must be positive.*/ - if (new_mtu < 0) + if (new_mtu < dev->min_mtu) { + net_err_ratelimited("%s: Invalid MTU %d requested, hw min %d\n", + dev->name, new_mtu, dev->min_mtu); return -EINVAL; + } + + if (dev->max_mtu > 0 && new_mtu > dev->max_mtu) { + net_err_ratelimited("%s: Invalid MTU %d requested, hw max %d\n", + dev->name, new_mtu, dev->min_mtu); + return -EINVAL; + } if (!netif_device_present(dev)) return -ENODEV; -- 2.10.0
Re: [PATCH net-next 2/2] errqueue: include linux/time.h
On Mon, Sep 12, 2016 at 3:26 PM, kbuild test robot wrote: > Hi Willem, > > [auto build test ERROR on net-next/master] > > url: > https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/uapi-include-time-h-from-errqueue-h/20160913-020431 > config: i386-defconfig (attached as .config) > compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All errors (new ones prefixed by >>): This error report shows that breakage can occur with applications that include linux/errqueue.h before the libc time headers. The libc-compat definitions in this patch set only fix compilation when uapi headers are included after the userspace headers. These errors indeed go away when errqueue.h is included after the userspace time includes, as in the diff below. For the inverse, the libc headers need additional #if __UAPI_DEF_FOO changes, as described in include/uapli/linux/libc-compat.h. Those changes are a noop without kernel definitions, so arguably that libc patch should be merged before this kernel patch. I will remove this patch set from the patchwork queue for now. diff --git a/Documentation/networking/timestamping/txtimestamp.c b/Documentation/networking/timestamping/txtimestamp.c index 5df0704..f073801 100644 --- a/Documentation/networking/timestamping/txtimestamp.c +++ b/Documentation/networking/timestamping/txtimestamp.c @@ -37,7 +37,6 @@ #include #include #include -#include #include #include #include @@ -59,6 +58,7 @@ #include #include #include +#include #include
Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote: > On Thu, 8 Sep 2016 23:30:50 -0700 > Alexei Starovoitov wrote: > > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote: > [...] > > > Imagine you have packets intermixed towards the stack and XDP_TX. > > > Every time you call the stack code, then you flush your icache. When > > > returning to the driver code, you will have to reload all the icache > > > associated with the XDP_TX, this is a costly operation. > > > [...] > > To make further progress in this discussion can we talk about > > the use case you have in mind instead? Then solution will > > be much clear, I hope. > > The DDoS use-case _is_ affected by this "hidden" bulking design. > > Lets say, I want to implement a DDoS facility. Instead of just > dropping the malicious packets, I want to see the bad packets. I > implement this by rewriting the destination-MAC to be my monitor > machine and then XDP_TX the packet. not following the use case. you want to implement a DDoS generator? Or just forward all bad packets from affected host to another host in the same rack? so two servers will be spammed with traffic and even more load on the tor? I really don't see how this is useful for anything but stress testing. > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100% > of the traffic is delivered to the stack. (See note 1) hmm. DoS prevention use case is when 99% of the traffic is dropped. > Once the DDoS attack starts, then the traffic pattern changes, and XDP > should (hopefully only) catch the malicious traffic (monitor machine can > help diagnose false positive). Now, due to interleaving the DDoS > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to > more icache misses... > > > > Note(1): Notice I have already demonstrated that loading a XDP/eBPF > program with 100% delivery to the stack, actually slows down the > normal stack. This is due to hitting a bottleneck in the page > allocator. I'm working removing that bottleneck with page_pool, and > that solution is orthogonal to this problem. sure. no one arguing against improving page allocator. > It is actually an excellent argument, for why you would want to run a > DDoS XDP filter only on a restricted number of RX queues. no. it's the opposite. If the host is under DoS there is no way the host can tell in advance which rx queue will be seeing bad packets.
Re: [PATCH net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail
On Sat, Sep 10, 2016 at 12:03:53AM +0800, Xin Long wrote: > > That said, have you considered the retransmit case? That is to say, if you > > queue and flush the outq, and some packets fail delivery, and in the time > > between the intial send and the expiration of the RTX timer (during which > > the > > socket lock will have been released), an event may occur which changes the > > transport state, which will then be ignored with your patch. > Sorry, I'm not sure if I got it. > > You mean "during which changes q->asoc->state", right ? > > This patch removes the check of q->asoc->state in sctp_outq_tail(). > > sctp_outq_tail() is called for data only in: > sctp_primitive_SEND -> sctp_do_sm -> sctp_cmd_send_msg -> > sctp_cmd_interpreter -> sctp_cmd_send_msg() -> sctp_outq_tail() > > before calling sctp_primitive_SEND, hold sock lock first. > then sctp_primitive_SEND choose FUNC according: > > #define TYPE_SCTP_PRIMITIVE_SEND { > > > if asoc->state is unavailable, FUNC can't be sctp_cmd_send_msg, > but sctp_sf_error_closed/sctp_sf_error_shutdown, sctp_outq_tail > can't be called, either. > I mean sctp_primitive_SEND do the same check for asoc->state > already actually. > > so the code in sctp_outq_tail is redundant actually. I also don't see an issue with this patch, btw. Xin, you may want to add more/such details to the changelog, specially about the timer versus primitive handling. Marcelo
Re: [PATCH net 3/6] sctp: free msg->chunks when sctp_primitive_SEND return err
On Thu, Sep 08, 2016 at 05:31:47PM +0800, Xin Long wrote: > Last patch "sctp: do not return the transmit err back to sctp_sendmsg" > made sctp_primitive_SEND return err only when asoc state is unavailable. > In this case, chunks are not enqueued, they have no chance to be freed if > we don't take care of them later. > > This Patch is actually to revert commit 1cd4d5c4326a ("sctp: remove the > unused sctp_datamsg_free()") and commit 8b570dc9f7b6 ("sctp: only drop the > reference on the datamsg after sending a msg"), to use sctp_datamsg_free to > free the chunks of current msg. > Considering the subsequent patches in this series are improving error return and the first is a cleanup/optimization, Xin, this patch and the previous one both deserve a Fixes tag too, for 8b570dc9f7b6. Thanks > Signed-off-by: Xin Long > --- > include/net/sctp/structs.h | 1 + > net/sctp/chunk.c | 13 + > net/sctp/socket.c | 6 -- > 3 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h > index ce93c4b..f61fb7c 100644 > --- a/include/net/sctp/structs.h > +++ b/include/net/sctp/structs.h > @@ -537,6 +537,7 @@ struct sctp_datamsg { > struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *, > struct sctp_sndrcvinfo *, > struct iov_iter *); > +void sctp_datamsg_free(struct sctp_datamsg *); > void sctp_datamsg_put(struct sctp_datamsg *); > void sctp_chunk_fail(struct sctp_chunk *, int error); > int sctp_chunk_abandoned(struct sctp_chunk *); > diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c > index a55e547..af9cc80 100644 > --- a/net/sctp/chunk.c > +++ b/net/sctp/chunk.c > @@ -70,6 +70,19 @@ static struct sctp_datamsg *sctp_datamsg_new(gfp_t gfp) > return msg; > } > > +void sctp_datamsg_free(struct sctp_datamsg *msg) > +{ > + struct sctp_chunk *chunk; > + > + /* This doesn't have to be a _safe vairant because > + * sctp_chunk_free() only drops the refs. > + */ > + list_for_each_entry(chunk, &msg->chunks, frag_list) > + sctp_chunk_free(chunk); > + > + sctp_datamsg_put(msg); > +} > + > /* Final destructruction of datamsg memory. */ > static void sctp_datamsg_destroy(struct sctp_datamsg *msg) > { > diff --git a/net/sctp/socket.c b/net/sctp/socket.c > index 9fc417a..9d8e2be 100644 > --- a/net/sctp/socket.c > +++ b/net/sctp/socket.c > @@ -1970,13 +1970,15 @@ static int sctp_sendmsg(struct sock *sk, struct > msghdr *msg, size_t msg_len) >* breaks. >*/ > err = sctp_primitive_SEND(net, asoc, datamsg); > - sctp_datamsg_put(datamsg); > /* Did the lower layer accept the chunk? */ > - if (err) > + if (err) { > + sctp_datamsg_free(datamsg); > goto out_free; > + } > > pr_debug("%s: we sent primitively\n", __func__); > > + sctp_datamsg_put(datamsg); > err = msg_len; > > if (unlikely(wait_connect)) { > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: [ovs-dev] [PATCH 2/2] openvswitch: use percpu flow stats
On Fri, Sep 9, 2016 at 1:41 PM, Thadeu Lima de Souza Cascardo wrote: > Instead of using flow stats per NUMA node, use it per CPU. When using > megaflows, the stats lock can be a bottleneck in scalability. > > On a E5-2690 12-core system, usual throughput went from ~4Mpps to > ~15Mpps when forwarding between two 40GbE ports with a single flow > configured on the datapath. > > This has been tested on a system with possible CPUs 0-7,16-23. After > module removal, there were no corruption on the slab cache. > > Signed-off-by: Thadeu Lima de Souza Cascardo > --- > net/openvswitch/flow.c | 43 +++ > net/openvswitch/flow.h | 4 ++-- > net/openvswitch/flow_table.c | 23 --- > 3 files changed, 37 insertions(+), 33 deletions(-) > > diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c > index 3609f37..2970a9f 100644 > --- a/net/openvswitch/flow.c > +++ b/net/openvswitch/flow.c > @@ -29,6 +29,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 > tcp_flags, > { > struct flow_stats *stats; > int node = numa_node_id(); > + int cpu = get_cpu(); > int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0); > This function is always called from BH context. So calling smp_processor_id() for cpu id is fine. There is no need to handle pre-emption here. > - stats = rcu_dereference(flow->stats[node]); > + stats = rcu_dereference(flow->stats[cpu]); > ... > diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c > index 957a3c3..60e5ae0 100644 > --- a/net/openvswitch/flow_table.c > +++ b/net/openvswitch/flow_table.c ... > @@ -102,9 +103,9 @@ struct sw_flow *ovs_flow_alloc(void) > > RCU_INIT_POINTER(flow->stats[0], stats); > > - for_each_node(node) > - if (node != 0) > - RCU_INIT_POINTER(flow->stats[node], NULL); > + for_each_possible_cpu(cpu) > + if (cpu != 0) > + RCU_INIT_POINTER(flow->stats[cpu], NULL); > I think at this point we should just use GFP_ZERO flag for allocating struct flow.
Re: [PATCH net-next 2/2] errqueue: include linux/time.h
Hi Willem, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/uapi-include-time-h-from-errqueue-h/20160913-020431 config: i386-defconfig (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): In file included from ./usr/include/linux/errqueue.h:4:0, from Documentation/networking/timestamping/timestamping.c:46: >> ./usr/include/linux/time.h:26:8: error: redefinition of 'struct timezone' struct timezone { ^~~~ In file included from Documentation/networking/timestamping/timestamping.c:37:0: /usr/include/x86_64-linux-gnu/sys/time.h:55:8: note: originally defined here struct timezone ^~~~ -- In file included from ./usr/include/linux/errqueue.h:4:0, from Documentation/networking/timestamping/txtimestamp.c:40: >> ./usr/include/linux/time.h:19:8: error: redefinition of 'struct timeval' struct timeval { ^~~ In file included from /usr/include/x86_64-linux-gnu/sys/select.h:45:0, from /usr/include/x86_64-linux-gnu/sys/types.h:219, from /usr/include/x86_64-linux-gnu/sys/uio.h:23, from /usr/include/x86_64-linux-gnu/sys/socket.h:26, from /usr/include/netinet/in.h:23, from /usr/include/arpa/inet.h:22, from Documentation/networking/timestamping/txtimestamp.c:35: /usr/include/x86_64-linux-gnu/bits/time.h:30:8: note: originally defined here struct timeval ^~~ In file included from Documentation/networking/timestamping/txtimestamp.c:59:0: >> /usr/include/x86_64-linux-gnu/sys/time.h:55:8: error: redefinition of >> 'struct timezone' struct timezone ^~~~ In file included from ./usr/include/linux/errqueue.h:4:0, from Documentation/networking/timestamping/txtimestamp.c:40: ./usr/include/linux/time.h:26:8: note: originally defined here struct timezone { ^~~~ >> ./usr/include/linux/time.h:38:22: error: expected identifier before numeric >> constant #define ITIMER_REAL 0 ^ In file included from Documentation/networking/timestamping/txtimestamp.c:59:0: >> /usr/include/x86_64-linux-gnu/sys/time.h:107:8: error: redefinition of >> 'struct itimerval' struct itimerval ^ In file included from ./usr/include/linux/errqueue.h:4:0, from Documentation/networking/timestamping/txtimestamp.c:40: ./usr/include/linux/time.h:51:8: note: originally defined here struct itimerval { ^ In file included from Documentation/networking/timestamping/txtimestamp.c:61:0: >> /usr/include/time.h:161:8: error: redefinition of 'struct itimerspec' struct itimerspec ^~ In file included from ./usr/include/linux/errqueue.h:4:0, from Documentation/networking/timestamping/txtimestamp.c:40: ./usr/include/linux/time.h:44:8: note: originally defined here struct itimerspec { ^~ --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH RFC] e1000: Send XDP_TX packets using dev_queue_xmit
On Mon, Sep 12, 2016 at 09:21:10AM -0700, Tom Herbert wrote: > In order to use XDP with non multi-queue drivers (such as e1000) we need > a method to handle XDP_TX return code from xdp program. Since there is > only one queue the XDP transmit path needs to share that queue with the > normal stack path, and in order to do that we can't break the existing > stack's mechanisms of locking, qdiscs, and BQL for the queue. > > This patch allocates an skbuff when XDP_TX is returned from the XDP > program and call dev_queue_xmit to transmit the skbuff. > > Note that this is expected to be a performance hit (which needs to be > quantified) relative to a MQ XDP implementation where we can reserve > queues to be used by XDP. The first intent of these patches is > correctness. This should not affect performance in the XDP_DROP path. > > This patch needs to be applied after "e1000: add initial XDP support". > > Tested: I've only built this, need to see if I can find some e1000 > machines. John, if you could try this that would be much appreciated. > > Signed-off-by: Tom Herbert > --- > drivers/net/ethernet/intel/e1000/e1000_main.c | 29 > --- > 1 file changed, 22 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c > b/drivers/net/ethernet/intel/e1000/e1000_main.c > index 91d5c87..f457ea8 100644 > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c > @@ -4338,13 +4338,28 @@ static bool e1000_clean_jumbo_rx_irq(struct > e1000_adapter *adapter, > case XDP_PASS: > break; > case XDP_TX: > - dma_sync_single_for_device(&pdev->dev, > -dma, > -length, > -DMA_TO_DEVICE); > - e1000_xmit_raw_frame(buffer_info, length, > - netdev, adapter); > - buffer_info->rxbuf.page = NULL; > + /* There is only one transmit queue so we need > + * to inject the packet at the qdisc layer > + * to maintain sanity in the stack. We allocate > + * an skbuff, set up frags with the page, and > + * then do dev_queue_xmit. > + */ > + skb = alloc_skb(0, GFP_ATOMIC); > + if (skb) { > + dma_unmap_page(&pdev->dev, > +buffer_info->dma, > +adapter->rx_buffer_len, > +DMA_FROM_DEVICE); > + skb->dev = netdev; > + skb_fill_page_desc(skb, 0, > + buffer_info->rxbuf.page, > + 0, length); > + dev_queue_xmit(skb); I don't like this approach, since it does exactly what it meant to: It uses qdisc and everything else in xmit path of the stack. It's a serious issue, since for some drivers XDP_TX would mean raw xmit whereas for e1k it goes via stack. Sooner or later program authors will start relying on that and that's not acceptable from XDP design point of view.
Re: icmpv6: issue with routing table entries from link local addresses
On 9/12/16 11:26 AM, Hannes Frederic Sowa wrote: > Hello, > > On 12.09.2016 16:27, Andreas Hübner wrote: >> Hi, >> >> I'm currently debugging a potential issue with the icmpv6 stack and >> hopefully this is the correct place to ask. (Was actually looking for a >> more specific list, but didn't find anything. Please point me to a more >> apropriate list if this is out of place here.) >> >> I have the following setup: >> - 2 directly connected hosts (A+B), both have only link local addresses >> configured (interface on both hosts is eth0) >> - host B is also connected to another host C (via interface eth1) >> - main routing table (relevant part) on host B looks like this: >> >> fe80::/64 dev eth1 proto kernel metric 256 >> fe80::/64 dev eth0 proto kernel metric 256 >> >> - host A is trying to ICMPv6 ping the link local address of host B >> >> The issue I currently have is, that the echo reply that host B should >> generate is never sent back to host A. If I change the order of the >> routing table entries on host B, everything works fine. >> (host A is connected on eth0) >> >> I'm wondering, if this is how it is supposed to work. Do we need to do a >> routing table lookup when generating an ICMPv6 echo reply for link local >> addresses? (From my understanding, this is not done in the neighbour >> discovery stack, so why here?) > > For global addresses this is necessary as asymetric routing could be > involved and we don't want to treat ping echos in any way special. > >> Actually, I'm convinced I must be doing something wrong here. The setup >> for the issue is quite trivial, someone would have tripped over it >> already. The only condition is that one host has multiple interfaces >> with ipv6 enabled. >> >> Any help in shedding some light onto this issue would be appreciated. > > This shouldn't be the case. We certainly carry over the ifindex of the > received packet into the routing lookup of the outgoing packet, thus the > appropriate rule, with outgoing ifindex should be selected. > > I also couldn't reproduce your problem here with my system. Can you > verify with tcpdump that the packet is leaving on another interface? v4.4 and on there are fib6 tracepoints that show the lookup result. May provide some insights. perf record -a -e fib6:* perf script
Re: [RFC 02/11] Add RoCE driver framework
>> +uint debug; >> +module_param(debug, uint, 0); >> +MODULE_PARM_DESC(debug, "Default debug msglevel"); >Why are you adding this as a module parameter? I believe this is mostly to follow same line as qede which also defines 'debug' module parameter for allowing easy user control of debug prints [& specifically for probe prints, which can't be controlled otherwise].
Re: [RFC PATCH v3 3/7] proc: Reduce cache miss in snmp6_seq_show
On Fri, Sep 09, 2016 at 02:33:58PM +0800, Jia He wrote: > This is to use the generic interface snmp_get_cpu_field{,64}_batch to > aggregate the data by going through all the items of each cpu sequentially. > > Signed-off-by: Jia He > --- > net/ipv6/proc.c | 32 +++- > 1 file changed, 23 insertions(+), 9 deletions(-) > > diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c > index 679253d0..50ba2c3 100644 > --- a/net/ipv6/proc.c > +++ b/net/ipv6/proc.c > @@ -30,6 +30,11 @@ > #include > #include > > +#define MAX4(a, b, c, d) \ > + max_t(u32, max_t(u32, a, b), max_t(u32, c, d)) > +#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \ > + IPSTATS_MIB_MAX, ICMP_MIB_MAX) > + > static int sockstat6_seq_show(struct seq_file *seq, void *v) > { > struct net *net = seq->private; > @@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, > void __percpu *pcpumib, > const struct snmp_mib *itemlist) > { > int i; > - unsigned long val; > - > - for (i = 0; itemlist[i].name; i++) { > - val = pcpumib ? > - snmp_fold_field(pcpumib, itemlist[i].entry) : > - atomic_long_read(smib + itemlist[i].entry); > - seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val); > + unsigned long buff[SNMP_MIB_MAX]; > + > + memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX); This memset() could be moved... > + > + if (pcpumib) { ... here, so it's not executed if it hits the else block. > + snmp_get_cpu_field_batch(buff, itemlist, pcpumib); > + for (i = 0; itemlist[i].name; i++) > + seq_printf(seq, "%-32s\t%lu\n", > +itemlist[i].name, buff[i]); > + } else { > + for (i = 0; itemlist[i].name; i++) > + seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, > +atomic_long_read(smib + itemlist[i].entry)); > } > } > > @@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, > void __percpu *mib, > const struct snmp_mib *itemlist, size_t > syncpoff) > { > int i; > + u64 buff64[SNMP_MIB_MAX]; > + > + memset(buff64, 0, sizeof(unsigned long) * SNMP_MIB_MAX); > > + snmp_get_cpu_field64_batch(buff64, itemlist, mib, syncpoff); > for (i = 0; itemlist[i].name; i++) > - seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, > -snmp_fold_field64(mib, itemlist[i].entry, syncpoff)); > + seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff64[i]); > } > > static int snmp6_seq_show(struct seq_file *seq, void *v) > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show
On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote: > This is to use the generic interface snmp_get_cpu_field{,64}_batch to > aggregate the data by going through all the items of each cpu sequentially. > Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build > warning "the frame size" larger than 1024 on s390. Yeah about that, did you test it with stack overflow detection? These arrays can be quite large. One more below.. > > Signed-off-by: Jia He > --- > net/ipv4/proc.c | 106 > +++- > 1 file changed, 74 insertions(+), 32 deletions(-) > > diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c > index 9f665b6..c6fc80e 100644 > --- a/net/ipv4/proc.c > +++ b/net/ipv4/proc.c > @@ -46,6 +46,8 @@ > #include > #include > > +#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX) > + > /* > * Report socket allocation statistics [m...@utu.fi] > */ > @@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq) > /* > * Called from the PROCfs module. This outputs /proc/net/snmp. > */ > -static int snmp_seq_show(struct seq_file *seq, void *v) > +static int snmp_seq_show_ipstats(struct seq_file *seq, void *v) > { > int i; > + u64 buff64[IPSTATS_MIB_MAX]; > struct net *net = seq->private; > > - seq_puts(seq, "Ip: Forwarding DefaultTTL"); > + memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64)); > > + seq_puts(seq, "Ip: Forwarding DefaultTTL"); > for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) > seq_printf(seq, " %s", snmp4_ipstats_list[i].name); > > @@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v) > net->ipv4.sysctl_ip_default_ttl); > > BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0); > + snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list, > +net->mib.ip_statistics, > +offsetof(struct ipstats_mib, syncp)); > for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) > - seq_printf(seq, " %llu", > -snmp_fold_field64(net->mib.ip_statistics, > - snmp4_ipstats_list[i].entry, > - offsetof(struct ipstats_mib, > syncp))); > + seq_printf(seq, " %llu", buff64[i]); > > - icmp_put(seq); /* RFC 2011 compatibility */ > - icmpmsg_put(seq); > + return 0; > +} > + > +static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v) > +{ > + int i; > + unsigned long buff[TCPUDP_MIB_MAX]; > + struct net *net = seq->private; > + > + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); > > seq_puts(seq, "\nTcp:"); > for (i = 0; snmp4_tcp_list[i].name != NULL; i++) > seq_printf(seq, " %s", snmp4_tcp_list[i].name); > > seq_puts(seq, "\nTcp:"); > + snmp_get_cpu_field_batch(buff, snmp4_tcp_list, > + net->mib.tcp_statistics); > for (i = 0; snmp4_tcp_list[i].name != NULL; i++) { > /* MaxConn field is signed, RFC 2012 */ > if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN) > - seq_printf(seq, " %ld", > -snmp_fold_field(net->mib.tcp_statistics, > -snmp4_tcp_list[i].entry)); > + seq_printf(seq, " %ld", buff[i]); > else > - seq_printf(seq, " %lu", > -snmp_fold_field(net->mib.tcp_statistics, > -snmp4_tcp_list[i].entry)); > + seq_printf(seq, " %lu", buff[i]); > } > > + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); > + > + snmp_get_cpu_field_batch(buff, snmp4_udp_list, > + net->mib.udp_statistics); > seq_puts(seq, "\nUdp:"); > for (i = 0; snmp4_udp_list[i].name != NULL; i++) > seq_printf(seq, " %s", snmp4_udp_list[i].name); > - > seq_puts(seq, "\nUdp:"); > for (i = 0; snmp4_udp_list[i].name != NULL; i++) > - seq_printf(seq, " %lu", > -snmp_fold_field(net->mib.udp_statistics, > -snmp4_udp_list[i].entry)); > + seq_printf(seq, " %lu", buff[i]); > + > + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); > > /* the UDP and UDP-Lite MIBs are the same */ > seq_puts(seq, "\nUdpLite:"); > + snmp_get_cpu_field_batch(buff, snmp4_udp_list, > + net->mib.udplite_statistics); > for (i = 0; snmp4_udp_list[i].name != NULL; i++) > seq_printf(seq, " %s", snmp4_udp_list[i].name); > - > seq_puts(seq, "\nUdpLite:"); > for (i = 0; snmp4_udp_list[i].name != NULL; i++) > - seq_printf(seq, " %lu", > -
Re: [PATCH 00/26] constify local structures
On Mon, Sep 12, 2016 at 03:52:08PM +0200, Julia Lawall wrote: > > > On Mon, 12 Sep 2016, Felipe Balbi wrote: > > > > > Hi, > > > > Jarkko Sakkinen writes: > > > On Mon, Sep 12, 2016 at 10:54:07AM +0200, Julia Lawall wrote: > > >> > > >> > > >> On Sun, 11 Sep 2016, Jarkko Sakkinen wrote: > > >> > > >> > On Sun, Sep 11, 2016 at 03:05:42PM +0200, Julia Lawall wrote: > > >> > > Constify local structures. > > >> > > > > >> > > The semantic patch that makes this change is as follows: > > >> > > (http://coccinelle.lip6.fr/) > > >> > > > >> > Just my two cents but: > > >> > > > >> > 1. You *can* use a static analysis too to find bugs or other issues. > > >> > 2. However, you should manually do the commits and proper commit > > >> >messages to subsystems based on your findings. And I generally think > > >> >that if one contributes code one should also at least smoke test > > >> > changes > > >> >somehow. > > >> > > > >> > I don't know if I'm alone with my opinion. I just think that one should > > >> > also do the analysis part and not blindly create and submit patches. > > >> > > >> All of the patches are compile tested. And the individual patches are > > > > > > Compile-testing is not testing. If you are not able to test a commit, > > > you should explain why. > > > > Dude, Julia has been doing semantic patching for years already and > > nobody has raised any concerns so far. There's already an expectation > > that Coccinelle *works* and Julia's sematic patches are sound. > > > > Besides, adding 'const' is something that causes virtually no functional > > changes to the point that build-testing is really all you need. Any > > problems caused by adding 'const' to a definition will be seen by build > > errors or warnings. > > > > Really, just stop with the pointless discussion and go read a bit about > > Coccinelle and what semantic patches are giving you. The work done by > > Julia and her peers are INRIA have measurable benefits. > > > > You're really making a thunderstorm in a glass of water. > > Thanks for the defense, but since a lot of these patches torned out to be > wrong, due to an incorrect parse by Coccinelle, combined with an > unpleasantly lax compiler, Jarkko does have a point that I should have > looked at the patches more carefully. In any case, I have written to the > maintainers relevant to the patches that turned out to be incorrect. Exactly. I'm not excepting that every commit would require extensive analysis but it would be good to quickly at least skim through commits and see if they make sense (or ask if unsure) :) And I'm fine with compile testing if it is mentioned in the commit msg. > julia /Jarkko
Re: [PATCHv2 net] sctp: hold the transport before using it in sctp_hash_cmp
On Sat, Sep 10, 2016 at 11:11:23PM +0800, Xin Long wrote: > Since commit 4f0087812648 ("sctp: apply rhashtable api to send/recv > path"), sctp uses transport rhashtable with .obj_cmpfn sctp_hash_cmp, > in which it compares the members of the transport with the rhashtable > args to check if it's the right transport. > > But sctp uses the transport without holding it in sctp_hash_cmp, it can > cause a use-after-free panic. As after it gets transport from hashtable, > another CPU may close the sk and free the asoc. In sctp_association_free, > it frees all the transports, meanwhile, the assoc's refcnt may be reduced > to 0, assoc can be destroyed by sctp_association_destroy. > > So after that, transport->assoc is actually an unavailable memory address > in sctp_hash_cmp. Although sctp_hash_cmp is under rcu_read_lock, it still > can not avoid this, as assoc is not freed by RCU. > > This patch is to hold the transport before checking it's members with > sctp_transport_hold, in which it checks the refcnt first, holds it if > it's not 0. > > Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path") > Signed-off-by: Xin Long Note that we cannot defer the free of the asoc too because that cause issues with port re-use (issue already hit and fixed in the past), as the port would be still in use during the RCU grace period. Acked-by: Marcelo Ricardo Leitner > --- > net/sctp/input.c | 27 +-- > 1 file changed, 17 insertions(+), 10 deletions(-) > > diff --git a/net/sctp/input.c b/net/sctp/input.c > index 69444d3..1555fb8 100644 > --- a/net/sctp/input.c > +++ b/net/sctp/input.c > @@ -796,27 +796,34 @@ struct sctp_hash_cmp_arg { > static inline int sctp_hash_cmp(struct rhashtable_compare_arg *arg, > const void *ptr) > { > + struct sctp_transport *t = (struct sctp_transport *)ptr; > const struct sctp_hash_cmp_arg *x = arg->key; > - const struct sctp_transport *t = ptr; > - struct sctp_association *asoc = t->asoc; > - const struct net *net = x->net; > + struct sctp_association *asoc; > + int err = 1; > > if (!sctp_cmp_addr_exact(&t->ipaddr, x->paddr)) > - return 1; > - if (!net_eq(sock_net(asoc->base.sk), net)) > - return 1; > + return err; > + if (!sctp_transport_hold(t)) > + return err; > + > + asoc = t->asoc; > + if (!net_eq(sock_net(asoc->base.sk), x->net)) > + goto out; > if (x->ep) { > if (x->ep != asoc->ep) > - return 1; > + goto out; > } else { > if (x->laddr->v4.sin_port != htons(asoc->base.bind_addr.port)) > - return 1; > + goto out; > if (!sctp_bind_addr_match(&asoc->base.bind_addr, > x->laddr, sctp_sk(asoc->base.sk))) > - return 1; > + goto out; > } > > - return 0; > + err = 0; > +out: > + sctp_transport_put(t); > + return err; > } > > static inline u32 sctp_hash_obj(const void *data, u32 len, u32 seed) > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: [PATCH] net: dsa: add FIB support
On 12/09/2016 16:00, Andrew Lunn wrote: > Hi John > >> +if (!ds->drv->ipv4_fib_prepare || !ds->drv->ipv4_fib_add) > > drv recently got renamed to ops, which is what 0-day is complaining > about. > > Andrew > i based this on a net-next from 5-6 days ago, will rebase/resend tomorrow. John
Re: [PATCH] net: inet: diag: Fix an error handling
Le 12/09/2016 à 16:35, David Ahern a écrit : On 9/12/16 12:02 AM, Christophe JAILLET wrote: diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index abfbe492ebfe..795af25cf84c 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -1134,7 +1134,6 @@ int inet_diag_handler_get_info(struct sk_buff *skb, struct sock *sk) handler = inet_diag_lock_handler(sk->sk_protocol); if (IS_ERR(handler)) { - inet_diag_unlock_handler(handler); nlmsg_cancel(skb, nlh); return PTR_ERR(handler); } That call is needed. inet_diag_unlock_handler does not operate on the input arg but a global mutex. Perhaps a better patch is to change inet_diag_unlock_handler() to a void. This is obvious. Sorry for the noise. CJ
Re: [net-next PATCH v2 2/2] e1000: bundle xdp xmit routines
On 16-09-12 05:17 AM, Jesper Dangaard Brouer wrote: > On Fri, 09 Sep 2016 14:29:38 -0700 > John Fastabend wrote: > >> e1000 supports a single TX queue so it is being shared with the stack >> when XDP runs XDP_TX action. This requires taking the xmit lock to >> ensure we don't corrupt the tx ring. To avoid taking and dropping the >> lock per packet this patch adds a bundling implementation to submit >> a bundle of packets to the xmit routine. >> >> I tested this patch running e1000 in a VM using KVM over a tap >> device using pktgen to generate traffic along with 'ping -f -l 100'. >> >> Suggested-by: Jesper Dangaard Brouer > > Thank you for actually implementing this! :-) > Yep no problem the effects are minimal on e1000 but should be noticeable at 10/40/100gbps nics. >> Signed-off-by: John Fastabend >> --- > [...] [...] >> +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle >> *buffer_info, >> + struct net_device *netdev, >> + struct e1000_adapter *adapter) >> +{ >> +struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0); >> +struct e1000_tx_ring *tx_ring = adapter->tx_ring; >> +struct e1000_hw *hw = &adapter->hw; >> +int i = 0; >> + >> /* e1000 only support a single txq at the moment so the queue is being >> * shared with stack. To support this requires locking to ensure the >> * stack and XDP are not running at the same time. Devices with >> * multiple queues should allocate a separate queue space. >> + * >> + * To amortize the locking cost e1000 bundles the xmits and sends as >> + * many as possible until either running out of descriptors or failing. >> */ >> HARD_TX_LOCK(netdev, txq, smp_processor_id()); >> >> -tx_ring = adapter->tx_ring; >> - >> -if (E1000_DESC_UNUSED(tx_ring) < 2) { >> -HARD_TX_UNLOCK(netdev, txq); >> -return; >> +for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) { >^^^ >> +e1000_xmit_raw_frame(buffer_info[i].buffer, >> + buffer_info[i].length, >> + adapter, tx_ring); >> +buffer_info[i].buffer->rxbuf.page = NULL; >> +buffer_info[i].buffer = NULL; >> +buffer_info[i].length = 0; >> +i++; > ^^^ > Looks like "i" is incremented twice, is that correct? > >> } Yep this and a couple other issues are resolved in v3 which I'll send out in a moment. Also in v3 I kept the program in the adapter structure. Moving it into the ring structure made the code a bit uglier IMO. I agree with the logic but practically only one program can exist for e1000.
[PATCHv2 next 3/3] ipvlan: Introduce l3s mode
From: Mahesh Bandewar In a typical IPvlan L3 setup where master is in default-ns and each slave is into different (slave) ns. In this setup egress packet processing for traffic originating from slave-ns will hit all NF_HOOKs in slave-ns as well as default-ns. However same is not true for ingress processing. All these NF_HOOKs are hit only in the slave-ns skipping them in the default-ns. IPvlan in L3 mode is restrictive and if admins want to deploy iptables rules in default-ns, this asymmetric data path makes it impossible to do so. This patch makes use of the l3_rcv() (added as part of l3mdev enhancements) to perform input route lookup on RX packets without changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN to change the skb->dev just before handing over skb to L4. Signed-off-by: Mahesh Bandewar --- Documentation/networking/ipvlan.txt | 7 ++- drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvlan.h | 7 +++ drivers/net/ipvlan/ipvlan_core.c| 94 + drivers/net/ipvlan/ipvlan_main.c| 60 --- include/uapi/linux/if_link.h| 1 + 6 files changed, 162 insertions(+), 8 deletions(-) diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index 14422f8fcdc4..24196cef7c91 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module There are no module parameters for this driver and it can be configured using IProute2/ip utility. - ip link add link type ipvlan mode { l2 | L3 } + ip link add link type ipvlan mode { l2 | l3 | l3s } e.g. ip link add link ipvl0 eth0 type ipvlan mode l2 @@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves will not receive nor can send multicast / broadcast traffic. +4.3 L3S mode: + This is very similar to the L3 mode except that iptables (conn-tracking) +works in this mode and hence it is L3-symmetric (L3s). This will have slightly less +performance but that shouldn't matter since you are choosing this mode over plain-L3 +mode to make conn-tracking work. 5. What to choose (macvlan vs. ipvlan)? These two devices are very similar in many regards and the specific use diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 0c5415b05ea9..8768a625350d 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -149,6 +149,7 @@ config IPVLAN tristate "IP-VLAN support" depends on INET depends on IPV6 +depends on NET_L3_MASTER_DEV ---help--- This allows one to create virtual devices off of a main interface and packets will be delivered based on the dest L3 (IPv6/IPv4 addr) diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h index 695a5dc9ace3..68b270b59ba9 100644 --- a/drivers/net/ipvlan/ipvlan.h +++ b/drivers/net/ipvlan/ipvlan.h @@ -23,11 +23,13 @@ #include #include #include +#include #include #include #include #include #include +#include #define IPVLAN_DRV "ipvlan" #define IPV_DRV_VER"0.1" @@ -96,6 +98,7 @@ struct ipvl_port { struct work_struct wq; struct sk_buff_head backlog; int count; + boolipt_hook_added; struct rcu_head rcu; }; @@ -124,4 +127,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev *ipvlan, const void *iaddr, bool is_v6); bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6); void ipvlan_ht_addr_del(struct ipvl_addr *addr); +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb, + u16 proto); +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb, +const struct nf_hook_state *state); #endif /* __IPVLAN_H */ diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c index b5f9511d819e..b4e990743e1d 100644 --- a/drivers/net/ipvlan/ipvlan_core.c +++ b/drivers/net/ipvlan/ipvlan_core.c @@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) case IPVLAN_MODE_L2: return ipvlan_xmit_mode_l2(skb, dev); case IPVLAN_MODE_L3: + case IPVLAN_MODE_L3S: return ipvlan_xmit_mode_l3(skb, dev); } @@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb) return ipvlan_handle_mode_l2(pskb, port); case IPVLAN_MODE_L3: return ipvlan_handle_mode_l3(pskb, port); + case IPVLAN_MODE_L3S: + return RX_HANDLER_PASS; } /* Should not reach here */ @@ -672,3 +675,94 @@ rx_handler_result_t ipvlan_handle_frame(struct
Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC
On Mon, Sep 12, 2016 at 07:07:34PM +0300, Ram Amrani wrote: > drivers/infiniband/hw/qedr/main.c | 907 ++ > drivers/infiniband/hw/qedr/qedr.h | 494 > drivers/infiniband/hw/qedr/qedr_cm.c | 626 + > drivers/infiniband/hw/qedr/qedr_cm.h | 61 + > drivers/infiniband/hw/qedr/qedr_hsi.h | 56 + > drivers/infiniband/hw/qedr/qedr_hsi_rdma.h | 748 + > drivers/infiniband/hw/qedr/qedr_user.h | 80 + We are requiring new uAPI headers are placed under include/uapi/rdma/, please coordinate with Leon on the path. Jason
[PATCHv2 next 2/3] net: Add _nf_(un)register_hooks symbols
From: Mahesh Bandewar Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow caller to hold RTNL mutex. Signed-off-by: Mahesh Bandewar CC: Pablo Neira Ayuso --- include/linux/netfilter.h | 2 ++ net/netfilter/core.c | 51 ++- 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 9230f9aee896..e82b76781bf6 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -133,6 +133,8 @@ int nf_register_hook(struct nf_hook_ops *reg); void nf_unregister_hook(struct nf_hook_ops *reg); int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n); +int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); +void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n); /* Functions to register get/setsockopt ranges (non-inclusive). You need to check permissions yourself! */ diff --git a/net/netfilter/core.c b/net/netfilter/core.c index f39276d1c2d7..2c5327e43a88 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -188,19 +188,17 @@ EXPORT_SYMBOL(nf_unregister_net_hooks); static LIST_HEAD(nf_hook_list); -int nf_register_hook(struct nf_hook_ops *reg) +static int _nf_register_hook(struct nf_hook_ops *reg) { struct net *net, *last; int ret; - rtnl_lock(); for_each_net(net) { ret = nf_register_net_hook(net, reg); if (ret && ret != -ENOENT) goto rollback; } list_add_tail(®->list, &nf_hook_list); - rtnl_unlock(); return 0; rollback: @@ -210,19 +208,34 @@ rollback: break; nf_unregister_net_hook(net, reg); } + return ret; +} + +int nf_register_hook(struct nf_hook_ops *reg) +{ + int ret; + + rtnl_lock(); + ret = _nf_register_hook(reg); rtnl_unlock(); + return ret; } EXPORT_SYMBOL(nf_register_hook); -void nf_unregister_hook(struct nf_hook_ops *reg) +static void _nf_unregister_hook(struct nf_hook_ops *reg) { struct net *net; - rtnl_lock(); list_del(®->list); for_each_net(net) nf_unregister_net_hook(net, reg); +} + +void nf_unregister_hook(struct nf_hook_ops *reg) +{ + rtnl_lock(); + _nf_unregister_hook(reg); rtnl_unlock(); } EXPORT_SYMBOL(nf_unregister_hook); @@ -246,6 +259,26 @@ err: } EXPORT_SYMBOL(nf_register_hooks); +/* Caller MUST take rtnl_lock() */ +int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n) +{ + unsigned int i; + int err = 0; + + for (i = 0; i < n; i++) { + err = _nf_register_hook(®[i]); + if (err) + goto err; + } + return err; + +err: + if (i > 0) + _nf_unregister_hooks(reg, i); + return err; +} +EXPORT_SYMBOL(_nf_register_hooks); + void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) { while (n-- > 0) @@ -253,6 +286,14 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) } EXPORT_SYMBOL(nf_unregister_hooks); +/* Caller MUST take rtnl_lock */ +void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n) +{ + while (n-- > 0) + _nf_unregister_hook(®[n]); +} +EXPORT_SYMBOL(_nf_unregister_hooks); + unsigned int nf_iterate(struct list_head *head, struct sk_buff *skb, struct nf_hook_state *state, -- 2.8.0.rc3.226.g39d4020
Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver
On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote: > [2] Libpvrdma User-level library - > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary You will probably find that rdma-plumbing will be the best way to get your userspace component into the distributors. http://www.spinics.net/lists/linux-rdma/msg39026.html http://www.spinics.net/lists/linux-rdma/msg39328.html http://www.spinics.net/lists/linux-rdma/msg40014.html http://www.spinics.net/lists/linux-rdma/msg39026.html Jason
[PATCHv2 next 1/3] ipv6: Export p6_route_input_lookup symbol
From: Mahesh Bandewar Make ip6_route_input_lookup available outside of ipv6 the module similar to ip_route_input_noref in the IPv4 world. Signed-off-by: Mahesh Bandewar --- include/net/ip6_route.h | 3 +++ net/ipv6/route.c| 7 --- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index d97305d0e71f..e0cd318d5103 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -64,6 +64,9 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr) } void ip6_route_input(struct sk_buff *skb); +struct dst_entry *ip6_route_input_lookup(struct net *net, +struct net_device *dev, +struct flowi6 *fl6, int flags); struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk, struct flowi6 *fl6, int flags); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 09d43ff11a8d..9563eedd4f97 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1147,15 +1147,16 @@ static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table * return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags); } -static struct dst_entry *ip6_route_input_lookup(struct net *net, - struct net_device *dev, - struct flowi6 *fl6, int flags) +struct dst_entry *ip6_route_input_lookup(struct net *net, +struct net_device *dev, +struct flowi6 *fl6, int flags) { if (rt6_need_strict(&fl6->daddr) && dev->type != ARPHRD_PIMREG) flags |= RT6_LOOKUP_F_IFACE; return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input); } +EXPORT_SYMBOL_GPL(ip6_route_input_lookup); void ip6_route_input(struct sk_buff *skb) { -- 2.8.0.rc3.226.g39d4020
[PATCHv2 next 0/3] ipvlan introduce l3s mode
From: Mahesh Bandewar Same old problem with new approach especially from suggestions from earlier patch-series. First thing is that this is introduced as a new mode rather than modifying the old (L3) mode. So the behavior of the existing modes is preserved as it is and the new L3s mode obeys iptables so that intended conn-tracking can work. To do this, the code uses newly added l3mdev_rcv() handler and an Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT to change the input device from master to the slave to complete the formality. Supporting stack changes are trivial changes to export symbol to get IPv4 equivalent code exported for IPv6 and to allow netfilter hook registration code to allow caller to hold RTNL. Please look into individual patches for details. Mahesh Bandewar (3): ipv6: Export p6_route_input_lookup symbol net: Add _nf_(un)register_hooks symbols ipvlan: Introduce l3s mode Documentation/networking/ipvlan.txt | 7 ++- drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvlan.h | 7 +++ drivers/net/ipvlan/ipvlan_core.c| 94 + drivers/net/ipvlan/ipvlan_main.c| 60 --- include/linux/netfilter.h | 2 + include/net/ip6_route.h | 3 ++ include/uapi/linux/if_link.h| 1 + net/ipv6/route.c| 7 +-- net/netfilter/core.c| 51 ++-- 10 files changed, 217 insertions(+), 16 deletions(-) v1: Initial post v2: Text correction and config changed from "select" to "depends on" -- 2.8.0.rc3.226.g39d4020
[PATCH net-next 1/2] uapi glibc compat: make linux/time.h compile with user time.h files
From: Willem de Bruijn Add libc-compat workaround for definitions in linux/time.h that duplicate those in libc time.h, sys/time.h and bits/time.h. With this change, userspace builds succeeds when linux/time.h is included after libc time.h and when it is included after sys/time.h. The inverse requires additional changes to those userspace headers. Without this patch, when compiling the following program after make headers_install: echo -e "#include \n#include " | \ gcc -Wall -Werror -Iusr/include -c -xc - gcc gives these errors: #include #include In file included from ../test_time.c:3:0: /usr/include/time.h:120:8: error: redefinition of ‘struct timespec’ struct timespec ^ In file included from ../test_time.c:2:0: ./usr/include/linux/time.h:9:8: note: originally defined here struct timespec { ^ In file included from ../test_time.c:3:0: /usr/include/time.h:161:8: error: redefinition of ‘struct itimerspec’ struct itimerspec ^ In file included from ../test_time.c:2:0: ./usr/include/linux/time.h:34:8: note: originally defined here struct itimerspec { and this warning by indirect inclusion of bits/time.h: In file included from ../test_time.c:4:0: ./usr/include/linux/time.h:67:0: error: "TIMER_ABSTIME" redefined [-Werror] #define TIMER_ABSTIME 0x01 ^ In file included from /usr/include/time.h:41:0, from ../test_time.c:3: /usr/include/x86_64-linux-gnu/bits/time.h:82:0: note: this is the location of the previous definition # define TIMER_ABSTIME 1 ^ The _SYS_TIME_H variant resolves similar errors for timeval, timezone, itimerval and warnings for ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF. Ran the same program for sys/time.h and bits/time.h. Signed-off-by: Willem de Bruijn --- include/uapi/linux/libc-compat.h | 50 include/uapi/linux/time.h| 15 2 files changed, 65 insertions(+) diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h index 44b8a6b..b08b0f5 100644 --- a/include/uapi/linux/libc-compat.h +++ b/include/uapi/linux/libc-compat.h @@ -165,6 +165,43 @@ #define __UAPI_DEF_XATTR 1 #endif +/* Definitions for time.h */ +#if defined(__timespec_defined) +#define __UAPI_DEF_TIMESPEC0 +#else +#define __UAPI_DEF_TIMESPEC1 +#endif + +#if defined(_TIME_H) && defined(__USE_POSIX199309) +#define __UAPI_DEF_ITIMERSPEC 0 +#else +#define __UAPI_DEF_ITIMERSPEC 1 +#endif + +/* Definitions for sys/time.h */ +#if defined(_SYS_TIME_H) +#define __UAPI_DEF_TIMEVAL 0 +#define __UAPI_DEF_ITIMERVAL 0 +#define __UAPI_DEF_ITIMER_WHICH0 +#else +#define __UAPI_DEF_TIMEVAL 1 +#define __UAPI_DEF_ITIMERVAL 1 +#define __UAPI_DEF_ITIMER_WHICH1 +#endif + +/* Definitions for bits/time.h */ +#if defined(_BITS_TIME_H) +#define __UAPI_DEF_ABSTIME 0 +#else +#define __UAPI_DEF_ABSTIME 1 +#endif + +#if defined(_SYS_TIME_H) && defined(__USE_BSD) +#define __UAPI_DEF_TIMEZONE0 +#else +#define __UAPI_DEF_TIMEZONE1 +#endif + /* If we did not see any headers from any supported C libraries, * or we are being included in the kernel, then define everything * that we need. */ @@ -208,6 +245,19 @@ /* Definitions for xattr.h */ #define __UAPI_DEF_XATTR 1 +/* Definitions for time.h */ +#define __UAPI_DEF_TIMESPEC1 +#define __UAPI_DEF_ITIMERSPEC 1 + +/* Definitions for sys/time.h */ +#define __UAPI_DEF_TIMEVAL 1 +#define __UAPI_DEF_ITIMERVAL 1 +#define __UAPI_DEF_ITIMER_WHICH1 +#define __UAPI_DEF_TIMEZONE1 + +/* Definitions for bits/time.h */ +#define __UAPI_DEF_ABSTIME 1 + #endif /* __GLIBC__ */ #endif /* _UAPI_LIBC_COMPAT_H */ diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h index e75e1b6..4e7333c 100644 --- a/include/uapi/linux/time.h +++ b/include/uapi/linux/time.h @@ -1,9 +1,11 @@ #ifndef _UAPI_LINUX_TIME_H #define _UAPI_LINUX_TIME_H +#include #include +#if __UAPI_DEF_TIMESPEC #ifndef _STRUCT_TIMESPEC #define _STRUCT_TIMESPEC struct timespec { @@ -11,35 +13,46 @@ struct timespec { longtv_nsec;/* nanoseconds */ }; #endif +#endif +#if __UAPI_DEF_TIMEVAL struct timeval { __kernel_time_t tv_sec; /* seconds */ __kernel_suseconds_ttv_usec;/* microseconds */ }; +#endif +#if __UAPI_DEF_TIMEZONE struct timezone { int tz_minuteswest; /* minutes west of Greenwich */ int tz_dsttime; /* type of dst correction */ }; +#endif /* * Names of the interval timers, and structure * defining a timer setting: */ +#if __UAPI_DEF_ITIMER_WHICH #defineITIMER_REAL 0 #define
Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Mon, Sep 12, 2016 at 10:56:55AM +0200, Jesper Dangaard Brouer wrote: > On Thu, 8 Sep 2016 23:30:50 -0700 > Alexei Starovoitov wrote: > > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote: > > > > > Lets do bundling/bulking from the start! > > > > > > > > mlx4 already does bulking and this proposed mlx5 set of patches > > > > does bulking as well. > > > > See nothing wrong about it. RX side processes the packets and > > > > when it's done it tells TX to xmit whatever it collected. > > > > > > This is doing "hidden" bulking and not really taking advantage of using > > > the icache more effeciently. > > > > > > Let me explain the problem I see, little more clear then, so you > > > hopefully see where I'm going. > > > > > > Imagine you have packets intermixed towards the stack and XDP_TX. > > > Every time you call the stack code, then you flush your icache. When > > > returning to the driver code, you will have to reload all the icache > > > associated with the XDP_TX, this is a costly operation. > > > > correct. And why is that a problem? > > It is good that you can see and acknowledge the I-cache problem. > > XDP is all about performance. What I hear is, that you are arguing > against a model that will yield better performance, that does not make > sense to me. Let me explain this again, in another way. I'm arguing against your proposal that I think will be more complex and lower performance than what Saeed and the team already implemented. Therefore I don't think it's fair to block the patch and ask them to reimplement it just to test an idea that may or may not improve performance. Getting maximum performance is tricky. Good is better than perfect. It's important to argue about user space visible bits upfront, but on the kernel performance side we should build/test incrementally. This particular patch 11/11 is simple, easy to review and provides good performance. What's not to like?
[PATCH net-next 2/2] errqueue: include linux/time.h
From: Willem de Bruijn struct scm_timestamping has fields of type struct timespec. Now that it is safe to include linux/time.h and time.h at the same time, include linux/time.h directly in linux/errqueue.h Without this patch, when compiling the following program after make headers_install: gcc -Wall -Werror -Iusr/include -c -xc - < static struct scm_timestamping tss; int main(void) { tss.ts[0].tv_sec = 1; return 0; } EOF gcc gives this error: In file included from :1:0: usr/include/linux/errqueue.h:33:18: error: array type has incomplete element type struct timespec ts[3]; Reported-by: Brooks Moses Signed-off-by: Willem de Bruijn --- include/uapi/linux/errqueue.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 07bdce1..abafec8 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -1,6 +1,7 @@ #ifndef _UAPI_LINUX_ERRQUEUE_H #define _UAPI_LINUX_ERRQUEUE_H +#include #include struct sock_extended_err { -- 2.8.0.rc3.226.g39d4020
Re: [PATCH v4 16/16] MAINTAINERS: Update for PVRDMA driver
On Sun, Sep 11, 2016 at 09:49:26PM -0700, Adit Ranadive wrote: > Add maintainer info for the PVRDMA driver. You can probably squash the last three patches. .. and fix the __u32 stuff throughout the entire driver please. Jason
[PATCH net-next 0/2] uapi: include time.h from errqueue.h
From: Willem de Bruijn It was reported that linux/errqueue.h requires linux/time.h, but that adding the include directly may cause userspace conflicts between linux/time.h and glibc time.h: https://lkml.org/lkml/2016/7/10/10 Address the conflicts using the standard libc-compat approach, then add the #include to errqueue.h The first patch is a resubmit. It was previously submitted to tip/timers/core, but given the commit history, the maintainer suggested this tree, instead. https://lkml.org/lkml/2016/8/10/748 This also allows sending the follow-up as part of the patchset. Willem de Bruijn (2): uapi glibc compat: make linux/time.h compile with user time.h files errqueue: include linux/time.h include/uapi/linux/errqueue.h| 1 + include/uapi/linux/libc-compat.h | 50 include/uapi/linux/time.h| 15 3 files changed, 66 insertions(+) -- 2.8.0.rc3.226.g39d4020
Re: [PATCH v4 03/16] IB/pvrdma: Add virtual device RDMA structures
On Sun, Sep 11, 2016 at 09:49:13PM -0700, Adit Ranadive wrote: > + __u8raw[16]; > + struct { > + __be64 subnet_prefix; > + __be64 interface_id; > + } global; If this is not a userspace header do not use the __ varients.. Jason
Re: [PATCH v4 02/16] IB/pvrdma: Add user-level shared functions
On Sun, Sep 11, 2016 at 09:49:12PM -0700, Adit Ranadive wrote: > We share some common structures with the user-level driver. This patch > adds those structures and shared functions to traverse the QP/CQ rings. > create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_uapi.h > create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_user.h The files that are intended to be shared with userspace must be under include/uapi/, please coordinate with Leon on the path. Same for all the new drivers. > +static inline __s32 pvrdma_idx(atomic_t *var, __u32 max_elems) > +{ > + const unsigned int idx = atomic_read(var); Eh? Does this even compile in userspace? If this is not a userspace header then why does it use __u32 and related ?? > +#define PVRDMA_UVERBS_ABI_VERSION3 > +#define PVRDMA_BOARD_ID 1 > +#define PVRDMA_REV_ID1 > + > +struct pvrdma_alloc_ucontext_resp { > + u32 qp_tab_size; > + u32 reserved; > +}; This certainly looks like a userspace header, shouldn't it use __u32? NAK Jason