Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Fri, Apr 1, 2016 at 7:16 PM, David Miller wrote: > From: Alexander Duyck > Date: Fri, 1 Apr 2016 12:58:41 -0700 > >> RFC 6864 is pretty explicit about this, IPv4 ID used only for >> fragmentation. https://tools.ietf.org/html/rfc6864#section-4.1 >> >> The goal with this change is to try and keep most of the existing >> behavior in tact without violating this rule? I would think the >> sequence number should give you the ability to infer a drop in the >> case of TCP. In the case of UDP tunnels we are now getting a bit more >> data since we were ignoring the outer IP header ID before. > > When retransmits happen, the sequence numbers are the same. But you > can then use the IP ID to see exactly what happened. You can even > tell if multiple retransmits got reordered. > > Eric's use case is extremely useful, and flat out eliminates ambiguity > when analyzing TCP traces. I'm not really sure the IP ID is really all that useful. On a 10G link it takes about 80ms for it to wrap using an MTU of 1500. That value is going to keep dropping as we move up to 40G and 100G. That was one of the motivations behind RFC 6864 because it starts becoming a bottle-neck if you want to keep the IP IDs truly unique. In addition while this change would allow you to combined disjointed IP IDs I don't think you would lose the re-transmission as there would likely be a gap in sequence numbers that would cause the flow to be flushed from GRO, and it isn't as if we can prepend the retransmit to the aggregated frame, we are always adding to the tail. I would think the most likely case where this change would foul up any IP IDs would be the garbage-in/garbage-out case like the IPv6 to IPv4 translation that is using the fixed IP ID of 0. I agree that it isn't desirable, however at the same time per RFC 6864 there is nothing there to say we cannot do that. In addition it would likely help to mitigate some of the impact of IP ID on things like SLHC compression since the resegmented frames would be incrementing so it would reduce the number of times the IP ID would have to be updated. - Alex
Re: [PATCH] RDS: sync congestion map updating
On 4/1/16 6:14 PM, Leon Romanovsky wrote: On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote: (cc-ing netdev) On 3/30/2016 7:59 PM, Wengang Wang wrote: 在 2016年03月31日 09:51, Wengang Wang 写道: 在 2016年03月31日 01:16, santosh shilimkar 写道: Hi Wengang, On 3/30/2016 9:19 AM, Leon Romanovsky wrote: On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote: Problem is found that some among a lot of parallel RDS communications hang. In my test ten or so among 33 communications hang. The send requests got -ENOBUF error meaning the peer socket (port) is congested. But meanwhile, peer socket (port) is not congested. The congestion map updating can happen in two paths: one is in rds_recvmsg path and the other is when it receives packets from the hardware. There is no synchronization when updating the congestion map. So a bit operation (clearing) in the rds_recvmsg path can be skipped by another bit operation (setting) in hardware packet receving path. To be more detailed. Here, the two paths (user calls recvmsg and hardware receives data) are for different rds socks. thus the rds_sock->rs_recv_lock is not helpful to sync the updating on congestion map. For archive purpose, let me try to conclude the thread. I synced with Wengang offlist and came up with below fix. I was under impression that __set_bit_le() was atmoic version. After fixing it like patch(end of the email), the bug gets addressed. I will probably send this as fix for stable as well. From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001 From: Santosh Shilimkar Date: Wed, 30 Mar 2016 23:26:47 -0700 Subject: [PATCH] RDS: Fix the atomicity for congestion map update Two different threads with different rds sockets may be in rds_recv_rcvbuf_delta() via receive path. If their ports both map to the same word in the congestion map, then using non-atomic ops to update it could cause the map to be incorrect. Lets use atomics to avoid such an issue. Full credit to Wengang for finding the issue, analysing it and also pointing out to offending code with spin lock based fix. I'm glad that you solved the issue without spinlocks. Out of curiosity, I see that this patch is needed to be sent to Dave and applied by him. Is it right? Right. I was planning send this one along with one more fix together on netdev for Dave to pick it up. ➜ linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c Santosh Shilimkar (supporter:RDS - RELIABLE DATAGRAM SOCKETS) "David S. Miller" (maintainer:NETWORKING [GENERAL]) netdev@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS) linux-r...@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS) rds-de...@oss.oracle.com (moderated list:RDS - RELIABLE DATAGRAM SOCKETS) linux-ker...@vger.kernel.org (open list) Signed-off-by: Wengang Wang Signed-off-by: Santosh Shilimkar Reviewed-by: Leon Romanovsky Thanks for review.
Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote: > On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since > > bpf programs require a skb context to navigate the packet, build a > > percpu fake skb with the minimal fields. This avoids the costly > > allocation for packets that end up being dropped. > > > > > > + /* A bpf program gets first chance to drop the packet. It may > > +* read bytes but not past the end of the frag. A non-zero > > +* return indicates packet should be dropped. > > +*/ > > + if (prog) { > > + struct ethhdr *ethh; > > + > > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > > +frags[0].page_offset); > > + if (mlx4_call_bpf(prog, ethh, length)) { > > + priv->stats.rx_dropped++; > > + goto next; > > + } > > + } > > + > > > 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet > frame. > > Still you pass a single fragment but total 'length' here : BPF program > can read past the end of this first fragment and panic the box. > > Please take a look at mlx4_en_complete_rx_desc() and you'll see what I > mean. yep. my reading of that part was that num_frags > 1 is only for large mtu sizes, so if we limit this for num_frags==1 only for now we should be ok and it's still applicable for most of the use cases ? > 2) priv->stats.rx_dropped is shared by all the RX queues -> false > sharing. yes. good point. I bet it was copy pasted from few lines below. Should be trivial to convert it to percpu. >This is probably the right time to add a rx_dropped field in struct > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on > higher speed links. yes, could be per ring as well. My guess we're hitting 14.5Mpps limit for empty bpf program and for program that actually looks into the packet because we're hitting 10G phy limit of 40G nic. Since physically 40G nic consists of four 10G phys. There will be the same problem with 100G and 50G nics. Both will be hitting 25G phy limit. We need to vary packets somehow. Hopefully Or can explain that bit of hw design. Jesper's experiments with mlx4 showed the same 14.5Mpps limit when sender blasting the same packet over and over again. Great to see the experiments converging.
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Sat, 2016-04-02 at 10:19 +0800, Herbert Xu wrote: > On Fri, Apr 01, 2016 at 07:15:33PM -0700, Eric Dumazet wrote: > > On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote: > > > > > > We could easily fix that by adding a feature bit to control this, > > > something like SKB_GSO_TCP_FIXEDID. > > > > I understood the patch allowed to aggregate 4 segments having > > > > ID=12 ID=40 ID=80 ID=1000 > > Right. But I haven't seen any justification that we should aggregate > such packets at all. The only valid case that I have seen is for > the case of unchanging IDs, no? Presumably repeats of "DF=1 ID=0" should be what we really need to handle. On my wish list, having some reordering logic in GRO would be far more interesting than these minor details ;)
Re: [PATCH v2 net-next] net/core: generic support for disabling netdev features down stack
Hi, Sorry for digging up an old patch, but... ;-) dev_disable_lro() is a leftover from ancient times. If you read commit 27660515a, there is a hint where it should go. Please, read on if you'd like to fix this properly. 2015-11-03 3:55 GMT+01:00 Jarod Wilson : > There are some netdev features, which when disabled on an upper device, > such as a bonding master or a bridge, must be disabled and cannot be > re-enabled on underlying devices. [...] > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6288,6 +6288,44 @@ static void rollback_registered(struct net_device *dev) > list_del(&single); > } > > +static netdev_features_t netdev_sync_upper_features(struct net_device *lower, > + struct net_device *upper, netdev_features_t features) > +{ > + netdev_features_t upper_disables = NETIF_F_UPPER_DISABLES; > + netdev_features_t feature; > + > + for_each_netdev_feature(&upper_disables, feature) { > + if (!(upper->wanted_features & feature) > + && (features & feature)) { > + netdev_dbg(lower, "Dropping feature %pNF, upper dev > %s has it off.\n", > + &feature, upper->name); > + features &= ~feature; > + } > + } You could do this once: disable_features = ~upper->features & features & NETIF_F_UPPER_DISABLES; if (features & disable_features) netdev_dbg(...) features &= ~disable_features; > + > + return features; > +} [...] > @@ -6370,6 +6410,10 @@ int __netdev_update_features(struct net_device *dev) > /* driver might be less strict about feature dependencies */ > features = netdev_fix_features(dev, features); > > + /* some features can't be enabled if they're off an an upper device */ > + netdev_for_each_upper_dev_rcu(dev, upper, iter) > + features = netdev_sync_upper_features(dev, upper, features); > + > if (dev->features == features) > return 0; > This should go into netdev_fix_features(), as it is a one single place, where are feature dependencies are verified. [...] > @@ -6386,6 +6430,12 @@ int __netdev_update_features(struct net_device *dev) > return -1; > } > > + /* some features must be disabled on lower devices when disabled > +* on an upper device (think: bonding master or bridge) > +*/ > + netdev_for_each_lower_dev(dev, lower, iter) > + netdev_sync_lower_features(dev, lower, features); > + [...] This should be equal in resulting flags to: netdev_for_each_lower_dev(dev, lower...) netdev_update_features(lower); because netdev_fix_features(lower) will see already changed dev->features. Best Regards, Michał Mirosław
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Fri, 2016-04-01 at 22:16 -0400, David Miller wrote: > From: Alexander Duyck > Date: Fri, 1 Apr 2016 12:58:41 -0700 > > > RFC 6864 is pretty explicit about this, IPv4 ID used only for > > fragmentation. https://tools.ietf.org/html/rfc6864#section-4.1 > > > > The goal with this change is to try and keep most of the existing > > behavior in tact without violating this rule? I would think the > > sequence number should give you the ability to infer a drop in the > > case of TCP. In the case of UDP tunnels we are now getting a bit more > > data since we were ignoring the outer IP header ID before. > > When retransmits happen, the sequence numbers are the same. But you > can then use the IP ID to see exactly what happened. You can even > tell if multiple retransmits got reordered. > > Eric's use case is extremely useful, and flat out eliminates ambiguity > when analyzing TCP traces. Yes, our team (including Van Jacobson ;) ) would be sad to not have sequential IP ID (but then we don't have them for IPv6 ;) ) Since the cost of generating them is pretty small (inet->inet_id counter), we probably should keep them in linux. Their usage will phase out as IPv6 wins the Internet war...
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Fri, Apr 01, 2016 at 07:15:33PM -0700, Eric Dumazet wrote: > On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote: > > > > We could easily fix that by adding a feature bit to control this, > > something like SKB_GSO_TCP_FIXEDID. > > I understood the patch allowed to aggregate 4 segments having > > ID=12 ID=40 ID=80 ID=1000 Right. But I haven't seen any justification that we should aggregate such packets at all. The only valid case that I have seen is for the case of unchanging IDs, no? Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Alexander Duyck Date: Fri, 1 Apr 2016 12:58:41 -0700 > RFC 6864 is pretty explicit about this, IPv4 ID used only for > fragmentation. https://tools.ietf.org/html/rfc6864#section-4.1 > > The goal with this change is to try and keep most of the existing > behavior in tact without violating this rule? I would think the > sequence number should give you the ability to infer a drop in the > case of TCP. In the case of UDP tunnels we are now getting a bit more > data since we were ignoring the outer IP header ID before. When retransmits happen, the sequence numbers are the same. But you can then use the IP ID to see exactly what happened. You can even tell if multiple retransmits got reordered. Eric's use case is extremely useful, and flat out eliminates ambiguity when analyzing TCP traces.
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote: > Eric Dumazet wrote: > > > > I do not particularly care, but it is worth mentioning that GRO+TSO > > would not be idempotent anymore. > > We could easily fix that by adding a feature bit to control this, > something like SKB_GSO_TCP_FIXEDID. I understood the patch allowed to aggregate 4 segments having ID=12 ID=40 ID=80 ID=1000 -> resulting GRO packet with 4 segments and ID=12. TSO would generate 12,13,14,15 or 12,12,12,12 with your flag ? (Before the patch no aggregation took place and exact same packets were forwarded with 12, 40, 80, 1000) As I said, I am not sure we should care :)
Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since > bpf programs require a skb context to navigate the packet, build a > percpu fake skb with the minimal fields. This avoids the costly > allocation for packets that end up being dropped. > > + /* A bpf program gets first chance to drop the packet. It may > + * read bytes but not past the end of the frag. A non-zero > + * return indicates packet should be dropped. > + */ > + if (prog) { > + struct ethhdr *ethh; > + > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > + frags[0].page_offset); > + if (mlx4_call_bpf(prog, ethh, length)) { > + priv->stats.rx_dropped++; > + goto next; > + } > + } > + 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet frame. Still you pass a single fragment but total 'length' here : BPF program can read past the end of this first fragment and panic the box. Please take a look at mlx4_en_complete_rx_desc() and you'll see what I mean. 2) priv->stats.rx_dropped is shared by all the RX queues -> false sharing. This is probably the right time to add a rx_dropped field in struct mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on higher speed links.
Re: [RESEND PATCH V4 2/3] IB/hns: Add HiSilicon RoCE driver support
On Fri, Apr 01, 2016 at 05:21:31PM +0800, Lijun Ou wrote: > The driver for HiSilicon RoCE is a platform driver. > The driver will support multiple versions of hardware. Currently only "v1" > for hip06 SoC is supported. > The driver includes two parts: common driver and hardware-specific > operations. hns_roce_v1_hw.c and hns_roce_v1_hw.h are files for > hardware-specific operations only for v1 engine, and other files(.c and .h) > for common algorithm and common hardware operations. > > Signed-off-by: Lijun Ou > Signed-off-by: Wei Hu(Xavier) > Signed-off-by: Znlong > --- > MAINTAINERS|8 + > drivers/infiniband/Kconfig |1 + > drivers/infiniband/hw/Makefile |1 + > drivers/infiniband/hw/hisilicon/hns/Kconfig| 10 + > drivers/infiniband/hw/hisilicon/hns/Makefile |9 + > drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c | 110 + We are not adding name of company (hisilicon) for infiniband HW drivers drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c ---> drivers/infiniband/hw/hns/hns_roce_ah.c > .../infiniband/hw/hisilicon/hns/hns_roce_alloc.c | 239 ++ ^^ Please fix you paths. > drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.c | 338 +++ > drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.h | 80 + > .../infiniband/hw/hisilicon/hns/hns_roce_common.h | 308 +++ > drivers/infiniband/hw/hisilicon/hns/hns_roce_cq.c | 436 +++ > .../infiniband/hw/hisilicon/hns/hns_roce_device.h | 794 ++ > drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.c | 758 ++ > drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.h | 132 + > drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.c | 578 > drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.h | 112 + > .../infiniband/hw/hisilicon/hns/hns_roce_main.c| 1097 > drivers/infiniband/hw/hisilicon/hns/hns_roce_mr.c | 605 + > drivers/infiniband/hw/hisilicon/hns/hns_roce_pd.c | 124 + > drivers/infiniband/hw/hisilicon/hns/hns_roce_qp.c | 841 ++ > .../infiniband/hw/hisilicon/hns/hns_roce_user.h| 31 + > .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.c | 2832 > > .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.h | 985 +++ ^^ Do you support v1 of RoCE or v1 of your HW? > 23 files changed, 10429 insertions(+) Please appreciate the effort needed to review such large patch and invest time and effort to divide this to number of small easy review patches.
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
Eric Dumazet wrote: > > I do not particularly care, but it is worth mentioning that GRO+TSO > would not be idempotent anymore. We could easily fix that by adding a feature bit to control this, something like SKB_GSO_TCP_FIXEDID. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: Question on rhashtable in worst-case scenario.
On Fri, Apr 01, 2016 at 11:34:10PM +0200, Johannes Berg wrote: > > I was thinking about that one - it's not obvious to me from the code > how this "explicitly checking for dups" would be done or let's say how > rhashtable differentiates. But since it seems to work for Ben until > hitting a certain number of identical keys, surely that's just me not > understanding the code rather than anything else :) It's really simple, rhashtable_insert_fast does not check for dups while rhashtable_lookup_insert_* do. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[RFC PATCH 3/5] rtnl: add option for setting link bpf prog
Sets the bpf program represented by fd as an early filter in the rx path of the netdev. The fd must have been created as BPF_PROG_TYPE_PHYS_DEV. Providing a negative value as fd clears the program. Getting the fd back via rtnl is not possible, therefore reading of this value merely provides a bool whether the program is valid on the link or not. Signed-off-by: Brenden Blanco --- include/uapi/linux/if_link.h | 1 + net/core/rtnetlink.c | 10 ++ 2 files changed, 11 insertions(+) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index c488066..08d66a3 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -155,6 +155,7 @@ enum { IFLA_PROTO_DOWN, IFLA_GSO_MAX_SEGS, IFLA_GSO_MAX_SIZE, + IFLA_BPF_FD, __IFLA_MAX }; diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index f206677..1c4a603 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -909,6 +909,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev, + rtnl_link_get_af_size(dev, ext_filter_mask) /* IFLA_AF_SPEC */ + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */ + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */ + + nla_total_size(4) /* IFLA_BPF_FD */ + nla_total_size(1); /* IFLA_PROTO_DOWN */ } @@ -1241,6 +1242,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) || nla_put_u32(skb, IFLA_CARRIER_CHANGES, atomic_read(&dev->carrier_changes)) || + nla_put_s32(skb, IFLA_BPF_FD, dev->bpf_valid) || nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down)) goto nla_put_failure; @@ -1374,6 +1376,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_PHYS_SWITCH_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN }, [IFLA_LINK_NETNSID] = { .type = NLA_S32 }, [IFLA_PROTO_DOWN] = { .type = NLA_U8 }, + [IFLA_BPF_FD] = { .type = NLA_S32 }, }; static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = { @@ -2028,6 +2031,13 @@ static int do_setlink(const struct sk_buff *skb, status |= DO_SETLINK_NOTIFY; } + if (tb[IFLA_BPF_FD]) { + err = dev_change_bpf_fd(dev, nla_get_s32(tb[IFLA_BPF_FD])); + if (err) + goto errout; + status |= DO_SETLINK_NOTIFY; + } + errout: if (status & DO_SETLINK_MODIFIED) { if (status & DO_SETLINK_NOTIFY) -- 2.8.0
[RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since bpf programs require a skb context to navigate the packet, build a percpu fake skb with the minimal fields. This avoids the costly allocation for packets that end up being dropped. Since mlx4 is so far the only user of this pseudo skb, the build function is defined locally. Signed-off-by: Brenden Blanco --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++ drivers/net/ethernet/mellanox/mlx4/en_rx.c | 18 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 + 3 files changed, 81 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index b4b258c..89ca787 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -31,6 +31,7 @@ * */ +#include #include #include #include @@ -1966,6 +1967,9 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv) mlx4_en_destroy_cq(priv, &priv->rx_cq[i]); } + if (priv->prog) + bpf_prog_put(priv->prog); + } int mlx4_en_alloc_resources(struct mlx4_en_priv *priv) @@ -2456,6 +2460,61 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m return err; } +static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb); + +static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data, +unsigned int length) +{ + /* data_len is intentionally not set here so that skb_is_nonlinear() +* returns false +*/ + + skb->len = length; + skb->head = data; + skb->data = data; +} + +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length) +{ + struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb); + int ret; + + build_pseudo_skb_for_bpf(skb, data, length); + + rcu_read_lock(); + ret = BPF_PROG_RUN(prog, (void *)skb); + rcu_read_unlock(); + + return ret; +} + +static int mlx4_bpf_set(struct net_device *dev, int fd) +{ + struct mlx4_en_priv *priv = netdev_priv(dev); + struct bpf_prog *prog = NULL, *old_prog; + + if (fd >= 0) { + prog = bpf_prog_get(fd); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) { + bpf_prog_put(prog); + return -EINVAL; + } + } + + old_prog = xchg(&priv->prog, prog); + if (old_prog) { + synchronize_net(); + bpf_prog_put(old_prog); + } + + priv->dev->bpf_valid = !!prog; + + return 0; +} + static const struct net_device_ops mlx4_netdev_ops = { .ndo_open = mlx4_en_open, .ndo_stop = mlx4_en_close, @@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = { .ndo_features_check = mlx4_en_features_check, #endif .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, + .ndo_bpf_set= mlx4_bpf_set, }; static const struct net_device_ops mlx4_netdev_ops_master = { @@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = { .ndo_features_check = mlx4_en_features_check, #endif .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, + .ndo_bpf_set= mlx4_bpf_set, }; struct mlx4_en_bond { diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 86bcfe5..03fe005 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring]; struct mlx4_en_rx_alloc *frags; struct mlx4_en_rx_desc *rx_desc; + struct bpf_prog *prog; struct sk_buff *skb; int index; int nr; @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud if (budget <= 0) return polled; + prog = READ_ONCE(priv->prog); + /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx * descriptor offset can be deduced from the CQE index instead of * reading 'cqe->index' */ @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) && (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL)); + /* A bpf program gets first chance to drop the packet. It may +* read bytes but not past the end of the frag. A non-zero +* return indicates packet should be dropped. +*/ +
[RFC PATCH 0/5] Add driver bpf hook for early packet drop
This patch set introduces new infrastructure for programmatically processing packets in the earliest stages of rx, as part of an effort others are calling Express Data Path (XDP) [1]. Start this effort by introducing a new bpf program type for early packet filtering, before even an skb has been allocated. With this, hope to enable line rate filtering, with this initial implementation providing drop/allow action only. Patch 1 introduces the new prog type and helpers for validating the bpf program. A new userspace struct is defined containing only len as a field, with others to follow in the future. In patch 2, create a new ndo to pass the fd to support drivers. In patch 3, expose a new rtnl option to userspace. In patch 4, enable support in mlx4 driver. No skb allocation is required, instead a static percpu skb is kept in the driver and minimally initialized for each driver frag. In patch 5, create a sample drop and count program. With single core, achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data access, bpf array lookup, and increment. Interestingly, accessing packet data from the program did not have a noticeable impact on performance. Even so, future enhancements to prefetching / batching / page-allocs should hopefully improve the performance in this path. [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf Brenden Blanco (5): bpf: add PHYS_DEV prog type for early driver filter net: add ndo to set bpf prog in adapter rx rtnl: add option for setting link bpf prog mlx4: add support for fast rx drop bpf program Add sample for adding simple drop program to link drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++ drivers/net/ethernet/mellanox/mlx4/en_rx.c | 18 +++ drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 + include/linux/netdevice.h | 8 ++ include/uapi/linux/bpf.h | 5 + include/uapi/linux/if_link.h | 1 + kernel/bpf/verifier.c | 1 + net/core/dev.c | 12 ++ net/core/filter.c | 68 +++ net/core/rtnetlink.c | 10 ++ samples/bpf/Makefile | 4 + samples/bpf/bpf_load.c | 8 ++ samples/bpf/netdrvx1_kern.c| 26 + samples/bpf/netdrvx1_user.c| 155 + 14 files changed, 379 insertions(+) create mode 100644 samples/bpf/netdrvx1_kern.c create mode 100644 samples/bpf/netdrvx1_user.c -- 2.8.0
[RFC PATCH 5/5] Add sample for adding simple drop program to link
Add a sample program that only drops packets at the BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program, observed single core rate is ~14.6Mpps. Other tests were run, for instance without the dropcnt increment or without reading from the packet header, the packet rate was mostly unchanged. $ perf record -a samples/bpf/netdrvx1 $( --- samples/bpf/Makefile| 4 ++ samples/bpf/bpf_load.c | 8 +++ samples/bpf/netdrvx1_kern.c | 26 samples/bpf/netdrvx1_user.c | 155 4 files changed, 193 insertions(+) create mode 100644 samples/bpf/netdrvx1_kern.c create mode 100644 samples/bpf/netdrvx1_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 502c9fc..ad36bb8 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -19,6 +19,7 @@ hostprogs-y += lathist hostprogs-y += offwaketime hostprogs-y += spintest hostprogs-y += map_perf_test +hostprogs-y += netdrvx1 test_verifier-objs := test_verifier.o libbpf.o test_maps-objs := test_maps.o libbpf.o @@ -38,6 +39,7 @@ lathist-objs := bpf_load.o libbpf.o lathist_user.o offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o spintest-objs := bpf_load.o libbpf.o spintest_user.o map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o +netdrvx1-objs := bpf_load.o libbpf.o netdrvx1_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -56,6 +58,7 @@ always += lathist_kern.o always += offwaketime_kern.o always += spintest_kern.o always += map_perf_test_kern.o +always += netdrvx1_kern.o HOSTCFLAGS += -I$(objtree)/usr/include @@ -75,6 +78,7 @@ HOSTLOADLIBES_lathist += -lelf HOSTLOADLIBES_offwaketime += -lelf HOSTLOADLIBES_spintest += -lelf HOSTLOADLIBES_map_perf_test += -lelf -lrt +HOSTLOADLIBES_netdrvx1 += -lelf # point this to your LLVM backend with bpf support LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 58f86bd..9308fbc 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -49,6 +49,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) bool is_socket = strncmp(event, "socket", 6) == 0; bool is_kprobe = strncmp(event, "kprobe/", 7) == 0; bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0; + bool is_phys_dev = strncmp(event, "phys_dev", 8) == 0; enum bpf_prog_type prog_type; char buf[256]; int fd, efd, err, id; @@ -63,6 +64,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_type = BPF_PROG_TYPE_SOCKET_FILTER; } else if (is_kprobe || is_kretprobe) { prog_type = BPF_PROG_TYPE_KPROBE; + } else if (is_phys_dev) { + prog_type = BPF_PROG_TYPE_PHYS_DEV; } else { printf("Unknown event '%s'\n", event); return -1; @@ -76,6 +79,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_fd[prog_cnt++] = fd; + if (is_phys_dev) + return 0; + if (is_socket) { event += 6; if (*event != '/') @@ -304,6 +310,7 @@ int load_bpf_file(char *path) if (memcmp(shname_prog, "kprobe/", 7) == 0 || memcmp(shname_prog, "kretprobe/", 10) == 0 || + memcmp(shname_prog, "phys_dev", 8) == 0 || memcmp(shname_prog, "socket", 6) == 0) load_and_attach(shname_prog, insns, data_prog->d_size); } @@ -320,6 +327,7 @@ int load_bpf_file(char *path) if (memcmp(shname, "kprobe/", 7) == 0 || memcmp(shname, "kretprobe/", 10) == 0 || + memcmp(shname, "phys_dev", 8) == 0 || memcmp(shname, "socket", 6) == 0) load_and_attach(shname, data->d_buf, data->d_size); } diff --git a/samples/bpf/netdrvx1_kern.c b/samples/bpf/netdrvx1_kern.c new file mode 100644 index 000..9837d8a --- /dev/null +++ b/samples/bpf/netdrvx1_kern.c @@ -0,0 +1,26 @@ +#include +#include +#include +#include +#include "bpf_helpers.h" + +struct bpf_map_def SEC("maps") dropcnt = { + .type = BPF_MAP_TYPE_PERCPU_ARRAY, + .key_size = sizeof(u32), + .value_size = sizeof(long), + .max_entries = 256, +}; + +SEC("phys_dev1") +int bpf_prog1(struct xdp_metadata *ctx) +{ + int index = load_byte(ctx, ETH_HLEN + offsetof(struct iphdr, protocol)); + long *value; + + value = bpf_map_lookup_elem(&dropcnt, &index); + if (value) + *value += 1; + + return 1; +} +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/netdrvx1_user.c b/samples/bpf/netdrvx1_user.c new file mode 100644 index 000..9e6ec9a --- /dev/null +++ b/samples/bpf/netdrvx1_user.c @@
[RFC PATCH 2/5] net: add ndo to set bpf prog in adapter rx
Add a new netdev op for drivers implementing the BPF_PROG_TYPE_PHYS_DEV filter to get configuration. Since the fd is only used by the driver to fetch the prog, the netdev should just keep a bit to indicate the program is valid. Signed-off-by: Brenden Blanco --- include/linux/netdevice.h | 8 net/core/dev.c| 12 2 files changed, 20 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cb0d5d0..c46e2e3 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1102,6 +1102,11 @@ struct tc_to_netdev { * appropriate rx headroom value allows avoiding skb head copy on * forward. Setting a negative value resets the rx headroom to the * default value. + * int (*ndo_bpf_set)(struct net_device *dev, int fd); + * This function is used to set or clear a bpf program used in the + * earliest stages of packet rx. The fd must be a program loaded as + * BPF_PROG_TYPE_PHYS_DEV. Negative values of fd indicate the program + * should be removed. * */ struct net_device_ops { @@ -1292,6 +1297,7 @@ struct net_device_ops { struct sk_buff *skb); void(*ndo_set_rx_headroom)(struct net_device *dev, int needed_headroom); + int (*ndo_bpf_set)(struct net_device *dev, int fd); }; /** @@ -1875,6 +1881,7 @@ struct net_device { struct phy_device *phydev; struct lock_class_key *qdisc_tx_busylock; boolproto_down; + boolbpf_valid; }; #define to_net_dev(d) container_of(d, struct net_device, dev) @@ -3268,6 +3275,7 @@ int dev_get_phys_port_id(struct net_device *dev, int dev_get_phys_port_name(struct net_device *dev, char *name, size_t len); int dev_change_proto_down(struct net_device *dev, bool proto_down); +int dev_change_bpf_fd(struct net_device *dev, int fd); struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev); struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, int *ret); diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe7..eb93414 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6480,6 +6480,18 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down) } EXPORT_SYMBOL(dev_change_proto_down); +int dev_change_bpf_fd(struct net_device *dev, int fd) +{ + const struct net_device_ops *ops = dev->netdev_ops; + + if (!ops->ndo_bpf_set) + return -EOPNOTSUPP; + if (!netif_device_present(dev)) + return -ENODEV; + return ops->ndo_bpf_set(dev, fd); +} +EXPORT_SYMBOL(dev_change_bpf_fd); + /** * dev_new_index - allocate an ifindex * @net: the applicable net namespace -- 2.8.0
[RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
Add a new bpf prog type that is intended to run in early stages of the packet rx path. Only minimal packet metadata will be available, hence a new context type, struct xdp_metadata, is exposed to userspace. So far only expose the readable packet length, and only in read mode. The PHYS_DEV name is chosen to represent that the program is meant only for physical adapters, rather than all netdevs. While the user visible struct is new, the underlying context must be implemented as a minimal skb in order for the packet load_* instructions to work. The skb filled in by the driver must have skb->len, skb->head, and skb->data set, and skb->data_len == 0. Signed-off-by: Brenden Blanco --- include/uapi/linux/bpf.h | 5 kernel/bpf/verifier.c| 1 + net/core/filter.c| 68 3 files changed, 74 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 924f537..b8a4ef2 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -92,6 +92,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_SCHED_CLS, BPF_PROG_TYPE_SCHED_ACT, + BPF_PROG_TYPE_PHYS_DEV, }; #define BPF_PSEUDO_MAP_FD 1 @@ -367,6 +368,10 @@ struct __sk_buff { __u32 tc_classid; }; +struct xdp_metadata { + __u32 len; +}; + struct bpf_tunnel_key { __u32 tunnel_id; union { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2e08f8e..804ca70 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type) case BPF_PROG_TYPE_SOCKET_FILTER: case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: + case BPF_PROG_TYPE_PHYS_DEV: return true; default: return false; diff --git a/net/core/filter.c b/net/core/filter.c index b7177d0..c417db6 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) } } +static const struct bpf_func_proto * +phys_dev_func_proto(enum bpf_func_id func_id) +{ + return sk_filter_func_proto(func_id); +} + static bool __is_valid_access(int off, int size, enum bpf_access_type type) { /* check bounds */ @@ -2073,6 +2079,36 @@ static bool tc_cls_act_is_valid_access(int off, int size, return __is_valid_access(off, size, type); } +static bool __is_valid_xdp_access(int off, int size, + enum bpf_access_type type) +{ + if (off < 0 || off >= sizeof(struct xdp_metadata)) + return false; + + if (off % size != 0) + return false; + + if (size != 4) + return false; + + return true; +} + +static bool phys_dev_is_valid_access(int off, int size, +enum bpf_access_type type) +{ + if (type == BPF_WRITE) + return false; + + switch (off) { + case offsetof(struct xdp_metadata, len): + break; + default: + return false; + } + return __is_valid_xdp_access(off, size, type); +} + static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg, int src_reg, int ctx_off, struct bpf_insn *insn_buf, @@ -2210,6 +2246,26 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg, return insn - insn_buf; } +static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type, + int dst_reg, int src_reg, + int ctx_off, + struct bpf_insn *insn_buf, + struct bpf_prog *prog) +{ + struct bpf_insn *insn = insn_buf; + + switch (ctx_off) { + case offsetof(struct xdp_metadata, len): + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4); + + *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, + offsetof(struct sk_buff, len)); + break; + } + + return insn - insn_buf; +} + static const struct bpf_verifier_ops sk_filter_ops = { .get_func_proto = sk_filter_func_proto, .is_valid_access = sk_filter_is_valid_access, @@ -,6 +2278,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = { .convert_ctx_access = bpf_net_convert_ctx_access, }; +static const struct bpf_verifier_ops phys_dev_ops = { + .get_func_proto = phys_dev_func_proto, + .is_valid_access = phys_dev_is_valid_access, + .convert_ctx_access = bpf_phys_dev_convert_ctx_access, +}; + static struct bpf_prog_type_list sk_filter_type __read_mostly = { .ops = &sk_filter_ops, .type = BPF_PROG_TYPE_SOCKET_FILTER, @@ -2237,11 +2299,17 @@ sta
Re: [PATCH] RDS: sync congestion map updating
On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote: > (cc-ing netdev) > On 3/30/2016 7:59 PM, Wengang Wang wrote: > > > > > >在 2016年03月31日 09:51, Wengang Wang 写道: > >> > >> > >>在 2016年03月31日 01:16, santosh shilimkar 写道: > >>>Hi Wengang, > >>> > >>>On 3/30/2016 9:19 AM, Leon Romanovsky wrote: > On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote: > >Problem is found that some among a lot of parallel RDS > >communications hang. > >In my test ten or so among 33 communications hang. The send > >requests got > >-ENOBUF error meaning the peer socket (port) is congested. But > >meanwhile, > >peer socket (port) is not congested. > > > >The congestion map updating can happen in two paths: one is in > >rds_recvmsg path > >and the other is when it receives packets from the hardware. There > >is no > >synchronization when updating the congestion map. So a bit > >operation (clearing) > >in the rds_recvmsg path can be skipped by another bit operation > >(setting) in > >hardware packet receving path. > > > > > >To be more detailed. Here, the two paths (user calls recvmsg and > >hardware receives data) are for different rds socks. thus the > >rds_sock->rs_recv_lock is not helpful to sync the updating on congestion > >map. > > > For archive purpose, let me try to conclude the thread. I synced > with Wengang offlist and came up with below fix. I was under > impression that __set_bit_le() was atmoic version. After fixing > it like patch(end of the email), the bug gets addressed. > > I will probably send this as fix for stable as well. > > > From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001 > From: Santosh Shilimkar > Date: Wed, 30 Mar 2016 23:26:47 -0700 > Subject: [PATCH] RDS: Fix the atomicity for congestion map update > > Two different threads with different rds sockets may be in > rds_recv_rcvbuf_delta() via receive path. If their ports > both map to the same word in the congestion map, then > using non-atomic ops to update it could cause the map to > be incorrect. Lets use atomics to avoid such an issue. > > Full credit to Wengang for > finding the issue, analysing it and also pointing out > to offending code with spin lock based fix. I'm glad that you solved the issue without spinlocks. Out of curiosity, I see that this patch is needed to be sent to Dave and applied by him. Is it right? ➜ linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c Santosh Shilimkar (supporter:RDS - RELIABLE DATAGRAM SOCKETS) "David S. Miller" (maintainer:NETWORKING [GENERAL]) netdev@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS) linux-r...@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS) rds-de...@oss.oracle.com (moderated list:RDS - RELIABLE DATAGRAM SOCKETS) linux-ker...@vger.kernel.org (open list) > > Signed-off-by: Wengang Wang > Signed-off-by: Santosh Shilimkar Reviewed-by: Leon Romanovsky
[GIT] Networking
1) Missing device reference in IPSEC input path results in crashes during device unregistration. From Subash Abhinov Kasiviswanathan. 2) Per-queue ISR register writes not being done properly in macb driver, from Cyrille Pitchen. 3) Stats accounting bugs in bcmgenet, from Patri Gynther. 4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from Quentin Armitage. 5) SXGBE driver has off-by-one in probe error paths, from Rasmus Villemoes. 6) Fix race in save/swap/delete options in netfilter ipset, from Vishwanath Pai. 7) Ageing time of bridge not set properly when not operating over a switchdev device. Fix from Haishuang Yan. 8) Fix GRO regression wrt. nested FOU/GUE based tunnels, from Alexander Duyck. 9) IPV6 UDP code bumps wrong stats, from Eric Dumazet. 10) FEC driver should only access registers that actually exist on the given chipset, fix from Fabio Estevam. Please pull, thanks a lot! The following changes since commit e46b4e2b46e173889b1b8bd033d5e8b3acf0: Merge tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace (2016-03-24 10:52:25 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net for you to fetch changes up to db5dd0db2d8352bb7fd5e9d16e17b79d66c7e4e3: net: mvneta: fix changing MTU when using per-cpu processing (2016-04-01 15:16:37 -0400) Alexander Duyck (3): ixgbe: Store VXLAN port number in network order ixgbe: Fix ATR so that it correctly handles IPv6 extension headers gro: Allow tunnel stacking in the case of FOU/GUE Arnd Bergmann (1): openvswitch: call only into reachable nf-nat code Bjorn Helgaas (1): netpoll: Fix extra refcount release in netpoll_cleanup() Bjørn Mork (1): qmi_wwan: add "D-Link DWM-221 B1" device id Charles Keepax (1): net: macb: Only call GPIO functions if there is a valid GPIO Colin Ian King (1): qed: initialize return rc to avoid returning garbage Cosmin-Gabriel Samoila (1): Drivers: isdn: hisax: isac.c: Fix assignment and check into one expression. Cyrille Pitchen (2): net: macb: replace macb_writel() call by queue_writel() to update queue ISR net: macb: remove BUG_ON() and reset the queue to handle RX errors Daniel Borkmann (3): bpf: add missing map_flags to bpf_map_show_fdinfo bpf: make padding in bpf_tunnel_key explicit tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter Daniele Palmas (1): net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card David S. Miller (4): Merge branch 'hns-fixes' Merge git://git.kernel.org/.../pablo/nf Merge branch '10GbE' of git://git.kernel.org/.../jkirsher/net-queue Merge branch 'stmmac-fixes' Diego Viola (1): drivers/net/usb/plusb.c: Fix typo Emil Tantilov (2): ixgbevf: fix error code path when setting MAC address ixgbe: make __ixgbe_setup_tc static Eric Dumazet (1): ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates Fabio Estevam (1): fec: Do not access unexisting register in Coldfire Florian Westphal (3): netfilter: x_tables: validate e->target_offset early netfilter: x_tables: make sure e->next_offset covers remaining blob size netfilter: x_tables: fix unconditional helper Giuseppe CAVALLARO (3): stmmac: fix TX normal DESC Revert "stmmac: Fix 'eth0: No PHY found' regression" stmmac: fix MDIO settings Haishuang Yan (2): openvswitch: Use proper buffer size in nla_memcpy bridge: Allow set bridge ageing time when switchdev disabled Jaedon Shin (1): net: phy: bcm7xxx: Add entries for Broadcom BCM7346 and BCM7362 Jarno Rajahalme (1): openvswitch: Fix checking for new expected connections. Jisheng Zhang (5): net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: mvpp2: fix maybe-uninitialized warning net: mvpp2: use cache_line_size() to get cacheline size net: mvneta: use cache_line_size() to get cacheline size Kejian Yan (1): net: hns: fix warning of passing zero to 'PTR_ERR' Lino Sanfilippo (1): ravb: fix software timestamping Liping Zhang (1): netfilter: ipv4: fix NULL dereference Lisheng (2): net: hns: fixed the setting and getting overtime bug net: hns: set-coalesce-usecs returns errno by dsaf.ko Manish Chopra (1): qlge: Update version to 1.00.00.35 Marcelo Ricardo Leitner (1): sctp: really allow using GFP_KERNEL on sctp_packet_transmit Marcin Wojtas (1): net: mvneta: fix changing MTU when using per-cpu processing Mark Rustad (1): ixgbe: Use udelay to avoid sleeping while atomic Michael Chan (3): bnxt_en: Implement proper firmware message padding. bnxt_en: Fix typo in bnxt_hwrm_set_pause_common
Re: [PATCH net] vlan: pull on __vlan_insert_tag error path and fix csum correction
On 04/01/2016 11:28 PM, Daniel Borkmann wrote: On 04/01/2016 09:00 PM, David Miller wrote: From: Daniel Borkmann Date: Fri, 1 Apr 2016 11:41:03 +0200 Moreover, I noticed that when in the non-error path the __skb_pull() is done and the original offset to mac header was non-zero, we fixup from a wrong skb->data offset in the checksum complete processing. So the skb_postpush_rcsum() really needs to be done before __skb_pull() where skb->data still points to the mac header start. Ugh, what a mess, are you sure any of this is right even after your change? What happens (outside of the csum part) is this: __skb_push(offset); __vlan_insert_tag(); { skb_push(VLAN_HLEN); ... memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN); } __skb_pull(offset); If I understand this correctly, the last pull will therefore put skb->data pointing at vlan_ethhdr->h_vlan_TCI of the new VLAN header pushed by __vlan_insert_tag(). That is assuming skb->data began right after the original ethernet header. Yes, this is correct. Now, continuing this train of thought: you have skb->data pointing _currently_ at vlan_ethhdr->h_vlan_TCI. And then you call: skb_postpush_rcsum(skb, skb->data + (2 * ETH_ALEN), VLAN_HLEN); So, we point from the ->vlan_TCI + (2 * ETH_ALEN) as start offset, and with VLAN_HLEN (= 4 bytes that we added) as length for the csum correction as input. So, we point way beyond what we actually wanted to fixup wrt csum, no? But what we actually want to sum is [h_vlan_proto + h_vlan_TCI], which is where above skb_postpush_rcsum() call points to _before_ the last __skb_pull() happens. In other words, still at that time, we have the same expectations as in __vlan_insert_tag(). To me, that postpull csum currently is absolutely in the correct spot, because it's acting upon the pull done by __vlan_insert_tag(), not the one done here by skb_vlan_push(). Right? Can you tell me how you tested this? Just curious... I noticed both while reviewing the code, the error path fixup is not critical for ovs or act_vlan as the skb is dropped afterwards, but not necessarily in eBPF case, so there it matters as eBPF doesn't know at this point, what the program is going to do with it (similar fixup is done in __skb_vlan_pop() error path, btw). For the csum, I did a hexdump to compare what we write and what is being passed in for the csum correction. Anyway ... Aside from all this and based on your comment, I'm investigating whether for the vlan push and also pop case the __skb_pull(skb, offset) in success case is actually enough and whether it needs to take VLAN_HLEN into account as well. But, I need to do more test for that one first. At least the skb_vlan_push() comment says "__vlan_insert_tag expect skb->data pointing to mac header. So change skb->data before calling it and change back to original position later", Jiri? For this part, what is meant with "original" position (relative to the start of the ethernet header [currently the case], or relative to some data, e.g. before/after the call, I still expect skb->data position to point to my IP header)? Thanks, Daniel
Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
On Fri, Apr 1, 2016 at 4:13 PM, Cong Wang wrote: > On Fri, Apr 1, 2016 at 3:56 PM, Martin KaFai Lau wrote: >> + bh_lock_sock(sk); >> + if (!sock_owned_by_user(sk)) >> + ip6_datagram_dst_update(sk, false); >> + bh_unlock_sock(sk); > > > My discussion with Eric shows that we probably don't need to hold > this sock lock here, and you are Cc'ed in that thread, so > > 1) why do you still take the lock here? > 2) why didn't you involve in our discussion if you disagree? Here is the original thread: http://comments.gmane.org/gmane.linux.network/405404
Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
On Fri, Apr 1, 2016 at 3:56 PM, Martin KaFai Lau wrote: > + bh_lock_sock(sk); > + if (!sock_owned_by_user(sk)) > + ip6_datagram_dst_update(sk, false); > + bh_unlock_sock(sk); My discussion with Eric shows that we probably don't need to hold this sock lock here, and you are Cc'ed in that thread, so 1) why do you still take the lock here? 2) why didn't you involve in our discussion if you disagree?
[RFC PATCH net 2/4] ipv6: datagram: Refactor dst lookup and update codes to a new function
This patch moves the route lookup and update codes for connected datagram sk to a newly created function ip6_datagram_dst_update() It will be reused during the pmtu update in the later patch. Signed-off-by: Martin KaFai Lau Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- net/ipv6/datagram.c | 103 +--- 1 file changed, 57 insertions(+), 46 deletions(-) diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index f07c1dd..140665b 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -64,16 +64,65 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk) security_sk_classify_flow(sk, flowi6_to_flowi(fl6)); } +static int ip6_datagram_dst_update(struct sock *sk) +{ + struct ip6_flowlabel *flowlabel = NULL; + struct in6_addr *final_p, final; + struct ipv6_txoptions *opt; + struct dst_entry *dst; + struct inet_sock *inet = inet_sk(sk); + struct ipv6_pinfo *np = inet6_sk(sk); + struct flowi6 fl6; + int err = 0; + + if (np->sndflow && np->flow_label) { + flowlabel = fl6_sock_lookup(sk, np->flow_label); + if (!flowlabel) + return -EINVAL; + } + ip6_datagram_flow_key_init(&fl6, sk); + + rcu_read_lock(); + opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt); + final_p = fl6_update_dst(&fl6, opt, &final); + rcu_read_unlock(); + + dst = ip6_dst_lookup_flow(sk, &fl6, final_p); + if (IS_ERR(dst)) { + err = PTR_ERR(dst); + goto out; + } + + if (ipv6_addr_any(&np->saddr)) + np->saddr = fl6.saddr; + + if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { + sk->sk_v6_rcv_saddr = fl6.saddr; + inet->inet_rcv_saddr = LOOPBACK4_IPV6; + if (sk->sk_prot->rehash) + sk->sk_prot->rehash(sk); + } + + ip6_dst_store(sk, dst, + ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ? + &sk->sk_v6_daddr : NULL, +#ifdef CONFIG_IPV6_SUBTREES + ipv6_addr_equal(&fl6.saddr, &np->saddr) ? + &np->saddr : +#endif + NULL); + +out: + fl6_sock_release(flowlabel); + return err; +} + static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr; struct inet_sock*inet = inet_sk(sk); struct ipv6_pinfo *np = inet6_sk(sk); - struct in6_addr *daddr, *final_p, final; - struct dst_entry*dst; - struct flowi6 fl6; - struct ip6_flowlabel*flowlabel = NULL; - struct ipv6_txoptions *opt; + struct in6_addr *daddr; int addr_type; int err; __be32 fl6_flowlabel = 0; @@ -91,14 +140,8 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a if (usin->sin6_family != AF_INET6) return -EAFNOSUPPORT; - if (np->sndflow) { + if (np->sndflow) fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK; - if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) { - flowlabel = fl6_sock_lookup(sk, fl6_flowlabel); - if (!flowlabel) - return -EINVAL; - } - } addr_type = ipv6_addr_type(&usin->sin6_addr); @@ -178,45 +221,13 @@ ipv4_connected: * destination cache for it. */ - ip6_datagram_flow_key_init(&fl6, sk); - - rcu_read_lock(); - opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt); - final_p = fl6_update_dst(&fl6, opt, &final); - rcu_read_unlock(); - - dst = ip6_dst_lookup_flow(sk, &fl6, final_p); - err = 0; - if (IS_ERR(dst)) { - err = PTR_ERR(dst); + err = ip6_datagram_dst_update(sk); + if (err) goto out; - } - - /* source address lookup done in ip6_dst_lookup */ - - if (ipv6_addr_any(&np->saddr)) - np->saddr = fl6.saddr; - - if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { - sk->sk_v6_rcv_saddr = fl6.saddr; - inet->inet_rcv_saddr = LOOPBACK4_IPV6; - if (sk->sk_prot->rehash) - sk->sk_prot->rehash(sk); - } - - ip6_dst_store(sk, dst, - ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ? - &sk->sk_v6_daddr : NULL, -#ifdef CONFIG_IPV6_SUBTREES - ipv6_addr_equal(&fl6.saddr, &np->saddr) ? - &np->saddr : -#endif - NULL); sk->sk_state = TCP_ESTABLISHED; sk_set_txhash(sk); out: - fl6_sock_release(flowlabel);
[RFC PATCH net 4/4] ipv6: udp: Do a route lookup and update during release_cb
This patch adds a release_cb for UDPv6. It does a route lookup and updates sk->sk_dst_cache if it is needed. It picks up the left-over job from ip6_sk_update_pmtu() if the sk is owned by user during the pmtu update. Signed-off-by: Martin KaFai Lau Reported-by: Wei Wang Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- include/net/ipv6.h | 1 + net/ipv6/datagram.c | 12 net/ipv6/udp.c | 1 + 3 files changed, 14 insertions(+) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index fd02e90..1be050a 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -960,6 +960,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len); int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr, int addr_len); int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr); +void ip6_datagram_release_cb(struct sock *sk); int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 0b60f1e..a4f06bd 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -119,6 +119,18 @@ out: return err; } +void ip6_datagram_release_cb(struct sock *sk) +{ + struct dst_entry *dst; + + dst = __sk_dst_get(sk); + if (!dst || dst->ops->check(dst, inet6_sk(sk)->dst_cookie)) + return; + + ip6_datagram_dst_update(sk, false); +} +EXPORT_SYMBOL_GPL(ip6_datagram_release_cb); + static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 8125931..6bc5c66 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1539,6 +1539,7 @@ struct proto udpv6_prot = { .sendmsg = udpv6_sendmsg, .recvmsg = udpv6_recvmsg, .backlog_rcv = __udpv6_queue_rcv_skb, + .release_cb= ip6_datagram_release_cb, .hash = udp_lib_hash, .unhash= udp_lib_unhash, .rehash= udp_v6_rehash, -- 2.5.1
[RFC PATCH net 1/4] ipv6: datagram: Refactor flowi6 init codes to a new function
Move flowi6 init codes for connected datagram sk to a newly created function ip6_datagram_flow_key_init(). It will be reused during pmtu update in the later patch. Signed-off-by: Martin KaFai Lau Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- net/ipv6/datagram.c | 50 ++ 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 4281621..f07c1dd 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -40,6 +40,30 @@ static bool ipv6_mapped_addr_any(const struct in6_addr *a) return ipv6_addr_v4mapped(a) && (a->s6_addr32[3] == 0); } +static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk) +{ + struct inet_sock *inet = inet_sk(sk); + struct ipv6_pinfo *np = inet6_sk(sk); + + memset(fl6, 0, sizeof(*fl6)); + fl6->flowi6_proto = sk->sk_protocol; + fl6->daddr = sk->sk_v6_daddr; + fl6->saddr = np->saddr; + fl6->flowi6_oif = sk->sk_bound_dev_if; + fl6->flowi6_mark = sk->sk_mark; + fl6->fl6_dport = inet->inet_dport; + fl6->fl6_sport = inet->inet_sport; + fl6->flowlabel = np->flow_label; + + if (!fl6->flowi6_oif) + fl6->flowi6_oif = np->sticky_pktinfo.ipi6_ifindex; + + if (!fl6->flowi6_oif && ipv6_addr_is_multicast(&fl6->daddr)) + fl6->flowi6_oif = np->mcast_oif; + + security_sk_classify_flow(sk, flowi6_to_flowi(fl6)); +} + static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr; @@ -52,6 +76,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a struct ipv6_txoptions *opt; int addr_type; int err; + __be32 fl6_flowlabel = 0; if (usin->sin6_family == AF_INET) { if (__ipv6_only_sock(sk)) @@ -66,11 +91,10 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a if (usin->sin6_family != AF_INET6) return -EAFNOSUPPORT; - memset(&fl6, 0, sizeof(fl6)); if (np->sndflow) { - fl6.flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK; - if (fl6.flowlabel&IPV6_FLOWLABEL_MASK) { - flowlabel = fl6_sock_lookup(sk, fl6.flowlabel); + fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK; + if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) { + flowlabel = fl6_sock_lookup(sk, fl6_flowlabel); if (!flowlabel) return -EINVAL; } @@ -145,7 +169,7 @@ ipv4_connected: } sk->sk_v6_daddr = *daddr; - np->flow_label = fl6.flowlabel; + np->flow_label = fl6_flowlabel; inet->inet_dport = usin->sin6_port; @@ -154,21 +178,7 @@ ipv4_connected: * destination cache for it. */ - fl6.flowi6_proto = sk->sk_protocol; - fl6.daddr = sk->sk_v6_daddr; - fl6.saddr = np->saddr; - fl6.flowi6_oif = sk->sk_bound_dev_if; - fl6.flowi6_mark = sk->sk_mark; - fl6.fl6_dport = inet->inet_dport; - fl6.fl6_sport = inet->inet_sport; - - if (!fl6.flowi6_oif) - fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex; - - if (!fl6.flowi6_oif && (addr_type&IPV6_ADDR_MULTICAST)) - fl6.flowi6_oif = np->mcast_oif; - - security_sk_classify_flow(sk, flowi6_to_flowi(&fl6)); + ip6_datagram_flow_key_init(&fl6, sk); rcu_read_lock(); opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt); -- 2.5.1
[RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
There is a case in connected UDP socket such that getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible sequence could be the following: 1. Create a connected UDP socket 2. Send some datagrams out 3. Receive a ICMPV6_PKT_TOOBIG 4. No new outgoing datagrams to trigger the sk_dst_check() logic to update the sk->sk_dst_cache. 5. getsockopt(IPV6_MTU) returns the mtu from the invalid sk->sk_dst_cache instead of the newly created RTF_CACHE clone. This patch updates the sk->sk_dst_cache for a connected datagram sk. It is done under '!sock_owned_by_user(sk)' condition because the user may make another ip6_datagram_connect() while dst lookup and update are happening. For the sock_owned_by_user(sk) == true case, the next patch will introduce a release_cb() which will update the sk->sk_dst_cache. Test: Server (Connected UDP Socket): ~ Route Details: [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac' 2fac::/64 dev eth0 proto kernel metric 256 pref medium 2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium A simple python code to create a connected UDP socket: import socket import errno HOST = '2fac::1' PORT = 8080 s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) s.bind((HOST, PORT)) s.connect(('2fac:face::face', 53)) print("connected") while True: try: data = s.recv(1024) except socket.error as se: if se.errno == errno.EMSGSIZE: pmtu = s.getsockopt(41, 24) print("PMTU:%d" % pmtu) break s.close() Python program output after getting a ICMPV6_PKT_TOOBIG: [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py connected PMTU:1300 Cache routes after recieving TOOBIG: [root@arch-fb-vm1 ~]# ip -6 r show table cache 2fac:face::face via 2fac::face dev eth0 metric 0 cache expires 463sec mtu 1300 pref medium Client (Send the ICMPV6_PKT_TOOBIG): ~~~ scapy is used to generate the TOOBIG message. Here is the scapy script I have used: >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', >>> type=0x86dd)/IPv6(src='2fac::face', >>> dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac:: 1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53) >>> sendp(p, iface='qemubr0') Signed-off-by: Martin KaFai Lau Reported-by: Wei Wang Cc: Cong Wang Cc: Eric Dumazet Cc: Wei Wang --- include/net/ipv6.h | 1 + net/ipv6/datagram.c | 20 +++- net/ipv6/route.c| 11 +++ 3 files changed, 23 insertions(+), 9 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index d0aeb97..fd02e90 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -959,6 +959,7 @@ int compat_ipv6_getsockopt(struct sock *sk, int level, int optname, int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len); int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr, int addr_len); +int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr); int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 140665b..0b60f1e 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -64,7 +64,7 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk) security_sk_classify_flow(sk, flowi6_to_flowi(fl6)); } -static int ip6_datagram_dst_update(struct sock *sk) +int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr) { struct ip6_flowlabel *flowlabel = NULL; struct in6_addr *final_p, final; @@ -93,14 +93,16 @@ static int ip6_datagram_dst_update(struct sock *sk) goto out; } - if (ipv6_addr_any(&np->saddr)) - np->saddr = fl6.saddr; + if (fix_sk_saddr) { + if (ipv6_addr_any(&np->saddr)) + np->saddr = fl6.saddr; - if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { - sk->sk_v6_rcv_saddr = fl6.saddr; - inet->inet_rcv_saddr = LOOPBACK4_IPV6; - if (sk->sk_prot->rehash) - sk->sk_prot->rehash(sk); + if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { + sk->sk_v6_rcv_saddr = fl6.saddr; + inet->inet_rcv_saddr = LOOPBACK4_IPV6; + if (sk->sk_prot->rehash) + sk->sk_prot->rehash(sk); + } } ip6_dst_store(sk, dst, @@ -221,7 +223,7 @@ ipv4_connected: * destination cache for it. */ - err = ip6_datagram_dst_update(sk); + err = ip6_datagram_dst_update(sk, true); if (err) goto out; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index ed44663..f7e6a6d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1417,8 +1417,19 @@ EXPORT_SYMBOL_GPL(ip6_
[RFC PATCH net 0/4] ip6: datagram: Update dst cache of a connected datagram sk during pmtu update
There is a case in connected UDP socket such that getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible sequence could be the following: 1. Create a connected UDP socket 2. Send some datagrams out 3. Receive a ICMPV6_PKT_TOOBIG 4. No new outgoing datagrams to trigger the sk_dst_check() logic to update the sk->sk_dst_cache. 5. getsockopt(IPV6_MTU) returns the mtu from the invalid sk->sk_dst_cache instead of the newly created RTF_CACHE clone. Patch 1 and 2 are the prep work. Patch 3 and 4 are the fixes.
Re: [v7, 0/5] Fix eSDHC host version register bug
On Fri, 2016-04-01 at 11:07 +0800, Yangbo Lu wrote: > This patchset is used to fix a host version register bug in the T4240-R1.0 > -R2.0 > eSDHC controller. To get the SoC version and revision, it's needed to add > the > GUTS driver to access the global utilities registers. > > So, the first three patches are to add the GUTS driver. > The following two patches are to enable GUTS driver support to get SVR in > eSDHC > driver and fix host version for T4240. > > Yangbo Lu (5): > ARM64: dts: ls2080a: add device configuration node > soc: fsl: add GUTS driver for QorIQ platforms > dt: move guts devicetree doc out of powerpc directory > powerpc/fsl: move mpc85xx.h to include/linux/fsl > mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 Acked-by: Scott Wood -Scott
[net-next PATCH 2/2 v4] ibmvnic: enable RX checksum offload
Enable RX Checksum offload feature in the ibmvnic driver. Signed-off-by: Thomas Falcon Cc: John Allen --- v4: this patch included since it is enabled by the previous patch --- drivers/net/ethernet/ibm/ibmvnic.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 4e97e76..21bccf6 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -2105,6 +2105,10 @@ static void handle_query_ip_offload_rsp(struct ibmvnic_adapter *adapter) if (buf->tcp_ipv6_chksum || buf->udp_ipv6_chksum) adapter->netdev->features |= NETIF_F_IPV6_CSUM; + if ((adapter->netdev->features & + (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM))) + adapter->netdev->features |= NETIF_F_RXCSUM; + memset(&crq, 0, sizeof(crq)); crq.control_ip_offload.first = IBMVNIC_CRQ_CMD; crq.control_ip_offload.cmd = CONTROL_IP_OFFLOAD; -- 2.4.11
[net-next PATCH 1/2 v4] ibmvnic: map L2/L3/L4 header descriptors to firmware
Allow the VNIC driver to provide descriptors containing L2/L3/L4 headers to firmware. This feature is needed for greater hardware compatibility and enablement of checksum and TCP offloading features. A new function is included for the hypervisor call, H_SEND_SUBCRQ_INDIRECT, allowing a DMA-mapped array of SCRQ descriptor elements to be sent to the VNIC server. These additions will help fully enable checksum offloading as well as other features as they are included later. Signed-off-by: Thomas Falcon Cc: John Allen --- v2: Fixed typo error caught by kbuild test bot v3: Fixed erroneous patch sender v4: sorry for the delay in resending, Thanks to David Miller for comments, removed all extra memory allocations, merged some helper functions, calculate all header lengths to meet firmware requirements, fixed endian bugs in the send_subcrq_indirect --- drivers/net/ethernet/ibm/ibmvnic.c | 195 - drivers/net/ethernet/ibm/ibmvnic.h | 3 + 2 files changed, 194 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 6e9e16ee..4e97e76 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -61,6 +61,7 @@ #include #include #include +#include #include #include #include @@ -94,6 +95,7 @@ static int ibmvnic_reenable_crq_queue(struct ibmvnic_adapter *); static int ibmvnic_send_crq(struct ibmvnic_adapter *, union ibmvnic_crq *); static int send_subcrq(struct ibmvnic_adapter *adapter, u64 remote_handle, union sub_crq *sub_crq); +static int send_subcrq_indirect(struct ibmvnic_adapter *, u64, u64, u64); static irqreturn_t ibmvnic_interrupt_rx(int irq, void *instance); static int enable_scrq_irq(struct ibmvnic_adapter *, struct ibmvnic_sub_crq_queue *); @@ -561,10 +563,141 @@ static int ibmvnic_close(struct net_device *netdev) return 0; } +/** + * build_hdr_data - creates L2/L3/L4 header data buffer + * @hdr_field - bitfield determining needed headers + * @skb - socket buffer + * @hdr_len - array of header lengths + * @tot_len - total length of data + * + * Reads hdr_field to determine which headers are needed by firmware. + * Builds a buffer containing these headers. Saves individual header + * lengths and total buffer length to be used to build descriptors. + */ +static int build_hdr_data(u8 hdr_field, struct sk_buff *skb, + int *hdr_len, u8 *hdr_data) +{ + int len = 0; + u8 *hdr; + + hdr_len[0] = sizeof(struct ethhdr); + + if (skb->protocol == htons(ETH_P_IP)) { + hdr_len[1] = ip_hdr(skb)->ihl * 4; + if (ip_hdr(skb)->protocol == IPPROTO_TCP) + hdr_len[2] = tcp_hdrlen(skb); + else if (ip_hdr(skb)->protocol == IPPROTO_UDP) + hdr_len[2] = sizeof(struct udphdr); + } else if (skb->protocol == htons(ETH_P_IPV6)) { + hdr_len[1] = sizeof(struct ipv6hdr); + if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP) + hdr_len[2] = tcp_hdrlen(skb); + else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP) + hdr_len[2] = sizeof(struct udphdr); + } + + memset(hdr_data, 0, 120); + if ((hdr_field >> 6) & 1) { + hdr = skb_mac_header(skb); + memcpy(hdr_data, hdr, hdr_len[0]); + len += hdr_len[0]; + } + + if ((hdr_field >> 5) & 1) { + hdr = skb_network_header(skb); + memcpy(hdr_data + len, hdr, hdr_len[1]); + len += hdr_len[1]; + } + + if ((hdr_field >> 4) & 1) { + hdr = skb_transport_header(skb); + memcpy(hdr_data + len, hdr, hdr_len[2]); + len += hdr_len[2]; + } + return len; +} + +/** + * create_hdr_descs - create header and header extension descriptors + * @hdr_field - bitfield determining needed headers + * @data - buffer containing header data + * @len - length of data buffer + * @hdr_len - array of individual header lengths + * @scrq_arr - descriptor array + * + * Creates header and, if needed, header extension descriptors and + * places them in a descriptor array, scrq_arr + */ + +static void create_hdr_descs(u8 hdr_field, u8 *hdr_data, int len, int *hdr_len, +union sub_crq *scrq_arr) +{ + union sub_crq hdr_desc; + int tmp_len = len; + u8 *data, *cur; + int tmp; + + while (tmp_len > 0) { + cur = hdr_data + len - tmp_len; + + memset(&hdr_desc, 0, sizeof(hdr_desc)); + if (cur != hdr_data) { + data = hdr_desc.hdr_ext.data; + tmp = tmp_len > 29 ? 29 : tmp_len; + hdr_desc.hdr_ext.first = IBMVNIC_CRQ_CMD; + hdr_desc.hdr_ext.type = IBMVNIC_HDR_
Re: [Odd commit author id merge via netdev]
On 4/1/2016 1:01 PM, Johannes Berg wrote: On Fri, 2016-04-01 at 10:51 -0700, santosh shilimkar wrote: Hi Dave, I noticed something odd while checking the recent commits of mine in kernel.org tree made it via netdev. Don't know if its patchwork tool doing this. Usual author line in my git objects : Author: Santosh Shilimkar But the commits going via your tree seems to be like below.. Author: email-id Few more examples of the commits end of the email. Can this be fixed for future commits ? The git objects you pulled from my tree directly have right author format where as ones which are picked from patchworks seems to be odd. Patchwork does store this info somehow and re-use it, quite possibly from the very first patch you ever sent. I think this bug was *just* fixed in patchwork, but it'll probably be a while until that fix lands. However, you can go and create a patchwork account with the real name, associate it with all the email addresses you use and then I think it'll pick it up. Not entirely sure though, you'll have to test it. I will try that. Thanks for the tip. Reagrds, Santosh
Re: Question on rhashtable in worst-case scenario.
On Fri, 2016-04-01 at 08:46 +0800, Herbert Xu wrote: > On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote: > > > > > > Does removing this completely disable the "-EEXIST" error? I can't > > say > > I fully understand the elasticity stuff in > > __rhashtable_insert_fast(). > What EEXIST error are you talking about? The only one that can be > returned on insertion is if you're explicitly checking for dups > which clearly can't be the case for you. I was thinking about that one - it's not obvious to me from the code how this "explicitly checking for dups" would be done or let's say how rhashtable differentiates. But since it seems to work for Ben until hitting a certain number of identical keys, surely that's just me not understanding the code rather than anything else :) johannes
Re: [PATCH net] vlan: pull on __vlan_insert_tag error path and fix csum correction
On 04/01/2016 09:00 PM, David Miller wrote: From: Daniel Borkmann Date: Fri, 1 Apr 2016 11:41:03 +0200 Moreover, I noticed that when in the non-error path the __skb_pull() is done and the original offset to mac header was non-zero, we fixup from a wrong skb->data offset in the checksum complete processing. So the skb_postpush_rcsum() really needs to be done before __skb_pull() where skb->data still points to the mac header start. Ugh, what a mess, are you sure any of this is right even after your change? What happens (outside of the csum part) is this: __skb_push(offset); __vlan_insert_tag(); { skb_push(VLAN_HLEN); ... memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN); } __skb_pull(offset); If I understand this correctly, the last pull will therefore put skb->data pointing at vlan_ethhdr->h_vlan_TCI of the new VLAN header pushed by __vlan_insert_tag(). That is assuming skb->data began right after the original ethernet header. Yes, this is correct. Now, continuing this train of thought: you have skb->data pointing _currently_ at vlan_ethhdr->h_vlan_TCI. And then you call: skb_postpush_rcsum(skb, skb->data + (2 * ETH_ALEN), VLAN_HLEN); So, we point from the ->vlan_TCI + (2 * ETH_ALEN) as start offset, and with VLAN_HLEN (= 4 bytes that we added) as length for the csum correction as input. So, we point way beyond what we actually wanted to fixup wrt csum, no? But what we actually want to sum is [h_vlan_proto + h_vlan_TCI], which is where above skb_postpush_rcsum() call points to _before_ the last __skb_pull() happens. In other words, still at that time, we have the same expectations as in __vlan_insert_tag(). To me, that postpull csum currently is absolutely in the correct spot, because it's acting upon the pull done by __vlan_insert_tag(), not the one done here by skb_vlan_push(). Right? Can you tell me how you tested this? Just curious... I noticed both while reviewing the code, the error path fixup is not critical for ovs or act_vlan as the skb is dropped afterwards, but not necessarily in eBPF case, so there it matters as eBPF doesn't know at this point, what the program is going to do with it (similar fixup is done in __skb_vlan_pop() error path, btw). For the csum, I did a hexdump to compare what we write and what is being passed in for the csum correction. Anyway ... Aside from all this and based on your comment, I'm investigating whether for the vlan push and also pop case the __skb_pull(skb, offset) in success case is actually enough and whether it needs to take VLAN_HLEN into account as well. But, I need to do more test for that one first. At least the skb_vlan_push() comment says "__vlan_insert_tag expect skb->data pointing to mac header. So change skb->data before calling it and change back to original position later", Jiri?
[PATCH v4 net-next 11/15] nfp: sync ring state during FW reconfiguration
FW reconfiguration in .ndo_open()/.ndo_stop() should reset/ restore queue state. Since we need IRQs to be disabled when filling rings on RX path we have to move disable_irq() from .ndo_open() all the way up to IRQ allocation. nfp_net_start_vec() becomes trivial now so it's inlined. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 45 -- 1 file changed, 16 insertions(+), 29 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index f171a7da8931..2878ac021eda 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1520,6 +1520,7 @@ nfp_net_prepare_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec, nn_err(nn, "Error requesting IRQ %d\n", entry->vector); return err; } + disable_irq(entry->vector); /* Setup NAPI */ netif_napi_add(nn->netdev, &r_vec->napi, @@ -1648,13 +1649,14 @@ static void nfp_net_clear_config_and_disable(struct nfp_net *nn) nn_writel(nn, NFP_NET_CFG_CTRL, new_ctrl); err = nfp_net_reconfig(nn, update); - if (err) { + if (err) nn_err(nn, "Could not disable device: %d\n", err); - return; - } - for (r = 0; r < nn->num_r_vecs; r++) + for (r = 0; r < nn->num_r_vecs; r++) { + nfp_net_rx_ring_reset(nn->r_vecs[r].rx_ring); + nfp_net_tx_ring_reset(nn, nn->r_vecs[r].tx_ring); nfp_net_vec_clear_ring_data(nn, r); + } nn->ctrl = new_ctrl; } @@ -1722,6 +1724,9 @@ static int __nfp_net_set_config_and_enable(struct nfp_net *nn) nn->ctrl = new_ctrl; + for (r = 0; r < nn->num_r_vecs; r++) + nfp_net_rx_ring_fill_freelist(nn->r_vecs[r].rx_ring); + /* Since reconfiguration requests while NFP is down are ignored we * have to wipe the entire VXLAN configuration and reinitialize it. */ @@ -1750,26 +1755,6 @@ static int nfp_net_set_config_and_enable(struct nfp_net *nn) } /** - * nfp_net_start_vec() - Start ring vector - * @nn: NFP Net device structure - * @r_vec: Ring vector to be started - */ -static void -nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector *r_vec) -{ - unsigned int irq_vec; - - irq_vec = nn->irq_entries[r_vec->irq_idx].vector; - - disable_irq(irq_vec); - - nfp_net_rx_ring_fill_freelist(r_vec->rx_ring); - napi_enable(&r_vec->napi); - - enable_irq(irq_vec); -} - -/** * nfp_net_open_stack() - Start the device from stack's perspective * @nn: NFP Net device to reconfigure */ @@ -1777,8 +1762,10 @@ static void nfp_net_open_stack(struct nfp_net *nn) { unsigned int r; - for (r = 0; r < nn->num_r_vecs; r++) - nfp_net_start_vec(nn, &nn->r_vecs[r]); + for (r = 0; r < nn->num_r_vecs; r++) { + napi_enable(&nn->r_vecs[r].napi); + enable_irq(nn->irq_entries[nn->r_vecs[r].irq_idx].vector); + } netif_tx_wake_all_queues(nn->netdev); @@ -1903,8 +1890,10 @@ static void nfp_net_close_stack(struct nfp_net *nn) netif_carrier_off(nn->netdev); nn->link_up = false; - for (r = 0; r < nn->num_r_vecs; r++) + for (r = 0; r < nn->num_r_vecs; r++) { + disable_irq(nn->irq_entries[nn->r_vecs[r].irq_idx].vector); napi_disable(&nn->r_vecs[r].napi); + } netif_tx_disable(nn->netdev); } @@ -1918,9 +1907,7 @@ static void nfp_net_close_free_all(struct nfp_net *nn) unsigned int r; for (r = 0; r < nn->num_r_vecs; r++) { - nfp_net_rx_ring_reset(nn->r_vecs[r].rx_ring); nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring); - nfp_net_tx_ring_reset(nn, nn->r_vecs[r].tx_ring); nfp_net_rx_ring_free(nn->r_vecs[r].rx_ring); nfp_net_tx_ring_free(nn->r_vecs[r].tx_ring); nfp_net_cleanup_vector(nn, &nn->r_vecs[r]); -- 1.9.1
[PATCH v4 net-next 08/15] nfp: preallocate RX buffers early in .ndo_open
We want the .ndo_open() to have following structure: - allocate resources; - configure HW/FW; - enable the device from stack perspective. Therefore filling RX rings needs to be moved to the beginning of .ndo_open(). Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 34 +++--- 1 file changed, 11 insertions(+), 23 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index b057102769f9..c04706cd7d51 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1667,28 +1667,19 @@ static void nfp_net_clear_config_and_disable(struct nfp_net *nn) * @nn: NFP Net device structure * @r_vec: Ring vector to be started */ -static int nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector *r_vec) +static void +nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector *r_vec) { unsigned int irq_vec; - int err = 0; irq_vec = nn->irq_entries[r_vec->irq_idx].vector; disable_irq(irq_vec); - err = nfp_net_rx_ring_bufs_alloc(r_vec->nfp_net, r_vec->rx_ring); - if (err) { - nn_err(nn, "RV%02d: couldn't allocate enough buffers\n", - r_vec->irq_idx); - goto out; - } nfp_net_rx_ring_fill_freelist(r_vec->rx_ring); - napi_enable(&r_vec->napi); -out: - enable_irq(irq_vec); - return err; + enable_irq(irq_vec); } static int nfp_net_netdev_open(struct net_device *netdev) @@ -1743,6 +1734,10 @@ static int nfp_net_netdev_open(struct net_device *netdev) err = nfp_net_rx_ring_alloc(nn->r_vecs[r].rx_ring); if (err) goto err_free_tx_ring_p; + + err = nfp_net_rx_ring_bufs_alloc(nn, nn->r_vecs[r].rx_ring); + if (err) + goto err_flush_rx_ring_p; } err = netif_set_real_num_tx_queues(netdev, nn->num_tx_rings); @@ -1815,11 +1810,8 @@ static int nfp_net_netdev_open(struct net_device *netdev) * - enable all TX queues * - set link state */ - for (r = 0; r < nn->num_r_vecs; r++) { - err = nfp_net_start_vec(nn, &nn->r_vecs[r]); - if (err) - goto err_disable_napi; - } + for (r = 0; r < nn->num_r_vecs; r++) + nfp_net_start_vec(nn, &nn->r_vecs[r]); netif_tx_wake_all_queues(netdev); @@ -1828,18 +1820,14 @@ static int nfp_net_netdev_open(struct net_device *netdev) return 0; -err_disable_napi: - while (r--) { - napi_disable(&nn->r_vecs[r].napi); - nfp_net_rx_ring_reset(nn->r_vecs[r].rx_ring); - nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring); - } err_clear_config: nfp_net_clear_config_and_disable(nn); err_free_rings: r = nn->num_r_vecs; err_free_prev_vecs: while (r--) { + nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring); +err_flush_rx_ring_p: nfp_net_rx_ring_free(nn->r_vecs[r].rx_ring); err_free_tx_ring_p: nfp_net_tx_ring_free(nn->r_vecs[r].tx_ring); -- 1.9.1
RE: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
| For transmit we can leave the IP ID code as is. For receive we should not be | snooping into the IP ID for any frames that have the DF bit set as devices | that have adopted RFC 6864 on their transmit path will end up causing issues. Currently, GRO does not coalesce TCP packets originating from nodes which do IPv6 to IPv4 translation. These packets have DF set to 1 and IP ID set 0 according to https://tools.ietf.org/html/rfc6145#page-17 With this patch, we should be able to see coalescing happen for those cases as well.
[PATCH v4 net-next 02/15] nfp: move link state interrupt request/free calls
We need to be able to disable the link state interrupt when the device is brought down. We used to just free the IRQ at the beginning of .ndo_stop(). As we now move towards more ordered .ndo_open()/.ndo_stop() paths LSC allocation should be placed in the "allocate resource" section. Since the IRQ can't be freed early in .ndo_stop(), it is disabled instead. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 23 +++--- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 307c02c4ba4a..9ce04b179fac 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1730,10 +1730,16 @@ static int nfp_net_netdev_open(struct net_device *netdev) NFP_NET_IRQ_EXN_IDX, nn->exn_handler); if (err) return err; + err = nfp_net_aux_irq_request(nn, NFP_NET_CFG_LSC, "%s-lsc", + nn->lsc_name, sizeof(nn->lsc_name), + NFP_NET_IRQ_LSC_IDX, nn->lsc_handler); + if (err) + goto err_free_exn; + disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector); err = nfp_net_alloc_rings(nn); if (err) - goto err_free_exn; + goto err_free_lsc; err = netif_set_real_num_tx_queues(netdev, nn->num_tx_rings); if (err) @@ -1813,19 +1819,11 @@ static int nfp_net_netdev_open(struct net_device *netdev) netif_tx_wake_all_queues(netdev); - err = nfp_net_aux_irq_request(nn, NFP_NET_CFG_LSC, "%s-lsc", - nn->lsc_name, sizeof(nn->lsc_name), - NFP_NET_IRQ_LSC_IDX, nn->lsc_handler); - if (err) - goto err_stop_tx; + enable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector); nfp_net_read_link_status(nn); return 0; -err_stop_tx: - netif_tx_disable(netdev); - for (r = 0; r < nn->num_r_vecs; r++) - nfp_net_tx_flush(nn->r_vecs[r].tx_ring); err_disable_napi: while (r--) { napi_disable(&nn->r_vecs[r].napi); @@ -1835,6 +1833,8 @@ err_clear_config: nfp_net_clear_config_and_disable(nn); err_free_rings: nfp_net_free_rings(nn); +err_free_lsc: + nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX); err_free_exn: nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX); return err; @@ -1856,7 +1856,7 @@ static int nfp_net_netdev_close(struct net_device *netdev) /* Step 1: Disable RX and TX rings from the Linux kernel perspective */ - nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX); + disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector); netif_carrier_off(netdev); nn->link_up = false; @@ -1877,6 +1877,7 @@ static int nfp_net_netdev_close(struct net_device *netdev) } nfp_net_free_rings(nn); + nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX); nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX); nn_dbg(nn, "%s down", netdev->name); -- 1.9.1
[PATCH v4 net-next 03/15] nfp: break up nfp_net_{alloc|free}_rings
nfp_net_{alloc|free}_rings contained strange mix of allocations and vector initialization. Remove it, declare vector init as a separate function and handle allocations explicitly. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 126 - 1 file changed, 47 insertions(+), 79 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 9ce04b179fac..b435d15ef8d6 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1489,91 +1489,40 @@ err_alloc: return -ENOMEM; } -static void __nfp_net_free_rings(struct nfp_net *nn, unsigned int n_free) +static int +nfp_net_prepare_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec, + int idx) { - struct nfp_net_r_vector *r_vec; - struct msix_entry *entry; + struct msix_entry *entry = &nn->irq_entries[r_vec->irq_idx]; + int err; - while (n_free--) { - r_vec = &nn->r_vecs[n_free]; - entry = &nn->irq_entries[r_vec->irq_idx]; + snprintf(r_vec->name, sizeof(r_vec->name), +"%s-rxtx-%d", nn->netdev->name, idx); + err = request_irq(entry->vector, r_vec->handler, 0, r_vec->name, r_vec); + if (err) { + nn_err(nn, "Error requesting IRQ %d\n", entry->vector); + return err; + } - nfp_net_rx_ring_free(r_vec->rx_ring); - nfp_net_tx_ring_free(r_vec->tx_ring); + /* Setup NAPI */ + netif_napi_add(nn->netdev, &r_vec->napi, + nfp_net_poll, NAPI_POLL_WEIGHT); - irq_set_affinity_hint(entry->vector, NULL); - free_irq(entry->vector, r_vec); + irq_set_affinity_hint(entry->vector, &r_vec->affinity_mask); - netif_napi_del(&r_vec->napi); - } -} + nn_dbg(nn, "RV%02d: irq=%03d/%03d\n", idx, entry->vector, entry->entry); -/** - * nfp_net_free_rings() - Free all ring resources - * @nn: NFP Net device to reconfigure - */ -static void nfp_net_free_rings(struct nfp_net *nn) -{ - __nfp_net_free_rings(nn, nn->num_r_vecs); + return 0; } -/** - * nfp_net_alloc_rings() - Allocate resources for RX and TX rings - * @nn: NFP Net device to reconfigure - * - * Return: 0 on success or negative errno on error. - */ -static int nfp_net_alloc_rings(struct nfp_net *nn) +static void +nfp_net_cleanup_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec) { - struct nfp_net_r_vector *r_vec; - struct msix_entry *entry; - int err; - int r; + struct msix_entry *entry = &nn->irq_entries[r_vec->irq_idx]; - for (r = 0; r < nn->num_r_vecs; r++) { - r_vec = &nn->r_vecs[r]; - entry = &nn->irq_entries[r_vec->irq_idx]; - - /* Setup NAPI */ - netif_napi_add(nn->netdev, &r_vec->napi, - nfp_net_poll, NAPI_POLL_WEIGHT); - - snprintf(r_vec->name, sizeof(r_vec->name), -"%s-rxtx-%d", nn->netdev->name, r); - err = request_irq(entry->vector, r_vec->handler, 0, - r_vec->name, r_vec); - if (err) { - nn_dbg(nn, "Error requesting IRQ %d\n", entry->vector); - goto err_napi_del; - } - - irq_set_affinity_hint(entry->vector, &r_vec->affinity_mask); - - nn_dbg(nn, "RV%02d: irq=%03d/%03d\n", - r, entry->vector, entry->entry); - - /* Allocate TX ring resources */ - err = nfp_net_tx_ring_alloc(r_vec->tx_ring); - if (err) - goto err_free_irq; - - /* Allocate RX ring resources */ - err = nfp_net_rx_ring_alloc(r_vec->rx_ring); - if (err) - goto err_free_tx; - } - - return 0; - -err_free_tx: - nfp_net_tx_ring_free(r_vec->tx_ring); -err_free_irq: irq_set_affinity_hint(entry->vector, NULL); - free_irq(entry->vector, r_vec); -err_napi_del: netif_napi_del(&r_vec->napi); - __nfp_net_free_rings(nn, r); - return err; + free_irq(entry->vector, r_vec); } /** @@ -1737,9 +1686,19 @@ static int nfp_net_netdev_open(struct net_device *netdev) goto err_free_exn; disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector); - err = nfp_net_alloc_rings(nn); - if (err) - goto err_free_lsc; + for (r = 0; r < nn->num_r_vecs; r++) { + err = nfp_net_prepare_vector(nn, &nn->r_vecs[r], r); + if (err) + goto err_free_prev_vecs; + + err = nfp_net_tx_ring_alloc(nn->r_vecs[r].tx_ring); + if (err) + goto e
[PATCH v4 net-next 09/15] nfp: move filling ring information to FW config
nfp_net_[rt]x_ring_{alloc,free} should only allocate or free ring resources without touching the device. Move setting parameters in the BAR to separate functions. This will make it possible to reuse alloc/free functions to allocate new rings while the device is running. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 50 ++ 1 file changed, 32 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index c04706cd7d51..f504de12ed2a 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1388,10 +1388,6 @@ static void nfp_net_tx_ring_free(struct nfp_net_tx_ring *tx_ring) struct nfp_net *nn = r_vec->nfp_net; struct pci_dev *pdev = nn->pdev; - nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(tx_ring->idx), 0); - nn_writeb(nn, NFP_NET_CFG_TXR_SZ(tx_ring->idx), 0); - nn_writeb(nn, NFP_NET_CFG_TXR_VEC(tx_ring->idx), 0); - kfree(tx_ring->txbufs); if (tx_ring->txds) @@ -1431,11 +1427,6 @@ static int nfp_net_tx_ring_alloc(struct nfp_net_tx_ring *tx_ring) if (!tx_ring->txbufs) goto err_alloc; - /* Write the DMA address, size and MSI-X info to the device */ - nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(tx_ring->idx), tx_ring->dma); - nn_writeb(nn, NFP_NET_CFG_TXR_SZ(tx_ring->idx), ilog2(tx_ring->cnt)); - nn_writeb(nn, NFP_NET_CFG_TXR_VEC(tx_ring->idx), r_vec->irq_idx); - netif_set_xps_queue(nn->netdev, &r_vec->affinity_mask, tx_ring->idx); nn_dbg(nn, "TxQ%02d: QCidx=%02d cnt=%d dma=%#llx host=%p\n", @@ -1459,10 +1450,6 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring) struct nfp_net *nn = r_vec->nfp_net; struct pci_dev *pdev = nn->pdev; - nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(rx_ring->idx), 0); - nn_writeb(nn, NFP_NET_CFG_RXR_SZ(rx_ring->idx), 0); - nn_writeb(nn, NFP_NET_CFG_RXR_VEC(rx_ring->idx), 0); - kfree(rx_ring->rxbufs); if (rx_ring->rxds) @@ -1502,11 +1489,6 @@ static int nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring) if (!rx_ring->rxbufs) goto err_alloc; - /* Write the DMA address, size and MSI-X info to the device */ - nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(rx_ring->idx), rx_ring->dma); - nn_writeb(nn, NFP_NET_CFG_RXR_SZ(rx_ring->idx), ilog2(rx_ring->cnt)); - nn_writeb(nn, NFP_NET_CFG_RXR_VEC(rx_ring->idx), r_vec->irq_idx); - nn_dbg(nn, "RxQ%02d: FlQCidx=%02d RxQCidx=%02d cnt=%d dma=%#llx host=%p\n", rx_ring->idx, rx_ring->fl_qcidx, rx_ring->rx_qcidx, rx_ring->cnt, (unsigned long long)rx_ring->dma, rx_ring->rxds); @@ -1631,6 +1613,17 @@ static void nfp_net_write_mac_addr(struct nfp_net *nn, const u8 *mac) get_unaligned_be16(nn->netdev->dev_addr + 4) << 16); } +static void nfp_net_vec_clear_ring_data(struct nfp_net *nn, unsigned int idx) +{ + nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(idx), 0); + nn_writeb(nn, NFP_NET_CFG_RXR_SZ(idx), 0); + nn_writeb(nn, NFP_NET_CFG_RXR_VEC(idx), 0); + + nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(idx), 0); + nn_writeb(nn, NFP_NET_CFG_TXR_SZ(idx), 0); + nn_writeb(nn, NFP_NET_CFG_TXR_VEC(idx), 0); +} + /** * nfp_net_clear_config_and_disable() - Clear control BAR and disable NFP * @nn: NFP Net device to reconfigure @@ -1638,6 +1631,7 @@ static void nfp_net_write_mac_addr(struct nfp_net *nn, const u8 *mac) static void nfp_net_clear_config_and_disable(struct nfp_net *nn) { u32 new_ctrl, update; + unsigned int r; int err; new_ctrl = nn->ctrl; @@ -1659,9 +1653,26 @@ static void nfp_net_clear_config_and_disable(struct nfp_net *nn) return; } + for (r = 0; r < nn->num_r_vecs; r++) + nfp_net_vec_clear_ring_data(nn, r); + nn->ctrl = new_ctrl; } +static void +nfp_net_vec_write_ring_data(struct nfp_net *nn, struct nfp_net_r_vector *r_vec, + unsigned int idx) +{ + /* Write the DMA address, size and MSI-X info to the device */ + nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(idx), r_vec->rx_ring->dma); + nn_writeb(nn, NFP_NET_CFG_RXR_SZ(idx), ilog2(r_vec->rx_ring->cnt)); + nn_writeb(nn, NFP_NET_CFG_RXR_VEC(idx), r_vec->irq_idx); + + nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(idx), r_vec->tx_ring->dma); + nn_writeb(nn, NFP_NET_CFG_TXR_SZ(idx), ilog2(r_vec->tx_ring->cnt)); + nn_writeb(nn, NFP_NET_CFG_TXR_VEC(idx), r_vec->irq_idx); +} + /** * nfp_net_start_vec() - Start ring vector * @nn: NFP Net device structure @@ -1769,6 +1780,9 @@ static int nfp_net_netdev_open(struct net_device *netdev) * - Set the Freelist buffer size * - Enable the FW */ + for (r = 0; r < nn->num_r_vecs;
[PATCH v4 net-next 15/15] nfp: allow ring size reconfiguration at runtime
Since much of the required changes have already been made for changing MTU at runtime let's use it for ring size changes as well. Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net.h | 1 + .../net/ethernet/netronome/nfp/nfp_net_common.c| 126 + .../net/ethernet/netronome/nfp/nfp_net_ethtool.c | 30 ++--- 3 files changed, 136 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 1e08c9cf3ee0..90ad6264e62c 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -724,6 +724,7 @@ void nfp_net_rss_write_key(struct nfp_net *nn); void nfp_net_coalesce_write_cfg(struct nfp_net *nn); int nfp_net_irqs_alloc(struct nfp_net *nn); void nfp_net_irqs_disable(struct nfp_net *nn); +int nfp_net_set_ring_size(struct nfp_net *nn, u32 rxd_cnt, u32 txd_cnt); #ifdef CONFIG_NFP_NET_DEBUG void nfp_net_debugfs_create(void); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 631168a1971e..57f330a4736e 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1445,6 +1445,59 @@ err_alloc: return -ENOMEM; } +static struct nfp_net_tx_ring * +nfp_net_shadow_tx_rings_prepare(struct nfp_net *nn, u32 buf_cnt) +{ + struct nfp_net_tx_ring *rings; + unsigned int r; + + rings = kcalloc(nn->num_tx_rings, sizeof(*rings), GFP_KERNEL); + if (!rings) + return NULL; + + for (r = 0; r < nn->num_tx_rings; r++) { + nfp_net_tx_ring_init(&rings[r], nn->tx_rings[r].r_vec, r); + + if (nfp_net_tx_ring_alloc(&rings[r], buf_cnt)) + goto err_free_prev; + } + + return rings; + +err_free_prev: + while (r--) + nfp_net_tx_ring_free(&rings[r]); + kfree(rings); + return NULL; +} + +static struct nfp_net_tx_ring * +nfp_net_shadow_tx_rings_swap(struct nfp_net *nn, struct nfp_net_tx_ring *rings) +{ + struct nfp_net_tx_ring *old = nn->tx_rings; + unsigned int r; + + for (r = 0; r < nn->num_tx_rings; r++) + old[r].r_vec->tx_ring = &rings[r]; + + nn->tx_rings = rings; + return old; +} + +static void +nfp_net_shadow_tx_rings_free(struct nfp_net *nn, struct nfp_net_tx_ring *rings) +{ + unsigned int r; + + if (!rings) + return; + + for (r = 0; r < nn->num_tx_rings; r++) + nfp_net_tx_ring_free(&rings[r]); + + kfree(rings); +} + /** * nfp_net_rx_ring_free() - Free resources allocated to a RX ring * @rx_ring: RX ring to free @@ -1561,6 +1614,9 @@ nfp_net_shadow_rx_rings_free(struct nfp_net *nn, struct nfp_net_rx_ring *rings) { unsigned int r; + if (!rings) + return; + for (r = 0; r < nn->num_r_vecs; r++) { nfp_net_rx_ring_bufs_free(nn, &rings[r]); nfp_net_rx_ring_free(&rings[r]); @@ -2106,6 +2162,76 @@ static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu) return err; } +int nfp_net_set_ring_size(struct nfp_net *nn, u32 rxd_cnt, u32 txd_cnt) +{ + struct nfp_net_tx_ring *tx_rings = NULL; + struct nfp_net_rx_ring *rx_rings = NULL; + u32 old_rxd_cnt, old_txd_cnt; + int err; + + if (!netif_running(nn->netdev)) { + nn->rxd_cnt = rxd_cnt; + nn->txd_cnt = txd_cnt; + return 0; + } + + old_rxd_cnt = nn->rxd_cnt; + old_txd_cnt = nn->txd_cnt; + + /* Prepare new rings */ + if (nn->rxd_cnt != rxd_cnt) { + rx_rings = nfp_net_shadow_rx_rings_prepare(nn, nn->fl_bufsz, + rxd_cnt); + if (!rx_rings) + return -ENOMEM; + } + if (nn->txd_cnt != txd_cnt) { + tx_rings = nfp_net_shadow_tx_rings_prepare(nn, txd_cnt); + if (!tx_rings) { + nfp_net_shadow_rx_rings_free(nn, rx_rings); + return -ENOMEM; + } + } + + /* Stop device, swap in new rings, try to start the firmware */ + nfp_net_close_stack(nn); + nfp_net_clear_config_and_disable(nn); + + if (rx_rings) + rx_rings = nfp_net_shadow_rx_rings_swap(nn, rx_rings); + if (tx_rings) + tx_rings = nfp_net_shadow_tx_rings_swap(nn, tx_rings); + + nn->rxd_cnt = rxd_cnt; + nn->txd_cnt = txd_cnt; + + err = nfp_net_set_config_and_enable(nn); + if (err) { + const int err_new = err; + + /* Try with old configuration and old rings */ + if (rx_rings) + rx_rings = nfp_net_shadow_rx_rings_swap(nn, rx_rings); +
[PATCH v4 net-next 05/15] nfp: allocate ring SW structs dynamically
To be able to switch rings more easly on config changes allocate them dynamically, separately from nfp_net structure. Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net.h | 6 ++--- .../net/ethernet/netronome/nfp/nfp_net_common.c| 28 +- 2 files changed, 25 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index ab264e1bccd0..0a87571a7d9c 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -472,6 +472,9 @@ struct nfp_net { u32 rx_offset; + struct nfp_net_tx_ring *tx_rings; + struct nfp_net_rx_ring *rx_rings; + #ifdef CONFIG_PCI_IOV unsigned int num_vfs; struct vf_data_storage *vfinfo; @@ -504,9 +507,6 @@ struct nfp_net { int txd_cnt; int rxd_cnt; - struct nfp_net_tx_ring tx_rings[NFP_NET_MAX_TX_RINGS]; - struct nfp_net_rx_ring rx_rings[NFP_NET_MAX_RX_RINGS]; - u8 num_irqs; u8 num_r_vecs; struct nfp_net_r_vector r_vecs[NFP_NET_MAX_TX_RINGS]; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 7233471af7a8..8f7e2e044811 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -414,12 +414,6 @@ static void nfp_net_irqs_assign(struct net_device *netdev) r_vec->irq_idx = NFP_NET_NON_Q_VECTORS + r; cpumask_set_cpu(r, &r_vec->affinity_mask); - - r_vec->tx_ring = &nn->tx_rings[r]; - nfp_net_tx_ring_init(r_vec->tx_ring, r_vec, r); - - r_vec->rx_ring = &nn->rx_rings[r]; - nfp_net_rx_ring_init(r_vec->rx_ring, r_vec, r); } } @@ -1504,6 +1498,12 @@ nfp_net_prepare_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec, struct msix_entry *entry = &nn->irq_entries[r_vec->irq_idx]; int err; + r_vec->tx_ring = &nn->tx_rings[idx]; + nfp_net_tx_ring_init(r_vec->tx_ring, r_vec, idx); + + r_vec->rx_ring = &nn->rx_rings[idx]; + nfp_net_rx_ring_init(r_vec->rx_ring, r_vec, idx); + snprintf(r_vec->name, sizeof(r_vec->name), "%s-rxtx-%d", nn->netdev->name, idx); err = request_irq(entry->vector, r_vec->handler, 0, r_vec->name, r_vec); @@ -1694,6 +1694,15 @@ static int nfp_net_netdev_open(struct net_device *netdev) goto err_free_exn; disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector); + nn->rx_rings = kcalloc(nn->num_rx_rings, sizeof(*nn->rx_rings), + GFP_KERNEL); + if (!nn->rx_rings) + goto err_free_lsc; + nn->tx_rings = kcalloc(nn->num_tx_rings, sizeof(*nn->tx_rings), + GFP_KERNEL); + if (!nn->tx_rings) + goto err_free_rx_rings; + for (r = 0; r < nn->num_r_vecs; r++) { err = nfp_net_prepare_vector(nn, &nn->r_vecs[r], r); if (err) @@ -1808,6 +1817,10 @@ err_free_tx_ring_p: err_cleanup_vec_p: nfp_net_cleanup_vector(nn, &nn->r_vecs[r]); } + kfree(nn->tx_rings); +err_free_rx_rings: + kfree(nn->rx_rings); +err_free_lsc: nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX); err_free_exn: nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX); @@ -1853,6 +1866,9 @@ static int nfp_net_netdev_close(struct net_device *netdev) nfp_net_cleanup_vector(nn, &nn->r_vecs[r]); } + kfree(nn->rx_rings); + kfree(nn->tx_rings); + nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX); nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX); -- 1.9.1
[PATCH v4 net-next 12/15] nfp: propagate list buffer size in struct rx_ring
Free list buffer size needs to be propagated to few functions as a parameter and added to struct nfp_net_rx_ring since soon some of the functions will be reused to manage rings with buffers of size different than nn->fl_bufsz. Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net.h | 3 +++ .../net/ethernet/netronome/nfp/nfp_net_common.c| 24 ++ 2 files changed, 19 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 0a87571a7d9c..1e08c9cf3ee0 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -298,6 +298,8 @@ struct nfp_net_rx_buf { * @rxds: Virtual address of FL/RX ring in host memory * @dma:DMA address of the FL/RX ring * @size: Size, in bytes, of the FL/RX ring (needed to free) + * @bufsz: Buffer allocation size for convenience of management routines + * (NOTE: this is in second cache line, do not use on fast path!) */ struct nfp_net_rx_ring { struct nfp_net_r_vector *r_vec; @@ -319,6 +321,7 @@ struct nfp_net_rx_ring { dma_addr_t dma; unsigned int size; + unsigned int bufsz; } cacheline_aligned; /** diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 2878ac021eda..eeabc33fe13d 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -958,25 +958,27 @@ static inline int nfp_net_rx_space(struct nfp_net_rx_ring *rx_ring) * nfp_net_rx_alloc_one() - Allocate and map skb for RX * @rx_ring: RX ring structure of the skb * @dma_addr: Pointer to storage for DMA address (output param) + * @fl_bufsz: size of freelist buffers * * This function will allcate a new skb, map it for DMA. * * Return: allocated skb or NULL on failure. */ static struct sk_buff * -nfp_net_rx_alloc_one(struct nfp_net_rx_ring *rx_ring, dma_addr_t *dma_addr) +nfp_net_rx_alloc_one(struct nfp_net_rx_ring *rx_ring, dma_addr_t *dma_addr, +unsigned int fl_bufsz) { struct nfp_net *nn = rx_ring->r_vec->nfp_net; struct sk_buff *skb; - skb = netdev_alloc_skb(nn->netdev, nn->fl_bufsz); + skb = netdev_alloc_skb(nn->netdev, fl_bufsz); if (!skb) { nn_warn_ratelimit(nn, "Failed to alloc receive SKB\n"); return NULL; } *dma_addr = dma_map_single(&nn->pdev->dev, skb->data, - nn->fl_bufsz, DMA_FROM_DEVICE); + fl_bufsz, DMA_FROM_DEVICE); if (dma_mapping_error(&nn->pdev->dev, *dma_addr)) { dev_kfree_skb_any(skb); nn_warn_ratelimit(nn, "Failed to map DMA RX buffer\n"); @@ -1069,7 +1071,7 @@ nfp_net_rx_ring_bufs_free(struct nfp_net *nn, struct nfp_net_rx_ring *rx_ring) continue; dma_unmap_single(&pdev->dev, rx_ring->rxbufs[i].dma_addr, -nn->fl_bufsz, DMA_FROM_DEVICE); +rx_ring->bufsz, DMA_FROM_DEVICE); dev_kfree_skb_any(rx_ring->rxbufs[i].skb); rx_ring->rxbufs[i].dma_addr = 0; rx_ring->rxbufs[i].skb = NULL; @@ -1091,7 +1093,8 @@ nfp_net_rx_ring_bufs_alloc(struct nfp_net *nn, struct nfp_net_rx_ring *rx_ring) for (i = 0; i < rx_ring->cnt - 1; i++) { rxbufs[i].skb = - nfp_net_rx_alloc_one(rx_ring, &rxbufs[i].dma_addr); + nfp_net_rx_alloc_one(rx_ring, &rxbufs[i].dma_addr, +rx_ring->bufsz); if (!rxbufs[i].skb) { nfp_net_rx_ring_bufs_free(nn, rx_ring); return -ENOMEM; @@ -1279,7 +1282,8 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget) skb = rx_ring->rxbufs[idx].skb; - new_skb = nfp_net_rx_alloc_one(rx_ring, &new_dma_addr); + new_skb = nfp_net_rx_alloc_one(rx_ring, &new_dma_addr, + nn->fl_bufsz); if (!new_skb) { nfp_net_rx_give_one(rx_ring, rx_ring->rxbufs[idx].skb, rx_ring->rxbufs[idx].dma_addr); @@ -1466,10 +1470,12 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring) /** * nfp_net_rx_ring_alloc() - Allocate resource for a RX ring * @rx_ring: RX ring to allocate + * @fl_bufsz: Size of buffers to allocate * * Return: 0 on success, negative errno otherwise. */ -static int nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring) +static int +nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring, unsigned int fl_bufsz) { struct nfp_net_r_vector *r_vec = rx
[PATCH v4 net-next 13/15] nfp: convert .ndo_change_mtu() to prepare/commit paradigm
When changing MTU on running device first allocate new rings and buffers and once it succeeds proceed with changing MTU. Allocation of new rings is not really necessary for this operation - it's done to keep the code simple and because size of the extra ring memory is quite small compared to the size of buffers. Operation can still fail midway through if FW communication times out. In that case we retry with old MTU (rings). Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 110 +++-- 1 file changed, 103 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index eeabc33fe13d..33001ce1d8bf 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1507,6 +1507,64 @@ err_alloc: return -ENOMEM; } +static struct nfp_net_rx_ring * +nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, unsigned int fl_bufsz) +{ + struct nfp_net_rx_ring *rings; + unsigned int r; + + rings = kcalloc(nn->num_rx_rings, sizeof(*rings), GFP_KERNEL); + if (!rings) + return NULL; + + for (r = 0; r < nn->num_rx_rings; r++) { + nfp_net_rx_ring_init(&rings[r], nn->rx_rings[r].r_vec, r); + + if (nfp_net_rx_ring_alloc(&rings[r], fl_bufsz)) + goto err_free_prev; + + if (nfp_net_rx_ring_bufs_alloc(nn, &rings[r])) + goto err_free_ring; + } + + return rings; + +err_free_prev: + while (r--) { + nfp_net_rx_ring_bufs_free(nn, &rings[r]); +err_free_ring: + nfp_net_rx_ring_free(&rings[r]); + } + kfree(rings); + return NULL; +} + +static struct nfp_net_rx_ring * +nfp_net_shadow_rx_rings_swap(struct nfp_net *nn, struct nfp_net_rx_ring *rings) +{ + struct nfp_net_rx_ring *old = nn->rx_rings; + unsigned int r; + + for (r = 0; r < nn->num_rx_rings; r++) + old[r].r_vec->rx_ring = &rings[r]; + + nn->rx_rings = rings; + return old; +} + +static void +nfp_net_shadow_rx_rings_free(struct nfp_net *nn, struct nfp_net_rx_ring *rings) +{ + unsigned int r; + + for (r = 0; r < nn->num_r_vecs; r++) { + nfp_net_rx_ring_bufs_free(nn, &rings[r]); + nfp_net_rx_ring_free(&rings[r]); + } + + kfree(rings); +} + static int nfp_net_prepare_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec, int idx) @@ -1985,24 +2043,62 @@ static void nfp_net_set_rx_mode(struct net_device *netdev) static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu) { + unsigned int old_mtu, old_fl_bufsz, new_fl_bufsz; struct nfp_net *nn = netdev_priv(netdev); + struct nfp_net_rx_ring *tmp_rings; + int err; if (new_mtu < 68 || new_mtu > nn->max_mtu) { nn_err(nn, "New MTU (%d) is not valid\n", new_mtu); return -EINVAL; } - netdev->mtu = new_mtu; - nn->fl_bufsz = NFP_NET_MAX_PREPEND + ETH_HLEN + VLAN_HLEN * 2 + + old_mtu = netdev->mtu; + old_fl_bufsz = nn->fl_bufsz; + new_fl_bufsz = NFP_NET_MAX_PREPEND + ETH_HLEN + VLAN_HLEN * 2 + MPLS_HLEN * 8 + new_mtu; - /* restart if running */ - if (netif_running(netdev)) { - nfp_net_netdev_close(netdev); - nfp_net_netdev_open(netdev); + if (!netif_running(netdev)) { + netdev->mtu = new_mtu; + nn->fl_bufsz = new_fl_bufsz; + return 0; } - return 0; + /* Prepare new rings */ + tmp_rings = nfp_net_shadow_rx_rings_prepare(nn, new_fl_bufsz); + if (!tmp_rings) + return -ENOMEM; + + /* Stop device, swap in new rings, try to start the firmware */ + nfp_net_close_stack(nn); + nfp_net_clear_config_and_disable(nn); + + tmp_rings = nfp_net_shadow_rx_rings_swap(nn, tmp_rings); + + netdev->mtu = new_mtu; + nn->fl_bufsz = new_fl_bufsz; + + err = nfp_net_set_config_and_enable(nn); + if (err) { + const int err_new = err; + + /* Try with old configuration and old rings */ + tmp_rings = nfp_net_shadow_rx_rings_swap(nn, tmp_rings); + + netdev->mtu = old_mtu; + nn->fl_bufsz = old_fl_bufsz; + + err = __nfp_net_set_config_and_enable(nn); + if (err) + nn_err(nn, "Can't restore MTU - FW communication failed (%d,%d)\n", + err_new, err); + } + + nfp_net_shadow_rx_rings_free(nn, tmp_rings); + + nfp_net_open_stack(nn); + + return err; } static struct rtnl_link_stats64 *nfp_net_stat64(struct net_device *netdev, -- 1.9.1
[PATCH v4 net-next 07/15] nfp: reorganize initial filling of RX rings
Separate allocation of buffers from giving them to FW, thanks to this it will be possible to move allocation earlier on .ndo_open() path and reuse buffers during runtime reconfiguration. Similar to TX side clean up the spill of functionality from flush to freeing the ring. Unlike on TX side, RX ring reset does not free buffers from the ring. Ring reset means only that FW pointers are zeroed and buffers on the ring must be placed in [0, cnt - 1) positions. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 119 ++--- 1 file changed, 78 insertions(+), 41 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 9a027a3cfe02..b057102769f9 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1021,62 +1021,100 @@ static void nfp_net_rx_give_one(struct nfp_net_rx_ring *rx_ring, } /** - * nfp_net_rx_flush() - Free any buffers currently on the RX ring - * @rx_ring: RX ring to remove buffers from + * nfp_net_rx_ring_reset() - Reflect in SW state of freelist after disable + * @rx_ring: RX ring structure * - * Assumes that the device is stopped + * Warning: Do *not* call if ring buffers were never put on the FW freelist + * (i.e. device was not enabled)! */ -static void nfp_net_rx_flush(struct nfp_net_rx_ring *rx_ring) +static void nfp_net_rx_ring_reset(struct nfp_net_rx_ring *rx_ring) { - struct nfp_net *nn = rx_ring->r_vec->nfp_net; - struct pci_dev *pdev = nn->pdev; - int idx; + unsigned int wr_idx, last_idx; - while (rx_ring->rd_p != rx_ring->wr_p) { - idx = rx_ring->rd_p % rx_ring->cnt; + /* Move the empty entry to the end of the list */ + wr_idx = rx_ring->wr_p % rx_ring->cnt; + last_idx = rx_ring->cnt - 1; + rx_ring->rxbufs[wr_idx].dma_addr = rx_ring->rxbufs[last_idx].dma_addr; + rx_ring->rxbufs[wr_idx].skb = rx_ring->rxbufs[last_idx].skb; + rx_ring->rxbufs[last_idx].dma_addr = 0; + rx_ring->rxbufs[last_idx].skb = NULL; - if (rx_ring->rxbufs[idx].skb) { - dma_unmap_single(&pdev->dev, -rx_ring->rxbufs[idx].dma_addr, -nn->fl_bufsz, DMA_FROM_DEVICE); - dev_kfree_skb_any(rx_ring->rxbufs[idx].skb); - rx_ring->rxbufs[idx].dma_addr = 0; - rx_ring->rxbufs[idx].skb = NULL; - } + memset(rx_ring->rxds, 0, sizeof(*rx_ring->rxds) * rx_ring->cnt); + rx_ring->wr_p = 0; + rx_ring->rd_p = 0; + rx_ring->wr_ptr_add = 0; +} - memset(&rx_ring->rxds[idx], 0, sizeof(rx_ring->rxds[idx])); +/** + * nfp_net_rx_ring_bufs_free() - Free any buffers currently on the RX ring + * @nn:NFP Net device + * @rx_ring: RX ring to remove buffers from + * + * Assumes that the device is stopped and buffers are in [0, ring->cnt - 1) + * entries. After device is disabled nfp_net_rx_ring_reset() must be called + * to restore required ring geometry. + */ +static void +nfp_net_rx_ring_bufs_free(struct nfp_net *nn, struct nfp_net_rx_ring *rx_ring) +{ + struct pci_dev *pdev = nn->pdev; + unsigned int i; - rx_ring->rd_p++; + for (i = 0; i < rx_ring->cnt - 1; i++) { + /* NULL skb can only happen when initial filling of the ring +* fails to allocate enough buffers and calls here to free +* already allocated ones. +*/ + if (!rx_ring->rxbufs[i].skb) + continue; + + dma_unmap_single(&pdev->dev, rx_ring->rxbufs[i].dma_addr, +nn->fl_bufsz, DMA_FROM_DEVICE); + dev_kfree_skb_any(rx_ring->rxbufs[i].skb); + rx_ring->rxbufs[i].dma_addr = 0; + rx_ring->rxbufs[i].skb = NULL; } } /** - * nfp_net_rx_fill_freelist() - Attempt filling freelist with RX buffers - * @rx_ring: RX ring to fill - * - * Try to fill as many buffers as possible into freelist. Return - * number of buffers added. - * - * Return: Number of freelist buffers added. + * nfp_net_rx_ring_bufs_alloc() - Fill RX ring with buffers (don't give to FW) + * @nn:NFP Net device + * @rx_ring: RX ring to remove buffers from */ -static int nfp_net_rx_fill_freelist(struct nfp_net_rx_ring *rx_ring) +static int +nfp_net_rx_ring_bufs_alloc(struct nfp_net *nn, struct nfp_net_rx_ring *rx_ring) { - struct sk_buff *skb; - dma_addr_t dma_addr; + struct nfp_net_rx_buf *rxbufs; + unsigned int i; + + rxbufs = rx_ring->rxbufs; - while (nfp_net_rx_space(rx_ring)) { - skb = nfp_net_rx_alloc_one(rx_ring, &dma_addr); - if (!skb) { - nf
[PATCH v4 net-next 00/15] MTU/buffer reconfig changes
Hi! Sorry it takes me so long to iterate this. Previous series included some not entirely related patches, this one is cut down. Main issue I'm trying to solve here is that .ndo_change_mtu() in nfpvf driver is doing full close/open to reallocate buffers - which if open fails can result in device being basically closed even though the interface is started. As suggested by you I try to move towards a paradigm where the resources are allocated first and the MTU change is only done once I'm certain (almost) nothing can fail. Almost because I need to communicate with FW and that can always time out. Patch 1 fixes small issue. Next 10 patches reorganize things so that I can easily allocate new rings and sets of buffers while the device is running. Patches 13 and 15 reshape the .ndo_change_mtu() and ethtool's ring-resize operation into desired form. Jakub Kicinski (15): nfp: correct RX buffer length calculation nfp: move link state interrupt request/free calls nfp: break up nfp_net_{alloc|free}_rings nfp: make *x_ring_init do all the init nfp: allocate ring SW structs dynamically nfp: cleanup tx ring flush and rename to reset nfp: reorganize initial filling of RX rings nfp: preallocate RX buffers early in .ndo_open nfp: move filling ring information to FW config nfp: slice .ndo_open() and .ndo_stop() up nfp: sync ring state during FW reconfiguration nfp: propagate list buffer size in struct rx_ring nfp: convert .ndo_change_mtu() to prepare/commit paradigm nfp: pass ring count as function parameter nfp: allow ring size reconfiguration at runtime drivers/net/ethernet/netronome/nfp/nfp_net.h | 10 +- .../net/ethernet/netronome/nfp/nfp_net_common.c| 905 ++--- .../net/ethernet/netronome/nfp/nfp_net_ethtool.c | 30 +- 3 files changed, 617 insertions(+), 328 deletions(-) -- 1.9.1
[PATCH v4 net-next 04/15] nfp: make *x_ring_init do all the init
nfp_net_[rt]x_ring_init functions used to be called from probe path only and some of their functionality was spilled to the call site. In order to reuse them for ring reconfiguration we need them to do all the init. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 28 ++ 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index b435d15ef8d6..7233471af7a8 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -348,12 +348,18 @@ static irqreturn_t nfp_net_irq_exn(int irq, void *data) /** * nfp_net_tx_ring_init() - Fill in the boilerplate for a TX ring * @tx_ring: TX ring structure + * @r_vec:IRQ vector servicing this ring + * @idx: Ring index */ -static void nfp_net_tx_ring_init(struct nfp_net_tx_ring *tx_ring) +static void +nfp_net_tx_ring_init(struct nfp_net_tx_ring *tx_ring, +struct nfp_net_r_vector *r_vec, unsigned int idx) { - struct nfp_net_r_vector *r_vec = tx_ring->r_vec; struct nfp_net *nn = r_vec->nfp_net; + tx_ring->idx = idx; + tx_ring->r_vec = r_vec; + tx_ring->qcidx = tx_ring->idx * nn->stride_tx; tx_ring->qcp_q = nn->tx_bar + NFP_QCP_QUEUE_OFF(tx_ring->qcidx); } @@ -361,12 +367,18 @@ static void nfp_net_tx_ring_init(struct nfp_net_tx_ring *tx_ring) /** * nfp_net_rx_ring_init() - Fill in the boilerplate for a RX ring * @rx_ring: RX ring structure + * @r_vec:IRQ vector servicing this ring + * @idx: Ring index */ -static void nfp_net_rx_ring_init(struct nfp_net_rx_ring *rx_ring) +static void +nfp_net_rx_ring_init(struct nfp_net_rx_ring *rx_ring, +struct nfp_net_r_vector *r_vec, unsigned int idx) { - struct nfp_net_r_vector *r_vec = rx_ring->r_vec; struct nfp_net *nn = r_vec->nfp_net; + rx_ring->idx = idx; + rx_ring->r_vec = r_vec; + rx_ring->fl_qcidx = rx_ring->idx * nn->stride_rx; rx_ring->rx_qcidx = rx_ring->fl_qcidx + (nn->stride_rx - 1); @@ -404,14 +416,10 @@ static void nfp_net_irqs_assign(struct net_device *netdev) cpumask_set_cpu(r, &r_vec->affinity_mask); r_vec->tx_ring = &nn->tx_rings[r]; - nn->tx_rings[r].idx = r; - nn->tx_rings[r].r_vec = r_vec; - nfp_net_tx_ring_init(r_vec->tx_ring); + nfp_net_tx_ring_init(r_vec->tx_ring, r_vec, r); r_vec->rx_ring = &nn->rx_rings[r]; - nn->rx_rings[r].idx = r; - nn->rx_rings[r].r_vec = r_vec; - nfp_net_rx_ring_init(r_vec->rx_ring); + nfp_net_rx_ring_init(r_vec->rx_ring, r_vec, r); } } -- 1.9.1
[PATCH v4 net-next 10/15] nfp: slice .ndo_open() and .ndo_stop() up
Divide .ndo_open() and .ndo_stop() into logical, callable chunks. No functional changes. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 218 + 1 file changed, 136 insertions(+), 82 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index f504de12ed2a..f171a7da8931 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1673,6 +1673,82 @@ nfp_net_vec_write_ring_data(struct nfp_net *nn, struct nfp_net_r_vector *r_vec, nn_writeb(nn, NFP_NET_CFG_TXR_VEC(idx), r_vec->irq_idx); } +static int __nfp_net_set_config_and_enable(struct nfp_net *nn) +{ + u32 new_ctrl, update = 0; + unsigned int r; + int err; + + new_ctrl = nn->ctrl; + + if (nn->cap & NFP_NET_CFG_CTRL_RSS) { + nfp_net_rss_write_key(nn); + nfp_net_rss_write_itbl(nn); + nn_writel(nn, NFP_NET_CFG_RSS_CTRL, nn->rss_cfg); + update |= NFP_NET_CFG_UPDATE_RSS; + } + + if (nn->cap & NFP_NET_CFG_CTRL_IRQMOD) { + nfp_net_coalesce_write_cfg(nn); + + new_ctrl |= NFP_NET_CFG_CTRL_IRQMOD; + update |= NFP_NET_CFG_UPDATE_IRQMOD; + } + + for (r = 0; r < nn->num_r_vecs; r++) + nfp_net_vec_write_ring_data(nn, &nn->r_vecs[r], r); + + nn_writeq(nn, NFP_NET_CFG_TXRS_ENABLE, nn->num_tx_rings == 64 ? + 0xULL : ((u64)1 << nn->num_tx_rings) - 1); + + nn_writeq(nn, NFP_NET_CFG_RXRS_ENABLE, nn->num_rx_rings == 64 ? + 0xULL : ((u64)1 << nn->num_rx_rings) - 1); + + nfp_net_write_mac_addr(nn, nn->netdev->dev_addr); + + nn_writel(nn, NFP_NET_CFG_MTU, nn->netdev->mtu); + nn_writel(nn, NFP_NET_CFG_FLBUFSZ, nn->fl_bufsz); + + /* Enable device */ + new_ctrl |= NFP_NET_CFG_CTRL_ENABLE; + update |= NFP_NET_CFG_UPDATE_GEN; + update |= NFP_NET_CFG_UPDATE_MSIX; + update |= NFP_NET_CFG_UPDATE_RING; + if (nn->cap & NFP_NET_CFG_CTRL_RINGCFG) + new_ctrl |= NFP_NET_CFG_CTRL_RINGCFG; + + nn_writel(nn, NFP_NET_CFG_CTRL, new_ctrl); + err = nfp_net_reconfig(nn, update); + + nn->ctrl = new_ctrl; + + /* Since reconfiguration requests while NFP is down are ignored we +* have to wipe the entire VXLAN configuration and reinitialize it. +*/ + if (nn->ctrl & NFP_NET_CFG_CTRL_VXLAN) { + memset(&nn->vxlan_ports, 0, sizeof(nn->vxlan_ports)); + memset(&nn->vxlan_usecnt, 0, sizeof(nn->vxlan_usecnt)); + vxlan_get_rx_port(nn->netdev); + } + + return err; +} + +/** + * nfp_net_set_config_and_enable() - Write control BAR and enable NFP + * @nn: NFP Net device to reconfigure + */ +static int nfp_net_set_config_and_enable(struct nfp_net *nn) +{ + int err; + + err = __nfp_net_set_config_and_enable(nn); + if (err) + nfp_net_clear_config_and_disable(nn); + + return err; +} + /** * nfp_net_start_vec() - Start ring vector * @nn: NFP Net device structure @@ -1693,20 +1769,33 @@ nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector *r_vec) enable_irq(irq_vec); } +/** + * nfp_net_open_stack() - Start the device from stack's perspective + * @nn: NFP Net device to reconfigure + */ +static void nfp_net_open_stack(struct nfp_net *nn) +{ + unsigned int r; + + for (r = 0; r < nn->num_r_vecs; r++) + nfp_net_start_vec(nn, &nn->r_vecs[r]); + + netif_tx_wake_all_queues(nn->netdev); + + enable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector); + nfp_net_read_link_status(nn); +} + static int nfp_net_netdev_open(struct net_device *netdev) { struct nfp_net *nn = netdev_priv(netdev); int err, r; - u32 update = 0; - u32 new_ctrl; if (nn->ctrl & NFP_NET_CFG_CTRL_ENABLE) { nn_err(nn, "Dev is already enabled: 0x%08x\n", nn->ctrl); return -EBUSY; } - new_ctrl = nn->ctrl; - /* Step 1: Allocate resources for rings and the like * - Request interrupts * - Allocate RX and TX ring resources @@ -1759,20 +1848,6 @@ static int nfp_net_netdev_open(struct net_device *netdev) if (err) goto err_free_rings; - if (nn->cap & NFP_NET_CFG_CTRL_RSS) { - nfp_net_rss_write_key(nn); - nfp_net_rss_write_itbl(nn); - nn_writel(nn, NFP_NET_CFG_RSS_CTRL, nn->rss_cfg); - update |= NFP_NET_CFG_UPDATE_RSS; - } - - if (nn->cap & NFP_NET_CFG_CTRL_IRQMOD) { - nfp_net_coalesce_write_cfg(nn); - - new_ctrl |= NFP_NET_CFG_CTRL_IRQMOD; - update |= NFP_NET_CFG_UPDATE_IRQMOD; - } -
[PATCH v4 net-next 06/15] nfp: cleanup tx ring flush and rename to reset
Since we never used flush without freeing the ring later the functionality of the two operations is mixed. Rename flush to ring reset and move there all the things which have to be done after FW ring state is cleared. While at it do some clean-ups. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 81 ++ 1 file changed, 37 insertions(+), 44 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 8f7e2e044811..9a027a3cfe02 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -868,61 +868,59 @@ static void nfp_net_tx_complete(struct nfp_net_tx_ring *tx_ring) } /** - * nfp_net_tx_flush() - Free any untransmitted buffers currently on the TX ring - * @tx_ring: TX ring structure + * nfp_net_tx_ring_reset() - Free any untransmitted buffers and reset pointers + * @nn:NFP Net device + * @tx_ring: TX ring structure * * Assumes that the device is stopped */ -static void nfp_net_tx_flush(struct nfp_net_tx_ring *tx_ring) +static void +nfp_net_tx_ring_reset(struct nfp_net *nn, struct nfp_net_tx_ring *tx_ring) { - struct nfp_net_r_vector *r_vec = tx_ring->r_vec; - struct nfp_net *nn = r_vec->nfp_net; - struct pci_dev *pdev = nn->pdev; const struct skb_frag_struct *frag; struct netdev_queue *nd_q; - struct sk_buff *skb; - int nr_frags; - int fidx; - int idx; + struct pci_dev *pdev = nn->pdev; while (tx_ring->rd_p != tx_ring->wr_p) { - idx = tx_ring->rd_p % tx_ring->cnt; + int nr_frags, fidx, idx; + struct sk_buff *skb; + idx = tx_ring->rd_p % tx_ring->cnt; skb = tx_ring->txbufs[idx].skb; - if (skb) { - nr_frags = skb_shinfo(skb)->nr_frags; - fidx = tx_ring->txbufs[idx].fidx; - - if (fidx == -1) { - /* unmap head */ - dma_unmap_single(&pdev->dev, -tx_ring->txbufs[idx].dma_addr, -skb_headlen(skb), -DMA_TO_DEVICE); - } else { - /* unmap fragment */ - frag = &skb_shinfo(skb)->frags[fidx]; - dma_unmap_page(&pdev->dev, - tx_ring->txbufs[idx].dma_addr, - skb_frag_size(frag), - DMA_TO_DEVICE); - } - - /* check for last gather fragment */ - if (fidx == nr_frags - 1) - dev_kfree_skb_any(skb); - - tx_ring->txbufs[idx].dma_addr = 0; - tx_ring->txbufs[idx].skb = NULL; - tx_ring->txbufs[idx].fidx = -2; + nr_frags = skb_shinfo(skb)->nr_frags; + fidx = tx_ring->txbufs[idx].fidx; + + if (fidx == -1) { + /* unmap head */ + dma_unmap_single(&pdev->dev, +tx_ring->txbufs[idx].dma_addr, +skb_headlen(skb), DMA_TO_DEVICE); + } else { + /* unmap fragment */ + frag = &skb_shinfo(skb)->frags[fidx]; + dma_unmap_page(&pdev->dev, + tx_ring->txbufs[idx].dma_addr, + skb_frag_size(frag), DMA_TO_DEVICE); } - memset(&tx_ring->txds[idx], 0, sizeof(tx_ring->txds[idx])); + /* check for last gather fragment */ + if (fidx == nr_frags - 1) + dev_kfree_skb_any(skb); + + tx_ring->txbufs[idx].dma_addr = 0; + tx_ring->txbufs[idx].skb = NULL; + tx_ring->txbufs[idx].fidx = -2; tx_ring->qcp_rd_p++; tx_ring->rd_p++; } + memset(tx_ring->txds, 0, sizeof(*tx_ring->txds) * tx_ring->cnt); + tx_ring->wr_p = 0; + tx_ring->rd_p = 0; + tx_ring->qcp_rd_p = 0; + tx_ring->wr_ptr_add = 0; + nd_q = netdev_get_tx_queue(nn->netdev, tx_ring->idx); netdev_tx_reset_queue(nd_q); } @@ -1363,11 +1361,6 @@ static void nfp_net_tx_ring_free(struct nfp_net_tx_ring *tx_ring) tx_ring->txds, tx_ring->dma); tx_ring->cnt = 0; - tx_ring->wr_p = 0; - tx_ring->rd_p = 0; - tx_ring->qcp_rd_p = 0; - tx_ring->wr_ptr_add = 0; - tx_ring->
[PATCH v4 net-next 14/15] nfp: pass ring count as function parameter
Soon ring resize will call this functions with values different than the current configuration we need to explicitly pass the ring count as parameter. Signed-off-by: Jakub Kicinski --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 23 +- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 33001ce1d8bf..631168a1971e 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1408,17 +1408,18 @@ static void nfp_net_tx_ring_free(struct nfp_net_tx_ring *tx_ring) /** * nfp_net_tx_ring_alloc() - Allocate resource for a TX ring * @tx_ring: TX Ring structure to allocate + * @cnt: Ring buffer count * * Return: 0 on success, negative errno otherwise. */ -static int nfp_net_tx_ring_alloc(struct nfp_net_tx_ring *tx_ring) +static int nfp_net_tx_ring_alloc(struct nfp_net_tx_ring *tx_ring, u32 cnt) { struct nfp_net_r_vector *r_vec = tx_ring->r_vec; struct nfp_net *nn = r_vec->nfp_net; struct pci_dev *pdev = nn->pdev; int sz; - tx_ring->cnt = nn->txd_cnt; + tx_ring->cnt = cnt; tx_ring->size = sizeof(*tx_ring->txds) * tx_ring->cnt; tx_ring->txds = dma_zalloc_coherent(&pdev->dev, tx_ring->size, @@ -1471,18 +1472,20 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring) * nfp_net_rx_ring_alloc() - Allocate resource for a RX ring * @rx_ring: RX ring to allocate * @fl_bufsz: Size of buffers to allocate + * @cnt: Ring buffer count * * Return: 0 on success, negative errno otherwise. */ static int -nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring, unsigned int fl_bufsz) +nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring, unsigned int fl_bufsz, + u32 cnt) { struct nfp_net_r_vector *r_vec = rx_ring->r_vec; struct nfp_net *nn = r_vec->nfp_net; struct pci_dev *pdev = nn->pdev; int sz; - rx_ring->cnt = nn->rxd_cnt; + rx_ring->cnt = cnt; rx_ring->bufsz = fl_bufsz; rx_ring->size = sizeof(*rx_ring->rxds) * rx_ring->cnt; @@ -1508,7 +1511,8 @@ err_alloc: } static struct nfp_net_rx_ring * -nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, unsigned int fl_bufsz) +nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, unsigned int fl_bufsz, + u32 buf_cnt) { struct nfp_net_rx_ring *rings; unsigned int r; @@ -1520,7 +1524,7 @@ nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, unsigned int fl_bufsz) for (r = 0; r < nn->num_rx_rings; r++) { nfp_net_rx_ring_init(&rings[r], nn->rx_rings[r].r_vec, r); - if (nfp_net_rx_ring_alloc(&rings[r], fl_bufsz)) + if (nfp_net_rx_ring_alloc(&rings[r], fl_bufsz, buf_cnt)) goto err_free_prev; if (nfp_net_rx_ring_bufs_alloc(nn, &rings[r])) @@ -1879,12 +1883,12 @@ static int nfp_net_netdev_open(struct net_device *netdev) if (err) goto err_free_prev_vecs; - err = nfp_net_tx_ring_alloc(nn->r_vecs[r].tx_ring); + err = nfp_net_tx_ring_alloc(nn->r_vecs[r].tx_ring, nn->txd_cnt); if (err) goto err_cleanup_vec_p; err = nfp_net_rx_ring_alloc(nn->r_vecs[r].rx_ring, - nn->fl_bufsz); + nn->fl_bufsz, nn->rxd_cnt); if (err) goto err_free_tx_ring_p; @@ -2065,7 +2069,8 @@ static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu) } /* Prepare new rings */ - tmp_rings = nfp_net_shadow_rx_rings_prepare(nn, new_fl_bufsz); + tmp_rings = nfp_net_shadow_rx_rings_prepare(nn, new_fl_bufsz, + nn->rxd_cnt); if (!tmp_rings) return -ENOMEM; -- 1.9.1
[PATCH v4 net-next 01/15] nfp: correct RX buffer length calculation
When calculating the RX buffer length we need to account for up to 2 VLAN tags and up to 8 MPLS labels. Rounding up to 1k is an relic of a distant past and can be removed. While at it also remove trivial print statement. Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 43c618bafdb6..307c02c4ba4a 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -61,6 +61,7 @@ #include +#include #include #include "nfp_net_ctrl.h" @@ -1911,9 +1912,6 @@ static void nfp_net_set_rx_mode(struct net_device *netdev) static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu) { struct nfp_net *nn = netdev_priv(netdev); - u32 tmp; - - nn_dbg(nn, "New MTU = %d\n", new_mtu); if (new_mtu < 68 || new_mtu > nn->max_mtu) { nn_err(nn, "New MTU (%d) is not valid\n", new_mtu); @@ -1921,10 +1919,8 @@ static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu) } netdev->mtu = new_mtu; - - /* Freelist buffer size rounded up to the nearest 1K */ - tmp = new_mtu + ETH_HLEN + VLAN_HLEN + NFP_NET_MAX_PREPEND; - nn->fl_bufsz = roundup(tmp, 1024); + nn->fl_bufsz = NFP_NET_MAX_PREPEND + ETH_HLEN + VLAN_HLEN * 2 + + MPLS_HLEN * 8 + new_mtu; /* restart if running */ if (netif_running(netdev)) { -- 1.9.1
Re: [PATCH v2 -next] net/core/dev: Warn on a too-short GRO frame
On Fri, 2016-04-01 at 15:58 -0400, Aaron Conole wrote: > From: Aaron Conole > > When signaling that a GRO frame is ready to be processed, the network stack > correctly checks length and aborts processing when a frame is less than 14 > bytes. However, such a condition is really indicative of a broken driver, > and should be loudly signaled, rather than silently dropped as the case is > today. > > Convert the condition to use net_warn_ratelimited() to ensure the stack > loudly complains about such broken drivers. > > Signed-off-by: Aaron Conole > --- Shouldn't we give a hint of device name ? (available in napi->dev->name )
[PATCH] ip6_tunnel: set rtnl_link_ops before calling register_netdevice
When creating an ip6tnl tunnel with ip tunnel, rtnl_link_ops is not set before ip6_tnl_create2 is called. When register_netdevice is called, there is no linkinfo attribute in the NEWLINK message because of that. Setting rtnl_link_ops before calling register_netdevice fixes that. Signed-off-by: Thadeu Lima de Souza Cascardo --- net/ipv6/ip6_tunnel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index eb2ac4b..1f20345 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -252,12 +252,12 @@ static int ip6_tnl_create2(struct net_device *dev) t = netdev_priv(dev); + dev->rtnl_link_ops = &ip6_link_ops; err = register_netdevice(dev); if (err < 0) goto out; strcpy(t->parms.name, dev->name); - dev->rtnl_link_ops = &ip6_link_ops; dev_hold(dev); ip6_tnl_link(ip6n, t); -- 2.5.0
Re: [Odd commit author id merge via netdev]
On Fri, 2016-04-01 at 10:51 -0700, santosh shilimkar wrote: > Hi Dave, > > I noticed something odd while checking the recent > commits of mine in kernel.org tree made it via netdev. > > Don't know if its patchwork tool doing this. > Usual author line in my git objects : > Author: Santosh Shilimkar > > But the commits going via your tree seems to be like below.. > Author: email-id > > Few more examples of the commits end of the email. Can this > be fixed for future commits ? The git objects you pulled from > my tree directly have right author format where as ones which > are picked from patchworks seems to be odd. > Patchwork does store this info somehow and re-use it, quite possibly from the very first patch you ever sent. I think this bug was *just* fixed in patchwork, but it'll probably be a while until that fix lands. However, you can go and create a patchwork account with the real name, associate it with all the email addresses you use and then I think it'll pick it up. Not entirely sure though, you'll have to test it. johannes
[PATCH v2 -next] net/core/dev: Warn on a too-short GRO frame
From: Aaron Conole When signaling that a GRO frame is ready to be processed, the network stack correctly checks length and aborts processing when a frame is less than 14 bytes. However, such a condition is really indicative of a broken driver, and should be loudly signaled, rather than silently dropped as the case is today. Convert the condition to use net_warn_ratelimited() to ensure the stack loudly complains about such broken drivers. Signed-off-by: Aaron Conole --- v2: * Convert from WARN_ON to net_warn_ratelimited net/core/dev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe7..1be269e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4663,6 +4663,8 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi) if (unlikely(skb_gro_header_hard(skb, hlen))) { eth = skb_gro_header_slow(skb, hlen, 0); if (unlikely(!eth)) { + net_warn_ratelimited("%s: dropping impossible skb\n", +__func__); napi_reuse_skb(napi, skb); return NULL; } -- 2.5.5
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Fri, Apr 1, 2016 at 12:24 PM, David Miller wrote: > From: Eric Dumazet > Date: Fri, 01 Apr 2016 11:49:03 -0700 > >> For example, TCP stack tracks per socket ID generation, even if it >> sends DF=1 frames. Damn useful for tcpdump analysis and drop >> inference. > > Thanks for mentioning this, I never considered this use case. RFC 6864 is pretty explicit about this, IPv4 ID used only for fragmentation. https://tools.ietf.org/html/rfc6864#section-4.1 The goal with this change is to try and keep most of the existing behavior in tact without violating this rule? I would think the sequence number should give you the ability to infer a drop in the case of TCP. In the case of UDP tunnels we are now getting a bit more data since we were ignoring the outer IP header ID before. >> With your change, the resulting GRO packet would propagate the ID of >> first frag. Most GSO/GSO engines will then provide a ID sequence, >> which might not match original packets. Right. But that is only in the case where the IP IDs did not already increment or were left uninitialized meaning the transmitter was probably already following RFC 6864 and chose a fixed value. Odds are in such a case we end up improving the performance if anything as there are plenty of legacy systems out there that still require the IPv4 ID increment in order to get LRO/GRO. >> I do not particularly care, but it is worth mentioning that GRO+TSO >> would not be idempotent anymore. In the patch I mentioned we had already broken that. I'm basically just going through and fixing the cases for tunnels where we were doing the outer header wrong while at the same time relaxing the requirements for the inner header if DF is set. I'll probably add some documentation do the Documentation folder about it as well. I'm currently in the process of writing up documentation for GSO and GSO partial for the upcoming patchset. I can pretty easily throw in a few comments about GRO as well. > Our eventual plan was to start emitting zero in the ID field for > outgoing TCP datagrams with DF set, since the issue that caused us to > generate incrementing IDs in the first place (buggy Microsoft SLHC > compression) we decided is not relevant and important enough to > accommodate any more. For the GSO partial stuff I was probably just going to have the IP ID on the inner headers lurch forward in chunks equal to gso_segs when we are doing the segmentation. I didn't want to use a fixed value just because that would likely make it easy to identify Linux devices being a bump in the wire. I figure if there are already sources that weren't updating IP ID for their segmentation offloads then if we just take that approach odds are we will blend in with the other devices and be more difficult to single out. Another reason for doing it this way is that different devices are going to have different behaviors with GSO partial. In the case of the i40e driver it recognizes both inner and outer network headers so it can increment both correctly. In the case of igb and ixgbe they only can support the outer header so the inner IP ID value would be lurching by gso_size every time we move from one GSO frame to the next. > So outside of your TCP behavior analysis case, there isn't a > compelling argument to keeping that code around any more, rather than > just put zero in the ID field. > > I suppose we could keep the counter code around and allow it to be > enabled using a sysctl or socket option, but how strongly do you > really feel about this? I'm not suggesting we drop the counter code for transmit. What RFC 6864 says is "Originating sources MAY set the IPv4 ID field of atomic datagrams to any value." For transmit we can leave the IP ID code as is. For receive we should not be snooping into the IP ID for any frames that have the DF bit set as devices that have adopted RFC 6864 on their transmit path will end up causing issues. - Alex
Re: [PATCH v3 net-next] net: ipv4: Consider failed nexthops in multipath routes
Hello, On Fri, 1 Apr 2016, David Ahern wrote: > v3 > - Julian comments: changed use of dead in documentation to failed, > init state to NUD_REACHABLE which simplifies fib_good_nh, use of > nh_dev for neighbor lookup, fallback to first entry which is what > current logic does > > v2 > - use rcu locking to avoid refcnts per Eric's suggestion > - only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's > comment > - drop the 'state == NUD_REACHABLE' from the state check since it is > part of NUD_VALID (comment from Julian) > - wrapped the use of the neigh in a sysctl > > Documentation/networking/ip-sysctl.txt | 10 ++ > include/net/netns/ipv4.h | 3 +++ > net/ipv4/fib_semantics.c | 32 > net/ipv4/sysctl_net_ipv4.c | 11 +++ > 4 files changed, 52 insertions(+), 4 deletions(-) > > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index d97268e8ff10..e08abf96824a 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -1559,21 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int > nh_flags) > } > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > +static bool fib_good_nh(const struct fib_nh *nh) > +{ > + int state = NUD_REACHABLE; > + > + if (nh->nh_scope == RT_SCOPE_LINK) { > + struct neighbour *n = NULL; NULL is not needed anymore. > + > + rcu_read_lock_bh(); > + > + n = __neigh_lookup_noref(&arp_tbl, &nh->nh_gw, nh->nh_dev); > + if (n) > + state = n->nud_state; > + > + rcu_read_unlock_bh(); > + } > + > + return !!(state & NUD_VALID); > +} > > void fib_select_multipath(struct fib_result *res, int hash) > { > struct fib_info *fi = res->fi; > + struct net *net = fi->fib_net; > + unsigned char first_nhsel = 0; Looking at fib_table_lookup() res->nh_sel is not 0 in all cases. I personally don't like that we do not fallback properly but to make this logic more correct we can use something like this: bool first = false; > > for_nexthops(fi) { > if (hash > atomic_read(&nh->nh_upper_bound)) > continue; > > - res->nh_sel = nhsel; > - return; > + if (!net->ipv4.sysctl_fib_multipath_use_neigh || > + fib_good_nh(nh)) { > + res->nh_sel = nhsel; > + return; > + } > + if (!first_nhsel) > + first_nhsel = nhsel; if (!first) { res->nh_sel = nhsel; first = true; } > } endfor_nexthops(fi); > > - /* Race condition: route has just become dead. */ > - res->nh_sel = 0; > + res->nh_sel = first_nhsel; And then this is not needed anymore. Even setting to 0 was not needed because 0 is not better than current nh_sel when both are DEAD/LINKDOWN. Regards -- Julian Anastasov
Re: [PATCH] RDS: sync congestion map updating
(cc-ing netdev) On 3/30/2016 7:59 PM, Wengang Wang wrote: 在 2016年03月31日 09:51, Wengang Wang 写道: 在 2016年03月31日 01:16, santosh shilimkar 写道: Hi Wengang, On 3/30/2016 9:19 AM, Leon Romanovsky wrote: On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote: Problem is found that some among a lot of parallel RDS communications hang. In my test ten or so among 33 communications hang. The send requests got -ENOBUF error meaning the peer socket (port) is congested. But meanwhile, peer socket (port) is not congested. The congestion map updating can happen in two paths: one is in rds_recvmsg path and the other is when it receives packets from the hardware. There is no synchronization when updating the congestion map. So a bit operation (clearing) in the rds_recvmsg path can be skipped by another bit operation (setting) in hardware packet receving path. To be more detailed. Here, the two paths (user calls recvmsg and hardware receives data) are for different rds socks. thus the rds_sock->rs_recv_lock is not helpful to sync the updating on congestion map. For archive purpose, let me try to conclude the thread. I synced with Wengang offlist and came up with below fix. I was under impression that __set_bit_le() was atmoic version. After fixing it like patch(end of the email), the bug gets addressed. I will probably send this as fix for stable as well. From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001 From: Santosh Shilimkar Date: Wed, 30 Mar 2016 23:26:47 -0700 Subject: [PATCH] RDS: Fix the atomicity for congestion map update Two different threads with different rds sockets may be in rds_recv_rcvbuf_delta() via receive path. If their ports both map to the same word in the congestion map, then using non-atomic ops to update it could cause the map to be incorrect. Lets use atomics to avoid such an issue. Full credit to Wengang for finding the issue, analysing it and also pointing out to offending code with spin lock based fix. Signed-off-by: Wengang Wang Signed-off-by: Santosh Shilimkar --- net/rds/cong.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/cong.c b/net/rds/cong.c index e6144b8..6641bcf 100644 --- a/net/rds/cong.c +++ b/net/rds/cong.c @@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port) i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; - __set_bit_le(off, (void *)map->m_page_addrs[i]); + set_bit_le(off, (void *)map->m_page_addrs[i]); } void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) @@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; - __clear_bit_le(off, (void *)map->m_page_addrs[i]); + clear_bit_le(off, (void *)map->m_page_addrs[i]); } static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port) -- 1.7.1
[RFC v3 -next 0/2] virtio-net: Advised MTU feature
The following series adds the ability for a hypervisor to set an MTU on the guest during feature negotiation phase. This is useful for VM orchestration when, for instance, tunneling is involved and the MTU of the various systems should be homogenous. The first patch adds the feature bit as described in the proposed VIRTIO spec addition found at https://lists.oasis-open.org/archives/virtio-dev/201603/msg1.html The second patch adds a user of the bit, and a warning when the guest changes the MTU from the hypervisor advised MTU. Future patches may add more thorough error handling. v2: * Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni * Additional test before printing a warning v3: * Removed the warning when changing MTU (which simplified the code) Aaron Conole (2): virtio: Start feature MTU support virtio_net: Read the advised MTU drivers/net/virtio_net.c| 8 include/uapi/linux/virtio_net.h | 3 +++ 2 files changed, 11 insertions(+) -- 2.5.5
[RFC v3 -next 2/2] virtio_net: Read the advised MTU
This patch checks the feature bit for the VIRTIO_NET_F_MTU feature. If it exists, read the advised MTU and use it. No proper error handling is provided for the case where a user changes the negotiated MTU. A future commit will add proper error handling. Instead, a warning is emitted if the guest changes the device MTU after previously being given advice. Signed-off-by: Aaron Conole --- v2: * Whitespace cleanup in the last hunk * Code style change around the pr_warn * Additional test for mtu change before printing warning v3: * removed the mtu change warning drivers/net/virtio_net.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 49d84e5..2308083 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1450,6 +1450,7 @@ static const struct ethtool_ops virtnet_ethtool_ops = { static int virtnet_change_mtu(struct net_device *dev, int new_mtu) { + struct virtnet_info *vi = netdev_priv(dev); if (new_mtu < MIN_MTU || new_mtu > MAX_MTU) return -EINVAL; dev->mtu = new_mtu; @@ -1896,6 +1897,12 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ)) vi->has_cvq = true; + if (virtio_has_feature(vdev, VIRTIO_NET_F_MTU)) { + dev->mtu = virtio_cread16(vdev, + offsetof(struct virtio_net_config, + mtu)); + } + if (vi->any_header_sg) dev->needed_headroom = vi->hdr_len; @@ -2081,6 +2088,7 @@ static unsigned int features[] = { VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, VIRTIO_NET_F_CTRL_MAC_ADDR, VIRTIO_F_ANY_LAYOUT, + VIRTIO_NET_F_MTU, }; static struct virtio_driver virtio_net_driver = { -- 2.5.5
[RFC v3 -net 1/2] virtio: Start feature MTU support
This commit adds the feature bit and associated mtu device entry for the virtio network device. Future commits will make use of these bits to support negotiated MTU. Signed-off-by: Aaron Conole --- v2,v3: * No change include/uapi/linux/virtio_net.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index ec32293..41a6a01 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -55,6 +55,7 @@ #define VIRTIO_NET_F_MQ22 /* Device supports Receive Flow * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +#define VIRTIO_NET_F_MTU 25/* Device supports Default MTU Negotiation */ #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ @@ -73,6 +74,8 @@ struct virtio_net_config { * Legal values are between 1 and 0x8000 */ __u16 max_virtqueue_pairs; + /* Default maximum transmit unit advice */ + __u16 mtu; } __attribute__((packed)); /* -- 2.5.5
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Eric Dumazet Date: Fri, 01 Apr 2016 11:49:03 -0700 > For example, TCP stack tracks per socket ID generation, even if it > sends DF=1 frames. Damn useful for tcpdump analysis and drop > inference. Thanks for mentioning this, I never considered this use case. > With your change, the resulting GRO packet would propagate the ID of > first frag. Most GSO/GSO engines will then provide a ID sequence, > which might not match original packets. > > I do not particularly care, but it is worth mentioning that GRO+TSO > would not be idempotent anymore. Our eventual plan was to start emitting zero in the ID field for outgoing TCP datagrams with DF set, since the issue that caused us to generate incrementing IDs in the first place (buggy Microsoft SLHC compression) we decided is not relevant and important enough to accommodate any more. So outside of your TCP behavior analysis case, there isn't a compelling argument to keeping that code around any more, rather than just put zero in the ID field. I suppose we could keep the counter code around and allow it to be enabled using a sysctl or socket option, but how strongly do you really feel about this?
Re: [PATCH] net: mvneta: fix changing MTU when using per-cpu processing
From: Marcin Wojtas Date: Fri, 1 Apr 2016 15:21:18 +0200 > After enabling per-cpu processing it appeared that under heavy load > changing MTU can result in blocking all port's interrupts and transmitting > data is not possible after the change. > > This commit fixes above issue by disabling percpu interrupts for the > time, when TXQs and RXQs are reconfigured. > > Signed-off-by: Marcin Wojtas Applied, thanks. When I reviewed this I was worried that this was yet another case where the ndo op could be invoked in a potentially atomic or similar context, whereby on_each_cpu() would be illegal to use. But that appears to not be the case, and thus this change is just fine. Thanks.
Re: [v7, 4/5] powerpc/fsl: move mpc85xx.h to include/linux/fsl
On 03/31/2016 08:07 PM, Yangbo Lu wrote: > drivers/clk/clk-qoriq.c | 3 +-- > For clk part: Acked-by: Stephen Boyd -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH net] vlan: pull on __vlan_insert_tag error path and fix csum correction
From: Daniel Borkmann Date: Fri, 1 Apr 2016 11:41:03 +0200 > Moreover, I noticed that when in the non-error path the __skb_pull() > is done and the original offset to mac header was non-zero, we fixup > from a wrong skb->data offset in the checksum complete processing. > > So the skb_postpush_rcsum() really needs to be done before __skb_pull() > where skb->data still points to the mac header start. Ugh, what a mess, are you sure any of this is right even after your change? What happens (outside of the csum part) is this: __skb_push(offset); __vlan_insert_tag(); { skb_push(VLAN_HLEN); ... memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN); } __skb_pull(offset); If I understand this correctly, the last pull will therefore put skb->data pointing at vlan_ethhdr->h_vlan_TCI of the new VLAN header pushed by __vlan_insert_tag(). That is assuming skb->data began right after the original ethernet header. To me, that postpull csum currently is absolutely in the correct spot, because it's acting upon the pull done by __vlan_insert_tag(), not the one done here by skb_vlan_push(). Right? Can you tell me how you tested this? Just curious...
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On Fri, 2016-04-01 at 11:05 -0700, Alexander Duyck wrote: > RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other > than fragmentation and reassembly. Currently we are looking at this field > as a way of identifying what frames can be aggregated and which cannot for > GRO. While this is valid for frames that do not have DF set, it is invalid > to do so if the bit is set. > > In addition we were generating IPv4 ID collisions when 2 or more flows were > interleaved over the same tunnel. To prevent that we store the result of > all IP ID checks via a "|=" instead of overwriting previous values. Note that for atomic datagrams (DF = 1), since fragmentation and reassembly can not occur, maybe some people use ID field for other purposes. For example, TCP stack tracks per socket ID generation, even if it sends DF=1 frames. Damn useful for tcpdump analysis and drop inference. With your change, the resulting GRO packet would propagate the ID of first frag. Most GSO/GSO engines will then provide a ID sequence, which might not match original packets. I do not particularly care, but it is worth mentioning that GRO+TSO would not be idempotent anymore.
Re: [PATCH] net: mvpp2: use cache_line_size() to get cacheline size
From: Jisheng Zhang Date: Fri, 1 Apr 2016 17:11:05 +0800 > L1_CACHE_BYTES may not be the real cacheline size, use cache_line_size > to determine the cacheline size in runtime. > > Signed-off-by: Jisheng Zhang > Suggested-by: Marcin Wojtas Applied.
Re: [PATCH 1/2] ipv6: rework the lock in addrconf_permanent_addr
From: roy.qing...@gmail.com Date: Fri, 1 Apr 2016 17:26:58 +0800 > From: Li RongQing > > 1. nothing of idev is changed, so read lock is enough > 2. ifp is changed, so used ifp->lock or cmpxchg to protect it > > Signed-off-by: Li RongQing You posted this patch twice and didn't post patch 2/2. I'm tossing this from patchwork, please resubmit this properly.
Re: [PATCH] net: mvneta: use cache_line_size() to get cacheline size
From: Jisheng Zhang Date: Fri, 1 Apr 2016 17:12:49 +0800 > L1_CACHE_BYTES may not be the real cacheline size, use cache_line_size > to determine the cacheline size in runtime. > > Signed-off-by: Jisheng Zhang > Suggested-by: Marcin Wojtas Applied.
Re: [PATCH] bridge: remove br_dev_set_multicast_list
From: roy.qing...@gmail.com Date: Fri, 1 Apr 2016 16:16:10 +0800 > From: Li RongQing > > remove br_dev_set_multicast_list which does nothing > > Signed-off-by: Li RongQing This will break SIOCADDMULTI et al. on the bridge, see net/core/dev.c which checks whether this ndo OP is NULL or not. Please sufficiently grep the source tree on how an ndo operation is used before making changes like this. Thanks.
Re: [PATCH (net.git) 0/3] stmmac MDIO and normal descr fixes
From: Giuseppe Cavallaro Date: Fri, 1 Apr 2016 09:07:13 +0200 > This patch series is to fix the problems below and recently debugged > in this mailing list: > > o to fix a problem for the HW where the normal descriptor > o to fix the mdio registration according to the different > platform configurations > > I am resending all the patches again: built on top of net.git repo. Series applied, thanks.
Re: [PATCH] net: mvpp2: fix maybe-uninitialized warning
From: Jisheng Zhang Date: Thu, 31 Mar 2016 17:01:23 +0800 > This is to fix the following maybe-uninitialized warning: > > drivers/net/ethernet/marvell/mvpp2.c:6007:18: warning: 'err' may be > used uninitialized in this function [-Wmaybe-uninitialized] > > Signed-off-by: Jisheng Zhang Applied.
Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
On 04/01/2016 08:33 PM, David Miller wrote: From: Daniel Borkmann Date: Fri, 01 Apr 2016 10:10:11 +0200 Dave, do you need me to resubmit this one w/o changes: http://patchwork.ozlabs.org/patch/603903/ ? I'll apply this and queue it up for -stable, thanks. Ok, thanks!
Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF
On 04/01/2016 08:10 PM, Alexei Starovoitov wrote: On 4/1/16 2:58 AM, Naveen N. Rao wrote: PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2. Enable with: echo 1 > /proc/sys/net/core/bpf_jit_enable or echo 2 > /proc/sys/net/core/bpf_jit_enable ... to see the generated JIT code. This can further be processed with tools/net/bpf_jit_disasm. With CONFIG_TEST_BPF=m and 'modprobe test_bpf': test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed] ... on both ppc64 BE and LE. The details of the approach are documented through various comments in the code, as are the TODOs. Some of the prominent TODOs include implementing BPF tail calls and skb loads. Cc: Matt Evans Cc: Michael Ellerman Cc: Paul Mackerras Cc: Alexei Starovoitov Cc: "David S. Miller" Cc: Ananth N Mavinakayanahalli Signed-off-by: Naveen N. Rao --- arch/powerpc/include/asm/ppc-opcode.h | 19 +- arch/powerpc/net/Makefile | 4 + arch/powerpc/net/bpf_jit.h| 66 ++- arch/powerpc/net/bpf_jit64.h | 58 +++ arch/powerpc/net/bpf_jit_comp64.c | 828 ++ 5 files changed, 973 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/net/bpf_jit64.h create mode 100644 arch/powerpc/net/bpf_jit_comp64.c ... -#ifdef CONFIG_PPC64 +#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2) impressive stuff! +1, awesome to see another one! Everything nicely documented. Could you add few words for the above condition as well ? Or may be a new macro, since it occurs many times? What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ? Will there ever be v3 ? Minor TODO would also be to convert to use bpf_jit_binary_alloc() and bpf_jit_binary_free() API for the image, which is done by other eBPF jits, too. So far most of the bpf jits were going via net-next tree, but if in this case no changes to the core is necessary then I guess it's fine to do it via powerpc tree. What's your plan?
Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get
From: Daniel Borkmann Date: Fri, 01 Apr 2016 10:10:11 +0200 > Dave, do you need me to resubmit this one w/o changes: > http://patchwork.ozlabs.org/patch/603903/ ? I'll apply this and queue it up for -stable, thanks.
[PATCH] Marvell phy: add fiber status check for some components
>From a5a7a9828511ff6a522cf742109768207ff89929 Mon Sep 17 00:00:00 2001 From: Charles-Antoine Couret Date: Fri, 1 Apr 2016 16:16:35 +0200 Subject: [PATCH] Marvell phy: add fiber status check for some components This patch is not tested with all Marvell's phy. The new function was actived only for tested phys. Signed-off-by: Charles-Antoine Couret --- drivers/net/phy/marvell.c | 37 +++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c index ab1d0fc..5ac186e 100644 --- a/drivers/net/phy/marvell.c +++ b/drivers/net/phy/marvell.c @@ -890,6 +890,39 @@ static int marvell_read_status(struct phy_device *phydev) return 0; } +/* marvell_read_fiber_status + * + * Some Marvell's phys have two modes: fiber and copper. + * Both need status checked. + * Description: + * First, check the fiber link and status. + * If the fiber link is down, check the copper link and status which + * will be the default value if both link are down. + */ +static int marvell_read_fiber_status(struct phy_device *phydev) +{ + int err; + + /* Check the fiber mode first */ + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER); + if (err < 0) + return err; + + err = marvell_read_status(phydev); + if (err < 0) + return err; + + if (phydev->link) + return 0; + + /* If fiber link is down, check and save copper mode state */ + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); + if (err < 0) + return err; + + return marvell_read_status(phydev); +} + static int marvell_aneg_done(struct phy_device *phydev) { int retval = phy_read(phydev, MII_M1011_PHY_STATUS); @@ -1122,7 +1155,7 @@ static struct phy_driver marvell_drivers[] = { .probe = marvell_probe, .config_init = &m88e_config_init, .config_aneg = &marvell_config_aneg, - .read_status = &marvell_read_status, + .read_status = &marvell_read_fiber_status, .ack_interrupt = &marvell_ack_interrupt, .config_intr = &marvell_config_intr, .resume = &genphy_resume, @@ -1270,7 +1303,7 @@ static struct phy_driver marvell_drivers[] = { .probe = marvell_probe, .config_init = &marvell_config_init, .config_aneg = &m88e1510_config_aneg, - .read_status = &marvell_read_status, + .read_status = &marvell_read_fiber_status, .ack_interrupt = &marvell_ack_interrupt, .config_intr = &marvell_config_intr, .did_interrupt = &m88e1121_did_interrupt, -- 2.5.5
Re: Question on rhashtable in worst-case scenario.
On 03/31/2016 05:46 PM, Herbert Xu wrote: On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote: Does removing this completely disable the "-EEXIST" error? I can't say I fully understand the elasticity stuff in __rhashtable_insert_fast(). What EEXIST error are you talking about? The only one that can be returned on insertion is if you're explicitly checking for dups which clearly can't be the case for you. If you're talking about the EEXIST error due to a rehash then it is completely hidden from you by rhashtable_insert_rehash. If you actually meant EBUSY then yes this should prevent it from occurring, unless your chain-length exceeds 2^32. EEXIST was on removal, and was a symptom of the failure to insert, not really a problem itself. I reverted my revert (ie, back to rhashtable), added Johanne's patch to check insertion (and added my on pr_err there). I also added this: diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index 38ef0be..c25b945 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -66,6 +66,7 @@ static const struct rhashtable_params sta_rht_params = { .nelem_hint = 3, /* start small */ + .insecure_elasticity = true, /* Disable chain-length checks. */ .automatic_shrinking = true, .head_offset = offsetof(struct sta_info, hash_node), .key_offset = offsetof(struct sta_info, addr), Now, my test case seems to pass, though I did have one strange issue before I put in the pr_err. I'm not sure if it was a hashtable issue or something else..but I have lots of something-else going on in this system, so I'd say that likely the patch above fixes rhashtable for my use case. I will of course let you know if I run into more issues that appear to be hashtable related! Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF
On 4/1/16 2:58 AM, Naveen N. Rao wrote: PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2. Enable with: echo 1 > /proc/sys/net/core/bpf_jit_enable or echo 2 > /proc/sys/net/core/bpf_jit_enable ... to see the generated JIT code. This can further be processed with tools/net/bpf_jit_disasm. With CONFIG_TEST_BPF=m and 'modprobe test_bpf': test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed] ... on both ppc64 BE and LE. The details of the approach are documented through various comments in the code, as are the TODOs. Some of the prominent TODOs include implementing BPF tail calls and skb loads. Cc: Matt Evans Cc: Michael Ellerman Cc: Paul Mackerras Cc: Alexei Starovoitov Cc: "David S. Miller" Cc: Ananth N Mavinakayanahalli Signed-off-by: Naveen N. Rao --- arch/powerpc/include/asm/ppc-opcode.h | 19 +- arch/powerpc/net/Makefile | 4 + arch/powerpc/net/bpf_jit.h| 66 ++- arch/powerpc/net/bpf_jit64.h | 58 +++ arch/powerpc/net/bpf_jit_comp64.c | 828 ++ 5 files changed, 973 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/net/bpf_jit64.h create mode 100644 arch/powerpc/net/bpf_jit_comp64.c ... -#ifdef CONFIG_PPC64 +#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2) impressive stuff! Everything nicely documented. Could you add few words for the above condition as well ? Or may be a new macro, since it occurs many times? What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ? Will there ever be v3 ? So far most of the bpf jits were going via net-next tree, but if in this case no changes to the core is necessary then I guess it's fine to do it via powerpc tree. What's your plan?
[net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other than fragmentation and reassembly. Currently we are looking at this field as a way of identifying what frames can be aggregated and which cannot for GRO. While this is valid for frames that do not have DF set, it is invalid to do so if the bit is set. In addition we were generating IPv4 ID collisions when 2 or more flows were interleaved over the same tunnel. To prevent that we store the result of all IP ID checks via a "|=" instead of overwriting previous values. Signed-off-by: Alexander Duyck --- net/core/dev.c |1 + net/ipv4/af_inet.c | 23 --- net/ipv6/ip6_offload.c |3 --- 3 files changed, 17 insertions(+), 10 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 77a71cd68535..3429632398a4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb) unsigned long diffs; NAPI_GRO_CB(p)->flush = 0; + NAPI_GRO_CB(p)->flush_id = 0; if (hash != skb_get_hash_raw(p)) { NAPI_GRO_CB(p)->same_flow = 0; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9e481992dbae..7d8733393934 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1347,14 +1347,23 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, (iph->tos ^ iph2->tos) | ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF)); - /* Save the IP ID check to be included later when we get to -* the transport layer so only the inner most IP ID is checked. -* This is because some GSO/TSO implementations do not -* correctly increment the IP ID for the outer hdrs. -*/ - NAPI_GRO_CB(p)->flush_id = - ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id); NAPI_GRO_CB(p)->flush |= flush; + + /* For non-atomic datagrams we need to save the IP ID offset +* to be included later. If the frame has the DF bit set +* we must ignore the IP ID value as per RFC 6864. +*/ + if (iph2->frag_off & htons(IP_DF)) + continue; + + /* We must save the offset as it is possible to have multiple +* flows using the same protocol and address pairs so we +* need to wait until we can validate this is part of the +* same flow with a 5-tuple or better to avoid unnecessary +* collisions between flows. +*/ + NAPI_GRO_CB(p)->flush_id |= ntohs(iph2->id) ^ + (u16)(id - NAPI_GRO_CB(p)->count); } NAPI_GRO_CB(skb)->flush |= flush; diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 82e9f3076028..9aa53f64dffd 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head, /* flush if Traffic Class fields are different */ NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF0)); NAPI_GRO_CB(p)->flush |= flush; - - /* Clear flush_id, there's really no concept of ID in IPv6. */ - NAPI_GRO_CB(p)->flush_id = 0; } NAPI_GRO_CB(skb)->flush |= flush;
[net PATCH 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: Alexander Duyck --- include/linux/netdevice.h |5 - net/core/dev.c|1 + net/ipv4/fou.c|6 ++ net/ipv4/gre_offload.c|8 net/ipv4/ip_gre.c | 13 ++--- 5 files changed, 29 insertions(+), 4 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cb0d5d09c2e4..8395308a2445 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2120,7 +2120,10 @@ struct napi_gro_cb { /* Used in foo-over-udp, set in udp[46]_gro_receive */ u8 is_ipv6:1; - /* 7 bit hole */ + /* Used in GRE, set in fou/gue_gro_receive */ + u8 is_fou:1; + + /* 6 bit hole */ /* used to support CHECKSUM_COMPLETE for tunneling protocols */ __wsum csum; diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe77d913..77a71cd68535 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4439,6 +4439,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff NAPI_GRO_CB(skb)->flush = 0; NAPI_GRO_CB(skb)->free = 0; NAPI_GRO_CB(skb)->encap_mark = 0; + NAPI_GRO_CB(skb)->is_fou = 0; NAPI_GRO_CB(skb)->gro_remcsum_start = 0; /* Setup for GRO checksum validation */ diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index 5a94aea280d3..a39068b4a4d9 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -203,6 +203,9 @@ static struct sk_buff **fou_gro_receive(struct sk_buff **head, */ NAPI_GRO_CB(skb)->encap_mark = 0; + /* Flag this frame as already having an outer encap header */ + NAPI_GRO_CB(skb)->is_fou = 1; + rcu_read_lock(); offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[proto]); @@ -368,6 +371,9 @@ static struct sk_buff **gue_gro_receive(struct sk_buff **head, */ NAPI_GRO_CB(skb)->encap_mark = 0; + /* Flag this frame as already having an outer encap header */ + NAPI_GRO_CB(skb)->is_fou = 1; + rcu_read_lock(); offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads; ops = rcu_dereference(offloads[guehdr->proto_ctype]); diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index c47539d04b88..6a5bd4317866 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -150,6 +150,14 @@ static struct sk_buff **gre_gro_receive(struct sk_buff **head, if ((greh->flags & ~(GRE_KEY|GRE_CSUM)) != 0) goto out; + /* We can only support GRE_CSUM if we can track the location of +* the GRE header. In the case of FOU/GUE we cannot because the +* outer UDP header displaces the GRE header leaving us in a state +* of limbo. +*/ + if ((greh->flags & GRE_CSUM) && NAPI_GRO_CB(skb)->is_fou) + goto out; + type = greh->protocol; rcu_read_lock(); diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 31936d387cfd..af5d1f38217f 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -862,9 +862,16 @@ static void __gre_tunnel_init(struct net_device *dev) dev->hw_features|= GRE_FEATURES; if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) { - /* TCP offload with GRE SEQ is not supported. */ - dev->features|= NETIF_F_GSO_SOFTWARE; - dev->hw_features |= NETIF_F_GSO_SOFTWARE; + /* TCP offload with GRE SEQ is not supported, nor +* can we support 2 levels of outer headers requiring +* an update. +*/ + if (!(tunnel->parms.o_flags & TUNNEL_CSUM) || + (tunnel->encap.type == TUNNEL_ENCAP_NONE)) { + dev->features|= NETIF_F_GSO_SOFTWARE; + dev->hw_features |= NETIF_F_GSO_SOFTWARE; + } + /* Can use a lockless transmit, unless we generate * output sequences */
[net PATCH 0/2] Fixes for GRO and GRE tunnels
This pair of patches addresses a few issues I have discovered over the last week or so concerning GRO and GRE tunnels. The first patch addresses an item I called out as an issue with FOU/GUE encapsulating GRE, and I finally had a chance to test it and verify that the code concerning it was broken so I took the opportunity to fix it so that we cannot generate a FOU/GUE frame that is encapsulating a GRE tunnel with checksum while requesting TSO/GSO for the frame. The second patch actually addresses something I realized was an issue if we feed a tunnel through GRO and back out through GSO. Specifically it was possible for GRO to generate overlapping IPv4 ID ranges as the outer IP IDs were being ignored for tunnels. Ignoring the IP IDs like this should only be valid if the DF bit is set. This is normally the case for IPIP, SIT, and GRE tunnels, but not so for UDP tunnels. In the case that the DF bit is not set we store off the fact that there was a delta from what we were expecting and when we hit the inner-most header we validate the value as to avoid generating a frame which could lead to an IP ID collision on packets that could eventually be fragmented. A side effect is that the inner-most IP ID test is relaxed as well, but the worst case scenario is that we GRO a frame with a throw-away ID sequence anyway so if anything segmenting such a frame with the wrong IP IDs should have no negative effects. --- Alexander Duyck (2): GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU ipv4/GRO: Make GRO conform to RFC 6864 include/linux/netdevice.h |5 - net/core/dev.c|2 ++ net/ipv4/af_inet.c| 23 --- net/ipv4/fou.c|6 ++ net/ipv4/gre_offload.c|8 net/ipv4/ip_gre.c | 13 ++--- net/ipv6/ip6_offload.c|3 --- 7 files changed, 46 insertions(+), 14 deletions(-) --
Re: [PATCH 4/4] samples/bpf: Enable powerpc support
On 4/1/16 7:41 AM, Naveen N. Rao wrote: On 2016/03/31 10:52AM, Alexei Starovoitov wrote: On 3/31/16 4:25 AM, Naveen N. Rao wrote: ... + +#ifdef __powerpc__ +#define BPF_KPROBE_READ_RET_IP(ip, ctx){ (ip) = (ctx)->link; } +#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) BPF_KPROBE_READ_RET_IP(ip, ctx) +#else +#define BPF_KPROBE_READ_RET_IP(ip, ctx) \ + bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)) +#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) \ + bpf_probe_read(&(ip), sizeof(ip), \ + (void *)(PT_REGS_FP(ctx) + sizeof(ip))) makes sense, but please use ({ }) gcc extension instead of {} and open call to make sure that macro body is scoped. To be sure I understand this right, do you mean something like this? + +#ifdef __powerpc__ +#define BPF_KPROBE_READ_RET_IP(ip, ctx)({ (ip) = (ctx)->link; }) +#define BPF_KRETPROBE_READ_RET_IP BPF_KPROBE_READ_RET_IP +#else +#define BPF_KPROBE_READ_RET_IP(ip, ctx)({ \ + bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); }) +#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) ({ \ + bpf_probe_read(&(ip), sizeof(ip), \ + (void *)(PT_REGS_FP(ctx) + sizeof(ip))); }) +#endif yes. Thanks!
Re: [PATCH 2/4] samples/bpf: Use llc in PATH, rather than a hardcoded value
On 4/1/16 7:37 AM, Naveen N. Rao wrote: On 2016/03/31 08:19PM, Daniel Borkmann wrote: On 03/31/2016 07:46 PM, Alexei Starovoitov wrote: On 3/31/16 4:25 AM, Naveen N. Rao wrote: clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \ --O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@ +-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=obj -o $@ clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \ --O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o $@.s +-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=asm -o $@.s that was a workaround when clang/llvm didn't have bpf support. Now clang 3.7 and 3.8 have bpf built-in, so make sense to remove manual calls to llc completely. Just use 'clang -target bpf -O2 -D... -c $< -o $@' +1, the clang part in that Makefile should also more correctly be called with '-target bpf' as it turns out (despite llc with '-march=bpf' ...). Better to use clang directly as suggested by Alexei. I'm likely missing something obvious, but I cannot get this to work. With this diff: $(obj)/%.o: $(src)/%.c clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \ - -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@ - clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ - -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \ - -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o $@.s + -O2 -target bpf -c $< -o $@ I see far too many errors thrown starting with: ./arch/x86/include/asm/arch_hweight.h:31:10: error: invalid output constraint '=a' in asm : "="REG_OUT (res) ahh. yes. when processing kernel headers clang has to assume x86 style inline asm, though all of these functions will be ignored. I don't have a quick fix for this yet. Let's go back to your original change $(LLC)->llc
[Odd commit author id merge via netdev]
Hi Dave, I noticed something odd while checking the recent commits of mine in kernel.org tree made it via netdev. Don't know if its patchwork tool doing this. Usual author line in my git objects : Author: Santosh Shilimkar But the commits going via your tree seems to be like below.. Author: email-id Few more examples of the commits end of the email. Can this be fixed for future commits ? The git objects you pulled from my tree directly have right author format where as ones which are picked from patchworks seems to be odd. Regards, Santosh commit ad6832f950d35df8c70b577993a24b31b34d88e4 Author: santosh.shilim...@oracle.com commit 2cb2912d65633e751d3f8397377174501412aa47 Author: santosh.shilim...@oracle.com commit db42753adb638b63572583162bb08ea193947309 Author: santosh.shilim...@oracle.com [] commit 06766513232d1619ac84e87b1d839d3fcc23a540 Author: Santosh Shilimkar commit 41a4e9646229801624e38f7a1cc53033a0affdb1 Author: Santosh Shilimkar
[PATCH v2] sctp: use list_* in sctp_list_dequeue
Use list_* helpers in sctp_list_dequeue, more readable. Signed-off-by: Marcelo Ricardo Leitner --- v2: patch rechecked include/net/sctp/sctp.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h index 65521cfdcadeee35d61f280165a387cc2164ab6d..03fb33efcae21d54192204629ff4ced2e36e7d4d 100644 --- a/include/net/sctp/sctp.h +++ b/include/net/sctp/sctp.h @@ -386,11 +386,9 @@ static inline struct list_head *sctp_list_dequeue(struct list_head *list) { struct list_head *result = NULL; - if (list->next != list) { + if (!list_empty(list)) { result = list->next; - list->next = result->next; - list->next->prev = list; - INIT_LIST_HEAD(result); + list_del_init(result); } return result; } -- 2.5.0
Re: qdisc spin lock
2016-03-31 19:19 GMT-07:00 David Miller : > From: Michael Ma > Date: Thu, 31 Mar 2016 16:48:43 -0700 > >> I didn't really know that multiple qdiscs can be isolated using MQ so > ... > > Please stop top-posting. Sorry that I wasn't aware of this...
Re: [PATCH] Marvell phy: add fiber status check for some components
On Fri, Apr 01, 2016 at 06:33:48PM +0200, Charles-Antoine Couret wrote: > >From a5a7a9828511ff6a522cf742109768207ff89929 Mon Sep 17 00:00:00 2001 > From: Charles-Antoine Couret > Date: Fri, 1 Apr 2016 16:16:35 +0200 > Subject: [PATCH] Marvell phy: add fiber status check for some components > > This patch is not tested with all Marvell's phy. The new function was actived > only for tested phys. > > Signed-off-by: Charles-Antoine Couret > --- > drivers/net/phy/marvell.c | 37 +++-- > 1 file changed, 35 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c > index ab1d0fc..5ac186e 100644 > --- a/drivers/net/phy/marvell.c > +++ b/drivers/net/phy/marvell.c > @@ -890,6 +890,39 @@ static int marvell_read_status(struct phy_device *phydev) > return 0; > } > > +/* marvell_read_fiber_status > + * > + * Some Marvell's phys have two modes: fiber and copper. > + * Both need status checked. > + * Description: > + * First, check the fiber link and status. > + * If the fiber link is down, check the copper link and status which > + * will be the default value if both link are down. > + */ > +static int marvell_read_fiber_status(struct phy_device *phydev) > +{ > + int err; > + > + /* Check the fiber mode first */ > + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER); > + if (err < 0) > + return err; > + > + err = marvell_read_status(phydev); > + if (err < 0) > + return err; > + > + if (phydev->link) > + return 0; > + > + /* If fiber link is down, check and save copper mode state */ > + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); > + if (err < 0) > + return err; > + > + return marvell_read_status(phydev); > +} Hi Charles Shouldn't you return to page 0, i.e. MII_M_COPPER, under all conditions? Andrew
[PATCH v2] sctp: flush if we can't fit another DATA chunk
There is no point on delaying the packet if we can't fit a single byte of data on it anymore. So lets just reduce the threshold by the amount that a data chunk with 4 bytes (rounding) would use. v2: based on the right tree Signed-off-by: Marcelo Ricardo Leitner --- net/sctp/output.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/sctp/output.c b/net/sctp/output.c index 97745351d58c2fb32b9f9b57d61831d7724d83b2..9844fe573029b9e262743440980f15277ddaf5a1 100644 --- a/net/sctp/output.c +++ b/net/sctp/output.c @@ -705,7 +705,8 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet, /* Check whether this chunk and all the rest of pending data will fit * or delay in hopes of bundling a full sized packet. */ - if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead) + if (chunk->skb->len + q->out_qlen > + transport->pathmtu - packet->overhead - sizeof(sctp_data_chunk_t) - 4) /* Enough data queued to fill a packet */ return SCTP_XMIT_OK; -- 2.5.0
Re: [PATCH 3/4] net: w5100: enable to support sleepable register access interface
2016-04-01 4:30 GMT+09:00 David Miller : > From: Akinobu Mita > Date: Thu, 31 Mar 2016 01:38:39 +0900 > >> + struct sk_buff_head tx_queue; > > The way the queueing works in this driver is that it is only possible > to have one SKB being transmitted at one time. > > This is evident by how the driver immediately stops the TX queue when > it is given a new packet to transmit, and this is woken up by the TX > completion IRQ. > > So don't use a queue here, just use a plain single pointer. > > The SKB queue you're using here is going to also do locking which is > even more unnecessary overhead. Thanks for spotting this. Using a single pointer works fine. Maybe we can support sending multiple packets at a time, but it should be another separated patch.
[PATCH] Marvell phy: add fiber status check for some components
>From a5a7a9828511ff6a522cf742109768207ff89929 Mon Sep 17 00:00:00 2001 From: Charles-Antoine Couret Date: Fri, 1 Apr 2016 16:16:35 +0200 Subject: [PATCH] Marvell phy: add fiber status check for some components This patch is not tested with all Marvell's phy. The new function was actived only for tested phys. Signed-off-by: Charles-Antoine Couret --- drivers/net/phy/marvell.c | 37 +++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c index ab1d0fc..5ac186e 100644 --- a/drivers/net/phy/marvell.c +++ b/drivers/net/phy/marvell.c @@ -890,6 +890,39 @@ static int marvell_read_status(struct phy_device *phydev) return 0; } +/* marvell_read_fiber_status + * + * Some Marvell's phys have two modes: fiber and copper. + * Both need status checked. + * Description: + * First, check the fiber link and status. + * If the fiber link is down, check the copper link and status which + * will be the default value if both link are down. + */ +static int marvell_read_fiber_status(struct phy_device *phydev) +{ + int err; + + /* Check the fiber mode first */ + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER); + if (err < 0) + return err; + + err = marvell_read_status(phydev); + if (err < 0) + return err; + + if (phydev->link) + return 0; + + /* If fiber link is down, check and save copper mode state */ + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER); + if (err < 0) + return err; + + return marvell_read_status(phydev); +} + static int marvell_aneg_done(struct phy_device *phydev) { int retval = phy_read(phydev, MII_M1011_PHY_STATUS); @@ -1122,7 +1155,7 @@ static struct phy_driver marvell_drivers[] = { .probe = marvell_probe, .config_init = &m88e_config_init, .config_aneg = &marvell_config_aneg, - .read_status = &marvell_read_status, + .read_status = &marvell_read_fiber_status, .ack_interrupt = &marvell_ack_interrupt, .config_intr = &marvell_config_intr, .resume = &genphy_resume, @@ -1270,7 +1303,7 @@ static struct phy_driver marvell_drivers[] = { .probe = marvell_probe, .config_init = &marvell_config_init, .config_aneg = &m88e1510_config_aneg, - .read_status = &marvell_read_status, + .read_status = &marvell_read_fiber_status, .ack_interrupt = &marvell_ack_interrupt, .config_intr = &marvell_config_intr, .did_interrupt = &m88e1121_did_interrupt, -- 2.5.5
Re: [PATCH v2 net-next 11/11] tcp: rate limit ACK sent by SYN_RECV request sockets
On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet wrote: > Attackers like to use SYNFLOOD targeting one 5-tuple, as they > hit a single RX queue (and cpu) on the victim. > > If they use random sequence numbers in their SYN, we detect > they do not match the expected window and send back an ACK. > > This patch adds a rate limitation, so that the effect of such > attacks is limited to ingress only. > > We roughly double our ability to absorb such attacks. Thanks, Eric! Acked-by: Neal Cardwell neal
[PATCH v2 net-next 03/11] tcp/dccp: remove BH disable/enable in lookup
Since linux 2.6.29, lookups only use rcu locking. Signed-off-by: Eric Dumazet --- include/net/inet_hashtables.h | 7 +-- net/ipv6/inet6_hashtables.c | 2 -- 2 files changed, 1 insertion(+), 8 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 50f635c2c536..a77acee93aaf 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -280,11 +280,8 @@ static inline struct sock *inet_lookup_listener(struct net *net, net_eq(sock_net(__sk), (__net))) #endif /* 64-bit arch */ -/* - * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need +/* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need * not check it for lookups anymore, thanks Alexey. -DaveM - * - * Local BH must be disabled here. */ struct sock *__inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo, @@ -326,10 +323,8 @@ static inline struct sock *inet_lookup(struct net *net, { struct sock *sk; - local_bh_disable(); sk = __inet_lookup(net, hashinfo, skb, doff, saddr, sport, daddr, dport, dif); - local_bh_enable(); return sk; } diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 70f2628be6fa..d253f32874c9 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -200,10 +200,8 @@ struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo, { struct sock *sk; - local_bh_disable(); sk = __inet6_lookup(net, hashinfo, skb, doff, saddr, sport, daddr, ntohs(dport), dif); - local_bh_enable(); return sk; } -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 01/11] net: add SOCK_RCU_FREE socket flag
We want a generic way to insert an RCU grace period before socket freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too much overhead. SLAB_DESTROY_BY_RCU strict rules force us to take a reference on the socket sk_refcnt, and it is a performance problem for UDP encapsulation, or TCP synflood behavior, as many CPUs might attempt the atomic operations on a shared sk_refcnt UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their lookup can use traditional RCU rules, without refcount changes. They can set the flag only once hashed and visible by other cpus. Signed-off-by: Eric Dumazet Cc: Tom Herbert --- include/net/sock.h | 2 ++ net/core/sock.c| 14 +- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/net/sock.h b/include/net/sock.h index 255d3e03727b..c88785a3e76c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -438,6 +438,7 @@ struct sock { struct sk_buff *skb); void(*sk_destruct)(struct sock *sk); struct sock_reuseport __rcu *sk_reuseport_cb; + struct rcu_head sk_rcu; }; #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data))) @@ -720,6 +721,7 @@ enum sock_flags { */ SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */ SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */ + SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */ }; #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE)) diff --git a/net/core/sock.c b/net/core/sock.c index b67b9aedb230..238a94f879ca 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority, } EXPORT_SYMBOL(sk_alloc); -void sk_destruct(struct sock *sk) +/* Sockets having SOCK_RCU_FREE will call this function after one RCU + * grace period. This is the case for UDP sockets and TCP listeners. + */ +static void __sk_destruct(struct rcu_head *head) { + struct sock *sk = container_of(head, struct sock, sk_rcu); struct sk_filter *filter; if (sk->sk_destruct) @@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk) sk_prot_free(sk->sk_prot_creator, sk); } +void sk_destruct(struct sock *sk) +{ + if (sock_flag(sk, SOCK_RCU_FREE)) + call_rcu(&sk->sk_rcu, __sk_destruct); + else + __sk_destruct(&sk->sk_rcu); +} + static void __sk_free(struct sock *sk) { if (unlikely(sock_diag_has_destroy_listeners(sk) && sk->sk_net_refcnt)) -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 10/11] ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply()
TCP uses per cpu 'sockets' to send some packets : - RST packets ( tcp_v4_send_reset()) ) - ACK packets for SYN_RECV and TIMEWAIT sockets By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree() to not call sk_write_space() since these internal sockets do not care. This gives a small performance improvement, merely by allowing cpu to properly predict the sock_wfree() conditional branch, and avoiding one atomic operation. Signed-off-by: Eric Dumazet --- net/ipv4/tcp_ipv4.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index f3ce0afe70aa..456ff3d6a132 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2384,6 +2384,7 @@ static int __net_init tcp_sk_init(struct net *net) IPPROTO_TCP, net); if (res) goto fail; + sock_set_flag(sk, SOCK_USE_WRITE_QUEUE); *per_cpu_ptr(net->ipv4.tcp_sk, cpu) = sk; } -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 00/11] net: various udp/tcp changes
First round of patches for linux-4.7 Add a generic facility for sockets to be freed after an RCU grace period, if they need to. Then UDP stack is changed to no longer use SLAB_DESTROY_BY_RCU, in order to speedup rx processing for traffic encapsulated in UDP. It gives a 17 % speedup for normal UDP reception in stress conditions. Then TCP listeners are changed to use SOCK_RCU_FREE as well to avoid touching sk_refcnt in synflood case : I got up to 30 % performance increase for a mono listener. Then three patches add SK_MEMINFO_DROPS to sock_diag and add per socket rx drops accounting to TCP. Last patch adds rate limiting on ACK sent on behalf of SYN_RECV to better resist to SYNFLOOD targeting one or few flows. Eric Dumazet (11): net: add SOCK_RCU_FREE socket flag udp: no longer use SLAB_DESTROY_BY_RCU tcp/dccp: remove BH disable/enable in lookup tcp/dccp: use rcu locking in inet_diag_find_one_icsk() inet: reqsk_alloc() needs to take care of dead listeners tcp/dccp: do not touch listener sk_refcnt under synflood sock_diag: add SK_MEMINFO_DROPS tcp: increment sk_drops for dropped rx packets tcp: increment sk_drops for listeners ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply() tcp: rate limit ACK sent by SYN_RECV request sockets include/linux/udp.h| 8 +- include/net/inet6_hashtables.h | 12 +- include/net/inet_hashtables.h | 47 +++ include/net/request_sock.h | 31 +++-- include/net/sock.h | 21 ++- include/net/tcp.h | 13 ++ include/net/udp.h | 2 +- include/uapi/linux/sock_diag.h | 1 + net/core/sock.c| 15 ++- net/core/sock_diag.c | 1 + net/dccp/ipv4.c| 7 +- net/dccp/ipv6.c| 7 +- net/ipv4/inet_diag.c | 10 +- net/ipv4/inet_hashtables.c | 77 --- net/ipv4/tcp_input.c | 41 +++--- net/ipv4/tcp_ipv4.c| 74 ++- net/ipv4/tcp_minisocks.c | 5 +- net/ipv4/udp.c | 293 - net/ipv4/udp_diag.c| 18 +-- net/ipv6/inet6_hashtables.c| 62 +++-- net/ipv6/tcp_ipv6.c| 32 +++-- net/ipv6/udp.c | 196 +-- net/netfilter/xt_socket.c | 6 +- 23 files changed, 401 insertions(+), 578 deletions(-) -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 05/11] inet: reqsk_alloc() needs to take care of dead listeners
We'll soon no longer take a refcount on listeners, so reqsk_alloc() can not assume a listener refcount is not zero. We need to use atomic_inc_not_zero() Signed-off-by: Eric Dumazet --- include/net/request_sock.h | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index f49759decb28..6ebe13eb1c4c 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -85,24 +85,23 @@ reqsk_alloc(const struct request_sock_ops *ops, struct sock *sk_listener, struct request_sock *req; req = kmem_cache_alloc(ops->slab, GFP_ATOMIC | __GFP_NOWARN); - - if (req) { - req->rsk_ops = ops; - if (attach_listener) { - sock_hold(sk_listener); - req->rsk_listener = sk_listener; - } else { - req->rsk_listener = NULL; + if (!req) + return NULL; + req->rsk_listener = NULL; + if (attach_listener) { + if (unlikely(!atomic_inc_not_zero(&sk_listener->sk_refcnt))) { + kmem_cache_free(ops->slab, req); + return NULL; } - req_to_sk(req)->sk_prot = sk_listener->sk_prot; - sk_node_init(&req_to_sk(req)->sk_node); - sk_tx_queue_clear(req_to_sk(req)); - req->saved_syn = NULL; - /* Following is temporary. It is coupled with debugging -* helpers in reqsk_put() & reqsk_free() -*/ - atomic_set(&req->rsk_refcnt, 0); + req->rsk_listener = sk_listener; } + req->rsk_ops = ops; + req_to_sk(req)->sk_prot = sk_listener->sk_prot; + sk_node_init(&req_to_sk(req)->sk_node); + sk_tx_queue_clear(req_to_sk(req)); + req->saved_syn = NULL; + atomic_set(&req->rsk_refcnt, 0); + return req; } -- 2.8.0.rc3.226.g39d4020
[PATCH v2 net-next 06/11] tcp/dccp: do not touch listener sk_refcnt under synflood
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes. By letting listeners use SOCK_RCU_FREE infrastructure, we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets, only listeners are impacted by this change. Peak performance under SYNFLOOD is increased by ~33% : On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps Most consuming functions are now skb_set_owner_w() and sock_wfree() contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them. Signed-off-by: Eric Dumazet --- include/net/inet6_hashtables.h | 12 --- include/net/inet_hashtables.h | 40 ++- net/dccp/ipv4.c| 7 ++-- net/dccp/ipv6.c| 7 ++-- net/ipv4/inet_diag.c | 3 +- net/ipv4/inet_hashtables.c | 73 +++--- net/ipv4/tcp_ipv4.c| 66 +++--- net/ipv6/inet6_hashtables.c| 56 +--- net/ipv6/tcp_ipv6.c| 27 +--- net/netfilter/xt_socket.c | 6 ++-- 10 files changed, 134 insertions(+), 163 deletions(-) diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h index 28332bdac333..b87becacd9d3 100644 --- a/include/net/inet6_hashtables.h +++ b/include/net/inet6_hashtables.h @@ -66,13 +66,15 @@ static inline struct sock *__inet6_lookup(struct net *net, const __be16 sport, const struct in6_addr *daddr, const u16 hnum, - const int dif) + const int dif, + bool *refcounted) { struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr, sport, daddr, hnum, dif); + *refcounted = true; if (sk) return sk; - + *refcounted = false; return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport, daddr, hnum, dif); } @@ -81,17 +83,19 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, const __be16 sport, const __be16 dport, - int iif) + int iif, + bool *refcounted) { struct sock *sk = skb_steal_sock(skb); + *refcounted = true; if (sk) return sk; return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb, doff, &ipv6_hdr(skb)->saddr, sport, &ipv6_hdr(skb)->daddr, ntohs(dport), - iif); + iif, refcounted); } struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo, diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index a77acee93aaf..0574493e3899 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -100,14 +100,10 @@ struct inet_bind_hashbucket { /* * Sockets can be hashed in established or listening table - * We must use different 'nulls' end-of-chain value for listening - * hash table, or we might find a socket that was closed and - * reallocated/inserted into established hash table */ -#define LISTENING_NULLS_BASE (1U << 29) struct inet_listen_hashbucket { spinlock_t lock; - struct hlist_nulls_head head; + struct hlist_head head; }; /* This is for listening sockets, thus all sockets which possess wildcards. */ @@ -304,14 +300,20 @@ static inline struct sock *__inet_lookup(struct net *net, struct sk_buff *skb, int doff, const __be32 saddr, const __be16 sport, const __be32 daddr, const __be16 dport, -const int dif) +const int dif, +bool *refcounted) { u16 hnum = ntohs(dport); - struct sock *sk = __inet_lookup_established(net, hashinfo, - saddr, sport, daddr, hnum, dif); + struct sock *sk; - return sk ? : __inet_lookup_listener(net, hashinfo, skb, doff, saddr, -sport, daddr, hnum, dif); + sk = __inet_lookup_established(net, hashinfo, saddr, sport, + daddr, hnum, dif); + *refc
[PATCH v2 net-next 02/11] udp: no longer use SLAB_DESTROY_BY_RCU
Tom Herbert would like not touching UDP socket refcnt for encapsulated traffic. For this to happen, we need to use normal RCU rules, with a grace period before freeing a socket. UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. This actually removes a lot of complexity in UDP stack. Multicast receives no longer need to hold a bucket spinlock. Note that ip early demux still needs to take a reference on the socket. Same remark for functions used by xt_socket and xt_PROXY netfilter modules, but this might be changed later. Performance for a single UDP socket receiving flood traffic from many RX queues/cpus. Simple udp_rx using simple recvfrom() loop : 438 kpps instead of 374 kpps : 17 % increase of the peak rate. v2: Addressed Willem de Bruijn feedback in multicast handling - keep early demux break in __udp4_lib_demux_lookup() Signed-off-by: Eric Dumazet Cc: Tom Herbert Cc: Willem de Bruijn --- include/linux/udp.h | 8 +- include/net/sock.h | 12 +-- include/net/udp.h | 2 +- net/ipv4/udp.c | 293 net/ipv4/udp_diag.c | 18 ++-- net/ipv6/udp.c | 196 --- 6 files changed, 171 insertions(+), 358 deletions(-) diff --git a/include/linux/udp.h b/include/linux/udp.h index 87c094961bd5..32342754643a 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk) return udp_sk(sk)->no_check6_rx; } -#define udp_portaddr_for_each_entry(__sk, node, list) \ - hlist_nulls_for_each_entry(__sk, node, list, __sk_common.skc_portaddr_node) +#define udp_portaddr_for_each_entry(__sk, list) \ + hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) -#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \ - hlist_nulls_for_each_entry_rcu(__sk, node, list, __sk_common.skc_portaddr_node) +#define udp_portaddr_for_each_entry_rcu(__sk, list) \ + hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node) #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag) diff --git a/include/net/sock.h b/include/net/sock.h index c88785a3e76c..c3a707d1cee8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -178,7 +178,7 @@ struct sock_common { int skc_bound_dev_if; union { struct hlist_node skc_bind_node; - struct hlist_nulls_node skc_portaddr_node; + struct hlist_node skc_portaddr_node; }; struct proto*skc_prot; possible_net_t skc_net; @@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk, hlist_for_each_entry(__sk, list, sk_bind_node) /** - * sk_nulls_for_each_entry_offset - iterate over a list at a given struct offset + * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset * @tpos: the type * to use as a loop cursor. * @pos: the &struct hlist_node to use as a loop cursor. * @head: the head for your list. * @offset:offset of hlist_node within the struct. * */ -#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset) \ - for (pos = (head)->first; \ -(!is_a_nulls(pos)) && \ +#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) \ + for (pos = rcu_dereference((head)->first); \ +pos != NULL &&\ ({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;}); \ -pos = pos->next) +pos = rcu_dereference(pos->next)) static inline struct user_namespace *sk_user_ns(struct sock *sk) { diff --git a/include/net/udp.h b/include/net/udp.h index 92927f729ac8..d870ec1611c4 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -59,7 +59,7 @@ struct udp_skb_cb { * @lock: spinlock protecting changes to head/count */ struct udp_hslot { - struct hlist_nulls_head head; + struct hlist_head head; int count; spinlock_t lock; } __attribute__((aligned(2 * sizeof(long; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 08eed5e16df0..0475aaf95040 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num, unsigned int log) { struct sock *sk2; - struct hlist_nulls_node *node; kuid_t uid = sock_i_uid(sk); - sk_nulls_for_each(sk2, node, &hslot->head) { + sk_for_each(sk2, &hslot->head) { if (net_eq(sock_net(sk2), net) && sk2 != sk && (bitmap || udp_sk(sk2)->udp_port_hash == num
[PATCH v2 net-next 04/11] tcp/dccp: use rcu locking in inet_diag_find_one_icsk()
RX packet processing holds rcu_read_lock(), so we can remove pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions if inet_diag also holds rcu before calling them. This is needed anyway as __inet_lookup_listener() and inet6_lookup_listener() will soon no longer increment refcount on the found listener. Signed-off-by: Eric Dumazet --- net/ipv4/inet_diag.c| 7 +-- net/ipv4/inet_hashtables.c | 4 net/ipv6/inet6_hashtables.c | 4 3 files changed, 5 insertions(+), 10 deletions(-) diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index 5fdb02f5598e..ea8df527b279 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -356,6 +356,7 @@ struct sock *inet_diag_find_one_icsk(struct net *net, { struct sock *sk; + rcu_read_lock(); if (req->sdiag_family == AF_INET) sk = inet_lookup(net, hashinfo, NULL, 0, req->id.idiag_dst[0], req->id.idiag_dport, req->id.idiag_src[0], @@ -376,9 +377,11 @@ struct sock *inet_diag_find_one_icsk(struct net *net, req->id.idiag_if); } #endif - else + else { + rcu_read_unlock(); return ERR_PTR(-EINVAL); - + } + rcu_read_unlock(); if (!sk) return ERR_PTR(-ENOENT); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index bc68eced0105..387338d71dcd 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -220,7 +220,6 @@ struct sock *__inet_lookup_listener(struct net *net, bool select_ok = true; u32 phash = 0; - rcu_read_lock(); begin: result = NULL; hiscore = 0; @@ -269,7 +268,6 @@ found: goto begin; } } - rcu_read_unlock(); return result; } EXPORT_SYMBOL_GPL(__inet_lookup_listener); @@ -312,7 +310,6 @@ struct sock *__inet_lookup_established(struct net *net, unsigned int slot = hash & hashinfo->ehash_mask; struct inet_ehash_bucket *head = &hashinfo->ehash[slot]; - rcu_read_lock(); begin: sk_nulls_for_each_rcu(sk, node, &head->chain) { if (sk->sk_hash != hash) @@ -339,7 +336,6 @@ begin: out: sk = NULL; found: - rcu_read_unlock(); return sk; } EXPORT_SYMBOL_GPL(__inet_lookup_established); diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index d253f32874c9..e6ef6ce1ed74 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -69,7 +69,6 @@ struct sock *__inet6_lookup_established(struct net *net, struct inet_ehash_bucket *head = &hashinfo->ehash[slot]; - rcu_read_lock(); begin: sk_nulls_for_each_rcu(sk, node, &head->chain) { if (sk->sk_hash != hash) @@ -90,7 +89,6 @@ begin: out: sk = NULL; found: - rcu_read_unlock(); return sk; } EXPORT_SYMBOL(__inet6_lookup_established); @@ -138,7 +136,6 @@ struct sock *inet6_lookup_listener(struct net *net, unsigned int hash = inet_lhashfn(net, hnum); struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash]; - rcu_read_lock(); begin: result = NULL; hiscore = 0; @@ -187,7 +184,6 @@ found: goto begin; } } - rcu_read_unlock(); return result; } EXPORT_SYMBOL_GPL(inet6_lookup_listener); -- 2.8.0.rc3.226.g39d4020