Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
On 2015-09-16 08:27, Damian Hobson-Garcia wrote: From: Martin Sustrik When implementing network protocols in user space, one has to implement fake file descriptors to represent the sockets for the protocol. Polling on such fake file descriptors is a problem (poll/select/epoll accept only true file descriptors) and forces protocol implementers to use various workarounds resulting in complex, non-standard and convoluted APIs. More generally, ability to create full-blown file descriptors for userspace-to-userspace signalling is missing. While eventfd(2) goes half the way towards this goal it has follwoing shorcomings: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitrary combination of POLL* flags. Most notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using eventfd. This patch implements new EFD_MASK flag which solves the above problems. Additionally, to provide a way to associate user-space state with eventfd object, it allows to attach user-space data to the file descriptor. The above paragraph is a leftover from the past. The functionality no longer exist. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. The 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { uint32_t events; }; Is it worth having a struct here? Why not just uint32_t? Martin The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. read(2): read is not supported and will fail with EINVAL. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik [dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3] Signed-off-by: Damian Hobson-Garcia --- fs/eventfd.c | 102 ++- include/linux/eventfd.h | 16 +-- include/uapi/linux/eventfd.h | 39 + 3 files changed, 132 insertions(+), 25 deletions(-) create mode 100644 include/uapi/linux/eventfd.h diff --git a/fs/eventfd.c b/fs/eventfd.c index 8d0c0df..1a6a066 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,31 @@ #include #include +#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK) +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP) + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; + union { + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the +* "count" value to userspace, and will reset "count" to zero. +* The kernel side eventfd_signal() also, adds to the "count" +* counter and issue a wakeup. +*/ + __u64 count; + + /* +* When using eventfd in EFD_MASK mode this stracture stores the +* current events to be signaled on the eventfd (events member) +* along with opaque user-defined data (data member). +*/ + struct efd_mask mask; + }; unsigned int flags; }; @@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) return events; } +static unsigned int eventfd_mask_poll(struct file *file, poll_table *wait) +{ + struct eventfd_ctx *ctx = file->private_data; + + poll_wait(file, &ctx->wqh, wait); + return ctx->mask.events; +} + static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) { *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count; @@ -239,6 +261,14 @@ static ssize_t eventfd_read(struct fi
request for stable inclusion
Hi Dave, Commit 9293267 "net/mlx4_core: Capping number of requested MSIXs to MAX_MSIX" fixes a bug under which the driver doesn't really starts over a machine with > 32 cores. The bug was introduced in 4.2-rc1 but the fix missed 4.2 -- could you please push it to 4.2 -stable? If you prefer that we will submit it directly there, fine too. thanks, Or. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next 2/2] bpf: add bpf_redirect() helper
On 15-09-15 11:05 PM, Alexei Starovoitov wrote: > Existing bpf_clone_redirect() helper clones skb before redirecting > it to RX or TX of destination netdev. > Introduce bpf_redirect() helper that does that without cloning. > > Benchmarked with two hosts using 10G ixgbe NICs. > One host is doing line rate pktgen. > Another host is configured as: > $ tc qdisc add dev $dev ingress > $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ >action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop > so it receives the packet on $dev and immediately xmits it on $dev + 1 > The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program > that does bpf_clone_redirect() and performance is 2.0 Mpps > > $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ >action bpf run object-file tcbpf1_kern.o section redirect_xmit drop > which is using bpf_redirect() - 2.4 Mpps > > and using cls_bpf with integrated actions as: > $ tc filter add dev $dev root pref 10 \ > bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1 > performance is 2.5 Mpps > > To summarize: > u32+act_bpf using clone_redirect - 2.0 Mpps > u32+act_bpf using redirect - 2.4 Mpps > cls_bpf using redirect - 2.5 Mpps > > For comparison linux bridge in this setup is doing 2.1 Mpps > and ixgbe rx + drop in ip_rcv - 7.8 Mpps > > Signed-off-by: Alexei Starovoitov > Acked-by: Daniel Borkmann > --- > This approach is using per_cpu scratch area to store ifindex and flags. > The other alternatives discussed at plumbers are slower and more intrusive. > v1->v2: dropped redundant iff_up check > > include/net/sch_generic.h|1 + > include/uapi/linux/bpf.h |8 > include/uapi/linux/pkt_cls.h |1 + > net/core/dev.c |8 > net/core/filter.c| 44 > ++ > net/sched/act_bpf.c |1 + > net/sched/cls_bpf.c |1 + > samples/bpf/bpf_helpers.h|4 > samples/bpf/tcbpf1_kern.c| 24 ++- > 9 files changed, 91 insertions(+), 1 deletion(-) > Acked-by: John Fastabend -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ip: find correct route for socket which is not bound to a device
For multi-cast, we should find valid route(thus get the meaniful pmtu) for the package on the socket which is not bound to a device(sk_bound_dev_if being 0) too. >From man page of socket(7) SO_BINDTODEVICE Bind this socket to a particular device like “eth0”, as specified in the passed interface name. If the name is an empty string or the option length is zero, the socket device binding is removed. The passed option is a variable-length null-terminated interface name string with the maximum size of IFNAMSIZ. If a socket is bound to an interface, only packets received from that particular interface are processed by the socket. Note that this works only for some socket types, particularly AF_INET sockets. It is not supported for packet sockets (use normal bind(2) there). The man page doesn't say when socket not bound packages won't be routed. A problem is hit that all multi-cast packages dropped by kernel(from sender host). The lower layer is IPoIB with MTU being 7000. And I was sending 4096 length multi-cast package. In side IPoIB the first send is dropped because is exeeding the internal package size limitation mcast_mtu which is 2044. So IPoIB calls ip_rt_update_pmtu (indirectly) trying to set path mtu. A correct route is configured for the multi-cast, so the setting of pmtu cucceeded and the next multi-cast package(to the same target) is expected to succeed(it would be well fragmented accroding to the pmtu I just set). But actually the second and later multi-cast packages got dropped too. And the reason is that the neighor looking up(fib_lookup) is skipped because of the socket is not bound to device(sk_bound_dev_if being 0). After applied the patch I proposed here, it works fine. Signed-off-by: Wengang Wang --- net/ipv4/route.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 5f4a556..032481a 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2097,7 +2097,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) */ fl4->flowi4_oif = dev_out->ifindex; - goto make_route; + goto lookup; } if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) { @@ -2153,6 +2153,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) goto make_route; } +lookup: if (fib_lookup(net, fl4, &res, 0)) { res.fi = NULL; res.table = NULL; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/1] Generalize poll events from eventfd
Using eventfd user space can generate POLLIN/POLLOUT events but some applications may want to generate POLLPRI/POLLERR events as well. This patch submission aims to generalize the events generated by an eventfd. This is a resubmission of a patch from Feb 2013[1]. The original discussion trailed off without any conclusion, but the original author has recently confirmed[2] that this functionality is still useful, so I volunteered to rebase and resubmit the patch for discussion. [1] https://lkml.org/lkml/2013/2/18/147 [2] https://lkml.org/lkml/2015/7/9/153 Changes in v2 - * rebased on Linux v4.3-rc1 * Move file operation implementations for EFD_MASK to a seperate structure * Remove 'data' element from efd_mask structure * read() is no longer supported when EFD_MASK is set (fails with EINVAL) * eventfd_ctx_fileget() now returns EINVAL when EFD_MASK is set, eliminating the possibility of triggering the orginal BUG_ON() macros which have now been removed. Thank you, Damian Martin Sustrik (1): eventfd: implementation of EFD_MASK flag fs/eventfd.c | 91 ++-- include/linux/eventfd.h | 16 +--- include/uapi/linux/eventfd.h | 40 +++ 3 files changed, 121 insertions(+), 26 deletions(-) create mode 100644 include/uapi/linux/eventfd.h -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
From: Martin Sustrik When implementing network protocols in user space, one has to implement fake file descriptors to represent the sockets for the protocol. Polling on such fake file descriptors is a problem (poll/select/epoll accept only true file descriptors) and forces protocol implementers to use various workarounds resulting in complex, non-standard and convoluted APIs. More generally, ability to create full-blown file descriptors for userspace-to-userspace signalling is missing. While eventfd(2) goes half the way towards this goal it has follwoing shorcomings: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitrary combination of POLL* flags. Most notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using eventfd. This patch implements new EFD_MASK flag which solves the above problems. Additionally, to provide a way to associate user-space state with eventfd object, it allows to attach user-space data to the file descriptor. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. The 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { uint32_t events; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. read(2): read is not supported and will fail with EINVAL. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik [dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3] Signed-off-by: Damian Hobson-Garcia --- fs/eventfd.c | 102 ++- include/linux/eventfd.h | 16 +-- include/uapi/linux/eventfd.h | 39 + 3 files changed, 132 insertions(+), 25 deletions(-) create mode 100644 include/uapi/linux/eventfd.h diff --git a/fs/eventfd.c b/fs/eventfd.c index 8d0c0df..1a6a066 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,31 @@ #include #include +#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK) +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP) + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; + union { + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the +* "count" value to userspace, and will reset "count" to zero. +* The kernel side eventfd_signal() also, adds to the "count" +* counter and issue a wakeup. +*/ + __u64 count; + + /* +* When using eventfd in EFD_MASK mode this stracture stores the +* current events to be signaled on the eventfd (events member) +* along with opaque user-defined data (data member). +*/ + struct efd_mask mask; + }; unsigned int flags; }; @@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) return events; } +static unsigned int eventfd_mask_poll(struct file *file, poll_table *wait) +{ + struct eventfd_ctx *ctx = file->private_data; + + poll_wait(file, &ctx->wqh, wait); + return ctx->mask.events; +} + static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) { *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count; @@ -239,6 +261,14 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt); } +static ssize_t eventfd_mask_read(struct file *file, char __user *buf, + size_t co
Re: kernel 4.2 : "bridge vlan" command return empty result (works with kernel 4.1.3)
>>Do you have a bond in your system ?. Yes, Indeed. Removing the bond fix the problem. I'll try your patch today. Thanks ! Alexandre - Mail original - De: "roopa" À: "aderumier" Cc: "netdev" , "Scott Feldman" Envoyé: Mardi 15 Septembre 2015 21:02:34 Objet: Re: kernel 4.2 : "bridge vlan" command return empty result (works with kernel 4.1.3) On 9/15/15, 10:39 AM, Alexandre DERUMIER wrote: > Hi, > > since kernel 4.2, "bridge vlan" command return empty result. > > > kernel 4.1.3 > > # bridge vlan > port vlan ids > eth0 1 PVID Egress Untagged > 90 > 91 > 92 > 93 > 94 > 95 > 96 > 97 > 98 > 99 > 100 > > vmbr0 1 PVID Egress Untagged > 94 > > > > kernel 4.2 > > # bridge vlan > port vlan ids > > > > Note that vlans are correctly working,it seem that is just the display. > > tcpdump -e -i vmbr0 > > 19:38:08.005055 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui > Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, > 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 339613, win 5523, > length 0 > 19:38:08.007730 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui > Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, > 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 342145, win 5568, > length 0 > 19:38:08.010977 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui > Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, > 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 344677, win 5614, > length 0 > 19:3 I was able to reproduce this when there is a bond in the system. Looks like this was due to 85fdb956726ff2a ("switchdev: cut over to new switchdev_port_bridge_getlink"). When CONFIG_SWITCHDEV is off, nodes that use switchdev api for ndo_bridge_getlink (example, bonds, teams, rocker) can return -EOPNOTSUPP. The problem went away on my box with the following patch. I will submit an official patch in a bit. Do you have a bond in your system ?. diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 01ced4a..bdb3842 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3013,6 +3013,7 @@ static int rtnl_bridge_getlink(struct sk_buff *skb, struct u32 portid = NETLINK_CB(cb->skb).portid; u32 seq = cb->nlh->nlmsg_seq; u32 filter_mask = 0; + int err; if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) { struct nlattr *extfilt; @@ -3033,20 +3034,25 @@ static int rtnl_bridge_getlink(struct sk_buff *skb, stru struct net_device *br_dev = netdev_master_upper_dev_get(dev); if (br_dev && br_dev->netdev_ops->ndo_bridge_getlink) { - if (idx >= cb->args[0] && - br_dev->netdev_ops->ndo_bridge_getlink( - skb, portid, seq, dev, filter_mask, - NLM_F_MULTI) < 0) - break; + if (idx >= cb->args[0]) { + err = br_dev->netdev_ops->ndo_bridge_getlink( + skb, portid, seq, dev, + filter_mask, NLM_F_MULTI); + if ( err < 0 && err != -EOPNOTSUPP) + break; + } idx++; } if (ops->ndo_bridge_getlink) { - if (idx >= cb->args[0] && - ops->ndo_bridge_getlink(skb, portid, seq, dev, - filter_mask, - NLM_F_MULTI) < 0) - break; + if (idx >= cb->args[0]) { + err = ops->ndo_bridge_getlink(skb, portid, + seq, dev, + filter_mask, + NLM_F_MULTI); + if ( err < 0 && err != -EOPNOTSUPP) + break; + } idx++; } } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next 0/2] bpf: performance improvements
v1->v2: dropped redundant iff_up check in patch 2 At plumbers we discussed different options on how to get rid of skb_clone from bpf_clone_redirect(), the patch 2 implements the best option. Patch 1 adds 'integrated exts' to cls_bpf to improve performance by combining simple actions into bpf classifier. Alexei Starovoitov (1): bpf: add bpf_redirect() helper Daniel Borkmann (1): cls_bpf: introduce integrated actions include/net/sch_generic.h|3 ++- include/uapi/linux/bpf.h |9 +++ include/uapi/linux/pkt_cls.h |4 +++ net/core/dev.c |8 ++ net/core/filter.c| 58 +++ net/sched/act_bpf.c |1 + net/sched/cls_bpf.c | 61 ++ samples/bpf/bpf_helpers.h|4 +++ samples/bpf/tcbpf1_kern.c| 24 - 9 files changed, 159 insertions(+), 13 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next 2/2] bpf: add bpf_redirect() helper
Existing bpf_clone_redirect() helper clones skb before redirecting it to RX or TX of destination netdev. Introduce bpf_redirect() helper that does that without cloning. Benchmarked with two hosts using 10G ixgbe NICs. One host is doing line rate pktgen. Another host is configured as: $ tc qdisc add dev $dev ingress $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop so it receives the packet on $dev and immediately xmits it on $dev + 1 The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program that does bpf_clone_redirect() and performance is 2.0 Mpps $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ action bpf run object-file tcbpf1_kern.o section redirect_xmit drop which is using bpf_redirect() - 2.4 Mpps and using cls_bpf with integrated actions as: $ tc filter add dev $dev root pref 10 \ bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1 performance is 2.5 Mpps To summarize: u32+act_bpf using clone_redirect - 2.0 Mpps u32+act_bpf using redirect - 2.4 Mpps cls_bpf using redirect - 2.5 Mpps For comparison linux bridge in this setup is doing 2.1 Mpps and ixgbe rx + drop in ip_rcv - 7.8 Mpps Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann --- This approach is using per_cpu scratch area to store ifindex and flags. The other alternatives discussed at plumbers are slower and more intrusive. v1->v2: dropped redundant iff_up check include/net/sch_generic.h|1 + include/uapi/linux/bpf.h |8 include/uapi/linux/pkt_cls.h |1 + net/core/dev.c |8 net/core/filter.c| 44 ++ net/sched/act_bpf.c |1 + net/sched/cls_bpf.c |1 + samples/bpf/bpf_helpers.h|4 samples/bpf/tcbpf1_kern.c| 24 ++- 9 files changed, 91 insertions(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index da61febb9091..4c79ce8c1f92 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -402,6 +402,7 @@ void __qdisc_calculate_pkt_len(struct sk_buff *skb, const struct qdisc_size_table *stab); bool tcf_destroy(struct tcf_proto *tp, bool force); void tcf_destroy_chain(struct tcf_proto __rcu **fl); +int skb_do_redirect(struct sk_buff *); /* Reset all TX qdiscs greater then index of a device. */ static inline void qdisc_reset_all_tx_gt(struct net_device *dev, unsigned int i) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2fbd1c71fa3b..4ec0b5488294 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -272,6 +272,14 @@ enum bpf_func_id { BPF_FUNC_skb_get_tunnel_key, BPF_FUNC_skb_set_tunnel_key, BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(&map, index) */ + /** +* bpf_redirect(ifindex, flags) - redirect to another netdev +* @ifindex: ifindex of the net device +* @flags: bit 0 - if set, redirect to ingress instead of egress +* other bits - reserved +* Return: TC_ACT_REDIRECT +*/ + BPF_FUNC_redirect, __BPF_FUNC_MAX_ID, }; diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 0a262a83f9d4..439873775d49 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -87,6 +87,7 @@ enum { #define TC_ACT_STOLEN 4 #define TC_ACT_QUEUED 5 #define TC_ACT_REPEAT 6 +#define TC_ACT_REDIRECT7 #define TC_ACT_JUMP0x1000 /* Action type identifiers*/ diff --git a/net/core/dev.c b/net/core/dev.c index 877c84834d81..d6a492e57874 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3668,6 +3668,14 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb, case TC_ACT_QUEUED: kfree_skb(skb); return NULL; + case TC_ACT_REDIRECT: + /* skb_mac_header check was done by cls/act_bpf, so +* we can safely push the L2 header back before +* redirecting to another netdev +*/ + __skb_push(skb, skb->mac_len); + skb_do_redirect(skb); + return NULL; default: break; } diff --git a/net/core/filter.c b/net/core/filter.c index 971d6ba89758..da3f3d94d6e9 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1427,6 +1427,48 @@ const struct bpf_func_proto bpf_clone_redirect_proto = { .arg3_type = ARG_ANYTHING, }; +struct redirect_info { + u32 ifindex; + u32 flags; +}; + +static DEFINE_PER_CPU(struct redirect_info, redirect_info); +static u64 bpf_redirect(u64 ifindex, u64 flags, u64 r3, u64 r4, u64 r5) +{ + struct redirect_info *ri = this_cpu_ptr(&redirect_info); + + ri-
[PATCH v2 net-next 1/2] cls_bpf: introduce integrated actions
From: Daniel Borkmann Often cls_bpf classifier is used with single action drop attached. Optimize this use case and let cls_bpf return both classid and action. For backwards compatibility reasons enable this feature under TCA_BPF_FLAG_ACT_DIRECT flag. Then more interesting programs like the following are easier to write: int cls_bpf_prog(struct __sk_buff *skb) { /* classify arp, ip, ipv6 into different traffic classes * and drop all other packets */ switch (skb->protocol) { case htons(ETH_P_ARP): skb->tc_classid = 1; break; case htons(ETH_P_IP): skb->tc_classid = 2; break; case htons(ETH_P_IPV6): skb->tc_classid = 3; break; default: return TC_ACT_SHOT; } return TC_ACT_OK; } Joint work with Daniel Borkmann. Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- v1->v2: no changes include/net/sch_generic.h|2 +- include/uapi/linux/bpf.h |1 + include/uapi/linux/pkt_cls.h |3 +++ net/core/filter.c| 14 ++ net/sched/cls_bpf.c | 60 ++ 5 files changed, 68 insertions(+), 12 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 444faa89a55f..da61febb9091 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -251,7 +251,7 @@ struct tcf_proto { struct qdisc_skb_cb { unsigned intpkt_len; u16 slave_dev_queue_mapping; - u16 _pad; + u16 tc_classid; #define QDISC_CB_PRIV_LEN 20 unsigned char data[QDISC_CB_PRIV_LEN]; }; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 92a48e2d5461..2fbd1c71fa3b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -293,6 +293,7 @@ struct __sk_buff { __u32 tc_index; __u32 cb[5]; __u32 hash; + __u32 tc_classid; }; struct bpf_tunnel_key { diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 4f0d1bc3647d..0a262a83f9d4 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -373,6 +373,8 @@ enum { /* BPF classifier */ +#define TCA_BPF_FLAG_ACT_DIRECT(1 << 0) + enum { TCA_BPF_UNSPEC, TCA_BPF_ACT, @@ -382,6 +384,7 @@ enum { TCA_BPF_OPS, TCA_BPF_FD, TCA_BPF_NAME, + TCA_BPF_FLAGS, __TCA_BPF_MAX, }; diff --git a/net/core/filter.c b/net/core/filter.c index 13079f03902e..971d6ba89758 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1632,6 +1632,9 @@ static bool __is_valid_access(int off, int size, enum bpf_access_type type) static bool sk_filter_is_valid_access(int off, int size, enum bpf_access_type type) { + if (off == offsetof(struct __sk_buff, tc_classid)) + return false; + if (type == BPF_WRITE) { switch (off) { case offsetof(struct __sk_buff, cb[0]) ... @@ -1648,6 +1651,9 @@ static bool sk_filter_is_valid_access(int off, int size, static bool tc_cls_act_is_valid_access(int off, int size, enum bpf_access_type type) { + if (off == offsetof(struct __sk_buff, tc_classid)) + return type == BPF_WRITE ? true : false; + if (type == BPF_WRITE) { switch (off) { case offsetof(struct __sk_buff, mark): @@ -1760,6 +1766,14 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg, *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off); break; + case offsetof(struct __sk_buff, tc_classid): + ctx_off -= offsetof(struct __sk_buff, tc_classid); + ctx_off += offsetof(struct sk_buff, cb); + ctx_off += offsetof(struct qdisc_skb_cb, tc_classid); + WARN_ON(type != BPF_WRITE); + *insn++ = BPF_STX_MEM(BPF_H, dst_reg, src_reg, ctx_off); + break; + case offsetof(struct __sk_buff, tc_index): #ifdef CONFIG_NET_SCHED BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tc_index) != 2); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index e5168f8b9640..77b0ef148256 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -38,6 +38,7 @@ struct cls_bpf_prog { struct bpf_prog *filter; struct list_head link; struct tcf_result res; + bool exts_integrated; struct tcf_exts exts; u32 handle; union { @@ -52,6 +53,7 @@ struct cls_bpf_prog { static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = { [TCA_BPF_CLASSID] = { .type = NLA_U32 }, + [TCA_BPF_FLAGS] = { .type = NLA_U32 }, [TCA_BPF_FD]= { .type = NLA_U32 }, [TCA_BPF_NAME] = { .type = NLA_NUL_STRING, .len = CLS_BPF_NAME_LEN },
FW: [RFC net-next 01/10] qed: Add module with basic common support
> From: Yuval Mintz > Date: Thu, 10 Sep 2015 16:54:12 +0300 > > Documentation/networking/LICENSE.qlogic| 288 ++ > I do not want to get into the habit of having to add copy after copy > of the GPL v2 to the source tree, so this is rather inappropriate. > Everything said in that file is explicitly covered by the top-level > COPYING file. You're right; We'll remove this and add some comment reference to COPYING instead in v2. On a related but different topic, I've noticed a lack of commentary for this series; Don't know if it's due to lack of time, interest, or something else. [I know we write perfect code [ ;-) ], but I would have guessed that Adding 10Ks of lines of code to the kernel would generate at least some rejects]. Should we wait any longer before sending the next revision? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 net] cxgb4vf: support for single-threading access to adapter mailbox registers
On Mon, Sep 14, 2015 at 19:08:34 +0530, Hariprasad Shenai wrote: > The issue is the for the Virtual Function Driver, the only way to get the > Virtual Interface statistics is to issue mailbox commands to ask the > firmware for the VI Stats. And, because the VI Stats command can only > retrieve a smallish number of stats per mailbox command, we have to issue > three mailbox commands in quick succession. What we ran into was irqbalance > coming in every 10 seconds and interrogating every network interface in the > system. > > Signed-off-by: Hariprasad Shenai > --- > V2: Updated description and using linux completion API's instead of > for loop based on review comments by David Miller > > drivers/net/ethernet/chelsio/cxgb4vf/adapter.h | 9 + > .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c| 4 ++ > drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c | 46 > +- > 3 files changed, 58 insertions(+), 1 deletion(-) > Hi David, There is an issue with this patch. Can you please drop it. Will send a V3, with the fixes. The below one should be a while loop, instead of if condition. /* If we're at the head, break out and start the mailbox * protocol. */ if (list_first_entry(&adapter->mlist.list, struct mbox_list, list) != &entry) { int ret; Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, Sep 15, 2015 at 5:03 PM, Eric Dumazet wrote: > On Tue, 2015-09-15 at 16:45 -0700, Tom Herbert wrote: >> > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && >> > + skb->l4_hash) >> > + return skb->hash; >> > + >> > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 || >> > !bond_flow_dissect(bond, skb, &flow)) >> > return bond_eth_hash(skb); >> > >> > >> Ugh, bond_flow_dissect is yet another instance of customized flow >> dissection! We should really clean this up. I suggest that in cases >> were we want L4 hash a call to skb_get_hash should suffice. We can >> create skb_get_l3hash when caller explicitly wants an L3 hash-- this >> would return skb->hash if it's valid and skb->l4_hash is not set, else >> call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the >> normal hash over flow keys (don't save result in skb->hash in this >> case). > > This code predates all the change you did recently ;) > > BTW, the simple xor weakness is showing up after > our change favoring even ports at connect() time, for a bonding device > with 2 or 4 slaves. > Right, xor as a packet hash should be eliminated. It seems possible that all these modes can be implemented using flow_dissector and the jhash. If I'm reading the meaning of modes correctly: BOND_XMIT_POLICY_ENCAP34 is equivalent to skb_get_hash BOND_XMIT_POLICY_LAYER23 is flow dissection with FLOW_DISSECTOR_F_STOP_AT_L3 and then normal hash BOND_XMIT_POLICY_LAYER34 is flow dissection with FLOW_DISSECTOR_F_STOP_AT_ENCAP BOND_XMIT_POLICY_LAYER2 would be flow dissection with FLOW_DISSECTOR_F_STOP_AT_L2 (new flag) and then normal hash BOND_XMIT_POLICY_ENCAP23 is a little more interesting. This could be accomplished with custom flow dissector targets that exclude L4 information (ports, flow label, key ID). Also noticed a little bug in flow_dissector, we should get out on FLOW_DISSECTOR_F_STOP_AT_L3 before IPv6 flow label is processed (that's considered L4). I'll fix that. Tom > (commit 07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5 > "tcp/dccp: try to not exhaust ip_local_port_range in connect()") > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper
On 9/15/15 9:50 PM, John Fastabend wrote: Looks like you can remove the check. I would prefer to let the stack handle this case using normal mechanisms. I had to do a bit of tracking but netif_running check equates roughly to your IFF_UP case via, ... Seem reasonable? Or did you put it there to work around some specific case I'm missing? well, in the forwarding path is_skb_forwardable() does the IFF_UP check before netif_running() has to do it, so yeah this check can be dropped. Will fix in v2. Thanks for the review! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux-next: build failure after merge of the tip tree
From: Stephen Rothwell Date: Wed, 16 Sep 2015 11:30:53 +1000 > I have added the following fix patch for today: > > From: Stephen Rothwell > Date: Wed, 16 Sep 2015 11:10:16 +1000 > Subject: [PATCH] cdc: add header guards > > Signed-off-by: Stephen Rothwell Applied, thanks Stephen. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper
On 15-09-15 09:11 PM, Alexei Starovoitov wrote: > On 9/15/15 8:10 PM, John Fastabend wrote: >> Nice, I like this. But just to be sure I read this correctly this will >> only work on the ingress qdisc for now right? To get the tx side working >> will require a bit more care. > > correct. > For egress I'm waiting for Daniel to resubmit his preclassifier patch > and I'll hook this skb_do_redirect() there as well. > Other options are also possible, but preclassifier looks the best for > this purpose, since it's lockless. > Great, works for me. One other question/observation, +int skb_do_redirect(struct sk_buff *skb) +{ [...] + + if (unlikely(!(dev->flags & IFF_UP))) { + kfree_skb(skb); + return -EINVAL; + } The IFF_UP check is not needed as best I can tell, the dev_queue_xmit() will check if the qdisc is active and the dev_forward_skb() path will do a !netif_running check in enqueue_to_backlog() call. Looks like you can remove the check. I would prefer to let the stack handle this case using normal mechanisms. I had to do a bit of tracking but netif_running check equates roughly to your IFF_UP case via, > __dev_change_flags() > [...] > if ((old_flags ^ flags) & IFF_UP) > ret = ((old_flags & IFF_UP) ? __dev_close : > __dev_open)(dev); > > > __dev_close() > [...] > __dev_close_many() > > __dev_close_many() > [...] > clear_bit(__LINK_STATE_START, &dev->state); Seem reasonable? Or did you put it there to work around some specific case I'm missing? .John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper
On 9/15/15 8:10 PM, John Fastabend wrote: Nice, I like this. But just to be sure I read this correctly this will only work on the ingress qdisc for now right? To get the tx side working will require a bit more care. correct. For egress I'm waiting for Daniel to resubmit his preclassifier patch and I'll hook this skb_do_redirect() there as well. Other options are also possible, but preclassifier looks the best for this purpose, since it's lockless. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper
On 15-09-15 06:51 PM, Alexei Starovoitov wrote: > Existing bpf_clone_redirect() helper clones skb before redirecting > it to RX or TX of destination netdev. > Introduce bpf_redirect() helper that does that without cloning. > > Benchmarked with two hosts using 10G ixgbe NICs. > One host is doing line rate pktgen. > Another host is configured as: > $ tc qdisc add dev $dev ingress > $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ >action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop > so it receives the packet on $dev and immediately xmits it on $dev + 1 > The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program > that does bpf_clone_redirect() and performance is 2.0 Mpps > > $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ >action bpf run object-file tcbpf1_kern.o section redirect_xmit drop > which is using bpf_redirect() - 2.4 Mpps > > and using cls_bpf with integrated actions as: > $ tc filter add dev $dev root pref 10 \ > bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1 > performance is 2.5 Mpps > > To summarize: > u32+act_bpf using clone_redirect - 2.0 Mpps > u32+act_bpf using redirect - 2.4 Mpps > cls_bpf using redirect - 2.5 Mpps > > For comparison linux bridge in this setup is doing 2.1 Mpps > and ixgbe rx + drop in ip_rcv - 7.8 Mpps > > Signed-off-by: Alexei Starovoitov > Acked-by: Daniel Borkmann > --- Nice, I like this. But just to be sure I read this correctly this will only work on the ingress qdisc for now right? To get the tx side working will require a bit more care. Thanks, .John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next:master 10/14] ERROR: "cdc_parse_cdc_header" [drivers/net/usb/cdc-phonet.ko] undefined!
tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: d5566fd72ec1924958fcfd48b65c022c8f7eae64 commit: 7b6ee48d3f4d432bfa6c9c9662fbdbd97681240e [10/14] cdc-phonet: use common parser config: x86_64-randconfig-a0-09160932 (attached as .config) reproduce: git checkout 7b6ee48d3f4d432bfa6c9c9662fbdbd97681240e # save the attached .config to linux build tree make ARCH=x86_64 All error/warnings (new ones prefixed by >>): >> ERROR: "cdc_parse_cdc_header" [drivers/net/usb/cdc-phonet.ko] undefined! --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation # # Automatically generated file; DO NOT EDIT. # Linux/x86_64 4.2.0 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_PERF_EVENTS_INTEL_UNCORE=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_CONSTRUCTORS=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y # CONFIG_KERNEL_GZIP is not set # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SYSVIPC=y CONFIG_POSIX_MQUEUE=y # CONFIG_CROSS_MEMORY_ATTACH is not set CONFIG_FHANDLE=y # CONFIG_USELIB is not set # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_IRQ_DOMAIN_DEBUG=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ is not set # CONFIG_HIGH_RES_TIMERS is not set # # CPU/Task time and stats accounting # # CONFIG_TICK_CPU_ACCOUNTING is not set # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set CONFIG_IRQ_TIME_ACCOUNTING=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # # RCU Subsystem # CONFIG_PREEMPT_RCU=y CONFIG_RCU_EXPERT=y CONFIG_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_TREE_RCU_TRACE is not set CONFIG_RCU_BOOST=y CONFIG_RCU_KTHREAD_PRIO=1 CONFIG_RCU_BOOST_DELAY=500 # CONFIG_RCU_NOCB_CPU is not set # CONFIG_RCU_EXPEDITE_BOOT is not set CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_SUPPORTS_INT128=y CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_PIDS=y # CONFIG_CGROUP_DEVICE is not set CONFIG_CPUSETS=y # CONFIG_PROC_PID_CPUSET is not set # CONFIG_CGROUP_CPUACCT is not set # CONFIG_MEMCG is not set # CONFIG_CGROUP_HUGETLB is not set # CONFIG_CGROUP_PERF is not set CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y # CONFIG_CHECKPOINT_RESTORE is not set # CONFIG_NAMESPACES is not set CONFIG_SCHED_AUTOGROUP=y # CONFIG_SYSFS_DEPRECATED is not set # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_RD_GZIP=y CONFIG_RD_BZIP2=y CONFIG_RD_LZMA=y CONFIG_RD_XZ=y CONFIG_RD_
Re: [PATCH for-next] cxgb4: add device ID for few T5 adapters
On Tue, Sep 15, 2015 at 11:55:19 -0700, David Miller wrote: > From: Hariprasad Shenai > Date: Tue, 15 Sep 2015 17:20:09 +0530 > > > Signed-off-by: Hariprasad Shenai > > Adding just some new device IDs is definitely 'net' material, mind > if I apply it there instead? No issues. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR
On 09/15/2015 02:17 PM, Phil Sutter wrote: > On Tue, Sep 15, 2015 at 11:11:53AM -0400, Vlad Yasevich wrote: >> On 09/14/2015 04:06 PM, Phil Sutter wrote: >>> On Mon, Sep 14, 2015 at 02:21:10PM -0400, Vlad Yasevich wrote: On 09/11/2015 04:20 PM, Phil Sutter wrote: > On Fri, Sep 11, 2015 at 12:24:45PM -0700, Stephen Hemminger wrote: >> On Fri, 11 Sep 2015 21:22:03 +0200 >> Phil Sutter wrote: >> >>> When forwarding packets from an 802.1Q interface with REORDER_HDR set to >>> zero, the VLAN header previously inserted by vlan_do_receive() needs to >>> be stripped from the packet and the mac_header adjustment undone, >>> otherwise a tagged frame with first four bytes missing will be >>> transmitted. >>> >>> Signed-off-by: Phil Sutter >>> --- >>> net/bridge/br_input.c | 10 ++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c >>> index f921a5d..e4e3fc7 100644 >>> --- a/net/bridge/br_input.c >>> +++ b/net/bridge/br_input.c >>> @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff >>> **pskb) >>> } >>> >>> forward: >>> + if (is_vlan_dev(skb->dev) && >>> + !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) { >>> + unsigned int offset = skb->data - skb_mac_header(skb); >>> + >>> + skb_push(skb, offset); >>> + memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN); >>> + skb->mac_header += VLAN_HLEN; >>> + skb_pull(skb, offset); >>> + skb_reset_mac_len(skb); >>> + } >>> switch (p->state) { >>> case BR_STATE_FORWARDING: >>> rhook = rcu_dereference(br_should_route_hook); >> >> Thanks for finding this. Is this a new thing or has it always been there? > > Sorry, I didn't check if this is a regression or not. Seen initially > with RHEL7's kernel-3.10.0-229.7.2, which due to the massive backporting > is by far not as old as it might seem. But it's surely not a brand new > problem of net-next or so. > > Since nowadays no sane mind touches REORDER_HDR (there was originally a > bug in NetworkManager which defaulted this to 0), it may very well be > there for a long time already. > >> Sorry, this looks so special case it doesn't seem like a good idea. >> Something is broken in VLAN handling if this is required. > > It is so ugly, I wish I had found a better way to fix the problem. Well, > maybe I miss something: > > - packet enters __netif_receive_skb_core(): > - skb->protocol is set to ETH_P_8021Q, so: > - packet is untagged > - skb->vlan_tci set > - skb->protocol set to 'real' protocol > - skb_vlan_tag_present(skb) == true, so: > - vlan_do_receive() is called: > - tags the packet again > - zeroes vlan_tci > - goto another_round > - __netif_receive_skb_core(), round 2: > - skb->protocol is not ETH_P_8021Q -> no untagging > - skb_vlan_tag_present(skb) == false -> no vlan_do_receive() > - rx_handler handler (== br_handle_frame) is called > > IMO the root of all evil is the existence of REORDER_HDR itself. It > causes an skb which should have been untagged to being passed along with > VLAN header present and code dealing with it needs to clean up the mess. So the problem here appears the be the code the in br_dev_queue_push_xmit(). It assumes that MAC_HLEN worth of data has been removed from the skb, which is normal in case of normal VLAN processing. However, without REORDER_HEADER set this is no longer the case. In this case, the ethernet header is shifted 4 bytes, and when we push the it back we miss the 4 bytes of the destination mac address... >>> >>> Please note that vlan_do_receive() also inserts the VLAN header in >>> between ethernet header and IP header, therefore: >>> I wonder if it would be safe to just use skb->mac_len. >>> >>> Given this works, the bridge would still forward a tagged frame which >>> should have been untagged in the first place. >>> >>> I just wondered where this added VLAN header is dropped if the interface >>> does not belong to a bridge, but then realized that further packet >>> processing simply ignores the ethernet header (and everything following >>> it). So unless I forget something, this should indeed be a >>> bridge-specific problem. >>> >> >> Looks like macvtap is also susceptible to this problem. It seems to be a bad >> idea to allow any upper device configuration on top of a REORDER_HDR=0 vlan. >> It is also not enough to just check is_vlan_dev(skb->dev) because vlan may >> be at >> lower in the device stack. > > Oh well. Apart from implementing workarounds for this wor
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, 2015-09-15 at 17:15 -0700, Tom Herbert wrote: > A more fundamental question is whether we can eliminate some of these > hashing types (I see five of them in if_bonding.h). Is there any > substantial difference between this and IPv4/v6 ECMP routing such that > they shouldn't all have the same path selection modes? We had an issue on a router that did not like a change in the hashing done by the host behind it. Do not ask me for details that I cannot provide, but I would guess it is better not changing legacy modes unilaterally. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/2] cls_bpf: introduce integrated actions
From: Daniel Borkmann Often cls_bpf classifier is used with single action drop attached. Optimize this use case and let cls_bpf return both classid and action. For backwards compatibility reasons enable this feature under TCA_BPF_FLAG_ACT_DIRECT flag. Then more interesting programs like the following are easier to write: int cls_bpf_prog(struct __sk_buff *skb) { /* classify arp, ip, ipv6 into different traffic classes * and drop all other packets */ switch (skb->protocol) { case htons(ETH_P_ARP): skb->tc_classid = 1; break; case htons(ETH_P_IP): skb->tc_classid = 2; break; case htons(ETH_P_IPV6): skb->tc_classid = 3; break; default: return TC_ACT_SHOT; } return TC_ACT_OK; } Joint work with Daniel Borkmann. Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- include/net/sch_generic.h|2 +- include/uapi/linux/bpf.h |1 + include/uapi/linux/pkt_cls.h |3 +++ net/core/filter.c| 14 ++ net/sched/cls_bpf.c | 60 ++ 5 files changed, 68 insertions(+), 12 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 444faa89a55f..da61febb9091 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -251,7 +251,7 @@ struct tcf_proto { struct qdisc_skb_cb { unsigned intpkt_len; u16 slave_dev_queue_mapping; - u16 _pad; + u16 tc_classid; #define QDISC_CB_PRIV_LEN 20 unsigned char data[QDISC_CB_PRIV_LEN]; }; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 92a48e2d5461..2fbd1c71fa3b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -293,6 +293,7 @@ struct __sk_buff { __u32 tc_index; __u32 cb[5]; __u32 hash; + __u32 tc_classid; }; struct bpf_tunnel_key { diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 4f0d1bc3647d..0a262a83f9d4 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -373,6 +373,8 @@ enum { /* BPF classifier */ +#define TCA_BPF_FLAG_ACT_DIRECT(1 << 0) + enum { TCA_BPF_UNSPEC, TCA_BPF_ACT, @@ -382,6 +384,7 @@ enum { TCA_BPF_OPS, TCA_BPF_FD, TCA_BPF_NAME, + TCA_BPF_FLAGS, __TCA_BPF_MAX, }; diff --git a/net/core/filter.c b/net/core/filter.c index 13079f03902e..971d6ba89758 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1632,6 +1632,9 @@ static bool __is_valid_access(int off, int size, enum bpf_access_type type) static bool sk_filter_is_valid_access(int off, int size, enum bpf_access_type type) { + if (off == offsetof(struct __sk_buff, tc_classid)) + return false; + if (type == BPF_WRITE) { switch (off) { case offsetof(struct __sk_buff, cb[0]) ... @@ -1648,6 +1651,9 @@ static bool sk_filter_is_valid_access(int off, int size, static bool tc_cls_act_is_valid_access(int off, int size, enum bpf_access_type type) { + if (off == offsetof(struct __sk_buff, tc_classid)) + return type == BPF_WRITE ? true : false; + if (type == BPF_WRITE) { switch (off) { case offsetof(struct __sk_buff, mark): @@ -1760,6 +1766,14 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg, *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off); break; + case offsetof(struct __sk_buff, tc_classid): + ctx_off -= offsetof(struct __sk_buff, tc_classid); + ctx_off += offsetof(struct sk_buff, cb); + ctx_off += offsetof(struct qdisc_skb_cb, tc_classid); + WARN_ON(type != BPF_WRITE); + *insn++ = BPF_STX_MEM(BPF_H, dst_reg, src_reg, ctx_off); + break; + case offsetof(struct __sk_buff, tc_index): #ifdef CONFIG_NET_SCHED BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tc_index) != 2); diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index e5168f8b9640..77b0ef148256 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -38,6 +38,7 @@ struct cls_bpf_prog { struct bpf_prog *filter; struct list_head link; struct tcf_result res; + bool exts_integrated; struct tcf_exts exts; u32 handle; union { @@ -52,6 +53,7 @@ struct cls_bpf_prog { static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = { [TCA_BPF_CLASSID] = { .type = NLA_U32 }, + [TCA_BPF_FLAGS] = { .type = NLA_U32 }, [TCA_BPF_FD]= { .type = NLA_U32 }, [TCA_BPF_NAME] = { .type = NLA_NUL_STRING, .len = CLS_BPF_NAME_LEN }, [TCA_BPF_OPS_LEN]
[PATCH net-next 2/2] bpf: add bpf_redirect() helper
Existing bpf_clone_redirect() helper clones skb before redirecting it to RX or TX of destination netdev. Introduce bpf_redirect() helper that does that without cloning. Benchmarked with two hosts using 10G ixgbe NICs. One host is doing line rate pktgen. Another host is configured as: $ tc qdisc add dev $dev ingress $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop so it receives the packet on $dev and immediately xmits it on $dev + 1 The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program that does bpf_clone_redirect() and performance is 2.0 Mpps $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ action bpf run object-file tcbpf1_kern.o section redirect_xmit drop which is using bpf_redirect() - 2.4 Mpps and using cls_bpf with integrated actions as: $ tc filter add dev $dev root pref 10 \ bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1 performance is 2.5 Mpps To summarize: u32+act_bpf using clone_redirect - 2.0 Mpps u32+act_bpf using redirect - 2.4 Mpps cls_bpf using redirect - 2.5 Mpps For comparison linux bridge in this setup is doing 2.1 Mpps and ixgbe rx + drop in ip_rcv - 7.8 Mpps Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann --- This approach is using per_cpu scratch area to store ifindex and flags. The other alternatives discussed at plumbers are slower and more intrusive. include/net/sch_generic.h|1 + include/uapi/linux/bpf.h |8 +++ include/uapi/linux/pkt_cls.h |1 + net/core/dev.c |8 +++ net/core/filter.c| 49 ++ net/sched/act_bpf.c |1 + net/sched/cls_bpf.c |1 + samples/bpf/bpf_helpers.h|4 samples/bpf/tcbpf1_kern.c| 24 - 9 files changed, 96 insertions(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index da61febb9091..4c79ce8c1f92 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -402,6 +402,7 @@ void __qdisc_calculate_pkt_len(struct sk_buff *skb, const struct qdisc_size_table *stab); bool tcf_destroy(struct tcf_proto *tp, bool force); void tcf_destroy_chain(struct tcf_proto __rcu **fl); +int skb_do_redirect(struct sk_buff *); /* Reset all TX qdiscs greater then index of a device. */ static inline void qdisc_reset_all_tx_gt(struct net_device *dev, unsigned int i) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2fbd1c71fa3b..4ec0b5488294 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -272,6 +272,14 @@ enum bpf_func_id { BPF_FUNC_skb_get_tunnel_key, BPF_FUNC_skb_set_tunnel_key, BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(&map, index) */ + /** +* bpf_redirect(ifindex, flags) - redirect to another netdev +* @ifindex: ifindex of the net device +* @flags: bit 0 - if set, redirect to ingress instead of egress +* other bits - reserved +* Return: TC_ACT_REDIRECT +*/ + BPF_FUNC_redirect, __BPF_FUNC_MAX_ID, }; diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 0a262a83f9d4..439873775d49 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -87,6 +87,7 @@ enum { #define TC_ACT_STOLEN 4 #define TC_ACT_QUEUED 5 #define TC_ACT_REPEAT 6 +#define TC_ACT_REDIRECT7 #define TC_ACT_JUMP0x1000 /* Action type identifiers*/ diff --git a/net/core/dev.c b/net/core/dev.c index 877c84834d81..d6a492e57874 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3668,6 +3668,14 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb, case TC_ACT_QUEUED: kfree_skb(skb); return NULL; + case TC_ACT_REDIRECT: + /* skb_mac_header check was done by cls/act_bpf, so +* we can safely push the L2 header back before +* redirecting to another netdev +*/ + __skb_push(skb, skb->mac_len); + skb_do_redirect(skb); + return NULL; default: break; } diff --git a/net/core/filter.c b/net/core/filter.c index 971d6ba89758..5bf273bab781 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1427,6 +1427,53 @@ const struct bpf_func_proto bpf_clone_redirect_proto = { .arg3_type = ARG_ANYTHING, }; +struct redirect_info { + u32 ifindex; + u32 flags; +}; + +static DEFINE_PER_CPU(struct redirect_info, redirect_info); +static u64 bpf_redirect(u64 ifindex, u64 flags, u64 r3, u64 r4, u64 r5) +{ + struct redirect_info *ri = this_cpu_ptr(&redirect_info); + + ri->ifindex = ifindex; + ri->flags = fla
[PATCH net-next 0/2] bpf: performance improvements
At plumbers we discussed different options on how to get rid of skb_clone from bpf_clone_redirect(), the patch 2 implements the best option. Patch 1 adds 'integrated exts' to cls_bpf to improve performance by combining simple actions into bpf classifier. Alexei Starovoitov (1): bpf: add bpf_redirect() helper Daniel Borkmann (1): cls_bpf: introduce integrated actions include/net/sch_generic.h|3 +- include/uapi/linux/bpf.h |9 ++ include/uapi/linux/pkt_cls.h |4 +++ net/core/dev.c |8 ++ net/core/filter.c| 63 ++ net/sched/act_bpf.c |1 + net/sched/cls_bpf.c | 61 samples/bpf/bpf_helpers.h|4 +++ samples/bpf/tcbpf1_kern.c| 24 +++- 9 files changed, 164 insertions(+), 13 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
linux-next: build failure after merge of the tip tree
Hi all, After merging the next-20150915 version of the tip tree, today's linux-next build (x86_64 allmodconfig) failed like this: In file included from drivers/usb/gadget/function/u_ether.h:20:0, from drivers/usb/gadget/function/f_ncm.c:26: include/linux/usb/cdc.h:23:8: error: redefinition of 'struct usb_cdc_parsed_header' struct usb_cdc_parsed_header { ^ In file included from drivers/usb/gadget/function/f_ncm.c:24:0: include/linux/usb/cdc.h:23:8: note: originally defined here struct usb_cdc_parsed_header { ^ In file included from drivers/usb/gadget/function/u_ether.h:20:0, from drivers/usb/gadget/function/f_ncm.c:26: include/linux/usb/cdc.h:44:5: error: conflicting types for 'cdc_parse_cdc_header' int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr, ^ In file included from drivers/usb/gadget/function/f_ncm.c:24:0: include/linux/usb/cdc.h:44:5: note: previous declaration of 'cdc_parse_cdc_header' was here int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr, ^ In file included from drivers/usb/gadget/function/u_serial.h:16:0, from drivers/usb/gadget/legacy/cdc2.c:17: include/linux/usb/cdc.h:23:8: error: redefinition of 'struct usb_cdc_parsed_header' struct usb_cdc_parsed_header { ^ In file included from drivers/usb/gadget/function/u_ether.h:20:0, from drivers/usb/gadget/legacy/cdc2.c:16: include/linux/usb/cdc.h:23:8: note: originally defined here struct usb_cdc_parsed_header { ^ In file included from drivers/usb/gadget/function/u_serial.h:16:0, from drivers/usb/gadget/legacy/cdc2.c:17: include/linux/usb/cdc.h:44:5: error: conflicting types for 'cdc_parse_cdc_header' int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr, ^ In file included from drivers/usb/gadget/function/u_ether.h:20:0, from drivers/usb/gadget/legacy/cdc2.c:16: include/linux/usb/cdc.h:44:5: note: previous declaration of 'cdc_parse_cdc_header' was here int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr, ^ Caused by commit c40a2c8817e4 ("CDC: common parser for extra headers") from the net-next tree that added include/linux/usb/cdc.h with no reinclusion guards. I am not sure why I did not see this failure when building after merging the net-next tree. Maybe it is exposed by some config change in the tip tree? I have added the following fix patch for today: From: Stephen Rothwell Date: Wed, 16 Sep 2015 11:10:16 +1000 Subject: [PATCH] cdc: add header guards Signed-off-by: Stephen Rothwell --- include/linux/usb/cdc.h | 4 include/uapi/linux/usb/cdc.h | 6 +++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/linux/usb/cdc.h b/include/linux/usb/cdc.h index 959d0c838113..b5706f94ee9e 100644 --- a/include/linux/usb/cdc.h +++ b/include/linux/usb/cdc.h @@ -7,6 +7,8 @@ * modify it under the terms of the GNU General Public License * version 2 as published by the Free Software Foundation. */ +#ifndef __LINUX_USB_CDC_H +#define __LINUX_USB_CDC_H #include @@ -45,3 +47,5 @@ int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr, struct usb_interface *intf, u8 *buffer, int buflen); + +#endif /* __LINUX_USB_CDC_H */ diff --git a/include/uapi/linux/usb/cdc.h b/include/uapi/linux/usb/cdc.h index b6a9cdd6e096..e2bc417b243b 100644 --- a/include/uapi/linux/usb/cdc.h +++ b/include/uapi/linux/usb/cdc.h @@ -6,8 +6,8 @@ * firmware based USB peripherals. */ -#ifndef __LINUX_USB_CDC_H -#define __LINUX_USB_CDC_H +#ifndef __UAPI_LINUX_USB_CDC_H +#define __UAPI_LINUX_USB_CDC_H #include @@ -444,4 +444,4 @@ struct usb_cdc_ncm_ndp_input_size { #define USB_CDC_NCM_CRC_NOT_APPENDED 0x00 #define USB_CDC_NCM_CRC_APPENDED 0x01 -#endif /* __LINUX_USB_CDC_H */ +#endif /* __UAPI_LINUX_USB_CDC_H */ -- 2.5.1 -- Cheers, Stephen Rothwells...@canb.auug.org.au -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net] net/mlx4_en: really allow to change RSS key
From: Eric Dumazet When changing rss key, we do not want to overwrite user provided key by the one provided by netdev_rss_key_fill(), which is the host random key generated at boot time. Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function") Signed-off-by: Eric Dumazet Cc: Eyal Perry CC: Amir Vadai --- drivers/net/ethernet/mellanox/mlx4/en_rx.c |2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 4402a1e48c9b..0ce6ffe73ca8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -1268,8 +1268,6 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv) rss_context->hash_fn = MLX4_RSS_HASH_TOP; memcpy(rss_context->rss_key, priv->rss_key, MLX4_EN_RSS_KEY_SIZE); - netdev_rss_key_fill(rss_context->rss_key, - MLX4_EN_RSS_KEY_SIZE); } else { en_err(priv, "Unknown RSS hash function requested\n"); err = -EINVAL; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] net/mlx4_en:
Arg, patch title was meant to be net/mlx4_en: really allow to change RSS key -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] net/mlx4_en:
From: Eric Dumazet When changing rss key, we do not want to overwrite user provided key by the one provided by netdev_rss_key_fill(), which is the host random key generated at boot time. Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function") Signed-off-by: Eric Dumazet Cc: Eyal Perry CC: Amir Vadai --- drivers/net/ethernet/mellanox/mlx4/en_rx.c |2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 4402a1e48c9b..0ce6ffe73ca8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -1268,8 +1268,6 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv) rss_context->hash_fn = MLX4_RSS_HASH_TOP; memcpy(rss_context->rss_key, priv->rss_key, MLX4_EN_RSS_KEY_SIZE); - netdev_rss_key_fill(rss_context->rss_key, - MLX4_EN_RSS_KEY_SIZE); } else { en_err(priv, "Unknown RSS hash function requested\n"); err = -EINVAL; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 21/30] ipv6: Only compute net once in ip6_finish_output2
Signed-off-by: "Eric W. Biederman" --- net/ipv6/ip6_output.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index a80502c64523..12d0166a64cd 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -60,6 +60,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb) { struct dst_entry *dst = skb_dst(skb); struct net_device *dev = dst->dev; + struct net *net = dev_net(dev); struct neighbour *neigh; struct in6_addr *nexthop; int ret; @@ -71,7 +72,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb) struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) && - ((mroute6_socket(dev_net(dev), skb) && + ((mroute6_socket(net, skb) && !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) || ipv6_chk_mcast_addr(dev, &ipv6_hdr(skb)->daddr, &ipv6_hdr(skb)->saddr))) { @@ -86,15 +87,14 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb) dev_loopback_xmit); if (ipv6_hdr(skb)->hop_limit == 0) { - IP6_INC_STATS(dev_net(dev), idev, + IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); kfree_skb(skb); return 0; } } - IP6_UPD_PO_STATS(dev_net(dev), idev, IPSTATS_MIB_OUTMCAST, - skb->len); + IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUTMCAST, skb->len); if (IPV6_ADDR_MC_SCOPE(&ipv6_hdr(skb)->daddr) <= IPV6_ADDR_SCOPE_NODELOCAL && @@ -116,8 +116,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb) } rcu_read_unlock_bh(); - IP6_INC_STATS(dev_net(dst->dev), - ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES); + IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES); kfree_skb(skb); return -EINVAL; } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 25/30] bridge: Pass net into br_nf_push_frag_xmit
When struct net starts being passed through the ipv4 and ipv6 fragment routines br_nf_push_frag_xmit will need to take a net parameter. Prepare br_nf_push_frag_xmit before that is needed and introduce br_nf_push_frag_xmit_sk for the call sites that still need the old calling conventions. Signed-off-by: "Eric W. Biederman" --- net/bridge/br_netfilter_hooks.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index 971d45d24c64..e6910b71af6e 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -668,7 +668,7 @@ static unsigned int br_nf_forward_arp(const struct nf_hook_ops *ops, } #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4) || IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) -static int br_nf_push_frag_xmit(struct sock *sk, struct sk_buff *skb) +static int br_nf_push_frag_xmit(struct net *net, struct sock *sk, struct sk_buff *skb) { struct brnf_frag_data *data; int err; @@ -692,6 +692,11 @@ static int br_nf_push_frag_xmit(struct sock *sk, struct sk_buff *skb) nf_bridge_info_free(skb); return br_dev_queue_push_xmit(sk, skb); } +static int br_nf_push_frag_xmit_sk(struct sock *sk, struct sk_buff *skb) +{ + struct net *net = dev_net(skb_dst(skb)->dev); + return br_nf_push_frag_xmit(net, sk, skb); +} #endif #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4) @@ -760,7 +765,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct sk_buff *skb) skb_copy_from_linear_data_offset(skb, -data->size, data->mac, data->size); - return br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit); + return br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit_sk); } #endif #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) @@ -783,7 +788,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct sk_buff *skb) data->size); if (v6ops) - return v6ops->fragment(sk, skb, br_nf_push_frag_xmit); + return v6ops->fragment(sk, skb, br_nf_push_frag_xmit_sk); kfree_skb(skb); return -EMSGSIZE; -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, 2015-09-15 at 17:04 -0700, Mahesh Bandewar wrote: > On Tue, Sep 15, 2015 at 4:20 PM, Eric Dumazet wrote: > > On Tue, 2015-09-15 at 15:54 -0700, Mahesh Bandewar wrote: > > > >> > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && > >> > + skb->l4_hash) > >> if (ENCAP34 || LAYER34) && l4_hash) may be? > > > > Hmm, traditional BOND_XMIT_POLICY_LAYER34 did not a full flow bisection > > (tunnel awareness added in commit > > 32819dc1834866cb9547cb75f81af9edd58d33cd bonding: modify the old and add > > new xmit hash policies) > > > > This could radically change some setups and behavior. > > > > BOND_XMIT_POLICY_ENCAP34 looks a better fit to me. > > > Agreed, this will change flow distribution for LAYER34 policy but then > loose out on calculating hash per packet which I think is unnecessary. We added new bonding policy exactly for this. If people are stuck with LAYER34, that is their choice. > > This elimination of hash calculation is a good step but I'm feeling > that it's somehow tied to ENCAP policy which is actually orthogonal > and should be applied to LAYER34 also. You can send a followup patch, once fully tested. I've tested the ENCAP34 mode only, I do not want to add cycles for a mode that is potentially a legacy one that nobody uses. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 19/30] net: Remove dev_queue_xmit_sk
A function with weird arguments that it will never use to accomdate a netfilter callback prototype is absolutely in the core of the networking stack. Frankly it does not make sense and it causes a lot of confusion as to why arguments that are never used are being passed to the function. As I am preparing to make a second change to arguments to the okfn even the names stops making sense. As I have removed the two callers of this function remove this confusion from the networking stack. Signed-off-by: "Eric W. Biederman" --- include/linux/netdevice.h | 6 +- net/core/dev.c| 4 ++-- 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 88a00694eda5..e664f87c8e4c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2213,11 +2213,7 @@ int dev_close(struct net_device *dev); int dev_close_many(struct list_head *head, bool unlink); void dev_disable_lro(struct net_device *dev); int dev_loopback_xmit(struct sock *sk, struct sk_buff *newskb); -int dev_queue_xmit_sk(struct sock *sk, struct sk_buff *skb); -static inline int dev_queue_xmit(struct sk_buff *skb) -{ - return dev_queue_xmit_sk(skb->sk, skb); -} +int dev_queue_xmit(struct sk_buff *skb); int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv); int register_netdevice(struct net_device *dev); void unregister_netdevice_queue(struct net_device *dev, struct list_head *head); diff --git a/net/core/dev.c b/net/core/dev.c index 877c84834d81..dcf9ff913925 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3143,11 +3143,11 @@ out: return rc; } -int dev_queue_xmit_sk(struct sock *sk, struct sk_buff *skb) +int dev_queue_xmit(struct sk_buff *skb) { return __dev_queue_xmit(skb, NULL); } -EXPORT_SYMBOL(dev_queue_xmit_sk); +EXPORT_SYMBOL(dev_queue_xmit); int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv) { -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 23/30] ipv6: Compute net once in raw6_send_hdrinc
Signed-off-by: "Eric W. Biederman" --- net/ipv6/raw.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 1636537705f5..5aa461302716 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -614,6 +614,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length, unsigned int flags) { struct ipv6_pinfo *np = inet6_sk(sk); + struct net *net = sock_net(sk); struct ipv6hdr *iph; struct sk_buff *skb; int err; @@ -652,7 +653,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length, if (err) goto error_fault; - IP6_UPD_PO_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len); + IP6_UPD_PO_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len); err = NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT, sk, skb, NULL, rt->dst.dev, dst_output); if (err > 0) @@ -666,7 +667,7 @@ error_fault: err = -EFAULT; kfree_skb(skb); error: - IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS); + IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS); if (err == -ENOBUFS && !np->recverr) err = 0; return err; -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 26/30] bridge: Cache net in br_nf_pre_routing_finish
This is prep work for passing net to the netfilter hooks. Signed-off-by: "Eric W. Biederman" --- net/bridge/br_netfilter_hooks.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index e6910b71af6e..c1127908e23a 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -346,6 +346,7 @@ static int br_nf_pre_routing_finish(struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb->dev; struct iphdr *iph = ip_hdr(skb); + struct net *net = dev_net(dev); struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb); struct rtable *rt; int err; @@ -371,7 +372,7 @@ static int br_nf_pre_routing_finish(struct sock *sk, struct sk_buff *skb) if (err != -EHOSTUNREACH || !in_dev || IN_DEV_FORWARD(in_dev)) goto free_skb; - rt = ip_route_output(dev_net(dev), iph->daddr, 0, + rt = ip_route_output(net, iph->daddr, 0, RT_TOS(iph->tos), 0); if (!IS_ERR(rt)) { /* - Bridged-and-DNAT'ed traffic doesn't -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 24/30] bridge: Pass net into br_nf_ip_fragment
This is a prep work for passing struct net through ip_do_fragment and later the netfilter okfn. Doing this independently makes the later code changes clearer. Signed-off-by: "Eric W. Biederman" --- net/bridge/br_netfilter_hooks.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index 0a6f095bb0c9..971d45d24c64 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -695,18 +695,17 @@ static int br_nf_push_frag_xmit(struct sock *sk, struct sk_buff *skb) #endif #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4) -static int br_nf_ip_fragment(struct sock *sk, struct sk_buff *skb, -int (*output)(struct sock *, struct sk_buff *)) +static int +br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, + int (*output)(struct sock *, struct sk_buff *)) { unsigned int mtu = ip_skb_dst_mtu(skb); struct iphdr *iph = ip_hdr(skb); - struct rtable *rt = skb_rtable(skb); - struct net_device *dev = rt->dst.dev; if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) || (IPCB(skb)->frag_max_size && IPCB(skb)->frag_max_size > mtu))) { - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); + IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS); kfree_skb(skb); return -EMSGSIZE; } @@ -726,6 +725,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct sk_buff *skb) { struct nf_bridge_info *nf_bridge; unsigned int mtu_reserved; + struct net *net = dev_net(skb_dst(skb)->dev); mtu_reserved = nf_bridge_mtu_reduction(skb); @@ -760,7 +760,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct sk_buff *skb) skb_copy_from_linear_data_offset(skb, -data->size, data->mac, data->size); - return br_nf_ip_fragment(sk, skb, br_nf_push_frag_xmit); + return br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit); } #endif #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 20/30] ipv6: Don't recompute net in ip6_rcv
Avoid silly redundant code Signed-off-by: "Eric W. Biederman" --- net/ipv6/ip6_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index adba03ac7ce9..c628dba477d4 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -109,7 +109,7 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt if (hdr->version != 6) goto err; - IP6_ADD_STATS_BH(dev_net(dev), idev, + IP6_ADD_STATS_BH(net, idev, IPSTATS_MIB_NOECTPKTS + (ipv6_get_dsfield(hdr) & INET_ECN_MASK), max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs)); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 11/30] ipv4: Only compute net once in ip_do_fragment
Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_output.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 9ee622ad8dfa..85b72d450184 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -531,9 +531,11 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb, int offset; __be16 not_last_frag; struct rtable *rt = skb_rtable(skb); + struct net *net; int err = 0; dev = rt->dst.dev; + net = dev_net(dev); /* * Point into the IP datagram header. @@ -626,7 +628,7 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb, err = output(sk, skb); if (!err) - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES); + IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES); if (err || !frag) break; @@ -636,7 +638,7 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb, } if (err == 0) { - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS); + IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS); return 0; } @@ -645,7 +647,7 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb, kfree_skb(frag); frag = skb; } - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); + IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS); return err; slow_path_clean: @@ -767,15 +769,15 @@ slow_path: if (err) goto fail; - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES); + IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES); } consume_skb(skb); - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS); + IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS); return err; fail: kfree_skb(skb); - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); + IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS); return err; } EXPORT_SYMBOL(ip_do_fragment); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 28/30] netfilter: Pass struct net into the netfilter hooks
Pass a network namespace parameter into the netfilter hooks. At the call site of the netfilter hooks the path a packet is taking through the network stack is well known which allows the network namespace to be easily and reliabily. This allows the replacement of magic code like "dev_net(state->in?:state->out)" that appears at the start of most netfilter hooks with "state->net". In almost all cases the network namespace passed in is derived from the first network device passed in, guaranteeing those paths will not see any changes in practice. The exceptions are: xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm) ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp) ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp) ipv4/raw.c:raw_send_hdrinc()sock_net(sk) ipv6/ip6_output.c:ip6_xmit()sock_net(sk) ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev) ipv6/raw.c:raw6_send_hdrinc() sock_net(sk) br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev In all cases these exceptions seem to be a better expression for the network namespace the packet is being processed in then the historic "dev_net(in?in:out)". I am documenting them in case something odd pops up and someone starts trying to track down what happened. Signed-off-by: "Eric W. Biederman" --- drivers/net/vrf.c | 7 --- include/linux/netfilter.h | 27 --- net/bridge/br_forward.c | 13 +++-- net/bridge/br_input.c | 13 +++-- net/bridge/br_multicast.c | 4 ++-- net/bridge/br_netfilter_hooks.c | 15 --- net/bridge/br_netfilter_ipv6.c| 7 --- net/bridge/br_stp_bpdu.c | 4 ++-- net/decnet/dn_neigh.c | 15 +-- net/decnet/dn_nsp_in.c| 4 ++-- net/decnet/dn_route.c | 24 net/ipv4/arp.c| 10 ++ net/ipv4/ip_forward.c | 5 +++-- net/ipv4/ip_input.c | 8 net/ipv4/ip_output.c | 22 +- net/ipv4/ipmr.c | 4 ++-- net/ipv4/raw.c| 5 +++-- net/ipv4/xfrm4_input.c| 4 ++-- net/ipv4/xfrm4_output.c | 6 -- net/ipv6/ip6_input.c | 8 net/ipv6/ip6_output.c | 15 --- net/ipv6/ip6mr.c | 4 ++-- net/ipv6/mcast.c | 7 --- net/ipv6/ndisc.c | 4 ++-- net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 2 +- net/ipv6/output_core.c| 6 -- net/ipv6/raw.c| 2 +- net/ipv6/xfrm6_input.c| 4 ++-- net/ipv6/xfrm6_output.c | 6 -- net/netfilter/ipvs/ip_vs_xmit.c | 4 ++-- net/xfrm/xfrm_output.c| 3 ++- 31 files changed, 142 insertions(+), 120 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index e7094fbd7568..c82260341b72 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -298,14 +298,15 @@ err: static int vrf_output(struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; + struct net *net = dev_net(dev); - IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); + IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); skb->dev = dev; skb->protocol = htons(ETH_P_IP); - return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, sk, skb, - NULL, dev, + return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, + net, sk, skb, NULL, dev, vrf_finish_output, !(IPCB(skb)->flags & IPSKB_REROUTED)); } diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 042148dc1e22..295f2650b5dc 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -190,12 +190,11 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, return 1; } -static inline int nf_hook(u_int8_t pf, unsigned int hook, struct sock *sk, - struct sk_buff *skb, struct net_device *indev, - struct net_device *outdev, +static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net, + struct sock *sk, struct sk_buff *skb, + struct net_device *indev, struct net_device *outdev, int (*okfn)(struct sock *, struct sk_buff *)) { - struct net *net = dev_net(indev ? ind
[PATCH next 30/30] netfilter: Pass net into okfn
This is immediately motivated by the bridge code that chains functions that call into netfilter. Without passing net into the okfns the bridge code would need to guess about the best expression for the network namespace to process packets in. As net is frequently one of the first things computed in continuation functions after netfilter has done it's job passing in the desired network namespace is in many cases a code simplification. To support this change the function dst_output_okfn is introduced to simplify passing dst_output as an okfn. For the moment dst_output_okfn just silently drops the struct net. Signed-off-by: "Eric W. Biederman" --- drivers/net/vrf.c| 2 +- include/linux/netdevice.h| 2 +- include/linux/netfilter.h| 26 ++ include/linux/netfilter_bridge.h | 2 +- include/net/dn_neigh.h | 6 +++--- include/net/dst.h| 4 include/net/ipv6.h | 2 +- include/net/netfilter/br_netfilter.h | 2 +- net/bridge/br_forward.c | 5 ++--- net/bridge/br_input.c| 7 --- net/bridge/br_netfilter_hooks.c | 21 + net/bridge/br_netfilter_ipv6.c | 3 +-- net/bridge/br_private.h | 6 +++--- net/bridge/br_stp_bpdu.c | 3 ++- net/core/dev.c | 4 +++- net/decnet/dn_neigh.c| 8 net/decnet/dn_nsp_in.c | 3 ++- net/decnet/dn_route.c| 6 +++--- net/ipv4/arp.c | 7 +++ net/ipv4/ip_forward.c| 3 +-- net/ipv4/ip_input.c | 7 ++- net/ipv4/ip_output.c | 4 ++-- net/ipv4/ipmr.c | 4 ++-- net/ipv4/raw.c | 2 +- net/ipv4/xfrm4_input.c | 3 ++- net/ipv4/xfrm4_output.c | 2 +- net/ipv6/ip6_input.c | 5 ++--- net/ipv6/ip6_output.c| 7 --- net/ipv6/ip6mr.c | 3 +-- net/ipv6/mcast.c | 4 ++-- net/ipv6/ndisc.c | 2 +- net/ipv6/output_core.c | 2 +- net/ipv6/raw.c | 2 +- net/ipv6/xfrm6_output.c | 2 +- net/netfilter/ipvs/ip_vs_xmit.c | 4 ++-- net/netfilter/nf_queue.c | 2 +- net/xfrm/xfrm_output.c | 12 ++-- 37 files changed, 95 insertions(+), 94 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index c82260341b72..4dd701d7b8e6 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -253,7 +253,7 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev) } /* modelled after ip_finish_output2 */ -static int vrf_finish_output(struct sock *sk, struct sk_buff *skb) +static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct dst_entry *dst = skb_dst(skb); struct rtable *rt = (struct rtable *)dst; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 97ab5c9a7069..b791405958b4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2212,7 +2212,7 @@ int dev_open(struct net_device *dev); int dev_close(struct net_device *dev); int dev_close_many(struct list_head *head, bool unlink); void dev_disable_lro(struct net_device *dev); -int dev_loopback_xmit(struct sock *sk, struct sk_buff *newskb); +int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb); int dev_queue_xmit(struct sk_buff *skb); int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv); int register_netdevice(struct net_device *dev); diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 295f2650b5dc..0b4d4560f33d 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -56,7 +56,7 @@ struct nf_hook_state { struct sock *sk; struct net *net; struct list_head *hook_list; - int (*okfn)(struct sock *, struct sk_buff *); + int (*okfn)(struct net *, struct sock *, struct sk_buff *); }; static inline void nf_hook_state_init(struct nf_hook_state *p, @@ -67,7 +67,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p, struct net_device *outdev, struct sock *sk, struct net *net, - int (*okfn)(struct sock *, struct sk_buff *)) + int (*okfn)(struct net *, struct sock *, struct sk_buff *)) { p->hook = hook; p->thresh = thresh; @@ -175,7 +175,7 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, struct sk_buff *skb, struct net_device *indev, struct net_device *outdev, -
[PATCH next 16/30] ipv6: Only compute net once in ip6mr_forward2_finish
Signed-off-by: "Eric W. Biederman" --- net/ipv6/ip6mr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index e95f6b6281de..3e3085b37a91 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -1987,9 +1987,10 @@ int ip6mr_compat_ioctl(struct sock *sk, unsigned int cmd, void __user *arg) static inline int ip6mr_forward2_finish(struct sock *sk, struct sk_buff *skb) { - IP6_INC_STATS_BH(dev_net(skb_dst(skb)->dev), ip6_dst_idev(skb_dst(skb)), + struct net *net = dev_net(skb_dst(skb)->dev); + IP6_INC_STATS_BH(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_OUTFORWDATAGRAMS); - IP6_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), ip6_dst_idev(skb_dst(skb)), + IP6_ADD_STATS_BH(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_OUTOCTETS, skb->len); return dst_output(sk, skb); } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 27/30] bridge: Add br_netif_receive_skb remove netif_receive_skb_sk
netif_receive_skb_sk is only called once in the bridge code, replace it with a bridge specific function that calls netif_receive_skb. Signed-off-by: "Eric W. Biederman" --- include/linux/netdevice.h | 6 +- net/bridge/br_input.c | 7 ++- net/core/dev.c| 4 ++-- 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e664f87c8e4c..97ab5c9a7069 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2985,11 +2985,7 @@ static inline void dev_consume_skb_any(struct sk_buff *skb) int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); -int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb); -static inline int netif_receive_skb(struct sk_buff *skb) -{ - return netif_receive_skb_sk(skb->sk, skb); -} +int netif_receive_skb(struct sk_buff *skb); gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb); void napi_gro_flush(struct napi_struct *napi, bool flush_old); struct sk_buff *napi_get_frags(struct napi_struct *napi); diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index f921a5dce22d..2359c041e27c 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -26,6 +26,11 @@ br_should_route_hook_t __rcu *br_should_route_hook __read_mostly; EXPORT_SYMBOL(br_should_route_hook); +static int br_netif_receive_skb(struct sock *sk, struct sk_buff *skb) +{ + return netif_receive_skb(skb); +} + static int br_pass_frame_up(struct sk_buff *skb) { struct net_device *indev, *brdev = BR_INPUT_SKB_CB(skb)->brdev; @@ -57,7 +62,7 @@ static int br_pass_frame_up(struct sk_buff *skb) return NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, NULL, skb, indev, NULL, - netif_receive_skb_sk); + br_netif_receive_skb); } static void br_do_proxy_arp(struct sk_buff *skb, struct net_bridge *br, diff --git a/net/core/dev.c b/net/core/dev.c index dcf9ff913925..7db9b012dfb7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3982,13 +3982,13 @@ static int netif_receive_skb_internal(struct sk_buff *skb) * NET_RX_SUCCESS: no congestion * NET_RX_DROP: packet was dropped */ -int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb) +int netif_receive_skb(struct sk_buff *skb) { trace_netif_receive_skb_entry(skb); return netif_receive_skb_internal(skb); } -EXPORT_SYMBOL(netif_receive_skb_sk); +EXPORT_SYMBOL(netif_receive_skb); /* Network device is going away, flush any packets still pending * Called with irqs disabled. -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 18/30] bridge: Introduce br_send_bpdu_finish
The function dev_queue_xmit_skb_sk is unncessary and very confusing. Introduce br_send_bpdu_finish to remove the need for dev_queue_xmit_skb_sk, and have br_send_bpdu_finish call dev_queue_xmit. Signed-off-by: "Eric W. Biederman" --- net/bridge/br_stp_bpdu.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/bridge/br_stp_bpdu.c b/net/bridge/br_stp_bpdu.c index 534fc4cd263e..3017a396cdef 100644 --- a/net/bridge/br_stp_bpdu.c +++ b/net/bridge/br_stp_bpdu.c @@ -30,6 +30,11 @@ #define LLC_RESERVE sizeof(struct llc_pdu_un) +static int br_send_bpdu_finish(struct sock *sk, struct sk_buff *skb) +{ + return dev_queue_xmit(skb); +} + static void br_send_bpdu(struct net_bridge_port *p, const unsigned char *data, int length) { @@ -56,7 +61,7 @@ static void br_send_bpdu(struct net_bridge_port *p, NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_OUT, NULL, skb, NULL, skb->dev, - dev_queue_xmit_sk); + br_send_bpdu_finish); } static inline void br_set_ticks(unsigned char *dest, int j) -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 13/30] ipv4: Only compute net once in ip_finish_output2
Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_output.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 095754c99061..fc550e97daac 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -177,14 +177,15 @@ static int ip_finish_output2(struct sock *sk, struct sk_buff *skb) struct dst_entry *dst = skb_dst(skb); struct rtable *rt = (struct rtable *)dst; struct net_device *dev = dst->dev; + struct net *net = dev_net(dev); unsigned int hh_len = LL_RESERVED_SPACE(dev); struct neighbour *neigh; u32 nexthop; if (rt->rt_type == RTN_MULTICAST) { - IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTMCAST, skb->len); + IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len); } else if (rt->rt_type == RTN_BROADCAST) - IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTBCAST, skb->len); + IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len); /* Be paranoid, rather than too clever. */ if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 22/30] ipv6: Cache net in ip6_output
Keep net in a local variable so I can use it in NF_HOOK_COND when I pass struct net to all of the netfilter hooks. Signed-off-by: "Eric W. Biederman" --- net/ipv6/ip6_output.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 12d0166a64cd..8cab909b181e 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -135,9 +135,9 @@ int ip6_output(struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); + struct net *net = dev_net(dev); if (unlikely(idev->cnf.disable_ipv6)) { - IP6_INC_STATS(dev_net(dev), idev, - IPSTATS_MIB_OUTDISCARDS); + IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); kfree_skb(skb); return 0; } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 17/30] arp: Introduce arp_xmit_finish
The function dev_queue_xmit_skb_sk is unncessary and very confusing. Introduce arp_xmit_finish to remove the need for dev_queue_xmit_skb_sk, and have arp_xmit_finish call dev_queue_xmit. Signed-off-by: "Eric W. Biederman" --- net/ipv4/arp.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 30409b75e925..3632e98eb0f9 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -621,6 +621,11 @@ out: } EXPORT_SYMBOL(arp_create); +static int arp_xmit_finish(struct sock *sk, struct sk_buff *skb) +{ + return dev_queue_xmit(skb); +} + /* * Send an arp packet. */ @@ -628,7 +633,7 @@ void arp_xmit(struct sk_buff *skb) { /* Send it off, maybe filter it using firewalling first. */ NF_HOOK(NFPROTO_ARP, NF_ARP_OUT, NULL, skb, - NULL, skb->dev, dev_queue_xmit_sk); + NULL, skb->dev, arp_xmit_finish); } EXPORT_SYMBOL(arp_xmit); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 15/30] ipv4: Only compute net once in ipmr_forward_finish
Signed-off-by: "Eric W. Biederman" --- net/ipv4/ipmr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 075bc695ae34..dfe4e8ec6c3a 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1681,9 +1681,10 @@ static void ip_encap(struct net *net, struct sk_buff *skb, static inline int ipmr_forward_finish(struct sock *sk, struct sk_buff *skb) { struct ip_options *opt = &(IPCB(skb)->opt); + struct net *net = dev_net(skb_dst(skb)->dev); - IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTFORWDATAGRAMS); - IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, skb->len); + IP_INC_STATS_BH(net, IPSTATS_MIB_OUTFORWDATAGRAMS); + IP_ADD_STATS_BH(net, IPSTATS_MIB_OUTOCTETS, skb->len); if (unlikely(opt->optlen)) ip_forward_options(skb); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 29/30] netfilter: Use nf_hook_state.net
Instead of saying "net = dev_net(state->in?state->in:state->out)" just say "state->net". As that information is now availabe, much less confusing and much less error prone. Signed-off-by: "Eric W. Biederman" --- net/bridge/netfilter/ebtable_filter.c | 4 ++-- net/bridge/netfilter/ebtable_nat.c | 4 ++-- net/ipv4/netfilter/arptable_filter.c | 4 +--- net/ipv4/netfilter/ip_tables.c | 8 net/ipv4/netfilter/ipt_CLUSTERIP.c | 2 +- net/ipv4/netfilter/ipt_SYNPROXY.c | 2 +- net/ipv4/netfilter/iptable_filter.c| 6 ++ net/ipv4/netfilter/iptable_mangle.c| 7 +++ net/ipv4/netfilter/iptable_nat.c | 5 ++--- net/ipv4/netfilter/iptable_raw.c | 6 ++ net/ipv4/netfilter/iptable_security.c | 5 + net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 4 ++-- net/ipv6/netfilter/ip6_tables.c| 8 net/ipv6/netfilter/ip6t_SYNPROXY.c | 2 +- net/ipv6/netfilter/ip6table_filter.c | 5 ++--- net/ipv6/netfilter/ip6table_mangle.c | 6 +++--- net/ipv6/netfilter/ip6table_nat.c | 5 ++--- net/ipv6/netfilter/ip6table_raw.c | 5 ++--- net/ipv6/netfilter/ip6table_security.c | 4 +--- net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 4 ++-- net/netfilter/nfnetlink_queue_core.c | 3 +-- 21 files changed, 41 insertions(+), 58 deletions(-) diff --git a/net/bridge/netfilter/ebtable_filter.c b/net/bridge/netfilter/ebtable_filter.c index 8a3f63b2e807..ab20d6ed6e2f 100644 --- a/net/bridge/netfilter/ebtable_filter.c +++ b/net/bridge/netfilter/ebtable_filter.c @@ -61,7 +61,7 @@ ebt_in_hook(const struct nf_hook_ops *ops, struct sk_buff *skb, const struct nf_hook_state *state) { return ebt_do_table(ops->hooknum, skb, state->in, state->out, - dev_net(state->in)->xt.frame_filter); + state->net->xt.frame_filter); } static unsigned int @@ -69,7 +69,7 @@ ebt_out_hook(const struct nf_hook_ops *ops, struct sk_buff *skb, const struct nf_hook_state *state) { return ebt_do_table(ops->hooknum, skb, state->in, state->out, - dev_net(state->out)->xt.frame_filter); + state->net->xt.frame_filter); } static struct nf_hook_ops ebt_ops_filter[] __read_mostly = { diff --git a/net/bridge/netfilter/ebtable_nat.c b/net/bridge/netfilter/ebtable_nat.c index c5ef5b1ab678..ad81a5a65644 100644 --- a/net/bridge/netfilter/ebtable_nat.c +++ b/net/bridge/netfilter/ebtable_nat.c @@ -61,7 +61,7 @@ ebt_nat_in(const struct nf_hook_ops *ops, struct sk_buff *skb, const struct nf_hook_state *state) { return ebt_do_table(ops->hooknum, skb, state->in, state->out, - dev_net(state->in)->xt.frame_nat); + state->net->xt.frame_nat); } static unsigned int @@ -69,7 +69,7 @@ ebt_nat_out(const struct nf_hook_ops *ops, struct sk_buff *skb, const struct nf_hook_state *state) { return ebt_do_table(ops->hooknum, skb, state->in, state->out, - dev_net(state->out)->xt.frame_nat); + state->net->xt.frame_nat); } static struct nf_hook_ops ebt_ops_nat[] __read_mostly = { diff --git a/net/ipv4/netfilter/arptable_filter.c b/net/ipv4/netfilter/arptable_filter.c index 93876d03120c..d217e4c19645 100644 --- a/net/ipv4/netfilter/arptable_filter.c +++ b/net/ipv4/netfilter/arptable_filter.c @@ -30,10 +30,8 @@ static unsigned int arptable_filter_hook(const struct nf_hook_ops *ops, struct sk_buff *skb, const struct nf_hook_state *state) { - const struct net *net = dev_net(state->in ? state->in : state->out); - return arpt_do_table(skb, ops->hooknum, state, -net->ipv4.arptable_filter); +state->net->ipv4.arptable_filter); } static struct nf_hook_ops *arpfilter_ops __read_mostly; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index b0a86e73451c..5d514eac4c31 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -246,7 +246,8 @@ get_chainname_rulenum(const struct ipt_entry *s, const struct ipt_entry *e, return 0; } -static void trace_packet(const struct sk_buff *skb, +static void trace_packet(struct net *net, +const struct sk_buff *skb, unsigned int hook, const struct net_device *in, const struct net_device *out, @@ -258,7 +259,6 @@ static void trace_packet(const struct sk_buff *skb, const char *hookname, *chainname, *comment; const struct ipt_entry *iter; unsigned int rulenum = 0; - struct net *net = dev_net(in ? in : out); root = get_entry(private->entries, pr
[PATCH next 14/30] ipv4: Only compute net once in ip_rcv_finish
Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_input.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index ff908863f22e..cc242b9501d9 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -314,6 +314,7 @@ EXPORT_SYMBOL(sysctl_ip_early_demux); static int ip_rcv_finish(struct sock *sk, struct sk_buff *skb) { const struct iphdr *iph = ip_hdr(skb); + struct net *net = dev_net(skb->dev); struct rtable *rt; if (sysctl_ip_early_demux && !skb_dst(skb) && !skb->sk) { @@ -337,8 +338,7 @@ static int ip_rcv_finish(struct sock *sk, struct sk_buff *skb) iph->tos, skb->dev); if (unlikely(err)) { if (err == -EXDEV) - NET_INC_STATS_BH(dev_net(skb->dev), -LINUX_MIB_IPRPFILTER); + NET_INC_STATS_BH(net, LINUX_MIB_IPRPFILTER); goto drop; } } @@ -359,11 +359,9 @@ static int ip_rcv_finish(struct sock *sk, struct sk_buff *skb) rt = skb_rtable(skb); if (rt->rt_type == RTN_MULTICAST) { - IP_UPD_PO_STATS_BH(dev_net(rt->dst.dev), IPSTATS_MIB_INMCAST, - skb->len); + IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_INMCAST, skb->len); } else if (rt->rt_type == RTN_BROADCAST) - IP_UPD_PO_STATS_BH(dev_net(rt->dst.dev), IPSTATS_MIB_INBCAST, - skb->len); + IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_INBCAST, skb->len); return dst_input(skb); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 12/30] ipv4: Explicitly compute net in ip_fragment
Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_output.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 85b72d450184..095754c99061 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -500,10 +500,9 @@ static int ip_fragment(struct sock *sk, struct sk_buff *skb, if (unlikely(!skb->ignore_df || (IPCB(skb)->frag_max_size && IPCB(skb)->frag_max_size > mtu))) { - struct rtable *rt = skb_rtable(skb); - struct net_device *dev = rt->dst.dev; + struct net *net = dev_net(skb_rtable(skb)->dst.dev); - IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); + IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); kfree_skb(skb); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 08/30] ipv4: Compute net once in ip_rcv
Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_input.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index f4fc8a77aaa7..ff908863f22e 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -378,6 +378,7 @@ drop: int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { const struct iphdr *iph; + struct net *net; u32 len; /* When the interface is in promisc. mode, drop all the crap @@ -387,11 +388,12 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, goto drop; - IP_UPD_PO_STATS_BH(dev_net(dev), IPSTATS_MIB_IN, skb->len); + net = dev_net(dev); + IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_IN, skb->len); skb = skb_share_check(skb, GFP_ATOMIC); if (!skb) { - IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS); + IP_INC_STATS_BH(net, IPSTATS_MIB_INDISCARDS); goto out; } @@ -417,7 +419,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, BUILD_BUG_ON(IPSTATS_MIB_ECT1PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_1); BUILD_BUG_ON(IPSTATS_MIB_ECT0PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_0); BUILD_BUG_ON(IPSTATS_MIB_CEPKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_CE); - IP_ADD_STATS_BH(dev_net(dev), + IP_ADD_STATS_BH(net, IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK), max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs)); @@ -431,7 +433,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, len = ntohs(iph->tot_len); if (skb->len < len) { - IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS); + IP_INC_STATS_BH(net, IPSTATS_MIB_INTRUNCATEDPKTS); goto drop; } else if (len < (iph->ihl*4)) goto inhdr_error; @@ -441,7 +443,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, * Note this now means skb->len holds ntohs(iph->tot_len). */ if (pskb_trim_rcsum(skb, len)) { - IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS); + IP_INC_STATS_BH(net, IPSTATS_MIB_INDISCARDS); goto drop; } @@ -458,9 +460,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, ip_rcv_finish); csum_error: - IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_CSUMERRORS); + IP_INC_STATS_BH(net, IPSTATS_MIB_CSUMERRORS); inhdr_error: - IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS); + IP_INC_STATS_BH(net, IPSTATS_MIB_INHDRERRORS); drop: kfree_skb(skb); out: -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 03/30] netfilter: Pass net to nf_hook_thresh
Signed-off-by: "Eric W. Biederman" --- include/linux/netfilter.h | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 889ac0e11f01..042148dc1e22 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -170,6 +170,7 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state); * value indicates the packet has been consumed by the hook. */ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, +struct net *net, struct sock *sk, struct sk_buff *skb, struct net_device *indev, @@ -177,7 +178,6 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, int (*okfn)(struct sock *, struct sk_buff *), int thresh) { - struct net *net = dev_net(indev ? indev : outdev); struct list_head *hook_list = &net->nf.hooks[pf][hook]; if (nf_hook_list_active(hook_list, pf, hook)) { @@ -195,7 +195,8 @@ static inline int nf_hook(u_int8_t pf, unsigned int hook, struct sock *sk, struct net_device *outdev, int (*okfn)(struct sock *, struct sk_buff *)) { - return nf_hook_thresh(pf, hook, sk, skb, indev, outdev, okfn, INT_MIN); + struct net *net = dev_net(indev ? indev : outdev); + return nf_hook_thresh(pf, hook, net, sk, skb, indev, outdev, okfn, INT_MIN); } /* Activate hook; either okfn or kfree_skb called, unless a hook @@ -221,7 +222,8 @@ NF_HOOK_THRESH(uint8_t pf, unsigned int hook, struct sock *sk, struct net_device *out, int (*okfn)(struct sock *, struct sk_buff *), int thresh) { - int ret = nf_hook_thresh(pf, hook, sk, skb, in, out, okfn, thresh); + struct net *net = dev_net(in ? in : out); + int ret = nf_hook_thresh(pf, hook, net, sk, skb, in, out, okfn, thresh); if (ret == 1) ret = okfn(sk, skb); return ret; @@ -232,10 +234,11 @@ NF_HOOK_COND(uint8_t pf, unsigned int hook, struct sock *sk, struct sk_buff *skb, struct net_device *in, struct net_device *out, int (*okfn)(struct sock *, struct sk_buff *), bool cond) { + struct net *net = dev_net(in ? in : out); int ret; if (!cond || - ((ret = nf_hook_thresh(pf, hook, sk, skb, in, out, okfn, INT_MIN)) == 1)) + ((ret = nf_hook_thresh(pf, hook, net, sk, skb, in, out, okfn, INT_MIN)) == 1)) ret = okfn(sk, skb); return ret; } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 02/30] netfilter: Store net in nf_hook_state
Signed-off-by: "Eric W. Biederman" --- include/linux/netfilter.h | 5 - include/linux/netfilter_ingress.h | 2 +- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 1abac85ec907..889ac0e11f01 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -54,6 +54,7 @@ struct nf_hook_state { struct net_device *in; struct net_device *out; struct sock *sk; + struct net *net; struct list_head *hook_list; int (*okfn)(struct sock *, struct sk_buff *); }; @@ -65,6 +66,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p, struct net_device *indev, struct net_device *outdev, struct sock *sk, + struct net *net, int (*okfn)(struct sock *, struct sk_buff *)) { p->hook = hook; @@ -73,6 +75,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p, p->in = indev; p->out = outdev; p->sk = sk; + p->net = net; p->hook_list = hook_list; p->okfn = okfn; } @@ -181,7 +184,7 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, struct nf_hook_state state; nf_hook_state_init(&state, hook_list, hook, thresh, - pf, indev, outdev, sk, okfn); + pf, indev, outdev, sk, net, okfn); return nf_hook_slow(skb, &state); } return 1; diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h index cb0727fe2b3d..187feabe557c 100644 --- a/include/linux/netfilter_ingress.h +++ b/include/linux/netfilter_ingress.h @@ -17,7 +17,7 @@ static inline int nf_hook_ingress(struct sk_buff *skb) nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress, NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV, NULL, - skb->dev, NULL, NULL); + skb->dev, NULL, dev_net(skb->dev), NULL); return nf_hook_slow(skb, &state); } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 05/30] net: Merge dst_output and dst_output_sk
Add a sock paramter to dst_output making dst_output_sk superfluous. Add a skb->sk parameter to all of the callers of dst_output Have the callers of dst_output_sk call dst_output. Signed-off-by: "Eric W. Biederman" --- include/net/dst.h | 6 +- net/decnet/dn_nsp_out.c | 4 ++-- net/ipv4/ip_forward.c | 2 +- net/ipv4/ip_output.c| 6 +++--- net/ipv4/ip_vti.c | 2 +- net/ipv4/ipmr.c | 2 +- net/ipv4/raw.c | 2 +- net/ipv4/xfrm4_output.c | 2 +- net/ipv6/ip6_output.c | 4 ++-- net/ipv6/ip6_vti.c | 2 +- net/ipv6/ip6mr.c| 2 +- net/ipv6/mcast.c| 4 ++-- net/ipv6/ndisc.c| 2 +- net/ipv6/output_core.c | 4 ++-- net/ipv6/raw.c | 2 +- net/ipv6/xfrm6_output.c | 2 +- net/netfilter/ipvs/ip_vs_xmit.c | 4 ++-- net/xfrm/xfrm_output.c | 2 +- net/xfrm/xfrm_policy.c | 2 +- 19 files changed, 26 insertions(+), 30 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index 9261d928303d..c72e58474e52 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -454,14 +454,10 @@ static inline void dst_set_expires(struct dst_entry *dst, int timeout) } /* Output packet to network from transport. */ -static inline int dst_output_sk(struct sock *sk, struct sk_buff *skb) +static inline int dst_output(struct sock *sk, struct sk_buff *skb) { return skb_dst(skb)->output(sk, skb); } -static inline int dst_output(struct sk_buff *skb) -{ - return dst_output_sk(skb->sk, skb); -} /* Input packet from network to transport. */ static inline int dst_input(struct sk_buff *skb) diff --git a/net/decnet/dn_nsp_out.c b/net/decnet/dn_nsp_out.c index 1aaa51ebbda6..4b02dd300f50 100644 --- a/net/decnet/dn_nsp_out.c +++ b/net/decnet/dn_nsp_out.c @@ -85,7 +85,7 @@ static void dn_nsp_send(struct sk_buff *skb) if (dst) { try_again: skb_dst_set(skb, dst); - dst_output(skb); + dst_output(skb->sk, skb); return; } @@ -582,7 +582,7 @@ static __inline__ void dn_nsp_do_disc(struct sock *sk, unsigned char msgflg, * associations. */ skb_dst_set(skb, dst_clone(dst)); - dst_output(skb); + dst_output(skb->sk, skb); } diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 2d3aa408fbdc..28fb90108f56 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -72,7 +72,7 @@ static int ip_forward_finish(struct sock *sk, struct sk_buff *skb) ip_forward_options(skb); skb_sender_cpu_clear(skb); - return dst_output_sk(sk, skb); + return dst_output(sk, skb); } int ip_forward(struct sk_buff *skb) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 0138fada0951..f076f11aa94a 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -102,7 +102,7 @@ static int __ip_local_out_sk(struct sock *sk, struct sk_buff *skb) iph->tot_len = htons(skb->len); ip_send_check(iph); return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, sk, skb, NULL, - skb_dst(skb)->dev, dst_output_sk); + skb_dst(skb)->dev, dst_output); } int __ip_local_out(struct sk_buff *skb) @@ -116,7 +116,7 @@ int ip_local_out_sk(struct sock *sk, struct sk_buff *skb) err = __ip_local_out(skb); if (likely(err == 1)) - err = dst_output_sk(sk, skb); + err = dst_output(sk, skb); return err; } @@ -271,7 +271,7 @@ static int ip_finish_output(struct sock *sk, struct sk_buff *skb) /* Policy lookup after SNAT yielded a new policy */ if (skb_dst(skb)->xfrm) { IPCB(skb)->flags |= IPSKB_REROUTED; - return dst_output_sk(sk, skb); + return dst_output(sk, skb); } #endif mtu = ip_skb_dst_mtu(skb); diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c index 0c152087ca15..3b87ec5178f9 100644 --- a/net/ipv4/ip_vti.c +++ b/net/ipv4/ip_vti.c @@ -197,7 +197,7 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct net_device *dev, skb_dst_set(skb, dst); skb->dev = skb_dst(skb)->dev; - err = dst_output(skb); + err = dst_output(skb->sk, skb); if (net_xmit_eval(err) == 0) err = skb->len; iptunnel_xmit_stats(err, &dev->stats, dev->tstats); diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 866ee89f5254..a0a5def920fc 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1688,7 +1688,7 @@ static inline int ipmr_forward_finish(struct sock *sk, struct sk_buff *skb) if (unlikely(opt->optlen)) ip_forward_options(skb); - return dst_output_sk(sk, skb); + return dst_output(sk, skb); } /* diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 561cd4b8fc6e..09ab5bb6913a 100644 --- a/net/ipv4/raw.c +++ b/n
[PATCH next 09/30] ipv4: Remember the net in ip_output and ip_mc_output
This is a prepatory patch to passing net int the netfilter hooks, where net will be used again. Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_output.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index f076f11aa94a..9ee622ad8dfa 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -288,11 +288,12 @@ int ip_mc_output(struct sock *sk, struct sk_buff *skb) { struct rtable *rt = skb_rtable(skb); struct net_device *dev = rt->dst.dev; + struct net *net = dev_net(dev); /* * If the indicated interface is up and running, send the packet. */ - IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); + IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); skb->dev = dev; skb->protocol = htons(ETH_P_IP); @@ -347,8 +348,9 @@ int ip_mc_output(struct sock *sk, struct sk_buff *skb) int ip_output(struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; + struct net *net = dev_net(dev); - IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); + IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); skb->dev = dev; skb->protocol = htons(ETH_P_IP); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 07/30] ipv4: Compute net once in ip_forward_finish
Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_forward.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index ba2f66b3b3f6..95235c813f18 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -63,10 +63,11 @@ static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu) static int ip_forward_finish(struct sock *sk, struct sk_buff *skb) { + struct net *net = dev_net(skb_dst(skb)->dev); struct ip_options *opt = &(IPCB(skb)->opt); - IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTFORWDATAGRAMS); - IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, skb->len); + IP_INC_STATS_BH(net, IPSTATS_MIB_OUTFORWDATAGRAMS); + IP_ADD_STATS_BH(net, IPSTATS_MIB_OUTOCTETS, skb->len); if (unlikely(opt->optlen)) ip_forward_options(skb); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 06/30] ipv4: Compute net once in ip_forward
Compute struct net from the input device in ip_forward before it is used. Signed-off-by: "Eric W. Biederman" --- net/ipv4/ip_forward.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 28fb90108f56..ba2f66b3b3f6 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -81,6 +81,7 @@ int ip_forward(struct sk_buff *skb) struct iphdr *iph; /* Our header */ struct rtable *rt; /* Route we use */ struct ip_options *opt = &(IPCB(skb)->opt); + struct net *net; /* that should never happen */ if (skb->pkt_type != PACKET_HOST) @@ -99,6 +100,7 @@ int ip_forward(struct sk_buff *skb) return NET_RX_SUCCESS; skb_forward_csum(skb); + net = dev_net(skb->dev); /* * According to the RFC, we must first decrease the TTL field. If @@ -119,7 +121,7 @@ int ip_forward(struct sk_buff *skb) IPCB(skb)->flags |= IPSKB_FORWARDED; mtu = ip_dst_mtu_maybe_forward(&rt->dst, true); if (ip_exceeds_mtu(skb, mtu)) { - IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS); + IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); goto drop; @@ -155,7 +157,7 @@ sr_failed: too_many_hops: /* Tell the sender its packet died... */ - IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_INHDRERRORS); + IP_INC_STATS_BH(net, IPSTATS_MIB_INHDRERRORS); icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0); drop: kfree_skb(skb); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 10/30] ipv4: Don't recompute net in ipmr_queue_xmit
Calling dev_net(dev) for is just silly. Signed-off-by: "Eric W. Biederman" --- net/ipv4/ipmr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index a0a5def920fc..075bc695ae34 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1745,7 +1745,7 @@ static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt, * to blackhole. */ - IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_FRAGFAILS); + IP_INC_STATS_BH(net, IPSTATS_MIB_FRAGFAILS); ip_rt_put(rt); goto out_free; } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 04/30] xfrm: Remove unused afinfo method init_dst
Signed-off-by: "Eric W. Biederman" --- include/net/xfrm.h | 2 -- net/xfrm/xfrm_policy.c | 2 -- 2 files changed, 4 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 312e3fee9ccf..fd176106909a 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -296,8 +296,6 @@ struct xfrm_policy_afinfo { struct flowi *fl, int reverse); int (*get_tos)(const struct flowi *fl); - void(*init_dst)(struct net *net, - struct xfrm_dst *dst); int (*init_path)(struct xfrm_dst *path, struct dst_entry *dst, int nfheader_len); diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 94af3d065785..6b5d6e2b9a49 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1583,8 +1583,6 @@ static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family) memset(dst + 1, 0, sizeof(*xdst) - sizeof(*dst)); xdst->flo.ops = &xfrm_bundle_fc_ops; - if (afinfo->init_dst) - afinfo->init_dst(net, xdst); } else xdst = ERR_PTR(-ENOBUFS); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 01/30] netfilter: Remove !CONFIG_NETFITLER definition of nf_hook_thresh
The !CONFIG_NETFILTER definition of nf_hook_thresh calls okfn when the CONFIG_NETFITLER defintion does not, making it buggy. As the !CONFIG_NETFILTER defintion of nf_hook_thresh is not used remove it rather than fix it. Signed-off-by: "Eric W. Biederman" --- include/linux/netfilter.h | 9 - 1 file changed, 9 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 36a652531791..1abac85ec907 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -344,15 +344,6 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi *fl, u_int8_t family) #else /* !CONFIG_NETFILTER */ #define NF_HOOK(pf, hook, sk, skb, indev, outdev, okfn) (okfn)(sk, skb) #define NF_HOOK_COND(pf, hook, sk, skb, indev, outdev, okfn, cond) (okfn)(sk, skb) -static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, -struct sock *sk, -struct sk_buff *skb, -struct net_device *indev, -struct net_device *outdev, -int (*okfn)(struct sock *sk, struct sk_buff *), int thresh) -{ - return okfn(sk, skb); -} static inline int nf_hook(u_int8_t pf, unsigned int hook, struct sock *sk, struct sk_buff *skb, struct net_device *indev, struct net_device *outdev, -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH next 0/30] Passing net through the netfilter hooks
My primary goal with this patchset and it's follow ups is to cleanup the network routing paths so that we do not look at the output device to derive the network namespace. My plan is to pass the network namespace of the transmitting socket through the output path, to replace code that looks at the output network device today. Once that is done we can have routes with output devices outside of the current network namespace. Which should allow reception and transmission of packets in network namespaces to be as fast as normal packet reception and transmission with early demux disabled, because it will same code path. Once skb_dst(skb)->dev is a little better under control I think it will also be possible to use rcu to cleanup the ancient hack that sets dst->dev to loopback_dev when a network device is removed. The work to get there is a series of code cleanups. I am starting with passing net into the netfilter hooks and into the functions that are called after the netfilter hooks. This removes from netfilter the need to guess which network namespace it is working on. To get there I perform a series of minor prep patches so the big changes at the end are possible to audit without getting lost in the noise. In particular I have a lot of patches computing net into a local variable and then using it through out the function. So this patchset encompases removing dead code, sorting out the _sk functions that were added last time someone pushed a prototype change through the post netfilter functions. Cleaning up individual functions use of the network namespace. Passing net into the netfilter hooks. Passing net into the post netfilter functions. Using state->net in the netfilter code where it is available and trivially usable. Pablo, Dave I don't know whose tree this makes more sense to go through. I am assuming at least initially Pablos as netfilter is involved. From what I have seen there will be a lot of back and forth between the netfilter code paths and the routing code paths. The patches are also available (against 4.3-rc1) at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next.git master Eric W. Biederman (30): netfilter: Remove !CONFIG_NETFITLER definition of nf_hook_thresh netfilter: Store net in nf_hook_state netfilter: Pass net to nf_hook_thresh xfrm: Remove unused afinfo method init_dst net: Merge dst_output and dst_output_sk ipv4: Compute net once in ip_forward ipv4: Compute net once in ip_forward_finish ipv4: Compute net once in ip_rcv ipv4: Remember the net in ip_output and ip_mc_output ipv4: Don't recompute net in ipmr_queue_xmit ipv4: Only compute net once in ip_do_fragment ipv4: Explicitly compute net in ip_fragment ipv4: Only compute net once in ip_finish_output2 ipv4: Only compute net once in ip_rcv_finish ipv4: Only compute net once in ipmr_forward_finish ipv6: Only compute net once in ip6mr_forward2_finish arp: Introduce arp_xmit_finish bridge: Introduce br_send_bpdu_finish net: Remove dev_queue_xmit_sk ipv6: Don't recompute net in ip6_rcv ipv6: Only compute net once in ip6_finish_output2 ipv6: Cache net in ip6_output ipv6: Compute net once in raw6_send_hdrinc bridge: Pass net into br_nf_ip_fragment bridge: Pass net into br_nf_push_frag_xmit bridge: Cache net in br_nf_pre_routing_finish bridge: Add br_netif_receive_skb remove netif_receive_skb_sk netfilter: Pass struct net into the netfilter hooks netfilter: Use nf_hook_state.net netfilter: Pass net into okfn drivers/net/vrf.c | 9 ++-- include/linux/netdevice.h | 14 ++ include/linux/netfilter.h | 68 -- include/linux/netfilter_bridge.h | 2 +- include/linux/netfilter_ingress.h | 2 +- include/net/dn_neigh.h | 6 +-- include/net/dst.h | 6 +-- include/net/ipv6.h | 2 +- include/net/netfilter/br_netfilter.h | 2 +- include/net/xfrm.h | 2 - net/bridge/br_forward.c| 16 +++--- net/bridge/br_input.c | 25 ++ net/bridge/br_multicast.c | 4 +- net/bridge/br_netfilter_hooks.c| 54 ++-- net/bridge/br_netfilter_ipv6.c | 8 +-- net/bridge/br_private.h| 6 +-- net/bridge/br_stp_bpdu.c | 12 +++-- net/bridge/netfilter/ebtable_filter.c | 4 +- net/bridge/netfilter/ebtable_nat.c | 4 +- net/core/dev.c | 12 +++-- net/decnet/dn_neigh.c | 23 + net/decnet/dn_nsp_in.c | 7 +-- net/decnet/dn_nsp_out.c
[net-next 04/18] fm10k: disable service task during suspend
From: Jacob Keller The service task reads some registers as part of its normal routine, even while the interface is down. Normally this is ok. However, during suspend we have disabled the PCI device. Due to this, registers will read in the same way as a surprise-remove event. Disable the service task while we suspend, and re-enable it after we resume. If we don't do this, the device could be UP when you suspend and come back from resume as closed (since fm10k closes the device when it gets a surprise remove). Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index ce53ff2..8413ab5 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -1983,6 +1983,16 @@ static int fm10k_resume(struct pci_dev *pdev) if (err) return err; + /* assume host is not ready, to prevent race with watchdog in case we +* actually don't have connection to the switch +*/ + interface->host_ready = false; + fm10k_watchdog_host_not_ready(interface); + + /* clear the service task disable bit to allow service task to start */ + clear_bit(__FM10K_SERVICE_DISABLE, &interface->state); + fm10k_service_event_schedule(interface); + /* restore SR-IOV interface */ fm10k_iov_resume(pdev); @@ -2010,6 +2020,15 @@ static int fm10k_suspend(struct pci_dev *pdev, fm10k_iov_suspend(pdev); + /* the watchdog tasks may read registers, which will appear like a +* surprise-remove event once the PCI device is disabled. This will +* cause us to close the netdevice, so we don't retain the open/closed +* state post-resume. Prevent this by disabling the service task while +* suspended, until we actually resume. +*/ + set_bit(__FM10K_SERVICE_DISABLE, &interface->state); + cancel_work_sync(&interface->service_task); + rtnl_lock(); if (netif_running(netdev)) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 05/18] fm10k: only prevent removal of default VID rules
From: Jacob Keller This allows us to correctly add a VLAN even if it matches our default VID. However, we don't want to remove the VID rules once that VLAN is deleted. Correctly remove the stack layers information of the VLAN, but then return to forwarding that VID as untagged frames. If we deleted the VID rules here, we would begin dropping traffic due to VLAN membership violations. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index 99228bf..818bc8b 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -775,8 +775,8 @@ static int fm10k_update_vid(struct net_device *netdev, u16 vid, bool set) if (!set) clear_bit(vid, interface->active_vlans); - /* if default VLAN is already present do nothing */ - if (vid == hw->mac.default_vid) + /* Do not remove default VID related entries from VLAN and MAC tables */ + if (!set && vid == hw->mac.default_vid) return 0; fm10k_mbx_lock(interface); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 09/18] fm10k: Report MAC address on driver load
From: Alexander Duyck This change adds the MAC address to the list of values recorded on driver load. The MAC address represents the serial number of the unit and allows us to track the value should a card be replaced in a system. The log message should now be similar in output to that of ixgbe. Signed-off-by: Alexander Duyck Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index db237b7..9f2b2f1 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -1905,6 +1905,9 @@ static int fm10k_probe(struct pci_dev *pdev, /* print warning for non-optimal configurations */ fm10k_slot_warn(interface); + /* report MAC address for logging */ + dev_info(&pdev->dev, "%pM\n", netdev->dev_addr); + /* enable SR-IOV after registering netdev to enforce PF/VF ordering */ fm10k_iov_configure(pdev, 0); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 17/18] fm10k: Only trigger data path reset if fabric is up
From: Alexander Duyck This change makes it so that we only trigger the data path reset if the fabric is ready to handle traffic. The general idea is to avoid triggering the reset unless the switch API is ready for us. Otherwise we can just postpone the reset until we receive a switch ready notification. Signed-off-by: Alexander Duyck Signed-off-by: Jacob Keller Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c index 241b969..d806d87 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c @@ -59,6 +59,11 @@ static s32 fm10k_reset_hw_pf(struct fm10k_hw *hw) if (reg & (FM10K_DMA_CTRL_TX_ACTIVE | FM10K_DMA_CTRL_RX_ACTIVE)) return FM10K_ERR_DMA_PENDING; + /* verify the switch is ready for reset */ + reg = fm10k_read_reg(hw, FM10K_DMA_CTRL2); + if (!(reg & FM10K_DMA_CTRL2_SWITCH_READY)) + goto out; + /* Inititate data path reset */ reg |= FM10K_DMA_CTRL_DATAPATH_RESET; fm10k_write_reg(hw, FM10K_DMA_CTRL, reg); @@ -72,6 +77,7 @@ static s32 fm10k_reset_hw_pf(struct fm10k_hw *hw) if (!(reg & FM10K_IP_NOTINRESET)) err = FM10K_ERR_RESET_FAILED; +out: return err; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 01/18] ixgbe: fix issue with SFP events with new X550 devices
From: Don Skidmore Add checks for systems that don't have SFP's to avoid incorrectly acting on interrupts that are falsely interpreted as SFP events. This also includes a modified check generating the EICR mask to be more forward-looking. Signed-off-by: Don Skidmore Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 63b2cfe..b9267e2 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -2495,17 +2495,26 @@ static inline bool ixgbe_is_sfp(struct ixgbe_hw *hw) static void ixgbe_check_sfp_event(struct ixgbe_adapter *adapter, u32 eicr) { struct ixgbe_hw *hw = &adapter->hw; + u32 eicr_mask = IXGBE_EICR_GPI_SDP2(hw); - if (eicr & IXGBE_EICR_GPI_SDP2(hw)) { + if (!ixgbe_is_sfp(hw)) + return; + + /* Later MAC's use different SDP */ + if (hw->mac.type >= ixgbe_mac_X540) + eicr_mask = IXGBE_EICR_GPI_SDP0_X540; + + if (eicr & eicr_mask) { /* Clear the interrupt */ - IXGBE_WRITE_REG(hw, IXGBE_EICR, IXGBE_EICR_GPI_SDP2(hw)); + IXGBE_WRITE_REG(hw, IXGBE_EICR, eicr_mask); if (!test_bit(__IXGBE_DOWN, &adapter->state)) { adapter->flags2 |= IXGBE_FLAG2_SFP_NEEDS_RESET; ixgbe_service_event_schedule(adapter); } } - if (eicr & IXGBE_EICR_GPI_SDP1(hw)) { + if (adapter->hw.mac.type == ixgbe_mac_82599EB && + (eicr & IXGBE_EICR_GPI_SDP1(hw))) { /* Clear the interrupt */ IXGBE_WRITE_REG(hw, IXGBE_EICR, IXGBE_EICR_GPI_SDP1(hw)); if (!test_bit(__IXGBE_DOWN, &adapter->state)) { -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 11/18] fm10k: don't store sw_vid at reset
From: Jacob Keller If we store the sw_vid at reset of PF, then we accidentally prevent the VF from receiving the message to update its default VID. This only occurs if the VF is created before the PF has come up, which is the standard way of creating VFs when using the module parameter. This fixes an issue where we request the incorrect MAC/VLAN combinations, and prevents us from accidentally reporting some frames as VLAN tagged. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c index 94571e6..0e25a80 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c @@ -228,9 +228,6 @@ int fm10k_iov_resume(struct pci_dev *pdev) hw->iov.ops.set_lport(hw, vf_info, i, FM10K_VF_FLAG_MULTI_CAPABLE); - /* assign our default vid to the VF following reset */ - vf_info->sw_vid = hw->mac.default_vid; - /* mailbox is disconnected so we don't send a message */ hw->iov.ops.assign_default_mac_vlan(hw, vf_info); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 10/18] fm10k: allow creation of VLAN interfaces even while down
From: Jacob Keller We re-sync upon going up, so there is little reason to worry about not syncing immediately with switch. This prevents an error that occurs if you add a VLAN interface while down. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index b2065cb..e1ceb3a 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -779,6 +779,12 @@ static int fm10k_update_vid(struct net_device *netdev, u16 vid, bool set) if (!set && vid == hw->mac.default_vid) return 0; + /* Do not throw an error if the interface is down. We will sync once +* we come up +*/ + if (test_bit(__FM10K_DOWN, &interface->state)) + return 0; + fm10k_mbx_lock(interface); /* only need to update the VLAN if not in promiscuous mode */ -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 16/18] fm10k: re-enable VF after a full reset on detection of a Malicious event
From: Jacob Keller Modify behavior of Malicious Driver Detection events. Presently, the hardware disables the VF queues and re-assigns them to the PF. This causes the VF in question to continuously Tx hang, because it assumes that it can transmit over the queues in question. For transient events, this results in continuous logging of malicious events. New behavior is to reset the LPORT and VF state, so that the VF will have to reset and re-enable itself. This does mean that malicious VFs will possibly be able to continue and attempt malicious events again. However, it is expected that system administrators will step in and manually remove or disable the VF in question. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 30 ++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index 9bdc04d..3d71c52 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -880,10 +880,12 @@ void fm10k_netpoll(struct net_device *netdev) #endif #define FM10K_ERR_MSG(type) case (type): error = #type; break -static void fm10k_print_fault(struct fm10k_intfc *interface, int type, +static void fm10k_handle_fault(struct fm10k_intfc *interface, int type, struct fm10k_fault *fault) { struct pci_dev *pdev = interface->pdev; + struct fm10k_hw *hw = &interface->hw; + struct fm10k_iov_data *iov_data = interface->iov_data; char *error; switch (type) { @@ -937,6 +939,30 @@ static void fm10k_print_fault(struct fm10k_intfc *interface, int type, "%s Address: 0x%llx SpecInfo: 0x%x Func: %02x.%0x\n", error, fault->address, fault->specinfo, PCI_SLOT(fault->func), PCI_FUNC(fault->func)); + + /* For VF faults, clear out the respective LPORT, reset the queue +* resources, and then reconnect to the mailbox. This allows the +* VF in question to resume behavior. For transient faults that are +* the result of non-malicious behavior this will log the fault and +* allow the VF to resume functionality. Obviously for malicious VFs +* they will be able to attempt malicious behavior again. In this +* case, the system administrator will need to step in and manually +* remove or disable the VF in question. +*/ + if (fault->func && iov_data) { + int vf = fault->func - 1; + struct fm10k_vf_info *vf_info = &iov_data->vf_info[vf]; + + hw->iov.ops.reset_lport(hw, vf_info); + hw->iov.ops.reset_resources(hw, vf_info); + + /* reset_lport disables the VF, so re-enable it */ + hw->iov.ops.set_lport(hw, vf_info, vf, + FM10K_VF_FLAG_MULTI_CAPABLE); + + /* reset_resources will disconnect from the mbx */ + vf_info->mbx.ops.connect(hw, &vf_info->mbx); + } } static void fm10k_report_fault(struct fm10k_intfc *interface, u32 eicr) @@ -960,7 +986,7 @@ static void fm10k_report_fault(struct fm10k_intfc *interface, u32 eicr) continue; } - fm10k_print_fault(interface, type, &fault); + fm10k_handle_fault(interface, type, &fault); } } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 06/18] fm10k: update fm10k_slot_warn to use pcie_get_minimum link
From: Jacob Keller This is useful in cases where we connect to a slot at Gen3, but the slot is behind a bus which only connected at Gen2. This generally only happens when a PCIe switch is in the sequence of devices, and can be very confusing when you see slow performance with no obvious cause. I am aware this patch has a few lines that break 80 characters, but there does not seem to be a readable way to format them to less than 80 characters. Suggestions welcome. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 105 +++ 1 file changed, 76 insertions(+), 29 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index 8413ab5..2d87c32 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -1705,22 +1705,86 @@ static int fm10k_sw_init(struct fm10k_intfc *interface, static void fm10k_slot_warn(struct fm10k_intfc *interface) { - struct device *dev = &interface->pdev->dev; + enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN; + enum pci_bus_speed speed = PCI_SPEED_UNKNOWN; struct fm10k_hw *hw = &interface->hw; + int max_gts = 0, expected_gts = 0; - if (hw->mac.ops.is_slot_appropriate(hw)) + if (pcie_get_minimum_link(interface->pdev, &speed, &width) || + speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) { + dev_warn(&interface->pdev->dev, +"Unable to determine PCI Express bandwidth.\n"); return; + } + + switch (speed) { + case PCIE_SPEED_2_5GT: + /* 8b/10b encoding reduces max throughput by 20% */ + max_gts = 2 * width; + break; + case PCIE_SPEED_5_0GT: + /* 8b/10b encoding reduces max throughput by 20% */ + max_gts = 4 * width; + break; + case PCIE_SPEED_8_0GT: + /* 128b/130b encoding has less than 2% impact on throughput */ + max_gts = 8 * width; + break; + default: + dev_warn(&interface->pdev->dev, +"Unable to determine PCI Express bandwidth.\n"); + return; + } + + dev_info(&interface->pdev->dev, +"PCI Express bandwidth of %dGT/s available\n", +max_gts); + dev_info(&interface->pdev->dev, +"(Speed:%s, Width: x%d, Encoding Loss:%s, Payload:%s)\n", +(speed == PCIE_SPEED_8_0GT ? "8.0GT/s" : + speed == PCIE_SPEED_5_0GT ? "5.0GT/s" : + speed == PCIE_SPEED_2_5GT ? "2.5GT/s" : + "Unknown"), +hw->bus.width, +(speed == PCIE_SPEED_2_5GT ? "20%" : + speed == PCIE_SPEED_5_0GT ? "20%" : + speed == PCIE_SPEED_8_0GT ? "<2%" : + "Unknown"), +(hw->bus.payload == fm10k_bus_payload_128 ? "128B" : + hw->bus.payload == fm10k_bus_payload_256 ? "256B" : + hw->bus.payload == fm10k_bus_payload_512 ? "512B" : + "Unknown")); - dev_warn(dev, -"For optimal performance, a %s %s slot is recommended.\n", -(hw->bus_caps.width == fm10k_bus_width_pcie_x1 ? "x1" : - hw->bus_caps.width == fm10k_bus_width_pcie_x4 ? "x4" : - "x8"), -(hw->bus_caps.speed == fm10k_bus_speed_2500 ? "2.5GT/s" : - hw->bus_caps.speed == fm10k_bus_speed_5000 ? "5.0GT/s" : - "8.0GT/s")); - dev_warn(dev, -"A slot with more lanes and/or higher speed is suggested.\n"); + switch (hw->bus_caps.speed) { + case fm10k_bus_speed_2500: + /* 8b/10b encoding reduces max throughput by 20% */ + expected_gts = 2 * hw->bus_caps.width; + break; + case fm10k_bus_speed_5000: + /* 8b/10b encoding reduces max throughput by 20% */ + expected_gts = 4 * hw->bus_caps.width; + break; + case fm10k_bus_speed_8000: + /* 128b/130b encoding has less than 2% impact on throughput */ + expected_gts = 8 * hw->bus_caps.width; + break; + default: + dev_warn(&interface->pdev->dev, +"Unable to determine expected PCI Express bandwidth.\n"); + return; + } + + if (max_gts < expected_gts) { + dev_warn(&interface->pdev->dev, +"This device requires %dGT/s of bandwidth for optimal performance.\n", +expected_gts); + dev_warn(&interface->pdev->dev, +"A %sslot with x%d lanes is suggested.\n", +(hw->bus_cap
[net-next 07/18] fm10k: update netdev perm_addr during reinit, instead of at up
From: Jacob Keller Update the netdev permanent address during fm10k_reinit enables the user to immediately see the new MAC address on the VF even if the device isn't up. The previous code required that the device by opened before changes would appear. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 15 --- drivers/net/ethernet/intel/fm10k/fm10k_pci.c| 15 +++ 2 files changed, 15 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index 818bc8b..b2065cb 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -996,21 +996,6 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface) int xcast_mode; u16 vid, glort; - /* restore our address if perm_addr is set */ - if (hw->mac.type == fm10k_mac_vf) { - if (is_valid_ether_addr(hw->mac.perm_addr)) { - ether_addr_copy(hw->mac.addr, hw->mac.perm_addr); - ether_addr_copy(netdev->perm_addr, hw->mac.perm_addr); - ether_addr_copy(netdev->dev_addr, hw->mac.perm_addr); - netdev->addr_assign_type &= ~NET_ADDR_RANDOM; - } - - if (hw->mac.vlan_override) - netdev->features &= ~NETIF_F_HW_VLAN_CTAG_RX; - else - netdev->features |= NETIF_F_HW_VLAN_CTAG_RX; - } - /* record glort for this interface */ glort = interface->glort; diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index 2d87c32..db237b7 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -170,6 +170,21 @@ static void fm10k_reinit(struct fm10k_intfc *interface) /* reassociate interrupts */ fm10k_mbx_request_irq(interface); + /* update hardware address for VFs if perm_addr has changed */ + if (hw->mac.type == fm10k_mac_vf) { + if (is_valid_ether_addr(hw->mac.perm_addr)) { + ether_addr_copy(hw->mac.addr, hw->mac.perm_addr); + ether_addr_copy(netdev->perm_addr, hw->mac.perm_addr); + ether_addr_copy(netdev->dev_addr, hw->mac.perm_addr); + netdev->addr_assign_type &= ~NET_ADDR_RANDOM; + } + + if (hw->mac.vlan_override) + netdev->features &= ~NETIF_F_HW_VLAN_CTAG_RX; + else + netdev->features |= NETIF_F_HW_VLAN_CTAG_RX; + } + /* reset clock */ fm10k_ts_reset(interface); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 03/18] ixgbe: Limit lowest interrupt rate for adaptive interrupt moderation to 12K
From: Alexander Duyck This patch updates the lowest limit for adaptive interrupt interrupt moderation to roughly 12K interrupts per second. The way I came about reaching 12K as the desired interrupt rate is by testing with UDP flows. Specifically I had a simple test that ran a netperf UDP_STREAM test at varying sizes. What I found was as the packet sizes increased the performance fell steadily behind until we were only able to receive at ~4Gb/s with a message size of 65507. A bit of digging found that we were dropping packets for the socket in the network stack, and looking at things further what I found was I could solve it by increasing the interrupt rate, or increasing the rmem_default/rmem_max. What I found was that when the interrupt coalescing resulted in more data being processed per interrupt than could be stored in the socket buffer we started losing packets and the performance dropped. So I reached 12K based on the following math. rmem_default = 212992 skb->truesize = 2994 212992 / 2994 = 71.14 packets to fill the buffer packet rate at 1514 packet size is 812744pps 71.14 / 812744 = 87.9us to fill socket buffer >From there it was just a matter of choosing the interrupt rate and providing a bit of wiggle room which is why I decided to go with 12K interrupts per second as that uses a value of 84us. The data below is based on VM to VM over a direct assigned ixgbe interface. The test run was: netperf -H -t UDP_STREAM" Socket Message Elapsed Messages CPU Service SizeSize Time Okay Errors Throughput Util Demand bytes bytessecs# # 10^6bits/sec % SS us/KB Before: 212992 65507 60.00 1100662 0 9613.4 10.890.557 212992 60.00 4734744135.4 11.270.576 After: 212992 65507 60.00 1100413 0 9611.2 10.730.549 212992 60.00 9741328508.3 11.690.598 Using bare metal the data is similar but not as dramatic as the throughput increases from about 8.5Gb/s to 9.5Gb/s. Signed-off-by: Alexander Duyck Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 3 +-- drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 4 ++-- 4 files changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index edf1fb9..a699c99 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -539,8 +539,7 @@ struct hwmon_buff { #define IXGBE_MIN_RSC_ITR 24 #define IXGBE_100K_ITR 40 #define IXGBE_20K_ITR 200 -#define IXGBE_10K_ITR 400 -#define IXGBE_8K_ITR 500 +#define IXGBE_12K_ITR 336 /* ixgbe_test_staterr - tests bits in Rx descriptor status and error fields */ static inline __le32 ixgbe_test_staterr(union ixgbe_adv_rx_desc *rx_desc, diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c index ab2edc8..94c4912 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c @@ -2286,7 +2286,7 @@ static int ixgbe_set_coalesce(struct net_device *netdev, adapter->tx_itr_setting = ec->tx_coalesce_usecs; if (adapter->tx_itr_setting == 1) - tx_itr_param = IXGBE_10K_ITR; + tx_itr_param = IXGBE_12K_ITR; else tx_itr_param = adapter->tx_itr_setting; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c index 68e1e75..f3168bc 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c @@ -866,7 +866,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter, if (txr_count && !rxr_count) { /* tx only vector */ if (adapter->tx_itr_setting == 1) - q_vector->itr = IXGBE_10K_ITR; + q_vector->itr = IXGBE_12K_ITR; else q_vector->itr = adapter->tx_itr_setting; } else { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index c04480e..acb1b91 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -2261,7 +2261,7 @@ static void ixgbe_update_itr(struct ixgbe_q_vector *q_vector, /* simple throttlerate management * 0-10MB/s lowest (10 ints/s) * 10-20MB/s low(2 ints/s) -* 20-1249MB/s bulk (8000 ints/s) +* 20-1249MB/s bulk (12000 ints/s) */ /* what was last interrupt timeslice? */
[net-next 12/18] fm10k: remove is_slot_appropriate
From: Jacob Keller This function is no longer used now that we have updated fm10k_slot_warn functionality. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 14 -- drivers/net/ethernet/intel/fm10k/fm10k_type.h | 1 - drivers/net/ethernet/intel/fm10k/fm10k_vf.c | 14 -- 3 files changed, 29 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c index 3ca0233..241b969 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c @@ -185,19 +185,6 @@ static s32 fm10k_init_hw_pf(struct fm10k_hw *hw) } /** - * fm10k_is_slot_appropriate_pf - Indicate appropriate slot for this SKU - * @hw: pointer to hardware structure - * - * Looks at the PCIe bus info to confirm whether or not this slot can support - * the necessary bandwidth for this device. - **/ -static bool fm10k_is_slot_appropriate_pf(struct fm10k_hw *hw) -{ - return (hw->bus.speed == hw->bus_caps.speed) && - (hw->bus.width == hw->bus_caps.width); -} - -/** * fm10k_update_vlan_pf - Update status of VLAN ID in VLAN filter table * @hw: pointer to hardware structure * @vid: VLAN ID to add to table @@ -1849,7 +1836,6 @@ static struct fm10k_mac_ops mac_ops_pf = { .init_hw= &fm10k_init_hw_pf, .start_hw = &fm10k_start_hw_generic, .stop_hw= &fm10k_stop_hw_generic, - .is_slot_appropriate= &fm10k_is_slot_appropriate_pf, .update_vlan= &fm10k_update_vlan_pf, .read_mac_addr = &fm10k_read_mac_addr_pf, .update_uc_addr = &fm10k_update_uc_addr_pf, diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_type.h b/drivers/net/ethernet/intel/fm10k/fm10k_type.h index 2a17d82..bac8d48 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_type.h +++ b/drivers/net/ethernet/intel/fm10k/fm10k_type.h @@ -521,7 +521,6 @@ struct fm10k_mac_ops { s32 (*stop_hw)(struct fm10k_hw *); s32 (*get_bus_info)(struct fm10k_hw *); s32 (*get_host_state)(struct fm10k_hw *, bool *); - bool (*is_slot_appropriate)(struct fm10k_hw *); s32 (*update_vlan)(struct fm10k_hw *, u32, u8, bool); s32 (*read_mac_addr)(struct fm10k_hw *); s32 (*update_uc_addr)(struct fm10k_hw *, u16, const u8 *, diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_vf.c b/drivers/net/ethernet/intel/fm10k/fm10k_vf.c index 94f0f6a..36c8b0a 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_vf.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_vf.c @@ -131,19 +131,6 @@ static s32 fm10k_init_hw_vf(struct fm10k_hw *hw) return 0; } -/** - * fm10k_is_slot_appropriate_vf - Indicate appropriate slot for this SKU - * @hw: pointer to hardware structure - * - * Looks at the PCIe bus info to confirm whether or not this slot can support - * the necessary bandwidth for this device. Since the VF has no control over - * the "slot" it is in, always indicate that the slot is appropriate. - **/ -static bool fm10k_is_slot_appropriate_vf(struct fm10k_hw *hw) -{ - return true; -} - /* This structure defines the attibutes to be parsed below */ const struct fm10k_tlv_attr fm10k_mac_vlan_msg_attr[] = { FM10K_TLV_ATTR_U32(FM10K_MAC_VLAN_MSG_VLAN), @@ -552,7 +539,6 @@ static struct fm10k_mac_ops mac_ops_vf = { .init_hw= &fm10k_init_hw_vf, .start_hw = &fm10k_start_hw_generic, .stop_hw= &fm10k_stop_hw_vf, - .is_slot_appropriate= &fm10k_is_slot_appropriate_vf, .update_vlan= &fm10k_update_vlan_vf, .read_mac_addr = &fm10k_read_mac_addr_vf, .update_uc_addr = &fm10k_update_uc_addr_vf, -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 02/18] ixgbe: Teardown SR-IOV before unregister_netdev()
From: Alex Williamson When the .remove() callback for a PF is called, SR-IOV support for the device is disabled, which requires unbinding and removing the VFs. The VFs may be in-use either by the host kernel or userspace, such as assigned to a VM through vfio-pci. In this latter case, the VFs may be removed either by shutting down the VM or hot-unplugging the devices from the VM. Unfortunately in the case of a Windows 2012 R2 guest, hot-unplug is broken due to the ordering of the PF driver teardown. Disabling SR-IOV prior to unregister_netdev() avoids this issue. Signed-off-by: Alex Williamson Acked-by: Mitch Williams Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index b9267e2..c04480e 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -9028,12 +9028,12 @@ static void ixgbe_remove(struct pci_dev *pdev) /* remove the added san mac */ ixgbe_del_sanmac_netdev(netdev); - if (netdev->reg_state == NETREG_REGISTERED) - unregister_netdev(netdev); - #ifdef CONFIG_PCI_IOV ixgbe_disable_sriov(adapter); #endif + if (netdev->reg_state == NETREG_REGISTERED) + unregister_netdev(netdev); + ixgbe_clear_interrupt_scheme(adapter); ixgbe_release_hw_control(adapter); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 08/18] fm10k: Don't assume page fragments are page size
From: Alexander Duyck This change pulls out the optimization that assumed that all fragments would be limited to page size. That hasn't been the case for some time now and to assume this is incorrect as the TCP allocator can provide up to a 32K page fragment. Signed-off-by: Alexander Duyck Acked-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index b5b2925..6ad03e0 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -1079,9 +1079,7 @@ netdev_tx_t fm10k_xmit_frame_ring(struct sk_buff *skb, struct fm10k_tx_buffer *first; int tso; u32 tx_flags = 0; -#if PAGE_SIZE > FM10K_MAX_DATA_PER_TXD unsigned short f; -#endif u16 count = TXD_USE_COUNT(skb_headlen(skb)); /* need: 1 descriptor per page * PAGE_SIZE/FM10K_MAX_DATA_PER_TXD, @@ -1089,12 +1087,9 @@ netdev_tx_t fm10k_xmit_frame_ring(struct sk_buff *skb, * + 2 desc gap to keep tail from touching head * otherwise try next time */ -#if PAGE_SIZE > FM10K_MAX_DATA_PER_TXD for (f = 0; f < skb_shinfo(skb)->nr_frags; f++) count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size); -#else - count += skb_shinfo(skb)->nr_frags; -#endif + if (fm10k_maybe_stop_tx(tx_ring, count + 3)) { tx_ring->tx_stats.tx_busy++; return NETDEV_TX_BUSY; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 00/18][pull request] Intel Wired LAN Driver Updates 2015-09-15
This series contains updates to ixgbe and fm10k. Don fixes a ixgbe issue by adding checks for systems that do not have SFP's to avoid incorrectly acting on interrupts that are falsely interpreted as SFP events. Alex Williamson adds a fix for ixgbe to disable SR-IOV prior to unregistering the netdev to avoid issues with guest OS's which do not support hot-unplug or their hot-unplug is broken. Alex Duyck update the lowest limit for adaptive interrupt interrupt moderation to about 12K interrupts per second for ixgbe. This change increases the performance for ixgbe. Also fixed up fm10k to remove the optimization that assumed that all fragments would be limited to page size, since that assumption is incorrect as the TCP allocator can provide up to a 32K page fragment. Updated fm10k to add the MAC address to the list of values recorded on driver load. Fixes fm10k so that we only trigger the data path reset if the fabric is ready to handle traffic to avoid triggering the reset unless the switch API is ready for us. Jacob updates the fm10k driver to disable the service task during suspend and re-enable it after we resume. If we don't do this, the device could be UP when you suspend and come back from resume as DOWN. Also update fm10k to prevent the removal of default VID rules, and correctly remove the stack layers information of the VLAN, but then return to forwarding that VID as untagged frames. If we deleted the VID rules here, we would begin dropping traffic due to VLAN membership violations. Fixed fm10k to use pcie_get_minimum_link(), which is useful in cases where we connect to a slot at Gen3, but the slot is behind a bus which is only connected at Gen2. Updated fm10k to update the netdev permanent address during reinit instead of up to enable users to immediately see the new MAC address on the VF even if the device is not up. Adds the creation of VLAN interfaces on a device, even while the device is down for fm10k. Fixed an issue where we request the incorrect MAC/VLAN combinations, and prevents us from accidentally reporting some frames as VLAN tagged. Provided a couple of trivial fixes for fm10k to fix code style and typos in code comments. The following are changes since commit ad1e7b97b3adb91d46f3adb70a7867a50fc274cf: cdc: Fix build warning. and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master Alex Williamson (1): ixgbe: Teardown SR-IOV before unregister_netdev() Alexander Duyck (4): ixgbe: Limit lowest interrupt rate for adaptive interrupt moderation to 12K fm10k: Don't assume page fragments are page size fm10k: Report MAC address on driver load fm10k: Only trigger data path reset if fabric is up Don Skidmore (1): ixgbe: fix issue with SFP events with new X550 devices Jacob Keller (12): fm10k: disable service task during suspend fm10k: only prevent removal of default VID rules fm10k: update fm10k_slot_warn to use pcie_get_minimum link fm10k: update netdev perm_addr during reinit, instead of at up fm10k: allow creation of VLAN interfaces even while down fm10k: don't store sw_vid at reset fm10k: remove is_slot_appropriate fm10k: TRIVIAL fix up ordering of __always_unused and style fm10k: send traffic on default VID to VLAN device if we have one fm10k: TRIVIAL fix typo in fm10k_netdev.c fm10k: re-enable VF after a full reset on detection of a Malicious event fm10k: fix iov_msg_mac_vlan_pf VID checks drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c | 5 +- drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 3 - drivers/net/ethernet/intel/fm10k/fm10k_main.c| 12 +- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 39 ++--- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 176 +++ drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 105 -- drivers/net/ethernet/intel/fm10k/fm10k_type.h| 1 - drivers/net/ethernet/intel/fm10k/fm10k_vf.c | 14 -- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 3 +- drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 25 ++-- 12 files changed, 252 insertions(+), 135 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 15/18] fm10k: TRIVIAL fix typo in fm10k_netdev.c
From: Jacob Keller Signed-off-by: Jacob Keller Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index 3a6230b..639263d 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -1048,7 +1048,7 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface) vid, true, 0); } - /* update xcast mode before syncronizing addresses */ + /* update xcast mode before synchronizing addresses */ hw->mac.ops.update_xcast_mode(hw, glort, xcast_mode); /* synchronize all of the addresses */ -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 14/18] fm10k: send traffic on default VID to VLAN device if we have one
From: Jacob Keller This patch ensures that VLAN traffic on the default VID will go to the corresponding VLAN device if it exists. To do this, mask the rx_ring VID if we have an active VLAN on that VID. For this to work correctly, we need to update fm10k_process_skb_fields to correctly mask off the VLAN_PRIO_MASK bits and compare them separately, otherwise we incorrectly compare the priority bits with the cleared flag. This also happens to fix a related bug where having priority bits set causes us to incorrectly classify traffic. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 5 - drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 12 drivers/net/ethernet/intel/fm10k/fm10k_pci.c| 4 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index 6ad03e0..92d4155 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -497,8 +497,11 @@ static unsigned int fm10k_process_skb_fields(struct fm10k_ring *rx_ring, if (rx_desc->w.vlan) { u16 vid = le16_to_cpu(rx_desc->w.vlan); - if (vid != rx_ring->vid) + if ((vid & VLAN_VID_MASK) != rx_ring->vid) __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vid); + else if (vid & VLAN_PRIO_MASK) + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), + vid & VLAN_PRIO_MASK); } fm10k_type_trans(rx_ring, rx_desc, skb); diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c index e1ceb3a..3a6230b 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c @@ -758,6 +758,7 @@ static int fm10k_update_vid(struct net_device *netdev, u16 vid, bool set) struct fm10k_intfc *interface = netdev_priv(netdev); struct fm10k_hw *hw = &interface->hw; s32 err; + int i; /* updates do not apply to VLAN 0 */ if (!vid) @@ -775,6 +776,17 @@ static int fm10k_update_vid(struct net_device *netdev, u16 vid, bool set) if (!set) clear_bit(vid, interface->active_vlans); + /* disable the default VID on ring if we have an active VLAN */ + for (i = 0; i < interface->num_rx_queues; i++) { + struct fm10k_ring *rx_ring = interface->rx_ring[i]; + u16 rx_vid = rx_ring->vid & (VLAN_N_VID - 1); + + if (test_bit(rx_vid, interface->active_vlans)) + rx_ring->vid |= FM10K_VLAN_CLEAR; + else + rx_ring->vid &= ~FM10K_VLAN_CLEAR; + } + /* Do not remove default VID related entries from VLAN and MAC tables */ if (!set && vid == hw->mac.default_vid) return 0; diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index 9f2b2f1..9bdc04d 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -678,6 +678,10 @@ static void fm10k_configure_rx_ring(struct fm10k_intfc *interface, /* assign default VLAN to queue */ ring->vid = hw->mac.default_vid; + /* if we have an active VLAN, disable default VID */ + if (test_bit(hw->mac.default_vid, interface->active_vlans)) + ring->vid |= FM10K_VLAN_CLEAR; + /* Map interrupt */ if (ring->q_vector) { rxint = ring->q_vector->v_idx + NON_Q_VECTORS(hw); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 13/18] fm10k: TRIVIAL fix up ordering of __always_unused and style
From: Jacob Keller Fix some style issues in debugfs code, and correct ordering of void and __always_unused. Technically, the order does not matter, but preferred style is to put the macro between the type and name. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c b/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c index f45b4d7..08ecf43 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c @@ -37,7 +37,8 @@ static void *fm10k_dbg_desc_seq_start(struct seq_file *s, loff_t *pos) } static void *fm10k_dbg_desc_seq_next(struct seq_file *s, -void __always_unused *v, loff_t *pos) +void __always_unused *v, +loff_t *pos) { struct fm10k_ring *ring = s->private; @@ -45,7 +46,7 @@ static void *fm10k_dbg_desc_seq_next(struct seq_file *s, } static void fm10k_dbg_desc_seq_stop(struct seq_file __always_unused *s, - __always_unused void *v) + void __always_unused *v) { /* Do nothing. */ } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 18/18] fm10k: fix iov_msg_mac_vlan_pf VID checks
From: Jacob Keller The VF will send a message to request multicast addresses with the default VID. In the current code, if the PF has statically assigned a VLAN to a VF, then the VF will not get the multicast addresses. Fix up all of the various VLAN messages to use identical checks (since each check was different). Also use set as a variable, so that it simplifies our check for whether VLAN matches the pf_vid. The new logic will allow set of a VLAN if it is zero, automatically converting to the default VID. Otherwise it will allow setting the PF VID, or any VLAN if PF has not statically assigned a VLAN. This is consistent behavior, and allows VF to request either 0 or the default_vid without silently failing. Note that we need the check for zero since VFs might not get the default VID message in time to actually request non-zero VLANs. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 85 ++--- 1 file changed, 52 insertions(+), 33 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c index d806d87..8c0bdc4 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c @@ -1155,6 +1155,24 @@ s32 fm10k_iov_msg_msix_pf(struct fm10k_hw *hw, u32 **results, } /** + * fm10k_iov_select_vid - Select correct default VID + * @hw: Pointer to hardware structure + * @vid: VID to correct + * + * Will report an error if VID is out of range. For VID = 0, it will return + * either the pf_vid or sw_vid depending on which one is set. + */ +static inline s32 fm10k_iov_select_vid(struct fm10k_vf_info *vf_info, u16 vid) +{ + if (!vid) + return vf_info->pf_vid ? vf_info->pf_vid : vf_info->sw_vid; + else if (vf_info->pf_vid && vid != vf_info->pf_vid) + return FM10K_ERR_PARAM; + else + return vid; +} + +/** * fm10k_iov_msg_mac_vlan_pf - Message handler for MAC/VLAN request from VF * @hw: Pointer to hardware structure * @results: Pointer array to message, results[0] is pointer to message @@ -1168,9 +1186,10 @@ s32 fm10k_iov_msg_mac_vlan_pf(struct fm10k_hw *hw, u32 **results, struct fm10k_mbx_info *mbx) { struct fm10k_vf_info *vf_info = (struct fm10k_vf_info *)mbx; - int err = 0; u8 mac[ETH_ALEN]; u32 *result; + int err = 0; + bool set; u16 vlan; u32 vid; @@ -1186,19 +1205,21 @@ s32 fm10k_iov_msg_mac_vlan_pf(struct fm10k_hw *hw, u32 **results, if (err) return err; - /* if VLAN ID is 0, set the default VLAN ID instead of 0 */ - if (!vid || (vid == FM10K_VLAN_CLEAR)) { - if (vf_info->pf_vid) - vid |= vf_info->pf_vid; - else - vid |= vf_info->sw_vid; - } else if (vid != vf_info->pf_vid) { + /* verify upper 16 bits are zero */ + if (vid >> 16) return FM10K_ERR_PARAM; - } + + set = !(vid & FM10K_VLAN_CLEAR); + vid &= ~FM10K_VLAN_CLEAR; + + err = fm10k_iov_select_vid(vf_info, vid); + if (err < 0) + return err; + else + vid = err; /* update VSI info for VF in regards to VLAN table */ - err = hw->mac.ops.update_vlan(hw, vid, vf_info->vsi, - !(vid & FM10K_VLAN_CLEAR)); + err = hw->mac.ops.update_vlan(hw, vid, vf_info->vsi, set); } if (!err && !!results[FM10K_MAC_VLAN_MSG_MAC]) { @@ -1214,19 +1235,18 @@ s32 fm10k_iov_msg_mac_vlan_pf(struct fm10k_hw *hw, u32 **results, memcmp(mac, vf_info->mac, ETH_ALEN)) return FM10K_ERR_PARAM; - /* if VLAN ID is 0, set the default VLAN ID instead of 0 */ - if (!vlan || (vlan == FM10K_VLAN_CLEAR)) { - if (vf_info->pf_vid) - vlan |= vf_info->pf_vid; - else - vlan |= vf_info->sw_vid; - } else if (vf_info->pf_vid) { - return FM10K_ERR_PARAM; - } + set = !(vlan & FM10K_VLAN_CLEAR); + vlan &= ~FM10K_VLAN_CLEAR; + + err = fm10k_iov_select_vid(vf_info, vlan); + if (err < 0) + return err; + else + vlan = err; /* notify switch of request for new unicast address */ - err = hw->mac.ops.update_uc_addr(hw, vf_info->glort, mac, vlan, -!(vlan & FM10K_VLAN_CLE
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, Sep 15, 2015 at 5:03 PM, Eric Dumazet wrote: > On Tue, 2015-09-15 at 16:45 -0700, Tom Herbert wrote: >> > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && >> > + skb->l4_hash) >> > + return skb->hash; >> > + >> > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 || >> > !bond_flow_dissect(bond, skb, &flow)) >> > return bond_eth_hash(skb); >> > >> > >> Ugh, bond_flow_dissect is yet another instance of customized flow >> dissection! We should really clean this up. I suggest that in cases >> were we want L4 hash a call to skb_get_hash should suffice. We can >> create skb_get_l3hash when caller explicitly wants an L3 hash-- this >> would return skb->hash if it's valid and skb->l4_hash is not set, else >> call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the >> normal hash over flow keys (don't save result in skb->hash in this >> case). > > This code predates all the change you did recently ;) > A more fundamental question is whether we can eliminate some of these hashing types (I see five of them in if_bonding.h). Is there any substantial difference between this and IPv4/v6 ECMP routing such that they shouldn't all have the same path selection modes? Tom > BTW, the simple xor weakness is showing up after > our change favoring even ports at connect() time, for a bonding device > with 2 or 4 slaves. > > (commit 07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5 > "tcp/dccp: try to not exhaust ip_local_port_range in connect()") > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, Sep 15, 2015 at 4:20 PM, Eric Dumazet wrote: > On Tue, 2015-09-15 at 15:54 -0700, Mahesh Bandewar wrote: > >> > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && >> > + skb->l4_hash) >> if (ENCAP34 || LAYER34) && l4_hash) may be? > > Hmm, traditional BOND_XMIT_POLICY_LAYER34 did not a full flow bisection > (tunnel awareness added in commit > 32819dc1834866cb9547cb75f81af9edd58d33cd bonding: modify the old and add > new xmit hash policies) > > This could radically change some setups and behavior. > > BOND_XMIT_POLICY_ENCAP34 looks a better fit to me. > Agreed, this will change flow distribution for LAYER34 policy but then loose out on calculating hash per packet which I think is unnecessary. This elimination of hash calculation is a good step but I'm feeling that it's somehow tied to ENCAP policy which is actually orthogonal and should be applied to LAYER34 also. However if that change in the behavior for LAYER34 is considered too drastic then I'm perfectly fine tying it to ENCAP34 policy. > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, 2015-09-15 at 16:45 -0700, Tom Herbert wrote: > > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && > > + skb->l4_hash) > > + return skb->hash; > > + > > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 || > > !bond_flow_dissect(bond, skb, &flow)) > > return bond_eth_hash(skb); > > > > > Ugh, bond_flow_dissect is yet another instance of customized flow > dissection! We should really clean this up. I suggest that in cases > were we want L4 hash a call to skb_get_hash should suffice. We can > create skb_get_l3hash when caller explicitly wants an L3 hash-- this > would return skb->hash if it's valid and skb->l4_hash is not set, else > call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the > normal hash over flow keys (don't save result in skb->hash in this > case). This code predates all the change you did recently ;) BTW, the simple xor weakness is showing up after our change favoring even ports at connect() time, for a bonding device with 2 or 4 slaves. (commit 07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5 "tcp/dccp: try to not exhaust ip_local_port_range in connect()") -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 25/39] net: core: drop null test before destroy functions
From: Julia Lawall Date: Sun, 13 Sep 2015 14:15:18 +0200 > Remove unneeded NULL test. > > The semantic patch that makes this change is as follows: > (http://coccinelle.lip6.fr/) > > // > @@ expression x; @@ > -if (x != NULL) { > \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); > x = NULL; > -} > // > > Signed-off-by: Julia Lawall Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 34/39] dccp: drop null test before destroy functions
From: Julia Lawall Date: Sun, 13 Sep 2015 14:15:27 +0200 > Remove unneeded NULL test. > > The semantic patch that makes this change is as follows: > (http://coccinelle.lip6.fr/) > > // > @@ > expression x; > @@ > > -if (x != NULL) > \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); > > @@ > expression x; > @@ > > -if (x != NULL) { > \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); > x = NULL; > -} > // > > Signed-off-by: Julia Lawall Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/39] atm: he: drop null test before destroy functions
From: Julia Lawall Date: Sun, 13 Sep 2015 14:15:03 +0200 > Remove unneeded NULL test. > > The semantic patch that makes this change is as follows: > (http://coccinelle.lip6.fr/) > > // > @@ expression x; @@ > -if (x != NULL) > \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); > // > > Signed-off-by: Julia Lawall Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPv6 routing/fragmentation panic
David Woodhouse wrote: > I can repeatably crash my router with 'ping6 -s 2000' to an external > machine: > [ 61.741618] skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 > head:dec98000 data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan > [ 61.754128] [ cut here ] > [ 61.758754] Kernel BUG at c1201b1f [verbose debug info unavailable] > [ 61.764005] invalid opcode: [#1] > [ 61.764005] Modules linked in: sch_teql 8139cp mii iptable_nat pppoe > nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE > xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit > xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT solos_pci pppox > ppp_async nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat_ftp > nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_ftp > nf_conntrack iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt > act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw > sch_hfsc sch_ingress ledtrig_heartbeat ledtrig_gpio ip6t_REJECT > nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle > ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc br2684 atm > geode_aes cbc arc4 aes_i586 > [ 61.764005] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0+ #2 > [ 61.764005] task: c138d540 ti: c1386000 task.ti: c1386000 > [ 61.764005] EIP: 0060:[] EFLAGS: 00210286 CPU: 0 > [ 61.764005] EIP is at skb_panic+0x3b/0x3d > [ 61.764005] EAX: 007c EBX: deca3000 ECX: c13a0910 EDX: c139f3c4 > [ 61.764005] ESI: dee85d8c EDI: dec9800a EBP: defe3b40 ESP: dec0bd50 > [ 61.764005] DS: 007b ES: 007b FS: GS: SS: 0068 > [ 61.764005] CR0: 8005003b CR2: b7704474 CR3: 1ef0d000 CR4: 0090 > [ 61.764005] Stack: > [ 61.764005] c135e48c c12e1580 c1277f1e 050e 000e dec98000 > dec97ffc dec9850a > [ 61.764005] dec98f40 deca3000 dee85d00 c120337b c12e1580 c1277f1e > 000e > [ 61.764005] dee85d7c ff671e02 deca3000 c109afd3 00200282 1d91 > 0028 dec98012 > [ 61.764005] Call Trace: > [ 61.764005] [] ? ip6_finish_output2+0x196/0x4da Hmm, unlike ip the ip6 stack doesn't check headroom size before adding hh. > But should the kernel *panic* without it? If there are requirements on > the headroom I must leave on received packets, where are they > documented? Or is this a bug in the IPv6 fragmentation code, to make > such assumptions? I'm not sure the ipv6 (re)fragmentation code is to blame here. In particular, we could have setups where additional headers need to be inserted which could also require headroom expansion. > I'm not entirely sure how to interpret the above stack trace. Is the > incoming IPv6 packet being reassembled for netfilter's benefit, then re > -fragmented for transmission? Yes, ipv6 connection tracking depends on defragmentation. ip6_fragment should use the frag_list of the (reassembled) skb so no refragmentation should be happening, we should just be re-using the original fragmented skbs from that fraglist. What I don't understand is why you see this with fragmented ipv6 packets only (and not with all ipv6 forwarded skbs). Something like this copy-pastry from ip_finish_output2 should fix it: diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -62,6 +62,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb) struct net_device *dev = dst->dev; struct neighbour *neigh; struct in6_addr *nexthop; + unsigned int hh_len; int ret; skb->protocol = htons(ETH_P_IPV6); @@ -104,6 +105,21 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb) } } + hh_len = LL_RESERVED_SPACE(dev); + if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { + struct sk_buff *skb2; + + skb2 = skb_realloc_headroom(skb, hh_len); + if (!skb2) { + kfree_skb(skb); + return -ENOMEM; + } + if (skb->sk) + skb_set_owner_w(skb2, skb->sk); + consume_skb(skb); + skb = skb2; + } + rcu_read_lock_bh(); nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr); neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, Sep 15, 2015 at 3:24 PM, Eric Dumazet wrote: > From: Eric Dumazet > > If skb carries a l4 hash, no need to perform a flow dissection. > > Performance is slightly better : > > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.39012e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.39393e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.39988e+06 > > After patch : > > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.43579e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.44304e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.44312e+06 > > Signed-off-by: Eric Dumazet > Cc: Tom Herbert > Cc: Mahesh Bandewar > --- > drivers/net/bonding/bond_main.c |4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index 771a449..9250d1e 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -3136,6 +3136,10 @@ u32 bond_xmit_hash(struct bonding *bond, struct > sk_buff *skb) > struct flow_keys flow; > u32 hash; > > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && > + skb->l4_hash) > + return skb->hash; > + > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 || > !bond_flow_dissect(bond, skb, &flow)) > return bond_eth_hash(skb); > > Ugh, bond_flow_dissect is yet another instance of customized flow dissection! We should really clean this up. I suggest that in cases were we want L4 hash a call to skb_get_hash should suffice. We can create skb_get_l3hash when caller explicitly wants an L3 hash-- this would return skb->hash if it's valid and skb->l4_hash is not set, else call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the normal hash over flow keys (don't save result in skb->hash in this case). Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] net: irda: pxaficp_ir: use sched_clock() for time management
From: Robert Jarzmik Date: Sat, 12 Sep 2015 13:45:22 +0200 > Instead of using directly the OS timer through direct register access, > use the standard sched_clock(), which will end up in OSCR reading > anyway. > > This is a first step for direct access register removal and machine > specific code removal from this driver. > > Signed-off-by: Robert Jarzmik What is the granularity of the OSCR register? If it is not nanoseconds, then you need to adjust calculations such as this one: > @@ -549,7 +548,7 @@ static int pxa_irda_hard_xmit(struct sk_buff *skb, struct > net_device *dev) > skb_copy_from_linear_data(skb, si->dma_tx_buff, skb->len); > > if (mtt) > - while ((unsigned)(readl_relaxed(OSCR) - > si->last_oscr)/4 < mtt) > + while ((sched_clock() - si->last_clk) / 4 < mtt) > cpu_relax(); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/2] tcp: provide skb->hash to synack packets
On Tue, Sep 15, 2015 at 3:24 PM, Eric Dumazet wrote: > From: Eric Dumazet > > In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf > on xmit"), Tom provided a l4 hash to most outgoing TCP packets. > > We'd like to provide one as well for SYNACK packets, so that all packets > of a given flow share same txhash, to later enable bonding driver to > also use skb->hash to perform slave selection. > > Note that a SYNACK retransmit shuffles the tx hash, as Tom did > in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing > advice") for established sockets. > > This has nice effect making TCP flows resilient to some kind of black > holes, even at connection establish phase. > Acked-by: Tom Herbert > Signed-off-by: Eric Dumazet > Cc: Tom Herbert > Cc: Mahesh Bandewar > --- > include/linux/tcp.h |1 + > include/net/sock.h| 12 > net/ipv4/tcp_input.c |1 + > net/ipv4/tcp_ipv4.c |2 +- > net/ipv4/tcp_output.c |2 ++ > net/ipv6/tcp_ipv6.c |2 +- > 6 files changed, 14 insertions(+), 6 deletions(-) > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index 48c3696..937b978 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -113,6 +113,7 @@ struct tcp_request_sock { > struct inet_request_sockreq; > const struct tcp_request_sock_ops *af_specific; > booltfo_listener; > + u32 txhash; > u32 rcv_isn; > u32 snt_isn; > u32 snt_synack; /* synack sent time */ > diff --git a/include/net/sock.h b/include/net/sock.h > index 7aa7844..94dff7f 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -1654,12 +1654,16 @@ static inline void sock_graft(struct sock *sk, struct > socket *parent) > kuid_t sock_i_uid(struct sock *sk); > unsigned long sock_i_ino(struct sock *sk); > > -static inline void sk_set_txhash(struct sock *sk) > +static inline u32 net_tx_rndhash(void) > { > - sk->sk_txhash = prandom_u32(); > + u32 v = prandom_u32(); > + > + return v ?: 1; > +} > > - if (unlikely(!sk->sk_txhash)) > - sk->sk_txhash = 1; > +static inline void sk_set_txhash(struct sock *sk) > +{ > + sk->sk_txhash = net_tx_rndhash(); > } > > static inline void sk_rethink_txhash(struct sock *sk) > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index a8f515b..a62e9c7 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -6228,6 +6228,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, > } > > tcp_rsk(req)->snt_isn = isn; > + tcp_rsk(req)->txhash = net_tx_rndhash(); > tcp_openreq_init_rwin(req, sk, dst); > fastopen = !want_cookie && >tcp_try_fastopen(sk, skb, req, &foc, dst); > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 93898e0..d671d74 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -1276,8 +1276,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, > struct sk_buff *skb, > newinet->mc_index = inet_iif(skb); > newinet->mc_ttl = ip_hdr(skb)->ttl; > newinet->rcv_tos = ip_hdr(skb)->tos; > + newsk->sk_txhash = tcp_rsk(req)->txhash; > inet_csk(newsk)->icsk_ext_hdr_len = 0; > - sk_set_txhash(newsk); > if (inet_opt) > inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen; > newinet->inet_id = newtp->write_seq ^ jiffies; > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index f9a8a12..d0ad355 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2987,6 +2987,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct > dst_entry *dst, > rcu_read_lock(); > md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req)); > #endif > + skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); > tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5, > foc) + sizeof(*th); > > @@ -3505,6 +3506,7 @@ int tcp_rtx_synack(struct sock *sk, struct request_sock > *req) > struct flowi fl; > int res; > > + tcp_rsk(req)->txhash = net_tx_rndhash(); > res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL); > if (!res) { > TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS); > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 97d9314..f9c0e26 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -1090,7 +1090,7 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock > *sk, struct sk_buff *skb, > newsk->sk_v6_rcv_saddr = ireq->ir_v6_loc_addr; > newsk->sk_bound_dev_if = ireq->ir_iif; > > - sk_set_txhash(newsk); > + newsk->sk_txhash = tcp_rsk(req)->txhash; > >
Re: [PATCH net v2] openvswitch: Fix mask generation for nested attributes.
From: Jesse Gross Date: Fri, 11 Sep 2015 18:38:28 -0700 > Masks were added to OVS flows in a way that was backwards compatible > with userspace programs that did not generate masks. As a result, it is > possible that we may receive flows that do not have a mask and we need > to synthesize one. > > Generating a mask requires iterating over attributes and descending into > nested attributes. For each level we need to know the size to generate the > correct mask. We do this with a linked table of attribute types. > > Although the logic to handle these nested attributes was there in concept, > there are a number of bugs in practice. Examples include incomplete links > between tables, variable length attributes being treated as nested and > missing sanity checks. > > Signed-off-by: Jesse Gross > --- > v2: Fix whitespace errors. > Add check for unknown bytes in VXLAN extensions. > Factor out check for nested or variable attributes. Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: smc91x: convert pxa dma to dmaengine
From: Robert Jarzmik Date: Thu, 10 Sep 2015 21:26:04 +0200 > Convert the dma transfers to be dmaengine based, now pxa has a dmaengine > slave driver. This makes this driver a bit more PXA agnostic. > > The driver was tested on pxa27x (mainstone) and pxa310 (zylonite), > ie. only pxa platforms. > > Signed-off-by: Robert Jarzmik > Cc: Russell King > Cc: Arnd Bergmann > --- > This has potential to break other platform such as Neponset, Idp, > halibut and qsd8x50, so I added Russell and Arnd as they were discussing > smc91x support last February. Is someone testing whether such platforms break or not? I'm waiting for that before I consider applying this patch. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, 2015-09-15 at 15:54 -0700, Mahesh Bandewar wrote: > > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && > > + skb->l4_hash) > if (ENCAP34 || LAYER34) && l4_hash) may be? Hmm, traditional BOND_XMIT_POLICY_LAYER34 did not a full flow bisection (tunnel awareness added in commit 32819dc1834866cb9547cb75f81af9edd58d33cd bonding: modify the old and add new xmit hash policies) This could radically change some setups and behavior. BOND_XMIT_POLICY_ENCAP34 looks a better fit to me. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/2] bonding: use l4 hash if available
On Tue, Sep 15, 2015 at 3:24 PM, Eric Dumazet wrote: > > From: Eric Dumazet > > If skb carries a l4 hash, no need to perform a flow dissection. > > Performance is slightly better : > > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.39012e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.39393e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.39988e+06 > > After patch : > > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.43579e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.44304e+06 > lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 > 2.44312e+06 > > Signed-off-by: Eric Dumazet > Cc: Tom Herbert > Cc: Mahesh Bandewar > --- > drivers/net/bonding/bond_main.c |4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index 771a449..9250d1e 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -3136,6 +3136,10 @@ u32 bond_xmit_hash(struct bonding *bond, struct > sk_buff *skb) > struct flow_keys flow; > u32 hash; > > + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && > + skb->l4_hash) if (ENCAP34 || LAYER34) && l4_hash) may be? > > + return skb->hash; > + > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 || > !bond_flow_dissect(bond, skb, &flow)) > return bond_eth_hash(skb); > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 net-next 1/2] ipv4: L3 hash-based multipath
On Tue, 15 Sep 2015 14:40:48 -0700 Alexander Duyck wrote: > On 09/15/2015 01:29 PM, Peter Nørlund wrote: > > Replaces the per-packet multipath with a hash-based multipath using > > source and destination address. > > > > Signed-off-by: Peter Nørlund > > --- > > include/net/ip_fib.h | 11 ++-- > > net/ipv4/fib_semantics.c | 137 > > +-- > > net/ipv4/route.c | 23 +++- 3 files changed, 102 > > insertions(+), 69 deletions(-) > > > > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > > index a37d043..c335dd4 100644 > > --- a/include/net/ip_fib.h > > +++ b/include/net/ip_fib.h > > @@ -79,7 +79,7 @@ struct fib_nh { > > unsigned char nh_scope; > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > int nh_weight; > > - int nh_power; > > + atomic_tnh_upper_bound; > > #endif > > #ifdef CONFIG_IP_ROUTE_CLASSID > > __u32 nh_tclassid; > > @@ -118,7 +118,7 @@ struct fib_info { > > #define fib_advmss fib_metrics[RTAX_ADVMSS-1] > > int fib_nhs; > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > - int fib_power; > > + int fib_weight; > > #endif > > struct rcu_head rcu; > > struct fib_nh fib_nh[0]; > > @@ -312,7 +312,12 @@ int ip_fib_check_default(__be32 gw, struct > > net_device *dev); int fib_sync_down_dev(struct net_device *dev, > > unsigned long event); int fib_sync_down_addr(struct net *net, > > __be32 local); int fib_sync_up(struct net_device *dev, unsigned int > > nh_flags); -void fib_select_multipath(struct fib_result *res); > > + > > +extern u32 fib_multipath_secret __read_mostly; > > + > > +typedef int (*multipath_hash_func_t)(void *ctx); > > +void fib_select_multipath(struct fib_result *res, > > + multipath_hash_func_t hash_func, void > > *ctx); > > > > /* Exported by fib_trie.c */ > > void fib_trie_init(void); > > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > > index 064bd3c..64d3e0e 100644 > > --- a/net/ipv4/fib_semantics.c > > +++ b/net/ipv4/fib_semantics.c > > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt; > > static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE]; > > > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > - > > -static DEFINE_SPINLOCK(fib_multipath_lock); > > +u32 fib_multipath_secret __read_mostly; > > > > #define for_nexthops(fi) > > { \ int nhsel; const > > struct fib_nh *nh; \ @@ -468,6 > > +467,55 @@ static int fib_count_nexthops(struct rtnexthop *rtnh, > > int remaining) return remaining > 0 ? 0 : nhs; } > > > > +static void fib_rebalance(struct fib_info *fi) > > +{ > > + int total; > > + int w; > > + struct in_device *in_dev; > > + > > + if (fi->fib_nhs < 2) > > + return; > > + > > + total = 0; > > + for_nexthops(fi) { > > + if (nh->nh_flags & RTNH_F_DEAD) > > + continue; > > + > > + in_dev = __in_dev_get_rcu(nh->nh_dev); > > + > > + if (in_dev && > > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > > + nh->nh_flags & RTNH_F_LINKDOWN) > > + continue; > > + > > + total += nh->nh_weight; > > + } endfor_nexthops(fi); > > + > > + w = 0; > > + change_nexthops(fi) { > > + int upper_bound; > > + > > + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); > > + > > + if (nexthop_nh->nh_flags & RTNH_F_DEAD) { > > + upper_bound = -1; > > + } else if (in_dev && > > + > > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > > + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) > > { > > + upper_bound = -1; > > + } else { > > + w += nexthop_nh->nh_weight; > > + upper_bound = > > DIV_ROUND_CLOSEST(2147483648LL * w, > > + total) - 1; > > + } > > + > > + atomic_set(&nexthop_nh->nh_upper_bound, > > upper_bound); > > + } endfor_nexthops(fi); > > + > > + net_get_random_once(&fib_multipath_secret, > > + sizeof(fib_multipath_secret)); > > +} > > + > > static int fib_get_nhs(struct fib_info *fi, struct rtnexthop > > *rtnh, int remaining, struct fib_config *cfg) > > { > > @@ -1094,8 +1142,15 @@ struct fib_info *fib_create_info(struct > > fib_config *cfg) > > > > change_nexthops(fi) { > > fib_info_update_nh_saddr(net, nexthop_nh); > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > > + fi->fib_weight += nexthop_nh->nh_weight; > > +#endif > > } endfor_nexthops(fi) > > > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > > + fib_rebalance(fi); > > +#endif > > + > > link_it: > > ofi = fib_find_info(fi); > > if (ofi) { > > @@ -1317,12 +1372,6 @@ int fib_sync_down_dev(struct net_device
Re: [PATCH v3 net-next] rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats
From: Sowmini Varadhan Date: Fri, 11 Sep 2015 16:48:48 -0400 > > Many commonly used functions like getifaddrs() invoke RTM_GETLINK > to dump the interface information, and do not need the > the AF_INET6 statististics that are always returned by default > from rtnl_fill_ifinfo(). > > Computing the statistics can be an expensive operation that impacts > scaling, so it is desirable to avoid this if the information is > not needed. > > This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that > can be passed with netlink_request() to avoid statistics computation > for the ifinfo path. > > Signed-off-by: Sowmini Varadhan > --- > v2: David Miller comments: pass u32 ext_filter_mask down. > v3: non-RFC version of v2. Applied, with one minor change: > + if (!!(ext_filter_mask & RTEXT_FILTER_SKIP_STATS)) I got rid of the "!!" as it really isn't needed for an expression like this. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/2] tcp: provide skb->hash to synack packets
From: Eric Dumazet In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit"), Tom provided a l4 hash to most outgoing TCP packets. We'd like to provide one as well for SYNACK packets, so that all packets of a given flow share same txhash, to later enable bonding driver to also use skb->hash to perform slave selection. Note that a SYNACK retransmit shuffles the tx hash, as Tom did in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing advice") for established sockets. This has nice effect making TCP flows resilient to some kind of black holes, even at connection establish phase. Signed-off-by: Eric Dumazet Cc: Tom Herbert Cc: Mahesh Bandewar --- include/linux/tcp.h |1 + include/net/sock.h| 12 net/ipv4/tcp_input.c |1 + net/ipv4/tcp_ipv4.c |2 +- net/ipv4/tcp_output.c |2 ++ net/ipv6/tcp_ipv6.c |2 +- 6 files changed, 14 insertions(+), 6 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 48c3696..937b978 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -113,6 +113,7 @@ struct tcp_request_sock { struct inet_request_sockreq; const struct tcp_request_sock_ops *af_specific; booltfo_listener; + u32 txhash; u32 rcv_isn; u32 snt_isn; u32 snt_synack; /* synack sent time */ diff --git a/include/net/sock.h b/include/net/sock.h index 7aa7844..94dff7f 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1654,12 +1654,16 @@ static inline void sock_graft(struct sock *sk, struct socket *parent) kuid_t sock_i_uid(struct sock *sk); unsigned long sock_i_ino(struct sock *sk); -static inline void sk_set_txhash(struct sock *sk) +static inline u32 net_tx_rndhash(void) { - sk->sk_txhash = prandom_u32(); + u32 v = prandom_u32(); + + return v ?: 1; +} - if (unlikely(!sk->sk_txhash)) - sk->sk_txhash = 1; +static inline void sk_set_txhash(struct sock *sk) +{ + sk->sk_txhash = net_tx_rndhash(); } static inline void sk_rethink_txhash(struct sock *sk) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a8f515b..a62e9c7 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6228,6 +6228,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } tcp_rsk(req)->snt_isn = isn; + tcp_rsk(req)->txhash = net_tx_rndhash(); tcp_openreq_init_rwin(req, sk, dst); fastopen = !want_cookie && tcp_try_fastopen(sk, skb, req, &foc, dst); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 93898e0..d671d74 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1276,8 +1276,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb, newinet->mc_index = inet_iif(skb); newinet->mc_ttl = ip_hdr(skb)->ttl; newinet->rcv_tos = ip_hdr(skb)->tos; + newsk->sk_txhash = tcp_rsk(req)->txhash; inet_csk(newsk)->icsk_ext_hdr_len = 0; - sk_set_txhash(newsk); if (inet_opt) inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen; newinet->inet_id = newtp->write_seq ^ jiffies; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f9a8a12..d0ad355 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2987,6 +2987,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, rcu_read_lock(); md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req)); #endif + skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5, foc) + sizeof(*th); @@ -3505,6 +3506,7 @@ int tcp_rtx_synack(struct sock *sk, struct request_sock *req) struct flowi fl; int res; + tcp_rsk(req)->txhash = net_tx_rndhash(); res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL); if (!res) { TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 97d9314..f9c0e26 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1090,7 +1090,7 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb, newsk->sk_v6_rcv_saddr = ireq->ir_v6_loc_addr; newsk->sk_bound_dev_if = ireq->ir_iif; - sk_set_txhash(newsk); + newsk->sk_txhash = tcp_rsk(req)->txhash; /* Now IPv6 options... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/2] bonding: use l4 hash if available
From: Eric Dumazet If skb carries a l4 hash, no need to perform a flow dissection. Performance is slightly better : lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 2.39012e+06 lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 2.39393e+06 lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 2.39988e+06 After patch : lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 2.43579e+06 lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 2.44304e+06 lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100 2.44312e+06 Signed-off-by: Eric Dumazet Cc: Tom Herbert Cc: Mahesh Bandewar --- drivers/net/bonding/bond_main.c |4 1 file changed, 4 insertions(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 771a449..9250d1e 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3136,6 +3136,10 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb) struct flow_keys flow; u32 hash; + if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 && + skb->l4_hash) + return skb->hash; + if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 || !bond_flow_dissect(bond, skb, &flow)) return bond_eth_hash(skb); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Fix vti use case with oif in dst lookups
Steffen reported that the recent change to add oif to dst lookups breaks the VTI use case. The problem is that with the oif set in the flow struct the comparison to the nh_oif is triggered. Fix by splitting the FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare nh oif (FLOWI_FLAG_SKIP_NH_OIF). Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Signed-off-by: David Ahern --- IPv6 does not show this problem for me. So no change is added for IPv6. If your mileage varies let me know and I'll take another look. drivers/net/vrf.c | 3 ++- include/net/flow.h | 1 + include/net/route.h | 2 +- net/ipv4/fib_trie.c | 2 +- net/ipv4/udp.c | 3 ++- net/ipv4/xfrm4_policy.c | 2 ++ 6 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index e7094fbd7568..488c6f50df73 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -193,7 +193,8 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb, .flowi4_oif = vrf_dev->ifindex, .flowi4_iif = LOOPBACK_IFINDEX, .flowi4_tos = RT_TOS(ip4h->tos), - .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC, + .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC | + FLOWI_FLAG_SKIP_NH_OIF, .daddr = ip4h->daddr, }; diff --git a/include/net/flow.h b/include/net/flow.h index acd6a096250e..9b85db85f13c 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -35,6 +35,7 @@ struct flowi_common { #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 #define FLOWI_FLAG_VRFSRC 0x04 +#define FLOWI_FLAG_SKIP_NH_OIF 0x08 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; }; diff --git a/include/net/route.h b/include/net/route.h index cc61cb95f059..f46af256880c 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -255,7 +255,7 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 flow_flags |= FLOWI_FLAG_ANYSRC; if (netif_index_is_vrf(sock_net(sk), oif)) - flow_flags |= FLOWI_FLAG_VRFSRC; + flow_flags |= FLOWI_FLAG_VRFSRC | FLOWI_FLAG_SKIP_NH_OIF; flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE, protocol, flow_flags, dst, src, dport, sport); diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 26d6ffb6d23c..6c2af797f2f9 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1426,7 +1426,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, nh->nh_flags & RTNH_F_LINKDOWN && !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE)) continue; - if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) { + if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) { if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif) continue; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c0a15e7f359f..f7d1d5e19e95 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1024,7 +1024,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (netif_index_is_vrf(net, ipc.oif)) { flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE, sk->sk_protocol, - (flow_flags | FLOWI_FLAG_VRFSRC), + (flow_flags | FLOWI_FLAG_VRFSRC | + FLOWI_FLAG_SKIP_NH_OIF), faddr, saddr, dport, inet->inet_sport); diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index bb919b28619f..c10a9ee68433 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -33,6 +33,8 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4, if (saddr) fl4->saddr = saddr->a4; + fl4->flowi4_flags = FLOWI_FLAG_SKIP_NH_OIF; + rt = __ip_route_output_key(net, fl4); if (!IS_ERR(rt)) return &rt->dst; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: stmmac: Use msleep rather then udelay for reset delay
From: Sjoerd Simons Date: Fri, 11 Sep 2015 22:25:48 +0200 > The reset delays used for stmmac are in the order of 10ms to 1 second, > which is far too long for udelay usage, so switch to using msleep. > > Practically this fixes the PHY not being reliably detected in some cases > as udelay wouldn't actually delay for long enough to let the phy > reliably be reset. > > Signed-off-by: Sjoerd Simons Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html