[PATCH net] sctp: fix the transports round robin issue when init is retransmitted
prior to this patch, at the beginning if we have two paths in one assoc, they may have the same params other than the last_time_heard, it will try the paths like this: 1st cycle try trans1 fail. then trans2 is selected.(cause it's last_time_heard is after trans1). 2nd cycle: try trans2 fail then trans2 is selected.(cause it's last_time_heard is after trans1). 3rd cycle: try trans2 fail then trans2 is selected.(cause it's last_time_heard is after trans1). trans1 will never have change to be selected, which is not what we expect. we should keeping round robin all the paths if they are just added at the beginning. So at first every tranport's last_time_heard should be initialized 0, so that we ensure they have the same value at the beginning, only by this, all the transports could get equal chance to be selected. Then for sctp_trans_elect_best, it should return the trans_next one when *trans == *trans_next, so that we can try next if it fails, but now it always return trans. so we can fix it by exchanging these two params when we calls sctp_trans_elect_tie(). Fixes: 4c47af4d5eb2 ('net: sctp: rework multihoming retransmission path selection to rfc4960') Signed-off-by: Xin Long --- net/sctp/associola.c | 2 +- net/sctp/transport.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sctp/associola.c b/net/sctp/associola.c index 2bf8ec9..cd87344 100644 --- a/net/sctp/associola.c +++ b/net/sctp/associola.c @@ -1263,7 +1263,7 @@ static struct sctp_transport *sctp_trans_elect_best(struct sctp_transport *curr, if (score_curr > score_best) return curr; else if (score_curr == score_best) - return sctp_trans_elect_tie(curr, best); + return sctp_trans_elect_tie(best, curr); else return best; } diff --git a/net/sctp/transport.c b/net/sctp/transport.c index a431c14..d517153 100644 --- a/net/sctp/transport.c +++ b/net/sctp/transport.c @@ -72,7 +72,7 @@ static struct sctp_transport *sctp_transport_init(struct net *net, */ peer->rto = msecs_to_jiffies(net->sctp.rto_initial); - peer->last_time_heard = ktime_get(); + peer->last_time_heard = ktime_set(0, 0); peer->last_time_ecne_reduced = jiffies; peer->param_flags = SPP_HB_DISABLE | -- 2.1.0
Re: [PATCH V4 0/3] basic busy polling support for vhost_net
Hi Greg, > Greg Kurz wrote on 03/09/2016 09:26:45 PM: > > On Fri, 4 Mar 2016 06:24:50 -0500 > > Jason Wang wrote: > > > This series tries to add basic busy polling for vhost net. The idea is > > simple: at the end of tx/rx processing, busy polling for new tx added > > descriptor and rx receive socket for a while. The maximum number of > > time (in us) could be spent on busy polling was specified ioctl. > > > > Test A were done through: > > > > - 50 us as busy loop timeout > > - Netperf 2.6 > > - Two machines with back to back connected mlx4 > > Hi Jason, > > Could this also improve performance if both VMs are > on the same host system ? I've experimented a little with Jason's patches and guest-to-guest netperf when both guests were on the same host, and I saw improvements for that case. > > - Guest with 1 vcpus and 1 queue > > > > Results: > > - Obvious improvements (%5 - 20%) for latency (TCP_RR). > > - Get a better or minor regression on most of the TX tests, but see > > some regression on 4096 size. > > - Except for 8 sessions of 4096 size RX, have a better or same > > performance. > > - CPU utilization were incrased as expected. > > > > TCP_RR: > > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/ > > 1/ 1/ +8%/ -32%/ +8%/ +8%/ +7% > > 1/50/ +7%/ -19%/ +7%/ +7%/ +1% > > 1/ 100/ +5%/ -21%/ +5%/ +5%/0% > > 1/ 200/ +5%/ -21%/ +7%/ +7%/ +1% > >64/ 1/ +11%/ -29%/ +11%/ +11%/ +10% > >64/50/ +7%/ -19%/ +8%/ +8%/ +2% > >64/ 100/ +8%/ -18%/ +9%/ +9%/ +2% > >64/ 200/ +6%/ -19%/ +6%/ +6%/0% > > 256/ 1/ +7%/ -33%/ +7%/ +7%/ +6% > > 256/50/ +7%/ -18%/ +7%/ +7%/0% > > 256/ 100/ +9%/ -18%/ +8%/ +8%/ +2% > > 256/ 200/ +9%/ -18%/ +10%/ +10%/ +3% > > 1024/ 1/ +20%/ -28%/ +20%/ +20%/ +19% > > 1024/50/ +8%/ -18%/ +9%/ +9%/ +2% > > 1024/ 100/ +6%/ -19%/ +5%/ +5%/0% > > 1024/ 200/ +8%/ -18%/ +9%/ +9%/ +2% > > Guest TX: > > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/ > >64/ 1/ -5%/ -28%/ +11%/ +12%/ +10% > >64/ 4/ -2%/ -26%/ +13%/ +13%/ +13% > >64/ 8/ -6%/ -29%/ +9%/ +10%/ +10% > > 512/ 1/ +15%/ -7%/ +13%/ +11%/ +3% > > 512/ 4/ +17%/ -6%/ +18%/ +13%/ +11% > > 512/ 8/ +14%/ -7%/ +13%/ +7%/ +7% > > 1024/ 1/ +27%/ -2%/ +26%/ +29%/ +12% > > 1024/ 4/ +8%/ -9%/ +6%/ +1%/ +6% > > 1024/ 8/ +41%/ +12%/ +34%/ +20%/ -3% > > 4096/ 1/ -22%/ -21%/ -36%/ +81%/+1360% > > 4096/ 4/ -57%/ -58%/ +286%/ +15%/+2074% > > 4096/ 8/ +67%/ +70%/ -45%/ -8%/ +63% > > 16384/ 1/ -2%/ -5%/ +5%/ -3%/ +80% > > 16384/ 4/0%/0%/0%/ +4%/ +138% > > 16384/ 8/0%/0%/0%/ +1%/ +41% > > 65535/ 1/ -3%/ -6%/ +2%/ +11%/ +113% > > 65535/ 4/ -2%/ -1%/ -2%/ -3%/ +484% > > 65535/ 8/0%/ +1%/0%/ +2%/ +40% > > Guest RX: > > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/ > >64/ 1/ +31%/ -3%/ +8%/ +8%/ +8% > >64/ 4/ +11%/ -17%/ +13%/ +14%/ +15% > >64/ 8/ +4%/ -23%/ +11%/ +11%/ +12% > > 512/ 1/ +24%/0%/ +18%/ +14%/ -8% > > 512/ 4/ +4%/ -15%/ +6%/ +5%/ +6% > > 512/ 8/ +26%/0%/ +21%/ +10%/ +3% > > 1024/ 1/ +88%/ +47%/ +69%/ +44%/ -30% > > 1024/ 4/ +18%/ -5%/ +19%/ +16%/ +2% > > 1024/ 8/ +15%/ -4%/ +13%/ +8%/ +1% > > 4096/ 1/ -3%/ -5%/ +2%/ -2%/ +41% > > 4096/ 4/ +2%/ +3%/ -20%/ -14%/ -24% > > 4096/ 8/ -43%/ -45%/ +69%/ -24%/ +94% > > 16384/ 1/ -3%/ -11%/ +23%/ +7%/ +42% > > 16384/ 4/ -3%/ -3%/ -4%/ +5%/ +115% > > 16384/ 8/ -1%/0%/ -1%/ -3%/ +32% > > 65535/ 1/ +1%/0%/ +2%/0%/ +66% > > 65535/ 4/ -1%/ -1%/0%/ +4%/ +492% > > 65535/ 8/0%/ -1%/ -1%/ +4%/ +38% > > > > Changes from V3: > > - drop single_task_running() > > - use cpu_relax_lowlatency() instead of cpu_relax() > > > > Changes from V2: > > - rename vhost_vq_more_avail() to vhost_vq_avail_empty(). And return > > false we __get_user() fails. > > - do not bother premmptions/timers for good path. > > - use vhost_vring_state as ioctl parameter instead of reinveting a new > > one. > > - add the unit of timeout (us) to the comment of new added ioctls > > > > Changes from V1: > > - remove the buggy vq_error() in vhost_vq_more_avail(). > > - leave vhost_enable_notify() untouched. > > > > Changes from RFC V3: > > - small tweak on the code to avoid multiple duplicate conditions in > > critical path when busy loop is not enabled. > > - add the test result of multiple VMs > > > > Changes from RFC V2: > > - poll also at the end of rx handling > > - factor out the polling logic and opti
Re: Hung up was occurred after v4.4-rc1 during IPv6 Ready Logo Conformance Test
On Thu, Mar 10, 2016 at 01:21:05PM +0900, Yuki Machida wrote: > Hi all, > > Hung up was occurred at Linux Kernel after v4.4-rc1 during IPv6 Ready Logo > Conformance Test. > Not Fix a bug in v4.5-rc7 yet. > > Currently, it is under investigation. > > The following are the details: I think you forgot to paste the details? > > IPv6 Ready Logo > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ipv6ready.org_&d=CwICJg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=BzjeED6xB_NfQBjateVt8Gh7IK7hST7tSmPGXEAtRdQ&s=oaM204vOhRnGb6aTD1wxTZu9ss_bCzFAeZU0EmvVe6I&e= > > I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510). > It is using userland build with yocto project. > > Test Environment > Test Specification : 4.0.6 > Tool Version: REL_3_3_2 > Test Program Version: V6LC_5_0_0 > Target Device : Intel D510MO (Atom D510) > > Regards, > Yuki Machida
Re: [PATCH net-next] bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS
From: Alexei Starovoitov Date: Wed, 9 Mar 2016 18:56:49 -0800 > 0-day bot reported build error: > kernel/built-in.o: In function `map_lookup_elem': >>> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to >>> `bpf_stackmap_copy' > when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not. > Add weak definition to resolve it. > This code path in map_lookup_elem() is never taken > when CONFIG_PERF_EVENTS is not set. > > Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") > Reported-by: Fengguang Wu > Signed-off-by: Alexei Starovoitov Applied.
Re: [PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()
From: Alexei Starovoitov Date: Wed, 9 Mar 2016 20:02:33 -0800 > Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but > the result is typically passed to print("%s", buf) and extra bytes > after zero don't cause any harm. > In bpf the result of bpf_get_current_comm() is used as the part of > map key and was causing spurious hash map mismatches. > Use strlcpy() to guarantee zero-terminated string. > bpf verifier checks that output buffer is zero-initialized, > so even for short task names the output buffer don't have junk bytes. > Note it's not a security concern, since kprobe+bpf is root only. > > Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm > accessors") > Reported-by: Tobias Waldekranz > Signed-off-by: Alexei Starovoitov > --- > Targeting net-next, since it's too late for net. > I think it makes sense for stable as well. Applied and queued up for -stable, thanks.
Re: [PATCH net-next 0/3] validate variable length ll headers
From: Willem de Bruijn Date: Wed, 9 Mar 2016 23:22:59 -0500 > On Wed, Mar 9, 2016 at 10:13 PM, David Miller wrote: >> From: Willem de Bruijn >> Date: Wed, 9 Mar 2016 21:58:31 -0500 >> >>> Allow device-specific validation of link layer headers. Existing >>> checks drop all packets shorter than hard_header_len. For variable >>> length protocols, such packets can be valid. >>> >>> patch 1 adds header_ops.validate and dev_validate_header >>> patch 2 implements the protocol specific callback for AX25 >>> patch 3 replaces ll_header_truncated with dev_validate_header >> >> Series applied and queued up for -stable, thanks! > > Thanks, David. Please don't queue the earlier net patchset for stable. > I spotted an issue while converting to net-next, I'm afraid. I can send a > net v2 (as net-next won't apply), or a separate stable patch once 4.5 is cut. I'll use the net-next patches as the basis for the backport I do, thanks.
Re: [PATCH net-next 0/3] validate variable length ll headers
On Wed, Mar 9, 2016 at 10:13 PM, David Miller wrote: > From: Willem de Bruijn > Date: Wed, 9 Mar 2016 21:58:31 -0500 > >> Allow device-specific validation of link layer headers. Existing >> checks drop all packets shorter than hard_header_len. For variable >> length protocols, such packets can be valid. >> >> patch 1 adds header_ops.validate and dev_validate_header >> patch 2 implements the protocol specific callback for AX25 >> patch 3 replaces ll_header_truncated with dev_validate_header > > Series applied and queued up for -stable, thanks! Thanks, David. Please don't queue the earlier net patchset for stable. I spotted an issue while converting to net-next, I'm afraid. I can send a net v2 (as net-next won't apply), or a separate stable patch once 4.5 is cut.
Hung up was occurred after v4.4-rc1 during IPv6 Ready Logo Conformance Test
Hi all, Hung up was occurred at Linux Kernel after v4.4-rc1 during IPv6 Ready Logo Conformance Test. Not Fix a bug in v4.5-rc7 yet. Currently, it is under investigation. The following are the details: IPv6 Ready Logo https://www.ipv6ready.org/ I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510). It is using userland build with yocto project. Test Environment Test Specification : 4.0.6 Tool Version: REL_3_3_2 Test Program Version: V6LC_5_0_0 Target Device : Intel D510MO (Atom D510) Regards, Yuki Machida
[PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()
Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but the result is typically passed to print("%s", buf) and extra bytes after zero don't cause any harm. In bpf the result of bpf_get_current_comm() is used as the part of map key and was causing spurious hash map mismatches. Use strlcpy() to guarantee zero-terminated string. bpf verifier checks that output buffer is zero-initialized, so even for short task names the output buffer don't have junk bytes. Note it's not a security concern, since kprobe+bpf is root only. Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors") Reported-by: Tobias Waldekranz Signed-off-by: Alexei Starovoitov --- Targeting net-next, since it's too late for net. I think it makes sense for stable as well. kernel/bpf/helpers.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 4504ca66118d..50da680c479f 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -166,7 +166,7 @@ static u64 bpf_get_current_comm(u64 r1, u64 size, u64 r3, u64 r4, u64 r5) if (!task) return -EINVAL; - memcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm))); + strlcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm))); return 0; } -- 2.8.0.rc1
Re: [PATCH nf-next v9 8/8] openvswitch: Interface with NAT.
Hi Jarno, Thanks for working on this. Mostly just a few style things around #ifdefs below. On 9 March 2016 at 15:10, Jarno Rajahalme wrote: > Extend OVS conntrack interface to cover NAT. New nested > OVS_CT_ATTR_NAT attribute may be used to include NAT with a CT action. > A bare OVS_CT_ATTR_NAT only mangles existing and expected connections. > If OVS_NAT_ATTR_SRC or OVS_NAT_ATTR_DST is included within the nested > attributes, new (non-committed/non-confirmed) connections are mangled > according to the rest of the nested attributes. > > The corresponding OVS userspace patch series includes test cases (in > tests/system-traffic.at) that also serve as example uses. > > This work extends on a branch by Thomas Graf at > https://github.com/tgraf/ovs/tree/nat. Thomas, I guess there was not signoff in these patches so Jarno does not have your signoff in this patch. > Signed-off-by: Jarno Rajahalme > --- > v9: Fixed module dependencies. > > include/uapi/linux/openvswitch.h | 49 > net/openvswitch/Kconfig | 3 +- > net/openvswitch/conntrack.c | 523 > +-- > net/openvswitch/conntrack.h | 3 +- > 4 files changed, 551 insertions(+), 27 deletions(-) > diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig > index cd5fd9d..23471a4 100644 > --- a/net/openvswitch/Kconfig > +++ b/net/openvswitch/Kconfig > @@ -6,7 +6,8 @@ config OPENVSWITCH > tristate "Open vSwitch" > depends on INET > depends on !NF_CONNTRACK || \ > - (NF_CONNTRACK && (!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6)) > + (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \ > +(!NF_NAT || NF_NAT))) Whitespace. > select LIBCRC32C > select MPLS > select NET_MPLS_GSO > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index 5711f80..6455237 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > struct ovs_ct_len_tbl { > - size_t maxlen; > - size_t minlen; > + int maxlen; > + int minlen; > }; Are these changed for a specific reason, or just to use INT_MAX rather than SIZE_MAX in ovs_ct_len_tbl? > /* Metadata mark for masked write to conntrack mark */ > @@ -42,15 +52,29 @@ struct md_labels { > struct ovs_key_ct_labels mask; > }; > > +#ifdef CONFIG_NF_NAT_NEEDED > +enum ovs_ct_nat { > + OVS_CT_NAT = 1 << 0, /* NAT for committed connections only. */ > + OVS_CT_SRC_NAT = 1 << 1, /* Source NAT for NEW connections. */ > + OVS_CT_DST_NAT = 1 << 2, /* Destination NAT for NEW connections. */ > +}; > +#endif Here... > /* Conntrack action context for execution. */ > struct ovs_conntrack_info { > struct nf_conntrack_helper *helper; > struct nf_conntrack_zone zone; > struct nf_conn *ct; > u8 commit : 1; > +#ifdef CONFIG_NF_NAT_NEEDED > + u8 nat : 3; /* enum ovs_ct_nat */ > +#endif and here.. I wonder if we can trim more of these #ifdefs, for readability and more compiler coverage if the feature is disabled. > u16 family; > struct md_mark mark; > struct md_labels labels; > +#ifdef CONFIG_NF_NAT_NEEDED > + struct nf_nat_range range; /* Only present for SRC NAT and DST NAT. > */ > +#endif > }; (probably not this one, though) > static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info); > @@ -137,12 +161,15 @@ static void __ovs_ct_update_key(struct sw_flow_key > *key, u8 state, > ovs_ct_get_labels(ct, &key->ct.labels); > } > > -/* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has > - * previously sent the packet to conntrack via the ct action. > +/* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has > + * previously sent the packet to conntrack via the ct action. If > + * 'keep_nat_flags' is true, the existing NAT flags retained, else they are > + * initialized from the connection status. > */ > static void ovs_ct_update_key(const struct sk_buff *skb, > const struct ovs_conntrack_info *info, > - struct sw_flow_key *key, bool post_ct) > + struct sw_flow_key *key, bool post_ct, > + bool keep_nat_flags) > { > const struct nf_conntrack_zone *zone = &nf_ct_zone_dflt; > enum ip_conntrack_info ctinfo; > @@ -160,6 +187,16 @@ static void ovs_ct_update_key(const struct sk_buff *skb, > */ > if (ct->master) > state |= OVS_CS_F_RELATED; > +#ifdef CONFIG_NF_NAT_NEEDED > + if (keep_nat_flags) { > + state |= key->ct.state & OVS_CS_F_NAT_MASK; > + } else { > + if (ct->status & IPS_SRC_NAT) > + state |= OVS_CS_F_SRC_NAT; > + if (ct->stat
Re: [PATCH] bpf: make bpf_stackmap_copy conditionally called
On Thu, Mar 10, 2016 at 02:43:42AM +0100, Arnd Bergmann wrote: > Changing the bpf syscall to use the new bpf_stackmap_copy() helper for > BPF_MAP_TYPE_STACK_TRACE causes a link error when CONFIG_PERF_EVENTS > is disabled: > > kernel/built-in.o: In function `map_lookup_elem': > :(.text+0x7fca4): undefined reference to `bpf_stackmap_copy' > > This patch simply avoids handling that case, which may or may not > be the correct answer here. > > Signed-off-by: Arnd Bergmann > Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") ... > - } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) { > + } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE && > +IS_ENABLED(CONFIG_PERF_EVENTS)) { > err = bpf_stackmap_copy(map, key, value); > } else { yes. this is also ok-ish fix. I've sent different version already: http://patchwork.ozlabs.org/patch/595617/ I considered the option like yours but it's relying on gcc doing dead code elimination of 'if (false) {}' branch and though kernel is never compiled with -O0. I didn't want to take the risk. I'm fine with either approach though.
[PATCH] bpf: make bpf_stackmap_copy conditionally called
Changing the bpf syscall to use the new bpf_stackmap_copy() helper for BPF_MAP_TYPE_STACK_TRACE causes a link error when CONFIG_PERF_EVENTS is disabled: kernel/built-in.o: In function `map_lookup_elem': :(.text+0x7fca4): undefined reference to `bpf_stackmap_copy' This patch simply avoids handling that case, which may or may not be the correct answer here. Signed-off-by: Arnd Bergmann Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") --- kernel/bpf/syscall.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 2978d0d08869..b014b64b6116 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -290,7 +290,8 @@ static int map_lookup_elem(union bpf_attr *attr) err = bpf_percpu_hash_copy(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { err = bpf_percpu_array_copy(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) { + } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE && + IS_ENABLED(CONFIG_PERF_EVENTS)) { err = bpf_stackmap_copy(map, key, value); } else { rcu_read_lock(); -- 2.7.0
Re: [PATCH net-next 0/3] validate variable length ll headers
From: Willem de Bruijn Date: Wed, 9 Mar 2016 21:58:31 -0500 > Allow device-specific validation of link layer headers. Existing > checks drop all packets shorter than hard_header_len. For variable > length protocols, such packets can be valid. > > patch 1 adds header_ops.validate and dev_validate_header > patch 2 implements the protocol specific callback for AX25 > patch 3 replaces ll_header_truncated with dev_validate_header Series applied and queued up for -stable, thanks!
[PATCH net-next 3/3] packet: validate variable length ll headers
From: Willem de Bruijn Replace link layer header validation check ll_header_truncate with more generic dev_validate_header. Validation based on hard_header_len incorrectly drops valid packets in variable length protocols, such as AX25. dev_validate_header calls header_ops.validate for such protocols to ensure correctness below hard_header_len. See also http://comments.gmane.org/gmane.linux.network/401064 Fixes 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header") Signed-off-by: Willem de Bruijn --- net/packet/af_packet.c | 43 ++- 1 file changed, 18 insertions(+), 25 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index d41b107..1ecfa71 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1915,6 +1915,10 @@ retry: goto retry; } + if (!dev_validate_header(dev, skb->data, len)) { + err = -EINVAL; + goto out_unlock; + } if (len > (dev->mtu + dev->hard_header_len + extra_len) && !packet_extra_vlan_len_allowed(dev, skb)) { err = -EMSGSIZE; @@ -2393,18 +2397,6 @@ static void tpacket_destruct_skb(struct sk_buff *skb) sock_wfree(skb); } -static bool ll_header_truncated(const struct net_device *dev, int len) -{ - /* net device doesn't like empty head */ - if (unlikely(len < dev->hard_header_len)) { - net_warn_ratelimited("%s: packet size is too short (%d < %d)\n", -current->comm, len, dev->hard_header_len); - return true; - } - - return false; -} - static void tpacket_set_protocol(const struct net_device *dev, struct sk_buff *skb) { @@ -2522,16 +2514,20 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, if (unlikely(err < 0)) return -EINVAL; } else if (copylen) { + int hdrlen = min_t(int, copylen, tp_len); + skb_push(skb, dev->hard_header_len); skb_put(skb, copylen - dev->hard_header_len); - err = skb_store_bits(skb, 0, data, copylen); + err = skb_store_bits(skb, 0, data, hdrlen); if (unlikely(err)) return err; + if (!dev_validate_header(dev, skb->data, hdrlen)) + return -EINVAL; if (!skb->protocol) tpacket_set_protocol(dev, skb); - data += copylen; - to_write -= copylen; + data += hdrlen; + to_write -= hdrlen; } offset = offset_in_page(data); @@ -2703,13 +2699,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) copylen = __virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len); } - if (dev->hard_header_len) { - if (ll_header_truncated(dev, tp_len)) { - tp_len = -EINVAL; - goto tpacket_error; - } - copylen = max_t(int, copylen, dev->hard_header_len); - } + copylen = max_t(int, copylen, dev->hard_header_len); skb = sock_alloc_send_skb(&po->sk, hlen + tlen + sizeof(struct sockaddr_ll) + (copylen - dev->hard_header_len), @@ -2905,9 +2895,6 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) offset = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, len); if (unlikely(offset < 0)) goto out_free; - } else { - if (ll_header_truncated(dev, len)) - goto out_free; } /* Returns -EFAULT on error */ @@ -2915,6 +2902,12 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) if (err) goto out_free; + if (sock->type == SOCK_RAW && + !dev_validate_header(dev, skb->data, len)) { + err = -EINVAL; + goto out_free; + } + sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags); if (!vnet_hdr.gso_type && (len > dev->mtu + reserve + extra_len) && -- 2.7.0.rc3.207.g0ac5344
[PATCH net-next 2/3] ax25: add link layer header validation function
From: Willem de Bruijn As variable length protocol, AX25 fails link layer header validation tests based on a minimum length. header_ops.validate allows protocols to validate headers that are shorter than hard_header_len. Implement this callback for AX25. See also http://comments.gmane.org/gmane.linux.network/401064 Signed-off-by: Willem de Bruijn --- net/ax25/ax25_ip.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c index b563a3f..2fa3be9 100644 --- a/net/ax25/ax25_ip.c +++ b/net/ax25/ax25_ip.c @@ -228,8 +228,23 @@ netdev_tx_t ax25_ip_xmit(struct sk_buff *skb) } #endif +static bool ax25_validate_header(const char *header, unsigned int len) +{ + ax25_digi digi; + + if (!len) + return false; + + if (header[0]) + return true; + + return ax25_addr_parse(header + 1, len - 1, NULL, NULL, &digi, NULL, + NULL); +} + const struct header_ops ax25_header_ops = { .create = ax25_hard_header, + .validate = ax25_validate_header, }; EXPORT_SYMBOL(ax25_header_ops); -- 2.7.0.rc3.207.g0ac5344
[PATCH net-next 1/3] net: validate variable length ll headers
From: Willem de Bruijn Netdevice parameter hard_header_len is variously interpreted both as an upper and lower bound on link layer header length. The field is used as upper bound when reserving room at allocation, as lower bound when validating user input in PF_PACKET. Clarify the definition to be maximum header length. For validation of untrusted headers, add an optional validate member to header_ops. Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance for deliberate testing of corrupt input. In this case, pad trailing bytes, as some device drivers expect completely initialized headers. See also http://comments.gmane.org/gmane.linux.network/401064 Signed-off-by: Willem de Bruijn --- include/linux/netdevice.h | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index efe7cec..fd30cb5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -268,6 +268,7 @@ struct header_ops { void(*cache_update)(struct hh_cache *hh, const struct net_device *dev, const unsigned char *haddr); + bool(*validate)(const char *ll_header, unsigned int len); }; /* These flag bits are private to the generic network queueing @@ -1459,8 +1460,7 @@ enum netdev_priv_flags { * @dma: DMA channel * @mtu: Interface MTU value * @type: Interface hardware type - * @hard_header_len: Hardware header length, which means that this is the - * minimum size of a packet. + * @hard_header_len: Maximum hardware header length. * * @needed_headroom: Extra headroom the hardware may need, but not in all * cases can this be guaranteed @@ -2687,6 +2687,24 @@ static inline int dev_parse_header(const struct sk_buff *skb, return dev->header_ops->parse(skb, haddr); } +/* ll_header must have at least hard_header_len allocated */ +static inline bool dev_validate_header(const struct net_device *dev, + char *ll_header, int len) +{ + if (likely(len >= dev->hard_header_len)) + return true; + + if (capable(CAP_SYS_RAWIO)) { + memset(ll_header + len, 0, dev->hard_header_len - len); + return true; + } + + if (dev->header_ops && dev->header_ops->validate) + return dev->header_ops->validate(ll_header, len); + + return false; +} + typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr, int len); int register_gifconf(unsigned int family, gifconf_func_t *gifconf); static inline int unregister_gifconf(unsigned int family) -- 2.7.0.rc3.207.g0ac5344
[PATCH net-next 0/3] validate variable length ll headers
From: Willem de Bruijn Allow device-specific validation of link layer headers. Existing checks drop all packets shorter than hard_header_len. For variable length protocols, such packets can be valid. patch 1 adds header_ops.validate and dev_validate_header patch 2 implements the protocol specific callback for AX25 patch 3 replaces ll_header_truncated with dev_validate_header Willem de Bruijn (3): net: validate variable length ll headers ax25: add link layer header validation function packet: validate variable length ll headers include/linux/netdevice.h | 22 -- net/ax25/ax25_ip.c| 15 +++ net/packet/af_packet.c| 43 ++- 3 files changed, 53 insertions(+), 27 deletions(-) -- 2.7.0.rc3.207.g0ac5344
[PATCH net-next] bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS
0-day bot reported build error: kernel/built-in.o: In function `map_lookup_elem': >> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to >> `bpf_stackmap_copy' when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not. Add weak definition to resolve it. This code path in map_lookup_elem() is never taken when CONFIG_PERF_EVENTS is not set. Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Reported-by: Fengguang Wu Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 2978d0d08869..2a2efe1bc76c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -244,6 +244,11 @@ static void __user *u64_to_ptr(__u64 val) return (void __user *) (unsigned long) val; } +int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value) +{ + return -ENOTSUPP; +} + /* last field in 'union bpf_attr' used by this command */ #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value -- 2.8.0.rc1
[RFC PATCH 1/1] net namespace: dynamically configure new net namespace inherit net config
Sometimes the system engineer and application expect a new net namespace to inherit config from the base net config. Sometimes the current net config is expected by the system engineer and application. So it is necessary that the system engineer and application can choose a new net namespace to inherit from the base net config, or the current net config. For example, the value of /proc/sys/net/ipv4/ip_forward is taken as an example. The value of /proc/sys/net/ipv4/ip_forward in the base net config is 0 while the value of /proc/sys/net/ipv4/ip_forward is changed to 1 in the current net config. The system engineer and application can choose a new net namespace to inherit the value of /proc/sys/net/ipv4/ip_forward from the base or the current settings. Test case: 1. % cat /proc/sys/net/ipv4/net_ns_inherit 1 2. Set ip forwarding in the "base namespace" % echo 1 > /proc/sys/net/ipv4/ip_forward % cat /proc/sys/net/ipv4/ip_forward 1 3. Create a new namespace % ip netns add mynewns 4. Check ip forwarding in the new namespace % ip netns exec mynewns cat /proc/sys/net/ipv4/ip_forward 1 5. % echo 0 > /proc/sys/net/ipv4/net_ns_inherit % cat /proc/sys/net/ipv4/net_ns_inherit 0 6. Set ip forwarding in the "base namespace" % echo 1 > /proc/sys/net/ipv4/ip_forward % cat /proc/sys/net/ipv4/ip_forward 1 7. Create a new namespace % ip netns add mynewns_new 8. Check ip forwarding in the new namespace % ip netns exec mynewns_new cat /proc/sys/net/ipv4/ip_forward 0 Suggested-by: Bruce Ashfield Signed-off-by: Zhu Yanjun CC: David S. Miller CC: Alexey Kuznetsov CC: James Morris CC: Hideaki YOSHIFUJI CC: Patrick McHardy --- include/linux/inetdevice.h |2 +- include/net/ip.h|3 +++ include/uapi/linux/sysctl.h |1 + net/ipv4/devinet.c | 58 --- net/ipv4/sysctl_net_ipv4.c |7 ++ 5 files changed, 61 insertions(+), 10 deletions(-) diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h index ee971f3..1c0ae93 100644 --- a/include/linux/inetdevice.h +++ b/include/linux/inetdevice.h @@ -164,7 +164,7 @@ static inline struct net_device *ip_dev_find(struct net *net, __be32 addr) int inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b); int devinet_ioctl(struct net *net, unsigned int cmd, void __user *); -void devinet_init(void); +int devinet_init(void); struct in_device *inetdev_by_index(struct net *, int); __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope); __be32 inet_confirm_addr(struct net *net, struct in_device *in_dev, __be32 dst, diff --git a/include/net/ip.h b/include/net/ip.h index 1a98f1c..0ad4a7d 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -245,6 +245,9 @@ extern int inet_peer_threshold; extern int inet_peer_minttl; extern int inet_peer_maxttl; +/* From devinet.c */ +extern int net_ns_inherit; + /* From ip_input.c */ extern int sysctl_ip_early_demux; diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 0956373..350c3ce 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -426,6 +426,7 @@ enum NET_TCP_ALLOWED_CONG_CONTROL=123, NET_TCP_MAX_SSTHRESH=124, NET_TCP_FRTO_RESPONSE=125, + NET_IPV4_NET_NS_INHERIT = 126, }; enum { diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index cebd9d3..b68d7fa 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -2277,28 +2277,31 @@ static struct ctl_table ctl_forward_entry[] = { }; #endif +#define NET_NS_INIT_DEFAULT0 +#define NET_NS_INIT_MODIFIED 1 + +/* net ns initialized from current */ +int net_ns_inherit __read_mostly = NET_NS_INIT_MODIFIED; +static struct ipv4_devconf *all_backup, *dflt_backup; + static __net_init int devinet_init_net(struct net *net) { int err; - struct ipv4_devconf *all, *dflt; + struct ipv4_devconf *all = NULL, *dflt = NULL; #ifdef CONFIG_SYSCTL struct ctl_table *tbl = ctl_forward_entry; struct ctl_table_header *forw_hdr; #endif - err = -ENOMEM; - all = &ipv4_devconf; - dflt = &ipv4_devconf_dflt; - if (!net_eq(net, &init_net)) { - all = kmemdup(all, sizeof(ipv4_devconf), GFP_KERNEL); + if (net_ns_inherit == NET_NS_INIT_DEFAULT) { + all = kmemdup(all_backup, sizeof(ipv4_devconf), GFP_KERNEL); if (!all) goto err_alloc_all; - dflt = kmemdup(dflt, sizeof(ipv4_devconf_dflt), GFP_KERNEL); + dflt = kmemdup(dflt_backup, sizeof(ipv4_devconf_dflt), GFP_KERNEL); if (!dflt) goto err_alloc_dflt; - #ifdef CONFIG_SYSCTL tbl = kmemdup(tbl, sizeof(ctl_forward_entry), GFP_KERNEL); if (!tbl) @@ -2309,6 +2312,29 @@ static __net_init int devinet_init_net(struct net *net) tbl[0].extra2 = net; #endif
Re: [PATCH nf-next v9 7/8] openvswitch: Delay conntrack helper call for new connections.
On 9 March 2016 at 15:10, Jarno Rajahalme wrote: > There is no need to help connections that are not confirmed, so we can > delay helping new connections to the time when they are confirmed. > This change is needed for NAT support, and having this as a separate > patch will make the following NAT patch a bit easier to review. > > Signed-off-by: Jarno Rajahalme > --- > net/openvswitch/conntrack.c | 20 +++- > 1 file changed, 15 insertions(+), 5 deletions(-) > > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index 92613de..5711f80 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > @@ -506,11 +510,17 @@ static int __ovs_ct_lookup(struct net *net, struct > sw_flow_key *key, > return -ENOENT; > > ovs_ct_update_key(skb, info, key, true); > + } > > - if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) { > - WARN_ONCE(1, "helper rejected packet"); > - return -EINVAL; > - } > + /* Call the helper only if we did nf_conntrack_in() above ('!cached') > +* for confirmed connections, but only when committing for unconfirmed > +* connections. > +*/ Minor nit, try this wording for readibility? /* Call the helper only if: * - nf_conntrack_in() was executed above ("!cached") for a confirmed connection, or * - When committing an unconfirmed connection */
Re: [PATCH nf-next v9 5/8] openvswitch: Find existing conntrack entry after upcall.
On 9 March 2016 at 15:10, Jarno Rajahalme wrote: > Add a new function ovs_ct_find_existing() to find an existing > conntrack entry for which this packet was already applied to. This is > only to be called when there is evidence that the packet was already > tracked and committed, but we lost the ct reference due to an > userspace upcall. > > ovs_ct_find_existing() is called from skb_nfct_cached(), which can now > hide the fact that the ct reference may have been lost due to an > upcall. This allows ovs_ct_commit() to be simplified. > > This patch is needed by later "openvswitch: Interface with NAT" patch, > as we need to be able to pass the packet through NAT using the > original ct reference also after the reference is lost after an > upcall. > > Signed-off-by: Jarno Rajahalme Acked-by: Joe Stringer
Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
On Wed, Mar 9, 2016 at 5:45 PM, Jonas Markussen wrote: > >> On 10 Mar 2016, at 01:20, Yuchung Cheng wrote: >> >> PS. I don't understand how (old) RDB can masquerade the losses by >> skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets >> and the last 3 were (s)acked. We perform RDB to send a packet that has >> previous 4 payloads + 1 new byte. The sender still gets the loss >> information? >> > > If I’ve understood you correctly, you’re talking about sending 4 > packets and the first one is lost? > > In this case, RDB will not only bundle on the last/new packet but also > as it sends packet 2 (which will contain 1+2), packet 3 (1+2+3) > and packet 4 (1+2+3+4). > > So the fact that packet 1 was lost is masqueraded when it is > recovered by packet 2 and there won’t be any gap in the SACK window > indicating that packet 1 was lost. I see. Thanks for the clarification. So my question is still if thin-stream app has enough inflight to use ack-triggered recovery. i.e., it has to send at least twice within an RTT. Also have you tested this with non-Linux receivers? Thanks. > > Best regards, > Jonas Markussen
Re: [ovs-dev] [PATCH nf-next v9 4/8] openvswitch: Update the CT state key only after nf_conntrack_in().
On 9 March 2016 at 15:10, Jarno Rajahalme wrote: > Only a successful nf_conntrack_in() call can effect a connection state > change, so if suffices to update the key only after the > nf_conntrack_in() returns. "it" suffices to update... > This change is needed for the later NAT patches. > > Signed-off-by: Jarno Rajahalme Acked-by: Joe Stringer
Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
> On 10 Mar 2016, at 01:20, Yuchung Cheng wrote: > > PS. I don't understand how (old) RDB can masquerade the losses by > skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets > and the last 3 were (s)acked. We perform RDB to send a packet that has > previous 4 payloads + 1 new byte. The sender still gets the loss > information? > If I’ve understood you correctly, you’re talking about sending 4 packets and the first one is lost? In this case, RDB will not only bundle on the last/new packet but also as it sends packet 2 (which will contain 1+2), packet 3 (1+2+3) and packet 4 (1+2+3+4). So the fact that packet 1 was lost is masqueraded when it is recovered by packet 2 and there won’t be any gap in the SACK window indicating that packet 1 was lost. Best regards, Jonas Markussen
[PATCH net] net: hns: bug fix about ping6
The current upstreaming code fails to ping other IPv6 net device, because the enet receives the multicast packets with the src mac addr whick is the same as its mac addr. These packets need to be dropped. Signed-off-by: Kejian Yan --- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c index 3f77ff7..9ad5da4 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c @@ -564,6 +564,7 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data, struct sk_buff *skb; struct hnae_desc *desc; struct hnae_desc_cb *desc_cb; + struct ethhdr *eh; unsigned char *va; int bnum, length, i; int pull_len; @@ -670,6 +671,14 @@ out_bnum_err: return -EFAULT; } + /* filter out multicast pkt with the same src mac as this port */ + eh = (struct ethhdr *)skb->data; + if (unlikely(is_multicast_ether_addr(eh->h_dest) && +ether_addr_equal(ndev->dev_addr, eh->h_source))) { + dev_kfree_skb_any(skb); + return -EFAULT; + } + ring->stats.rx_pkts++; ring->stats.rx_bytes += skb->len; -- 1.9.1
[patch net 2/2] net: hns: fixes a bug of RSS
If trying to get receive flow hash indirection table by ethtool, it needs to call .get_rxnfc to get ring number first. So this patch implements the .get_rxnfc of ethtool. And the data type of rss_indir_table is u32, it has to be multiply by the width of data type when using memcpy. Signed-off-by: Kejian Yan --- drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 6 -- drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 20 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c index 3b8f301..c733a5a 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c @@ -779,7 +779,8 @@ static int hns_ae_get_rss(struct hnae_handle *handle, u32 *indir, u8 *key, memcpy(key, ppe_cb->rss_key, HNS_PPEV2_RSS_KEY_SIZE); /* update the current hash->queue mappings from the shadow RSS table */ - memcpy(indir, ppe_cb->rss_indir_table, HNS_PPEV2_RSS_IND_TBL_SIZE); + memcpy(indir, ppe_cb->rss_indir_table, + HNS_PPEV2_RSS_IND_TBL_SIZE * sizeof(*indir)); return 0; } @@ -794,7 +795,8 @@ static int hns_ae_set_rss(struct hnae_handle *handle, const u32 *indir, hns_ppe_set_rss_key(ppe_cb, (u32 *)key); /* update the shadow RSS table with user specified qids */ - memcpy(ppe_cb->rss_indir_table, indir, HNS_PPEV2_RSS_IND_TBL_SIZE); + memcpy(ppe_cb->rss_indir_table, indir, + HNS_PPEV2_RSS_IND_TBL_SIZE * sizeof(*indir)); /* now update the hardware */ hns_ppe_set_indir_table(ppe_cb, ppe_cb->rss_indir_table); diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c index ada8e04..a070392 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c @@ -1244,6 +1244,25 @@ hns_set_rss(struct net_device *netdev, const u32 *indir, const u8 *key, return ret; } +static int hns_get_rxnfc(struct net_device *netdev, +struct ethtool_rxnfc *cmd, +u32 *rule_locs) +{ + struct hns_nic_priv *priv = netdev_priv(netdev); + int ret = 0; + + switch (cmd->cmd) { + case ETHTOOL_GRXRINGS: + cmd->data = priv->ae_handle->q_num; + break; + default: + ret = -EOPNOTSUPP; + break; + } + + return ret; +} + static struct ethtool_ops hns_ethtool_ops = { .get_drvinfo = hns_nic_get_drvinfo, .get_link = hns_nic_get_link, @@ -1267,6 +1286,7 @@ static struct ethtool_ops hns_ethtool_ops = { .get_rxfh_indir_size = hns_get_rss_indir_size, .get_rxfh = hns_get_rss, .set_rxfh = hns_set_rss, + .get_rxnfc = hns_get_rxnfc, }; void hns_ethtool_set_ops(struct net_device *ndev) -- 1.9.1
[patch net 0/2] net: hns: get and set RSS indirection table by using ethtool
When we use ethtool to retrieves or configure the receive flow hash indirection table, ethtool needs to call .get_rxnfc to get the ring number so this patchset implements the .get_rxnfc and fixes the bug that we can not get the tatal table each time. Kejian Yan (2): net: hns: fix return value of the function about rss net: hns: fixes a bug of RSS drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 8 --- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 2 +- drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 28 +++ 3 files changed, 30 insertions(+), 8 deletions(-) -- 1.9.1
[patch net 1/2] net: hns: fix return value of the function about rss
Both .get_rxfh and .get_rxfh are always return 0, it should return result from hardware when getting or setting rss. And the rss function should return the correct data type. Signed-off-by: Kejian Yan --- drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 2 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 2 +- drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 8 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c index a0070d0..3b8f301 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c @@ -791,7 +791,7 @@ static int hns_ae_set_rss(struct hnae_handle *handle, const u32 *indir, /* set the RSS Hash Key if specififed by the user */ if (key) - hns_ppe_set_rss_key(ppe_cb, (int *)key); + hns_ppe_set_rss_key(ppe_cb, (u32 *)key); /* update the shadow RSS table with user specified qids */ memcpy(ppe_cb->rss_indir_table, indir, HNS_PPEV2_RSS_IND_TBL_SIZE); diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c index f302ef9..811ef35 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c @@ -27,7 +27,7 @@ void hns_ppe_set_tso_enable(struct hns_ppe_cb *ppe_cb, u32 value) void hns_ppe_set_rss_key(struct hns_ppe_cb *ppe_cb, const u32 rss_key[HNS_PPEV2_RSS_KEY_NUM]) { - int key_item = 0; + u32 key_item = 0; for (key_item = 0; key_item < HNS_PPEV2_RSS_KEY_NUM; key_item++) dsaf_write_dev(ppe_cb, PPEV2_RSS_KEY_REG + key_item * 0x4, diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c index 3df2284..ada8e04 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c @@ -1165,7 +1165,7 @@ hns_get_rss_key_size(struct net_device *netdev) if (AE_IS_VER1(priv->enet_ver)) { netdev_err(netdev, "RSS feature is not supported on this hardware\n"); - return -EOPNOTSUPP; + return (u32)-EOPNOTSUPP; } ops = priv->ae_handle->dev->ops; @@ -1184,7 +1184,7 @@ hns_get_rss_indir_size(struct net_device *netdev) if (AE_IS_VER1(priv->enet_ver)) { netdev_err(netdev, "RSS feature is not supported on this hardware\n"); - return -EOPNOTSUPP; + return (u32)-EOPNOTSUPP; } ops = priv->ae_handle->dev->ops; @@ -1213,7 +1213,7 @@ hns_get_rss(struct net_device *netdev, u32 *indir, u8 *key, u8 *hfunc) ret = ops->get_rss(priv->ae_handle, indir, key, hfunc); - return 0; + return ret; } static int @@ -1241,7 +1241,7 @@ hns_set_rss(struct net_device *netdev, const u32 *indir, const u8 *key, ret = ops->set_rss(priv->ae_handle, indir, key, hfunc); - return 0; + return ret; } static struct ethtool_ops hns_ethtool_ops = { -- 1.9.1
Re: [PATCH net-next v3] tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In
On Wed, Mar 09, 2016 at 03:11:50PM -0800, Yuchung Cheng wrote: > On Wed, Mar 9, 2016 at 10:43 AM, Martin KaFai Lau wrote: > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index e90db85..24557a8 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -1816,4 +1816,14 @@ static inline void skb_set_tcp_pure_ack(struct > > sk_buff *skb) > > skb->truesize = 2; > > } > > > > +static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff > > *skb) > > +{ > > + u16 segs_in; > > + > > + segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs); > > + tp->segs_in += segs_in; > > + if (skb->len > tcp_hdrlen(skb)) > > + tp->data_segs_in += segs_in; > > +} > > + > > #endif /* _TCP_H */ > > diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c > > index fdb286d..f583c85 100644 > > --- a/net/ipv4/tcp_fastopen.c > > +++ b/net/ipv4/tcp_fastopen.c > > @@ -131,6 +131,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock > > *req, > > void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) > > { > > struct tcp_sock *tp = tcp_sk(sk); > > + u16 segs_in; > > > > if (TCP_SKB_CB(skb)->end_seq == tp->rcv_nxt) > > return; > > @@ -154,6 +155,9 @@ void tcp_fastopen_add_skb(struct sock *sk, struct > > sk_buff *skb) > > * as we certainly are not changing upper 32bit value (0) > > */ > > tp->bytes_received = skb->len; > > + segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs); > > + tp->segs_in = segs_in; > > + tp->data_segs_in = segs_in; > why not use the new tcp_segs_in() helper? It is because segs_in has been set to 1 by tcp_create_openreq_child(), so calling tcp_segs_in() will have it double counted and tcphdr has been pulled from skb, so skb->len does not include tcp_hdrlen. I will add some remark in this change.
Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
On Wed, Mar 9, 2016 at 4:18 PM, Joe Perches wrote: > On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote: >> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert wrote: >> > I took a look inlining these. >> > >> > #define rol32(V, X) ({ \ >> > int word = V; \ >> > if (__builtin_constant_p(X))\ >> > asm("roll $" #X ",%[word]\n\t" \ >> > : [word] "=r" (word)); \ >> > else\ >> > asm("roll %%cl,%[word]\n\t" \ >> > : [word] "=r" (word)\ >> > : "c" (X)); \ >> > word; \ >> > }) >> > >> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s... >> Is gcc really not converting the rol32 calls into rotates? > > No, it is. > > The difference in the object code with the asm for instance is: > > (old, compiled with gcc 5.3.1) > > : > 84e: 81 ee 09 41 52 21 sub$0x21524109,%esi > 854: 81 ef 09 41 52 21 sub$0x21524109,%edi > 85a: 55 push %rbp > 85b: 89 f0 mov%esi,%eax > 85d: 89 f2 mov%esi,%edx > 85f: 48 ff 05 00 00 00 00incq 0x0(%rip)# 866 > > 866: c1 c2 0erol$0xe,%edx > 869: 35 f7 be ad de xor$0xdeadbef7,%eax > 86e: 48 89 e5mov%rsp,%rbp > 871: 29 d0 sub%edx,%eax > 873: 48 ff 05 00 00 00 00incq 0x0(%rip)# 87a > > 87a: 48 ff 05 00 00 00 00incq 0x0(%rip)# 881 > > 881: 89 c2 mov%eax,%edx > 883: 31 c7 xor%eax,%edi > 885: c1 c2 0brol$0xb,%edx > 888: 29 d7 sub%edx,%edi > 88a: 89 fa mov%edi,%edx > 88c: 31 fe xor%edi,%esi > 88e: c1 ca 07ror$0x7,%edx > 891: 29 d6 sub%edx,%esi > 893: 89 f2 mov%esi,%edx > 895: 31 f0 xor%esi,%eax > 897: c1 c2 10rol$0x10,%edx > 89a: 29 d0 sub%edx,%eax > 89c: 89 c2 mov%eax,%edx > 89e: 31 c7 xor%eax,%edi > 8a0: c1 c2 04rol$0x4,%edx > 8a3: 29 d7 sub%edx,%edi > 8a5: 31 fe xor%edi,%esi > 8a7: c1 c7 0erol$0xe,%edi > 8aa: 29 fe sub%edi,%esi > 8ac: 31 f0 xor%esi,%eax > 8ae: c1 ce 08ror$0x8,%esi > 8b1: 29 f0 sub%esi,%eax > 8b3: 5d pop%rbp > 8b4: c3 retq > > vs Tom's asm > > 084e : > 84e: 81 ee 09 41 52 21 sub$0x21524109,%esi > 854: 8d 87 f7 be ad de lea-0x21524109(%rdi),%eax > 85a: 55 push %rbp > 85b: 89 f2 mov%esi,%edx > 85d: 48 ff 05 00 00 00 00incq 0x0(%rip)# 864 > > 864: 48 ff 05 00 00 00 00incq 0x0(%rip)# 86b > > 86b: 81 f2 f7 be ad de xor$0xdeadbef7,%edx > 871: 48 89 e5mov%rsp,%rbp > 874: c1 c1 0erol$0xe,%ecx > 877: 29 ca sub%ecx,%edx > 879: 31 d0 xor%edx,%eax > 87b: c1 c7 0brol$0xb,%edi > 87e: 29 f8 sub%edi,%eax > 880: 48 ff 05 00 00 00 00incq 0x0(%rip)# 887 > > 887: 31 c6 xor%eax,%esi > 889: c1 c7 19rol$0x19,%edi > 88c: 29 fe sub%edi,%esi > 88e: 31 f2 xor%esi,%edx > 890: c1 c7 10rol$0x10,%edi > 893: 29 fa sub%edi,%edx > 895: 31 d0 xor%edx,%eax > 897: c1 c7 04rol$0x4,%edi > 89a: 29 f8 sub%edi,%eax > 89c: 31 f0 xor%esi,%eax > 89e: 29 c8 sub%ecx,%eax > 8a0: 31 d0 xor%edx,%eax > 8a2: 5d pop%rbp > 8a3: c1 c2 18rol$0x18,%edx > 8a6: 29 d0
Re: [ovs-dev] [PATCH net-next] ovs: allow nl 'flow set' to use ufid without flow key
On Wed, Mar 9, 2016 at 4:25 PM, Samuel Gauthier wrote: > Sorry, I missed that. Thank you for pointing it out. > > Although, set command is also used to reset the flow statistics, and the > action attribute seems optional. Would you find acceptable to make the key > attribute mandatory only if the action attribute is provided? > > The ovs-dpctl command could then be simplified to support ovs-dpctl --clear > mod-flow ufid: instead of having to specify the flow and action when > you just want to reset the statistics of a flow. > That is fine with me. Thanks, Pravin.
Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)
On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad wrote: > > Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing > the latency for applications sending time-dependent data. > Latency-sensitive applications or services, such as online games and > remote desktop, produce traffic with thin-stream characteristics, > characterized by small packets and a relatively high ITT. By bundling > already sent data in packets with new data, RDB alleviates head-of-line > blocking by reducing the need to retransmit data segments when packets > are lost. RDB is a continuation on the work on latency improvements for > TCP in Linux, previously resulting in two thin-stream mechanisms in the > Linux kernel > (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt). > > The RDB implementation has been thoroughly tested, and shows > significant latency reductions when packet loss occurs[1]. The tests > show that, by imposing restrictions on the bundling rate, it can be > made not to negatively affect competing traffic in an unfair manner. > > Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb" > (http://patchwork.ozlabs.org/patch/510674) > > These patches have also been tested with as set of packetdrill scripts > located at > https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb > (The tests require patching packetdrill with a new socket option: > https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b) > > Detailed info about the RDB mechanism can be found at > http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as > in the paper "Latency and Fairness Trade-Off for Thin Streams using > Redundant Data Bundling in TCP"[2]. > > [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf > [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf I read the paper. I think the underlying idea is neat. but the implementation is little heavy-weight that requires changes on fast path (tcp_write_xmit) and space in skb control blocks. ultimately this patch is meant for a small set of specific applications. In my mental model (please correct me if I am wrong), losses on these thin streams would mostly resort to RTOs instead of fast recovery, due to the bursty nature of Internet losses. The HOLB comes from RTO only retransmit the first (tiny) unacked packet while a small of new data is readily available. But since Linux congestion control is packet-based, and loss cwnd is 1, the new data needs to wait until the 1st packet is acked which is for another RTT. Instead what if we only perform RDB on the (first and recurring) RTO retransmission? PS. I don't understand how (old) RDB can masquerade the losses by skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets and the last 3 were (s)acked. We perform RDB to send a packet that has previous 4 payloads + 1 new byte. The sender still gets the loss information? > > Changes: > > v6 (PATCH): > * tcp-Add-Redundant-Data-Bundling-RDB: >* Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM) >* Minor doc changes > > * tcp-Add-DPIFL-thin-stream-detection-mechanism: >* Minor doc changes > > v5 (PATCH): > * tcp-Add-Redundant-Data-Bundling-RDB: >* Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric) >* Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric) >* Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c) >* Merged the two if tests for max payload of RDB packet in > rdb_can_bundle_test() >* Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss() > and restructured to reduce indentation. >* Improved docs >* Revised commit message to be more detailed. > > * tcp-Add-DPIFL-thin-stream-detection-mechanism: >* Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c) > > v4 (PATCH): > * tcp-Add-Redundant-Data-Bundling-RDB: >* Moved skb_append_data() to tcp_output.c and call this > function from tcp_collapse_retrans() as well. >* Merged functionality of create_rdb_skb() into > tcp_transmit_rdb_skb() >* Removed one parameter from rdb_can_bundle_test() > > v3 (PATCH): > * tcp-Add-Redundant-Data-Bundling-RDB: >* Changed name of sysctl variable from tcp_rdb_max_skbs to > tcp_rdb_max_packets after comment from Eric Dumazet about > not exposing internal (kernel) names like skb. >* Formatting and function docs fixes > > v2 (RFC/PATCH): > * tcp-Add-DPIFL-thin-stream-detection-mechanism: >* Change calculation in tcp_stream_is_thin_dpifl based on > feedback from Eric Dumazet. > > * tcp-Add-Redundant-Data-Bundling-RDB: >* Removed setting nonagle in do_tcp_setsockopt (TCP_RDB) > to reduce complexity as commented by Neal Cardwell. >* Cleaned up loss detection code in rdb_check_rtx_queue_loss > > v1 (RFC/PATCH) > > > Bendik Rønning Opstad (2): > tcp: Add DPIFL thin stream detection mechanism
Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote: > On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert wrote: > > I took a look inlining these. > > > > #define rol32(V, X) ({ \ > > int word = V; \ > > if (__builtin_constant_p(X))\ > > asm("roll $" #X ",%[word]\n\t" \ > > : [word] "=r" (word)); \ > > else\ > > asm("roll %%cl,%[word]\n\t" \ > > : [word] "=r" (word)\ > > : "c" (X)); \ > > word; \ > > }) > > > > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s... > Is gcc really not converting the rol32 calls into rotates? No, it is. The difference in the object code with the asm for instance is: (old, compiled with gcc 5.3.1) : 84e: 81 ee 09 41 52 21 sub$0x21524109,%esi 854: 81 ef 09 41 52 21 sub$0x21524109,%edi 85a: 55 push %rbp 85b: 89 f0 mov%esi,%eax 85d: 89 f2 mov%esi,%edx 85f: 48 ff 05 00 00 00 00incq 0x0(%rip)# 866 866: c1 c2 0erol$0xe,%edx 869: 35 f7 be ad de xor$0xdeadbef7,%eax 86e: 48 89 e5mov%rsp,%rbp 871: 29 d0 sub%edx,%eax 873: 48 ff 05 00 00 00 00incq 0x0(%rip)# 87a 87a: 48 ff 05 00 00 00 00incq 0x0(%rip)# 881 881: 89 c2 mov%eax,%edx 883: 31 c7 xor%eax,%edi 885: c1 c2 0brol$0xb,%edx 888: 29 d7 sub%edx,%edi 88a: 89 fa mov%edi,%edx 88c: 31 fe xor%edi,%esi 88e: c1 ca 07ror$0x7,%edx 891: 29 d6 sub%edx,%esi 893: 89 f2 mov%esi,%edx 895: 31 f0 xor%esi,%eax 897: c1 c2 10rol$0x10,%edx 89a: 29 d0 sub%edx,%eax 89c: 89 c2 mov%eax,%edx 89e: 31 c7 xor%eax,%edi 8a0: c1 c2 04rol$0x4,%edx 8a3: 29 d7 sub%edx,%edi 8a5: 31 fe xor%edi,%esi 8a7: c1 c7 0erol$0xe,%edi 8aa: 29 fe sub%edi,%esi 8ac: 31 f0 xor%esi,%eax 8ae: c1 ce 08ror$0x8,%esi 8b1: 29 f0 sub%esi,%eax 8b3: 5d pop%rbp 8b4: c3 retq vs Tom's asm 084e : 84e: 81 ee 09 41 52 21 sub$0x21524109,%esi 854: 8d 87 f7 be ad de lea-0x21524109(%rdi),%eax 85a: 55 push %rbp 85b: 89 f2 mov%esi,%edx 85d: 48 ff 05 00 00 00 00incq 0x0(%rip)# 864 864: 48 ff 05 00 00 00 00incq 0x0(%rip)# 86b 86b: 81 f2 f7 be ad de xor$0xdeadbef7,%edx 871: 48 89 e5mov%rsp,%rbp 874: c1 c1 0erol$0xe,%ecx 877: 29 ca sub%ecx,%edx 879: 31 d0 xor%edx,%eax 87b: c1 c7 0brol$0xb,%edi 87e: 29 f8 sub%edi,%eax 880: 48 ff 05 00 00 00 00incq 0x0(%rip)# 887 887: 31 c6 xor%eax,%esi 889: c1 c7 19rol$0x19,%edi 88c: 29 fe sub%edi,%esi 88e: 31 f2 xor%esi,%edx 890: c1 c7 10rol$0x10,%edi 893: 29 fa sub%edi,%edx 895: 31 d0 xor%edx,%eax 897: c1 c7 04rol$0x4,%edi 89a: 29 f8 sub%edi,%eax 89c: 31 f0 xor%esi,%eax 89e: 29 c8 sub%ecx,%eax 8a0: 31 d0 xor%edx,%eax 8a2: 5d pop%rbp 8a3: c1 c2 18rol$0x18,%edx 8a6: 29 d0 sub%edx,%eax 8a8: c3 retq > If we need this type of code in order to get the rotates to occur as > expected then maybe we need to look at doing arch specific versions o
Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing
On Tue, Mar 8, 2016 at 10:42 AM, Mahesh Bandewar wrote: >> The more subsystems involves, the more struct net pointers you >> potentially need to touch, the less likely you can make it correct >> by just switching skb->dev. > > Please drop that prejudice and read the patch-set carefully. I'm > neither changing > any *net* pointers nor changing the skb->dev pointers anywhere. All I'm saying > is dev->l3_dev is what we'll use for *all* L3 processing and no need to change > skb->dev or net-ns of any device(s) involved. Please don't misinterpret me. I said _switching_, not overwriting or changing, you use dev->l3_dev to _switch_ skb->dev and/or net, this is exactly what I am complaining about. This is not how netns works.
RE: [net PATCH 1/2] e1000: Do not overestimate descriptor counts in Tx pre-check
> From: netdev-ow...@vger.kernel.org [mailto:netdev- > ow...@vger.kernel.org] On Behalf Of Alexander Duyck > Sent: Wednesday, March 2, 2016 1:16 PM > To: netdev@vger.kernel.org; jogre...@redhat.com; intel-wired- > l...@lists.osuosl.org; Kirsher, Jeffrey T ; > sassm...@redhat.com > Subject: [net PATCH 1/2] e1000: Do not overestimate descriptor counts in Tx > pre-check > > The current code path is capable of grossly overestimating the number of > descriptors needed to transmit a new frame. This specifically occurs if > the skb contains a number of 4K pages. The issue is that the logic for > determining the descriptors needed is ((S) >> (X)) + 1. When X is 12 it > means that we were indicating that we required 2 descriptors for each 4K > page when we only needed one. > > This change corrects this by instead adding (1 << (X)) - 1 to the S value > instead of adding 1 after the fact. This way we get an accurate descriptor > needed count as we are essentially doing a DIV_ROUNDUP(). > > Reported-by: Ivan Suzdal > Signed-off-by: Alexander Duyck > --- > drivers/net/ethernet/intel/e1000/e1000_main.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Tested-by: Aaron Brown
RE: [Intel-wired-lan] [net PATCH 2/2] e1000: Double Tx descriptors needed check for 82544
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On > Behalf Of Alexander Duyck > Sent: Wednesday, March 2, 2016 1:16 PM > To: netdev@vger.kernel.org; jogre...@redhat.com; intel-wired- > l...@lists.osuosl.org; Kirsher, Jeffrey T ; > sassm...@redhat.com > Subject: [Intel-wired-lan] [net PATCH 2/2] e1000: Double Tx descriptors > needed check for 82544 > > The 82544 has code that adds one additional descriptor per data buffer. > However we weren't taking that into acount when determining the > descriptors > needed for the next transmit at the end of the xmit_frame path. > > This change takes that into account by doubling the number of descriptors > needed for the 82544 so that we can avoid a potential issue where we could > hang the Tx ring by loading frames with xmit_more enabled and then > stopping > the ring without writing the tail. > > In addition it adds a few more descriptors to account for some additional > workarounds that have been added over time. > > Signed-off-by: Alexander Duyck > --- > drivers/net/ethernet/intel/e1000/e1000_main.c | 19 > ++- > 1 file changed, 18 insertions(+), 1 deletion(-) Tested-by: Aaron Brown
Re: [PATCH] rxrpc: Replace all unsigned with unsigned int
David Howells wrote: > Replace all "unsigned" types with "unsigned int" types. > > Reported-by: David Miller > Signed-off-by: David Howells This is aimed at net-next. David
[PATCH] rxrpc: Replace all unsigned with unsigned int
Replace all "unsigned" types with "unsigned int" types. Reported-by: David Miller Signed-off-by: David Howells --- net/rxrpc/af_rxrpc.c |2 +- net/rxrpc/ar-ack.c| 12 ++-- net/rxrpc/ar-call.c |4 ++-- net/rxrpc/ar-connection.c |2 +- net/rxrpc/ar-internal.h | 22 +++--- net/rxrpc/ar-output.c |2 +- net/rxrpc/ar-transport.c |2 +- net/rxrpc/sysctl.c| 32 8 files changed, 39 insertions(+), 39 deletions(-) diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c index a76501757b59..9d935fa5a2a9 100644 --- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -81,7 +81,7 @@ static int rxrpc_validate_address(struct rxrpc_sock *rx, struct sockaddr_rxrpc *srx, int len) { - unsigned tail; + unsigned int tail; if (len < sizeof(struct sockaddr_rxrpc)) return -EINVAL; diff --git a/net/rxrpc/ar-ack.c b/net/rxrpc/ar-ack.c index 20f3f001694e..16d967075eaf 100644 --- a/net/rxrpc/ar-ack.c +++ b/net/rxrpc/ar-ack.c @@ -23,7 +23,7 @@ * How long to wait before scheduling ACK generation after seeing a * packet with RXRPC_REQUEST_ACK set (in jiffies). */ -unsigned rxrpc_requested_ack_delay = 1; +unsigned int rxrpc_requested_ack_delay = 1; /* * How long to wait before scheduling an ACK with subtype DELAY (in jiffies). @@ -32,7 +32,7 @@ unsigned rxrpc_requested_ack_delay = 1; * all consumed within this time we will send a DELAY ACK if an ACK was not * requested to let the sender know it doesn't need to resend. */ -unsigned rxrpc_soft_ack_delay = 1 * HZ; +unsigned int rxrpc_soft_ack_delay = 1 * HZ; /* * How long to wait before scheduling an ACK with subtype IDLE (in jiffies). @@ -41,7 +41,7 @@ unsigned rxrpc_soft_ack_delay = 1 * HZ; * further packets aren't immediately received to decide when to send an IDLE * ACK let the other end know that it can free up its Tx buffer space. */ -unsigned rxrpc_idle_ack_delay = 0.5 * HZ; +unsigned int rxrpc_idle_ack_delay = 0.5 * HZ; /* * Receive window size in packets. This indicates the maximum number of @@ -49,19 +49,19 @@ unsigned rxrpc_idle_ack_delay = 0.5 * HZ; * limit is hit, we should generate an EXCEEDS_WINDOW ACK and discard further * packets. */ -unsigned rxrpc_rx_window_size = 32; +unsigned int rxrpc_rx_window_size = 32; /* * Maximum Rx MTU size. This indicates to the sender the size of jumbo packet * made by gluing normal packets together that we're willing to handle. */ -unsigned rxrpc_rx_mtu = 5692; +unsigned int rxrpc_rx_mtu = 5692; /* * The maximum number of fragments in a received jumbo packet that we tell the * sender that we're willing to handle. */ -unsigned rxrpc_rx_jumbo_max = 4; +unsigned int rxrpc_rx_jumbo_max = 4; static const char *rxrpc_acks(u8 reason) { diff --git a/net/rxrpc/ar-call.c b/net/rxrpc/ar-call.c index 4a499e0100f1..7c8d300ade9b 100644 --- a/net/rxrpc/ar-call.c +++ b/net/rxrpc/ar-call.c @@ -21,12 +21,12 @@ /* * Maximum lifetime of a call (in jiffies). */ -unsigned rxrpc_max_call_lifetime = 60 * HZ; +unsigned int rxrpc_max_call_lifetime = 60 * HZ; /* * Time till dead call expires after last use (in jiffies). */ -unsigned rxrpc_dead_call_expiry = 2 * HZ; +unsigned int rxrpc_dead_call_expiry = 2 * HZ; const char *const rxrpc_call_states[NR__RXRPC_CALL_STATES] = { [RXRPC_CALL_CLIENT_SEND_REQUEST]= "ClSndReq", diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c index 53df14cb8d25..9942da1edbf6 100644 --- a/net/rxrpc/ar-connection.c +++ b/net/rxrpc/ar-connection.c @@ -21,7 +21,7 @@ /* * Time till a connection expires after last use (in seconds). */ -unsigned rxrpc_connection_expiry = 10 * 60; +unsigned int rxrpc_connection_expiry = 10 * 60; static void rxrpc_connection_reaper(struct work_struct *work); diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 8b495aed517d..a3002f4ddc90 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -478,12 +478,12 @@ int rxrpc_reject_call(struct rxrpc_sock *); /* * ar-ack.c */ -extern unsigned rxrpc_requested_ack_delay; -extern unsigned rxrpc_soft_ack_delay; -extern unsigned rxrpc_idle_ack_delay; -extern unsigned rxrpc_rx_window_size; -extern unsigned rxrpc_rx_mtu; -extern unsigned rxrpc_rx_jumbo_max; +extern unsigned int rxrpc_requested_ack_delay; +extern unsigned int rxrpc_soft_ack_delay; +extern unsigned int rxrpc_idle_ack_delay; +extern unsigned int rxrpc_rx_window_size; +extern unsigned int rxrpc_rx_mtu; +extern unsigned int rxrpc_rx_jumbo_max; void __rxrpc_propose_ACK(struct rxrpc_call *, u8, u32, bool); void rxrpc_propose_ACK(struct rxrpc_call *, u8, u32, bool); @@ -492,8 +492,8 @@ void rxrpc_process_call(struct work_struct *); /* * ar-call.c */ -extern unsigned rxrpc_max_call_lifetime; -extern unsigned rxrpc_dead_call_expiry; +extern
Re: [PATCH nf-next v8 3/8] openvswitch: Add commentary to conntrack.c
Sergei, Just found this from my junk mail box, sorry. Will fix spelling, but I just sent v9 and will wait for other reviews before re-posting. Jarno > On Mar 9, 2016, at 5:50 AM, Sergei Shtylyov > wrote: > > Hello. > > On 3/9/2016 3:24 AM, Jarno Rajahalme wrote: > >> This makes the code easier to understand and the following patches >> more focused. >> >> Signed-off-by: Jarno Rajahalme >> --- >> net/openvswitch/conntrack.c | 21 - >> 1 file changed, 20 insertions(+), 1 deletion(-) >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> index 3045290..8cd0110 100644 >> --- a/net/openvswitch/conntrack.c >> +++ b/net/openvswitch/conntrack.c > [...] >> @@ -418,6 +429,13 @@ static int ovs_ct_lookup(struct net *net, struct >> sw_flow_key *key, >> { >> struct nf_conntrack_expect *exp; >> >> +/* If we pass an expected packet through nf_conntrack_in() the >> + * expectiation is typically removed, but the packet could still be > > Expectation. > [...] > > MBR, Sergei >
Re: [PATCH net-next v3] tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In
On Wed, Mar 9, 2016 at 10:43 AM, Martin KaFai Lau wrote: > > Per RFC4898, they count segments sent/received > containing a positive length data segment (that includes > retransmission segments carrying data). Unlike > tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments > carrying no data (e.g. pure ack). > > The patch also updates the segs_in in tcp_fastopen_add_skb() > so that segs_in >= data_segs_in property is kept. > > Together with retransmission data, tcpi_data_segs_out > gives a better signal on the rxmit rate. > > v3: Add const modifier to the skb parameter in tcp_segs_in() > > v2: Rework based on recent fix by Eric: > commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment") > > Signed-off-by: Martin KaFai Lau > Cc: Chris Rapier > Cc: Eric Dumazet > Cc: Marcelo Ricardo Leitner > Cc: Neal Cardwell > Cc: Yuchung Cheng > --- > include/linux/tcp.h | 6 ++ > include/net/tcp.h| 10 ++ > include/uapi/linux/tcp.h | 2 ++ > net/ipv4/tcp.c | 2 ++ > net/ipv4/tcp_fastopen.c | 4 > net/ipv4/tcp_ipv4.c | 2 +- > net/ipv4/tcp_minisocks.c | 2 +- > net/ipv4/tcp_output.c| 4 +++- > net/ipv6/tcp_ipv6.c | 2 +- > 9 files changed, 30 insertions(+), 4 deletions(-) > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index bcbf51d..7be9b12 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -158,6 +158,9 @@ struct tcp_sock { > u32 segs_in;/* RFC4898 tcpEStatsPerfSegsIn > * total number of segments in. > */ > + u32 data_segs_in; /* RFC4898 tcpEStatsPerfDataSegsIn > +* total number of data segments in. > +*/ > u32 rcv_nxt;/* What we want to receive next */ > u32 copied_seq; /* Head of yet unread data */ > u32 rcv_wup;/* rcv_nxt on last window update sent */ > @@ -165,6 +168,9 @@ struct tcp_sock { > u32 segs_out; /* RFC4898 tcpEStatsPerfSegsOut > * The total number of segments sent. > */ > + u32 data_segs_out; /* RFC4898 tcpEStatsPerfDataSegsOut > +* total number of data segments sent. > +*/ > u64 bytes_acked;/* RFC4898 tcpEStatsAppHCThruOctetsAcked > * sum(delta(snd_una)), or how many bytes > * were acked. > diff --git a/include/net/tcp.h b/include/net/tcp.h > index e90db85..24557a8 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -1816,4 +1816,14 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff > *skb) > skb->truesize = 2; > } > > +static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff > *skb) > +{ > + u16 segs_in; > + > + segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs); > + tp->segs_in += segs_in; > + if (skb->len > tcp_hdrlen(skb)) > + tp->data_segs_in += segs_in; > +} > + > #endif /* _TCP_H */ > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > index fe95446..53e8e3f 100644 > --- a/include/uapi/linux/tcp.h > +++ b/include/uapi/linux/tcp.h > @@ -199,6 +199,8 @@ struct tcp_info { > > __u32 tcpi_notsent_bytes; > __u32 tcpi_min_rtt; > + __u32 tcpi_data_segs_in; /* RFC4898 tcpEStatsDataSegsIn */ > + __u32 tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */ > }; > > /* for TCP_MD5SIG socket option */ > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index f9faadb..6b01b48 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2728,6 +2728,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info > *info) > info->tcpi_notsent_bytes = max(0, notsent_bytes); > > info->tcpi_min_rtt = tcp_min_rtt(tp); > + info->tcpi_data_segs_in = tp->data_segs_in; > + info->tcpi_data_segs_out = tp->data_segs_out; > } > EXPORT_SYMBOL_GPL(tcp_get_info); > > diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c > index fdb286d..f583c85 100644 > --- a/net/ipv4/tcp_fastopen.c > +++ b/net/ipv4/tcp_fastopen.c > @@ -131,6 +131,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock > *req, > void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) > { > struct tcp_sock *tp = tcp_sk(sk); > + u16 segs_in; > > if (TCP_SKB_CB(skb)->end_seq == tp->rcv_nxt) > return; > @@ -154,6 +155,9 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff > *skb) > * as we certainly are not changing upper 32bit value (0) > */ > tp->bytes_received = skb->len; > + segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs); > + tp->segs_in = segs_in; > + tp->data_segs_in = segs_in; why not
[PATCH nf-next v9 4/8] openvswitch: Update the CT state key only after nf_conntrack_in().
Only a successful nf_conntrack_in() call can effect a connection state change, so if suffices to update the key only after the nf_conntrack_in() returns. This change is needed for the later NAT patches. Signed-off-by: Jarno Rajahalme --- net/openvswitch/conntrack.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 8cd0110..da804d3 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -382,7 +382,8 @@ static bool skb_nfct_cached(const struct net *net, const struct sk_buff *skb, } /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if - * not done already. Update key with new CT state. + * not done already. Update key with new CT state after passing the packet + * through conntrack. * Note that if the packet is deemed invalid by conntrack, skb->nfct will be * set to NULL and 0 will be returned. */ @@ -411,14 +412,14 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, skb) != NF_ACCEPT) return -ENOENT; + ovs_ct_update_key(skb, info, key, true); + if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) { WARN_ONCE(1, "helper rejected packet"); return -EINVAL; } } - ovs_ct_update_key(skb, info, key, true); - return 0; } -- 2.1.4
[PATCH nf-next v9 6/8] openvswitch: Handle NF_REPEAT in conntrack action.
Repeat the nf_conntrack_in() call when it returns NF_REPEAT. This avoids dropping a SYN packet re-opening an existing TCP connection. Signed-off-by: Jarno Rajahalme Acked-by: Joe Stringer --- net/openvswitch/conntrack.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index b838536..92613de 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -485,6 +485,7 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, */ if (!skb_nfct_cached(net, key, info, skb)) { struct nf_conn *tmpl = info->ct; + int err; /* Associate skb with specified zone. */ if (tmpl) { @@ -495,8 +496,13 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, skb->nfctinfo = IP_CT_NEW; } - if (nf_conntrack_in(net, info->family, NF_INET_PRE_ROUTING, - skb) != NF_ACCEPT) + /* Repeat if requested, see nf_iterate(). */ + do { + err = nf_conntrack_in(net, info->family, + NF_INET_PRE_ROUTING, skb); + } while (err == NF_REPEAT); + + if (err != NF_ACCEPT) return -ENOENT; ovs_ct_update_key(skb, info, key, true); -- 2.1.4
Re: [PATCH nf-next v8 8/8] openvswitch: Interface with NAT.
Just sent a v9 addressing these dependency issues. Jarno > On Mar 9, 2016, at 12:04 AM, kbuild test robot wrote: > > Hi Jarno, > > [auto build test ERROR on nf-next/master] > > url: > https://github.com/0day-ci/linux/commits/Jarno-Rajahalme/netfilter-Remove-IP_CT_NEW_REPLY-definition/20160309-083126 > base: https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next master > config: i386-randconfig-x0-03091344 (attached as .config) > reproduce: ># save the attached .config to linux build tree >make ARCH=i386 > > All errors (new ones prefixed by >>): > > net/built-in.o: In function `__ovs_ct_lookup': >>> conntrack.c:(.text+0x227267): undefined reference to `nf_ct_nat_ext_add' >>> conntrack.c:(.text+0x227404): undefined reference to >>> `nf_nat_icmp_reply_translation' > conntrack.c:(.text+0x2274ab): undefined reference to > `nf_nat_icmpv6_reply_translation' >>> conntrack.c:(.text+0x227585): undefined reference to `nf_nat_setup_info' >>> conntrack.c:(.text+0x2275a1): undefined reference to >>> `nf_nat_alloc_null_binding' >>> conntrack.c:(.text+0x2275d2): undefined reference to `nf_nat_packet' > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation > <.config.gz>
[PATCH nf-next v9 8/8] openvswitch: Interface with NAT.
Extend OVS conntrack interface to cover NAT. New nested OVS_CT_ATTR_NAT attribute may be used to include NAT with a CT action. A bare OVS_CT_ATTR_NAT only mangles existing and expected connections. If OVS_NAT_ATTR_SRC or OVS_NAT_ATTR_DST is included within the nested attributes, new (non-committed/non-confirmed) connections are mangled according to the rest of the nested attributes. The corresponding OVS userspace patch series includes test cases (in tests/system-traffic.at) that also serve as example uses. This work extends on a branch by Thomas Graf at https://github.com/tgraf/ovs/tree/nat. Signed-off-by: Jarno Rajahalme --- v9: Fixed module dependencies. include/uapi/linux/openvswitch.h | 49 net/openvswitch/Kconfig | 3 +- net/openvswitch/conntrack.c | 523 +-- net/openvswitch/conntrack.h | 3 +- 4 files changed, 551 insertions(+), 27 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index a27222d..616d047 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -454,6 +454,14 @@ struct ovs_key_ct_labels { #define OVS_CS_F_REPLY_DIR 0x08 /* Flow is in the reply direction. */ #define OVS_CS_F_INVALID 0x10 /* Could not track connection. */ #define OVS_CS_F_TRACKED 0x20 /* Conntrack has occurred. */ +#define OVS_CS_F_SRC_NAT 0x40 /* Packet's source address/port was +* mangled by NAT. +*/ +#define OVS_CS_F_DST_NAT 0x80 /* Packet's destination address/port +* was mangled by NAT. +*/ + +#define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT) /** * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands. @@ -632,6 +640,8 @@ struct ovs_action_hash { * mask. For each bit set in the mask, the corresponding bit in the value is * copied to the connection tracking label field in the connection. * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG. + * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address + * translation (NAT) on the packet. */ enum ovs_ct_attr { OVS_CT_ATTR_UNSPEC, @@ -641,12 +651,51 @@ enum ovs_ct_attr { OVS_CT_ATTR_LABELS, /* labels to associate with this connection. */ OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of related connections. */ + OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */ __OVS_CT_ATTR_MAX }; #define OVS_CT_ATTR_MAX (__OVS_CT_ATTR_MAX - 1) /** + * enum ovs_nat_attr - Attributes for %OVS_CT_ATTR_NAT. + * + * @OVS_NAT_ATTR_SRC: Flag for Source NAT (mangle source address/port). + * @OVS_NAT_ATTR_DST: Flag for Destination NAT (mangle destination + * address/port). Only one of (@OVS_NAT_ATTR_SRC, @OVS_NAT_ATTR_DST) may be + * specified. Effective only for packets for ct_state NEW connections. + * Packets of committed connections are mangled by the NAT action according to + * the committed NAT type regardless of the flags specified. As a corollary, a + * NAT action without a NAT type flag will only mangle packets of committed + * connections. The following NAT attributes only apply for NEW + * (non-committed) connections, and they may be included only when the CT + * action has the @OVS_CT_ATTR_COMMIT flag and either @OVS_NAT_ATTR_SRC or + * @OVS_NAT_ATTR_DST is also included. + * @OVS_NAT_ATTR_IP_MIN: struct in_addr or struct in6_addr + * @OVS_NAT_ATTR_IP_MAX: struct in_addr or struct in6_addr + * @OVS_NAT_ATTR_PROTO_MIN: u16 L4 protocol specific lower boundary (port) + * @OVS_NAT_ATTR_PROTO_MAX: u16 L4 protocol specific upper boundary (port) + * @OVS_NAT_ATTR_PERSISTENT: Flag for persistent IP mapping across reboots + * @OVS_NAT_ATTR_PROTO_HASH: Flag for pseudo random L4 port mapping (MD5) + * @OVS_NAT_ATTR_PROTO_RANDOM: Flag for fully randomized L4 port mapping + */ +enum ovs_nat_attr { + OVS_NAT_ATTR_UNSPEC, + OVS_NAT_ATTR_SRC, + OVS_NAT_ATTR_DST, + OVS_NAT_ATTR_IP_MIN, + OVS_NAT_ATTR_IP_MAX, + OVS_NAT_ATTR_PROTO_MIN, + OVS_NAT_ATTR_PROTO_MAX, + OVS_NAT_ATTR_PERSISTENT, + OVS_NAT_ATTR_PROTO_HASH, + OVS_NAT_ATTR_PROTO_RANDOM, + __OVS_NAT_ATTR_MAX, +}; + +#define OVS_NAT_ATTR_MAX (__OVS_NAT_ATTR_MAX - 1) + +/** * enum ovs_action_attr - Action types. * * @OVS_ACTION_ATTR_OUTPUT: Output packet to port. diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig index cd5fd9d..23471a4 100644 --- a/net/openvswitch/Kconfig +++ b/net/openvswitch/Kconfig @@ -6,7 +6,8 @@ config OPENVSWITCH tristate "Open vSwitch" depends on INET depends on !NF_CONNTRACK || \ - (NF_CONNTRACK && (!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6)) + (NF_CONNTRACK && ((!NF_DEFRA
[PATCH nf-next v9 7/8] openvswitch: Delay conntrack helper call for new connections.
There is no need to help connections that are not confirmed, so we can delay helping new connections to the time when they are confirmed. This change is needed for NAT support, and having this as a separate patch will make the following NAT patch a bit easier to review. Signed-off-by: Jarno Rajahalme --- net/openvswitch/conntrack.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 92613de..5711f80 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -483,7 +483,11 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, * actually run the packet through conntrack twice unless it's for a * different zone. */ - if (!skb_nfct_cached(net, key, info, skb)) { + bool cached = skb_nfct_cached(net, key, info, skb); + enum ip_conntrack_info ctinfo; + struct nf_conn *ct; + + if (!cached) { struct nf_conn *tmpl = info->ct; int err; @@ -506,11 +510,17 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, return -ENOENT; ovs_ct_update_key(skb, info, key, true); + } - if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) { - WARN_ONCE(1, "helper rejected packet"); - return -EINVAL; - } + /* Call the helper only if we did nf_conntrack_in() above ('!cached') +* for confirmed connections, but only when committing for unconfirmed +* connections. +*/ + ct = nf_ct_get(skb, &ctinfo); + if (ct && (nf_ct_is_confirmed(ct) ? !cached : info->commit) && + ovs_ct_helper(skb, info->family) != NF_ACCEPT) { + WARN_ONCE(1, "helper rejected packet"); + return -EINVAL; } return 0; -- 2.1.4
[PATCH nf-next v9 3/8] openvswitch: Add commentary to conntrack.c
This makes the code easier to understand and the following patches more focused. Signed-off-by: Jarno Rajahalme --- net/openvswitch/conntrack.c | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 3045290..8cd0110 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -152,8 +152,12 @@ static void ovs_ct_update_key(const struct sk_buff *skb, ct = nf_ct_get(skb, &ctinfo); if (ct) { state = ovs_ct_get_state(ctinfo); + /* All unconfirmed entries are NEW connections. */ if (!nf_ct_is_confirmed(ct)) state |= OVS_CS_F_NEW; + /* OVS persists the related flag for the duration of the +* connection. +*/ if (ct->master) state |= OVS_CS_F_RELATED; zone = nf_ct_zone(ct); @@ -165,6 +169,9 @@ static void ovs_ct_update_key(const struct sk_buff *skb, __ovs_ct_update_key(key, state, zone, ct); } +/* This is called to initialize CT key fields possibly coming in from the local + * stack. + */ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key) { ovs_ct_update_key(skb, NULL, key, false); @@ -199,7 +206,6 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key, struct nf_conn *ct; u32 new_mark; - /* The connection could be invalid, in which case set_mark is no-op. */ ct = nf_ct_get(skb, &ctinfo); if (!ct) @@ -375,6 +381,11 @@ static bool skb_nfct_cached(const struct net *net, const struct sk_buff *skb, return true; } +/* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if + * not done already. Update key with new CT state. + * Note that if the packet is deemed invalid by conntrack, skb->nfct will be + * set to NULL and 0 will be returned. + */ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, const struct ovs_conntrack_info *info, struct sk_buff *skb) @@ -418,6 +429,13 @@ static int ovs_ct_lookup(struct net *net, struct sw_flow_key *key, { struct nf_conntrack_expect *exp; + /* If we pass an expected packet through nf_conntrack_in() the +* expectiation is typically removed, but the packet could still be +* lost in upcall processing. To prevent this from happening we +* perform an explicit expectation lookup. Expected connections are +* always new, and will be passed through conntrack only when they are +* committed, as it is OK to remove the expectation at that time. +*/ exp = ovs_ct_expect_find(net, &info->zone, info->family, skb); if (exp) { u8 state; @@ -455,6 +473,7 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, err = __ovs_ct_lookup(net, key, info, skb); if (err) return err; + /* This is a no-op if the connection has already been confirmed. */ if (nf_conntrack_confirm(skb) != NF_ACCEPT) return -EINVAL; -- 2.1.4
[PATCH nf-next v9 5/8] openvswitch: Find existing conntrack entry after upcall.
Add a new function ovs_ct_find_existing() to find an existing conntrack entry for which this packet was already applied to. This is only to be called when there is evidence that the packet was already tracked and committed, but we lost the ct reference due to an userspace upcall. ovs_ct_find_existing() is called from skb_nfct_cached(), which can now hide the fact that the ct reference may have been lost due to an upcall. This allows ovs_ct_commit() to be simplified. This patch is needed by later "openvswitch: Interface with NAT" patch, as we need to be able to pass the packet through NAT using the original ct reference also after the reference is lost after an upcall. Signed-off-by: Jarno Rajahalme --- net/openvswitch/conntrack.c | 103 ++-- 1 file changed, 90 insertions(+), 13 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index da804d3..b838536 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -356,14 +356,101 @@ ovs_ct_expect_find(struct net *net, const struct nf_conntrack_zone *zone, return __nf_ct_expect_find(net, zone, &tuple); } +/* This replicates logic from nf_conntrack_core.c that is not exported. */ +static enum ip_conntrack_info +ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) +{ + const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); + + if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY) + return IP_CT_ESTABLISHED_REPLY; + /* Once we've had two way comms, always ESTABLISHED. */ + if (test_bit(IPS_SEEN_REPLY_BIT, &ct->status)) + return IP_CT_ESTABLISHED; + if (test_bit(IPS_EXPECTED_BIT, &ct->status)) + return IP_CT_RELATED; + return IP_CT_NEW; +} + +/* Find an existing connection which this packet belongs to without + * re-attributing statistics or modifying the connection state. This allows an + * skb->nfct lost due to an upcall to be recovered during actions execution. + * + * Must be called with rcu_read_lock. + * + * On success, populates skb->nfct and skb->nfctinfo, and returns the + * connection. Returns NULL if there is no existing entry. + */ +static struct nf_conn * +ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, +u8 l3num, struct sk_buff *skb) +{ + struct nf_conntrack_l3proto *l3proto; + struct nf_conntrack_l4proto *l4proto; + struct nf_conntrack_tuple tuple; + struct nf_conntrack_tuple_hash *h; + enum ip_conntrack_info ctinfo; + struct nf_conn *ct; + unsigned int dataoff; + u8 protonum; + + l3proto = __nf_ct_l3proto_find(l3num); + if (!l3proto) { + pr_debug("ovs_ct_find_existing: Can't get l3proto\n"); + return NULL; + } + if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, +&protonum) <= 0) { + pr_debug("ovs_ct_find_existing: Can't get protonum\n"); + return NULL; + } + l4proto = __nf_ct_l4proto_find(l3num, protonum); + if (!l4proto) { + pr_debug("ovs_ct_find_existing: Can't get l4proto\n"); + return NULL; + } + if (!nf_ct_get_tuple(skb, skb_network_offset(skb), dataoff, l3num, +protonum, net, &tuple, l3proto, l4proto)) { + pr_debug("ovs_ct_find_existing: Can't get tuple\n"); + return NULL; + } + + /* look for tuple match */ + h = nf_conntrack_find_get(net, zone, &tuple); + if (!h) + return NULL; /* Not found. */ + + ct = nf_ct_tuplehash_to_ctrack(h); + + ctinfo = ovs_ct_get_info(h); + if (ctinfo == IP_CT_NEW) { + /* This should not happen. */ + WARN_ONCE(1, "ovs_ct_find_existing: new packet for %p\n", ct); + } + skb->nfct = &ct->ct_general; + skb->nfctinfo = ctinfo; + return ct; +} + /* Determine whether skb->nfct is equal to the result of conntrack lookup. */ -static bool skb_nfct_cached(const struct net *net, const struct sk_buff *skb, - const struct ovs_conntrack_info *info) +static bool skb_nfct_cached(struct net *net, + const struct sw_flow_key *key, + const struct ovs_conntrack_info *info, + struct sk_buff *skb) { enum ip_conntrack_info ctinfo; struct nf_conn *ct; ct = nf_ct_get(skb, &ctinfo); + /* If no ct, check if we have evidence that an existing conntrack entry +* might be found for this skb. This happens when we lose a skb->nfct +* due to an upcall. If the connection was not confirmed, it is not +* cached and needs to be run through conntrack again. +*/ + if (!ct && key->ct.state & OVS_CS_F_TRACKED && + !(key->ct.state & OVS_CS_F_INVALID) && +
[PATCH nf-next v9 1/8] netfilter: Remove IP_CT_NEW_REPLY definition.
Remove the definition of IP_CT_NEW_REPLY from the kernel as it does not make sense. This allows the definition of IP_CT_NUMBER to be simplified as well. Signed-off-by: Jarno Rajahalme --- include/uapi/linux/netfilter/nf_conntrack_common.h | 12 +--- net/openvswitch/conntrack.c| 2 -- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h index 319f471..6d074d1 100644 --- a/include/uapi/linux/netfilter/nf_conntrack_common.h +++ b/include/uapi/linux/netfilter/nf_conntrack_common.h @@ -20,9 +20,15 @@ enum ip_conntrack_info { IP_CT_ESTABLISHED_REPLY = IP_CT_ESTABLISHED + IP_CT_IS_REPLY, IP_CT_RELATED_REPLY = IP_CT_RELATED + IP_CT_IS_REPLY, - IP_CT_NEW_REPLY = IP_CT_NEW + IP_CT_IS_REPLY, - /* Number of distinct IP_CT types (no NEW in reply dirn). */ - IP_CT_NUMBER = IP_CT_IS_REPLY * 2 - 1 + /* No NEW in reply direction. */ + + /* Number of distinct IP_CT types. */ + IP_CT_NUMBER, + + /* only for userspace compatibility */ +#ifndef __KERNEL__ + IP_CT_NEW_REPLY = IP_CT_NUMBER, +#endif }; #define NF_CT_STATE_INVALID_BIT(1 << 0) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index ee6ff8f..3045290 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -75,7 +75,6 @@ static u8 ovs_ct_get_state(enum ip_conntrack_info ctinfo) switch (ctinfo) { case IP_CT_ESTABLISHED_REPLY: case IP_CT_RELATED_REPLY: - case IP_CT_NEW_REPLY: ct_state |= OVS_CS_F_REPLY_DIR; break; default: @@ -92,7 +91,6 @@ static u8 ovs_ct_get_state(enum ip_conntrack_info ctinfo) ct_state |= OVS_CS_F_RELATED; break; case IP_CT_NEW: - case IP_CT_NEW_REPLY: ct_state |= OVS_CS_F_NEW; break; default: -- 2.1.4
[PATCH nf-next v9 2/8] netfilter: Allow calling into nat helper without skb_dst.
NAT checksum recalculation code assumes existence of skb_dst, which becomes a problem for a later patch in the series ("openvswitch: Interface with NAT."). Simplify this by removing the check on skb_dst, as the checksum will be dealt with later in the stack. Suggested-by: Pravin Shelar Signed-off-by: Jarno Rajahalme --- net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 30 -- net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 30 -- 2 files changed, 16 insertions(+), 44 deletions(-) diff --git a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c index 61c7cc2..f8aad03 100644 --- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c +++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c @@ -127,29 +127,15 @@ static void nf_nat_ipv4_csum_recalc(struct sk_buff *skb, u8 proto, void *data, __sum16 *check, int datalen, int oldlen) { - const struct iphdr *iph = ip_hdr(skb); - struct rtable *rt = skb_rtable(skb); - if (skb->ip_summed != CHECKSUM_PARTIAL) { - if (!(rt->rt_flags & RTCF_LOCAL) && - (!skb->dev || skb->dev->features & -(NETIF_F_IP_CSUM | NETIF_F_HW_CSUM))) { - skb->ip_summed = CHECKSUM_PARTIAL; - skb->csum_start = skb_headroom(skb) + - skb_network_offset(skb) + - ip_hdrlen(skb); - skb->csum_offset = (void *)check - data; - *check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, - datalen, proto, 0); - } else { - *check = 0; - *check = csum_tcpudp_magic(iph->saddr, iph->daddr, - datalen, proto, - csum_partial(data, datalen, - 0)); - if (proto == IPPROTO_UDP && !*check) - *check = CSUM_MANGLED_0; - } + const struct iphdr *iph = ip_hdr(skb); + + skb->ip_summed = CHECKSUM_PARTIAL; + skb->csum_start = skb_headroom(skb) + skb_network_offset(skb) + + ip_hdrlen(skb); + skb->csum_offset = (void *)check - data; + *check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen, + proto, 0); } else inet_proto_csum_replace2(check, skb, htons(oldlen), htons(datalen), true); diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c index 6ce3099..e0be97e 100644 --- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c +++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c @@ -131,29 +131,15 @@ static void nf_nat_ipv6_csum_recalc(struct sk_buff *skb, u8 proto, void *data, __sum16 *check, int datalen, int oldlen) { - const struct ipv6hdr *ipv6h = ipv6_hdr(skb); - struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); - if (skb->ip_summed != CHECKSUM_PARTIAL) { - if (!(rt->rt6i_flags & RTF_LOCAL) && - (!skb->dev || skb->dev->features & -(NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))) { - skb->ip_summed = CHECKSUM_PARTIAL; - skb->csum_start = skb_headroom(skb) + - skb_network_offset(skb) + - (data - (void *)skb->data); - skb->csum_offset = (void *)check - data; - *check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, - datalen, proto, 0); - } else { - *check = 0; - *check = csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, -datalen, proto, -csum_partial(data, datalen, - 0)); - if (proto == IPPROTO_UDP && !*check) - *check = CSUM_MANGLED_0; - } + const struct ipv6hdr *ipv6h = ipv6_hdr(skb); + + skb->ip_summed = CHECKSUM_PARTIAL; + skb->csum_start = skb_headroom(skb) + skb_network_offset(skb) + + (data - (void *)skb->data); + skb->csum_offset = (void *)check - data; + *check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, + datalen, proto, 0); } else
Re: [RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging
Hi Ido, Ido Schimmel writes: > Wed, Mar 09, 2016 at 07:42:47PM IST, vivien.dide...@savoirfairelinux.com > wrote: >>Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is >>set before adding a port to a bridge and deleting a port from a bridge. >> >>The main purpose for this attribute is to provide switchdev users a >>simple and common way to retrieve bridging information, instead of >>implementing complex notifier blocks to listen to global netdev events. >> >>We can also imagine a switchdev user returning an error different from >>-EOPNOTSUPP in the prepare phase to prevent a port from being bridged. > > I don't really understand the motivation for this change. We are already > doing all these stuff with the notifiers and it's pretty > straight-forward. > > In fact, I believe using an existing mechanism instead of introducing > more switchdev hooks is more elegant. This RFC only deals with bridge, > but you'll have to do the same for team, bond and vlan devices. And > you'll probably place the hooks in the exact locations where the > notifiers are called from anyway. > >> >>Signed-off-by: Vivien Didelot >>--- >> include/net/switchdev.h | 2 ++ >> net/bridge/br_if.c | 27 +++ >> 2 files changed, 29 insertions(+) >> >>diff --git a/include/net/switchdev.h b/include/net/switchdev.h >>index d451122..65f8514 100644 >>--- a/include/net/switchdev.h >>+++ b/include/net/switchdev.h >>@@ -46,6 +46,7 @@ enum switchdev_attr_id { >> SWITCHDEV_ATTR_ID_PORT_PARENT_ID, >> SWITCHDEV_ATTR_ID_PORT_STP_STATE, >> SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS, >>+ SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF, >> SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME, >> SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING, >> }; >>@@ -58,6 +59,7 @@ struct switchdev_attr { >> struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */ >> u8 stp_state; /* PORT_STP_STATE */ >> unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */ >>+ bool join; /* PORT_BRIDGE_IF */ >> u32 ageing_time;/* BRIDGE_AGEING_TIME */ >> bool vlan_filtering;/* >> BRIDGE_VLAN_FILTERING */ >> } u; >>diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c >>index a73df33..105b9fd 100644 >>--- a/net/bridge/br_if.c >>+++ b/net/bridge/br_if.c >>@@ -28,6 +28,24 @@ >> >> #include "br_private.h" >> >>+static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br, >>+bool join) >>+{ >>+ struct switchdev_attr attr = { >>+ .orig_dev = br->dev, > > This should be just 'dev', since you need to know for which stacked > device on top of the port this was called for. This also means you'll > have to call netdev_master_upper_dev_get() from within your driver if > you want to limit the number of VLAN filtering bridges (for example). > However, since this is called before bridge dev and dev itself are > linked, you'll get NULL. > >>+ .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF, >>+ .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP, >>+ .u.join = join, >>+ }; >>+ int err; >>+ >>+ err = switchdev_port_attr_set(dev, &attr); >>+ if (err && err != -EOPNOTSUPP) >>+ return err; >>+ >>+ return 0; >>+} >>+ >> /* >> * Determine initial path cost based on speed. >> * using recommendations from 802.1d standard >>@@ -297,6 +315,10 @@ static void del_nbp(struct net_bridge_port *p) >> br_netpoll_disable(p); >> >> call_rcu(&p->rcu, destroy_nbp_rcu); >>+ >>+ if (switchdev_bridge_if(dev, br, false)) >>+ br_warn(br, "error unbridging port %u(%s)\n", >>+ (unsigned int) p->port_no, dev->name); >> } >> >> /* Delete bridge device */ >>@@ -347,6 +369,11 @@ static struct net_bridge_port *new_nbp(struct net_bridge >>*br, >> { >> int index; >> struct net_bridge_port *p; >>+ int err; >>+ >>+ err = switchdev_bridge_if(dev, br, true); > > If you look at br_add_if() - where new_nbp() is called from - then > you'll see that you aren't rollbacking this operation in case of error. > Same for subsequent errors in this function I believe. > >>+ if (err) >>+ return ERR_PTR(err); >> >> index = find_portno(br); >> if (index < 0) >>-- >>2.7.2 >> > > Maybe this is something we'll have to do in the future, but for now I > think we are OK with the notifiers. :) > > Thanks Vivien! I didn't have the big picture for team, bond and vlan devices as well. I can drop this RFC then. Thanks for the details! Vivien
Re: [PATCH] inet: set `error' value not under lock_sock().
On 03/09/2016 12:37 PM, David Miller wrote: From: Weongyo Jeong Date: Wed, 9 Mar 2016 10:22:21 -0800 A trivial patch to set `error' variable while not holding lock_sock(). Signed-off-by: Weongyo Jeong Deferring the assignment means gcc doesn't have to potentially put it on the stack across the lock_sock() call. You're making the code worse not better, the assignment into a local variable is not an expensive operation. I'm not applying this change, sorry. No problem. Thank you for review and opinion. Regards, Weongyo Jeong === Considering Office 365? Barracuda security and storage solutions can help. Learn more about Barracuda solutions for Office 365 at http://barracuda.com/office365. DISCLAIMER: This e-mail and any attachments to it contain confidential and proprietary material of Barracuda, its affiliates or agents, and is solely for the use of the intended recipient. Any review, use, disclosure, distribution or copying of this transmittal is prohibited except by or on behalf of the intended recipient. If you have received this transmittal in error, please notify the sender and destroy this e-mail and any attachments and all copies, whether electronic or printed.
Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF
Hi Jiri, Jiri Pirko writes: > Wed, Mar 09, 2016 at 07:32:13PM CET, and...@lunn.ch wrote: >>Hi Vivien >> >>> -static bool dsa_slave_dev_check(struct net_device *dev) >>> -{ >>> - return dev->netdev_ops == &dsa_slave_netdev_ops; >>> -} >> >>Where is the equivalent of this happening? Where do we check that the >>interface added to the bridge is part of the switch? >> >>> -int dsa_slave_netdevice_event(struct notifier_block *unused, >>> - unsigned long event, void *ptr) >>> -{ >>> - struct net_device *dev; >>> - int err = 0; >>> - >>> - switch (event) { >>> - case NETDEV_CHANGEUPPER: >>> - dev = netdev_notifier_info_to_dev(ptr); >>> - if (!dsa_slave_dev_check(dev)) >>> - goto out; >>> - >>> - err = dsa_slave_master_changed(dev); >>> - if (err && err != -EOPNOTSUPP) >>> - netdev_warn(dev, "failed to reflect master change\n"); >>> - >>> - break; >>> - } >>> - >>> -out: >>> - return NOTIFY_DONE; >>> -} >> >>How about team/bonding? We are not ready to implement it yet with the >>Marvell devices, but at some point we probably will. Won't we need the >>events then? We need to know when a switch port has been added to a >>team? >> >>Or do you think a switchdev object will be added for this case? >>Mellanox already have the ability to add switch interfaces to a team, >>and then add the team to a bridge. So we need to ensure your solution >>works for such stacked systems. > > I have to look at this more closer tomorrow, but I'm missing motivation > behind this. Using existing notifiers, drivers can easily monitor what > is going on with their uppers. Why do we need this to be changed? Yes with notifiers, drivers can monitor these changes with the NETDEV_CHANGEUPPER even. They can also forbid such bridging by returning NOTIFY_BAD in the NETDEV_PRECHANGEUPPER event if I'm not mistaken. But looking at DSA slave, Mellanox Spectrum, and Rocker, they all implement this similar heavy code, while they could support a common switchdev attribute and reduce boilerplate. But maybe I'm wrong, what why I sent that as an RFC :-) Thanks, Vivien
Re: [PATCH 2/2] isdn: i4l: move active-isdn drivers to staging
Am 07.03.2016 um 09:48 schrieb Paul Bolle: > On za, 2016-03-05 at 14:08 +0100, Tilman Schmidt wrote: >> As a consequence, owners of HiSAX type adapters are in fact stuck with >> the old hisax driver if they want to continue using i4l userspace >> tools. > > Do you know whether or not mISDN tools offer functionality comparable to > i4l tools? AFAICS they don't. mISDN seems very much geared towards voice use, for example with Asterisk. The entire topic of ISDN data connections appears to be missing. I don't see how someone currently using i4l for Internet dial-in (either client or server side) could migrate to mISDN. -- Tilman Schmidt E-Mail: til...@imap.cc Bonn, Germany Nous, on a des fleurs et des bougies pour nous protéger. signature.asc Description: OpenPGP digital signature
Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations
On Wed, Mar 9, 2016 at 1:47 PM, Jesper Dangaard Brouer wrote: > On Wed, 9 Mar 2016 13:43:59 -0800 > Alexander Duyck wrote: > >> On Wed, Mar 9, 2016 at 1:36 PM, Jesper Dangaard Brouer >> wrote: >> > On Wed, 09 Mar 2016 16:03:20 -0500 (EST) >> > David Miller wrote: >> > >> >> From: Alexander Duyck >> >> Date: Wed, 9 Mar 2016 08:47:58 -0800 >> >> >> >> > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer >> >> > wrote: >> >> >> Passing the budget down was Alex'es design. Axel any thoughts? >> >> > >> >> > I'd say just use dev_consume_skb_any in the bulk free instead of >> >> > dev_consume_skb_irq. This is slow path, as you said, so it shouldn't >> >> > come up often. >> >> >> >> Agreed. >> >> >> >> >> I do wonder how expensive this check is... as it goes into a code >> >> >> hotpath, which is very unlikely. The good thing would be, that we >> >> >> handle if buggy drivers call this function from a none softirq context >> >> >> (as these bugs could be hard to catch). >> >> >> >> >> >> Can netpoll ever be called from softirq or with BH disabled? (It >> >> >> disables IRQs, which would break calling kmem_cache_free_bulk). >> >> > >> >> > It is better for us to switch things out so that the napi_consume_skb >> >> > is the fast path with dev_consume_skb_any as the slow. There are too >> >> > many scenarios where we could be invoking something that makes use of >> >> > this within the Tx path so it is probably easiest to just solve it >> >> > that way so we don't have to deal with it again in the future. >> >> >> >> Indeed. >> > >> > So, if I understand you correctly, then we drop the budget parameter >> > and check for in_softirq(), like: >> > >> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> > index 7af7ec635d90..a3c61a9b65d2 100644 >> > --- a/net/core/skbuff.c >> > +++ b/net/core/skbuff.c >> > @@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb) >> > _kfree_skb_defer(skb); >> > } >> > >> > -void napi_consume_skb(struct sk_buff *skb, int budget) >> > +void napi_consume_skb(struct sk_buff *skb) >> > { >> > if (unlikely(!skb)) >> > return; >> > >> > - /* if budget is 0 assume netpoll w/ IRQs disabled */ >> > - if (unlikely(!budget)) { >> > - dev_consume_skb_irq(skb); >> > + /* Handle if not called from NAPI context, and netpoll invocation >> > */ >> > + if (unlikely(!in_softirq())) { >> > + dev_consume_skb_any(skb); >> > return; >> > } >> > >> >> No. We still need to have the budget value. What we do though is >> have that feed into dev_consume_skb_any. >> >> The problem with using in_softirq is that it will trigger if softirqs >> are just disabled so there are more possible paths where it is >> possible. For example the transmit path has bottom halves disabled so >> I am pretty sure it might trigger this as well. We want this to only >> execute when we are running from a NAPI polling routine with a >> non-zero budget. > > What about using in_serving_softirq() instead of in_softirq() ? > (would that allow us to drop the budget parameter?) The problem is there are multiple softirq handlers you could be referring to. NAPI is just one. What if for example someone sets up a tasklet that has to perform some sort of reset with RTNL held. I am pretty sure we don't want the tasklet using the NAPI free context since there will be nothing to actually free the buffers at the end of it. We want to avoid that. That is why I was using the budget value. - Alex
[PATCH next v2 7/7] ipvlan: Implement L3-symmetric mode.
From: Mahesh Bandewar Current packet processing from IPtables perspective is asymmetric for IPvlan L3 mode. On egress path, packets hit LOCAL_OUT and POST_ROUTING hooks in slave-ns as well as master's ns however during ingress path, LOCAL_IN and PRE_ROUTING hooks are hit only in slave's ns. L3 mode is restrictive and uses master's L3 for packet processing, so it does not make sense to skip these hooks in ingress path in master's ns. The changes in this patch nominates master-dev to be the device for L3 ingress processing when skb device is the IPvlan slave. Since master device is used for L3 processing, the IPT hooks are hit in master's ns making the packet processing symmetric. The other minor change this patch does is to add a force parameter for set_port_mode() to ensure correct settings are set during the device initialization phase. Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- drivers/net/ipvlan/ipvlan_main.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c index 5802b9025765..734c25e52c60 100644 --- a/drivers/net/ipvlan/ipvlan_main.c +++ b/drivers/net/ipvlan/ipvlan_main.c @@ -14,16 +14,19 @@ static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev) ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj; } -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval) +static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval, bool force) { struct ipvl_dev *ipvlan; - if (port->mode != nval) { + if (port->mode != nval || force) { list_for_each_entry(ipvlan, &port->ipvlans, pnode) { - if (nval == IPVLAN_MODE_L3) + if (nval == IPVLAN_MODE_L3) { ipvlan->dev->flags |= IFF_NOARP; - else + ipvlan->dev->l3_dev = port->dev; + } else { ipvlan->dev->flags &= ~IFF_NOARP; + ipvlan->dev->l3_dev = ipvlan->dev; + } } port->mode = nval; } @@ -392,7 +395,7 @@ static int ipvlan_nl_changelink(struct net_device *dev, if (data && data[IFLA_IPVLAN_MODE]) { u16 nmode = nla_get_u16(data[IFLA_IPVLAN_MODE]); - ipvlan_set_port_mode(port, nmode); + ipvlan_set_port_mode(port, nmode, false); } return 0; } @@ -479,7 +482,6 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev, memcpy(dev->dev_addr, phy_dev->dev_addr, ETH_ALEN); dev->priv_flags |= IFF_IPVLAN_SLAVE; - port->count += 1; err = register_netdevice(dev); if (err < 0) @@ -490,7 +492,7 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev, goto ipvlan_destroy_port; list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans); - ipvlan_set_port_mode(port, mode); + ipvlan_set_port_mode(port, mode, true); netif_stacked_transfer_operstate(phy_dev, dev); return 0; -- 2.7.0.rc3.207.g0ac5344
[PATCH next v2 4/7] ipv6: Use l3_dev for L3 ingress processing
From: Mahesh Bandewar Use the in-dev passed by the packet dispatcher for the L3 phase. If there are places where code uses skb->dev, use the netif_get_l3_dev() helper to get l3_dev. Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- net/ipv6/ip6_input.c | 14 -- net/ipv6/route.c | 7 --- 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index c05c425c2389..f02da6d7e097 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -67,7 +67,8 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt const struct ipv6hdr *hdr; u32 pkt_len; struct inet6_dev *idev; - struct net *net = dev_net(skb->dev); + struct net_device *l3dev = netif_get_l3_dev(skb->dev); + struct net *net = dev_net(l3dev); if (skb->pkt_type == PACKET_OTHERHOST) { kfree_skb(skb); @@ -76,7 +77,7 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt rcu_read_lock(); - idev = __in6_dev_get(skb->dev); + idev = __in6_dev_get(l3dev); IP6_UPD_PO_STATS_BH(net, idev, IPSTATS_MIB_IN, skb->len); @@ -246,8 +247,8 @@ resubmit: skb_network_header_len(skb)); hdr = ipv6_hdr(skb); if (ipv6_addr_is_multicast(&hdr->daddr) && - !ipv6_chk_mcast_addr(skb->dev, &hdr->daddr, - &hdr->saddr) && + !ipv6_chk_mcast_addr(netif_get_l3_dev(skb->dev), +&hdr->daddr, &hdr->saddr) && !ipv6_is_mld(skb, nexthdr, skb_network_header_len(skb))) goto discard; } @@ -287,9 +288,10 @@ discard: int ip6_input(struct sk_buff *skb) { + struct net_device *dev = netif_get_l3_dev(skb->dev); + return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN, - dev_net(skb->dev), NULL, skb, skb->dev, NULL, - ip6_input_finish); + dev_net(dev), NULL, skb, dev, NULL, ip6_input_finish); } int ip6_mc_input(struct sk_buff *skb) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index ed446639219c..3a1b3c62d80b 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1158,11 +1158,12 @@ static struct dst_entry *ip6_route_input_lookup(struct net *net, void ip6_route_input(struct sk_buff *skb) { const struct ipv6hdr *iph = ipv6_hdr(skb); - struct net *net = dev_net(skb->dev); + struct net_device *dev = netif_get_l3_dev(skb->dev); + struct net *net = dev_net(dev); int flags = RT6_LOOKUP_F_HAS_SADDR; struct ip_tunnel_info *tun_info; struct flowi6 fl6 = { - .flowi6_iif = l3mdev_fib_oif(skb->dev), + .flowi6_iif = l3mdev_fib_oif(dev), .daddr = iph->daddr, .saddr = iph->saddr, .flowlabel = ip6_flowinfo(iph), @@ -1174,7 +1175,7 @@ void ip6_route_input(struct sk_buff *skb) if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX)) fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id; skb_dst_drop(skb); - skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags)); + skb_dst_set(skb, ip6_route_input_lookup(net, dev, &fl6, flags)); } static struct rt6_info *ip6_pol_route_output(struct net *net, struct fib6_table *table, -- 2.7.0.rc3.207.g0ac5344
[PATCH next v2 6/7] xfrm: Use l3_dev for xfrm policy checks.
From: Mahesh Bandewar IPsec, whether it's tunnel mode or transport mode, is still a function of L3 so all the decisions should be based on the L3 device. Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- include/net/xfrm.h | 2 +- net/xfrm/xfrm_policy.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index d6f6e5006ee9..30f9a351c3b9 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -1067,7 +1067,7 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir, struct sk_buff *skb, unsigned int family, int reverse) { - struct net *net = dev_net(skb->dev); + struct net *net = dev_net(netif_get_l3_dev(skb->dev)); int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0); if (sk && sk->sk_policy[XFRM_POLICY_IN]) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index b5e665b3cfb0..c5942744f2e3 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2462,7 +2462,7 @@ static inline int secpath_has_nontransport(const struct sec_path *sp, int k, int int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) { - struct net *net = dev_net(skb->dev); + struct net *net = dev_net(netif_get_l3_dev(skb->dev)); struct xfrm_policy *pol; struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX]; int npols = 0; @@ -2620,7 +2620,7 @@ EXPORT_SYMBOL(__xfrm_policy_check); int __xfrm_route_forward(struct sk_buff *skb, unsigned short family) { - struct net *net = dev_net(skb->dev); + struct net *net = dev_net(netif_get_l3_dev(skb->dev)); struct flowi fl; struct dst_entry *dst; int res = 1; -- 2.7.0.rc3.207.g0ac5344
[PATCH next v2 5/7] raw: Use l3_dev for L3 ingress raw packet delivery
From: Mahesh Bandewar Use the in-dev passed by the packet dispatcher for the L3 phase. If there are places where code uses skb->dev, use the netif_get_l3_dev() helper to get l3_dev. Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- net/ipv4/raw.c | 11 +-- net/ipv6/raw.c | 2 +- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 8d22de74080c..7802ce24a42d 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -173,22 +173,22 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) struct hlist_head *head; int delivered = 0; struct net *net; + struct net_device *dev = netif_get_l3_dev(skb->dev); read_lock(&raw_v4_hashinfo.lock); head = &raw_v4_hashinfo.ht[hash]; if (hlist_empty(head)) goto out; - net = dev_net(skb->dev); + net = dev_net(dev); sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol, iph->saddr, iph->daddr, -skb->dev->ifindex); +dev->ifindex); while (sk) { delivered = 1; if ((iph->protocol != IPPROTO_ICMP || !icmp_filter(sk, skb)) && - ip_mc_sf_allow(sk, iph->daddr, iph->saddr, - skb->dev->ifindex)) { + ip_mc_sf_allow(sk, iph->daddr, iph->saddr, dev->ifindex)) { struct sk_buff *clone = skb_clone(skb, GFP_ATOMIC); /* Not releasing hash table! */ @@ -196,8 +196,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) raw_rcv(sk, clone); } sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol, -iph->saddr, iph->daddr, -skb->dev->ifindex); +iph->saddr, iph->daddr, dev->ifindex); } out: read_unlock(&raw_v4_hashinfo.lock); diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index fa59dd7a427e..22d001f0645d 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -175,7 +175,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) if (!sk) goto out; - net = dev_net(skb->dev); + net = dev_net(netif_get_l3_dev(skb->dev)); sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, inet6_iif(skb)); while (sk) { -- 2.7.0.rc3.207.g0ac5344
[PATCH next v2 3/7] ipv4: Use l3_dev for L3 ingress processing
From: Mahesh Bandewar Use the in-dev passed by the packet dispatcher for the L3 phase. If there are places where code uses skb->dev, use the netif_get_l3_dev() helper to get l3_dev. Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- net/ipv4/ip_input.c | 12 +++- net/ipv4/ip_options.c | 3 ++- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index e3d782746d9d..cae3503c872b 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -247,7 +247,8 @@ int ip_local_deliver(struct sk_buff *skb) /* * Reassemble IP fragments. */ - struct net *net = dev_net(skb->dev); + struct net_device *dev = netif_get_l3_dev(skb->dev); + struct net *net = dev_net(dev); if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER)) @@ -255,7 +256,7 @@ int ip_local_deliver(struct sk_buff *skb) } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, - net, NULL, skb, skb->dev, NULL, + net, NULL, skb, dev, NULL, ip_local_deliver_finish); } @@ -263,7 +264,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb) { struct ip_options *opt; const struct iphdr *iph; - struct net_device *dev = skb->dev; + struct net_device *dev = netif_get_l3_dev(skb->dev); /* It looks as overkill, because not all IP options require packet mangling. @@ -312,6 +313,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { const struct iphdr *iph = ip_hdr(skb); struct rtable *rt; + struct net_device *dev = netif_get_l3_dev(skb->dev); if (net->ipv4.sysctl_ip_early_demux && !skb_dst(skb) && @@ -334,7 +336,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) */ if (!skb_valid_dst(skb)) { int err = ip_route_input_noref(skb, iph->daddr, iph->saddr, - iph->tos, skb->dev); + iph->tos, dev); if (unlikely(err)) { if (err == -EXDEV) NET_INC_STATS_BH(net, LINUX_MIB_IPRPFILTER); @@ -363,7 +365,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_INBCAST, skb->len); } else if (skb->pkt_type == PACKET_BROADCAST || skb->pkt_type == PACKET_MULTICAST) { - struct in_device *in_dev = __in_dev_get_rcu(skb->dev); + struct in_device *in_dev = __in_dev_get_rcu(dev); /* RFC 1122 3.3.6: * diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c index bd246792360b..fc434b12a3d3 100644 --- a/net/ipv4/ip_options.c +++ b/net/ipv4/ip_options.c @@ -639,7 +639,8 @@ int ip_options_rcv_srr(struct sk_buff *skb) orefdst = skb->_skb_refdst; skb_dst_set(skb, NULL); - err = ip_route_input(skb, nexthop, iph->saddr, iph->tos, skb->dev); + err = ip_route_input(skb, nexthop, iph->saddr, iph->tos, +netif_get_l3_dev(skb->dev)); rt2 = skb_rtable(skb); if (err || (rt2->rt_type != RTN_UNICAST && rt2->rt_type != RTN_LOCAL)) { skb_dst_drop(skb); -- 2.7.0.rc3.207.g0ac5344
[PATCH next v2 1/7] dev: Add netif_get_l3_dev() helper
From: Mahesh Bandewar This patch adds a l3_dev to net_device struct and a helper function to retrieve it. During L3 ingress packet processing, this device will be used and should serve the current purpose of skb->dev. Since l3_dev is initialized to self; l3_dev should be pointing to skb->dev so the normal packet processing is neither altered nor should incur any additional cost (as it resides in the RX cache line). Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- include/linux/netdevice.h | 6 ++ net/core/dev.c| 1 + 2 files changed, 7 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e52077ffe5ed..1cf7e8d61043 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1738,6 +1738,7 @@ struct net_device { unsigned long gro_flush_timeout; rx_handler_func_t __rcu *rx_handler; void __rcu *rx_handler_data; + struct net_device *l3_dev; #ifdef CONFIG_NET_CLS_ACT struct tcf_proto __rcu *ingress_cl_list; @@ -4085,6 +4086,11 @@ static inline void netif_keep_dst(struct net_device *dev) dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM); } +static inline struct net_device *netif_get_l3_dev(struct net_device *dev) +{ + return dev->l3_dev; +} + extern struct pernet_operations __net_initdata loopback_net_ops; /* Logging, debugging and troubleshooting/diagnostic helpers. */ diff --git a/net/core/dev.c b/net/core/dev.c index edb7179bc051..c4023a68cdc1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7463,6 +7463,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, if (!dev->ethtool_ops) dev->ethtool_ops = &default_ethtool_ops; + dev->l3_dev = dev; nf_hook_ingress_init(dev); return dev; -- 2.7.0.rc3.207.g0ac5344
[PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
From: Mahesh Bandewar One of the major request (for enhancement) that I have received from various users of IPvlan in L3 mode is its inability to handle IPtables. While looking at the code and how we handle ingress, the problem can be attributed to the asymmetry in the way packets get processed for IPvlan devices configured in L3 mode. L3 mode is supposed to be restrictive and all the L3 decisions need to be taken for the traffic in master's ns. This does happen as expected for egress traffic however on ingress traffic, the IPvlan packet-handler changes the skb->dev and this forces packet to be processed with the IPvlan slave and it's associated ns. This causes above mentioned problem and few other which are not yet reported / attempted. e.g. IPsec with L3 mode or even ingress routing. This could have been solved if we had a way to handover packet to slave and associated ns after completing the L3 phase. This is a non-trivial issue to fix especially looking at IPsec code. This patch series attempts to solve this problem by introducing the device pointer l3_dev which resides in net_device structure in the RX cache line. We initialize the l3_dev to self. This would mean there is no complex logic to when-and-how-to initialize it. Now the stack will use this dev pointer during the L3 phase. This should not alter any existing properties / behavior and also there should not be any additional penalties since it resides in the same RX cache line. Now coming back to IPvlan setup. The typical IPvlan L3 setup where master is in default-ns while the slaves are put inside different (slave) net-ns. During the initialization phase, the driver can make l3_dev for each slave to point to its master-device. This will solve the current IPT problem as well as make the packet processing symmetric from stack perspective. For all other packet processing since dev->l3_dev is same as dev, there should not be any behavioral or functional change. Mahesh Bandewar (7): dev: Add netif_get_l3_dev() helper dev: Update packet dispatcher to pass l3_dev as an input device ipv4: Use l3_dev for L3 ingress processing ipv6: Use l3_dev for L3 ingress processing raw: Use l3_dev for L3 ingress raw packet delivery xfrm: Use l3_dev for xfrm policy checks. ipvlan: Implement L3-symmetric mode. drivers/net/ipvlan/ipvlan_main.c | 16 +--- include/linux/netdevice.h| 6 ++ include/net/xfrm.h | 2 +- net/core/dev.c | 10 +++--- net/ipv4/ip_input.c | 12 +++- net/ipv4/ip_options.c| 3 ++- net/ipv4/raw.c | 11 +-- net/ipv6/ip6_input.c | 14 -- net/ipv6/raw.c | 2 +- net/ipv6/route.c | 7 --- net/xfrm/xfrm_policy.c | 4 ++-- 11 files changed, 52 insertions(+), 35 deletions(-) -- 2.7.0.rc3.207.g0ac5344
[PATCH next v2 2/7] dev: Update packet dispatcher to pass l3_dev as an input device
From: Mahesh Bandewar netif_receive_skb_core() dispatcher uses skb->dev device as an input device to all packet-handlers (e.g. ip_rcv, ipv6_rcv etc) These packet handlers use this device reference to start processing L3 phase of the packet-processing. This includes IPT hooks, ingress route-lookup, and xfrm policy lookups etc. This patch passes l3_dev as an input device and shifts L3 processing from skb->dev to l3_dev (skb->dev->l3_dev). Signed-off-by: Mahesh Bandewar CC: Eric Dumazet CC: Tim Hockin CC: Alex Pollitt CC: Matthew Dupre --- net/core/dev.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index c4023a68cdc1..9252436ef11a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1811,7 +1811,8 @@ static inline int deliver_skb(struct sk_buff *skb, if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) return -ENOMEM; atomic_inc(&skb->users); - return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); + return pt_prev->func(skb, netif_get_l3_dev(skb->dev), pt_prev, +orig_dev); } static inline void deliver_ptype_list_skb(struct sk_buff *skb, @@ -1904,7 +1905,8 @@ again: } out_unlock: if (pt_prev) - pt_prev->func(skb2, skb->dev, pt_prev, skb->dev); + pt_prev->func(skb2, netif_get_l3_dev(skb->dev), pt_prev, + skb->dev); rcu_read_unlock(); } @@ -4157,7 +4159,8 @@ ncls: if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) goto drop; else - ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); + ret = pt_prev->func(skb, netif_get_l3_dev(skb->dev), + pt_prev, orig_dev); } else { drop: if (!deliver_exact) -- 2.7.0.rc3.207.g0ac5344
Re: [net-next PATCH 3/4] vxlan: Enforce IP ID verification on outer headers
On Wed, Mar 9, 2016 at 1:02 PM, David Miller wrote: > From: Jesse Gross > Date: Mon, 7 Mar 2016 15:42:59 -0800 > >> On Mon, Mar 7, 2016 at 3:06 PM, Alex Duyck wrote: >>> On Mon, Mar 7, 2016 at 11:09 AM, David Miller wrote: From: Or Gerlitz Date: Mon, 7 Mar 2016 20:05:20 +0200 > On Mon, Mar 7, 2016 at 7:22 PM, Alexander Duyck > wrote: >> This change enforces the IP ID verification on outer headers. As a >> result >> if the DF flag is not set on the outer header we will force the flow to >> be >> flushed in the event that the IP ID is out of sequence with the existing >> flow. > > Can you please state the precise requirement for aggregation w.r.t IP > IDs here? and point to where/how this is enforced, e.g for > non-tunneled TCP GRO-ing? I also didn't see a nice "PATCH 0/4" posting explaining this series and I'd really like to see that. >>> >>> Sorry about that. I forgot to add the cover page when I sent this. >>> >>> The enforcement is coming from the IP and TCP layers. If you take a >>> look in inet_gro_receive we have the NAPI_GRO_CB(p)->flush_id value >>> being populated based on the difference between the expected ID and >>> the received one. So for IPv4 we overwrite it, and for IPv6 we set it >>> to 0. The only consumer currently using it is TCP in tcp_gro_receive. >>> The problem is with tunnels we lose the data for the outer when the >>> inner overwrites it, as a result we can put whatever we want currently >>> in the outer IP ID and it will be accepted. >>> >>> The patch set is based off of a conversation several of us had on the >>> list about doing TSO for tunnels and the fact that the IP IDs for the >>> outer header have to advance. It makes it easier for me to validate >>> that I am doing things properly if GRO doesn't destroy the IP ID data >>> for the outer headers. >> >> In net/ipv4/af_inet.c:inet_gro_receive() there is the following >> comment above where NAPI_GRO_CB(p)->flush_id is set: >> >> /* Save the IP ID check to be included later when we get to >> * the transport layer so only the inner most IP ID is checked. >> * This is because some GSO/TSO implementations do not >> * correctly increment the IP ID for the outer hdrs. >> */ >> >> There was a long discussion about this a couple years ago and the >> conclusion was that it is the inner IP ID is really the important one >> in the case of encapsulation. Obviously, things like TCP/IP header >> compression don't apply to the outer encapsulation header. > > That's how I remember the conversation going as wel... Given that then, how would you feel about a software scheme to use existing TSO support on devices to support UDP/GRE tunnel TSO by pre-computing the values for the outer headers and then having those duplicated on TSO frames? This is kind of what got me started down the path of validating the outer IP ID. If we are okay with UDP or GRE encapsulated tunnels using the same IP ID for the outer headers then I should be able to code up something in software so that we can get more devices supporting tunnel TSO offload for tunnels. - Alex
Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations
On Wed, 9 Mar 2016 13:43:59 -0800 Alexander Duyck wrote: > On Wed, Mar 9, 2016 at 1:36 PM, Jesper Dangaard Brouer > wrote: > > On Wed, 09 Mar 2016 16:03:20 -0500 (EST) > > David Miller wrote: > > > >> From: Alexander Duyck > >> Date: Wed, 9 Mar 2016 08:47:58 -0800 > >> > >> > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer > >> > wrote: > >> >> Passing the budget down was Alex'es design. Axel any thoughts? > >> > > >> > I'd say just use dev_consume_skb_any in the bulk free instead of > >> > dev_consume_skb_irq. This is slow path, as you said, so it shouldn't > >> > come up often. > >> > >> Agreed. > >> > >> >> I do wonder how expensive this check is... as it goes into a code > >> >> hotpath, which is very unlikely. The good thing would be, that we > >> >> handle if buggy drivers call this function from a none softirq context > >> >> (as these bugs could be hard to catch). > >> >> > >> >> Can netpoll ever be called from softirq or with BH disabled? (It > >> >> disables IRQs, which would break calling kmem_cache_free_bulk). > >> > > >> > It is better for us to switch things out so that the napi_consume_skb > >> > is the fast path with dev_consume_skb_any as the slow. There are too > >> > many scenarios where we could be invoking something that makes use of > >> > this within the Tx path so it is probably easiest to just solve it > >> > that way so we don't have to deal with it again in the future. > >> > >> Indeed. > > > > So, if I understand you correctly, then we drop the budget parameter > > and check for in_softirq(), like: > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 7af7ec635d90..a3c61a9b65d2 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb) > > _kfree_skb_defer(skb); > > } > > > > -void napi_consume_skb(struct sk_buff *skb, int budget) > > +void napi_consume_skb(struct sk_buff *skb) > > { > > if (unlikely(!skb)) > > return; > > > > - /* if budget is 0 assume netpoll w/ IRQs disabled */ > > - if (unlikely(!budget)) { > > - dev_consume_skb_irq(skb); > > + /* Handle if not called from NAPI context, and netpoll invocation */ > > + if (unlikely(!in_softirq())) { > > + dev_consume_skb_any(skb); > > return; > > } > > > > No. We still need to have the budget value. What we do though is > have that feed into dev_consume_skb_any. > > The problem with using in_softirq is that it will trigger if softirqs > are just disabled so there are more possible paths where it is > possible. For example the transmit path has bottom halves disabled so > I am pretty sure it might trigger this as well. We want this to only > execute when we are running from a NAPI polling routine with a > non-zero budget. What about using in_serving_softirq() instead of in_softirq() ? (would that allow us to drop the budget parameter?) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net] r8169:Remove unnecessary phy reset for pcie nic when setting link spped.
Hau : [...] > Unless pcie nic has bug, pcie nic does not need to reset phy to let phy link > on. > > There is a counter for phy speed down. If phy is in link down state, this > counter will start to count down. When it count to 0, phy will speed down. > Reset phy will reset this counter and prevent phy from speed down. Thanks for the clarification. -- Ueimor
Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations
On Wed, Mar 9, 2016 at 1:36 PM, Jesper Dangaard Brouer wrote: > On Wed, 09 Mar 2016 16:03:20 -0500 (EST) > David Miller wrote: > >> From: Alexander Duyck >> Date: Wed, 9 Mar 2016 08:47:58 -0800 >> >> > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer >> > wrote: >> >> Passing the budget down was Alex'es design. Axel any thoughts? >> > >> > I'd say just use dev_consume_skb_any in the bulk free instead of >> > dev_consume_skb_irq. This is slow path, as you said, so it shouldn't >> > come up often. >> >> Agreed. >> >> >> I do wonder how expensive this check is... as it goes into a code >> >> hotpath, which is very unlikely. The good thing would be, that we >> >> handle if buggy drivers call this function from a none softirq context >> >> (as these bugs could be hard to catch). >> >> >> >> Can netpoll ever be called from softirq or with BH disabled? (It >> >> disables IRQs, which would break calling kmem_cache_free_bulk). >> > >> > It is better for us to switch things out so that the napi_consume_skb >> > is the fast path with dev_consume_skb_any as the slow. There are too >> > many scenarios where we could be invoking something that makes use of >> > this within the Tx path so it is probably easiest to just solve it >> > that way so we don't have to deal with it again in the future. >> >> Indeed. > > So, if I understand you correctly, then we drop the budget parameter > and check for in_softirq(), like: > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 7af7ec635d90..a3c61a9b65d2 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb) > _kfree_skb_defer(skb); > } > > -void napi_consume_skb(struct sk_buff *skb, int budget) > +void napi_consume_skb(struct sk_buff *skb) > { > if (unlikely(!skb)) > return; > > - /* if budget is 0 assume netpoll w/ IRQs disabled */ > - if (unlikely(!budget)) { > - dev_consume_skb_irq(skb); > + /* Handle if not called from NAPI context, and netpoll invocation */ > + if (unlikely(!in_softirq())) { > + dev_consume_skb_any(skb); > return; > } > No. We still need to have the budget value. What we do though is have that feed into dev_consume_skb_any. The problem with using in_softirq is that it will trigger if softirqs are just disabled so there are more possible paths where it is possible. For example the transmit path has bottom halves disabled so I am pretty sure it might trigger this as well. We want this to only execute when we are running from a NAPI polling routine with a non-zero budget. - Alex
Re: [RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging
Hi Vivien, Wed, Mar 09, 2016 at 07:42:47PM IST, vivien.dide...@savoirfairelinux.com wrote: >Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is >set before adding a port to a bridge and deleting a port from a bridge. > >The main purpose for this attribute is to provide switchdev users a >simple and common way to retrieve bridging information, instead of >implementing complex notifier blocks to listen to global netdev events. > >We can also imagine a switchdev user returning an error different from >-EOPNOTSUPP in the prepare phase to prevent a port from being bridged. I don't really understand the motivation for this change. We are already doing all these stuff with the notifiers and it's pretty straight-forward. In fact, I believe using an existing mechanism instead of introducing more switchdev hooks is more elegant. This RFC only deals with bridge, but you'll have to do the same for team, bond and vlan devices. And you'll probably place the hooks in the exact locations where the notifiers are called from anyway. > >Signed-off-by: Vivien Didelot >--- > include/net/switchdev.h | 2 ++ > net/bridge/br_if.c | 27 +++ > 2 files changed, 29 insertions(+) > >diff --git a/include/net/switchdev.h b/include/net/switchdev.h >index d451122..65f8514 100644 >--- a/include/net/switchdev.h >+++ b/include/net/switchdev.h >@@ -46,6 +46,7 @@ enum switchdev_attr_id { > SWITCHDEV_ATTR_ID_PORT_PARENT_ID, > SWITCHDEV_ATTR_ID_PORT_STP_STATE, > SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS, >+ SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF, > SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME, > SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING, > }; >@@ -58,6 +59,7 @@ struct switchdev_attr { > struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */ > u8 stp_state; /* PORT_STP_STATE */ > unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */ >+ bool join; /* PORT_BRIDGE_IF */ > u32 ageing_time;/* BRIDGE_AGEING_TIME */ > bool vlan_filtering;/* > BRIDGE_VLAN_FILTERING */ > } u; >diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c >index a73df33..105b9fd 100644 >--- a/net/bridge/br_if.c >+++ b/net/bridge/br_if.c >@@ -28,6 +28,24 @@ > > #include "br_private.h" > >+static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br, >+ bool join) >+{ >+ struct switchdev_attr attr = { >+ .orig_dev = br->dev, This should be just 'dev', since you need to know for which stacked device on top of the port this was called for. This also means you'll have to call netdev_master_upper_dev_get() from within your driver if you want to limit the number of VLAN filtering bridges (for example). However, since this is called before bridge dev and dev itself are linked, you'll get NULL. >+ .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF, >+ .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP, >+ .u.join = join, >+ }; >+ int err; >+ >+ err = switchdev_port_attr_set(dev, &attr); >+ if (err && err != -EOPNOTSUPP) >+ return err; >+ >+ return 0; >+} >+ > /* > * Determine initial path cost based on speed. > * using recommendations from 802.1d standard >@@ -297,6 +315,10 @@ static void del_nbp(struct net_bridge_port *p) > br_netpoll_disable(p); > > call_rcu(&p->rcu, destroy_nbp_rcu); >+ >+ if (switchdev_bridge_if(dev, br, false)) >+ br_warn(br, "error unbridging port %u(%s)\n", >+ (unsigned int) p->port_no, dev->name); > } > > /* Delete bridge device */ >@@ -347,6 +369,11 @@ static struct net_bridge_port *new_nbp(struct net_bridge >*br, > { > int index; > struct net_bridge_port *p; >+ int err; >+ >+ err = switchdev_bridge_if(dev, br, true); If you look at br_add_if() - where new_nbp() is called from - then you'll see that you aren't rollbacking this operation in case of error. Same for subsequent errors in this function I believe. >+ if (err) >+ return ERR_PTR(err); > > index = find_portno(br); > if (index < 0) >-- >2.7.2 > Maybe this is something we'll have to do in the future, but for now I think we are OK with the notifiers. :) Thanks Vivien!
Re: [PATCH v2 net-next 00/13] kcm: Kernel Connection Multiplexor (KCM)
From: Tom Herbert Date: Mon, 7 Mar 2016 14:10:59 -0800 > Kernel Connection Multiplexor (KCM) is a facility that provides a > message based interface over TCP for generic application protocols. ... I've decided to apply this as-is. If there are any issues we can fix it with follow-on patches. Thanks for all of the hard work Tom.
Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations
On Wed, 09 Mar 2016 16:03:20 -0500 (EST) David Miller wrote: > From: Alexander Duyck > Date: Wed, 9 Mar 2016 08:47:58 -0800 > > > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer > > wrote: > >> Passing the budget down was Alex'es design. Axel any thoughts? > > > > I'd say just use dev_consume_skb_any in the bulk free instead of > > dev_consume_skb_irq. This is slow path, as you said, so it shouldn't > > come up often. > > Agreed. > > >> I do wonder how expensive this check is... as it goes into a code > >> hotpath, which is very unlikely. The good thing would be, that we > >> handle if buggy drivers call this function from a none softirq context > >> (as these bugs could be hard to catch). > >> > >> Can netpoll ever be called from softirq or with BH disabled? (It > >> disables IRQs, which would break calling kmem_cache_free_bulk). > > > > It is better for us to switch things out so that the napi_consume_skb > > is the fast path with dev_consume_skb_any as the slow. There are too > > many scenarios where we could be invoking something that makes use of > > this within the Tx path so it is probably easiest to just solve it > > that way so we don't have to deal with it again in the future. > > Indeed. So, if I understand you correctly, then we drop the budget parameter and check for in_softirq(), like: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7af7ec635d90..a3c61a9b65d2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb) _kfree_skb_defer(skb); } -void napi_consume_skb(struct sk_buff *skb, int budget) +void napi_consume_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; - /* if budget is 0 assume netpoll w/ IRQs disabled */ - if (unlikely(!budget)) { - dev_consume_skb_irq(skb); + /* Handle if not called from NAPI context, and netpoll invocation */ + if (unlikely(!in_softirq())) { + dev_consume_skb_any(skb); return; } -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [ovs-dev] [PATCH net-next] ovs: allow nl 'flow set' to use ufid without flow key
On Wed, Mar 9, 2016 at 9:05 AM, Samuel Gauthier wrote: > When we want to change a flow using netlink, we have to identify it to > be able to perform a lookup. Both the flow key and unique flow ID > (ufid) are valid identifiers, but we always have to specify the flow > key in the netlink message. When both attributes are there, the ufid > is used. > > This commit allows to use the ufid without having to provide the flow > key, as it is already done in the netlink 'flow get' and 'flow del' > path. > > Signed-off-by: Samuel Gauthier > --- > net/openvswitch/datapath.c | 21 ++--- > 1 file changed, 10 insertions(+), 11 deletions(-) > > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c > index deadfdab1bc3..06f15143cf2a 100644 > --- a/net/openvswitch/datapath.c > +++ b/net/openvswitch/datapath.c > @@ -1100,21 +1100,20 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, > struct genl_info *info) > struct sw_flow_match match; > struct sw_flow_id sfid; > u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]); > - int error; > + int error = 0; > bool log = !a[OVS_FLOW_ATTR_PROBE]; > bool ufid_present; > > - /* Extract key. */ > - error = -EINVAL; > - if (!a[OVS_FLOW_ATTR_KEY]) { > - OVS_NLERR(log, "Flow key attribute not present in set flow."); > - goto error; > - } > - > ufid_present = ovs_nla_get_ufid(&sfid, a[OVS_FLOW_ATTR_UFID], log); > - ovs_match_init(&match, &key, &mask); > - error = ovs_nla_get_match(net, &match, a[OVS_FLOW_ATTR_KEY], > - a[OVS_FLOW_ATTR_MASK], log); > + if (a[OVS_FLOW_ATTR_KEY]) { > + ovs_match_init(&match, &key, NULL); > + error = ovs_nla_get_match(net, &match, a[OVS_FLOW_ATTR_KEY], > + NULL, log); > + } else if (!ufid_present) { > + OVS_NLERR(log, > + "Flow set message rejected, Key attribute > missing."); > + error = -EINVAL; > + } Set command sets new action for given flow. The action validation depends on the flow key and mask. so userspace needs to provide the key and mask.
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
On Wed, Mar 09, 2016 at 04:10:38PM -0500, David Miller wrote: > > > > and here we call for NETDEV_DOWN, which then hits masq_device_event > > and go further to conntrack code. > > Yes that's where the notifier comes from, which happens with or without > my patch. Thanks for explanation, Dave! I'll continue on this task tomorrow tryin to implement optimization you proposed.
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
From: Cyrill Gorcunov Date: Wed, 9 Mar 2016 23:57:47 +0300 > Aha! So in your patch __inet_del_ifa bypass first blocking_notifier_call_chain > > __inet_del_ifa > ... > if (in_dev->dead) > goto no_promotions; > > // First call to NETDEV_DOWN > ... > no_promotions: > rtmsg_ifa(RTM_DELADDR, ifa1, nlh, portid); > blocking_notifier_call_chain(&inetaddr_chain, NETDEV_DOWN, ifa1); > > and here we call for NETDEV_DOWN, which then hits masq_device_event > and go further to conntrack code. Yes that's where the notifier comes from, which happens with or without my patch.
Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations
From: Alexander Duyck Date: Wed, 9 Mar 2016 08:47:58 -0800 > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer > wrote: >> Passing the budget down was Alex'es design. Axel any thoughts? > > I'd say just use dev_consume_skb_any in the bulk free instead of > dev_consume_skb_irq. This is slow path, as you said, so it shouldn't > come up often. Agreed. >> I do wonder how expensive this check is... as it goes into a code >> hotpath, which is very unlikely. The good thing would be, that we >> handle if buggy drivers call this function from a none softirq context >> (as these bugs could be hard to catch). >> >> Can netpoll ever be called from softirq or with BH disabled? (It >> disables IRQs, which would break calling kmem_cache_free_bulk). > > It is better for us to switch things out so that the napi_consume_skb > is the fast path with dev_consume_skb_any as the slow. There are too > many scenarios where we could be invoking something that makes use of > this within the Tx path so it is probably easiest to just solve it > that way so we don't have to deal with it again in the future. Indeed.
Re: [net-next PATCH 3/4] vxlan: Enforce IP ID verification on outer headers
From: Jesse Gross Date: Mon, 7 Mar 2016 15:42:59 -0800 > On Mon, Mar 7, 2016 at 3:06 PM, Alex Duyck wrote: >> On Mon, Mar 7, 2016 at 11:09 AM, David Miller wrote: >>> From: Or Gerlitz >>> Date: Mon, 7 Mar 2016 20:05:20 +0200 >>> On Mon, Mar 7, 2016 at 7:22 PM, Alexander Duyck wrote: > This change enforces the IP ID verification on outer headers. As a result > if the DF flag is not set on the outer header we will force the flow to be > flushed in the event that the IP ID is out of sequence with the existing > flow. Can you please state the precise requirement for aggregation w.r.t IP IDs here? and point to where/how this is enforced, e.g for non-tunneled TCP GRO-ing? >>> >>> I also didn't see a nice "PATCH 0/4" posting explaining this series and >>> I'd really like to see that. >> >> Sorry about that. I forgot to add the cover page when I sent this. >> >> The enforcement is coming from the IP and TCP layers. If you take a >> look in inet_gro_receive we have the NAPI_GRO_CB(p)->flush_id value >> being populated based on the difference between the expected ID and >> the received one. So for IPv4 we overwrite it, and for IPv6 we set it >> to 0. The only consumer currently using it is TCP in tcp_gro_receive. >> The problem is with tunnels we lose the data for the outer when the >> inner overwrites it, as a result we can put whatever we want currently >> in the outer IP ID and it will be accepted. >> >> The patch set is based off of a conversation several of us had on the >> list about doing TSO for tunnels and the fact that the IP IDs for the >> outer header have to advance. It makes it easier for me to validate >> that I am doing things properly if GRO doesn't destroy the IP ID data >> for the outer headers. > > In net/ipv4/af_inet.c:inet_gro_receive() there is the following > comment above where NAPI_GRO_CB(p)->flush_id is set: > > /* Save the IP ID check to be included later when we get to > * the transport layer so only the inner most IP ID is checked. > * This is because some GSO/TSO implementations do not > * correctly increment the IP ID for the outer hdrs. > */ > > There was a long discussion about this a couple years ago and the > conclusion was that it is the inner IP ID is really the important one > in the case of encapsulation. Obviously, things like TCP/IP header > compression don't apply to the outer encapsulation header. That's how I remember the conversation going as wel...
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
On Wed, Mar 09, 2016 at 03:47:25PM -0500, David Miller wrote: > From: Cyrill Gorcunov > Date: Wed, 9 Mar 2016 23:41:58 +0300 > > > On Wed, Mar 09, 2016 at 03:27:30PM -0500, David Miller wrote: > >> > > >> > Yes. I can drop it off for a while and run tests without it, > >> > then turn it back and try again. Would you like to see such > >> > numbers? > >> > >> That would be very helpful, yes. > > > > Just sent out. Take a look please. Indeed it sits inside get_next_corpse > > a lot. And now I think I've to figure out where we can optimize it. > > Continue tomorrow. > > The problem is that the masquerading code flushes the entire conntrack > table once for _every_ address removed. > > The code path is: > > masq_device_event() > if (event == NETDEV_DOWN) { > /* Device was downed. Search entire table for >* conntracks which were associated with that device, >* and forget them. >*/ > NF_CT_ASSERT(dev->ifindex != 0); > > nf_ct_iterate_cleanup(net, device_cmp, > (void *)(long)dev->ifindex, 0, 0); > > So if you have a million IP addresses, this flush happens a million times > on inetdev destroy. > > Part of the problem is that we emit NETDEV_DOWN inetdev notifiers per > address removed, instead of once per inetdev destroy. > > Maybe if we put some boolean state into the inetdev, we could make sure > we did this flush only once time while inetdev->dead = 1. Aha! So in your patch __inet_del_ifa bypass first blocking_notifier_call_chain __inet_del_ifa ... if (in_dev->dead) goto no_promotions; // First call to NETDEV_DOWN ... no_promotions: rtmsg_ifa(RTM_DELADDR, ifa1, nlh, portid); blocking_notifier_call_chain(&inetaddr_chain, NETDEV_DOWN, ifa1); and here we call for NETDEV_DOWN, which then hits masq_device_event and go further to conntrack code.
Re: [PATCH net 0/3] validate variable length ll headers
From: Willem de Bruijn Date: Fri, 4 Mar 2016 15:44:14 -0500 > From: Willem de Bruijn > > Allow device-specific validation of link layer headers. Existing > checks drop all packets shorter than hard_header_len. For variable > length protocols, such packets can be valid. > > patch 1 adds header_ops.validate and dev_validate_header > patch 2 replaces ll_header_truncated with dev_validate_header > patch 3 implements the protocol specific callback for AX25 > > Tested with a temporary eth_header_validate function. The AX25 > code is compile-tested only at this point. I'm not going to be able to send another pull request to Linus before -final, so please respin this against net-next and I'll queue it up for -stable. You can add the missing Fixes: tags as well when you do this. Thanks.
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
From: Cyrill Gorcunov Date: Wed, 9 Mar 2016 23:41:58 +0300 > On Wed, Mar 09, 2016 at 03:27:30PM -0500, David Miller wrote: >> > >> > Yes. I can drop it off for a while and run tests without it, >> > then turn it back and try again. Would you like to see such >> > numbers? >> >> That would be very helpful, yes. > > Just sent out. Take a look please. Indeed it sits inside get_next_corpse > a lot. And now I think I've to figure out where we can optimize it. > Continue tomorrow. The problem is that the masquerading code flushes the entire conntrack table once for _every_ address removed. The code path is: masq_device_event() if (event == NETDEV_DOWN) { /* Device was downed. Search entire table for * conntracks which were associated with that device, * and forget them. */ NF_CT_ASSERT(dev->ifindex != 0); nf_ct_iterate_cleanup(net, device_cmp, (void *)(long)dev->ifindex, 0, 0); So if you have a million IP addresses, this flush happens a million times on inetdev destroy. Part of the problem is that we emit NETDEV_DOWN inetdev notifiers per address removed, instead of once per inetdev destroy. Maybe if we put some boolean state into the inetdev, we could make sure we did this flush only once time while inetdev->dead = 1.
Kreditfinanzierung . Kontaktieren Sie uns für weitere Informationen.
Guten Tag Wir möchten Sie informieren, dass wir GeschäftDarlehen, Projektfinanzierung, Immobilienfinanzierung bei 1% Jahreszinssatz bieten. Für weitere Informationen kontaktieren Sie uns über folgende e-Mail: eurofinanc...@gmail.com. Senden Sie Ihren Darlehensantrag einschließlich Ihren vollständigen Namen, Darlehensbetrag Betrag erforderlich, Darlehenslaufzeit und Ihre Adresse/Telefonnummer. Wir senden Ihnen die Kreditkonditionen und Rückzahlung nach Erhalt Ihrer KreditAnfrage Informationen. Ich freue mich auf Ihre Antwort. Senden Sie Ihre Kreditanfrage an unser Büro E-Mail: eurofinanc...@gmail.com Vielen Dank. Grüße, Josef Pani
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
On Wed, Mar 09, 2016 at 03:27:30PM -0500, David Miller wrote: > > > > Yes. I can drop it off for a while and run tests without it, > > then turn it back and try again. Would you like to see such > > numbers? > > That would be very helpful, yes. Just sent out. Take a look please. Indeed it sits inside get_next_corpse a lot. And now I think I've to figure out where we can optimize it. Continue tomorrow.
Re: [PATCH] inet: set `error' value not under lock_sock().
From: Weongyo Jeong Date: Wed, 9 Mar 2016 10:22:21 -0800 > A trivial patch to set `error' variable while not holding > lock_sock(). > > Signed-off-by: Weongyo Jeong Deferring the assignment means gcc doesn't have to potentially put it on the stack across the lock_sock() call. You're making the code worse not better, the assignment into a local variable is not an expensive operation. I'm not applying this change, sorry.
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
From: Cyrill Gorcunov Date: Wed, 9 Mar 2016 20:53:07 +0300 > On Wed, Mar 09, 2016 at 12:24:00PM -0500, David Miller wrote: > ... >> We asked you for numbers without a lot of features enabled, it'll >> help us diagnose which subsystem still causes a lot of overhead >> much more clearly. >> >> So please do so. > > Sure. Gimme some time and I'll back with numbers. > >> Although it's already pretty clear that netfilter conntrack >> cleanup is insanely expensive. > > Yes. I can drop it off for a while and run tests without it, > then turn it back and try again. Would you like to see such > numbers? That would be very helpful, yes.
Re: [PATCH 0/3] net: macb: Fix coding style issues
From: Michal Simek Date: Wed, 9 Mar 2016 18:29:01 +0100 > On 9.3.2016 18:22, David Miller wrote: >> From: Michal Simek >> Date: Wed, 9 Mar 2016 17:29:39 +0100 >> >>> On 7.3.2016 18:13, Nicolas Ferre wrote: Le 07/03/2016 17:17, Moritz Fischer a écrit : > Hi Nicolas, > > this series deals with most of the checkpatch warnings > generated for macb. There are two BUG_ON()'s that I didn't touch, yet, > that were suggested by checkpatch, that I can address in a follow up > commit if needed. > Let me know if you want me to split the fixes differently or squash > them into one commit. Hi, I'm not usually fond of this type of patches, but I must admit that this series corrects some style issues. So, I would like more feedback from Michal and Cyrille as these changes may delay some of the not-merged-yet features or more important work-in-progress on their side. On the other hand, if we all think it's a calm period for this macb driver, we may find interesting to merge some "cleanup and style" enhancements. >>> >>> Not a problem with merging cleanups in general. We have several out of >>> tree patches but doesn't make sense to to wait. >>> I wasn't in cc for the series but I don't like this change to be the >>> part of cleanup series. >>> >>> mac = of_get_mac_address(np); >>> if (mac) >>> - memcpy(bp->dev->dev_addr, mac, ETH_ALEN); >>> + ether_addr_copy(bp->dev->dev_addr, mac); >> >> Why? This is what we tell people to use. > > I would expect this as separate patch not the part of one huge cleanup > patch which does just comment and space cleanups. That is true.
Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF
On Wed, Mar 09, 2016 at 02:32:05PM -0500, Vivien Didelot wrote: > Hi Andrew, > > Andrew Lunn writes: > > >> -static bool dsa_slave_dev_check(struct net_device *dev) > >> -{ > >> - return dev->netdev_ops == &dsa_slave_netdev_ops; > >> -} > > > > Where is the equivalent of this happening? Where do we check that the > > interface added to the bridge is part of the switch? > > Why should we check that? In this RFC, br_if.c tries to set the new > attribute to the net_device, when creating and deleting the net bridge > port. If it supports attr_set and this attribute, then we're good. Or am > I missing something? One of us is missing something... What happens if i have two dsa clusters? We probably want to limit the object to only being passed to the DSA cluster which contains the port, or once we receive the object, we verify it belongs to the cluster processing it. What happens with a team/bind interface is added to the bridge. In the future we need to know about this, so we can add the trunk in Marvells terms to the bridge. > > How about team/bonding? We are not ready to implement it yet with the > > Marvell devices, but at some point we probably will. Won't we need the > > events then? We need to know when a switch port has been added to a > > team? > > > > Or do you think a switchdev object will be added for this case? > > Mellanox already have the ability to add switch interfaces to a team, > > and then add the team to a bridge. So we need to ensure your solution > > works for such stacked systems. > > Indeed these features can be propagated through new switchdev attributes > or objects. > > I think it'd be preferable to factorize the switch related operations > into the switchdev API, instead of having every single switchdev user > implement its custom (but similar) listeners and checks for global > netdev events. What do you think? Centralizing the code would be good. But DSA is way behind what Mellanox can do, so you need to look at how your changes fit into their driver. During a netdev 1.1 BOF there was a conversation about the stack of interfaces, teams/bonds, bridges, etc. If the video is available, you might find it interesting. Andrew
pull-request: wireless-drivers-next 2016-03-09
Hi Dave, here's a pull request for 4.6. I'm planning to send one more but I'm not sure if it will make it in time, we'll see. Here notable changes are refactoring in bcma to create a common flash driver, brcmfmac platform data improvements in include/linux and beginning of AHB bus support for ath10k along with device tree binding update. More info in the signed tag below. Oh, and I just noticed that weirdly my name appeared in author field on one of Janusz' ath9k patches. But I didn't consider this bad enough to rebase everything so I let it be. Sorry Janusz :) commit b9a9693fd9aea43f50b107dfc8cbaea317f95a79 Author: Kalle Valo Date: Fri Nov 27 09:37:14 2015 +0100 ath9k: request NOA update when chanctx active Please let me know if you have any problems. Kalle The following changes since commit 00a1f0a93dea3cf1c141df79bfd06e7c9ee54162: Merge branch 'reset_mac_header' (2016-03-04 22:45:14 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git tags/wireless-drivers-next-for-davem-2016-03-09 for you to fetch changes up to 836856e3bd61d0644e5178a2c1b51d90459e2788: wireless: cw1200: use __maybe_unused to hide pm functions_ (2016-03-08 12:32:52 +0200) wireless-drivers patches for 4.6 Major changes: ath10k * dt: add bindings for ipq4019 wifi block * start adding support for qca4019 chip ath9k * add device ID for Toshiba WLM-20U2/GN-1080 * allow more than one interface on DFS channels bcma * move flash detection code to ChipCommon core driver brcmfmac * IPv6 Neighbor discovery offload * driver settings that can be populated from different sources * country code setting in firmware * length checks to validate firmware events * new way to determine device memory size needed for BCM4366 * various offloads during Wake on Wireless LAN (WoWLAN) * full Management Frame Protection (MFP) support iwlwifi * add support for thermal device / cooling device * improvements in scheduled scan without profiles * new firmware support (-21.ucode) * add MSIX support for 9000 devices * enable MU-MIMO and take care of firmware restart * add support for large SKBs in mvm to reach A-MSDU * add support for filtering frames from a BA session * start implementing the new Rx path for 9000 devices * enable the new Radio Resource Management (RRM) nl80211 feature flag * add a new module paramater to disable VHT * build infrastructure for Dynamic Queue Allocation Alexander Tsoy (1): ath9k_htc: add device ID for Toshiba WLM-20U2/GN-1080 Alexey Khoroshilov (1): at76c50x-usb: avoid double usb_put_dev() after downloading internal firmware in at76_probe() Amitkumar Karwar (1): mwifiex: fix corner case association failure Amitoj Kaur Chawla (1): mwifiex: Use to_delayed_work() Andrei Otcheretianski (1): iwlwifi: add disable_11ac module param Anilkumar Kolli (1): ath10k: reduce number of peers to support peer stats feature Anthony Wong (1): rt2x00: add new rt2800usb device Buffalo WLI-UC-G450 Anton Protopopov (1): ath10k: fix erroneous return value Arend van Spriel (1): brcmfmac: change function name for brcmf_cfg80211_wait_vif_event_timeout() Arnd Bergmann (2): ath9k: reduce stack usage in ar9003_aic_cal_post_process wireless: cw1200: use __maybe_unused to hide pm functions_ Ashok Raj Nagarajan (2): ath10k: fix pktlog in QCA99X0 ath10k: add hw_rev to trace events to support pktlog Avri Altman (2): iwlwifi: mvm: forbid U-APSD for P2P Client if the firmware doesn't support it iwlwifi: mvm: Send power command on BSS_CHANGED_BEACON_INFO if needed Beni Lev (1): iwlwifi: mvm: Set global RRM capability Bruno Randolf (3): rtl8xxxu: Enable monitor mode by handling filters rtl8xxxu: Document REG_RXFLTMAP registers rtl8xxxu: Enable data frame reception in rtl8xxxu_start Chaya Rachel Ivgi (4): iwlwifi: mvm: add CT-KILL notification iwlwifi: mvm: add registration to thermal zone iwlwifi: mvm: add registration to cooling device iwlwifi: mvm: update ucode status before stopping device Christian Lamparter (1): carl9170: import 1.9.9 firmware headers Colin Ian King (2): rtlwifi: pass struct rtl_stats by reference as it is more efficient mt7601u: do not free dma_buf when ivp allocation fails Dan Carpenter (1): mwifiex: fix an indenting mistake Emmanuel Grumbach (15): Merge tag 'mac80211-next-for-davem-2016-02-26' into next2 Merge tag 'iwlwifi-for-kalle-2016-02-15' into HEAD iwlwifi: mvm: bump firmware API to 21 iwlwifi: pcie: aggregate Flow Handler configuration writes iwlwifi: pcie: fix identation in trans.c iwlwifi: mvm: send large SKBs to the transport iwlwifi: mvm: add Tx A-MSDU inside A-MPDU iwlwifi: mvm:
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
On Wed, Mar 09, 2016 at 08:53:07PM +0300, Cyrill Gorcunov wrote: > On Wed, Mar 09, 2016 at 12:24:00PM -0500, David Miller wrote: > ... > > We asked you for numbers without a lot of features enabled, it'll > > help us diagnose which subsystem still causes a lot of overhead > > much more clearly. > > > > So please do so. > > Sure. Gimme some time and I'll back with numbers. OK, here are the results (with preempt-debug/trace disabled, on the kernel with David's two patches). --- No conntrack [root@s125 ~]# ./exploit.sh START 4addresses STOP 1457549979 1457549980-> 1 START 2704 addresses STOP 1457549983 1457549984-> 1 START 10404addresses STOP 1457549996 1457549997-> 1 START 23104addresses STOP 1457550029 1457550030-> 1 START 40804addresses STOP 1457550103 1457550104-> 1 START 63504addresses STOP 1457550267 1457550268-> 1 all works quite fast, takes 1 second. With conntrack -- 1) In a middle of "release -> create new addresses" transition 27.53% [kernel][k] __local_bh_enable_ip 26.29% [kernel][k] _raw_spin_lock 6.00% [kernel][k] nf_ct_iterate_cleanup 3.95% [kernel][k] nf_conntrack_lock 2.94% [kernel][k] _raw_spin_unlock 1.91% [kernel][k] _cond_resched 1.78% [kernel][k] check_lifetime 1.25% [kernel][k] __inet_insert_ifa 1.19% [kernel][k] inet_rtm_newaddr 2) Last one with 63K of addresses releasing 36.36% [kernel] [k] __local_bh_enable_ip 34.75% [kernel] [k] _raw_spin_lock 7.93% [kernel] [k] nf_ct_iterate_cleanup 5.11% [kernel] [k] nf_conntrack_lock 3.71% [kernel] [k] _raw_spin_unlock 2.51% [kernel] [k] _cond_resched 0.89% [kernel] [k] task_tick_fair 0.77% [kernel] [k] native_write_msr_safe 0.58% [kernel] [k] hrtimer_active 0.52% [kernel] [k] rcu_check_callbacks [root@s125 ~]# ./exploit.sh START 4 addresses STOP 1457552395 1457552397-> 2 START 2704 addresses STOP 1457552399 1457552403-> 4 START 10404 addresses STOP 1457552415 1457552429-> 14 START 23104 addresses STOP 1457552461 1457552492-> 31 START 40804 addresses STOP 1457552566 1457552620-> 54 START 63504 addresses STOP 1457552785 1457552870-> 85 at the final stage it took 85 seconds to become alive. All eaten inside nf_ct_iterate_cleanup (actually inside get_next_corpse). IIRC there were some feature of perf which could annotate the instructions, no? Have to refresh memory on how to use perf record and such...
Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF
Hi Andrew, Andrew Lunn writes: >> -static bool dsa_slave_dev_check(struct net_device *dev) >> -{ >> -return dev->netdev_ops == &dsa_slave_netdev_ops; >> -} > > Where is the equivalent of this happening? Where do we check that the > interface added to the bridge is part of the switch? Why should we check that? In this RFC, br_if.c tries to set the new attribute to the net_device, when creating and deleting the net bridge port. If it supports attr_set and this attribute, then we're good. Or am I missing something? >> -int dsa_slave_netdevice_event(struct notifier_block *unused, >> - unsigned long event, void *ptr) >> -{ >> -struct net_device *dev; >> -int err = 0; >> - >> -switch (event) { >> -case NETDEV_CHANGEUPPER: >> -dev = netdev_notifier_info_to_dev(ptr); >> -if (!dsa_slave_dev_check(dev)) >> -goto out; >> - >> -err = dsa_slave_master_changed(dev); >> -if (err && err != -EOPNOTSUPP) >> -netdev_warn(dev, "failed to reflect master change\n"); >> - >> -break; >> -} >> - >> -out: >> -return NOTIFY_DONE; >> -} > > How about team/bonding? We are not ready to implement it yet with the > Marvell devices, but at some point we probably will. Won't we need the > events then? We need to know when a switch port has been added to a > team? > > Or do you think a switchdev object will be added for this case? > Mellanox already have the ability to add switch interfaces to a team, > and then add the team to a bridge. So we need to ensure your solution > works for such stacked systems. Indeed these features can be propagated through new switchdev attributes or objects. I think it'd be preferable to factorize the switch related operations into the switchdev API, instead of having every single switchdev user implement its custom (but similar) listeners and checks for global netdev events. What do you think? Best, Vivien
Re: [RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging
Hello. On 03/09/2016 08:42 PM, Vivien Didelot wrote: Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is set before adding a port to a bridge and deleting a port from a bridge. The main purpose for this attribute is to provide switchdev users a simple and common way to retrieve bridging information, instead of implementing complex notifier blocks to listen to global netdev events. We can also imagine a switchdev user returning an error different from -EOPNOTSUPP in the prepare phase to prevent a port from being bridged. Signed-off-by: Vivien Didelot --- include/net/switchdev.h | 2 ++ net/bridge/br_if.c | 27 +++ 2 files changed, 29 insertions(+) [...] diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index a73df33..105b9fd 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -28,6 +28,24 @@ #include "br_private.h" +static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br, + bool join) +{ + struct switchdev_attr attr = { + .orig_dev = br->dev, + .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF, + .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP, + .u.join = join, + }; + int err; + + err = switchdev_port_attr_set(dev, &attr); + if (err && err != -EOPNOTSUPP) Enough to only do the latter comparison. + return err; + + return 0; +} + /* * Determine initial path cost based on speed. * using recommendations from 802.1d standard [...] MBR, Sergei
Re: [PATCH V4 0/3] basic busy polling support for vhost_net
On Fri, 4 Mar 2016 06:24:50 -0500 Jason Wang wrote: > This series tries to add basic busy polling for vhost net. The idea is > simple: at the end of tx/rx processing, busy polling for new tx added > descriptor and rx receive socket for a while. The maximum number of > time (in us) could be spent on busy polling was specified ioctl. > > Test A were done through: > > - 50 us as busy loop timeout > - Netperf 2.6 > - Two machines with back to back connected mlx4 Hi Jason, Could this also improve performance if both VMs are on the same host system ? > - Guest with 1 vcpus and 1 queue > > Results: > - Obvious improvements (%5 - 20%) for latency (TCP_RR). > - Get a better or minor regression on most of the TX tests, but see > some regression on 4096 size. > - Except for 8 sessions of 4096 size RX, have a better or same > performance. > - CPU utilization were incrased as expected. > > TCP_RR: > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/ > 1/ 1/ +8%/ -32%/ +8%/ +8%/ +7% > 1/50/ +7%/ -19%/ +7%/ +7%/ +1% > 1/ 100/ +5%/ -21%/ +5%/ +5%/0% > 1/ 200/ +5%/ -21%/ +7%/ +7%/ +1% >64/ 1/ +11%/ -29%/ +11%/ +11%/ +10% >64/50/ +7%/ -19%/ +8%/ +8%/ +2% >64/ 100/ +8%/ -18%/ +9%/ +9%/ +2% >64/ 200/ +6%/ -19%/ +6%/ +6%/0% > 256/ 1/ +7%/ -33%/ +7%/ +7%/ +6% > 256/50/ +7%/ -18%/ +7%/ +7%/0% > 256/ 100/ +9%/ -18%/ +8%/ +8%/ +2% > 256/ 200/ +9%/ -18%/ +10%/ +10%/ +3% > 1024/ 1/ +20%/ -28%/ +20%/ +20%/ +19% > 1024/50/ +8%/ -18%/ +9%/ +9%/ +2% > 1024/ 100/ +6%/ -19%/ +5%/ +5%/0% > 1024/ 200/ +8%/ -18%/ +9%/ +9%/ +2% > Guest TX: > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/ >64/ 1/ -5%/ -28%/ +11%/ +12%/ +10% >64/ 4/ -2%/ -26%/ +13%/ +13%/ +13% >64/ 8/ -6%/ -29%/ +9%/ +10%/ +10% > 512/ 1/ +15%/ -7%/ +13%/ +11%/ +3% > 512/ 4/ +17%/ -6%/ +18%/ +13%/ +11% > 512/ 8/ +14%/ -7%/ +13%/ +7%/ +7% > 1024/ 1/ +27%/ -2%/ +26%/ +29%/ +12% > 1024/ 4/ +8%/ -9%/ +6%/ +1%/ +6% > 1024/ 8/ +41%/ +12%/ +34%/ +20%/ -3% > 4096/ 1/ -22%/ -21%/ -36%/ +81%/+1360% > 4096/ 4/ -57%/ -58%/ +286%/ +15%/+2074% > 4096/ 8/ +67%/ +70%/ -45%/ -8%/ +63% > 16384/ 1/ -2%/ -5%/ +5%/ -3%/ +80% > 16384/ 4/0%/0%/0%/ +4%/ +138% > 16384/ 8/0%/0%/0%/ +1%/ +41% > 65535/ 1/ -3%/ -6%/ +2%/ +11%/ +113% > 65535/ 4/ -2%/ -1%/ -2%/ -3%/ +484% > 65535/ 8/0%/ +1%/0%/ +2%/ +40% > Guest RX: > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/ >64/ 1/ +31%/ -3%/ +8%/ +8%/ +8% >64/ 4/ +11%/ -17%/ +13%/ +14%/ +15% >64/ 8/ +4%/ -23%/ +11%/ +11%/ +12% > 512/ 1/ +24%/0%/ +18%/ +14%/ -8% > 512/ 4/ +4%/ -15%/ +6%/ +5%/ +6% > 512/ 8/ +26%/0%/ +21%/ +10%/ +3% > 1024/ 1/ +88%/ +47%/ +69%/ +44%/ -30% > 1024/ 4/ +18%/ -5%/ +19%/ +16%/ +2% > 1024/ 8/ +15%/ -4%/ +13%/ +8%/ +1% > 4096/ 1/ -3%/ -5%/ +2%/ -2%/ +41% > 4096/ 4/ +2%/ +3%/ -20%/ -14%/ -24% > 4096/ 8/ -43%/ -45%/ +69%/ -24%/ +94% > 16384/ 1/ -3%/ -11%/ +23%/ +7%/ +42% > 16384/ 4/ -3%/ -3%/ -4%/ +5%/ +115% > 16384/ 8/ -1%/0%/ -1%/ -3%/ +32% > 65535/ 1/ +1%/0%/ +2%/0%/ +66% > 65535/ 4/ -1%/ -1%/0%/ +4%/ +492% > 65535/ 8/0%/ -1%/ -1%/ +4%/ +38% > > Changes from V3: > - drop single_task_running() > - use cpu_relax_lowlatency() instead of cpu_relax() > > Changes from V2: > - rename vhost_vq_more_avail() to vhost_vq_avail_empty(). And return > false we __get_user() fails. > - do not bother premmptions/timers for good path. > - use vhost_vring_state as ioctl parameter instead of reinveting a new > one. > - add the unit of timeout (us) to the comment of new added ioctls > > Changes from V1: > - remove the buggy vq_error() in vhost_vq_more_avail(). > - leave vhost_enable_notify() untouched. > > Changes from RFC V3: > - small tweak on the code to avoid multiple duplicate conditions in > critical path when busy loop is not enabled. > - add the test result of multiple VMs > > Changes from RFC V2: > - poll also at the end of rx handling > - factor out the polling logic and optimize the code a little bit > - add two ioctls to get and set the busy poll timeout > - test on ixgbe (which can give more stable and reproducable numbers) > instead of mlx4. > > Changes from RFC V1: > - add a comment for vhost_has_work() to explain why it could be > lockless > - add param description for busyloop_timeout > - split out the busy polling logic into a new helper > - check and exit the loop when there's a pending signal >
Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF
Wed, Mar 09, 2016 at 07:32:13PM CET, and...@lunn.ch wrote: >Hi Vivien > >> -static bool dsa_slave_dev_check(struct net_device *dev) >> -{ >> -return dev->netdev_ops == &dsa_slave_netdev_ops; >> -} > >Where is the equivalent of this happening? Where do we check that the >interface added to the bridge is part of the switch? > >> -int dsa_slave_netdevice_event(struct notifier_block *unused, >> - unsigned long event, void *ptr) >> -{ >> -struct net_device *dev; >> -int err = 0; >> - >> -switch (event) { >> -case NETDEV_CHANGEUPPER: >> -dev = netdev_notifier_info_to_dev(ptr); >> -if (!dsa_slave_dev_check(dev)) >> -goto out; >> - >> -err = dsa_slave_master_changed(dev); >> -if (err && err != -EOPNOTSUPP) >> -netdev_warn(dev, "failed to reflect master change\n"); >> - >> -break; >> -} >> - >> -out: >> -return NOTIFY_DONE; >> -} > >How about team/bonding? We are not ready to implement it yet with the >Marvell devices, but at some point we probably will. Won't we need the >events then? We need to know when a switch port has been added to a >team? > >Or do you think a switchdev object will be added for this case? >Mellanox already have the ability to add switch interfaces to a team, >and then add the team to a bridge. So we need to ensure your solution >works for such stacked systems. I have to look at this more closer tomorrow, but I'm missing motivation behind this. Using existing notifiers, drivers can easily monitor what is going on with their uppers. Why do we need this to be changed?
Re: Unexpected tcpv6 connection resets since linux 4.4
On Mon, Mar 7, 2016 at 15:58, Cong Wang wrote: > On Sun, Mar 6, 2016 at 7:10 AM, Andreas Messer wrote: > > i have updated two of my machines in the last weeks to linux 4.4.1 and > > linux 4.4.3. It seems that since then i get unexpected TCPv6 connection > > resets when connecting to these machines remotely. The issue occurs with > > sshd and with a http service. /etc/hosts.deny and /etc/hosts.allow are > > empty on both server machines. I'm not so in IPv6 and have no idea whats > > going on. Please find attached a network trace from one of the machines > > when connecting with ssh (on port 23 for debugging). > > Sounds like the problem fixed by the following commit: > > commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 > Author: Eric Dumazet > Date: Tue Feb 2 19:31:12 2016 -0800 > > tcp: do not drop syn_recv on all icmp reports Thanks for the tip. I applied the patch to V4,4,1 and it solved the problem. Cheers, Andreas signature.asc Description: This is a digitally signed message part.
[PATCH net-next v3] tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In
Per RFC4898, they count segments sent/received containing a positive length data segment (that includes retransmission segments carrying data). Unlike tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments carrying no data (e.g. pure ack). The patch also updates the segs_in in tcp_fastopen_add_skb() so that segs_in >= data_segs_in property is kept. Together with retransmission data, tcpi_data_segs_out gives a better signal on the rxmit rate. v3: Add const modifier to the skb parameter in tcp_segs_in() v2: Rework based on recent fix by Eric: commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment") Signed-off-by: Martin KaFai Lau Cc: Chris Rapier Cc: Eric Dumazet Cc: Marcelo Ricardo Leitner Cc: Neal Cardwell Cc: Yuchung Cheng --- include/linux/tcp.h | 6 ++ include/net/tcp.h| 10 ++ include/uapi/linux/tcp.h | 2 ++ net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_fastopen.c | 4 net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 2 +- net/ipv4/tcp_output.c| 4 +++- net/ipv6/tcp_ipv6.c | 2 +- 9 files changed, 30 insertions(+), 4 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index bcbf51d..7be9b12 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -158,6 +158,9 @@ struct tcp_sock { u32 segs_in;/* RFC4898 tcpEStatsPerfSegsIn * total number of segments in. */ + u32 data_segs_in; /* RFC4898 tcpEStatsPerfDataSegsIn +* total number of data segments in. +*/ u32 rcv_nxt;/* What we want to receive next */ u32 copied_seq; /* Head of yet unread data */ u32 rcv_wup;/* rcv_nxt on last window update sent */ @@ -165,6 +168,9 @@ struct tcp_sock { u32 segs_out; /* RFC4898 tcpEStatsPerfSegsOut * The total number of segments sent. */ + u32 data_segs_out; /* RFC4898 tcpEStatsPerfDataSegsOut +* total number of data segments sent. +*/ u64 bytes_acked;/* RFC4898 tcpEStatsAppHCThruOctetsAcked * sum(delta(snd_una)), or how many bytes * were acked. diff --git a/include/net/tcp.h b/include/net/tcp.h index e90db85..24557a8 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1816,4 +1816,14 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff *skb) skb->truesize = 2; } +static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb) +{ + u16 segs_in; + + segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs); + tp->segs_in += segs_in; + if (skb->len > tcp_hdrlen(skb)) + tp->data_segs_in += segs_in; +} + #endif /* _TCP_H */ diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index fe95446..53e8e3f 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -199,6 +199,8 @@ struct tcp_info { __u32 tcpi_notsent_bytes; __u32 tcpi_min_rtt; + __u32 tcpi_data_segs_in; /* RFC4898 tcpEStatsDataSegsIn */ + __u32 tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */ }; /* for TCP_MD5SIG socket option */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f9faadb..6b01b48 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2728,6 +2728,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) info->tcpi_notsent_bytes = max(0, notsent_bytes); info->tcpi_min_rtt = tcp_min_rtt(tp); + info->tcpi_data_segs_in = tp->data_segs_in; + info->tcpi_data_segs_out = tp->data_segs_out; } EXPORT_SYMBOL_GPL(tcp_get_info); diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index fdb286d..f583c85 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -131,6 +131,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock *req, void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); + u16 segs_in; if (TCP_SKB_CB(skb)->end_seq == tp->rcv_nxt) return; @@ -154,6 +155,9 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) * as we certainly are not changing upper 32bit value (0) */ tp->bytes_received = skb->len; + segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs); + tp->segs_in = segs_in; + tp->data_segs_in = segs_in; if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) tcp_fin(sk); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 4c8d58d..0b02ef7 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1650,7 +1650,7 @@ process: sk_incoming_cpu_update(sk); bh_lock_so
Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF
Hi Vivien > -static bool dsa_slave_dev_check(struct net_device *dev) > -{ > - return dev->netdev_ops == &dsa_slave_netdev_ops; > -} Where is the equivalent of this happening? Where do we check that the interface added to the bridge is part of the switch? > -int dsa_slave_netdevice_event(struct notifier_block *unused, > - unsigned long event, void *ptr) > -{ > - struct net_device *dev; > - int err = 0; > - > - switch (event) { > - case NETDEV_CHANGEUPPER: > - dev = netdev_notifier_info_to_dev(ptr); > - if (!dsa_slave_dev_check(dev)) > - goto out; > - > - err = dsa_slave_master_changed(dev); > - if (err && err != -EOPNOTSUPP) > - netdev_warn(dev, "failed to reflect master change\n"); > - > - break; > - } > - > -out: > - return NOTIFY_DONE; > -} How about team/bonding? We are not ready to implement it yet with the Marvell devices, but at some point we probably will. Won't we need the events then? We need to know when a switch port has been added to a team? Or do you think a switchdev object will be added for this case? Mellanox already have the ability to add switch interfaces to a team, and then add the team to a bridge. So we need to ensure your solution works for such stacked systems. Andrew
[PATCH] inet: set `error' value not under lock_sock().
A trivial patch to set `error' variable while not holding lock_sock(). Signed-off-by: Weongyo Jeong --- net/ipv4/inet_connection_sock.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 6414891..58bc39f 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -306,14 +306,13 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err) struct request_sock_queue *queue = &icsk->icsk_accept_queue; struct request_sock *req; struct sock *newsk; - int error; + int error = -EINVAL; lock_sock(sk); /* We need to make sure that this socket is listening, * and that it has something pending. */ - error = -EINVAL; if (sk->sk_state != TCP_LISTEN) goto out_err; -- 2.1.3
Re: [PATCH (net.git) 0/2] stmmac: MDIO fixes
Am 09.03.2016 um 10:00 schrieb Giuseppe Cavallaro: > These two patches are to fix the recent regressions raised > when test the stmmac on some platforms. > Please I kindly ask you to give me the feedback if it actually > covers all the cases and if the stmmac runs fine on the boxes. > I have tested on my side the H410 B2120 with an embedded switch > (so using the fixed-link). > > Giuseppe Cavallaro (2): > Revert "stmmac: Fix 'eth0: No PHY found' regression" > stmmac: fix MDIO settings > > drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 11 ++--- > drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 11 +- > .../net/ethernet/stmicro/stmmac/stmmac_platform.c | 22 +-- > include/linux/stmmac.h |3 +- > 4 files changed, 20 insertions(+), 27 deletions(-) Works fine on the Banana Pi M1 board. (Fixes "libphy: PHY stmmac-0: not found") Regards, Frank
Re: [RFC] net: ipv4 -- Introduce ifa limit per net
On Wed, Mar 09, 2016 at 12:24:00PM -0500, David Miller wrote: ... > We asked you for numbers without a lot of features enabled, it'll > help us diagnose which subsystem still causes a lot of overhead > much more clearly. > > So please do so. Sure. Gimme some time and I'll back with numbers. > Although it's already pretty clear that netfilter conntrack > cleanup is insanely expensive. Yes. I can drop it off for a while and run tests without it, then turn it back and try again. Would you like to see such numbers? > You're also jumping to a lot of conclusions, work with us to fix the > fundamental performance problems rather than continually insisting on > a limit. > > We should be able to remove millions of IP addresses in less than > half a second, no problem. Limits make no sense at all. Sure, I'll continue experimenting (and turn off preemt as a first step). Sorry if I sounded rough. Cyrill
[RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF
Add a new dsa_slave_bridge_if function to handle the SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute. Thus remove the code related to the netdev notifier block. Signed-off-by: Vivien Didelot --- net/dsa/dsa.c | 7 net/dsa/dsa_priv.h | 2 - net/dsa/slave.c| 113 ++--- 3 files changed, 30 insertions(+), 92 deletions(-) diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index fa4daba..cfb678b 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -977,10 +977,6 @@ static struct packet_type dsa_pack_type __read_mostly = { .func = dsa_switch_rcv, }; -static struct notifier_block dsa_netdevice_nb __read_mostly = { - .notifier_call = dsa_slave_netdevice_event, -}; - #ifdef CONFIG_PM_SLEEP static int dsa_suspend(struct device *d) { @@ -1047,8 +1043,6 @@ static int __init dsa_init_module(void) { int rc; - register_netdevice_notifier(&dsa_netdevice_nb); - rc = platform_driver_register(&dsa_driver); if (rc) return rc; @@ -1061,7 +1055,6 @@ module_init(dsa_init_module); static void __exit dsa_cleanup_module(void) { - unregister_netdevice_notifier(&dsa_netdevice_nb); dev_remove_pack(&dsa_pack_type); platform_driver_unregister(&dsa_driver); } diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h index 1d1a546..34d1951 100644 --- a/net/dsa/dsa_priv.h +++ b/net/dsa/dsa_priv.h @@ -64,8 +64,6 @@ int dsa_slave_create(struct dsa_switch *ds, struct device *parent, void dsa_slave_destroy(struct net_device *slave_dev); int dsa_slave_suspend(struct net_device *slave_dev); int dsa_slave_resume(struct net_device *slave_dev); -int dsa_slave_netdevice_event(struct notifier_block *unused, - unsigned long event, void *ptr); /* tag_dsa.c */ extern const struct dsa_device_ops dsa_netdev_ops; diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 27bf03d..90ef149 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -305,16 +305,38 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) return -EOPNOTSUPP; } -static int dsa_slave_stp_update(struct net_device *dev, u8 state) +static int dsa_slave_bridge_if(struct net_device *dev, + const struct switchdev_attr *attr, + struct switchdev_trans *trans) { struct dsa_slave_priv *p = netdev_priv(dev); struct dsa_switch *ds = p->parent; - int ret = -EOPNOTSUPP; + int err; - if (ds->drv->port_stp_update) - ret = ds->drv->port_stp_update(ds, p->port, state); + if (switchdev_trans_ph_prepare(trans)) { + if (!ds->drv->port_join_bridge || !ds->drv->port_leave_bridge) + return -EOPNOTSUPP; + return 0; + } - return ret; + if (attr->u.join) { + err = ds->drv->port_join_bridge(ds, p->port, attr->orig_dev); + if (!err) + p->bridge_dev = attr->orig_dev; + } else { + err = ds->drv->port_leave_bridge(ds, p->port); + + /* When a port leaves a bridge, the bridge layer sets its STP +* state to DISABLED. Restore FORWARDING to keep it functional. +*/ + if (ds->drv->port_stp_update) + ds->drv->port_stp_update(ds, p->port, +BR_STATE_FORWARDING); + + p->bridge_dev = NULL; + } + + return err; } static int dsa_slave_vlan_filtering(struct net_device *dev, @@ -354,6 +376,9 @@ static int dsa_slave_port_attr_set(struct net_device *dev, case SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING: ret = dsa_slave_vlan_filtering(dev, attr, trans); break; + case SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF: + ret = dsa_slave_bridge_if(dev, attr, trans); + break; default: ret = -EOPNOTSUPP; break; @@ -439,41 +464,6 @@ static int dsa_slave_port_obj_dump(struct net_device *dev, return err; } -static int dsa_slave_bridge_port_join(struct net_device *dev, - struct net_device *br) -{ - struct dsa_slave_priv *p = netdev_priv(dev); - struct dsa_switch *ds = p->parent; - int ret = -EOPNOTSUPP; - - p->bridge_dev = br; - - if (ds->drv->port_join_bridge) - ret = ds->drv->port_join_bridge(ds, p->port, br); - - return ret; -} - -static int dsa_slave_bridge_port_leave(struct net_device *dev) -{ - struct dsa_slave_priv *p = netdev_priv(dev); - struct dsa_switch *ds = p->parent; - int ret = -EOPNOTSUPP; - - - if (ds->drv->port_leave_bridge) - ret = ds->drv->port_leave_bridge(ds, p->port); - - p->bridge_dev = NULL; - - /* Port left the bridge, put in BR_STATE_DISABLED by the bridg
[RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging
Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is set before adding a port to a bridge and deleting a port from a bridge. The main purpose for this attribute is to provide switchdev users a simple and common way to retrieve bridging information, instead of implementing complex notifier blocks to listen to global netdev events. We can also imagine a switchdev user returning an error different from -EOPNOTSUPP in the prepare phase to prevent a port from being bridged. Signed-off-by: Vivien Didelot --- include/net/switchdev.h | 2 ++ net/bridge/br_if.c | 27 +++ 2 files changed, 29 insertions(+) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index d451122..65f8514 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -46,6 +46,7 @@ enum switchdev_attr_id { SWITCHDEV_ATTR_ID_PORT_PARENT_ID, SWITCHDEV_ATTR_ID_PORT_STP_STATE, SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS, + SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF, SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME, SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING, }; @@ -58,6 +59,7 @@ struct switchdev_attr { struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */ u8 stp_state; /* PORT_STP_STATE */ unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */ + bool join; /* PORT_BRIDGE_IF */ u32 ageing_time;/* BRIDGE_AGEING_TIME */ bool vlan_filtering;/* BRIDGE_VLAN_FILTERING */ } u; diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index a73df33..105b9fd 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -28,6 +28,24 @@ #include "br_private.h" +static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br, + bool join) +{ + struct switchdev_attr attr = { + .orig_dev = br->dev, + .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF, + .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP, + .u.join = join, + }; + int err; + + err = switchdev_port_attr_set(dev, &attr); + if (err && err != -EOPNOTSUPP) + return err; + + return 0; +} + /* * Determine initial path cost based on speed. * using recommendations from 802.1d standard @@ -297,6 +315,10 @@ static void del_nbp(struct net_bridge_port *p) br_netpoll_disable(p); call_rcu(&p->rcu, destroy_nbp_rcu); + + if (switchdev_bridge_if(dev, br, false)) + br_warn(br, "error unbridging port %u(%s)\n", + (unsigned int) p->port_no, dev->name); } /* Delete bridge device */ @@ -347,6 +369,11 @@ static struct net_bridge_port *new_nbp(struct net_bridge *br, { int index; struct net_bridge_port *p; + int err; + + err = switchdev_bridge_if(dev, br, true); + if (err) + return ERR_PTR(err); index = find_portno(br); if (index < 0) -- 2.7.2
[RFC PATCH net-next 0/2] net: switchdev: add attribute for port bridging
Current switchdev users implement notifier blocks to filter global netdev events, in order to correctly offload bridging to their hardware ports. Such code could be replaced with the support of a simple switchdev attribute set when adding/deleting a port to/from a bridge. Also, we can imagine a switch driver or network layer wanting to restrict the number of logical bridges on top of a physical device. That could be done by returning -EOPNOTSUPP when setting such attribute. The first patch adds a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute containing a boolean, set when joining or leaving a bridge. The second patch shows the benefit of supporting such attribute in the DSA layer. Similar change should be doable to other switchdev users, like Rocker. Note: I send this as an RFC since I am not really sure about the attribute flags, and the exact place to set it in del_nbp(). Comments needed :-) Thanks, Vivien Didelot (2): net: bridge: add switchdev attr for port bridging net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF include/net/switchdev.h | 2 + net/bridge/br_if.c | 27 net/dsa/dsa.c | 7 --- net/dsa/dsa_priv.h | 2 - net/dsa/slave.c | 113 +--- 5 files changed, 59 insertions(+), 92 deletions(-) -- 2.7.2
gretap default MTU question
Hello good people of netdev, When setting up gretap devices like so: ip link add mydev type gretap remote 1.1.1.1 local 2.2.2.2 nopmtudisc I'm observing two different behavior: - On system A, the MTU of 'mydev' is set to the MTU of the 'parent' interface (currently 1600) minus 38. All other interfaces on that system have a default MTU of 1500, only the parent was forced to 1600 to avoid fragmentation. So 'mydev' accurately figures out that its MTU is 1562. - On system B, with the 'parent' interface MTU set to 1600 and all other defaulting to 1500 (same situation as A), the MTU of 'mydev' gets set to 1462. I'm trying to figure out which behavior is normal and what mechanism (if any) causes the MTU to be set differently. In both cases the 'parent' device has an MTU of 1600. The code in ip_gre.c does this: dev->mtu = ETH_DATA_LEN - t_hlen - 4; In this case, system B would have the expected behavior and I need some way to explain what goes on with system A. Of course I can force the MTU on system B but I was rather pleased with the 'magic' on system A. If anyone here familiar with this code can offer an explanation, it would greatly ease my curiosity. Jonathan
Re: Linux netfilter IPT_SO_SET_REPLACE memory corruption
Small correction: IPT_SO_SET_REPLACE is reached via setsockopt, not ioctl. On Wed, Mar 9, 2016 at 9:25 AM, Ben Hawkes wrote: > N.B. I was redirected to this list by secur...@kernel.org. I'm told > that networking related issues such as this are handled on the public > netdev mailing list. > > A memory corruption vulnerability exists in the IPT_SO_SET_REPLACE > ioctl in the netfilter code for iptables support. This ioctl is can be > triggered by an unprivileged user on PF_INET sockets when unprivileged > user namespaces are available (CONFIG_USER_NS=y). Android does not > enable this option, but desktop/server distributions and Chrome OS > will commonly enable this to allow for containers support or > sandboxing. > > In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it > is possible for a user-supplied ipt_entry structure to have a large > next_offset field. This field is not bounds checked prior to writing a > counter value at the supplied offset: > > newpos = pos + e->next_offset; > ... > e = (struct ipt_entry *) (entry0 + newpos); > e->counters.pcnt = pos; > > This means that an out of bounds 32-bit write can occur in a 64kb > range from the allocated heap entry, with a controlled offset and a > partially controlled write value ("pos") or zero. The attached > proof-of-concept (netfilter_setsockopt_v3.c) triggers the corruption > multiple times to set adjacent heap structures to zero. > > This issue affects (at least) kernel versions 3.10, 3.18 and 4.4, and > could be used for local privilege escalation. It appears that a > similar codepath is accessible via arp_tables.c/ARPT_SO_SET_REPLACE as > well. > > Furthermore, a recent refactoring of this codepath > (https://github.com/torvalds/linux/commit/2e4e6a17af35be359cc8f1c924f8f198fbd478cc) > introduced an integer overflow in xt_alloc_table_info, which on 32-bit > systems can lead to small structure allocation and a copy_from_user > based heap corruption. The attached proof-of-concept > (netfilter_setsockopt_v4.c) triggers this issue on 4.4.