date:20160309

[PATCH net] sctp: fix the transports round robin issue when init is retransmitted

2016-03-09 Thread Xin Long

prior to this patch, at the beginning if we have two paths in one assoc,
they may have the same params other than the last_time_heard, it will try
the paths like this:

1st cycle
  try trans1 fail.
  then trans2 is selected.(cause it's last_time_heard is after trans1).

2nd cycle:
  try  trans2 fail
  then trans2 is selected.(cause it's last_time_heard is after trans1).

3rd cycle:
  try  trans2 fail
  then trans2 is selected.(cause it's last_time_heard is after trans1).



trans1 will never have change to be selected, which is not what we expect.
we should keeping round robin all the paths if they are just added at the
beginning.

So at first every tranport's last_time_heard should be initialized 0, so
that we ensure they have the same value at the beginning, only by this,
all the transports could get equal chance to be selected.

Then for sctp_trans_elect_best, it should return the trans_next one when
*trans == *trans_next, so that we can try next if it fails,  but now it
always return trans. so we can fix it by exchanging these two params when
we calls sctp_trans_elect_tie().

Fixes: 4c47af4d5eb2 ('net: sctp: rework multihoming retransmission path 
selection to rfc4960')
Signed-off-by: Xin Long 
---
 net/sctp/associola.c | 2 +-
 net/sctp/transport.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 2bf8ec9..cd87344 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1263,7 +1263,7 @@ static struct sctp_transport 
*sctp_trans_elect_best(struct sctp_transport *curr,
if (score_curr > score_best)
return curr;
else if (score_curr == score_best)
-   return sctp_trans_elect_tie(curr, best);
+   return sctp_trans_elect_tie(best, curr);
else
return best;
 }
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index a431c14..d517153 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -72,7 +72,7 @@ static struct sctp_transport *sctp_transport_init(struct net 
*net,
 */
peer->rto = msecs_to_jiffies(net->sctp.rto_initial);
 
-   peer->last_time_heard = ktime_get();
+   peer->last_time_heard = ktime_set(0, 0);
peer->last_time_ecne_reduced = jiffies;
 
peer->param_flags = SPP_HB_DISABLE |
-- 
2.1.0

Re: [PATCH V4 0/3] basic busy polling support for vhost_net

2016-03-09 Thread Michael Rapoport

Hi Greg,

> Greg Kurz  wrote on 03/09/2016 09:26:45 PM:
> > On Fri,  4 Mar 2016 06:24:50 -0500
> > Jason Wang  wrote:
> 
> > This series tries to add basic busy polling for vhost net. The idea is
> > simple: at the end of tx/rx processing, busy polling for new tx added
> > descriptor and rx receive socket for a while. The maximum number of
> > time (in us) could be spent on busy polling was specified ioctl.
> > 
> > Test A were done through:
> > 
> > - 50 us as busy loop timeout
> > - Netperf 2.6
> > - Two machines with back to back connected mlx4
> 
> Hi Jason,
> 
> Could this also improve performance if both VMs are
> on the same host system ?

I've experimented a little with Jason's patches and guest-to-guest netperf 
when both guests were on the same host, and I saw improvements for that 
case.
 
> > - Guest with 1 vcpus and 1 queue
> > 
> > Results:
> > - Obvious improvements (%5 - 20%) for latency (TCP_RR).
> > - Get a better or minor regression on most of the TX tests, but see
> >   some regression on 4096 size.
> > - Except for 8 sessions of 4096 size RX, have a better or same
> >   performance.
> > - CPU utilization were incrased as expected.
> > 
> > TCP_RR:
> > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> > 1/ 1/   +8%/  -32%/   +8%/   +8%/   +7%
> > 1/50/   +7%/  -19%/   +7%/   +7%/   +1%
> > 1/   100/   +5%/  -21%/   +5%/   +5%/0%
> > 1/   200/   +5%/  -21%/   +7%/   +7%/   +1%
> >64/ 1/  +11%/  -29%/  +11%/  +11%/  +10%
> >64/50/   +7%/  -19%/   +8%/   +8%/   +2%
> >64/   100/   +8%/  -18%/   +9%/   +9%/   +2%
> >64/   200/   +6%/  -19%/   +6%/   +6%/0%
> >   256/ 1/   +7%/  -33%/   +7%/   +7%/   +6%
> >   256/50/   +7%/  -18%/   +7%/   +7%/0%
> >   256/   100/   +9%/  -18%/   +8%/   +8%/   +2%
> >   256/   200/   +9%/  -18%/  +10%/  +10%/   +3%
> >  1024/ 1/  +20%/  -28%/  +20%/  +20%/  +19%
> >  1024/50/   +8%/  -18%/   +9%/   +9%/   +2%
> >  1024/   100/   +6%/  -19%/   +5%/   +5%/0%
> >  1024/   200/   +8%/  -18%/   +9%/   +9%/   +2%
> > Guest TX:
> > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> >64/ 1/   -5%/  -28%/  +11%/  +12%/  +10%
> >64/ 4/   -2%/  -26%/  +13%/  +13%/  +13%
> >64/ 8/   -6%/  -29%/   +9%/  +10%/  +10%
> >   512/ 1/  +15%/   -7%/  +13%/  +11%/   +3%
> >   512/ 4/  +17%/   -6%/  +18%/  +13%/  +11%
> >   512/ 8/  +14%/   -7%/  +13%/   +7%/   +7%
> >  1024/ 1/  +27%/   -2%/  +26%/  +29%/  +12%
> >  1024/ 4/   +8%/   -9%/   +6%/   +1%/   +6%
> >  1024/ 8/  +41%/  +12%/  +34%/  +20%/   -3%
> >  4096/ 1/  -22%/  -21%/  -36%/  +81%/+1360%
> >  4096/ 4/  -57%/  -58%/ +286%/  +15%/+2074%
> >  4096/ 8/  +67%/  +70%/  -45%/   -8%/  +63%
> > 16384/ 1/   -2%/   -5%/   +5%/   -3%/  +80%
> > 16384/ 4/0%/0%/0%/   +4%/ +138%
> > 16384/ 8/0%/0%/0%/   +1%/  +41%
> > 65535/ 1/   -3%/   -6%/   +2%/  +11%/ +113%
> > 65535/ 4/   -2%/   -1%/   -2%/   -3%/ +484%
> > 65535/ 8/0%/   +1%/0%/   +2%/  +40%
> > Guest RX:
> > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> >64/ 1/  +31%/   -3%/   +8%/   +8%/   +8%
> >64/ 4/  +11%/  -17%/  +13%/  +14%/  +15%
> >64/ 8/   +4%/  -23%/  +11%/  +11%/  +12%
> >   512/ 1/  +24%/0%/  +18%/  +14%/   -8%
> >   512/ 4/   +4%/  -15%/   +6%/   +5%/   +6%
> >   512/ 8/  +26%/0%/  +21%/  +10%/   +3%
> >  1024/ 1/  +88%/  +47%/  +69%/  +44%/  -30%
> >  1024/ 4/  +18%/   -5%/  +19%/  +16%/   +2%
> >  1024/ 8/  +15%/   -4%/  +13%/   +8%/   +1%
> >  4096/ 1/   -3%/   -5%/   +2%/   -2%/  +41%
> >  4096/ 4/   +2%/   +3%/  -20%/  -14%/  -24%
> >  4096/ 8/  -43%/  -45%/  +69%/  -24%/  +94%
> > 16384/ 1/   -3%/  -11%/  +23%/   +7%/  +42%
> > 16384/ 4/   -3%/   -3%/   -4%/   +5%/ +115%
> > 16384/ 8/   -1%/0%/   -1%/   -3%/  +32%
> > 65535/ 1/   +1%/0%/   +2%/0%/  +66%
> > 65535/ 4/   -1%/   -1%/0%/   +4%/ +492%
> > 65535/ 8/0%/   -1%/   -1%/   +4%/  +38%
> > 
> > Changes from V3:
> > - drop single_task_running()
> > - use cpu_relax_lowlatency() instead of cpu_relax()
> > 
> > Changes from V2:
> > - rename vhost_vq_more_avail() to vhost_vq_avail_empty(). And return
> > false we __get_user() fails.
> > - do not bother premmptions/timers for good path.
> > - use vhost_vring_state as ioctl parameter instead of reinveting a new
> > one.
> > - add the unit of timeout (us) to the comment of new added ioctls
> > 
> > Changes from V1:
> > - remove the buggy vq_error() in vhost_vq_more_avail().
> > - leave vhost_enable_notify() untouched.
> > 
> > Changes from RFC V3:
> > - small tweak on the code to avoid multiple duplicate conditions in
> > critical path when busy loop is not enabled.
> > - add the test result of multiple VMs
> > 
> > Changes from RFC V2:
> > - poll also at the end of rx handling
> > - factor out the polling logic and opti

Re: Hung up was occurred after v4.4-rc1 during IPv6 Ready Logo Conformance Test

2016-03-09 Thread Martin KaFai Lau

On Thu, Mar 10, 2016 at 01:21:05PM +0900, Yuki Machida wrote:
> Hi all,
>
> Hung up was occurred at Linux Kernel after v4.4-rc1 during IPv6 Ready Logo 
> Conformance Test.
> Not Fix a bug in v4.5-rc7 yet.
>
> Currently, it is under investigation.
>
> The following are the details:
I think you forgot to paste the details?

>
> IPv6 Ready Logo
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ipv6ready.org_&d=CwICJg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=BzjeED6xB_NfQBjateVt8Gh7IK7hST7tSmPGXEAtRdQ&s=oaM204vOhRnGb6aTD1wxTZu9ss_bCzFAeZU0EmvVe6I&e=
>
> I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510).
> It is using userland build with yocto project.
>
> Test Environment
> Test Specification  : 4.0.6
> Tool Version: REL_3_3_2
> Test Program Version: V6LC_5_0_0
> Target Device   : Intel D510MO (Atom D510)
>
> Regards,
> Yuki Machida

Re: [PATCH net-next] bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS

2016-03-09 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 9 Mar 2016 18:56:49 -0800

> 0-day bot reported build error:
> kernel/built-in.o: In function `map_lookup_elem':
>>> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to 
>>> `bpf_stackmap_copy'
> when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
> Add weak definition to resolve it.
> This code path in map_lookup_elem() is never taken
> when CONFIG_PERF_EVENTS is not set.
> 
> Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
> Reported-by: Fengguang Wu 
> Signed-off-by: Alexei Starovoitov 

Applied.

Re: [PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()

2016-03-09 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 9 Mar 2016 20:02:33 -0800

> Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
> the result is typically passed to print("%s", buf) and extra bytes
> after zero don't cause any harm.
> In bpf the result of bpf_get_current_comm() is used as the part of
> map key and was causing spurious hash map mismatches.
> Use strlcpy() to guarantee zero-terminated string.
> bpf verifier checks that output buffer is zero-initialized,
> so even for short task names the output buffer don't have junk bytes.
> Note it's not a security concern, since kprobe+bpf is root only.
> 
> Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm 
> accessors")
> Reported-by: Tobias Waldekranz 
> Signed-off-by: Alexei Starovoitov 
> ---
> Targeting net-next, since it's too late for net.
> I think it makes sense for stable as well.

Applied and queued up for -stable, thanks.

Re: [PATCH net-next 0/3] validate variable length ll headers

2016-03-09 Thread David Miller

From: Willem de Bruijn 
Date: Wed, 9 Mar 2016 23:22:59 -0500

> On Wed, Mar 9, 2016 at 10:13 PM, David Miller  wrote:
>> From: Willem de Bruijn 
>> Date: Wed,  9 Mar 2016 21:58:31 -0500
>>
>>> Allow device-specific validation of link layer headers. Existing
>>> checks drop all packets shorter than hard_header_len. For variable
>>> length protocols, such packets can be valid.
>>>
>>> patch 1 adds header_ops.validate and dev_validate_header
>>> patch 2 implements the protocol specific callback for AX25
>>> patch 3 replaces ll_header_truncated with dev_validate_header
>>
>> Series applied and queued up for -stable, thanks!
> 
> Thanks, David. Please don't queue the earlier net patchset for stable.
> I spotted an issue while converting to net-next, I'm afraid. I can send a
> net v2 (as net-next won't apply), or a separate stable patch once 4.5 is cut.

I'll use the net-next patches as the basis for the backport I do,
thanks.

Re: [PATCH net-next 0/3] validate variable length ll headers

2016-03-09 Thread Willem de Bruijn

On Wed, Mar 9, 2016 at 10:13 PM, David Miller  wrote:
> From: Willem de Bruijn 
> Date: Wed,  9 Mar 2016 21:58:31 -0500
>
>> Allow device-specific validation of link layer headers. Existing
>> checks drop all packets shorter than hard_header_len. For variable
>> length protocols, such packets can be valid.
>>
>> patch 1 adds header_ops.validate and dev_validate_header
>> patch 2 implements the protocol specific callback for AX25
>> patch 3 replaces ll_header_truncated with dev_validate_header
>
> Series applied and queued up for -stable, thanks!

Thanks, David. Please don't queue the earlier net patchset for stable.
I spotted an issue while converting to net-next, I'm afraid. I can send a
net v2 (as net-next won't apply), or a separate stable patch once 4.5 is cut.

Hung up was occurred after v4.4-rc1 during IPv6 Ready Logo Conformance Test

2016-03-09 Thread Yuki Machida

Hi all,

Hung up was occurred at Linux Kernel after v4.4-rc1 during IPv6 Ready Logo 
Conformance Test.
Not Fix a bug in v4.5-rc7 yet.

Currently, it is under investigation.

The following are the details:

IPv6 Ready Logo
https://www.ipv6ready.org/

I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510).
It is using userland build with yocto project.

Test Environment
Test Specification  : 4.0.6
Tool Version: REL_3_3_2
Test Program Version: V6LC_5_0_0
Target Device   : Intel D510MO (Atom D510)

Regards,
Yuki Machida

[PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()

2016-03-09 Thread Alexei Starovoitov

Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm 
accessors")
Reported-by: Tobias Waldekranz 
Signed-off-by: Alexei Starovoitov 
---
Targeting net-next, since it's too late for net.
I think it makes sense for stable as well.

 kernel/bpf/helpers.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4504ca66118d..50da680c479f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -166,7 +166,7 @@ static u64 bpf_get_current_comm(u64 r1, u64 size, u64 r3, 
u64 r4, u64 r5)
if (!task)
return -EINVAL;
 
-   memcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));
+   strlcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));
return 0;
 }
 
-- 
2.8.0.rc1

Re: [PATCH nf-next v9 8/8] openvswitch: Interface with NAT.

2016-03-09 Thread Joe Stringer

Hi Jarno,

Thanks for working on this. Mostly just a few style things around #ifdefs below.

On 9 March 2016 at 15:10, Jarno Rajahalme  wrote:
> Extend OVS conntrack interface to cover NAT.  New nested
> OVS_CT_ATTR_NAT attribute may be used to include NAT with a CT action.
> A bare OVS_CT_ATTR_NAT only mangles existing and expected connections.
> If OVS_NAT_ATTR_SRC or OVS_NAT_ATTR_DST is included within the nested
> attributes, new (non-committed/non-confirmed) connections are mangled
> according to the rest of the nested attributes.
>
> The corresponding OVS userspace patch series includes test cases (in
> tests/system-traffic.at) that also serve as example uses.
>
> This work extends on a branch by Thomas Graf at
> https://github.com/tgraf/ovs/tree/nat.

Thomas, I guess there was not signoff in these patches so Jarno does
not have your signoff in this patch.

> Signed-off-by: Jarno Rajahalme 
> ---
> v9: Fixed module dependencies.
>
>  include/uapi/linux/openvswitch.h |  49 
>  net/openvswitch/Kconfig  |   3 +-
>  net/openvswitch/conntrack.c  | 523 
> +--
>  net/openvswitch/conntrack.h  |   3 +-
>  4 files changed, 551 insertions(+), 27 deletions(-)



> diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
> index cd5fd9d..23471a4 100644
> --- a/net/openvswitch/Kconfig
> +++ b/net/openvswitch/Kconfig
> @@ -6,7 +6,8 @@ config OPENVSWITCH
> tristate "Open vSwitch"
> depends on INET
> depends on !NF_CONNTRACK || \
> -  (NF_CONNTRACK && (!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6))
> +  (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
> +(!NF_NAT || NF_NAT)))

Whitespace.

> select LIBCRC32C
> select MPLS
> select NET_MPLS_GSO
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 5711f80..6455237 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c



>  struct ovs_ct_len_tbl {
> -   size_t maxlen;
> -   size_t minlen;
> +   int maxlen;
> +   int minlen;
>  };

Are these changed for a specific reason, or just to use INT_MAX rather
than SIZE_MAX in ovs_ct_len_tbl?

>  /* Metadata mark for masked write to conntrack mark */
> @@ -42,15 +52,29 @@ struct md_labels {
> struct ovs_key_ct_labels mask;
>  };
>
> +#ifdef CONFIG_NF_NAT_NEEDED
> +enum ovs_ct_nat {
> +   OVS_CT_NAT = 1 << 0, /* NAT for committed connections only. */
> +   OVS_CT_SRC_NAT = 1 << 1, /* Source NAT for NEW connections. */
> +   OVS_CT_DST_NAT = 1 << 2, /* Destination NAT for NEW connections. */
> +};
> +#endif

Here...

>  /* Conntrack action context for execution. */
>  struct ovs_conntrack_info {
> struct nf_conntrack_helper *helper;
> struct nf_conntrack_zone zone;
> struct nf_conn *ct;
> u8 commit : 1;
> +#ifdef CONFIG_NF_NAT_NEEDED
> +   u8 nat : 3; /* enum ovs_ct_nat */
> +#endif

and here.. I wonder if we can trim more of these #ifdefs, for
readability and more compiler coverage if the feature is disabled.

> u16 family;
> struct md_mark mark;
> struct md_labels labels;
> +#ifdef CONFIG_NF_NAT_NEEDED
> +   struct nf_nat_range range;  /* Only present for SRC NAT and DST NAT. 
> */
> +#endif
>  };

(probably not this one, though)

>  static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
> @@ -137,12 +161,15 @@ static void __ovs_ct_update_key(struct sw_flow_key 
> *key, u8 state,
> ovs_ct_get_labels(ct, &key->ct.labels);
>  }
>
> -/* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has
> - * previously sent the packet to conntrack via the ct action.
> +/* Update 'key' based on skb->nfct.  If 'post_ct' is true, then OVS has
> + * previously sent the packet to conntrack via the ct action.  If
> + * 'keep_nat_flags' is true, the existing NAT flags retained, else they are
> + * initialized from the connection status.
>   */
>  static void ovs_ct_update_key(const struct sk_buff *skb,
>   const struct ovs_conntrack_info *info,
> - struct sw_flow_key *key, bool post_ct)
> + struct sw_flow_key *key, bool post_ct,
> + bool keep_nat_flags)
>  {
> const struct nf_conntrack_zone *zone = &nf_ct_zone_dflt;
> enum ip_conntrack_info ctinfo;
> @@ -160,6 +187,16 @@ static void ovs_ct_update_key(const struct sk_buff *skb,
>  */
> if (ct->master)
> state |= OVS_CS_F_RELATED;
> +#ifdef CONFIG_NF_NAT_NEEDED
> +   if (keep_nat_flags) {
> +   state |= key->ct.state & OVS_CS_F_NAT_MASK;
> +   } else {
> +   if (ct->status & IPS_SRC_NAT)
> +   state |= OVS_CS_F_SRC_NAT;
> +   if (ct->stat

Re: [PATCH] bpf: make bpf_stackmap_copy conditionally called

2016-03-09 Thread Alexei Starovoitov

On Thu, Mar 10, 2016 at 02:43:42AM +0100, Arnd Bergmann wrote:
> Changing the bpf syscall to use the new bpf_stackmap_copy() helper for
> BPF_MAP_TYPE_STACK_TRACE causes a link error when CONFIG_PERF_EVENTS
> is disabled:
> 
> kernel/built-in.o: In function `map_lookup_elem':
> :(.text+0x7fca4): undefined reference to `bpf_stackmap_copy'
> 
> This patch simply avoids handling that case, which may or may not
> be the correct answer here.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
...
> - } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
> + } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE &&
> +IS_ENABLED(CONFIG_PERF_EVENTS)) {
>   err = bpf_stackmap_copy(map, key, value);
>   } else {

yes. this is also ok-ish fix.
I've sent different version already:
http://patchwork.ozlabs.org/patch/595617/

I considered the option like yours but it's relying on gcc doing
dead code elimination of 'if (false) {}' branch and though kernel
is never compiled with -O0. I didn't want to take the risk.
I'm fine with either approach though.

[PATCH] bpf: make bpf_stackmap_copy conditionally called

2016-03-09 Thread Arnd Bergmann

Changing the bpf syscall to use the new bpf_stackmap_copy() helper for
BPF_MAP_TYPE_STACK_TRACE causes a link error when CONFIG_PERF_EVENTS
is disabled:

kernel/built-in.o: In function `map_lookup_elem':
:(.text+0x7fca4): undefined reference to `bpf_stackmap_copy'

This patch simply avoids handling that case, which may or may not
be the correct answer here.

Signed-off-by: Arnd Bergmann 
Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
---
 kernel/bpf/syscall.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2978d0d08869..b014b64b6116 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -290,7 +290,8 @@ static int map_lookup_elem(union bpf_attr *attr)
err = bpf_percpu_hash_copy(map, key, value);
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
err = bpf_percpu_array_copy(map, key, value);
-   } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
+   } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE &&
+  IS_ENABLED(CONFIG_PERF_EVENTS)) {
err = bpf_stackmap_copy(map, key, value);
} else {
rcu_read_lock();
-- 
2.7.0

Re: [PATCH net-next 0/3] validate variable length ll headers

2016-03-09 Thread David Miller

From: Willem de Bruijn 
Date: Wed,  9 Mar 2016 21:58:31 -0500

> Allow device-specific validation of link layer headers. Existing
> checks drop all packets shorter than hard_header_len. For variable
> length protocols, such packets can be valid.
> 
> patch 1 adds header_ops.validate and dev_validate_header
> patch 2 implements the protocol specific callback for AX25
> patch 3 replaces ll_header_truncated with dev_validate_header

Series applied and queued up for -stable, thanks!

[PATCH net-next 3/3] packet: validate variable length ll headers

2016-03-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Replace link layer header validation check ll_header_truncate with
more generic dev_validate_header.

Validation based on hard_header_len incorrectly drops valid packets
in variable length protocols, such as AX25. dev_validate_header
calls header_ops.validate for such protocols to ensure correctness
below hard_header_len.

See also http://comments.gmane.org/gmane.linux.network/401064

Fixes 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 
header")
Signed-off-by: Willem de Bruijn 
---
 net/packet/af_packet.c | 43 ++-
 1 file changed, 18 insertions(+), 25 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d41b107..1ecfa71 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1915,6 +1915,10 @@ retry:
goto retry;
}
 
+   if (!dev_validate_header(dev, skb->data, len)) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
if (len > (dev->mtu + dev->hard_header_len + extra_len) &&
!packet_extra_vlan_len_allowed(dev, skb)) {
err = -EMSGSIZE;
@@ -2393,18 +2397,6 @@ static void tpacket_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
-static bool ll_header_truncated(const struct net_device *dev, int len)
-{
-   /* net device doesn't like empty head */
-   if (unlikely(len < dev->hard_header_len)) {
-   net_warn_ratelimited("%s: packet size is too short (%d < %d)\n",
-current->comm, len, dev->hard_header_len);
-   return true;
-   }
-
-   return false;
-}
-
 static void tpacket_set_protocol(const struct net_device *dev,
 struct sk_buff *skb)
 {
@@ -2522,16 +2514,20 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
if (unlikely(err < 0))
return -EINVAL;
} else if (copylen) {
+   int hdrlen = min_t(int, copylen, tp_len);
+
skb_push(skb, dev->hard_header_len);
skb_put(skb, copylen - dev->hard_header_len);
-   err = skb_store_bits(skb, 0, data, copylen);
+   err = skb_store_bits(skb, 0, data, hdrlen);
if (unlikely(err))
return err;
+   if (!dev_validate_header(dev, skb->data, hdrlen))
+   return -EINVAL;
if (!skb->protocol)
tpacket_set_protocol(dev, skb);
 
-   data += copylen;
-   to_write -= copylen;
+   data += hdrlen;
+   to_write -= hdrlen;
}
 
offset = offset_in_page(data);
@@ -2703,13 +2699,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
copylen = __virtio16_to_cpu(vio_le(),
vnet_hdr->hdr_len);
}
-   if (dev->hard_header_len) {
-   if (ll_header_truncated(dev, tp_len)) {
-   tp_len = -EINVAL;
-   goto tpacket_error;
-   }
-   copylen = max_t(int, copylen, dev->hard_header_len);
-   }
+   copylen = max_t(int, copylen, dev->hard_header_len);
skb = sock_alloc_send_skb(&po->sk,
hlen + tlen + sizeof(struct sockaddr_ll) +
(copylen - dev->hard_header_len),
@@ -2905,9 +2895,6 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
offset = dev_hard_header(skb, dev, ntohs(proto), addr, NULL, 
len);
if (unlikely(offset < 0))
goto out_free;
-   } else {
-   if (ll_header_truncated(dev, len))
-   goto out_free;
}
 
/* Returns -EFAULT on error */
@@ -2915,6 +2902,12 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (err)
goto out_free;
 
+   if (sock->type == SOCK_RAW &&
+   !dev_validate_header(dev, skb->data, len)) {
+   err = -EINVAL;
+   goto out_free;
+   }
+
sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
 
if (!vnet_hdr.gso_type && (len > dev->mtu + reserve + extra_len) &&
-- 
2.7.0.rc3.207.g0ac5344

[PATCH net-next 2/3] ax25: add link layer header validation function

2016-03-09 Thread Willem de Bruijn

From: Willem de Bruijn 

As variable length protocol, AX25 fails link layer header validation
tests based on a minimum length. header_ops.validate allows protocols
to validate headers that are shorter than hard_header_len. Implement
this callback for AX25.

See also http://comments.gmane.org/gmane.linux.network/401064

Signed-off-by: Willem de Bruijn 
---
 net/ax25/ax25_ip.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index b563a3f..2fa3be9 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -228,8 +228,23 @@ netdev_tx_t ax25_ip_xmit(struct sk_buff *skb)
 }
 #endif
 
+static bool ax25_validate_header(const char *header, unsigned int len)
+{
+   ax25_digi digi;
+
+   if (!len)
+   return false;
+
+   if (header[0])
+   return true;
+
+   return ax25_addr_parse(header + 1, len - 1, NULL, NULL, &digi, NULL,
+  NULL);
+}
+
 const struct header_ops ax25_header_ops = {
.create = ax25_hard_header,
+   .validate = ax25_validate_header,
 };
 
 EXPORT_SYMBOL(ax25_header_ops);
-- 
2.7.0.rc3.207.g0ac5344

[PATCH net-next 1/3] net: validate variable length ll headers

2016-03-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Netdevice parameter hard_header_len is variously interpreted both as
an upper and lower bound on link layer header length. The field is
used as upper bound when reserving room at allocation, as lower bound
when validating user input in PF_PACKET.

Clarify the definition to be maximum header length. For validation
of untrusted headers, add an optional validate member to header_ops.

Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
for deliberate testing of corrupt input. In this case, pad trailing
bytes, as some device drivers expect completely initialized headers.

See also http://comments.gmane.org/gmane.linux.network/401064

Signed-off-by: Willem de Bruijn 
---
 include/linux/netdevice.h | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index efe7cec..fd30cb5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -268,6 +268,7 @@ struct header_ops {
void(*cache_update)(struct hh_cache *hh,
const struct net_device *dev,
const unsigned char *haddr);
+   bool(*validate)(const char *ll_header, unsigned int len);
 };
 
 /* These flag bits are private to the generic network queueing
@@ -1459,8 +1460,7 @@ enum netdev_priv_flags {
  * @dma:   DMA channel
  * @mtu:   Interface MTU value
  * @type:  Interface hardware type
- * @hard_header_len: Hardware header length, which means that this is the
- *   minimum size of a packet.
+ * @hard_header_len: Maximum hardware header length.
  *
  * @needed_headroom: Extra headroom the hardware may need, but not in all
  *   cases can this be guaranteed
@@ -2687,6 +2687,24 @@ static inline int dev_parse_header(const struct sk_buff 
*skb,
return dev->header_ops->parse(skb, haddr);
 }
 
+/* ll_header must have at least hard_header_len allocated */
+static inline bool dev_validate_header(const struct net_device *dev,
+  char *ll_header, int len)
+{
+   if (likely(len >= dev->hard_header_len))
+   return true;
+
+   if (capable(CAP_SYS_RAWIO)) {
+   memset(ll_header + len, 0, dev->hard_header_len - len);
+   return true;
+   }
+
+   if (dev->header_ops && dev->header_ops->validate)
+   return dev->header_ops->validate(ll_header, len);
+
+   return false;
+}
+
 typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr, int 
len);
 int register_gifconf(unsigned int family, gifconf_func_t *gifconf);
 static inline int unregister_gifconf(unsigned int family)
-- 
2.7.0.rc3.207.g0ac5344

[PATCH net-next 0/3] validate variable length ll headers

2016-03-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Allow device-specific validation of link layer headers. Existing
checks drop all packets shorter than hard_header_len. For variable
length protocols, such packets can be valid.

patch 1 adds header_ops.validate and dev_validate_header
patch 2 implements the protocol specific callback for AX25
patch 3 replaces ll_header_truncated with dev_validate_header

Willem de Bruijn (3):
  net: validate variable length ll headers
  ax25: add link layer header validation function
  packet: validate variable length ll headers

 include/linux/netdevice.h | 22 --
 net/ax25/ax25_ip.c| 15 +++
 net/packet/af_packet.c| 43 ++-
 3 files changed, 53 insertions(+), 27 deletions(-)

-- 
2.7.0.rc3.207.g0ac5344

[PATCH net-next] bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS

2016-03-09 Thread Alexei Starovoitov

0-day bot reported build error:
kernel/built-in.o: In function `map_lookup_elem':
>> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to 
>> `bpf_stackmap_copy'
when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
Add weak definition to resolve it.
This code path in map_lookup_elem() is never taken
when CONFIG_PERF_EVENTS is not set.

Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
Reported-by: Fengguang Wu 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/syscall.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2978d0d08869..2a2efe1bc76c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -244,6 +244,11 @@ static void __user *u64_to_ptr(__u64 val)
return (void __user *) (unsigned long) val;
 }
 
+int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
+{
+   return -ENOTSUPP;
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
 
-- 
2.8.0.rc1

[RFC PATCH 1/1] net namespace: dynamically configure new net namespace inherit net config

2016-03-09 Thread Zhu Yanjun

Sometimes the system engineer and application expect a new net namespace
to inherit config from the base net config. Sometimes the current net config
is expected by the system engineer and application. So it is necessary that
the system engineer and application can choose a new net namespace to inherit
from the base net config, or the current net config.

For example, the value of /proc/sys/net/ipv4/ip_forward is taken as
an example. The value of /proc/sys/net/ipv4/ip_forward in the base net
config is 0 while the value of /proc/sys/net/ipv4/ip_forward is changed
to 1 in the current net config. The system engineer and application can choose
a new net namespace to inherit the value of /proc/sys/net/ipv4/ip_forward from
the base or the current settings.

Test case:

1. % cat /proc/sys/net/ipv4/net_ns_inherit
   1

2. Set ip forwarding in the "base namespace"

   % echo 1 > /proc/sys/net/ipv4/ip_forward

   % cat /proc/sys/net/ipv4/ip_forward
   1

3. Create a new namespace

   % ip netns add mynewns

4. Check ip forwarding in the new namespace

   % ip netns exec mynewns cat /proc/sys/net/ipv4/ip_forward
   1

5. % echo 0 > /proc/sys/net/ipv4/net_ns_inherit

   % cat /proc/sys/net/ipv4/net_ns_inherit
   0

6. Set ip forwarding in the "base namespace"

   % echo 1 > /proc/sys/net/ipv4/ip_forward

   % cat /proc/sys/net/ipv4/ip_forward
   1

7. Create a new namespace

   % ip netns add mynewns_new

8. Check ip forwarding in the new namespace

   % ip netns exec mynewns_new cat /proc/sys/net/ipv4/ip_forward
   0

Suggested-by: Bruce Ashfield 
Signed-off-by: Zhu Yanjun 
CC: David S. Miller  
CC: Alexey Kuznetsov 
CC: James Morris 
CC: Hideaki YOSHIFUJI 
CC: Patrick McHardy 

---
 include/linux/inetdevice.h  |2 +-
 include/net/ip.h|3 +++
 include/uapi/linux/sysctl.h |1 +
 net/ipv4/devinet.c  |   58 ---
 net/ipv4/sysctl_net_ipv4.c  |7 ++
 5 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ee971f3..1c0ae93 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -164,7 +164,7 @@ static inline struct net_device *ip_dev_find(struct net 
*net, __be32 addr)
 
 int inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b);
 int devinet_ioctl(struct net *net, unsigned int cmd, void __user *);
-void devinet_init(void);
+int devinet_init(void);
 struct in_device *inetdev_by_index(struct net *, int);
 __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope);
 __be32 inet_confirm_addr(struct net *net, struct in_device *in_dev, __be32 dst,
diff --git a/include/net/ip.h b/include/net/ip.h
index 1a98f1c..0ad4a7d 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -245,6 +245,9 @@ extern int inet_peer_threshold;
 extern int inet_peer_minttl;
 extern int inet_peer_maxttl;
 
+/* From devinet.c */
+extern int net_ns_inherit;
+
 /* From ip_input.c */
 extern int sysctl_ip_early_demux;
 
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 0956373..350c3ce 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -426,6 +426,7 @@ enum
NET_TCP_ALLOWED_CONG_CONTROL=123,
NET_TCP_MAX_SSTHRESH=124,
NET_TCP_FRTO_RESPONSE=125,
+   NET_IPV4_NET_NS_INHERIT = 126,
 };
 
 enum {
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index cebd9d3..b68d7fa 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2277,28 +2277,31 @@ static struct ctl_table ctl_forward_entry[] = {
 };
 #endif
 
+#define NET_NS_INIT_DEFAULT0
+#define NET_NS_INIT_MODIFIED   1
+
+/* net ns initialized from current */
+int net_ns_inherit __read_mostly = NET_NS_INIT_MODIFIED;
+static struct ipv4_devconf *all_backup, *dflt_backup;
+
 static __net_init int devinet_init_net(struct net *net)
 {
int err;
-   struct ipv4_devconf *all, *dflt;
+   struct ipv4_devconf *all = NULL, *dflt = NULL;
 #ifdef CONFIG_SYSCTL
struct ctl_table *tbl = ctl_forward_entry;
struct ctl_table_header *forw_hdr;
 #endif
-
err = -ENOMEM;
-   all = &ipv4_devconf;
-   dflt = &ipv4_devconf_dflt;
 
-   if (!net_eq(net, &init_net)) {
-   all = kmemdup(all, sizeof(ipv4_devconf), GFP_KERNEL);
+   if (net_ns_inherit == NET_NS_INIT_DEFAULT) {
+   all = kmemdup(all_backup, sizeof(ipv4_devconf), GFP_KERNEL);
if (!all)
goto err_alloc_all;
 
-   dflt = kmemdup(dflt, sizeof(ipv4_devconf_dflt), GFP_KERNEL);
+   dflt = kmemdup(dflt_backup, sizeof(ipv4_devconf_dflt), 
GFP_KERNEL);
if (!dflt)
goto err_alloc_dflt;
-
 #ifdef CONFIG_SYSCTL
tbl = kmemdup(tbl, sizeof(ctl_forward_entry), GFP_KERNEL);
if (!tbl)
@@ -2309,6 +2312,29 @@ static __net_init int devinet_init_net(struct net *net)
tbl[0].extra2 = net;
 #endif

Re: [PATCH nf-next v9 7/8] openvswitch: Delay conntrack helper call for new connections.

2016-03-09 Thread Joe Stringer

On 9 March 2016 at 15:10, Jarno Rajahalme  wrote:
> There is no need to help connections that are not confirmed, so we can
> delay helping new connections to the time when they are confirmed.
> This change is needed for NAT support, and having this as a separate
> patch will make the following NAT patch a bit easier to review.
>
> Signed-off-by: Jarno Rajahalme 
> ---
>  net/openvswitch/conntrack.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 92613de..5711f80 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c



> @@ -506,11 +510,17 @@ static int __ovs_ct_lookup(struct net *net, struct 
> sw_flow_key *key,
> return -ENOENT;
>
> ovs_ct_update_key(skb, info, key, true);
> +   }
>
> -   if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
> -   WARN_ONCE(1, "helper rejected packet");
> -   return -EINVAL;
> -   }
> +   /* Call the helper only if we did nf_conntrack_in() above ('!cached')
> +* for confirmed connections, but only when committing for unconfirmed
> +* connections.
> +*/

Minor nit, try this wording for readibility?

/* Call the helper only if:
 * - nf_conntrack_in() was executed above ("!cached") for a confirmed
connection, or
 * - When committing an unconfirmed connection
 */

Re: [PATCH nf-next v9 5/8] openvswitch: Find existing conntrack entry after upcall.

2016-03-09 Thread Joe Stringer

On 9 March 2016 at 15:10, Jarno Rajahalme  wrote:
> Add a new function ovs_ct_find_existing() to find an existing
> conntrack entry for which this packet was already applied to.  This is
> only to be called when there is evidence that the packet was already
> tracked and committed, but we lost the ct reference due to an
> userspace upcall.
>
> ovs_ct_find_existing() is called from skb_nfct_cached(), which can now
> hide the fact that the ct reference may have been lost due to an
> upcall.  This allows ovs_ct_commit() to be simplified.
>
> This patch is needed by later "openvswitch: Interface with NAT" patch,
> as we need to be able to pass the packet through NAT using the
> original ct reference also after the reference is lost after an
> upcall.
>
> Signed-off-by: Jarno Rajahalme 

Acked-by: Joe Stringer

Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-03-09 Thread Yuchung Cheng

On Wed, Mar 9, 2016 at 5:45 PM, Jonas Markussen  wrote:
>
>> On 10 Mar 2016, at 01:20, Yuchung Cheng  wrote:
>>
>> PS. I don't understand how (old) RDB can masquerade the losses by
>> skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets
>> and the last 3 were (s)acked. We perform RDB to send a packet that has
>> previous 4 payloads + 1 new byte. The sender still gets the loss
>> information?
>>
>
> If I’ve understood you correctly, you’re talking about sending 4
> packets and the first one is lost?
>
> In this case, RDB will not only bundle on the last/new packet but also
> as it sends packet 2 (which will contain 1+2), packet 3 (1+2+3)
> and packet 4 (1+2+3+4).
>
> So the fact that packet 1 was lost is masqueraded when it is
> recovered by packet 2 and there won’t be any gap in the SACK window
> indicating that packet 1 was lost.
I see. Thanks for the clarification.

So my question is still if thin-stream app has enough inflight to use
ack-triggered recovery. i.e., it has to send at least twice within an
RTT.

Also have you tested this with non-Linux receivers? Thanks.

>
> Best regards,
> Jonas Markussen

Re: [ovs-dev] [PATCH nf-next v9 4/8] openvswitch: Update the CT state key only after nf_conntrack_in().

2016-03-09 Thread Joe Stringer

On 9 March 2016 at 15:10, Jarno Rajahalme  wrote:
> Only a successful nf_conntrack_in() call can effect a connection state
> change, so if suffices to update the key only after the
> nf_conntrack_in() returns.

"it" suffices to update...

> This change is needed for the later NAT patches.
>
> Signed-off-by: Jarno Rajahalme 

Acked-by: Joe Stringer

Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-03-09 Thread Jonas Markussen

> On 10 Mar 2016, at 01:20, Yuchung Cheng  wrote:
> 
> PS. I don't understand how (old) RDB can masquerade the losses by
> skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets
> and the last 3 were (s)acked. We perform RDB to send a packet that has
> previous 4 payloads + 1 new byte. The sender still gets the loss
> information?
> 

If I’ve understood you correctly, you’re talking about sending 4 
packets and the first one is lost?

In this case, RDB will not only bundle on the last/new packet but also 
as it sends packet 2 (which will contain 1+2), packet 3 (1+2+3) 
and packet 4 (1+2+3+4). 

So the fact that packet 1 was lost is masqueraded when it is 
recovered by packet 2 and there won’t be any gap in the SACK window
indicating that packet 1 was lost.

Best regards,
Jonas Markussen

[PATCH net] net: hns: bug fix about ping6

2016-03-09 Thread Kejian Yan

The current upstreaming code fails to ping other IPv6 net device, because
the enet receives the multicast packets with the src mac addr whick is the
same as its mac addr. These packets need to be dropped.

Signed-off-by: Kejian Yan 
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 3f77ff7..9ad5da4 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -564,6 +564,7 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data 
*ring_data,
struct sk_buff *skb;
struct hnae_desc *desc;
struct hnae_desc_cb *desc_cb;
+   struct ethhdr *eh;
unsigned char *va;
int bnum, length, i;
int pull_len;
@@ -670,6 +671,14 @@ out_bnum_err:
return -EFAULT;
}
 
+   /* filter out multicast pkt with the same src mac as this port */
+   eh = (struct ethhdr *)skb->data;
+   if (unlikely(is_multicast_ether_addr(eh->h_dest) &&
+ether_addr_equal(ndev->dev_addr, eh->h_source))) {
+   dev_kfree_skb_any(skb);
+   return -EFAULT;
+   }
+
ring->stats.rx_pkts++;
ring->stats.rx_bytes += skb->len;
 
-- 
1.9.1

[patch net 2/2] net: hns: fixes a bug of RSS

2016-03-09 Thread Kejian Yan

If trying to get receive flow hash indirection table by ethtool, it needs
to call .get_rxnfc to get ring number first. So this patch implements the
.get_rxnfc of ethtool. And the data type of rss_indir_table is u32, it has
to be multiply by the width of data type when using memcpy.

Signed-off-by: Kejian Yan 
---
 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c |  6 --
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c  | 20 
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
index 3b8f301..c733a5a 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
@@ -779,7 +779,8 @@ static int hns_ae_get_rss(struct hnae_handle *handle, u32 
*indir, u8 *key,
memcpy(key, ppe_cb->rss_key, HNS_PPEV2_RSS_KEY_SIZE);
 
/* update the current hash->queue mappings from the shadow RSS table */
-   memcpy(indir, ppe_cb->rss_indir_table, HNS_PPEV2_RSS_IND_TBL_SIZE);
+   memcpy(indir, ppe_cb->rss_indir_table,
+  HNS_PPEV2_RSS_IND_TBL_SIZE * sizeof(*indir));
 
return 0;
 }
@@ -794,7 +795,8 @@ static int hns_ae_set_rss(struct hnae_handle *handle, const 
u32 *indir,
hns_ppe_set_rss_key(ppe_cb, (u32 *)key);
 
/* update the shadow RSS table with user specified qids */
-   memcpy(ppe_cb->rss_indir_table, indir, HNS_PPEV2_RSS_IND_TBL_SIZE);
+   memcpy(ppe_cb->rss_indir_table, indir,
+  HNS_PPEV2_RSS_IND_TBL_SIZE * sizeof(*indir));
 
/* now update the hardware */
hns_ppe_set_indir_table(ppe_cb, ppe_cb->rss_indir_table);
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
index ada8e04..a070392 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
@@ -1244,6 +1244,25 @@ hns_set_rss(struct net_device *netdev, const u32 *indir, 
const u8 *key,
return ret;
 }
 
+static int hns_get_rxnfc(struct net_device *netdev,
+struct ethtool_rxnfc *cmd,
+u32 *rule_locs)
+{
+   struct hns_nic_priv *priv = netdev_priv(netdev);
+   int ret = 0;
+
+   switch (cmd->cmd) {
+   case ETHTOOL_GRXRINGS:
+   cmd->data = priv->ae_handle->q_num;
+   break;
+   default:
+   ret = -EOPNOTSUPP;
+   break;
+   }
+
+   return ret;
+}
+
 static struct ethtool_ops hns_ethtool_ops = {
.get_drvinfo = hns_nic_get_drvinfo,
.get_link  = hns_nic_get_link,
@@ -1267,6 +1286,7 @@ static struct ethtool_ops hns_ethtool_ops = {
.get_rxfh_indir_size = hns_get_rss_indir_size,
.get_rxfh = hns_get_rss,
.set_rxfh = hns_set_rss,
+   .get_rxnfc = hns_get_rxnfc,
 };
 
 void hns_ethtool_set_ops(struct net_device *ndev)
-- 
1.9.1

[patch net 0/2] net: hns: get and set RSS indirection table by using ethtool

2016-03-09 Thread Kejian Yan

When we use ethtool to retrieves or configure the receive flow hash 
indirection table, ethtool needs to call .get_rxnfc to get the ring number
so this patchset implements the .get_rxnfc and fixes the bug that we can
not get the tatal table each time.

Kejian Yan (2):
  net: hns: fix return value of the function about rss
  net: hns: fixes a bug of RSS

 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c |  8 ---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c |  2 +-
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c  | 28 +++
 3 files changed, 30 insertions(+), 8 deletions(-)

-- 
1.9.1

[patch net 1/2] net: hns: fix return value of the function about rss

2016-03-09 Thread Kejian Yan

Both .get_rxfh and .get_rxfh are always return 0, it should return result
from hardware when getting or setting rss. And the rss function should
return the correct data type.

Signed-off-by: Kejian Yan 
---
 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 2 +-
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 2 +-
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c  | 8 
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
index a0070d0..3b8f301 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
@@ -791,7 +791,7 @@ static int hns_ae_set_rss(struct hnae_handle *handle, const 
u32 *indir,
 
/* set the RSS Hash Key if specififed by the user */
if (key)
-   hns_ppe_set_rss_key(ppe_cb, (int *)key);
+   hns_ppe_set_rss_key(ppe_cb, (u32 *)key);
 
/* update the shadow RSS table with user specified qids */
memcpy(ppe_cb->rss_indir_table, indir, HNS_PPEV2_RSS_IND_TBL_SIZE);
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
index f302ef9..811ef35 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
@@ -27,7 +27,7 @@ void hns_ppe_set_tso_enable(struct hns_ppe_cb *ppe_cb, u32 
value)
 void hns_ppe_set_rss_key(struct hns_ppe_cb *ppe_cb,
 const u32 rss_key[HNS_PPEV2_RSS_KEY_NUM])
 {
-   int key_item = 0;
+   u32 key_item = 0;
 
for (key_item = 0; key_item < HNS_PPEV2_RSS_KEY_NUM; key_item++)
dsaf_write_dev(ppe_cb, PPEV2_RSS_KEY_REG + key_item * 0x4,
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
index 3df2284..ada8e04 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
@@ -1165,7 +1165,7 @@ hns_get_rss_key_size(struct net_device *netdev)
if (AE_IS_VER1(priv->enet_ver)) {
netdev_err(netdev,
   "RSS feature is not supported on this hardware\n");
-   return -EOPNOTSUPP;
+   return (u32)-EOPNOTSUPP;
}
 
ops = priv->ae_handle->dev->ops;
@@ -1184,7 +1184,7 @@ hns_get_rss_indir_size(struct net_device *netdev)
if (AE_IS_VER1(priv->enet_ver)) {
netdev_err(netdev,
   "RSS feature is not supported on this hardware\n");
-   return -EOPNOTSUPP;
+   return (u32)-EOPNOTSUPP;
}
 
ops = priv->ae_handle->dev->ops;
@@ -1213,7 +1213,7 @@ hns_get_rss(struct net_device *netdev, u32 *indir, u8 
*key, u8 *hfunc)
 
ret = ops->get_rss(priv->ae_handle, indir, key, hfunc);
 
-   return 0;
+   return ret;
 }
 
 static int
@@ -1241,7 +1241,7 @@ hns_set_rss(struct net_device *netdev, const u32 *indir, 
const u8 *key,
 
ret = ops->set_rss(priv->ae_handle, indir, key, hfunc);
 
-   return 0;
+   return ret;
 }
 
 static struct ethtool_ops hns_ethtool_ops = {
-- 
1.9.1

Re: [PATCH net-next v3] tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In

2016-03-09 Thread Martin KaFai Lau

On Wed, Mar 09, 2016 at 03:11:50PM -0800, Yuchung Cheng wrote:
> On Wed, Mar 9, 2016 at 10:43 AM, Martin KaFai Lau  wrote:
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index e90db85..24557a8 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1816,4 +1816,14 @@ static inline void skb_set_tcp_pure_ack(struct 
> > sk_buff *skb)
> > skb->truesize = 2;
> >  }
> >
> > +static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff 
> > *skb)
> > +{
> > +   u16 segs_in;
> > +
> > +   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
> > +   tp->segs_in += segs_in;
> > +   if (skb->len > tcp_hdrlen(skb))
> > +   tp->data_segs_in += segs_in;
> > +}
> > +
> >  #endif /* _TCP_H */
> > diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
> > index fdb286d..f583c85 100644
> > --- a/net/ipv4/tcp_fastopen.c
> > +++ b/net/ipv4/tcp_fastopen.c
> > @@ -131,6 +131,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock 
> > *req,
> >  void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
> >  {
> > struct tcp_sock *tp = tcp_sk(sk);
> > +   u16 segs_in;
> >
> > if (TCP_SKB_CB(skb)->end_seq == tp->rcv_nxt)
> > return;
> > @@ -154,6 +155,9 @@ void tcp_fastopen_add_skb(struct sock *sk, struct 
> > sk_buff *skb)
> >  * as we certainly are not changing upper 32bit value (0)
> >  */
> > tp->bytes_received = skb->len;
> > +   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
> > +   tp->segs_in = segs_in;
> > +   tp->data_segs_in = segs_in;
> why not use the new tcp_segs_in() helper?
It is because
segs_in has been set to 1 by tcp_create_openreq_child(), so calling
tcp_segs_in() will have it double counted
and
tcphdr has been pulled from skb, so skb->len does not include tcp_hdrlen.

I will add some remark in this change.

Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap

2016-03-09 Thread Tom Herbert

On Wed, Mar 9, 2016 at 4:18 PM, Joe Perches  wrote:
> On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote:
>> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert  wrote:
>> > I took a look inlining these.
>> >
>> > #define rol32(V, X) ({  \
>> > int word = V;   \
>> > if (__builtin_constant_p(X))\
>> > asm("roll $" #X ",%[word]\n\t"  \
>> > : [word] "=r" (word));  \
>> > else\
>> > asm("roll %%cl,%[word]\n\t" \
>> > : [word] "=r" (word)\
>> > : "c" (X)); \
>> > word;   \
>> > })
>> >
>> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
>> Is gcc really not converting the rol32 calls into rotates?
>
> No, it is.
>
> The difference in the object code with the asm for instance is:
>
> (old, compiled with gcc 5.3.1)
>
> :
>  84e:   81 ee 09 41 52 21   sub$0x21524109,%esi
>  854:   81 ef 09 41 52 21   sub$0x21524109,%edi
>  85a:   55  push   %rbp
>  85b:   89 f0   mov%esi,%eax
>  85d:   89 f2   mov%esi,%edx
>  85f:   48 ff 05 00 00 00 00incq   0x0(%rip)# 866 
> 
>  866:   c1 c2 0erol$0xe,%edx
>  869:   35 f7 be ad de  xor$0xdeadbef7,%eax
>  86e:   48 89 e5mov%rsp,%rbp
>  871:   29 d0   sub%edx,%eax
>  873:   48 ff 05 00 00 00 00incq   0x0(%rip)# 87a 
> 
>  87a:   48 ff 05 00 00 00 00incq   0x0(%rip)# 881 
> 
>  881:   89 c2   mov%eax,%edx
>  883:   31 c7   xor%eax,%edi
>  885:   c1 c2 0brol$0xb,%edx
>  888:   29 d7   sub%edx,%edi
>  88a:   89 fa   mov%edi,%edx
>  88c:   31 fe   xor%edi,%esi
>  88e:   c1 ca 07ror$0x7,%edx
>  891:   29 d6   sub%edx,%esi
>  893:   89 f2   mov%esi,%edx
>  895:   31 f0   xor%esi,%eax
>  897:   c1 c2 10rol$0x10,%edx
>  89a:   29 d0   sub%edx,%eax
>  89c:   89 c2   mov%eax,%edx
>  89e:   31 c7   xor%eax,%edi
>  8a0:   c1 c2 04rol$0x4,%edx
>  8a3:   29 d7   sub%edx,%edi
>  8a5:   31 fe   xor%edi,%esi
>  8a7:   c1 c7 0erol$0xe,%edi
>  8aa:   29 fe   sub%edi,%esi
>  8ac:   31 f0   xor%esi,%eax
>  8ae:   c1 ce 08ror$0x8,%esi
>  8b1:   29 f0   sub%esi,%eax
>  8b3:   5d  pop%rbp
>  8b4:   c3  retq
>
> vs Tom's asm
>
> 084e :
>  84e:   81 ee 09 41 52 21   sub$0x21524109,%esi
>  854:   8d 87 f7 be ad de   lea-0x21524109(%rdi),%eax
>  85a:   55  push   %rbp
>  85b:   89 f2   mov%esi,%edx
>  85d:   48 ff 05 00 00 00 00incq   0x0(%rip)# 864 
> 
>  864:   48 ff 05 00 00 00 00incq   0x0(%rip)# 86b 
> 
>  86b:   81 f2 f7 be ad de   xor$0xdeadbef7,%edx
>  871:   48 89 e5mov%rsp,%rbp
>  874:   c1 c1 0erol$0xe,%ecx
>  877:   29 ca   sub%ecx,%edx
>  879:   31 d0   xor%edx,%eax
>  87b:   c1 c7 0brol$0xb,%edi
>  87e:   29 f8   sub%edi,%eax
>  880:   48 ff 05 00 00 00 00incq   0x0(%rip)# 887 
> 
>  887:   31 c6   xor%eax,%esi
>  889:   c1 c7 19rol$0x19,%edi
>  88c:   29 fe   sub%edi,%esi
>  88e:   31 f2   xor%esi,%edx
>  890:   c1 c7 10rol$0x10,%edi
>  893:   29 fa   sub%edi,%edx
>  895:   31 d0   xor%edx,%eax
>  897:   c1 c7 04rol$0x4,%edi
>  89a:   29 f8   sub%edi,%eax
>  89c:   31 f0   xor%esi,%eax
>  89e:   29 c8   sub%ecx,%eax
>  8a0:   31 d0   xor%edx,%eax
>  8a2:   5d  pop%rbp
>  8a3:   c1 c2 18rol$0x18,%edx
>  8a6:   29 d0

Re: [ovs-dev] [PATCH net-next] ovs: allow nl 'flow set' to use ufid without flow key

2016-03-09 Thread pravin shelar

On Wed, Mar 9, 2016 at 4:25 PM, Samuel Gauthier
 wrote:
> Sorry, I missed that. Thank you for pointing it out.
>
> Although, set command is also used to reset the flow statistics, and the
> action attribute seems optional. Would you find acceptable to make the key
> attribute mandatory only if the action attribute is provided?
>
> The ovs-dpctl command could then be simplified to support ovs-dpctl --clear
> mod-flow ufid: instead of having to specify the flow and action when
> you just want to reset the statistics of a flow.
>

That is fine with me.

Thanks,
Pravin.

Re: [PATCH v6 net-next 0/2] tcp: Redundant Data Bundling (RDB)

2016-03-09 Thread Yuchung Cheng

On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad
 wrote:
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
> Latency-sensitive applications or services, such as online games and
> remote desktop, produce traffic with thin-stream characteristics,
> characterized by small packets and a relatively high ITT. By bundling
> already sent data in packets with new data, RDB alleviates head-of-line
> blocking by reducing the need to retransmit data segments when packets
> are lost. RDB is a continuation on the work on latency improvements for
> TCP in Linux, previously resulting in two thin-stream mechanisms in the
> Linux kernel
> (https://github.com/torvalds/linux/blob/master/Documentation/networking/tcp-thin.txt).
>
> The RDB implementation has been thoroughly tested, and shows
> significant latency reductions when packet loss occurs[1]. The tests
> show that, by imposing restrictions on the bundling rate, it can be
> made not to negatively affect competing traffic in an unfair manner.
>
> Note: Current patch set depends on the patch "tcp: refactor struct tcp_skb_cb"
> (http://patchwork.ozlabs.org/patch/510674)
>
> These patches have also been tested with as set of packetdrill scripts
> located at
> https://github.com/bendikro/packetdrill/tree/master/gtests/net/packetdrill/tests/linux/rdb
> (The tests require patching packetdrill with a new socket option:
> https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b7d8baf703b2c2ac1b)
>
> Detailed info about the RDB mechanism can be found at
> http://mlab.no/blog/2015/10/redundant-data-bundling-in-tcp, as well as
> in the paper "Latency and Fairness Trade-Off for Thin Streams using
> Redundant Data Bundling in TCP"[2].
>
> [1] http://home.ifi.uio.no/paalh/students/BendikOpstad.pdf
> [2] http://home.ifi.uio.no/bendiko/rdb_fairness_tradeoff.pdf
I read the paper. I think the underlying idea is neat. but the
implementation is little heavy-weight that requires changes on fast
path (tcp_write_xmit) and space in skb control blocks. ultimately this
patch is meant for a small set of specific applications.

In my mental model (please correct me if I am wrong), losses on these
thin streams would mostly resort to RTOs instead of fast recovery, due
to the bursty nature of Internet losses. The HOLB comes from RTO only
retransmit the first (tiny) unacked packet while a small of new data is
readily available. But since Linux congestion control is packet-based,
and loss cwnd is 1, the new data needs to wait until the 1st packet is
acked which is for another RTT.

Instead what if we only perform RDB on the (first and recurring) RTO
retransmission?

PS. I don't understand how (old) RDB can masquerade the losses by
skipping DUPACKs. Perhaps an example helps. Suppose we send 4 packets
and the last 3 were (s)acked. We perform RDB to send a packet that has
previous 4 payloads + 1 new byte. The sender still gets the loss
information?

>
> Changes:
>
> v6 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>* Renamed rdb_ack_event() to tcp_rdb_ack_event() (Thanks DaveM)
>* Minor doc changes
>
>  * tcp-Add-DPIFL-thin-stream-detection-mechanism:
>* Minor doc changes
>
> v5 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>* Removed two unnecessary EXPORT_SYMOBOLs (Thanks Eric)
>* Renamed skb_append_data() to tcp_skb_append_data() (Thanks Eric)
>* Fixed bugs in additions to ipv4_table (sysctl_net_ipv4.c)
>* Merged the two if tests for max payload of RDB packet in
>  rdb_can_bundle_test()
>* Renamed rdb_check_rtx_queue_loss() to rdb_detect_loss()
>  and restructured to reduce indentation.
>* Improved docs
>* Revised commit message to be more detailed.
>
>  * tcp-Add-DPIFL-thin-stream-detection-mechanism:
>* Fixed bug in additions to ipv4_table (sysctl_net_ipv4.c)
>
> v4 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>* Moved skb_append_data() to tcp_output.c and call this
>  function from tcp_collapse_retrans() as well.
>* Merged functionality of create_rdb_skb() into
>  tcp_transmit_rdb_skb()
>* Removed one parameter from rdb_can_bundle_test()
>
> v3 (PATCH):
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>* Changed name of sysctl variable from tcp_rdb_max_skbs to
>  tcp_rdb_max_packets after comment from Eric Dumazet about
>  not exposing internal (kernel) names like skb.
>* Formatting and function docs fixes
>
> v2 (RFC/PATCH):
>  * tcp-Add-DPIFL-thin-stream-detection-mechanism:
>* Change calculation in tcp_stream_is_thin_dpifl based on
>  feedback from Eric Dumazet.
>
>  * tcp-Add-Redundant-Data-Bundling-RDB:
>* Removed setting nonagle in do_tcp_setsockopt (TCP_RDB)
>  to reduce complexity as commented by Neal Cardwell.
>* Cleaned up loss detection code in rdb_check_rtx_queue_loss
>
> v1 (RFC/PATCH)
>
>
> Bendik Rønning Opstad (2):
>   tcp: Add DPIFL thin stream detection mechanism

Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap

2016-03-09 Thread Joe Perches

On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote:
> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert  wrote:
> > I took a look inlining these.
> > 
> > #define rol32(V, X) ({  \
> > int word = V;   \
> > if (__builtin_constant_p(X))\
> > asm("roll $" #X ",%[word]\n\t"  \
> > : [word] "=r" (word));  \
> > else\
> > asm("roll %%cl,%[word]\n\t" \
> > : [word] "=r" (word)\
> > : "c" (X)); \
> > word;   \
> > })
> > 
> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
> Is gcc really not converting the rol32 calls into rotates?

No, it is.

The difference in the object code with the asm for instance is:

(old, compiled with gcc 5.3.1)

:
 84e:   81 ee 09 41 52 21   sub$0x21524109,%esi
 854:   81 ef 09 41 52 21   sub$0x21524109,%edi
 85a:   55  push   %rbp
 85b:   89 f0   mov%esi,%eax
 85d:   89 f2   mov%esi,%edx
 85f:   48 ff 05 00 00 00 00incq   0x0(%rip)# 866 

 866:   c1 c2 0erol$0xe,%edx
 869:   35 f7 be ad de  xor$0xdeadbef7,%eax
 86e:   48 89 e5mov%rsp,%rbp
 871:   29 d0   sub%edx,%eax
 873:   48 ff 05 00 00 00 00incq   0x0(%rip)# 87a 

 87a:   48 ff 05 00 00 00 00incq   0x0(%rip)# 881 

 881:   89 c2   mov%eax,%edx
 883:   31 c7   xor%eax,%edi
 885:   c1 c2 0brol$0xb,%edx
 888:   29 d7   sub%edx,%edi
 88a:   89 fa   mov%edi,%edx
 88c:   31 fe   xor%edi,%esi
 88e:   c1 ca 07ror$0x7,%edx
 891:   29 d6   sub%edx,%esi
 893:   89 f2   mov%esi,%edx
 895:   31 f0   xor%esi,%eax
 897:   c1 c2 10rol$0x10,%edx
 89a:   29 d0   sub%edx,%eax
 89c:   89 c2   mov%eax,%edx
 89e:   31 c7   xor%eax,%edi
 8a0:   c1 c2 04rol$0x4,%edx
 8a3:   29 d7   sub%edx,%edi
 8a5:   31 fe   xor%edi,%esi
 8a7:   c1 c7 0erol$0xe,%edi
 8aa:   29 fe   sub%edi,%esi
 8ac:   31 f0   xor%esi,%eax
 8ae:   c1 ce 08ror$0x8,%esi
 8b1:   29 f0   sub%esi,%eax
 8b3:   5d  pop%rbp
 8b4:   c3  retq   

vs Tom's asm

084e :
 84e:   81 ee 09 41 52 21   sub$0x21524109,%esi
 854:   8d 87 f7 be ad de   lea-0x21524109(%rdi),%eax
 85a:   55  push   %rbp
 85b:   89 f2   mov%esi,%edx
 85d:   48 ff 05 00 00 00 00incq   0x0(%rip)# 864 

 864:   48 ff 05 00 00 00 00incq   0x0(%rip)# 86b 

 86b:   81 f2 f7 be ad de   xor$0xdeadbef7,%edx
 871:   48 89 e5mov%rsp,%rbp
 874:   c1 c1 0erol$0xe,%ecx
 877:   29 ca   sub%ecx,%edx
 879:   31 d0   xor%edx,%eax
 87b:   c1 c7 0brol$0xb,%edi
 87e:   29 f8   sub%edi,%eax
 880:   48 ff 05 00 00 00 00incq   0x0(%rip)# 887 

 887:   31 c6   xor%eax,%esi
 889:   c1 c7 19rol$0x19,%edi
 88c:   29 fe   sub%edi,%esi
 88e:   31 f2   xor%esi,%edx
 890:   c1 c7 10rol$0x10,%edi
 893:   29 fa   sub%edi,%edx
 895:   31 d0   xor%edx,%eax
 897:   c1 c7 04rol$0x4,%edi
 89a:   29 f8   sub%edi,%eax
 89c:   31 f0   xor%esi,%eax
 89e:   29 c8   sub%ecx,%eax
 8a0:   31 d0   xor%edx,%eax
 8a2:   5d  pop%rbp
 8a3:   c1 c2 18rol$0x18,%edx
 8a6:   29 d0   sub%edx,%eax
 8a8:   c3  retq   

> If we need this type of code in order to get the rotates to occur as
> expected then maybe we need to look at doing arch specific versions o

Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-09 Thread Cong Wang

On Tue, Mar 8, 2016 at 10:42 AM, Mahesh Bandewar  wrote:
>> The more subsystems involves, the more struct net pointers you
>> potentially need to touch, the less likely you can make it correct
>> by just switching skb->dev.
>
> Please drop that prejudice and read the patch-set carefully. I'm
> neither changing
> any *net* pointers nor changing the skb->dev pointers anywhere. All I'm saying
> is dev->l3_dev is what we'll use for *all* L3 processing and no need to change
> skb->dev or net-ns of any device(s) involved.

Please don't misinterpret me.

I said _switching_, not overwriting or changing, you use dev->l3_dev to _switch_
skb->dev and/or net, this is exactly what I am complaining about. This is not
how netns works.

RE: [net PATCH 1/2] e1000: Do not overestimate descriptor counts in Tx pre-check

2016-03-09 Thread Brown, Aaron F

> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Alexander Duyck
> Sent: Wednesday, March 2, 2016 1:16 PM
> To: netdev@vger.kernel.org; jogre...@redhat.com; intel-wired-
> l...@lists.osuosl.org; Kirsher, Jeffrey T ;
> sassm...@redhat.com
> Subject: [net PATCH 1/2] e1000: Do not overestimate descriptor counts in Tx
> pre-check
> 
> The current code path is capable of grossly overestimating the number of
> descriptors needed to transmit a new frame.  This specifically occurs if
> the skb contains a number of 4K pages.  The issue is that the logic for
> determining the descriptors needed is ((S) >> (X)) + 1.  When X is 12 it
> means that we were indicating that we required 2 descriptors for each 4K
> page when we only needed one.
> 
> This change corrects this by instead adding (1 << (X)) - 1 to the S value
> instead of adding 1 after the fact.  This way we get an accurate descriptor
> needed count as we are essentially doing a DIV_ROUNDUP().
> 
> Reported-by: Ivan Suzdal 
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Tested-by: Aaron Brown

RE: [Intel-wired-lan] [net PATCH 2/2] e1000: Double Tx descriptors needed check for 82544

2016-03-09 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Alexander Duyck
> Sent: Wednesday, March 2, 2016 1:16 PM
> To: netdev@vger.kernel.org; jogre...@redhat.com; intel-wired-
> l...@lists.osuosl.org; Kirsher, Jeffrey T ;
> sassm...@redhat.com
> Subject: [Intel-wired-lan] [net PATCH 2/2] e1000: Double Tx descriptors
> needed check for 82544
> 
> The 82544 has code that adds one additional descriptor per data buffer.
> However we weren't taking that into acount when determining the
> descriptors
> needed for the next transmit at the end of the xmit_frame path.
> 
> This change takes that into account by doubling the number of descriptors
> needed for the 82544 so that we can avoid a potential issue where we could
> hang the Tx ring by loading frames with xmit_more enabled and then
> stopping
> the ring without writing the tail.
> 
> In addition it adds a few more descriptors to account for some additional
> workarounds that have been added over time.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |   19
> ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)

Tested-by: Aaron Brown

Re: [PATCH] rxrpc: Replace all unsigned with unsigned int

2016-03-09 Thread David Howells

David Howells  wrote:

> Replace all "unsigned" types with "unsigned int" types.
> 
> Reported-by: David Miller 
> Signed-off-by: David Howells 

This is aimed at net-next.

David

[PATCH] rxrpc: Replace all unsigned with unsigned int

2016-03-09 Thread David Howells

Replace all "unsigned" types with "unsigned int" types.

Reported-by: David Miller 
Signed-off-by: David Howells 
---

 net/rxrpc/af_rxrpc.c  |2 +-
 net/rxrpc/ar-ack.c|   12 ++--
 net/rxrpc/ar-call.c   |4 ++--
 net/rxrpc/ar-connection.c |2 +-
 net/rxrpc/ar-internal.h   |   22 +++---
 net/rxrpc/ar-output.c |2 +-
 net/rxrpc/ar-transport.c  |2 +-
 net/rxrpc/sysctl.c|   32 
 8 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index a76501757b59..9d935fa5a2a9 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -81,7 +81,7 @@ static int rxrpc_validate_address(struct rxrpc_sock *rx,
  struct sockaddr_rxrpc *srx,
  int len)
 {
-   unsigned tail;
+   unsigned int tail;
 
if (len < sizeof(struct sockaddr_rxrpc))
return -EINVAL;
diff --git a/net/rxrpc/ar-ack.c b/net/rxrpc/ar-ack.c
index 20f3f001694e..16d967075eaf 100644
--- a/net/rxrpc/ar-ack.c
+++ b/net/rxrpc/ar-ack.c
@@ -23,7 +23,7 @@
  * How long to wait before scheduling ACK generation after seeing a
  * packet with RXRPC_REQUEST_ACK set (in jiffies).
  */
-unsigned rxrpc_requested_ack_delay = 1;
+unsigned int rxrpc_requested_ack_delay = 1;
 
 /*
  * How long to wait before scheduling an ACK with subtype DELAY (in jiffies).
@@ -32,7 +32,7 @@ unsigned rxrpc_requested_ack_delay = 1;
  * all consumed within this time we will send a DELAY ACK if an ACK was not
  * requested to let the sender know it doesn't need to resend.
  */
-unsigned rxrpc_soft_ack_delay = 1 * HZ;
+unsigned int rxrpc_soft_ack_delay = 1 * HZ;
 
 /*
  * How long to wait before scheduling an ACK with subtype IDLE (in jiffies).
@@ -41,7 +41,7 @@ unsigned rxrpc_soft_ack_delay = 1 * HZ;
  * further packets aren't immediately received to decide when to send an IDLE
  * ACK let the other end know that it can free up its Tx buffer space.
  */
-unsigned rxrpc_idle_ack_delay = 0.5 * HZ;
+unsigned int rxrpc_idle_ack_delay = 0.5 * HZ;
 
 /*
  * Receive window size in packets.  This indicates the maximum number of
@@ -49,19 +49,19 @@ unsigned rxrpc_idle_ack_delay = 0.5 * HZ;
  * limit is hit, we should generate an EXCEEDS_WINDOW ACK and discard further
  * packets.
  */
-unsigned rxrpc_rx_window_size = 32;
+unsigned int rxrpc_rx_window_size = 32;
 
 /*
  * Maximum Rx MTU size.  This indicates to the sender the size of jumbo packet
  * made by gluing normal packets together that we're willing to handle.
  */
-unsigned rxrpc_rx_mtu = 5692;
+unsigned int rxrpc_rx_mtu = 5692;
 
 /*
  * The maximum number of fragments in a received jumbo packet that we tell the
  * sender that we're willing to handle.
  */
-unsigned rxrpc_rx_jumbo_max = 4;
+unsigned int rxrpc_rx_jumbo_max = 4;
 
 static const char *rxrpc_acks(u8 reason)
 {
diff --git a/net/rxrpc/ar-call.c b/net/rxrpc/ar-call.c
index 4a499e0100f1..7c8d300ade9b 100644
--- a/net/rxrpc/ar-call.c
+++ b/net/rxrpc/ar-call.c
@@ -21,12 +21,12 @@
 /*
  * Maximum lifetime of a call (in jiffies).
  */
-unsigned rxrpc_max_call_lifetime = 60 * HZ;
+unsigned int rxrpc_max_call_lifetime = 60 * HZ;
 
 /*
  * Time till dead call expires after last use (in jiffies).
  */
-unsigned rxrpc_dead_call_expiry = 2 * HZ;
+unsigned int rxrpc_dead_call_expiry = 2 * HZ;
 
 const char *const rxrpc_call_states[NR__RXRPC_CALL_STATES] = {
[RXRPC_CALL_CLIENT_SEND_REQUEST]= "ClSndReq",
diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
index 53df14cb8d25..9942da1edbf6 100644
--- a/net/rxrpc/ar-connection.c
+++ b/net/rxrpc/ar-connection.c
@@ -21,7 +21,7 @@
 /*
  * Time till a connection expires after last use (in seconds).
  */
-unsigned rxrpc_connection_expiry = 10 * 60;
+unsigned int rxrpc_connection_expiry = 10 * 60;
 
 static void rxrpc_connection_reaper(struct work_struct *work);
 
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 8b495aed517d..a3002f4ddc90 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -478,12 +478,12 @@ int rxrpc_reject_call(struct rxrpc_sock *);
 /*
  * ar-ack.c
  */
-extern unsigned rxrpc_requested_ack_delay;
-extern unsigned rxrpc_soft_ack_delay;
-extern unsigned rxrpc_idle_ack_delay;
-extern unsigned rxrpc_rx_window_size;
-extern unsigned rxrpc_rx_mtu;
-extern unsigned rxrpc_rx_jumbo_max;
+extern unsigned int rxrpc_requested_ack_delay;
+extern unsigned int rxrpc_soft_ack_delay;
+extern unsigned int rxrpc_idle_ack_delay;
+extern unsigned int rxrpc_rx_window_size;
+extern unsigned int rxrpc_rx_mtu;
+extern unsigned int rxrpc_rx_jumbo_max;
 
 void __rxrpc_propose_ACK(struct rxrpc_call *, u8, u32, bool);
 void rxrpc_propose_ACK(struct rxrpc_call *, u8, u32, bool);
@@ -492,8 +492,8 @@ void rxrpc_process_call(struct work_struct *);
 /*
  * ar-call.c
  */
-extern unsigned rxrpc_max_call_lifetime;
-extern unsigned rxrpc_dead_call_expiry;
+extern

Re: [PATCH nf-next v8 3/8] openvswitch: Add commentary to conntrack.c

2016-03-09 Thread Jarno Rajahalme

Sergei,

Just found this from my junk mail box, sorry. Will fix spelling, but I just 
sent v9 and will wait for other reviews before re-posting.

  Jarno

> On Mar 9, 2016, at 5:50 AM, Sergei Shtylyov 
>  wrote:
> 
> Hello.
> 
> On 3/9/2016 3:24 AM, Jarno Rajahalme wrote:
> 
>> This makes the code easier to understand and the following patches
>> more focused.
>> 
>> Signed-off-by: Jarno Rajahalme 
>> ---
>>  net/openvswitch/conntrack.c | 21 -
>>  1 file changed, 20 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index 3045290..8cd0110 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
> [...]
>> @@ -418,6 +429,13 @@ static int ovs_ct_lookup(struct net *net, struct 
>> sw_flow_key *key,
>>  {
>>  struct nf_conntrack_expect *exp;
>> 
>> +/* If we pass an expected packet through nf_conntrack_in() the
>> + * expectiation is typically removed, but the packet could still be
> 
>   Expectation.
> [...]
> 
> MBR, Sergei
>

Re: [PATCH net-next v3] tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In

2016-03-09 Thread Yuchung Cheng

On Wed, Mar 9, 2016 at 10:43 AM, Martin KaFai Lau  wrote:
>
> Per RFC4898, they count segments sent/received
> containing a positive length data segment (that includes
> retransmission segments carrying data).  Unlike
> tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
> carrying no data (e.g. pure ack).
>
> The patch also updates the segs_in in tcp_fastopen_add_skb()
> so that segs_in >= data_segs_in property is kept.
>
> Together with retransmission data, tcpi_data_segs_out
> gives a better signal on the rxmit rate.
>
> v3: Add const modifier to the skb parameter in tcp_segs_in()
>
> v2: Rework based on recent fix by Eric:
> commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")
>
> Signed-off-by: Martin KaFai Lau 
> Cc: Chris Rapier 
> Cc: Eric Dumazet 
> Cc: Marcelo Ricardo Leitner 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
>  include/linux/tcp.h  |  6 ++
>  include/net/tcp.h| 10 ++
>  include/uapi/linux/tcp.h |  2 ++
>  net/ipv4/tcp.c   |  2 ++
>  net/ipv4/tcp_fastopen.c  |  4 
>  net/ipv4/tcp_ipv4.c  |  2 +-
>  net/ipv4/tcp_minisocks.c |  2 +-
>  net/ipv4/tcp_output.c|  4 +++-
>  net/ipv6/tcp_ipv6.c  |  2 +-
>  9 files changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index bcbf51d..7be9b12 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -158,6 +158,9 @@ struct tcp_sock {
> u32 segs_in;/* RFC4898 tcpEStatsPerfSegsIn
>  * total number of segments in.
>  */
> +   u32 data_segs_in;   /* RFC4898 tcpEStatsPerfDataSegsIn
> +* total number of data segments in.
> +*/
> u32 rcv_nxt;/* What we want to receive next */
> u32 copied_seq; /* Head of yet unread data  */
> u32 rcv_wup;/* rcv_nxt on last window update sent   */
> @@ -165,6 +168,9 @@ struct tcp_sock {
> u32 segs_out;   /* RFC4898 tcpEStatsPerfSegsOut
>  * The total number of segments sent.
>  */
> +   u32 data_segs_out;  /* RFC4898 tcpEStatsPerfDataSegsOut
> +* total number of data segments sent.
> +*/
> u64 bytes_acked;/* RFC4898 tcpEStatsAppHCThruOctetsAcked
>  * sum(delta(snd_una)), or how many bytes
>  * were acked.
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e90db85..24557a8 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1816,4 +1816,14 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff 
> *skb)
> skb->truesize = 2;
>  }
>
> +static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff 
> *skb)
> +{
> +   u16 segs_in;
> +
> +   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
> +   tp->segs_in += segs_in;
> +   if (skb->len > tcp_hdrlen(skb))
> +   tp->data_segs_in += segs_in;
> +}
> +
>  #endif /* _TCP_H */
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index fe95446..53e8e3f 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -199,6 +199,8 @@ struct tcp_info {
>
> __u32   tcpi_notsent_bytes;
> __u32   tcpi_min_rtt;
> +   __u32   tcpi_data_segs_in;  /* RFC4898 tcpEStatsDataSegsIn */
> +   __u32   tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
>  };
>
>  /* for TCP_MD5SIG socket option */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index f9faadb..6b01b48 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2728,6 +2728,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info 
> *info)
> info->tcpi_notsent_bytes = max(0, notsent_bytes);
>
> info->tcpi_min_rtt = tcp_min_rtt(tp);
> +   info->tcpi_data_segs_in = tp->data_segs_in;
> +   info->tcpi_data_segs_out = tp->data_segs_out;
>  }
>  EXPORT_SYMBOL_GPL(tcp_get_info);
>
> diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
> index fdb286d..f583c85 100644
> --- a/net/ipv4/tcp_fastopen.c
> +++ b/net/ipv4/tcp_fastopen.c
> @@ -131,6 +131,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock 
> *req,
>  void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
>  {
> struct tcp_sock *tp = tcp_sk(sk);
> +   u16 segs_in;
>
> if (TCP_SKB_CB(skb)->end_seq == tp->rcv_nxt)
> return;
> @@ -154,6 +155,9 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff 
> *skb)
>  * as we certainly are not changing upper 32bit value (0)
>  */
> tp->bytes_received = skb->len;
> +   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
> +   tp->segs_in = segs_in;
> +   tp->data_segs_in = segs_in;
why not

[PATCH nf-next v9 4/8] openvswitch: Update the CT state key only after nf_conntrack_in().

2016-03-09 Thread Jarno Rajahalme

Only a successful nf_conntrack_in() call can effect a connection state
change, so if suffices to update the key only after the
nf_conntrack_in() returns.

This change is needed for the later NAT patches.

Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/conntrack.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 8cd0110..da804d3 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -382,7 +382,8 @@ static bool skb_nfct_cached(const struct net *net, const 
struct sk_buff *skb,
 }
 
 /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if
- * not done already.  Update key with new CT state.
+ * not done already.  Update key with new CT state after passing the packet
+ * through conntrack.
  * Note that if the packet is deemed invalid by conntrack, skb->nfct will be
  * set to NULL and 0 will be returned.
  */
@@ -411,14 +412,14 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
skb) != NF_ACCEPT)
return -ENOENT;
 
+   ovs_ct_update_key(skb, info, key, true);
+
if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
WARN_ONCE(1, "helper rejected packet");
return -EINVAL;
}
}
 
-   ovs_ct_update_key(skb, info, key, true);
-
return 0;
 }
 
-- 
2.1.4

[PATCH nf-next v9 6/8] openvswitch: Handle NF_REPEAT in conntrack action.

2016-03-09 Thread Jarno Rajahalme

Repeat the nf_conntrack_in() call when it returns NF_REPEAT.  This
avoids dropping a SYN packet re-opening an existing TCP connection.

Signed-off-by: Jarno Rajahalme 
Acked-by: Joe Stringer 
---
 net/openvswitch/conntrack.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index b838536..92613de 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -485,6 +485,7 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
 */
if (!skb_nfct_cached(net, key, info, skb)) {
struct nf_conn *tmpl = info->ct;
+   int err;
 
/* Associate skb with specified zone. */
if (tmpl) {
@@ -495,8 +496,13 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
skb->nfctinfo = IP_CT_NEW;
}
 
-   if (nf_conntrack_in(net, info->family, NF_INET_PRE_ROUTING,
-   skb) != NF_ACCEPT)
+   /* Repeat if requested, see nf_iterate(). */
+   do {
+   err = nf_conntrack_in(net, info->family,
+ NF_INET_PRE_ROUTING, skb);
+   } while (err == NF_REPEAT);
+
+   if (err != NF_ACCEPT)
return -ENOENT;
 
ovs_ct_update_key(skb, info, key, true);
-- 
2.1.4

Re: [PATCH nf-next v8 8/8] openvswitch: Interface with NAT.

2016-03-09 Thread Jarno Rajahalme

Just sent a v9 addressing these dependency issues.

  Jarno

> On Mar 9, 2016, at 12:04 AM, kbuild test robot  wrote:
> 
> Hi Jarno,
> 
> [auto build test ERROR on nf-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Jarno-Rajahalme/netfilter-Remove-IP_CT_NEW_REPLY-definition/20160309-083126
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next master
> config: i386-randconfig-x0-03091344 (attached as .config)
> reproduce:
># save the attached .config to linux build tree
>make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>   net/built-in.o: In function `__ovs_ct_lookup':
>>> conntrack.c:(.text+0x227267): undefined reference to `nf_ct_nat_ext_add'
>>> conntrack.c:(.text+0x227404): undefined reference to 
>>> `nf_nat_icmp_reply_translation'
>   conntrack.c:(.text+0x2274ab): undefined reference to 
> `nf_nat_icmpv6_reply_translation'
>>> conntrack.c:(.text+0x227585): undefined reference to `nf_nat_setup_info'
>>> conntrack.c:(.text+0x2275a1): undefined reference to 
>>> `nf_nat_alloc_null_binding'
>>> conntrack.c:(.text+0x2275d2): undefined reference to `nf_nat_packet'
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> <.config.gz>

[PATCH nf-next v9 8/8] openvswitch: Interface with NAT.

2016-03-09 Thread Jarno Rajahalme

Extend OVS conntrack interface to cover NAT.  New nested
OVS_CT_ATTR_NAT attribute may be used to include NAT with a CT action.
A bare OVS_CT_ATTR_NAT only mangles existing and expected connections.
If OVS_NAT_ATTR_SRC or OVS_NAT_ATTR_DST is included within the nested
attributes, new (non-committed/non-confirmed) connections are mangled
according to the rest of the nested attributes.

The corresponding OVS userspace patch series includes test cases (in
tests/system-traffic.at) that also serve as example uses.

This work extends on a branch by Thomas Graf at
https://github.com/tgraf/ovs/tree/nat.

Signed-off-by: Jarno Rajahalme 
---
v9: Fixed module dependencies.

 include/uapi/linux/openvswitch.h |  49 
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 523 +--
 net/openvswitch/conntrack.h  |   3 +-
 4 files changed, 551 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index a27222d..616d047 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -454,6 +454,14 @@ struct ovs_key_ct_labels {
 #define OVS_CS_F_REPLY_DIR 0x08 /* Flow is in the reply direction. */
 #define OVS_CS_F_INVALID   0x10 /* Could not track connection. */
 #define OVS_CS_F_TRACKED   0x20 /* Conntrack has occurred. */
+#define OVS_CS_F_SRC_NAT   0x40 /* Packet's source address/port was
+* mangled by NAT.
+*/
+#define OVS_CS_F_DST_NAT   0x80 /* Packet's destination address/port
+* was mangled by NAT.
+*/
+
+#define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT)
 
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
@@ -632,6 +640,8 @@ struct ovs_action_hash {
  * mask. For each bit set in the mask, the corresponding bit in the value is
  * copied to the connection tracking label field in the connection.
  * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG.
+ * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address
+ * translation (NAT) on the packet.
  */
 enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
@@ -641,12 +651,51 @@ enum ovs_ct_attr {
OVS_CT_ATTR_LABELS, /* labels to associate with this connection. */
OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of
   related connections. */
+   OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */
__OVS_CT_ATTR_MAX
 };
 
 #define OVS_CT_ATTR_MAX (__OVS_CT_ATTR_MAX - 1)
 
 /**
+ * enum ovs_nat_attr - Attributes for %OVS_CT_ATTR_NAT.
+ *
+ * @OVS_NAT_ATTR_SRC: Flag for Source NAT (mangle source address/port).
+ * @OVS_NAT_ATTR_DST: Flag for Destination NAT (mangle destination
+ * address/port).  Only one of (@OVS_NAT_ATTR_SRC, @OVS_NAT_ATTR_DST) may be
+ * specified.  Effective only for packets for ct_state NEW connections.
+ * Packets of committed connections are mangled by the NAT action according to
+ * the committed NAT type regardless of the flags specified.  As a corollary, a
+ * NAT action without a NAT type flag will only mangle packets of committed
+ * connections.  The following NAT attributes only apply for NEW
+ * (non-committed) connections, and they may be included only when the CT
+ * action has the @OVS_CT_ATTR_COMMIT flag and either @OVS_NAT_ATTR_SRC or
+ * @OVS_NAT_ATTR_DST is also included.
+ * @OVS_NAT_ATTR_IP_MIN: struct in_addr or struct in6_addr
+ * @OVS_NAT_ATTR_IP_MAX: struct in_addr or struct in6_addr
+ * @OVS_NAT_ATTR_PROTO_MIN: u16 L4 protocol specific lower boundary (port)
+ * @OVS_NAT_ATTR_PROTO_MAX: u16 L4 protocol specific upper boundary (port)
+ * @OVS_NAT_ATTR_PERSISTENT: Flag for persistent IP mapping across reboots
+ * @OVS_NAT_ATTR_PROTO_HASH: Flag for pseudo random L4 port mapping (MD5)
+ * @OVS_NAT_ATTR_PROTO_RANDOM: Flag for fully randomized L4 port mapping
+ */
+enum ovs_nat_attr {
+   OVS_NAT_ATTR_UNSPEC,
+   OVS_NAT_ATTR_SRC,
+   OVS_NAT_ATTR_DST,
+   OVS_NAT_ATTR_IP_MIN,
+   OVS_NAT_ATTR_IP_MAX,
+   OVS_NAT_ATTR_PROTO_MIN,
+   OVS_NAT_ATTR_PROTO_MAX,
+   OVS_NAT_ATTR_PERSISTENT,
+   OVS_NAT_ATTR_PROTO_HASH,
+   OVS_NAT_ATTR_PROTO_RANDOM,
+   __OVS_NAT_ATTR_MAX,
+};
+
+#define OVS_NAT_ATTR_MAX (__OVS_NAT_ATTR_MAX - 1)
+
+/**
  * enum ovs_action_attr - Action types.
  *
  * @OVS_ACTION_ATTR_OUTPUT: Output packet to port.
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index cd5fd9d..23471a4 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -6,7 +6,8 @@ config OPENVSWITCH
tristate "Open vSwitch"
depends on INET
depends on !NF_CONNTRACK || \
-  (NF_CONNTRACK && (!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6))
+  (NF_CONNTRACK && ((!NF_DEFRA

[PATCH nf-next v9 7/8] openvswitch: Delay conntrack helper call for new connections.

2016-03-09 Thread Jarno Rajahalme

There is no need to help connections that are not confirmed, so we can
delay helping new connections to the time when they are confirmed.
This change is needed for NAT support, and having this as a separate
patch will make the following NAT patch a bit easier to review.

Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/conntrack.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 92613de..5711f80 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -483,7 +483,11 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
 * actually run the packet through conntrack twice unless it's for a
 * different zone.
 */
-   if (!skb_nfct_cached(net, key, info, skb)) {
+   bool cached = skb_nfct_cached(net, key, info, skb);
+   enum ip_conntrack_info ctinfo;
+   struct nf_conn *ct;
+
+   if (!cached) {
struct nf_conn *tmpl = info->ct;
int err;
 
@@ -506,11 +510,17 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
return -ENOENT;
 
ovs_ct_update_key(skb, info, key, true);
+   }
 
-   if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
-   WARN_ONCE(1, "helper rejected packet");
-   return -EINVAL;
-   }
+   /* Call the helper only if we did nf_conntrack_in() above ('!cached')
+* for confirmed connections, but only when committing for unconfirmed
+* connections.
+*/
+   ct = nf_ct_get(skb, &ctinfo);
+   if (ct && (nf_ct_is_confirmed(ct) ? !cached : info->commit) &&
+   ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
+   WARN_ONCE(1, "helper rejected packet");
+   return -EINVAL;
}
 
return 0;
-- 
2.1.4

[PATCH nf-next v9 3/8] openvswitch: Add commentary to conntrack.c

2016-03-09 Thread Jarno Rajahalme

This makes the code easier to understand and the following patches
more focused.

Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/conntrack.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 3045290..8cd0110 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -152,8 +152,12 @@ static void ovs_ct_update_key(const struct sk_buff *skb,
ct = nf_ct_get(skb, &ctinfo);
if (ct) {
state = ovs_ct_get_state(ctinfo);
+   /* All unconfirmed entries are NEW connections. */
if (!nf_ct_is_confirmed(ct))
state |= OVS_CS_F_NEW;
+   /* OVS persists the related flag for the duration of the
+* connection.
+*/
if (ct->master)
state |= OVS_CS_F_RELATED;
zone = nf_ct_zone(ct);
@@ -165,6 +169,9 @@ static void ovs_ct_update_key(const struct sk_buff *skb,
__ovs_ct_update_key(key, state, zone, ct);
 }
 
+/* This is called to initialize CT key fields possibly coming in from the local
+ * stack.
+ */
 void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key)
 {
ovs_ct_update_key(skb, NULL, key, false);
@@ -199,7 +206,6 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct 
sw_flow_key *key,
struct nf_conn *ct;
u32 new_mark;
 
-
/* The connection could be invalid, in which case set_mark is no-op. */
ct = nf_ct_get(skb, &ctinfo);
if (!ct)
@@ -375,6 +381,11 @@ static bool skb_nfct_cached(const struct net *net, const 
struct sk_buff *skb,
return true;
 }
 
+/* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if
+ * not done already.  Update key with new CT state.
+ * Note that if the packet is deemed invalid by conntrack, skb->nfct will be
+ * set to NULL and 0 will be returned.
+ */
 static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
   const struct ovs_conntrack_info *info,
   struct sk_buff *skb)
@@ -418,6 +429,13 @@ static int ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
 {
struct nf_conntrack_expect *exp;
 
+   /* If we pass an expected packet through nf_conntrack_in() the
+* expectiation is typically removed, but the packet could still be
+* lost in upcall processing.  To prevent this from happening we
+* perform an explicit expectation lookup.  Expected connections are
+* always new, and will be passed through conntrack only when they are
+* committed, as it is OK to remove the expectation at that time.
+*/
exp = ovs_ct_expect_find(net, &info->zone, info->family, skb);
if (exp) {
u8 state;
@@ -455,6 +473,7 @@ static int ovs_ct_commit(struct net *net, struct 
sw_flow_key *key,
err = __ovs_ct_lookup(net, key, info, skb);
if (err)
return err;
+   /* This is a no-op if the connection has already been confirmed. */
if (nf_conntrack_confirm(skb) != NF_ACCEPT)
return -EINVAL;
 
-- 
2.1.4

[PATCH nf-next v9 5/8] openvswitch: Find existing conntrack entry after upcall.

2016-03-09 Thread Jarno Rajahalme

Add a new function ovs_ct_find_existing() to find an existing
conntrack entry for which this packet was already applied to.  This is
only to be called when there is evidence that the packet was already
tracked and committed, but we lost the ct reference due to an
userspace upcall.

ovs_ct_find_existing() is called from skb_nfct_cached(), which can now
hide the fact that the ct reference may have been lost due to an
upcall.  This allows ovs_ct_commit() to be simplified.

This patch is needed by later "openvswitch: Interface with NAT" patch,
as we need to be able to pass the packet through NAT using the
original ct reference also after the reference is lost after an
upcall.

Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/conntrack.c | 103 ++--
 1 file changed, 90 insertions(+), 13 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index da804d3..b838536 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -356,14 +356,101 @@ ovs_ct_expect_find(struct net *net, const struct 
nf_conntrack_zone *zone,
return __nf_ct_expect_find(net, zone, &tuple);
 }
 
+/* This replicates logic from nf_conntrack_core.c that is not exported. */
+static enum ip_conntrack_info
+ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h)
+{
+   const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+   if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY)
+   return IP_CT_ESTABLISHED_REPLY;
+   /* Once we've had two way comms, always ESTABLISHED. */
+   if (test_bit(IPS_SEEN_REPLY_BIT, &ct->status))
+   return IP_CT_ESTABLISHED;
+   if (test_bit(IPS_EXPECTED_BIT, &ct->status))
+   return IP_CT_RELATED;
+   return IP_CT_NEW;
+}
+
+/* Find an existing connection which this packet belongs to without
+ * re-attributing statistics or modifying the connection state.  This allows an
+ * skb->nfct lost due to an upcall to be recovered during actions execution.
+ *
+ * Must be called with rcu_read_lock.
+ *
+ * On success, populates skb->nfct and skb->nfctinfo, and returns the
+ * connection.  Returns NULL if there is no existing entry.
+ */
+static struct nf_conn *
+ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone,
+u8 l3num, struct sk_buff *skb)
+{
+   struct nf_conntrack_l3proto *l3proto;
+   struct nf_conntrack_l4proto *l4proto;
+   struct nf_conntrack_tuple tuple;
+   struct nf_conntrack_tuple_hash *h;
+   enum ip_conntrack_info ctinfo;
+   struct nf_conn *ct;
+   unsigned int dataoff;
+   u8 protonum;
+
+   l3proto = __nf_ct_l3proto_find(l3num);
+   if (!l3proto) {
+   pr_debug("ovs_ct_find_existing: Can't get l3proto\n");
+   return NULL;
+   }
+   if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff,
+&protonum) <= 0) {
+   pr_debug("ovs_ct_find_existing: Can't get protonum\n");
+   return NULL;
+   }
+   l4proto = __nf_ct_l4proto_find(l3num, protonum);
+   if (!l4proto) {
+   pr_debug("ovs_ct_find_existing: Can't get l4proto\n");
+   return NULL;
+   }
+   if (!nf_ct_get_tuple(skb, skb_network_offset(skb), dataoff, l3num,
+protonum, net, &tuple, l3proto, l4proto)) {
+   pr_debug("ovs_ct_find_existing: Can't get tuple\n");
+   return NULL;
+   }
+
+   /* look for tuple match */
+   h = nf_conntrack_find_get(net, zone, &tuple);
+   if (!h)
+   return NULL;   /* Not found. */
+
+   ct = nf_ct_tuplehash_to_ctrack(h);
+
+   ctinfo = ovs_ct_get_info(h);
+   if (ctinfo == IP_CT_NEW) {
+   /* This should not happen. */
+   WARN_ONCE(1, "ovs_ct_find_existing: new packet for %p\n", ct);
+   }
+   skb->nfct = &ct->ct_general;
+   skb->nfctinfo = ctinfo;
+   return ct;
+}
+
 /* Determine whether skb->nfct is equal to the result of conntrack lookup. */
-static bool skb_nfct_cached(const struct net *net, const struct sk_buff *skb,
-   const struct ovs_conntrack_info *info)
+static bool skb_nfct_cached(struct net *net,
+   const struct sw_flow_key *key,
+   const struct ovs_conntrack_info *info,
+   struct sk_buff *skb)
 {
enum ip_conntrack_info ctinfo;
struct nf_conn *ct;
 
ct = nf_ct_get(skb, &ctinfo);
+   /* If no ct, check if we have evidence that an existing conntrack entry
+* might be found for this skb.  This happens when we lose a skb->nfct
+* due to an upcall.  If the connection was not confirmed, it is not
+* cached and needs to be run through conntrack again.
+*/
+   if (!ct && key->ct.state & OVS_CS_F_TRACKED &&
+   !(key->ct.state & OVS_CS_F_INVALID) &&
+

[PATCH nf-next v9 1/8] netfilter: Remove IP_CT_NEW_REPLY definition.

2016-03-09 Thread Jarno Rajahalme

Remove the definition of IP_CT_NEW_REPLY from the kernel as it does
not make sense.  This allows the definition of IP_CT_NUMBER to be
simplified as well.

Signed-off-by: Jarno Rajahalme 
---
 include/uapi/linux/netfilter/nf_conntrack_common.h | 12 +---
 net/openvswitch/conntrack.c|  2 --
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h 
b/include/uapi/linux/netfilter/nf_conntrack_common.h
index 319f471..6d074d1 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_common.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
@@ -20,9 +20,15 @@ enum ip_conntrack_info {
 
IP_CT_ESTABLISHED_REPLY = IP_CT_ESTABLISHED + IP_CT_IS_REPLY,
IP_CT_RELATED_REPLY = IP_CT_RELATED + IP_CT_IS_REPLY,
-   IP_CT_NEW_REPLY = IP_CT_NEW + IP_CT_IS_REPLY,   
-   /* Number of distinct IP_CT types (no NEW in reply dirn). */
-   IP_CT_NUMBER = IP_CT_IS_REPLY * 2 - 1
+   /* No NEW in reply direction. */
+
+   /* Number of distinct IP_CT types. */
+   IP_CT_NUMBER,
+
+   /* only for userspace compatibility */
+#ifndef __KERNEL__
+   IP_CT_NEW_REPLY = IP_CT_NUMBER,
+#endif
 };
 
 #define NF_CT_STATE_INVALID_BIT(1 << 0)
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index ee6ff8f..3045290 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -75,7 +75,6 @@ static u8 ovs_ct_get_state(enum ip_conntrack_info ctinfo)
switch (ctinfo) {
case IP_CT_ESTABLISHED_REPLY:
case IP_CT_RELATED_REPLY:
-   case IP_CT_NEW_REPLY:
ct_state |= OVS_CS_F_REPLY_DIR;
break;
default:
@@ -92,7 +91,6 @@ static u8 ovs_ct_get_state(enum ip_conntrack_info ctinfo)
ct_state |= OVS_CS_F_RELATED;
break;
case IP_CT_NEW:
-   case IP_CT_NEW_REPLY:
ct_state |= OVS_CS_F_NEW;
break;
default:
-- 
2.1.4

[PATCH nf-next v9 2/8] netfilter: Allow calling into nat helper without skb_dst.

2016-03-09 Thread Jarno Rajahalme

NAT checksum recalculation code assumes existence of skb_dst, which
becomes a problem for a later patch in the series ("openvswitch:
Interface with NAT.").  Simplify this by removing the check on
skb_dst, as the checksum will be dealt with later in the stack.

Suggested-by: Pravin Shelar 
Signed-off-by: Jarno Rajahalme 
---
 net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 30 --
 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 30 --
 2 files changed, 16 insertions(+), 44 deletions(-)

diff --git a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c 
b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
index 61c7cc2..f8aad03 100644
--- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
@@ -127,29 +127,15 @@ static void nf_nat_ipv4_csum_recalc(struct sk_buff *skb,
u8 proto, void *data, __sum16 *check,
int datalen, int oldlen)
 {
-   const struct iphdr *iph = ip_hdr(skb);
-   struct rtable *rt = skb_rtable(skb);
-
if (skb->ip_summed != CHECKSUM_PARTIAL) {
-   if (!(rt->rt_flags & RTCF_LOCAL) &&
-   (!skb->dev || skb->dev->features &
-(NETIF_F_IP_CSUM | NETIF_F_HW_CSUM))) {
-   skb->ip_summed = CHECKSUM_PARTIAL;
-   skb->csum_start = skb_headroom(skb) +
- skb_network_offset(skb) +
- ip_hdrlen(skb);
-   skb->csum_offset = (void *)check - data;
-   *check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
-   datalen, proto, 0);
-   } else {
-   *check = 0;
-   *check = csum_tcpudp_magic(iph->saddr, iph->daddr,
-  datalen, proto,
-  csum_partial(data, datalen,
-   0));
-   if (proto == IPPROTO_UDP && !*check)
-   *check = CSUM_MANGLED_0;
-   }
+   const struct iphdr *iph = ip_hdr(skb);
+
+   skb->ip_summed = CHECKSUM_PARTIAL;
+   skb->csum_start = skb_headroom(skb) + skb_network_offset(skb) +
+   ip_hdrlen(skb);
+   skb->csum_offset = (void *)check - data;
+   *check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
+   proto, 0);
} else
inet_proto_csum_replace2(check, skb,
 htons(oldlen), htons(datalen), true);
diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c 
b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
index 6ce3099..e0be97e 100644
--- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
@@ -131,29 +131,15 @@ static void nf_nat_ipv6_csum_recalc(struct sk_buff *skb,
u8 proto, void *data, __sum16 *check,
int datalen, int oldlen)
 {
-   const struct ipv6hdr *ipv6h = ipv6_hdr(skb);
-   struct rt6_info *rt = (struct rt6_info *)skb_dst(skb);
-
if (skb->ip_summed != CHECKSUM_PARTIAL) {
-   if (!(rt->rt6i_flags & RTF_LOCAL) &&
-   (!skb->dev || skb->dev->features &
-(NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))) {
-   skb->ip_summed = CHECKSUM_PARTIAL;
-   skb->csum_start = skb_headroom(skb) +
- skb_network_offset(skb) +
- (data - (void *)skb->data);
-   skb->csum_offset = (void *)check - data;
-   *check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
- datalen, proto, 0);
-   } else {
-   *check = 0;
-   *check = csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
-datalen, proto,
-csum_partial(data, datalen,
- 0));
-   if (proto == IPPROTO_UDP && !*check)
-   *check = CSUM_MANGLED_0;
-   }
+   const struct ipv6hdr *ipv6h = ipv6_hdr(skb);
+
+   skb->ip_summed = CHECKSUM_PARTIAL;
+   skb->csum_start = skb_headroom(skb) + skb_network_offset(skb) +
+   (data - (void *)skb->data);
+   skb->csum_offset = (void *)check - data;
+   *check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
+ datalen, proto, 0);
} else

Re: [RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging

2016-03-09 Thread Vivien Didelot

Hi Ido,

Ido Schimmel  writes:

> Wed, Mar 09, 2016 at 07:42:47PM IST, vivien.dide...@savoirfairelinux.com 
> wrote:
>>Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is
>>set before adding a port to a bridge and deleting a port from a bridge.
>>
>>The main purpose for this attribute is to provide switchdev users a
>>simple and common way to retrieve bridging information, instead of
>>implementing complex notifier blocks to listen to global netdev events.
>>
>>We can also imagine a switchdev user returning an error different from
>>-EOPNOTSUPP in the prepare phase to prevent a port from being bridged.
>
> I don't really understand the motivation for this change. We are already
> doing all these stuff with the notifiers and it's pretty
> straight-forward.
>
> In fact, I believe using an existing mechanism instead of introducing
> more switchdev hooks is more elegant. This RFC only deals with bridge,
> but you'll have to do the same for team, bond and vlan devices. And
> you'll probably place the hooks in the exact locations where the
> notifiers are called from anyway.
>
>>
>>Signed-off-by: Vivien Didelot 
>>---
>> include/net/switchdev.h |  2 ++
>> net/bridge/br_if.c  | 27 +++
>> 2 files changed, 29 insertions(+)
>>
>>diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>index d451122..65f8514 100644
>>--- a/include/net/switchdev.h
>>+++ b/include/net/switchdev.h
>>@@ -46,6 +46,7 @@ enum switchdev_attr_id {
>>  SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>>  SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>  SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
>>+ SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF,
>>  SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>>  SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
>> };
>>@@ -58,6 +59,7 @@ struct switchdev_attr {
>>  struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */
>>  u8 stp_state;   /* PORT_STP_STATE */
>>  unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */
>>+ bool join;  /* PORT_BRIDGE_IF */
>>  u32 ageing_time;/* BRIDGE_AGEING_TIME */
>>  bool vlan_filtering;/* 
>> BRIDGE_VLAN_FILTERING */
>>  } u;
>>diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
>>index a73df33..105b9fd 100644
>>--- a/net/bridge/br_if.c
>>+++ b/net/bridge/br_if.c
>>@@ -28,6 +28,24 @@
>> 
>> #include "br_private.h"
>> 
>>+static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br,
>>+bool join)
>>+{
>>+ struct switchdev_attr attr = {
>>+ .orig_dev = br->dev,
>
> This should be just 'dev', since you need to know for which stacked
> device on top of the port this was called for. This also means you'll
> have to call netdev_master_upper_dev_get() from within your driver if
> you want to limit the number of VLAN filtering bridges (for example).
> However, since this is called before bridge dev and dev itself are
> linked, you'll get NULL.
>
>>+ .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF,
>>+ .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>>+ .u.join = join,
>>+ };
>>+ int err;
>>+
>>+ err = switchdev_port_attr_set(dev, &attr);
>>+ if (err && err != -EOPNOTSUPP)
>>+ return err;
>>+
>>+ return 0;
>>+}
>>+
>> /*
>>  * Determine initial path cost based on speed.
>>  * using recommendations from 802.1d standard
>>@@ -297,6 +315,10 @@ static void del_nbp(struct net_bridge_port *p)
>>  br_netpoll_disable(p);
>> 
>>  call_rcu(&p->rcu, destroy_nbp_rcu);
>>+
>>+ if (switchdev_bridge_if(dev, br, false))
>>+ br_warn(br, "error unbridging port %u(%s)\n",
>>+ (unsigned int) p->port_no, dev->name);
>> }
>> 
>> /* Delete bridge device */
>>@@ -347,6 +369,11 @@ static struct net_bridge_port *new_nbp(struct net_bridge 
>>*br,
>> {
>>  int index;
>>  struct net_bridge_port *p;
>>+ int err;
>>+
>>+ err = switchdev_bridge_if(dev, br, true);
>
> If you look at br_add_if() - where new_nbp() is called from - then
> you'll see that you aren't rollbacking this operation in case of error.
> Same for subsequent errors in this function I believe.
>
>>+ if (err)
>>+ return ERR_PTR(err);
>> 
>>  index = find_portno(br);
>>  if (index < 0)
>>-- 
>>2.7.2
>>
>
> Maybe this is something we'll have to do in the future, but for now I
> think we are OK with the notifiers. :)
>
> Thanks Vivien!

I didn't have the big picture for team, bond and vlan devices as well.

I can drop this RFC then. Thanks for the details!

Vivien

Re: [PATCH] inet: set `error' value not under lock_sock().

2016-03-09 Thread Weongyo Jeong

On 03/09/2016 12:37 PM, David Miller wrote:

From: Weongyo Jeong 
Date: Wed, 9 Mar 2016 10:22:21 -0800

A trivial patch to set `error' variable while not holding
lock_sock().

Signed-off-by: Weongyo Jeong 

Deferring the assignment means gcc doesn't have to potentially put it
on the stack across the lock_sock() call.

You're making the code worse not better, the assignment into a local
variable is not an expensive operation.

I'm not applying this change, sorry.

No problem.  Thank you for review and opinion.

Regards,
Weongyo Jeong

===

Considering Office 365?  Barracuda security and storage solutions can help. 
Learn more about Barracuda solutions for Office 365 at 
http://barracuda.com/office365.

DISCLAIMER:
This e-mail and any attachments to it contain confidential and proprietary 
material of Barracuda, its affiliates or agents, and is solely for the use of 
the intended recipient. Any review, use, disclosure, distribution or copying of 
this transmittal is prohibited except by or on behalf of the intended 
recipient. If you have received this transmittal in error, please notify the 
sender and destroy this e-mail and any attachments and all copies, whether 
electronic or printed.

Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF

2016-03-09 Thread Vivien Didelot

Hi Jiri,

Jiri Pirko  writes:

> Wed, Mar 09, 2016 at 07:32:13PM CET, and...@lunn.ch wrote:
>>Hi Vivien
>>
>>> -static bool dsa_slave_dev_check(struct net_device *dev)
>>> -{
>>> -   return dev->netdev_ops == &dsa_slave_netdev_ops;
>>> -}
>>
>>Where is the equivalent of this happening? Where do we check that the
>>interface added to the bridge is part of the switch?
>>
>>> -int dsa_slave_netdevice_event(struct notifier_block *unused,
>>> - unsigned long event, void *ptr)
>>> -{
>>> -   struct net_device *dev;
>>> -   int err = 0;
>>> -
>>> -   switch (event) {
>>> -   case NETDEV_CHANGEUPPER:
>>> -   dev = netdev_notifier_info_to_dev(ptr);
>>> -   if (!dsa_slave_dev_check(dev))
>>> -   goto out;
>>> -
>>> -   err = dsa_slave_master_changed(dev);
>>> -   if (err && err != -EOPNOTSUPP)
>>> -   netdev_warn(dev, "failed to reflect master change\n");
>>> -
>>> -   break;
>>> -   }
>>> -
>>> -out:
>>> -   return NOTIFY_DONE;
>>> -}
>>
>>How about team/bonding? We are not ready to implement it yet with the
>>Marvell devices, but at some point we probably will. Won't we need the
>>events then? We need to know when a switch port has been added to a
>>team?
>>
>>Or do you think a switchdev object will be added for this case?
>>Mellanox already have the ability to add switch interfaces to a team,
>>and then add the team to a bridge. So we need to ensure your solution
>>works for such stacked systems.
>
> I have to look at this more closer tomorrow, but I'm missing motivation
> behind this. Using existing notifiers, drivers can easily monitor what
> is going on with their uppers. Why do we need this to be changed?

Yes with notifiers, drivers can monitor these changes with the
NETDEV_CHANGEUPPER even. They can also forbid such bridging by returning
NOTIFY_BAD in the NETDEV_PRECHANGEUPPER event if I'm not mistaken.

But looking at DSA slave, Mellanox Spectrum, and Rocker, they all
implement this similar heavy code, while they could support a common
switchdev attribute and reduce boilerplate.

But maybe I'm wrong, what why I sent that as an RFC :-)

Thanks,
Vivien

Re: [PATCH 2/2] isdn: i4l: move active-isdn drivers to staging

2016-03-09 Thread Tilman Schmidt

Am 07.03.2016 um 09:48 schrieb Paul Bolle:
> On za, 2016-03-05 at 14:08 +0100, Tilman Schmidt wrote:
>> As a consequence, owners of HiSAX type adapters are in fact stuck with
>> the old hisax driver if they want to continue using i4l userspace
>> tools.
> 
> Do you know whether or not mISDN tools offer functionality comparable to
> i4l tools?

AFAICS they don't. mISDN seems very much geared towards voice use, for
example with Asterisk. The entire topic of ISDN data connections appears
to be missing. I don't see how someone currently using i4l for Internet
dial-in (either client or server side) could migrate to mISDN.

-- 
Tilman Schmidt  E-Mail: til...@imap.cc
Bonn, Germany
Nous, on a des fleurs et des bougies pour nous protéger.

signature.asc
Description: OpenPGP digital signature

Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations

2016-03-09 Thread Alexander Duyck

On Wed, Mar 9, 2016 at 1:47 PM, Jesper Dangaard Brouer
 wrote:
> On Wed, 9 Mar 2016 13:43:59 -0800
> Alexander Duyck  wrote:
>
>> On Wed, Mar 9, 2016 at 1:36 PM, Jesper Dangaard Brouer
>>  wrote:
>> > On Wed, 09 Mar 2016 16:03:20 -0500 (EST)
>> > David Miller  wrote:
>> >
>> >> From: Alexander Duyck 
>> >> Date: Wed, 9 Mar 2016 08:47:58 -0800
>> >>
>> >> > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer
>> >> >  wrote:
>> >> >> Passing the budget down was Alex'es design.  Axel any thoughts?
>> >> >
>> >> > I'd say just use dev_consume_skb_any in the bulk free instead of
>> >> > dev_consume_skb_irq.  This is slow path, as you said, so it shouldn't
>> >> > come up often.
>> >>
>> >> Agreed.
>> >>
>> >> >> I do wonder how expensive this check is... as it goes into a code
>> >> >> hotpath, which is very unlikely.  The good thing would be, that we
>> >> >> handle if buggy drivers call this function from a none softirq context
>> >> >> (as these bugs could be hard to catch).
>> >> >>
>> >> >> Can netpoll ever be called from softirq or with BH disabled? (It
>> >> >> disables IRQs, which would break calling kmem_cache_free_bulk).
>> >> >
>> >> > It is better for us to switch things out so that the napi_consume_skb
>> >> > is the fast path with dev_consume_skb_any as the slow.  There are too
>> >> > many scenarios where we could be invoking something that makes use of
>> >> > this within the Tx path so it is probably easiest to just solve it
>> >> > that way so we don't have to deal with it again in the future.
>> >>
>> >> Indeed.
>> >
>> > So, if I understand you correctly, then we drop the budget parameter
>> > and check for in_softirq(), like:
>> >
>> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> > index 7af7ec635d90..a3c61a9b65d2 100644
>> > --- a/net/core/skbuff.c
>> > +++ b/net/core/skbuff.c
>> > @@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb)
>> > _kfree_skb_defer(skb);
>> >  }
>> >
>> > -void napi_consume_skb(struct sk_buff *skb, int budget)
>> > +void napi_consume_skb(struct sk_buff *skb)
>> >  {
>> > if (unlikely(!skb))
>> > return;
>> >
>> > -   /* if budget is 0 assume netpoll w/ IRQs disabled */
>> > -   if (unlikely(!budget)) {
>> > -   dev_consume_skb_irq(skb);
>> > +   /* Handle if not called from NAPI context, and netpoll invocation 
>> > */
>> > +   if (unlikely(!in_softirq())) {
>> > +   dev_consume_skb_any(skb);
>> > return;
>> > }
>> >
>>
>> No.  We still need to have the budget value.  What we do though is
>> have that feed into dev_consume_skb_any.
>>
>> The problem with using in_softirq is that it will trigger if softirqs
>> are just disabled so there are more possible paths where it is
>> possible.  For example the transmit path has bottom halves disabled so
>> I am pretty sure it might trigger this as well.  We want this to only
>> execute when we are running from a NAPI polling routine with a
>> non-zero budget.
>
> What about using in_serving_softirq() instead of in_softirq() ?
> (would that allow us to drop the budget parameter?)

The problem is there are multiple softirq handlers you could be
referring to.  NAPI is just one.  What if for example someone sets up
a tasklet that has to perform some sort of reset with RTNL held.  I am
pretty sure we don't want the tasklet using the NAPI free context
since there will be nothing to actually free the buffers at the end of
it.  We want to avoid that.  That is why I was using the budget value.

- Alex

[PATCH next v2 7/7] ipvlan: Implement L3-symmetric mode.

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

Current packet processing from IPtables perspective is asymmetric
for IPvlan L3 mode. On egress path, packets hit LOCAL_OUT and
POST_ROUTING hooks in slave-ns as well as master's ns however
during ingress path, LOCAL_IN and PRE_ROUTING hooks are hit only
in slave's ns. L3 mode is restrictive and uses master's L3 for
packet processing, so it does not make sense to skip these hooks
in ingress path in master's ns.

The changes in this patch nominates master-dev to be the device
for L3 ingress processing when skb device is the IPvlan slave.
Since master device is used for L3 processing, the IPT hooks are
hit in master's ns making the packet processing symmetric.

The other minor change this patch does is to add a force parameter
for set_port_mode() to ensure correct settings are set during the
device initialization phase.

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 drivers/net/ipvlan/ipvlan_main.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 5802b9025765..734c25e52c60 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -14,16 +14,19 @@ static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, 
struct net_device *dev)
ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
 }
 
-static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
+static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval, bool force)
 {
struct ipvl_dev *ipvlan;
 
-   if (port->mode != nval) {
+   if (port->mode != nval || force) {
list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
-   if (nval == IPVLAN_MODE_L3)
+   if (nval == IPVLAN_MODE_L3) {
ipvlan->dev->flags |= IFF_NOARP;
-   else
+   ipvlan->dev->l3_dev = port->dev;
+   } else {
ipvlan->dev->flags &= ~IFF_NOARP;
+   ipvlan->dev->l3_dev = ipvlan->dev;
+   }
}
port->mode = nval;
}
@@ -392,7 +395,7 @@ static int ipvlan_nl_changelink(struct net_device *dev,
if (data && data[IFLA_IPVLAN_MODE]) {
u16 nmode = nla_get_u16(data[IFLA_IPVLAN_MODE]);
 
-   ipvlan_set_port_mode(port, nmode);
+   ipvlan_set_port_mode(port, nmode, false);
}
return 0;
 }
@@ -479,7 +482,6 @@ static int ipvlan_link_new(struct net *src_net, struct 
net_device *dev,
memcpy(dev->dev_addr, phy_dev->dev_addr, ETH_ALEN);
 
dev->priv_flags |= IFF_IPVLAN_SLAVE;
-
port->count += 1;
err = register_netdevice(dev);
if (err < 0)
@@ -490,7 +492,7 @@ static int ipvlan_link_new(struct net *src_net, struct 
net_device *dev,
goto ipvlan_destroy_port;
 
list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans);
-   ipvlan_set_port_mode(port, mode);
+   ipvlan_set_port_mode(port, mode, true);
 
netif_stacked_transfer_operstate(phy_dev, dev);
return 0;
-- 
2.7.0.rc3.207.g0ac5344

[PATCH next v2 4/7] ipv6: Use l3_dev for L3 ingress processing

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

Use the in-dev passed by the packet dispatcher for the L3 phase. If
there are places where code uses skb->dev, use the netif_get_l3_dev()
helper to get l3_dev.

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 net/ipv6/ip6_input.c | 14 --
 net/ipv6/route.c |  7 ---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index c05c425c2389..f02da6d7e097 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -67,7 +67,8 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt
const struct ipv6hdr *hdr;
u32 pkt_len;
struct inet6_dev *idev;
-   struct net *net = dev_net(skb->dev);
+   struct net_device *l3dev = netif_get_l3_dev(skb->dev);
+   struct net *net = dev_net(l3dev);
 
if (skb->pkt_type == PACKET_OTHERHOST) {
kfree_skb(skb);
@@ -76,7 +77,7 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt
 
rcu_read_lock();
 
-   idev = __in6_dev_get(skb->dev);
+   idev = __in6_dev_get(l3dev);
 
IP6_UPD_PO_STATS_BH(net, idev, IPSTATS_MIB_IN, skb->len);
 
@@ -246,8 +247,8 @@ resubmit:
   skb_network_header_len(skb));
hdr = ipv6_hdr(skb);
if (ipv6_addr_is_multicast(&hdr->daddr) &&
-   !ipv6_chk_mcast_addr(skb->dev, &hdr->daddr,
-   &hdr->saddr) &&
+   !ipv6_chk_mcast_addr(netif_get_l3_dev(skb->dev),
+&hdr->daddr, &hdr->saddr) &&
!ipv6_is_mld(skb, nexthdr, 
skb_network_header_len(skb)))
goto discard;
}
@@ -287,9 +288,10 @@ discard:
 
 int ip6_input(struct sk_buff *skb)
 {
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
+
return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN,
-  dev_net(skb->dev), NULL, skb, skb->dev, NULL,
-  ip6_input_finish);
+  dev_net(dev), NULL, skb, dev, NULL, ip6_input_finish);
 }
 
 int ip6_mc_input(struct sk_buff *skb)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ed446639219c..3a1b3c62d80b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1158,11 +1158,12 @@ static struct dst_entry *ip6_route_input_lookup(struct 
net *net,
 void ip6_route_input(struct sk_buff *skb)
 {
const struct ipv6hdr *iph = ipv6_hdr(skb);
-   struct net *net = dev_net(skb->dev);
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
+   struct net *net = dev_net(dev);
int flags = RT6_LOOKUP_F_HAS_SADDR;
struct ip_tunnel_info *tun_info;
struct flowi6 fl6 = {
-   .flowi6_iif = l3mdev_fib_oif(skb->dev),
+   .flowi6_iif = l3mdev_fib_oif(dev),
.daddr = iph->daddr,
.saddr = iph->saddr,
.flowlabel = ip6_flowinfo(iph),
@@ -1174,7 +1175,7 @@ void ip6_route_input(struct sk_buff *skb)
if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
skb_dst_drop(skb);
-   skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags));
+   skb_dst_set(skb, ip6_route_input_lookup(net, dev, &fl6, flags));
 }
 
 static struct rt6_info *ip6_pol_route_output(struct net *net, struct 
fib6_table *table,
-- 
2.7.0.rc3.207.g0ac5344

[PATCH next v2 6/7] xfrm: Use l3_dev for xfrm policy checks.

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

IPsec, whether it's tunnel mode or transport mode, is still a
function of L3 so all the decisions should be based on the L3
device.

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 include/net/xfrm.h | 2 +-
 net/xfrm/xfrm_policy.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index d6f6e5006ee9..30f9a351c3b9 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1067,7 +1067,7 @@ static inline int __xfrm_policy_check2(struct sock *sk, 
int dir,
   struct sk_buff *skb,
   unsigned int family, int reverse)
 {
-   struct net *net = dev_net(skb->dev);
+   struct net *net = dev_net(netif_get_l3_dev(skb->dev));
int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
 
if (sk && sk->sk_policy[XFRM_POLICY_IN])
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index b5e665b3cfb0..c5942744f2e3 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2462,7 +2462,7 @@ static inline int secpath_has_nontransport(const struct 
sec_path *sp, int k, int
 int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
unsigned short family)
 {
-   struct net *net = dev_net(skb->dev);
+   struct net *net = dev_net(netif_get_l3_dev(skb->dev));
struct xfrm_policy *pol;
struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
int npols = 0;
@@ -2620,7 +2620,7 @@ EXPORT_SYMBOL(__xfrm_policy_check);
 
 int __xfrm_route_forward(struct sk_buff *skb, unsigned short family)
 {
-   struct net *net = dev_net(skb->dev);
+   struct net *net = dev_net(netif_get_l3_dev(skb->dev));
struct flowi fl;
struct dst_entry *dst;
int res = 1;
-- 
2.7.0.rc3.207.g0ac5344

[PATCH next v2 5/7] raw: Use l3_dev for L3 ingress raw packet delivery

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

Use the in-dev passed by the packet dispatcher for the L3 phase. If
there are places where code uses skb->dev, use the netif_get_l3_dev()
helper to get l3_dev.

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 net/ipv4/raw.c | 11 +--
 net/ipv6/raw.c |  2 +-
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8d22de74080c..7802ce24a42d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -173,22 +173,22 @@ static int raw_v4_input(struct sk_buff *skb, const struct 
iphdr *iph, int hash)
struct hlist_head *head;
int delivered = 0;
struct net *net;
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
 
read_lock(&raw_v4_hashinfo.lock);
head = &raw_v4_hashinfo.ht[hash];
if (hlist_empty(head))
goto out;
 
-   net = dev_net(skb->dev);
+   net = dev_net(dev);
sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol,
 iph->saddr, iph->daddr,
-skb->dev->ifindex);
+dev->ifindex);
 
while (sk) {
delivered = 1;
if ((iph->protocol != IPPROTO_ICMP || !icmp_filter(sk, skb)) &&
-   ip_mc_sf_allow(sk, iph->daddr, iph->saddr,
-  skb->dev->ifindex)) {
+   ip_mc_sf_allow(sk, iph->daddr, iph->saddr, dev->ifindex)) {
struct sk_buff *clone = skb_clone(skb, GFP_ATOMIC);
 
/* Not releasing hash table! */
@@ -196,8 +196,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct 
iphdr *iph, int hash)
raw_rcv(sk, clone);
}
sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol,
-iph->saddr, iph->daddr,
-skb->dev->ifindex);
+iph->saddr, iph->daddr, dev->ifindex);
}
 out:
read_unlock(&raw_v4_hashinfo.lock);
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index fa59dd7a427e..22d001f0645d 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -175,7 +175,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int 
nexthdr)
if (!sk)
goto out;
 
-   net = dev_net(skb->dev);
+   net = dev_net(netif_get_l3_dev(skb->dev));
sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, inet6_iif(skb));
 
while (sk) {
-- 
2.7.0.rc3.207.g0ac5344

[PATCH next v2 3/7] ipv4: Use l3_dev for L3 ingress processing

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

Use the in-dev passed by the packet dispatcher for the L3 phase. If
there are places where code uses skb->dev, use the netif_get_l3_dev()
helper to get l3_dev.

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 net/ipv4/ip_input.c   | 12 +++-
 net/ipv4/ip_options.c |  3 ++-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index e3d782746d9d..cae3503c872b 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -247,7 +247,8 @@ int ip_local_deliver(struct sk_buff *skb)
/*
 *  Reassemble IP fragments.
 */
-   struct net *net = dev_net(skb->dev);
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
+   struct net *net = dev_net(dev);
 
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
@@ -255,7 +256,7 @@ int ip_local_deliver(struct sk_buff *skb)
}
 
return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
-  net, NULL, skb, skb->dev, NULL,
+  net, NULL, skb, dev, NULL,
   ip_local_deliver_finish);
 }
 
@@ -263,7 +264,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 {
struct ip_options *opt;
const struct iphdr *iph;
-   struct net_device *dev = skb->dev;
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
 
/* It looks as overkill, because not all
   IP options require packet mangling.
@@ -312,6 +313,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 {
const struct iphdr *iph = ip_hdr(skb);
struct rtable *rt;
+   struct net_device *dev = netif_get_l3_dev(skb->dev);
 
if (net->ipv4.sysctl_ip_early_demux &&
!skb_dst(skb) &&
@@ -334,7 +336,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 */
if (!skb_valid_dst(skb)) {
int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-  iph->tos, skb->dev);
+  iph->tos, dev);
if (unlikely(err)) {
if (err == -EXDEV)
NET_INC_STATS_BH(net, LINUX_MIB_IPRPFILTER);
@@ -363,7 +365,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_INBCAST, skb->len);
} else if (skb->pkt_type == PACKET_BROADCAST ||
   skb->pkt_type == PACKET_MULTICAST) {
-   struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
+   struct in_device *in_dev = __in_dev_get_rcu(dev);
 
/* RFC 1122 3.3.6:
 *
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index bd246792360b..fc434b12a3d3 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -639,7 +639,8 @@ int ip_options_rcv_srr(struct sk_buff *skb)
 
orefdst = skb->_skb_refdst;
skb_dst_set(skb, NULL);
-   err = ip_route_input(skb, nexthop, iph->saddr, iph->tos, 
skb->dev);
+   err = ip_route_input(skb, nexthop, iph->saddr, iph->tos,
+netif_get_l3_dev(skb->dev));
rt2 = skb_rtable(skb);
if (err || (rt2->rt_type != RTN_UNICAST && rt2->rt_type != 
RTN_LOCAL)) {
skb_dst_drop(skb);
-- 
2.7.0.rc3.207.g0ac5344

[PATCH next v2 1/7] dev: Add netif_get_l3_dev() helper

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

This patch adds a l3_dev to net_device struct and a helper
function to retrieve it. During L3 ingress packet processing,
this device will be used and should serve the current purpose
of skb->dev. Since l3_dev is initialized to self; l3_dev should
be pointing to skb->dev so the normal packet processing is
neither altered nor should incur any additional cost (as it
resides in the RX cache line).

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 include/linux/netdevice.h | 6 ++
 net/core/dev.c| 1 +
 2 files changed, 7 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e52077ffe5ed..1cf7e8d61043 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1738,6 +1738,7 @@ struct net_device {
unsigned long   gro_flush_timeout;
rx_handler_func_t __rcu *rx_handler;
void __rcu  *rx_handler_data;
+   struct net_device   *l3_dev;
 
 #ifdef CONFIG_NET_CLS_ACT
struct tcf_proto __rcu  *ingress_cl_list;
@@ -4085,6 +4086,11 @@ static inline void netif_keep_dst(struct net_device *dev)
dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM);
 }
 
+static inline struct net_device *netif_get_l3_dev(struct net_device *dev)
+{
+   return dev->l3_dev;
+}
+
 extern struct pernet_operations __net_initdata loopback_net_ops;
 
 /* Logging, debugging and troubleshooting/diagnostic helpers. */
diff --git a/net/core/dev.c b/net/core/dev.c
index edb7179bc051..c4023a68cdc1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7463,6 +7463,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
if (!dev->ethtool_ops)
dev->ethtool_ops = &default_ethtool_ops;
 
+   dev->l3_dev = dev;
nf_hook_ingress_init(dev);
 
return dev;
-- 
2.7.0.rc3.207.g0ac5344

[PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

One of the major request (for enhancement) that I have received
from various users of IPvlan in L3 mode is its inability to handle
IPtables.

While looking at the code and how we handle ingress, the problem
can be attributed to the asymmetry in the way packets get processed
for IPvlan devices configured in L3 mode. L3 mode is supposed to
be restrictive and all the L3 decisions need to be taken for the
traffic in master's ns. This does happen as expected for egress
traffic however on ingress traffic, the IPvlan packet-handler
changes the skb->dev and this forces packet to be processed with
the IPvlan slave and it's associated ns. This causes above mentioned
problem and few other which are not yet reported / attempted. e.g.
IPsec with L3 mode or even ingress routing.

This could have been solved if we had a way to handover packet to
slave and associated ns after completing the L3 phase. This is a
non-trivial issue to fix especially looking at IPsec code.

This patch series attempts to solve this problem by introducing the
device pointer l3_dev which resides in net_device structure in the
RX cache line. We initialize the l3_dev to self. This would mean
there is no complex logic to when-and-how-to initialize it. Now
the stack will use this dev pointer during the L3 phase. This should
not alter any existing properties / behavior and also there should
not be any additional penalties since it resides in the same RX
cache line. 

Now coming back to IPvlan setup. The typical IPvlan L3 setup where
master is in default-ns while the slaves are put inside different
(slave) net-ns. During the initialization phase, the driver can
make l3_dev for each slave to point to its master-device. This will
solve the current IPT problem as well as make the packet processing
symmetric from stack perspective. For all other packet processing
since dev->l3_dev is same as dev, there should not be any behavioral
or functional change. 

Mahesh Bandewar (7):
  dev: Add netif_get_l3_dev() helper
  dev: Update packet dispatcher to pass l3_dev as an input device
  ipv4: Use l3_dev for L3 ingress processing
  ipv6: Use l3_dev for L3 ingress processing
  raw: Use l3_dev for L3 ingress raw packet delivery
  xfrm: Use l3_dev for xfrm policy checks.
  ipvlan: Implement L3-symmetric mode.

 drivers/net/ipvlan/ipvlan_main.c | 16 +---
 include/linux/netdevice.h|  6 ++
 include/net/xfrm.h   |  2 +-
 net/core/dev.c   | 10 +++---
 net/ipv4/ip_input.c  | 12 +++-
 net/ipv4/ip_options.c|  3 ++-
 net/ipv4/raw.c   | 11 +--
 net/ipv6/ip6_input.c | 14 --
 net/ipv6/raw.c   |  2 +-
 net/ipv6/route.c |  7 ---
 net/xfrm/xfrm_policy.c   |  4 ++--
 11 files changed, 52 insertions(+), 35 deletions(-)

-- 
2.7.0.rc3.207.g0ac5344

[PATCH next v2 2/7] dev: Update packet dispatcher to pass l3_dev as an input device

2016-03-09 Thread Mahesh Bandewar

From: Mahesh Bandewar 

netif_receive_skb_core() dispatcher uses skb->dev device as an
input device to all packet-handlers (e.g. ip_rcv, ipv6_rcv etc)
These packet handlers use this device reference to start processing
L3 phase of the packet-processing. This includes IPT hooks, ingress
route-lookup, and xfrm policy lookups etc.

This patch passes l3_dev as an input device and shifts L3 processing
from skb->dev to l3_dev (skb->dev->l3_dev).

Signed-off-by: Mahesh Bandewar 
CC: Eric Dumazet 
CC: Tim Hockin 
CC: Alex Pollitt 
CC: Matthew Dupre 
---
 net/core/dev.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c4023a68cdc1..9252436ef11a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1811,7 +1811,8 @@ static inline int deliver_skb(struct sk_buff *skb,
if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
return -ENOMEM;
atomic_inc(&skb->users);
-   return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   return pt_prev->func(skb, netif_get_l3_dev(skb->dev), pt_prev,
+orig_dev);
 }
 
 static inline void deliver_ptype_list_skb(struct sk_buff *skb,
@@ -1904,7 +1905,8 @@ again:
}
 out_unlock:
if (pt_prev)
-   pt_prev->func(skb2, skb->dev, pt_prev, skb->dev);
+   pt_prev->func(skb2, netif_get_l3_dev(skb->dev), pt_prev,
+ skb->dev);
rcu_read_unlock();
 }
 
@@ -4157,7 +4159,8 @@ ncls:
if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
goto drop;
else
-   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   ret = pt_prev->func(skb, netif_get_l3_dev(skb->dev),
+   pt_prev, orig_dev);
} else {
 drop:
if (!deliver_exact)
-- 
2.7.0.rc3.207.g0ac5344

Re: [net-next PATCH 3/4] vxlan: Enforce IP ID verification on outer headers

2016-03-09 Thread Alexander Duyck

On Wed, Mar 9, 2016 at 1:02 PM, David Miller  wrote:
> From: Jesse Gross 
> Date: Mon, 7 Mar 2016 15:42:59 -0800
>
>> On Mon, Mar 7, 2016 at 3:06 PM, Alex Duyck  wrote:
>>> On Mon, Mar 7, 2016 at 11:09 AM, David Miller  wrote:
 From: Or Gerlitz 
 Date: Mon, 7 Mar 2016 20:05:20 +0200

> On Mon, Mar 7, 2016 at 7:22 PM, Alexander Duyck  
> wrote:
>> This change enforces the IP ID verification on outer headers.  As a 
>> result
>> if the DF flag is not set on the outer header we will force the flow to 
>> be
>> flushed in the event that the IP ID is out of sequence with the existing
>> flow.
>
> Can you please state the precise requirement for aggregation w.r.t IP
> IDs here? and point to where/how this is enforced, e.g for
> non-tunneled TCP GRO-ing?

 I also didn't see a nice "PATCH 0/4" posting explaining this series and
 I'd really like to see that.
>>>
>>> Sorry about that.  I forgot to add the cover page when I sent this.
>>>
>>> The enforcement is coming from the IP and TCP layers.  If you take a
>>> look in inet_gro_receive we have the NAPI_GRO_CB(p)->flush_id value
>>> being populated based on the difference between the expected ID and
>>> the received one.  So for IPv4 we overwrite it, and for IPv6 we set it
>>> to 0.  The only consumer currently using it is TCP in tcp_gro_receive.
>>> The problem is with tunnels we lose the data for the outer when the
>>> inner overwrites it, as a result we can put whatever we want currently
>>> in the outer IP ID and it will be accepted.
>>>
>>> The patch set is based off of a conversation several of us had on the
>>> list about doing TSO for tunnels and the fact that the IP IDs for the
>>> outer header have to advance.  It makes it easier for me to validate
>>> that I am doing things properly if GRO doesn't destroy the IP ID data
>>> for the outer headers.
>>
>> In net/ipv4/af_inet.c:inet_gro_receive() there is the following
>> comment above where NAPI_GRO_CB(p)->flush_id is set:
>>
>> /* Save the IP ID check to be included later when we get to
>> * the transport layer so only the inner most IP ID is checked.
>> * This is because some GSO/TSO implementations do not
>> * correctly increment the IP ID for the outer hdrs.
>> */
>>
>> There was a long discussion about this a couple years ago and the
>> conclusion was that it is the inner IP ID is really the important one
>> in the case of encapsulation. Obviously, things like TCP/IP header
>> compression don't apply to the outer encapsulation header.
>
> That's how I remember the conversation going as wel...

Given that then, how would you feel about a software scheme to use
existing TSO support on devices to support UDP/GRE tunnel TSO by
pre-computing the values for the outer headers and then having those
duplicated on TSO frames?

This is kind of what got me started down the path of validating the
outer IP ID.  If we are okay with UDP or GRE encapsulated tunnels
using the same IP ID for the outer headers then I should be able to
code up something in software so that we can get more devices
supporting tunnel TSO offload for tunnels.

- Alex

Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations

2016-03-09 Thread Jesper Dangaard Brouer

On Wed, 9 Mar 2016 13:43:59 -0800
Alexander Duyck  wrote:

> On Wed, Mar 9, 2016 at 1:36 PM, Jesper Dangaard Brouer
>  wrote:
> > On Wed, 09 Mar 2016 16:03:20 -0500 (EST)
> > David Miller  wrote:
> >  
> >> From: Alexander Duyck 
> >> Date: Wed, 9 Mar 2016 08:47:58 -0800
> >>  
> >> > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer
> >> >  wrote:  
> >> >> Passing the budget down was Alex'es design.  Axel any thoughts?  
> >> >
> >> > I'd say just use dev_consume_skb_any in the bulk free instead of
> >> > dev_consume_skb_irq.  This is slow path, as you said, so it shouldn't
> >> > come up often.  
> >>
> >> Agreed.
> >>  
> >> >> I do wonder how expensive this check is... as it goes into a code
> >> >> hotpath, which is very unlikely.  The good thing would be, that we
> >> >> handle if buggy drivers call this function from a none softirq context
> >> >> (as these bugs could be hard to catch).
> >> >>
> >> >> Can netpoll ever be called from softirq or with BH disabled? (It
> >> >> disables IRQs, which would break calling kmem_cache_free_bulk).  
> >> >
> >> > It is better for us to switch things out so that the napi_consume_skb
> >> > is the fast path with dev_consume_skb_any as the slow.  There are too
> >> > many scenarios where we could be invoking something that makes use of
> >> > this within the Tx path so it is probably easiest to just solve it
> >> > that way so we don't have to deal with it again in the future.  
> >>
> >> Indeed.  
> >
> > So, if I understand you correctly, then we drop the budget parameter
> > and check for in_softirq(), like:
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 7af7ec635d90..a3c61a9b65d2 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb)
> > _kfree_skb_defer(skb);
> >  }
> >
> > -void napi_consume_skb(struct sk_buff *skb, int budget)
> > +void napi_consume_skb(struct sk_buff *skb)
> >  {
> > if (unlikely(!skb))
> > return;
> >
> > -   /* if budget is 0 assume netpoll w/ IRQs disabled */
> > -   if (unlikely(!budget)) {
> > -   dev_consume_skb_irq(skb);
> > +   /* Handle if not called from NAPI context, and netpoll invocation */
> > +   if (unlikely(!in_softirq())) {
> > +   dev_consume_skb_any(skb);
> > return;
> > }
> >  
> 
> No.  We still need to have the budget value.  What we do though is
> have that feed into dev_consume_skb_any.
> 
> The problem with using in_softirq is that it will trigger if softirqs
> are just disabled so there are more possible paths where it is
> possible.  For example the transmit path has bottom halves disabled so
> I am pretty sure it might trigger this as well.  We want this to only
> execute when we are running from a NAPI polling routine with a
> non-zero budget.

What about using in_serving_softirq() instead of in_softirq() ?
(would that allow us to drop the budget parameter?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net] r8169:Remove unnecessary phy reset for pcie nic when setting link spped.

2016-03-09 Thread Francois Romieu

Hau  :
[...] 
> Unless pcie nic has bug, pcie nic does not need to reset phy to let phy link 
> on.
> 
> There is a counter for phy speed down. If phy is in link down state, this
> counter will start to count down. When it count to 0, phy will speed down.
> Reset phy will reset this counter and prevent phy from speed down.

Thanks for the clarification.

-- 
Ueimor

Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations

2016-03-09 Thread Alexander Duyck

On Wed, Mar 9, 2016 at 1:36 PM, Jesper Dangaard Brouer
 wrote:
> On Wed, 09 Mar 2016 16:03:20 -0500 (EST)
> David Miller  wrote:
>
>> From: Alexander Duyck 
>> Date: Wed, 9 Mar 2016 08:47:58 -0800
>>
>> > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer
>> >  wrote:
>> >> Passing the budget down was Alex'es design.  Axel any thoughts?
>> >
>> > I'd say just use dev_consume_skb_any in the bulk free instead of
>> > dev_consume_skb_irq.  This is slow path, as you said, so it shouldn't
>> > come up often.
>>
>> Agreed.
>>
>> >> I do wonder how expensive this check is... as it goes into a code
>> >> hotpath, which is very unlikely.  The good thing would be, that we
>> >> handle if buggy drivers call this function from a none softirq context
>> >> (as these bugs could be hard to catch).
>> >>
>> >> Can netpoll ever be called from softirq or with BH disabled? (It
>> >> disables IRQs, which would break calling kmem_cache_free_bulk).
>> >
>> > It is better for us to switch things out so that the napi_consume_skb
>> > is the fast path with dev_consume_skb_any as the slow.  There are too
>> > many scenarios where we could be invoking something that makes use of
>> > this within the Tx path so it is probably easiest to just solve it
>> > that way so we don't have to deal with it again in the future.
>>
>> Indeed.
>
> So, if I understand you correctly, then we drop the budget parameter
> and check for in_softirq(), like:
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7af7ec635d90..a3c61a9b65d2 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb)
> _kfree_skb_defer(skb);
>  }
>
> -void napi_consume_skb(struct sk_buff *skb, int budget)
> +void napi_consume_skb(struct sk_buff *skb)
>  {
> if (unlikely(!skb))
> return;
>
> -   /* if budget is 0 assume netpoll w/ IRQs disabled */
> -   if (unlikely(!budget)) {
> -   dev_consume_skb_irq(skb);
> +   /* Handle if not called from NAPI context, and netpoll invocation */
> +   if (unlikely(!in_softirq())) {
> +   dev_consume_skb_any(skb);
> return;
> }
>

No.  We still need to have the budget value.  What we do though is
have that feed into dev_consume_skb_any.

The problem with using in_softirq is that it will trigger if softirqs
are just disabled so there are more possible paths where it is
possible.  For example the transmit path has bottom halves disabled so
I am pretty sure it might trigger this as well.  We want this to only
execute when we are running from a NAPI polling routine with a
non-zero budget.

- Alex

Re: [RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging

2016-03-09 Thread Ido Schimmel

Hi Vivien,

Wed, Mar 09, 2016 at 07:42:47PM IST, vivien.dide...@savoirfairelinux.com wrote:
>Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is
>set before adding a port to a bridge and deleting a port from a bridge.
>
>The main purpose for this attribute is to provide switchdev users a
>simple and common way to retrieve bridging information, instead of
>implementing complex notifier blocks to listen to global netdev events.
>
>We can also imagine a switchdev user returning an error different from
>-EOPNOTSUPP in the prepare phase to prevent a port from being bridged.

I don't really understand the motivation for this change. We are already
doing all these stuff with the notifiers and it's pretty
straight-forward.

In fact, I believe using an existing mechanism instead of introducing
more switchdev hooks is more elegant. This RFC only deals with bridge,
but you'll have to do the same for team, bond and vlan devices. And
you'll probably place the hooks in the exact locations where the
notifiers are called from anyway.

>
>Signed-off-by: Vivien Didelot 
>---
> include/net/switchdev.h |  2 ++
> net/bridge/br_if.c  | 27 +++
> 2 files changed, 29 insertions(+)
>
>diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>index d451122..65f8514 100644
>--- a/include/net/switchdev.h
>+++ b/include/net/switchdev.h
>@@ -46,6 +46,7 @@ enum switchdev_attr_id {
>   SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>   SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>   SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
>+  SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF,
>   SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>   SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
> };
>@@ -58,6 +59,7 @@ struct switchdev_attr {
>   struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */
>   u8 stp_state;   /* PORT_STP_STATE */
>   unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */
>+  bool join;  /* PORT_BRIDGE_IF */
>   u32 ageing_time;/* BRIDGE_AGEING_TIME */
>   bool vlan_filtering;/* 
> BRIDGE_VLAN_FILTERING */
>   } u;
>diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
>index a73df33..105b9fd 100644
>--- a/net/bridge/br_if.c
>+++ b/net/bridge/br_if.c
>@@ -28,6 +28,24 @@
> 
> #include "br_private.h"
> 
>+static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br,
>+ bool join)
>+{
>+  struct switchdev_attr attr = {
>+  .orig_dev = br->dev,

This should be just 'dev', since you need to know for which stacked
device on top of the port this was called for. This also means you'll
have to call netdev_master_upper_dev_get() from within your driver if
you want to limit the number of VLAN filtering bridges (for example).
However, since this is called before bridge dev and dev itself are
linked, you'll get NULL.

>+  .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF,
>+  .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>+  .u.join = join,
>+  };
>+  int err;
>+
>+  err = switchdev_port_attr_set(dev, &attr);
>+  if (err && err != -EOPNOTSUPP)
>+  return err;
>+
>+  return 0;
>+}
>+
> /*
>  * Determine initial path cost based on speed.
>  * using recommendations from 802.1d standard
>@@ -297,6 +315,10 @@ static void del_nbp(struct net_bridge_port *p)
>   br_netpoll_disable(p);
> 
>   call_rcu(&p->rcu, destroy_nbp_rcu);
>+
>+  if (switchdev_bridge_if(dev, br, false))
>+  br_warn(br, "error unbridging port %u(%s)\n",
>+  (unsigned int) p->port_no, dev->name);
> }
> 
> /* Delete bridge device */
>@@ -347,6 +369,11 @@ static struct net_bridge_port *new_nbp(struct net_bridge 
>*br,
> {
>   int index;
>   struct net_bridge_port *p;
>+  int err;
>+
>+  err = switchdev_bridge_if(dev, br, true);

If you look at br_add_if() - where new_nbp() is called from - then
you'll see that you aren't rollbacking this operation in case of error.
Same for subsequent errors in this function I believe.

>+  if (err)
>+  return ERR_PTR(err);
> 
>   index = find_portno(br);
>   if (index < 0)
>-- 
>2.7.2
>

Maybe this is something we'll have to do in the future, but for now I
think we are OK with the notifiers. :)

Thanks Vivien!

Re: [PATCH v2 net-next 00/13] kcm: Kernel Connection Multiplexor (KCM)

2016-03-09 Thread David Miller

From: Tom Herbert 
Date: Mon, 7 Mar 2016 14:10:59 -0800

> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
 ...

I've decided to apply this as-is.

If there are any issues we can fix it with follow-on patches.

Thanks for all of the hard work Tom.

Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations

2016-03-09 Thread Jesper Dangaard Brouer

On Wed, 09 Mar 2016 16:03:20 -0500 (EST)
David Miller  wrote:

> From: Alexander Duyck 
> Date: Wed, 9 Mar 2016 08:47:58 -0800
> 
> > On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer
> >  wrote:  
> >> Passing the budget down was Alex'es design.  Axel any thoughts?  
> > 
> > I'd say just use dev_consume_skb_any in the bulk free instead of
> > dev_consume_skb_irq.  This is slow path, as you said, so it shouldn't
> > come up often.  
> 
> Agreed.
> 
> >> I do wonder how expensive this check is... as it goes into a code
> >> hotpath, which is very unlikely.  The good thing would be, that we
> >> handle if buggy drivers call this function from a none softirq context
> >> (as these bugs could be hard to catch).
> >>
> >> Can netpoll ever be called from softirq or with BH disabled? (It
> >> disables IRQs, which would break calling kmem_cache_free_bulk).  
> > 
> > It is better for us to switch things out so that the napi_consume_skb
> > is the fast path with dev_consume_skb_any as the slow.  There are too
> > many scenarios where we could be invoking something that makes use of
> > this within the Tx path so it is probably easiest to just solve it
> > that way so we don't have to deal with it again in the future.  
> 
> Indeed.

So, if I understand you correctly, then we drop the budget parameter
and check for in_softirq(), like:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7af7ec635d90..a3c61a9b65d2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -796,14 +796,14 @@ void __kfree_skb_defer(struct sk_buff *skb)
_kfree_skb_defer(skb);
 }
 
-void napi_consume_skb(struct sk_buff *skb, int budget)
+void napi_consume_skb(struct sk_buff *skb)
 {
if (unlikely(!skb))
return;
 
-   /* if budget is 0 assume netpoll w/ IRQs disabled */
-   if (unlikely(!budget)) {
-   dev_consume_skb_irq(skb);
+   /* Handle if not called from NAPI context, and netpoll invocation */
+   if (unlikely(!in_softirq())) {
+   dev_consume_skb_any(skb);
return;
}


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [ovs-dev] [PATCH net-next] ovs: allow nl 'flow set' to use ufid without flow key

2016-03-09 Thread pravin shelar

On Wed, Mar 9, 2016 at 9:05 AM, Samuel Gauthier
 wrote:
> When we want to change a flow using netlink, we have to identify it to
> be able to perform a lookup. Both the flow key and unique flow ID
> (ufid) are valid identifiers, but we always have to specify the flow
> key in the netlink message. When both attributes are there, the ufid
> is used.
>
> This commit allows to use the ufid without having to provide the flow
> key, as it is already done in the netlink 'flow get' and 'flow del'
> path.
>
> Signed-off-by: Samuel Gauthier 
> ---
>  net/openvswitch/datapath.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index deadfdab1bc3..06f15143cf2a 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -1100,21 +1100,20 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, 
> struct genl_info *info)
> struct sw_flow_match match;
> struct sw_flow_id sfid;
> u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
> -   int error;
> +   int error = 0;
> bool log = !a[OVS_FLOW_ATTR_PROBE];
> bool ufid_present;
>
> -   /* Extract key. */
> -   error = -EINVAL;
> -   if (!a[OVS_FLOW_ATTR_KEY]) {
> -   OVS_NLERR(log, "Flow key attribute not present in set flow.");
> -   goto error;
> -   }
> -
> ufid_present = ovs_nla_get_ufid(&sfid, a[OVS_FLOW_ATTR_UFID], log);
> -   ovs_match_init(&match, &key, &mask);
> -   error = ovs_nla_get_match(net, &match, a[OVS_FLOW_ATTR_KEY],
> - a[OVS_FLOW_ATTR_MASK], log);
> +   if (a[OVS_FLOW_ATTR_KEY]) {
> +   ovs_match_init(&match, &key, NULL);
> +   error = ovs_nla_get_match(net, &match, a[OVS_FLOW_ATTR_KEY],
> + NULL, log);
> +   } else if (!ufid_present) {
> +   OVS_NLERR(log,
> + "Flow set message rejected, Key attribute 
> missing.");
> +   error = -EINVAL;
> +   }

Set command sets new action for given flow. The action validation
depends on the flow key and mask. so userspace needs to provide the
key and mask.

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread Cyrill Gorcunov

On Wed, Mar 09, 2016 at 04:10:38PM -0500, David Miller wrote:
> > 
> > and here we call for NETDEV_DOWN, which then hits masq_device_event
> > and go further to conntrack code.
> 
> Yes that's where the notifier comes from, which happens with or without
> my patch.

Thanks for explanation, Dave! I'll continue on this task tomorrow
tryin to implement optimization you proposed.

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread David Miller

From: Cyrill Gorcunov 
Date: Wed, 9 Mar 2016 23:57:47 +0300

> Aha! So in your patch __inet_del_ifa bypass first blocking_notifier_call_chain
> 
> __inet_del_ifa
>   ...
>   if (in_dev->dead)
>   goto no_promotions;
> 
>   // First call to NETDEV_DOWN
> ...
> no_promotions:
>   rtmsg_ifa(RTM_DELADDR, ifa1, nlh, portid);
>   blocking_notifier_call_chain(&inetaddr_chain, NETDEV_DOWN, ifa1);
> 
> and here we call for NETDEV_DOWN, which then hits masq_device_event
> and go further to conntrack code.

Yes that's where the notifier comes from, which happens with or without
my patch.

Re: [net-next PATCH 2/7] mlx4: use napi_consume_skb API to get bulk free operations

2016-03-09 Thread David Miller

From: Alexander Duyck 
Date: Wed, 9 Mar 2016 08:47:58 -0800

> On Wed, Mar 9, 2016 at 3:00 AM, Jesper Dangaard Brouer
>  wrote:
>> Passing the budget down was Alex'es design.  Axel any thoughts?
> 
> I'd say just use dev_consume_skb_any in the bulk free instead of
> dev_consume_skb_irq.  This is slow path, as you said, so it shouldn't
> come up often.

Agreed.

>> I do wonder how expensive this check is... as it goes into a code
>> hotpath, which is very unlikely.  The good thing would be, that we
>> handle if buggy drivers call this function from a none softirq context
>> (as these bugs could be hard to catch).
>>
>> Can netpoll ever be called from softirq or with BH disabled? (It
>> disables IRQs, which would break calling kmem_cache_free_bulk).
> 
> It is better for us to switch things out so that the napi_consume_skb
> is the fast path with dev_consume_skb_any as the slow.  There are too
> many scenarios where we could be invoking something that makes use of
> this within the Tx path so it is probably easiest to just solve it
> that way so we don't have to deal with it again in the future.

Indeed.

Re: [net-next PATCH 3/4] vxlan: Enforce IP ID verification on outer headers

2016-03-09 Thread David Miller

From: Jesse Gross 
Date: Mon, 7 Mar 2016 15:42:59 -0800

> On Mon, Mar 7, 2016 at 3:06 PM, Alex Duyck  wrote:
>> On Mon, Mar 7, 2016 at 11:09 AM, David Miller  wrote:
>>> From: Or Gerlitz 
>>> Date: Mon, 7 Mar 2016 20:05:20 +0200
>>>
 On Mon, Mar 7, 2016 at 7:22 PM, Alexander Duyck  
 wrote:
> This change enforces the IP ID verification on outer headers.  As a result
> if the DF flag is not set on the outer header we will force the flow to be
> flushed in the event that the IP ID is out of sequence with the existing
> flow.

 Can you please state the precise requirement for aggregation w.r.t IP
 IDs here? and point to where/how this is enforced, e.g for
 non-tunneled TCP GRO-ing?
>>>
>>> I also didn't see a nice "PATCH 0/4" posting explaining this series and
>>> I'd really like to see that.
>>
>> Sorry about that.  I forgot to add the cover page when I sent this.
>>
>> The enforcement is coming from the IP and TCP layers.  If you take a
>> look in inet_gro_receive we have the NAPI_GRO_CB(p)->flush_id value
>> being populated based on the difference between the expected ID and
>> the received one.  So for IPv4 we overwrite it, and for IPv6 we set it
>> to 0.  The only consumer currently using it is TCP in tcp_gro_receive.
>> The problem is with tunnels we lose the data for the outer when the
>> inner overwrites it, as a result we can put whatever we want currently
>> in the outer IP ID and it will be accepted.
>>
>> The patch set is based off of a conversation several of us had on the
>> list about doing TSO for tunnels and the fact that the IP IDs for the
>> outer header have to advance.  It makes it easier for me to validate
>> that I am doing things properly if GRO doesn't destroy the IP ID data
>> for the outer headers.
> 
> In net/ipv4/af_inet.c:inet_gro_receive() there is the following
> comment above where NAPI_GRO_CB(p)->flush_id is set:
> 
> /* Save the IP ID check to be included later when we get to
> * the transport layer so only the inner most IP ID is checked.
> * This is because some GSO/TSO implementations do not
> * correctly increment the IP ID for the outer hdrs.
> */
> 
> There was a long discussion about this a couple years ago and the
> conclusion was that it is the inner IP ID is really the important one
> in the case of encapsulation. Obviously, things like TCP/IP header
> compression don't apply to the outer encapsulation header.

That's how I remember the conversation going as wel...

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread Cyrill Gorcunov

On Wed, Mar 09, 2016 at 03:47:25PM -0500, David Miller wrote:
> From: Cyrill Gorcunov 
> Date: Wed, 9 Mar 2016 23:41:58 +0300
> 
> > On Wed, Mar 09, 2016 at 03:27:30PM -0500, David Miller wrote:
> >> > 
> >> > Yes. I can drop it off for a while and run tests without it,
> >> > then turn it back and try again. Would you like to see such
> >> > numbers?
> >> 
> >> That would be very helpful, yes.
> > 
> > Just sent out. Take a look please. Indeed it sits inside get_next_corpse
> > a lot. And now I think I've to figure out where we can optimize it.
> > Continue tomorrow.
> 
> The problem is that the masquerading code flushes the entire conntrack
> table once for _every_ address removed.
> 
> The code path is:
> 
> masq_device_event()
>   if (event == NETDEV_DOWN) {
>   /* Device was downed.  Search entire table for
>* conntracks which were associated with that device,
>* and forget them.
>*/
>   NF_CT_ASSERT(dev->ifindex != 0);
> 
>   nf_ct_iterate_cleanup(net, device_cmp,
> (void *)(long)dev->ifindex, 0, 0);
> 
> So if you have a million IP addresses, this flush happens a million times
> on inetdev destroy.
> 
> Part of the problem is that we emit NETDEV_DOWN inetdev notifiers per
> address removed, instead of once per inetdev destroy.
> 
> Maybe if we put some boolean state into the inetdev, we could make sure
> we did this flush only once time while inetdev->dead = 1.

Aha! So in your patch __inet_del_ifa bypass first blocking_notifier_call_chain

__inet_del_ifa
...
if (in_dev->dead)
goto no_promotions;

// First call to NETDEV_DOWN
...
no_promotions:
rtmsg_ifa(RTM_DELADDR, ifa1, nlh, portid);
blocking_notifier_call_chain(&inetaddr_chain, NETDEV_DOWN, ifa1);

and here we call for NETDEV_DOWN, which then hits masq_device_event
and go further to conntrack code.

Re: [PATCH net 0/3] validate variable length ll headers

2016-03-09 Thread David Miller

From: Willem de Bruijn 
Date: Fri,  4 Mar 2016 15:44:14 -0500

> From: Willem de Bruijn 
> 
> Allow device-specific validation of link layer headers. Existing
> checks drop all packets shorter than hard_header_len. For variable
> length protocols, such packets can be valid.
> 
> patch 1 adds header_ops.validate and dev_validate_header
> patch 2 replaces ll_header_truncated with dev_validate_header
> patch 3 implements the protocol specific callback for AX25
> 
> Tested with a temporary eth_header_validate function. The AX25
> code is compile-tested only at this point.

I'm not going to be able to send another pull request to Linus
before -final, so please respin this against net-next and I'll
queue it up for -stable.

You can add the missing Fixes: tags as well when you do this.

Thanks.

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread David Miller

From: Cyrill Gorcunov 
Date: Wed, 9 Mar 2016 23:41:58 +0300

> On Wed, Mar 09, 2016 at 03:27:30PM -0500, David Miller wrote:
>> > 
>> > Yes. I can drop it off for a while and run tests without it,
>> > then turn it back and try again. Would you like to see such
>> > numbers?
>> 
>> That would be very helpful, yes.
> 
> Just sent out. Take a look please. Indeed it sits inside get_next_corpse
> a lot. And now I think I've to figure out where we can optimize it.
> Continue tomorrow.

The problem is that the masquerading code flushes the entire conntrack
table once for _every_ address removed.

The code path is:

masq_device_event()
if (event == NETDEV_DOWN) {
/* Device was downed.  Search entire table for
 * conntracks which were associated with that device,
 * and forget them.
 */
NF_CT_ASSERT(dev->ifindex != 0);

nf_ct_iterate_cleanup(net, device_cmp,
  (void *)(long)dev->ifindex, 0, 0);

So if you have a million IP addresses, this flush happens a million times
on inetdev destroy.

Part of the problem is that we emit NETDEV_DOWN inetdev notifiers per
address removed, instead of once per inetdev destroy.

Maybe if we put some boolean state into the inetdev, we could make sure
we did this flush only once time while inetdev->dead = 1.

Kreditfinanzierung . Kontaktieren Sie uns für weitere Informationen.

2016-03-09 Thread EuroFinance Group



Guten Tag

Wir möchten Sie informieren, dass wir GeschäftDarlehen, Projektfinanzierung, 
Immobilienfinanzierung bei 1% Jahreszinssatz bieten. Für weitere Informationen 
kontaktieren Sie uns über folgende e-Mail: eurofinanc...@gmail.com. Senden Sie 
Ihren Darlehensantrag einschließlich Ihren vollständigen Namen, Darlehensbetrag 
Betrag erforderlich, Darlehenslaufzeit und Ihre Adresse/Telefonnummer. Wir 
senden Ihnen die Kreditkonditionen und Rückzahlung nach Erhalt Ihrer 
KreditAnfrage Informationen. Ich freue mich auf Ihre Antwort.
Senden Sie Ihre Kreditanfrage an unser Büro E-Mail: eurofinanc...@gmail.com

Vielen Dank.

Grüße,
Josef Pani

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread Cyrill Gorcunov

On Wed, Mar 09, 2016 at 03:27:30PM -0500, David Miller wrote:
> > 
> > Yes. I can drop it off for a while and run tests without it,
> > then turn it back and try again. Would you like to see such
> > numbers?
> 
> That would be very helpful, yes.

Just sent out. Take a look please. Indeed it sits inside get_next_corpse
a lot. And now I think I've to figure out where we can optimize it.
Continue tomorrow.

Re: [PATCH] inet: set `error' value not under lock_sock().

2016-03-09 Thread David Miller

From: Weongyo Jeong 
Date: Wed, 9 Mar 2016 10:22:21 -0800

> A trivial patch to set `error' variable while not holding
> lock_sock().
> 
> Signed-off-by: Weongyo Jeong 

Deferring the assignment means gcc doesn't have to potentially put it
on the stack across the lock_sock() call.

You're making the code worse not better, the assignment into a local
variable is not an expensive operation.

I'm not applying this change, sorry.

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread David Miller

From: Cyrill Gorcunov 
Date: Wed, 9 Mar 2016 20:53:07 +0300

> On Wed, Mar 09, 2016 at 12:24:00PM -0500, David Miller wrote:
> ...
>> We asked you for numbers without a lot of features enabled, it'll
>> help us diagnose which subsystem still causes a lot of overhead
>> much more clearly.
>> 
>> So please do so.
> 
> Sure. Gimme some time and I'll back with numbers.
> 
>> Although it's already pretty clear that netfilter conntrack
>> cleanup is insanely expensive.
> 
> Yes. I can drop it off for a while and run tests without it,
> then turn it back and try again. Would you like to see such
> numbers?

That would be very helpful, yes.

Re: [PATCH 0/3] net: macb: Fix coding style issues

2016-03-09 Thread David Miller

From: Michal Simek 
Date: Wed, 9 Mar 2016 18:29:01 +0100

> On 9.3.2016 18:22, David Miller wrote:
>> From: Michal Simek 
>> Date: Wed, 9 Mar 2016 17:29:39 +0100
>> 
>>> On 7.3.2016 18:13, Nicolas Ferre wrote:
 Le 07/03/2016 17:17, Moritz Fischer a écrit :
> Hi Nicolas,
>
> this series deals with most of the checkpatch warnings
> generated for macb. There are two BUG_ON()'s that I didn't touch, yet,
> that were suggested by checkpatch, that I can address in a follow up
> commit if needed.
> Let me know if you want me to split the fixes differently or squash
> them into one commit.

 Hi,

 I'm not usually fond of this type of patches, but I must admit that this
 series corrects some style issues.

 So, I would like more feedback from Michal and Cyrille as these changes
 may delay some of the not-merged-yet features or more important
 work-in-progress on their side.

 On the other hand, if we all think it's a calm period for this macb
 driver, we may find interesting to merge some "cleanup and style"
 enhancements.
>>>
>>> Not a problem with merging cleanups in general. We have several out of
>>> tree patches but doesn't make sense to to wait.
>>> I wasn't in cc for the series but I don't like this change to be the
>>> part of cleanup series.
>>>
>>> mac = of_get_mac_address(np);
>>> if (mac)
>>> -   memcpy(bp->dev->dev_addr, mac, ETH_ALEN);
>>> +   ether_addr_copy(bp->dev->dev_addr, mac);
>> 
>> Why?  This is what we tell people to use.
> 
> I would expect this as separate patch not the part of one huge cleanup
> patch which does just comment and space cleanups.

That is true.

Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF

2016-03-09 Thread Andrew Lunn

On Wed, Mar 09, 2016 at 02:32:05PM -0500, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn  writes:
> 
> >> -static bool dsa_slave_dev_check(struct net_device *dev)
> >> -{
> >> -  return dev->netdev_ops == &dsa_slave_netdev_ops;
> >> -}
> >
> > Where is the equivalent of this happening? Where do we check that the
> > interface added to the bridge is part of the switch?
> 
> Why should we check that? In this RFC, br_if.c tries to set the new
> attribute to the net_device, when creating and deleting the net bridge
> port. If it supports attr_set and this attribute, then we're good. Or am
> I missing something?

One of us is missing something...

What happens if i have two dsa clusters? We probably want to limit the
object to only being passed to the DSA cluster which contains the
port, or once we receive the object, we verify it belongs to the
cluster processing it.

What happens with a team/bind interface is added to the bridge. In the
future we need to know about this, so we can add the trunk in Marvells
terms to the bridge.

> > How about team/bonding? We are not ready to implement it yet with the
> > Marvell devices, but at some point we probably will. Won't we need the
> > events then? We need to know when a switch port has been added to a
> > team?
> >
> > Or do you think a switchdev object will be added for this case?
> > Mellanox already have the ability to add switch interfaces to a team,
> > and then add the team to a bridge. So we need to ensure your solution
> > works for such stacked systems.
> 
> Indeed these features can be propagated through new switchdev attributes
> or objects.
> 
> I think it'd be preferable to factorize the switch related operations
> into the switchdev API, instead of having every single switchdev user
> implement its custom (but similar) listeners and checks for global
> netdev events. What do you think?

Centralizing the code would be good. But DSA is way behind what
Mellanox can do, so you need to look at how your changes fit into
their driver.

During a netdev 1.1 BOF there was a conversation about the stack of
interfaces, teams/bonds, bridges, etc. If the video is available, you
might find it interesting.

  Andrew

pull-request: wireless-drivers-next 2016-03-09

2016-03-09 Thread Kalle Valo

Hi Dave,

here's a pull request for 4.6. I'm planning to send one more but I'm not
sure if it will make it in time, we'll see.

Here notable changes are refactoring in bcma to create a common flash
driver, brcmfmac platform data improvements in include/linux and
beginning of AHB bus support for ath10k along with device tree binding
update. More info in the signed tag below.

Oh, and I just noticed that weirdly my name appeared in author field on
one of Janusz' ath9k patches. But I didn't consider this bad enough to
rebase everything so I let it be. Sorry Janusz :)

commit b9a9693fd9aea43f50b107dfc8cbaea317f95a79
Author: Kalle Valo 
Date:   Fri Nov 27 09:37:14 2015 +0100

ath9k: request NOA update when chanctx active

Please let me know if you have any problems.

Kalle

The following changes since commit 00a1f0a93dea3cf1c141df79bfd06e7c9ee54162:

  Merge branch 'reset_mac_header' (2016-03-04 22:45:14 -0500)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git 
tags/wireless-drivers-next-for-davem-2016-03-09

for you to fetch changes up to 836856e3bd61d0644e5178a2c1b51d90459e2788:

  wireless: cw1200: use __maybe_unused to hide pm functions_ (2016-03-08 
12:32:52 +0200)


wireless-drivers patches for 4.6

Major changes:

ath10k

* dt: add bindings for ipq4019 wifi block
* start adding support for qca4019 chip

ath9k

* add device ID for Toshiba WLM-20U2/GN-1080
* allow more than one interface on DFS channels

bcma

* move flash detection code to ChipCommon core driver

brcmfmac

* IPv6 Neighbor discovery offload
* driver settings that can be populated from different sources
* country code setting in firmware
* length checks to validate firmware events
* new way to determine device memory size needed for BCM4366
* various offloads during Wake on Wireless LAN (WoWLAN)
* full Management Frame Protection (MFP) support

iwlwifi

* add support for thermal device / cooling device
* improvements in scheduled scan without profiles
* new firmware support (-21.ucode)
* add MSIX support for 9000 devices
* enable MU-MIMO and take care of firmware restart
* add support for large SKBs in mvm to reach A-MSDU
* add support for filtering frames from a BA session
* start implementing the new Rx path for 9000 devices
* enable the new Radio Resource Management (RRM) nl80211 feature flag
* add a new module paramater to disable VHT
* build infrastructure for Dynamic Queue Allocation


Alexander Tsoy (1):
  ath9k_htc: add device ID for Toshiba WLM-20U2/GN-1080

Alexey Khoroshilov (1):
  at76c50x-usb: avoid double usb_put_dev() after downloading internal 
firmware in at76_probe()

Amitkumar Karwar (1):
  mwifiex: fix corner case association failure

Amitoj Kaur Chawla (1):
  mwifiex: Use to_delayed_work()

Andrei Otcheretianski (1):
  iwlwifi: add disable_11ac module param

Anilkumar Kolli (1):
  ath10k: reduce number of peers to support peer stats feature

Anthony Wong (1):
  rt2x00: add new rt2800usb device Buffalo WLI-UC-G450

Anton Protopopov (1):
  ath10k: fix erroneous return value

Arend van Spriel (1):
  brcmfmac: change function name for brcmf_cfg80211_wait_vif_event_timeout()

Arnd Bergmann (2):
  ath9k: reduce stack usage in ar9003_aic_cal_post_process
  wireless: cw1200: use __maybe_unused to hide pm functions_

Ashok Raj Nagarajan (2):
  ath10k: fix pktlog in QCA99X0
  ath10k: add hw_rev to trace events to support pktlog

Avri Altman (2):
  iwlwifi: mvm: forbid U-APSD for P2P Client if the firmware doesn't 
support it
  iwlwifi: mvm: Send power command on BSS_CHANGED_BEACON_INFO if needed

Beni Lev (1):
  iwlwifi: mvm: Set global RRM capability

Bruno Randolf (3):
  rtl8xxxu: Enable monitor mode by handling filters
  rtl8xxxu: Document REG_RXFLTMAP registers
  rtl8xxxu: Enable data frame reception in rtl8xxxu_start

Chaya Rachel Ivgi (4):
  iwlwifi: mvm: add CT-KILL notification
  iwlwifi: mvm: add registration to thermal zone
  iwlwifi: mvm: add registration to cooling device
  iwlwifi: mvm: update ucode status before stopping device

Christian Lamparter (1):
  carl9170: import 1.9.9 firmware headers

Colin Ian King (2):
  rtlwifi: pass struct rtl_stats by reference as it is more efficient
  mt7601u: do not free dma_buf when ivp allocation fails

Dan Carpenter (1):
  mwifiex: fix an indenting mistake

Emmanuel Grumbach (15):
  Merge tag 'mac80211-next-for-davem-2016-02-26' into next2
  Merge tag 'iwlwifi-for-kalle-2016-02-15' into HEAD
  iwlwifi: mvm: bump firmware API to 21
  iwlwifi: pcie: aggregate Flow Handler configuration writes
  iwlwifi: pcie: fix identation in trans.c
  iwlwifi: mvm: send large SKBs to the transport
  iwlwifi: mvm: add Tx A-MSDU inside A-MPDU
  iwlwifi: mvm:

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread Cyrill Gorcunov

On Wed, Mar 09, 2016 at 08:53:07PM +0300, Cyrill Gorcunov wrote:
> On Wed, Mar 09, 2016 at 12:24:00PM -0500, David Miller wrote:
> ...
> > We asked you for numbers without a lot of features enabled, it'll
> > help us diagnose which subsystem still causes a lot of overhead
> > much more clearly.
> > 
> > So please do so.
> 
> Sure. Gimme some time and I'll back with numbers.

OK, here are the results (with preempt-debug/trace disabled,
on the kernel with David's two patches).
---

 No conntrack
 
 [root@s125 ~]# ./exploit.sh 
 START 4addresses STOP 1457549979 1457549980-> 1
 START 2704 addresses STOP 1457549983 1457549984-> 1
 START 10404addresses STOP 1457549996 1457549997-> 1
 START 23104addresses STOP 1457550029 1457550030-> 1
 START 40804addresses STOP 1457550103 1457550104-> 1
 START 63504addresses STOP 1457550267 1457550268-> 1

all works quite fast, takes 1 second.

 With conntrack
 --

 1) In a middle of "release -> create new addresses" transition

  27.53%  [kernel][k] __local_bh_enable_ip
  26.29%  [kernel][k] _raw_spin_lock
   6.00%  [kernel][k] nf_ct_iterate_cleanup
   3.95%  [kernel][k] nf_conntrack_lock
   2.94%  [kernel][k] _raw_spin_unlock
   1.91%  [kernel][k] _cond_resched
   1.78%  [kernel][k] check_lifetime
   1.25%  [kernel][k] __inet_insert_ifa
   1.19%  [kernel][k] inet_rtm_newaddr

 2) Last one with 63K of addresses releasing

  36.36%  [kernel] [k] __local_bh_enable_ip
  34.75%  [kernel] [k] _raw_spin_lock
   7.93%  [kernel] [k] nf_ct_iterate_cleanup
   5.11%  [kernel] [k] nf_conntrack_lock
   3.71%  [kernel] [k] _raw_spin_unlock
   2.51%  [kernel] [k] _cond_resched
   0.89%  [kernel] [k] task_tick_fair
   0.77%  [kernel] [k] native_write_msr_safe
   0.58%  [kernel] [k] hrtimer_active
   0.52%  [kernel] [k] rcu_check_callbacks

[root@s125 ~]# ./exploit.sh 
START 4 addresses STOP 1457552395 1457552397-> 2
START 2704  addresses STOP 1457552399 1457552403-> 4
START 10404 addresses STOP 1457552415 1457552429-> 14
START 23104 addresses STOP 1457552461 1457552492-> 31
START 40804 addresses STOP 1457552566 1457552620-> 54
START 63504 addresses STOP 1457552785 1457552870-> 85

at the final stage it took 85 seconds to become alive. All
eaten inside nf_ct_iterate_cleanup (actually inside
get_next_corpse). IIRC there were some feature of perf
which could annotate the instructions, no? Have to refresh
memory on how to use perf record and such...

Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF

2016-03-09 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

>> -static bool dsa_slave_dev_check(struct net_device *dev)
>> -{
>> -return dev->netdev_ops == &dsa_slave_netdev_ops;
>> -}
>
> Where is the equivalent of this happening? Where do we check that the
> interface added to the bridge is part of the switch?

Why should we check that? In this RFC, br_if.c tries to set the new
attribute to the net_device, when creating and deleting the net bridge
port. If it supports attr_set and this attribute, then we're good. Or am
I missing something?

>> -int dsa_slave_netdevice_event(struct notifier_block *unused,
>> -  unsigned long event, void *ptr)
>> -{
>> -struct net_device *dev;
>> -int err = 0;
>> -
>> -switch (event) {
>> -case NETDEV_CHANGEUPPER:
>> -dev = netdev_notifier_info_to_dev(ptr);
>> -if (!dsa_slave_dev_check(dev))
>> -goto out;
>> -
>> -err = dsa_slave_master_changed(dev);
>> -if (err && err != -EOPNOTSUPP)
>> -netdev_warn(dev, "failed to reflect master change\n");
>> -
>> -break;
>> -}
>> -
>> -out:
>> -return NOTIFY_DONE;
>> -}
>
> How about team/bonding? We are not ready to implement it yet with the
> Marvell devices, but at some point we probably will. Won't we need the
> events then? We need to know when a switch port has been added to a
> team?
>
> Or do you think a switchdev object will be added for this case?
> Mellanox already have the ability to add switch interfaces to a team,
> and then add the team to a bridge. So we need to ensure your solution
> works for such stacked systems.

Indeed these features can be propagated through new switchdev attributes
or objects.

I think it'd be preferable to factorize the switch related operations
into the switchdev API, instead of having every single switchdev user
implement its custom (but similar) listeners and checks for global
netdev events. What do you think?

Best,
Vivien

Re: [RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging

2016-03-09 Thread Sergei Shtylyov


Hello.

On 03/09/2016 08:42 PM, Vivien Didelot wrote:


Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is
set before adding a port to a bridge and deleting a port from a bridge.

The main purpose for this attribute is to provide switchdev users a
simple and common way to retrieve bridging information, instead of
implementing complex notifier blocks to listen to global netdev events.

We can also imagine a switchdev user returning an error different from
-EOPNOTSUPP in the prepare phase to prevent a port from being bridged.

Signed-off-by: Vivien Didelot 
---
  include/net/switchdev.h |  2 ++
  net/bridge/br_if.c  | 27 +++
  2 files changed, 29 insertions(+)


[...]

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index a73df33..105b9fd 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -28,6 +28,24 @@

  #include "br_private.h"

+static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br,
+  bool join)
+{
+   struct switchdev_attr attr = {
+   .orig_dev = br->dev,
+   .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF,
+   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
+   .u.join = join,
+   };
+   int err;
+
+   err = switchdev_port_attr_set(dev, &attr);
+   if (err && err != -EOPNOTSUPP)


   Enough to only do the latter comparison.


+   return err;
+
+   return 0;
+}
+
  /*
   * Determine initial path cost based on speed.
   * using recommendations from 802.1d standard

[...]

MBR, Sergei

Re: [PATCH V4 0/3] basic busy polling support for vhost_net

2016-03-09 Thread Greg Kurz

On Fri,  4 Mar 2016 06:24:50 -0500
Jason Wang  wrote:

> This series tries to add basic busy polling for vhost net. The idea is
> simple: at the end of tx/rx processing, busy polling for new tx added
> descriptor and rx receive socket for a while. The maximum number of
> time (in us) could be spent on busy polling was specified ioctl.
> 
> Test A were done through:
> 
> - 50 us as busy loop timeout
> - Netperf 2.6
> - Two machines with back to back connected mlx4

Hi Jason,

Could this also improve performance if both VMs are
on the same host system ?

> - Guest with 1 vcpus and 1 queue
> 
> Results:
> - Obvious improvements (%5 - 20%) for latency (TCP_RR).
> - Get a better or minor regression on most of the TX tests, but see
>   some regression on 4096 size.
> - Except for 8 sessions of 4096 size RX, have a better or same
>   performance.
> - CPU utilization were incrased as expected.
> 
> TCP_RR:
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> 1/ 1/   +8%/  -32%/   +8%/   +8%/   +7%
> 1/50/   +7%/  -19%/   +7%/   +7%/   +1%
> 1/   100/   +5%/  -21%/   +5%/   +5%/0%
> 1/   200/   +5%/  -21%/   +7%/   +7%/   +1%
>64/ 1/  +11%/  -29%/  +11%/  +11%/  +10%
>64/50/   +7%/  -19%/   +8%/   +8%/   +2%
>64/   100/   +8%/  -18%/   +9%/   +9%/   +2%
>64/   200/   +6%/  -19%/   +6%/   +6%/0%
>   256/ 1/   +7%/  -33%/   +7%/   +7%/   +6%
>   256/50/   +7%/  -18%/   +7%/   +7%/0%
>   256/   100/   +9%/  -18%/   +8%/   +8%/   +2%
>   256/   200/   +9%/  -18%/  +10%/  +10%/   +3%
>  1024/ 1/  +20%/  -28%/  +20%/  +20%/  +19%
>  1024/50/   +8%/  -18%/   +9%/   +9%/   +2%
>  1024/   100/   +6%/  -19%/   +5%/   +5%/0%
>  1024/   200/   +8%/  -18%/   +9%/   +9%/   +2%
> Guest TX:
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>64/ 1/   -5%/  -28%/  +11%/  +12%/  +10%
>64/ 4/   -2%/  -26%/  +13%/  +13%/  +13%
>64/ 8/   -6%/  -29%/   +9%/  +10%/  +10%
>   512/ 1/  +15%/   -7%/  +13%/  +11%/   +3%
>   512/ 4/  +17%/   -6%/  +18%/  +13%/  +11%
>   512/ 8/  +14%/   -7%/  +13%/   +7%/   +7%
>  1024/ 1/  +27%/   -2%/  +26%/  +29%/  +12%
>  1024/ 4/   +8%/   -9%/   +6%/   +1%/   +6%
>  1024/ 8/  +41%/  +12%/  +34%/  +20%/   -3%
>  4096/ 1/  -22%/  -21%/  -36%/  +81%/+1360%
>  4096/ 4/  -57%/  -58%/ +286%/  +15%/+2074%
>  4096/ 8/  +67%/  +70%/  -45%/   -8%/  +63%
> 16384/ 1/   -2%/   -5%/   +5%/   -3%/  +80%
> 16384/ 4/0%/0%/0%/   +4%/ +138%
> 16384/ 8/0%/0%/0%/   +1%/  +41%
> 65535/ 1/   -3%/   -6%/   +2%/  +11%/ +113%
> 65535/ 4/   -2%/   -1%/   -2%/   -3%/ +484%
> 65535/ 8/0%/   +1%/0%/   +2%/  +40%
> Guest RX:
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>64/ 1/  +31%/   -3%/   +8%/   +8%/   +8%
>64/ 4/  +11%/  -17%/  +13%/  +14%/  +15%
>64/ 8/   +4%/  -23%/  +11%/  +11%/  +12%
>   512/ 1/  +24%/0%/  +18%/  +14%/   -8%
>   512/ 4/   +4%/  -15%/   +6%/   +5%/   +6%
>   512/ 8/  +26%/0%/  +21%/  +10%/   +3%
>  1024/ 1/  +88%/  +47%/  +69%/  +44%/  -30%
>  1024/ 4/  +18%/   -5%/  +19%/  +16%/   +2%
>  1024/ 8/  +15%/   -4%/  +13%/   +8%/   +1%
>  4096/ 1/   -3%/   -5%/   +2%/   -2%/  +41%
>  4096/ 4/   +2%/   +3%/  -20%/  -14%/  -24%
>  4096/ 8/  -43%/  -45%/  +69%/  -24%/  +94%
> 16384/ 1/   -3%/  -11%/  +23%/   +7%/  +42%
> 16384/ 4/   -3%/   -3%/   -4%/   +5%/ +115%
> 16384/ 8/   -1%/0%/   -1%/   -3%/  +32%
> 65535/ 1/   +1%/0%/   +2%/0%/  +66%
> 65535/ 4/   -1%/   -1%/0%/   +4%/ +492%
> 65535/ 8/0%/   -1%/   -1%/   +4%/  +38%
> 
> Changes from V3:
> - drop single_task_running()
> - use cpu_relax_lowlatency() instead of cpu_relax()
> 
> Changes from V2:
> - rename vhost_vq_more_avail() to vhost_vq_avail_empty(). And return
> false we __get_user() fails.
> - do not bother premmptions/timers for good path.
> - use vhost_vring_state as ioctl parameter instead of reinveting a new
> one.
> - add the unit of timeout (us) to the comment of new added ioctls
> 
> Changes from V1:
> - remove the buggy vq_error() in vhost_vq_more_avail().
> - leave vhost_enable_notify() untouched.
> 
> Changes from RFC V3:
> - small tweak on the code to avoid multiple duplicate conditions in
> critical path when busy loop is not enabled.
> - add the test result of multiple VMs
> 
> Changes from RFC V2:
> - poll also at the end of rx handling
> - factor out the polling logic and optimize the code a little bit
> - add two ioctls to get and set the busy poll timeout
> - test on ixgbe (which can give more stable and reproducable numbers)
> instead of mlx4.
> 
> Changes from RFC V1:
> - add a comment for vhost_has_work() to explain why it could be
> lockless
> - add param description for busyloop_timeout
> - split out the busy polling logic into a new helper
> - check and exit the loop when there's a pending signal
>

Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF

2016-03-09 Thread Jiri Pirko

Wed, Mar 09, 2016 at 07:32:13PM CET, and...@lunn.ch wrote:
>Hi Vivien
>
>> -static bool dsa_slave_dev_check(struct net_device *dev)
>> -{
>> -return dev->netdev_ops == &dsa_slave_netdev_ops;
>> -}
>
>Where is the equivalent of this happening? Where do we check that the
>interface added to the bridge is part of the switch?
>
>> -int dsa_slave_netdevice_event(struct notifier_block *unused,
>> -  unsigned long event, void *ptr)
>> -{
>> -struct net_device *dev;
>> -int err = 0;
>> -
>> -switch (event) {
>> -case NETDEV_CHANGEUPPER:
>> -dev = netdev_notifier_info_to_dev(ptr);
>> -if (!dsa_slave_dev_check(dev))
>> -goto out;
>> -
>> -err = dsa_slave_master_changed(dev);
>> -if (err && err != -EOPNOTSUPP)
>> -netdev_warn(dev, "failed to reflect master change\n");
>> -
>> -break;
>> -}
>> -
>> -out:
>> -return NOTIFY_DONE;
>> -}
>
>How about team/bonding? We are not ready to implement it yet with the
>Marvell devices, but at some point we probably will. Won't we need the
>events then? We need to know when a switch port has been added to a
>team?
>
>Or do you think a switchdev object will be added for this case?
>Mellanox already have the ability to add switch interfaces to a team,
>and then add the team to a bridge. So we need to ensure your solution
>works for such stacked systems.

I have to look at this more closer tomorrow, but I'm missing motivation
behind this. Using existing notifiers, drivers can easily monitor what
is going on with their uppers. Why do we need this to be changed?

Re: Unexpected tcpv6 connection resets since linux 4.4

2016-03-09 Thread Andreas Messer

On Mon, Mar 7, 2016 at 15:58, Cong Wang wrote:
> On Sun, Mar 6, 2016 at 7:10 AM, Andreas Messer  wrote:
> > i have updated two of my machines in the last weeks to linux 4.4.1 and
> > linux 4.4.3. It seems that since then i get unexpected TCPv6 connection
> > resets when connecting to these machines remotely. The issue occurs with
> > sshd and with a http service. /etc/hosts.deny and /etc/hosts.allow are
> > empty on both server machines. I'm not so in IPv6 and have no idea whats
> > going on. Please find attached a network trace from one of the machines
> > when connecting with ssh (on port 23 for debugging).
> 
> Sounds like the problem fixed by the following commit:
> 
> commit 9cf7490360bf2c46a16b7525f899e4970c5fc144
> Author: Eric Dumazet 
> Date:   Tue Feb 2 19:31:12 2016 -0800
> 
> tcp: do not drop syn_recv on all icmp reports

Thanks for the tip. I applied the patch to V4,4,1 and it solved the problem.

Cheers,
Andreas


signature.asc
Description: This is a digitally signed message part.

[PATCH net-next v3] tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In

2016-03-09 Thread Martin KaFai Lau

Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data).  Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).

The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.

Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.

v3: Add const modifier to the skb parameter in tcp_segs_in()

v2: Rework based on recent fix by Eric:
commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")

Signed-off-by: Martin KaFai Lau 
Cc: Chris Rapier 
Cc: Eric Dumazet 
Cc: Marcelo Ricardo Leitner 
Cc: Neal Cardwell 
Cc: Yuchung Cheng 
---
 include/linux/tcp.h  |  6 ++
 include/net/tcp.h| 10 ++
 include/uapi/linux/tcp.h |  2 ++
 net/ipv4/tcp.c   |  2 ++
 net/ipv4/tcp_fastopen.c  |  4 
 net/ipv4/tcp_ipv4.c  |  2 +-
 net/ipv4/tcp_minisocks.c |  2 +-
 net/ipv4/tcp_output.c|  4 +++-
 net/ipv6/tcp_ipv6.c  |  2 +-
 9 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bcbf51d..7be9b12 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -158,6 +158,9 @@ struct tcp_sock {
u32 segs_in;/* RFC4898 tcpEStatsPerfSegsIn
 * total number of segments in.
 */
+   u32 data_segs_in;   /* RFC4898 tcpEStatsPerfDataSegsIn
+* total number of data segments in.
+*/
u32 rcv_nxt;/* What we want to receive next */
u32 copied_seq; /* Head of yet unread data  */
u32 rcv_wup;/* rcv_nxt on last window update sent   */
@@ -165,6 +168,9 @@ struct tcp_sock {
u32 segs_out;   /* RFC4898 tcpEStatsPerfSegsOut
 * The total number of segments sent.
 */
+   u32 data_segs_out;  /* RFC4898 tcpEStatsPerfDataSegsOut
+* total number of data segments sent.
+*/
u64 bytes_acked;/* RFC4898 tcpEStatsAppHCThruOctetsAcked
 * sum(delta(snd_una)), or how many bytes
 * were acked.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e90db85..24557a8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1816,4 +1816,14 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff 
*skb)
skb->truesize = 2;
 }
 
+static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
+{
+   u16 segs_in;
+
+   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+   tp->segs_in += segs_in;
+   if (skb->len > tcp_hdrlen(skb))
+   tp->data_segs_in += segs_in;
+}
+
 #endif /* _TCP_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index fe95446..53e8e3f 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -199,6 +199,8 @@ struct tcp_info {
 
__u32   tcpi_notsent_bytes;
__u32   tcpi_min_rtt;
+   __u32   tcpi_data_segs_in;  /* RFC4898 tcpEStatsDataSegsIn */
+   __u32   tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f9faadb..6b01b48 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2728,6 +2728,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_notsent_bytes = max(0, notsent_bytes);
 
info->tcpi_min_rtt = tcp_min_rtt(tp);
+   info->tcpi_data_segs_in = tp->data_segs_in;
+   info->tcpi_data_segs_out = tp->data_segs_out;
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
 
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index fdb286d..f583c85 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -131,6 +131,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock 
*req,
 void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   u16 segs_in;
 
if (TCP_SKB_CB(skb)->end_seq == tp->rcv_nxt)
return;
@@ -154,6 +155,9 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff 
*skb)
 * as we certainly are not changing upper 32bit value (0)
 */
tp->bytes_received = skb->len;
+   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+   tp->segs_in = segs_in;
+   tp->data_segs_in = segs_in;
 
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
tcp_fin(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4c8d58d..0b02ef7 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1650,7 +1650,7 @@ process:
sk_incoming_cpu_update(sk);
 
bh_lock_so

Re: [RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF

2016-03-09 Thread Andrew Lunn

Hi Vivien

> -static bool dsa_slave_dev_check(struct net_device *dev)
> -{
> - return dev->netdev_ops == &dsa_slave_netdev_ops;
> -}

Where is the equivalent of this happening? Where do we check that the
interface added to the bridge is part of the switch?

> -int dsa_slave_netdevice_event(struct notifier_block *unused,
> -   unsigned long event, void *ptr)
> -{
> - struct net_device *dev;
> - int err = 0;
> -
> - switch (event) {
> - case NETDEV_CHANGEUPPER:
> - dev = netdev_notifier_info_to_dev(ptr);
> - if (!dsa_slave_dev_check(dev))
> - goto out;
> -
> - err = dsa_slave_master_changed(dev);
> - if (err && err != -EOPNOTSUPP)
> - netdev_warn(dev, "failed to reflect master change\n");
> -
> - break;
> - }
> -
> -out:
> - return NOTIFY_DONE;
> -}

How about team/bonding? We are not ready to implement it yet with the
Marvell devices, but at some point we probably will. Won't we need the
events then? We need to know when a switch port has been added to a
team?

Or do you think a switchdev object will be added for this case?
Mellanox already have the ability to add switch interfaces to a team,
and then add the team to a bridge. So we need to ensure your solution
works for such stacked systems.

  Andrew

[PATCH] inet: set `error' value not under lock_sock().

2016-03-09 Thread Weongyo Jeong

A trivial patch to set `error' variable while not holding
lock_sock().

Signed-off-by: Weongyo Jeong 
---
 net/ipv4/inet_connection_sock.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 6414891..58bc39f 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -306,14 +306,13 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, 
int *err)
struct request_sock_queue *queue = &icsk->icsk_accept_queue;
struct request_sock *req;
struct sock *newsk;
-   int error;
+   int error = -EINVAL;
 
lock_sock(sk);
 
/* We need to make sure that this socket is listening,
 * and that it has something pending.
 */
-   error = -EINVAL;
if (sk->sk_state != TCP_LISTEN)
goto out_err;
 
-- 
2.1.3

Re: [PATCH (net.git) 0/2] stmmac: MDIO fixes

2016-03-09 Thread Frank Schäfer


Am 09.03.2016 um 10:00 schrieb Giuseppe Cavallaro:
> These two patches are to fix the recent regressions raised
> when test the stmmac on some platforms.
> Please I kindly ask you to give me the feedback if it actually
> covers all the cases and if the stmmac runs fine on the boxes.
> I have tested on my side the H410 B2120 with an embedded switch
> (so using the fixed-link).
>
> Giuseppe Cavallaro (2):
>   Revert "stmmac: Fix 'eth0: No PHY found' regression"
>   stmmac: fix MDIO settings
>
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |   11 ++---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c  |   11 +-
>  .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   22 +--
>  include/linux/stmmac.h |3 +-
>  4 files changed, 20 insertions(+), 27 deletions(-)
Works fine on the Banana Pi M1 board.
(Fixes "libphy: PHY stmmac-0: not found")

Regards,
Frank

Re: [RFC] net: ipv4 -- Introduce ifa limit per net

2016-03-09 Thread Cyrill Gorcunov

On Wed, Mar 09, 2016 at 12:24:00PM -0500, David Miller wrote:
...
> We asked you for numbers without a lot of features enabled, it'll
> help us diagnose which subsystem still causes a lot of overhead
> much more clearly.
> 
> So please do so.

Sure. Gimme some time and I'll back with numbers.

> Although it's already pretty clear that netfilter conntrack
> cleanup is insanely expensive.

Yes. I can drop it off for a while and run tests without it,
then turn it back and try again. Would you like to see such
numbers?

> You're also jumping to a lot of conclusions, work with us to fix the
> fundamental performance problems rather than continually insisting on
> a limit.
> 
> We should be able to remove millions of IP addresses in less than
> half a second, no problem.  Limits make no sense at all.

Sure, I'll continue experimenting (and turn off preemt as
a first step). Sorry if I sounded rough.

Cyrill

[RFC PATCH net-next 2/2] net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF

2016-03-09 Thread Vivien Didelot

Add a new dsa_slave_bridge_if function to handle the
SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute.

Thus remove the code related to the netdev notifier block.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa.c  |   7 
 net/dsa/dsa_priv.h |   2 -
 net/dsa/slave.c| 113 ++---
 3 files changed, 30 insertions(+), 92 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index fa4daba..cfb678b 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -977,10 +977,6 @@ static struct packet_type dsa_pack_type __read_mostly = {
.func   = dsa_switch_rcv,
 };
 
-static struct notifier_block dsa_netdevice_nb __read_mostly = {
-   .notifier_call  = dsa_slave_netdevice_event,
-};
-
 #ifdef CONFIG_PM_SLEEP
 static int dsa_suspend(struct device *d)
 {
@@ -1047,8 +1043,6 @@ static int __init dsa_init_module(void)
 {
int rc;
 
-   register_netdevice_notifier(&dsa_netdevice_nb);
-
rc = platform_driver_register(&dsa_driver);
if (rc)
return rc;
@@ -1061,7 +1055,6 @@ module_init(dsa_init_module);
 
 static void __exit dsa_cleanup_module(void)
 {
-   unregister_netdevice_notifier(&dsa_netdevice_nb);
dev_remove_pack(&dsa_pack_type);
platform_driver_unregister(&dsa_driver);
 }
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 1d1a546..34d1951 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -64,8 +64,6 @@ int dsa_slave_create(struct dsa_switch *ds, struct device 
*parent,
 void dsa_slave_destroy(struct net_device *slave_dev);
 int dsa_slave_suspend(struct net_device *slave_dev);
 int dsa_slave_resume(struct net_device *slave_dev);
-int dsa_slave_netdevice_event(struct notifier_block *unused,
- unsigned long event, void *ptr);
 
 /* tag_dsa.c */
 extern const struct dsa_device_ops dsa_netdev_ops;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 27bf03d..90ef149 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -305,16 +305,38 @@ static int dsa_slave_ioctl(struct net_device *dev, struct 
ifreq *ifr, int cmd)
return -EOPNOTSUPP;
 }
 
-static int dsa_slave_stp_update(struct net_device *dev, u8 state)
+static int dsa_slave_bridge_if(struct net_device *dev,
+  const struct switchdev_attr *attr,
+  struct switchdev_trans *trans)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_switch *ds = p->parent;
-   int ret = -EOPNOTSUPP;
+   int err;
 
-   if (ds->drv->port_stp_update)
-   ret = ds->drv->port_stp_update(ds, p->port, state);
+   if (switchdev_trans_ph_prepare(trans)) {
+   if (!ds->drv->port_join_bridge || !ds->drv->port_leave_bridge)
+   return -EOPNOTSUPP;
+   return 0;
+   }
 
-   return ret;
+   if (attr->u.join) {
+   err = ds->drv->port_join_bridge(ds, p->port, attr->orig_dev);
+   if (!err)
+   p->bridge_dev = attr->orig_dev;
+   } else {
+   err = ds->drv->port_leave_bridge(ds, p->port);
+
+   /* When a port leaves a bridge, the bridge layer sets its STP
+* state to DISABLED. Restore FORWARDING to keep it functional.
+*/
+   if (ds->drv->port_stp_update)
+   ds->drv->port_stp_update(ds, p->port,
+BR_STATE_FORWARDING);
+
+   p->bridge_dev = NULL;
+   }
+
+   return err;
 }
 
 static int dsa_slave_vlan_filtering(struct net_device *dev,
@@ -354,6 +376,9 @@ static int dsa_slave_port_attr_set(struct net_device *dev,
case SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING:
ret = dsa_slave_vlan_filtering(dev, attr, trans);
break;
+   case SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF:
+   ret = dsa_slave_bridge_if(dev, attr, trans);
+   break;
default:
ret = -EOPNOTSUPP;
break;
@@ -439,41 +464,6 @@ static int dsa_slave_port_obj_dump(struct net_device *dev,
return err;
 }
 
-static int dsa_slave_bridge_port_join(struct net_device *dev,
- struct net_device *br)
-{
-   struct dsa_slave_priv *p = netdev_priv(dev);
-   struct dsa_switch *ds = p->parent;
-   int ret = -EOPNOTSUPP;
-
-   p->bridge_dev = br;
-
-   if (ds->drv->port_join_bridge)
-   ret = ds->drv->port_join_bridge(ds, p->port, br);
-
-   return ret;
-}
-
-static int dsa_slave_bridge_port_leave(struct net_device *dev)
-{
-   struct dsa_slave_priv *p = netdev_priv(dev);
-   struct dsa_switch *ds = p->parent;
-   int ret = -EOPNOTSUPP;
-
-
-   if (ds->drv->port_leave_bridge)
-   ret = ds->drv->port_leave_bridge(ds, p->port);
-
-   p->bridge_dev = NULL;
-
-   /* Port left the bridge, put in BR_STATE_DISABLED by the bridg

[RFC PATCH net-next 1/2] net: bridge: add switchdev attr for port bridging

2016-03-09 Thread Vivien Didelot

Add a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute which is
set before adding a port to a bridge and deleting a port from a bridge.

The main purpose for this attribute is to provide switchdev users a
simple and common way to retrieve bridging information, instead of
implementing complex notifier blocks to listen to global netdev events.

We can also imagine a switchdev user returning an error different from
-EOPNOTSUPP in the prepare phase to prevent a port from being bridged.

Signed-off-by: Vivien Didelot 
---
 include/net/switchdev.h |  2 ++
 net/bridge/br_if.c  | 27 +++
 2 files changed, 29 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d451122..65f8514 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -46,6 +46,7 @@ enum switchdev_attr_id {
SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
SWITCHDEV_ATTR_ID_PORT_STP_STATE,
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
+   SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF,
SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
 };
@@ -58,6 +59,7 @@ struct switchdev_attr {
struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */
u8 stp_state;   /* PORT_STP_STATE */
unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */
+   bool join;  /* PORT_BRIDGE_IF */
u32 ageing_time;/* BRIDGE_AGEING_TIME */
bool vlan_filtering;/* 
BRIDGE_VLAN_FILTERING */
} u;
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index a73df33..105b9fd 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -28,6 +28,24 @@
 
 #include "br_private.h"
 
+static int switchdev_bridge_if(struct net_device *dev, struct net_bridge *br,
+  bool join)
+{
+   struct switchdev_attr attr = {
+   .orig_dev = br->dev,
+   .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF,
+   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
+   .u.join = join,
+   };
+   int err;
+
+   err = switchdev_port_attr_set(dev, &attr);
+   if (err && err != -EOPNOTSUPP)
+   return err;
+
+   return 0;
+}
+
 /*
  * Determine initial path cost based on speed.
  * using recommendations from 802.1d standard
@@ -297,6 +315,10 @@ static void del_nbp(struct net_bridge_port *p)
br_netpoll_disable(p);
 
call_rcu(&p->rcu, destroy_nbp_rcu);
+
+   if (switchdev_bridge_if(dev, br, false))
+   br_warn(br, "error unbridging port %u(%s)\n",
+   (unsigned int) p->port_no, dev->name);
 }
 
 /* Delete bridge device */
@@ -347,6 +369,11 @@ static struct net_bridge_port *new_nbp(struct net_bridge 
*br,
 {
int index;
struct net_bridge_port *p;
+   int err;
+
+   err = switchdev_bridge_if(dev, br, true);
+   if (err)
+   return ERR_PTR(err);
 
index = find_portno(br);
if (index < 0)
-- 
2.7.2

[RFC PATCH net-next 0/2] net: switchdev: add attribute for port bridging

2016-03-09 Thread Vivien Didelot

Current switchdev users implement notifier blocks to filter global netdev
events, in order to correctly offload bridging to their hardware ports.

Such code could be replaced with the support of a simple switchdev attribute
set when adding/deleting a port to/from a bridge.

Also, we can imagine a switch driver or network layer wanting to restrict the
number of logical bridges on top of a physical device. That could be done by
returning -EOPNOTSUPP when setting such attribute.

The first patch adds a new SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF switchdev attribute
containing a boolean, set when joining or leaving a bridge.

The second patch shows the benefit of supporting such attribute in the DSA
layer. Similar change should be doable to other switchdev users, like Rocker.

Note: I send this as an RFC since I am not really sure about the attribute
flags, and the exact place to set it in del_nbp(). Comments needed :-)

Thanks,

Vivien Didelot (2):
  net: bridge: add switchdev attr for port bridging
  net: dsa: support SWITCHDEV_ATTR_ID_PORT_BRIDGE_IF

 include/net/switchdev.h |   2 +
 net/bridge/br_if.c  |  27 
 net/dsa/dsa.c   |   7 ---
 net/dsa/dsa_priv.h  |   2 -
 net/dsa/slave.c | 113 +---
 5 files changed, 59 insertions(+), 92 deletions(-)

-- 
2.7.2

gretap default MTU question

2016-03-09 Thread Jonathan Thibault

Hello good people of netdev,

When setting up gretap devices like so:

ip link add mydev type gretap remote 1.1.1.1 local 2.2.2.2 nopmtudisc

I'm observing two different behavior:

- On system A, the MTU of 'mydev' is set to the MTU of the 'parent'
interface (currently 1600) minus 38. All other interfaces on that system
have a default MTU of 1500, only the parent was forced to 1600 to avoid
fragmentation.  So 'mydev' accurately figures out that its MTU is 1562.

- On system B, with the 'parent' interface MTU set to 1600 and all other
defaulting to 1500 (same situation as A), the MTU of 'mydev' gets set to
1462.

I'm trying to figure out which behavior is normal and what mechanism (if
any) causes the MTU to be set differently.  In both cases the 'parent'
device has an MTU of 1600.  The code in ip_gre.c does this:

dev->mtu = ETH_DATA_LEN - t_hlen - 4;

In this case, system B would have the expected behavior and I need some
way to explain what goes on with system A.

Of course I can force the MTU on system B but I was rather pleased with
the 'magic' on system A.

If anyone here familiar with this code can offer an explanation, it
would greatly ease my curiosity.

Jonathan

Re: Linux netfilter IPT_SO_SET_REPLACE memory corruption

2016-03-09 Thread Ben Hawkes

Small correction: IPT_SO_SET_REPLACE is reached via setsockopt, not ioctl.

On Wed, Mar 9, 2016 at 9:25 AM, Ben Hawkes  wrote:
> N.B. I was redirected to this list by secur...@kernel.org. I'm told
> that networking related issues such as this are handled on the public
> netdev mailing list.
>
> A memory corruption vulnerability exists in the IPT_SO_SET_REPLACE
> ioctl in the netfilter code for iptables support. This ioctl is can be
> triggered by an unprivileged user on PF_INET sockets when unprivileged
> user namespaces are available (CONFIG_USER_NS=y). Android does not
> enable this option, but desktop/server distributions and Chrome OS
> will commonly enable this to allow for containers support or
> sandboxing.
>
> In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
> is possible for a user-supplied ipt_entry structure to have a large
> next_offset field. This field is not bounds checked prior to writing a
> counter value at the supplied offset:
>
> newpos = pos + e->next_offset;
> ...
> e = (struct ipt_entry *) (entry0 + newpos);
> e->counters.pcnt = pos;
>
> This means that an out of bounds 32-bit write can occur in a 64kb
> range from the allocated heap entry, with a controlled offset and a
> partially controlled write value ("pos") or zero. The attached
> proof-of-concept (netfilter_setsockopt_v3.c) triggers the corruption
> multiple times to set adjacent heap structures to zero.
>
> This issue affects (at least) kernel versions 3.10, 3.18 and 4.4, and
> could be used for local privilege escalation. It appears that a
> similar codepath is accessible via arp_tables.c/ARPT_SO_SET_REPLACE as
> well.
>
> Furthermore, a recent refactoring of this codepath
> (https://github.com/torvalds/linux/commit/2e4e6a17af35be359cc8f1c924f8f198fbd478cc)
> introduced an integer overflow in xt_alloc_table_info, which on 32-bit
> systems can lead to small structure allocation and a copy_from_user
> based heap corruption. The attached proof-of-concept
> (netfilter_setsockopt_v4.c) triggers this issue on 4.4.

1 2 >

1 - 100 of 175 matches

Mail list logo