date:20151008

Re: [PATCH net-next] net: Fix vti use case with oif in dst lookups for IPv6

2015-10-08 Thread Hajime Tazaki


Hello David,

At Mon,  5 Oct 2015 08:32:51 -0600,
David Ahern wrote:

> 
> diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
> index 30caa289c5db..5cedfda4b241 100644
> --- a/net/ipv6/xfrm6_policy.c
> +++ b/net/ipv6/xfrm6_policy.c
> @@ -37,6 +37,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, 
> int tos, int oif,
>  
>   memset(&fl6, 0, sizeof(fl6));
>   fl6.flowi6_oif = oif;
> + fl6.flowi6_flags = FLOWI_FLAG_SKIP_NH_OIF;
>   memcpy(&fl6.daddr, daddr, sizeof(fl6.daddr));
>   if (saddr)
>   memcpy(&fl6.saddr, saddr, sizeof(fl6.saddr));

I found that this fix is still not sufficient with the mip6
(Mobile IPv6) use case.

FLOWI_FLAG_SKIP_NH_OIF is not checked anywhere else in ipv6
code, in ip6_route_output() etc.

Even if I added the check (like below), MH packets are not
sent at all from mobile node, home agent.

do you have any idea ?

I have a reproducible setup here with mip6. let me know if
you need further information.


diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8c0898796ffb..0aba308b5ea3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1169,9 +1169,9 @@ struct dst_entry *ip6_route_output(struct net *net, const 
struct sock *sk,
 
fl6->flowi6_iif = LOOPBACK_IFINDEX;
 
-   if ((sk && sk->sk_bound_dev_if) || rt6_need_strict(&fl6->daddr))
+   if ((sk && sk->sk_bound_dev_if) || rt6_need_strict(&fl6->daddr) ||
+   (!(fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF) && fl6->flowi6_oif))
flags |= RT6_LOOKUP_F_IFACE;

if (!ipv6_addr_any(&fl6->saddr))
flags |= RT6_LOOKUP_F_HAS_SADDR;
else if (sk)

-- Hajime
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 2/3] switchdev: allow caller to explicitly use deferred attr_set version

2015-10-08 Thread Jiri Pirko

Fri, Oct 09, 2015 at 06:39:41AM CEST, sfel...@gmail.com wrote:
>On Thu, Oct 8, 2015 at 1:26 AM, Jiri Pirko  wrote:
>> Thu, Oct 08, 2015 at 08:03:35AM CEST, sfel...@gmail.com wrote:
>>>On Wed, Oct 7, 2015 at 10:39 PM, Jiri Pirko  wrote:
 Thu, Oct 08, 2015 at 06:27:07AM CEST, sfel...@gmail.com wrote:
>On Wed, Oct 7, 2015 at 11:30 AM, Jiri Pirko  wrote:
>> From: Jiri Pirko 
>>
>> Caller should know if he can call attr_set directly (when holding RTNL)
>> or if he has to use deferred version of this function.
>>
>> This also allows drivers to sleep inside attr_set and report operation
>> status back to switchdev core. Switchdev core then warns if status is
>> not ok, instead of silent errors happening in drivers.
>>
>> Signed-off-by: Jiri Pirko 
>> ---
>>  include/net/switchdev.h   |   2 +
>>  net/bridge/br_stp.c   |   4 +-
>>  net/switchdev/switchdev.c | 113 
>> +-
>>  3 files changed, 65 insertions(+), 54 deletions(-)
>>
>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>> index 89266a3..320be44 100644
>> --- a/include/net/switchdev.h
>> +++ b/include/net/switchdev.h
>> @@ -168,6 +168,8 @@ int switchdev_port_attr_get(struct net_device *dev,
>> struct switchdev_attr *attr);
>>  int switchdev_port_attr_set(struct net_device *dev,
>> struct switchdev_attr *attr);
>> +int switchdev_port_attr_set_deferred(struct net_device *dev,
>> +struct switchdev_attr *attr);
>
>Rather than adding another op, use attr->flags and define:
>
>#define SWITCHDEV_F_DEFERRED  BIT(x)
>
>So we get:
>
>void br_set_state(struct net_bridge_port *p, unsigned int state)
>{
>struct switchdev_attr attr = {
>.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>+  .flags = SWITCHDEV_F_DEFERRED,
>.u.stp_state = state,
>};
>int err;
>
>p->state = state;
>err = switchdev_port_attr_set(p->dev, &attr);
>if (err && err != -EOPNOTSUPP)
>br_warn(p->br, "error setting offload STP state on
>port %u(%s)\n",
>(unsigned int) p->port_no,
>p->dev->name);
>}
>
>(And add obj->flags to do the same).

 That's what I wanted to avoid. Also because the obj is const and for
 call from work, this flag would have to be removed.
>>>
>>>What did you want to avoid?
>>
>> Having this as a flag. I don't like it too much.
>> But that is cosmetics. Other than that, does the patchset make sense?
>> Do you see some possible issues?
>
>patch 1/3 makes sense, I tested it out and no issues.  (Looks like
>there are other places to assert rtnl_lock, are you going to add
>those?)

Sure, can you pinpoint the places?

>
>patch 2/3: Rather than trying to guess the call context in the core,
>make the caller call the right variant for its context.  That part is
>good.  On the flag vs. no flags, the reasons why I want this as a flag
>are:
>
>a) I want to keep the switchdev ops set to the core set: get/set attr
>and add/del/dump objs.  I've pushed back on changing this before.  I
>don't want ops explosion (like netdev_ops), and I'd like to avoid the
>1000-line patch when the arg list in an op changes, and we need to
>update N drivers.  The flags lets the caller modify the algo behavior,
>while keeping the core call (and args) fixed.
>
>b) the caller can combine flags, where it makes sense.  For example,
>maybe I'm in a locked context and I don't want to recurse the device
>tree, so I would make the call with NO_RECURSE | DEFERRED.  If we
>didn't use flags, then we need to supply ops for each variant on the
>call, and then things explode.

Fair enough. I'll process this in.


>
>patch 3/3 I haven't looked at yet...I'm stuck on 2/3.

It is very similar to 2/3, only for obj_add/del.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-08 Thread Jiri Pirko

Fri, Oct 09, 2015 at 06:38:10AM CEST, pjonn...@broadcom.com wrote:
>
>
>> -Original Message-
>> From: sfel...@gmail.com [mailto:sfel...@gmail.com]
>> Sent: Friday, October 09, 2015 7:53 AM
>> To: netdev@vger.kernel.org
>> Cc: da...@davemloft.net; j...@resnulli.us; siva.mannem@gmail.com;
>> Premkumar Jonnala; step...@networkplumber.org;
>> ro...@cumulusnetworks.com; and...@lunn.ch; f.faine...@gmail.com;
>> vivien.dide...@savoirfairelinux.com
>> Subject: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down
>> to switchdev
>> 
>> From: Scott Feldman 
>> 
>> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
>> support setting ageing_time (or setting bridge attrs in general).
>> 
>> If push fails, don't update ageing_time in bridge and return err to user.
>> 
>> If push succeeds, update ageing_time in bridge and run gc_timer now to
>> recalabrate when to run gc_timer next, based on new ageing_time.
>> 
>> Signed-off-by: Scott Feldman 
>> Signed-off-by: Jiri Pirko 
>> ---
>>  net/bridge/br_ioctl.c|3 +--
>>  net/bridge/br_netlink.c  |6 +++---
>>  net/bridge/br_private.h  |1 +
>>  net/bridge/br_stp.c  |   23 +++
>>  net/bridge/br_sysfs_br.c |3 +--
>>  5 files changed, 29 insertions(+), 7 deletions(-)
>> 
>> diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
>> index 8d423bc..263b4de 100644
>> --- a/net/bridge/br_ioctl.c
>> +++ b/net/bridge/br_ioctl.c
>> @@ -200,8 +200,7 @@ static int old_dev_ioctl(struct net_device *dev, struct
>> ifreq *rq, int cmd)
>>  if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
>>  return -EPERM;
>> 
>> -br->ageing_time = clock_t_to_jiffies(args[1]);
>> -return 0;
>> +return br_set_ageing_time(br, args[1]);
>> 
>>  case BRCTL_GET_PORT_INFO:
>>  {
>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>> index d78b442..544ab96 100644
>> --- a/net/bridge/br_netlink.c
>> +++ b/net/bridge/br_netlink.c
>> @@ -870,9 +870,9 @@ static int br_changelink(struct net_device *brdev, struct
>> nlattr *tb[],
>>  }
>> 
>>  if (data[IFLA_BR_AGEING_TIME]) {
>> -u32 ageing_time = nla_get_u32(data[IFLA_BR_AGEING_TIME]);
>> -
>> -br->ageing_time = clock_t_to_jiffies(ageing_time);
>> +err = br_set_ageing_time(br,
>> nla_get_u32(data[IFLA_BR_AGEING_TIME]));
>> +if (err)
>> +return err;
>>  }
>> 
>>  if (data[IFLA_BR_STP_STATE]) {
>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>> index 09d3ecb..ba0c67b 100644
>> --- a/net/bridge/br_private.h
>> +++ b/net/bridge/br_private.h
>> @@ -882,6 +882,7 @@ void __br_set_forward_delay(struct net_bridge *br,
>> unsigned long t);
>>  int br_set_forward_delay(struct net_bridge *br, unsigned long x);
>>  int br_set_hello_time(struct net_bridge *br, unsigned long x);
>>  int br_set_max_age(struct net_bridge *br, unsigned long x);
>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time);
>> 
>> 
>>  /* br_stp_if.c */
>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
>> index 3a982c0..db6d243de 100644
>> --- a/net/bridge/br_stp.c
>> +++ b/net/bridge/br_stp.c
>> @@ -566,6 +566,29 @@ int br_set_max_age(struct net_bridge *br, unsigned
>> long val)
>> 
>>  }
>> 
>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
>> +{
>> +struct switchdev_attr attr = {
>> +.id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>> +.flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>> +.u.ageing_time = ageing_time,
>> +};
>> +unsigned long t = clock_t_to_jiffies(ageing_time);
>> +int err;
>> +
>> +if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
>> +return -ERANGE;
>> +
>> +err = switchdev_port_attr_set(br->dev, &attr);
>
>A thought - given that the ageing time is not a per-bridge-port attr, why are 
>we using a "port based api"
>to pass the attribute down?  May be I'm missing something here?

I general, it can be. And in case of rocker, it is.
Other drivers will just use port handler to set the ageing time on the
appropriate bridge.



>
>-Prem
>
>
>> +if (err)
>> +return err;
>> +
>> +br->ageing_time = t;
>> +mod_timer(&br->gc_timer, jiffies);
>> +
>> +return 0;
>> +}
>> +
>>  void __br_set_forward_delay(struct net_bridge *br, unsigned long t)
>>  {
>>  br->bridge_forward_delay = t;
>> diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
>> index 4c97fc5..04ef192 100644
>> --- a/net/bridge/br_sysfs_br.c
>> +++ b/net/bridge/br_sysfs_br.c
>> @@ -102,8 +102,7 @@ static ssize_t ageing_time_show(struct device *d,
>> 
>>  static int set_ageing_time(struct net_bridge *br, unsigned long val)
>>  {
>> -br->ageing_time = clock_t_to_jiffies(val);
>> -return 0;
>> +return br_set_ageing_time(br, val);
>>  }
>> 
>>  static ssize_t ageing_time_store(struct device *d,
>> --
>> 1.7.10

Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread Kosuke Tatsukawa

Neil Brown wrote:
> Kosuke Tatsukawa  writes:
> 
>> There are several places in net/sunrpc/svcsock.c which calls
>> waitqueue_active() without calling a memory barrier.  Add a memory
>> barrier just as in wq_has_sleeper().
>>
>> I found this issue when I was looking through the linux source code
>> for places calling waitqueue_active() before wake_up*(), but without
>> preceding memory barriers, after sending a patch to fix a similar
>> issue in drivers/tty/n_tty.c  (Details about the original issue can be
>> found here: https://lkml.org/lkml/2015/9/28/849).
> 
> hi,
> this feels like the wrong approach to the problem.  It requires extra
> 'smb_mb's to be spread around which are hard to understand as easy to
> forget.
> 
> A quick look seems to suggest that (nearly) every waitqueue_active()
> will need an smb_mb.  Could we just put the smb_mb() inside
> waitqueue_active()??


There are around 200 occurrences of waitqueue_active() in the kernel
source, and most of the places which use it before wake_up are either
protected by some spin lock, or already has a memory barrier or some
kind of atomic operation before it.

Simply adding smp_mb() to waitqueue_active() would incur extra cost in
many cases and won't be a good idea.

Another way to solve this problem is to remove the waitqueue_active(),
making the code look like this;
if (wq)
wake_up_interruptible(wq);
This also fixes the problem because the spinlock in the wake_up*() acts
as a memory barrier and prevents the code from being reordered by the
CPU (and it also makes the resulting code is much simpler).
---
Kosuke TATSUKAWA  | 3rd IT Platform Department
  | IT Platform Division, NEC Corporation
  | ta...@ab.jp.nec.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

watches

2015-10-08 Thread Tom

Dear Sir or Madam

How are you doing?
Attached pls kindly find some of our new design.
Looking forward to work with you at an ealry date！

Best whishes 


TomN�Р骒r��yb�X�肚�v�^�)藓{.n�+�阀z�^�)��骅w*jg�报�茛j/�赇z罐���2���ㄨ��&�)摺�a囤���G���h��j:+v���w��佶

Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread Neil Brown

Kosuke Tatsukawa  writes:

> There are several places in net/sunrpc/svcsock.c which calls
> waitqueue_active() without calling a memory barrier.  Add a memory
> barrier just as in wq_has_sleeper().
>
> I found this issue when I was looking through the linux source code
> for places calling waitqueue_active() before wake_up*(), but without
> preceding memory barriers, after sending a patch to fix a similar
> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> found here: https://lkml.org/lkml/2015/9/28/849).

hi,
this feels like the wrong approach to the problem.  It requires extra
'smb_mb's to be spread around which are hard to understand as easy to
forget.

A quick look seems to suggest that (nearly) every waitqueue_active()
will need an smb_mb.  Could we just put the smb_mb() inside
waitqueue_active()??

Thanks,
NeilBrown


>
> Signed-off-by: Kosuke Tatsukawa 
> ---
> v2:
>   - Fixed compiler warnings caused by type mismatch
> v1:
>   - https://lkml.org/lkml/2015/10/8/993
> ---
>  net/sunrpc/svcsock.c |6 ++
>  1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 0c81202..ec19444 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -414,6 +414,7 @@ static void svc_udp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>   svc_xprt_enqueue(&svsk->sk_xprt);
>   }
> + smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }


signature.asc
Description: PGP signature

Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-08 Thread Jiang Liu

On 2015/10/9 4:20, Andrew Morton wrote:
> On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes  
> wrote:
> 
>> On Wed, 19 Aug 2015, Patil, Kiran wrote:
>>
>>> Acked-by: Kiran Patil 
>>
>> Where's the call to preempt_disable() to prevent kernels with preemption 
>> from making numa_node_id() invalid during this iteration?
> 
> David asked this question twice, received no answer and now the patch
> is in the maintainer tree, destined for mainline.
> 
> If I was asked this question I would respond
> 
>   The use of numa_mem_id() is racy and best-effort.  If the unlikely
>   race occurs, the memory allocation will occur on the wrong node, the
>   overall result being very slightly suboptimal performance.  The
>   existing use of numa_node_id() suffers from the same issue.
> 
> But I'm not the person proposing the patch.  Please don't just ignore
> reviewer comments!
Hi Andrew,
Apologize for the slow response due to personal reasons!
And thanks for answering the question from David. To be honest,
I didn't know how to answer this question before. Actually this
question has puzzled me for a long time when dealing with memory
hot-removal. For normal cases, it only causes sub-optimal memory
allocation if schedule event happens between querying NUMA node id
and calling alloc_pages_node(). But what happens if system run into
following execution sequence?
1) node = numa_mem_id();
2) memory hot-removal event triggers
2.1) remove affected memory
2.2) reset pgdat to zero if node becomes empty after memory removal
3) alloc_pages_node(), which may access zero-ed pgdat structure.

I haven't found a mechanism to protect system from above sequence yet,
so puzzled for a long time already:(. Does stop_machine() protect
system from such a execution sequence?
Thanks!
Gerry

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 2/3] switchdev: allow caller to explicitly use deferred attr_set version

2015-10-08 Thread Scott Feldman

On Thu, Oct 8, 2015 at 1:26 AM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 08:03:35AM CEST, sfel...@gmail.com wrote:
>>On Wed, Oct 7, 2015 at 10:39 PM, Jiri Pirko  wrote:
>>> Thu, Oct 08, 2015 at 06:27:07AM CEST, sfel...@gmail.com wrote:
On Wed, Oct 7, 2015 at 11:30 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Caller should know if he can call attr_set directly (when holding RTNL)
> or if he has to use deferred version of this function.
>
> This also allows drivers to sleep inside attr_set and report operation
> status back to switchdev core. Switchdev core then warns if status is
> not ok, instead of silent errors happening in drivers.
>
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   2 +
>  net/bridge/br_stp.c   |   4 +-
>  net/switchdev/switchdev.c | 113 
> +-
>  3 files changed, 65 insertions(+), 54 deletions(-)
>
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index 89266a3..320be44 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -168,6 +168,8 @@ int switchdev_port_attr_get(struct net_device *dev,
> struct switchdev_attr *attr);
>  int switchdev_port_attr_set(struct net_device *dev,
> struct switchdev_attr *attr);
> +int switchdev_port_attr_set_deferred(struct net_device *dev,
> +struct switchdev_attr *attr);

Rather than adding another op, use attr->flags and define:

#define SWITCHDEV_F_DEFERRED  BIT(x)

So we get:

void br_set_state(struct net_bridge_port *p, unsigned int state)
{
struct switchdev_attr attr = {
.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
+  .flags = SWITCHDEV_F_DEFERRED,
.u.stp_state = state,
};
int err;

p->state = state;
err = switchdev_port_attr_set(p->dev, &attr);
if (err && err != -EOPNOTSUPP)
br_warn(p->br, "error setting offload STP state on
port %u(%s)\n",
(unsigned int) p->port_no,
p->dev->name);
}

(And add obj->flags to do the same).
>>>
>>> That's what I wanted to avoid. Also because the obj is const and for
>>> call from work, this flag would have to be removed.
>>
>>What did you want to avoid?
>
> Having this as a flag. I don't like it too much.
> But that is cosmetics. Other than that, does the patchset make sense?
> Do you see some possible issues?

patch 1/3 makes sense, I tested it out and no issues.  (Looks like
there are other places to assert rtnl_lock, are you going to add
those?)

patch 2/3: Rather than trying to guess the call context in the core,
make the caller call the right variant for its context.  That part is
good.  On the flag vs. no flags, the reasons why I want this as a flag
are:

a) I want to keep the switchdev ops set to the core set: get/set attr
and add/del/dump objs.  I've pushed back on changing this before.  I
don't want ops explosion (like netdev_ops), and I'd like to avoid the
1000-line patch when the arg list in an op changes, and we need to
update N drivers.  The flags lets the caller modify the algo behavior,
while keeping the core call (and args) fixed.

b) the caller can combine flags, where it makes sense.  For example,
maybe I'm in a locked context and I don't want to recurse the device
tree, so I would make the call with NO_RECURSE | DEFERRED.  If we
didn't use flags, then we need to supply ops for each variant on the
call, and then things explode.

patch 3/3 I haven't looked at yet...I'm stuck on 2/3.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-08 Thread Premkumar Jonnala



> -Original Message-
> From: sfel...@gmail.com [mailto:sfel...@gmail.com]
> Sent: Friday, October 09, 2015 7:53 AM
> To: netdev@vger.kernel.org
> Cc: da...@davemloft.net; j...@resnulli.us; siva.mannem@gmail.com;
> Premkumar Jonnala; step...@networkplumber.org;
> ro...@cumulusnetworks.com; and...@lunn.ch; f.faine...@gmail.com;
> vivien.dide...@savoirfairelinux.com
> Subject: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down
> to switchdev
> 
> From: Scott Feldman 
> 
> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
> support setting ageing_time (or setting bridge attrs in general).
> 
> If push fails, don't update ageing_time in bridge and return err to user.
> 
> If push succeeds, update ageing_time in bridge and run gc_timer now to
> recalabrate when to run gc_timer next, based on new ageing_time.
> 
> Signed-off-by: Scott Feldman 
> Signed-off-by: Jiri Pirko 
> ---
>  net/bridge/br_ioctl.c|3 +--
>  net/bridge/br_netlink.c  |6 +++---
>  net/bridge/br_private.h  |1 +
>  net/bridge/br_stp.c  |   23 +++
>  net/bridge/br_sysfs_br.c |3 +--
>  5 files changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
> index 8d423bc..263b4de 100644
> --- a/net/bridge/br_ioctl.c
> +++ b/net/bridge/br_ioctl.c
> @@ -200,8 +200,7 @@ static int old_dev_ioctl(struct net_device *dev, struct
> ifreq *rq, int cmd)
>   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
>   return -EPERM;
> 
> - br->ageing_time = clock_t_to_jiffies(args[1]);
> - return 0;
> + return br_set_ageing_time(br, args[1]);
> 
>   case BRCTL_GET_PORT_INFO:
>   {
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index d78b442..544ab96 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -870,9 +870,9 @@ static int br_changelink(struct net_device *brdev, struct
> nlattr *tb[],
>   }
> 
>   if (data[IFLA_BR_AGEING_TIME]) {
> - u32 ageing_time = nla_get_u32(data[IFLA_BR_AGEING_TIME]);
> -
> - br->ageing_time = clock_t_to_jiffies(ageing_time);
> + err = br_set_ageing_time(br,
> nla_get_u32(data[IFLA_BR_AGEING_TIME]));
> + if (err)
> + return err;
>   }
> 
>   if (data[IFLA_BR_STP_STATE]) {
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 09d3ecb..ba0c67b 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -882,6 +882,7 @@ void __br_set_forward_delay(struct net_bridge *br,
> unsigned long t);
>  int br_set_forward_delay(struct net_bridge *br, unsigned long x);
>  int br_set_hello_time(struct net_bridge *br, unsigned long x);
>  int br_set_max_age(struct net_bridge *br, unsigned long x);
> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time);
> 
> 
>  /* br_stp_if.c */
> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
> index 3a982c0..db6d243de 100644
> --- a/net/bridge/br_stp.c
> +++ b/net/bridge/br_stp.c
> @@ -566,6 +566,29 @@ int br_set_max_age(struct net_bridge *br, unsigned
> long val)
> 
>  }
> 
> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
> +{
> + struct switchdev_attr attr = {
> + .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
> + .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
> + .u.ageing_time = ageing_time,
> + };
> + unsigned long t = clock_t_to_jiffies(ageing_time);
> + int err;
> +
> + if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
> + return -ERANGE;
> +
> + err = switchdev_port_attr_set(br->dev, &attr);

A thought - given that the ageing time is not a per-bridge-port attr, why are 
we using a "port based api"
to pass the attribute down?  May be I'm missing something here?

-Prem


> + if (err)
> + return err;
> +
> + br->ageing_time = t;
> + mod_timer(&br->gc_timer, jiffies);
> +
> + return 0;
> +}
> +
>  void __br_set_forward_delay(struct net_bridge *br, unsigned long t)
>  {
>   br->bridge_forward_delay = t;
> diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
> index 4c97fc5..04ef192 100644
> --- a/net/bridge/br_sysfs_br.c
> +++ b/net/bridge/br_sysfs_br.c
> @@ -102,8 +102,7 @@ static ssize_t ageing_time_show(struct device *d,
> 
>  static int set_ageing_time(struct net_bridge *br, unsigned long val)
>  {
> - br->ageing_time = clock_t_to_jiffies(val);
> - return 0;
> + return br_set_ageing_time(br, val);
>  }
> 
>  static ssize_t ageing_time_store(struct device *d,
> --
> 1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()

2015-10-08 Thread kbuild test robot

Hi Jason,

[auto build test ERROR on v4.3-rc3 -- if it's inappropriate base, please ignore]

config: x86_64-randconfig-i0-201540 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   net/unix/af_unix.c: In function 'unix_dgram_writable':
>> net/unix/af_unix.c:2465:3: error: 'other_full' undeclared (first use in this 
>> function)
 *other_full = false;
  ^
   net/unix/af_unix.c:2465:3: note: each undeclared identifier is reported only 
once for each function it appears in

vim +/other_full +2465 net/unix/af_unix.c

  2459  return mask;
  2460  }
  2461  
  2462  static bool unix_dgram_writable(struct sock *sk, struct sock *other,
  2463  bool *other_nospace)
  2464  {
> 2465  *other_full = false;
  2466  
  2467  if (other && unix_peer(other) != sk && unix_recvq_full(other)) {
  2468  *other_full = true;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH net-next v3 0/4] switchdev: push bridge ageing_time attribute down

2015-10-08 Thread Jiri Pirko

Fri, Oct 09, 2015 at 04:23:16AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Push bridge-level attributes down to switchdev drivers.  This patchset
>adds the infrastructure and then pushes, as an example, ageing_time attribute
>down from bridge to switchdev (rocker) driver.  Add some range-checking
>for ageing_time.
>
># ip link set dev br0 type bridge ageing_time 1000
>
># ip link set dev br0 type bridge ageing_time 999
>RTNETLINK answers: Numerical result out of range
>
>Up until now, switchdev attrs where port-level attrs, so the netdev used in
>switchdev_attr_set() would be a switch port or bond of switch ports.  With
>bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
>netdev.  The same recusive algo is used to visit the leaves of the stacked
>drivers to set the attr, it's just in this case we start one layer higher in
>the stack.  One note is not all ports in the bridge may support setting a
>bridge-level attribute, so rather than failing the entire set, we'll skip over
>those ports returning -EOPNOTSUPP.
>
>v2->v3: Per Jiri review: push only ageing_time attr down at this time, and
>don't pass raw bridge IFLA_BR_* values; rather use new switchdev attr ID for
>ageing_time.

Looks fine now. Thanks Scott!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 4/4] rocker: handle setting bridge ageing_time

2015-10-08 Thread Jiri Pirko

Fri, Oct 09, 2015 at 04:23:20AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>The FDB cleanup timer will get rescheduled to re-evaluate FDB entries
>based on new ageing_time.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-08 Thread Jiri Pirko

Fri, Oct 09, 2015 at 04:23:19AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
>support setting ageing_time (or setting bridge attrs in general).
>
>If push fails, don't update ageing_time in bridge and return err to user.
>
>If push succeeds, update ageing_time in bridge and run gc_timer now to
>recalabrate when to run gc_timer next, based on new ageing_time.
>
>Signed-off-by: Scott Feldman 
>Signed-off-by: Jiri Pirko 

Acked-by: Jiri Pirko 
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 2/4] switchdev: skip over ports returning -EOPNOTSUPP when recursing ports

2015-10-08 Thread Jiri Pirko

Fri, Oct 09, 2015 at 04:23:18AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>This allows us to recurse over all the ports, skipping over unsupporting
>ports.  Without the change, the recursion would stop at first unsupported
>port.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 1/4] switchdev: add bridge ageing_time attribute

2015-10-08 Thread Jiri Pirko

Fri, Oct 09, 2015 at 04:23:17AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Setting the stage to push bridge-level attributes down to port driver so
>hardware can be programmed accordingly.  Bridge-level attribute example is
>ageing_time.  This is a per-bridge attribute, not a per-bridge-port attr.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()

2015-10-08 Thread Jason Baron

The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
queue associated with the socket s that we are poll'ing against, but also calls
sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
if we call poll()/select()/epoll() for the socket s, there are then
a couple of code paths in which the remote peer socket p and its associated
peer_wait queue can be freed before poll()/select()/epoll() have a chance
to remove themselves from the remote peer socket.

The way that remote peer socket can be freed are:

1. If s calls connect() to a connect to a new socket other than p, it will
drop its reference on p, and thus a close() on p will free it.

2. If we call close on p(), then a subsequent sendmsg() from s, will drop
the final reference to p, allowing it to be freed.

Address this issue, by reverting unix_dgram_poll() to only register with
the wait queue associated with s and register a callback with the remote peer
socket on connect() that will wake up the wait queue associated with s. If
scenarios 1 or 2 occur above we then simply remove the callback from the
remote peer. This then presents the expected semantics to poll()/select()/
epoll().

I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().

Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
DGRAM sockets").

Tested-by: Mathias Krause 
Signed-off-by: Jason Baron 
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 32 +++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 4a167b3..9698aff 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_twait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..f789423 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair = unix_peer(sk);
 
if (skpair != NULL) {
+   if (sk->sk_type != SOCK_STREAM)
+   remove_wait_queue(&unix_sk(skpair)->peer_wait,
+ &u->wait);
if (sk->sk_type == SOCK_STREAM || sk->sk_type == 
SOCK_SEQPACKET) {
unix_state_lock(skpair);
/* No more writes */
@@ -636,6 +639,16 @@ static struct proto unix_proto = {
  */
 static struct lock_class_key af_unix_sk_receive_queue_lock_key;
 
+static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+   struct unix_sock *u;
+
+   u = container_of(wait, struct unix_sock, wait);
+   wake_up_interruptible_sync_poll(sk_sleep(&u->sk), key);
+
+   return 0;
+}
+
 static struct sock *unix_create1(struct net *net, struct socket *sock, int 
kern)
 {
struct sock *sk = NULL;
@@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+   init_waitqueue_func_entry(&u->wait, peer_wake);
unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
if (sk == NULL)
@@ -1030,7 +1044,11 @@ restart:
 */
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
+
+   remove_wait_queue(&unix_sk(old_peer)->peer_wait,
+ &unix_sk(sk)->wait);
unix_peer(sk) = other;
+   add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
unix_state_double_unlock(sk, other);
 
if (other != old_peer)
@@ -1038,8 +1056,12 @@ restart:
sock_put(old_peer);
} else {
unix_peer(sk) = other;
+   add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
unix_state_double_unlock(sk, other);
}
+   /* New remote may have created write space for us */
+   wake_up_interruptible_sync_poll(sk_sleep(sk),
+   POLLOUT | POLLWRNORM | POLLWRBAND);
return 0;
 
 out_unlock:
@@ -1194,6 +1216,8 @@ restart:
 
sock_hold(sk);
unix_peer(newsk)= sk;
+   if (sk->sk_type == SOCK_SEQPACKET)
+   add_wait_queue(&unix_sk(sk)->peer_wait, &unix_sk(newsk)->wait);
newsk->sk_state = TCP_ESTABLISHED;
newsk->sk_type  = sk->sk_type;
init_peercred(newsk);
@@ -1220,6 +1244,8 @@ restart:
 
smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
unix_peer(sk)   = newsk;
+   if (sk-

[PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()

2015-10-08 Thread Jason Baron

Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters.

Tested using: http://www.spinics.net/lists/netdev/msg145533.html

Signed-off-by: Jason Baron 
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 92 +--
 2 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
+#define UNIX_NOSPACE   2
struct socket_wqpeer_wq;
wait_queue_twait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..05fbd00 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long 
timeo)
 
prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE);
 
+   set_bit(UNIX_NOSPACE, &u->flags);
+   /* Ensure that we either see space in the peer sk_receive_queue via the
+* unix_recvq_full() check below, or we receive a wakeup when it
+* empties. Pairs with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
sched = !sock_flag(other, SOCK_DEAD) &&
!(other->sk_shutdown & RCV_SHUTDOWN) &&
unix_recvq_full(other);
@@ -1623,17 +1629,27 @@ restart:
 
if (unix_peer(other) != sk && unix_recvq_full(other)) {
if (!timeo) {
-   err = -EAGAIN;
-   goto out_unlock;
-   }
-
-   timeo = unix_wait_for_peer(other, timeo);
+   set_bit(UNIX_NOSPACE, &unix_sk(other)->flags);
+   /* Ensure that we either see space in the peer
+* sk_receive_queue via the unix_recvq_full() check
+* below, or we receive a wakeup when it empties. This
+* makes sure that epoll ET triggers correctly. Pairs
+* with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
+   if (unix_recvq_full(other)) {
+   err = -EAGAIN;
+   goto out_unlock;
+   }
+   } else {
+   timeo = unix_wait_for_peer(other, timeo);
 
-   err = sock_intr_errno(timeo);
-   if (signal_pending(current))
-   goto out_free;
+   err = sock_intr_errno(timeo);
+   if (signal_pending(current))
+   goto out_free;
 
-   goto restart;
+   goto restart;
+   }
}
 
if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   wake_up_interruptible_sync_poll(&u->peer_wait,
-   POLLOUT | POLLWRNORM | POLLWRBAND);
+   /* Ensure that waiters on our sk->sk_receive_queue draining that check
+* via unix_recvq_full() either see space in the queue or get a wakeup
+* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram()
+* call above. Pairs with the mb in unix_dgram_sendmsg(),
+*unix_dgram_poll(), and unix_wait_for_peer().
+*/
+   smp_mb();
+   if (test_bit(UNIX_NOSPACE, &u->flags)) {
+   clear_bit(UNIX_NOSPACE, &u->flags);
+   wake_up_interruptible_sync_poll(&u->peer_wait,
+   POLLOUT | POLLWRNORM |
+   POLLWRBAND);
+   }
 
if (msg->msg_name)
unix_copy_addr(msg, skb->sk);
@@ -2432,11 +2459,25 @@ static unsigned int unix_poll(struct file *file, struct 
socket *sock, poll_table
return mask;
 }
 
+static bool unix_dgram_writable(struct sock *sk, struct sock *other,
+   bool *other_nospace)
+{
+   *other_full = false;
+
+   if (other && unix_peer(other) != sk && unix_recvq_full(other)) {
+   *other_full = true;
+   return false;
+   }
+
+   return unix_writable(sk);
+}
+
 static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,

[PATCH v4 0/3] net: unix: fix use-after-free

2015-10-08 Thread Jason Baron

Hi,

These patches are against mainline, I can re-base to net-next, please
let me know.

They have been tested against: https://lkml.org/lkml/2015/9/13/195,
which causes the use-after-free quite quickly and here:
https://lkml.org/lkml/2015/10/2/693.

Thanks,

-Jason

v4:
-set UNIX_NOSPACE only if the peer socket has receive space

v3:
-beef up memory barrier comments in 3/3 (Peter Zijlstra)
-clean up unix_dgram_writable() function in 3/3 (Joe Perches)

Jason Baron (3):
  net: unix: fix use-after-free in unix_dgram_poll()
  net: unix: Convert gc_flags to flags
  net: unix: optimize wakeups in unix_dgram_recvmsg()

 include/net/af_unix.h |   4 +-
 net/unix/af_unix.c| 124 --
 net/unix/garbage.c|  12 ++---
 3 files changed, 108 insertions(+), 32 deletions(-)

-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 2/3] net: unix: Convert gc_flags to flags

2015-10-08 Thread Jason Baron

Convert gc_flags to flags in perparation for the subsequent patch, which will
make use of a flag bit for a non-gc purpose.

Signed-off-by: Jason Baron 
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c| 12 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 9698aff..6a4a345 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -58,7 +58,7 @@ struct unix_sock {
atomic_long_t   inflight;
spinlock_t  lock;
unsigned char   recursion_level;
-   unsigned long   gc_flags;
+   unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a73a226..39794d9 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void 
(*func)(struct unix_sock *),
 * have been added to the queues after
 * starting the garbage collection
 */
-   if (test_bit(UNIX_GC_CANDIDATE, 
&u->gc_flags)) {
+   if (test_bit(UNIX_GC_CANDIDATE, 
&u->flags)) {
hit = true;
 
func(u);
@@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 * of the list, so that it's checked even if it was already
 * passed over
 */
-   if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags))
+   if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->flags))
list_move_tail(&u->link, &gc_candidates);
 }
 
@@ -305,8 +305,8 @@ void unix_gc(void)
BUG_ON(total_refs < inflight_refs);
if (total_refs == inflight_refs) {
list_move_tail(&u->link, &gc_candidates);
-   __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
-   __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
+   __set_bit(UNIX_GC_CANDIDATE, &u->flags);
+   __set_bit(UNIX_GC_MAYBE_CYCLE, &u->flags);
}
}
 
@@ -332,7 +332,7 @@ void unix_gc(void)
 
if (atomic_long_read(&u->inflight) > 0) {
list_move_tail(&u->link, ¬_cycle_list);
-   __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
+   __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->flags);
scan_children(&u->sk, inc_inflight_move_tail, NULL);
}
}
@@ -343,7 +343,7 @@ void unix_gc(void)
 */
while (!list_empty(¬_cycle_list)) {
u = list_entry(not_cycle_list.next, struct unix_sock, link);
-   __clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
+   __clear_bit(UNIX_GC_CANDIDATE, &u->flags);
list_move_tail(&u->link, &gc_inflight_list);
}
 
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 net-next 0/4] tcp: better smp listener behavior

2015-10-08 Thread Grant Zhang




On 08/10/2015 19:33, Eric Dumazet wrote:

As promised in last patch series, we implement a better SO_REUSEPORT
strategy, based on cpu affinities if selected by the application.

We also moved sk_refcnt out of the cache line containing the lookup
keys, as it was considerably slowing down smp operations because
of false sharing. This was simpler than converting listen sockets
to conventional RCU (to avoid sk_refcnt dirtying)

Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.

Eric Dumazet (4):
   net: SO_INCOMING_CPU setsockopt() support
   net: align sk_refcnt on 128 bytes boundary
   net: shrink struct sock and request_sock by 8 bytes
   tcp: shrink tcp_timewait_sock by 8 bytes

  include/linux/tcp.h  |  4 ++--
  include/net/inet_timewait_sock.h |  2 +-
  include/net/request_sock.h   |  7 +++
  include/net/sock.h   | 41 +++-
  net/core/sock.c  |  5 +
  net/ipv4/inet_hashtables.c   |  2 ++
  net/ipv4/syncookies.c|  4 ++--
  net/ipv4/tcp_input.c |  2 +-
  net/ipv4/tcp_ipv4.c  |  2 +-
  net/ipv4/tcp_minisocks.c | 18 +-
  net/ipv4/tcp_output.c|  2 +-
  net/ipv4/udp.c   |  6 +-
  net/ipv6/inet6_hashtables.c  |  2 ++
  net/ipv6/syncookies.c|  4 ++--
  net/ipv6/tcp_ipv6.c  |  2 +-
  net/ipv6/udp.c   | 11 +++
  16 files changed, 72 insertions(+), 42 deletions(-)


Eric,

Does it make sense to make the listener hash table percpu? Socket with 
SO_INCOMING_CPU set could just be add to the hashtable for that specific 
cpu.


Thanks,

Grant
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] tcp: better smp listener behavior

2015-10-08 Thread Tom Herbert

On Thu, Oct 8, 2015 at 8:37 AM, Eric Dumazet  wrote:
> As promised in last patch series, we implement a better SO_REUSEPORT
> strategy, based on cpu affinities if selected by the application.
>
> We also moved sk_refcnt out of the cache line containing the lookup
> keys, as it was considerably slowing down smp operations because
> of false sharing. This was simpler than converting listen sockets
> to conventional RCU (to avoid sk_refcnt dirtying)
>
> Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
>
Is this IPv4, IPv6, or some combination of the two ? :-)

> Eric Dumazet (4):
>   net: SO_INCOMING_CPU setsockopt() support
>   net: align sk_refcnt on 128 bytes boundary
>   net: shrink struct sock and request_sock by 8 bytes
>   tcp: shrink tcp_timewait_sock by 8 bytes
>
>  include/linux/tcp.h  |  4 ++--
>  include/net/inet_timewait_sock.h |  2 +-
>  include/net/request_sock.h   |  7 +++
>  include/net/sock.h   | 37 +++--
>  net/core/sock.c  |  5 +
>  net/ipv4/inet_hashtables.c   |  5 +
>  net/ipv4/syncookies.c|  4 ++--
>  net/ipv4/tcp_input.c |  2 +-
>  net/ipv4/tcp_ipv4.c  |  2 +-
>  net/ipv4/tcp_minisocks.c | 18 +-
>  net/ipv4/tcp_output.c|  2 +-
>  net/ipv4/udp.c   | 12 +++-
>  net/ipv6/inet6_hashtables.c  |  5 +
>  net/ipv6/syncookies.c|  4 ++--
>  net/ipv6/tcp_ipv6.c  |  2 +-
>  net/ipv6/udp.c   | 11 +++
>  16 files changed, 87 insertions(+), 35 deletions(-)
>
> --
> 2.6.0.rc2.230.g3dd15c0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-08 Thread Tom Herbert

On Thu, Oct 8, 2015 at 7:33 PM, Eric Dumazet  wrote:
> SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
> to fetch incoming cpu handling a particular TCP flow after accept()
>
> This commits adds setsockopt() support and extends SO_REUSEPORT selection
> logic : If a TCP listener or UDP socket has this option set, a packet is
> delivered to this socket only if CPU handling the packet matches the specified
> one.
>
> This allows to build very efficient TCP servers, using one listener per
> RX queue, as the associated TCP listener should only accept flows handled
> in softirq by the same cpu.
> This provides optimal NUMA behavior and keep cpu caches hot.
>
> Note that __inet_lookup_listener() still has to iterate over the list of
> all listeners. Following patch puts sk_refcnt in a different cache line
> to let this iteration hit only shared and read mostly cache lines.
>
> Signed-off-by: Eric Dumazet 
> ---
>  include/net/sock.h  | 10 --
>  net/core/sock.c |  5 +
>  net/ipv4/inet_hashtables.c  |  2 ++
>  net/ipv4/udp.c  |  6 +-
>  net/ipv6/inet6_hashtables.c |  2 ++
>  net/ipv6/udp.c  | 11 +++
>  6 files changed, 25 insertions(+), 11 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dfe2eb8e1132..08abffe32236 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -150,6 +150,7 @@ typedef __u64 __bitwise __addrpair;
>   * @skc_node: main hash linkage for various protocol lookup tables
>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_incoming_cpu: record/match cpu processing incoming packets
>   * @skc_refcnt: reference count
>   *
>   * This is the minimal network layer representation of sockets, the 
> header
> @@ -212,6 +213,8 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> +   int skc_incoming_cpu;
> +
> atomic_tskc_refcnt;
> /* private: */
> int skc_dontcopy_end[0];
> @@ -274,7 +277,6 @@ struct cg_proto;
>*@sk_rcvtimeo: %SO_RCVTIMEO setting
>*@sk_sndtimeo: %SO_SNDTIMEO setting
>*@sk_rxhash: flow hash received from netif layer
> -  *@sk_incoming_cpu: record cpu processing incoming packets
>*@sk_txhash: computed flow hash for use on transmit
>*@sk_filter: socket filtering instructions
>*@sk_timer: sock cleanup timer
> @@ -331,6 +333,7 @@ struct sock {
>  #define sk_v6_daddr__sk_common.skc_v6_daddr
>  #define sk_v6_rcv_saddr__sk_common.skc_v6_rcv_saddr
>  #define sk_cookie  __sk_common.skc_cookie
> +#define sk_incoming_cpu__sk_common.skc_incoming_cpu
>
> socket_lock_t   sk_lock;
> struct sk_buff_head sk_receive_queue;
> @@ -353,11 +356,6 @@ struct sock {
>  #ifdef CONFIG_RPS
> __u32   sk_rxhash;
>  #endif
> -   u16 sk_incoming_cpu;
> -   /* 16bit hole
> -* Warned : sk_incoming_cpu can be set from softirq,
> -* Do not use this hole without fully understanding possible issues.
> -*/
>
> __u32   sk_txhash;
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 7dd1263e4c24..1071f9380250 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -988,6 +988,10 @@ set_rcvbuf:
>  sk->sk_max_pacing_rate);
> break;
>
> +   case SO_INCOMING_CPU:
> +   sk->sk_incoming_cpu = val;
> +   break;
> +
> default:
> ret = -ENOPROTOOPT;
> break;
> @@ -2353,6 +2357,7 @@ void sock_init_data(struct socket *sock, struct sock 
> *sk)
>
> sk->sk_max_pacing_rate = ~0U;
> sk->sk_pacing_rate = ~0U;
> +   sk->sk_incoming_cpu = -1;
> /*
>  * Before updating sk_refcnt, we must commit prior changes to memory
>  * (Documentation/RCU/rculist_nulls.txt for details)
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index bed8886a4b6c..08643a3616af 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -185,6 +185,8 @@ static inline int compute_score(struct sock *sk, struct 
> net *net,
> return -1;
> score += 4;
> }
> +   if (sk->sk_incoming_cpu == raw_smp_processor_id())
> +   score++;
> }
> return score;
>  }
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index e1fc129099ea..24ec14f9825c 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -375,7 +375,8 @@ static inline int compute_score(struct sock *sk, str

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-08 Thread Scott Feldman

On Thu, Oct 8, 2015 at 7:40 PM, Florian Fainelli  wrote:
> 2015-10-08 19:23 GMT-07:00  :
>> From: Scott Feldman 
>>
>> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
>> support setting ageing_time (or setting bridge attrs in general).
>>
>> If push fails, don't update ageing_time in bridge and return err to user.
>>
>> If push succeeds, update ageing_time in bridge and run gc_timer now to
>> recalabrate when to run gc_timer next, based on new ageing_time.
>>
>> Signed-off-by: Scott Feldman 
>> Signed-off-by: Jiri Pirko 
>> ---
>>  net/bridge/br_ioctl.c|3 +--
>>  net/bridge/br_netlink.c  |6 +++---
>>  net/bridge/br_private.h  |1 +
>>  net/bridge/br_stp.c  |   23 +++
>>  net/bridge/br_sysfs_br.c |3 +--
>>  5 files changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
>> index 8d423bc..263b4de 100644
>> --- a/net/bridge/br_ioctl.c
>> +++ b/net/bridge/br_ioctl.c
>> @@ -200,8 +200,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
>> ifreq *rq, int cmd)
>> if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
>> return -EPERM;
>>
>> -   br->ageing_time = clock_t_to_jiffies(args[1]);
>> -   return 0;
>> +   return br_set_ageing_time(br, args[1]);
>>
>> case BRCTL_GET_PORT_INFO:
>> {
>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>> index d78b442..544ab96 100644
>> --- a/net/bridge/br_netlink.c
>> +++ b/net/bridge/br_netlink.c
>> @@ -870,9 +870,9 @@ static int br_changelink(struct net_device *brdev, 
>> struct nlattr *tb[],
>> }
>>
>> if (data[IFLA_BR_AGEING_TIME]) {
>> -   u32 ageing_time = nla_get_u32(data[IFLA_BR_AGEING_TIME]);
>> -
>> -   br->ageing_time = clock_t_to_jiffies(ageing_time);
>> +   err = br_set_ageing_time(br, 
>> nla_get_u32(data[IFLA_BR_AGEING_TIME]));
>> +   if (err)
>> +   return err;
>> }
>>
>> if (data[IFLA_BR_STP_STATE]) {
>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>> index 09d3ecb..ba0c67b 100644
>> --- a/net/bridge/br_private.h
>> +++ b/net/bridge/br_private.h
>> @@ -882,6 +882,7 @@ void __br_set_forward_delay(struct net_bridge *br, 
>> unsigned long t);
>>  int br_set_forward_delay(struct net_bridge *br, unsigned long x);
>>  int br_set_hello_time(struct net_bridge *br, unsigned long x);
>>  int br_set_max_age(struct net_bridge *br, unsigned long x);
>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time);
>>
>>
>>  /* br_stp_if.c */
>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
>> index 3a982c0..db6d243de 100644
>> --- a/net/bridge/br_stp.c
>> +++ b/net/bridge/br_stp.c
>> @@ -566,6 +566,29 @@ int br_set_max_age(struct net_bridge *br, unsigned long 
>> val)
>>
>>  }
>>
>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
>> +{
>> +   struct switchdev_attr attr = {
>> +   .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>> +   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>> +   .u.ageing_time = ageing_time,
>> +   };
>> +   unsigned long t = clock_t_to_jiffies(ageing_time);
>> +   int err;
>> +
>> +   if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
>> +   return -ERANGE;
>> +
>> +   err = switchdev_port_attr_set(br->dev, &attr);
>> +   if (err)
>> +   return err;
>> +
>> +   br->ageing_time = t;
>> +   mod_timer(&br->gc_timer, jiffies);
>
> If the switch driver/HW supports ageing, does it still make sense to
> have this software timer ticking?

Yes, because the bridge still needs to age out entries it has learned
(those not marked with added_by_external_learn), for example entries
learned on non-offloaded ports.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net namespace: dynamically configure new net namespace inherit net config

2015-10-08 Thread Paul Gortmaker

On Thu, Oct 8, 2015 at 2:44 AM, yzhu1  wrote:
> Hi, Miller
>
> Would you like to check this patch?

I explained to you way back in June what some of the biggest oversights
were, here with your work.  And you have changed nothing but yet expect
a reply from maintainers who are extremely busy, simply by resending
the same old patch over and over.  Do you not see why this approach will
not work?

Paul.
--

>
> Thanks a lot.
> Zhu Yanjun
>
> On 06/26/2015 05:37 PM, Zhu Yanjun wrote:
>>
>> The new net namespace can inherit from the original net config, or
>> the current net config. As such, a config is needed to decide where
>> the new namespace inherit from.
>>
>> Signed-off-by: Zhu Yanjun 
>> ---
>>   init/Kconfig   |  9 +
>>   net/ipv4/devinet.c | 13 +
>>   2 files changed, 22 insertions(+)
>>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index dc24dec..fab8c41 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1212,6 +1212,15 @@ config NET_NS
>>   Allow user space to create what appear to be multiple instances
>>   of the network stack.
>>   +config NET_NS_INHERIT_ORIGINAL
>> +   bool "New network namespace inherits from original net config"
>> +   depends on NET_NS
>> +   default n
>> +   help
>> + Allow new network namespace inherit from original net config.
>> + If no, the new network namespace inherits from the current net
>> + config including the modified net config.
>> +
>>   endif # NAMESPACES
>> config SCHED_AUTOGROUP
>> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
>> index 419d23c..cf635e4 100644
>> --- a/net/ipv4/devinet.c
>> +++ b/net/ipv4/devinet.c
>> @@ -2271,6 +2271,7 @@ static __net_init int devinet_init_net(struct net
>> *net)
>>   #endif
>> err = -ENOMEM;
>> +#ifndef CONFIG_NET_NS_INHERIT_ORIGINAL
>> all = &ipv4_devconf;
>> dflt = &ipv4_devconf_dflt;
>>   @@ -2282,6 +2283,15 @@ static __net_init int devinet_init_net(struct net
>> *net)
>> dflt = kmemdup(dflt, sizeof(ipv4_devconf_dflt),
>> GFP_KERNEL);
>> if (!dflt)
>> goto err_alloc_dflt;
>> +#else
>> +   all = kmemdup(&ipv4_devconf, sizeof(ipv4_devconf), GFP_KERNEL);
>> +   if (!all)
>> +   goto err_alloc_all;
>> +
>> +   dflt = kmemdup(&ipv4_devconf_dflt, sizeof(ipv4_devconf_dflt),
>> GFP_KERNEL);
>> +   if (!dflt)
>> +   goto err_alloc_dflt;
>> +#endif
>> #ifdef CONFIG_SYSCTL
>> tbl = kmemdup(tbl, sizeof(ctl_forward_entry), GFP_KERNEL);
>> @@ -2292,7 +2302,10 @@ static __net_init int devinet_init_net(struct net
>> *net)
>> tbl[0].extra1 = all;
>> tbl[0].extra2 = net;
>>   #endif
>> +
>> +#ifndef CONFIG_NET_NS_INHERIT_ORIGINAL
>> }
>> +#endif
>> #ifdef CONFIG_SYSCTL
>> err = __devinet_sysctl_register(net, "all", all);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH ethtool 1/2] Fix missing function declarations when building tests

2015-10-08 Thread Ben Hutchings

Fix these compiler warnings by declaring test_exit() and test_main()
regardless of whether TEST_NO_WRAPPERS is defined:

test-cmdline.c: In function ‘send_ioctl’:
test-cmdline.c:268:2: warning: implicit declaration of function ‘test_exit’ 
[-Wimplicit-function-declaration]
  test_exit(0);
  ^
test-common.c: In function ‘test_cmdline’:
test-common.c:361:21: warning: implicit declaration of function ‘test_main’ 
[-Wimplicit-function-declaration]
  rc = rc ? rc - 1 : test_main(argc, argv);
 ^

Signed-off-by: Ben Hutchings 
---
These warnings are longstanding so I'm not sure why I didn't notice
them before!  I've applied this post-4.2.

Ben.

 internal.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/internal.h b/internal.h
index 444810d..156770c 100644
--- a/internal.h
+++ b/internal.h
@@ -132,10 +132,11 @@ struct cmd_expect {
 int test_ioctl(const struct cmd_expect *expect, void *cmd);
 #define TEST_IOCTL_MISMATCH (-2)
 
-#ifndef TEST_NO_WRAPPERS
 int test_main(int argc, char **argp);
-#define main(...) test_main(__VA_ARGS__)
 void test_exit(int rc) __attribute__((noreturn));
+
+#ifndef TEST_NO_WRAPPERS
+#define main(...) test_main(__VA_ARGS__)
 #undef exit
 #define exit(rc) test_exit(rc)
 void *test_malloc(size_t size);

-- 
Ben Hutchings
If the facts do not conform to your theory, they must be disposed of.

signature.asc
Description: This is a digitally signed message part

[PATCH ethtool 2/2] Fix return type of test_free() prorotype

2015-10-08 Thread Ben Hutchings

The return type should be void, consistent with the definition and
with the standard free() function.

Signed-off-by: Ben Hutchings 
---
I've applied this post-4.2.

Ben.

 internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/internal.h b/internal.h
index 156770c..b5ef646 100644
--- a/internal.h
+++ b/internal.h
@@ -148,7 +148,7 @@ void *test_calloc(size_t nmemb, size_t size);
 char *test_strdup(const char *s);
 #undef strdup
 #define strdup(s) test_strdup(s)
-void *test_free(void *ptr);
+void test_free(void *ptr);
 #undef free
 #define free(ptr) test_free(ptr)
 void *test_realloc(void *ptr, size_t size);
-- 
Ben Hutchings
If the facts do not conform to your theory, they must be disposed of.

signature.asc
Description: This is a digitally signed message part

ethtool 4.2 released

2015-10-08 Thread Ben Hutchings

ethtool version 4.2 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-4.2.tar.xz

Release notes:

* Feature: Support soldered-on modules in module EEPROM dump (-m option)
* Feature: Add register dump support for VMware vmxnet3 (-d option)
* Feature: Update register dump support for IBM EMAC (-d option)
  (requires Linux 4.3 or a future stable update to 4.1 or 4.2)
* Doc: Fix typo in man page

Ben.

-- 
Ben Hutchings
If the facts do not conform to your theory, they must be disposed of.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH net-next v3 0/4] switchdev: push bridge ageing_time attribute down

2015-10-08 Thread Florian Fainelli

2015-10-08 19:23 GMT-07:00  :
> From: Scott Feldman 
>
> Push bridge-level attributes down to switchdev drivers.  This patchset
> adds the infrastructure and then pushes, as an example, ageing_time attribute
> down from bridge to switchdev (rocker) driver.  Add some range-checking
> for ageing_time.
>
> # ip link set dev br0 type bridge ageing_time 1000
>
> # ip link set dev br0 type bridge ageing_time 999
> RTNETLINK answers: Numerical result out of range
>
> Up until now, switchdev attrs where port-level attrs, so the netdev used in
> switchdev_attr_set() would be a switch port or bond of switch ports.  With
> bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
> netdev.  The same recusive algo is used to visit the leaves of the stacked
> drivers to set the attr, it's just in this case we start one layer higher in
> the stack.  One note is not all ports in the bridge may support setting a
> bridge-level attribute, so rather than failing the entire set, we'll skip over
> those ports returning -EOPNOTSUPP.

Other than the small question on patch #3, this looks good to me:

Reviewed-by: Florian Fainelli 

>
> v2->v3: Per Jiri review: push only ageing_time attr down at this time, and
> don't pass raw bridge IFLA_BR_* values; rather use new switchdev attr ID for
> ageing_time.
>
> v1->v2: rebase w/ net-next
>
>
> Scott Feldman (4):
>   switchdev: add bridge ageing_time attribute
>   switchdev: skip over ports returning -EOPNOTSUPP when recursing ports
>   bridge: push bridge setting ageing_time down to switchdev
>   rocker: handle setting bridge ageing_time
>
>  drivers/net/ethernet/rocker/rocker.c |   16 
>  include/net/switchdev.h  |3 +++
>  net/bridge/br_ioctl.c|3 +--
>  net/bridge/br_netlink.c  |6 +++---
>  net/bridge/br_private.h  |1 +
>  net/bridge/br_stp.c  |   23 +++
>  net/bridge/br_sysfs_br.c |3 +--
>  net/switchdev/switchdev.c|9 -
>  8 files changed, 56 insertions(+), 8 deletions(-)
>
> --
> 1.7.10.4
>



-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-08 Thread Florian Fainelli

2015-10-08 19:23 GMT-07:00  :
> From: Scott Feldman 
>
> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
> support setting ageing_time (or setting bridge attrs in general).
>
> If push fails, don't update ageing_time in bridge and return err to user.
>
> If push succeeds, update ageing_time in bridge and run gc_timer now to
> recalabrate when to run gc_timer next, based on new ageing_time.
>
> Signed-off-by: Scott Feldman 
> Signed-off-by: Jiri Pirko 
> ---
>  net/bridge/br_ioctl.c|3 +--
>  net/bridge/br_netlink.c  |6 +++---
>  net/bridge/br_private.h  |1 +
>  net/bridge/br_stp.c  |   23 +++
>  net/bridge/br_sysfs_br.c |3 +--
>  5 files changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
> index 8d423bc..263b4de 100644
> --- a/net/bridge/br_ioctl.c
> +++ b/net/bridge/br_ioctl.c
> @@ -200,8 +200,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
> ifreq *rq, int cmd)
> if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
> return -EPERM;
>
> -   br->ageing_time = clock_t_to_jiffies(args[1]);
> -   return 0;
> +   return br_set_ageing_time(br, args[1]);
>
> case BRCTL_GET_PORT_INFO:
> {
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index d78b442..544ab96 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -870,9 +870,9 @@ static int br_changelink(struct net_device *brdev, struct 
> nlattr *tb[],
> }
>
> if (data[IFLA_BR_AGEING_TIME]) {
> -   u32 ageing_time = nla_get_u32(data[IFLA_BR_AGEING_TIME]);
> -
> -   br->ageing_time = clock_t_to_jiffies(ageing_time);
> +   err = br_set_ageing_time(br, 
> nla_get_u32(data[IFLA_BR_AGEING_TIME]));
> +   if (err)
> +   return err;
> }
>
> if (data[IFLA_BR_STP_STATE]) {
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 09d3ecb..ba0c67b 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -882,6 +882,7 @@ void __br_set_forward_delay(struct net_bridge *br, 
> unsigned long t);
>  int br_set_forward_delay(struct net_bridge *br, unsigned long x);
>  int br_set_hello_time(struct net_bridge *br, unsigned long x);
>  int br_set_max_age(struct net_bridge *br, unsigned long x);
> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time);
>
>
>  /* br_stp_if.c */
> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
> index 3a982c0..db6d243de 100644
> --- a/net/bridge/br_stp.c
> +++ b/net/bridge/br_stp.c
> @@ -566,6 +566,29 @@ int br_set_max_age(struct net_bridge *br, unsigned long 
> val)
>
>  }
>
> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
> +{
> +   struct switchdev_attr attr = {
> +   .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
> +   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
> +   .u.ageing_time = ageing_time,
> +   };
> +   unsigned long t = clock_t_to_jiffies(ageing_time);
> +   int err;
> +
> +   if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
> +   return -ERANGE;
> +
> +   err = switchdev_port_attr_set(br->dev, &attr);
> +   if (err)
> +   return err;
> +
> +   br->ageing_time = t;
> +   mod_timer(&br->gc_timer, jiffies);

If the switch driver/HW supports ageing, does it still make sense to
have this software timer ticking?
--
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-08 Thread Alexei Starovoitov


On 10/8/15 5:50 PM, Devon H. O'Dell wrote:

with the amount of skb_sender_cpu_clear() all over the code base
>I wonder whether there is a better solution to all of these.

I think there is. We found that splitting the union of sender_cpu and
napi_id solved the issue for us. In general, I think this is an OK
solution as long as the following hold:

  * skbs are always allocated via kzalloc
  * out -> out cloned skbs are always cloned on the same CPU
  * an extra four bytes in skbuff isn't a bad thing


I'm pretty sure extending sk_buff for this is not acceptable.
I was thinking may be we can use sign bit to distinguish between
napi_id and sender_cpu.
Like:
if ((int)skb->sender_cpu >= 0)
skb->sender_cpu = - (raw_smp_processor_id() + 1);
and inside get_xps_queue() use it only if it's negative.
Then we can remove skb_sender_cpu_clear() from everywhere.
Adding a check to napi_hash_add() to make sure that napi_id is not
negative is probably ok too.
Thoughts?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-08 Thread Eric Dumazet

SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()

This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.

This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.

Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.

Signed-off-by: Eric Dumazet 
---
 include/net/sock.h  | 10 --
 net/core/sock.c |  5 +
 net/ipv4/inet_hashtables.c  |  2 ++
 net/ipv4/udp.c  |  6 +-
 net/ipv6/inet6_hashtables.c |  2 ++
 net/ipv6/udp.c  | 11 +++
 6 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index dfe2eb8e1132..08abffe32236 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -150,6 +150,7 @@ typedef __u64 __bitwise __addrpair;
  * @skc_node: main hash linkage for various protocol lookup tables
  * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_incoming_cpu: record/match cpu processing incoming packets
  * @skc_refcnt: reference count
  *
  * This is the minimal network layer representation of sockets, the header
@@ -212,6 +213,8 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
+   int skc_incoming_cpu;
+
atomic_tskc_refcnt;
/* private: */
int skc_dontcopy_end[0];
@@ -274,7 +277,6 @@ struct cg_proto;
   *@sk_rcvtimeo: %SO_RCVTIMEO setting
   *@sk_sndtimeo: %SO_SNDTIMEO setting
   *@sk_rxhash: flow hash received from netif layer
-  *@sk_incoming_cpu: record cpu processing incoming packets
   *@sk_txhash: computed flow hash for use on transmit
   *@sk_filter: socket filtering instructions
   *@sk_timer: sock cleanup timer
@@ -331,6 +333,7 @@ struct sock {
 #define sk_v6_daddr__sk_common.skc_v6_daddr
 #define sk_v6_rcv_saddr__sk_common.skc_v6_rcv_saddr
 #define sk_cookie  __sk_common.skc_cookie
+#define sk_incoming_cpu__sk_common.skc_incoming_cpu
 
socket_lock_t   sk_lock;
struct sk_buff_head sk_receive_queue;
@@ -353,11 +356,6 @@ struct sock {
 #ifdef CONFIG_RPS
__u32   sk_rxhash;
 #endif
-   u16 sk_incoming_cpu;
-   /* 16bit hole
-* Warned : sk_incoming_cpu can be set from softirq,
-* Do not use this hole without fully understanding possible issues.
-*/
 
__u32   sk_txhash;
 #ifdef CONFIG_NET_RX_BUSY_POLL
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dd1263e4c24..1071f9380250 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -988,6 +988,10 @@ set_rcvbuf:
 sk->sk_max_pacing_rate);
break;
 
+   case SO_INCOMING_CPU:
+   sk->sk_incoming_cpu = val;
+   break;
+
default:
ret = -ENOPROTOOPT;
break;
@@ -2353,6 +2357,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
sk->sk_max_pacing_rate = ~0U;
sk->sk_pacing_rate = ~0U;
+   sk->sk_incoming_cpu = -1;
/*
 * Before updating sk_refcnt, we must commit prior changes to memory
 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bed8886a4b6c..08643a3616af 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -185,6 +185,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
+   if (sk->sk_incoming_cpu == raw_smp_processor_id())
+   score++;
}
return score;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1fc129099ea..24ec14f9825c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -375,7 +375,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
-
+   if (sk->sk_incoming_cpu == raw_smp_processor_id())
+   score++;
return score;
 }
 
@@ -419,6 +420,9 @@ static inline int compute_score2(struct sock *sk, struct 
net

[PATCH v3 net-next 4/4] tcp: shrink tcp_timewait_sock by 8 bytes

2015-10-08 Thread Eric Dumazet

Reducing tcp_timewait_sock from 280 bytes to 272 bytes
allows SLAB to pack 15 objects per page instead of 14 (on x86)

Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h | 4 ++--
 include/net/sock.h  | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e442e6e9a365..86a7edaa6797 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -356,8 +356,8 @@ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
 
 struct tcp_timewait_sock {
struct inet_timewait_sock tw_sk;
-   u32   tw_rcv_nxt;
-   u32   tw_snd_nxt;
+#define tw_rcv_nxt tw_sk.__tw_common.skc_tw_rcv_nxt
+#define tw_snd_nxt tw_sk.__tw_common.skc_tw_snd_nxt
u32   tw_rcv_wnd;
u32   tw_ts_offset;
u32   tw_ts_recent;
diff --git a/include/net/sock.h b/include/net/sock.h
index fce12399fad4..288934da0ae3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -229,6 +229,7 @@ struct sock_common {
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
+   u32 skc_tw_rcv_nxt; /* struct tcp_timewait_sock  */
};
 
atomic_tskc_refcnt;
@@ -237,6 +238,7 @@ struct sock_common {
union {
u32 skc_rxhash;
u32 skc_window_clamp;
+   u32 skc_tw_snd_nxt; /* struct tcp_timewait_sock */
};
/* public: */
 };
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 net-next 3/4] net: shrink struct sock and request_sock by 8 bytes

2015-10-08 Thread Eric Dumazet

One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.

Signed-off-by: Eric Dumazet 
---
 include/net/request_sock.h |  5 ++---
 include/net/sock.h | 14 +-
 net/ipv4/syncookies.c  |  4 ++--
 net/ipv4/tcp_input.c   |  2 +-
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_minisocks.c   | 18 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/syncookies.c  |  4 ++--
 net/ipv6/tcp_ipv6.c|  2 +-
 9 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 6b818b77d5e5..2e73748956d5 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -51,15 +51,14 @@ struct request_sock {
 #define rsk_refcnt __req_common.skc_refcnt
 #define rsk_hash   __req_common.skc_hash
 #define rsk_listener   __req_common.skc_listener
+#define rsk_window_clamp   __req_common.skc_window_clamp
+#define rsk_rcv_wnd__req_common.skc_rcv_wnd
 
struct request_sock *dl_next;
u16 mss;
u8  num_retrans; /* number of retransmits */
u8  cookie_ts:1; /* syncookie: encode 
tcpopts in timestamp */
u8  num_timeout:7; /* number of timeouts */
-   /* The following two fields can be easily recomputed I think -AK */
-   u32 window_clamp; /* window clamp at 
creation time */
-   u32 rcv_wnd;  /* rcv_wnd offered 
first time */
u32 ts_recent;
struct timer_list   rsk_timer;
const struct request_sock_ops   *rsk_ops;
diff --git a/include/net/sock.h b/include/net/sock.h
index a7818104a73f..fce12399fad4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -226,11 +226,18 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
-   int skc_incoming_cpu;
+   union {
+   int skc_incoming_cpu;
+   u32 skc_rcv_wnd;
+   };
 
atomic_tskc_refcnt;
/* private: */
int skc_dontcopy_end[0];
+   union {
+   u32 skc_rxhash;
+   u32 skc_window_clamp;
+   };
/* public: */
 };
 
@@ -287,7 +294,6 @@ struct cg_proto;
   *@sk_rcvlowat: %SO_RCVLOWAT setting
   *@sk_rcvtimeo: %SO_RCVTIMEO setting
   *@sk_sndtimeo: %SO_SNDTIMEO setting
-  *@sk_rxhash: flow hash received from netif layer
   *@sk_txhash: computed flow hash for use on transmit
   *@sk_filter: socket filtering instructions
   *@sk_timer: sock cleanup timer
@@ -346,6 +352,7 @@ struct sock {
 #define sk_cookie  __sk_common.skc_cookie
 #define sk_incoming_cpu__sk_common.skc_incoming_cpu
 #define sk_flags   __sk_common.skc_flags
+#define sk_rxhash  __sk_common.skc_rxhash
 
socket_lock_t   sk_lock;
struct sk_buff_head sk_receive_queue;
@@ -365,9 +372,6 @@ struct sock {
} sk_backlog;
 #define sk_rmem_alloc sk_backlog.rmem_alloc
int sk_forward_alloc;
-#ifdef CONFIG_RPS
-   __u32   sk_rxhash;
-#endif
 
__u32   sk_txhash;
 #ifdef CONFIG_NET_RX_BUSY_POLL
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8113c30ccf96..0769248bc0db 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -381,10 +381,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
}
 
/* Try to redo what tcp_v4_send_synack did. */
-   req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, 
RTAX_WINDOW);
+   req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, 
RTAX_WINDOW);
 
tcp_select_initial_window(tcp_full_space(sk), req->mss,
- &req->rcv_wnd, &req->window_clamp,
+ &req->rsk_rcv_wnd, &req->rsk_window_clamp,
  ireq->wscale_ok, &rcv_wscale,
  dst_metric(&rt->dst, RTAX_INITRWND));
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ddadb318e850..3b35c3f4d268 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6022,7 +6022,7 @@ static void tcp_openreq_init(struct request_sock *req,
 {
struct inet_request_sock *ireq = inet_rsk(req);
 
-   req->rcv_wnd = 0;   /* So that tcp_send_synack() knows! */
+   req->rsk_rcv_wnd = 0;   /* So that tcp_send_synack() knows! */
req->cookie_ts = 0;
tcp_rsk(req)->rcv_isn = TCP_SKB_CB(

[PATCH v3 net-next 0/4] tcp: better smp listener behavior

2015-10-08 Thread Eric Dumazet

As promised in last patch series, we implement a better SO_REUSEPORT
strategy, based on cpu affinities if selected by the application.

We also moved sk_refcnt out of the cache line containing the lookup
keys, as it was considerably slowing down smp operations because
of false sharing. This was simpler than converting listen sockets
to conventional RCU (to avoid sk_refcnt dirtying)

Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.

Eric Dumazet (4):
  net: SO_INCOMING_CPU setsockopt() support
  net: align sk_refcnt on 128 bytes boundary
  net: shrink struct sock and request_sock by 8 bytes
  tcp: shrink tcp_timewait_sock by 8 bytes

 include/linux/tcp.h  |  4 ++--
 include/net/inet_timewait_sock.h |  2 +-
 include/net/request_sock.h   |  7 +++
 include/net/sock.h   | 41 +++-
 net/core/sock.c  |  5 +
 net/ipv4/inet_hashtables.c   |  2 ++
 net/ipv4/syncookies.c|  4 ++--
 net/ipv4/tcp_input.c |  2 +-
 net/ipv4/tcp_ipv4.c  |  2 +-
 net/ipv4/tcp_minisocks.c | 18 +-
 net/ipv4/tcp_output.c|  2 +-
 net/ipv4/udp.c   |  6 +-
 net/ipv6/inet6_hashtables.c  |  2 ++
 net/ipv6/syncookies.c|  4 ++--
 net/ipv6/tcp_ipv6.c  |  2 +-
 net/ipv6/udp.c   | 11 +++
 16 files changed, 72 insertions(+), 42 deletions(-)

-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 net-next 2/4] net: align sk_refcnt on 128 bytes boundary

2015-10-08 Thread Eric Dumazet

sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
This is a performance issue if multiple cpus hit a common socket,
or multiple sockets are chained due to SO_REUSEPORT.

By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
are mostly read. As they contain the lookup keys, this has
a considerable performance impact, as cpus can cache them.

These 8 bytes are not wasted, we use them as a place holder
for various fields, depending on the socket type.

Tested:
 SYN flood hitting a 16 RX queues NIC.
 TCP listener using 16 sockets and SO_REUSEPORT
 and SO_INCOMING_CPU for proper siloing.

 Could process 6.0 Mpps SYN instead of 4.2 Mpps

 Kernel profile looked like :
11.68%  [kernel]  [k] sha_transform
 6.51%  [kernel]  [k] __inet_lookup_listener
 5.07%  [kernel]  [k] __inet_lookup_established
 4.15%  [kernel]  [k] memcpy_erms
 3.46%  [kernel]  [k] ipt_do_table
 2.74%  [kernel]  [k] fib_table_lookup
 2.54%  [kernel]  [k] tcp_make_synack
 2.34%  [kernel]  [k] tcp_conn_request
 2.05%  [kernel]  [k] __netif_receive_skb_core
 2.03%  [kernel]  [k] kmem_cache_alloc

Signed-off-by: Eric Dumazet 
---
 include/net/inet_timewait_sock.h |  2 +-
 include/net/request_sock.h   |  2 +-
 include/net/sock.h   | 17 ++---
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 186f3a1e1b1f..e581fc69129d 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -70,6 +70,7 @@ struct inet_timewait_sock {
 #define tw_dport   __tw_common.skc_dport
 #define tw_num __tw_common.skc_num
 #define tw_cookie  __tw_common.skc_cookie
+#define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
volatile unsigned char  tw_substate;
@@ -88,7 +89,6 @@ struct inet_timewait_sock {
kmemcheck_bitfield_end(flags);
struct timer_list   tw_timer;
struct inet_bind_bucket *tw_tb;
-   struct inet_timewait_death_row *tw_dr;
 };
 #define tw_tclass tw_tos
 
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 95ab5d7aab96..6b818b77d5e5 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -50,9 +50,9 @@ struct request_sock {
struct sock_common  __req_common;
 #define rsk_refcnt __req_common.skc_refcnt
 #define rsk_hash   __req_common.skc_hash
+#define rsk_listener   __req_common.skc_listener
 
struct request_sock *dl_next;
-   struct sock *rsk_listener;
u16 mss;
u8  num_retrans; /* number of retransmits */
u8  cookie_ts:1; /* syncookie: encode 
tcpopts in timestamp */
diff --git a/include/net/sock.h b/include/net/sock.h
index 08abffe32236..a7818104a73f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -150,6 +150,9 @@ typedef __u64 __bitwise __addrpair;
  * @skc_node: main hash linkage for various protocol lookup tables
  * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_flags: place holder for sk_flags
+ * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
+ * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
  * @skc_incoming_cpu: record/match cpu processing incoming packets
  * @skc_refcnt: reference count
  *
@@ -201,6 +204,16 @@ struct sock_common {
 
atomic64_t  skc_cookie;
 
+   /* following fields are padding to force
+* offset(struct sock, sk_refcnt) == 128 on 64bit arches
+* assuming IPV6 is enabled. We use this padding differently
+* for different kind of 'sockets'
+*/
+   union {
+   unsigned long   skc_flags;
+   struct sock *skc_listener; /* request_sock */
+   struct inet_timewait_death_row *skc_tw_dr; /* 
inet_timewait_sock */
+   };
/*
 * fields between dontcopy_begin/dontcopy_end
 * are not copied in sock_copy()
@@ -246,8 +259,6 @@ struct cg_proto;
   *@sk_pacing_rate: Pacing rate (if supported by transport/packet 
scheduler)
   *@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
   *@sk_sndbuf: size of send buffer in bytes
-  *@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
-  *   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   *@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
   *@sk_no_check_rx: allow zero checksum in RX packets
   *@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
@@ -334,6 +345,7 @@ struct sock {
 #define sk_v6_rcv_saddr__sk_common.skc_v6_rcv_saddr
 #define sk_cookie

Re: [PATCH] ethtool: add new emac_regs struct from driver, add new chip types.

2015-10-08 Thread Ben Hutchings

On Fri, 2015-09-25 at 08:15 +0400, Ivan Mikhaylov wrote:
> * add new version of emac_regs struct from driver structure perspective
>   and passing size from actual struct size, not from memory area variable
>   which set in dts file.
> * add three types of network chips for new struct : emac, emac4, emac4sync.
> * add emac4sync processing in print_emac_regs.
> * this commit fixing problem with output of MII sections for new driver 
> versions.
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings
If the facts do not conform to your theory, they must be disposed of.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] ethtool: fix typo in man page

2015-10-08 Thread Ben Hutchings

On Tue, 2015-10-06 at 10:07 +0200, Ivan Vecera wrote:
> Signed-off-by: Ivan Vecera 
> ---
>  ethtool.8.in | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/ethtool.8.in b/ethtool.8.in
> index ae56293..eeffa70 100644
> --- a/ethtool.8.in
> +++ b/ethtool.8.in
> @@ -872,7 +872,7 @@ Enables/disables the device support of EEE.
>  Determines whether the device should assert its Tx LPI.
>  .TP
>  .BI advertise \ N
> -Sets the speeds for which the device should advertise EEE capabiliities.
> +Sets the speeds for which the device should advertise EEE capabilities.
>  Values are as for
>  .B \-\-change advertise
>  .TP

Applied, thanks.

Ben.

-- 
Ben Hutchings
If the facts do not conform to your theory, they must be disposed of.

signature.asc
Description: This is a digitally signed message part

[PATCH net-next v3 0/4] switchdev: push bridge ageing_time attribute down

2015-10-08 Thread sfeldma

From: Scott Feldman 

Push bridge-level attributes down to switchdev drivers.  This patchset
adds the infrastructure and then pushes, as an example, ageing_time attribute
down from bridge to switchdev (rocker) driver.  Add some range-checking
for ageing_time.

# ip link set dev br0 type bridge ageing_time 1000

# ip link set dev br0 type bridge ageing_time 999
RTNETLINK answers: Numerical result out of range

Up until now, switchdev attrs where port-level attrs, so the netdev used in
switchdev_attr_set() would be a switch port or bond of switch ports.  With
bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
netdev.  The same recusive algo is used to visit the leaves of the stacked
drivers to set the attr, it's just in this case we start one layer higher in
the stack.  One note is not all ports in the bridge may support setting a
bridge-level attribute, so rather than failing the entire set, we'll skip over
those ports returning -EOPNOTSUPP.

v2->v3: Per Jiri review: push only ageing_time attr down at this time, and
don't pass raw bridge IFLA_BR_* values; rather use new switchdev attr ID for
ageing_time.

v1->v2: rebase w/ net-next


Scott Feldman (4):
  switchdev: add bridge ageing_time attribute
  switchdev: skip over ports returning -EOPNOTSUPP when recursing ports
  bridge: push bridge setting ageing_time down to switchdev
  rocker: handle setting bridge ageing_time

 drivers/net/ethernet/rocker/rocker.c |   16 
 include/net/switchdev.h  |3 +++
 net/bridge/br_ioctl.c|3 +--
 net/bridge/br_netlink.c  |6 +++---
 net/bridge/br_private.h  |1 +
 net/bridge/br_stp.c  |   23 +++
 net/bridge/br_sysfs_br.c |3 +--
 net/switchdev/switchdev.c|9 -
 8 files changed, 56 insertions(+), 8 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/4] switchdev: add bridge attributes

2015-10-08 Thread Scott Feldman

On Thu, Oct 8, 2015 at 1:39 AM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 08:04:40AM CEST, sfel...@gmail.com wrote:
>>From: Scott Feldman 
>>
>>Setting the stage to push bridge-level attributes down to port driver so
>>hardware can be programmed accordingly.  Bridge-level attribute example is
>>ageing_time.  This is a per-bridge attribute, not a per-bridge-port attr.
>>
>>Signed-off-by: Scott Feldman 
>>---
>> include/net/switchdev.h  |5 +
>> include/uapi/linux/if_link.h |2 +-
>> 2 files changed, 6 insertions(+), 1 deletion(-)
>>
>>diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>index 89266a3..8d92cd0 100644
>>--- a/include/net/switchdev.h
>>+++ b/include/net/switchdev.h
>>@@ -43,6 +43,7 @@ enum switchdev_attr_id {
>>   SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>>   SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>   SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
>>+  SWITCHDEV_ATTR_ID_BRIDGE,
>> };
>>
>> struct switchdev_attr {
>>@@ -52,6 +53,10 @@ struct switchdev_attr {
>>   struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */
>>   u8 stp_state;   /* PORT_STP_STATE */
>>   unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */
>>+  struct switchdev_attr_bridge {  /* BRIDGE */
>>+  enum ifla_br attr;
>
> I don't like pushing down IFLA_BR_* values throught switchdev. I think
> it might better to just intruduce:
> SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME
>
> and "u32 ageing_time" here. Something similar to stp_state. Much easier
> to read and does not give blank cheque for passing any bridge IFLA_BR_*
> down to drivers.
>
> It also aligns with bridge code nicely, I believe.

Done, v3 sent with your suggested change.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ethtool: Add vmxnet3 register dump support

2015-10-08 Thread Ben Hutchings

On Wed, 2015-09-23 at 15:19 -0700, Shrikrishna Khare wrote:
> This adds support for dumping vmxnet3 registers in a readable format.
> 
> Signed-off-by: Shrikrishna Khare 
> Signed-off-by: Bhavesh Davda 
> Acked-by: Srividya Murali 
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings
If the facts do not conform to your theory, they must be disposed of.


signature.asc
Description: This is a digitally signed message part

[PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-08 Thread sfeldma

From: Scott Feldman 

Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
support setting ageing_time (or setting bridge attrs in general).

If push fails, don't update ageing_time in bridge and return err to user.

If push succeeds, update ageing_time in bridge and run gc_timer now to
recalabrate when to run gc_timer next, based on new ageing_time.

Signed-off-by: Scott Feldman 
Signed-off-by: Jiri Pirko 
---
 net/bridge/br_ioctl.c|3 +--
 net/bridge/br_netlink.c  |6 +++---
 net/bridge/br_private.h  |1 +
 net/bridge/br_stp.c  |   23 +++
 net/bridge/br_sysfs_br.c |3 +--
 5 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 8d423bc..263b4de 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -200,8 +200,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
-   br->ageing_time = clock_t_to_jiffies(args[1]);
-   return 0;
+   return br_set_ageing_time(br, args[1]);
 
case BRCTL_GET_PORT_INFO:
{
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index d78b442..544ab96 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -870,9 +870,9 @@ static int br_changelink(struct net_device *brdev, struct 
nlattr *tb[],
}
 
if (data[IFLA_BR_AGEING_TIME]) {
-   u32 ageing_time = nla_get_u32(data[IFLA_BR_AGEING_TIME]);
-
-   br->ageing_time = clock_t_to_jiffies(ageing_time);
+   err = br_set_ageing_time(br, 
nla_get_u32(data[IFLA_BR_AGEING_TIME]));
+   if (err)
+   return err;
}
 
if (data[IFLA_BR_STP_STATE]) {
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 09d3ecb..ba0c67b 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -882,6 +882,7 @@ void __br_set_forward_delay(struct net_bridge *br, unsigned 
long t);
 int br_set_forward_delay(struct net_bridge *br, unsigned long x);
 int br_set_hello_time(struct net_bridge *br, unsigned long x);
 int br_set_max_age(struct net_bridge *br, unsigned long x);
+int br_set_ageing_time(struct net_bridge *br, u32 ageing_time);
 
 
 /* br_stp_if.c */
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 3a982c0..db6d243de 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -566,6 +566,29 @@ int br_set_max_age(struct net_bridge *br, unsigned long 
val)
 
 }
 
+int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
+{
+   struct switchdev_attr attr = {
+   .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
+   .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
+   .u.ageing_time = ageing_time,
+   };
+   unsigned long t = clock_t_to_jiffies(ageing_time);
+   int err;
+
+   if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
+   return -ERANGE;
+
+   err = switchdev_port_attr_set(br->dev, &attr);
+   if (err)
+   return err;
+
+   br->ageing_time = t;
+   mod_timer(&br->gc_timer, jiffies);
+
+   return 0;
+}
+
 void __br_set_forward_delay(struct net_bridge *br, unsigned long t)
 {
br->bridge_forward_delay = t;
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 4c97fc5..04ef192 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -102,8 +102,7 @@ static ssize_t ageing_time_show(struct device *d,
 
 static int set_ageing_time(struct net_bridge *br, unsigned long val)
 {
-   br->ageing_time = clock_t_to_jiffies(val);
-   return 0;
+   return br_set_ageing_time(br, val);
 }
 
 static ssize_t ageing_time_store(struct device *d,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v3 4/4] rocker: handle setting bridge ageing_time

2015-10-08 Thread sfeldma

From: Scott Feldman 

The FDB cleanup timer will get rescheduled to re-evaluate FDB entries
based on new ageing_time.

Signed-off-by: Scott Feldman 
---
 drivers/net/ethernet/rocker/rocker.c |   16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index cf91ffc..eafa907 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4361,6 +4361,18 @@ static int rocker_port_brport_flags_set(struct 
rocker_port *rocker_port,
return err;
 }
 
+static int rocker_port_bridge_ageing_time(struct rocker_port *rocker_port,
+ struct switchdev_trans *trans,
+ u32 ageing_time)
+{
+   if (!switchdev_trans_ph_prepare(trans)) {
+   rocker_port->ageing_time = clock_t_to_jiffies(ageing_time);
+   mod_timer(&rocker_port->rocker->fdb_cleanup_timer, jiffies);
+   }
+
+   return 0;
+}
+
 static int rocker_port_attr_set(struct net_device *dev,
struct switchdev_attr *attr,
struct switchdev_trans *trans)
@@ -4378,6 +4390,10 @@ static int rocker_port_attr_set(struct net_device *dev,
err = rocker_port_brport_flags_set(rocker_port, trans,
   attr->u.brport_flags);
break;
+   case SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME:
+   err = rocker_port_bridge_ageing_time(rocker_port, trans,
+attr->u.ageing_time);
+   break;
default:
err = -EOPNOTSUPP;
break;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v3 2/4] switchdev: skip over ports returning -EOPNOTSUPP when recursing ports

2015-10-08 Thread sfeldma

From: Scott Feldman 

This allows us to recurse over all the ports, skipping over unsupporting
ports.  Without the change, the recursion would stop at first unsupported
port.

Signed-off-by: Scott Feldman 
---
 include/net/switchdev.h   |1 +
 net/switchdev/switchdev.c |9 -
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 61f129b..1ce7083 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -16,6 +16,7 @@
 #include 
 
 #define SWITCHDEV_F_NO_RECURSE BIT(0)
+#define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
 
 struct switchdev_trans_item {
struct list_head list;
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 6e4a4f9..7a9ab90 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -147,7 +147,7 @@ static int __switchdev_port_attr_set(struct net_device *dev,
return ops->switchdev_port_attr_set(dev, attr, trans);
 
if (attr->flags & SWITCHDEV_F_NO_RECURSE)
-   return err;
+   goto done;
 
/* Switch device port(s) may be stacked under
 * bond/team/vlan dev, so recurse down to set attr on
@@ -156,10 +156,17 @@ static int __switchdev_port_attr_set(struct net_device 
*dev,
 
netdev_for_each_lower_dev(dev, lower_dev, iter) {
err = __switchdev_port_attr_set(lower_dev, attr, trans);
+   if (err == -EOPNOTSUPP &&
+   attr->flags & SWITCHDEV_F_SKIP_EOPNOTSUPP)
+   continue;
if (err)
break;
}
 
+done:
+   if (err == -EOPNOTSUPP && attr->flags & SWITCHDEV_F_SKIP_EOPNOTSUPP)
+   err = 0;
+
return err;
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v3 1/4] switchdev: add bridge ageing_time attribute

2015-10-08 Thread sfeldma

From: Scott Feldman 

Setting the stage to push bridge-level attributes down to port driver so
hardware can be programmed accordingly.  Bridge-level attribute example is
ageing_time.  This is a per-bridge attribute, not a per-bridge-port attr.

Signed-off-by: Scott Feldman 
---
 include/net/switchdev.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 89266a3..61f129b 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -43,6 +43,7 @@ enum switchdev_attr_id {
SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
SWITCHDEV_ATTR_ID_PORT_STP_STATE,
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
+   SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
 };
 
 struct switchdev_attr {
@@ -52,6 +53,7 @@ struct switchdev_attr {
struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */
u8 stp_state;   /* PORT_STP_STATE */
unsigned long brport_flags; /* PORT_BRIDGE_FLAGS */
+   u32 ageing_time;/* BRIDGE_AGEING_TIME */
} u;
 };
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread Kosuke Tatsukawa

There are several places in net/sunrpc/svcsock.c which calls
waitqueue_active() without calling a memory barrier.  Add a memory
barrier just as in wq_has_sleeper().

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa 
---
v2:
  - Fixed compiler warnings caused by type mismatch
v1:
  - https://lkml.org/lkml/2015/10/8/993
---
 net/sunrpc/svcsock.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 0c81202..ec19444 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -414,6 +414,7 @@ static void svc_udp_data_ready(struct sock *sk)
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
svc_xprt_enqueue(&svsk->sk_xprt);
}
+   smp_mb();
if (wq && waitqueue_active(wq))
wake_up_interruptible(wq);
 }
@@ -432,6 +433,7 @@ static void svc_write_space(struct sock *sk)
svc_xprt_enqueue(&svsk->sk_xprt);
}
 
+   smp_mb();
if (wq && waitqueue_active(wq)) {
dprintk("RPC svc_write_space: someone sleeping on %p\n",
   svsk);
@@ -787,6 +789,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
}
 
wq = sk_sleep(sk);
+   smp_mb();
if (wq && waitqueue_active(wq))
wake_up_interruptible_all(wq);
 }
@@ -808,6 +811,7 @@ static void svc_tcp_state_change(struct sock *sk)
set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
svc_xprt_enqueue(&svsk->sk_xprt);
}
+   smp_mb();
if (wq && waitqueue_active(wq))
wake_up_interruptible_all(wq);
 }
@@ -823,6 +827,7 @@ static void svc_tcp_data_ready(struct sock *sk)
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
svc_xprt_enqueue(&svsk->sk_xprt);
}
+   smp_mb();
if (wq && waitqueue_active(wq))
wake_up_interruptible(wq);
 }
@@ -1594,6 +1599,7 @@ static void svc_sock_detach(struct svc_xprt *xprt)
sk->sk_write_space = svsk->sk_owspace;
 
wq = sk_sleep(sk);
+   smp_mb();
if (wq && waitqueue_active(wq))
wake_up_interruptible(wq);
 }
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 13/16] i40e: refactor code to remove indent

2015-10-08 Thread Jeff Kirsher

From: Jesse Brandeburg 

I found a code indent that was avoidable because a whole function is inside
an if block, reverse the if and move the code back a tab.

Change-ID: I9989c8750ee61678fbf96a3b0fd7bf7cc7ef300a
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 82 +++--
 1 file changed, 42 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 12b90fa..c46d814 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5679,49 +5679,51 @@ static void i40e_fdir_flush_and_replay(struct i40e_pf 
*pf)
if (!(pf->flags & (I40E_FLAG_FD_SB_ENABLED | I40E_FLAG_FD_ATR_ENABLED)))
return;
 
-   if (time_after(jiffies, pf->fd_flush_timestamp +
-   (I40E_MIN_FD_FLUSH_INTERVAL * HZ))) {
-   /* If the flush is happening too quick and we have mostly
-* SB rules we should not re-enable ATR for some time.
-*/
-   min_flush_time = pf->fd_flush_timestamp
-   + (I40E_MIN_FD_FLUSH_SB_ATR_UNSTABLE * HZ);
-   fd_room = pf->fdir_pf_filter_count - pf->fdir_pf_active_filters;
+   if (!time_after(jiffies, pf->fd_flush_timestamp +
+(I40E_MIN_FD_FLUSH_INTERVAL * HZ)))
+   return;
 
-   if (!(time_after(jiffies, min_flush_time)) &&
-   (fd_room < I40E_FDIR_BUFFER_HEAD_ROOM_FOR_ATR)) {
-   if (I40E_DEBUG_FD & pf->hw.debug_mask)
-   dev_info(&pf->pdev->dev, "ATR disabled, not 
enough FD filter space.\n");
-   disable_atr = true;
-   }
+   /* If the flush is happening too quick and we have mostly SB rules we
+* should not re-enable ATR for some time.
+*/
+   min_flush_time = pf->fd_flush_timestamp +
+(I40E_MIN_FD_FLUSH_SB_ATR_UNSTABLE * HZ);
+   fd_room = pf->fdir_pf_filter_count - pf->fdir_pf_active_filters;
 
-   pf->fd_flush_timestamp = jiffies;
-   pf->flags &= ~I40E_FLAG_FD_ATR_ENABLED;
-   /* flush all filters */
-   wr32(&pf->hw, I40E_PFQF_CTL_1,
-I40E_PFQF_CTL_1_CLEARFDTABLE_MASK);
-   i40e_flush(&pf->hw);
-   pf->fd_flush_cnt++;
-   pf->fd_add_err = 0;
-   do {
-   /* Check FD flush status every 5-6msec */
-   usleep_range(5000, 6000);
-   reg = rd32(&pf->hw, I40E_PFQF_CTL_1);
-   if (!(reg & I40E_PFQF_CTL_1_CLEARFDTABLE_MASK))
-   break;
-   } while (flush_wait_retry--);
-   if (reg & I40E_PFQF_CTL_1_CLEARFDTABLE_MASK) {
-   dev_warn(&pf->pdev->dev, "FD table did not flush, needs 
more time\n");
-   } else {
-   /* replay sideband filters */
-   i40e_fdir_filter_restore(pf->vsi[pf->lan_vsi]);
-   if (!disable_atr)
-   pf->flags |= I40E_FLAG_FD_ATR_ENABLED;
-   clear_bit(__I40E_FD_FLUSH_REQUESTED, &pf->state);
-   if (I40E_DEBUG_FD & pf->hw.debug_mask)
-   dev_info(&pf->pdev->dev, "FD Filter table 
flushed and FD-SB replayed.\n");
-   }
+   if (!(time_after(jiffies, min_flush_time)) &&
+   (fd_room < I40E_FDIR_BUFFER_HEAD_ROOM_FOR_ATR)) {
+   if (I40E_DEBUG_FD & pf->hw.debug_mask)
+   dev_info(&pf->pdev->dev, "ATR disabled, not enough FD 
filter space.\n");
+   disable_atr = true;
+   }
+
+   pf->fd_flush_timestamp = jiffies;
+   pf->flags &= ~I40E_FLAG_FD_ATR_ENABLED;
+   /* flush all filters */
+   wr32(&pf->hw, I40E_PFQF_CTL_1,
+I40E_PFQF_CTL_1_CLEARFDTABLE_MASK);
+   i40e_flush(&pf->hw);
+   pf->fd_flush_cnt++;
+   pf->fd_add_err = 0;
+   do {
+   /* Check FD flush status every 5-6msec */
+   usleep_range(5000, 6000);
+   reg = rd32(&pf->hw, I40E_PFQF_CTL_1);
+   if (!(reg & I40E_PFQF_CTL_1_CLEARFDTABLE_MASK))
+   break;
+   } while (flush_wait_retry--);
+   if (reg & I40E_PFQF_CTL_1_CLEARFDTABLE_MASK) {
+   dev_warn(&pf->pdev->dev, "FD table did not flush, needs more 
time\n");
+   } else {
+   /* replay sideband filters */
+   i40e_fdir_filter_restore(pf->vsi[pf->lan_vsi]);
+   if (!disable_atr)
+   pf->flags |= I40E_FLAG_FD_ATR_ENABLED;
+   clear_bit(__I40E_FD_FLUSH_REQUESTED, &pf->state);
+   if (I40E_DEBUG_FD & p

[net-next 12/16] i40e/i40evf: clean up some code

2015-10-08 Thread Jeff Kirsher

From: Jesse Brandeburg 

Add missings spaces after declarations, remove another __func__ use,
remove uncessary braces, remove unneeded breaks, and useless returns,
and generally fix up some code.

Change-ID: Ie715d6b64976c50e1c21531685fe0a2bd38c4244
Signed-off-by: Jesse Brandeburg 
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq.c  |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_common.c  |   9 +-
 drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c  |   1 +
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 124 -
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c|   1 -
 drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c|  38 ---
 drivers/net/ethernet/intel/i40e/i40e_ptp.c |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   6 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  13 ++-
 drivers/net/ethernet/intel/i40evf/i40e_adminq.c|   3 +-
 drivers/net/ethernet/intel/i40evf/i40e_common.c|   8 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |   6 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  |   1 +
 drivers/net/ethernet/intel/i40evf/i40evf.h |   4 -
 17 files changed, 124 insertions(+), 103 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.c 
b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
index 287cb8d..fa2e916 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
@@ -683,8 +683,7 @@ static u16 i40e_clean_asq(struct i40e_hw *hw)
details = I40E_ADMINQ_DETAILS(*asq, ntc);
while (rd32(hw, hw->aq.asq.head) != ntc) {
i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
-  "%s: ntc %d head %d.\n", __func__, ntc,
-  rd32(hw, hw->aq.asq.head));
+  "ntc %d head %d.\n", ntc, rd32(hw, hw->aq.asq.head));
 
if (details->callback) {
I40E_ADMINQ_CALLBACK cb_func =
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 2839ea5..2d012d9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1035,7 +1035,7 @@ i40e_status i40e_get_mac_addr(struct i40e_hw *hw, u8 
*mac_addr)
status = i40e_aq_mac_address_read(hw, &flags, &addrs, NULL);
 
if (flags & I40E_AQC_LAN_ADDR_VALID)
-   memcpy(mac_addr, &addrs.pf_lan_mac, sizeof(addrs.pf_lan_mac));
+   ether_addr_copy(mac_addr, addrs.pf_lan_mac);
 
return status;
 }
@@ -1058,7 +1058,7 @@ i40e_status i40e_get_port_mac_addr(struct i40e_hw *hw, u8 
*mac_addr)
return status;
 
if (flags & I40E_AQC_PORT_ADDR_VALID)
-   memcpy(mac_addr, &addrs.port_mac, sizeof(addrs.port_mac));
+   ether_addr_copy(mac_addr, addrs.port_mac);
else
status = I40E_ERR_INVALID_MAC_ADDR;
 
@@ -1116,7 +1116,7 @@ i40e_status i40e_get_san_mac_addr(struct i40e_hw *hw, u8 
*mac_addr)
return status;
 
if (flags & I40E_AQC_SAN_ADDR_VALID)
-   memcpy(mac_addr, &addrs.pf_san_mac, sizeof(addrs.pf_san_mac));
+   ether_addr_copy(mac_addr, addrs.pf_san_mac);
else
status = I40E_ERR_INVALID_MAC_ADDR;
 
@@ -2363,6 +2363,7 @@ i40e_status i40e_aq_get_veb_parameters(struct i40e_hw *hw,
*vebs_free = le16_to_cpu(cmd_resp->vebs_free);
if (floating) {
u16 flags = le16_to_cpu(cmd_resp->veb_flags);
+
if (flags & I40E_AQC_ADD_VEB_FLOATING)
*floating = true;
else
@@ -3777,7 +3778,7 @@ i40e_status i40e_aq_add_rem_control_packet_filter(struct 
i40e_hw *hw,
}
 
if (mac_addr)
-   memcpy(cmd->mac, mac_addr, ETH_ALEN);
+   ether_addr_copy(cmd->mac, mac_addr);
 
cmd->etype = cpu_to_le16(ethtype);
cmd->flags = cpu_to_le16(flags);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c 
b/drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c
index dbadad7..7c42d13 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c
@@ -236,6 +236,7 @@ static void i40e_dcbnl_del_app(struct i40e_pf *pf,
   struct i40e_dcb_app_priority_table *app)
 {
int v, err;
+
for (v = 0; v < pf->num_alloc_vsi; v++) {
if (pf->vsi[v] && pf->vsi[v]->netdev) {
err = i40e_dcbnl_vsi_del_app(pf->vsi[v], app);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c 
b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index 9f9d842..c1dd24

[net-next 14/16] i40evf: use capabilities flags properly

2015-10-08 Thread Jeff Kirsher

From: Mitch Williams 

Use the capabilities passed to us by the PF driver to control VF driver
behavior. In the process, clean up the VLAN add/remove code so it's not
a horrible morass of ifdefs.

Change-ID: I1050eaf12b658a26fea6813047c9964163c70a73
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 664fde6..1f99930 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -730,6 +730,8 @@ static int i40evf_vlan_rx_add_vid(struct net_device *netdev,
 {
struct i40evf_adapter *adapter = netdev_priv(netdev);
 
+   if (!VLAN_ALLOWED(adapter))
+   return -EIO;
if (i40evf_add_vlan(adapter, vid) == NULL)
return -ENOMEM;
return 0;
@@ -745,8 +747,11 @@ static int i40evf_vlan_rx_kill_vid(struct net_device 
*netdev,
 {
struct i40evf_adapter *adapter = netdev_priv(netdev);
 
-   i40evf_del_vlan(adapter, vid);
-   return 0;
+   if (VLAN_ALLOWED(adapter)) {
+   i40evf_del_vlan(adapter, vid);
+   return 0;
+   }
+   return -EIO;
 }
 
 /**
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 01/16] i40e: fix erroneous WARN_ON

2015-10-08 Thread Jeff Kirsher

From: Jesse Brandeburg 

The driver was issuing a WARN_ON during ring size changes
because the code was cloning the rx_ring struct but
not zeroing out the pointers before allocating new memory.

Zero out the pointers in the cloned copy before allocating
new memory for them.  In this case the code was correctly
avoiding memory leaks but still triggering the warning.

Change-ID: I186dd493948e9b7254ab0593d4aad8b68808918d
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index ffa9431..ef471fc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1176,6 +1176,11 @@ static int i40e_set_ringparam(struct net_device *netdev,
/* clone ring and setup updated count */
tx_rings[i] = *vsi->tx_rings[i];
tx_rings[i].count = new_tx_count;
+   /* the desc and bi pointers will be reallocated in the
+* setup call
+*/
+   tx_rings[i].desc = NULL;
+   tx_rings[i].rx_bi = NULL;
err = i40e_setup_tx_descriptors(&tx_rings[i]);
if (err) {
while (i) {
@@ -1206,6 +1211,11 @@ static int i40e_set_ringparam(struct net_device *netdev,
/* clone ring and setup updated count */
rx_rings[i] = *vsi->rx_rings[i];
rx_rings[i].count = new_rx_count;
+   /* the desc and bi pointers will be reallocated in the
+* setup call
+*/
+   rx_rings[i].desc = NULL;
+   rx_rings[i].rx_bi = NULL;
err = i40e_setup_rx_descriptors(&rx_rings[i]);
if (err) {
while (i) {
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 04/16] i40e: Add parsing for CEE DCBX TLVs

2015-10-08 Thread Jeff Kirsher

From: Neerav Parikh 

This patch adds parsing for CEE DCBX TLVs from the LLDP MIB.

While the driver gets the DCB CEE operational configuration from Firmware
using the "Get CEE DCBX Oper Config" AQ command there is a need to get
the CEE DesiredCfg Tx by firmware and DCB configuration Rx from peer; for
debug and other application purposes.

Change-ID: I9140edf1a25a2852c7eff805d81e5eff6266178d
Signed-off-by: Neerav Parikh 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_dcb.c | 179 +
 drivers/net/ethernet/intel/i40e/i40e_dcb.h |  39 +++
 2 files changed, 218 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_dcb.c 
b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
index 9aee35d..89e60e3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_dcb.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
@@ -292,6 +292,182 @@ static void i40e_parse_ieee_tlv(struct i40e_lldp_org_tlv 
*tlv,
 }
 
 /**
+ * i40e_parse_cee_pgcfg_tlv
+ * @tlv: CEE DCBX PG CFG TLV
+ * @dcbcfg: Local store to update ETS CFG data
+ *
+ * Parses CEE DCBX PG CFG TLV
+ **/
+static void i40e_parse_cee_pgcfg_tlv(struct i40e_cee_feat_tlv *tlv,
+struct i40e_dcbx_config *dcbcfg)
+{
+   struct i40e_dcb_ets_config *etscfg;
+   u8 *buf = tlv->tlvinfo;
+   u16 offset = 0;
+   u8 priority;
+   int i;
+
+   etscfg = &dcbcfg->etscfg;
+
+   if (tlv->en_will_err & I40E_CEE_FEAT_TLV_WILLING_MASK)
+   etscfg->willing = 1;
+
+   etscfg->cbs = 0;
+   /* Priority Group Table (4 octets)
+* Octets:|1|2|3|4|
+*-
+*|pri0|pri1|pri2|pri3|pri4|pri5|pri6|pri7|
+*-
+*   Bits:|7  4|3  0|7  4|3  0|7  4|3  0|7  4|3  0|
+*-
+*/
+   for (i = 0; i < 4; i++) {
+   priority = (u8)((buf[offset] & I40E_CEE_PGID_PRIO_1_MASK) >>
+I40E_CEE_PGID_PRIO_1_SHIFT);
+   etscfg->prioritytable[i * 2] =  priority;
+   priority = (u8)((buf[offset] & I40E_CEE_PGID_PRIO_0_MASK) >>
+I40E_CEE_PGID_PRIO_0_SHIFT);
+   etscfg->prioritytable[i * 2 + 1] = priority;
+   offset++;
+   }
+
+   /* PG Percentage Table (8 octets)
+* Octets:| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+*-
+*|pg0|pg1|pg2|pg3|pg4|pg5|pg6|pg7|
+*-
+*/
+   for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++)
+   etscfg->tcbwtable[i] = buf[offset++];
+
+   /* Number of TCs supported (1 octet) */
+   etscfg->maxtcs = buf[offset];
+}
+
+/**
+ * i40e_parse_cee_pfccfg_tlv
+ * @tlv: CEE DCBX PFC CFG TLV
+ * @dcbcfg: Local store to update PFC CFG data
+ *
+ * Parses CEE DCBX PFC CFG TLV
+ **/
+static void i40e_parse_cee_pfccfg_tlv(struct i40e_cee_feat_tlv *tlv,
+ struct i40e_dcbx_config *dcbcfg)
+{
+   u8 *buf = tlv->tlvinfo;
+
+   if (tlv->en_will_err & I40E_CEE_FEAT_TLV_WILLING_MASK)
+   dcbcfg->pfc.willing = 1;
+
+   /* 
+* | PFC Enable | PFC TCs |
+* 
+* | 1 octet| 1 octet |
+*/
+   dcbcfg->pfc.pfcenable = buf[0];
+   dcbcfg->pfc.pfccap = buf[1];
+}
+
+/**
+ * i40e_parse_cee_app_tlv
+ * @tlv: CEE DCBX APP TLV
+ * @dcbcfg: Local store to update APP PRIO data
+ *
+ * Parses CEE DCBX APP PRIO TLV
+ **/
+static void i40e_parse_cee_app_tlv(struct i40e_cee_feat_tlv *tlv,
+  struct i40e_dcbx_config *dcbcfg)
+{
+   u16 length, typelength, offset = 0;
+   struct i40e_cee_app_prio *app;
+   u8 i, up;
+
+   typelength = ntohs(tlv->hdr.typelen);
+   length = (u16)((typelength & I40E_LLDP_TLV_LEN_MASK) >>
+  I40E_LLDP_TLV_LEN_SHIFT);
+
+   dcbcfg->numapps = length / sizeof(*app);
+   if (!dcbcfg->numapps)
+   return;
+
+   for (i = 0; i < dcbcfg->numapps; i++) {
+   app = (struct i40e_cee_app_prio *)(tlv->tlvinfo + offset);
+   for (up = 0; up < I40E_MAX_USER_PRIORITY; up++) {
+   if (app->prio_map & (1 << up))
+   break;
+   }
+   dcbcfg->app[i].priority = up;
+   /* Get Selector from lower 2 bits */
+   dcbcfg->app[i].selector = (app->upper_oui_sel &
+  I40E_CEE_APP_SELECTOR_MASK);
+   dcbcfg->app[i].protocolid = ntohs(app->protocol);
+   /* Move to next app */
+   offset += sizeof(*app);
+   }
+}
+
+/**
+ * i40e_parse_cee_tlv
+ * @tlv: CEE DCB

[net-next 03/16] i40e: add more verbose error messages

2015-10-08 Thread Jeff Kirsher

From: Mitch Williams 

Under certain circumstances, the device may not have enough resources to
enable all of the VFs that it advertises in config space. Although the
number of supported VFs is reported upon driver init, it is not obvious
when this is different from the number reported in config space. To
eliminate this confusion, add an error message explaining the problem.
Additionally, move the 'Allocating VFs' message down below the error
checks so as to prevent further confusion.

Change-ID: I45b7efca53a7aebfbe33a8bc9d615ae48ea1
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 0545e3f..fac8a02 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -998,17 +998,19 @@ static int i40e_pci_sriov_enable(struct pci_dev *pdev, 
int num_vfs)
goto err_out;
}
 
-   dev_info(&pdev->dev, "Allocating %d VFs.\n", num_vfs);
if (pre_existing_vfs && pre_existing_vfs != num_vfs)
i40e_free_vfs(pf);
else if (pre_existing_vfs && pre_existing_vfs == num_vfs)
goto out;
 
if (num_vfs > pf->num_req_vfs) {
+   dev_warn(&pdev->dev, "Unable to enable %d VFs. Limited to %d 
VFs due to device resource constraints.\n",
+num_vfs, pf->num_req_vfs);
err = -EPERM;
goto err_out;
}
 
+   dev_info(&pdev->dev, "Allocating %d VFs.\n", num_vfs);
err = i40e_alloc_vfs(pf, num_vfs);
if (err) {
dev_warn(&pdev->dev, "Failed to enable SR-IOV: %d\n", err);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 06/16] i40e: Fix for extra Flow Director filter in table after error

2015-10-08 Thread Jeff Kirsher

From: Carolyn Wyborny 

This patch fixes a problem where the PF's fdir filter table would have an
entry that the hw was unable to add. This notification happens in the hot
path, so instead of trying to fix it then, we note the location in the
failure case and delete it during regular fdir subtask callback. Without
this patch, a case can occur where an invalid entry gets replayed and a
valid one is not.

Change-ID: I67831c183b5d0309876de807cc434809b74c9cb7
Signed-off-by: Carolyn Wyborny 
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  1 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 14 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  3 ++-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a3c939..a662e39 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -410,6 +410,7 @@ struct i40e_pf {
u32 npar_min_bw;
 
u32 ioremap_len;
+   u32 fd_inv;
 };
 
 struct i40e_mac_filter {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2c59214..94953568 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5614,7 +5614,9 @@ u32 i40e_get_global_fd_count(struct i40e_pf *pf)
  **/
 void i40e_fdir_check_and_reenable(struct i40e_pf *pf)
 {
+   struct i40e_fdir_filter *filter;
u32 fcnt_prog, fcnt_avail;
+   struct hlist_node *node;
 
if (test_bit(__I40E_FD_FLUSH_REQUESTED, &pf->state))
return;
@@ -5643,6 +5645,18 @@ void i40e_fdir_check_and_reenable(struct i40e_pf *pf)
dev_info(&pf->pdev->dev, "ATR is being enabled 
since we have space in the table now\n");
}
}
+
+   /* if hw had a problem adding a filter, delete it */
+   if (pf->fd_inv > 0) {
+   hlist_for_each_entry_safe(filter, node,
+ &pf->fdir_filter_list, fdir_node) {
+   if (filter->fd_id == pf->fd_inv) {
+   hlist_del(&filter->fdir_node);
+   kfree(filter);
+   pf->fdir_pf_active_filters--;
+   }
+   }
+   }
 }
 
 #define I40E_MIN_FD_FLUSH_INTERVAL 10
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 889ed10..8ab7ab1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -465,10 +465,11 @@ static void i40e_fd_handle_status(struct i40e_ring 
*rx_ring,
I40E_RX_PROG_STATUS_DESC_QW1_ERROR_SHIFT;
 
if (error == BIT(I40E_RX_PROG_STATUS_DESC_FD_TBL_FULL_SHIFT)) {
+   pf->fd_inv = le32_to_cpu(rx_desc->wb.qword0.hi_dword.fd_id);
if ((rx_desc->wb.qword0.hi_dword.fd_id != 0) ||
(I40E_DEBUG_FD & pf->hw.debug_mask))
dev_warn(&pdev->dev, "ntuple filter loc = %d, could not 
be added\n",
-rx_desc->wb.qword0.hi_dword.fd_id);
+pf->fd_inv);
 
/* Check if the programming error is for ATR.
 * If so, auto disable ATR and set a state for
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 10/16] i40e: Support FW CEE DCB UP to TC map nibble swap

2015-10-08 Thread Jeff Kirsher

From: Greg Bowers 

Changes parsing of AQ command Get CEE DCBX OPER CFG (0x0A07). Change is
required because FW creates the oper_prio_tc nibbles reversed from those
in the CEE Priority Group sub-TLV.

Change-ID: I7d9d8641bb430d30e286fc3fac909866ef8a0de8
Signed-off-by: Greg Bowers 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_dcb.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_dcb.c 
b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
index fbec7d7..6fa07ef 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_dcb.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
@@ -681,15 +681,18 @@ static void i40e_cee_to_dcb_config(
/* CEE PG data to ETS config */
dcbcfg->etscfg.maxtcs = cee_cfg->oper_num_tc;
 
+   /* Note that the FW creates the oper_prio_tc nibbles reversed
+* from those in the CEE Priority Group sub-TLV.
+*/
for (i = 0; i < 4; i++) {
tc = (u8)((cee_cfg->oper_prio_tc[i] &
-I40E_CEE_PGID_PRIO_1_MASK) >>
-I40E_CEE_PGID_PRIO_1_SHIFT);
-   dcbcfg->etscfg.prioritytable[i*2] =  tc;
-   tc = (u8)((cee_cfg->oper_prio_tc[i] &
 I40E_CEE_PGID_PRIO_0_MASK) >>
 I40E_CEE_PGID_PRIO_0_SHIFT);
-   dcbcfg->etscfg.prioritytable[i*2 + 1] = tc;
+   dcbcfg->etscfg.prioritytable[i * 2] =  tc;
+   tc = (u8)((cee_cfg->oper_prio_tc[i] &
+I40E_CEE_PGID_PRIO_1_MASK) >>
+I40E_CEE_PGID_PRIO_1_SHIFT);
+   dcbcfg->etscfg.prioritytable[i * 2 + 1] = tc;
}
 
for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 05/16] i40e/i40evf: Store CEE DCBX DesiredCfg and RemoteCfg

2015-10-08 Thread Jeff Kirsher

From: Neerav Parikh 

This patch adds capability to query and store the CEE DCBX DesiredCfg
and RemoteCfg data from the LLDP MIB.
Added new member "desired_dcbx_config" in the i40e_hw data structure
to hold CEE only DesiredCfg data.

Change-ID: I19c550369594384eaff4cc63e690ca740231195d
Signed-off-by: Neerav Parikh 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_dcb.c| 44 ++-
 drivers/net/ethernet/intel/i40e/i40e_type.h   |  5 +--
 drivers/net/ethernet/intel/i40evf/i40e_type.h |  5 +--
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_dcb.c 
b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
index 89e60e3..fbec7d7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_dcb.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
@@ -762,6 +762,36 @@ static void i40e_cee_to_dcb_config(
 }
 
 /**
+ * i40e_get_ieee_dcb_config
+ * @hw: pointer to the hw struct
+ *
+ * Get IEEE mode DCB configuration from the Firmware
+ **/
+static i40e_status i40e_get_ieee_dcb_config(struct i40e_hw *hw)
+{
+   i40e_status ret = 0;
+
+   /* IEEE mode */
+   hw->local_dcbx_config.dcbx_mode = I40E_DCBX_MODE_IEEE;
+   /* Get Local DCB Config */
+   ret = i40e_aq_get_dcb_config(hw, I40E_AQ_LLDP_MIB_LOCAL, 0,
+&hw->local_dcbx_config);
+   if (ret)
+   goto out;
+
+   /* Get Remote DCB Config */
+   ret = i40e_aq_get_dcb_config(hw, I40E_AQ_LLDP_MIB_REMOTE,
+I40E_AQ_LLDP_BRIDGE_TYPE_NEAREST_BRIDGE,
+&hw->remote_dcbx_config);
+   /* Don't treat ENOENT as an error for Remote MIBs */
+   if (hw->aq.asq_last_status == I40E_AQ_RC_ENOENT)
+   ret = 0;
+
+out:
+   return ret;
+}
+
+/**
  * i40e_get_dcb_config
  * @hw: pointer to the hw struct
  *
@@ -776,7 +806,7 @@ i40e_status i40e_get_dcb_config(struct i40e_hw *hw)
/* If Firmware version < v4.33 IEEE only */
if (((hw->aq.fw_maj_ver == 4) && (hw->aq.fw_min_ver < 33)) ||
(hw->aq.fw_maj_ver < 4))
-   goto ieee;
+   return i40e_get_ieee_dcb_config(hw);
 
/* If Firmware version == v4.33 use old CEE struct */
if ((hw->aq.fw_maj_ver == 4) && (hw->aq.fw_min_ver == 33)) {
@@ -805,16 +835,14 @@ i40e_status i40e_get_dcb_config(struct i40e_hw *hw)
 
/* CEE mode not enabled try querying IEEE data */
if (hw->aq.asq_last_status == I40E_AQ_RC_ENOENT)
-   goto ieee;
-   else
+   return i40e_get_ieee_dcb_config(hw);
+
+   if (ret)
goto out;
 
-ieee:
-   /* IEEE mode */
-   hw->local_dcbx_config.dcbx_mode = I40E_DCBX_MODE_IEEE;
-   /* Get Local DCB Config */
+   /* Get CEE DCB Desired Config */
ret = i40e_aq_get_dcb_config(hw, I40E_AQ_LLDP_MIB_LOCAL, 0,
-&hw->local_dcbx_config);
+&hw->desired_dcbx_config);
if (ret)
goto out;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h 
b/drivers/net/ethernet/intel/i40e/i40e_type.h
index 34720e0..1c0bedb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -510,8 +510,9 @@ struct i40e_hw {
u16 dcbx_status;
 
/* DCBX info */
-   struct i40e_dcbx_config local_dcbx_config;
-   struct i40e_dcbx_config remote_dcbx_config;
+   struct i40e_dcbx_config local_dcbx_config; /* Oper/Local Cfg */
+   struct i40e_dcbx_config remote_dcbx_config; /* Peer Cfg */
+   struct i40e_dcbx_config desired_dcbx_config; /* CEE Desired Cfg */
 
/* debug mask */
u32 debug_mask;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h 
b/drivers/net/ethernet/intel/i40evf/i40e_type.h
index bbb3886..4b5528d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h
@@ -504,8 +504,9 @@ struct i40e_hw {
u16 dcbx_status;
 
/* DCBX info */
-   struct i40e_dcbx_config local_dcbx_config;
-   struct i40e_dcbx_config remote_dcbx_config;
+   struct i40e_dcbx_config local_dcbx_config; /* Oper/Local Cfg */
+   struct i40e_dcbx_config remote_dcbx_config; /* Peer Cfg */
+   struct i40e_dcbx_config desired_dcbx_config; /* CEE Desired Cfg */
 
/* debug mask */
u32 debug_mask;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 15/16] i40e/i40evf: pass QOS handle to VF

2015-10-08 Thread Jeff Kirsher

From: Mitch Williams 

The VF really doesn't care about the QOS handle but it will in the
future. Since the VF only uses TC0, send it that handle. On the VF
side, save the handle and use it to populate the QOS params when we call
into the client interface.

Change-ID: I76f41b070baeaa09b19383e9168bc677837e0761
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 6 --
 drivers/net/ethernet/intel/i40evf/i40evf.h | 1 +
 drivers/net/ethernet/intel/i40evf/i40evf_main.c| 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 678623f..ee747dc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1210,8 +1210,10 @@ static int i40e_vc_get_vf_resources_msg(struct i40e_vf 
*vf, u8 *msg)
if (vf->lan_vsi_idx) {
vfres->vsi_res[i].vsi_id = vf->lan_vsi_id;
vfres->vsi_res[i].vsi_type = I40E_VSI_SRIOV;
-   vfres->vsi_res[i].num_queue_pairs =
-   pf->vsi[vf->lan_vsi_idx]->alloc_queue_pairs;
+   vfres->vsi_res[i].num_queue_pairs = vsi->alloc_queue_pairs;
+   /* VFs only use TC 0 */
+   vfres->vsi_res[i].qset_handle
+ = le16_to_cpu(vsi->info.qs_handle[0]);
ether_addr_copy(vfres->vsi_res[i].default_mac_addr,
vf->default_lan_addr.addr);
i++;
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf.h 
b/drivers/net/ethernet/intel/i40evf/i40evf.h
index 27dc3fe..e7a223e 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf.h
+++ b/drivers/net/ethernet/intel/i40evf/i40evf.h
@@ -66,6 +66,7 @@ struct i40e_vsi {
 */
u16 rx_itr_setting;
u16 tx_itr_setting;
+   u16 qs_handle;
 };
 
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 1f99930..c00e495 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2115,6 +2115,7 @@ int i40evf_process_config(struct i40evf_adapter *adapter)
adapter->vsi.tx_itr_setting = (I40E_ITR_DYNAMIC |
   ITR_REG_TO_USEC(I40E_ITR_TX_DEF));
adapter->vsi.netdev = adapter->netdev;
+   adapter->vsi.qs_handle = adapter->vsi_res->qset_handle;
return 0;
 }
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 08/16] i40e: add switch for link polling

2015-10-08 Thread Jeff Kirsher

From: Shannon Nelson 

There's been some need for controlling the periodic link polling for
debugging link issues.  This patch enables switching it off and on
through an ethtool private flag.  The link poll remains on by default,
but can be turned off with
ethtool --set-priv-flags p261p1 LinkPolling off
and later turned back on with
ethtool --set-priv-flags p261p1 LinkPolling on

To check the current status, use
ethtool --show-priv-flags p261p1

Change-ID: I32e4ab654ff3eec90a06cf144899971b82d71c40
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  2 ++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 26 --
 drivers/net/ethernet/intel/i40e/i40e_main.c|  4 +++-
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 0c73404..f26dcb2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -101,6 +101,7 @@
 
 /* Ethtool Private Flags */
 #define I40E_PRIV_FLAGS_NPAR_FLAG  BIT(0)
+#define I40E_PRIV_FLAGS_LINKPOLL_FLAG  BIT(1)
 
 #define I40E_NVM_VERSION_LO_SHIFT  0
 #define I40E_NVM_VERSION_LO_MASK   (0xff << I40E_NVM_VERSION_LO_SHIFT)
@@ -327,6 +328,7 @@ struct i40e_pf {
 #define I40E_FLAG_WB_ON_ITR_CAPABLEBIT_ULL(35)
 #define I40E_FLAG_VEB_STATS_ENABLEDBIT_ULL(37)
 #define I40E_FLAG_MULTIPLE_TCP_UDP_RSS_PCTYPE  BIT_ULL(38)
+#define I40E_FLAG_LINK_POLLING_ENABLED BIT_ULL(39)
 #define I40E_FLAG_VEB_MODE_ENABLED BIT_ULL(40)
 
/* tracks features that get auto disabled by errors */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index f2b4f8b..5a726f2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -230,10 +230,10 @@ static const char i40e_gstrings_test[][ETH_GSTRING_LEN] = 
{
 
 static const char i40e_priv_flags_strings[][ETH_GSTRING_LEN] = {
"NPAR",
+   "LinkPolling",
 };
 
-#define I40E_PRIV_FLAGS_STR_LEN \
-   (sizeof(i40e_priv_flags_strings) / ETH_GSTRING_LEN)
+#define I40E_PRIV_FLAGS_STR_LEN ARRAY_SIZE(i40e_priv_flags_strings)
 
 /**
  * i40e_partition_setting_complaint - generic complaint for MFP restriction
@@ -2636,10 +2636,31 @@ static u32 i40e_get_priv_flags(struct net_device *dev)
 
ret_flags |= pf->hw.func_caps.npar_enable ?
I40E_PRIV_FLAGS_NPAR_FLAG : 0;
+   ret_flags |= pf->flags & I40E_FLAG_LINK_POLLING_ENABLED ?
+   I40E_PRIV_FLAGS_LINKPOLL_FLAG : 0;
 
return ret_flags;
 }
 
+/**
+ * i40e_set_priv_flags - set private flags
+ * @dev: network interface device structure
+ * @flags: bit flags to be set
+ **/
+static int i40e_set_priv_flags(struct net_device *dev, u32 flags)
+{
+   struct i40e_netdev_priv *np = netdev_priv(dev);
+   struct i40e_vsi *vsi = np->vsi;
+   struct i40e_pf *pf = vsi->back;
+
+   if (flags & I40E_PRIV_FLAGS_LINKPOLL_FLAG)
+   pf->flags |= I40E_FLAG_LINK_POLLING_ENABLED;
+   else
+   pf->flags &= ~I40E_FLAG_LINK_POLLING_ENABLED;
+
+   return 0;
+}
+
 static const struct ethtool_ops i40e_ethtool_ops = {
.get_settings   = i40e_get_settings,
.set_settings   = i40e_set_settings,
@@ -2676,6 +2697,7 @@ static const struct ethtool_ops i40e_ethtool_ops = {
.set_channels   = i40e_set_channels,
.get_ts_info= i40e_get_ts_info,
.get_priv_flags = i40e_get_priv_flags,
+   .set_priv_flags = i40e_set_priv_flags,
 };
 
 void i40e_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index e05e6aa..23a7b40 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5892,7 +5892,8 @@ static void i40e_watchdog_subtask(struct i40e_pf *pf)
return;
pf->service_timer_previous = jiffies;
 
-   i40e_link_event(pf);
+   if (pf->flags & I40E_FLAG_LINK_POLLING_ENABLED)
+   i40e_link_event(pf);
 
/* Update the stats for active netdevs so the network stack
 * can look at updated numbers whenever it cares to
@@ -7908,6 +7909,7 @@ static int i40e_sw_init(struct i40e_pf *pf)
/* Set default capability flags */
pf->flags = I40E_FLAG_RX_CSUM_ENABLED |
I40E_FLAG_MSI_ENABLED |
+   I40E_FLAG_LINK_POLLING_ENABLED |
I40E_FLAG_MSIX_ENABLED;
 
if (iommu_present(&pci_bus_type))
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 07/16] i40e: Fix multiple link up messages

2015-10-08 Thread Jeff Kirsher

From: Matt Jared 

This patch addresses an issue where multiple link up messages can be logged
resulting from aq link status timing when link properties are changed (fc,
speed, etc.); solved by using a single function to handle status printing
and adding a mechanism to track whether link state (up or down) has
actually changed.

Change-ID: Ied6ed6e49dc397c77d992adc0bc9ed3767152b9d
Signed-off-by: Matt Jared 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h | 2 ++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 4 ++--
 drivers/net/ethernet/intel/i40e/i40e_main.c| 5 -
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index a662e39..0c73404 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -537,6 +537,7 @@ struct i40e_vsi {
u16 idx;   /* index in pf->vsi[] */
u16 veb_idx;   /* index of VEB parent */
struct kobject *kobj;  /* sysfs object */
+   bool current_isup; /* Sync 'link up' logging */
 
/* VSI specific handlers */
irqreturn_t (*irq_handler)(int irq, void *data);
@@ -791,4 +792,5 @@ int i40e_is_vsi_uplink_mode_veb(struct i40e_vsi *vsi);
 i40e_status i40e_get_npar_bw_setting(struct i40e_pf *pf);
 i40e_status i40e_set_npar_bw_setting(struct i40e_pf *pf);
 i40e_status i40e_commit_npar_bw_setting(struct i40e_pf *pf);
+void i40e_print_link_message(struct i40e_vsi *vsi, bool isup);
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index ef471fc..f2b4f8b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -690,7 +690,7 @@ static int i40e_set_settings(struct net_device *netdev,
/* Tell the OS link is going down, the link will go
 * back up when fw says it is ready asynchronously
 */
-   netdev_info(netdev, "PHY settings change requested, NIC 
Link is going down.\n");
+   i40e_print_link_message(vsi, false);
netif_carrier_off(netdev);
netif_tx_stop_all_queues(netdev);
}
@@ -834,7 +834,7 @@ static int i40e_set_pauseparam(struct net_device *netdev,
/* Tell the OS link is going down, the link will go back up when fw
 * says it is ready asynchronously
 */
-   netdev_info(netdev, "Flow control settings change requested, NIC Link 
is going down.\n");
+   i40e_print_link_message(vsi, false);
netif_carrier_off(netdev);
netif_tx_stop_all_queues(netdev);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 94953568..e05e6aa 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4837,11 +4837,14 @@ out:
  * i40e_print_link_message - print link up or down
  * @vsi: the VSI for which link needs a message
  */
-static void i40e_print_link_message(struct i40e_vsi *vsi, bool isup)
+void i40e_print_link_message(struct i40e_vsi *vsi, bool isup)
 {
char speed[SPEED_SIZE] = "Unknown";
char fc[FC_SIZE] = "RX/TX";
 
+   if (vsi->current_isup == isup)
+   return;
+   vsi->current_isup = isup;
if (!isup) {
netdev_info(vsi->netdev, "NIC Link is Down\n");
return;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 02/16] i40e: inline interrupt enable

2015-10-08 Thread Jeff Kirsher

From: Jesse Brandeburg 

The interrupt enable function can be inlined by moving it to the header
file, which decreases the function call overhead for a frequently called
function.

Change-ID: I3214cc99593725768642680e7b8ce7e9bba7e44d
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  | 19 ++-
 drivers/net/ethernet/intel/i40e/i40e_main.c | 18 --
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 681bd5d..7a3c939 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -702,7 +702,24 @@ static inline void i40e_dbg_pf_exit(struct i40e_pf *pf) {}
 static inline void i40e_dbg_init(void) {}
 static inline void i40e_dbg_exit(void) {}
 #endif /* CONFIG_DEBUG_FS*/
-void i40e_irq_dynamic_enable(struct i40e_vsi *vsi, int vector);
+/**
+ * i40e_irq_dynamic_enable - Enable default interrupt generation settings
+ * @vsi: pointer to a vsi
+ * @vector: enable a particular Hw Interrupt vector, without base_vector
+ **/
+static inline void i40e_irq_dynamic_enable(struct i40e_vsi *vsi, int vector)
+{
+   struct i40e_pf *pf = vsi->back;
+   struct i40e_hw *hw = &pf->hw;
+   u32 val;
+
+   val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
+ I40E_PFINT_DYN_CTLN_CLEARPBA_MASK |
+ (I40E_ITR_NONE << I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT);
+   wr32(hw, I40E_PFINT_DYN_CTLN(vector + vsi->base_vector - 1), val);
+   /* skip the flush */
+}
+
 void i40e_irq_dynamic_disable(struct i40e_vsi *vsi, int vector);
 void i40e_irq_dynamic_disable_icr0(struct i40e_pf *pf);
 void i40e_irq_dynamic_enable_icr0(struct i40e_pf *pf);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index fb4b34d..2c59214 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3060,24 +3060,6 @@ void i40e_irq_dynamic_enable_icr0(struct i40e_pf *pf)
 }
 
 /**
- * i40e_irq_dynamic_enable - Enable default interrupt generation settings
- * @vsi: pointer to a vsi
- * @vector: enable a particular Hw Interrupt vector, without base_vector
- **/
-void i40e_irq_dynamic_enable(struct i40e_vsi *vsi, int vector)
-{
-   struct i40e_pf *pf = vsi->back;
-   struct i40e_hw *hw = &pf->hw;
-   u32 val;
-
-   val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
- I40E_PFINT_DYN_CTLN_CLEARPBA_MASK |
- (I40E_ITR_NONE << I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT);
-   wr32(hw, I40E_PFINT_DYN_CTLN(vector + vsi->base_vector - 1), val);
-   /* skip the flush */
-}
-
-/**
  * i40e_irq_dynamic_disable - Disable default interrupt generation settings
  * @vsi: pointer to a vsi
  * @vector: disable a particular Hw Interrupt vector
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 16/16] i40e: print neato new features

2015-10-08 Thread Jeff Kirsher

From: Jesse Brandeburg 

To help users and developers know what compile options
and hardware features are enabled at compile time, print
VxLAN is available.

Change-ID: I3162f3b7678dc725a597f964217920eb218b480b
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c46d814..a484f22 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9914,6 +9914,9 @@ static void i40e_print_features(struct i40e_pf *pf)
}
if (pf->flags & I40E_FLAG_DCB_CAPABLE)
buf += sprintf(buf, "DCB ");
+#if IS_ENABLED(CONFIG_VXLAN)
+   buf += sprintf(buf, "VxLAN ");
+#endif
if (pf->flags & I40E_FLAG_PTP)
buf += sprintf(buf, "PTP ");
 #ifdef I40E_FCOE
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 11/16] i40evf: detect reset more reliably

2015-10-08 Thread Jeff Kirsher

From: Mitch Williams 

Using VFGEN_RSTAT to detect a VF reset is an endeavor that is fraught
with peril. It's entirely too easy to miss a reset because none of the
bits are sticky. By the time the VF driver reads the register, the reset
may have been processed and cleaned up by the PF driver, leaving the
register in the same state that it was before the reset.

Instead, detect a reset with the VF_ARQLEN register. When the VF is
reset, the enable bit in this register is cleared, and it stays cleared
until the VF driver processes the reset and re-enables the admin queue.

Because we now deal with multiple registers in the reset and watchdog
tasks, rename the rstat_val variable to reg_val.

Change-ID: Id1df17045c0992e607da0162d31807f7fc20d199
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 36 +++--
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 0d18446..664fde6 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -1419,16 +1419,16 @@ static void i40evf_watchdog_task(struct work_struct 
*work)
  struct i40evf_adapter,
  watchdog_task);
struct i40e_hw *hw = &adapter->hw;
-   uint32_t rstat_val;
+   u32 reg_val;
 
if (test_and_set_bit(__I40EVF_IN_CRITICAL_TASK, &adapter->crit_section))
goto restart_watchdog;
 
if (adapter->flags & I40EVF_FLAG_PF_COMMS_FAILED) {
-   rstat_val = rd32(hw, I40E_VFGEN_RSTAT) &
-   I40E_VFGEN_RSTAT_VFR_STATE_MASK;
-   if ((rstat_val == I40E_VFR_VFACTIVE) ||
-   (rstat_val == I40E_VFR_COMPLETED)) {
+   reg_val = rd32(hw, I40E_VFGEN_RSTAT) &
+ I40E_VFGEN_RSTAT_VFR_STATE_MASK;
+   if ((reg_val == I40E_VFR_VFACTIVE) ||
+   (reg_val == I40E_VFR_COMPLETED)) {
/* A chance for redemption! */
dev_err(&adapter->pdev->dev, "Hardware came out of 
reset. Attempting reinit.\n");
adapter->state = __I40EVF_STARTUP;
@@ -1453,11 +1453,8 @@ static void i40evf_watchdog_task(struct work_struct 
*work)
goto watchdog_done;
 
/* check for reset */
-   rstat_val = rd32(hw, I40E_VFGEN_RSTAT) &
-   I40E_VFGEN_RSTAT_VFR_STATE_MASK;
-   if (!(adapter->flags & I40EVF_FLAG_RESET_PENDING) &&
-   (rstat_val != I40E_VFR_VFACTIVE) &&
-   (rstat_val != I40E_VFR_COMPLETED)) {
+   reg_val = rd32(hw, I40E_VF_ARQLEN1) & I40E_VF_ARQLEN1_ARQENABLE_MASK;
+   if (!(adapter->flags & I40EVF_FLAG_RESET_PENDING) && !reg_val) {
adapter->state = __I40EVF_RESETTING;
adapter->flags |= I40EVF_FLAG_RESET_PENDING;
dev_err(&adapter->pdev->dev, "Hardware reset detected\n");
@@ -1572,7 +1569,7 @@ static void i40evf_reset_task(struct work_struct *work)
struct net_device *netdev = adapter->netdev;
struct i40e_hw *hw = &adapter->hw;
struct i40evf_mac_filter *f;
-   uint32_t rstat_val;
+   u32 reg_val;
int i = 0, err;
 
while (test_and_set_bit(__I40EVF_IN_CRITICAL_TASK,
@@ -1593,12 +1590,11 @@ static void i40evf_reset_task(struct work_struct *work)
 
/* poll until we see the reset actually happen */
for (i = 0; i < I40EVF_RESET_WAIT_COUNT; i++) {
-   rstat_val = rd32(hw, I40E_VFGEN_RSTAT) &
-   I40E_VFGEN_RSTAT_VFR_STATE_MASK;
-   if ((rstat_val != I40E_VFR_VFACTIVE) &&
-   (rstat_val != I40E_VFR_COMPLETED))
+   reg_val = rd32(hw, I40E_VF_ARQLEN1) &
+ I40E_VF_ARQLEN1_ARQENABLE_MASK;
+   if (!reg_val)
break;
-   usleep_range(500, 1000);
+   usleep_range(5000, 1);
}
if (i == I40EVF_RESET_WAIT_COUNT) {
dev_info(&adapter->pdev->dev, "Never saw reset\n");
@@ -1607,9 +1603,9 @@ static void i40evf_reset_task(struct work_struct *work)
 
/* wait until the reset is complete and the PF is responding to us */
for (i = 0; i < I40EVF_RESET_WAIT_COUNT; i++) {
-   rstat_val = rd32(hw, I40E_VFGEN_RSTAT) &
-   I40E_VFGEN_RSTAT_VFR_STATE_MASK;
-   if (rstat_val == I40E_VFR_VFACTIVE)
+   reg_val = rd32(hw, I40E_VFGEN_RSTAT) &
+ I40E_VFGEN_RSTAT_VFR_STATE_MASK;
+   if (reg_val == I40E_VFR_VFACTIVE)
break;
msleep(I40EVF_RESET_WAIT_MS);
}
@@ -1621,7 +1617,7 @@ static void i40evf_reset_task(struct w

[net-next 00/16][pull request] Intel Wired LAN Driver Updates 2015-10-08

2015-10-08 Thread Jeff Kirsher

This series contains updates to i40e and i40evf only (again).

Jesse fixes an issue where the driver was issuing a WARN_ON during ring
size changes because the code was cloning the rx_ring struct but not
zeroing out the pointers before allocating new memory, so simply zero
out the pointers.  Also reduced the function call overhead by moving
the interrupt enable function by moving it to the header file, which it
in turn allows us to inline it.  Also does a thorough job of code
cleanup to fix spaces after declarations, remove unnecessary braces
and breaks, remove another __func__ use and general code tidiness.

Mitch adds mover verbose error messages when the number of supported VFs
is reported in driver init and it different from the number reported in
config space.  Updated the VF driver to now detect a reset with the
VF_ARQLEN register since the enable bit is cleared when the VF is reset
and it stays cleared until the VF driver processes the reset and
re-enables the admin queue which is more reliable than using the
VFGEN_RSTAT as previously.

Neerav adds parsing for CEE DCBx TLVs from the LLDP MIB since there is
a need to get the CEE DesiredCfg Tx by firmware and DCB configuration
Rx from peer for debug and other application purposes.

Carolyn fixes a problem where the PF's Flow Director filter table would
have an entry that the hardware was unable to add, when this occurs an
invalid entry gets replayed and a valid one is lost.

Matt fixes an issue where multiple link up messages can be logged
resulting from admin queue link status timing when link properties are
changed.

Shannon adds the ability to control the period link polling through
ethtool to be able to switch it off and on for debugging link issues.

Serey explicitly assigns the enum index for each VSI type so that the PF
and VF always reference to the same VSI type event if the enum lists
are different.

The following are changes since commit df718423250c000ca4323a767cedc2f3219b685c:
  Merge branch 'bpf_random32'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master

Carolyn Wyborny (1):
  i40e: Fix for extra Flow Director filter in table after error

Greg Bowers (1):
  i40e: Support FW CEE DCB UP to TC map nibble swap

Jesse Brandeburg (5):
  i40e: fix erroneous WARN_ON
  i40e: inline interrupt enable
  i40e/i40evf: clean up some code
  i40e: refactor code to remove indent
  i40e: print neato new features

Matt Jared (1):
  i40e: Fix multiple link up messages

Mitch Williams (4):
  i40e: add more verbose error messages
  i40evf: detect reset more reliably
  i40evf: use capabilities flags properly
  i40e/i40evf: pass QOS handle to VF

Neerav Parikh (2):
  i40e: Add parsing for CEE DCBX TLVs
  i40e/i40evf: Store CEE DCBX DesiredCfg and RemoteCfg

Serey Kong (1):
  i40e/i40evf: Explicitly assign enum index for VSI type

Shannon Nelson (1):
  i40e: add switch for link polling

 drivers/net/ethernet/intel/i40e/i40e.h |  24 ++-
 drivers/net/ethernet/intel/i40e/i40e_adminq.c  |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_common.c  |   9 +-
 drivers/net/ethernet/intel/i40e/i40e_dcb.c | 236 +++--
 drivers/net/ethernet/intel/i40e/i40e_dcb.h |  39 
 drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c  |   1 +
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 124 ++-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  43 +++-
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c|   1 -
 drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c| 164 +++---
 drivers/net/ethernet/intel/i40e/i40e_ptp.c |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_type.h|  21 +-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  23 +-
 drivers/net/ethernet/intel/i40evf/i40e_adminq.c|   3 +-
 drivers/net/ethernet/intel/i40evf/i40e_common.c|   8 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |   6 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  |   1 +
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |  21 +-
 drivers/net/ethernet/intel/i40evf/i40evf.h |   5 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  46 ++--
 23 files changed, 567 insertions(+), 227 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 09/16] i40e/i40evf: Explicitly assign enum index for VSI type

2015-10-08 Thread Jeff Kirsher

From: Serey Kong 

Ran into an issue where PF's VSI type list was different from VF's,
which was resulted in different enum index. The VSI type list can
be different depending on what build flag is used for PF and VF.

The change is to explicitly assign enum index for each VSI type
so that PF and VF always reference to the same VSI type event if the
enum lists are different.

Change-ID: I8c0e5fdb515f324f7964df863a458073cf467e57
Signed-off-by: Serey Kong 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_type.h   | 16 
 drivers/net/ethernet/intel/i40evf/i40e_type.h | 16 
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h 
b/drivers/net/ethernet/intel/i40e/i40e_type.h
index 1c0bedb..d1ec5a4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -160,14 +160,14 @@ enum i40e_set_fc_aq_failures {
 };
 
 enum i40e_vsi_type {
-   I40E_VSI_MAIN = 0,
-   I40E_VSI_VMDQ1,
-   I40E_VSI_VMDQ2,
-   I40E_VSI_CTRL,
-   I40E_VSI_FCOE,
-   I40E_VSI_MIRROR,
-   I40E_VSI_SRIOV,
-   I40E_VSI_FDIR,
+   I40E_VSI_MAIN   = 0,
+   I40E_VSI_VMDQ1  = 1,
+   I40E_VSI_VMDQ2  = 2,
+   I40E_VSI_CTRL   = 3,
+   I40E_VSI_FCOE   = 4,
+   I40E_VSI_MIRROR = 5,
+   I40E_VSI_SRIOV  = 6,
+   I40E_VSI_FDIR   = 7,
I40E_VSI_TYPE_UNKNOWN
 };
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h 
b/drivers/net/ethernet/intel/i40evf/i40e_type.h
index 4b5528d..a59b60f 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h
@@ -160,14 +160,14 @@ enum i40e_set_fc_aq_failures {
 };
 
 enum i40e_vsi_type {
-   I40E_VSI_MAIN = 0,
-   I40E_VSI_VMDQ1,
-   I40E_VSI_VMDQ2,
-   I40E_VSI_CTRL,
-   I40E_VSI_FCOE,
-   I40E_VSI_MIRROR,
-   I40E_VSI_SRIOV,
-   I40E_VSI_FDIR,
+   I40E_VSI_MAIN   = 0,
+   I40E_VSI_VMDQ1  = 1,
+   I40E_VSI_VMDQ2  = 2,
+   I40E_VSI_CTRL   = 3,
+   I40E_VSI_FCOE   = 4,
+   I40E_VSI_MIRROR = 5,
+   I40E_VSI_SRIOV  = 6,
+   I40E_VSI_FDIR   = 7,
I40E_VSI_TYPE_UNKNOWN
 };
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread kbuild test robot

Hi Kosuke,

[auto build test WARNING on v4.3-rc4 -- if it's inappropriate base, please 
ignore]

reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/sunrpc/svcsock.c:417:28: sparse: incorrect type in argument 1 (different 
>> base types)
   net/sunrpc/svcsock.c:417:28:expected struct socket_wq *wq
   net/sunrpc/svcsock.c:417:28:got struct __wait_queue_head [usertype] *wq
>> net/sunrpc/svcsock.c:1597:28: sparse: incorrect type in argument 1 
>> (different base types)
   net/sunrpc/svcsock.c:1597:28:expected struct socket_wq *wq
   net/sunrpc/svcsock.c:1597:28:got struct __wait_queue_head [usertype] 
*[assigned] wq
   net/sunrpc/svcsock.c:435:28: sparse: incorrect type in argument 1 (different 
base types)
   net/sunrpc/svcsock.c:435:28:expected struct socket_wq *wq
   net/sunrpc/svcsock.c:435:28:got struct __wait_queue_head [usertype] *wq
   net/sunrpc/svcsock.c:790:28: sparse: incorrect type in argument 1 (different 
base types)
   net/sunrpc/svcsock.c:790:28:expected struct socket_wq *wq
   net/sunrpc/svcsock.c:790:28:got struct __wait_queue_head [usertype] 
*[assigned] wq
   net/sunrpc/svcsock.c:811:28: sparse: incorrect type in argument 1 (different 
base types)
   net/sunrpc/svcsock.c:811:28:expected struct socket_wq *wq
   net/sunrpc/svcsock.c:811:28:got struct __wait_queue_head [usertype] *wq
   net/sunrpc/svcsock.c:826:28: sparse: incorrect type in argument 1 (different 
base types)
   net/sunrpc/svcsock.c:826:28:expected struct socket_wq *wq
   net/sunrpc/svcsock.c:826:28:got struct __wait_queue_head [usertype] *wq
   net/sunrpc/svcsock.c: In function 'svc_udp_data_ready':
   net/sunrpc/svcsock.c:417:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq))
^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_write_space':
   net/sunrpc/svcsock.c:435:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq)) {
^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_tcp_listen_data_ready':
   net/sunrpc/svcsock.c:790:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq))
^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_tcp_state_change':
   net/sunrpc/svcsock.c:811:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq))
^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_tcp_data_ready':
   net/sunrpc/svcsock.c:826:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq))
^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_sock_detach':

Re: [PATCH] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread kbuild test robot

Hi Kosuke,

[auto build test WARNING on v4.3-rc4 -- if it's inappropriate base, please 
ignore]

config: x86_64-randconfig-x002-201540 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/linkage.h:4:0,
from include/linux/kernel.h:6,
from net/sunrpc/svcsock.c:22:
   net/sunrpc/svcsock.c: In function 'svc_udp_data_ready':
   net/sunrpc/svcsock.c:417:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq))
^
   include/linux/compiler.h:147:28: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
>> net/sunrpc/svcsock.c:417:2: note: in expansion of macro 'if'
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   In file included from include/linux/linkage.h:4:0,
from include/linux/kernel.h:6,
from net/sunrpc/svcsock.c:22:
   net/sunrpc/svcsock.c:417:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq))
^
   include/linux/compiler.h:147:40: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
>> net/sunrpc/svcsock.c:417:2: note: in expansion of macro 'if'
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   In file included from include/linux/linkage.h:4:0,
from include/linux/kernel.h:6,
from net/sunrpc/svcsock.c:22:
   net/sunrpc/svcsock.c:417:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq))
^
   include/linux/compiler.h:158:16: note: in definition of macro '__trace_if'
  __r = !!(cond); \
   ^
>> net/sunrpc/svcsock.c:417:2: note: in expansion of macro 'if'
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   In file included from include/linux/linkage.h:4:0,
from include/linux/kernel.h:6,
from net/sunrpc/svcsock.c:22:
   net/sunrpc/svcsock.c: In function 'svc_write_space':
   net/sunrpc/svcsock.c:435:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq)) {
^
   include/linux/compiler.h:147:28: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
   net/sunrpc/svcsock.c:435:2: note: in expansion of macro 'if'
 if (wq_has_sleeper(wq)) {
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'wait_queue_head_t * {aka struct __wait_queue_head *}'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   In file included from include/linux/linkage.h:4:0,
from include/linux/kernel.h:6,
from net/sunrpc/svcsock.c:22:
   net/sunrpc/svcsock.c:435:21: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type [-Wincompatible-pointer-types]
 if (wq_has_sleeper(wq)) {
^
   include/linux/compiler.h:147:40: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
   net/sunrpc/svcsock.c:435:2: note: in expansion of macro

Re: [PATCH] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread kbuild test robot

Hi Kosuke,

[auto build test WARNING on v4.3-rc4 -- if it's inappropriate base, please 
ignore]

config: xtensa-allyesconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All warnings (new ones prefixed by >>):

   net/sunrpc/svcsock.c: In function 'svc_udp_data_ready':
>> net/sunrpc/svcsock.c:417:6: warning: passing argument 1 of 'wq_has_sleeper' 
>> from incompatible pointer type
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'struct wait_queue_head_t *'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_write_space':
   net/sunrpc/svcsock.c:435:6: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type
 if (wq_has_sleeper(wq)) {
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'struct wait_queue_head_t *'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_tcp_listen_data_ready':
   net/sunrpc/svcsock.c:790:6: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'struct wait_queue_head_t *'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_tcp_state_change':
   net/sunrpc/svcsock.c:811:6: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'struct wait_queue_head_t *'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_tcp_data_ready':
   net/sunrpc/svcsock.c:826:6: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'struct wait_queue_head_t *'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^
   net/sunrpc/svcsock.c: In function 'svc_sock_detach':
   net/sunrpc/svcsock.c:1597:6: warning: passing argument 1 of 'wq_has_sleeper' 
from incompatible pointer type
 if (wq_has_sleeper(wq))
 ^
   In file included from include/net/inet_sock.h:27:0,
from include/linux/udp.h:20,
from net/sunrpc/svcsock.c:30:
   include/net/sock.h:1879:20: note: expected 'struct socket_wq *' but argument 
is of type 'struct wait_queue_head_t *'
static inline bool wq_has_sleeper(struct socket_wq *wq)
   ^

vim +/wq_has_sleeper +417 net/sunrpc/svcsock.c

   401  
   402  /*
   403   * INET callback when data has been received on the socket.
   404   */
   405  static void svc_udp_data_ready(struct sock *sk)
   406  {
   407  struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
   408  wait_queue_head_t *wq = sk_sleep(sk);
   409  
   410  if (svsk) {
   411  dprintk("svc: socket %p(inet %p), busy=%d\n",
   412  svsk, sk,
   413  test_bit(XPT_BUSY, &svsk->sk_xprt.xpt_flags));
   414  set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
   415  svc_xprt_enqueue(&svsk->sk_xprt);
   416  }
 > 417  if (wq_has_sleeper(wq))
   418  wake_up_interruptible(wq);
   419  }
   420  
   421  /*
   422   * INET callback when space is newly available on the socket.
   423   */
   424  static void svc_write_space(struct sock *sk)
   425  {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all

[PATCH net] tcp: change type of alive from int to bool

2015-10-08 Thread Richard Sailer

The alive parameter of tcp_orphan_retries, indicates
whether the connection is assumed alive or not.
In the function and all places calling it is used as a boolean value.

Therefore this changes the type of alive to bool in the function
definition and all calling locations.

Since tcp_orphan_tries is a tcp_timer.c local function no change in
any other file or header is necessary.

Signed-off-by: Richard Sailer 
---
 net/ipv4/tcp_timer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 7149ebc..c9c716a 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -83,7 +83,7 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
 }
 
 /* Calculate maximal number or retries on an orphaned socket. */
-static int tcp_orphan_retries(struct sock *sk, int alive)
+static int tcp_orphan_retries(struct sock *sk, bool alive)
 {
int retries = sysctl_tcp_orphan_retries; /* May be zero. */
 
@@ -184,7 +184,7 @@ static int tcp_write_timeout(struct sock *sk)
 
retry_until = sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
-   const int alive = icsk->icsk_rto < TCP_RTO_MAX;
+   const bool alive = icsk->icsk_rto < TCP_RTO_MAX;
 
retry_until = tcp_orphan_retries(sk, alive);
do_reset = alive ||
@@ -298,7 +298,7 @@ static void tcp_probe_timer(struct sock *sk)
 
max_probes = sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
-   const int alive = inet_csk_rto_backoff(icsk, TCP_RTO_MAX) < 
TCP_RTO_MAX;
+   const bool alive = inet_csk_rto_backoff(icsk, TCP_RTO_MAX) < 
TCP_RTO_MAX;
 
max_probes = tcp_orphan_retries(sk, alive);
if (!alive && icsk->icsk_backoff >= max_probes)
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-08 Thread Devon H. O'Dell

On Wed, Oct 7, 2015 at 8:46 AM, Alexei Starovoitov  wrote:
> On 10/7/15 1:16 AM, Daniel Borkmann wrote:
>>
>> Similar to commit c29390c6dfee ("xps: must clear sender_cpu before
>> forwarding"), we also need to clear the skb->sender_cpu when moving
>> from RX to TX via skb_do_redirect() due to the shared location of
>> napi_id (used on RX) and sender_cpu (used on TX).
>>
>> Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
>> Signed-off-by: Daniel Borkmann
>
>
> Acked-by: Alexei Starovoitov 
>
> with the amount of skb_sender_cpu_clear() all over the code base
> I wonder whether there is a better solution to all of these.

I think there is. We found that splitting the union of sender_cpu and
napi_id solved the issue for us. In general, I think this is an OK
solution as long as the following hold:

 * skbs are always allocated via kzalloc
 * out -> out cloned skbs are always cloned on the same CPU
 * an extra four bytes in skbuff isn't a bad thing

I think the first and last points are true, but I'm not 100% sure. I'm
also particularly unsure about the second point. If that assumption
does not hold, it could result in extra cache / bus traffic between
cores / sockets. However, that would also imply that we were already
getting some extra traffic at the point of doing the copy. So maybe
not a big deal? The other problem I could imagine is if the second
point *is* true and skbs end up being cloned multiple times, XPS might
get overworked on individual cores.

Anyway, I'm not 100% sure about any of these things: I'm really not at
all familiar with the Linux kernel, let alone the netstack -- this
just turned out to be not particularly difficult to find given
register context and call stack from the panic. I'd be happy to send a
patch to struct skbuff and toss skb_sender_cpu_clear, but I suspect
someone else on this list could validate that quicker than I. The
patch at that point is trivial.

I think it's probably a good thing to do. The need to call
skb_sender_cpu_clear() around every rx->tx copy interaction seems
brittle and likely to be problematic again in the future unless code
is always cargo culted, and assuming we've found every potential clone
site.

--dho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread Kosuke Tatsukawa

There are several places in net/sunrpc/svcsock.c which calls
waitqueue_active() without calling a memory barrier.  Change the code
to call wq_has_sleeper() instead, which other networking code uses in
similar places.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa 
---
 net/sunrpc/svcsock.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 0c81202..cf081b8 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -414,7 +414,7 @@ static void svc_udp_data_ready(struct sock *sk)
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
svc_xprt_enqueue(&svsk->sk_xprt);
}
-   if (wq && waitqueue_active(wq))
+   if (wq_has_sleeper(wq))
wake_up_interruptible(wq);
 }
 
@@ -432,7 +432,7 @@ static void svc_write_space(struct sock *sk)
svc_xprt_enqueue(&svsk->sk_xprt);
}
 
-   if (wq && waitqueue_active(wq)) {
+   if (wq_has_sleeper(wq)) {
dprintk("RPC svc_write_space: someone sleeping on %p\n",
   svsk);
wake_up_interruptible(wq);
@@ -787,7 +787,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
}
 
wq = sk_sleep(sk);
-   if (wq && waitqueue_active(wq))
+   if (wq_has_sleeper(wq))
wake_up_interruptible_all(wq);
 }
 
@@ -808,7 +808,7 @@ static void svc_tcp_state_change(struct sock *sk)
set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
svc_xprt_enqueue(&svsk->sk_xprt);
}
-   if (wq && waitqueue_active(wq))
+   if (wq_has_sleeper(wq))
wake_up_interruptible_all(wq);
 }
 
@@ -823,7 +823,7 @@ static void svc_tcp_data_ready(struct sock *sk)
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
svc_xprt_enqueue(&svsk->sk_xprt);
}
-   if (wq && waitqueue_active(wq))
+   if (wq_has_sleeper(wq))
wake_up_interruptible(wq);
 }
 
@@ -1594,7 +1594,7 @@ static void svc_sock_detach(struct svc_xprt *xprt)
sk->sk_write_space = svsk->sk_owspace;
 
wq = sk_sleep(sk);
-   if (wq && waitqueue_active(wq))
+   if (wq_has_sleeper(wq))
wake_up_interruptible(wq);
 }
 
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] brcmfmac: fix waitqueue_active without memory barrier in brcmfmac driver

2015-10-08 Thread Kosuke Tatsukawa

brcmf_msgbuf_ioctl_resp_wake() seems to be missing a memory barrier
which might cause the waker to not notice the waiter and miss sending a
wake_up as in the following figure.

  brcmf_msgbuf_ioctl_resp_wake  brcmf_msgbuf_ioctl_resp_wait

if (waitqueue_active(&msgbuf->ioctl_resp_wait))
/* The CPU might reorder the test for
   the waitqueue up here, before
   prior writes complete */
   /* wait_event_timeout */
/* __wait_event_timeout */
 /* ___wait_event */
 prepare_to_wait_event(&wq, &__wait,
   state);
 if (msgbuf->ctl_completed)
 ...
msgbuf->ctl_completed = true;
 schedule_timeout(__ret))


There are three other place in drivers/net/wireless/brcm80211/brcmfmac/
which have similar code.  The attached patch removes the call to
waitqueue_active() leaving just wake_up() behind.  This fixes the
problem because the call to spin_lock_irqsave() in wake_up() will be an
ACQUIRE operation.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa 
---
 drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c |3 +--
 drivers/net/wireless/brcm80211/brcmfmac/sdio.c   |6 ++
 drivers/net/wireless/brcm80211/brcmfmac/usb.c|3 +--
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c 
b/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c
index 7b2136c..648151e 100644
--- a/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c
+++ b/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c
@@ -473,8 +473,7 @@ static int brcmf_msgbuf_ioctl_resp_wait(struct brcmf_msgbuf 
*msgbuf)
 static void brcmf_msgbuf_ioctl_resp_wake(struct brcmf_msgbuf *msgbuf)
 {
msgbuf->ctl_completed = true;
-   if (waitqueue_active(&msgbuf->ioctl_resp_wait))
-   wake_up(&msgbuf->ioctl_resp_wait);
+   wake_up(&msgbuf->ioctl_resp_wait);
 }
 
 
diff --git a/drivers/net/wireless/brcm80211/brcmfmac/sdio.c 
b/drivers/net/wireless/brcm80211/brcmfmac/sdio.c
index f990e3d..332c4c8 100644
--- a/drivers/net/wireless/brcm80211/brcmfmac/sdio.c
+++ b/drivers/net/wireless/brcm80211/brcmfmac/sdio.c
@@ -1785,8 +1785,7 @@ static int brcmf_sdio_dcmd_resp_wait(struct brcmf_sdio 
*bus, uint *condition,
 
 static int brcmf_sdio_dcmd_resp_wake(struct brcmf_sdio *bus)
 {
-   if (waitqueue_active(&bus->dcmd_resp_wait))
-   wake_up_interruptible(&bus->dcmd_resp_wait);
+   wake_up_interruptible(&bus->dcmd_resp_wait);
 
return 0;
 }
@@ -2110,8 +2109,7 @@ static uint brcmf_sdio_readframes(struct brcmf_sdio *bus, 
uint maxframes)
 static void
 brcmf_sdio_wait_event_wakeup(struct brcmf_sdio *bus)
 {
-   if (waitqueue_active(&bus->ctrl_wait))
-   wake_up_interruptible(&bus->ctrl_wait);
+   wake_up_interruptible(&bus->ctrl_wait);
return;
 }
 
diff --git a/drivers/net/wireless/brcm80211/brcmfmac/usb.c 
b/drivers/net/wireless/brcm80211/brcmfmac/usb.c
index daba86d..7f5889c 100644
--- a/drivers/net/wireless/brcm80211/brcmfmac/usb.c
+++ b/drivers/net/wireless/brcm80211/brcmfmac/usb.c
@@ -184,8 +184,7 @@ static int brcmf_usb_ioctl_resp_wait(struct 
brcmf_usbdev_info *devinfo)
 
 static void brcmf_usb_ioctl_resp_wake(struct brcmf_usbdev_info *devinfo)
 {
-   if (waitqueue_active(&devinfo->ioctl_resp_wait))
-   wake_up(&devinfo->ioctl_resp_wait);
+   wake_up(&devinfo->ioctl_resp_wait);
 }
 
 static void
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node

2015-10-08 Thread Jesse Gross

On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme  wrote:
>
>> On Oct 6, 2015, at 6:01 PM, Jesse Gross  wrote:
>>
>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>>  wrote:
>>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:

 On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>
> When openvswitch tries allocate memory from offline numa node 0:
> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
> 0)
> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
> This patch disables numa affinity in this case.
>
> Signed-off-by: Konstantin Khlebnikov 


 ...

> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index f2ea83ba4763..c7f74aab34b9 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>
>  /* Initialize the default stat node. */
>  stats = kmem_cache_alloc_node(flow_stats_cache,
> -  GFP_KERNEL | __GFP_ZERO, 0);
> +  GFP_KERNEL | __GFP_ZERO,
> +  node_online(0) ? 0 : NUMA_NO_NODE);


 Stupid question: can node 0 become offline between this check, and the
 VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>>>
>>>
>>> Another question to ask would be is it possible for node 0 to be online, but
>>> be a memoryless node?
>>>
>>> I would say you are better off just making this call kmem_cache_alloc.  I
>>> don't see anything that indicates the memory has to come from node 0, so
>>> adding the extra overhead doesn't provide any value.
>>
>> I agree that this at least makes me wonder, though I actually have
>> concerns in the opposite direction - I see assumptions about this
>> being on node 0 in net/openvswitch/flow.c.
>>
>> Jarno, since you original wrote this code, can you take a look to see
>> if everything still makes sense?
>
> We keep the pre-allocated stats node at array index 0, which is initially 
> used by all CPUs, but if CPUs from multiple numa nodes start updating the 
> stats, we allocate additional stats nodes (up to one per numa node), and the 
> CPUs on node 0 keep using the preallocated entry. If stats cannot be 
> allocated from CPUs local node, then those CPUs keep using the entry at index 
> 0. Currently the code in net/openvswitch/flow.c will try to allocate the 
> local memory repeatedly, which may not be optimal when there is no memory at 
> the local node.
>
> Allocating the memory for the index 0 from other than node 0, as discussed 
> here, just means that the CPUs on node 0 will keep on using non-local memory 
> for stats. In a scenario where there are CPUs on two nodes (0, 1), but only 
> the node 1 has memory, a shared flow entry will still end up having separate 
> memory allocated for both nodes, but both of the nodes would be at node 1. 
> However, there is still a high likelihood that the memory allocations would 
> not share a cache line, which should prevent the nodes from invalidating each 
> other’s caches. Based on this I do not see a problem relaxing the memory 
> allocation for the default stats node. If node 0 has memory, however, it 
> would be better to allocate the memory from node 0.

Thanks for going through all of that.

It seems like the question that is being raised is whether it actually
makes sense to try to get the initial memory on node 0, especially
since it seems to introduce some corner cases? Is there any reason why
the flow is more likely to hit node 0 than a randomly chosen one?
(Assuming that this is a multinode system, otherwise it's kind of a
moot point.) We could have a separate pointer to the default allocated
memory, so it wouldn't conflict with memory that was intentionally
allocated for node 0.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-next v2 3/4] net/sched: make sch_blackhole.c explicitly non-modular

2015-10-08 Thread Paul Gortmaker

[Re: [PATCH-next v2 3/4] net/sched: make sch_blackhole.c explicitly 
non-modular] On 07/10/2015 (Wed 14:47) Cong Wang wrote:

> On Wed, Oct 7, 2015 at 2:27 PM, Paul Gortmaker
>  wrote:
> > The Kconfig currently controlling compilation of this code is:
> >
> > net/sched/Kconfig:menuconfig NET_SCHED
> > net/sched/Kconfig:  bool "QoS and/or fair queueing"
> >
> > ...meaning that it currently is not being built as a module by anyone.
> 
> Is there any reason why sch_blackhole can't be a module like
> other qdisc's?
> 
> If not, I'd rather making it be a module. It is small but not often used.

As I've said in other similar threads, there are some 300+ places
where code that can't ever be modular uses modular calls and/or
introduces dead module remove code.

Hence here I am making the code consistent with its current limitations.
I'm not looking to extend functionality in code that I don't know
intimately.  I can't do that and do it reliably and guarantee it
works as a module when it has never been used as such before in 300+
places all across the kernel.

If there are interested users who want their code tristate and can vouch
that their code works OK as such, I can drop the patch(es) here ; this
is what happened with 4 of the patches I originally had in v1 of this
very series.  But a good number of the 300+ instances have been this way
since before git history began (2005) and so I wonder the value in say
extending instances like old ISA drivers from bool to tristate...

Paul.
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] openvswitch: report features supported by the kernel datapath

2015-10-08 Thread Jesse Gross

On Thu, Oct 8, 2015 at 6:53 AM, Jiri Benc  wrote:
> Allow the user space to query what features are supported by the openvswitch
> module. This will be used to allow or disallow certain configurations and/or
> switch between newer and older APIs depending on what the kernel supports.
>
> Two features are reported as supported by this patch: lwtunnel and IPv6
> tunneling support. Theoretically, we could merge these two, as any of them
> implies the other with this patch applied, but it's better to keep them
> separate: kernel 4.3 supports lwtunnels but not IPv6 for ovs, and the
> separation of the two flags allows us to backport a version of this patch
> to 4.3 should the need arise.
>
> Signed-off-by: Jiri Benc 

I have similar concerns as were expressed in the other thread. The
features listed here aren't OVS components and I don't think that it
makes sense for OVS to try to cover everything that is related - the
goal that we've been working towards is to have OVS be less monolithic
and more integrated. So to the extent that it is necessary to have
capabilities be exposed (and I would like to avoid this where
possible), it should be in the individual component, not in OVS.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-08 Thread Eric Dumazet

On Fri, 2015-10-09 at 06:14 +0800, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test WARNING on net-next/master -- if it's inappropriate base, 
> please ignore]
> 
> config: sh-titan_defconfig (attached as .config)
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=sh 
> 
> All warnings (new ones prefixed by >>):
> 
>net/ipv6/udp.c: In function 'udp6_lib_lookup2':
> >> net/ipv6/udp.c:276:1: warning: label 'exact_match' defined but not used 
> >> [-Wunused-label]

Oh right, I'll send a V3 then.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-08 Thread kbuild test robot

Hi Eric,

[auto build test WARNING on net-next/master -- if it's inappropriate base, 
please ignore]

config: sh-titan_defconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sh 

All warnings (new ones prefixed by >>):

   net/ipv6/udp.c: In function 'udp6_lib_lookup2':
>> net/ipv6/udp.c:276:1: warning: label 'exact_match' defined but not used 
>> [-Wunused-label]

vim +/exact_match +276 net/ipv6/udp.c

72289b96 Tom Herbert 2013-01-22  260} else if (score == 
badness && reuseport) {
72289b96 Tom Herbert 2013-01-22  261matches++;
8fc54f68 Daniel Borkmann 2014-08-23  262if 
(reciprocal_scale(hash, matches) == 0)
72289b96 Tom Herbert 2013-01-22  263result 
= sk;
72289b96 Tom Herbert 2013-01-22  264hash = 
next_pseudo_random32(hash);
fddc17de Eric Dumazet2009-11-08  265}
fddc17de Eric Dumazet2009-11-08  266}
fddc17de Eric Dumazet2009-11-08  267/*
fddc17de Eric Dumazet2009-11-08  268 * if the nulls value we got at 
the end of this lookup is
fddc17de Eric Dumazet2009-11-08  269 * not the expected one, we 
must restart lookup.
fddc17de Eric Dumazet2009-11-08  270 * We probably met an item that 
was moved to another chain.
fddc17de Eric Dumazet2009-11-08  271 */
fddc17de Eric Dumazet2009-11-08  272if (get_nulls_value(node) != 
slot2)
fddc17de Eric Dumazet2009-11-08  273goto begin;
fddc17de Eric Dumazet2009-11-08  274  
fddc17de Eric Dumazet2009-11-08  275if (result) {
fddc17de Eric Dumazet2009-11-08 @276  exact_match:
c31504dc Eric Dumazet2010-11-15  277if 
(unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
fddc17de Eric Dumazet2009-11-08  278result = NULL;
fddc17de Eric Dumazet2009-11-08  279else if 
(unlikely(compute_score2(result, net, saddr, sport,
fddc17de Eric Dumazet2009-11-08  280  
daddr, hnum, dif) < badness)) {
fddc17de Eric Dumazet2009-11-08  281
sock_put(result);
fddc17de Eric Dumazet2009-11-08  282goto begin;
fddc17de Eric Dumazet2009-11-08  283}
fddc17de Eric Dumazet2009-11-08  284}

:: The code at line 276 was first introduced by commit
:: fddc17defa22d8caba1cdfb2e22b50bb4b9f35c0 ipv6: udp: optimize unicast RX 
path

:: TO: Eric Dumazet 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH v2] netfilter: fix bad checksum on IPv6 when NAT is performed

2015-10-08 Thread Tom Herbert

On Thu, Oct 8, 2015 at 2:26 PM, Maxime Bizon  wrote:
>
> On Thu, 2015-10-08 at 14:09 -0700, Tom Herbert wrote:
>
>> I think inet_proto_csum_replace16 should be called here.
>
> inet_proto_csum_replace16() wants a non NULL checksum pointer to update,
> and there is no such thing here.
>
> I could pass a dummy value, but inet_proto_csum_replace16() will do
> twice more work for nothing
>
> or I could modify inet_proto_csum_replace16() to allow a NULL sum
> argument.
>
If I am reading the code correctly, it looks like
inet_proto_csum_replace16 is called from
{tcp,dccp,tcp.udp,udplite}_manip_pkt via l3proto->csum_update which is
a call to nf_nat_ipv6_csum_update for IPv6. So for these protocols
when the transport checksum is updated skb->csum is also correctly
updated. For other protocols or when UDP checksum is zero or for
fragments-- I don't see where skb->csum is being properly updated.
This potentially seems to be a major bug.

I would suggest:
1) Modify the inet_proto_csum_replace* routines to check for NULL
check pointer and and only write it when non-NULL
2) Call l3proto->csum_update from unknown_manip_pkt with check == NULL
3) In udp_manip_pkt call l3proto->csum_update with check = NULL when
UDP checksum is zero
4) For IPv6 fragments call l3proto->csum_update with check = NULL

Tom



> --
> Maxime
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-08 Thread Alexei Starovoitov


On 10/8/15 11:20 AM, Hannes Frederic Sowa wrote:

Hi Alexei,

On Thu, Oct 8, 2015, at 07:23, Alexei Starovoitov wrote:

The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1).  Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.


This approach seems fine to me.

I am wondering if it makes sense to somehow allow ebpf access per
namespace? I currently have no idea how that could work and on which
namespace type to depend or going with a prctl or even cgroup maybe. The
rationale behind this is, that maybe some namespaces like openstack
router namespaces could make usage of advanced ebpf capabilities in the
kernel, while other namespaces, especially where untrusted third parties
are hosted, shouldn't have access to those facilities.

In that way, hosters would be able to e.g. deploy more efficient
performance monitoring container (which should still need not to run as
root) while the majority of the users has no access to that. Or think
about routing instances in some namespaces, etc. etc.


when we're talking about eBPF for networking or performance monitoring
it's all going to be under root anyway. The next question is
how to let the programs run only for traffic or for applications within
namespaces. Something gotta do this demux. It either can be in-kernel
C code which is configured via some API that calls different eBPF
programs based on cgroup or based on netns, or it can be another
eBPF program that does demux on its own.
In case of tracing such 'demuxing' program can be attached to kernel
events and call 'worker' programs via tail_call, so that 'worker'
programs will have an illusion that they're working only with events
that belong to their namespace.
imo existing facilities already allow 'per namespace' eBPF, though
the prog_array used to jump from 'demuxing' bpf into 'worker' bpf
currently is a bit awkward to use (because of FD passing via daemon),
but that will get solved soon.
It feels that in-kernel C code doing filtering may be
'more robust' from namespace isolation point of view, but I don't
think we have a concrete and tested proposal, so I would
experiment with 'demuxing' bpf first.
The programs in general don't have a notion of namespace. They
need to be attached to veth via TC to get packets for
particular namespace.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3.10.y 0/2] ipv6: avoid soft lockups in fib6_run_gc()

2015-10-08 Thread Ben Hutchings

On Wed, 2015-06-10 at 13:40 +0300, Konstantin Khlebnikov wrote:
> Two patches from 3.11 which are missing in 3.10.y
> 
> I've just seen livelock in 3.10.69+ where all cpus are stuck in
> fib6_run_gc()
[...]

These also looked applicable to 3.2, so I've queued them up too.

Ben.


-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.


signature.asc
Description: This is a digitally signed message part

[PATCH v2 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-08 Thread Eric Dumazet

SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()

This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.

This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.

Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.

Signed-off-by: Eric Dumazet 
---
 include/net/sock.h  | 10 --
 net/core/sock.c |  5 +
 net/ipv4/inet_hashtables.c  |  2 ++
 net/ipv4/udp.c  |  6 +-
 net/ipv6/inet6_hashtables.c |  2 ++
 net/ipv6/udp.c  | 10 +++---
 6 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index dfe2eb8e1132..08abffe32236 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -150,6 +150,7 @@ typedef __u64 __bitwise __addrpair;
  * @skc_node: main hash linkage for various protocol lookup tables
  * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_incoming_cpu: record/match cpu processing incoming packets
  * @skc_refcnt: reference count
  *
  * This is the minimal network layer representation of sockets, the header
@@ -212,6 +213,8 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
+   int skc_incoming_cpu;
+
atomic_tskc_refcnt;
/* private: */
int skc_dontcopy_end[0];
@@ -274,7 +277,6 @@ struct cg_proto;
   *@sk_rcvtimeo: %SO_RCVTIMEO setting
   *@sk_sndtimeo: %SO_SNDTIMEO setting
   *@sk_rxhash: flow hash received from netif layer
-  *@sk_incoming_cpu: record cpu processing incoming packets
   *@sk_txhash: computed flow hash for use on transmit
   *@sk_filter: socket filtering instructions
   *@sk_timer: sock cleanup timer
@@ -331,6 +333,7 @@ struct sock {
 #define sk_v6_daddr__sk_common.skc_v6_daddr
 #define sk_v6_rcv_saddr__sk_common.skc_v6_rcv_saddr
 #define sk_cookie  __sk_common.skc_cookie
+#define sk_incoming_cpu__sk_common.skc_incoming_cpu
 
socket_lock_t   sk_lock;
struct sk_buff_head sk_receive_queue;
@@ -353,11 +356,6 @@ struct sock {
 #ifdef CONFIG_RPS
__u32   sk_rxhash;
 #endif
-   u16 sk_incoming_cpu;
-   /* 16bit hole
-* Warned : sk_incoming_cpu can be set from softirq,
-* Do not use this hole without fully understanding possible issues.
-*/
 
__u32   sk_txhash;
 #ifdef CONFIG_NET_RX_BUSY_POLL
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dd1263e4c24..1071f9380250 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -988,6 +988,10 @@ set_rcvbuf:
 sk->sk_max_pacing_rate);
break;
 
+   case SO_INCOMING_CPU:
+   sk->sk_incoming_cpu = val;
+   break;
+
default:
ret = -ENOPROTOOPT;
break;
@@ -2353,6 +2357,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
sk->sk_max_pacing_rate = ~0U;
sk->sk_pacing_rate = ~0U;
+   sk->sk_incoming_cpu = -1;
/*
 * Before updating sk_refcnt, we must commit prior changes to memory
 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bed8886a4b6c..08643a3616af 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -185,6 +185,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
+   if (sk->sk_incoming_cpu == raw_smp_processor_id())
+   score++;
}
return score;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1fc129099ea..24ec14f9825c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -375,7 +375,8 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
return -1;
score += 4;
}
-
+   if (sk->sk_incoming_cpu == raw_smp_processor_id())
+   score++;
return score;
 }
 
@@ -419,6 +420,9 @@ static inline int compute_score2(struct sock *sk, struct 
net

[PATCH v2 net-next 3/4] net: shrink struct sock and request_sock by 8 bytes

2015-10-08 Thread Eric Dumazet

One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.

Signed-off-by: Eric Dumazet 
---
 include/net/request_sock.h |  5 ++---
 include/net/sock.h | 14 +-
 net/ipv4/syncookies.c  |  4 ++--
 net/ipv4/tcp_input.c   |  2 +-
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_minisocks.c   | 18 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/syncookies.c  |  4 ++--
 net/ipv6/tcp_ipv6.c|  2 +-
 9 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 6b818b77d5e5..2e73748956d5 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -51,15 +51,14 @@ struct request_sock {
 #define rsk_refcnt __req_common.skc_refcnt
 #define rsk_hash   __req_common.skc_hash
 #define rsk_listener   __req_common.skc_listener
+#define rsk_window_clamp   __req_common.skc_window_clamp
+#define rsk_rcv_wnd__req_common.skc_rcv_wnd
 
struct request_sock *dl_next;
u16 mss;
u8  num_retrans; /* number of retransmits */
u8  cookie_ts:1; /* syncookie: encode 
tcpopts in timestamp */
u8  num_timeout:7; /* number of timeouts */
-   /* The following two fields can be easily recomputed I think -AK */
-   u32 window_clamp; /* window clamp at 
creation time */
-   u32 rcv_wnd;  /* rcv_wnd offered 
first time */
u32 ts_recent;
struct timer_list   rsk_timer;
const struct request_sock_ops   *rsk_ops;
diff --git a/include/net/sock.h b/include/net/sock.h
index a7818104a73f..fce12399fad4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -226,11 +226,18 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
-   int skc_incoming_cpu;
+   union {
+   int skc_incoming_cpu;
+   u32 skc_rcv_wnd;
+   };
 
atomic_tskc_refcnt;
/* private: */
int skc_dontcopy_end[0];
+   union {
+   u32 skc_rxhash;
+   u32 skc_window_clamp;
+   };
/* public: */
 };
 
@@ -287,7 +294,6 @@ struct cg_proto;
   *@sk_rcvlowat: %SO_RCVLOWAT setting
   *@sk_rcvtimeo: %SO_RCVTIMEO setting
   *@sk_sndtimeo: %SO_SNDTIMEO setting
-  *@sk_rxhash: flow hash received from netif layer
   *@sk_txhash: computed flow hash for use on transmit
   *@sk_filter: socket filtering instructions
   *@sk_timer: sock cleanup timer
@@ -346,6 +352,7 @@ struct sock {
 #define sk_cookie  __sk_common.skc_cookie
 #define sk_incoming_cpu__sk_common.skc_incoming_cpu
 #define sk_flags   __sk_common.skc_flags
+#define sk_rxhash  __sk_common.skc_rxhash
 
socket_lock_t   sk_lock;
struct sk_buff_head sk_receive_queue;
@@ -365,9 +372,6 @@ struct sock {
} sk_backlog;
 #define sk_rmem_alloc sk_backlog.rmem_alloc
int sk_forward_alloc;
-#ifdef CONFIG_RPS
-   __u32   sk_rxhash;
-#endif
 
__u32   sk_txhash;
 #ifdef CONFIG_NET_RX_BUSY_POLL
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8113c30ccf96..0769248bc0db 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -381,10 +381,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
}
 
/* Try to redo what tcp_v4_send_synack did. */
-   req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, 
RTAX_WINDOW);
+   req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, 
RTAX_WINDOW);
 
tcp_select_initial_window(tcp_full_space(sk), req->mss,
- &req->rcv_wnd, &req->window_clamp,
+ &req->rsk_rcv_wnd, &req->rsk_window_clamp,
  ireq->wscale_ok, &rcv_wscale,
  dst_metric(&rt->dst, RTAX_INITRWND));
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ddadb318e850..3b35c3f4d268 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6022,7 +6022,7 @@ static void tcp_openreq_init(struct request_sock *req,
 {
struct inet_request_sock *ireq = inet_rsk(req);
 
-   req->rcv_wnd = 0;   /* So that tcp_send_synack() knows! */
+   req->rsk_rcv_wnd = 0;   /* So that tcp_send_synack() knows! */
req->cookie_ts = 0;
tcp_rsk(req)->rcv_isn = TCP_SKB_CB(

[PATCH v2 net-next 4/4] tcp: shrink tcp_timewait_sock by 8 bytes

2015-10-08 Thread Eric Dumazet

Reducing tcp_timewait_sock from 280 bytes to 272 bytes
allows SLAB to pack 15 objects per page instead of 14 (on x86)

Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h | 4 ++--
 include/net/sock.h  | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e442e6e9a365..86a7edaa6797 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -356,8 +356,8 @@ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
 
 struct tcp_timewait_sock {
struct inet_timewait_sock tw_sk;
-   u32   tw_rcv_nxt;
-   u32   tw_snd_nxt;
+#define tw_rcv_nxt tw_sk.__tw_common.skc_tw_rcv_nxt
+#define tw_snd_nxt tw_sk.__tw_common.skc_tw_snd_nxt
u32   tw_rcv_wnd;
u32   tw_ts_offset;
u32   tw_ts_recent;
diff --git a/include/net/sock.h b/include/net/sock.h
index fce12399fad4..288934da0ae3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -229,6 +229,7 @@ struct sock_common {
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
+   u32 skc_tw_rcv_nxt; /* struct tcp_timewait_sock  */
};
 
atomic_tskc_refcnt;
@@ -237,6 +238,7 @@ struct sock_common {
union {
u32 skc_rxhash;
u32 skc_window_clamp;
+   u32 skc_tw_snd_nxt; /* struct tcp_timewait_sock */
};
/* public: */
 };
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 net-next 2/4] net: align sk_refcnt on 128 bytes boundary

2015-10-08 Thread Eric Dumazet

sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
This is a performance issue if multiple cpus hit a common socket,
or multiple sockets are chained due to SO_REUSEPORT.

By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
are mostly read. As they contain the lookup keys, this has
a considerable performance impact, as cpus can cache them.

These 8 bytes are not wasted, we use them as a place holder
for various fields, depending on the socket type.

Tested:
 SYN flood hitting a 16 RX queues NIC.
 TCP listener using 16 sockets and SO_REUSEPORT
 and SO_INCOMING_CPU for proper siloing.

 Could process 6.0 Mpps SYN instead of 4.2 Mpps

 Kernel profile looked like :
11.68%  [kernel]  [k] sha_transform
 6.51%  [kernel]  [k] __inet_lookup_listener
 5.07%  [kernel]  [k] __inet_lookup_established
 4.15%  [kernel]  [k] memcpy_erms
 3.46%  [kernel]  [k] ipt_do_table
 2.74%  [kernel]  [k] fib_table_lookup
 2.54%  [kernel]  [k] tcp_make_synack
 2.34%  [kernel]  [k] tcp_conn_request
 2.05%  [kernel]  [k] __netif_receive_skb_core
 2.03%  [kernel]  [k] kmem_cache_alloc

Signed-off-by: Eric Dumazet 
---
 include/net/inet_timewait_sock.h |  2 +-
 include/net/request_sock.h   |  2 +-
 include/net/sock.h   | 17 ++---
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 186f3a1e1b1f..e581fc69129d 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -70,6 +70,7 @@ struct inet_timewait_sock {
 #define tw_dport   __tw_common.skc_dport
 #define tw_num __tw_common.skc_num
 #define tw_cookie  __tw_common.skc_cookie
+#define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
volatile unsigned char  tw_substate;
@@ -88,7 +89,6 @@ struct inet_timewait_sock {
kmemcheck_bitfield_end(flags);
struct timer_list   tw_timer;
struct inet_bind_bucket *tw_tb;
-   struct inet_timewait_death_row *tw_dr;
 };
 #define tw_tclass tw_tos
 
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 95ab5d7aab96..6b818b77d5e5 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -50,9 +50,9 @@ struct request_sock {
struct sock_common  __req_common;
 #define rsk_refcnt __req_common.skc_refcnt
 #define rsk_hash   __req_common.skc_hash
+#define rsk_listener   __req_common.skc_listener
 
struct request_sock *dl_next;
-   struct sock *rsk_listener;
u16 mss;
u8  num_retrans; /* number of retransmits */
u8  cookie_ts:1; /* syncookie: encode 
tcpopts in timestamp */
diff --git a/include/net/sock.h b/include/net/sock.h
index 08abffe32236..a7818104a73f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -150,6 +150,9 @@ typedef __u64 __bitwise __addrpair;
  * @skc_node: main hash linkage for various protocol lookup tables
  * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_flags: place holder for sk_flags
+ * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
+ * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
  * @skc_incoming_cpu: record/match cpu processing incoming packets
  * @skc_refcnt: reference count
  *
@@ -201,6 +204,16 @@ struct sock_common {
 
atomic64_t  skc_cookie;
 
+   /* following fields are padding to force
+* offset(struct sock, sk_refcnt) == 128 on 64bit arches
+* assuming IPV6 is enabled. We use this padding differently
+* for different kind of 'sockets'
+*/
+   union {
+   unsigned long   skc_flags;
+   struct sock *skc_listener; /* request_sock */
+   struct inet_timewait_death_row *skc_tw_dr; /* 
inet_timewait_sock */
+   };
/*
 * fields between dontcopy_begin/dontcopy_end
 * are not copied in sock_copy()
@@ -246,8 +259,6 @@ struct cg_proto;
   *@sk_pacing_rate: Pacing rate (if supported by transport/packet 
scheduler)
   *@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
   *@sk_sndbuf: size of send buffer in bytes
-  *@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
-  *   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   *@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
   *@sk_no_check_rx: allow zero checksum in RX packets
   *@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
@@ -334,6 +345,7 @@ struct sock {
 #define sk_v6_rcv_saddr__sk_common.skc_v6_rcv_saddr
 #define sk_cookie

[PATCH v2 net-next 0/4] tcp: better smp listener behavior

2015-10-08 Thread Eric Dumazet

As promised in last patch series, we implement a better SO_REUSEPORT
strategy, based on cpu hints if given by the application.

We also moved sk_refcnt out of the cache line containing the lookup
keys, as it was considerably slowing down smp operations because
of false sharing. This was simpler than converting listen sockets
to conventional RCU (to avoid sk_refcnt dirtying)

Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.

Eric Dumazet (4):
  net: SO_INCOMING_CPU setsockopt() support
  net: align sk_refcnt on 128 bytes boundary
  net: shrink struct sock and request_sock by 8 bytes
  tcp: shrink tcp_timewait_sock by 8 bytes

 include/linux/tcp.h  |  4 ++--
 include/net/inet_timewait_sock.h |  2 +-
 include/net/request_sock.h   |  7 +++
 include/net/sock.h   | 41 +++-
 net/core/sock.c  |  5 +
 net/ipv4/inet_hashtables.c   |  2 ++
 net/ipv4/syncookies.c|  4 ++--
 net/ipv4/tcp_input.c |  2 +-
 net/ipv4/tcp_ipv4.c  |  2 +-
 net/ipv4/tcp_minisocks.c | 18 +-
 net/ipv4/tcp_output.c|  2 +-
 net/ipv4/udp.c   |  6 +-
 net/ipv6/inet6_hashtables.c  |  2 ++
 net/ipv6/syncookies.c|  4 ++--
 net/ipv6/tcp_ipv6.c  |  2 +-
 net/ipv6/udp.c   | 10 +++---
 16 files changed, 72 insertions(+), 41 deletions(-)

-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/2] sock: support per-packet fwmark

2015-10-08 Thread Edward Hyunkoo Jee

It's useful to allow users to set fwmark for an individual packet,
without changing the socket state. The function this patch adds in
sock layer can be used by the protocols that need such a feature.

Signed-off-by: Edward Hyunkoo Jee 
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
---
 include/net/sock.h |  7 +++
 net/core/sock.c| 26 ++
 2 files changed, 33 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index dfe2eb8..03ca20f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1514,6 +1514,13 @@ void sock_kfree_s(struct sock *sk, void *mem, int size);
 void sock_kzfree_s(struct sock *sk, void *mem, int size);
 void sk_send_sigurg(struct sock *sk);
 
+struct sockcm_cookie {
+   u32 mark;
+};
+
+int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
+  struct sockcm_cookie *sockc);
+
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
  * does not implement a particular function.
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dd1263..3395777 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1852,6 +1852,32 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, 
unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
+int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
+  struct sockcm_cookie *sockc)
+{
+   struct cmsghdr *cmsg;
+
+   for_each_cmsghdr(cmsg, msg) {
+   if (!CMSG_OK(msg, cmsg))
+   return -EINVAL;
+   if (cmsg->cmsg_level != SOL_SOCKET)
+   continue;
+   switch (cmsg->cmsg_type) {
+   case SO_MARK:
+   if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+   if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
+   return -EINVAL;
+   sockc->mark = *(u32 *)CMSG_DATA(cmsg);
+   break;
+   default:
+   return -EINVAL;
+   }
+   }
+   return 0;
+}
+EXPORT_SYMBOL(sock_cmsg_send);
+
 /* On 32bit arches, an skb frag is limited to 2^15 */
 #define SKB_FRAG_PAGE_ORDERget_order(32768)
 
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 2/2] packet: support per-packet fwmark for af_packet sendmsg

2015-10-08 Thread Edward Hyunkoo Jee

Signed-off-by: Edward Hyunkoo Jee 
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
---
 net/packet/af_packet.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 81c900f..9d8c7fa 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2630,6 +2630,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
__be16 proto;
unsigned char *addr;
int err, reserve = 0;
+   struct sockcm_cookie sockc;
struct virtio_net_hdr vnet_hdr = { 0 };
int offset = 0;
int vnet_hdr_len;
@@ -2665,6 +2666,13 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
+   sockc.mark = sk->sk_mark;
+   if (msg->msg_controllen) {
+   err = sock_cmsg_send(sk, msg, &sockc);
+   if (unlikely(err))
+   goto out_unlock;
+   }
+
if (sock->type == SOCK_RAW)
reserve = dev->hard_header_len;
if (po->has_vnet_hdr) {
@@ -2774,7 +2782,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->protocol = proto;
skb->dev = dev;
skb->priority = sk->sk_priority;
-   skb->mark = sk->sk_mark;
+   skb->mark = sockc.mark;
 
packet_pick_tx_queue(dev, skb);
 
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 3/3] mlx4: Call skb_csum_offload_check to check offloadability

2015-10-08 Thread Or Gerlitz

On Wed, Oct 7, 2015 at 9:07 PM, Tom Herbert  wrote:
> On Wed, Oct 7, 2015 at 8:41 AM, Or Gerlitz  wrote:
>> On 10/6/2015 2:39 AM, Tom Herbert wrote:

>>>   +static const struct skb_csum_offl_spec csum_offl_spec = {
>>> +   .ipv4_okay = 1,
>>> +   .ipv6_okay = 1,
>>> +   .encap_okay = 1,
>>> +   .tcp_okay = 1,
>>> +   .udp_okay = 1,
>>> +};
>>> +
[...]


> Or, I would only give the mlnx support as an example. I think driver
> owners would need to implement the specification for the their devices.

sure, sorry to bother on that.

>> Another constraint, is that when the device does support (say) TCP TX
>> checksum offload for the inner packet they don't support UDP
>> checksum offload for the outer packet.

> We can add such things to the specification. One value I see in having
> a common structure to describe the checksum capabilities is that
> becomes a way to clearly document what is (and is not) supported by
> devices.

> btw, I don't quite understand your example. If a device does not
> support UDP checksum there is a flag for that in the specification.

But we do support UDP checksum generation for not-tunneled packets, so
the specification should somehow capture this combination.

> If the stack sends a TCP packet encapsulated in a UDP packet with UDP
> checksum enabled, it will try to offload the UDP checksum and not the
> TCP one.

Not following... who is "it" in this sentence, the stack or the device?

We don't advertise NETIF_F_GSO_UDP_TUNNEL_CSUM so for UDP tunning
GSO-ed packets  we should be fine. For non GSO... you say this works
only b/c the default for the vxlan driver is to use zero udp checksum?

> There is currently no interface to offload two checksums in
> the same packet (in non-GRO at least)

GRO? aren't we talking on xmit?

> and with things like RCO we probably will never need that anyway.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] netfilter: fix bad checksum on IPv6 when NAT is performed

2015-10-08 Thread Maxime Bizon


On Thu, 2015-10-08 at 14:09 -0700, Tom Herbert wrote:

> I think inet_proto_csum_replace16 should be called here.

inet_proto_csum_replace16() wants a non NULL checksum pointer to update,
and there is no such thing here.

I could pass a dummy value, but inet_proto_csum_replace16() will do
twice more work for nothing

or I could modify inet_proto_csum_replace16() to allow a NULL sum
argument.

-- 
Maxime


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-08 Thread Eric Dumazet

On Thu, 2015-10-08 at 13:53 -0700, Tom Herbert wrote:

> If the incoming CPU is set for a connected UDP via
> sk_incoming_cpu_update, wouldn't this check subsequently _only_ allow
> packets for that socket to come from the same CPU?
> 

Hmm, I thought the SO_REUSEPORT path would be taken only for non
connected UDP sockets (like TCP listeners.).

But you might be right !

> Also, the check seems a little austere. Why not do something like:
> 
>if (sk->sk_incoming_cpu != -1) {
>if (sk->sk_incoming_cpu != raw_smp_processor_id())
> score += 4;
>}
> 
> My worry is that the packet steering configuration may change without
> the application's knowledge, so it's possible packets may come in on
> CPUs that the are unexpected to the application and then they would be
> dropped without matching a socket. I suppose that this could work with
> the original patch if a socket is bound to every CPU or there is at
> least one listener socket that is not bound to any CPU.

This is what I initially wrote, then I attempted a short cut, (abort
full list scan), then forgot to re-instate the first try, when I decided
to let this for future patch (Ying patch)

if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;

(Note we do not even have to test for sk_incoming_cpu == -1 in this
variant)

I'll include this in v2.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] tcp: fix RFS vs lockless listeners

2015-10-08 Thread Tom Herbert

On Thu, Oct 8, 2015 at 11:16 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> Before recent TCP listener patches, we were updating listener
> sk->sk_rxhash before the cloning of master socket.
>
> children sk_rxhash was therefore correct after the normal 3WHS.
>
> But with lockless listener, we no longer dirty/change listener sk_rxhash
> as it would be racy.
>
> We need to correctly update the child sk_rxhash, otherwise first data
> packet wont hit correct cpu if RFS is used.
>
> Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
> Signed-off-by: Eric Dumazet 
> Reported-by: Willem de Bruijn 
> Cc: Tom Herbert 
> ---
>  net/ipv4/syncookies.c|1 +
>  net/ipv4/tcp_minisocks.c |1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 8113c30ccf96..2dbb11331f6c 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -225,6 +225,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct 
> sk_buff *skb,
> child = icsk->icsk_af_ops->syn_recv_sock(sk, skb, req, dst);
> if (child) {
> atomic_set(&req->rsk_refcnt, 1);
> +   sock_rps_save_rxhash(child, skb);
> inet_csk_reqsk_queue_add(sk, req, child);
> } else {
> reqsk_free(req);
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 9adf1e2c3170..1079e6ad77fe 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -768,6 +768,7 @@ struct sock *tcp_check_req(struct sock *sk, struct 
> sk_buff *skb,
> if (!child)
> goto listen_overflow;
>
> +   sock_rps_save_rxhash(child, skb);
> tcp_synack_rtt_meas(child, req);
> inet_csk_reqsk_queue_drop(sk, req);
> inet_csk_reqsk_queue_add(sk, req, child);
>
>
Acked-by: Tom Herbert 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] netfilter: fix bad checksum on IPv6 when NAT is performed

2015-10-08 Thread Tom Herbert

On Thu, Oct 8, 2015 at 1:26 PM, Maxime Bizon  wrote:
>
> With this setup:
>
> * non IPv6 checksumming capable network hardware
> * GRO off
> * IPv6 SNAT
>
> I get this when I receive an UDPv6 reply: ": hw csum failure"
>
> Call trace:
>
> * nf_ip6_checksum() calls __skb_checksum_complete()
> * nf_nat_ipv6_csum_update() & nf_nat_ipv6_manip_pkt()
> * __udp6_lib_rcv() => udp6_csum_init()
> * __skb_checksum_validate_complete() "fastpath" fails because
>   skb->csum is incorrect.
> * udpv6_recvmsg() => skb_copy_and_csum_datagram_msg()
>
> The last call computes a valid checksum despite CHECKSUM_COMPLETE and
> triggers the warning.
>
> When we perform NAT on IPv4, we also update the IPv4 checksum, so
> there is no side effect on skb->csum (since the csum over a valid IPv4
> header area was already zero).
>
> But IPv6 doesn't have such checksum, so when performing NAT we need to
> update skb->csum.
>
> Signed-off-by: Maxime Bizon 
> ---
>  net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 23 +++
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c 
> b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> index 70fbaed..e44af9c 100644
> --- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> +++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> @@ -81,6 +81,8 @@ static bool nf_nat_ipv6_manip_pkt(struct sk_buff *skb,
>   enum nf_nat_manip_type maniptype)
>  {
> struct ipv6hdr *ipv6h;
> +   const __be32 *to;
> +   __be32 *from;
> __be16 frag_off;
> int hdroff;
> u8 nexthdr;
> @@ -100,11 +102,24 @@ static bool nf_nat_ipv6_manip_pkt(struct sk_buff *skb,
> target, maniptype))
> return false;
>  manip_addr:
> -   if (maniptype == NF_NAT_MANIP_SRC)
> -   ipv6h->saddr = target->src.u3.in6;
> -   else
> -   ipv6h->daddr = target->dst.u3.in6;
> +   if (maniptype == NF_NAT_MANIP_SRC) {
> +   from = ipv6h->saddr.s6_addr32;
> +   to = target->src.u3.in6.s6_addr32;
> +   } else {
> +   from = ipv6h->daddr.s6_addr32;
> +   to = target->dst.u3.in6.s6_addr32;
> +   }
> +
> +   if (skb->ip_summed == CHECKSUM_COMPLETE) {
> +   __be32 diff[] = {
> +   ~from[0], ~from[1], ~from[2], ~from[3],
> +   to[0], to[1], to[2], to[3],
> +   };
> +
> +   skb->csum = ~csum_partial(diff, sizeof(diff), ~skb->csum);
> +   }
>
I think inet_proto_csum_replace16 should be called here.

> +   memcpy(from, to, sizeof (struct in6_addr));
> return true;
>  }
>
> --
> 1.9.1
>
>
> --
> Maxime
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-08 Thread Tom Herbert

On Thu, Oct 8, 2015 at 8:37 AM, Eric Dumazet  wrote:
> SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
> to fetch incoming cpu handling a particular TCP flow after accept()
>
> This commits adds setsockopt() support and extends SO_REUSEPORT selection
> logic : If a TCP listener or UDP socket has this option set, a packet is
> delivered to this socket only if CPU handling the packet matches the 
> specified one.
>
> This allows to build very efficient TCP servers, using one thread per cpu,
> as the associated TCP listener should only accept flows handled in softirq
> by the same cpu. This provides optimal NUMA/SMP behavior and keep cpu caches 
> hot.
>
> Note that __inet_lookup_listener() still has to iterate over the list of
> all listeners. Following patch puts sk_refcnt in a different cache line
> to let this iteration hit only shared and read mostly cache lines.
>
> Signed-off-by: Eric Dumazet 
> ---
>  include/net/sock.h  | 11 +--
>  net/core/sock.c |  5 +
>  net/ipv4/inet_hashtables.c  |  5 +
>  net/ipv4/udp.c  | 12 +++-
>  net/ipv6/inet6_hashtables.c |  5 +
>  net/ipv6/udp.c  | 11 +++
>  6 files changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dfe2eb8e1132..00f60bea983b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -150,6 +150,7 @@ typedef __u64 __bitwise __addrpair;
>   * @skc_node: main hash linkage for various protocol lookup tables
>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_incoming_cpu: record/match cpu processing incoming packets
>   * @skc_refcnt: reference count
>   *
>   * This is the minimal network layer representation of sockets, the 
> header
> @@ -212,6 +213,8 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> +   int skc_incoming_cpu;
> +
> atomic_tskc_refcnt;
> /* private: */
> int skc_dontcopy_end[0];
> @@ -274,7 +277,7 @@ struct cg_proto;
>*@sk_rcvtimeo: %SO_RCVTIMEO setting
>*@sk_sndtimeo: %SO_SNDTIMEO setting
>*@sk_rxhash: flow hash received from netif layer
> -  *@sk_incoming_cpu: record cpu processing incoming packets
> +  *@sk_incoming_cpu: record/match cpu processing incoming packets
>*@sk_txhash: computed flow hash for use on transmit
>*@sk_filter: socket filtering instructions
>*@sk_timer: sock cleanup timer
> @@ -331,6 +334,7 @@ struct sock {
>  #define sk_v6_daddr__sk_common.skc_v6_daddr
>  #define sk_v6_rcv_saddr__sk_common.skc_v6_rcv_saddr
>  #define sk_cookie  __sk_common.skc_cookie
> +#define sk_incoming_cpu__sk_common.skc_incoming_cpu
>
> socket_lock_t   sk_lock;
> struct sk_buff_head sk_receive_queue;
> @@ -353,11 +357,6 @@ struct sock {
>  #ifdef CONFIG_RPS
> __u32   sk_rxhash;
>  #endif
> -   u16 sk_incoming_cpu;
> -   /* 16bit hole
> -* Warned : sk_incoming_cpu can be set from softirq,
> -* Do not use this hole without fully understanding possible issues.
> -*/
>
> __u32   sk_txhash;
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 7dd1263e4c24..1071f9380250 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -988,6 +988,10 @@ set_rcvbuf:
>  sk->sk_max_pacing_rate);
> break;
>
> +   case SO_INCOMING_CPU:
> +   sk->sk_incoming_cpu = val;
> +   break;
> +
> default:
> ret = -ENOPROTOOPT;
> break;
> @@ -2353,6 +2357,7 @@ void sock_init_data(struct socket *sock, struct sock 
> *sk)
>
> sk->sk_max_pacing_rate = ~0U;
> sk->sk_pacing_rate = ~0U;
> +   sk->sk_incoming_cpu = -1;
> /*
>  * Before updating sk_refcnt, we must commit prior changes to memory
>  * (Documentation/RCU/rculist_nulls.txt for details)
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index bed8886a4b6c..eabcfbc13afb 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -185,6 +185,11 @@ static inline int compute_score(struct sock *sk, struct 
> net *net,
> return -1;
> score += 4;
> }
> +   if (sk->sk_incoming_cpu != -1) {
> +   if (sk->sk_incoming_cpu != raw_smp_processor_id())
> +   return -1;
> +   score++;
> +   }

If the incoming CPU is set for a connect

Re: [PATCH net-next] net: synack packets can be attached to request sockets

2015-10-08 Thread Eric Dumazet

On Thu, 2015-10-08 at 11:56 -0400, Paul Moore wrote:

> Acked-by: Paul Moore 

Thanks for reviewing Paul.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] netfilter: fix bad checksum on IPv6 when NAT is performed

2015-10-08 Thread Maxime Bizon


With this setup:

* non IPv6 checksumming capable network hardware
* GRO off
* IPv6 SNAT

I get this when I receive an UDPv6 reply: ": hw csum failure"

Call trace:

* nf_ip6_checksum() calls __skb_checksum_complete()
* nf_nat_ipv6_csum_update() & nf_nat_ipv6_manip_pkt()
* __udp6_lib_rcv() => udp6_csum_init()
* __skb_checksum_validate_complete() "fastpath" fails because
  skb->csum is incorrect.
* udpv6_recvmsg() => skb_copy_and_csum_datagram_msg()

The last call computes a valid checksum despite CHECKSUM_COMPLETE and
triggers the warning.

When we perform NAT on IPv4, we also update the IPv4 checksum, so
there is no side effect on skb->csum (since the csum over a valid IPv4
header area was already zero).

But IPv6 doesn't have such checksum, so when performing NAT we need to
update skb->csum.

Signed-off-by: Maxime Bizon 
---
 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c 
b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
index 70fbaed..e44af9c 100644
--- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
@@ -81,6 +81,8 @@ static bool nf_nat_ipv6_manip_pkt(struct sk_buff *skb,
  enum nf_nat_manip_type maniptype)
 {
struct ipv6hdr *ipv6h;
+   const __be32 *to;
+   __be32 *from;
__be16 frag_off;
int hdroff;
u8 nexthdr;
@@ -100,11 +102,24 @@ static bool nf_nat_ipv6_manip_pkt(struct sk_buff *skb,
target, maniptype))
return false;
 manip_addr:
-   if (maniptype == NF_NAT_MANIP_SRC)
-   ipv6h->saddr = target->src.u3.in6;
-   else
-   ipv6h->daddr = target->dst.u3.in6;
+   if (maniptype == NF_NAT_MANIP_SRC) {
+   from = ipv6h->saddr.s6_addr32;
+   to = target->src.u3.in6.s6_addr32;
+   } else {
+   from = ipv6h->daddr.s6_addr32;
+   to = target->dst.u3.in6.s6_addr32;
+   }
+
+   if (skb->ip_summed == CHECKSUM_COMPLETE) {
+   __be32 diff[] = {
+   ~from[0], ~from[1], ~from[2], ~from[3],
+   to[0], to[1], to[2], to[3],
+   };
+
+   skb->csum = ~csum_partial(diff, sizeof(diff), ~skb->csum);
+   }
 
+   memcpy(from, to, sizeof (struct in6_addr));
return true;
 }
 
-- 
1.9.1


-- 
Maxime


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-08 Thread Andrew Morton

On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes  
wrote:

> On Wed, 19 Aug 2015, Patil, Kiran wrote:
> 
> > Acked-by: Kiran Patil 
> 
> Where's the call to preempt_disable() to prevent kernels with preemption 
> from making numa_node_id() invalid during this iteration?

David asked this question twice, received no answer and now the patch
is in the maintainer tree, destined for mainline.

If I was asked this question I would respond

  The use of numa_mem_id() is racy and best-effort.  If the unlikely
  race occurs, the memory allocation will occur on the wrong node, the
  overall result being very slightly suboptimal performance.  The
  existing use of numa_node_id() suffers from the same issue.

But I'm not the person proposing the patch.  Please don't just ignore
reviewer comments!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: fix bad checksum on IPv6 when NAT is performed

2015-10-08 Thread Maxime Bizon


On Tue, 2015-10-06 at 16:23 +0200, Maxime Bizon wrote:

> + if (maniptype == NF_NAT_MANIP_SRC) {
> + from = ipv6h->saddr.s6_addr32;
> + to = target->src.u3.in6.s6_addr32;
> + } else {
> + from = ipv6h->daddr.s6_addr32;
> + to = target->src.u3.in6.s6_addr32;
> + }

stupid copy & paste error here, will send v2

-- 
Maxime


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v4] net: ipv6: Make address flushing on ifdown optional

2015-10-08 Thread Hannes Frederic Sowa

Hi David,

David Ahern  writes:
> On 10/8/15 1:25 PM, Hannes Frederic Sowa wrote:
>>> diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
>>> index 1c8b6820b694..f190a14148ab 100644
>>> --- a/include/net/if_inet6.h
>>> +++ b/include/net/if_inet6.h
>>> @@ -72,6 +72,7 @@ struct inet6_ifaddr {
>>> int regen_count;
>>>
>>> booltokenized;
>>> +   boolmanaged;
>>
>> IMHO the naming of the bool is a bit too vague. ;) Would you mind
>> renaming it to something like puuh... user_managed, non_autoconf,
>> manual_conf etc.?  'managed' seems so often used in the context of
>> temporary addresses, I first thought about that.
>>
>> enum { USER_SPACE, KERNEL_AUTOCONF } managed_by;
>
> I have no preference on naming; unless other preferences are stated I'll 
> do v5 with it renamed to 'user_managed'.

I think this is more appropriate. Thanks!

>>> @@ -2689,6 +2692,9 @@ static int inet6_addr_add(struct net *net, int 
>>> ifindex,
>>> valid_lft, prefered_lft);
>>>
>>> if (!IS_ERR(ifp)) {
>>> +   if (!expires)
>>> +   ifp->managed = true;
>>> +
>>
>> This assumes that user space managed addresses don't time out. This is
>> in fact not true. I am not sure if it matters a lot, as most addresses
>> added from user space with a timeout most probably will be added because
>> of autoconf, but they are not managed by kernel autoconf. Not sure if we
>> want to make this more explicit, certainly it would avoid surprises.
>
> Not exactly. I'm taking the easy way out and saying only addresses with 
> no expiration time fall into the 'user managed' category and retained on 
> an ifdown. Trying to accommodate lifetimes is a PITA. I mentioned that 
> in the documentation:
>"static global addresses with no expiration time are not flushed"

Hmm, I thought a call to addrconf_verify on up would be sufficient but
haven't looked into that too closely.

Anyway, this logic actually only makes sense with addresses which don't
expire.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v4] net: ipv6: Make address flushing on ifdown optional

2015-10-08 Thread David Ahern


Hi Hannes:

On 10/8/15 1:25 PM, Hannes Frederic Sowa wrote:

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 1c8b6820b694..f190a14148ab 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -72,6 +72,7 @@ struct inet6_ifaddr {
int regen_count;

booltokenized;
+   boolmanaged;


IMHO the naming of the bool is a bit too vague. ;) Would you mind
renaming it to something like puuh... user_managed, non_autoconf,
manual_conf etc.?  'managed' seems so often used in the context of
temporary addresses, I first thought about that.

enum { USER_SPACE, KERNEL_AUTOCONF } managed_by;


I have no preference on naming; unless other preferences are stated I'll 
do v5 with it renamed to 'user_managed'.





@@ -2689,6 +2692,9 @@ static int inet6_addr_add(struct net *net, int ifindex,
valid_lft, prefered_lft);

if (!IS_ERR(ifp)) {
+   if (!expires)
+   ifp->managed = true;
+


This assumes that user space managed addresses don't time out. This is
in fact not true. I am not sure if it matters a lot, as most addresses
added from user space with a timeout most probably will be added because
of autoconf, but they are not managed by kernel autoconf. Not sure if we
want to make this more explicit, certainly it would avoid surprises.


Not exactly. I'm taking the easy way out and saying only addresses with 
no expiration time fall into the 'user managed' category and retained on 
an ifdown. Trying to accommodate lifetimes is a PITA. I mentioned that 
in the documentation:

  "static global addresses with no expiration time are not flushed"

Thanks for the review,
David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v4] net: ipv6: Make address flushing on ifdown optional

2015-10-08 Thread Hannes Frederic Sowa

Hi David,

David Ahern  writes:

> Currently, all ipv6 addresses are flushed when the interface is configured
> down, including global, static addresses:
>
> $ ip -6 addr add dev eth1 2000:11:1:1::1/64
> $ ip addr show dev eth1
> 3: eth1:  mtu 1500 qdisc noop state DOWN group 
> default qlen 1000
> link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff
> inet6 2000:11:1:1::1/64 scope global tentative
>valid_lft forever preferred_lft forever
> $ ip link set dev eth1 up
> $ ip link set dev eth1 down
> $ ip addr show dev eth1
> 3: eth1:  mtu 1500 qdisc pfifo_fast state DOWN group 
> default qlen 1000
> link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff
>
> Add a new sysctl to make this behavior optional. The new setting defaults to
> flush all addresses to maintain backwards compatibility. When the setting is
> reset global addresses with no expire times are not flushed:
>
> $ echo 0 > /proc/sys/net/ipv6/conf/eth1/flush_addr_on_down
> $ ip -6 addr add dev eth1 2000:11:1:1::1/64
> $ ip addr show dev eth1
> 3: eth1:  mtu 1500 qdisc pfifo_fast state DOWN group 
> default qlen 1000
> link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff
> inet6 2000:11:1:1::1/64 scope global tentative
>valid_lft forever preferred_lft forever
> $ ip link set dev eth1 up
> $ ip link set dev eth1 down
> $ ip addr show dev eth1
> 3: eth1:  mtu 1500 qdisc pfifo_fast state DOWN group 
> default qlen 1000
> link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff
> inet6 2000:11:1:1::1/64 scope global
>valid_lft forever preferred_lft forever
> inet6 fe80::4:11ff:fe22:3301/64 scope link
>valid_lft forever preferred_lft forever
>
> Signed-off-by: David Ahern 
> ---
> It has been 8 months since the last version:
>http://lists.openwall.net/netdev/2015/02/12/33
>
> but wanted to revive it. This current version addresses the last round of
> comments and verifies all routes are deleted and re-added correctly
>
> Nicolas: I ran 'ip monitor' on a link down and link up cycle and you can
>  see the neighbor and route deletes on a down and routes added on
>  an up.
>
> v4:
> - rebased to top of tree
>
> - updated to clear all routes on admin down and re-added on admin up
>
> - verified the route tables (main and local) on a link down have *no*
>   remnants of the configured, global address. On a link up all routes
>   are restored -- multicast, linklocal, local routes and connected.
>
> v3:
> - fix local variable ordering and comment style per Dave's comment
> - consistency in DEVCONF naming per Brian Haley's comment
> - added entry to Documentation/networking/ip-sysctl.txt
>
> v2:
> - only keep static addresses as suggested by Hannes
> - added new managed flag to track configured addresses
> - on ifdown do not remove from configured address from inet6_addr_lst
> - on ifdown reset the TENTATIVE flag and set state to DAD so that DAD is
>   redone when link is brought up again
>
>  Documentation/networking/ip-sysctl.txt |  6 +++
>  include/linux/ipv6.h   |  1 +
>  include/net/if_inet6.h |  1 +
>  include/uapi/linux/ipv6.h  |  1 +
>  net/ipv6/addrconf.c| 91 
> +-
>  5 files changed, 87 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index ebe94f2cab98..51c60f58f7ec 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -1432,6 +1432,12 @@ dad_transmits - INTEGER
>   The amount of Duplicate Address Detection probes to send.
>   Default: 1
>  
> +flush_addr_on_down - BOOLEAN
> + Flush all IPv6 addresses on an interface down event. If disabled
> + static global addresses with no expiration time are not flushed.
> +
> + Default: enabled
> +
>  forwarding - INTEGER
>   Configure interface-specific Host/Router behaviour.
>  
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index 0ef2a97ccdb5..112a18940ab2 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -60,6 +60,7 @@ struct ipv6_devconf {
>   struct in6_addr secret;
>   } stable_secret;
>   __s32   use_oif_addrs_only;
> + __s32   flush_addr_on_down;
>   void*sysctl;
>  };
>  
> diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
> index 1c8b6820b694..f190a14148ab 100644
> --- a/include/net/if_inet6.h
> +++ b/include/net/if_inet6.h
> @@ -72,6 +72,7 @@ struct inet6_ifaddr {
>   int regen_count;
>  
>   booltokenized;
> + boolmanaged;

IMHO the naming of the bool is a bit too vague. ;) Would you mind
renaming it to something like puuh... user_managed, non_autoconf,
manual_conf etc.?  'managed' seems so often used in

Re: [net-next 07/18] i40e: make i40e_init_pf_fcoe to void

2015-10-08 Thread Sergei Shtylyov


Hello.

On 10/08/2015 01:47 AM, Jeff Kirsher wrote:


From: Shannon Nelson 

i40e_init_pf_fcoe() didn't return anything except 0, it prints enough
error info already, and no driver logic depends on the return value,
so this can be void.

Change-ID: Ie6afad849857d87a7064c42c3cce14c74c2f29d8
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 

[...]


diff --git a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c 
b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
index 5ea75dd..2398d9b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c

[...]

@@ -326,7 +324,7 @@ int i40e_init_pf_fcoe(struct i40e_pf *pf)
wr32(hw, I40E_GLFCOE_RCTL, val);

dev_info(&pf->pdev->dev, "FCoE is supported.\n");
-   return 0;
+   return;


   Not needed at all.


  }

  /**

[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-08 Thread Hannes Frederic Sowa

Hi Alexei,

On Thu, Oct 8, 2015, at 07:23, Alexei Starovoitov wrote:
> The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
> This toggle defaults to off (0), but can be set true (1).  Once true,
> bpf programs and maps cannot be accessed from unprivileged process,
> and the toggle cannot be set back to false.

This approach seems fine to me.

I am wondering if it makes sense to somehow allow ebpf access per
namespace? I currently have no idea how that could work and on which
namespace type to depend or going with a prctl or even cgroup maybe. The
rationale behind this is, that maybe some namespaces like openstack
router namespaces could make usage of advanced ebpf capabilities in the
kernel, while other namespaces, especially where untrusted third parties
are hosted, shouldn't have access to those facilities.

In that way, hosters would be able to e.g. deploy more efficient
performance monitoring container (which should still need not to run as
root) while the majority of the users has no access to that. Or think
about routing instances in some namespaces, etc. etc.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] tcp: fix RFS vs lockless listeners

2015-10-08 Thread Eric Dumazet

From: Eric Dumazet 

Before recent TCP listener patches, we were updating listener
sk->sk_rxhash before the cloning of master socket.

children sk_rxhash was therefore correct after the normal 3WHS.

But with lockless listener, we no longer dirty/change listener sk_rxhash
as it would be racy.

We need to correctly update the child sk_rxhash, otherwise first data
packet wont hit correct cpu if RFS is used.

Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet 
Reported-by: Willem de Bruijn 
Cc: Tom Herbert 
---
 net/ipv4/syncookies.c|1 +
 net/ipv4/tcp_minisocks.c |1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8113c30ccf96..2dbb11331f6c 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -225,6 +225,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct 
sk_buff *skb,
child = icsk->icsk_af_ops->syn_recv_sock(sk, skb, req, dst);
if (child) {
atomic_set(&req->rsk_refcnt, 1);
+   sock_rps_save_rxhash(child, skb);
inet_csk_reqsk_queue_add(sk, req, child);
} else {
reqsk_free(req);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 9adf1e2c3170..1079e6ad77fe 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -768,6 +768,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff 
*skb,
if (!child)
goto listen_overflow;
 
+   sock_rps_save_rxhash(child, skb);
tcp_synack_rtt_meas(child, req);
inet_csk_reqsk_queue_drop(sk, req);
inet_csk_reqsk_queue_add(sk, req, child);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 243 matches

Mail list logo