date:20151009

Re: [PATCH net-next] ipv6 route: use err pointers instead of returning pointer by reference

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 10:26 AM, Roopa Prabhu  wrote:
> From: Roopa Prabhu 
>
> This patch makes ip6_route_info_create return err pointer instead of
> returning the rt pointer by reference as suggested  by Dave
>
> Signed-off-by: Roopa Prabhu 
> ---
> Dave, sorry abt the delay on this one. net-next was closed when i got to it
> and its been in my queue since then.
>
>  net/ipv6/route.c | 30 --
>  1 file changed, 16 insertions(+), 14 deletions(-)



>  int ip6_route_add(struct fib6_config *cfg)
> @@ -1980,9 +1976,12 @@ int ip6_route_add(struct fib6_config *cfg)
> struct rt6_info *rt = NULL;

nit: don't need to init rt since it's now set unconditionally.

> int err;
>
> -   err = ip6_route_info_create(cfg, &rt);
> -   if (err)
> +   rt = ip6_route_info_create(cfg);
> +   if (IS_ERR(rt)) {
> +   err = PTR_ERR(rt);
> +   rt = NULL;
> goto out;
> +   }
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-09 Thread Alexei Starovoitov


On 10/9/15 9:38 PM, Eric Dumazet wrote:

On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote:


since this bug wasn't fixed at once in all places, it means
that it is hard to review _all_ needed call-sites.
There are 7 places that call skb_sender_cpu_clear() in net-next.
Plus 2 more in net.
How many such paths from rx to tx left?
On the first glance ovs is missing one and who knows what else.


Alexei, what's happening ?

The original patch is 6 months old. If this issue was so urgent, how
comes it took so long to catch the remaining bugs ?


no urgency at all. bpf side is clean, so I'm not worried :)


Just add skb_sender_cpu_clear() where needed, thanks.

Using union is hard, but there is a price to performance.

skb size is absolutely critical and deserves some headaches.


yep. as I said it shouldn't be increased and proposed in-band sign bit.

Anyway, since you and Daniel are ok with adding skb_sender_cpu_clear()
in other places, I rest my case.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-09 Thread Alexei Starovoitov


On 10/9/15 9:38 PM, Eric Dumazet wrote:

On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote:


since this bug wasn't fixed at once in all places, it means
that it is hard to review _all_ needed call-sites.
There are 7 places that call skb_sender_cpu_clear() in net-next.
Plus 2 more in net.
How many such paths from rx to tx left?
On the first glance ovs is missing one and who knows what else.


Alexei, what's happening ?

The original patch is 6 months old. If this issue was so urgent, how
comes it took so long to catch the remaining bugs ?


no urgency at all. bpf side is clean, so I'm not worried :)


Just add skb_sender_cpu_clear() where needed, thanks.

Using union is hard, but there is a price to performance.

skb size is absolutely critical and deserves some headaches.


yep. as I said it shouldn't be increased and proposed in-band sign bit.

Anyway, since you and Daniel are ok with adding skb_sender_cpu_clear()
in other places, I rest my case.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3] bridge: allow adding of fdb entries pointing to the bridge device

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 10:38 AM, Roopa Prabhu  wrote:
> From: Roopa Prabhu 
>
> This patch enables adding of fdb entries pointing to the bridge device.
> This can be used to propagate mac address of vlan interfaces
> configured on top of the vlan filtering bridge.
>
> Before:
> $bridge fdb add 44:38:39:00:27:9f dev bridge
> RTNETLINK answers: Invalid argument
>
> After:
> $bridge fdb add 44:38:39:00:27:9f dev bridge
>
> Signed-off-by: Roopa Prabhu 
> Reviewed-by: Nikolay Aleksandrov 
> ---
> v1 - v2 : fix kbuild warnings
> v2 - v3 : address review comments from Nikolay (use of br_vlan_should_use)
>
>  net/bridge/br_fdb.c  | 122 
> ++-
>  net/bridge/br_vlan.c |   1 +
>  2 files changed, 93 insertions(+), 30 deletions(-)
>
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index 7f7d551..f43ce05 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -608,13 +608,14 @@ void br_fdb_update(struct net_bridge *br, struct 
> net_bridge_port *source,
> }
>  }
>
> -static int fdb_to_nud(const struct net_bridge_fdb_entry *fdb)
> +static int fdb_to_nud(const struct net_bridge *br,
> + const struct net_bridge_fdb_entry *fdb)
>  {
> if (fdb->is_local)
> return NUD_PERMANENT;
> else if (fdb->is_static)
> return NUD_NOARP;
> -   else if (has_expired(fdb->dst->br, fdb))
> +   else if (has_expired(br, fdb))
> return NUD_STALE;
> else
> return NUD_REACHABLE;
> @@ -640,7 +641,7 @@ static int fdb_fill_info(struct sk_buff *skb, const 
> struct net_bridge *br,
> ndm->ndm_flags   = fdb->added_by_external_learn ? NTF_EXT_LEARNED : 0;
> ndm->ndm_type= 0;
> ndm->ndm_ifindex = fdb->dst ? fdb->dst->dev->ifindex : 
> br->dev->ifindex;
> -   ndm->ndm_state   = fdb_to_nud(fdb);
> +   ndm->ndm_state   = fdb_to_nud(br, fdb);
>
> if (nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->addr))
> goto nla_put_failure;
> @@ -785,7 +786,7 @@ static int fdb_add_entry(struct net_bridge_port *source, 
> const __u8 *addr,
> }
> }
>
> -   if (fdb_to_nud(fdb) != state) {
> +   if (fdb_to_nud(br, fdb) != state) {


Hi Roopa,

Are the above changes to fdb_to_nud() related to the patch subject?
I was trying to figure out this part of the patch...seems unrelated.
Is fdb->dst->br now not valid in some cases?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-09 Thread Eric Dumazet

On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote:

> since this bug wasn't fixed at once in all places, it means
> that it is hard to review _all_ needed call-sites.
> There are 7 places that call skb_sender_cpu_clear() in net-next.
> Plus 2 more in net.
> How many such paths from rx to tx left?
> On the first glance ovs is missing one and who knows what else.

Alexei, what's happening ?

The original patch is 6 months old. If this issue was so urgent, how
comes it took so long to catch the remaining bugs ?

Just add skb_sender_cpu_clear() where needed, thanks.

Using union is hard, but there is a price to performance.

skb size is absolutely critical and deserves some headaches.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: switchdev and VLAN ranges

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 4:30 PM, Vivien Didelot
 wrote:
> Hi All,
>
> I understand that specifying a VLAN range on the command line is nice
> for the user, and it makes no big deal for software implementation.

[Adding Roopa, since she did the original vlan range support in the
kernel/iproute2]

> However, AFAICT a VLAN range does not make sense at all for hardware
> such as Ethernet switch chips. Am I wrong?
>
> I would suggest to make switchdev directly answer to a bridge request
> that the operation is not supported when the user asks for a VLAN range.
>
> That way, we can simply use a single "vid" member in struct
> switchdev_obj_port_vlan instead of "vid_begin" and "vid_end" and thus
> avoid making drivers heavier with iteration loops on such range.
>
> I have two concerns in mind:
>
> a) if we imagine that drivers like Rocker allocate memory in the prepare
> phase for each VID, preparing a range like 100-4000 would definitely not
> be recommended.

This call should be in process context so it doesn't seem to terrible
for the driver to take its time to reserve/allocate resources in
prepare phase, even for a vlan range.  I think I'm missing your point.

> b) imagine that you have two Linux bridges on a switch, one using the
> hardware VLAN 100. If you request the VLAN range 99-101 for the other
> bridge members, it is not possible for the driver to say "I can
> accelerate VLAN 99 and 101, but not 100". It must return OPNOTSUPP for
> the whole range.

Well, it probably should return -ERANGE to indicate the range can't be
added, but that's an aside.

The reason why vlan ranges need to work down to the switchdev driver
is, from the user's perspective, it's an all-or-nothing request from
the user to add the vlan range to the device.  So we need to ask the
driver in the prepare phase, "can you support this range,
completely?", and if yes, then commit it as a whole.  The netlink
response back to the user isn't equipped to describe what subset of
the range was added, and what subset was not.

> That's why I think that avoiding VLAN range at the switchdev level would
> be a good idea.

As a general rule with switchdev, we've tried to keep the user's
experience the same when using {Linux} as a soft switch/router vs.
using {Linux + offload device} as a hard switch/router.  So if native
Linux supports some operation, for example vlan ranges, then we should
try to extend that to the offload model.  In other words, we don't
want to re-train the user when moving from soft switch to hard switch!
 But there are physical limitations when dealing with an offload
device

Anyway, with your vlan range example, we've got a case where each soft
bridge has an independent vlan set, and the vlan sets between soft
bridges can overlap.  For the (typical) hard switch, there is one vlan
set for the whole switch, and trying to overlay the soft bridges'
(overlapping) vlan sets on the hard switch fails.  That failure is
reported to the user.  We tried, but due to offload device
limitations, we can't support that operation.  Of course, if the vlan
sets didn't overlap, then we don't have a problem.

This will not be the only case where something we can do on a soft
Linux switch/router can't be offloaded to some physical offload
device.  But I think the philosophy has been to try offload what we
can, up to the point of failure.  In some cases, we can mask that
failure from the user by falling back to soft-switch only, but in
other cases the failure will pop up right in the user's face, like in
your example.

One idea to help mitigate the user's confusion would be to limit the
number of bridges overlaid on the device to just one.  Our drivers
know when ports are enslaved to bridges, so is there something we can
do there to fail the enslave on a second bridge?  Exercise left to the
reader.  If we had that, now vlan ranges work 1:1 with soft Linux
because both soft bridge and device have single vlan set.

Sorry for the long-winded response.

-scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next] bridge: try switchdev op first in __vlan_vid_add/del

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 3:44 PM, Vivien Didelot
 wrote:
> Hi Jiri,
>
> On Oct. Friday 09 (41) 01:54 PM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>>
>> Some drivers need to implement both switchdev vlan ops and
>> vid_add/kill ndos. For that to work in bridge code, we need to try
>> switchdev op first when adding/deleting vlan id.
>
> Just curious, when would a driver need to have both operations?

Ya, I was kind of curious of that myself. Is this because the driver
wants to support standalone VLANs using 802.1q module and vconfig, as
well as bridge vlans?  With the vlan support built into the bridge,
I've been working under the assumption that 802.1q module (and
vconfig) aren't needed, and vlans for a bridged and non-bridge port
can be managed using the "bridge" iproute2 cmd.

> I kinda have the same question regarding ndo_fdb_{add,del} and the
> bridge_{get,set}link equivalent, which is a bit confusing to me.

I had to look back at my commit 7f109539 to remind myself about the
vid_add/kill ndos and bridge_{get,set}link usage.   Maybe that
write-up helps?  I'm not following you on the ndo_fdb_add/del part of
your question.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-09 Thread Alexei Starovoitov


On 10/9/15 10:33 AM, Daniel Borkmann wrote:

I was thinking may be we can use sign bit to distinguish between
napi_id and sender_cpu.
Like:
 if ((int)skb->sender_cpu >= 0)
 skb->sender_cpu = - (raw_smp_processor_id() + 1);
and inside get_xps_queue() use it only if it's negative.
Then we can remove skb_sender_cpu_clear() from everywhere.
Adding a check to napi_hash_add() to make sure that napi_id is not
negative is probably ok too.
Thoughts?


I think this doesn't make it any more maintainable.

skb_sender_cpu_clear(), one can at least git-grep to easily find
out and review call-sites in the code. There are various members
already used differently depending on the context.


since this bug wasn't fixed at once in all places, it means
that it is hard to review _all_ needed call-sites.
There are 7 places that call skb_sender_cpu_clear() in net-next.
Plus 2 more in net.
How many such paths from rx to tx left?
On the first glance ovs is missing one and who knows what else.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next] bridge: try switchdev op first in __vlan_vid_add/del

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 4:54 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Some drivers need to implement both switchdev vlan ops and
> vid_add/kill ndos. For that to work in bridge code, we need to try
> switchdev op first when adding/deleting vlan id.
>
> Signed-off-by: Jiri Pirko 
> Signed-off-by: Ido Schimmel 

Acked-by: Scott Feldman 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-09 Thread Alexei Starovoitov


On 10/9/15 9:40 AM, Devon H. O'Dell wrote:

I like the idea, but it seems unnecessarily magical. What about using
a bitfield? Then there's just an option bit that is either
OPTION_NAPI_ID or OPTION_SENDER_CPU. Then the check to set sender_cpu
in netdev_pick_tx becomes

 if (skb->sender_napi_option == OPTION_NAPI_ID || skb->sender_cpu == 0) ..


It's less magical, but slower since two loads from skb and two cmp/jmp
are needed instead of one.
and this is critical path of xmit executed for every skb.
that's why I proposed a sign.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down to switchdev

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 9:38 PM, Premkumar Jonnala  wrote:
>
>
>> -Original Message-
>> From: sfel...@gmail.com [mailto:sfel...@gmail.com]
>> Sent: Friday, October 09, 2015 7:53 AM
>> To: netdev@vger.kernel.org
>> Cc: da...@davemloft.net; j...@resnulli.us; siva.mannem@gmail.com;
>> Premkumar Jonnala; step...@networkplumber.org;
>> ro...@cumulusnetworks.com; and...@lunn.ch; f.faine...@gmail.com;
>> vivien.dide...@savoirfairelinux.com
>> Subject: [PATCH net-next v3 3/4] bridge: push bridge setting ageing_time down
>> to switchdev
>>
>> From: Scott Feldman 
>>
>> Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
>> support setting ageing_time (or setting bridge attrs in general).
>>
>> If push fails, don't update ageing_time in bridge and return err to user.
>>
>> If push succeeds, update ageing_time in bridge and run gc_timer now to
>> recalabrate when to run gc_timer next, based on new ageing_time.
>>
>> Signed-off-by: Scott Feldman 
>> Signed-off-by: Jiri Pirko 



>> +int br_set_ageing_time(struct net_bridge *br, u32 ageing_time)
>> +{
>> + struct switchdev_attr attr = {
>> + .id = SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
>> + .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
>> + .u.ageing_time = ageing_time,
>> + };
>> + unsigned long t = clock_t_to_jiffies(ageing_time);
>> + int err;
>> +
>> + if (t < BR_MIN_AGEING_TIME || t > BR_MAX_AGEING_TIME)
>> + return -ERANGE;
>> +
>> + err = switchdev_port_attr_set(br->dev, &attr);
>
> A thought - given that the ageing time is not a per-bridge-port attr, why are 
> we using a "port based api"
> to pass the attribute down?  May be I'm missing something here?

I think Florian raised the same point earlier.  Sigh, I think this
should be addressedv4 coming soon...thanks guys for keeping the
standard high.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 00/13] rocker: add support for multiple worlds

2015-10-09 Thread Scott Feldman

On Fri, Oct 9, 2015 at 7:36 AM, Jiri Pirko  wrote:
> Wed, Oct 07, 2015 at 07:39:56PM CEST, j...@resnulli.us wrote:
>>Wed, Oct 07, 2015 at 06:53:22PM CEST, sfel...@gmail.com wrote:
>>>On Tue, Oct 6, 2015 at 11:03 PM, Jiri Pirko  wrote:
 Tue, Oct 06, 2015 at 07:14:39PM CEST, sfel...@gmail.com wrote:
>On Tue, Oct 6, 2015 at 12:30 AM, Jiri Pirko  wrote:
>> Tue, Oct 06, 2015 at 05:56:12AM CEST, sfel...@gmail.com wrote:
>>>On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko  wrote:
 From: Jiri Pirko 

 This patchset allows new rocker worlds to be easily added in future 
 (like eBPF
 based one I have been working on). The main part of the patchset is 
 the OF-DPA
 carve-out. It resuts in OF-DPA specific file. Clean cut.

 v1->v2:
  - rtnl rocker mode change userspace expose patch was removed

 Jiri Pirko (13):
   rocker: remove unused rocker_port param from alloc funcs and shorten
 their names
   rocker: rename rocker.h to rocker_hw.h
   rocker: rename rocker.c to rocker_main.c
   rocker: push tlv processing into separate files
   rocker: implement set settings mode command
   rocker: introduce worlds infrastructure
   rocker: introduce OF-DPA world skeleton
   rocker: set default world on port probe and clean world on remove
   rocker: pass "learning" value as a parameter to
 rocker_port_set_learning
   rocker: pre-allocate wait structures during cmd ring init
   rocker: remove trans parameter to rocker_cmd_exec function
   rocker: call rocker_cmd_exec function with "nowait" boolean instead 
 of
 flags
   rocker: move OF-DPA stuff into separate file
>>>
>>>A couple of my tests are failing with this patchset.  A simple port
>>>test is failing and IPv4 routing test is failing.
>>>
>>>The port test is simple: just connect a port on DUT to a port on
>>>another system and assign an IP address to each port and verify IP
>>>connectivity.  I have this:
>>>
>>>   DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)
>>>
>>>The IPv4 routing tests is a bit more complicated to setup.  I'm using
>>>OSPF, but I'm not seeing full routes formed in the topology, so I
>>>suspect OSPF hellos aren't getting thru.
>>>
>>>Please fix find/fix these issues and send v3.  I don't want any git
>>>bisect issues when running tests.  Thanks.
>>
>> I fixed that. Sending v3 in a sec. Thanks.
>
>Sorry, both tests are still broken.  Would you send me your tests
>scripts so I can see why your tests are passing?

 I'm trying some smoke tests including bridge setup and just ip-ip
 setup by hand. Meybe if you send me your scripts, I can run it locally.
>>>
>>>My test scripts are already included in the qemu tree.
>>
>>Okay, will rework and use your scripts. Hope I will find some time
>>during this weekend.
>
> Scott, could you try to test with current net-next?
> I'm trying basic:
> DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)
> and it does not work for me now. It worked previously when I tested with
> my patchset. This is getting odd.

I had just re-run the tests against net-next before submitting the
ageing_time patchset and everything passes.

Are you using a namespace or a VM for host1?  Either one should work.

This would be a bad test, as the kernel will loop the traffic and the
offload device will not see it:

DUT:sw1p1 (11.0.0.1/24) <>DUT:sw1p2(11.0.0.2/24)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next RFC 2/3] switchdev: allow caller to explicitly use deferred attr_set version

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 11:46 PM, Jiri Pirko  wrote:
> Fri, Oct 09, 2015 at 06:39:41AM CEST, sfel...@gmail.com wrote:
>>On Thu, Oct 8, 2015 at 1:26 AM, Jiri Pirko  wrote:
>>> Thu, Oct 08, 2015 at 08:03:35AM CEST, sfel...@gmail.com wrote:
On Wed, Oct 7, 2015 at 10:39 PM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 06:27:07AM CEST, sfel...@gmail.com wrote:
>>On Wed, Oct 7, 2015 at 11:30 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Caller should know if he can call attr_set directly (when holding RTNL)
>>> or if he has to use deferred version of this function.
>>>
>>> This also allows drivers to sleep inside attr_set and report operation
>>> status back to switchdev core. Switchdev core then warns if status is
>>> not ok, instead of silent errors happening in drivers.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   2 +
>>>  net/bridge/br_stp.c   |   4 +-
>>>  net/switchdev/switchdev.c | 113 
>>> +-
>>>  3 files changed, 65 insertions(+), 54 deletions(-)
>>>
>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>> index 89266a3..320be44 100644
>>> --- a/include/net/switchdev.h
>>> +++ b/include/net/switchdev.h
>>> @@ -168,6 +168,8 @@ int switchdev_port_attr_get(struct net_device *dev,
>>> struct switchdev_attr *attr);
>>>  int switchdev_port_attr_set(struct net_device *dev,
>>> struct switchdev_attr *attr);
>>> +int switchdev_port_attr_set_deferred(struct net_device *dev,
>>> +struct switchdev_attr *attr);
>>
>>Rather than adding another op, use attr->flags and define:
>>
>>#define SWITCHDEV_F_DEFERRED  BIT(x)
>>
>>So we get:
>>
>>void br_set_state(struct net_bridge_port *p, unsigned int state)
>>{
>>struct switchdev_attr attr = {
>>.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>+  .flags = SWITCHDEV_F_DEFERRED,
>>.u.stp_state = state,
>>};
>>int err;
>>
>>p->state = state;
>>err = switchdev_port_attr_set(p->dev, &attr);
>>if (err && err != -EOPNOTSUPP)
>>br_warn(p->br, "error setting offload STP state on
>>port %u(%s)\n",
>>(unsigned int) p->port_no,
>>p->dev->name);
>>}
>>
>>(And add obj->flags to do the same).
>
> That's what I wanted to avoid. Also because the obj is const and for
> call from work, this flag would have to be removed.

What did you want to avoid?
>>>
>>> Having this as a flag. I don't like it too much.
>>> But that is cosmetics. Other than that, does the patchset make sense?
>>> Do you see some possible issues?
>>
>>patch 1/3 makes sense, I tested it out and no issues.  (Looks like
>>there are other places to assert rtnl_lock, are you going to add
>>those?)
>
> Sure, can you pinpoint the places?

Isn't every place we use netdev_for_each_lower_dev, like you mentioned
in 1/3 patch?

>>patch 2/3: Rather than trying to guess the call context in the core,
>>make the caller call the right variant for its context.  That part is
>>good.  On the flag vs. no flags, the reasons why I want this as a flag
>>are:
>>
>>a) I want to keep the switchdev ops set to the core set: get/set attr
>>and add/del/dump objs.  I've pushed back on changing this before.  I
>>don't want ops explosion (like netdev_ops), and I'd like to avoid the
>>1000-line patch when the arg list in an op changes, and we need to
>>update N drivers.  The flags lets the caller modify the algo behavior,
>>while keeping the core call (and args) fixed.
>>
>>b) the caller can combine flags, where it makes sense.  For example,
>>maybe I'm in a locked context and I don't want to recurse the device
>>tree, so I would make the call with NO_RECURSE | DEFERRED.  If we
>>didn't use flags, then we need to supply ops for each variant on the
>>call, and then things explode.
>
> Fair enough. I'll process this in.

Actually, I realized later that my reply here was only half true.
Part b) to combine flags for various calling situation is good.   Part
a) is bogus because I confused adding a new op or adding a new wrapper
to call existing op.  You did the latter; but I was complaining about
the former.  Sorry about that.  Regardless, port b) I think justifies
using flags.

>
>>
>>patch 3/3 I haven't looked at yet...I'm stuck on 2/3.
>
> It is very similar to 2/3, only for obj_add/del.

Do we have examples of a deferred obj add or del?  Maybe we should
hold off adding that support until someone finds a use-case.  I'm kind
of hoping there isn't a use-case, but who knows?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More

Re: [patch net-next RFC 3/3] switchdev: introduce deferred variants of obj_add/del helpers

2015-10-09 Thread Scott Feldman

On Thu, Oct 8, 2015 at 6:25 AM, Jiri Pirko  wrote:
> Thu, Oct 08, 2015 at 03:21:44PM CEST, gerlitz...@gmail.com wrote:
>>On Thu, Oct 8, 2015 at 4:09 PM, Jiri Pirko  wrote:
>>> Thu, Oct 08, 2015 at 10:28:58AM CEST, j...@resnulli.us wrote:
Thu, Oct 08, 2015 at 08:45:58AM CEST, gerlitz...@gmail.com wrote:
>On Wed, Oct 7, 2015 at 9:30 PM, Jiri Pirko  wrote:
>>
>This introduced a regression to the 2-phase commit scheme, since the
>prepare commit can fail
>and that would go un-noticed toward the upper layer, agree?
>>
Well, no. This still does the transaction for all lower devices in one
go. No change in that.
>>
>>> Now I get it, yes you are right. But currently there is no code in
>>> kernel which would control retval of deferred attr_set or obj_add/del
>>
>>I am not sure to understand your reply. You are saying that when the deferred
>>procedures complete (e.g fail in the prepare phase) they can't actually let
>>the upper layer to realize that this change isn't possible? this is
>>exactly the bug.
>
> Correct. But check the code. Callers of current deferred variants do
> not care about the retval. Therefore this is not a regression.
>
> It makes sense in my opinion. If you are a called and you explicitly say to
> defer the operation, you cannot expect retval.

Makes sense to me also, FWIW.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

switchdev and VLAN ranges

2015-10-09 Thread Vivien Didelot

Hi All,

I understand that specifying a VLAN range on the command line is nice
for the user, and it makes no big deal for software implementation.

However, AFAICT a VLAN range does not make sense at all for hardware
such as Ethernet switch chips. Am I wrong?

I would suggest to make switchdev directly answer to a bridge request
that the operation is not supported when the user asks for a VLAN range.

That way, we can simply use a single "vid" member in struct
switchdev_obj_port_vlan instead of "vid_begin" and "vid_end" and thus
avoid making drivers heavier with iteration loops on such range.

I have two concerns in mind:

a) if we imagine that drivers like Rocker allocate memory in the prepare
phase for each VID, preparing a range like 100-4000 would definitely not
be recommended.

b) imagine that you have two Linux bridges on a switch, one using the
hardware VLAN 100. If you request the VLAN range 99-101 for the other
bridge members, it is not possible for the driver to say "I can
accelerate VLAN 99 and 101, but not 100". It must return OPNOTSUPP for
the whole range.

That's why I think that avoiding VLAN range at the switchdev level would
be a good idea.

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next] bridge: try switchdev op first in __vlan_vid_add/del

2015-10-09 Thread Vivien Didelot

Hi Jiri,

On Oct. Friday 09 (41) 01:54 PM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Some drivers need to implement both switchdev vlan ops and
> vid_add/kill ndos. For that to work in bridge code, we need to try
> switchdev op first when adding/deleting vlan id.

Just curious, when would a driver need to have both operations?

I kinda have the same question regarding ndo_fdb_{add,del} and the
bridge_{get,set}link equivalent, which is a bit confusing to me.

> 
> Signed-off-by: Jiri Pirko 
> Signed-off-by: Ido Schimmel 
> ---
>  net/bridge/br_vlan.c | 58 
> 
>  1 file changed, 22 insertions(+), 36 deletions(-)
> 
> diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
> index eae07ee..975deb9 100644
> --- a/net/bridge/br_vlan.c
> +++ b/net/bridge/br_vlan.c
> @@ -72,28 +72,20 @@ static void __vlan_add_flags(struct net_bridge_vlan *v, 
> u16 flags)
>  static int __vlan_vid_add(struct net_device *dev, struct net_bridge *br,
> u16 vid, u16 flags)
>  {
> - const struct net_device_ops *ops = dev->netdev_ops;
> + struct switchdev_obj_port_vlan v = {
> + .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
> + .flags = flags,
> + .vid_begin = vid,
> + .vid_end = vid,
> + };
>   int err;
>  
> - /* If driver uses VLAN ndo ops, use 8021q to install vid
> -  * on device, otherwise try switchdev ops to install vid.
> + /* Try switchdev op first. In case it is not supported, fallback to
> +  * 8021q add.
>*/
> -
> - if (ops->ndo_vlan_rx_add_vid) {
> - err = vlan_vid_add(dev, br->vlan_proto, vid);
> - } else {
> - struct switchdev_obj_port_vlan v = {
> - .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
> - .flags = flags,
> - .vid_begin = vid,
> - .vid_end = vid,
> - };
> -
> - err = switchdev_port_obj_add(dev, &v.obj);
> - if (err == -EOPNOTSUPP)
> - err = 0;
> - }
> -
> + err = switchdev_port_obj_add(dev, &v.obj);
> + if (err == -EOPNOTSUPP)
> + return vlan_vid_add(dev, br->vlan_proto, vid);

err = vlan_vid_add(dev, br->vlan_proto, vid);

Just being picky: the above line would have been preferred to keep a
single return path, but this does not justify a v2 though.

>   return err;
>  }

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] tun: use sk_fullsock() before reading sk->sk_tsflags

2015-10-09 Thread Eric Dumazet

From: Eric Dumazet 

timewait or request sockets are small and do not contain sk->sk_tsflags

Without this fix, we might read garbage, and crash later in

__skb_complete_tx_timestamp()
 -> sock_queue_err_skb()

(These pseudo sockets do not have an error queue either)

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of 
listener")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
---
 Note this bug also exists on net tree for timewait sockets but only
 in exceptional conditions (routing glitches and IP early demux)

 drivers/net/tun.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 976aa9704297..b1878faea397 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -858,7 +858,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
goto drop;
 
-   if (skb->sk) {
+   if (skb->sk && sk_fullsock(skb->sk)) {
sock_tx_timestamp(skb->sk, &skb_shinfo(skb)->tx_flags);
sw_tx_timestamp(skb);
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] i40evf: fix 32 bit build warnings

2015-10-09 Thread Jesse Brandeburg

On Wed, 7 Oct 2015 22:13:19 +0200
Arnd Bergmann  wrote:

> Jesse Brandeburg fixed a bug for 32-bit systems in the i40e driver
> in commit 9c70d7cebfec5 ("i40e: fix 32 bit build warnings"), but the
> same code still exists in the i40evf driver and causes compilation
> warnings in ARM and x86 allmodconfig:
> 
> drivers/net/ethernet/intel/i40evf/i40e_common.c:445:68: warning: cast from 
> pointer to integer of different size [-Wpointer-to-int-cast]
> drivers/net/ethernet/intel/i40evf/i40e_common.c:446:71: warning: cast from 
> pointer to integer of different size [-Wpointer-to-int-cast]
> 
> This applies the same fix by removing the broken code.
> 
> Signed-off-by: Arnd Bergmann 

Thanks for catching that, my mistake.
Acked-by: Jesse Brandeburg 

> It would probably be a good idea to merge some of the duplicate code into
> a library module that gets used by both drivers to avoid having to fix bugs
> twice in the future.

The library is a nice idea, but while much of the code is the same,
many things about interaction with it while running in the VF context
are different than when called in the PF context.

We will look closely at what we can commonize and at least move to
header files.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node

2015-10-09 Thread Jesse Gross

On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme  wrote:
>
> On Oct 8, 2015, at 4:03 PM, Jesse Gross  wrote:
>
> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme 
> wrote:
>
>
> On Oct 6, 2015, at 6:01 PM, Jesse Gross  wrote:
>
> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>  wrote:
>
> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>
>
> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>
>
> When openvswitch tries allocate memory from offline numa node 0:
> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
> 0)
> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
> This patch disables numa affinity in this case.
>
> Signed-off-by: Konstantin Khlebnikov 
>
>
>
> ...
>
> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index f2ea83ba4763..c7f74aab34b9 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>
> /* Initialize the default stat node. */
> stats = kmem_cache_alloc_node(flow_stats_cache,
> -  GFP_KERNEL | __GFP_ZERO, 0);
> +  GFP_KERNEL | __GFP_ZERO,
> +  node_online(0) ? 0 : NUMA_NO_NODE);
>
>
>
> Stupid question: can node 0 become offline between this check, and the
> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>
>
>
> Another question to ask would be is it possible for node 0 to be online, but
> be a memoryless node?
>
> I would say you are better off just making this call kmem_cache_alloc.  I
> don't see anything that indicates the memory has to come from node 0, so
> adding the extra overhead doesn't provide any value.
>
>
> I agree that this at least makes me wonder, though I actually have
> concerns in the opposite direction - I see assumptions about this
> being on node 0 in net/openvswitch/flow.c.
>
> Jarno, since you original wrote this code, can you take a look to see
> if everything still makes sense?
>
>
> We keep the pre-allocated stats node at array index 0, which is initially
> used by all CPUs, but if CPUs from multiple numa nodes start updating the
> stats, we allocate additional stats nodes (up to one per numa node), and the
> CPUs on node 0 keep using the preallocated entry. If stats cannot be
> allocated from CPUs local node, then those CPUs keep using the entry at
> index 0. Currently the code in net/openvswitch/flow.c will try to allocate
> the local memory repeatedly, which may not be optimal when there is no
> memory at the local node.
>
> Allocating the memory for the index 0 from other than node 0, as discussed
> here, just means that the CPUs on node 0 will keep on using non-local memory
> for stats. In a scenario where there are CPUs on two nodes (0, 1), but only
> the node 1 has memory, a shared flow entry will still end up having separate
> memory allocated for both nodes, but both of the nodes would be at node 1.
> However, there is still a high likelihood that the memory allocations would
> not share a cache line, which should prevent the nodes from invalidating
> each other’s caches. Based on this I do not see a problem relaxing the
> memory allocation for the default stats node. If node 0 has memory, however,
> it would be better to allocate the memory from node 0.
>
>
> Thanks for going through all of that.
>
> It seems like the question that is being raised is whether it actually
> makes sense to try to get the initial memory on node 0, especially
> since it seems to introduce some corner cases? Is there any reason why
> the flow is more likely to hit node 0 than a randomly chosen one?
> (Assuming that this is a multinode system, otherwise it's kind of a
> moot point.) We could have a separate pointer to the default allocated
> memory, so it wouldn't conflict with memory that was intentionally
> allocated for node 0.
>
>
> It would still be preferable to know from which node the default stats node
> was allocated, and store it in the appropriate pointer in the array. We
> could then add a new “default stats node index” that would be used to locate
> the node in the array of pointers we already have. That way we would avoid
> extra allocation and processing of the default stats node.

I agree, that sounds reasonable to me. Will you make that change?

Besides eliminating corner cases, it might help performance in some
cases too by avoiding stressing memory bandwidth on node 0.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] openvswitch: report features supported by the kernel datapath

2015-10-09 Thread Jesse Gross

On Fri, Oct 9, 2015 at 2:46 AM, Jiri Benc  wrote:
> On Fri, 9 Oct 2015 11:24:53 +0200, Thomas Graf wrote:
>> On 10/08/15 at 03:40pm, Jesse Gross wrote:
>> > I have similar concerns as were expressed in the other thread. The
>> > features listed here aren't OVS components and I don't think that it
>> > makes sense for OVS to try to cover everything that is related - the
>> > goal that we've been working towards is to have OVS be less monolithic
>> > and more integrated. So to the extent that it is necessary to have
>> > capabilities be exposed (and I would like to avoid this where
>> > possible), it should be in the individual component, not in OVS.
>
> Fair enough. Note that the IPv6 flag really belongs to ovs, though -
> it's about the existence of OVS_TUNNEL_KEY_ATTR_IPV6_SRC and
> OVS_TUNNEL_KEY_ATTR_IPV6_DST netlink attributes. For the lwtunnel flag
> (which is just another way to tell whether vxlan/geneve/etc. has
> COLLECT_METADATA support) I can agree that it does not belong to ovs.

We actually already have a mechanism for handling compatibility as new
keys are added without requiring capabilities bits - see
Documentation/networking/openvswitch.txt. We've used this quite a few
times, including the addition of baseline tunnel capabilities.

That doesn't really cover IPv6 capability for individual tunneling
protocols though (which could vary on a per-protocol basis).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: simplify configfs attributes V2

2015-10-09 Thread Felipe Balbi

Christoph Hellwig  writes:

> This series consolidates the code to implement configfs attributes
> by providing the ->show and ->store method in common code and using
> container_of in the methods to access the containing structure.
>
> This reduces source and binary size of configfs consumers a lot.
>
> Changes since V1:
>  - a couple fixes for unintended changes in the uvc driver
>  - moved a few CONFIG_ATTR() statements around
>  - fixed up the documentation and samples in the last patch
>  - added a little rather pointless blurb to the patch description for
>various patches

For reference, I'm fine if you guys take all patches through FS
tree. Another option is waiting for dependencies to be merged in v4.4,
and the gadget changes merge in v4.5, whatever works.

Acked-by: Felipe Balbi 

-- 
balbi


signature.asc
Description: PGP signature

Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-09 Thread Trond Myklebust

On Fri, Oct 9, 2015 at 5:18 PM, J. Bruce Fields  wrote:
>
> On Fri, Oct 09, 2015 at 06:29:44AM +, Kosuke Tatsukawa wrote:
> > Neil Brown wrote:
> > > Kosuke Tatsukawa  writes:
> > >
> > >> There are several places in net/sunrpc/svcsock.c which calls
> > >> waitqueue_active() without calling a memory barrier.  Add a memory
> > >> barrier just as in wq_has_sleeper().
> > >>
> > >> I found this issue when I was looking through the linux source code
> > >> for places calling waitqueue_active() before wake_up*(), but without
> > >> preceding memory barriers, after sending a patch to fix a similar
> > >> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> > >> found here: https://lkml.org/lkml/2015/9/28/849).
> > >
> > > hi,
> > > this feels like the wrong approach to the problem.  It requires extra
> > > 'smb_mb's to be spread around which are hard to understand as easy to
> > > forget.
> > >
> > > A quick look seems to suggest that (nearly) every waitqueue_active()
> > > will need an smb_mb.  Could we just put the smb_mb() inside
> > > waitqueue_active()??
> > 
> >
> > There are around 200 occurrences of waitqueue_active() in the kernel
> > source, and most of the places which use it before wake_up are either
> > protected by some spin lock, or already has a memory barrier or some
> > kind of atomic operation before it.
> >
> > Simply adding smp_mb() to waitqueue_active() would incur extra cost in
> > many cases and won't be a good idea.
> >
> > Another way to solve this problem is to remove the waitqueue_active(),
> > making the code look like this;
> >   if (wq)
> >   wake_up_interruptible(wq);
> > This also fixes the problem because the spinlock in the wake_up*() acts
> > as a memory barrier and prevents the code from being reordered by the
> > CPU (and it also makes the resulting code is much simpler).
>
> I might not care which we did, except I don't have the means to test
> this quickly, and I guess this is some of our most frequently called
> code.
>
> I suppose your patch is the most conservative approach, as the
> alternative is a spinlock/unlock in wake_up_interruptible, which I
> assume is necessarily more expensive than an smp_mb().
>
> As far as I can tell it's been this way since forever.  (Well, since a
> 2002 patch "NFSD: TCP: rationalise locking in RPC server routines" which
> removed some spinlocks from the data_ready routines.)
>
> I don't understand what the actual race is yet (which code exactly is
> missing the wakeup in this case?  nfsd threads seem to instead get
> woken up by the wake_up_process() in svc_xprt_do_enqueue().)
>

Those threads still use blocking calls for sendpage() and sendmsg(),
so presumably they may be affected.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/23] usb-gadget: use per-attribute show and store methods

2015-10-09 Thread Felipe Balbi


Hi,

Christoph Hellwig  writes:
> To simplify the configfs interface and remove boilerplate code that also
> causes binary bloat.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Andrzej Pietrasiewicz 

I suppose this depends on other fs/configfs changes ?

-- 
balbi


signature.asc
Description: PGP signature

Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-09 Thread J. Bruce Fields

On Fri, Oct 09, 2015 at 06:29:44AM +, Kosuke Tatsukawa wrote:
> Neil Brown wrote:
> > Kosuke Tatsukawa  writes:
> > 
> >> There are several places in net/sunrpc/svcsock.c which calls
> >> waitqueue_active() without calling a memory barrier.  Add a memory
> >> barrier just as in wq_has_sleeper().
> >>
> >> I found this issue when I was looking through the linux source code
> >> for places calling waitqueue_active() before wake_up*(), but without
> >> preceding memory barriers, after sending a patch to fix a similar
> >> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> >> found here: https://lkml.org/lkml/2015/9/28/849).
> > 
> > hi,
> > this feels like the wrong approach to the problem.  It requires extra
> > 'smb_mb's to be spread around which are hard to understand as easy to
> > forget.
> > 
> > A quick look seems to suggest that (nearly) every waitqueue_active()
> > will need an smb_mb.  Could we just put the smb_mb() inside
> > waitqueue_active()??
> 
> 
> There are around 200 occurrences of waitqueue_active() in the kernel
> source, and most of the places which use it before wake_up are either
> protected by some spin lock, or already has a memory barrier or some
> kind of atomic operation before it.
> 
> Simply adding smp_mb() to waitqueue_active() would incur extra cost in
> many cases and won't be a good idea.
> 
> Another way to solve this problem is to remove the waitqueue_active(),
> making the code look like this;
>   if (wq)
>   wake_up_interruptible(wq);
> This also fixes the problem because the spinlock in the wake_up*() acts
> as a memory barrier and prevents the code from being reordered by the
> CPU (and it also makes the resulting code is much simpler).

I might not care which we did, except I don't have the means to test
this quickly, and I guess this is some of our most frequently called
code.

I suppose your patch is the most conservative approach, as the
alternative is a spinlock/unlock in wake_up_interruptible, which I
assume is necessarily more expensive than an smp_mb().

As far as I can tell it's been this way since forever.  (Well, since a
2002 patch "NFSD: TCP: rationalise locking in RPC server routines" which
removed some spinlocks from the data_ready routines.)

I don't understand what the actual race is yet (which code exactly is
missing the wakeup in this case?  nfsd threads seem to instead get
woken up by the wake_up_process() in svc_xprt_do_enqueue().)

--b.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 18/23] spear13xx_pcie_gadget: use per-attribute show and store methods

2015-10-09 Thread Felipe Balbi

Pratyush Anand  writes:

> On Sat, Oct 3, 2015 at 7:02 PM, Christoph Hellwig  wrote:
>> Signed-off-by: Christoph Hellwig 
>
> Acked-by: Pratyush Anand 

I don't seem to have the actual patch, care to resend?

-- 
balbi


signature.asc
Description: PGP signature

[PATCH v2 6/7] Bluetooth: Add HCI device identifier for Qualcomm SMD

2015-10-09 Thread Bjorn Andersson

This patch assigns the next free HCI device identifier to Bluetooth
devices based on the Qualcomm Shared Memory channels.

Signed-off-by: Bjorn Andersson 
---

Changes since v1:
- Split out this from the btqcomsmd patch

 include/net/bluetooth/hci.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index e7f938cac7c6..adfb371a19f9 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -60,6 +60,7 @@
 #define HCI_RS232  4
 #define HCI_PCI5
 #define HCI_SDIO   6
+#define HCI_SMD7
 
 /* HCI controller types */
 #define HCI_BREDR  0x00
-- 
2.4.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] i40e: re-use %*ph specifier to hexdump a data

2015-10-09 Thread Jesse Brandeburg

On Fri, 2 Oct 2015 12:18:16 +0300
Andy Shevchenko  wrote:

> Instead of using a custom approach change the code to use %*ph format
> specifier.
> 
> Signed-off-by: Andy Shevchenko 

Nice catch, thanks!

Acked-by: jesse Brandeburg 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sunrpc: avoid warning in gss_key_timeout

2015-10-09 Thread J. Bruce Fields

On Fri, Oct 09, 2015 at 04:13:45PM +0200, Arnd Bergmann wrote:
> The gss_key_timeout() function causes a harmless warning in some
> configurations, e.g. ARM imx_v6_v7_defconfig with gcc-5.2, if the
> compiler cannot figure out the state of the 'expire' variable across
> an rcu_read_unlock():
> 
> net/sunrpc/auth_gss/auth_gss.c: In function 'gss_key_timeout':
> net/sunrpc/auth_gss/auth_gss.c:1422:211: warning: 'expire' may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
> 
> To avoid this warning without adding a bogus initialization, this
> rewrites the function so the comparison is done inside of the
> critical section. As a side-effect, it also becomes slightly
> easier to understand because the implementation now more closely
> resembles the comment above it.

Looks reasonable, thanks; applying for 4.4--b.

> 
> Signed-off-by: Arnd Bergmann 
> Fixes: c5e6aecd034e7 ("sunrpc: fix RCU handling of gc_ctx field")
> 
> diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
> index dace13d7638e..799e65b944b9 100644
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -1411,17 +1411,16 @@ gss_key_timeout(struct rpc_cred *rc)
>  {
>   struct gss_cred *gss_cred = container_of(rc, struct gss_cred, gc_base);
>   struct gss_cl_ctx *ctx;
> - unsigned long now = jiffies;
> - unsigned long expire;
> + unsigned long timeout = jiffies + (gss_key_expire_timeo * HZ);
> + int ret = 0;
>  
>   rcu_read_lock();
>   ctx = rcu_dereference(gss_cred->gc_ctx);
> - if (ctx)
> - expire = ctx->gc_expiry - (gss_key_expire_timeo * HZ);
> + if (!ctx || time_after(timeout, ctx->gc_expiry))
> + ret = -EACCES;
>   rcu_read_unlock();
> - if (!ctx || time_after(now, expire))
> - return -EACCES;
> - return 0;
> +
> + return ret;
>  }
>  
>  static int
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] brcmfmac: fix waitqueue_active without memory barrier in brcmfmac driver

2015-10-09 Thread Arend van Spriel

On 10/09/2015 02:35 AM, Kosuke Tatsukawa wrote:
> brcmf_msgbuf_ioctl_resp_wake() seems to be missing a memory barrier
> which might cause the waker to not notice the waiter and miss sending a
> wake_up as in the following figure.

My mail reader treats this as HTML format or so. Can you resend it in
plain text please?

Regards,
Arend

> 
>brcmf_msgbuf_ioctl_resp_wake   brcmf_msgbuf_ioctl_resp_wait
> 
> if (waitqueue_active(&msgbuf->ioctl_resp_wait))
> /* The CPU might reorder the test for
> the waitqueue up here, before
> prior writes complete */
>  /* wait_event_timeout */
>   /* __wait_event_timeout */
>/* ___wait_event */
>prepare_to_wait_event(&wq, &__wait,
>  state);
>if (msgbuf->ctl_completed)
>...
> msgbuf->ctl_completed = true;
>schedule_timeout(__ret))
> 
> 
> There are three other place in drivers/net/wireless/brcm80211/brcmfmac/
> which have similar code.  The attached patch removes the call to
> waitqueue_active() leaving just wake_up() behind.  This fixes the
> problem because the call to spin_lock_irqsave() in wake_up() will be an
> ACQUIRE operation.
> 
> I found this issue when I was looking through the linux source code
> for places calling waitqueue_active() before wake_up*(), but without
> preceding memory barriers, after sending a patch to fix a similar
> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> found here: https://lkml.org/lkml/2015/9/28/849).
> 
> Signed-off-by: Kosuke Tatsukawa 
> ---
>   drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c |3 +--
>   drivers/net/wireless/brcm80211/brcmfmac/sdio.c   |6 ++
>   drivers/net/wireless/brcm80211/brcmfmac/usb.c|3 +--
>   3 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c 
> b/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c
> index 7b2136c..648151e 100644
> --- a/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c
> +++ b/drivers/net/wireless/brcm80211/brcmfmac/msgbuf.c
> @@ -473,8 +473,7 @@ static int brcmf_msgbuf_ioctl_resp_wait(struct 
> brcmf_msgbuf *msgbuf)
>   static void brcmf_msgbuf_ioctl_resp_wake(struct brcmf_msgbuf *msgbuf)
>   {
>   msgbuf->ctl_completed = true;
> - if (waitqueue_active(&msgbuf->ioctl_resp_wait))
> - wake_up(&msgbuf->ioctl_resp_wait);
> + wake_up(&msgbuf->ioctl_resp_wait);
>   }
>   
>   
> diff --git a/drivers/net/wireless/brcm80211/brcmfmac/sdio.c 
> b/drivers/net/wireless/brcm80211/brcmfmac/sdio.c
> index f990e3d..332c4c8 100644
> --- a/drivers/net/wireless/brcm80211/brcmfmac/sdio.c
> +++ b/drivers/net/wireless/brcm80211/brcmfmac/sdio.c
> @@ -1785,8 +1785,7 @@ static int brcmf_sdio_dcmd_resp_wait(struct brcmf_sdio 
> *bus, uint *condition,
>   
>   static int brcmf_sdio_dcmd_resp_wake(struct brcmf_sdio *bus)
>   {
> - if (waitqueue_active(&bus->dcmd_resp_wait))
> - wake_up_interruptible(&bus->dcmd_resp_wait);
> + wake_up_interruptible(&bus->dcmd_resp_wait);
>   
>   return 0;
>   }
> @@ -2110,8 +2109,7 @@ static uint brcmf_sdio_readframes(struct brcmf_sdio 
> *bus, uint maxframes)
>   static void
>   brcmf_sdio_wait_event_wakeup(struct brcmf_sdio *bus)
>   {
> - if (waitqueue_active(&bus->ctrl_wait))
> - wake_up_interruptible(&bus->ctrl_wait);
> + wake_up_interruptible(&bus->ctrl_wait);
>   return;
>   }
>   
> diff --git a/drivers/net/wireless/brcm80211/brcmfmac/usb.c 
> b/drivers/net/wireless/brcm80211/brcmfmac/usb.c
> index daba86d..7f5889c 100644
> --- a/drivers/net/wireless/brcm80211/brcmfmac/usb.c
> +++ b/drivers/net/wireless/brcm80211/brcmfmac/usb.c
> @@ -184,8 +184,7 @@ static int brcmf_usb_ioctl_resp_wait(struct 
> brcmf_usbdev_info *devinfo)
>   
>   static void brcmf_usb_ioctl_resp_wake(struct brcmf_usbdev_info *devinfo)
>   {
> - if (waitqueue_active(&devinfo->ioctl_resp_wait))
> - wake_up(&devinfo->ioctl_resp_wait);
> + wake_up(&devinfo->ioctl_resp_wait);
>   }
>   
>   static void
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

e1000e: hard system lockup on Linux 4.2

2015-10-09 Thread Jason A. Donenfeld

Hi Jeffrey & Raanan & Yanirx,

I have a Thinkpad W530 with a 82579LM inside of it, which uses the
e1000e driver. Every few hours, my system does a hard lockup, and I am
unable to do anything at all with it except power it off. There isn't
a panic or oops, as nothing is written to /sys/fs/pstore after. But, I
did enable the lockup detection debug option, and was able to gain a
few stack traces. And all of the time, the culprit is the e1000e
driver.

The funny thing is that I'm not actually using the Ethernet card --
nothing is plugged into the jack, as I'm mostly on wifi these days.
Nevertheless, it receives power from my laptop and thus the driver is
partaking in some form of communication with it.

The stack traces follow below. You'll notice that some time after the
initial e1000e lockup, the there's a soft lockup in the bpf paths. I
believe this is due to the fact that at the time of the hard lockup in
e1000e, I was trying to open a new Chrome tab process, which makes use
of seccomp-bpf. I do not believe the bug is in the bpf code, however.

Briefly looking at the stack trace myself shows quite a bit of
activity in `e1000e_cyclecounter_read`. Running `git log
drivers/net/ethernet/intel/e1000e` indicates a recent change from
Raanan -- 37b12910dd11d9ab969f2c310dc9160b7f3e3405 -- "e1000e: Fix
tight loop implementation of systime read algorithm". In this change,
a loop is entirely removed.

Investigating the origin of that loop reveals this commit from Yanirx
-- 83129b37ef35bb6a7f01c060129736a8db5d31c4. This commit appears to be
present in 4.2, but not in 4.1. This leads me to think it may be the
cause of the bug, with the aforementioned
37b12910dd11d9ab969f2c310dc9160b7f3e3405 being the fix for it.

I would therefore recommend that -- if this analysis is correct --
37b12910dd11d9ab969f2c310dc9160b7f3e3405 be backported to the 4.2
stable releases (thus, CCing Greg).

If my very brief and preliminary investigation is not correct, please
let me know if there is any additional information or debugging steps
I can apply, so that we can fix this regression.

In the meantime while I wait to hear back, I'll try backporting that
commit to 4.2 myself, and seeing the stability of my laptop over the
next 24 hours.

Thanks,
Jason

=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~

[14469.274866] [ cut here ]
[14469.274874] WARNING: CPU: 1 PID: 12032 at kernel/watchdog.c:311
watchdog_overflow_callback+0x74/0xa0()
[14469.274875] Watchdog detected hard LOCKUP on cpu 1
[14469.274877] Modules linked in:
[14469.274878]  bnep hid_generic usbhid cdc_acm af_packet pl2303
usbserial btusb btbcm btintel bluetooth uvcvideo videobuf2_vmalloc
videobuf2_memops videobuf2_core v4l2_common videodev ip6table_filter
iptable_filter ip6_tables ip_tables x_tables mmc_block
snd_hda_codec_realtek snd_hda_codec_generic iwldvm coretemp
x86_pkg_temp_thermal intel_powerclamp mac80211 kvm_intel snd_hda_intel
sdhci_pci snd_hda_codec kvm iwlwifi snd_hwdep snd_hda_core joydev
xhci_pci ehci_pci sdhci xhci_hcd ehci_hcd snd_pcm mousedev microcode
cfg80211 mmc_core usbcore snd_timer usb_common thinkpad_acpi thermal
snd soundcore hwmon rfkill ac battery evdev processor ipv6
[14469.274912] CPU: 1 PID: 12032 Comm: kworker/1:0 Not tainted 4.2.3-gentoo #3
[14469.274913] Hardware name: LENOVO 2436CTO/2436CTO, BIOS G5ETA2WW
(2.62 ) 03/31/2015
[14469.274918] Workqueue: events e1000e_systim_overflow_work
[14469.274920]   81a0eb69 81683f33
88081e245ba0
[14469.274922]  810ce4b8 8807fa6f 
88081e245c80
[14469.274924]  88081e245ef8  810ce535
81a0a3f8
[14469.274926] Call Trace:
[14469.274928][] ? dump_stack+0x40/0x50
[14469.274935]  [] ? warn_slowpath_common+0x78/0xb0
[14469.274937]  [] ? warn_slowpath_fmt+0x45/0x50
[14469.274939]  [] ? watchdog_overflow_callback+0x74/0xa0
[14469.274941]  [] ? __perf_event_overflow+0x86/0x1c0
[14469.274944]  [] ? intel_pmu_handle_irq+0x1c9/0x3f0
[14469.274948]  [] ? perf_event_nmi_handler+0x25/0x40
[14469.274951]  [] ? nmi_handle+0x7c/0x100
[14469.274952]  [] ? do_nmi+0x1dd/0x360
[14469.274956]  [] ? end_repeat_nmi+0x1a/0x1e
[14469.274958]  [] ? e1000e_cyclecounter_read+0xd/0xb0
[14469.274960]  [] ? e1000e_cyclecounter_read+0xd/0xb0
[14469.274962]  [] ? e1000e_cyclecounter_read+0xd/0xb0
[14469.274963]  <>  [] ? timecounter_read+0xc/0x50
[14469.274968]  [] ? e1000e_phc_gettime+0x28/0x60
[14469.274971]  [] ? e1000e_systim_overflow_work+0x18/0x40
[14469.274974]  [] ? process_one_work+0x140/0x3f0
[14469.274976]  [] ? worker_thread+0x42/0x490
[14469.274978]  [] ? process_one_work+0x3f0/0x3f0
[14469.274980]  [] ? kthread+0xbc/0xe0
[14469.274983]  [] ? kthread_worker_fn+0x160/0x160
[14469.274985]  [] ? ret_from_fork+0x3f/0x70
[14469.274987]  [] ? kthread_worker_fn+0x160/0x160
[14469.274988] ---[ end trace 99827f3383cad419 ]---
[14491.475971] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
[chrome:13388]
[144

[PATCH net-next 2/3] ipv4: Pass struct net into ip_defrag and ip_check_defrag

2015-10-09 Thread Eric W. Biederman

The function ip_defrag is called on both the input and the output
paths of the networking stack.  In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.

So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.

Signed-off-by: "Eric W. Biederman" 
---
 drivers/net/macvlan.c   | 2 +-
 include/net/ip.h| 6 +++---
 net/ipv4/ip_fragment.c  | 7 +++
 net/ipv4/ip_input.c | 7 ---
 net/ipv4/netfilter/nf_defrag_ipv4.c | 7 ---
 net/netfilter/ipvs/ip_vs_core.c | 2 +-
 net/openvswitch/conntrack.c | 2 +-
 net/packet/af_packet.c  | 6 +++---
 8 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 47da43595ac2..86f6c6292c27 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -412,7 +412,7 @@ static rx_handler_result_t macvlan_handle_frame(struct 
sk_buff **pskb)
 
port = macvlan_port_get_rcu(skb->dev);
if (is_multicast_ether_addr(eth->h_dest)) {
-   skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN);
+   skb = ip_check_defrag(dev_net(skb->dev), skb, 
IP_DEFRAG_MACVLAN);
if (!skb)
return RX_HANDLER_CONSUMED;
eth = eth_hdr(skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index 3c904a28d5e5..1a98f1ca1638 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -506,11 +506,11 @@ static inline bool ip_defrag_user_in_between(u32 user,
return user >= lower_bond && user <= upper_bond;
 }
 
-int ip_defrag(struct sk_buff *skb, u32 user);
+int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
 #ifdef CONFIG_INET
-struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user);
+struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 
user);
 #else
-static inline struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
+static inline struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff 
*skb, u32 user)
 {
return skb;
 }
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9772b789adf3..5482745d5d68 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -654,11 +654,10 @@ out_fail:
 }
 
 /* Process an incoming IP datagram fragment. */
-int ip_defrag(struct sk_buff *skb, u32 user)
+int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 {
struct net_device *dev = skb->dev ? : skb_dst(skb)->dev;
int vif = l3mdev_master_ifindex_rcu(dev);
-   struct net *net = dev_net(dev);
struct ipq *qp;
 
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS);
@@ -683,7 +682,7 @@ int ip_defrag(struct sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_defrag);
 
-struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
+struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 {
struct iphdr iph;
int netoff;
@@ -712,7 +711,7 @@ struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 
user)
if (pskb_trim_rcsum(skb, netoff + len))
return skb;
memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-   if (ip_defrag(skb, user))
+   if (ip_defrag(net, skb, user))
return NULL;
skb_clear_hash(skb);
}
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 804b86fd615f..b1209b63381f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -170,7 +170,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 sk->sk_bound_dev_if == dev->ifindex) &&
net_eq(sock_net(sk), net)) {
if (ip_is_fragment(ip_hdr(skb))) {
-   if (ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN))
+   if (ip_defrag(net, skb, 
IP_DEFRAG_CALL_RA_CHAIN))
return true;
}
if (last) {
@@ -247,14 +247,15 @@ int ip_local_deliver(struct sk_buff *skb)
/*
 *  Reassemble IP fragments.
 */
+   struct net *net = dev_net(skb->dev);
 
if (ip_is_fragment(ip_hdr(skb))) {
-   if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
+   if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
return 0;
}
 
return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
-  dev_net(skb->dev), NULL, skb, skb->dev, NULL,
+  net, NULL, skb, skb->dev, NULL,
   ip_local_deliver_finish);
 }
 
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c 
b/net/ipv4/netfilter/nf_defrag_ipv4.c
index b246346ee849..bf25f45b23d2 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_d

[PATCH net-next 3/3] ipv6: Pass struct net into nf_ct_frag6_gather

2015-10-09 Thread Eric W. Biederman

The function nf_ct_frag6_gather is called on both the input and the
output paths of the networking stack.  In particular ipv6_defrag which
calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
on input and the LOCAL_OUT chain on output.

The addition of a net parameter makes it explicit which network
namespace the packets are being reassembled in, and removes the need
for nf_ct_frag6_gather to guess.

Signed-off-by: "Eric W. Biederman" 
---
 include/net/netfilter/ipv6/nf_defrag_ipv6.h | 2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 +---
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c   | 3 ++-
 net/openvswitch/conntrack.c | 2 +-
 4 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/net/netfilter/ipv6/nf_defrag_ipv6.h 
b/include/net/netfilter/ipv6/nf_defrag_ipv6.h
index 27666d8a0bd0..fb7da5bb76cc 100644
--- a/include/net/netfilter/ipv6/nf_defrag_ipv6.h
+++ b/include/net/netfilter/ipv6/nf_defrag_ipv6.h
@@ -5,7 +5,7 @@ void nf_defrag_ipv6_enable(void);
 
 int nf_ct_frag6_init(void);
 void nf_ct_frag6_cleanup(void);
-struct sk_buff *nf_ct_frag6_gather(struct sk_buff *skb, u32 user);
+struct sk_buff *nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 
user);
 void nf_ct_frag6_consume_orig(struct sk_buff *skb);
 
 struct inet_frags_ctl;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 701cd2bae0a9..2fb86a99bf5f 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -563,12 +563,10 @@ find_prev_fhdr(struct sk_buff *skb, u8 *prevhdrp, int 
*prevhoff, int *fhoff)
return 0;
 }
 
-struct sk_buff *nf_ct_frag6_gather(struct sk_buff *skb, u32 user)
+struct sk_buff *nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 
user)
 {
struct sk_buff *clone;
struct net_device *dev = skb->dev;
-   struct net *net = skb_dst(skb) ? dev_net(skb_dst(skb)->dev)
-  : dev_net(skb->dev);
struct frag_hdr *fhdr;
struct frag_queue *fq;
struct ipv6hdr *hdr;
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c 
b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index a99baf63eccf..5173a89a238e 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -63,7 +63,8 @@ static unsigned int ipv6_defrag(void *priv,
return NF_ACCEPT;
 #endif
 
-   reasm = nf_ct_frag6_gather(skb, nf_ct6_defrag_user(state->hook, skb));
+   reasm = nf_ct_frag6_gather(state->net, skb,
+  nf_ct6_defrag_user(state->hook, skb));
/* queued */
if (reasm == NULL)
return NF_STOLEN;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index cb76076a7a42..ad614267cc2a 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -315,7 +315,7 @@ static int handle_fragments(struct net *net, struct 
sw_flow_key *key,
struct sk_buff *reasm;
 
memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm));
-   reasm = nf_ct_frag6_gather(skb, user);
+   reasm = nf_ct_frag6_gather(net, skb, user);
if (!reasm)
return -EINPROGRESS;
 
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/3] ipv4: Only compute net once in ip_call_ra_chain

2015-10-09 Thread Eric W. Biederman

ip_call_ra_chain is called early in the forwarding chain from
ip_forward and ip_mr_input, which makes skb->dev the correct
expression to get the input network device and dev_net(skb->dev) a
correct expression for the network namespace the packet is being
processed in.

Compute the network namespace and store it in a variable to make the
code clearer.

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7cc9f7bb7fb7..804b86fd615f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -157,6 +157,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
u8 protocol = ip_hdr(skb)->protocol;
struct sock *last = NULL;
struct net_device *dev = skb->dev;
+   struct net *net = dev_net(dev);
 
for (ra = rcu_dereference(ip_ra_chain); ra; ra = 
rcu_dereference(ra->next)) {
struct sock *sk = ra->sk;
@@ -167,7 +168,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
if (sk && inet_sk(sk)->inet_num == protocol &&
(!sk->sk_bound_dev_if ||
 sk->sk_bound_dev_if == dev->ifindex) &&
-   net_eq(sock_net(sk), dev_net(dev))) {
+   net_eq(sock_net(sk), net)) {
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN))
return true;
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 0/3] net: Pass net into defragmentation

2015-10-09 Thread Eric W. Biederman


This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.

In netfilter and af_packet we defragment packets in the output path,
and there is the usual amount of confusion about how to compute which
net we are processing the packets in.  This patchset clears that
confusion up by explicitly passing in struct net in ip_defrag,
ip_check_defrag, and nf_ct_frag6_gather.

The changes are also available against net-next at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next.git master

Eric

Eric W. Biederman (3):
  ipv4: Only compute net once in ip_call_ra_chain
  ipv4: Pass struct net into ip_defrag and ip_check_defrag
  ipv6: Pass struct net into nf_ct_frag6_gather

 drivers/net/macvlan.c   |  2 +-
 include/net/ip.h|  6 +++---
 include/net/netfilter/ipv6/nf_defrag_ipv6.h |  2 +-
 net/ipv4/ip_fragment.c  |  7 +++
 net/ipv4/ip_input.c | 10 ++
 net/ipv4/netfilter/nf_defrag_ipv4.c |  7 ---
 net/ipv6/netfilter/nf_conntrack_reasm.c |  4 +---
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c   |  3 ++-
 net/netfilter/ipvs/ip_vs_core.c |  2 +-
 net/openvswitch/conntrack.c |  4 ++--
 net/packet/af_packet.c  |  6 +++---
 11 files changed, 27 insertions(+), 26 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] netfilter: turn NF_HOOK into an inline function

2015-10-09 Thread Arnd Bergmann

A recent change to the dst_output handling caused a new warning
when the call to NF_HOOK() is the only used of a local variable
passed as 'dev', and CONFIG_NETFILTER is disabled:

net/ipv6/ip6_output.c: In function 'ip6_output':
net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' [-Wunused-variable]

The reason for this is that the NF_HOOK macro in this case does
not reference the variable at all, and the call to dev_net(dev)
got removed from the ip6_output function. To avoid that warning now
and in the future, this changes the macro into an equivalent
inline function, which tells the compiler that the variable is
passed correctly but still unused.

The dn_forward function apparently had the same problem in
the past and added a local workaround that no longer works
with the inline function. In order to avoid a regression, we
have to also remove the #ifdef from decnet in the same patch.

Signed-off-by: Arnd Bergmann 
Fixes: ede2059dbaf9 ("dst: Pass net into dst->output")
---
v2: folded the #ifdef removal and reworded based on Eric's feedback.

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index edb3dc32f1da..1ff5c3f82820 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -347,8 +347,23 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi 
*fl, u_int8_t family)
 }
 
 #else /* !CONFIG_NETFILTER */
-#define NF_HOOK(pf, hook, net, sk, skb, indev, outdev, okfn) (okfn)(net, sk, 
skb)
-#define NF_HOOK_COND(pf, hook, net, sk, skb, indev, outdev, okfn, cond) 
(okfn)(net, sk, skb)
+static inline int
+NF_HOOK_COND(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+struct sk_buff *skb, struct net_device *in, struct net_device *out,
+int (*okfn)(struct net *, struct sock *, struct sk_buff *),
+bool cond)
+{
+   return okfn(net, sk, skb);
+}
+
+static inline int
+NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, 
struct sk_buff *skb,
+   struct net_device *in, struct net_device *out,
+   int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+   return okfn(net, sk, skb);
+}
+
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
  struct sock *sk, struct sk_buff *skb,
  struct net_device *indev, struct net_device *outdev,
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 27fce283117b..607a14f20d88 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -789,9 +789,7 @@ static int dn_forward(struct sk_buff *skb)
struct dn_dev *dn_db = rcu_dereference(dst->dev->dn_ptr);
struct dn_route *rt;
int header_len;
-#ifdef CONFIG_NETFILTER
struct net_device *dev = skb->dev;
-#endif
 
if (skb->pkt_type != PACKET_HOST)
goto drop;

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: turn NF_HOOK into an inline function

2015-10-09 Thread Eric W. Biederman

Arnd Bergmann  writes:

> A recent change to the dst_output handling caused a new warning
> when the call to NF_HOOK() is the only used of a local variable
> passed as 'dev', and CONFIG_NETFILTER is disabled:
>
> net/ipv6/ip6_output.c: In function 'ip6_output':
> net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' 
> [-Wunused-variable]
>
> The reason for this is that the NF_HOOK macro in this case does
> not reference the variable at all. To avoid that warning now
> and in the future, this changes the macro into an equivalent
> inline function, which tells the compiler that the variable is
> passed correctly but still unused.

For clarification the actual change that trigger this is I passed in net
instead of computing net as net = dev_net(dev).  Which was the second
use of the dev variable.

> Signed-off-by: Arnd Bergmann 
> Fixes: ede2059dbaf9 ("dst: Pass net into dst->output")
>
> diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
> index edb3dc32f1da..1ff5c3f82820 100644
> --- a/include/linux/netfilter.h
> +++ b/include/linux/netfilter.h
> @@ -347,8 +347,23 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi 
> *fl, u_int8_t family)
>  }
>  
>  #else /* !CONFIG_NETFILTER */
> -#define NF_HOOK(pf, hook, net, sk, skb, indev, outdev, okfn) (okfn)(net, sk, 
> skb)
> -#define NF_HOOK_COND(pf, hook, net, sk, skb, indev, outdev, okfn, cond) 
> (okfn)(net, sk, skb)
> +static inline int
> +NF_HOOK_COND(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
> +  struct sk_buff *skb, struct net_device *in, struct net_device *out,
> +  int (*okfn)(struct net *, struct sock *, struct sk_buff *),
> +  bool cond)
> +{
> + return okfn(net, sk, skb);
> +}
> +
> +static inline int
> +NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, 
> struct sk_buff *skb,
> + struct net_device *in, struct net_device *out,
> + int (*okfn)(struct net *, struct sock *, struct sk_buff *))
> +{
> + return okfn(net, sk, skb);
> +}
> +
>  static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
> struct sock *sk, struct sk_buff *skb,
> struct net_device *indev, struct net_device *outdev,
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: turn NF_HOOK into an inline function

2015-10-09 Thread Eric W. Biederman

Arnd Bergmann  writes:

> On Friday 09 October 2015 21:03:57 kbuild test robot wrote:
>> ^1da177e Linus Torvalds2005-04-16  818  cb->rt_flags &= ~DN_RT_F_IE;
>> ^1da177e Linus Torvalds2005-04-16  819  if (rt->rt_flags & 
>> RTCF_DOREDIRECT)
>> ^1da177e Linus Torvalds2005-04-16  820  cb->rt_flags |= 
>> DN_RT_F_IE;
>> ^1da177e Linus Torvalds2005-04-16  821  
>> 29a26a56 Eric W. Biederman 2015-09-15  822  return 
>> NF_HOOK(NFPROTO_DECNET, NF_DN_FORWARD,
>> 29a26a56 Eric W. Biederman 2015-09-15 @823 &init_net, 
>> NULL, skb, dev, skb->dev,
>> 8f40b161 David S. Miller   2011-07-17  824 
>> dn_to_neigh_output);
>> ^1da177e Linus Torvalds2005-04-16  825  
>> ^1da177e Linus Torvalds2005-04-16  826  drop:
>> 
>
> Ah, right. The 'dev' variable here is declared as
>
> #ifdef CONFIG_NETFILTER
> struct net_device *dev = skb->dev;
> #endif
>
> Apparently because the code produced the same warning as the ipv6 code.
>
> Removing the #ifdef here would make that code nicer and let us use
> my patch. Alternatively we could put the same #ifdef into IPV6 and
> not use the inline function.

Compilers are good at removing unused variabes (SSA should guarantee
this will always happen), and #ifdefs sucks.  I vote for your inline.
Especially as the case that is actively used and tested
(CONFIG_NETFILTER) has these two functions as inline functions.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] packet: fix match_fanout_group()

2015-10-09 Thread Eric Dumazet

From: Eric Dumazet 

Recent TCP listener patches exposed a prior af_packet bug :
match_fanout_group() blindly assumes it is always safe
to cast sk to a packet socket to compare fanout with af_packet_priv

But SYNACK packets can be sent while attached to request_sock, which
are smaller than a "struct sock".

We can read non existent memory and crash.

Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group")
Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of 
listener")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Cc: Eric Leblond 
---
 net/packet/af_packet.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 81c900fbc4a4..ccd1d4e9b151 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1519,10 +1519,10 @@ static void __fanout_unlink(struct sock *sk, struct 
packet_sock *po)
 
 static bool match_fanout_group(struct packet_type *ptype, struct sock *sk)
 {
-   if (ptype->af_packet_priv == (void *)((struct packet_sock *)sk)->fanout)
-   return true;
+   if (sk->sk_family != PF_PACKET)
+   return false;
 
-   return false;
+   return ptype->af_packet_priv == pkt_sk(sk)->fanout;
 }
 
 static void fanout_init_data(struct packet_fanout *f)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] tcp: better smp listener behavior

2015-10-09 Thread Eric Dumazet

On Fri, 2015-10-09 at 20:02 +0200, Daniel Borkmann wrote:

> Agreed, will fix that in trafgen. ;) Thanks!

Nice ! Thanks Daniel !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v4 1/2] fix return of iptunnel_xmit

2015-10-09 Thread Debabrata Banerjee

Andreas, I think we need to use the net_xmit defines so the errors are
masked properly, how about:

-   if (unlikely(net_xmit_eval(err)))
-   pkt_len = 0;
-   return pkt_len;
+   if (likely(net_xmit_eval(err) == 0))
+   return pkt_len;
+   else
+   return net_xmit_errno(err);
+
+   return 0;

On Fri, Oct 9, 2015 at 5:27 AM, Andreas Schultz  wrote:
> All users of iptunnel_xmit expect the return value to be the packet
> length on success (>0), negative for a tx error and zero for a tx
> dropped error. In cset 0e6fbc5b6c6218987c93b8c7ca60cf786062899d the
> negative return case was lost.
>
> This bug was introduced when the ip_tunnel_core code was refactored.
>
> Fixes: 0e6fbc5b6c6218987c93b8c7ca60cf786062899d
> Signed-off-by: Andreas Schultz 
> Acked-by: Jiri Benc 
> Acked-by: Pravin B Shelar 
> ---
> Change in v2:
>  - remove unused variable pkt_len
>
> Change in v3:
>  - reworked based on comment from Jiri Benc
>
> Change in v4:
>  - rebased to net-next to avoid merge conflicts
>  - added Acked-By from Jiri Benc and Pravin B Shelar
>
> ---
>  net/ipv4/ip_tunnel_core.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> index 6cb9009..453d569 100644
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -80,9 +80,12 @@ int iptunnel_xmit(struct sock *sk, struct rtable *rt, 
> struct sk_buff *skb,
> __ip_select_ident(net, iph, skb_shinfo(skb)->gso_segs ?: 1);
>
> err = ip_local_out(net, sk, skb);
> -   if (unlikely(net_xmit_eval(err)))
> -   pkt_len = 0;
> -   return pkt_len;
> +   if (likely(net_xmit_eval(err) == 0))
> +   return pkt_len;
> +   if (err < 0)
> +   return err;
> +
> +   return 0;
>  }
>  EXPORT_SYMBOL_GPL(iptunnel_xmit);
>
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] tcp: better smp listener behavior

2015-10-09 Thread Daniel Borkmann


On 10/09/2015 12:50 PM, Eric Dumazet wrote:

On Thu, 2015-10-08 at 20:42 -0700, Tom Herbert wrote:

On Thu, Oct 8, 2015 at 8:37 AM, Eric Dumazet  wrote:

As promised in last patch series, we implement a better SO_REUSEPORT
strategy, based on cpu affinities if selected by the application.

We also moved sk_refcnt out of the cache line containing the lookup
keys, as it was considerably slowing down smp operations because
of false sharing. This was simpler than converting listen sockets
to conventional RCU (to avoid sk_refcnt dirtying)

Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.


Is this IPv4, IPv6, or some combination of the two ? :-)


IPv4 only (mostly because I was using trafgen and its csumtcp() only
deals with IPv4 and I am lazy)


Agreed, will fix that in trafgen. ;) Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-09 Thread Alexei Starovoitov


On 10/9/15 10:45 AM, Daniel Borkmann wrote:

On 10/09/2015 07:30 PM, Alexei Starovoitov wrote:
...

Openstack use case is different. There it will be prog_type_sched_cls
that can mangle packets, change skb metadata, etc under TC framework.
These are not suitable for all users and this patch leaves
them root-only. If you're proposing to add CAP_BPF_TC to let containers
use them without being CAP_SYS_ADMIN, then I agree, it is useful, but
needs a lot more safety analysis on tc side.


Well, I think if so, then this would need to be something generic for
tc instead of being specific to a single (out of various) entities
inside the tc framework, but I currently doubt that this makes much
sense. If we allow to operate already at that level, then restricting
to CAP_SYS_ADMIN makes more sense in that specific context/subsys to me.


Let me rephrase. I think it would be useful, but I have my doubts that
it's manageable, since analyzing dark corners of TC is not trivial.
Probably easier to allow prog_type_sched_cls/act under CAP_NET_ADMIN
and grant that to trusted apps. Though only tiny bit better than
requiring CAP_SYS_ADMIN.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-09 Thread Daniel Borkmann


On 10/09/2015 07:30 PM, Alexei Starovoitov wrote:
...

Openstack use case is different. There it will be prog_type_sched_cls
that can mangle packets, change skb metadata, etc under TC framework.
These are not suitable for all users and this patch leaves
them root-only. If you're proposing to add CAP_BPF_TC to let containers
use them without being CAP_SYS_ADMIN, then I agree, it is useful, but
needs a lot more safety analysis on tc side.


Well, I think if so, then this would need to be something generic for
tc instead of being specific to a single (out of various) entities
inside the tc framework, but I currently doubt that this makes much
sense. If we allow to operate already at that level, then restricting
to CAP_SYS_ADMIN makes more sense in that specific context/subsys to me.

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] net/fsl_pq_mdio: check TBI address for consistency with mapped range

2015-10-09 Thread kbuild test robot

Hi Gerlando,

[auto build test WARNING on net/master -- if it's inappropriate base, please 
ignore]

config: powerpc-tqm8541_defconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/freescale/fsl_pq_mdio.c: In function 
'fsl_pq_mdio_probe':
>> drivers/net/ethernet/freescale/fsl_pq_mdio.c:454:14: warning: comparison of 
>> distinct pointer types lacks a cast
   if (tbipa > priv->map + resource_size(&res))
 ^

vim +454 drivers/net/ethernet/freescale/fsl_pq_mdio.c

   438  if (!prop) {
   439  dev_err(&pdev->dev,
   440  "missing 'reg' property in node 
%s\n",
   441  tbi->full_name);
   442  err = -EBUSY;
   443  goto error;
   444  }
   445  
   446  tbipa = data->get_tbipa(priv->map);
   447  
   448  /*
   449   * Add consistency check to make sure TBI is 
contained
   450   * within the mapped range (not because we 
would get a
   451   * segfault, rather to catch bugs in computing 
TBI
   452   * address). Print error message but continue 
anyway.
   453   */
 > 454  if (tbipa > priv->map + resource_size(&res))
   455  dev_err(&pdev->dev, "invalid register 
map (should be at least 0x%04x to contain TBI address)\n",
   456  ((void *)tbipa - priv->map) + 
4);
   457  
   458  iowrite32be(be32_to_cpup(prop), tbipa);
   459  }
   460  }
   461  
   462  if (data->ucc_configure)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

[PATCH] rtnetlink: fix gcc -Wconversion warning

2015-10-09 Thread Ronen Arad

RTA_ALIGNTO is currently define as 4. It has to be 4U to prevent warning
for RTA_ALIGN and RTA_DATA expansions when -Wconversion gcc option is
enabled.
This follows NLMSG_ALIGNTO definition in .

Signed-off-by: Ronen Arad 
---
 include/uapi/linux/rtnetlink.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 4db0b3c..123a5af 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -160,7 +160,7 @@ struct rtattr {
 
 /* Macros to handle rtattributes */
 
-#define RTA_ALIGNTO4
+#define RTA_ALIGNTO4U
 #define RTA_ALIGN(len) ( ((len)+RTA_ALIGNTO-1) & ~(RTA_ALIGNTO-1) )
 #define RTA_OK(rta,len) ((len) >= (int)sizeof(struct rtattr) && \
 (rta)->rta_len >= sizeof(struct rtattr) && \
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-09 Thread Daniel Borkmann


On 10/09/2015 04:35 AM, Alexei Starovoitov wrote:

On 10/8/15 5:50 PM, Devon H. O'Dell wrote:

with the amount of skb_sender_cpu_clear() all over the code base
>I wonder whether there is a better solution to all of these.

I think there is. We found that splitting the union of sender_cpu and
napi_id solved the issue for us. In general, I think this is an OK
solution as long as the following hold:

  * skbs are always allocated via kzalloc
  * out -> out cloned skbs are always cloned on the same CPU
  * an extra four bytes in skbuff isn't a bad thing


I'm pretty sure extending sk_buff for this is not acceptable.


+1, I agree.


I was thinking may be we can use sign bit to distinguish between
napi_id and sender_cpu.
Like:
 if ((int)skb->sender_cpu >= 0)
 skb->sender_cpu = - (raw_smp_processor_id() + 1);
and inside get_xps_queue() use it only if it's negative.
Then we can remove skb_sender_cpu_clear() from everywhere.
Adding a check to napi_hash_add() to make sure that napi_id is not
negative is probably ok too.
Thoughts?


I think this doesn't make it any more maintainable.

skb_sender_cpu_clear(), one can at least git-grep to easily find
out and review call-sites in the code. There are various members
already used differently depending on the context.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-09 Thread Alexei Starovoitov


On 10/9/15 4:45 AM, Hannes Frederic Sowa wrote:

Afaics this problem hasn't even be solved in
perf so far, tracepoints hit independent of the namespace currently.


yes and that's exactly what we're trying to solve.
The "demux+worker bpf programs" proposal is a work-in-progress solution
to get confidence how to actually separate tracepoint events into
namespaces before adding any new APIs to kernel.


For me namespacing of ebpf code is actually not that important, I would
much rather like to control which namespace is allowed to execute ebpf
in an unpriviledged manner. Like Thomas wrote, a capability was great
for that, but I don't know if any new capabilities will be added.


I think we're mixing too many things here.
First I believe eBPF 'socket filters' do not need any caps.
They're packet read-only and functionally very similar to classic with
a distinction that packet data can be aggregated into maps and programs
can be written in C. So I see no reason to restrict them per user or
per namespace.
Openstack use case is different. There it will be prog_type_sched_cls
that can mangle packets, change skb metadata, etc under TC framework.
These are not suitable for all users and this patch leaves
them root-only. If you're proposing to add CAP_BPF_TC to let containers
use them without being CAP_SYS_ADMIN, then I agree, it is useful, but
needs a lot more safety analysis on tc side.
Similar for prog_type_kprobe: we can add CAP_BPF_KPROBE to let
some trusted applications run unprivileged, but still being able
to do performance monitoring/analytics.
And we would need to carefully think about program restrictions,
since bpf_probe_read and kernel pointer walking is essential part
in tracing.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: Fix vti use case with oif in dst lookups for IPv6

2015-10-09 Thread David Ahern


On 10/9/15 1:17 AM, Steffen Klassert wrote:

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 30caa289c5db..5cedfda4b241 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -37,6 +37,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, 
int tos, int oif,

memset(&fl6, 0, sizeof(fl6));
fl6.flowi6_oif = oif;
+   fl6.flowi6_flags = FLOWI_FLAG_SKIP_NH_OIF;
memcpy(&fl6.daddr, daddr, sizeof(fl6.daddr));
if (saddr)
memcpy(&fl6.saddr, saddr, sizeof(fl6.saddr));


I found that this fix is still not sufficient with the mip6
(Mobile IPv6) use case.


It does not even fix the vti case. The behaviour of the vti devices is
the same, with and without the patch.



The attached patch applied to Linus' tree works for me. Currently the 
above change is not in his tree, so I added it to this patch. Once you 
confirm that it works for you I'll create the delta-patch for net and 
send out.


Thanks,
David
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 92b1aa38f121..2dbd73014a1b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -874,7 +874,8 @@ static struct dst_entry *ip6_sk_dst_check(struct sock *sk,
 #ifdef CONFIG_IPV6_SUBTREES
ip6_rt_check(&rt->rt6i_src, &fl6->saddr, np->saddr_cache) ||
 #endif
-   (fl6->flowi6_oif && fl6->flowi6_oif != dst->dev->ifindex)) {
+  (!(fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF) &&
+ (fl6->flowi6_oif && fl6->flowi6_oif != dst->dev->ifindex))) {
dst_release(dst);
dst = NULL;
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index cb32ce250db0..df24cff4a0cb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1068,6 +1068,9 @@ static struct rt6_info *ip6_pol_route(struct net *net, 
struct fib6_table *table,
fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
saved_fn = fn;
 
+   if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF)
+   oif = 0;
+
 redo_rt6_select:
rt = rt6_select(fn, oif, strict);
if (rt->rt6i_nsiblings)
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 30caa289c5db..5cedfda4b241 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -37,6 +37,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, 
int tos, int oif,
 
memset(&fl6, 0, sizeof(fl6));
fl6.flowi6_oif = oif;
+   fl6.flowi6_flags = FLOWI_FLAG_SKIP_NH_OIF;
memcpy(&fl6.daddr, daddr, sizeof(fl6.daddr));
if (saddr)
memcpy(&fl6.saddr, saddr, sizeof(fl6.saddr));

Re: [PATCH net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

2015-10-09 Thread Devon H. O'Dell

On Thu, Oct 8, 2015 at 7:35 PM, Alexei Starovoitov  wrote:
> On 10/8/15 5:50 PM, Devon H. O'Dell wrote:
>>>
>>> with the amount of skb_sender_cpu_clear() all over the code base
>>> >I wonder whether there is a better solution to all of these.
>>
>> I think there is. We found that splitting the union of sender_cpu and
>> napi_id solved the issue for us. In general, I think this is an OK
>> solution as long as the following hold:
>>
>>   * skbs are always allocated via kzalloc
>>   * out -> out cloned skbs are always cloned on the same CPU
>>   * an extra four bytes in skbuff isn't a bad thing
>
>
> I'm pretty sure extending sk_buff for this is not acceptable.

That's unfortunate.

> I was thinking may be we can use sign bit to distinguish between
> napi_id and sender_cpu.
> Like:
> if ((int)skb->sender_cpu >= 0)
> skb->sender_cpu = - (raw_smp_processor_id() + 1);
> and inside get_xps_queue() use it only if it's negative.

I like the idea, but it seems unnecessarily magical. What about using
a bitfield? Then there's just an option bit that is either
OPTION_NAPI_ID or OPTION_SENDER_CPU. Then the check to set sender_cpu
in netdev_pick_tx becomes

if (skb->sender_napi_option == OPTION_NAPI_ID || skb->sender_cpu == 0) ...

> Then we can remove skb_sender_cpu_clear() from everywhere.
> Adding a check to napi_hash_add() to make sure that napi_id is not
> negative is probably ok too.

We could change this to check that sender_napi_option would be
OPTION_NAPI_ID with the bitfield idea.

My names are probably bad, but I think the idea is less magical (and
is effectively the same thing you are proposing).

> Thoughts?

--dho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] ipvs: drop first packet to dead server

2015-10-09 Thread Jiri Bohac

Hi,

On Sun, Sep 27, 2015 at 08:25:18PM +0300, Julian Anastasov wrote:
> On Fri, 25 Sep 2015, Jiri Bohac wrote:
> 
> > if (!atomic_read(&cp->n_control))
> > ip_vs_conn_expire_now(cp);
> > __ip_vs_conn_put(cp);
> > -   cp = NULL;
> > +   return NF_DROP;
> 
>   So, at this point we do not know whether we have
> one or many real servers, with same or different forwarding
> method. For example, if we know that old real server is DR
> and the new real server is again DR we can reuse the conntrack.
> 
>   Without such info we have to drop the connection
> _only_ when conntrack is used.

right, good point!

> +static inline bool ip_vs_conn_uses_conntrack(struct ip_vs_conn *cp,
> + struct sk_buff *skb)
> +{
> +#ifdef CONFIG_IP_VS_NFCT
> + enum ip_conntrack_info ctinfo;
> + struct nf_conn *ct;
> +
> + if (!(cp->flags & IP_VS_CONN_F_NFCT))
> + return false;
> + ct = nf_ct_get(skb, &ctinfo);
> + if (ct && !nf_ct_is_untracked(ct))
> + return true;
> +#endif
> + return false;
> +}
> +

I tested this part; we found the problem on an old (3.12) kernel,
we're missing the parts dealing with rescheduling on port reuse -
only dealing with the "weight == 0" case.

> + if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) {
> + bool uses_ct = false, resched = false;
> +
> + if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest &&
> + unlikely(!atomic_read(&cp->dest->weight))) {
> + resched = true;
> + uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
> + } else if (is_new_conn_expected(cp, conn_reuse_mode)) {
> + uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
> + if (!atomic_read(&cp->n_control)) {
> + resched = true;
> + } else {
> + /* Do not reschedule controlling connection
> +  * that uses conntrack while it is still
> +  * referenced by controlled connection(s).
> +  */
> + resched = !uses_ct;
> + }
> + }
> +
> + if (resched) {
> + if (!atomic_read(&cp->n_control))
> + ip_vs_conn_expire_now(cp);
> + __ip_vs_conn_put(cp);
> + if (uses_ct)
> + return NF_DROP;
> + cp = NULL;
> + }

Looks good, but I can't easily test this.

Thanks,

-- 
Jiri Bohac 
SUSE Labs, SUSE CZ

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] net: Fix vti use case with oif in dst lookups for IPv6

2015-10-09 Thread David Ahern


On 10/9/15 1:17 AM, Steffen Klassert wrote:

On Fri, Oct 09, 2015 at 03:54:22PM +0900, Hajime Tazaki wrote:


Hello David,

At Mon,  5 Oct 2015 08:32:51 -0600,
David Ahern wrote:



diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 30caa289c5db..5cedfda4b241 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -37,6 +37,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, 
int tos, int oif,

memset(&fl6, 0, sizeof(fl6));
fl6.flowi6_oif = oif;
+   fl6.flowi6_flags = FLOWI_FLAG_SKIP_NH_OIF;
memcpy(&fl6.daddr, daddr, sizeof(fl6.daddr));
if (saddr)
memcpy(&fl6.saddr, saddr, sizeof(fl6.saddr));


I found that this fix is still not sufficient with the mip6
(Mobile IPv6) use case.


It does not even fix the vti case. The behaviour of the vti devices is
the same, with and without the patch.



I goofed this patch was on top of my IPv6 VRF patches. You need the 
FLOWI_FLAG_SKIP_NH_OIF bits from:


http://www.spinics.net/lists/netdev/msg346860.html

Let me cook up a patch based on Linus' tree which is where it is needed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-09 Thread Jay Vosburgh

Jarod Wilson  wrote:

>Jarod Wilson wrote:
>...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
>Is there any particular userspace tool that would need some updating, or
>is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>stuff thrown together now, but still need to test.

Most (all?) bonding options should be configurable via iproute
(netlink) now.

>
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
>
>Okay, so I believe the "only tested with 500ms" was in reference to
>testing with Uwe's initial patch. I do have supporting evidence in a
>bugzilla report that shows upwards of 5000ms still experience the problem
>here.

I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

As I understand it, the working theory is something like this:

- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

- switch has MAC for B in its forwarding table

- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

- switch sees source mac coming from port A, starts to update
its forwarding table

- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

- switch finishes updating forwarding table, MAC is now assigned
to port A.

- bonding now tries sending on port B, and the cycle repeats.

If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.

For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.

I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.

Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-09 Thread Nikolay Aleksandrov

On 10/09/2015 04:36 PM, Jarod Wilson wrote:
> Jarod Wilson wrote:
> ...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
> 
> Is there any particular userspace tool that would need some updating, or is 
> adding the sysfs knobs sufficient here? I think I've got all the sysfs stuff 
> thrown together now, but still need to test.
> 
I'd say adding netlink support at this point is more important, and it'd be nice
if you can add support to iproute2 for the new attribute. Currently all bonding
options have both netlink and sysfs support, so you can follow that, the others
can correct me if I'm wrong here.

One more thing please don't forget to update 
Documentation/networking/bonding.txt

> 
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
> 
> Okay, so I believe the "only tested with 500ms" was in reference to testing 
> with Uwe's initial patch. I do have supporting evidence in a bugzilla report 
> that shows upwards of 5000ms still experience the problem here.
_5 seconds_ are not enough to receive a reply, but sending it twice
in a second fixes the issue ?!
This sounds like the ARP request is not properly handled/received
and there's no reply.

Cheers,
 Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] net/fsl_pq_mdio: check TBI address for consistency with mapped range

2015-10-09 Thread Gerlando Falauto

When configuring the MDIO subsystem it is also necessary to configure
the TBI register. Make sure the TBI is contained within the mapped
register range in order to:
a) make sure the address is computed correctly
b) make users aware that we're actually accessing that register

In case of error, print a message but continue anyway.

Change-Id: If1e7d8931f440ea9259726c36d3df797dda016fb
Signed-off-by: Gerlando Falauto 
Cc: Timur Tabi 
Cc: David S. Miller 
Cc: Andy Fleming 
Cc: Kumar Gala 
---
 drivers/net/ethernet/freescale/fsl_pq_mdio.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/freescale/fsl_pq_mdio.c 
b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
index 3c40f6b..4618011 100644
--- a/drivers/net/ethernet/freescale/fsl_pq_mdio.c
+++ b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
@@ -445,6 +445,16 @@ static int fsl_pq_mdio_probe(struct platform_device *pdev)
 
tbipa = data->get_tbipa(priv->map);
 
+   /*
+* Add consistency check to make sure TBI is contained
+* within the mapped range (not because we would get a
+* segfault, rather to catch bugs in computing TBI
+* address). Print error message but continue anyway.
+*/
+   if (tbipa > priv->map + resource_size(&res))
+   dev_err(&pdev->dev, "invalid register map 
(should be at least 0x%04x to contain TBI address)\n",
+   ((void *)tbipa - priv->map) + 4);
+
iowrite32be(be32_to_cpup(prop), tbipa);
}
}
-- 
1.8.0.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] net/fsl_pq_mdio: fix computed address for the TBI register

2015-10-09 Thread Gerlando Falauto

commit afae5ad78b342f401c28b0bb1adb3cd494cb125a
  "net/fsl_pq_mdio: streamline probing of MDIO nodes"

added support for different types of MDIO devices:
1) Gianfar MDIO nodes that only map the MII registers
2) Gianfar MDIO nodes that map the full MDIO register set
3) eTSEC2 MDIO nodes (which map the full MDIO register set)
4) QE MDIO nodes (which map only the MII registers)

However, the implementation for types 1 and 4 would mistakenly assume
a mapping of the full MDIO register set, thereby computing the address
for the TBI register starting from the containing structure.
The TBI register would therefore be accessed at a wrong (much bigger)
address, not giving the expected result at all.
This patch restores the correct behavior we had prior to the above one.

The consequences of this bug are apparent when trying to access a PHY
with the same address as the value contained in the initial value of
the TBI register (normally 0); in that case you'll get answers from the
internal TBI device (even though MDIO/MDC pins are actually *also*
toggling on the physical bus!).
Beware that you also need to add a fake tbi node to your device tree
with an unused address.

Notice how this fix is related to commit
220669495bf8b68130a8218607147c7b74c28d2b
  "powerpc: Add TBI PHY node to first MDIO bus"

which fixed the behavior in kernel 3.3, which was later broken by the
above commit on kernel 3.7.

Change-Id: If78651268435aaed1f07ebdef374c46c0a528429
Signed-off-by: Gerlando Falauto 
Cc: Timur Tabi 
Cc: David S. Miller 
Cc: Andy Fleming 
Cc: Kumar Gala 
---
 drivers/net/ethernet/freescale/fsl_pq_mdio.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fsl_pq_mdio.c 
b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
index 4618011..9e12e02 100644
--- a/drivers/net/ethernet/freescale/fsl_pq_mdio.c
+++ b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
@@ -198,11 +198,13 @@ static int fsl_pq_mdio_reset(struct mii_bus *bus)
 
 #if defined(CONFIG_GIANFAR) || defined(CONFIG_GIANFAR_MODULE)
 /*
+ * Return the TBIPA address, starting from the address
+ * of the mapped GFAR MDIO registers (struct gfar)
  * This is mildly evil, but so is our hardware for doing this.
  * Also, we have to cast back to struct gfar because of
  * definition weirdness done in gianfar.h.
  */
-static uint32_t __iomem *get_gfar_tbipa(void __iomem *p)
+static uint32_t __iomem *get_gfar_tbipa_from_mdio(void __iomem *p)
 {
struct gfar __iomem *enet_regs = p;
 
@@ -210,6 +212,15 @@ static uint32_t __iomem *get_gfar_tbipa(void __iomem *p)
 }
 
 /*
+ * Return the TBIPA address, starting from the address
+ * of the mapped GFAR MII registers (gfar_mii_regs[] within struct gfar)
+ */
+static uint32_t __iomem *get_gfar_tbipa_from_mii(void __iomem *p)
+{
+   return get_gfar_tbipa_from_mdio(container_of(p, struct gfar, 
gfar_mii_regs));
+}
+
+/*
  * Return the TBIPAR address for an eTSEC2 node
  */
 static uint32_t __iomem *get_etsec_tbipa(void __iomem *p)
@@ -220,11 +231,12 @@ static uint32_t __iomem *get_etsec_tbipa(void __iomem *p)
 
 #if defined(CONFIG_UCC_GETH) || defined(CONFIG_UCC_GETH_MODULE)
 /*
- * Return the TBIPAR address for a QE MDIO node
+ * Return the TBIPAR address for a QE MDIO node, starting from the address
+ * of the mapped MII registers (struct fsl_pq_mii)
  */
 static uint32_t __iomem *get_ucc_tbipa(void __iomem *p)
 {
-   struct fsl_pq_mdio __iomem *mdio = p;
+   struct fsl_pq_mdio __iomem *mdio = container_of(p, struct fsl_pq_mdio, 
mii);
 
return &mdio->utbipar;
 }
@@ -300,14 +312,14 @@ static const struct of_device_id fsl_pq_mdio_match[] = {
.compatible = "fsl,gianfar-tbi",
.data = &(struct fsl_pq_mdio_data) {
.mii_offset = 0,
-   .get_tbipa = get_gfar_tbipa,
+   .get_tbipa = get_gfar_tbipa_from_mii,
},
},
{
.compatible = "fsl,gianfar-mdio",
.data = &(struct fsl_pq_mdio_data) {
.mii_offset = 0,
-   .get_tbipa = get_gfar_tbipa,
+   .get_tbipa = get_gfar_tbipa_from_mii,
},
},
{
@@ -315,7 +327,7 @@ static const struct of_device_id fsl_pq_mdio_match[] = {
.compatible = "gianfar",
.data = &(struct fsl_pq_mdio_data) {
.mii_offset = offsetof(struct fsl_pq_mdio, mii),
-   .get_tbipa = get_gfar_tbipa,
+   .get_tbipa = get_gfar_tbipa_from_mdio,
},
},
{
-- 
1.8.0.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pull-request: wireless-drivers-next 2015-10-09

2015-10-09 Thread Kalle Valo

Hi Dave,

here's first wireless-drivers pull request for 4.4. New features and
bugfixes but not really anything out of ordinary. Please let me know if
there are any problems.

Kalle

The following changes since commit 4730b4331ec58a74a66a044341f0114b02b3:

  sch_dsmark: improve memory locality (2015-09-17 22:37:19 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git 
tags/wireless-drivers-next-for-davem-2015-10-09

for you to fetch changes up to 7e64e5e66af8308725bfd03fcdf185c09b3056a7:

  Merge tag 'iwlwifi-next-for-kalle-2015-10-05' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next (2015-10-07 
12:14:23 +0300)



Major changes:

iwlwifi

* some debugfs improvements
* fix signedness in beacon statistics
* deinline some functions to reduce size when device tracing is enabled
* filter beacons out in AP mode when no stations are associated
* deprecate firmwares version -12
* fix a runtime PM vs. legacy suspend race
* one-liner fix for a ToF bug
* clean-ups in the rx code
* small debugging improvement
* fix WoWLAN with new firmware versions
* more clean-ups towards multiple RX queues;
* some rate scaling fixes and improvements;
* some time-of-flight fixes;
* other generic improvements and clean-ups;

brcmfmac

* rework code dealing with multiple interfaces
* allow logging firmware console using debug level
* support for BCM4350, BCM4365, and BCM4366 PCIE devices
* fixed for legacy P2P and P2P device handling
* correct set and get tx-power

ath9k

* add support for Outside Context of a BSS (OCB) mode

mwifiex

* add USB multichannel feature


Amitkumar Karwar (2):
  mwifiex: avoid memsetting PCIe event buffer
  mwifiex: Suppress -ENOSR error for data traffic on USB

Aniket Nagarnaik (2):
  mwifiex: don't always include ht/vht info in tdls confirm frame
  mwifiex: fix NULL pointer dereference during hidden SSID scan

Arend van Spriel (12):
  brcmfmac: consolidate ifp lookup in driver core
  brcmfmac: make brcmf_proto_hdrpull() return struct brcmf_if instance
  brcmfmac: change parameters for brcmf_remove_interface()
  brcmfmac: only call brcmf_cfg80211_detach() when attach was successful
  brcmfmac: correct detection of p2pdev interface event
  brcmfmac: use brcmf_get_ifp() to map ifidx to struct brcmf_if instance
  brcmfmac: pass struct brcmf_if instance in brcmf_txfinalize()
  brcmfmac: add mapping for interface index to bsscfg index
  brcmfmac: add dedicated debug level for firmware console logging
  brcmfmac: remove ifidx parameter from brcmf_fws_txstatus_suppressed()
  brcmfmac: change prototype for brcmf_fws_hdrpull()
  brcmfmac: introduce brcmf_net_detach() function

Assaf Krauss (2):
  iwlwifi: mvm: Fix tof debugfs formats (dec vs. hex)
  iwlwifi: mvm: Improve debugfs tof robustness

Aviya Erenfeld (1):
  iwlwifi: mvm: move DTS command and notification to new group

Bartosz Markowski (2):
  ath10k: fix beamformee VHT STS capability
  ath10k: fix beamformer VHT sounding dimensions capability

Bob Copeland (3):
  ath10k: enable monitor when OTHER_BSS requested
  ath10k: check for encryption before adding MIC_LEN
  ath10k: implement mesh support

Dan Carpenter (1):
  mwifiex: fix mwifiex_rdeeprom_read()

Eliad Peller (2):
  iwlwifi: mvm: configure wowlan configuration only if connected
  iwlwifi: mvm: add debug print for d0i3 exit indication

Emmanuel Grumbach (8):
  iwlwifi: mvm: add debugfs hook to send ECHO_CMD to the firmware
  iwlwifi: Deinline iwl_{read,write}(8,32}
  iwlwifi: mvm: don't load -12.ucode anymore
  iwlwifi: mvm: remove IWL_UCODE_TLV_API_HDC_PHASE_0 TLV flag
  iwlwifi: mvm: remove IWL_UCODE_TLV_API_TX_POWER_DEV TLV flag
  iwlwifi: mvm: remove IWL_UCODE_TLV_API_SINGLE_SCAN_EBS TLV flag
  iwlwifi: mvm: remove IWL_UCODE_TLV_API_ASYNC_DTM TLV flag
  iwlwifi: mvm: remove IWL_UCODE_TLV_API_STATS_V10 TLV flag

Eyal Shapira (5):
  iwlwifi: mvm: rs: improve rate debug messages
  iwlwifi: mvm: rs: remove overflowing debug message
  iwlwifi: mvm: rs: minor indentation fix
  iwlwifi: mvm: rs: fix success ratio comparison in rs_get_best_rate
  iwlwifi: mvm: rs: dynamically switch between 80MHz and 20MHz in some 
scenarios

Geoff Levand (1):
  net/wireless/wl18xx: Add missing MODULE_FIRMWARE

Gregory Greenman (2):
  iwlwifi: mvm: don't ask for beacons when AP vif and no assoc sta
  iwlwifi: mvm: ToF - fill bssid of responder configuration

Guodong Xu (1):
  wlcore: align reg_ch_conf_last[] to 64bit

Hante Meuleman (17):
  brcmfmac: Reset PCIE devices after recognition.
  brcmfmac: Fix exception handling.
  brcmfmac: Add support for the BCM4350 PCIE device.
  brcmfmac: Fix set and get tx-p

Re: [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()

2015-10-09 Thread Jason Baron

On 10/09/2015 12:29 AM, kbuild test robot wrote:
> Hi Jason,
> 
> [auto build test ERROR on v4.3-rc3 -- if it's inappropriate base, please 
> ignore]
> 
> config: x86_64-randconfig-i0-201540 (attached as .config)
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>net/unix/af_unix.c: In function 'unix_dgram_writable':
>>> net/unix/af_unix.c:2465:3: error: 'other_full' undeclared (first use in 
>>> this function)
>  *other_full = false;
>   ^
>net/unix/af_unix.c:2465:3: note: each undeclared identifier is reported 
> only once for each function it appears in
> 


Forgot to refresh this patch before sending. The one that I tested with
is below.

Thanks,

-Jason




Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters.

Tested using: http://www.spinics.net/lists/netdev/msg145533.html

Signed-off-by: Jason Baron 
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 92 +--
 2 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
unsigned long   flags;
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
+#define UNIX_NOSPACE   2
struct socket_wqpeer_wq;
wait_queue_twait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..ac9bcd8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long 
timeo)
 
prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE);
 
+   set_bit(UNIX_NOSPACE, &u->flags);
+   /* Ensure that we either see space in the peer sk_receive_queue via the
+* unix_recvq_full() check below, or we receive a wakeup when it
+* empties. Pairs with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
sched = !sock_flag(other, SOCK_DEAD) &&
!(other->sk_shutdown & RCV_SHUTDOWN) &&
unix_recvq_full(other);
@@ -1623,17 +1629,27 @@ restart:
 
if (unix_peer(other) != sk && unix_recvq_full(other)) {
if (!timeo) {
-   err = -EAGAIN;
-   goto out_unlock;
-   }
-
-   timeo = unix_wait_for_peer(other, timeo);
+   set_bit(UNIX_NOSPACE, &unix_sk(other)->flags);
+   /* Ensure that we either see space in the peer
+* sk_receive_queue via the unix_recvq_full() check
+* below, or we receive a wakeup when it empties. This
+* makes sure that epoll ET triggers correctly. Pairs
+* with the mb in unix_dgram_recvmsg().
+*/
+   smp_mb__after_atomic();
+   if (unix_recvq_full(other)) {
+   err = -EAGAIN;
+   goto out_unlock;
+   }
+   } else {
+   timeo = unix_wait_for_peer(other, timeo);
 
-   err = sock_intr_errno(timeo);
-   if (signal_pending(current))
-   goto out_free;
+   err = sock_intr_errno(timeo);
+   if (signal_pending(current))
+   goto out_free;
 
-   goto restart;
+   goto restart;
+   }
}
 
if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   wake_up_interruptible_sync_poll(&u->peer_wait,
-   POLLOUT | POLLWRNORM | POLLWRBAND);
+   /* Ensure that waiters on our sk->sk_receive_queue draining that check
+* via unix_recvq_full() either see space in the queue or get a wakeup
+* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram()
+* call above. Pairs with the mb in unix_dgram_sendmsg(),
+*unix_dgram_poll(), and unix_wait_for_peer().
+*/
+   smp_mb();
+   if (test_bit(UNIX_NOSPACE, &u->flags)) {
+   clear_bit(UNIX_NOSPACE, &u->flags);
+   wake_up_interruptible_sync_poll(&u->peer_w

Re: [PATCH net-next 0/4] tcp: better smp listener behavior

2015-10-09 Thread Eric Dumazet

On Fri, 2015-10-09 at 07:29 -0700, Eric Dumazet wrote:

> So the answer is : about 800,000 SYN per second in IPV6 with purely DDOS 
> attack

My IPv6 routing setup was a bit silly ;)

With a slightly better one, we reach 3.8 Mpps and kernel profile looks
like :

21.22%  [kernel]  [k] ip6_pol_route.isra.47  
11.42%  [kernel]  [k] _raw_read_lock_bh  
 9.83%  [kernel]  [k] fib6_lookup
 8.96%  [kernel]  [k] _raw_read_unlock_bh
 8.47%  [kernel]  [k] fib6_get_table 
 4.01%  [kernel]  [k] __inet6_lookup_established 
 2.94%  [kernel]  [k] memcpy_erms
 2.36%  [kernel]  [k] dst_release


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3.19 and earlier] fib_rules: Fix dump_rules() not to exit early

2015-10-09 Thread Thomas Jarosch

Hi Roland,

On Monday, 5. October 2015 10:29:28 Roland Dreier wrote:
> From: Roland Dreier 
> 
> Backports of 41fc014332d9 ("fib_rules: fix fib rule dumps across
> multiple skbs") introduced a regression in "ip rule show" - it ends up
> dumping the first rule over and over and never exiting, because 3.19
> and earlier are missing commit 053c095a82cf ("netlink: make
> nlmsg_end() and genlmsg_end() void"), so fib_nl_fill_rule() ends up
> returning skb->len (i.e. > 0) in the success case.
> 
> Fix this by checking the return code for < 0 instead of != 0.

thanks for this fix. You just saved me an afternoon of bisecting :)

I can confirm that this fixes the mentioned issue introduced in 3.14.54.
We have an automated ipsec VPN test that failed after the upgrade:
The "ip rule list" command was hanging forever.

Cheers,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()

2015-10-09 Thread Hannes Frederic Sowa

Hi,

Jason Baron  writes:

> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
> queue associated with the socket s that we are poll'ing against, but also 
> calls
> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
> if we call poll()/select()/epoll() for the socket s, there are then
> a couple of code paths in which the remote peer socket p and its associated
> peer_wait queue can be freed before poll()/select()/epoll() have a chance
> to remove themselves from the remote peer socket.
>
> The way that remote peer socket can be freed are:
>
> 1. If s calls connect() to a connect to a new socket other than p, it will
> drop its reference on p, and thus a close() on p will free it.
>
> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
> the final reference to p, allowing it to be freed.
>
> Address this issue, by reverting unix_dgram_poll() to only register with
> the wait queue associated with s and register a callback with the remote peer
> socket on connect() that will wake up the wait queue associated with s. If
> scenarios 1 or 2 occur above we then simply remove the callback from the
> remote peer. This then presents the expected semantics to poll()/select()/
> epoll().
>
> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().
>
> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
> DGRAM sockets").
>
> Tested-by: Mathias Krause 
> Signed-off-by: Jason Baron 

While I think this approach works, I haven't seen where the current code
leaks a reference. Assignment to unix_peer(sk) in general take spin_lock
and increment refcount. Are there bugs at the two places you referred
to?

Is an easier fix just to use atomic_inc_not_zero(&sk->sk_refcnt) in
unix_peer_get() which could also help other places?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 00/13] rocker: add support for multiple worlds

2015-10-09 Thread Jiri Pirko

Wed, Oct 07, 2015 at 07:39:56PM CEST, j...@resnulli.us wrote:
>Wed, Oct 07, 2015 at 06:53:22PM CEST, sfel...@gmail.com wrote:
>>On Tue, Oct 6, 2015 at 11:03 PM, Jiri Pirko  wrote:
>>> Tue, Oct 06, 2015 at 07:14:39PM CEST, sfel...@gmail.com wrote:
On Tue, Oct 6, 2015 at 12:30 AM, Jiri Pirko  wrote:
> Tue, Oct 06, 2015 at 05:56:12AM CEST, sfel...@gmail.com wrote:
>>On Mon, Oct 5, 2015 at 10:43 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> This patchset allows new rocker worlds to be easily added in future 
>>> (like eBPF
>>> based one I have been working on). The main part of the patchset is the 
>>> OF-DPA
>>> carve-out. It resuts in OF-DPA specific file. Clean cut.
>>>
>>> v1->v2:
>>>  - rtnl rocker mode change userspace expose patch was removed
>>>
>>> Jiri Pirko (13):
>>>   rocker: remove unused rocker_port param from alloc funcs and shorten
>>> their names
>>>   rocker: rename rocker.h to rocker_hw.h
>>>   rocker: rename rocker.c to rocker_main.c
>>>   rocker: push tlv processing into separate files
>>>   rocker: implement set settings mode command
>>>   rocker: introduce worlds infrastructure
>>>   rocker: introduce OF-DPA world skeleton
>>>   rocker: set default world on port probe and clean world on remove
>>>   rocker: pass "learning" value as a parameter to
>>> rocker_port_set_learning
>>>   rocker: pre-allocate wait structures during cmd ring init
>>>   rocker: remove trans parameter to rocker_cmd_exec function
>>>   rocker: call rocker_cmd_exec function with "nowait" boolean instead of
>>> flags
>>>   rocker: move OF-DPA stuff into separate file
>>
>>A couple of my tests are failing with this patchset.  A simple port
>>test is failing and IPv4 routing test is failing.
>>
>>The port test is simple: just connect a port on DUT to a port on
>>another system and assign an IP address to each port and verify IP
>>connectivity.  I have this:
>>
>>   DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)
>>
>>The IPv4 routing tests is a bit more complicated to setup.  I'm using
>>OSPF, but I'm not seeing full routes formed in the topology, so I
>>suspect OSPF hellos aren't getting thru.
>>
>>Please fix find/fix these issues and send v3.  I don't want any git
>>bisect issues when running tests.  Thanks.
>
> I fixed that. Sending v3 in a sec. Thanks.

Sorry, both tests are still broken.  Would you send me your tests
scripts so I can see why your tests are passing?
>>>
>>> I'm trying some smoke tests including bridge setup and just ip-ip
>>> setup by hand. Meybe if you send me your scripts, I can run it locally.
>>
>>My test scripts are already included in the qemu tree.
>
>Okay, will rework and use your scripts. Hope I will find some time
>during this weekend.

Scott, could you try to test with current net-next?
I'm trying basic:
DUT:sw1p1 (11.0.0.1/24) <---> host1:eth0 (11.0.0.2/24)
and it does not work for me now. It worked previously when I tested with
my patchset. This is getting odd.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-next v2 0/4] make non-modular code explicitly non-modular

2015-10-09 Thread David Miller

From: Paul Gortmaker 
Date: Wed, 7 Oct 2015 17:27:42 -0400

> [v2: drop m68k patches that Geert converted to modules; add one ARM
>  driver patch ; update net-next baseline to today; switch to ARM
>  for build testing.]
> 
> In a previous merge window, we made changes to allow better
> delineation between modular and non-modular code in commit
> 0fd972a7d91d6e15393c449492a04d94c0b89351 ("module: relocate module_init
> from init.h to module.h").  This allows us to now ensure module code
> looks modular and non-modular code does not accidentally look modular
> just to avoid suffering build breakage.
> 
> Here we target code that is, by nature of their Makefile and/or
> Kconfig settings, only available to be built-in, but implicitly
> presenting itself as being possibly modular by way of using modular
> headers, macros, and functions.
> 
> The goal here is to remove that illusion of modularity from these
> files, but in a way that leaves the actual runtime unchanged.
> In doing so, we remove code that has never been tested and adds
> no value to the tree.  And we continue the process of expecting a
> level of consistency between the Kconfig/Makefile of code and the
> code in use itself.
> 
> Fortuntately the net subsystem has relatively few instances, given
> the overall amount of code and drivers it contains.  For comparison
> there are over 300 instances tree wide, resulting in a possible net
> removal of on the order of 5000 lines of unused code.
> 
> Build tested on net-next from today, on ARM, since that is the arch
> where the one ethernet driver changed here is available.

Series applied, thanks Paul.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-09 Thread Jarod Wilson


Jarod Wilson wrote:
...

As Andy already stated I'm not a fan of such workarounds either but it's
necessary sometimes so if this is going to be actually considered then a
few things need to be fixed. Please make this a proper bonding option
which can be changed at runtime and not only via a module parameter.


Is there any particular userspace tool that would need some updating, or 
is adding the sysfs knobs sufficient here? I think I've got all the 
sysfs stuff thrown together now, but still need to test.




Now, I saw that you've only tested with 500 ms, can't this be fixed by
using
a different interval ? This seems like a very specific problem to have a
whole new option for.


...I'll wait until we've heard confirmation from Uwe that intervals
other than 500ms don't fix things.


Okay, so I believe the "only tested with 500ms" was in reference to 
testing with Uwe's initial patch. I do have supporting evidence in a 
bugzilla report that shows upwards of 5000ms still experience the 
problem here.




--
Jarod Wilson
ja...@redhat.com


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9] net: small improvement

2015-10-09 Thread David Miller

From: Yaowei Bai 
Date: Thu,  8 Oct 2015 21:28:53 +0800

> This patchset makes several functions in net return bool to improve
> readability and/or simplicity because these functions only use one
> or zero as their return value.
> 
> No functional changes.

Series applied, thank you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] sunrpc: avoid warning in gss_key_timeout

2015-10-09 Thread Arnd Bergmann

The gss_key_timeout() function causes a harmless warning in some
configurations, e.g. ARM imx_v6_v7_defconfig with gcc-5.2, if the
compiler cannot figure out the state of the 'expire' variable across
an rcu_read_unlock():

net/sunrpc/auth_gss/auth_gss.c: In function 'gss_key_timeout':
net/sunrpc/auth_gss/auth_gss.c:1422:211: warning: 'expire' may be used 
uninitialized in this function [-Wmaybe-uninitialized]

To avoid this warning without adding a bogus initialization, this
rewrites the function so the comparison is done inside of the
critical section. As a side-effect, it also becomes slightly
easier to understand because the implementation now more closely
resembles the comment above it.

Signed-off-by: Arnd Bergmann 
Fixes: c5e6aecd034e7 ("sunrpc: fix RCU handling of gc_ctx field")

diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index dace13d7638e..799e65b944b9 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1411,17 +1411,16 @@ gss_key_timeout(struct rpc_cred *rc)
 {
struct gss_cred *gss_cred = container_of(rc, struct gss_cred, gc_base);
struct gss_cl_ctx *ctx;
-   unsigned long now = jiffies;
-   unsigned long expire;
+   unsigned long timeout = jiffies + (gss_key_expire_timeo * HZ);
+   int ret = 0;
 
rcu_read_lock();
ctx = rcu_dereference(gss_cred->gc_ctx);
-   if (ctx)
-   expire = ctx->gc_expiry - (gss_key_expire_timeo * HZ);
+   if (!ctx || time_after(timeout, ctx->gc_expiry))
+   ret = -EACCES;
rcu_read_unlock();
-   if (!ctx || time_after(now, expire))
-   return -EACCES;
-   return 0;
+
+   return ret;
 }
 
 static int

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net 0/5] Mellanox driver update to 4.3-rc4

2015-10-09 Thread David Miller

From: Or Gerlitz 
Date: Thu,  8 Oct 2015 15:26:14 +0300

> Small set of fixes for net, which includes Carol's patches, a fix
> from Achiad to have the right behaviour for mlx5 Eth devices w.r.t 
> VLANs in promiscuous mode, a good-bye patch from Ido who left Mellanox
> and the 1st patch from Jiri to our NIC drivers (I love one-liners)...

Series applied except patch #2 which I applied already.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] tcp: better smp listener behavior

2015-10-09 Thread Eric Dumazet

On Fri, 2015-10-09 at 03:50 -0700, Eric Dumazet wrote:
> On Thu, 2015-10-08 at 20:42 -0700, Tom Herbert wrote:
> > On Thu, Oct 8, 2015 at 8:37 AM, Eric Dumazet  wrote:
> > > As promised in last patch series, we implement a better SO_REUSEPORT
> > > strategy, based on cpu affinities if selected by the application.
> > >
> > > We also moved sk_refcnt out of the cache line containing the lookup
> > > keys, as it was considerably slowing down smp operations because
> > > of false sharing. This was simpler than converting listen sockets
> > > to conventional RCU (to avoid sk_refcnt dirtying)
> > >
> > > Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
> > >
> > Is this IPv4, IPv6, or some combination of the two ? :-)
> 
> IPv4 only (mostly because I was using trafgen and its csumtcp() only
> deals with IPv4 and I am lazy)
> 
> I guess IPv6 one might hit some issues before reaching TCP stack, I do
> not see anything performance related in TCP itself.
> 
> 

So the answer is : about 800,000 SYN per second in IPV6 with purely DDOS attack

We hit neighbor cache badly.

  377.188231] neighbour: ndisc_cache: neighbor table overflow!
[  377.188234] neighbour: ndisc_cache: neighbor table overflow!
[  382.193043] net_ratelimit: 36622 callbacks suppressed
[  382.193046] neighbour: ndisc_cache: neighbor table overflow!
[  382.193051] neighbour: ndisc_cache: neighbor table overflow!
[  382.193054] neighbour: ndisc_cache: neighbor table overflow!

59.79%  [kernel]  [k] queued_spin_lock_slowpath 
 4.94%  [kernel]  [k] queued_write_lock_slowpath
 2.12%  [kernel]  [k] sha_transform 
 1.93%  [kernel]  [k] ip6_pol_route.isra.47 
 1.58%  [kernel]  [k] __neigh_create
 1.30%  [kernel]  [k] inet6_lookup_listener 
 1.19%  [kernel]  [k] memcpy_erms   
 1.09%  [kernel]  [k] _raw_read_lock_bh 
 0.88%  [kernel]  [k] memset_erms   
 0.85%  [kernel]  [k] __inet6_lookup_established
 0.84%  [kernel]  [k] ndisc_constructor 
 0.78%  [kernel]  [k] fib6_get_table
 0.71%  [kernel]  [k] ip6t_do_table 
 0.71%  [kernel]  [k] _raw_read_unlock_bh   
 0.66%  [kernel]  [k] _raw_write_lock_bh
 0.62%  [kernel]  [k] fib6_lookup   
 0.54%  [kernel]  [k] tcp_make_synack   
 0.54%  [kernel]  [k] tcp_conn_request  



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/9] net/can: can_dropped_invalid_skb can be boolean

2015-10-09 Thread Yaowei Bai

On Fri, Oct 09, 2015 at 12:14:31PM +0200, Marc Kleine-Budde wrote:
> On 10/08/2015 03:28 PM, Yaowei Bai wrote:
> > This patch makes can_dropped_invalid_skb return bool due to this
> > particular function only using either one or zero as its return
> > value.
> > 
> > No functional change.
> > 
> > Signed-off-by: Yaowei Bai 
> 
> Acked-by: Marc Kleine-Budde 

Thanks.

> 
> Yaowei, feel free to send the CAN patch as part of your series directly
> to David.

OK, i'll do that and sorry for disturbing you. :)

Thanks
Bai

> 
> Marc
> 
> -- 
> Pengutronix e.K.  | Marc Kleine-Budde   |
> Industrial Linux Solutions| Phone: +49-231-2826-924 |
> Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
> Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |
> 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[4.1.3-rt8] [report][cpuhotplug] BUG: spinlock bad magic on CPU#0, sh/137

2015-10-09 Thread Grygorii Strashko

Hi All,

I can constantly see below error report with 4.1 RT-kernel on TI ARM dra7-evm 
if I'm trying to unplug cpu1:

[   57.737589] CPU1: shutdown
[   57.767537] BUG: spinlock bad magic on CPU#0, sh/137
[   57.767546]  lock: 0xee994730, .magic: , .owner: /-1, 
.owner_cpu: 0
[   57.767552] CPU: 0 PID: 137 Comm: sh Not tainted 
4.1.10-rt8-01700-g2c38702-dirty #55
[   57.767555] Hardware name: Generic DRA74X (Flattened Device Tree)
[   57.767568] [] (unwind_backtrace) from [] 
(show_stack+0x20/0x24)
[   57.767579] [] (show_stack) from [] 
(dump_stack+0x84/0xa0)
[   57.767593] [] (dump_stack) from [] (spin_dump+0x84/0xac)
[   57.767603] [] (spin_dump) from [] (spin_bug+0x34/0x38)
[   57.767614] [] (spin_bug) from [] 
(do_raw_spin_lock+0x168/0x1c0)
[   57.767624] [] (do_raw_spin_lock) from [] 
(_raw_spin_lock+0x4c/0x54)
[   57.767631] [] (_raw_spin_lock) from [] 
(rt_spin_lock_slowlock+0x5c/0x374)
[   57.767638] [] (rt_spin_lock_slowlock) from [] 
(rt_spin_lock+0x38/0x70)
[   57.767649] [] (rt_spin_lock) from [] 
(skb_dequeue+0x28/0x7c)
[   57.767662] [] (skb_dequeue) from [] 
(dev_cpu_callback+0x1b8/0x240)
[   57.767673] [] (dev_cpu_callback) from [] 
(notifier_call_chain+0x3c/0xb4)
[   57.767683] [] (notifier_call_chain) from [] 
(__raw_notifier_call_chain+0x24/0x2c)
[   57.767692] [] (__raw_notifier_call_chain) from [] 
(cpu_notify+0x34/0x50)
[   57.767699] [] (cpu_notify) from [] 
(cpu_notify_nofail+0x18/0x24)
[   57.767707] [] (cpu_notify_nofail) from [] 
(_cpu_down+0x3e8/0x55c)
[   57.767715] [] (_cpu_down) from [] 
(disable_nonboot_cpus+0x118/0x5dc)
[   57.767722] [] (disable_nonboot_cpus) from [] 
(suspend_enter+0x2c4/0xd18)
[   57.767730] [] (suspend_enter) from [] 
(suspend_devices_and_enter+0xe4/0x65c)
[   57.767737] [] (suspend_devices_and_enter) from [] 
(enter_state+0x6c0/0x1050)
[   57.767744] [] (enter_state) from [] 
(pm_suspend+0x24/0x84)
[   57.767751] [] (pm_suspend) from [] 
(state_store+0x74/0xc8)
[   57.767760] [] (state_store) from [] 
(kobj_attr_store+0x1c/0x28)
[   57.767771] [] (kobj_attr_store) from [] 
(sysfs_kf_write+0x5c/0x60)
[   57.767781] [] (sysfs_kf_write) from [] 
(kernfs_fop_write+0xc8/0x1ac)
[   57.767792] [] (kernfs_fop_write) from [] 
(__vfs_write+0x38/0xec)
[   57.767801] [] (__vfs_write) from [] 
(vfs_write+0xa0/0x174)
[   57.767811] [] (vfs_write) from [] (SyS_write+0x54/0xb0)
[   57.767822] [] (SyS_write) from [] 
(ret_fast_syscall+0x0/0x54)
[   57.768224] Powerdomain (l3init_pwrdm) didn't enter target state 1

I'm working with TI RT-kernel:
git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git
branch: ti-rt-linux-4.1.y

It looks like this backtrace was introduces by 

commit 91df05da13a6c6c358e71182e80f19f3c48d1615
Author: Thomas Gleixner 
Date:   Tue Jul 12 15:38:34 2011 +0200

net: Use skbufhead with raw lock


I see the potential fix for this issue as below: 

index 4969c0d..f8c23de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7217,7 +7217,7 @@ static int dev_cpu_callback(struct notifier_block *nfb,
netif_rx_ni(skb);
input_queue_head_incr(oldsd);
}
-   while ((skb = skb_dequeue(&oldsd->input_pkt_queue))) {
+   while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
netif_rx_ni(skb);
input_queue_head_incr(oldsd);
}

input_pkt_queue is per-cpu queue and at this moment cpu is dead already,
so no one should touch it. But I'm not sure if my assumption is correct.

-- 
regards,
-grygorii
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/7] Mellanox driver update for net-next

2015-10-09 Thread David Miller

From: Or Gerlitz 
Date: Thu,  8 Oct 2015 17:13:56 +0300

> Some small fixes and small enhancements from the team.
> 
> Series applies over net-next commit acb4a6b "tcp: ensure prior synack rtx 
> behavior
> with small backlogs".

Series applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 00/16][pull request] Intel Wired LAN Driver Updates 2015-10-08

2015-10-09 Thread David Miller

From: Jeff Kirsher 
Date: Thu,  8 Oct 2015 18:32:38 -0700

> This series contains updates to i40e and i40evf only (again).

Pulled, thanks Jeff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: turn NF_HOOK into an inline function

2015-10-09 Thread Arnd Bergmann

On Friday 09 October 2015 21:03:57 kbuild test robot wrote:
> ^1da177e Linus Torvalds2005-04-16  818  cb->rt_flags &= ~DN_RT_F_IE;
> ^1da177e Linus Torvalds2005-04-16  819  if (rt->rt_flags & 
> RTCF_DOREDIRECT)
> ^1da177e Linus Torvalds2005-04-16  820  cb->rt_flags |= 
> DN_RT_F_IE;
> ^1da177e Linus Torvalds2005-04-16  821  
> 29a26a56 Eric W. Biederman 2015-09-15  822  return 
> NF_HOOK(NFPROTO_DECNET, NF_DN_FORWARD,
> 29a26a56 Eric W. Biederman 2015-09-15 @823 &init_net, 
> NULL, skb, dev, skb->dev,
> 8f40b161 David S. Miller   2011-07-17  824 
> dn_to_neigh_output);
> ^1da177e Linus Torvalds2005-04-16  825  
> ^1da177e Linus Torvalds2005-04-16  826  drop:
> 

Ah, right. The 'dev' variable here is declared as

#ifdef CONFIG_NETFILTER
struct net_device *dev = skb->dev;
#endif

Apparently because the code produced the same warning as the ipv6 code.

Removing the #ifdef here would make that code nicer and let us use
my patch. Alternatively we could put the same #ifdef into IPV6 and
not use the inline function.

Any opinions?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: turn NF_HOOK into an inline function

2015-10-09 Thread kbuild test robot

Hi Arnd,

[auto build test ERROR on next-20151009 -- if it's inappropriate base, please 
ignore]

config: i386-randconfig-i0-201540 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   net/decnet/dn_route.c: In function 'dn_forward':
>> net/decnet/dn_route.c:823:32: error: 'dev' undeclared (first use in this 
>> function)
 &init_net, NULL, skb, dev, skb->dev,
   ^
   net/decnet/dn_route.c:823:32: note: each undeclared identifier is reported 
only once for each function it appears in

vim +/dev +823 net/decnet/dn_route.c

^1da177e Linus Torvalds2005-04-16  807   */
^1da177e Linus Torvalds2005-04-16  808  if (++cb->hops > 30)
^1da177e Linus Torvalds2005-04-16  809  goto drop;
^1da177e Linus Torvalds2005-04-16  810  
d8d1f30b Changli Gao   2010-06-10  811  skb->dev = rt->dst.dev;
^1da177e Linus Torvalds2005-04-16  812  
^1da177e Linus Torvalds2005-04-16  813  /*
^1da177e Linus Torvalds2005-04-16  814   * If packet goes out same 
interface it came in on, then set
^1da177e Linus Torvalds2005-04-16  815   * the Intra-Ethernet bit. This 
has no effect for short
^1da177e Linus Torvalds2005-04-16  816   * packets, so we don't need to 
test for them here.
^1da177e Linus Torvalds2005-04-16  817   */
^1da177e Linus Torvalds2005-04-16  818  cb->rt_flags &= ~DN_RT_F_IE;
^1da177e Linus Torvalds2005-04-16  819  if (rt->rt_flags & 
RTCF_DOREDIRECT)
^1da177e Linus Torvalds2005-04-16  820  cb->rt_flags |= 
DN_RT_F_IE;
^1da177e Linus Torvalds2005-04-16  821  
29a26a56 Eric W. Biederman 2015-09-15  822  return NF_HOOK(NFPROTO_DECNET, 
NF_DN_FORWARD,
29a26a56 Eric W. Biederman 2015-09-15 @823 &init_net, NULL, 
skb, dev, skb->dev,
8f40b161 David S. Miller   2011-07-17  824 
dn_to_neigh_output);
^1da177e Linus Torvalds2005-04-16  825  
^1da177e Linus Torvalds2005-04-16  826  drop:
^1da177e Linus Torvalds2005-04-16  827  kfree_skb(skb);
^1da177e Linus Torvalds2005-04-16  828  return NET_RX_DROP;
^1da177e Linus Torvalds2005-04-16  829  }
^1da177e Linus Torvalds2005-04-16  830  
^1da177e Linus Torvalds2005-04-16  831  /*

:: The code at line 823 was first introduced by commit
:: 29a26a56803855a79dbd028cd61abee56237d6e5 netfilter: Pass struct net into 
the netfilter hooks

:: TO: Eric W. Biederman 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH net-next v4 1/2] fix return of iptunnel_xmit

2015-10-09 Thread Sergei Shtylyov


Hello.

On 10/9/2015 12:27 PM, Andreas Schultz wrote:


All users of iptunnel_xmit expect the return value to be the packet
length on success (>0), negative for a tx error and zero for a tx
dropped error. In cset 0e6fbc5b6c6218987c93b8c7ca60cf786062899d the


   Didn't checkpatch.pl compalin about improper commit citing?


negative return case was lost.



This bug was introduced when the ip_tunnel_core code was refactored.



Fixes: 0e6fbc5b6c6218987c93b8c7ca60cf786062899d


   See Documentation/SubmittingPatches for the proper format of this tag.


Signed-off-by: Andreas Schultz 
Acked-by: Jiri Benc 
Acked-by: Pravin B Shelar 


MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 net-next] cxgb4: Enhance driver to update FW, when FW is too old

2015-10-09 Thread Neil Horman


>   ret = t4_get_fw_version(adap, &adap->params.fw_vers);
>+  /* Try multiple times before returning error */
>+  for (i = 0; (ret == -EBUSY || ret == -EAGAIN) && i < 3; i++)
>+  ret = t4_get_fw_version(adap, &adap->params.fw_vers);
>+
>   if (ret)
>   return ret;


Nit: Just initalize ret to -EBUSY and change the test for i to < 4 rather than <
3.  That way you will only have one call site for t4_get_fw_version, which I
think is more readable.  Alternatively a do..while loop might be appropriate
here.

But I suppose I'm just splitting hairs at this point

Acked-by: Neil Horman 

 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: HNS: fix MDIO dependencies

2015-10-09 Thread Arnd Bergmann

The newly introduced HNS_MDIO Kconfig symbol selects 'MDIO', but
that is the wrong symbol as the code used by this driver is
provided by PHYLIB rather than the MDIO driver. Also, there is
no need to make this driver user selectable, because it is already
selected by all drivers that need it.

This changes the Kconfig file to select the correct library, and
to make the option silent.

Signed-off-by: Arnd Bergmann 
Fixes: 5b904d39406 ("net: add Hisilicon Network Subsystem MDIO support")

diff --git a/drivers/net/ethernet/hisilicon/Kconfig 
b/drivers/net/ethernet/hisilicon/Kconfig
index 165b5a8aa2ea..8d12b587809e 100644
--- a/drivers/net/ethernet/hisilicon/Kconfig
+++ b/drivers/net/ethernet/hisilicon/Kconfig
@@ -24,7 +24,6 @@ config HIX5HD2_GMAC
 
 config HIP04_ETH
tristate "HISILICON P04 Ethernet support"
-   select PHYLIB
select MARVELL_PHY
select MFD_SYSCON
select HNS_MDIO
@@ -33,8 +32,8 @@ config HIP04_ETH
  want to use the internal ethernet then you should answer Y to this.
 
 config HNS_MDIO
-   tristate "Hisilicon HNS MDIO device Support"
-   select MDIO
+   tristate
+   select PHYLIB
---help---
  This selects the HNS MDIO support. It is needed by HNS_DSAF to access
  the PHY

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] netfilter: turn NF_HOOK into an inline function

2015-10-09 Thread Arnd Bergmann

A recent change to the dst_output handling caused a new warning
when the call to NF_HOOK() is the only used of a local variable
passed as 'dev', and CONFIG_NETFILTER is disabled:

net/ipv6/ip6_output.c: In function 'ip6_output':
net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' [-Wunused-variable]

The reason for this is that the NF_HOOK macro in this case does
not reference the variable at all. To avoid that warning now
and in the future, this changes the macro into an equivalent
inline function, which tells the compiler that the variable is
passed correctly but still unused.

Signed-off-by: Arnd Bergmann 
Fixes: ede2059dbaf9 ("dst: Pass net into dst->output")

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index edb3dc32f1da..1ff5c3f82820 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -347,8 +347,23 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi 
*fl, u_int8_t family)
 }
 
 #else /* !CONFIG_NETFILTER */
-#define NF_HOOK(pf, hook, net, sk, skb, indev, outdev, okfn) (okfn)(net, sk, 
skb)
-#define NF_HOOK_COND(pf, hook, net, sk, skb, indev, outdev, okfn, cond) 
(okfn)(net, sk, skb)
+static inline int
+NF_HOOK_COND(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+struct sk_buff *skb, struct net_device *in, struct net_device *out,
+int (*okfn)(struct net *, struct sock *, struct sk_buff *),
+bool cond)
+{
+   return okfn(net, sk, skb);
+}
+
+static inline int
+NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, 
struct sk_buff *skb,
+   struct net_device *in, struct net_device *out,
+   int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+   return okfn(net, sk, skb);
+}
+
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
  struct sock *sk, struct sk_buff *skb,
  struct net_device *indev, struct net_device *outdev,

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] ipv4/icmp: redirect messages can use the ingress daddr as source

2015-10-09 Thread Paolo Abeni

This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.

The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:

Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.

If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.

The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.

Signed-off-by: Paolo Abeni 
---
 Documentation/networking/ip-sysctl.txt | 19 +--
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/icmp.c|  9 -
 net/ipv4/sysctl_net_ipv4.c |  7 +++
 4 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2..9983825 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -884,8 +884,8 @@ icmp_ignore_bogus_error_responses - BOOLEAN
 
 icmp_errors_use_inbound_ifaddr - BOOLEAN
 
-   If zero, icmp error messages are sent with the primary address of
-   the exiting interface.
+   If zero, icmp error messages except redirects are sent with the primary
+   address of the exiting interface.
 
If non-zero, the message will be sent with the primary address of
the interface that received the packet that caused the icmp error.
@@ -897,8 +897,23 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
then the primary address of the first non-loopback interface that
has one will be used regardless of this setting.
 
+   The source address selection of icmp redirect messages is controlled by
+   icmp_errors_use_inbound_ifaddr.
Default: 0
 
+icmp_redirects_use_orig_daddr - BOOLEAN
+
+   If zero, icmp redirect messages are sent using the address specified for
+   other icmp errors by icmp_errors_use_inbound_ifaddr.
+
+   If non-zero, the message will be sent with the destination address of
+   the packet that caused the icmp redirect.
+   This behaviour is the preferred one on VRRP routers (see RFC 5798
+   section 8.1.1).
+
+   Default: 0
+
+
 igmp_max_memberships - INTEGER
Change the maximum number of multicast groups we can subscribe to.
Default: 20
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c68926b..46d336a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -74,6 +74,7 @@ struct netns_ipv4 {
int sysctl_icmp_ratelimit;
int sysctl_icmp_ratemask;
int sysctl_icmp_errors_use_inbound_ifaddr;
+   int sysctl_icmp_redirects_use_orig_daddr;
 
struct local_ports ip_local_ports;
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e5eb8ac..3b57aa4 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -642,7 +642,9 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
 */
 
saddr = iph->daddr;
-   if (!(rt->rt_flags & RTCF_LOCAL)) {
+   if (!((type == ICMP_REDIRECT) &&
+ net->ipv4.sysctl_icmp_redirects_use_orig_daddr) &&
+   !(rt->rt_flags & RTCF_LOCAL)) {
struct net_device *dev = NULL;
 
rcu_read_lock();
@@ -1205,6 +1207,11 @@ static int __net_init icmp_sk_init(struct net *net)
net->ipv4.sysctl_icmp_ratemask = 0x1818;
net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr = 0;
 
+   /* Control paramerer - use the daddr of originating packets as saddr
+* in redirect messages?
+*/
+   net->ipv4.sysctl_icmp_redirects_use_orig_daddr = 0;
+
return 0;
 
 fail:
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 894da3a..30a531c 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -818,6 +818,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "icmp_redirects_use_orig_daddr",
+   .data   = 
&init_net.ipv4.sysctl_icmp_redirects_use_orig_daddr,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
.procname   = "icmp_ratelimit",
.data   = &init_net.ipv4.sysctl_icmp_ratelimit,
.maxlen = sizeof(int),
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netde

RE: question about potential integer truncation in mwifiex_set_wapi_ie and mwifiex_set_wps_ie

2015-10-09 Thread Amitkumar Karwar

Hi James/PaX Team,

> -Original Message-
> From: qu...@laptop.org [mailto:qu...@laptop.org]
> Sent: Wednesday, September 30, 2015 4:41 AM
> To: PaX Team
> Cc: Amitkumar Karwar; Avinash Patil; Kalle Valo; linux-
> wirel...@vger.kernel.org; netdev@vger.kernel.org; re.em...@gmail.com;
> spen...@grsecurity.net
> Subject: Re: question about potential integer truncation in
> mwifiex_set_wapi_ie and mwifiex_set_wps_ie
> 
> On Tue, Sep 29, 2015 at 05:21:28PM +0200, PaX Team wrote:
> > hi all,
> >
> > in drivers/net/wireless/mwifiex/sta_ioctl.c the following functions
> >
> > mwifiex_set_wpa_ie_helper
> > mwifiex_set_wapi_ie
> > mwifiex_set_wps_ie
> >
> > can truncate the incoming ie_len argument from u16 to u8 when it gets
> > stored in mwifiex_private.wpa_ie_len, mwifiex_private.wapi_ie_len and
> > mwifiex_private.wps_ie_len, respectively. based on some light code
> > reading it seems a length value of 256 is valid (IEEE_MAX_IE_SIZE and
> > MWIFIEX_MAX_VSIE_LEN seem to limit it) and thus would get truncated to
> > 0 when stored in those u8 fields. the question is whether this is
> > intentional or a bug somewhere.
> 
> i agree, while there is a test to ensure ie_len is not greater than 256,
> there is a possibility that it will be exactly 256, which means
> 256 bytes will be given to memcpy but
> mwifiex_private.{wpa,wapi,wps}_ie_len will be zero.
> 
> i suggest changing the lengths to u16.  not tested.
> 
> diff --git a/drivers/net/wireless/mwifiex/main.h
> b/drivers/net/wireless/mwifiex/main.h
> index fe12560..b66e9a7 100644
> --- a/drivers/net/wireless/mwifiex/main.h
> +++ b/drivers/net/wireless/mwifiex/main.h
> @@ -512,14 +512,14 @@ struct mwifiex_private {
>   struct mwifiex_wep_key wep_key[NUM_WEP_KEYS];
>   u16 wep_key_curr_index;
>   u8 wpa_ie[256];
> - u8 wpa_ie_len;
> + u16 wpa_ie_len;
>   u8 wpa_is_gtk_set;
>   struct host_cmd_ds_802_11_key_material aes_key;
>   struct host_cmd_ds_802_11_key_material_v2 aes_key_v2;
>   u8 wapi_ie[256];
> - u8 wapi_ie_len;
> + u16 wapi_ie_len;
>   u8 *wps_ie;
> - u8 wps_ie_len;
> + u16 wps_ie_len;
>   u8 wmm_required;
>   u8 wmm_enabled;
>   u8 wmm_qosinfo;
> 

This change makes sense. Also, we should not typecast the length to v8 while 
copying to mwifiex_private variable.

./sta_ioctl.c:761:  priv->wpa_ie_len = (u8) ie_len;

Eventually the length stored in 'wapi_ie_len' is copied to a u16 variable.

/join.c:304:ie_header.len = cpu_to_le16(priv->wapi_ie_len);

I will submit a patch to fix this.

Regards,
Amitkumar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next] bridge: try switchdev op first in __vlan_vid_add/del

2015-10-09 Thread Jiri Pirko

From: Jiri Pirko 

Some drivers need to implement both switchdev vlan ops and
vid_add/kill ndos. For that to work in bridge code, we need to try
switchdev op first when adding/deleting vlan id.

Signed-off-by: Jiri Pirko 
Signed-off-by: Ido Schimmel 
---
 net/bridge/br_vlan.c | 58 
 1 file changed, 22 insertions(+), 36 deletions(-)

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index eae07ee..975deb9 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -72,28 +72,20 @@ static void __vlan_add_flags(struct net_bridge_vlan *v, u16 
flags)
 static int __vlan_vid_add(struct net_device *dev, struct net_bridge *br,
  u16 vid, u16 flags)
 {
-   const struct net_device_ops *ops = dev->netdev_ops;
+   struct switchdev_obj_port_vlan v = {
+   .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
+   .flags = flags,
+   .vid_begin = vid,
+   .vid_end = vid,
+   };
int err;
 
-   /* If driver uses VLAN ndo ops, use 8021q to install vid
-* on device, otherwise try switchdev ops to install vid.
+   /* Try switchdev op first. In case it is not supported, fallback to
+* 8021q add.
 */
-
-   if (ops->ndo_vlan_rx_add_vid) {
-   err = vlan_vid_add(dev, br->vlan_proto, vid);
-   } else {
-   struct switchdev_obj_port_vlan v = {
-   .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
-   .flags = flags,
-   .vid_begin = vid,
-   .vid_end = vid,
-   };
-
-   err = switchdev_port_obj_add(dev, &v.obj);
-   if (err == -EOPNOTSUPP)
-   err = 0;
-   }
-
+   err = switchdev_port_obj_add(dev, &v.obj);
+   if (err == -EOPNOTSUPP)
+   return vlan_vid_add(dev, br->vlan_proto, vid);
return err;
 }
 
@@ -122,27 +114,21 @@ static void __vlan_del_list(struct net_bridge_vlan *v)
 static int __vlan_vid_del(struct net_device *dev, struct net_bridge *br,
  u16 vid)
 {
-   const struct net_device_ops *ops = dev->netdev_ops;
-   int err = 0;
+   struct switchdev_obj_port_vlan v = {
+   .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
+   .vid_begin = vid,
+   .vid_end = vid,
+   };
+   int err;
 
-   /* If driver uses VLAN ndo ops, use 8021q to delete vid
-* on device, otherwise try switchdev ops to delete vid.
+   /* Try switchdev op first. In case it is not supported, fallback to
+* 8021q del.
 */
-
-   if (ops->ndo_vlan_rx_kill_vid) {
+   err = switchdev_port_obj_del(dev, &v.obj);
+   if (err == -EOPNOTSUPP) {
vlan_vid_del(dev, br->vlan_proto, vid);
-   } else {
-   struct switchdev_obj_port_vlan v = {
-   .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN,
-   .vid_begin = vid,
-   .vid_end = vid,
-   };
-
-   err = switchdev_port_obj_del(dev, &v.obj);
-   if (err == -EOPNOTSUPP)
-   err = 0;
+   return 0;
}
-
return err;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-09 Thread Hannes Frederic Sowa

Hi,

Alexei Starovoitov  writes:

> On 10/8/15 11:20 AM, Hannes Frederic Sowa wrote:
>> Hi Alexei,
>>
>> On Thu, Oct 8, 2015, at 07:23, Alexei Starovoitov wrote:
>>> The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
>>> This toggle defaults to off (0), but can be set true (1).  Once true,
>>> bpf programs and maps cannot be accessed from unprivileged process,
>>> and the toggle cannot be set back to false.
>>
>> This approach seems fine to me.
>>
>> I am wondering if it makes sense to somehow allow ebpf access per
>> namespace? I currently have no idea how that could work and on which
>> namespace type to depend or going with a prctl or even cgroup maybe. The
>> rationale behind this is, that maybe some namespaces like openstack
>> router namespaces could make usage of advanced ebpf capabilities in the
>> kernel, while other namespaces, especially where untrusted third parties
>> are hosted, shouldn't have access to those facilities.
>>
>> In that way, hosters would be able to e.g. deploy more efficient
>> performance monitoring container (which should still need not to run as
>> root) while the majority of the users has no access to that. Or think
>> about routing instances in some namespaces, etc. etc.
>
> when we're talking about eBPF for networking or performance monitoring
> it's all going to be under root anyway.

I am not so sure, actually. Like PCP (Performance CoPilot), which does
long term collecting of performance data in the kernel and maybe sending
it over the network, it would be great if at least some capabilities
could be dropped after the bpf filedescriptor was allocated. But current
bpf syscall always checks capabilities on every call, which is actually
quite unusual for capabilities.

For networking the basic technique was also to drop capabilities sooner
than later.

Can we filter bpf syscall finegrained with selinux?

> The next question is
> how to let the programs run only for traffic or for applications within
> namespaces. Something gotta do this demux. It either can be in-kernel
> C code which is configured via some API that calls different eBPF
> programs based on cgroup or based on netns, or it can be another
> eBPF program that does demux on its own.

This sounds quite complex. Afaics this problem hasn't even be solved in
perf so far, tracepoints hit independent of the namespace currently.

> In case of tracing such 'demuxing' program can be attached to kernel
> events and call 'worker' programs via tail_call, so that 'worker'
> programs will have an illusion that they're working only with events
> that belong to their namespace.
> imo existing facilities already allow 'per namespace' eBPF, though
> the prog_array used to jump from 'demuxing' bpf into 'worker' bpf
> currently is a bit awkward to use (because of FD passing via daemon),
> but that will get solved soon.

Aha, so client namespaces hand over their fds to parent demuxer and it
sets up the necessary calls. Yeah, this seems to work.

> It feels that in-kernel C code doing filtering may be
> 'more robust' from namespace isolation point of view, but I don't
> think we have a concrete and tested proposal, so I would
> experiment with 'demuxing' bpf first.
> The programs in general don't have a notion of namespace. They
> need to be attached to veth via TC to get packets for
> particular namespace.

Okay.

For me namespacing of ebpf code is actually not that important, I would
much rather like to control which namespace is allowed to execute ebpf
in an unpriviledged manner. Like Thomas wrote, a capability was great
for that, but I don't know if any new capabilities will be added.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] mwifiex: fix a comment typo

2015-10-09 Thread Amitkumar Karwar

> From: Geliang Tang [mailto:geliangt...@163.com]
> Sent: Sunday, October 04, 2015 2:17 PM
> To: Amitkumar Karwar; Nishant Sarmukadam; Kalle Valo
> Cc: Geliang Tang; linux-wirel...@vger.kernel.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] mwifiex: fix a comment typo
> 
> Just fix a typo in the code comment.
> 
> Signed-off-by: Geliang Tang 
> ---
>  drivers/net/wireless/mwifiex/cfg80211.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/mwifiex/cfg80211.c
> b/drivers/net/wireless/mwifiex/cfg80211.c
> index 30cbafb..b7ac45f 100644
> --- a/drivers/net/wireless/mwifiex/cfg80211.c
> +++ b/drivers/net/wireless/mwifiex/cfg80211.c
> @@ -2374,7 +2374,7 @@ mwifiex_cfg80211_leave_ibss(struct wiphy *wiphy,
> struct net_device *dev)
>   * CFG802.11 operation handler for scan request.
>   *
>   * This function issues a scan request to the firmware based upon
> - * the user specified scan configuration. On successfull completion,
> + * the user specified scan configuration. On successful completion,
>   * it also informs the results.
>   */
>  static int
> --
> 2.5.0
> 

Acked-by: Amitkumar Karwar 

Regards,
Amitkumar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 04/12] mwifiex: use ktime_get_real for timestamping

2015-10-09 Thread Amitkumar Karwar

Hi Arnd,

> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Wednesday, September 30, 2015 4:57 PM
> To: netdev@vger.kernel.org
> Cc: y2...@lists.linaro.org; linux-ker...@vger.kernel.org; David S.
> Miller; Arnd Bergmann; Amitkumar Karwar; Nishant Sarmukadam; Kalle Valo;
> linux-wirel...@vger.kernel.org
> Subject: [PATCH 04/12] mwifiex: use ktime_get_real for timestamping
> 
> The mwifiex_11n_aggregate_pkt() function creates a ktime_t from a
> timeval returned by do_gettimeofday, which is slow and causes an
> overflow in 2038 on 32-bit architectures.
> 
> This solves both problems by using the appropriate ktime_get_real()
> function.
> 
> Signed-off-by: Arnd Bergmann 
> Cc: Amitkumar Karwar 
> Cc: Nishant Sarmukadam 
> Cc: Kalle Valo 
> Cc: linux-wirel...@vger.kernel.org
> ---
>  drivers/net/wireless/mwifiex/11n_aggr.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/wireless/mwifiex/11n_aggr.c
> b/drivers/net/wireless/mwifiex/11n_aggr.c
> index f7c717253a66..78853c51774d 100644
> --- a/drivers/net/wireless/mwifiex/11n_aggr.c
> +++ b/drivers/net/wireless/mwifiex/11n_aggr.c
> @@ -173,7 +173,6 @@ mwifiex_11n_aggregate_pkt(struct mwifiex_private
> *priv,
>   int pad = 0, aggr_num = 0, ret;
>   struct mwifiex_tx_param tx_param;
>   struct txpd *ptx_pd = NULL;
> - struct timeval tv;
>   int headroom = adapter->iface_type == MWIFIEX_USB ? 0 :
> INTF_HEADER_LEN;
> 
>   skb_src = skb_peek(&pra_list->skb_head); @@ -203,8 +202,7 @@
> mwifiex_11n_aggregate_pkt(struct mwifiex_private *priv,
>   tx_info_aggr->flags |= MWIFIEX_BUF_FLAG_AGGR_PKT;
>   skb_aggr->priority = skb_src->priority;
> 
> - do_gettimeofday(&tv);
> - skb_aggr->tstamp = timeval_to_ktime(tv);
> + skb_aggr->tstamp = ktime_get_real();
> 
>   do {
>   /* Check if AMSDU can accommodate this MSDU */
> --
> 2.1.0.rc2

Looks good.

Acked-by: Amitkumar Karwar 

Regards,
Amitkumar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 05/12] mwifiex: avoid gettimeofday in ba_threshold setting

2015-10-09 Thread Amitkumar Karwar

Hi Arnd,

> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Wednesday, September 30, 2015 4:57 PM
> To: netdev@vger.kernel.org
> Cc: y2...@lists.linaro.org; linux-ker...@vger.kernel.org; David S.
> Miller; Arnd Bergmann; Amitkumar Karwar; Nishant Sarmukadam; Kalle Valo;
> linux-wirel...@vger.kernel.org
> Subject: [PATCH 05/12] mwifiex: avoid gettimeofday in ba_threshold
> setting
> 
> mwifiex_get_random_ba_threshold() uses a complex homegrown
> implementation to generate a pseudo-random number from the current time
> as returned from do_gettimeofday().
> 
> This currently requires two 32-bit divisions plus a couple of other
> computations that are eventually discarded as only eight bits of the
> microsecond portion are used at all.
> 
> We could replace this with a call to get_random_bytes(), but that might
> drain the entropy pool too fast if this is called for each packet.
> 
> Instead, this patch converts it to use ktime_get_ns(), which is a bit
> faster than do_gettimeofday(), and then uses a similar algorithm as
> before, but in a way that takes both the nanosecond and second portion
> into account for slightly-more-but-still-not-very-random
> pseudorandom number.
> 
> Signed-off-by: Arnd Bergmann 
> Cc: Amitkumar Karwar 
> Cc: Nishant Sarmukadam 
> Cc: Kalle Valo 
> Cc: linux-wirel...@vger.kernel.org
> ---
>  drivers/net/wireless/mwifiex/wmm.c | 15 ---
>  1 file changed, 4 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/wireless/mwifiex/wmm.c
> b/drivers/net/wireless/mwifiex/wmm.c
> index 173d3663c2e0..878d358063dc 100644
> --- a/drivers/net/wireless/mwifiex/wmm.c
> +++ b/drivers/net/wireless/mwifiex/wmm.c
> @@ -117,22 +117,15 @@ mwifiex_wmm_allocate_ralist_node(struct
> mwifiex_adapter *adapter, const u8 *ra)
>   */
>  static u8 mwifiex_get_random_ba_threshold(void)
>  {
> - u32 sec, usec;
> - struct timeval ba_tstamp;
> - u8 ba_threshold;
> -
> + u64 ns;
>   /* setup ba_packet_threshold here random number between
>* [BA_SETUP_PACKET_OFFSET,
>* BA_SETUP_PACKET_OFFSET+BA_SETUP_MAX_PACKET_THRESHOLD-1]
>*/
> + ns = ktime_get_ns();
> + ns += (ns >> 32) + (ns >> 16);
> 
> - do_gettimeofday(&ba_tstamp);
> - sec = (ba_tstamp.tv_sec & 0x) + (ba_tstamp.tv_sec >> 16);
> - usec = (ba_tstamp.tv_usec & 0x) + (ba_tstamp.tv_usec >> 16);
> - ba_threshold = (((sec << 16) + usec) %
> BA_SETUP_MAX_PACKET_THRESHOLD)
> -   + BA_SETUP_PACKET_OFFSET;
> -
> - return ba_threshold;
> + return ((u8)ns % BA_SETUP_MAX_PACKET_THRESHOLD) +
> +BA_SETUP_PACKET_OFFSET;
>  }
> 
>  /*
> --

Looks fine to me.
Acked-by: Amitkumar Karwar 

Regards,
Amitkumar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 net-next 0/4] tcp: better smp listener behavior

2015-10-09 Thread Eric Dumazet

On Thu, 2015-10-08 at 21:16 -0700, Grant Zhang wrote:

> 
> Does it make sense to make the listener hash table percpu? Socket with 
> SO_INCOMING_CPU set could just be add to the hashtable for that specific 
> cpu.

Not sure : We plan to upstream a patch adding a soreuseport specific
table to make the lookup time independent of number of sockets bound to
one particular port. This simply adds an RCU protected array, with
ability to immediately fetch slot number X from this array.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/4] tcp: better smp listener behavior

2015-10-09 Thread Eric Dumazet

On Thu, 2015-10-08 at 20:42 -0700, Tom Herbert wrote:
> On Thu, Oct 8, 2015 at 8:37 AM, Eric Dumazet  wrote:
> > As promised in last patch series, we implement a better SO_REUSEPORT
> > strategy, based on cpu affinities if selected by the application.
> >
> > We also moved sk_refcnt out of the cache line containing the lookup
> > keys, as it was considerably slowing down smp operations because
> > of false sharing. This was simpler than converting listen sockets
> > to conventional RCU (to avoid sk_refcnt dirtying)
> >
> > Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
> >
> Is this IPv4, IPv6, or some combination of the two ? :-)

IPv4 only (mostly because I was using trafgen and its csumtcp() only
deals with IPv4 and I am lazy)

I guess IPv6 one might hit some issues before reaching TCP stack, I do
not see anything performance related in TCP itself.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] pcnet32: fix a logic error with pci_set_dma_mask

2015-10-09 Thread Geliang Tang

pcnet32 can't work on my machine recently. It says "architecture
does not support 32bit PCI busmaster DMA". There is a logic error
in it: pci_set_dma_mask() return 0 means return successfully.

Signed-off-by: Geliang Tang 
---
 drivers/net/ethernet/amd/pcnet32.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/pcnet32.c 
b/drivers/net/ethernet/amd/pcnet32.c
index e2afabf..2d9d216 100644
--- a/drivers/net/ethernet/amd/pcnet32.c
+++ b/drivers/net/ethernet/amd/pcnet32.c
@@ -1500,7 +1500,7 @@ pcnet32_probe_pci(struct pci_dev *pdev, const struct 
pci_device_id *ent)
return -ENODEV;
}
 
-   if (!pci_set_dma_mask(pdev, PCNET32_DMA_MASK)) {
+   if (pci_set_dma_mask(pdev, PCNET32_DMA_MASK)) {
if (pcnet32_debug & NETIF_MSG_PROBE)
pr_err("architecture does not support 32bit PCI 
busmaster DMA\n");
return -ENODEV;
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RHEL6.8 net PATCH] ipv4/icmp: redirect messages can use the ingress daddr as source

2015-10-09 Thread Paolo Abeni

Hi all,

I'm sorry, I messed with the subject tag. I'm going to resubmit.

Sorry for the noise,

Paolo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/9] net/can: can_dropped_invalid_skb can be boolean

2015-10-09 Thread Marc Kleine-Budde

On 10/08/2015 03:28 PM, Yaowei Bai wrote:
> This patch makes can_dropped_invalid_skb return bool due to this
> particular function only using either one or zero as its return
> value.
> 
> No functional change.
> 
> Signed-off-by: Yaowei Bai 

Acked-by: Marc Kleine-Budde 

Yaowei, feel free to send the CAN patch as part of your series directly
to David.

Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature

[net-next PATCH] driver: net: cpsw: add no_bd_ram dt parsing

2015-10-09 Thread Mugunthan V N

cpdma is capable of placing the dma descriptors in ddr using
dma_alloc_coherent() when the internal bd ram size is not enough.
To utilize this feature pass the DT parameter "no_bd_ram" and
increase bd_ram_size and number of rx descriptors.

Signed-off-by: Mugunthan V N 
---
 drivers/net/ethernet/ti/cpsw.c | 4 
 drivers/net/ethernet/ti/cpsw.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 8fc90f1..cf1a625 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1987,6 +1987,8 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
}
data->ale_entries = prop;
 
+   data->no_bd_ram = of_property_read_bool(node, "no_bd_ram");
+
if (of_property_read_u32(node, "bd_ram_size", &prop)) {
dev_err(&pdev->dev, "Missing bd_ram_size property in the 
DT.\n");
return -EINVAL;
@@ -2358,6 +2360,8 @@ static int cpsw_probe(struct platform_device *pdev)
dma_params.desc_mem_size= data->bd_ram_size;
dma_params.desc_align   = 16;
dma_params.has_ext_regs = true;
+   if (data->no_bd_ram)
+   dma_params.desc_mem_phys = 0;
dma_params.desc_hw_addr = dma_params.desc_mem_phys;
 
priv->dma = cpdma_ctlr_create(&dma_params);
diff --git a/drivers/net/ethernet/ti/cpsw.h b/drivers/net/ethernet/ti/cpsw.h
index ca90efa..b654ac2 100644
--- a/drivers/net/ethernet/ti/cpsw.h
+++ b/drivers/net/ethernet/ti/cpsw.h
@@ -33,6 +33,7 @@ struct cpsw_platform_data {
u32 cpts_clock_mult;  /* convert input clock ticks to nanoseconds */
u32 cpts_clock_shift; /* convert input clock ticks to nanoseconds */
u32 ale_entries;/* ale table size */
+   boolno_bd_ram;  /* set if cpsw bd ram should not be used */
u32 bd_ram_size;  /*buffer descriptor ram size */
u32 rx_descs;   /* Number of Rx Descriptios */
u32 mac_control;/* Mac control register */
-- 
2.6.1.133.gf5b6079

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] openvswitch: report features supported by the kernel datapath

2015-10-09 Thread Jiri Benc

On Fri, 9 Oct 2015 11:24:53 +0200, Thomas Graf wrote:
> On 10/08/15 at 03:40pm, Jesse Gross wrote:
> > I have similar concerns as were expressed in the other thread. The
> > features listed here aren't OVS components and I don't think that it
> > makes sense for OVS to try to cover everything that is related - the
> > goal that we've been working towards is to have OVS be less monolithic
> > and more integrated. So to the extent that it is necessary to have
> > capabilities be exposed (and I would like to avoid this where
> > possible), it should be in the individual component, not in OVS.

Fair enough. Note that the IPv6 flag really belongs to ovs, though -
it's about the existence of OVS_TUNNEL_KEY_ATTR_IPV6_SRC and
OVS_TUNNEL_KEY_ATTR_IPV6_DST netlink attributes. For the lwtunnel flag
(which is just another way to tell whether vxlan/geneve/etc. has
COLLECT_METADATA support) I can agree that it does not belong to ovs.

> I'm fine with that as well. However, I do dislike the idea of creating
> net_devices with a set of parameters just to figure if the parameters
> are supported or not. This works OK for the first step of evolution
> where we have support or not but it gets absolutely messy when we
> have: no support, multiple levels of partial support and finally full
> support.

100% agreed.

> We have been thinking about a more generic capabilities Netlink
> interface for a while and this looks like a good justification for
> finally doing that work.

I've been looking into this since morning and everything I've been able
to come up with seems to be quite intrusive. Before investing time to
create a long patchset that might be potentially rejected, I'd like to
get some opinions.

My thoughts are introducing either RTM_VALIDATELINK or
RTM_NEWLINK_STRICT. In the first case, it would just check whether the
passed attributes are okay for "strict" creation of the link; in the
second case, it would either reject the request, or create the link
(similarly to what RTM_NEWLINK does but with "strict" attributes
checking).

The "strict" checking would mean:

- Rejecting attributes with type <= 0 and > maxtype (i.e. changing
  nla_parse, nlmsg_parse, etc. to do optional strict checking based on
  a passed bool parameter).

- Adding the bool parameter for strict checking to rtnl_link_ops
  validate and slave_validate callbacks.

It would mean refactoring of rtnl_newlink.

Or do you have something more generic in mind? Like adding a new
NLM_F_REQUEST_STRICT flag to nlmsghdr to be used instead of
NLM_F_REQUEST?

Thanks,

 Jiri

-- 
Jiri Benc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support

2015-10-09 Thread Eric Dumazet

On Thu, 2015-10-08 at 20:40 -0700, Tom Herbert wrote:

> Do we care about losing this optimization? It's not done in IPv4 but I
> can imagine that there is some arguments that address comparisons in
> IPv6 are more expensive hence this might make sense...

I do not think we care. You removed the 'optimization' in IPv4 in commit
ba418fa357a7b ("soreuseport: UDP/IPv4 implementation") back in 2013 and
really no one noticed.

The important factor here is the number of cache lines taken to traverse
the list...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/1] sfc: fully reset if MC_REBOOT event received without warm_boot_count increment

2015-10-09 Thread Shradha Shah

From: Daniel Pieczko 

On EF10, MC_CMD_VPORT_RECONFIGURE can cause a CODE_MC_REBOOT event
to be sent to a function without incrementing the (adapter-wide)
warm_boot_count.  In this case, the reboot is not detected by the
loop on efx_mcdi_poll_reboot(), so prepare for recovery from an MC
reboot anyway.  When this codepath is run, the MC has always just
rebooted, so this recovery is valid.

The loop on efx_mcdi_poll_reboot() is still required for other MC
reboot cases, so that actions in response to an MC reboot are
performed, such as clearing locally calculated statistics.
Siena NICs are unaffected by this change as the above scenario
does not apply.

Signed-off-by: Shradha Shah 
---
 drivers/net/ethernet/sfc/ef10.c   | 30 +++---
 drivers/net/ethernet/sfc/mcdi.c   | 13 -
 drivers/net/ethernet/sfc/net_driver.h |  1 +
 3 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index ff649eb..78b7b7b 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -1604,6 +1604,22 @@ efx_ef10_mcdi_read_response(struct efx_nic *efx, 
efx_dword_t *outbuf,
memcpy(outbuf, pdu + offset, outlen);
 }
 
+static void efx_ef10_mcdi_reboot_detected(struct efx_nic *efx)
+{
+   struct efx_ef10_nic_data *nic_data = efx->nic_data;
+
+   /* All our allocations have been reset */
+   efx_ef10_reset_mc_allocations(efx);
+
+   /* The datapath firmware might have been changed */
+   nic_data->must_check_datapath_caps = true;
+
+   /* MAC statistics have been cleared on the NIC; clear the local
+* statistic that we update with efx_update_diff_stat().
+*/
+   nic_data->stats[EF10_STAT_port_rx_bad_bytes] = 0;
+}
+
 static int efx_ef10_mcdi_poll_reboot(struct efx_nic *efx)
 {
struct efx_ef10_nic_data *nic_data = efx->nic_data;
@@ -1623,17 +1639,7 @@ static int efx_ef10_mcdi_poll_reboot(struct efx_nic *efx)
return 0;
 
nic_data->warm_boot_count = rc;
-
-   /* All our allocations have been reset */
-   efx_ef10_reset_mc_allocations(efx);
-
-   /* The datapath firmware might have been changed */
-   nic_data->must_check_datapath_caps = true;
-
-   /* MAC statistics have been cleared on the NIC; clear the local
-* statistic that we update with efx_update_diff_stat().
-*/
-   nic_data->stats[EF10_STAT_port_rx_bad_bytes] = 0;
+   efx_ef10_mcdi_reboot_detected(efx);
 
return -EIO;
 }
@@ -4670,6 +4676,7 @@ const struct efx_nic_type efx_hunt_a0_vf_nic_type = {
.mcdi_poll_response = efx_ef10_mcdi_poll_response,
.mcdi_read_response = efx_ef10_mcdi_read_response,
.mcdi_poll_reboot = efx_ef10_mcdi_poll_reboot,
+   .mcdi_reboot_detected = efx_ef10_mcdi_reboot_detected,
.irq_enable_master = efx_port_dummy_op_void,
.irq_test_generate = efx_ef10_irq_test_generate,
.irq_disable_non_ev = efx_port_dummy_op_void,
@@ -4774,6 +4781,7 @@ const struct efx_nic_type efx_hunt_a0_nic_type = {
.mcdi_poll_response = efx_ef10_mcdi_poll_response,
.mcdi_read_response = efx_ef10_mcdi_read_response,
.mcdi_poll_reboot = efx_ef10_mcdi_poll_reboot,
+   .mcdi_reboot_detected = efx_ef10_mcdi_reboot_detected,
.irq_enable_master = efx_port_dummy_op_void,
.irq_test_generate = efx_ef10_irq_test_generate,
.irq_disable_non_ev = efx_port_dummy_op_void,
diff --git a/drivers/net/ethernet/sfc/mcdi.c b/drivers/net/ethernet/sfc/mcdi.c
index 98d172b..d3f307e 100644
--- a/drivers/net/ethernet/sfc/mcdi.c
+++ b/drivers/net/ethernet/sfc/mcdi.c
@@ -1028,10 +1028,21 @@ static void efx_mcdi_ev_death(struct efx_nic *efx, int 
rc)
 
/* Consume the status word since efx_mcdi_rpc_finish() won't */
for (count = 0; count < MCDI_STATUS_DELAY_COUNT; ++count) {
-   if (efx_mcdi_poll_reboot(efx))
+   rc = efx_mcdi_poll_reboot(efx);
+   if (rc)
break;
udelay(MCDI_STATUS_DELAY_US);
}
+
+   /* On EF10, a CODE_MC_REBOOT event can be received without the
+* reboot detection in efx_mcdi_poll_reboot() being triggered.
+* If zero was returned from the final call to
+* efx_mcdi_poll_reboot(), the MC reboot wasn't noticed but the
+* MC has definitely rebooted so prepare for the reset.
+*/
+   if (!rc && efx->type->mcdi_reboot_detected)
+   efx->type->mcdi_reboot_detected(efx);
+
mcdi->new_epoch = true;
 
/* Nobody was waiting for an MCDI request, so trigger a reset */
diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index c530e1c..ad56231 100644
--- a/drivers/net/ethernet/sfc/net_driver.

[RHEL6.8 net PATCH] ipv4/icmp: redirect messages can use the ingress daddr as source

2015-10-09 Thread Paolo Abeni

This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.

The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:

Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.

If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.

The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.

Signed-off-by: Paolo Abeni 
---
 Documentation/networking/ip-sysctl.txt | 19 +--
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/icmp.c|  9 -
 net/ipv4/sysctl_net_ipv4.c |  7 +++
 4 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2..9983825 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -884,8 +884,8 @@ icmp_ignore_bogus_error_responses - BOOLEAN
 
 icmp_errors_use_inbound_ifaddr - BOOLEAN
 
-   If zero, icmp error messages are sent with the primary address of
-   the exiting interface.
+   If zero, icmp error messages except redirects are sent with the primary
+   address of the exiting interface.
 
If non-zero, the message will be sent with the primary address of
the interface that received the packet that caused the icmp error.
@@ -897,8 +897,23 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
then the primary address of the first non-loopback interface that
has one will be used regardless of this setting.
 
+   The source address selection of icmp redirect messages is controlled by
+   icmp_errors_use_inbound_ifaddr.
Default: 0
 
+icmp_redirects_use_orig_daddr - BOOLEAN
+
+   If zero, icmp redirect messages are sent using the address specified for
+   other icmp errors by icmp_errors_use_inbound_ifaddr.
+
+   If non-zero, the message will be sent with the destination address of
+   the packet that caused the icmp redirect.
+   This behaviour is the preferred one on VRRP routers (see RFC 5798
+   section 8.1.1).
+
+   Default: 0
+
+
 igmp_max_memberships - INTEGER
Change the maximum number of multicast groups we can subscribe to.
Default: 20
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c68926b..46d336a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -74,6 +74,7 @@ struct netns_ipv4 {
int sysctl_icmp_ratelimit;
int sysctl_icmp_ratemask;
int sysctl_icmp_errors_use_inbound_ifaddr;
+   int sysctl_icmp_redirects_use_orig_daddr;
 
struct local_ports ip_local_ports;
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e5eb8ac..3b57aa4 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -642,7 +642,9 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
 */
 
saddr = iph->daddr;
-   if (!(rt->rt_flags & RTCF_LOCAL)) {
+   if (!((type == ICMP_REDIRECT) &&
+ net->ipv4.sysctl_icmp_redirects_use_orig_daddr) &&
+   !(rt->rt_flags & RTCF_LOCAL)) {
struct net_device *dev = NULL;
 
rcu_read_lock();
@@ -1205,6 +1207,11 @@ static int __net_init icmp_sk_init(struct net *net)
net->ipv4.sysctl_icmp_ratemask = 0x1818;
net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr = 0;
 
+   /* Control paramerer - use the daddr of originating packets as saddr
+* in redirect messages?
+*/
+   net->ipv4.sysctl_icmp_redirects_use_orig_daddr = 0;
+
return 0;
 
 fail:
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 894da3a..30a531c 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -818,6 +818,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "icmp_redirects_use_orig_daddr",
+   .data   = 
&init_net.ipv4.sysctl_icmp_redirects_use_orig_daddr,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
.procname   = "icmp_ratelimit",
.data   = &init_net.ipv4.sysctl_icmp_ratelimit,
.maxlen = sizeof(int),
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netde

[PATCH net-next 1/1] sfc: replace spinlocks with bit ops for busy poll locking

2015-10-09 Thread Shradha Shah

From: Bert Kenward 

This patch reduces the overhead of locking for busy poll.
Previously the state was protected by a lock, whereas now
it's manipulated solely with atomic operations.

Signed-off-by: Shradha Shah 
---
 drivers/net/ethernet/sfc/efx.c|  31 +---
 drivers/net/ethernet/sfc/net_driver.h | 129 +++---
 2 files changed, 78 insertions(+), 82 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 974637d..8d943cd 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -205,7 +205,7 @@ static void efx_remove_channel(struct efx_channel *channel);
 static void efx_remove_channels(struct efx_nic *efx);
 static const struct efx_channel_type efx_default_channel_type;
 static void efx_remove_port(struct efx_nic *efx);
-static void efx_init_napi_channel(struct efx_channel *channel);
+static int efx_init_napi_channel(struct efx_channel *channel);
 static void efx_fini_napi(struct efx_nic *efx);
 static void efx_fini_napi_channel(struct efx_channel *channel);
 static void efx_fini_struct(struct efx_nic *efx);
@@ -834,7 +834,9 @@ efx_realloc_channels(struct efx_nic *efx, u32 rxq_entries, 
u32 txq_entries)
rc = efx_probe_channel(channel);
if (rc)
goto rollback;
-   efx_init_napi_channel(efx->channel[i]);
+   rc = efx_init_napi_channel(efx->channel[i]);
+   if (rc)
+   goto rollback;
}
 
 out:
@@ -2054,7 +2056,7 @@ static int efx_ioctl(struct net_device *net_dev, struct 
ifreq *ifr, int cmd)
  *
  **/
 
-static void efx_init_napi_channel(struct efx_channel *channel)
+static int efx_init_napi_channel(struct efx_channel *channel)
 {
struct efx_nic *efx = channel->efx;
 
@@ -2062,15 +2064,23 @@ static void efx_init_napi_channel(struct efx_channel 
*channel)
netif_napi_add(channel->napi_dev, &channel->napi_str,
   efx_poll, napi_weight);
napi_hash_add(&channel->napi_str);
-   efx_channel_init_lock(channel);
+   efx_channel_busy_poll_init(channel);
+
+   return 0;
 }
 
-static void efx_init_napi(struct efx_nic *efx)
+static int efx_init_napi(struct efx_nic *efx)
 {
struct efx_channel *channel;
+   int rc;
 
-   efx_for_each_channel(channel, efx)
-   efx_init_napi_channel(channel);
+   efx_for_each_channel(channel, efx) {
+   rc = efx_init_napi_channel(channel);
+   if (rc)
+   return rc;
+   }
+
+   return 0;
 }
 
 static void efx_fini_napi_channel(struct efx_channel *channel)
@@ -2125,7 +2135,7 @@ static int efx_busy_poll(struct napi_struct *napi)
if (!netif_running(efx->net_dev))
return LL_FLUSH_FAILED;
 
-   if (!efx_channel_lock_poll(channel))
+   if (!efx_channel_try_lock_poll(channel))
return LL_FLUSH_BUSY;
 
old_rx_packets = channel->rx_queue.rx_packets;
@@ -3061,7 +3071,9 @@ static int efx_pci_probe_main(struct efx_nic *efx)
if (rc)
goto fail1;
 
-   efx_init_napi(efx);
+   rc = efx_init_napi(efx);
+   if (rc)
+   goto fail2;
 
rc = efx->type->init(efx);
if (rc) {
@@ -3094,6 +3106,7 @@ static int efx_pci_probe_main(struct efx_nic *efx)
efx->type->fini(efx);
  fail3:
efx_fini_napi(efx);
+ fail2:
efx_remove_all(efx);
  fail1:
return rc;
diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index c530e1c..19eda8c 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -431,21 +431,8 @@ struct efx_channel {
struct net_device *napi_dev;
struct napi_struct napi_str;
 #ifdef CONFIG_NET_RX_BUSY_POLL
-   unsigned int state;
-   spinlock_t state_lock;
-#define EFX_CHANNEL_STATE_IDLE 0
-#define EFX_CHANNEL_STATE_NAPI (1 << 0)  /* NAPI owns this channel */
-#define EFX_CHANNEL_STATE_POLL (1 << 1)  /* poll owns this channel */
-#define EFX_CHANNEL_STATE_DISABLED (1 << 2)  /* channel is disabled */
-#define EFX_CHANNEL_STATE_NAPI_YIELD   (1 << 3)  /* NAPI yielded this channel 
*/
-#define EFX_CHANNEL_STATE_POLL_YIELD   (1 << 4)  /* poll yielded this channel 
*/
-#define EFX_CHANNEL_OWNED \
-   (EFX_CHANNEL_STATE_NAPI | EFX_CHANNEL_STATE_POLL)
-#define EFX_CHANNEL_LOCKED \
-   (EFX_CHANNEL_OWNED | EFX_CHANNEL_STATE_DISABLED)
-#define EFX_CHANNEL_USER_PEND \
-   (EFX_CHANNEL_STATE_POLL | EFX_CHANNEL_STATE_POLL_YIELD)
-#endif /* CONFIG_NET_RX_BUSY_POLL */
+   unsigned long busy_poll_state;
+#endif
struct efx_special_buffer eventq;
unsigned int eventq_mask;
unsigned int eventq_read_ptr;
@@ -480,98 +467,94 @@ struct efx_channel {
 };
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
-static i

Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-09 Thread Thomas Graf

On 10/08/15 at 08:20pm, Hannes Frederic Sowa wrote:
> Hi Alexei,
> 
> On Thu, Oct 8, 2015, at 07:23, Alexei Starovoitov wrote:
> > The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
> > This toggle defaults to off (0), but can be set true (1).  Once true,
> > bpf programs and maps cannot be accessed from unprivileged process,
> > and the toggle cannot be set back to false.
> 
> This approach seems fine to me.
> 
> I am wondering if it makes sense to somehow allow ebpf access per
> namespace? I currently have no idea how that could work and on which
> namespace type to depend or going with a prctl or even cgroup maybe. The
> rationale behind this is, that maybe some namespaces like openstack
> router namespaces could make usage of advanced ebpf capabilities in the
> kernel, while other namespaces, especially where untrusted third parties
> are hosted, shouldn't have access to those facilities.
> 
> In that way, hosters would be able to e.g. deploy more efficient
> performance monitoring container (which should still need not to run as
> root) while the majority of the users has no access to that. Or think
> about routing instances in some namespaces, etc. etc.

The standard way of granting privileges like this for containers is
through CAP_ which does seem like a good fit for this as well and would
also solve your mentioned openstack use case.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v4 2/2] tipc: remove invalid ip_rt_put

2015-10-09 Thread Andreas Schultz

udp_tunnel_xmit_skb() will free the skb and release the rt->dst
reference in the error case. There is no need (and it would actully
trigger a warning) when we did.
This problem was not visible before, as udp_tunnel_xmit_skb() would
never return a value < 0

Signed-off-by: Andreas Schultz 
Acked-by: Jiri Benc 
---
 net/tipc/udp_media.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index c170d31..de8e110 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -181,10 +181,6 @@ static int tipc_udp_send_msg(struct net *net, struct 
sk_buff *skb,
  dst->ipv4.s_addr, 0, ttl, 0,
  src->udp_port, dst->udp_port,
  false, true);
-   if (err < 0) {
-   ip_rt_put(rt);
-   goto tx_error;
-   }
 #if IS_ENABLED(CONFIG_IPV6)
} else {
struct dst_entry *ndst;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v4 1/2] fix return of iptunnel_xmit

2015-10-09 Thread Andreas Schultz

All users of iptunnel_xmit expect the return value to be the packet
length on success (>0), negative for a tx error and zero for a tx
dropped error. In cset 0e6fbc5b6c6218987c93b8c7ca60cf786062899d the
negative return case was lost.

This bug was introduced when the ip_tunnel_core code was refactored.

Fixes: 0e6fbc5b6c6218987c93b8c7ca60cf786062899d
Signed-off-by: Andreas Schultz 
Acked-by: Jiri Benc 
Acked-by: Pravin B Shelar 
---
Change in v2:
 - remove unused variable pkt_len

Change in v3:
 - reworked based on comment from Jiri Benc

Change in v4:
 - rebased to net-next to avoid merge conflicts
 - added Acked-By from Jiri Benc and Pravin B Shelar

---
 net/ipv4/ip_tunnel_core.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 6cb9009..453d569 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -80,9 +80,12 @@ int iptunnel_xmit(struct sock *sk, struct rtable *rt, struct 
sk_buff *skb,
__ip_select_ident(net, iph, skb_shinfo(skb)->gso_segs ?: 1);
 
err = ip_local_out(net, sk, skb);
-   if (unlikely(net_xmit_eval(err)))
-   pkt_len = 0;
-   return pkt_len;
+   if (likely(net_xmit_eval(err) == 0))
+   return pkt_len;
+   if (err < 0)
+   return err;
+
+   return 0;
 }
 EXPORT_SYMBOL_GPL(iptunnel_xmit);
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v4 0/2] fix tunnel statistics handling

2015-10-09 Thread Andreas Schultz

These are the two changes I send earlier, rebased to net-next to avoid
a merge conflict. The statistics bug they fix has been noted before, so
I can wait a bit longer in net-next.

This first patch changes to the return of iptunnel_xmit to return the
real err value when err < 0. Before the return value would always be >= 0.

The second patch fixes the error handling of iptunnel_xmit in tipc.

Andreas Schultz (2):
  fix return of iptunnel_xmit
  tipc: remove invalid ip_rt_put

 net/ipv4/ip_tunnel_core.c | 9 ++---
 net/tipc/udp_media.c  | 4 
 2 files changed, 6 insertions(+), 7 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-09 Thread Jiang Liu

On 2015/10/9 17:08, Kamezawa Hiroyuki wrote:
> On 2015/10/09 14:52, Jiang Liu wrote:
>> On 2015/10/9 4:20, Andrew Morton wrote:
>>> On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes
>>>  wrote:
>>>
 On Wed, 19 Aug 2015, Patil, Kiran wrote:

> Acked-by: Kiran Patil 

 Where's the call to preempt_disable() to prevent kernels with
 preemption
 from making numa_node_id() invalid during this iteration?
>>>
>>> David asked this question twice, received no answer and now the patch
>>> is in the maintainer tree, destined for mainline.
>>>
>>> If I was asked this question I would respond
>>>
>>>The use of numa_mem_id() is racy and best-effort.  If the unlikely
>>>race occurs, the memory allocation will occur on the wrong node, the
>>>overall result being very slightly suboptimal performance.  The
>>>existing use of numa_node_id() suffers from the same issue.
>>>
>>> But I'm not the person proposing the patch.  Please don't just ignore
>>> reviewer comments!
>> Hi Andrew,
>> Apologize for the slow response due to personal reasons!
>> And thanks for answering the question from David. To be honest,
>> I didn't know how to answer this question before. Actually this
>> question has puzzled me for a long time when dealing with memory
>> hot-removal. For normal cases, it only causes sub-optimal memory
>> allocation if schedule event happens between querying NUMA node id
>> and calling alloc_pages_node(). But what happens if system run into
>> following execution sequence?
>> 1) node = numa_mem_id();
>> 2) memory hot-removal event triggers
>> 2.1) remove affected memory
>> 2.2) reset pgdat to zero if node becomes empty after memory removal
> 
> I'm sorry if I misunderstand something.
> After commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08, there is no
> memset().
Hi Kamezawa,
Thanks for the information. The commit solved the issue what
I was puzzling about. With this change applied, thing should work
as expected. Seems it would be better to enhance __build_all_zonelists()
to handle those offlined empty nodes too, but that really doesn't
make to much difference:)
Thanks for the info again!
Thanks!
Gerry
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 106 matches

Mail list logo