date:20160401

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Alexander Duyck

On Fri, Apr 1, 2016 at 7:16 PM, David Miller  wrote:
> From: Alexander Duyck 
> Date: Fri, 1 Apr 2016 12:58:41 -0700
>
>> RFC 6864 is pretty explicit about this, IPv4 ID used only for
>> fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1
>>
>> The goal with this change is to try and keep most of the existing
>> behavior in tact without violating this rule?  I would think the
>> sequence number should give you the ability to infer a drop in the
>> case of TCP.  In the case of UDP tunnels we are now getting a bit more
>> data since we were ignoring the outer IP header ID before.
>
> When retransmits happen, the sequence numbers are the same.  But you
> can then use the IP ID to see exactly what happened.  You can even
> tell if multiple retransmits got reordered.
>
> Eric's use case is extremely useful, and flat out eliminates ambiguity
> when analyzing TCP traces.

I'm not really sure the IP ID is really all that useful.  On a 10G
link it takes about 80ms for it to wrap using an MTU of 1500.  That
value is going to keep dropping as we move up to 40G and 100G.  That
was one of the motivations behind RFC 6864 because it starts becoming
a bottle-neck if you want to keep the IP IDs truly unique.  In
addition while this change would allow you to combined disjointed IP
IDs I don't think you would lose the re-transmission as there would
likely be a gap in sequence numbers that would cause the flow to be
flushed from GRO, and it isn't as if we can prepend the retransmit to
the aggregated frame, we are always adding to the tail.

I would think the most likely case where this change would foul up any
IP IDs would be the garbage-in/garbage-out case like the IPv6 to IPv4
translation that is using the fixed IP ID of 0.  I agree that it isn't
desirable, however at the same time per RFC 6864 there is nothing
there to say we cannot do that.  In addition it would likely help to
mitigate some of the impact of IP ID on things like SLHC compression
since the resegmented frames would be incrementing so it would reduce
the number of times the IP ID would have to be updated.

- Alex

Re: [PATCH] RDS: sync congestion map updating

2016-04-01 Thread santosh.shilim...@oracle.com




On 4/1/16 6:14 PM, Leon Romanovsky wrote:

On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote:

(cc-ing netdev)
On 3/30/2016 7:59 PM, Wengang Wang wrote:



在 2016年03月31日 09:51, Wengang Wang 写道:



在 2016年03月31日 01:16, santosh shilimkar 写道:

Hi Wengang,

On 3/30/2016 9:19 AM, Leon Romanovsky wrote:

On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:

Problem is found that some among a lot of parallel RDS
communications hang.
In my test ten or so among 33 communications hang. The send
requests got
-ENOBUF error meaning the peer socket (port) is congested. But
meanwhile,
peer socket (port) is not congested.

The congestion map updating can happen in two paths: one is in
rds_recvmsg path
and the other is when it receives packets from the hardware. There
is no
synchronization when updating the congestion map. So a bit
operation (clearing)
in the rds_recvmsg path can be skipped by another bit operation
(setting) in
hardware packet receving path.



To be more detailed.  Here, the two paths (user calls recvmsg and
hardware receives data) are for different rds socks. thus the
rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
map.


For archive purpose, let me try to conclude the thread. I synced
with Wengang offlist and came up with below fix. I was under
impression that __set_bit_le() was atmoic version. After fixing
it like patch(end of the email), the bug gets addressed.

I will probably send this as fix for stable as well.


 From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
From: Santosh Shilimkar 
Date: Wed, 30 Mar 2016 23:26:47 -0700
Subject: [PATCH] RDS: Fix the atomicity for congestion map update

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang  for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.


I'm glad that you solved the issue without spinlocks.
Out of curiosity, I see that this patch is needed to be sent
to Dave and applied by him. Is it right?


Right. I was planning send this one along with one more fix
together on netdev for Dave to pick it up.


➜  linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c
Santosh Shilimkar  (supporter:RDS -
RELIABLE DATAGRAM SOCKETS)
"David S. Miller"  (maintainer:NETWORKING
[GENERAL])
netdev@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
linux-r...@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
rds-de...@oss.oracle.com (moderated list:RDS - RELIABLE DATAGRAM
SOCKETS)
linux-ker...@vger.kernel.org (open list)



Signed-off-by: Wengang Wang 
Signed-off-by: Santosh Shilimkar 


Reviewed-by: Leon Romanovsky 


Thanks for review.

Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program

2016-04-01 Thread Alexei Starovoitov

On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> > bpf programs require a skb context to navigate the packet, build a
> > percpu fake skb with the minimal fields. This avoids the costly
> > allocation for packets that end up being dropped.
> > 
> 
> 
> > +   /* A bpf program gets first chance to drop the packet. It may
> > +* read bytes but not past the end of the frag. A non-zero
> > +* return indicates packet should be dropped.
> > +*/
> > +   if (prog) {
> > +   struct ethhdr *ethh;
> > +
> > +   ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > +frags[0].page_offset);
> > +   if (mlx4_call_bpf(prog, ethh, length)) {
> > +   priv->stats.rx_dropped++;
> > +   goto next;
> > +   }
> > +   }
> > +
> 
> 
> 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
> frame. 
> 
> Still you pass a single fragment but total 'length' here : BPF program
> can read past the end of this first fragment and panic the box.
> 
> Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
> mean.

yep.
my reading of that part was that num_frags > 1 is only for large
mtu sizes, so if we limit this for num_frags==1 only for now
we should be ok and it's still applicable for most of the use cases ?

> 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> sharing.

yes. good point. I bet it was copy pasted from few lines below.
Should be trivial to convert it to percpu.

>This is probably the right time to add a rx_dropped field in struct
> mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> higher speed links.

yes, could be per ring as well.
My guess we're hitting 14.5Mpps limit for empty bpf program
and for program that actually looks into the packet because we're
hitting 10G phy limit of 40G nic. Since physically 40G nic
consists of four 10G phys. There will be the same problem
with 100G and 50G nics. Both will be hitting 25G phy limit.
We need to vary packets somehow. Hopefully Or can explain that
bit of hw design.
Jesper's experiments with mlx4 showed the same 14.5Mpps limit
when sender blasting the same packet over and over again.
Great to see the experiments converging.

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Eric Dumazet

On Sat, 2016-04-02 at 10:19 +0800, Herbert Xu wrote:
> On Fri, Apr 01, 2016 at 07:15:33PM -0700, Eric Dumazet wrote:
> > On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote:
> > > 
> > > We could easily fix that by adding a feature bit to control this,
> > > something like SKB_GSO_TCP_FIXEDID.
> > 
> > I understood the patch allowed to aggregate 4 segments having
> > 
> > ID=12   ID=40   ID=80  ID=1000
> 
> Right.  But I haven't seen any justification that we should aggregate
> such packets at all.  The only valid case that I have seen is for
> the case of unchanging IDs, no?

Presumably repeats of "DF=1 ID=0" should be what we really need to
handle.

On my wish list, having some reordering logic in GRO would be far more
interesting than these minor details ;)

Re: [PATCH v2 net-next] net/core: generic support for disabling netdev features down stack

2016-04-01 Thread Michał Mirosław

Hi,

Sorry for digging up an old patch, but... ;-)

dev_disable_lro() is a leftover from ancient times. If you read commit
27660515a,
there is a hint where it should go. Please, read on if you'd like to
fix this properly.

2015-11-03 3:55 GMT+01:00 Jarod Wilson :
> There are some netdev features, which when disabled on an upper device,
> such as a bonding master or a bridge, must be disabled and cannot be
> re-enabled on underlying devices.
[...]
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6288,6 +6288,44 @@ static void rollback_registered(struct net_device *dev)
> list_del(&single);
>  }
>
> +static netdev_features_t netdev_sync_upper_features(struct net_device *lower,
> +   struct net_device *upper, netdev_features_t features)
> +{
> +   netdev_features_t upper_disables = NETIF_F_UPPER_DISABLES;
> +   netdev_features_t feature;
> +
> +   for_each_netdev_feature(&upper_disables, feature) {
> +   if (!(upper->wanted_features & feature)
> +   && (features & feature)) {
> +   netdev_dbg(lower, "Dropping feature %pNF, upper dev 
> %s has it off.\n",
> +  &feature, upper->name);
> +   features &= ~feature;
> +   }
> +   }

You could do this once:

disable_features = ~upper->features & features & NETIF_F_UPPER_DISABLES;
if (features & disable_features)
  netdev_dbg(...)
features &= ~disable_features;

> +
> +   return features;
> +}
[...]
> @@ -6370,6 +6410,10 @@ int __netdev_update_features(struct net_device *dev)
> /* driver might be less strict about feature dependencies */
> features = netdev_fix_features(dev, features);
>
> +   /* some features can't be enabled if they're off an an upper device */
> +   netdev_for_each_upper_dev_rcu(dev, upper, iter)
> +   features = netdev_sync_upper_features(dev, upper, features);
> +
> if (dev->features == features)
> return 0;
>

This should go into netdev_fix_features(), as it is a one single place,
where are feature dependencies are verified.

[...]
> @@ -6386,6 +6430,12 @@ int __netdev_update_features(struct net_device *dev)
> return -1;
> }
>
> +   /* some features must be disabled on lower devices when disabled
> +* on an upper device (think: bonding master or bridge)
> +*/
> +   netdev_for_each_lower_dev(dev, lower, iter)
> +   netdev_sync_lower_features(dev, lower, features);
> +
[...]

This should be equal in resulting flags to:

   netdev_for_each_lower_dev(dev, lower...)
 netdev_update_features(lower);

because netdev_fix_features(lower) will see already changed dev->features.

Best Regards,
Michał Mirosław

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Eric Dumazet

On Fri, 2016-04-01 at 22:16 -0400, David Miller wrote:
> From: Alexander Duyck 
> Date: Fri, 1 Apr 2016 12:58:41 -0700
> 
> > RFC 6864 is pretty explicit about this, IPv4 ID used only for
> > fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1
> > 
> > The goal with this change is to try and keep most of the existing
> > behavior in tact without violating this rule?  I would think the
> > sequence number should give you the ability to infer a drop in the
> > case of TCP.  In the case of UDP tunnels we are now getting a bit more
> > data since we were ignoring the outer IP header ID before.
> 
> When retransmits happen, the sequence numbers are the same.  But you
> can then use the IP ID to see exactly what happened.  You can even
> tell if multiple retransmits got reordered.
> 
> Eric's use case is extremely useful, and flat out eliminates ambiguity
> when analyzing TCP traces.

Yes, our team (including Van Jacobson ;) ) would be sad to not have
sequential IP ID (but then we don't have them for IPv6 ;) )

Since the cost of generating them is pretty small (inet->inet_id
counter), we probably should keep them in linux. Their usage will phase
out as IPv6 wins the Internet war...

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Herbert Xu

On Fri, Apr 01, 2016 at 07:15:33PM -0700, Eric Dumazet wrote:
> On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote:
> > 
> > We could easily fix that by adding a feature bit to control this,
> > something like SKB_GSO_TCP_FIXEDID.
> 
> I understood the patch allowed to aggregate 4 segments having
> 
> ID=12   ID=40   ID=80  ID=1000

Right.  But I haven't seen any justification that we should aggregate
such packets at all.  The only valid case that I have seen is for
the case of unchanging IDs, no?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread David Miller

From: Alexander Duyck 
Date: Fri, 1 Apr 2016 12:58:41 -0700

> RFC 6864 is pretty explicit about this, IPv4 ID used only for
> fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1
> 
> The goal with this change is to try and keep most of the existing
> behavior in tact without violating this rule?  I would think the
> sequence number should give you the ability to infer a drop in the
> case of TCP.  In the case of UDP tunnels we are now getting a bit more
> data since we were ignoring the outer IP header ID before.

When retransmits happen, the sequence numbers are the same.  But you
can then use the IP ID to see exactly what happened.  You can even
tell if multiple retransmits got reordered.

Eric's use case is extremely useful, and flat out eliminates ambiguity
when analyzing TCP traces.

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Eric Dumazet

On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote:
> Eric Dumazet  wrote:
> >
> > I do not particularly care, but it is worth mentioning that GRO+TSO
> > would not be idempotent anymore.
> 
> We could easily fix that by adding a feature bit to control this,
> something like SKB_GSO_TCP_FIXEDID.

I understood the patch allowed to aggregate 4 segments having

ID=12   ID=40   ID=80  ID=1000

-> resulting GRO packet with 4 segments and ID=12.  TSO would generate
12,13,14,15   or 12,12,12,12 with your flag ?

(Before the patch no aggregation took place and exact same packets were
forwarded with 12, 40, 80, 1000)

As I said, I am not sure we should care :)

Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program

2016-04-01 Thread Eric Dumazet

On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 

> + /* A bpf program gets first chance to drop the packet. It may
> +  * read bytes but not past the end of the frag. A non-zero
> +  * return indicates packet should be dropped.
> +  */
> + if (prog) {
> + struct ethhdr *ethh;
> +
> + ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +  frags[0].page_offset);
> + if (mlx4_call_bpf(prog, ethh, length)) {
> + priv->stats.rx_dropped++;
> + goto next;
> + }
> + }
> +

1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
frame. 

Still you pass a single fragment but total 'length' here : BPF program
can read past the end of this first fragment and panic the box.

Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
mean.

2) priv->stats.rx_dropped is shared by all the RX queues -> false
sharing.

   This is probably the right time to add a rx_dropped field in struct
mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
higher speed links.

Re: [RESEND PATCH V4 2/3] IB/hns: Add HiSilicon RoCE driver support

2016-04-01 Thread Leon Romanovsky

On Fri, Apr 01, 2016 at 05:21:31PM +0800, Lijun Ou wrote:
> The driver for HiSilicon RoCE is a platform driver.
> The driver will support multiple versions of hardware. Currently only "v1"
> for hip06 SoC is supported.
> The driver includes two parts: common driver and hardware-specific
> operations. hns_roce_v1_hw.c and hns_roce_v1_hw.h are files for
> hardware-specific operations only for v1 engine, and other files(.c and .h)
> for common algorithm and common hardware operations.
> 
> Signed-off-by: Lijun Ou 
> Signed-off-by: Wei Hu(Xavier) 
> Signed-off-by: Znlong 
> ---
>  MAINTAINERS|8 +
>  drivers/infiniband/Kconfig |1 +
>  drivers/infiniband/hw/Makefile |1 +
>  drivers/infiniband/hw/hisilicon/hns/Kconfig|   10 +
>  drivers/infiniband/hw/hisilicon/hns/Makefile   |9 +
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c  |  110 +

We are not adding name of company (hisilicon) for infiniband HW drivers
drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c
--->
drivers/infiniband/hw/hns/hns_roce_ah.c


>  .../infiniband/hw/hisilicon/hns/hns_roce_alloc.c   |  239 ++
 ^^
Please fix you paths.

>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.c |  338 +++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.h |   80 +
>  .../infiniband/hw/hisilicon/hns/hns_roce_common.h  |  308 +++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cq.c  |  436 +++
>  .../infiniband/hw/hisilicon/hns/hns_roce_device.h  |  794 ++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.c  |  758 ++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.h  |  132 +
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.c |  578 
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.h |  112 +
>  .../infiniband/hw/hisilicon/hns/hns_roce_main.c| 1097 
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_mr.c  |  605 +
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_pd.c  |  124 +
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_qp.c  |  841 ++
>  .../infiniband/hw/hisilicon/hns/hns_roce_user.h|   31 +
>  .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.c   | 2832 
> 
>  .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.h   |  985 +++
  ^^
Do you support v1 of RoCE or v1 of your HW?

>  23 files changed, 10429 insertions(+)

Please appreciate the effort needed to review such large patch and
invest time and effort to divide this to number of small easy review patches.

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Herbert Xu

Eric Dumazet  wrote:
>
> I do not particularly care, but it is worth mentioning that GRO+TSO
> would not be idempotent anymore.

We could easily fix that by adding a feature bit to control this,
something like SKB_GSO_TCP_FIXEDID.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: Question on rhashtable in worst-case scenario.

2016-04-01 Thread Herbert Xu

On Fri, Apr 01, 2016 at 11:34:10PM +0200, Johannes Berg wrote:
>
> I was thinking about that one - it's not obvious to me from the code
> how this "explicitly checking for dups" would be done or let's say how
> rhashtable differentiates. But since it seems to work for Ben until
> hitting a certain number of identical keys, surely that's just me not
> understanding the code rather than anything else :)

It's really simple, rhashtable_insert_fast does not check for dups
while rhashtable_lookup_insert_* do.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[RFC PATCH 3/5] rtnl: add option for setting link bpf prog

2016-04-01 Thread Brenden Blanco

Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_PHYS_DEV.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.

Signed-off-by: Brenden Blanco 
---
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index c488066..08d66a3 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -155,6 +155,7 @@ enum {
IFLA_PROTO_DOWN,
IFLA_GSO_MAX_SEGS,
IFLA_GSO_MAX_SIZE,
+   IFLA_BPF_FD,
__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f206677..1c4a603 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -909,6 +909,7 @@ static noinline size_t if_nlmsg_size(const struct 
net_device *dev,
   + rtnl_link_get_af_size(dev, ext_filter_mask) /* IFLA_AF_SPEC */
   + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
   + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
+  + nla_total_size(4) /* IFLA_BPF_FD */
   + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1241,6 +1242,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
 nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) ||
nla_put_u32(skb, IFLA_CARRIER_CHANGES,
atomic_read(&dev->carrier_changes)) ||
+   nla_put_s32(skb, IFLA_BPF_FD, dev->bpf_valid) ||
nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
goto nla_put_failure;
 
@@ -1374,6 +1376,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_PHYS_SWITCH_ID]   = { .type = NLA_BINARY, .len = 
MAX_PHYS_ITEM_ID_LEN },
[IFLA_LINK_NETNSID] = { .type = NLA_S32 },
[IFLA_PROTO_DOWN]   = { .type = NLA_U8 },
+   [IFLA_BPF_FD]   = { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2028,6 +2031,13 @@ static int do_setlink(const struct sk_buff *skb,
status |= DO_SETLINK_NOTIFY;
}
 
+   if (tb[IFLA_BPF_FD]) {
+   err = dev_change_bpf_fd(dev, nla_get_s32(tb[IFLA_BPF_FD]));
+   if (err)
+   goto errout;
+   status |= DO_SETLINK_NOTIFY;
+   }
+
 errout:
if (status & DO_SETLINK_MODIFIED) {
if (status & DO_SETLINK_NOTIFY)
-- 
2.8.0

[RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program

2016-04-01 Thread Brenden Blanco

Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
bpf programs require a skb context to navigate the packet, build a
percpu fake skb with the minimal fields. This avoids the costly
allocation for packets that end up being dropped.

Since mlx4 is so far the only user of this pseudo skb, the build
function is defined locally.

Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 18 
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  2 +
 3 files changed, 81 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index b4b258c..89ca787 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -1966,6 +1967,9 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv)
mlx4_en_destroy_cq(priv, &priv->rx_cq[i]);
}
 
+   if (priv->prog)
+   bpf_prog_put(priv->prog);
+
 }
 
 int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
@@ -2456,6 +2460,61 @@ static int mlx4_en_set_tx_maxrate(struct net_device 
*dev, int queue_index, u32 m
return err;
 }
 
+static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb);
+
+static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data,
+unsigned int length)
+{
+   /* data_len is intentionally not set here so that skb_is_nonlinear()
+* returns false
+*/
+
+   skb->len = length;
+   skb->head = data;
+   skb->data = data;
+}
+
+int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length)
+{
+   struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb);
+   int ret;
+
+   build_pseudo_skb_for_bpf(skb, data, length);
+
+   rcu_read_lock();
+   ret = BPF_PROG_RUN(prog, (void *)skb);
+   rcu_read_unlock();
+
+   return ret;
+}
+
+static int mlx4_bpf_set(struct net_device *dev, int fd)
+{
+   struct mlx4_en_priv *priv = netdev_priv(dev);
+   struct bpf_prog *prog = NULL, *old_prog;
+
+   if (fd >= 0) {
+   prog = bpf_prog_get(fd);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
+   bpf_prog_put(prog);
+   return -EINVAL;
+   }
+   }
+
+   old_prog = xchg(&priv->prog, prog);
+   if (old_prog) {
+   synchronize_net();
+   bpf_prog_put(old_prog);
+   }
+
+   priv->dev->bpf_valid = !!prog;
+
+   return 0;
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
.ndo_open   = mlx4_en_open,
.ndo_stop   = mlx4_en_close,
@@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
.ndo_features_check = mlx4_en_features_check,
 #endif
.ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate,
+   .ndo_bpf_set= mlx4_bpf_set,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master 
= {
.ndo_features_check = mlx4_en_features_check,
 #endif
.ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate,
+   .ndo_bpf_set= mlx4_bpf_set,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 86bcfe5..03fe005 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
struct mlx4_en_rx_alloc *frags;
struct mlx4_en_rx_desc *rx_desc;
+   struct bpf_prog *prog;
struct sk_buff *skb;
int index;
int nr;
@@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
if (budget <= 0)
return polled;
 
+   prog = READ_ONCE(priv->prog);
+
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 * descriptor offset can be deduced from the CQE index instead of
 * reading 'cqe->index' */
@@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
+   /* A bpf program gets first chance to drop the packet. It may
+* read bytes but not past the end of the frag. A non-zero
+* return indicates packet should be dropped.
+*/
+

[RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-01 Thread Brenden Blanco

This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling Express Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before even
an skb has been allocated.

With this, hope to enable line rate filtering, with this initial
implementation providing drop/allow action only.

Patch 1 introduces the new prog type and helpers for validating the bpf
program. A new userspace struct is defined containing only len as a field,
with others to follow in the future.
In patch 2, create a new ndo to pass the fd to support drivers. 
In patch 3, expose a new rtnl option to userspace.
In patch 4, enable support in mlx4 driver. No skb allocation is required,
instead a static percpu skb is kept in the driver and minimally initialized
for each driver frag.
In patch 5, create a sample drop and count program. With single core,
achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data
access, bpf array lookup, and increment.

Interestingly, accessing packet data from the program did not have a
noticeable impact on performance. Even so, future enhancements to
prefetching / batching / page-allocs should hopefully improve the
performance in this path.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

Brenden Blanco (5):
  bpf: add PHYS_DEV prog type for early driver filter
  net: add ndo to set bpf prog in adapter rx
  rtnl: add option for setting link bpf prog
  mlx4: add support for fast rx drop bpf program
  Add sample for adding simple drop program to link

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  61 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  18 +++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   2 +
 include/linux/netdevice.h  |   8 ++
 include/uapi/linux/bpf.h   |   5 +
 include/uapi/linux/if_link.h   |   1 +
 kernel/bpf/verifier.c  |   1 +
 net/core/dev.c |  12 ++
 net/core/filter.c  |  68 +++
 net/core/rtnetlink.c   |  10 ++
 samples/bpf/Makefile   |   4 +
 samples/bpf/bpf_load.c |   8 ++
 samples/bpf/netdrvx1_kern.c|  26 +
 samples/bpf/netdrvx1_user.c| 155 +
 14 files changed, 379 insertions(+)
 create mode 100644 samples/bpf/netdrvx1_kern.c
 create mode 100644 samples/bpf/netdrvx1_user.c

-- 
2.8.0

[RFC PATCH 5/5] Add sample for adding simple drop program to link

2016-04-01 Thread Brenden Blanco

Add a sample program that only drops packets at the
BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
observed single core rate is ~14.6Mpps.

Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.

$ perf record -a samples/bpf/netdrvx1 $(
---
 samples/bpf/Makefile|   4 ++
 samples/bpf/bpf_load.c  |   8 +++
 samples/bpf/netdrvx1_kern.c |  26 
 samples/bpf/netdrvx1_user.c | 155 
 4 files changed, 193 insertions(+)
 create mode 100644 samples/bpf/netdrvx1_kern.c
 create mode 100644 samples/bpf/netdrvx1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 502c9fc..ad36bb8 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -19,6 +19,7 @@ hostprogs-y += lathist
 hostprogs-y += offwaketime
 hostprogs-y += spintest
 hostprogs-y += map_perf_test
+hostprogs-y += netdrvx1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -38,6 +39,7 @@ lathist-objs := bpf_load.o libbpf.o lathist_user.o
 offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
 spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
+netdrvx1-objs := bpf_load.o libbpf.o netdrvx1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -56,6 +58,7 @@ always += lathist_kern.o
 always += offwaketime_kern.o
 always += spintest_kern.o
 always += map_perf_test_kern.o
+always += netdrvx1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -75,6 +78,7 @@ HOSTLOADLIBES_lathist += -lelf
 HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
+HOSTLOADLIBES_netdrvx1 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 58f86bd..9308fbc 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -49,6 +49,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_socket = strncmp(event, "socket", 6) == 0;
bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+   bool is_phys_dev = strncmp(event, "phys_dev", 8) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -63,6 +64,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
} else if (is_kprobe || is_kretprobe) {
prog_type = BPF_PROG_TYPE_KPROBE;
+   } else if (is_phys_dev) {
+   prog_type = BPF_PROG_TYPE_PHYS_DEV;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -76,6 +79,9 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
 
prog_fd[prog_cnt++] = fd;
 
+   if (is_phys_dev)
+   return 0;
+
if (is_socket) {
event += 6;
if (*event != '/')
@@ -304,6 +310,7 @@ int load_bpf_file(char *path)
 
if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
+   memcmp(shname_prog, "phys_dev", 8) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, 
data_prog->d_size);
}
@@ -320,6 +327,7 @@ int load_bpf_file(char *path)
 
if (memcmp(shname, "kprobe/", 7) == 0 ||
memcmp(shname, "kretprobe/", 10) == 0 ||
+   memcmp(shname, "phys_dev", 8) == 0 ||
memcmp(shname, "socket", 6) == 0)
load_and_attach(shname, data->d_buf, data->d_size);
}
diff --git a/samples/bpf/netdrvx1_kern.c b/samples/bpf/netdrvx1_kern.c
new file mode 100644
index 000..9837d8a
--- /dev/null
+++ b/samples/bpf/netdrvx1_kern.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = 256,
+};
+
+SEC("phys_dev1")
+int bpf_prog1(struct xdp_metadata *ctx)
+{
+   int index = load_byte(ctx, ETH_HLEN + offsetof(struct iphdr, protocol));
+   long *value;
+
+   value = bpf_map_lookup_elem(&dropcnt, &index);
+   if (value)
+   *value += 1;
+
+   return 1;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/netdrvx1_user.c b/samples/bpf/netdrvx1_user.c
new file mode 100644
index 000..9e6ec9a
--- /dev/null
+++ b/samples/bpf/netdrvx1_user.c
@@

[RFC PATCH 2/5] net: add ndo to set bpf prog in adapter rx

2016-04-01 Thread Brenden Blanco

Add a new netdev op for drivers implementing the BPF_PROG_TYPE_PHYS_DEV
filter to get configuration. Since the fd is only used by the driver to
fetch the prog, the netdev should just keep a bit to indicate the
program is valid.

Signed-off-by: Brenden Blanco 
---
 include/linux/netdevice.h |  8 
 net/core/dev.c| 12 
 2 files changed, 20 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d0..c46e2e3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1102,6 +1102,11 @@ struct tc_to_netdev {
  * appropriate rx headroom value allows avoiding skb head copy on
  * forward. Setting a negative value resets the rx headroom to the
  * default value.
+ * int  (*ndo_bpf_set)(struct net_device *dev, int fd);
+ * This function is used to set or clear a bpf program used in the
+ * earliest stages of packet rx. The fd must be a program loaded as
+ * BPF_PROG_TYPE_PHYS_DEV. Negative values of fd indicate the program
+ * should be removed.
  *
  */
 struct net_device_ops {
@@ -1292,6 +1297,7 @@ struct net_device_ops {
   struct sk_buff *skb);
void(*ndo_set_rx_headroom)(struct net_device *dev,
   int needed_headroom);
+   int (*ndo_bpf_set)(struct net_device *dev, int fd);
 };
 
 /**
@@ -1875,6 +1881,7 @@ struct net_device {
struct phy_device   *phydev;
struct lock_class_key   *qdisc_tx_busylock;
boolproto_down;
+   boolbpf_valid;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -3268,6 +3275,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
+int dev_change_bpf_fd(struct net_device *dev, int fd);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device 
*dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device 
*dev,
struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..eb93414 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6480,6 +6480,18 @@ int dev_change_proto_down(struct net_device *dev, bool 
proto_down)
 }
 EXPORT_SYMBOL(dev_change_proto_down);
 
+int dev_change_bpf_fd(struct net_device *dev, int fd)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+
+   if (!ops->ndo_bpf_set)
+   return -EOPNOTSUPP;
+   if (!netif_device_present(dev))
+   return -ENODEV;
+   return ops->ndo_bpf_set(dev, fd);
+}
+EXPORT_SYMBOL(dev_change_bpf_fd);
+
 /**
  * dev_new_index   -   allocate an ifindex
  * @net: the applicable net namespace
-- 
2.8.0

[RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter

2016-04-01 Thread Brenden Blanco

Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a new
context type, struct xdp_metadata, is exposed to userspace. So far only
expose the readable packet length, and only in read mode.

The PHYS_DEV name is chosen to represent that the program is meant only
for physical adapters, rather than all netdevs.

While the user visible struct is new, the underlying context must be
implemented as a minimal skb in order for the packet load_* instructions
to work. The skb filled in by the driver must have skb->len, skb->head,
and skb->data set, and skb->data_len == 0.

Signed-off-by: Brenden Blanco 
---
 include/uapi/linux/bpf.h |  5 
 kernel/bpf/verifier.c|  1 +
 net/core/filter.c| 68 
 3 files changed, 74 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 924f537..b8a4ef2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -92,6 +92,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_KPROBE,
BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT,
+   BPF_PROG_TYPE_PHYS_DEV,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
@@ -367,6 +368,10 @@ struct __sk_buff {
__u32 tc_classid;
 };
 
+struct xdp_metadata {
+   __u32 len;
+};
+
 struct bpf_tunnel_key {
__u32 tunnel_id;
union {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e..804ca70 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type)
case BPF_PROG_TYPE_SOCKET_FILTER:
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
+   case BPF_PROG_TYPE_PHYS_DEV:
return true;
default:
return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index b7177d0..c417db6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+phys_dev_func_proto(enum bpf_func_id func_id)
+{
+   return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
/* check bounds */
@@ -2073,6 +2079,36 @@ static bool tc_cls_act_is_valid_access(int off, int size,
return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+ enum bpf_access_type type)
+{
+   if (off < 0 || off >= sizeof(struct xdp_metadata))
+   return false;
+
+   if (off % size != 0)
+   return false;
+
+   if (size != 4)
+   return false;
+
+   return true;
+}
+
+static bool phys_dev_is_valid_access(int off, int size,
+enum bpf_access_type type)
+{
+   if (type == BPF_WRITE)
+   return false;
+
+   switch (off) {
+   case offsetof(struct xdp_metadata, len):
+   break;
+   default:
+   return false;
+   }
+   return __is_valid_xdp_access(off, size, type);
+}
+
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
  int src_reg, int ctx_off,
  struct bpf_insn *insn_buf,
@@ -2210,6 +2246,26 @@ static u32 bpf_net_convert_ctx_access(enum 
bpf_access_type type, int dst_reg,
return insn - insn_buf;
 }
 
+static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type,
+  int dst_reg, int src_reg,
+  int ctx_off,
+  struct bpf_insn *insn_buf,
+  struct bpf_prog *prog)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   switch (ctx_off) {
+   case offsetof(struct xdp_metadata, len):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
+
+   *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sk_buff, len));
+   break;
+   }
+
+   return insn - insn_buf;
+}
+
 static const struct bpf_verifier_ops sk_filter_ops = {
.get_func_proto = sk_filter_func_proto,
.is_valid_access = sk_filter_is_valid_access,
@@ -,6 +2278,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
.convert_ctx_access = bpf_net_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops phys_dev_ops = {
+   .get_func_proto = phys_dev_func_proto,
+   .is_valid_access = phys_dev_is_valid_access,
+   .convert_ctx_access = bpf_phys_dev_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops = &sk_filter_ops,
.type = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2237,11 +2299,17 @@ sta

Re: [PATCH] RDS: sync congestion map updating

2016-04-01 Thread Leon Romanovsky

On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote:
> (cc-ing netdev)
> On 3/30/2016 7:59 PM, Wengang Wang wrote:
> >
> >
> >在 2016年03月31日 09:51, Wengang Wang 写道:
> >>
> >>
> >>在 2016年03月31日 01:16, santosh shilimkar 写道:
> >>>Hi Wengang,
> >>>
> >>>On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
> >Problem is found that some among a lot of parallel RDS
> >communications hang.
> >In my test ten or so among 33 communications hang. The send
> >requests got
> >-ENOBUF error meaning the peer socket (port) is congested. But
> >meanwhile,
> >peer socket (port) is not congested.
> >
> >The congestion map updating can happen in two paths: one is in
> >rds_recvmsg path
> >and the other is when it receives packets from the hardware. There
> >is no
> >synchronization when updating the congestion map. So a bit
> >operation (clearing)
> >in the rds_recvmsg path can be skipped by another bit operation
> >(setting) in
> >hardware packet receving path.
> >
> >
> >To be more detailed.  Here, the two paths (user calls recvmsg and
> >hardware receives data) are for different rds socks. thus the
> >rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
> >map.
> >
> For archive purpose, let me try to conclude the thread. I synced
> with Wengang offlist and came up with below fix. I was under
> impression that __set_bit_le() was atmoic version. After fixing
> it like patch(end of the email), the bug gets addressed.
> 
> I will probably send this as fix for stable as well.
> 
> 
> From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
> From: Santosh Shilimkar 
> Date: Wed, 30 Mar 2016 23:26:47 -0700
> Subject: [PATCH] RDS: Fix the atomicity for congestion map update
> 
> Two different threads with different rds sockets may be in
> rds_recv_rcvbuf_delta() via receive path. If their ports
> both map to the same word in the congestion map, then
> using non-atomic ops to update it could cause the map to
> be incorrect. Lets use atomics to avoid such an issue.
> 
> Full credit to Wengang  for
> finding the issue, analysing it and also pointing out
> to offending code with spin lock based fix.

I'm glad that you solved the issue without spinlocks.
Out of curiosity, I see that this patch is needed to be sent
to Dave and applied by him. Is it right?

➜  linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c
Santosh Shilimkar  (supporter:RDS -
RELIABLE DATAGRAM SOCKETS)
"David S. Miller"  (maintainer:NETWORKING
[GENERAL])
netdev@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
linux-r...@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
rds-de...@oss.oracle.com (moderated list:RDS - RELIABLE DATAGRAM
SOCKETS)
linux-ker...@vger.kernel.org (open list)

> 
> Signed-off-by: Wengang Wang 
> Signed-off-by: Santosh Shilimkar 

Reviewed-by: Leon Romanovsky

[GIT] Networking

2016-04-01 Thread David Miller


1) Missing device reference in IPSEC input path results in crashes during
   device unregistration.  From Subash Abhinov Kasiviswanathan.

2) Per-queue ISR register writes not being done properly in macb driver,
   from Cyrille Pitchen.

3) Stats accounting bugs in bcmgenet, from Patri Gynther.

4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps,
   from Quentin Armitage.

5) SXGBE driver has off-by-one in probe error paths, from Rasmus
   Villemoes.

6) Fix race in save/swap/delete options in netfilter ipset, from
   Vishwanath Pai.

7) Ageing time of bridge not set properly when not operating over a
   switchdev device.  Fix from Haishuang Yan.

8) Fix GRO regression wrt. nested FOU/GUE based tunnels, from
   Alexander Duyck.

9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

10) FEC driver should only access registers that actually exist on
the given chipset, fix from Fabio Estevam.

Please pull, thanks a lot!

The following changes since commit e46b4e2b46e173889b1b8bd033d5e8b3acf0:

  Merge tag 'trace-v4.6' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace (2016-03-24 
10:52:25 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 

for you to fetch changes up to db5dd0db2d8352bb7fd5e9d16e17b79d66c7e4e3:

  net: mvneta: fix changing MTU when using per-cpu processing (2016-04-01 
15:16:37 -0400)


Alexander Duyck (3):
  ixgbe: Store VXLAN port number in network order
  ixgbe: Fix ATR so that it correctly handles IPv6 extension headers
  gro: Allow tunnel stacking in the case of FOU/GUE

Arnd Bergmann (1):
  openvswitch: call only into reachable nf-nat code

Bjorn Helgaas (1):
  netpoll: Fix extra refcount release in netpoll_cleanup()

Bjørn Mork (1):
  qmi_wwan: add "D-Link DWM-221 B1" device id

Charles Keepax (1):
  net: macb: Only call GPIO functions if there is a valid GPIO

Colin Ian King (1):
  qed: initialize return rc to avoid returning garbage

Cosmin-Gabriel Samoila (1):
  Drivers: isdn: hisax: isac.c: Fix assignment and check into one 
expression.

Cyrille Pitchen (2):
  net: macb: replace macb_writel() call by queue_writel() to update queue 
ISR
  net: macb: remove BUG_ON() and reset the queue to handle RX errors

Daniel Borkmann (3):
  bpf: add missing map_flags to bpf_map_show_fdinfo
  bpf: make padding in bpf_tunnel_key explicit
  tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter

Daniele Palmas (1):
  net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card

David S. Miller (4):
  Merge branch 'hns-fixes'
  Merge git://git.kernel.org/.../pablo/nf
  Merge branch '10GbE' of git://git.kernel.org/.../jkirsher/net-queue
  Merge branch 'stmmac-fixes'

Diego Viola (1):
  drivers/net/usb/plusb.c: Fix typo

Emil Tantilov (2):
  ixgbevf: fix error code path when setting MAC address
  ixgbe: make __ixgbe_setup_tc static

Eric Dumazet (1):
  ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates

Fabio Estevam (1):
  fec: Do not access unexisting register in Coldfire

Florian Westphal (3):
  netfilter: x_tables: validate e->target_offset early
  netfilter: x_tables: make sure e->next_offset covers remaining blob size
  netfilter: x_tables: fix unconditional helper

Giuseppe CAVALLARO (3):
  stmmac: fix TX normal DESC
  Revert "stmmac: Fix 'eth0: No PHY found' regression"
  stmmac: fix MDIO settings

Haishuang Yan (2):
  openvswitch: Use proper buffer size in nla_memcpy
  bridge: Allow set bridge ageing time when switchdev disabled

Jaedon Shin (1):
  net: phy: bcm7xxx: Add entries for Broadcom BCM7346 and BCM7362

Jarno Rajahalme (1):
  openvswitch: Fix checking for new expected connections.

Jisheng Zhang (5):
  net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
  net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
  net: mvpp2: fix maybe-uninitialized warning
  net: mvpp2: use cache_line_size() to get cacheline size
  net: mvneta: use cache_line_size() to get cacheline size

Kejian Yan (1):
  net: hns: fix warning of passing zero to 'PTR_ERR'

Lino Sanfilippo (1):
  ravb: fix software timestamping

Liping Zhang (1):
  netfilter: ipv4: fix NULL dereference

Lisheng (2):
  net: hns: fixed the setting and getting overtime bug
  net: hns: set-coalesce-usecs returns errno by dsaf.ko

Manish Chopra (1):
  qlge: Update version to 1.00.00.35

Marcelo Ricardo Leitner (1):
  sctp: really allow using GFP_KERNEL on sctp_packet_transmit

Marcin Wojtas (1):
  net: mvneta: fix changing MTU when using per-cpu processing

Mark Rustad (1):
  ixgbe: Use udelay to avoid sleeping while atomic

Michael Chan (3):
  bnxt_en: Implement proper firmware message padding.
  bnxt_en: Fix typo in bnxt_hwrm_set_pause_common

Re: [PATCH net] vlan: pull on __vlan_insert_tag error path and fix csum correction

2016-04-01 Thread Daniel Borkmann

On 04/01/2016 11:28 PM, Daniel Borkmann wrote:

On 04/01/2016 09:00 PM, David Miller wrote:

From: Daniel Borkmann 
Date: Fri,  1 Apr 2016 11:41:03 +0200

Moreover, I noticed that when in the non-error path the __skb_pull()
is done and the original offset to mac header was non-zero, we fixup
from a wrong skb->data offset in the checksum complete processing.

So the skb_postpush_rcsum() really needs to be done before __skb_pull()
where skb->data still points to the mac header start.

Ugh, what a mess, are you sure any of this is right even after your
change?  What happens (outside of the csum part) is this:

__skb_push(offset);
__vlan_insert_tag(); {
skb_push(VLAN_HLEN);
...
memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN);
}
__skb_pull(offset);

If I understand this correctly, the last pull will therefore put
skb->data pointing at vlan_ethhdr->h_vlan_TCI of the new VLAN header
pushed by __vlan_insert_tag().

That is assuming skb->data began right after the original ethernet
header.

Yes, this is correct. Now, continuing this train of thought: you have
skb->data pointing _currently_ at vlan_ethhdr->h_vlan_TCI.

And then you call:

   skb_postpush_rcsum(skb, skb->data + (2 * ETH_ALEN), VLAN_HLEN);

So, we point from the ->vlan_TCI + (2 * ETH_ALEN) as start offset, and
with VLAN_HLEN (= 4 bytes that we added) as length for the csum
correction as input. So, we point way beyond what we actually wanted
to fixup wrt csum, no?

But what we actually want to sum is [h_vlan_proto + h_vlan_TCI], which
is where above skb_postpush_rcsum() call points to _before_ the last
__skb_pull() happens. In other words, still at that time, we have the
same expectations as in __vlan_insert_tag().

To me, that postpull csum currently is absolutely in the correct spot,
because it's acting upon the pull done by __vlan_insert_tag(), not the
one done here by skb_vlan_push().

Right?

Can you tell me how you tested this?  Just curious...

I noticed both while reviewing the code, the error path fixup is not
critical for ovs or act_vlan as the skb is dropped afterwards, but not
necessarily in eBPF case, so there it matters as eBPF doesn't know at
this point, what the program is going to do with it (similar fixup is
done in __skb_vlan_pop() error path, btw). For the csum, I did a hexdump
to compare what we write and what is being passed in for the csum correction.

Anyway ...

Aside from all this and based on your comment, I'm investigating whether
for the vlan push and also pop case the __skb_pull(skb, offset) in success
case is actually enough and whether it needs to take VLAN_HLEN into account
as well. But, I need to do more test for that one first. At least the 
skb_vlan_push()
comment says "__vlan_insert_tag expect skb->data pointing to mac header.
So change skb->data before calling it and change back to original position
later", Jiri?

For this part, what is meant with "original" position (relative to the start
of the ethernet header [currently the case], or relative to some data, e.g.
before/after the call, I still expect skb->data position to point to my IP 
header)?

Thanks,
Daniel

Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-01 Thread Cong Wang

On Fri, Apr 1, 2016 at 4:13 PM, Cong Wang  wrote:
> On Fri, Apr 1, 2016 at 3:56 PM, Martin KaFai Lau  wrote:
>> +   bh_lock_sock(sk);
>> +   if (!sock_owned_by_user(sk))
>> +   ip6_datagram_dst_update(sk, false);
>> +   bh_unlock_sock(sk);
>
>
> My discussion with Eric shows that we probably don't need to hold
> this sock lock here, and you are Cc'ed in that thread, so
>
> 1) why do you still take the lock here?
> 2) why didn't you involve in our discussion if you disagree?

Here is the original thread:
http://comments.gmane.org/gmane.linux.network/405404

Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-01 Thread Cong Wang

On Fri, Apr 1, 2016 at 3:56 PM, Martin KaFai Lau  wrote:
> +   bh_lock_sock(sk);
> +   if (!sock_owned_by_user(sk))
> +   ip6_datagram_dst_update(sk, false);
> +   bh_unlock_sock(sk);


My discussion with Eric shows that we probably don't need to hold
this sock lock here, and you are Cc'ed in that thread, so

1) why do you still take the lock here?
2) why didn't you involve in our discussion if you disagree?

[RFC PATCH net 2/4] ipv6: datagram: Refactor dst lookup and update codes to a new function

2016-04-01 Thread Martin KaFai Lau

This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()

It will be reused during the pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 net/ipv6/datagram.c | 103 +---
 1 file changed, 57 insertions(+), 46 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index f07c1dd..140665b 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,16 +64,65 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, 
struct sock *sk)
security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
 }
 
+static int ip6_datagram_dst_update(struct sock *sk)
+{
+   struct ip6_flowlabel *flowlabel = NULL;
+   struct in6_addr *final_p, final;
+   struct ipv6_txoptions *opt;
+   struct dst_entry *dst;
+   struct inet_sock *inet = inet_sk(sk);
+   struct ipv6_pinfo *np = inet6_sk(sk);
+   struct flowi6 fl6;
+   int err = 0;
+
+   if (np->sndflow && np->flow_label) {
+   flowlabel = fl6_sock_lookup(sk, np->flow_label);
+   if (!flowlabel)
+   return -EINVAL;
+   }
+   ip6_datagram_flow_key_init(&fl6, sk);
+
+   rcu_read_lock();
+   opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
+   final_p = fl6_update_dst(&fl6, opt, &final);
+   rcu_read_unlock();
+
+   dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
+   if (IS_ERR(dst)) {
+   err = PTR_ERR(dst);
+   goto out;
+   }
+
+   if (ipv6_addr_any(&np->saddr))
+   np->saddr = fl6.saddr;
+
+   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+   sk->sk_v6_rcv_saddr = fl6.saddr;
+   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+   if (sk->sk_prot->rehash)
+   sk->sk_prot->rehash(sk);
+   }
+
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
+ &sk->sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
+ &np->saddr :
+#endif
+ NULL);
+
+out:
+   fl6_sock_release(flowlabel);
+   return err;
+}
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int 
addr_len)
 {
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
struct inet_sock*inet = inet_sk(sk);
struct ipv6_pinfo   *np = inet6_sk(sk);
-   struct in6_addr *daddr, *final_p, final;
-   struct dst_entry*dst;
-   struct flowi6   fl6;
-   struct ip6_flowlabel*flowlabel = NULL;
-   struct ipv6_txoptions   *opt;
+   struct in6_addr *daddr;
int addr_type;
int err;
__be32  fl6_flowlabel = 0;
@@ -91,14 +140,8 @@ static int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr, int a
if (usin->sin6_family != AF_INET6)
return -EAFNOSUPPORT;
 
-   if (np->sndflow) {
+   if (np->sndflow)
fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
-   if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
-   flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
-   if (!flowlabel)
-   return -EINVAL;
-   }
-   }
 
addr_type = ipv6_addr_type(&usin->sin6_addr);
 
@@ -178,45 +221,13 @@ ipv4_connected:
 *  destination cache for it.
 */
 
-   ip6_datagram_flow_key_init(&fl6, sk);
-
-   rcu_read_lock();
-   opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
-   final_p = fl6_update_dst(&fl6, opt, &final);
-   rcu_read_unlock();
-
-   dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
-   err = 0;
-   if (IS_ERR(dst)) {
-   err = PTR_ERR(dst);
+   err = ip6_datagram_dst_update(sk);
+   if (err)
goto out;
-   }
-
-   /* source address lookup done in ip6_dst_lookup */
-
-   if (ipv6_addr_any(&np->saddr))
-   np->saddr = fl6.saddr;
-
-   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-   sk->sk_v6_rcv_saddr = fl6.saddr;
-   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-   if (sk->sk_prot->rehash)
-   sk->sk_prot->rehash(sk);
-   }
-
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
- &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
- &np->saddr :
-#endif
- NULL);
 
sk->sk_state = TCP_ESTABLISHED;
sk_set_txhash(sk);
 out:
-   fl6_sock_release(flowlabel);

[RFC PATCH net 4/4] ipv6: udp: Do a route lookup and update during release_cb

2016-04-01 Thread Martin KaFai Lau

This patch adds a release_cb for UDPv6.  It does a route lookup
and updates sk->sk_dst_cache if it is needed.  It picks up the
left-over job from ip6_sk_update_pmtu() if the sk is owned
by user during the pmtu update.

Signed-off-by: Martin KaFai Lau 
Reported-by: Wei Wang 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 include/net/ipv6.h  |  1 +
 net/ipv6/datagram.c | 12 
 net/ipv6/udp.c  |  1 +
 3 files changed, 14 insertions(+)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index fd02e90..1be050a 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -960,6 +960,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr 
*addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 int addr_len);
 int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
+void ip6_datagram_release_cb(struct sock *sk);
 
 int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 0b60f1e..a4f06bd 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -119,6 +119,18 @@ out:
return err;
 }
 
+void ip6_datagram_release_cb(struct sock *sk)
+{
+   struct dst_entry *dst;
+
+   dst = __sk_dst_get(sk);
+   if (!dst || dst->ops->check(dst, inet6_sk(sk)->dst_cookie))
+   return;
+
+   ip6_datagram_dst_update(sk, false);
+}
+EXPORT_SYMBOL_GPL(ip6_datagram_release_cb);
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int 
addr_len)
 {
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8125931..6bc5c66 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1539,6 +1539,7 @@ struct proto udpv6_prot = {
.sendmsg   = udpv6_sendmsg,
.recvmsg   = udpv6_recvmsg,
.backlog_rcv   = __udpv6_queue_rcv_skb,
+   .release_cb= ip6_datagram_release_cb,
.hash  = udp_lib_hash,
.unhash= udp_lib_unhash,
.rehash= udp_v6_rehash,
-- 
2.5.1

[RFC PATCH net 1/4] ipv6: datagram: Refactor flowi6 init codes to a new function

2016-04-01 Thread Martin KaFai Lau

Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().

It will be reused during pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 net/ipv6/datagram.c | 50 ++
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..f07c1dd 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -40,6 +40,30 @@ static bool ipv6_mapped_addr_any(const struct in6_addr *a)
return ipv6_addr_v4mapped(a) && (a->s6_addr32[3] == 0);
 }
 
+static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
+{
+   struct inet_sock *inet = inet_sk(sk);
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   memset(fl6, 0, sizeof(*fl6));
+   fl6->flowi6_proto = sk->sk_protocol;
+   fl6->daddr = sk->sk_v6_daddr;
+   fl6->saddr = np->saddr;
+   fl6->flowi6_oif = sk->sk_bound_dev_if;
+   fl6->flowi6_mark = sk->sk_mark;
+   fl6->fl6_dport = inet->inet_dport;
+   fl6->fl6_sport = inet->inet_sport;
+   fl6->flowlabel = np->flow_label;
+
+   if (!fl6->flowi6_oif)
+   fl6->flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
+
+   if (!fl6->flowi6_oif && ipv6_addr_is_multicast(&fl6->daddr))
+   fl6->flowi6_oif = np->mcast_oif;
+
+   security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+}
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int 
addr_len)
 {
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
@@ -52,6 +76,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr, int a
struct ipv6_txoptions   *opt;
int addr_type;
int err;
+   __be32  fl6_flowlabel = 0;
 
if (usin->sin6_family == AF_INET) {
if (__ipv6_only_sock(sk))
@@ -66,11 +91,10 @@ static int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr, int a
if (usin->sin6_family != AF_INET6)
return -EAFNOSUPPORT;
 
-   memset(&fl6, 0, sizeof(fl6));
if (np->sndflow) {
-   fl6.flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK;
-   if (fl6.flowlabel&IPV6_FLOWLABEL_MASK) {
-   flowlabel = fl6_sock_lookup(sk, fl6.flowlabel);
+   fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
+   if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
+   flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
if (!flowlabel)
return -EINVAL;
}
@@ -145,7 +169,7 @@ ipv4_connected:
}
 
sk->sk_v6_daddr = *daddr;
-   np->flow_label = fl6.flowlabel;
+   np->flow_label = fl6_flowlabel;
 
inet->inet_dport = usin->sin6_port;
 
@@ -154,21 +178,7 @@ ipv4_connected:
 *  destination cache for it.
 */
 
-   fl6.flowi6_proto = sk->sk_protocol;
-   fl6.daddr = sk->sk_v6_daddr;
-   fl6.saddr = np->saddr;
-   fl6.flowi6_oif = sk->sk_bound_dev_if;
-   fl6.flowi6_mark = sk->sk_mark;
-   fl6.fl6_dport = inet->inet_dport;
-   fl6.fl6_sport = inet->inet_sport;
-
-   if (!fl6.flowi6_oif)
-   fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
-
-   if (!fl6.flowi6_oif && (addr_type&IPV6_ADDR_MULTICAST))
-   fl6.flowi6_oif = np->mcast_oif;
-
-   security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
+   ip6_datagram_flow_key_init(&fl6, sk);
 
rcu_read_lock();
opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
-- 
2.5.1

[RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-01 Thread Martin KaFai Lau

There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

This patch updates the sk->sk_dst_cache for a connected datagram sk.

It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() while
dst lookup and update are happening.

For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.

Test:

Server (Connected UDP Socket):
~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0  proto kernel  metric 256  pref medium
2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium

A simple python code to create a connected UDP socket:

import socket
import errno

HOST = '2fac::1'
PORT = 8080

s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()

Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 
~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300

Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0  metric 0
cache  expires 463sec mtu 1300 pref medium

Client (Send the ICMPV6_PKT_TOOBIG):
~~~
scapy is used to generate the TOOBIG message.  Here is the scapy script I have
used:

>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', 
>>> type=0x86dd)/IPv6(src='2fac::face', 
>>> dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')

Signed-off-by: Martin KaFai Lau 
Reported-by: Wei Wang 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 include/net/ipv6.h  |  1 +
 net/ipv6/datagram.c | 20 +++-
 net/ipv6/route.c| 11 +++
 3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index d0aeb97..fd02e90 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -959,6 +959,7 @@ int compat_ipv6_getsockopt(struct sock *sk, int level, int 
optname,
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 int addr_len);
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
 
 int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 140665b..0b60f1e 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,7 +64,7 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, 
struct sock *sk)
security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
 }
 
-static int ip6_datagram_dst_update(struct sock *sk)
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr)
 {
struct ip6_flowlabel *flowlabel = NULL;
struct in6_addr *final_p, final;
@@ -93,14 +93,16 @@ static int ip6_datagram_dst_update(struct sock *sk)
goto out;
}
 
-   if (ipv6_addr_any(&np->saddr))
-   np->saddr = fl6.saddr;
+   if (fix_sk_saddr) {
+   if (ipv6_addr_any(&np->saddr))
+   np->saddr = fl6.saddr;
 
-   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-   sk->sk_v6_rcv_saddr = fl6.saddr;
-   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-   if (sk->sk_prot->rehash)
-   sk->sk_prot->rehash(sk);
+   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+   sk->sk_v6_rcv_saddr = fl6.saddr;
+   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+   if (sk->sk_prot->rehash)
+   sk->sk_prot->rehash(sk);
+   }
}
 
ip6_dst_store(sk, dst,
@@ -221,7 +223,7 @@ ipv4_connected:
 *  destination cache for it.
 */
 
-   err = ip6_datagram_dst_update(sk);
+   err = ip6_datagram_dst_update(sk, true);
if (err)
goto out;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ed44663..f7e6a6d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1417,8 +1417,19 @@ EXPORT_SYMBOL_GPL(ip6_

[RFC PATCH net 0/4] ip6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-01 Thread Martin KaFai Lau

There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

Patch 1 and 2 are the prep work.
Patch 3 and 4 are the fixes.

Re: [v7, 0/5] Fix eSDHC host version register bug

2016-04-01 Thread Scott Wood

On Fri, 2016-04-01 at 11:07 +0800, Yangbo Lu wrote:
> This patchset is used to fix a host version register bug in the T4240-R1.0
> -R2.0
> eSDHC controller. To get the SoC version and revision, it's needed to add
> the
> GUTS driver to access the global utilities registers.
> 
> So, the first three patches are to add the GUTS driver.
> The following two patches are to enable GUTS driver support to get SVR in
> eSDHC
> driver and fix host version for T4240.
> 
> Yangbo Lu (5):
>   ARM64: dts: ls2080a: add device configuration node
>   soc: fsl: add GUTS driver for QorIQ platforms
>   dt: move guts devicetree doc out of powerpc directory
>   powerpc/fsl: move mpc85xx.h to include/linux/fsl
>   mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

Acked-by: Scott Wood 

-Scott

[net-next PATCH 2/2 v4] ibmvnic: enable RX checksum offload

2016-04-01 Thread Thomas Falcon

Enable RX Checksum offload feature in the ibmvnic driver.

Signed-off-by: Thomas Falcon 
Cc: John Allen 
---
v4: this patch included since it is enabled by the previous patch
---
 drivers/net/ethernet/ibm/ibmvnic.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 4e97e76..21bccf6 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2105,6 +2105,10 @@ static void handle_query_ip_offload_rsp(struct 
ibmvnic_adapter *adapter)
if (buf->tcp_ipv6_chksum || buf->udp_ipv6_chksum)
adapter->netdev->features |= NETIF_F_IPV6_CSUM;
 
+   if ((adapter->netdev->features &
+   (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM)))
+   adapter->netdev->features |= NETIF_F_RXCSUM;
+
memset(&crq, 0, sizeof(crq));
crq.control_ip_offload.first = IBMVNIC_CRQ_CMD;
crq.control_ip_offload.cmd = CONTROL_IP_OFFLOAD;
-- 
2.4.11

[net-next PATCH 1/2 v4] ibmvnic: map L2/L3/L4 header descriptors to firmware

2016-04-01 Thread Thomas Falcon

Allow the VNIC driver to provide descriptors containing
L2/L3/L4 headers to firmware.  This feature is needed
for greater hardware compatibility and enablement of checksum
and TCP offloading features.

A new function is included for the hypervisor call,
H_SEND_SUBCRQ_INDIRECT, allowing a DMA-mapped array of SCRQ
descriptor elements to be sent to the VNIC server.

These additions will help fully enable checksum offloading as
well as other features as they are included later.

Signed-off-by: Thomas Falcon 
Cc: John Allen 
---
v2: Fixed typo error caught by kbuild test bot
v3: Fixed erroneous patch sender
v4: sorry for the delay in resending, Thanks to David Miller for comments,
removed all extra memory allocations,
merged some helper functions,
calculate all header lengths to meet firmware requirements,
fixed endian bugs in the send_subcrq_indirect
---
 drivers/net/ethernet/ibm/ibmvnic.c | 195 -
 drivers/net/ethernet/ibm/ibmvnic.h |   3 +
 2 files changed, 194 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 6e9e16ee..4e97e76 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -61,6 +61,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -94,6 +95,7 @@ static int ibmvnic_reenable_crq_queue(struct ibmvnic_adapter 
*);
 static int ibmvnic_send_crq(struct ibmvnic_adapter *, union ibmvnic_crq *);
 static int send_subcrq(struct ibmvnic_adapter *adapter, u64 remote_handle,
   union sub_crq *sub_crq);
+static int send_subcrq_indirect(struct ibmvnic_adapter *, u64, u64, u64);
 static irqreturn_t ibmvnic_interrupt_rx(int irq, void *instance);
 static int enable_scrq_irq(struct ibmvnic_adapter *,
   struct ibmvnic_sub_crq_queue *);
@@ -561,10 +563,141 @@ static int ibmvnic_close(struct net_device *netdev)
return 0;
 }
 
+/**
+ * build_hdr_data - creates L2/L3/L4 header data buffer
+ * @hdr_field - bitfield determining needed headers
+ * @skb - socket buffer
+ * @hdr_len - array of header lengths
+ * @tot_len - total length of data
+ *
+ * Reads hdr_field to determine which headers are needed by firmware.
+ * Builds a buffer containing these headers.  Saves individual header
+ * lengths and total buffer length to be used to build descriptors.
+ */
+static int build_hdr_data(u8 hdr_field, struct sk_buff *skb,
+ int *hdr_len, u8 *hdr_data)
+{
+   int len = 0;
+   u8 *hdr;
+
+   hdr_len[0] = sizeof(struct ethhdr);
+
+   if (skb->protocol == htons(ETH_P_IP)) {
+   hdr_len[1] = ip_hdr(skb)->ihl * 4;
+   if (ip_hdr(skb)->protocol == IPPROTO_TCP)
+   hdr_len[2] = tcp_hdrlen(skb);
+   else if (ip_hdr(skb)->protocol == IPPROTO_UDP)
+   hdr_len[2] = sizeof(struct udphdr);
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   hdr_len[1] = sizeof(struct ipv6hdr);
+   if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP)
+   hdr_len[2] = tcp_hdrlen(skb);
+   else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP)
+   hdr_len[2] = sizeof(struct udphdr);
+   }
+
+   memset(hdr_data, 0, 120);
+   if ((hdr_field >> 6) & 1) {
+   hdr = skb_mac_header(skb);
+   memcpy(hdr_data, hdr, hdr_len[0]);
+   len += hdr_len[0];
+   }
+
+   if ((hdr_field >> 5) & 1) {
+   hdr = skb_network_header(skb);
+   memcpy(hdr_data + len, hdr, hdr_len[1]);
+   len += hdr_len[1];
+   }
+
+   if ((hdr_field >> 4) & 1) {
+   hdr = skb_transport_header(skb);
+   memcpy(hdr_data + len, hdr, hdr_len[2]);
+   len += hdr_len[2];
+   }
+   return len;
+}
+
+/**
+ * create_hdr_descs - create header and header extension descriptors
+ * @hdr_field - bitfield determining needed headers
+ * @data - buffer containing header data
+ * @len - length of data buffer
+ * @hdr_len - array of individual header lengths
+ * @scrq_arr - descriptor array
+ *
+ * Creates header and, if needed, header extension descriptors and
+ * places them in a descriptor array, scrq_arr
+ */
+
+static void create_hdr_descs(u8 hdr_field, u8 *hdr_data, int len, int *hdr_len,
+union sub_crq *scrq_arr)
+{
+   union sub_crq hdr_desc;
+   int tmp_len = len;
+   u8 *data, *cur;
+   int tmp;
+
+   while (tmp_len > 0) {
+   cur = hdr_data + len - tmp_len;
+
+   memset(&hdr_desc, 0, sizeof(hdr_desc));
+   if (cur != hdr_data) {
+   data = hdr_desc.hdr_ext.data;
+   tmp = tmp_len > 29 ? 29 : tmp_len;
+   hdr_desc.hdr_ext.first = IBMVNIC_CRQ_CMD;
+   hdr_desc.hdr_ext.type = IBMVNIC_HDR_

Re: [Odd commit author id merge via netdev]

2016-04-01 Thread santosh shilimkar


On 4/1/2016 1:01 PM, Johannes Berg wrote:

On Fri, 2016-04-01 at 10:51 -0700, santosh shilimkar wrote:

Hi Dave,

I noticed something odd while checking the recent
commits of mine in kernel.org tree made it via netdev.

Don't know if its patchwork tool doing this.
Usual author line in my git objects :
Author: Santosh Shilimkar 

But the commits going via your tree seems to be like below..
Author: email-id 

Few more examples of the commits end of the email. Can this
be fixed for future commits ? The git objects you pulled from
my tree directly have right author format where as ones which
are picked from patchworks seems to be odd.



Patchwork does store this info somehow and re-use it, quite possibly
from the very first patch you ever sent. I think this bug was *just*
fixed in patchwork, but it'll probably be a while until that fix lands.

However, you can go and create a patchwork account with the real name,
associate it with all the email addresses you use and then I think
it'll pick it up. Not entirely sure though, you'll have to test it.


I will try that. Thanks for the tip.

Reagrds,
Santosh

Re: Question on rhashtable in worst-case scenario.

2016-04-01 Thread Johannes Berg

On Fri, 2016-04-01 at 08:46 +0800, Herbert Xu wrote:
> On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote:
> > 
> > 
> > Does removing this completely disable the "-EEXIST" error? I can't
> > say
> > I fully understand the elasticity stuff in
> > __rhashtable_insert_fast().
> What EEXIST error are you talking about? The only one that can be
> returned on insertion is if you're explicitly checking for dups
> which clearly can't be the case for you.

I was thinking about that one - it's not obvious to me from the code
how this "explicitly checking for dups" would be done or let's say how
rhashtable differentiates. But since it seems to work for Ben until
hitting a certain number of identical keys, surely that's just me not
understanding the code rather than anything else :)

johannes

Re: [PATCH net] vlan: pull on __vlan_insert_tag error path and fix csum correction

2016-04-01 Thread Daniel Borkmann

On 04/01/2016 09:00 PM, David Miller wrote:

From: Daniel Borkmann 
Date: Fri,  1 Apr 2016 11:41:03 +0200

Moreover, I noticed that when in the non-error path the __skb_pull()
is done and the original offset to mac header was non-zero, we fixup
from a wrong skb->data offset in the checksum complete processing.

So the skb_postpush_rcsum() really needs to be done before __skb_pull()
where skb->data still points to the mac header start.

Ugh, what a mess, are you sure any of this is right even after your
change?  What happens (outside of the csum part) is this:

__skb_push(offset);
__vlan_insert_tag(); {
skb_push(VLAN_HLEN);
...
memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN);
}
__skb_pull(offset);

If I understand this correctly, the last pull will therefore put
skb->data pointing at vlan_ethhdr->h_vlan_TCI of the new VLAN header
pushed by __vlan_insert_tag().

That is assuming skb->data began right after the original ethernet
header.

Yes, this is correct. Now, continuing this train of thought: you have
skb->data pointing _currently_ at vlan_ethhdr->h_vlan_TCI.

And then you call:

  skb_postpush_rcsum(skb, skb->data + (2 * ETH_ALEN), VLAN_HLEN);

So, we point from the ->vlan_TCI + (2 * ETH_ALEN) as start offset, and
with VLAN_HLEN (= 4 bytes that we added) as length for the csum
correction as input. So, we point way beyond what we actually wanted
to fixup wrt csum, no?

But what we actually want to sum is [h_vlan_proto + h_vlan_TCI], which
is where above skb_postpush_rcsum() call points to _before_ the last
__skb_pull() happens. In other words, still at that time, we have the
same expectations as in __vlan_insert_tag().

To me, that postpull csum currently is absolutely in the correct spot,
because it's acting upon the pull done by __vlan_insert_tag(), not the
one done here by skb_vlan_push().

Right?

Can you tell me how you tested this?  Just curious...

I noticed both while reviewing the code, the error path fixup is not
critical for ovs or act_vlan as the skb is dropped afterwards, but not
necessarily in eBPF case, so there it matters as eBPF doesn't know at
this point, what the program is going to do with it (similar fixup is
done in __skb_vlan_pop() error path, btw). For the csum, I did a hexdump
to compare what we write and what is being passed in for the csum correction.

Anyway ...

Aside from all this and based on your comment, I'm investigating whether
for the vlan push and also pop case the __skb_pull(skb, offset) in success
case is actually enough and whether it needs to take VLAN_HLEN into account
as well. But, I need to do more test for that one first. At least the 
skb_vlan_push()
comment says "__vlan_insert_tag expect skb->data pointing to mac header.
So change skb->data before calling it and change back to original position
later", Jiri?

[PATCH v4 net-next 11/15] nfp: sync ring state during FW reconfiguration

2016-04-01 Thread Jakub Kicinski

FW reconfiguration in .ndo_open()/.ndo_stop() should reset/
restore queue state.  Since we need IRQs to be disabled when
filling rings on RX path we have to move disable_irq() from
.ndo_open() all the way up to IRQ allocation.

nfp_net_start_vec() becomes trivial now so it's inlined.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 45 --
 1 file changed, 16 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index f171a7da8931..2878ac021eda 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1520,6 +1520,7 @@ nfp_net_prepare_vector(struct nfp_net *nn, struct 
nfp_net_r_vector *r_vec,
nn_err(nn, "Error requesting IRQ %d\n", entry->vector);
return err;
}
+   disable_irq(entry->vector);
 
/* Setup NAPI */
netif_napi_add(nn->netdev, &r_vec->napi,
@@ -1648,13 +1649,14 @@ static void nfp_net_clear_config_and_disable(struct 
nfp_net *nn)
 
nn_writel(nn, NFP_NET_CFG_CTRL, new_ctrl);
err = nfp_net_reconfig(nn, update);
-   if (err) {
+   if (err)
nn_err(nn, "Could not disable device: %d\n", err);
-   return;
-   }
 
-   for (r = 0; r < nn->num_r_vecs; r++)
+   for (r = 0; r < nn->num_r_vecs; r++) {
+   nfp_net_rx_ring_reset(nn->r_vecs[r].rx_ring);
+   nfp_net_tx_ring_reset(nn, nn->r_vecs[r].tx_ring);
nfp_net_vec_clear_ring_data(nn, r);
+   }
 
nn->ctrl = new_ctrl;
 }
@@ -1722,6 +1724,9 @@ static int __nfp_net_set_config_and_enable(struct nfp_net 
*nn)
 
nn->ctrl = new_ctrl;
 
+   for (r = 0; r < nn->num_r_vecs; r++)
+   nfp_net_rx_ring_fill_freelist(nn->r_vecs[r].rx_ring);
+
/* Since reconfiguration requests while NFP is down are ignored we
 * have to wipe the entire VXLAN configuration and reinitialize it.
 */
@@ -1750,26 +1755,6 @@ static int nfp_net_set_config_and_enable(struct nfp_net 
*nn)
 }
 
 /**
- * nfp_net_start_vec() - Start ring vector
- * @nn:  NFP Net device structure
- * @r_vec:   Ring vector to be started
- */
-static void
-nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector *r_vec)
-{
-   unsigned int irq_vec;
-
-   irq_vec = nn->irq_entries[r_vec->irq_idx].vector;
-
-   disable_irq(irq_vec);
-
-   nfp_net_rx_ring_fill_freelist(r_vec->rx_ring);
-   napi_enable(&r_vec->napi);
-
-   enable_irq(irq_vec);
-}
-
-/**
  * nfp_net_open_stack() - Start the device from stack's perspective
  * @nn:  NFP Net device to reconfigure
  */
@@ -1777,8 +1762,10 @@ static void nfp_net_open_stack(struct nfp_net *nn)
 {
unsigned int r;
 
-   for (r = 0; r < nn->num_r_vecs; r++)
-   nfp_net_start_vec(nn, &nn->r_vecs[r]);
+   for (r = 0; r < nn->num_r_vecs; r++) {
+   napi_enable(&nn->r_vecs[r].napi);
+   enable_irq(nn->irq_entries[nn->r_vecs[r].irq_idx].vector);
+   }
 
netif_tx_wake_all_queues(nn->netdev);
 
@@ -1903,8 +1890,10 @@ static void nfp_net_close_stack(struct nfp_net *nn)
netif_carrier_off(nn->netdev);
nn->link_up = false;
 
-   for (r = 0; r < nn->num_r_vecs; r++)
+   for (r = 0; r < nn->num_r_vecs; r++) {
+   disable_irq(nn->irq_entries[nn->r_vecs[r].irq_idx].vector);
napi_disable(&nn->r_vecs[r].napi);
+   }
 
netif_tx_disable(nn->netdev);
 }
@@ -1918,9 +1907,7 @@ static void nfp_net_close_free_all(struct nfp_net *nn)
unsigned int r;
 
for (r = 0; r < nn->num_r_vecs; r++) {
-   nfp_net_rx_ring_reset(nn->r_vecs[r].rx_ring);
nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring);
-   nfp_net_tx_ring_reset(nn, nn->r_vecs[r].tx_ring);
nfp_net_rx_ring_free(nn->r_vecs[r].rx_ring);
nfp_net_tx_ring_free(nn->r_vecs[r].tx_ring);
nfp_net_cleanup_vector(nn, &nn->r_vecs[r]);
-- 
1.9.1

[PATCH v4 net-next 08/15] nfp: preallocate RX buffers early in .ndo_open

2016-04-01 Thread Jakub Kicinski

We want the .ndo_open() to have following structure:
 - allocate resources;
 - configure HW/FW;
 - enable the device from stack perspective.
Therefore filling RX rings needs to be moved to the beginning
of .ndo_open().

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 34 +++---
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index b057102769f9..c04706cd7d51 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1667,28 +1667,19 @@ static void nfp_net_clear_config_and_disable(struct 
nfp_net *nn)
  * @nn:  NFP Net device structure
  * @r_vec:   Ring vector to be started
  */
-static int nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector 
*r_vec)
+static void
+nfp_net_start_vec(struct nfp_net *nn, struct nfp_net_r_vector *r_vec)
 {
unsigned int irq_vec;
-   int err = 0;
 
irq_vec = nn->irq_entries[r_vec->irq_idx].vector;
 
disable_irq(irq_vec);
 
-   err = nfp_net_rx_ring_bufs_alloc(r_vec->nfp_net, r_vec->rx_ring);
-   if (err) {
-   nn_err(nn, "RV%02d: couldn't allocate enough buffers\n",
-  r_vec->irq_idx);
-   goto out;
-   }
nfp_net_rx_ring_fill_freelist(r_vec->rx_ring);
-
napi_enable(&r_vec->napi);
-out:
-   enable_irq(irq_vec);
 
-   return err;
+   enable_irq(irq_vec);
 }
 
 static int nfp_net_netdev_open(struct net_device *netdev)
@@ -1743,6 +1734,10 @@ static int nfp_net_netdev_open(struct net_device *netdev)
err = nfp_net_rx_ring_alloc(nn->r_vecs[r].rx_ring);
if (err)
goto err_free_tx_ring_p;
+
+   err = nfp_net_rx_ring_bufs_alloc(nn, nn->r_vecs[r].rx_ring);
+   if (err)
+   goto err_flush_rx_ring_p;
}
 
err = netif_set_real_num_tx_queues(netdev, nn->num_tx_rings);
@@ -1815,11 +1810,8 @@ static int nfp_net_netdev_open(struct net_device *netdev)
 * - enable all TX queues
 * - set link state
 */
-   for (r = 0; r < nn->num_r_vecs; r++) {
-   err = nfp_net_start_vec(nn, &nn->r_vecs[r]);
-   if (err)
-   goto err_disable_napi;
-   }
+   for (r = 0; r < nn->num_r_vecs; r++)
+   nfp_net_start_vec(nn, &nn->r_vecs[r]);
 
netif_tx_wake_all_queues(netdev);
 
@@ -1828,18 +1820,14 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
 
return 0;
 
-err_disable_napi:
-   while (r--) {
-   napi_disable(&nn->r_vecs[r].napi);
-   nfp_net_rx_ring_reset(nn->r_vecs[r].rx_ring);
-   nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring);
-   }
 err_clear_config:
nfp_net_clear_config_and_disable(nn);
 err_free_rings:
r = nn->num_r_vecs;
 err_free_prev_vecs:
while (r--) {
+   nfp_net_rx_ring_bufs_free(nn, nn->r_vecs[r].rx_ring);
+err_flush_rx_ring_p:
nfp_net_rx_ring_free(nn->r_vecs[r].rx_ring);
 err_free_tx_ring_p:
nfp_net_tx_ring_free(nn->r_vecs[r].tx_ring);
-- 
1.9.1

RE: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Subash Abhinov Kasiviswanathan

|  For transmit we can leave the IP ID code as is.  For receive we should not be
|  snooping into the IP ID for any frames that have the DF bit set as devices
|  that have adopted RFC 6864 on their transmit path will end up causing issues.

Currently, GRO does not coalesce TCP packets originating from nodes which do 
IPv6 to IPv4 translation.
These packets have DF set to 1 and IP ID set 0 according to 
https://tools.ietf.org/html/rfc6145#page-17 

With this patch, we should be able to see coalescing happen for those cases as 
well.

[PATCH v4 net-next 02/15] nfp: move link state interrupt request/free calls

2016-04-01 Thread Jakub Kicinski

We need to be able to disable the link state interrupt when
the device is brought down.  We used to just free the IRQ
at the beginning of .ndo_stop().  As we now move towards
more ordered .ndo_open()/.ndo_stop() paths LSC allocation
should be placed in the "allocate resource" section.

Since the IRQ can't be freed early in .ndo_stop(), it is
disabled instead.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 23 +++---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 307c02c4ba4a..9ce04b179fac 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1730,10 +1730,16 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
  NFP_NET_IRQ_EXN_IDX, nn->exn_handler);
if (err)
return err;
+   err = nfp_net_aux_irq_request(nn, NFP_NET_CFG_LSC, "%s-lsc",
+ nn->lsc_name, sizeof(nn->lsc_name),
+ NFP_NET_IRQ_LSC_IDX, nn->lsc_handler);
+   if (err)
+   goto err_free_exn;
+   disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
 
err = nfp_net_alloc_rings(nn);
if (err)
-   goto err_free_exn;
+   goto err_free_lsc;
 
err = netif_set_real_num_tx_queues(netdev, nn->num_tx_rings);
if (err)
@@ -1813,19 +1819,11 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
 
netif_tx_wake_all_queues(netdev);
 
-   err = nfp_net_aux_irq_request(nn, NFP_NET_CFG_LSC, "%s-lsc",
- nn->lsc_name, sizeof(nn->lsc_name),
- NFP_NET_IRQ_LSC_IDX, nn->lsc_handler);
-   if (err)
-   goto err_stop_tx;
+   enable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
nfp_net_read_link_status(nn);
 
return 0;
 
-err_stop_tx:
-   netif_tx_disable(netdev);
-   for (r = 0; r < nn->num_r_vecs; r++)
-   nfp_net_tx_flush(nn->r_vecs[r].tx_ring);
 err_disable_napi:
while (r--) {
napi_disable(&nn->r_vecs[r].napi);
@@ -1835,6 +1833,8 @@ err_clear_config:
nfp_net_clear_config_and_disable(nn);
 err_free_rings:
nfp_net_free_rings(nn);
+err_free_lsc:
+   nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
 err_free_exn:
nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX);
return err;
@@ -1856,7 +1856,7 @@ static int nfp_net_netdev_close(struct net_device *netdev)
 
/* Step 1: Disable RX and TX rings from the Linux kernel perspective
 */
-   nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
+   disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
netif_carrier_off(netdev);
nn->link_up = false;
 
@@ -1877,6 +1877,7 @@ static int nfp_net_netdev_close(struct net_device *netdev)
}
 
nfp_net_free_rings(nn);
+   nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX);
 
nn_dbg(nn, "%s down", netdev->name);
-- 
1.9.1

[PATCH v4 net-next 03/15] nfp: break up nfp_net_{alloc|free}_rings

2016-04-01 Thread Jakub Kicinski

nfp_net_{alloc|free}_rings contained strange mix of allocations
and vector initialization.  Remove it, declare vector init as
a separate function and handle allocations explicitly.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 126 -
 1 file changed, 47 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 9ce04b179fac..b435d15ef8d6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1489,91 +1489,40 @@ err_alloc:
return -ENOMEM;
 }
 
-static void __nfp_net_free_rings(struct nfp_net *nn, unsigned int n_free)
+static int
+nfp_net_prepare_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec,
+  int idx)
 {
-   struct nfp_net_r_vector *r_vec;
-   struct msix_entry *entry;
+   struct msix_entry *entry = &nn->irq_entries[r_vec->irq_idx];
+   int err;
 
-   while (n_free--) {
-   r_vec = &nn->r_vecs[n_free];
-   entry = &nn->irq_entries[r_vec->irq_idx];
+   snprintf(r_vec->name, sizeof(r_vec->name),
+"%s-rxtx-%d", nn->netdev->name, idx);
+   err = request_irq(entry->vector, r_vec->handler, 0, r_vec->name, r_vec);
+   if (err) {
+   nn_err(nn, "Error requesting IRQ %d\n", entry->vector);
+   return err;
+   }
 
-   nfp_net_rx_ring_free(r_vec->rx_ring);
-   nfp_net_tx_ring_free(r_vec->tx_ring);
+   /* Setup NAPI */
+   netif_napi_add(nn->netdev, &r_vec->napi,
+  nfp_net_poll, NAPI_POLL_WEIGHT);
 
-   irq_set_affinity_hint(entry->vector, NULL);
-   free_irq(entry->vector, r_vec);
+   irq_set_affinity_hint(entry->vector, &r_vec->affinity_mask);
 
-   netif_napi_del(&r_vec->napi);
-   }
-}
+   nn_dbg(nn, "RV%02d: irq=%03d/%03d\n", idx, entry->vector, entry->entry);
 
-/**
- * nfp_net_free_rings() - Free all ring resources
- * @nn:  NFP Net device to reconfigure
- */
-static void nfp_net_free_rings(struct nfp_net *nn)
-{
-   __nfp_net_free_rings(nn, nn->num_r_vecs);
+   return 0;
 }
 
-/**
- * nfp_net_alloc_rings() - Allocate resources for RX and TX rings
- * @nn:  NFP Net device to reconfigure
- *
- * Return: 0 on success or negative errno on error.
- */
-static int nfp_net_alloc_rings(struct nfp_net *nn)
+static void
+nfp_net_cleanup_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec)
 {
-   struct nfp_net_r_vector *r_vec;
-   struct msix_entry *entry;
-   int err;
-   int r;
+   struct msix_entry *entry = &nn->irq_entries[r_vec->irq_idx];
 
-   for (r = 0; r < nn->num_r_vecs; r++) {
-   r_vec = &nn->r_vecs[r];
-   entry = &nn->irq_entries[r_vec->irq_idx];
-
-   /* Setup NAPI */
-   netif_napi_add(nn->netdev, &r_vec->napi,
-  nfp_net_poll, NAPI_POLL_WEIGHT);
-
-   snprintf(r_vec->name, sizeof(r_vec->name),
-"%s-rxtx-%d", nn->netdev->name, r);
-   err = request_irq(entry->vector, r_vec->handler, 0,
- r_vec->name, r_vec);
-   if (err) {
-   nn_dbg(nn, "Error requesting IRQ %d\n", entry->vector);
-   goto err_napi_del;
-   }
-
-   irq_set_affinity_hint(entry->vector, &r_vec->affinity_mask);
-
-   nn_dbg(nn, "RV%02d: irq=%03d/%03d\n",
-  r, entry->vector, entry->entry);
-
-   /* Allocate TX ring resources */
-   err = nfp_net_tx_ring_alloc(r_vec->tx_ring);
-   if (err)
-   goto err_free_irq;
-
-   /* Allocate RX ring resources */
-   err = nfp_net_rx_ring_alloc(r_vec->rx_ring);
-   if (err)
-   goto err_free_tx;
-   }
-
-   return 0;
-
-err_free_tx:
-   nfp_net_tx_ring_free(r_vec->tx_ring);
-err_free_irq:
irq_set_affinity_hint(entry->vector, NULL);
-   free_irq(entry->vector, r_vec);
-err_napi_del:
netif_napi_del(&r_vec->napi);
-   __nfp_net_free_rings(nn, r);
-   return err;
+   free_irq(entry->vector, r_vec);
 }
 
 /**
@@ -1737,9 +1686,19 @@ static int nfp_net_netdev_open(struct net_device *netdev)
goto err_free_exn;
disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
 
-   err = nfp_net_alloc_rings(nn);
-   if (err)
-   goto err_free_lsc;
+   for (r = 0; r < nn->num_r_vecs; r++) {
+   err = nfp_net_prepare_vector(nn, &nn->r_vecs[r], r);
+   if (err)
+   goto err_free_prev_vecs;
+
+   err = nfp_net_tx_ring_alloc(nn->r_vecs[r].tx_ring);
+   if (err)
+   goto e

[PATCH v4 net-next 09/15] nfp: move filling ring information to FW config

2016-04-01 Thread Jakub Kicinski

nfp_net_[rt]x_ring_{alloc,free} should only allocate or free
ring resources without touching the device.  Move setting
parameters in the BAR to separate functions.  This will make
it possible to reuse alloc/free functions to allocate new
rings while the device is running.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 50 ++
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index c04706cd7d51..f504de12ed2a 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1388,10 +1388,6 @@ static void nfp_net_tx_ring_free(struct nfp_net_tx_ring 
*tx_ring)
struct nfp_net *nn = r_vec->nfp_net;
struct pci_dev *pdev = nn->pdev;
 
-   nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(tx_ring->idx), 0);
-   nn_writeb(nn, NFP_NET_CFG_TXR_SZ(tx_ring->idx), 0);
-   nn_writeb(nn, NFP_NET_CFG_TXR_VEC(tx_ring->idx), 0);
-
kfree(tx_ring->txbufs);
 
if (tx_ring->txds)
@@ -1431,11 +1427,6 @@ static int nfp_net_tx_ring_alloc(struct nfp_net_tx_ring 
*tx_ring)
if (!tx_ring->txbufs)
goto err_alloc;
 
-   /* Write the DMA address, size and MSI-X info to the device */
-   nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(tx_ring->idx), tx_ring->dma);
-   nn_writeb(nn, NFP_NET_CFG_TXR_SZ(tx_ring->idx), ilog2(tx_ring->cnt));
-   nn_writeb(nn, NFP_NET_CFG_TXR_VEC(tx_ring->idx), r_vec->irq_idx);
-
netif_set_xps_queue(nn->netdev, &r_vec->affinity_mask, tx_ring->idx);
 
nn_dbg(nn, "TxQ%02d: QCidx=%02d cnt=%d dma=%#llx host=%p\n",
@@ -1459,10 +1450,6 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring 
*rx_ring)
struct nfp_net *nn = r_vec->nfp_net;
struct pci_dev *pdev = nn->pdev;
 
-   nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(rx_ring->idx), 0);
-   nn_writeb(nn, NFP_NET_CFG_RXR_SZ(rx_ring->idx), 0);
-   nn_writeb(nn, NFP_NET_CFG_RXR_VEC(rx_ring->idx), 0);
-
kfree(rx_ring->rxbufs);
 
if (rx_ring->rxds)
@@ -1502,11 +1489,6 @@ static int nfp_net_rx_ring_alloc(struct nfp_net_rx_ring 
*rx_ring)
if (!rx_ring->rxbufs)
goto err_alloc;
 
-   /* Write the DMA address, size and MSI-X info to the device */
-   nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(rx_ring->idx), rx_ring->dma);
-   nn_writeb(nn, NFP_NET_CFG_RXR_SZ(rx_ring->idx), ilog2(rx_ring->cnt));
-   nn_writeb(nn, NFP_NET_CFG_RXR_VEC(rx_ring->idx), r_vec->irq_idx);
-
nn_dbg(nn, "RxQ%02d: FlQCidx=%02d RxQCidx=%02d cnt=%d dma=%#llx 
host=%p\n",
   rx_ring->idx, rx_ring->fl_qcidx, rx_ring->rx_qcidx,
   rx_ring->cnt, (unsigned long long)rx_ring->dma, rx_ring->rxds);
@@ -1631,6 +1613,17 @@ static void nfp_net_write_mac_addr(struct nfp_net *nn, 
const u8 *mac)
  get_unaligned_be16(nn->netdev->dev_addr + 4) << 16);
 }
 
+static void nfp_net_vec_clear_ring_data(struct nfp_net *nn, unsigned int idx)
+{
+   nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(idx), 0);
+   nn_writeb(nn, NFP_NET_CFG_RXR_SZ(idx), 0);
+   nn_writeb(nn, NFP_NET_CFG_RXR_VEC(idx), 0);
+
+   nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(idx), 0);
+   nn_writeb(nn, NFP_NET_CFG_TXR_SZ(idx), 0);
+   nn_writeb(nn, NFP_NET_CFG_TXR_VEC(idx), 0);
+}
+
 /**
  * nfp_net_clear_config_and_disable() - Clear control BAR and disable NFP
  * @nn:  NFP Net device to reconfigure
@@ -1638,6 +1631,7 @@ static void nfp_net_write_mac_addr(struct nfp_net *nn, 
const u8 *mac)
 static void nfp_net_clear_config_and_disable(struct nfp_net *nn)
 {
u32 new_ctrl, update;
+   unsigned int r;
int err;
 
new_ctrl = nn->ctrl;
@@ -1659,9 +1653,26 @@ static void nfp_net_clear_config_and_disable(struct 
nfp_net *nn)
return;
}
 
+   for (r = 0; r < nn->num_r_vecs; r++)
+   nfp_net_vec_clear_ring_data(nn, r);
+
nn->ctrl = new_ctrl;
 }
 
+static void
+nfp_net_vec_write_ring_data(struct nfp_net *nn, struct nfp_net_r_vector *r_vec,
+   unsigned int idx)
+{
+   /* Write the DMA address, size and MSI-X info to the device */
+   nn_writeq(nn, NFP_NET_CFG_RXR_ADDR(idx), r_vec->rx_ring->dma);
+   nn_writeb(nn, NFP_NET_CFG_RXR_SZ(idx), ilog2(r_vec->rx_ring->cnt));
+   nn_writeb(nn, NFP_NET_CFG_RXR_VEC(idx), r_vec->irq_idx);
+
+   nn_writeq(nn, NFP_NET_CFG_TXR_ADDR(idx), r_vec->tx_ring->dma);
+   nn_writeb(nn, NFP_NET_CFG_TXR_SZ(idx), ilog2(r_vec->tx_ring->cnt));
+   nn_writeb(nn, NFP_NET_CFG_TXR_VEC(idx), r_vec->irq_idx);
+}
+
 /**
  * nfp_net_start_vec() - Start ring vector
  * @nn:  NFP Net device structure
@@ -1769,6 +1780,9 @@ static int nfp_net_netdev_open(struct net_device *netdev)
 * - Set the Freelist buffer size
 * - Enable the FW
 */
+   for (r = 0; r < nn->num_r_vecs;

[PATCH v4 net-next 15/15] nfp: allow ring size reconfiguration at runtime

2016-04-01 Thread Jakub Kicinski

Since much of the required changes have already been made for
changing MTU at runtime let's use it for ring size changes as
well.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |   1 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 126 +
 .../net/ethernet/netronome/nfp/nfp_net_ethtool.c   |  30 ++---
 3 files changed, 136 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 1e08c9cf3ee0..90ad6264e62c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -724,6 +724,7 @@ void nfp_net_rss_write_key(struct nfp_net *nn);
 void nfp_net_coalesce_write_cfg(struct nfp_net *nn);
 int nfp_net_irqs_alloc(struct nfp_net *nn);
 void nfp_net_irqs_disable(struct nfp_net *nn);
+int nfp_net_set_ring_size(struct nfp_net *nn, u32 rxd_cnt, u32 txd_cnt);
 
 #ifdef CONFIG_NFP_NET_DEBUG
 void nfp_net_debugfs_create(void);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 631168a1971e..57f330a4736e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1445,6 +1445,59 @@ err_alloc:
return -ENOMEM;
 }
 
+static struct nfp_net_tx_ring *
+nfp_net_shadow_tx_rings_prepare(struct nfp_net *nn, u32 buf_cnt)
+{
+   struct nfp_net_tx_ring *rings;
+   unsigned int r;
+
+   rings = kcalloc(nn->num_tx_rings, sizeof(*rings), GFP_KERNEL);
+   if (!rings)
+   return NULL;
+
+   for (r = 0; r < nn->num_tx_rings; r++) {
+   nfp_net_tx_ring_init(&rings[r], nn->tx_rings[r].r_vec, r);
+
+   if (nfp_net_tx_ring_alloc(&rings[r], buf_cnt))
+   goto err_free_prev;
+   }
+
+   return rings;
+
+err_free_prev:
+   while (r--)
+   nfp_net_tx_ring_free(&rings[r]);
+   kfree(rings);
+   return NULL;
+}
+
+static struct nfp_net_tx_ring *
+nfp_net_shadow_tx_rings_swap(struct nfp_net *nn, struct nfp_net_tx_ring *rings)
+{
+   struct nfp_net_tx_ring *old = nn->tx_rings;
+   unsigned int r;
+
+   for (r = 0; r < nn->num_tx_rings; r++)
+   old[r].r_vec->tx_ring = &rings[r];
+
+   nn->tx_rings = rings;
+   return old;
+}
+
+static void
+nfp_net_shadow_tx_rings_free(struct nfp_net *nn, struct nfp_net_tx_ring *rings)
+{
+   unsigned int r;
+
+   if (!rings)
+   return;
+
+   for (r = 0; r < nn->num_tx_rings; r++)
+   nfp_net_tx_ring_free(&rings[r]);
+
+   kfree(rings);
+}
+
 /**
  * nfp_net_rx_ring_free() - Free resources allocated to a RX ring
  * @rx_ring:  RX ring to free
@@ -1561,6 +1614,9 @@ nfp_net_shadow_rx_rings_free(struct nfp_net *nn, struct 
nfp_net_rx_ring *rings)
 {
unsigned int r;
 
+   if (!rings)
+   return;
+
for (r = 0; r < nn->num_r_vecs; r++) {
nfp_net_rx_ring_bufs_free(nn, &rings[r]);
nfp_net_rx_ring_free(&rings[r]);
@@ -2106,6 +2162,76 @@ static int nfp_net_change_mtu(struct net_device *netdev, 
int new_mtu)
return err;
 }
 
+int nfp_net_set_ring_size(struct nfp_net *nn, u32 rxd_cnt, u32 txd_cnt)
+{
+   struct nfp_net_tx_ring *tx_rings = NULL;
+   struct nfp_net_rx_ring *rx_rings = NULL;
+   u32 old_rxd_cnt, old_txd_cnt;
+   int err;
+
+   if (!netif_running(nn->netdev)) {
+   nn->rxd_cnt = rxd_cnt;
+   nn->txd_cnt = txd_cnt;
+   return 0;
+   }
+
+   old_rxd_cnt = nn->rxd_cnt;
+   old_txd_cnt = nn->txd_cnt;
+
+   /* Prepare new rings */
+   if (nn->rxd_cnt != rxd_cnt) {
+   rx_rings = nfp_net_shadow_rx_rings_prepare(nn, nn->fl_bufsz,
+  rxd_cnt);
+   if (!rx_rings)
+   return -ENOMEM;
+   }
+   if (nn->txd_cnt != txd_cnt) {
+   tx_rings = nfp_net_shadow_tx_rings_prepare(nn, txd_cnt);
+   if (!tx_rings) {
+   nfp_net_shadow_rx_rings_free(nn, rx_rings);
+   return -ENOMEM;
+   }
+   }
+
+   /* Stop device, swap in new rings, try to start the firmware */
+   nfp_net_close_stack(nn);
+   nfp_net_clear_config_and_disable(nn);
+
+   if (rx_rings)
+   rx_rings = nfp_net_shadow_rx_rings_swap(nn, rx_rings);
+   if (tx_rings)
+   tx_rings = nfp_net_shadow_tx_rings_swap(nn, tx_rings);
+
+   nn->rxd_cnt = rxd_cnt;
+   nn->txd_cnt = txd_cnt;
+
+   err = nfp_net_set_config_and_enable(nn);
+   if (err) {
+   const int err_new = err;
+
+   /* Try with old configuration and old rings */
+   if (rx_rings)
+   rx_rings = nfp_net_shadow_rx_rings_swap(nn, rx_rings);
+

[PATCH v4 net-next 05/15] nfp: allocate ring SW structs dynamically

2016-04-01 Thread Jakub Kicinski

To be able to switch rings more easly on config changes allocate
them dynamically, separately from nfp_net structure.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  6 ++---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 28 +-
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index ab264e1bccd0..0a87571a7d9c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -472,6 +472,9 @@ struct nfp_net {
 
u32 rx_offset;
 
+   struct nfp_net_tx_ring *tx_rings;
+   struct nfp_net_rx_ring *rx_rings;
+
 #ifdef CONFIG_PCI_IOV
unsigned int num_vfs;
struct vf_data_storage *vfinfo;
@@ -504,9 +507,6 @@ struct nfp_net {
int txd_cnt;
int rxd_cnt;
 
-   struct nfp_net_tx_ring tx_rings[NFP_NET_MAX_TX_RINGS];
-   struct nfp_net_rx_ring rx_rings[NFP_NET_MAX_RX_RINGS];
-
u8 num_irqs;
u8 num_r_vecs;
struct nfp_net_r_vector r_vecs[NFP_NET_MAX_TX_RINGS];
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 7233471af7a8..8f7e2e044811 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -414,12 +414,6 @@ static void nfp_net_irqs_assign(struct net_device *netdev)
r_vec->irq_idx = NFP_NET_NON_Q_VECTORS + r;
 
cpumask_set_cpu(r, &r_vec->affinity_mask);
-
-   r_vec->tx_ring = &nn->tx_rings[r];
-   nfp_net_tx_ring_init(r_vec->tx_ring, r_vec, r);
-
-   r_vec->rx_ring = &nn->rx_rings[r];
-   nfp_net_rx_ring_init(r_vec->rx_ring, r_vec, r);
}
 }
 
@@ -1504,6 +1498,12 @@ nfp_net_prepare_vector(struct nfp_net *nn, struct 
nfp_net_r_vector *r_vec,
struct msix_entry *entry = &nn->irq_entries[r_vec->irq_idx];
int err;
 
+   r_vec->tx_ring = &nn->tx_rings[idx];
+   nfp_net_tx_ring_init(r_vec->tx_ring, r_vec, idx);
+
+   r_vec->rx_ring = &nn->rx_rings[idx];
+   nfp_net_rx_ring_init(r_vec->rx_ring, r_vec, idx);
+
snprintf(r_vec->name, sizeof(r_vec->name),
 "%s-rxtx-%d", nn->netdev->name, idx);
err = request_irq(entry->vector, r_vec->handler, 0, r_vec->name, r_vec);
@@ -1694,6 +1694,15 @@ static int nfp_net_netdev_open(struct net_device *netdev)
goto err_free_exn;
disable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
 
+   nn->rx_rings = kcalloc(nn->num_rx_rings, sizeof(*nn->rx_rings),
+  GFP_KERNEL);
+   if (!nn->rx_rings)
+   goto err_free_lsc;
+   nn->tx_rings = kcalloc(nn->num_tx_rings, sizeof(*nn->tx_rings),
+  GFP_KERNEL);
+   if (!nn->tx_rings)
+   goto err_free_rx_rings;
+
for (r = 0; r < nn->num_r_vecs; r++) {
err = nfp_net_prepare_vector(nn, &nn->r_vecs[r], r);
if (err)
@@ -1808,6 +1817,10 @@ err_free_tx_ring_p:
 err_cleanup_vec_p:
nfp_net_cleanup_vector(nn, &nn->r_vecs[r]);
}
+   kfree(nn->tx_rings);
+err_free_rx_rings:
+   kfree(nn->rx_rings);
+err_free_lsc:
nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
 err_free_exn:
nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX);
@@ -1853,6 +1866,9 @@ static int nfp_net_netdev_close(struct net_device *netdev)
nfp_net_cleanup_vector(nn, &nn->r_vecs[r]);
}
 
+   kfree(nn->rx_rings);
+   kfree(nn->tx_rings);
+
nfp_net_aux_irq_free(nn, NFP_NET_CFG_LSC, NFP_NET_IRQ_LSC_IDX);
nfp_net_aux_irq_free(nn, NFP_NET_CFG_EXN, NFP_NET_IRQ_EXN_IDX);
 
-- 
1.9.1

[PATCH v4 net-next 12/15] nfp: propagate list buffer size in struct rx_ring

2016-04-01 Thread Jakub Kicinski

Free list buffer size needs to be propagated to few functions
as a parameter and added to struct nfp_net_rx_ring since soon
some of the functions will be reused to manage rings with
buffers of size different than nn->fl_bufsz.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  3 +++
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 24 ++
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 0a87571a7d9c..1e08c9cf3ee0 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -298,6 +298,8 @@ struct nfp_net_rx_buf {
  * @rxds:   Virtual address of FL/RX ring in host memory
  * @dma:DMA address of the FL/RX ring
  * @size:   Size, in bytes, of the FL/RX ring (needed to free)
+ * @bufsz: Buffer allocation size for convenience of management routines
+ * (NOTE: this is in second cache line, do not use on fast path!)
  */
 struct nfp_net_rx_ring {
struct nfp_net_r_vector *r_vec;
@@ -319,6 +321,7 @@ struct nfp_net_rx_ring {
 
dma_addr_t dma;
unsigned int size;
+   unsigned int bufsz;
 } cacheline_aligned;
 
 /**
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 2878ac021eda..eeabc33fe13d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -958,25 +958,27 @@ static inline int nfp_net_rx_space(struct nfp_net_rx_ring 
*rx_ring)
  * nfp_net_rx_alloc_one() - Allocate and map skb for RX
  * @rx_ring:   RX ring structure of the skb
  * @dma_addr:  Pointer to storage for DMA address (output param)
+ * @fl_bufsz:  size of freelist buffers
  *
  * This function will allcate a new skb, map it for DMA.
  *
  * Return: allocated skb or NULL on failure.
  */
 static struct sk_buff *
-nfp_net_rx_alloc_one(struct nfp_net_rx_ring *rx_ring, dma_addr_t *dma_addr)
+nfp_net_rx_alloc_one(struct nfp_net_rx_ring *rx_ring, dma_addr_t *dma_addr,
+unsigned int fl_bufsz)
 {
struct nfp_net *nn = rx_ring->r_vec->nfp_net;
struct sk_buff *skb;
 
-   skb = netdev_alloc_skb(nn->netdev, nn->fl_bufsz);
+   skb = netdev_alloc_skb(nn->netdev, fl_bufsz);
if (!skb) {
nn_warn_ratelimit(nn, "Failed to alloc receive SKB\n");
return NULL;
}
 
*dma_addr = dma_map_single(&nn->pdev->dev, skb->data,
- nn->fl_bufsz, DMA_FROM_DEVICE);
+  fl_bufsz, DMA_FROM_DEVICE);
if (dma_mapping_error(&nn->pdev->dev, *dma_addr)) {
dev_kfree_skb_any(skb);
nn_warn_ratelimit(nn, "Failed to map DMA RX buffer\n");
@@ -1069,7 +1071,7 @@ nfp_net_rx_ring_bufs_free(struct nfp_net *nn, struct 
nfp_net_rx_ring *rx_ring)
continue;
 
dma_unmap_single(&pdev->dev, rx_ring->rxbufs[i].dma_addr,
-nn->fl_bufsz, DMA_FROM_DEVICE);
+rx_ring->bufsz, DMA_FROM_DEVICE);
dev_kfree_skb_any(rx_ring->rxbufs[i].skb);
rx_ring->rxbufs[i].dma_addr = 0;
rx_ring->rxbufs[i].skb = NULL;
@@ -1091,7 +1093,8 @@ nfp_net_rx_ring_bufs_alloc(struct nfp_net *nn, struct 
nfp_net_rx_ring *rx_ring)
 
for (i = 0; i < rx_ring->cnt - 1; i++) {
rxbufs[i].skb =
-   nfp_net_rx_alloc_one(rx_ring, &rxbufs[i].dma_addr);
+   nfp_net_rx_alloc_one(rx_ring, &rxbufs[i].dma_addr,
+rx_ring->bufsz);
if (!rxbufs[i].skb) {
nfp_net_rx_ring_bufs_free(nn, rx_ring);
return -ENOMEM;
@@ -1279,7 +1282,8 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
 
skb = rx_ring->rxbufs[idx].skb;
 
-   new_skb = nfp_net_rx_alloc_one(rx_ring, &new_dma_addr);
+   new_skb = nfp_net_rx_alloc_one(rx_ring, &new_dma_addr,
+  nn->fl_bufsz);
if (!new_skb) {
nfp_net_rx_give_one(rx_ring, rx_ring->rxbufs[idx].skb,
rx_ring->rxbufs[idx].dma_addr);
@@ -1466,10 +1470,12 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring 
*rx_ring)
 /**
  * nfp_net_rx_ring_alloc() - Allocate resource for a RX ring
  * @rx_ring:  RX ring to allocate
+ * @fl_bufsz: Size of buffers to allocate
  *
  * Return: 0 on success, negative errno otherwise.
  */
-static int nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring)
+static int
+nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring, unsigned int fl_bufsz)
 {
struct nfp_net_r_vector *r_vec = rx

[PATCH v4 net-next 13/15] nfp: convert .ndo_change_mtu() to prepare/commit paradigm

2016-04-01 Thread Jakub Kicinski

When changing MTU on running device first allocate new rings
and buffers and once it succeeds proceed with changing MTU.

Allocation of new rings is not really necessary for this
operation - it's done to keep the code simple and because
size of the extra ring memory is quite small compared to
the size of buffers.

Operation can still fail midway through if FW communication
times out.  In that case we retry with old MTU (rings).

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 110 +++--
 1 file changed, 103 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index eeabc33fe13d..33001ce1d8bf 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1507,6 +1507,64 @@ err_alloc:
return -ENOMEM;
 }
 
+static struct nfp_net_rx_ring *
+nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, unsigned int fl_bufsz)
+{
+   struct nfp_net_rx_ring *rings;
+   unsigned int r;
+
+   rings = kcalloc(nn->num_rx_rings, sizeof(*rings), GFP_KERNEL);
+   if (!rings)
+   return NULL;
+
+   for (r = 0; r < nn->num_rx_rings; r++) {
+   nfp_net_rx_ring_init(&rings[r], nn->rx_rings[r].r_vec, r);
+
+   if (nfp_net_rx_ring_alloc(&rings[r], fl_bufsz))
+   goto err_free_prev;
+
+   if (nfp_net_rx_ring_bufs_alloc(nn, &rings[r]))
+   goto err_free_ring;
+   }
+
+   return rings;
+
+err_free_prev:
+   while (r--) {
+   nfp_net_rx_ring_bufs_free(nn, &rings[r]);
+err_free_ring:
+   nfp_net_rx_ring_free(&rings[r]);
+   }
+   kfree(rings);
+   return NULL;
+}
+
+static struct nfp_net_rx_ring *
+nfp_net_shadow_rx_rings_swap(struct nfp_net *nn, struct nfp_net_rx_ring *rings)
+{
+   struct nfp_net_rx_ring *old = nn->rx_rings;
+   unsigned int r;
+
+   for (r = 0; r < nn->num_rx_rings; r++)
+   old[r].r_vec->rx_ring = &rings[r];
+
+   nn->rx_rings = rings;
+   return old;
+}
+
+static void
+nfp_net_shadow_rx_rings_free(struct nfp_net *nn, struct nfp_net_rx_ring *rings)
+{
+   unsigned int r;
+
+   for (r = 0; r < nn->num_r_vecs; r++) {
+   nfp_net_rx_ring_bufs_free(nn, &rings[r]);
+   nfp_net_rx_ring_free(&rings[r]);
+   }
+
+   kfree(rings);
+}
+
 static int
 nfp_net_prepare_vector(struct nfp_net *nn, struct nfp_net_r_vector *r_vec,
   int idx)
@@ -1985,24 +2043,62 @@ static void nfp_net_set_rx_mode(struct net_device 
*netdev)
 
 static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu)
 {
+   unsigned int old_mtu, old_fl_bufsz, new_fl_bufsz;
struct nfp_net *nn = netdev_priv(netdev);
+   struct nfp_net_rx_ring *tmp_rings;
+   int err;
 
if (new_mtu < 68 || new_mtu > nn->max_mtu) {
nn_err(nn, "New MTU (%d) is not valid\n", new_mtu);
return -EINVAL;
}
 
-   netdev->mtu = new_mtu;
-   nn->fl_bufsz = NFP_NET_MAX_PREPEND + ETH_HLEN + VLAN_HLEN * 2 +
+   old_mtu = netdev->mtu;
+   old_fl_bufsz = nn->fl_bufsz;
+   new_fl_bufsz = NFP_NET_MAX_PREPEND + ETH_HLEN + VLAN_HLEN * 2 +
MPLS_HLEN * 8 + new_mtu;
 
-   /* restart if running */
-   if (netif_running(netdev)) {
-   nfp_net_netdev_close(netdev);
-   nfp_net_netdev_open(netdev);
+   if (!netif_running(netdev)) {
+   netdev->mtu = new_mtu;
+   nn->fl_bufsz = new_fl_bufsz;
+   return 0;
}
 
-   return 0;
+   /* Prepare new rings */
+   tmp_rings = nfp_net_shadow_rx_rings_prepare(nn, new_fl_bufsz);
+   if (!tmp_rings)
+   return -ENOMEM;
+
+   /* Stop device, swap in new rings, try to start the firmware */
+   nfp_net_close_stack(nn);
+   nfp_net_clear_config_and_disable(nn);
+
+   tmp_rings = nfp_net_shadow_rx_rings_swap(nn, tmp_rings);
+
+   netdev->mtu = new_mtu;
+   nn->fl_bufsz = new_fl_bufsz;
+
+   err = nfp_net_set_config_and_enable(nn);
+   if (err) {
+   const int err_new = err;
+
+   /* Try with old configuration and old rings */
+   tmp_rings = nfp_net_shadow_rx_rings_swap(nn, tmp_rings);
+
+   netdev->mtu = old_mtu;
+   nn->fl_bufsz = old_fl_bufsz;
+
+   err = __nfp_net_set_config_and_enable(nn);
+   if (err)
+   nn_err(nn, "Can't restore MTU - FW communication failed 
(%d,%d)\n",
+  err_new, err);
+   }
+
+   nfp_net_shadow_rx_rings_free(nn, tmp_rings);
+
+   nfp_net_open_stack(nn);
+
+   return err;
 }
 
 static struct rtnl_link_stats64 *nfp_net_stat64(struct net_device *netdev,
-- 
1.9.1

[PATCH v4 net-next 07/15] nfp: reorganize initial filling of RX rings

2016-04-01 Thread Jakub Kicinski

Separate allocation of buffers from giving them to FW,
thanks to this it will be possible to move allocation
earlier on .ndo_open() path and reuse buffers during
runtime reconfiguration.

Similar to TX side clean up the spill of functionality
from flush to freeing the ring.  Unlike on TX side,
RX ring reset does not free buffers from the ring.
Ring reset means only that FW pointers are zeroed and
buffers on the ring must be placed in [0, cnt - 1)
positions.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 119 ++---
 1 file changed, 78 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 9a027a3cfe02..b057102769f9 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1021,62 +1021,100 @@ static void nfp_net_rx_give_one(struct nfp_net_rx_ring 
*rx_ring,
 }
 
 /**
- * nfp_net_rx_flush() - Free any buffers currently on the RX ring
- * @rx_ring:  RX ring to remove buffers from
+ * nfp_net_rx_ring_reset() - Reflect in SW state of freelist after disable
+ * @rx_ring:   RX ring structure
  *
- * Assumes that the device is stopped
+ * Warning: Do *not* call if ring buffers were never put on the FW freelist
+ * (i.e. device was not enabled)!
  */
-static void nfp_net_rx_flush(struct nfp_net_rx_ring *rx_ring)
+static void nfp_net_rx_ring_reset(struct nfp_net_rx_ring *rx_ring)
 {
-   struct nfp_net *nn = rx_ring->r_vec->nfp_net;
-   struct pci_dev *pdev = nn->pdev;
-   int idx;
+   unsigned int wr_idx, last_idx;
 
-   while (rx_ring->rd_p != rx_ring->wr_p) {
-   idx = rx_ring->rd_p % rx_ring->cnt;
+   /* Move the empty entry to the end of the list */
+   wr_idx = rx_ring->wr_p % rx_ring->cnt;
+   last_idx = rx_ring->cnt - 1;
+   rx_ring->rxbufs[wr_idx].dma_addr = rx_ring->rxbufs[last_idx].dma_addr;
+   rx_ring->rxbufs[wr_idx].skb = rx_ring->rxbufs[last_idx].skb;
+   rx_ring->rxbufs[last_idx].dma_addr = 0;
+   rx_ring->rxbufs[last_idx].skb = NULL;
 
-   if (rx_ring->rxbufs[idx].skb) {
-   dma_unmap_single(&pdev->dev,
-rx_ring->rxbufs[idx].dma_addr,
-nn->fl_bufsz, DMA_FROM_DEVICE);
-   dev_kfree_skb_any(rx_ring->rxbufs[idx].skb);
-   rx_ring->rxbufs[idx].dma_addr = 0;
-   rx_ring->rxbufs[idx].skb = NULL;
-   }
+   memset(rx_ring->rxds, 0, sizeof(*rx_ring->rxds) * rx_ring->cnt);
+   rx_ring->wr_p = 0;
+   rx_ring->rd_p = 0;
+   rx_ring->wr_ptr_add = 0;
+}
 
-   memset(&rx_ring->rxds[idx], 0, sizeof(rx_ring->rxds[idx]));
+/**
+ * nfp_net_rx_ring_bufs_free() - Free any buffers currently on the RX ring
+ * @nn:NFP Net device
+ * @rx_ring:   RX ring to remove buffers from
+ *
+ * Assumes that the device is stopped and buffers are in [0, ring->cnt - 1)
+ * entries.  After device is disabled nfp_net_rx_ring_reset() must be called
+ * to restore required ring geometry.
+ */
+static void
+nfp_net_rx_ring_bufs_free(struct nfp_net *nn, struct nfp_net_rx_ring *rx_ring)
+{
+   struct pci_dev *pdev = nn->pdev;
+   unsigned int i;
 
-   rx_ring->rd_p++;
+   for (i = 0; i < rx_ring->cnt - 1; i++) {
+   /* NULL skb can only happen when initial filling of the ring
+* fails to allocate enough buffers and calls here to free
+* already allocated ones.
+*/
+   if (!rx_ring->rxbufs[i].skb)
+   continue;
+
+   dma_unmap_single(&pdev->dev, rx_ring->rxbufs[i].dma_addr,
+nn->fl_bufsz, DMA_FROM_DEVICE);
+   dev_kfree_skb_any(rx_ring->rxbufs[i].skb);
+   rx_ring->rxbufs[i].dma_addr = 0;
+   rx_ring->rxbufs[i].skb = NULL;
}
 }
 
 /**
- * nfp_net_rx_fill_freelist() - Attempt filling freelist with RX buffers
- * @rx_ring: RX ring to fill
- *
- * Try to fill as many buffers as possible into freelist.  Return
- * number of buffers added.
- *
- * Return: Number of freelist buffers added.
+ * nfp_net_rx_ring_bufs_alloc() - Fill RX ring with buffers (don't give to FW)
+ * @nn:NFP Net device
+ * @rx_ring:   RX ring to remove buffers from
  */
-static int nfp_net_rx_fill_freelist(struct nfp_net_rx_ring *rx_ring)
+static int
+nfp_net_rx_ring_bufs_alloc(struct nfp_net *nn, struct nfp_net_rx_ring *rx_ring)
 {
-   struct sk_buff *skb;
-   dma_addr_t dma_addr;
+   struct nfp_net_rx_buf *rxbufs;
+   unsigned int i;
+
+   rxbufs = rx_ring->rxbufs;
 
-   while (nfp_net_rx_space(rx_ring)) {
-   skb = nfp_net_rx_alloc_one(rx_ring, &dma_addr);
-   if (!skb) {
-   nf

[PATCH v4 net-next 00/15] MTU/buffer reconfig changes

2016-04-01 Thread Jakub Kicinski

Hi!

Sorry it takes me so long to iterate this.

Previous series included some not entirely related patches,
this one is cut down.  Main issue I'm trying to solve here
is that .ndo_change_mtu() in nfpvf driver is doing full
close/open to reallocate buffers - which if open fails
can result in device being basically closed even though
the interface is started.  As suggested by you I try to move
towards a paradigm where the resources are allocated first
and the MTU change is only done once I'm certain (almost)
nothing can fail.  Almost because I need to communicate 
with FW and that can always time out.

Patch 1 fixes small issue.  Next 10 patches reorganize things
so that I can easily allocate new rings and sets of buffers
while the device is running.  Patches 13 and 15 reshape the
.ndo_change_mtu() and ethtool's ring-resize operation into
desired form.


Jakub Kicinski (15):
  nfp: correct RX buffer length calculation
  nfp: move link state interrupt request/free calls
  nfp: break up nfp_net_{alloc|free}_rings
  nfp: make *x_ring_init do all the init
  nfp: allocate ring SW structs dynamically
  nfp: cleanup tx ring flush and rename to reset
  nfp: reorganize initial filling of RX rings
  nfp: preallocate RX buffers early in .ndo_open
  nfp: move filling ring information to FW config
  nfp: slice .ndo_open() and .ndo_stop() up
  nfp: sync ring state during FW reconfiguration
  nfp: propagate list buffer size in struct rx_ring
  nfp: convert .ndo_change_mtu() to prepare/commit paradigm
  nfp: pass ring count as function parameter
  nfp: allow ring size reconfiguration at runtime

 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  10 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 905 ++---
 .../net/ethernet/netronome/nfp/nfp_net_ethtool.c   |  30 +-
 3 files changed, 617 insertions(+), 328 deletions(-)

-- 
1.9.1

[PATCH v4 net-next 04/15] nfp: make *x_ring_init do all the init

2016-04-01 Thread Jakub Kicinski

nfp_net_[rt]x_ring_init functions used to be called from probe
path only and some of their functionality was spilled to the
call site.  In order to reuse them for ring reconfiguration
we need them to do all the init.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 28 ++
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index b435d15ef8d6..7233471af7a8 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -348,12 +348,18 @@ static irqreturn_t nfp_net_irq_exn(int irq, void *data)
 /**
  * nfp_net_tx_ring_init() - Fill in the boilerplate for a TX ring
  * @tx_ring:  TX ring structure
+ * @r_vec:IRQ vector servicing this ring
+ * @idx:  Ring index
  */
-static void nfp_net_tx_ring_init(struct nfp_net_tx_ring *tx_ring)
+static void
+nfp_net_tx_ring_init(struct nfp_net_tx_ring *tx_ring,
+struct nfp_net_r_vector *r_vec, unsigned int idx)
 {
-   struct nfp_net_r_vector *r_vec = tx_ring->r_vec;
struct nfp_net *nn = r_vec->nfp_net;
 
+   tx_ring->idx = idx;
+   tx_ring->r_vec = r_vec;
+
tx_ring->qcidx = tx_ring->idx * nn->stride_tx;
tx_ring->qcp_q = nn->tx_bar + NFP_QCP_QUEUE_OFF(tx_ring->qcidx);
 }
@@ -361,12 +367,18 @@ static void nfp_net_tx_ring_init(struct nfp_net_tx_ring 
*tx_ring)
 /**
  * nfp_net_rx_ring_init() - Fill in the boilerplate for a RX ring
  * @rx_ring:  RX ring structure
+ * @r_vec:IRQ vector servicing this ring
+ * @idx:  Ring index
  */
-static void nfp_net_rx_ring_init(struct nfp_net_rx_ring *rx_ring)
+static void
+nfp_net_rx_ring_init(struct nfp_net_rx_ring *rx_ring,
+struct nfp_net_r_vector *r_vec, unsigned int idx)
 {
-   struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
struct nfp_net *nn = r_vec->nfp_net;
 
+   rx_ring->idx = idx;
+   rx_ring->r_vec = r_vec;
+
rx_ring->fl_qcidx = rx_ring->idx * nn->stride_rx;
rx_ring->rx_qcidx = rx_ring->fl_qcidx + (nn->stride_rx - 1);
 
@@ -404,14 +416,10 @@ static void nfp_net_irqs_assign(struct net_device *netdev)
cpumask_set_cpu(r, &r_vec->affinity_mask);
 
r_vec->tx_ring = &nn->tx_rings[r];
-   nn->tx_rings[r].idx = r;
-   nn->tx_rings[r].r_vec = r_vec;
-   nfp_net_tx_ring_init(r_vec->tx_ring);
+   nfp_net_tx_ring_init(r_vec->tx_ring, r_vec, r);
 
r_vec->rx_ring = &nn->rx_rings[r];
-   nn->rx_rings[r].idx = r;
-   nn->rx_rings[r].r_vec = r_vec;
-   nfp_net_rx_ring_init(r_vec->rx_ring);
+   nfp_net_rx_ring_init(r_vec->rx_ring, r_vec, r);
}
 }
 
-- 
1.9.1

[PATCH v4 net-next 10/15] nfp: slice .ndo_open() and .ndo_stop() up

2016-04-01 Thread Jakub Kicinski

Divide .ndo_open() and .ndo_stop() into logical, callable
chunks.  No functional changes.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 218 +
 1 file changed, 136 insertions(+), 82 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index f504de12ed2a..f171a7da8931 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1673,6 +1673,82 @@ nfp_net_vec_write_ring_data(struct nfp_net *nn, struct 
nfp_net_r_vector *r_vec,
nn_writeb(nn, NFP_NET_CFG_TXR_VEC(idx), r_vec->irq_idx);
 }
 
+static int __nfp_net_set_config_and_enable(struct nfp_net *nn)
+{
+   u32 new_ctrl, update = 0;
+   unsigned int r;
+   int err;
+
+   new_ctrl = nn->ctrl;
+
+   if (nn->cap & NFP_NET_CFG_CTRL_RSS) {
+   nfp_net_rss_write_key(nn);
+   nfp_net_rss_write_itbl(nn);
+   nn_writel(nn, NFP_NET_CFG_RSS_CTRL, nn->rss_cfg);
+   update |= NFP_NET_CFG_UPDATE_RSS;
+   }
+
+   if (nn->cap & NFP_NET_CFG_CTRL_IRQMOD) {
+   nfp_net_coalesce_write_cfg(nn);
+
+   new_ctrl |= NFP_NET_CFG_CTRL_IRQMOD;
+   update |= NFP_NET_CFG_UPDATE_IRQMOD;
+   }
+
+   for (r = 0; r < nn->num_r_vecs; r++)
+   nfp_net_vec_write_ring_data(nn, &nn->r_vecs[r], r);
+
+   nn_writeq(nn, NFP_NET_CFG_TXRS_ENABLE, nn->num_tx_rings == 64 ?
+ 0xULL : ((u64)1 << nn->num_tx_rings) - 1);
+
+   nn_writeq(nn, NFP_NET_CFG_RXRS_ENABLE, nn->num_rx_rings == 64 ?
+ 0xULL : ((u64)1 << nn->num_rx_rings) - 1);
+
+   nfp_net_write_mac_addr(nn, nn->netdev->dev_addr);
+
+   nn_writel(nn, NFP_NET_CFG_MTU, nn->netdev->mtu);
+   nn_writel(nn, NFP_NET_CFG_FLBUFSZ, nn->fl_bufsz);
+
+   /* Enable device */
+   new_ctrl |= NFP_NET_CFG_CTRL_ENABLE;
+   update |= NFP_NET_CFG_UPDATE_GEN;
+   update |= NFP_NET_CFG_UPDATE_MSIX;
+   update |= NFP_NET_CFG_UPDATE_RING;
+   if (nn->cap & NFP_NET_CFG_CTRL_RINGCFG)
+   new_ctrl |= NFP_NET_CFG_CTRL_RINGCFG;
+
+   nn_writel(nn, NFP_NET_CFG_CTRL, new_ctrl);
+   err = nfp_net_reconfig(nn, update);
+
+   nn->ctrl = new_ctrl;
+
+   /* Since reconfiguration requests while NFP is down are ignored we
+* have to wipe the entire VXLAN configuration and reinitialize it.
+*/
+   if (nn->ctrl & NFP_NET_CFG_CTRL_VXLAN) {
+   memset(&nn->vxlan_ports, 0, sizeof(nn->vxlan_ports));
+   memset(&nn->vxlan_usecnt, 0, sizeof(nn->vxlan_usecnt));
+   vxlan_get_rx_port(nn->netdev);
+   }
+
+   return err;
+}
+
+/**
+ * nfp_net_set_config_and_enable() - Write control BAR and enable NFP
+ * @nn:  NFP Net device to reconfigure
+ */
+static int nfp_net_set_config_and_enable(struct nfp_net *nn)
+{
+   int err;
+
+   err = __nfp_net_set_config_and_enable(nn);
+   if (err)
+   nfp_net_clear_config_and_disable(nn);
+
+   return err;
+}
+
 /**
  * nfp_net_start_vec() - Start ring vector
  * @nn:  NFP Net device structure
@@ -1693,20 +1769,33 @@ nfp_net_start_vec(struct nfp_net *nn, struct 
nfp_net_r_vector *r_vec)
enable_irq(irq_vec);
 }
 
+/**
+ * nfp_net_open_stack() - Start the device from stack's perspective
+ * @nn:  NFP Net device to reconfigure
+ */
+static void nfp_net_open_stack(struct nfp_net *nn)
+{
+   unsigned int r;
+
+   for (r = 0; r < nn->num_r_vecs; r++)
+   nfp_net_start_vec(nn, &nn->r_vecs[r]);
+
+   netif_tx_wake_all_queues(nn->netdev);
+
+   enable_irq(nn->irq_entries[NFP_NET_CFG_LSC].vector);
+   nfp_net_read_link_status(nn);
+}
+
 static int nfp_net_netdev_open(struct net_device *netdev)
 {
struct nfp_net *nn = netdev_priv(netdev);
int err, r;
-   u32 update = 0;
-   u32 new_ctrl;
 
if (nn->ctrl & NFP_NET_CFG_CTRL_ENABLE) {
nn_err(nn, "Dev is already enabled: 0x%08x\n", nn->ctrl);
return -EBUSY;
}
 
-   new_ctrl = nn->ctrl;
-
/* Step 1: Allocate resources for rings and the like
 * - Request interrupts
 * - Allocate RX and TX ring resources
@@ -1759,20 +1848,6 @@ static int nfp_net_netdev_open(struct net_device *netdev)
if (err)
goto err_free_rings;
 
-   if (nn->cap & NFP_NET_CFG_CTRL_RSS) {
-   nfp_net_rss_write_key(nn);
-   nfp_net_rss_write_itbl(nn);
-   nn_writel(nn, NFP_NET_CFG_RSS_CTRL, nn->rss_cfg);
-   update |= NFP_NET_CFG_UPDATE_RSS;
-   }
-
-   if (nn->cap & NFP_NET_CFG_CTRL_IRQMOD) {
-   nfp_net_coalesce_write_cfg(nn);
-
-   new_ctrl |= NFP_NET_CFG_CTRL_IRQMOD;
-   update |= NFP_NET_CFG_UPDATE_IRQMOD;
-   }
-

[PATCH v4 net-next 06/15] nfp: cleanup tx ring flush and rename to reset

2016-04-01 Thread Jakub Kicinski

Since we never used flush without freeing the ring later
the functionality of the two operations is mixed.
Rename flush to ring reset and move there all the things
which have to be done after FW ring state is cleared.
While at it do some clean-ups.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 81 ++
 1 file changed, 37 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 8f7e2e044811..9a027a3cfe02 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -868,61 +868,59 @@ static void nfp_net_tx_complete(struct nfp_net_tx_ring 
*tx_ring)
 }
 
 /**
- * nfp_net_tx_flush() - Free any untransmitted buffers currently on the TX ring
- * @tx_ring: TX ring structure
+ * nfp_net_tx_ring_reset() - Free any untransmitted buffers and reset pointers
+ * @nn:NFP Net device
+ * @tx_ring:   TX ring structure
  *
  * Assumes that the device is stopped
  */
-static void nfp_net_tx_flush(struct nfp_net_tx_ring *tx_ring)
+static void
+nfp_net_tx_ring_reset(struct nfp_net *nn, struct nfp_net_tx_ring *tx_ring)
 {
-   struct nfp_net_r_vector *r_vec = tx_ring->r_vec;
-   struct nfp_net *nn = r_vec->nfp_net;
-   struct pci_dev *pdev = nn->pdev;
const struct skb_frag_struct *frag;
struct netdev_queue *nd_q;
-   struct sk_buff *skb;
-   int nr_frags;
-   int fidx;
-   int idx;
+   struct pci_dev *pdev = nn->pdev;
 
while (tx_ring->rd_p != tx_ring->wr_p) {
-   idx = tx_ring->rd_p % tx_ring->cnt;
+   int nr_frags, fidx, idx;
+   struct sk_buff *skb;
 
+   idx = tx_ring->rd_p % tx_ring->cnt;
skb = tx_ring->txbufs[idx].skb;
-   if (skb) {
-   nr_frags = skb_shinfo(skb)->nr_frags;
-   fidx = tx_ring->txbufs[idx].fidx;
-
-   if (fidx == -1) {
-   /* unmap head */
-   dma_unmap_single(&pdev->dev,
-tx_ring->txbufs[idx].dma_addr,
-skb_headlen(skb),
-DMA_TO_DEVICE);
-   } else {
-   /* unmap fragment */
-   frag = &skb_shinfo(skb)->frags[fidx];
-   dma_unmap_page(&pdev->dev,
-  tx_ring->txbufs[idx].dma_addr,
-  skb_frag_size(frag),
-  DMA_TO_DEVICE);
-   }
-
-   /* check for last gather fragment */
-   if (fidx == nr_frags - 1)
-   dev_kfree_skb_any(skb);
-
-   tx_ring->txbufs[idx].dma_addr = 0;
-   tx_ring->txbufs[idx].skb = NULL;
-   tx_ring->txbufs[idx].fidx = -2;
+   nr_frags = skb_shinfo(skb)->nr_frags;
+   fidx = tx_ring->txbufs[idx].fidx;
+
+   if (fidx == -1) {
+   /* unmap head */
+   dma_unmap_single(&pdev->dev,
+tx_ring->txbufs[idx].dma_addr,
+skb_headlen(skb), DMA_TO_DEVICE);
+   } else {
+   /* unmap fragment */
+   frag = &skb_shinfo(skb)->frags[fidx];
+   dma_unmap_page(&pdev->dev,
+  tx_ring->txbufs[idx].dma_addr,
+  skb_frag_size(frag), DMA_TO_DEVICE);
}
 
-   memset(&tx_ring->txds[idx], 0, sizeof(tx_ring->txds[idx]));
+   /* check for last gather fragment */
+   if (fidx == nr_frags - 1)
+   dev_kfree_skb_any(skb);
+
+   tx_ring->txbufs[idx].dma_addr = 0;
+   tx_ring->txbufs[idx].skb = NULL;
+   tx_ring->txbufs[idx].fidx = -2;
 
tx_ring->qcp_rd_p++;
tx_ring->rd_p++;
}
 
+   memset(tx_ring->txds, 0, sizeof(*tx_ring->txds) * tx_ring->cnt);
+   tx_ring->wr_p = 0;
+   tx_ring->rd_p = 0;
+   tx_ring->qcp_rd_p = 0;
+   tx_ring->wr_ptr_add = 0;
+
nd_q = netdev_get_tx_queue(nn->netdev, tx_ring->idx);
netdev_tx_reset_queue(nd_q);
 }
@@ -1363,11 +1361,6 @@ static void nfp_net_tx_ring_free(struct nfp_net_tx_ring 
*tx_ring)
  tx_ring->txds, tx_ring->dma);
 
tx_ring->cnt = 0;
-   tx_ring->wr_p = 0;
-   tx_ring->rd_p = 0;
-   tx_ring->qcp_rd_p = 0;
-   tx_ring->wr_ptr_add = 0;
-
tx_ring->

[PATCH v4 net-next 14/15] nfp: pass ring count as function parameter

2016-04-01 Thread Jakub Kicinski

Soon ring resize will call this functions with values
different than the current configuration we need to
explicitly pass the ring count as parameter.

Signed-off-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 23 +-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 33001ce1d8bf..631168a1971e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1408,17 +1408,18 @@ static void nfp_net_tx_ring_free(struct nfp_net_tx_ring 
*tx_ring)
 /**
  * nfp_net_tx_ring_alloc() - Allocate resource for a TX ring
  * @tx_ring:   TX Ring structure to allocate
+ * @cnt:   Ring buffer count
  *
  * Return: 0 on success, negative errno otherwise.
  */
-static int nfp_net_tx_ring_alloc(struct nfp_net_tx_ring *tx_ring)
+static int nfp_net_tx_ring_alloc(struct nfp_net_tx_ring *tx_ring, u32 cnt)
 {
struct nfp_net_r_vector *r_vec = tx_ring->r_vec;
struct nfp_net *nn = r_vec->nfp_net;
struct pci_dev *pdev = nn->pdev;
int sz;
 
-   tx_ring->cnt = nn->txd_cnt;
+   tx_ring->cnt = cnt;
 
tx_ring->size = sizeof(*tx_ring->txds) * tx_ring->cnt;
tx_ring->txds = dma_zalloc_coherent(&pdev->dev, tx_ring->size,
@@ -1471,18 +1472,20 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring 
*rx_ring)
  * nfp_net_rx_ring_alloc() - Allocate resource for a RX ring
  * @rx_ring:  RX ring to allocate
  * @fl_bufsz: Size of buffers to allocate
+ * @cnt:  Ring buffer count
  *
  * Return: 0 on success, negative errno otherwise.
  */
 static int
-nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring, unsigned int fl_bufsz)
+nfp_net_rx_ring_alloc(struct nfp_net_rx_ring *rx_ring, unsigned int fl_bufsz,
+ u32 cnt)
 {
struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
struct nfp_net *nn = r_vec->nfp_net;
struct pci_dev *pdev = nn->pdev;
int sz;
 
-   rx_ring->cnt = nn->rxd_cnt;
+   rx_ring->cnt = cnt;
rx_ring->bufsz = fl_bufsz;
 
rx_ring->size = sizeof(*rx_ring->rxds) * rx_ring->cnt;
@@ -1508,7 +1511,8 @@ err_alloc:
 }
 
 static struct nfp_net_rx_ring *
-nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, unsigned int fl_bufsz)
+nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, unsigned int fl_bufsz,
+   u32 buf_cnt)
 {
struct nfp_net_rx_ring *rings;
unsigned int r;
@@ -1520,7 +1524,7 @@ nfp_net_shadow_rx_rings_prepare(struct nfp_net *nn, 
unsigned int fl_bufsz)
for (r = 0; r < nn->num_rx_rings; r++) {
nfp_net_rx_ring_init(&rings[r], nn->rx_rings[r].r_vec, r);
 
-   if (nfp_net_rx_ring_alloc(&rings[r], fl_bufsz))
+   if (nfp_net_rx_ring_alloc(&rings[r], fl_bufsz, buf_cnt))
goto err_free_prev;
 
if (nfp_net_rx_ring_bufs_alloc(nn, &rings[r]))
@@ -1879,12 +1883,12 @@ static int nfp_net_netdev_open(struct net_device 
*netdev)
if (err)
goto err_free_prev_vecs;
 
-   err = nfp_net_tx_ring_alloc(nn->r_vecs[r].tx_ring);
+   err = nfp_net_tx_ring_alloc(nn->r_vecs[r].tx_ring, nn->txd_cnt);
if (err)
goto err_cleanup_vec_p;
 
err = nfp_net_rx_ring_alloc(nn->r_vecs[r].rx_ring,
-   nn->fl_bufsz);
+   nn->fl_bufsz, nn->rxd_cnt);
if (err)
goto err_free_tx_ring_p;
 
@@ -2065,7 +2069,8 @@ static int nfp_net_change_mtu(struct net_device *netdev, 
int new_mtu)
}
 
/* Prepare new rings */
-   tmp_rings = nfp_net_shadow_rx_rings_prepare(nn, new_fl_bufsz);
+   tmp_rings = nfp_net_shadow_rx_rings_prepare(nn, new_fl_bufsz,
+   nn->rxd_cnt);
if (!tmp_rings)
return -ENOMEM;
 
-- 
1.9.1

[PATCH v4 net-next 01/15] nfp: correct RX buffer length calculation

2016-04-01 Thread Jakub Kicinski

When calculating the RX buffer length we need to account for
up to 2 VLAN tags and up to 8 MPLS labels.  Rounding up to 1k
is an relic of a distant past and can be removed.  While at
it also remove trivial print statement.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 43c618bafdb6..307c02c4ba4a 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -61,6 +61,7 @@
 
 #include 
 
+#include 
 #include 
 
 #include "nfp_net_ctrl.h"
@@ -1911,9 +1912,6 @@ static void nfp_net_set_rx_mode(struct net_device *netdev)
 static int nfp_net_change_mtu(struct net_device *netdev, int new_mtu)
 {
struct nfp_net *nn = netdev_priv(netdev);
-   u32 tmp;
-
-   nn_dbg(nn, "New MTU = %d\n", new_mtu);
 
if (new_mtu < 68 || new_mtu > nn->max_mtu) {
nn_err(nn, "New MTU (%d) is not valid\n", new_mtu);
@@ -1921,10 +1919,8 @@ static int nfp_net_change_mtu(struct net_device *netdev, 
int new_mtu)
}
 
netdev->mtu = new_mtu;
-
-   /* Freelist buffer size rounded up to the nearest 1K */
-   tmp = new_mtu + ETH_HLEN + VLAN_HLEN + NFP_NET_MAX_PREPEND;
-   nn->fl_bufsz = roundup(tmp, 1024);
+   nn->fl_bufsz = NFP_NET_MAX_PREPEND + ETH_HLEN + VLAN_HLEN * 2 +
+   MPLS_HLEN * 8 + new_mtu;
 
/* restart if running */
if (netif_running(netdev)) {
-- 
1.9.1

Re: [PATCH v2 -next] net/core/dev: Warn on a too-short GRO frame

2016-04-01 Thread Eric Dumazet

On Fri, 2016-04-01 at 15:58 -0400, Aaron Conole wrote:
> From: Aaron Conole 
> 
> When signaling that a GRO frame is ready to be processed, the network stack
> correctly checks length and aborts processing when a frame is less than 14
> bytes. However, such a condition is really indicative of a broken driver,
> and should be loudly signaled, rather than silently dropped as the case is
> today.
> 
> Convert the condition to use net_warn_ratelimited() to ensure the stack
> loudly complains about such broken drivers.
> 
> Signed-off-by: Aaron Conole 
> ---


Shouldn't we give a hint of device name ?

(available in napi->dev->name )

[PATCH] ip6_tunnel: set rtnl_link_ops before calling register_netdevice

2016-04-01 Thread Thadeu Lima de Souza Cascardo

When creating an ip6tnl tunnel with ip tunnel, rtnl_link_ops is not set
before ip6_tnl_create2 is called. When register_netdevice is called, there
is no linkinfo attribute in the NEWLINK message because of that.

Setting rtnl_link_ops before calling register_netdevice fixes that.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index eb2ac4b..1f20345 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -252,12 +252,12 @@ static int ip6_tnl_create2(struct net_device *dev)
 
t = netdev_priv(dev);
 
+   dev->rtnl_link_ops = &ip6_link_ops;
err = register_netdevice(dev);
if (err < 0)
goto out;
 
strcpy(t->parms.name, dev->name);
-   dev->rtnl_link_ops = &ip6_link_ops;
 
dev_hold(dev);
ip6_tnl_link(ip6n, t);
-- 
2.5.0

Re: [Odd commit author id merge via netdev]

2016-04-01 Thread Johannes Berg

On Fri, 2016-04-01 at 10:51 -0700, santosh shilimkar wrote:
> Hi Dave,
> 
> I noticed something odd while checking the recent
> commits of mine in kernel.org tree made it via netdev.
> 
> Don't know if its patchwork tool doing this.
> Usual author line in my git objects   :
>   Author: Santosh Shilimkar 
> 
> But the commits going via your tree seems to be like below..
>   Author: email-id 
> 
> Few more examples of the commits end of the email. Can this
> be fixed for future commits ? The git objects you pulled from
> my tree directly have right author format where as ones which
> are picked from patchworks seems to be odd.
> 

Patchwork does store this info somehow and re-use it, quite possibly
from the very first patch you ever sent. I think this bug was *just*
fixed in patchwork, but it'll probably be a while until that fix lands.

However, you can go and create a patchwork account with the real name,
associate it with all the email addresses you use and then I think
it'll pick it up. Not entirely sure though, you'll have to test it.

johannes

[PATCH v2 -next] net/core/dev: Warn on a too-short GRO frame

2016-04-01 Thread Aaron Conole

From: Aaron Conole 

When signaling that a GRO frame is ready to be processed, the network stack
correctly checks length and aborts processing when a frame is less than 14
bytes. However, such a condition is really indicative of a broken driver,
and should be loudly signaled, rather than silently dropped as the case is
today.

Convert the condition to use net_warn_ratelimited() to ensure the stack
loudly complains about such broken drivers.

Signed-off-by: Aaron Conole 
---
v2:
* Convert from WARN_ON to net_warn_ratelimited

 net/core/dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..1be269e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4663,6 +4663,8 @@ static struct sk_buff *napi_frags_skb(struct napi_struct 
*napi)
if (unlikely(skb_gro_header_hard(skb, hlen))) {
eth = skb_gro_header_slow(skb, hlen, 0);
if (unlikely(!eth)) {
+   net_warn_ratelimited("%s: dropping impossible skb\n",
+__func__);
napi_reuse_skb(napi, skb);
return NULL;
}
-- 
2.5.5

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Alexander Duyck

On Fri, Apr 1, 2016 at 12:24 PM, David Miller  wrote:
> From: Eric Dumazet 
> Date: Fri, 01 Apr 2016 11:49:03 -0700
>
>> For example, TCP stack tracks per socket ID generation, even if it
>> sends DF=1 frames. Damn useful for tcpdump analysis and drop
>> inference.
>
> Thanks for mentioning this, I never considered this use case.

RFC 6864 is pretty explicit about this, IPv4 ID used only for
fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1

The goal with this change is to try and keep most of the existing
behavior in tact without violating this rule?  I would think the
sequence number should give you the ability to infer a drop in the
case of TCP.  In the case of UDP tunnels we are now getting a bit more
data since we were ignoring the outer IP header ID before.

>> With your change, the resulting GRO packet would propagate the ID of
>> first frag. Most GSO/GSO engines will then provide a ID sequence,
>> which might not match original packets.

Right.  But that is only in the case where the IP IDs did not already
increment or were left uninitialized meaning the transmitter was
probably already following RFC 6864 and chose a fixed value.  Odds are
in such a case we end up improving the performance if anything as
there are plenty of legacy systems out there that still require the
IPv4 ID increment in order to get LRO/GRO.

>> I do not particularly care, but it is worth mentioning that GRO+TSO
>> would not be idempotent anymore.

In the patch I mentioned we had already broken that.  I'm basically
just going through and fixing the cases for tunnels where we were
doing the outer header wrong while at the same time relaxing the
requirements for the inner header if DF is set.  I'll probably add
some documentation do the Documentation folder about it as well.  I'm
currently in the process of writing up documentation for GSO and GSO
partial for the upcoming patchset.  I can pretty easily throw in a few
comments about GRO as well.

> Our eventual plan was to start emitting zero in the ID field for
> outgoing TCP datagrams with DF set, since the issue that caused us to
> generate incrementing IDs in the first place (buggy Microsoft SLHC
> compression) we decided is not relevant and important enough to
> accommodate any more.

For the GSO partial stuff I was probably just going to have the IP ID
on the inner headers lurch forward in chunks equal to gso_segs when we
are doing the segmentation.  I didn't want to use a fixed value just
because that would likely make it easy to identify Linux devices being
a bump in the wire.  I figure if there are already sources that
weren't updating IP ID for their segmentation offloads then if we just
take that approach odds are we will blend in with the other devices
and be more difficult to single out.

Another reason for doing it this way is that different devices are
going to have different behaviors with GSO partial.  In the case of
the i40e driver it recognizes both inner and outer network headers so
it can increment both correctly.  In the case of igb and ixgbe they
only can support the outer header so the inner IP ID value would be
lurching by gso_size every time we move from one GSO frame to the
next.

> So outside of your TCP behavior analysis case, there isn't a
> compelling argument to keeping that code around any more, rather than
> just put zero in the ID field.
>
> I suppose we could keep the counter code around and allow it to be
> enabled using a sysctl or socket option, but how strongly do you
> really feel about this?

I'm not suggesting we drop the counter code for transmit.  What RFC
6864 says is "Originating sources MAY set the IPv4 ID field of atomic
datagrams to any value."

For transmit we can leave the IP ID code as is.  For receive we should
not be snooping into the IP ID for any frames that have the DF bit set
as devices that have adopted RFC 6864 on their transmit path will end
up causing issues.

- Alex

Re: [PATCH v3 net-next] net: ipv4: Consider failed nexthops in multipath routes

2016-04-01 Thread Julian Anastasov


Hello,

On Fri, 1 Apr 2016, David Ahern wrote:

> v3
> - Julian comments: changed use of dead in documentation to failed,
>   init state to NUD_REACHABLE which simplifies fib_good_nh, use of
>   nh_dev for neighbor lookup, fallback to first entry which is what
>   current logic does
> 
> v2
> - use rcu locking to avoid refcnts per Eric's suggestion
> - only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
>   comment
> - drop the 'state == NUD_REACHABLE' from the state check since it is
>   part of NUD_VALID (comment from Julian)
> - wrapped the use of the neigh in a sysctl
> 
>  Documentation/networking/ip-sysctl.txt | 10 ++
>  include/net/netns/ipv4.h   |  3 +++
>  net/ipv4/fib_semantics.c   | 32 
>  net/ipv4/sysctl_net_ipv4.c | 11 +++
>  4 files changed, 52 insertions(+), 4 deletions(-)
> 

> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d97268e8ff10..e08abf96824a 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1559,21 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int 
> nh_flags)
>  }
>  
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> +static bool fib_good_nh(const struct fib_nh *nh)
> +{
> + int state = NUD_REACHABLE;
> +
> + if (nh->nh_scope == RT_SCOPE_LINK) {
> + struct neighbour *n = NULL;

NULL is not needed anymore.

> +
> + rcu_read_lock_bh();
> +
> + n = __neigh_lookup_noref(&arp_tbl, &nh->nh_gw, nh->nh_dev);
> + if (n)
> + state = n->nud_state;
> +
> + rcu_read_unlock_bh();
> + }
> +
> + return !!(state & NUD_VALID);
> +}
>  
>  void fib_select_multipath(struct fib_result *res, int hash)
>  {
>   struct fib_info *fi = res->fi;
> + struct net *net = fi->fib_net;
> + unsigned char first_nhsel = 0;

Looking at fib_table_lookup() res->nh_sel is not 0
in all cases. I personally don't like that we do not
fallback properly but to make this logic more correct we
can use something like this:

bool first = false;

>  
>   for_nexthops(fi) {
>   if (hash > atomic_read(&nh->nh_upper_bound))
>   continue;
>  
> - res->nh_sel = nhsel;
> - return;
> + if (!net->ipv4.sysctl_fib_multipath_use_neigh ||
> + fib_good_nh(nh)) {
> + res->nh_sel = nhsel;
> + return;
> + }
> + if (!first_nhsel)
> + first_nhsel = nhsel;

if (!first) {
res->nh_sel = nhsel;
first = true;
}

>   } endfor_nexthops(fi);
>  
> - /* Race condition: route has just become dead. */
> - res->nh_sel = 0;
> + res->nh_sel = first_nhsel;

And then this is not needed anymore. Even setting
to 0 was not needed because 0 is not better than current
nh_sel when both are DEAD/LINKDOWN.

Regards

--
Julian Anastasov

Re: [PATCH] RDS: sync congestion map updating

2016-04-01 Thread santosh shilimkar


(cc-ing netdev)
On 3/30/2016 7:59 PM, Wengang Wang wrote:



在 2016年03月31日 09:51, Wengang Wang 写道:



在 2016年03月31日 01:16, santosh shilimkar 写道:

Hi Wengang,

On 3/30/2016 9:19 AM, Leon Romanovsky wrote:

On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:

Problem is found that some among a lot of parallel RDS
communications hang.
In my test ten or so among 33 communications hang. The send
requests got
-ENOBUF error meaning the peer socket (port) is congested. But
meanwhile,
peer socket (port) is not congested.

The congestion map updating can happen in two paths: one is in
rds_recvmsg path
and the other is when it receives packets from the hardware. There
is no
synchronization when updating the congestion map. So a bit
operation (clearing)
in the rds_recvmsg path can be skipped by another bit operation
(setting) in
hardware packet receving path.



To be more detailed.  Here, the two paths (user calls recvmsg and
hardware receives data) are for different rds socks. thus the
rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
map.


For archive purpose, let me try to conclude the thread. I synced
with Wengang offlist and came up with below fix. I was under
impression that __set_bit_le() was atmoic version. After fixing
it like patch(end of the email), the bug gets addressed.

I will probably send this as fix for stable as well.


From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
From: Santosh Shilimkar 
Date: Wed, 30 Mar 2016 23:26:47 -0700
Subject: [PATCH] RDS: Fix the atomicity for congestion map update

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang  for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Signed-off-by: Wengang Wang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/cong.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..6641bcf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, 
__be16 port)

i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;

-   __set_bit_le(off, (void *)map->m_page_addrs[i]);
+   set_bit_le(off, (void *)map->m_page_addrs[i]);
 }

 void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
@@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, 
__be16 port)

i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;

-   __clear_bit_le(off, (void *)map->m_page_addrs[i]);
+   clear_bit_le(off, (void *)map->m_page_addrs[i]);
 }

 static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
--
1.7.1

[RFC v3 -next 0/2] virtio-net: Advised MTU feature

2016-04-01 Thread Aaron Conole

The following series adds the ability for a hypervisor to set an MTU on the
guest during feature negotiation phase. This is useful for VM orchestration
when, for instance, tunneling is involved and the MTU of the various systems
should be homogenous.

The first patch adds the feature bit as described in the proposed VIRTIO spec
addition found at
https://lists.oasis-open.org/archives/virtio-dev/201603/msg1.html
The second patch adds a user of the bit, and a warning when the guest changes
the MTU from the hypervisor advised MTU. Future patches may add more thorough
error handling.

v2:
* Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni
* Additional test before printing a warning

v3:
* Removed the warning when changing MTU (which simplified the code)

Aaron Conole (2):
  virtio: Start feature MTU support
  virtio_net: Read the advised MTU

 drivers/net/virtio_net.c| 8 
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 11 insertions(+)

-- 
2.5.5

[RFC v3 -next 2/2] virtio_net: Read the advised MTU

2016-04-01 Thread Aaron Conole

This patch checks the feature bit for the VIRTIO_NET_F_MTU feature. If it
exists, read the advised MTU and use it.

No proper error handling is provided for the case where a user changes the
negotiated MTU. A future commit will add proper error handling. Instead, a
warning is emitted if the guest changes the device MTU after previously
being given advice.

Signed-off-by: Aaron Conole 
---
v2:
* Whitespace cleanup in the last hunk
* Code style change around the pr_warn
* Additional test for mtu change before printing warning
v3:
* removed the mtu change warning

 drivers/net/virtio_net.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 49d84e5..2308083 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1450,6 +1450,7 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
 
 static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 {
+   struct virtnet_info *vi = netdev_priv(dev);
if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
return -EINVAL;
dev->mtu = new_mtu;
@@ -1896,6 +1897,12 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
vi->has_cvq = true;
 
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_MTU)) {
+   dev->mtu = virtio_cread16(vdev,
+ offsetof(struct virtio_net_config,
+  mtu));
+   }
+
if (vi->any_header_sg)
dev->needed_headroom = vi->hdr_len;
 
@@ -2081,6 +2088,7 @@ static unsigned int features[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ,
VIRTIO_NET_F_CTRL_MAC_ADDR,
VIRTIO_F_ANY_LAYOUT,
+   VIRTIO_NET_F_MTU,
 };
 
 static struct virtio_driver virtio_net_driver = {
-- 
2.5.5

[RFC v3 -net 1/2] virtio: Start feature MTU support

2016-04-01 Thread Aaron Conole

This commit adds the feature bit and associated mtu device entry for the
virtio network device. Future commits will make use of these bits to support
negotiated MTU.

Signed-off-by: Aaron Conole 
---
v2,v3:
* No change

 include/uapi/linux/virtio_net.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index ec32293..41a6a01 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -55,6 +55,7 @@
 #define VIRTIO_NET_F_MQ22  /* Device supports Receive Flow
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
+#define VIRTIO_NET_F_MTU 25/* Device supports Default MTU Negotiation */
 
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
@@ -73,6 +74,8 @@ struct virtio_net_config {
 * Legal values are between 1 and 0x8000
 */
__u16 max_virtqueue_pairs;
+   /* Default maximum transmit unit advice */
+   __u16 mtu;
 } __attribute__((packed));
 
 /*
-- 
2.5.5

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread David Miller

From: Eric Dumazet 
Date: Fri, 01 Apr 2016 11:49:03 -0700

> For example, TCP stack tracks per socket ID generation, even if it
> sends DF=1 frames. Damn useful for tcpdump analysis and drop
> inference.

Thanks for mentioning this, I never considered this use case.

> With your change, the resulting GRO packet would propagate the ID of
> first frag. Most GSO/GSO engines will then provide a ID sequence,
> which might not match original packets.
> 
> I do not particularly care, but it is worth mentioning that GRO+TSO
> would not be idempotent anymore.

Our eventual plan was to start emitting zero in the ID field for
outgoing TCP datagrams with DF set, since the issue that caused us to
generate incrementing IDs in the first place (buggy Microsoft SLHC
compression) we decided is not relevant and important enough to
accommodate any more.

So outside of your TCP behavior analysis case, there isn't a
compelling argument to keeping that code around any more, rather than
just put zero in the ID field.

I suppose we could keep the counter code around and allow it to be
enabled using a sysctl or socket option, but how strongly do you
really feel about this?

Re: [PATCH] net: mvneta: fix changing MTU when using per-cpu processing

2016-04-01 Thread David Miller

From: Marcin Wojtas 
Date: Fri,  1 Apr 2016 15:21:18 +0200

> After enabling per-cpu processing it appeared that under heavy load
> changing MTU can result in blocking all port's interrupts and transmitting
> data is not possible after the change.
> 
> This commit fixes above issue by disabling percpu interrupts for the
> time, when TXQs and RXQs are reconfigured.
> 
> Signed-off-by: Marcin Wojtas 

Applied, thanks.

When I reviewed this I was worried that this was yet another case where
the ndo op could be invoked in a potentially atomic or similar context,
whereby on_each_cpu() would be illegal to use.

But that appears to not be the case, and thus this change is just fine.

Thanks.

Re: [v7, 4/5] powerpc/fsl: move mpc85xx.h to include/linux/fsl

2016-04-01 Thread Stephen Boyd

On 03/31/2016 08:07 PM, Yangbo Lu wrote:
>  drivers/clk/clk-qoriq.c   | 3 +--
>

For clk part:

Acked-by: Stephen Boyd 

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Re: [PATCH net] vlan: pull on __vlan_insert_tag error path and fix csum correction

2016-04-01 Thread David Miller

From: Daniel Borkmann 
Date: Fri,  1 Apr 2016 11:41:03 +0200

> Moreover, I noticed that when in the non-error path the __skb_pull()
> is done and the original offset to mac header was non-zero, we fixup
> from a wrong skb->data offset in the checksum complete processing.
> 
> So the skb_postpush_rcsum() really needs to be done before __skb_pull()
> where skb->data still points to the mac header start.

Ugh, what a mess, are you sure any of this is right even after your
change?  What happens (outside of the csum part) is this:

__skb_push(offset);
__vlan_insert_tag(); {
skb_push(VLAN_HLEN);
...
memmove(skb->data, skb->data + VLAN_HLEN, 2 * ETH_ALEN);
}
__skb_pull(offset);

If I understand this correctly, the last pull will therefore put
skb->data pointing at vlan_ethhdr->h_vlan_TCI of the new VLAN header
pushed by __vlan_insert_tag().

That is assuming skb->data began right after the original ethernet
header.

To me, that postpull csum currently is absolutely in the correct spot,
because it's acting upon the pull done by __vlan_insert_tag(), not the
one done here by skb_vlan_push().

Right?

Can you tell me how you tested this?  Just curious...

Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Eric Dumazet

On Fri, 2016-04-01 at 11:05 -0700, Alexander Duyck wrote:
> RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
> than fragmentation and reassembly.  Currently we are looking at this field
> as a way of identifying what frames can be aggregated and  which cannot for
> GRO.  While this is valid for frames that do not have DF set, it is invalid
> to do so if the bit is set.
> 
> In addition we were generating IPv4 ID collisions when 2 or more flows were
> interleaved over the same tunnel.  To prevent that we store the result of
> all IP ID checks via a "|=" instead of overwriting previous values.

Note that for atomic datagrams (DF = 1), since fragmentation and
reassembly can not occur, maybe some people use ID field for other
purposes.

For example, TCP stack tracks per socket ID generation, even if it sends
DF=1 frames. Damn useful for tcpdump analysis and drop inference.

With your change, the resulting GRO packet would propagate the ID of
first frag. Most GSO/GSO engines will then provide a ID sequence, which
might not match original packets.

I do not particularly care, but it is worth mentioning that GRO+TSO
would not be idempotent anymore.

Re: [PATCH] net: mvpp2: use cache_line_size() to get cacheline size

2016-04-01 Thread David Miller

From: Jisheng Zhang 
Date: Fri, 1 Apr 2016 17:11:05 +0800

> L1_CACHE_BYTES may not be the real cacheline size, use cache_line_size
> to determine the cacheline size in runtime.
> 
> Signed-off-by: Jisheng Zhang 
> Suggested-by: Marcin Wojtas 

Applied.

Re: [PATCH 1/2] ipv6: rework the lock in addrconf_permanent_addr

2016-04-01 Thread David Miller

From: roy.qing...@gmail.com
Date: Fri,  1 Apr 2016 17:26:58 +0800

> From: Li RongQing 
> 
> 1. nothing of idev is changed, so read lock is enough
> 2. ifp is changed, so used ifp->lock or cmpxchg to protect it
> 
> Signed-off-by: Li RongQing 

You posted this patch twice and didn't post patch 2/2.

I'm tossing this from patchwork, please resubmit this
properly.

Re: [PATCH] net: mvneta: use cache_line_size() to get cacheline size

2016-04-01 Thread David Miller

From: Jisheng Zhang 
Date: Fri, 1 Apr 2016 17:12:49 +0800

> L1_CACHE_BYTES may not be the real cacheline size, use cache_line_size
> to determine the cacheline size in runtime.
> 
> Signed-off-by: Jisheng Zhang 
> Suggested-by: Marcin Wojtas 

Applied.

Re: [PATCH] bridge: remove br_dev_set_multicast_list

2016-04-01 Thread David Miller

From: roy.qing...@gmail.com
Date: Fri,  1 Apr 2016 16:16:10 +0800

> From: Li RongQing 
> 
> remove br_dev_set_multicast_list which does nothing
> 
> Signed-off-by: Li RongQing 

This will break SIOCADDMULTI et al. on the bridge, see net/core/dev.c
which checks whether this ndo OP is NULL or not.

Please sufficiently grep the source tree on how an ndo operation is
used before making changes like this.

Thanks.

Re: [PATCH (net.git) 0/3] stmmac MDIO and normal descr fixes

2016-04-01 Thread David Miller

From: Giuseppe Cavallaro 
Date: Fri, 1 Apr 2016 09:07:13 +0200

> This patch series is to fix the problems below and recently debugged
> in this mailing list:
> 
> o to fix a problem for the HW where the normal descriptor
> o to fix the mdio registration according to the different
>   platform configurations
> 
> I am resending all the patches again: built on top of net.git repo.

Series applied, thanks.

Re: [PATCH] net: mvpp2: fix maybe-uninitialized warning

2016-04-01 Thread David Miller

From: Jisheng Zhang 
Date: Thu, 31 Mar 2016 17:01:23 +0800

> This is to fix the following maybe-uninitialized warning:
> 
> drivers/net/ethernet/marvell/mvpp2.c:6007:18: warning: 'err' may be
> used uninitialized in this function [-Wmaybe-uninitialized]
> 
> Signed-off-by: Jisheng Zhang 

Applied.

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-04-01 Thread Daniel Borkmann


On 04/01/2016 08:33 PM, David Miller wrote:

From: Daniel Borkmann 
Date: Fri, 01 Apr 2016 10:10:11 +0200


Dave, do you need me to resubmit this one w/o changes:
http://patchwork.ozlabs.org/patch/603903/ ?


I'll apply this and queue it up for -stable, thanks.


Ok, thanks!

Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-04-01 Thread Daniel Borkmann


On 04/01/2016 08:10 PM, Alexei Starovoitov wrote:

On 4/1/16 2:58 AM, Naveen N. Rao wrote:

PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2.

Enable with:
echo 1 > /proc/sys/net/core/bpf_jit_enable
or
echo 2 > /proc/sys/net/core/bpf_jit_enable

... to see the generated JIT code. This can further be processed with
tools/net/bpf_jit_disasm.

With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed]

... on both ppc64 BE and LE.

The details of the approach are documented through various comments in
the code, as are the TODOs. Some of the prominent TODOs include
implementing BPF tail calls and skb loads.

Cc: Matt Evans 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Signed-off-by: Naveen N. Rao 
---
  arch/powerpc/include/asm/ppc-opcode.h |  19 +-
  arch/powerpc/net/Makefile |   4 +
  arch/powerpc/net/bpf_jit.h|  66 ++-
  arch/powerpc/net/bpf_jit64.h  |  58 +++
  arch/powerpc/net/bpf_jit_comp64.c | 828 ++
  5 files changed, 973 insertions(+), 2 deletions(-)
  create mode 100644 arch/powerpc/net/bpf_jit64.h
  create mode 100644 arch/powerpc/net/bpf_jit_comp64.c

...

-#ifdef CONFIG_PPC64
+#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2)


impressive stuff!


+1, awesome to see another one!


Everything nicely documented. Could you add few words for the above
condition as well ?
Or may be a new macro, since it occurs many times?
What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ?
Will there ever be v3 ?


Minor TODO would also be to convert to use bpf_jit_binary_alloc() and
bpf_jit_binary_free() API for the image, which is done by other eBPF
jits, too.


So far most of the bpf jits were going via net-next tree, but if
in this case no changes to the core is necessary then I guess it's fine
to do it via powerpc tree. What's your plan?

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-04-01 Thread David Miller

From: Daniel Borkmann 
Date: Fri, 01 Apr 2016 10:10:11 +0200

> Dave, do you need me to resubmit this one w/o changes:
> http://patchwork.ozlabs.org/patch/603903/ ?

I'll apply this and queue it up for -stable, thanks.

[PATCH] Marvell phy: add fiber status check for some components

2016-04-01 Thread Charles-Antoine Couret

>From a5a7a9828511ff6a522cf742109768207ff89929 Mon Sep 17 00:00:00 2001
From: Charles-Antoine Couret 
Date: Fri, 1 Apr 2016 16:16:35 +0200
Subject: [PATCH] Marvell phy: add fiber status check for some components

This patch is not tested with all Marvell's phy. The new function was actived
only for tested phys.

Signed-off-by: Charles-Antoine Couret 
---
 drivers/net/phy/marvell.c | 37 +++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index ab1d0fc..5ac186e 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -890,6 +890,39 @@ static int marvell_read_status(struct phy_device *phydev)
 	return 0;
 }
 
+/* marvell_read_fiber_status
+ *
+ * Some Marvell's phys have two modes: fiber and copper.
+ * Both need status checked.
+ * Description:
+ *   First, check the fiber link and status.
+ *   If the fiber link is down, check the copper link and status which
+ *   will be the default value if both link are down.
+ */
+static int marvell_read_fiber_status(struct phy_device *phydev)
+{
+	int err;
+
+	/* Check the fiber mode first */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER);
+	if (err < 0)
+		return err;
+
+	err = marvell_read_status(phydev);
+	if (err < 0)
+		return err;
+
+	if (phydev->link)
+		return 0;
+
+	/* If fiber link is down, check and save copper mode state */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
+	if (err < 0)
+		return err;
+
+	return marvell_read_status(phydev);
+}
+
 static int marvell_aneg_done(struct phy_device *phydev)
 {
 	int retval = phy_read(phydev, MII_M1011_PHY_STATUS);
@@ -1122,7 +1155,7 @@ static struct phy_driver marvell_drivers[] = {
 		.probe = marvell_probe,
 		.config_init = &m88e_config_init,
 		.config_aneg = &marvell_config_aneg,
-		.read_status = &marvell_read_status,
+		.read_status = &marvell_read_fiber_status,
 		.ack_interrupt = &marvell_ack_interrupt,
 		.config_intr = &marvell_config_intr,
 		.resume = &genphy_resume,
@@ -1270,7 +1303,7 @@ static struct phy_driver marvell_drivers[] = {
 		.probe = marvell_probe,
 		.config_init = &marvell_config_init,
 		.config_aneg = &m88e1510_config_aneg,
-		.read_status = &marvell_read_status,
+		.read_status = &marvell_read_fiber_status,
 		.ack_interrupt = &marvell_ack_interrupt,
 		.config_intr = &marvell_config_intr,
 		.did_interrupt = &m88e1121_did_interrupt,
-- 
2.5.5

Re: Question on rhashtable in worst-case scenario.

2016-04-01 Thread Ben Greear


On 03/31/2016 05:46 PM, Herbert Xu wrote:

On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote:


Does removing this completely disable the "-EEXIST" error? I can't say
I fully understand the elasticity stuff in __rhashtable_insert_fast().


What EEXIST error are you talking about? The only one that can be
returned on insertion is if you're explicitly checking for dups
which clearly can't be the case for you.

If you're talking about the EEXIST error due to a rehash then it is
completely hidden from you by rhashtable_insert_rehash.

If you actually meant EBUSY then yes this should prevent it from
occurring, unless your chain-length exceeds 2^32.


EEXIST was on removal, and was a symptom of the failure to insert, not
really a problem itself.

I reverted my revert (ie, back to rhashtable), added Johanne's patch
to check insertion (and added my on pr_err there).

I also added this:

diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 38ef0be..c25b945 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -66,6 +66,7 @@

 static const struct rhashtable_params sta_rht_params = {
.nelem_hint = 3, /* start small */
+   .insecure_elasticity = true, /* Disable chain-length checks. */
.automatic_shrinking = true,
.head_offset = offsetof(struct sta_info, hash_node),
.key_offset = offsetof(struct sta_info, addr),


Now, my test case seems to pass, though I did have one strange issue
before I put in the pr_err.  I'm not sure if it was a hashtable issue
or something else..but I have lots of something-else going on in this system,
so I'd say that likely the patch above fixes rhashtable for my use case.

I will of course let you know if I run into more issues that appear
to be hashtable related!

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-04-01 Thread Alexei Starovoitov


On 4/1/16 2:58 AM, Naveen N. Rao wrote:

PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2.

Enable with:
echo 1 > /proc/sys/net/core/bpf_jit_enable
or
echo 2 > /proc/sys/net/core/bpf_jit_enable

... to see the generated JIT code. This can further be processed with
tools/net/bpf_jit_disasm.

With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed]

... on both ppc64 BE and LE.

The details of the approach are documented through various comments in
the code, as are the TODOs. Some of the prominent TODOs include
implementing BPF tail calls and skb loads.

Cc: Matt Evans 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Signed-off-by: Naveen N. Rao 
---
  arch/powerpc/include/asm/ppc-opcode.h |  19 +-
  arch/powerpc/net/Makefile |   4 +
  arch/powerpc/net/bpf_jit.h|  66 ++-
  arch/powerpc/net/bpf_jit64.h  |  58 +++
  arch/powerpc/net/bpf_jit_comp64.c | 828 ++
  5 files changed, 973 insertions(+), 2 deletions(-)
  create mode 100644 arch/powerpc/net/bpf_jit64.h
  create mode 100644 arch/powerpc/net/bpf_jit_comp64.c

...

-#ifdef CONFIG_PPC64
+#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2)


impressive stuff!
Everything nicely documented. Could you add few words for the above
condition as well ?
Or may be a new macro, since it occurs many times?
What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ?
Will there ever be v3 ?

So far most of the bpf jits were going via net-next tree, but if
in this case no changes to the core is necessary then I guess it's fine
to do it via powerpc tree. What's your plan?

[net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-01 Thread Alexander Duyck

RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
than fragmentation and reassembly.  Currently we are looking at this field
as a way of identifying what frames can be aggregated and  which cannot for
GRO.  While this is valid for frames that do not have DF set, it is invalid
to do so if the bit is set.

In addition we were generating IPv4 ID collisions when 2 or more flows were
interleaved over the same tunnel.  To prevent that we store the result of
all IP ID checks via a "|=" instead of overwriting previous values.

Signed-off-by: Alexander Duyck 
---
 net/core/dev.c |1 +
 net/ipv4/af_inet.c |   23 ---
 net/ipv6/ip6_offload.c |3 ---
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 77a71cd68535..3429632398a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, 
struct sk_buff *skb)
unsigned long diffs;
 
NAPI_GRO_CB(p)->flush = 0;
+   NAPI_GRO_CB(p)->flush_id = 0;
 
if (hash != skb_get_hash_raw(p)) {
NAPI_GRO_CB(p)->same_flow = 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbae..7d8733393934 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1347,14 +1347,23 @@ static struct sk_buff **inet_gro_receive(struct sk_buff 
**head,
(iph->tos ^ iph2->tos) |
((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
 
-   /* Save the IP ID check to be included later when we get to
-* the transport layer so only the inner most IP ID is checked.
-* This is because some GSO/TSO implementations do not
-* correctly increment the IP ID for the outer hdrs.
-*/
-   NAPI_GRO_CB(p)->flush_id =
-   ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ 
id);
NAPI_GRO_CB(p)->flush |= flush;
+
+   /* For non-atomic datagrams we need to save the IP ID offset
+* to be included later.  If the frame has the DF bit set
+* we must ignore the IP ID value as per RFC 6864.
+*/
+   if (iph2->frag_off & htons(IP_DF))
+   continue;
+
+   /* We must save the offset as it is possible to have multiple
+* flows using the same protocol and address pairs so we
+* need to wait until we can validate this is part of the
+* same flow with a 5-tuple or better to avoid unnecessary
+* collisions between flows.
+*/
+   NAPI_GRO_CB(p)->flush_id |= ntohs(iph2->id) ^
+   (u16)(id - NAPI_GRO_CB(p)->count);
}
 
NAPI_GRO_CB(skb)->flush |= flush;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 82e9f3076028..9aa53f64dffd 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff 
**head,
/* flush if Traffic Class fields are different */
NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF0));
NAPI_GRO_CB(p)->flush |= flush;
-
-   /* Clear flush_id, there's really no concept of ID in IPv6. */
-   NAPI_GRO_CB(p)->flush_id = 0;
}
 
NAPI_GRO_CB(skb)->flush |= flush;

[net PATCH 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU

2016-04-01 Thread Alexander Duyck

This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.

The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header.  As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.

Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |5 -
 net/core/dev.c|1 +
 net/ipv4/fou.c|6 ++
 net/ipv4/gre_offload.c|8 
 net/ipv4/ip_gre.c |   13 ++---
 5 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d09c2e4..8395308a2445 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2120,7 +2120,10 @@ struct napi_gro_cb {
/* Used in foo-over-udp, set in udp[46]_gro_receive */
u8  is_ipv6:1;
 
-   /* 7 bit hole */
+   /* Used in GRE, set in fou/gue_gro_receive */
+   u8  is_fou:1;
+
+   /* 6 bit hole */
 
/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum  csum;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe77d913..77a71cd68535 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4439,6 +4439,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->flush = 0;
NAPI_GRO_CB(skb)->free = 0;
NAPI_GRO_CB(skb)->encap_mark = 0;
+   NAPI_GRO_CB(skb)->is_fou = 0;
NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
 
/* Setup for GRO checksum validation */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 5a94aea280d3..a39068b4a4d9 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -203,6 +203,9 @@ static struct sk_buff **fou_gro_receive(struct sk_buff 
**head,
 */
NAPI_GRO_CB(skb)->encap_mark = 0;
 
+   /* Flag this frame as already having an outer encap header */
+   NAPI_GRO_CB(skb)->is_fou = 1;
+
rcu_read_lock();
offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
ops = rcu_dereference(offloads[proto]);
@@ -368,6 +371,9 @@ static struct sk_buff **gue_gro_receive(struct sk_buff 
**head,
 */
NAPI_GRO_CB(skb)->encap_mark = 0;
 
+   /* Flag this frame as already having an outer encap header */
+   NAPI_GRO_CB(skb)->is_fou = 1;
+
rcu_read_lock();
offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
ops = rcu_dereference(offloads[guehdr->proto_ctype]);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index c47539d04b88..6a5bd4317866 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -150,6 +150,14 @@ static struct sk_buff **gre_gro_receive(struct sk_buff 
**head,
if ((greh->flags & ~(GRE_KEY|GRE_CSUM)) != 0)
goto out;
 
+   /* We can only support GRE_CSUM if we can track the location of
+* the GRE header.  In the case of FOU/GUE we cannot because the
+* outer UDP header displaces the GRE header leaving us in a state
+* of limbo.
+*/
+   if ((greh->flags & GRE_CSUM) && NAPI_GRO_CB(skb)->is_fou)
+   goto out;
+
type = greh->protocol;
 
rcu_read_lock();
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 31936d387cfd..af5d1f38217f 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -862,9 +862,16 @@ static void __gre_tunnel_init(struct net_device *dev)
dev->hw_features|= GRE_FEATURES;
 
if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) {
-   /* TCP offload with GRE SEQ is not supported. */
-   dev->features|= NETIF_F_GSO_SOFTWARE;
-   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   /* TCP offload with GRE SEQ is not supported, nor
+* can we support 2 levels of outer headers requiring
+* an update.
+*/
+   if (!(tunnel->parms.o_flags & TUNNEL_CSUM) ||
+   (tunnel->encap.type == TUNNEL_ENCAP_NONE)) {
+   dev->features|= NETIF_F_GSO_SOFTWARE;
+   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   }
+
/* Can use a lockless transmit, unless we generate
 * output sequences
 */

[net PATCH 0/2] Fixes for GRO and GRE tunnels

2016-04-01 Thread Alexander Duyck

This pair of patches addresses a few issues I have discovered over the last
week or so concerning GRO and GRE tunnels.

The first patch addresses an item I called out as an issue with FOU/GUE
encapsulating GRE, and I finally had a chance to test it and verify that
the code concerning it was broken so I took the opportunity to fix it so
that we cannot generate a FOU/GUE frame that is encapsulating a GRE tunnel
with checksum while requesting TSO/GSO for the frame.

The second patch actually addresses something I realized was an issue if we
feed a tunnel through GRO and back out through GSO.  Specifically it was
possible for GRO to generate overlapping IPv4 ID ranges as the outer IP IDs
were being ignored for tunnels.  Ignoring the IP IDs like this should only
be valid if the DF bit is set.  This is normally the case for IPIP, SIT,
and GRE tunnels, but not so for UDP tunnels.  In the case that the DF bit
is not set we store off the fact that there was a delta from what we were
expecting and when we hit the inner-most header we validate the value as to
avoid generating a frame which could lead to an IP ID collision on packets
that could eventually be fragmented.  A side effect is that the inner-most
IP ID test is relaxed as well, but the worst case scenario is that we GRO a
frame with a throw-away ID sequence anyway so if anything segmenting such a
frame with the wrong IP IDs should have no negative effects.

---

Alexander Duyck (2):
  GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
  ipv4/GRO: Make GRO conform to RFC 6864


 include/linux/netdevice.h |5 -
 net/core/dev.c|2 ++
 net/ipv4/af_inet.c|   23 ---
 net/ipv4/fou.c|6 ++
 net/ipv4/gre_offload.c|8 
 net/ipv4/ip_gre.c |   13 ++---
 net/ipv6/ip6_offload.c|3 ---
 7 files changed, 46 insertions(+), 14 deletions(-)

--

Re: [PATCH 4/4] samples/bpf: Enable powerpc support

2016-04-01 Thread Alexei Starovoitov


On 4/1/16 7:41 AM, Naveen N. Rao wrote:

On 2016/03/31 10:52AM, Alexei Starovoitov wrote:

On 3/31/16 4:25 AM, Naveen N. Rao wrote:
...

+
+#ifdef __powerpc__
+#define BPF_KPROBE_READ_RET_IP(ip, ctx){ (ip) = (ctx)->link; }
+#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) BPF_KPROBE_READ_RET_IP(ip, ctx)
+#else
+#define BPF_KPROBE_READ_RET_IP(ip, ctx)
\
+   bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx))
+#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) 
\
+   bpf_probe_read(&(ip), sizeof(ip),   \
+   (void *)(PT_REGS_FP(ctx) + sizeof(ip)))


makes sense, but please use ({ }) gcc extension instead of {} and
open call to make sure that macro body is scoped.


To be sure I understand this right, do you mean something like this?

+
+#ifdef __powerpc__
+#define BPF_KPROBE_READ_RET_IP(ip, ctx)({ (ip) = (ctx)->link; 
})
+#define BPF_KRETPROBE_READ_RET_IP  BPF_KPROBE_READ_RET_IP
+#else
+#define BPF_KPROBE_READ_RET_IP(ip, ctx)({  
\
+   bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); })
+#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) ({  
\
+   bpf_probe_read(&(ip), sizeof(ip),   
\
+   (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
+#endif


yes. Thanks!

Re: [PATCH 2/4] samples/bpf: Use llc in PATH, rather than a hardcoded value

2016-04-01 Thread Alexei Starovoitov


On 4/1/16 7:37 AM, Naveen N. Rao wrote:

On 2016/03/31 08:19PM, Daniel Borkmann wrote:

On 03/31/2016 07:46 PM, Alexei Starovoitov wrote:

On 3/31/16 4:25 AM, Naveen N. Rao wrote:

  clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
  -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
--O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
+-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=obj -o $@
  clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
  -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
--O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o $@.s
+-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=asm -o $@.s


that was a workaround when clang/llvm didn't have bpf support.
Now clang 3.7 and 3.8 have bpf built-in, so make sense to remove
manual calls to llc completely.
Just use 'clang -target bpf -O2 -D... -c $< -o $@'


+1, the clang part in that Makefile should also more correctly be called
with '-target bpf' as it turns out (despite llc with '-march=bpf' ...).
Better to use clang directly as suggested by Alexei.


I'm likely missing something obvious, but I cannot get this to work.
With this diff:

 $(obj)/%.o: $(src)/%.c
clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
-   -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf 
-filetype=obj -o $@
-   clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-   -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
-   -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf 
-filetype=asm -o $@.s
+   -O2 -target bpf -c $< -o $@

I see far too many errors thrown starting with:
./arch/x86/include/asm/arch_hweight.h:31:10: error: invalid output 
constraint '=a' in asm
 : "="REG_OUT (res)


ahh. yes. when processing kernel headers clang has to assume x86 style
inline asm, though all of these functions will be ignored.
I don't have a quick fix for this yet.
Let's go back to your original change $(LLC)->llc

[Odd commit author id merge via netdev]

2016-04-01 Thread santosh shilimkar


Hi Dave,

I noticed something odd while checking the recent
commits of mine in kernel.org tree made it via netdev.

Don't know if its patchwork tool doing this.
Usual author line in my git objects :
Author: Santosh Shilimkar 

But the commits going via your tree seems to be like below..
Author: email-id 

Few more examples of the commits end of the email. Can this
be fixed for future commits ? The git objects you pulled from
my tree directly have right author format where as ones which
are picked from patchworks seems to be odd.

Regards,
Santosh

commit ad6832f950d35df8c70b577993a24b31b34d88e4
Author: santosh.shilim...@oracle.com 

commit 2cb2912d65633e751d3f8397377174501412aa47
Author: santosh.shilim...@oracle.com 

commit db42753adb638b63572583162bb08ea193947309
Author: santosh.shilim...@oracle.com 


[]

commit 06766513232d1619ac84e87b1d839d3fcc23a540
Author: Santosh Shilimkar 

commit 41a4e9646229801624e38f7a1cc53033a0affdb1
Author: Santosh Shilimkar

[PATCH v2] sctp: use list_* in sctp_list_dequeue

2016-04-01 Thread Marcelo Ricardo Leitner

Use list_* helpers in sctp_list_dequeue, more readable.

Signed-off-by: Marcelo Ricardo Leitner 
---
v2: patch rechecked

 include/net/sctp/sctp.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 
65521cfdcadeee35d61f280165a387cc2164ab6d..03fb33efcae21d54192204629ff4ced2e36e7d4d
 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -386,11 +386,9 @@ static inline struct list_head *sctp_list_dequeue(struct 
list_head *list)
 {
struct list_head *result = NULL;
 
-   if (list->next != list) {
+   if (!list_empty(list)) {
result = list->next;
-   list->next = result->next;
-   list->next->prev = list;
-   INIT_LIST_HEAD(result);
+   list_del_init(result);
}
return result;
 }
-- 
2.5.0

Re: qdisc spin lock

2016-04-01 Thread Michael Ma

2016-03-31 19:19 GMT-07:00 David Miller :
> From: Michael Ma 
> Date: Thu, 31 Mar 2016 16:48:43 -0700
>
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>  ...
>
> Please stop top-posting.

Sorry that I wasn't aware of this...

Re: [PATCH] Marvell phy: add fiber status check for some components

2016-04-01 Thread Andrew Lunn

On Fri, Apr 01, 2016 at 06:33:48PM +0200, Charles-Antoine Couret wrote:

> >From a5a7a9828511ff6a522cf742109768207ff89929 Mon Sep 17 00:00:00 2001
> From: Charles-Antoine Couret 
> Date: Fri, 1 Apr 2016 16:16:35 +0200
> Subject: [PATCH] Marvell phy: add fiber status check for some components
> 
> This patch is not tested with all Marvell's phy. The new function was actived
> only for tested phys.
> 
> Signed-off-by: Charles-Antoine Couret 
> ---
>  drivers/net/phy/marvell.c | 37 +++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index ab1d0fc..5ac186e 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -890,6 +890,39 @@ static int marvell_read_status(struct phy_device *phydev)
>   return 0;
>  }
>  
> +/* marvell_read_fiber_status
> + *
> + * Some Marvell's phys have two modes: fiber and copper.
> + * Both need status checked.
> + * Description:
> + *   First, check the fiber link and status.
> + *   If the fiber link is down, check the copper link and status which
> + *   will be the default value if both link are down.
> + */
> +static int marvell_read_fiber_status(struct phy_device *phydev)
> +{
> + int err;
> +
> + /* Check the fiber mode first */
> + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER);
> + if (err < 0)
> + return err;
> +
> + err = marvell_read_status(phydev);
> + if (err < 0)
> + return err;
> +
> + if (phydev->link)
> + return 0;
> +
> + /* If fiber link is down, check and save copper mode state */
> + err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
> + if (err < 0)
> + return err;
> +
> + return marvell_read_status(phydev);
> +}

Hi Charles

Shouldn't you return to page 0, i.e. MII_M_COPPER, under all
conditions?

Andrew

[PATCH v2] sctp: flush if we can't fit another DATA chunk

2016-04-01 Thread Marcelo Ricardo Leitner

There is no point on delaying the packet if we can't fit a single byte
of data on it anymore. So lets just reduce the threshold by the amount
that a data chunk with 4 bytes (rounding) would use.

v2: based on the right tree

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/output.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sctp/output.c b/net/sctp/output.c
index 
97745351d58c2fb32b9f9b57d61831d7724d83b2..9844fe573029b9e262743440980f15277ddaf5a1
 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -705,7 +705,8 @@ static sctp_xmit_t sctp_packet_can_append_data(struct 
sctp_packet *packet,
/* Check whether this chunk and all the rest of pending data will fit
 * or delay in hopes of bundling a full sized packet.
 */
-   if (chunk->skb->len + q->out_qlen >= transport->pathmtu - 
packet->overhead)
+   if (chunk->skb->len + q->out_qlen >
+   transport->pathmtu - packet->overhead - 
sizeof(sctp_data_chunk_t) - 4)
/* Enough data queued to fill a packet */
return SCTP_XMIT_OK;
 
-- 
2.5.0

Re: [PATCH 3/4] net: w5100: enable to support sleepable register access interface

2016-04-01 Thread Akinobu Mita

2016-04-01 4:30 GMT+09:00 David Miller :
> From: Akinobu Mita 
> Date: Thu, 31 Mar 2016 01:38:39 +0900
>
>> + struct sk_buff_head tx_queue;
>
> The way the queueing works in this driver is that it is only possible
> to have one SKB being transmitted at one time.
>
> This is evident by how the driver immediately stops the TX queue when
> it is given a new packet to transmit, and this is woken up by the TX
> completion IRQ.
>
> So don't use a queue here, just use a plain single pointer.
>
> The SKB queue you're using here is going to also do locking which is
> even more unnecessary overhead.

Thanks for spotting this.  Using a single pointer works fine.

Maybe we can support sending multiple packets at a time, but it should be
another separated patch.

[PATCH] Marvell phy: add fiber status check for some components

2016-04-01 Thread Charles-Antoine Couret

>From a5a7a9828511ff6a522cf742109768207ff89929 Mon Sep 17 00:00:00 2001
From: Charles-Antoine Couret 
Date: Fri, 1 Apr 2016 16:16:35 +0200
Subject: [PATCH] Marvell phy: add fiber status check for some components

This patch is not tested with all Marvell's phy. The new function was actived
only for tested phys.

Signed-off-by: Charles-Antoine Couret 
---
 drivers/net/phy/marvell.c | 37 +++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index ab1d0fc..5ac186e 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -890,6 +890,39 @@ static int marvell_read_status(struct phy_device *phydev)
 	return 0;
 }
 
+/* marvell_read_fiber_status
+ *
+ * Some Marvell's phys have two modes: fiber and copper.
+ * Both need status checked.
+ * Description:
+ *   First, check the fiber link and status.
+ *   If the fiber link is down, check the copper link and status which
+ *   will be the default value if both link are down.
+ */
+static int marvell_read_fiber_status(struct phy_device *phydev)
+{
+	int err;
+
+	/* Check the fiber mode first */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_FIBER);
+	if (err < 0)
+		return err;
+
+	err = marvell_read_status(phydev);
+	if (err < 0)
+		return err;
+
+	if (phydev->link)
+		return 0;
+
+	/* If fiber link is down, check and save copper mode state */
+	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, MII_M_COPPER);
+	if (err < 0)
+		return err;
+
+	return marvell_read_status(phydev);
+}
+
 static int marvell_aneg_done(struct phy_device *phydev)
 {
 	int retval = phy_read(phydev, MII_M1011_PHY_STATUS);
@@ -1122,7 +1155,7 @@ static struct phy_driver marvell_drivers[] = {
 		.probe = marvell_probe,
 		.config_init = &m88e_config_init,
 		.config_aneg = &marvell_config_aneg,
-		.read_status = &marvell_read_status,
+		.read_status = &marvell_read_fiber_status,
 		.ack_interrupt = &marvell_ack_interrupt,
 		.config_intr = &marvell_config_intr,
 		.resume = &genphy_resume,
@@ -1270,7 +1303,7 @@ static struct phy_driver marvell_drivers[] = {
 		.probe = marvell_probe,
 		.config_init = &marvell_config_init,
 		.config_aneg = &m88e1510_config_aneg,
-		.read_status = &marvell_read_status,
+		.read_status = &marvell_read_fiber_status,
 		.ack_interrupt = &marvell_ack_interrupt,
 		.config_intr = &marvell_config_intr,
 		.did_interrupt = &m88e1121_did_interrupt,
-- 
2.5.5

Re: [PATCH v2 net-next 11/11] tcp: rate limit ACK sent by SYN_RECV request sockets

2016-04-01 Thread Neal Cardwell

On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet  wrote:
> Attackers like to use SYNFLOOD targeting one 5-tuple, as they
> hit a single RX queue (and cpu) on the victim.
>
> If they use random sequence numbers in their SYN, we detect
> they do not match the expected window and send back an ACK.
>
> This patch adds a rate limitation, so that the effect of such
> attacks is limited to ingress only.
>
> We roughly double our ability to absorb such attacks.

Thanks, Eric!

Acked-by: Neal Cardwell 

neal

[PATCH v2 net-next 03/11] tcp/dccp: remove BH disable/enable in lookup

2016-04-01 Thread Eric Dumazet

Since linux 2.6.29, lookups only use rcu locking.

Signed-off-by: Eric Dumazet 
---
 include/net/inet_hashtables.h | 7 +--
 net/ipv6/inet6_hashtables.c   | 2 --
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 50f635c2c536..a77acee93aaf 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -280,11 +280,8 @@ static inline struct sock *inet_lookup_listener(struct net 
*net,
 net_eq(sock_net(__sk), (__net)))
 #endif /* 64-bit arch */
 
-/*
- * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
+/* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
  * not check it for lookups anymore, thanks Alexey. -DaveM
- *
- * Local BH must be disabled here.
  */
 struct sock *__inet_lookup_established(struct net *net,
   struct inet_hashinfo *hashinfo,
@@ -326,10 +323,8 @@ static inline struct sock *inet_lookup(struct net *net,
 {
struct sock *sk;
 
-   local_bh_disable();
sk = __inet_lookup(net, hashinfo, skb, doff, saddr, sport, daddr,
   dport, dif);
-   local_bh_enable();
 
return sk;
 }
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 70f2628be6fa..d253f32874c9 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -200,10 +200,8 @@ struct sock *inet6_lookup(struct net *net, struct 
inet_hashinfo *hashinfo,
 {
struct sock *sk;
 
-   local_bh_disable();
sk = __inet6_lookup(net, hashinfo, skb, doff, saddr, sport, daddr,
ntohs(dport), dif);
-   local_bh_enable();
 
return sk;
 }
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 01/11] net: add SOCK_RCU_FREE socket flag

2016-04-01 Thread Eric Dumazet

We want a generic way to insert an RCU grace period before socket
freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
much overhead.

SLAB_DESTROY_BY_RCU strict rules force us to take a reference
on the socket sk_refcnt, and it is a performance problem for UDP
encapsulation, or TCP synflood behavior, as many CPUs might
attempt the atomic operations on a shared sk_refcnt

UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
lookup can use traditional RCU rules, without refcount changes.
They can set the flag only once hashed and visible by other cpus.

Signed-off-by: Eric Dumazet 
Cc: Tom Herbert 
---
 include/net/sock.h |  2 ++
 net/core/sock.c| 14 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b..c88785a3e76c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -438,6 +438,7 @@ struct sock {
  struct sk_buff *skb);
void(*sk_destruct)(struct sock *sk);
struct sock_reuseport __rcu *sk_reuseport_cb;
+   struct rcu_head sk_rcu;
 };
 
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
@@ -720,6 +721,7 @@ enum sock_flags {
 */
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
+   SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230..238a94f879ca 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t 
priority,
 }
 EXPORT_SYMBOL(sk_alloc);
 
-void sk_destruct(struct sock *sk)
+/* Sockets having SOCK_RCU_FREE will call this function after one RCU
+ * grace period. This is the case for UDP sockets and TCP listeners.
+ */
+static void __sk_destruct(struct rcu_head *head)
 {
+   struct sock *sk = container_of(head, struct sock, sk_rcu);
struct sk_filter *filter;
 
if (sk->sk_destruct)
@@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk)
sk_prot_free(sk->sk_prot_creator, sk);
 }
 
+void sk_destruct(struct sock *sk)
+{
+   if (sock_flag(sk, SOCK_RCU_FREE))
+   call_rcu(&sk->sk_rcu, __sk_destruct);
+   else
+   __sk_destruct(&sk->sk_rcu);
+}
+
 static void __sk_free(struct sock *sk)
 {
if (unlikely(sock_diag_has_destroy_listeners(sk) && sk->sk_net_refcnt))
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 10/11] ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply()

2016-04-01 Thread Eric Dumazet

TCP uses per cpu 'sockets' to send some packets :
- RST packets ( tcp_v4_send_reset()) )
- ACK packets for SYN_RECV and TIMEWAIT sockets

By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree()
to not call sk_write_space() since these internal sockets
do not care.

This gives a small performance improvement, merely by allowing
cpu to properly predict the sock_wfree() conditional branch,
and avoiding one atomic operation.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_ipv4.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f3ce0afe70aa..456ff3d6a132 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2384,6 +2384,7 @@ static int __net_init tcp_sk_init(struct net *net)
   IPPROTO_TCP, net);
if (res)
goto fail;
+   sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
*per_cpu_ptr(net->ipv4.tcp_sk, cpu) = sk;
}
 
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 00/11] net: various udp/tcp changes

2016-04-01 Thread Eric Dumazet

First round of patches for linux-4.7

Add a generic facility for sockets to be freed after an RCU grace
period, if they need to.

Then UDP stack is changed to no longer use SLAB_DESTROY_BY_RCU,
in order to speedup rx processing for traffic encapsulated in UDP.
It gives a 17 % speedup for normal UDP reception in stress conditions.

Then TCP listeners are changed to use SOCK_RCU_FREE as well
to avoid touching sk_refcnt in synflood case :
I got up to 30 % performance increase for a mono listener.

Then three patches add SK_MEMINFO_DROPS to sock_diag
and add per socket rx drops accounting to TCP.

Last patch adds rate limiting on ACK sent on behalf of SYN_RECV
to better resist to SYNFLOOD targeting one or few flows.


Eric Dumazet (11):
  net: add SOCK_RCU_FREE socket flag
  udp: no longer use SLAB_DESTROY_BY_RCU
  tcp/dccp: remove BH disable/enable in lookup
  tcp/dccp: use rcu locking in inet_diag_find_one_icsk()
  inet: reqsk_alloc() needs to take care of dead listeners
  tcp/dccp: do not touch listener sk_refcnt under synflood
  sock_diag: add SK_MEMINFO_DROPS
  tcp: increment sk_drops for dropped rx packets
  tcp: increment sk_drops for listeners
  ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply()
  tcp: rate limit ACK sent by SYN_RECV request sockets

 include/linux/udp.h|   8 +-
 include/net/inet6_hashtables.h |  12 +-
 include/net/inet_hashtables.h  |  47 +++
 include/net/request_sock.h |  31 +++--
 include/net/sock.h |  21 ++-
 include/net/tcp.h  |  13 ++
 include/net/udp.h  |   2 +-
 include/uapi/linux/sock_diag.h |   1 +
 net/core/sock.c|  15 ++-
 net/core/sock_diag.c   |   1 +
 net/dccp/ipv4.c|   7 +-
 net/dccp/ipv6.c|   7 +-
 net/ipv4/inet_diag.c   |  10 +-
 net/ipv4/inet_hashtables.c |  77 ---
 net/ipv4/tcp_input.c   |  41 +++---
 net/ipv4/tcp_ipv4.c|  74 ++-
 net/ipv4/tcp_minisocks.c   |   5 +-
 net/ipv4/udp.c | 293 -
 net/ipv4/udp_diag.c|  18 +--
 net/ipv6/inet6_hashtables.c|  62 +++--
 net/ipv6/tcp_ipv6.c|  32 +++--
 net/ipv6/udp.c | 196 +--
 net/netfilter/xt_socket.c  |   6 +-
 23 files changed, 401 insertions(+), 578 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 05/11] inet: reqsk_alloc() needs to take care of dead listeners

2016-04-01 Thread Eric Dumazet

We'll soon no longer take a refcount on listeners,
so reqsk_alloc() can not assume a listener refcount is not
zero. We need to use atomic_inc_not_zero()

Signed-off-by: Eric Dumazet 
---
 include/net/request_sock.h | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index f49759decb28..6ebe13eb1c4c 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -85,24 +85,23 @@ reqsk_alloc(const struct request_sock_ops *ops, struct sock 
*sk_listener,
struct request_sock *req;
 
req = kmem_cache_alloc(ops->slab, GFP_ATOMIC | __GFP_NOWARN);
-
-   if (req) {
-   req->rsk_ops = ops;
-   if (attach_listener) {
-   sock_hold(sk_listener);
-   req->rsk_listener = sk_listener;
-   } else {
-   req->rsk_listener = NULL;
+   if (!req)
+   return NULL;
+   req->rsk_listener = NULL;
+   if (attach_listener) {
+   if (unlikely(!atomic_inc_not_zero(&sk_listener->sk_refcnt))) {
+   kmem_cache_free(ops->slab, req);
+   return NULL;
}
-   req_to_sk(req)->sk_prot = sk_listener->sk_prot;
-   sk_node_init(&req_to_sk(req)->sk_node);
-   sk_tx_queue_clear(req_to_sk(req));
-   req->saved_syn = NULL;
-   /* Following is temporary. It is coupled with debugging
-* helpers in reqsk_put() & reqsk_free()
-*/
-   atomic_set(&req->rsk_refcnt, 0);
+   req->rsk_listener = sk_listener;
}
+   req->rsk_ops = ops;
+   req_to_sk(req)->sk_prot = sk_listener->sk_prot;
+   sk_node_init(&req_to_sk(req)->sk_node);
+   sk_tx_queue_clear(req_to_sk(req));
+   req->saved_syn = NULL;
+   atomic_set(&req->rsk_refcnt, 0);
+
return req;
 }
 
-- 
2.8.0.rc3.226.g39d4020

[PATCH v2 net-next 06/11] tcp/dccp: do not touch listener sk_refcnt under synflood

2016-04-01 Thread Eric Dumazet

When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.

Peak performance under SYNFLOOD is increased by ~33% :

On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

Signed-off-by: Eric Dumazet 
---
 include/net/inet6_hashtables.h | 12 ---
 include/net/inet_hashtables.h  | 40 ++-
 net/dccp/ipv4.c|  7 ++--
 net/dccp/ipv6.c|  7 ++--
 net/ipv4/inet_diag.c   |  3 +-
 net/ipv4/inet_hashtables.c | 73 +++---
 net/ipv4/tcp_ipv4.c| 66 +++---
 net/ipv6/inet6_hashtables.c| 56 +---
 net/ipv6/tcp_ipv6.c| 27 +---
 net/netfilter/xt_socket.c  |  6 ++--
 10 files changed, 134 insertions(+), 163 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 28332bdac333..b87becacd9d3 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -66,13 +66,15 @@ static inline struct sock *__inet6_lookup(struct net *net,
  const __be16 sport,
  const struct in6_addr *daddr,
  const u16 hnum,
- const int dif)
+ const int dif,
+ bool *refcounted)
 {
struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
sport, daddr, hnum, dif);
+   *refcounted = true;
if (sk)
return sk;
-
+   *refcounted = false;
return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
 daddr, hnum, dif);
 }
@@ -81,17 +83,19 @@ static inline struct sock *__inet6_lookup_skb(struct 
inet_hashinfo *hashinfo,
  struct sk_buff *skb, int doff,
  const __be16 sport,
  const __be16 dport,
- int iif)
+ int iif,
+ bool *refcounted)
 {
struct sock *sk = skb_steal_sock(skb);
 
+   *refcounted = true;
if (sk)
return sk;
 
return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
  doff, &ipv6_hdr(skb)->saddr, sport,
  &ipv6_hdr(skb)->daddr, ntohs(dport),
- iif);
+ iif, refcounted);
 }
 
 struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index a77acee93aaf..0574493e3899 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -100,14 +100,10 @@ struct inet_bind_hashbucket {
 
 /*
  * Sockets can be hashed in established or listening table
- * We must use different 'nulls' end-of-chain value for listening
- * hash table, or we might find a socket that was closed and
- * reallocated/inserted into established hash table
  */
-#define LISTENING_NULLS_BASE (1U << 29)
 struct inet_listen_hashbucket {
spinlock_t  lock;
-   struct hlist_nulls_head head;
+   struct hlist_head   head;
 };
 
 /* This is for listening sockets, thus all sockets which possess wildcards. */
@@ -304,14 +300,20 @@ static inline struct sock *__inet_lookup(struct net *net,
 struct sk_buff *skb, int doff,
 const __be32 saddr, const __be16 sport,
 const __be32 daddr, const __be16 dport,
-const int dif)
+const int dif,
+bool *refcounted)
 {
u16 hnum = ntohs(dport);
-   struct sock *sk = __inet_lookup_established(net, hashinfo,
-   saddr, sport, daddr, hnum, dif);
+   struct sock *sk;
 
-   return sk ? : __inet_lookup_listener(net, hashinfo, skb, doff, saddr,
-sport, daddr, hnum, dif);
+   sk = __inet_lookup_established(net, hashinfo, saddr, sport,
+  daddr, hnum, dif);
+   *refc

[PATCH v2 net-next 02/11] udp: no longer use SLAB_DESTROY_BY_RCU

2016-04-01 Thread Eric Dumazet

Tom Herbert would like not touching UDP socket refcnt for encapsulated
traffic. For this to happen, we need to use normal RCU rules, with a grace
period before freeing a socket. UDP sockets are not short lived in the
high usage case, so the added cost of call_rcu() should not be a concern.

This actually removes a lot of complexity in UDP stack.

Multicast receives no longer need to hold a bucket spinlock.

Note that ip early demux still needs to take a reference on the socket.

Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
but this might be changed later.

Performance for a single UDP socket receiving flood traffic from
many RX queues/cpus.

Simple udp_rx using simple recvfrom() loop :
438 kpps instead of 374 kpps : 17 % increase of the peak rate.

v2: Addressed Willem de Bruijn feedback in multicast handling
 - keep early demux break in __udp4_lib_demux_lookup()

Signed-off-by: Eric Dumazet 
Cc: Tom Herbert 
Cc: Willem de Bruijn 
---
 include/linux/udp.h |   8 +-
 include/net/sock.h  |  12 +--
 include/net/udp.h   |   2 +-
 net/ipv4/udp.c  | 293 
 net/ipv4/udp_diag.c |  18 ++--
 net/ipv6/udp.c  | 196 ---
 6 files changed, 171 insertions(+), 358 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 87c094961bd5..32342754643a 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
return udp_sk(sk)->no_check6_rx;
 }
 
-#define udp_portaddr_for_each_entry(__sk, node, list) \
-   hlist_nulls_for_each_entry(__sk, node, list, 
__sk_common.skc_portaddr_node)
+#define udp_portaddr_for_each_entry(__sk, list) \
+   hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
-#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \
-   hlist_nulls_for_each_entry_rcu(__sk, node, list, 
__sk_common.skc_portaddr_node)
+#define udp_portaddr_for_each_entry_rcu(__sk, list) \
+   hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
 
 #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag)
 
diff --git a/include/net/sock.h b/include/net/sock.h
index c88785a3e76c..c3a707d1cee8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -178,7 +178,7 @@ struct sock_common {
int skc_bound_dev_if;
union {
struct hlist_node   skc_bind_node;
-   struct hlist_nulls_node skc_portaddr_node;
+   struct hlist_node   skc_portaddr_node;
};
struct proto*skc_prot;
possible_net_t  skc_net;
@@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk,
hlist_for_each_entry(__sk, list, sk_bind_node)
 
 /**
- * sk_nulls_for_each_entry_offset - iterate over a list at a given struct 
offset
+ * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
  * @tpos:  the type * to use as a loop cursor.
  * @pos:   the &struct hlist_node to use as a loop cursor.
  * @head:  the head for your list.
  * @offset:offset of hlist_node within the struct.
  *
  */
-#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset)
   \
-   for (pos = (head)->first;  \
-(!is_a_nulls(pos)) && \
+#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) \
+   for (pos = rcu_dereference((head)->first); \
+pos != NULL &&\
({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;});   \
-pos = pos->next)
+pos = rcu_dereference(pos->next))
 
 static inline struct user_namespace *sk_user_ns(struct sock *sk)
 {
diff --git a/include/net/udp.h b/include/net/udp.h
index 92927f729ac8..d870ec1611c4 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -59,7 +59,7 @@ struct udp_skb_cb {
  * @lock:  spinlock protecting changes to head/count
  */
 struct udp_hslot {
-   struct hlist_nulls_head head;
+   struct hlist_head   head;
int count;
spinlock_t  lock;
 } __attribute__((aligned(2 * sizeof(long;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e16df0..0475aaf95040 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
   unsigned int log)
 {
struct sock *sk2;
-   struct hlist_nulls_node *node;
kuid_t uid = sock_i_uid(sk);
 
-   sk_nulls_for_each(sk2, node, &hslot->head) {
+   sk_for_each(sk2, &hslot->head) {
if (net_eq(sock_net(sk2), net) &&
sk2 != sk &&
(bitmap || udp_sk(sk2)->udp_port_hash == num

[PATCH v2 net-next 04/11] tcp/dccp: use rcu locking in inet_diag_find_one_icsk()

2016-04-01 Thread Eric Dumazet

RX packet processing holds rcu_read_lock(), so we can remove
pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
if inet_diag also holds rcu before calling them.

This is needed anyway as __inet_lookup_listener() and
inet6_lookup_listener() will soon no longer increment
refcount on the found listener.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/inet_diag.c| 7 +--
 net/ipv4/inet_hashtables.c  | 4 
 net/ipv6/inet6_hashtables.c | 4 
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 5fdb02f5598e..ea8df527b279 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -356,6 +356,7 @@ struct sock *inet_diag_find_one_icsk(struct net *net,
 {
struct sock *sk;
 
+   rcu_read_lock();
if (req->sdiag_family == AF_INET)
sk = inet_lookup(net, hashinfo, NULL, 0, req->id.idiag_dst[0],
 req->id.idiag_dport, req->id.idiag_src[0],
@@ -376,9 +377,11 @@ struct sock *inet_diag_find_one_icsk(struct net *net,
  req->id.idiag_if);
}
 #endif
-   else
+   else {
+   rcu_read_unlock();
return ERR_PTR(-EINVAL);
-
+   }
+   rcu_read_unlock();
if (!sk)
return ERR_PTR(-ENOENT);
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bc68eced0105..387338d71dcd 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -220,7 +220,6 @@ struct sock *__inet_lookup_listener(struct net *net,
bool select_ok = true;
u32 phash = 0;
 
-   rcu_read_lock();
 begin:
result = NULL;
hiscore = 0;
@@ -269,7 +268,6 @@ found:
goto begin;
}
}
-   rcu_read_unlock();
return result;
 }
 EXPORT_SYMBOL_GPL(__inet_lookup_listener);
@@ -312,7 +310,6 @@ struct sock *__inet_lookup_established(struct net *net,
unsigned int slot = hash & hashinfo->ehash_mask;
struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
 
-   rcu_read_lock();
 begin:
sk_nulls_for_each_rcu(sk, node, &head->chain) {
if (sk->sk_hash != hash)
@@ -339,7 +336,6 @@ begin:
 out:
sk = NULL;
 found:
-   rcu_read_unlock();
return sk;
 }
 EXPORT_SYMBOL_GPL(__inet_lookup_established);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index d253f32874c9..e6ef6ce1ed74 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -69,7 +69,6 @@ struct sock *__inet6_lookup_established(struct net *net,
struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
 
 
-   rcu_read_lock();
 begin:
sk_nulls_for_each_rcu(sk, node, &head->chain) {
if (sk->sk_hash != hash)
@@ -90,7 +89,6 @@ begin:
 out:
sk = NULL;
 found:
-   rcu_read_unlock();
return sk;
 }
 EXPORT_SYMBOL(__inet6_lookup_established);
@@ -138,7 +136,6 @@ struct sock *inet6_lookup_listener(struct net *net,
unsigned int hash = inet_lhashfn(net, hnum);
struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
 
-   rcu_read_lock();
 begin:
result = NULL;
hiscore = 0;
@@ -187,7 +184,6 @@ found:
goto begin;
}
}
-   rcu_read_unlock();
return result;
 }
 EXPORT_SYMBOL_GPL(inet6_lookup_listener);
-- 
2.8.0.rc3.226.g39d4020

1 2 >

1 - 100 of 167 matches

Mail list logo