date:20150915

Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2015-09-15 Thread Martin Sustrik


On 2015-09-16 08:27, Damian Hobson-Garcia wrote:

From: Martin Sustrik 

When implementing network protocols in user space, one has to implement
fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll
accept only true file descriptors) and forces protocol implementers to 
use
various workarounds resulting in complex, non-standard and convoluted 
APIs.


More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes 
half

the way towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. Most
notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly 
valid
combination for a network protocol (rx buffer is empty and tx 
buffer is

full), cannot be signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above 
problems.


Additionally, to provide a way to associate user-space state with 
eventfd

object, it allows to attach user-space data to the file descriptor.


The above paragraph is a leftover from the past. The functionality no 
longer exist.




The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such 
a

way as to signal no events on the file descriptor when it is polled on.
The 'initval' argument is ignored.

write(2):

User is allowed to write only buffers containing the following 
structure:


struct efd_mask {
  uint32_t events;
};


Is it worth having a struct here? Why not just uint32_t?

Martin



The value of 'events' should be any combination of event flags as 
defined

by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will be signaled when polling (select, poll, epoll) on the 
eventfd

is done later on.

read(2):

read is not supported and will fail with EINVAL.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events
specified in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 

[dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3]
Signed-off-by: Damian Hobson-Garcia 
---
 fs/eventfd.c | 102 
++-

 include/linux/eventfd.h  |  16 +--
 include/uapi/linux/eventfd.h |  39 +
 3 files changed, 132 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8d0c0df..1a6a066 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */

@@ -22,18 +23,31 @@
 #include 
 #include 

+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | 
EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | 
POLLHUP)

+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+
+   /*
+* When using eventfd in EFD_MASK mode this stracture stores the
+* current events to be signaled on the eventfd (events member)
+* along with opaque user-defined data (data member).
+*/
+   struct efd_mask mask;
+   };
unsigned int flags;
 };

@@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file
*file, poll_table *wait)
return events;
 }

+static unsigned int eventfd_mask_poll(struct file *file, poll_table 
*wait)

+{
+   struct eventfd_ctx *ctx = file->private_data;
+
+   poll_wait(file, &ctx->wqh, wait);
+   return ctx->mask.events;
+}
+
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
 {
*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
@@ -239,6 +261,14 @@ static ssize_t eventfd_read(struct fi

request for stable inclusion

2015-09-15 Thread Or Gerlitz


Hi Dave,

Commit 9293267 "net/mlx4_core: Capping number of requested MSIXs to 
MAX_MSIX"  fixes a bug under which the driver doesn't really starts over 
a machine with > 32 cores.


The bug was introduced in 4.2-rc1 but the fix missed 4.2 -- could you 
please push it to 4.2 -stable?


If you prefer that we will submit it directly there, fine too.

thanks,

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 2/2] bpf: add bpf_redirect() helper

2015-09-15 Thread John Fastabend

On 15-09-15 11:05 PM, Alexei Starovoitov wrote:
> Existing bpf_clone_redirect() helper clones skb before redirecting
> it to RX or TX of destination netdev.
> Introduce bpf_redirect() helper that does that without cloning.
> 
> Benchmarked with two hosts using 10G ixgbe NICs.
> One host is doing line rate pktgen.
> Another host is configured as:
> $ tc qdisc add dev $dev ingress
> $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
>action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
> so it receives the packet on $dev and immediately xmits it on $dev + 1
> The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
> that does bpf_clone_redirect() and performance is 2.0 Mpps
> 
> $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
>action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
> which is using bpf_redirect() - 2.4 Mpps
> 
> and using cls_bpf with integrated actions as:
> $ tc filter add dev $dev root pref 10 \
>   bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
> performance is 2.5 Mpps
> 
> To summarize:
> u32+act_bpf using clone_redirect - 2.0 Mpps
> u32+act_bpf using redirect - 2.4 Mpps
> cls_bpf using redirect - 2.5 Mpps
> 
> For comparison linux bridge in this setup is doing 2.1 Mpps
> and ixgbe rx + drop in ip_rcv - 7.8 Mpps
> 
> Signed-off-by: Alexei Starovoitov 
> Acked-by: Daniel Borkmann 
> ---
> This approach is using per_cpu scratch area to store ifindex and flags.
> The other alternatives discussed at plumbers are slower and more intrusive.
> v1->v2: dropped redundant iff_up check
> 
>  include/net/sch_generic.h|1 +
>  include/uapi/linux/bpf.h |8 
>  include/uapi/linux/pkt_cls.h |1 +
>  net/core/dev.c   |8 
>  net/core/filter.c|   44 
> ++
>  net/sched/act_bpf.c  |1 +
>  net/sched/cls_bpf.c  |1 +
>  samples/bpf/bpf_helpers.h|4 
>  samples/bpf/tcbpf1_kern.c|   24 ++-
>  9 files changed, 91 insertions(+), 1 deletion(-)
> 

Acked-by: John Fastabend 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ip: find correct route for socket which is not bound to a device

2015-09-15 Thread Wengang Wang

For multi-cast, we should find valid route(thus get the meaniful pmtu) for
the package on the socket which is not bound to a device(sk_bound_dev_if
being 0) too.

>From man page of socket(7)

   SO_BINDTODEVICE
Bind this socket to a particular device like “eth0”, as
specified in the passed interface name.  If the name is an
empty string or the option length is zero, the socket
device binding is removed. The  passed  option is  a
variable-length null-terminated interface name string with
the maximum size of IFNAMSIZ.  If a socket is bound to an
interface, only packets received from that particular
interface are processed by the socket. Note that this works
only for some socket types, particularly AF_INET sockets.
It is not supported for packet sockets (use normal bind(2)
there).

The man page doesn't say when socket not bound packages won't be routed.

A problem is hit that all multi-cast packages dropped by kernel(from sender
host). The lower layer is IPoIB with MTU being 7000. And I was sending 4096
length multi-cast  package. In side IPoIB the first send is dropped because
is exeeding the internal package size limitation mcast_mtu which is 2044.
So IPoIB calls ip_rt_update_pmtu (indirectly) trying to set path mtu. A
correct route is configured for the multi-cast, so the setting of pmtu
cucceeded and the next multi-cast package(to the same target) is expected
to succeed(it would be well fragmented accroding to the pmtu I just set).
But actually the second and later multi-cast packages got dropped too. And
the reason is that the neighor looking up(fib_lookup) is skipped because of
the socket is not bound to device(sk_bound_dev_if being 0). After applied
the patch I proposed here, it works fine.

Signed-off-by: Wengang Wang 
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5f4a556..032481a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2097,7 +2097,7 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
 */
 
fl4->flowi4_oif = dev_out->ifindex;
-   goto make_route;
+   goto lookup;
}
 
if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
@@ -2153,6 +2153,7 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
goto make_route;
}
 
+lookup:
if (fib_lookup(net, fl4, &res, 0)) {
res.fi = NULL;
res.table = NULL;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/1] Generalize poll events from eventfd

2015-09-15 Thread Damian Hobson-Garcia

Using eventfd user space can generate POLLIN/POLLOUT events but some
applications may want to generate POLLPRI/POLLERR events as well.
This patch submission aims to generalize the events generated by an
eventfd. This is a resubmission of a patch from Feb 2013[1]. The original
discussion trailed off without any conclusion, but the original author
has recently confirmed[2] that this functionality is still useful, so I
volunteered to rebase and resubmit the patch for discussion.

[1] https://lkml.org/lkml/2013/2/18/147
[2] https://lkml.org/lkml/2015/7/9/153

Changes in v2
-

* rebased on Linux v4.3-rc1
* Move file operation implementations for EFD_MASK to a seperate structure
* Remove 'data' element from efd_mask structure
* read() is no longer supported when EFD_MASK is set (fails with EINVAL)
* eventfd_ctx_fileget() now returns EINVAL when EFD_MASK is set, eliminating
  the possibility of triggering the orginal BUG_ON() macros which have now
  been removed.

Thank you,
Damian

Martin Sustrik (1):
  eventfd: implementation of EFD_MASK flag

 fs/eventfd.c | 91 ++--
 include/linux/eventfd.h  | 16 +---
 include/uapi/linux/eventfd.h | 40 +++
 3 files changed, 121 insertions(+), 26 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2015-09-15 Thread Damian Hobson-Garcia

From: Martin Sustrik 

When implementing network protocols in user space, one has to implement
fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll
accept only true file descriptors) and forces protocol implementers to use
various workarounds resulting in complex, non-standard and convoluted APIs.

More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes half
the way towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. Most
notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid
combination for a network protocol (rx buffer is empty and tx buffer is
full), cannot be signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above problems.

Additionally, to provide a way to associate user-space state with eventfd
object, it allows to attach user-space data to the file descriptor.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a
way as to signal no events on the file descriptor when it is polled on.
The 'initval' argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  uint32_t events;
};

The value of 'events' should be any combination of event flags as defined
by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will be signaled when polling (select, poll, epoll) on the eventfd
is done later on.

read(2):

read is not supported and will fail with EINVAL.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events
specified in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 

[dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3]
Signed-off-by: Damian Hobson-Garcia 
---
 fs/eventfd.c | 102 ++-
 include/linux/eventfd.h  |  16 +--
 include/uapi/linux/eventfd.h |  39 +
 3 files changed, 132 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8d0c0df..1a6a066 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */
 
@@ -22,18 +23,31 @@
 #include 
 #include 
 
+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+
+   /*
+* When using eventfd in EFD_MASK mode this stracture stores the
+* current events to be signaled on the eventfd (events member)
+* along with opaque user-defined data (data member).
+*/
+   struct efd_mask mask;
+   };
unsigned int flags;
 };
 
@@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file *file, 
poll_table *wait)
return events;
 }
 
+static unsigned int eventfd_mask_poll(struct file *file, poll_table *wait)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+
+   poll_wait(file, &ctx->wqh, wait);
+   return ctx->mask.events;
+}
+
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
 {
*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
@@ -239,6 +261,14 @@ static ssize_t eventfd_read(struct file *file, char __user 
*buf, size_t count,
return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
 }
 
+static ssize_t eventfd_mask_read(struct file *file, char __user *buf,
+   size_t co

Re: kernel 4.2 : "bridge vlan" command return empty result (works with kernel 4.1.3)

2015-09-15 Thread Alexandre DERUMIER

>>Do you have a bond in your system ?. 

Yes, Indeed.
Removing the bond fix the problem.

I'll try your patch today.


Thanks !

Alexandre

- Mail original -
De: "roopa" 
À: "aderumier" 
Cc: "netdev" , "Scott Feldman" 

Envoyé: Mardi 15 Septembre 2015 21:02:34
Objet: Re: kernel 4.2 : "bridge vlan" command return empty result (works with 
kernel 4.1.3)

On 9/15/15, 10:39 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> since kernel 4.2, "bridge vlan" command return empty result. 
> 
> 
> kernel 4.1.3 
>  
> # bridge vlan 
> port vlan ids 
> eth0 1 PVID Egress Untagged 
> 90 
> 91 
> 92 
> 93 
> 94 
> 95 
> 96 
> 97 
> 98 
> 99 
> 100 
> 
> vmbr0 1 PVID Egress Untagged 
> 94 
> 
> 
> 
> kernel 4.2 
>  
> # bridge vlan 
> port vlan ids 
> 
> 
> 
> Note that vlans are correctly working,it seem that is just the display. 
> 
> tcpdump -e -i vmbr0 
> 
> 19:38:08.005055 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui 
> Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, 
> 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 339613, win 5523, 
> length 0 
> 19:38:08.007730 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui 
> Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, 
> 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 342145, win 5568, 
> length 0 
> 19:38:08.010977 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui 
> Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, 
> 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 344677, win 5614, 
> length 0 
> 19:3 
I was able to reproduce this when there is a bond in the system. 

Looks like this was due to 85fdb956726ff2a ("switchdev: cut over to new 
switchdev_port_bridge_getlink"). 
When CONFIG_SWITCHDEV is off, nodes that use switchdev api for 
ndo_bridge_getlink (example, bonds, teams, rocker) can return 
-EOPNOTSUPP. The problem went away on my box with the following patch. I 
will submit an official patch in a bit. 
Do you have a bond in your system ?. 

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c 
index 01ced4a..bdb3842 100644 
--- a/net/core/rtnetlink.c 
+++ b/net/core/rtnetlink.c 
@@ -3013,6 +3013,7 @@ static int rtnl_bridge_getlink(struct sk_buff 
*skb, struct 
u32 portid = NETLINK_CB(cb->skb).portid; 
u32 seq = cb->nlh->nlmsg_seq; 
u32 filter_mask = 0; 
+ int err; 

if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) { 
struct nlattr *extfilt; 
@@ -3033,20 +3034,25 @@ static int rtnl_bridge_getlink(struct sk_buff 
*skb, stru 
struct net_device *br_dev = 
netdev_master_upper_dev_get(dev); 

if (br_dev && br_dev->netdev_ops->ndo_bridge_getlink) { 
- if (idx >= cb->args[0] && 
- br_dev->netdev_ops->ndo_bridge_getlink( 
- skb, portid, seq, dev, filter_mask, 
- NLM_F_MULTI) < 0) 
- break; 
+ if (idx >= cb->args[0]) { 
+ err = 
br_dev->netdev_ops->ndo_bridge_getlink( 
+ skb, portid, seq, dev, 
+ filter_mask, NLM_F_MULTI); 
+ if ( err < 0 && err != -EOPNOTSUPP) 
+ break; 
+ } 
idx++; 
} 

if (ops->ndo_bridge_getlink) { 
- if (idx >= cb->args[0] && 
- ops->ndo_bridge_getlink(skb, portid, seq, dev, 
- filter_mask, 
- NLM_F_MULTI) < 0) 
- break; 
+ if (idx >= cb->args[0]) { 
+ err = ops->ndo_bridge_getlink(skb, portid, 
+ seq, dev, 
+ filter_mask, 
+ NLM_F_MULTI); 
+ if ( err < 0 && err != -EOPNOTSUPP) 
+ break; 
+ } 
idx++; 
} 
} 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 net-next 0/2] bpf: performance improvements

2015-09-15 Thread Alexei Starovoitov

v1->v2: dropped redundant iff_up check in patch 2

At plumbers we discussed different options on how to get rid of skb_clone
from bpf_clone_redirect(), the patch 2 implements the best option.
Patch 1 adds 'integrated exts' to cls_bpf to improve performance by
combining simple actions into bpf classifier.

Alexei Starovoitov (1):
  bpf: add bpf_redirect() helper

Daniel Borkmann (1):
  cls_bpf: introduce integrated actions

 include/net/sch_generic.h|3 ++-
 include/uapi/linux/bpf.h |9 +++
 include/uapi/linux/pkt_cls.h |4 +++
 net/core/dev.c   |8 ++
 net/core/filter.c|   58 +++
 net/sched/act_bpf.c  |1 +
 net/sched/cls_bpf.c  |   61 ++
 samples/bpf/bpf_helpers.h|4 +++
 samples/bpf/tcbpf1_kern.c|   24 -
 9 files changed, 159 insertions(+), 13 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 net-next 2/2] bpf: add bpf_redirect() helper

2015-09-15 Thread Alexei Starovoitov

Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
  bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
This approach is using per_cpu scratch area to store ifindex and flags.
The other alternatives discussed at plumbers are slower and more intrusive.
v1->v2: dropped redundant iff_up check

 include/net/sch_generic.h|1 +
 include/uapi/linux/bpf.h |8 
 include/uapi/linux/pkt_cls.h |1 +
 net/core/dev.c   |8 
 net/core/filter.c|   44 ++
 net/sched/act_bpf.c  |1 +
 net/sched/cls_bpf.c  |1 +
 samples/bpf/bpf_helpers.h|4 
 samples/bpf/tcbpf1_kern.c|   24 ++-
 9 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index da61febb9091..4c79ce8c1f92 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -402,6 +402,7 @@ void __qdisc_calculate_pkt_len(struct sk_buff *skb,
   const struct qdisc_size_table *stab);
 bool tcf_destroy(struct tcf_proto *tp, bool force);
 void tcf_destroy_chain(struct tcf_proto __rcu **fl);
+int skb_do_redirect(struct sk_buff *);
 
 /* Reset all TX qdiscs greater then index of a device.  */
 static inline void qdisc_reset_all_tx_gt(struct net_device *dev, unsigned int 
i)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2fbd1c71fa3b..4ec0b5488294 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -272,6 +272,14 @@ enum bpf_func_id {
BPF_FUNC_skb_get_tunnel_key,
BPF_FUNC_skb_set_tunnel_key,
BPF_FUNC_perf_event_read,   /* u64 bpf_perf_event_read(&map, index) 
*/
+   /**
+* bpf_redirect(ifindex, flags) - redirect to another netdev
+* @ifindex: ifindex of the net device
+* @flags: bit 0 - if set, redirect to ingress instead of egress
+* other bits - reserved
+* Return: TC_ACT_REDIRECT
+*/
+   BPF_FUNC_redirect,
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 0a262a83f9d4..439873775d49 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -87,6 +87,7 @@ enum {
 #define TC_ACT_STOLEN  4
 #define TC_ACT_QUEUED  5
 #define TC_ACT_REPEAT  6
+#define TC_ACT_REDIRECT7
 #define TC_ACT_JUMP0x1000
 
 /* Action type identifiers*/
diff --git a/net/core/dev.c b/net/core/dev.c
index 877c84834d81..d6a492e57874 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3668,6 +3668,14 @@ static inline struct sk_buff *handle_ing(struct sk_buff 
*skb,
case TC_ACT_QUEUED:
kfree_skb(skb);
return NULL;
+   case TC_ACT_REDIRECT:
+   /* skb_mac_header check was done by cls/act_bpf, so
+* we can safely push the L2 header back before
+* redirecting to another netdev
+*/
+   __skb_push(skb, skb->mac_len);
+   skb_do_redirect(skb);
+   return NULL;
default:
break;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index 971d6ba89758..da3f3d94d6e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1427,6 +1427,48 @@ const struct bpf_func_proto bpf_clone_redirect_proto = {
.arg3_type  = ARG_ANYTHING,
 };
 
+struct redirect_info {
+   u32 ifindex;
+   u32 flags;
+};
+
+static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+static u64 bpf_redirect(u64 ifindex, u64 flags, u64 r3, u64 r4, u64 r5)
+{
+   struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+   ri-

[PATCH v2 net-next 1/2] cls_bpf: introduce integrated actions

2015-09-15 Thread Alexei Starovoitov

From: Daniel Borkmann 

Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.

Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
  /* classify arp, ip, ipv6 into different traffic classes
   * and drop all other packets
   */
  switch (skb->protocol) {
  case htons(ETH_P_ARP):
skb->tc_classid = 1;
break;
  case htons(ETH_P_IP):
skb->tc_classid = 2;
break;
  case htons(ETH_P_IPV6):
skb->tc_classid = 3;
break;
  default:
return TC_ACT_SHOT;
  }

  return TC_ACT_OK;
}

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
v1->v2: no changes

 include/net/sch_generic.h|2 +-
 include/uapi/linux/bpf.h |1 +
 include/uapi/linux/pkt_cls.h |3 +++
 net/core/filter.c|   14 ++
 net/sched/cls_bpf.c  |   60 ++
 5 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 444faa89a55f..da61febb9091 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -251,7 +251,7 @@ struct tcf_proto {
 struct qdisc_skb_cb {
unsigned intpkt_len;
u16 slave_dev_queue_mapping;
-   u16 _pad;
+   u16 tc_classid;
 #define QDISC_CB_PRIV_LEN 20
unsigned char   data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 92a48e2d5461..2fbd1c71fa3b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -293,6 +293,7 @@ struct __sk_buff {
__u32 tc_index;
__u32 cb[5];
__u32 hash;
+   __u32 tc_classid;
 };
 
 struct bpf_tunnel_key {
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 4f0d1bc3647d..0a262a83f9d4 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -373,6 +373,8 @@ enum {
 
 /* BPF classifier */
 
+#define TCA_BPF_FLAG_ACT_DIRECT(1 << 0)
+
 enum {
TCA_BPF_UNSPEC,
TCA_BPF_ACT,
@@ -382,6 +384,7 @@ enum {
TCA_BPF_OPS,
TCA_BPF_FD,
TCA_BPF_NAME,
+   TCA_BPF_FLAGS,
__TCA_BPF_MAX,
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 13079f03902e..971d6ba89758 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1632,6 +1632,9 @@ static bool __is_valid_access(int off, int size, enum 
bpf_access_type type)
 static bool sk_filter_is_valid_access(int off, int size,
  enum bpf_access_type type)
 {
+   if (off == offsetof(struct __sk_buff, tc_classid))
+   return false;
+
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct __sk_buff, cb[0]) ...
@@ -1648,6 +1651,9 @@ static bool sk_filter_is_valid_access(int off, int size,
 static bool tc_cls_act_is_valid_access(int off, int size,
   enum bpf_access_type type)
 {
+   if (off == offsetof(struct __sk_buff, tc_classid))
+   return type == BPF_WRITE ? true : false;
+
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct __sk_buff, mark):
@@ -1760,6 +1766,14 @@ static u32 bpf_net_convert_ctx_access(enum 
bpf_access_type type, int dst_reg,
*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off);
break;
 
+   case offsetof(struct __sk_buff, tc_classid):
+   ctx_off -= offsetof(struct __sk_buff, tc_classid);
+   ctx_off += offsetof(struct sk_buff, cb);
+   ctx_off += offsetof(struct qdisc_skb_cb, tc_classid);
+   WARN_ON(type != BPF_WRITE);
+   *insn++ = BPF_STX_MEM(BPF_H, dst_reg, src_reg, ctx_off);
+   break;
+
case offsetof(struct __sk_buff, tc_index):
 #ifdef CONFIG_NET_SCHED
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tc_index) != 2);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index e5168f8b9640..77b0ef148256 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -38,6 +38,7 @@ struct cls_bpf_prog {
struct bpf_prog *filter;
struct list_head link;
struct tcf_result res;
+   bool exts_integrated;
struct tcf_exts exts;
u32 handle;
union {
@@ -52,6 +53,7 @@ struct cls_bpf_prog {
 
 static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
[TCA_BPF_CLASSID]   = { .type = NLA_U32 },
+   [TCA_BPF_FLAGS] = { .type = NLA_U32 },
[TCA_BPF_FD]= { .type = NLA_U32 },
[TCA_BPF_NAME]  = { .type = NLA_NUL_STRING, .len = 
CLS_BPF_NAME_LEN },

FW: [RFC net-next 01/10] qed: Add module with basic common support

2015-09-15 Thread Yuval Mintz

> From: Yuval Mintz 
> Date: Thu, 10 Sep 2015 16:54:12 +0300

> >  Documentation/networking/LICENSE.qlogic|  288 ++

> I do not want to get into the habit of having to add copy after copy
> of the GPL v2 to the source tree, so this is rather inappropriate.

> Everything said in that file is explicitly covered by the top-level
> COPYING file.

You're right; We'll remove this and add some comment reference
to COPYING instead in v2.

On a related but different topic,  I've noticed a lack of commentary
for this series; Don't know if it's due to lack of time, interest, or
something else.
[I know we write perfect code [ ;-) ], but I would have guessed that
Adding 10Ks of lines of code to the kernel would generate at least
some rejects].
Should we wait any longer before sending the next revision?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 net] cxgb4vf: support for single-threading access to adapter mailbox registers

2015-09-15 Thread Hariprasad S

On Mon, Sep 14, 2015 at 19:08:34 +0530, Hariprasad Shenai wrote:
> The issue is the for the Virtual Function Driver, the only way to get the
> Virtual Interface statistics is to issue mailbox commands to ask the
> firmware for the VI Stats. And, because the VI Stats command can only
> retrieve a smallish number of stats per mailbox command, we have to issue
> three mailbox commands in quick succession. What we ran into was irqbalance
> coming in every 10 seconds and interrogating every network interface in the
> system.
> 
> Signed-off-by: Hariprasad Shenai 
> ---
> V2: Updated description and using linux completion API's instead of
> for loop based on review comments by David Miller
> 
>  drivers/net/ethernet/chelsio/cxgb4vf/adapter.h |  9 +
>  .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c|  4 ++
>  drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c | 46 
> +-
>  3 files changed, 58 insertions(+), 1 deletion(-)
> 


Hi David,

There is an issue with this patch. Can you please drop it.
Will send a V3, with the fixes. 

The below one should be a while loop, instead of if condition.
/* If we're at the head, break out and start the mailbox
 * protocol.
 */
if (list_first_entry(&adapter->mlist.list,
 struct mbox_list, list) != &entry) {
int ret;

Thanks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Tom Herbert

On Tue, Sep 15, 2015 at 5:03 PM, Eric Dumazet  wrote:
> On Tue, 2015-09-15 at 16:45 -0700, Tom Herbert wrote:
>> > +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
>> > +   skb->l4_hash)
>> > +   return skb->hash;
>> > +
>> > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
>> > !bond_flow_dissect(bond, skb, &flow))
>> > return bond_eth_hash(skb);
>> >
>> >
>> Ugh, bond_flow_dissect is yet another instance of customized flow
>> dissection! We should really clean this up. I suggest that in cases
>> were we want L4 hash a call to skb_get_hash should suffice. We can
>> create skb_get_l3hash when caller explicitly wants an L3 hash-- this
>> would return skb->hash if it's valid and skb->l4_hash is not set, else
>> call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the
>> normal hash over flow keys (don't save result in skb->hash in this
>> case).
>
> This code predates all the change you did recently ;)
>
> BTW, the simple xor weakness is showing up after
> our change favoring even ports at connect() time, for a bonding device
> with 2 or 4 slaves.
>
Right, xor as a packet hash should be eliminated. It seems possible
that all these modes can be implemented using flow_dissector and the
jhash. If I'm reading the meaning of modes correctly:

BOND_XMIT_POLICY_ENCAP34 is equivalent to skb_get_hash

BOND_XMIT_POLICY_LAYER23 is flow dissection with
FLOW_DISSECTOR_F_STOP_AT_L3 and then normal hash

BOND_XMIT_POLICY_LAYER34 is flow dissection with FLOW_DISSECTOR_F_STOP_AT_ENCAP

BOND_XMIT_POLICY_LAYER2 would be flow dissection with
FLOW_DISSECTOR_F_STOP_AT_L2 (new flag) and then normal hash

BOND_XMIT_POLICY_ENCAP23 is a little more interesting. This could be
accomplished with custom flow dissector targets that exclude L4
information (ports, flow label, key ID).

Also noticed a little bug in flow_dissector, we should get out on
FLOW_DISSECTOR_F_STOP_AT_L3 before IPv6 flow label is processed
(that's considered L4). I'll fix that.

Tom

> (commit 07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5
> "tcp/dccp: try to not exhaust ip_local_port_range in connect()")
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper

2015-09-15 Thread Alexei Starovoitov


On 9/15/15 9:50 PM, John Fastabend wrote:

Looks like you can remove the check. I would prefer to let the stack
handle this case using normal mechanisms.

I had to do a bit of tracking but netif_running check equates roughly
to your IFF_UP case via,

...

Seem reasonable? Or did you put it there to work around some specific
case I'm missing?


well, in the forwarding path is_skb_forwardable() does the IFF_UP check
before netif_running() has to do it, so yeah this check can be dropped.

Will fix in v2.
Thanks for the review!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: linux-next: build failure after merge of the tip tree

2015-09-15 Thread David Miller

From: Stephen Rothwell 
Date: Wed, 16 Sep 2015 11:30:53 +1000

> I have added the following fix patch for today:
> 
> From: Stephen Rothwell 
> Date: Wed, 16 Sep 2015 11:10:16 +1000
> Subject: [PATCH] cdc: add header guards
> 
> Signed-off-by: Stephen Rothwell 

Applied, thanks Stephen.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper

2015-09-15 Thread John Fastabend

On 15-09-15 09:11 PM, Alexei Starovoitov wrote:
> On 9/15/15 8:10 PM, John Fastabend wrote:
>> Nice, I like this. But just to be sure I read this correctly this will
>> only work on the ingress qdisc for now right? To get the tx side working
>> will require a bit more care.
> 
> correct.
> For egress I'm waiting for Daniel to resubmit his preclassifier patch
> and I'll hook this skb_do_redirect() there as well.
> Other options are also possible, but preclassifier looks the best for
> this purpose, since it's lockless.
> 

Great, works for me. One other question/observation,

+int skb_do_redirect(struct sk_buff *skb)
+{

[...]

+
+   if (unlikely(!(dev->flags & IFF_UP))) {
+   kfree_skb(skb);
+   return -EINVAL;
+   }

The IFF_UP check is not needed as best I can tell, the dev_queue_xmit()
will check if the qdisc is active and the dev_forward_skb() path will
do a !netif_running check in enqueue_to_backlog() call.

Looks like you can remove the check. I would prefer to let the stack
handle this case using normal mechanisms.

I had to do a bit of tracking but netif_running check equates roughly
to your IFF_UP case via,

>   __dev_change_flags()
>   [...]
>   if ((old_flags ^ flags) & IFF_UP)
>   ret = ((old_flags & IFF_UP) ? __dev_close : 
> __dev_open)(dev);
>   
> 
>   __dev_close()
>   [...]
>   __dev_close_many()
> 
>   __dev_close_many()
>   [...]
>   clear_bit(__LINK_STATE_START, &dev->state);

Seem reasonable? Or did you put it there to work around some specific
case I'm missing?

.John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper

2015-09-15 Thread Alexei Starovoitov


On 9/15/15 8:10 PM, John Fastabend wrote:

Nice, I like this. But just to be sure I read this correctly this will
only work on the ingress qdisc for now right? To get the tx side working
will require a bit more care.


correct.
For egress I'm waiting for Daniel to resubmit his preclassifier patch
and I'll hook this skb_do_redirect() there as well.
Other options are also possible, but preclassifier looks the best for
this purpose, since it's lockless.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bpf: add bpf_redirect() helper

2015-09-15 Thread John Fastabend

On 15-09-15 06:51 PM, Alexei Starovoitov wrote:
> Existing bpf_clone_redirect() helper clones skb before redirecting
> it to RX or TX of destination netdev.
> Introduce bpf_redirect() helper that does that without cloning.
> 
> Benchmarked with two hosts using 10G ixgbe NICs.
> One host is doing line rate pktgen.
> Another host is configured as:
> $ tc qdisc add dev $dev ingress
> $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
>action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
> so it receives the packet on $dev and immediately xmits it on $dev + 1
> The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
> that does bpf_clone_redirect() and performance is 2.0 Mpps
> 
> $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
>action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
> which is using bpf_redirect() - 2.4 Mpps
> 
> and using cls_bpf with integrated actions as:
> $ tc filter add dev $dev root pref 10 \
>   bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
> performance is 2.5 Mpps
> 
> To summarize:
> u32+act_bpf using clone_redirect - 2.0 Mpps
> u32+act_bpf using redirect - 2.4 Mpps
> cls_bpf using redirect - 2.5 Mpps
> 
> For comparison linux bridge in this setup is doing 2.1 Mpps
> and ixgbe rx + drop in ip_rcv - 7.8 Mpps
> 
> Signed-off-by: Alexei Starovoitov 
> Acked-by: Daniel Borkmann 
> ---


Nice, I like this. But just to be sure I read this correctly this will
only work on the ingress qdisc for now right? To get the tx side working
will require a bit more care.

Thanks,
.John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next:master 10/14] ERROR: "cdc_parse_cdc_header" [drivers/net/usb/cdc-phonet.ko] undefined!

2015-09-15 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   d5566fd72ec1924958fcfd48b65c022c8f7eae64
commit: 7b6ee48d3f4d432bfa6c9c9662fbdbd97681240e [10/14] cdc-phonet: use common 
parser
config: x86_64-randconfig-a0-09160932 (attached as .config)
reproduce:
  git checkout 7b6ee48d3f4d432bfa6c9c9662fbdbd97681240e
  # save the attached .config to linux build tree
  make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

>> ERROR: "cdc_parse_cdc_header" [drivers/net/usb/cdc-phonet.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.2.0 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_CONSTRUCTORS=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_XZ=y
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_IRQ_DOMAIN_DEBUG=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set

#
# CPU/Task time and stats accounting
#
# CONFIG_TICK_CPU_ACCOUNTING is not set
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
CONFIG_IRQ_TIME_ACCOUNTING=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set

#
# RCU Subsystem
#
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_BOOST=y
CONFIG_RCU_KTHREAD_PRIO=1
CONFIG_RCU_BOOST_DELAY=500
# CONFIG_RCU_NOCB_CPU is not set
# CONFIG_RCU_EXPEDITE_BOOT is not set
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_PIDS=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CPUSETS=y
# CONFIG_PROC_PID_CPUSET is not set
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_MEMCG is not set
# CONFIG_CGROUP_HUGETLB is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_NAMESPACES is not set
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_

Re: [PATCH for-next] cxgb4: add device ID for few T5 adapters

2015-09-15 Thread Hariprasad S

On Tue, Sep 15, 2015 at 11:55:19 -0700, David Miller wrote:
> From: Hariprasad Shenai 
> Date: Tue, 15 Sep 2015 17:20:09 +0530
> 
> > Signed-off-by: Hariprasad Shenai 
> 
> Adding just some new device IDs is definitely 'net' material, mind
> if I apply it there instead?

No issues. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-15 Thread Vlad Yasevich

On 09/15/2015 02:17 PM, Phil Sutter wrote:
> On Tue, Sep 15, 2015 at 11:11:53AM -0400, Vlad Yasevich wrote:
>> On 09/14/2015 04:06 PM, Phil Sutter wrote:
>>> On Mon, Sep 14, 2015 at 02:21:10PM -0400, Vlad Yasevich wrote:
 On 09/11/2015 04:20 PM, Phil Sutter wrote:
> On Fri, Sep 11, 2015 at 12:24:45PM -0700, Stephen Hemminger wrote:
>> On Fri, 11 Sep 2015 21:22:03 +0200
>> Phil Sutter  wrote:
>>
>>> When forwarding packets from an 802.1Q interface with REORDER_HDR set to
>>> zero, the VLAN header previously inserted by vlan_do_receive() needs to
>>> be stripped from the packet and the mac_header adjustment undone,
>>> otherwise a tagged frame with first four bytes missing will be
>>> transmitted.
>>>
>>> Signed-off-by: Phil Sutter 
>>> ---
>>>  net/bridge/br_input.c | 10 ++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>>> index f921a5d..e4e3fc7 100644
>>> --- a/net/bridge/br_input.c
>>> +++ b/net/bridge/br_input.c
>>> @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
>>> **pskb)
>>> }
>>>  
>>>  forward:
>>> +   if (is_vlan_dev(skb->dev) &&
>>> +   !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
>>> +   unsigned int offset = skb->data - skb_mac_header(skb);
>>> +
>>> +   skb_push(skb, offset);
>>> +   memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
>>> +   skb->mac_header += VLAN_HLEN;
>>> +   skb_pull(skb, offset);
>>> +   skb_reset_mac_len(skb);
>>> +   }
>>> switch (p->state) {
>>> case BR_STATE_FORWARDING:
>>> rhook = rcu_dereference(br_should_route_hook);
>>
>> Thanks for finding this. Is this a new thing or has it always been there?
>
> Sorry, I didn't check if this is a regression or not. Seen initially
> with RHEL7's kernel-3.10.0-229.7.2, which due to the massive backporting
> is by far not as old as it might seem. But it's surely not a brand new
> problem of net-next or so.
>
> Since nowadays no sane mind touches REORDER_HDR (there was originally a
> bug in NetworkManager which defaulted this to 0), it may very well be
> there for a long time already.
>
>> Sorry, this looks so special case it doesn't seem like a good idea.
>> Something is broken in VLAN handling if this is required.
>
> It is so ugly, I wish I had found a better way to fix the problem. Well,
> maybe I miss something:
>
> - packet enters __netif_receive_skb_core():
>   - skb->protocol is set to ETH_P_8021Q, so:
> - packet is untagged
> - skb->vlan_tci set
> - skb->protocol set to 'real' protocol
>   - skb_vlan_tag_present(skb) == true, so:
> - vlan_do_receive() is called:
>   - tags the packet again
>   - zeroes vlan_tci
> - goto another_round
> - __netif_receive_skb_core(), round 2:
>   - skb->protocol is not ETH_P_8021Q -> no untagging
>   - skb_vlan_tag_present(skb) == false -> no vlan_do_receive()
>   - rx_handler handler (== br_handle_frame) is called
>
> IMO the root of all evil is the existence of REORDER_HDR itself. It
> causes an skb which should have been untagged to being passed along with
> VLAN header present and code dealing with it needs to clean up the mess.

 So the problem here appears the be the code the in 
 br_dev_queue_push_xmit().
 It assumes that MAC_HLEN worth of data has been removed from the skb,
 which is normal in case of normal VLAN processing.  However, without
 REORDER_HEADER set this is no longer the case.  In this case, the ethernet
 header is shifted 4 bytes, and when we push the it back we miss the 4 bytes
 of the destination mac address...
>>>
>>> Please note that vlan_do_receive() also inserts the VLAN header in
>>> between ethernet header and IP header, therefore:
>>>
 I wonder if it would be safe to just use skb->mac_len.
>>>
>>> Given this works, the bridge would still forward a tagged frame which
>>> should have been untagged in the first place.
>>>
>>> I just wondered where this added VLAN header is dropped if the interface
>>> does not belong to a bridge, but then realized that further packet
>>> processing simply ignores the ethernet header (and everything following
>>> it). So unless I forget something, this should indeed be a
>>> bridge-specific problem.
>>>
>>
>> Looks like macvtap is also susceptible to this problem.  It seems to be a bad
>> idea to allow any upper device configuration on top of a REORDER_HDR=0 vlan.
>> It is also not enough to just check is_vlan_dev(skb->dev) because vlan may 
>> be at
>> lower in the device stack.
> 
> Oh well. Apart from implementing workarounds for this wor

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Eric Dumazet

On Tue, 2015-09-15 at 17:15 -0700, Tom Herbert wrote:

> A more fundamental question is whether we can eliminate some of these
> hashing types (I see five of them in if_bonding.h). Is there any
> substantial difference between this and IPv4/v6 ECMP routing such that
> they shouldn't all have the same path selection modes?

We had an issue on a router that did not like a change in the hashing
done by the host behind it.

Do not ask me for details that I cannot provide, but I would guess it is
better not changing legacy modes unilaterally.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/2] cls_bpf: introduce integrated actions

2015-09-15 Thread Alexei Starovoitov

From: Daniel Borkmann 

Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.

Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
  /* classify arp, ip, ipv6 into different traffic classes
   * and drop all other packets
   */
  switch (skb->protocol) {
  case htons(ETH_P_ARP):
skb->tc_classid = 1;
break;
  case htons(ETH_P_IP):
skb->tc_classid = 2;
break;
  case htons(ETH_P_IPV6):
skb->tc_classid = 3;
break;
  default:
return TC_ACT_SHOT;
  }

  return TC_ACT_OK;
}

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
 include/net/sch_generic.h|2 +-
 include/uapi/linux/bpf.h |1 +
 include/uapi/linux/pkt_cls.h |3 +++
 net/core/filter.c|   14 ++
 net/sched/cls_bpf.c  |   60 ++
 5 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 444faa89a55f..da61febb9091 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -251,7 +251,7 @@ struct tcf_proto {
 struct qdisc_skb_cb {
unsigned intpkt_len;
u16 slave_dev_queue_mapping;
-   u16 _pad;
+   u16 tc_classid;
 #define QDISC_CB_PRIV_LEN 20
unsigned char   data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 92a48e2d5461..2fbd1c71fa3b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -293,6 +293,7 @@ struct __sk_buff {
__u32 tc_index;
__u32 cb[5];
__u32 hash;
+   __u32 tc_classid;
 };
 
 struct bpf_tunnel_key {
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 4f0d1bc3647d..0a262a83f9d4 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -373,6 +373,8 @@ enum {
 
 /* BPF classifier */
 
+#define TCA_BPF_FLAG_ACT_DIRECT(1 << 0)
+
 enum {
TCA_BPF_UNSPEC,
TCA_BPF_ACT,
@@ -382,6 +384,7 @@ enum {
TCA_BPF_OPS,
TCA_BPF_FD,
TCA_BPF_NAME,
+   TCA_BPF_FLAGS,
__TCA_BPF_MAX,
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 13079f03902e..971d6ba89758 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1632,6 +1632,9 @@ static bool __is_valid_access(int off, int size, enum 
bpf_access_type type)
 static bool sk_filter_is_valid_access(int off, int size,
  enum bpf_access_type type)
 {
+   if (off == offsetof(struct __sk_buff, tc_classid))
+   return false;
+
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct __sk_buff, cb[0]) ...
@@ -1648,6 +1651,9 @@ static bool sk_filter_is_valid_access(int off, int size,
 static bool tc_cls_act_is_valid_access(int off, int size,
   enum bpf_access_type type)
 {
+   if (off == offsetof(struct __sk_buff, tc_classid))
+   return type == BPF_WRITE ? true : false;
+
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct __sk_buff, mark):
@@ -1760,6 +1766,14 @@ static u32 bpf_net_convert_ctx_access(enum 
bpf_access_type type, int dst_reg,
*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off);
break;
 
+   case offsetof(struct __sk_buff, tc_classid):
+   ctx_off -= offsetof(struct __sk_buff, tc_classid);
+   ctx_off += offsetof(struct sk_buff, cb);
+   ctx_off += offsetof(struct qdisc_skb_cb, tc_classid);
+   WARN_ON(type != BPF_WRITE);
+   *insn++ = BPF_STX_MEM(BPF_H, dst_reg, src_reg, ctx_off);
+   break;
+
case offsetof(struct __sk_buff, tc_index):
 #ifdef CONFIG_NET_SCHED
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tc_index) != 2);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index e5168f8b9640..77b0ef148256 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -38,6 +38,7 @@ struct cls_bpf_prog {
struct bpf_prog *filter;
struct list_head link;
struct tcf_result res;
+   bool exts_integrated;
struct tcf_exts exts;
u32 handle;
union {
@@ -52,6 +53,7 @@ struct cls_bpf_prog {
 
 static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
[TCA_BPF_CLASSID]   = { .type = NLA_U32 },
+   [TCA_BPF_FLAGS] = { .type = NLA_U32 },
[TCA_BPF_FD]= { .type = NLA_U32 },
[TCA_BPF_NAME]  = { .type = NLA_NUL_STRING, .len = 
CLS_BPF_NAME_LEN },
[TCA_BPF_OPS_LEN]

[PATCH net-next 2/2] bpf: add bpf_redirect() helper

2015-09-15 Thread Alexei Starovoitov

Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
  bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
This approach is using per_cpu scratch area to store ifindex and flags.
The other alternatives discussed at plumbers are slower and more intrusive.

 include/net/sch_generic.h|1 +
 include/uapi/linux/bpf.h |8 +++
 include/uapi/linux/pkt_cls.h |1 +
 net/core/dev.c   |8 +++
 net/core/filter.c|   49 ++
 net/sched/act_bpf.c  |1 +
 net/sched/cls_bpf.c  |1 +
 samples/bpf/bpf_helpers.h|4 
 samples/bpf/tcbpf1_kern.c|   24 -
 9 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index da61febb9091..4c79ce8c1f92 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -402,6 +402,7 @@ void __qdisc_calculate_pkt_len(struct sk_buff *skb,
   const struct qdisc_size_table *stab);
 bool tcf_destroy(struct tcf_proto *tp, bool force);
 void tcf_destroy_chain(struct tcf_proto __rcu **fl);
+int skb_do_redirect(struct sk_buff *);
 
 /* Reset all TX qdiscs greater then index of a device.  */
 static inline void qdisc_reset_all_tx_gt(struct net_device *dev, unsigned int 
i)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2fbd1c71fa3b..4ec0b5488294 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -272,6 +272,14 @@ enum bpf_func_id {
BPF_FUNC_skb_get_tunnel_key,
BPF_FUNC_skb_set_tunnel_key,
BPF_FUNC_perf_event_read,   /* u64 bpf_perf_event_read(&map, index) 
*/
+   /**
+* bpf_redirect(ifindex, flags) - redirect to another netdev
+* @ifindex: ifindex of the net device
+* @flags: bit 0 - if set, redirect to ingress instead of egress
+* other bits - reserved
+* Return: TC_ACT_REDIRECT
+*/
+   BPF_FUNC_redirect,
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 0a262a83f9d4..439873775d49 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -87,6 +87,7 @@ enum {
 #define TC_ACT_STOLEN  4
 #define TC_ACT_QUEUED  5
 #define TC_ACT_REPEAT  6
+#define TC_ACT_REDIRECT7
 #define TC_ACT_JUMP0x1000
 
 /* Action type identifiers*/
diff --git a/net/core/dev.c b/net/core/dev.c
index 877c84834d81..d6a492e57874 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3668,6 +3668,14 @@ static inline struct sk_buff *handle_ing(struct sk_buff 
*skb,
case TC_ACT_QUEUED:
kfree_skb(skb);
return NULL;
+   case TC_ACT_REDIRECT:
+   /* skb_mac_header check was done by cls/act_bpf, so
+* we can safely push the L2 header back before
+* redirecting to another netdev
+*/
+   __skb_push(skb, skb->mac_len);
+   skb_do_redirect(skb);
+   return NULL;
default:
break;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index 971d6ba89758..5bf273bab781 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1427,6 +1427,53 @@ const struct bpf_func_proto bpf_clone_redirect_proto = {
.arg3_type  = ARG_ANYTHING,
 };
 
+struct redirect_info {
+   u32 ifindex;
+   u32 flags;
+};
+
+static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+static u64 bpf_redirect(u64 ifindex, u64 flags, u64 r3, u64 r4, u64 r5)
+{
+   struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+   ri->ifindex = ifindex;
+   ri->flags = fla

[PATCH net-next 0/2] bpf: performance improvements

2015-09-15 Thread Alexei Starovoitov

At plumbers we discussed different options on how to get rid of skb_clone
from bpf_clone_redirect(), the patch 2 implements the best option.
Patch 1 adds 'integrated exts' to cls_bpf to improve performance by
combining simple actions into bpf classifier.

Alexei Starovoitov (1):
  bpf: add bpf_redirect() helper

Daniel Borkmann (1):
  cls_bpf: introduce integrated actions

 include/net/sch_generic.h|3 +-
 include/uapi/linux/bpf.h |9 ++
 include/uapi/linux/pkt_cls.h |4 +++
 net/core/dev.c   |8 ++
 net/core/filter.c|   63 ++
 net/sched/act_bpf.c  |1 +
 net/sched/cls_bpf.c  |   61 
 samples/bpf/bpf_helpers.h|4 +++
 samples/bpf/tcbpf1_kern.c|   24 +++-
 9 files changed, 164 insertions(+), 13 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

linux-next: build failure after merge of the tip tree

2015-09-15 Thread Stephen Rothwell

Hi all,

After merging the next-20150915 version of the tip tree, today's
linux-next build (x86_64 allmodconfig) failed like this:

In file included from drivers/usb/gadget/function/u_ether.h:20:0,
 from drivers/usb/gadget/function/f_ncm.c:26:
include/linux/usb/cdc.h:23:8: error: redefinition of 'struct 
usb_cdc_parsed_header'
 struct usb_cdc_parsed_header {
^
In file included from drivers/usb/gadget/function/f_ncm.c:24:0:
include/linux/usb/cdc.h:23:8: note: originally defined here
 struct usb_cdc_parsed_header {
^
In file included from drivers/usb/gadget/function/u_ether.h:20:0,
 from drivers/usb/gadget/function/f_ncm.c:26:
include/linux/usb/cdc.h:44:5: error: conflicting types for 
'cdc_parse_cdc_header'
 int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
 ^
In file included from drivers/usb/gadget/function/f_ncm.c:24:0:
include/linux/usb/cdc.h:44:5: note: previous declaration of 
'cdc_parse_cdc_header' was here
 int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
 ^
In file included from drivers/usb/gadget/function/u_serial.h:16:0,
 from drivers/usb/gadget/legacy/cdc2.c:17:
include/linux/usb/cdc.h:23:8: error: redefinition of 'struct 
usb_cdc_parsed_header'
 struct usb_cdc_parsed_header {
^
In file included from drivers/usb/gadget/function/u_ether.h:20:0,
 from drivers/usb/gadget/legacy/cdc2.c:16:
include/linux/usb/cdc.h:23:8: note: originally defined here
 struct usb_cdc_parsed_header {
^
In file included from drivers/usb/gadget/function/u_serial.h:16:0,
 from drivers/usb/gadget/legacy/cdc2.c:17:
include/linux/usb/cdc.h:44:5: error: conflicting types for 
'cdc_parse_cdc_header'
 int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
 ^
In file included from drivers/usb/gadget/function/u_ether.h:20:0,
 from drivers/usb/gadget/legacy/cdc2.c:16:
include/linux/usb/cdc.h:44:5: note: previous declaration of 
'cdc_parse_cdc_header' was here
 int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
 ^

Caused by commit

  c40a2c8817e4 ("CDC: common parser for extra headers")

from the net-next tree that added include/linux/usb/cdc.h with no
reinclusion guards.

I am not sure why I did not see this failure when building after
merging the net-next tree.  Maybe it is exposed by some config change
in the tip tree?

I have added the following fix patch for today:

From: Stephen Rothwell 
Date: Wed, 16 Sep 2015 11:10:16 +1000
Subject: [PATCH] cdc: add header guards

Signed-off-by: Stephen Rothwell 
---
 include/linux/usb/cdc.h  | 4 
 include/uapi/linux/usb/cdc.h | 6 +++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/usb/cdc.h b/include/linux/usb/cdc.h
index 959d0c838113..b5706f94ee9e 100644
--- a/include/linux/usb/cdc.h
+++ b/include/linux/usb/cdc.h
@@ -7,6 +7,8 @@
  * modify it under the terms of the GNU General Public License
  * version 2 as published by the Free Software Foundation.
  */
+#ifndef __LINUX_USB_CDC_H
+#define __LINUX_USB_CDC_H
 
 #include 
 
@@ -45,3 +47,5 @@ int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
struct usb_interface *intf,
u8 *buffer,
int buflen);
+
+#endif /* __LINUX_USB_CDC_H */
diff --git a/include/uapi/linux/usb/cdc.h b/include/uapi/linux/usb/cdc.h
index b6a9cdd6e096..e2bc417b243b 100644
--- a/include/uapi/linux/usb/cdc.h
+++ b/include/uapi/linux/usb/cdc.h
@@ -6,8 +6,8 @@
  * firmware based USB peripherals.
  */
 
-#ifndef __LINUX_USB_CDC_H
-#define __LINUX_USB_CDC_H
+#ifndef __UAPI_LINUX_USB_CDC_H
+#define __UAPI_LINUX_USB_CDC_H
 
 #include 
 
@@ -444,4 +444,4 @@ struct usb_cdc_ncm_ndp_input_size {
 #define USB_CDC_NCM_CRC_NOT_APPENDED   0x00
 #define USB_CDC_NCM_CRC_APPENDED   0x01
 
-#endif /* __LINUX_USB_CDC_H */
+#endif /* __UAPI_LINUX_USB_CDC_H */
-- 
2.5.1

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 net] net/mlx4_en: really allow to change RSS key

2015-09-15 Thread Eric Dumazet

From: Eric Dumazet 

When changing rss key, we do not want to overwrite user provided key
by the one provided by netdev_rss_key_fill(), which is the host random
key generated at boot time.

Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function")
Signed-off-by: Eric Dumazet 
Cc: Eyal Perry 
CC: Amir Vadai 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 4402a1e48c9b..0ce6ffe73ca8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1268,8 +1268,6 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
rss_context->hash_fn = MLX4_RSS_HASH_TOP;
memcpy(rss_context->rss_key, priv->rss_key,
   MLX4_EN_RSS_KEY_SIZE);
-   netdev_rss_key_fill(rss_context->rss_key,
-   MLX4_EN_RSS_KEY_SIZE);
} else {
en_err(priv, "Unknown RSS hash function requested\n");
err = -EINVAL;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] net/mlx4_en:

2015-09-15 Thread Eric Dumazet


Arg, patch title was meant to be

net/mlx4_en: really allow to change RSS key


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] net/mlx4_en:

2015-09-15 Thread Eric Dumazet

From: Eric Dumazet 

When changing rss key, we do not want to overwrite user provided key
by the one provided by netdev_rss_key_fill(), which is the host random
key generated at boot time.

Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function")
Signed-off-by: Eric Dumazet 
Cc: Eyal Perry 
CC: Amir Vadai 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 4402a1e48c9b..0ce6ffe73ca8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1268,8 +1268,6 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
rss_context->hash_fn = MLX4_RSS_HASH_TOP;
memcpy(rss_context->rss_key, priv->rss_key,
   MLX4_EN_RSS_KEY_SIZE);
-   netdev_rss_key_fill(rss_context->rss_key,
-   MLX4_EN_RSS_KEY_SIZE);
} else {
en_err(priv, "Unknown RSS hash function requested\n");
err = -EINVAL;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 21/30] ipv6: Only compute net once in ip6_finish_output2

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv6/ip6_output.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a80502c64523..12d0166a64cd 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -60,6 +60,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff 
*skb)
 {
struct dst_entry *dst = skb_dst(skb);
struct net_device *dev = dst->dev;
+   struct net *net = dev_net(dev);
struct neighbour *neigh;
struct in6_addr *nexthop;
int ret;
@@ -71,7 +72,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff 
*skb)
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
 
if (!(dev->flags & IFF_LOOPBACK) && sk_mc_loop(sk) &&
-   ((mroute6_socket(dev_net(dev), skb) &&
+   ((mroute6_socket(net, skb) &&
 !(IP6CB(skb)->flags & IP6SKB_FORWARDED)) ||
 ipv6_chk_mcast_addr(dev, &ipv6_hdr(skb)->daddr,
 &ipv6_hdr(skb)->saddr))) {
@@ -86,15 +87,14 @@ static int ip6_finish_output2(struct sock *sk, struct 
sk_buff *skb)
dev_loopback_xmit);
 
if (ipv6_hdr(skb)->hop_limit == 0) {
-   IP6_INC_STATS(dev_net(dev), idev,
+   IP6_INC_STATS(net, idev,
  IPSTATS_MIB_OUTDISCARDS);
kfree_skb(skb);
return 0;
}
}
 
-   IP6_UPD_PO_STATS(dev_net(dev), idev, IPSTATS_MIB_OUTMCAST,
-   skb->len);
+   IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUTMCAST, skb->len);
 
if (IPV6_ADDR_MC_SCOPE(&ipv6_hdr(skb)->daddr) <=
IPV6_ADDR_SCOPE_NODELOCAL &&
@@ -116,8 +116,7 @@ static int ip6_finish_output2(struct sock *sk, struct 
sk_buff *skb)
}
rcu_read_unlock_bh();
 
-   IP6_INC_STATS(dev_net(dst->dev),
- ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+   IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
kfree_skb(skb);
return -EINVAL;
 }
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 25/30] bridge: Pass net into br_nf_push_frag_xmit

2015-09-15 Thread Eric W. Biederman

When struct net starts being passed through the ipv4 and ipv6 fragment
routines br_nf_push_frag_xmit will need to take a net parameter.
Prepare br_nf_push_frag_xmit before that is needed and introduce
br_nf_push_frag_xmit_sk for the call sites that still need the old
calling conventions.

Signed-off-by: "Eric W. Biederman" 
---
 net/bridge/br_netfilter_hooks.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 971d45d24c64..e6910b71af6e 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -668,7 +668,7 @@ static unsigned int br_nf_forward_arp(const struct 
nf_hook_ops *ops,
 }
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4) || IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
-static int br_nf_push_frag_xmit(struct sock *sk, struct sk_buff *skb)
+static int br_nf_push_frag_xmit(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct brnf_frag_data *data;
int err;
@@ -692,6 +692,11 @@ static int br_nf_push_frag_xmit(struct sock *sk, struct 
sk_buff *skb)
nf_bridge_info_free(skb);
return br_dev_queue_push_xmit(sk, skb);
 }
+static int br_nf_push_frag_xmit_sk(struct sock *sk, struct sk_buff *skb)
+{
+   struct net *net = dev_net(skb_dst(skb)->dev);
+   return br_nf_push_frag_xmit(net, sk, skb);
+}
 #endif
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
@@ -760,7 +765,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct 
sk_buff *skb)
skb_copy_from_linear_data_offset(skb, -data->size, data->mac,
 data->size);
 
-   return br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit);
+   return br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit_sk);
}
 #endif
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
@@ -783,7 +788,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct 
sk_buff *skb)
 data->size);
 
if (v6ops)
-   return v6ops->fragment(sk, skb, br_nf_push_frag_xmit);
+   return v6ops->fragment(sk, skb, 
br_nf_push_frag_xmit_sk);
 
kfree_skb(skb);
return -EMSGSIZE;
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Eric Dumazet

On Tue, 2015-09-15 at 17:04 -0700, Mahesh Bandewar wrote:
> On Tue, Sep 15, 2015 at 4:20 PM, Eric Dumazet  wrote:
> > On Tue, 2015-09-15 at 15:54 -0700, Mahesh Bandewar wrote:
> >
> >> > +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
> >> > +   skb->l4_hash)
> >> if (ENCAP34 || LAYER34) && l4_hash) may be?
> >
> > Hmm, traditional BOND_XMIT_POLICY_LAYER34 did not a full flow bisection
> > (tunnel awareness added in commit
> > 32819dc1834866cb9547cb75f81af9edd58d33cd bonding: modify the old and add
> > new xmit hash policies)
> >
> > This could radically change some setups and behavior.
> >
> > BOND_XMIT_POLICY_ENCAP34 looks a better fit to me.
> >
> Agreed, this will change flow distribution for LAYER34 policy but then
> loose out on calculating hash per packet which I think is unnecessary.

We added new bonding policy exactly for this.

If people are stuck with LAYER34, that is their choice.

> 
> This elimination of hash calculation is a good step but I'm feeling
> that it's somehow tied to ENCAP policy which is actually orthogonal
> and should be applied to LAYER34 also.

You can send a followup patch, once fully tested.

I've tested the ENCAP34 mode only, I do not want to add cycles for a
mode that is potentially a legacy one that nobody uses.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 19/30] net: Remove dev_queue_xmit_sk

2015-09-15 Thread Eric W. Biederman

A function with weird arguments that it will never use to accomdate a
netfilter callback prototype is absolutely in the core of the
networking stack.  Frankly it does not make sense and it causes a lot
of confusion as to why arguments that are never used are being passed
to the function.

As I am preparing to make a second change to arguments to the okfn even
the names stops making sense.

As I have removed the two callers of this function remove this confusion
from the networking stack.

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/netdevice.h | 6 +-
 net/core/dev.c| 4 ++--
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88a00694eda5..e664f87c8e4c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2213,11 +2213,7 @@ int dev_close(struct net_device *dev);
 int dev_close_many(struct list_head *head, bool unlink);
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct sock *sk, struct sk_buff *newskb);
-int dev_queue_xmit_sk(struct sock *sk, struct sk_buff *skb);
-static inline int dev_queue_xmit(struct sk_buff *skb)
-{
-   return dev_queue_xmit_sk(skb->sk, skb);
-}
+int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head 
*head);
diff --git a/net/core/dev.c b/net/core/dev.c
index 877c84834d81..dcf9ff913925 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3143,11 +3143,11 @@ out:
return rc;
 }
 
-int dev_queue_xmit_sk(struct sock *sk, struct sk_buff *skb)
+int dev_queue_xmit(struct sk_buff *skb)
 {
return __dev_queue_xmit(skb, NULL);
 }
-EXPORT_SYMBOL(dev_queue_xmit_sk);
+EXPORT_SYMBOL(dev_queue_xmit);
 
 int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
 {
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 23/30] ipv6: Compute net once in raw6_send_hdrinc

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv6/raw.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 1636537705f5..5aa461302716 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -614,6 +614,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr 
*msg, int length,
unsigned int flags)
 {
struct ipv6_pinfo *np = inet6_sk(sk);
+   struct net *net = sock_net(sk);
struct ipv6hdr *iph;
struct sk_buff *skb;
int err;
@@ -652,7 +653,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr 
*msg, int length,
if (err)
goto error_fault;
 
-   IP6_UPD_PO_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUT, 
skb->len);
+   IP6_UPD_PO_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
err = NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT, sk, skb,
  NULL, rt->dst.dev, dst_output);
if (err > 0)
@@ -666,7 +667,7 @@ error_fault:
err = -EFAULT;
kfree_skb(skb);
 error:
-   IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
+   IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
if (err == -ENOBUFS && !np->recverr)
err = 0;
return err;
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 26/30] bridge: Cache net in br_nf_pre_routing_finish

2015-09-15 Thread Eric W. Biederman

This is prep work for passing net to the netfilter hooks.

Signed-off-by: "Eric W. Biederman" 
---
 net/bridge/br_netfilter_hooks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index e6910b71af6e..c1127908e23a 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -346,6 +346,7 @@ static int br_nf_pre_routing_finish(struct sock *sk, struct 
sk_buff *skb)
 {
struct net_device *dev = skb->dev;
struct iphdr *iph = ip_hdr(skb);
+   struct net *net = dev_net(dev);
struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
struct rtable *rt;
int err;
@@ -371,7 +372,7 @@ static int br_nf_pre_routing_finish(struct sock *sk, struct 
sk_buff *skb)
if (err != -EHOSTUNREACH || !in_dev || 
IN_DEV_FORWARD(in_dev))
goto free_skb;
 
-   rt = ip_route_output(dev_net(dev), iph->daddr, 0,
+   rt = ip_route_output(net, iph->daddr, 0,
 RT_TOS(iph->tos), 0);
if (!IS_ERR(rt)) {
/* - Bridged-and-DNAT'ed traffic doesn't
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 24/30] bridge: Pass net into br_nf_ip_fragment

2015-09-15 Thread Eric W. Biederman

This is a prep work for passing struct net through ip_do_fragment and
later the netfilter okfn.   Doing this independently makes the later
code changes clearer.

Signed-off-by: "Eric W. Biederman" 
---
 net/bridge/br_netfilter_hooks.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 0a6f095bb0c9..971d45d24c64 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -695,18 +695,17 @@ static int br_nf_push_frag_xmit(struct sock *sk, struct 
sk_buff *skb)
 #endif
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
-static int br_nf_ip_fragment(struct sock *sk, struct sk_buff *skb,
-int (*output)(struct sock *, struct sk_buff *))
+static int
+br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
+ int (*output)(struct sock *, struct sk_buff *))
 {
unsigned int mtu = ip_skb_dst_mtu(skb);
struct iphdr *iph = ip_hdr(skb);
-   struct rtable *rt = skb_rtable(skb);
-   struct net_device *dev = rt->dst.dev;
 
if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 (IPCB(skb)->frag_max_size &&
  IPCB(skb)->frag_max_size > mtu))) {
-   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
kfree_skb(skb);
return -EMSGSIZE;
}
@@ -726,6 +725,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct 
sk_buff *skb)
 {
struct nf_bridge_info *nf_bridge;
unsigned int mtu_reserved;
+   struct net *net = dev_net(skb_dst(skb)->dev);
 
mtu_reserved = nf_bridge_mtu_reduction(skb);
 
@@ -760,7 +760,7 @@ static int br_nf_dev_queue_xmit(struct sock *sk, struct 
sk_buff *skb)
skb_copy_from_linear_data_offset(skb, -data->size, data->mac,
 data->size);
 
-   return br_nf_ip_fragment(sk, skb, br_nf_push_frag_xmit);
+   return br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit);
}
 #endif
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 20/30] ipv6: Don't recompute net in ip6_rcv

2015-09-15 Thread Eric W. Biederman

Avoid silly redundant code

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv6/ip6_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index adba03ac7ce9..c628dba477d4 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -109,7 +109,7 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt
if (hdr->version != 6)
goto err;
 
-   IP6_ADD_STATS_BH(dev_net(dev), idev,
+   IP6_ADD_STATS_BH(net, idev,
 IPSTATS_MIB_NOECTPKTS +
(ipv6_get_dsfield(hdr) & INET_ECN_MASK),
 max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 11/30] ipv4: Only compute net once in ip_do_fragment

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_output.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9ee622ad8dfa..85b72d450184 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -531,9 +531,11 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb,
int offset;
__be16 not_last_frag;
struct rtable *rt = skb_rtable(skb);
+   struct net *net;
int err = 0;
 
dev = rt->dst.dev;
+   net = dev_net(dev);
 
/*
 *  Point into the IP datagram header.
@@ -626,7 +628,7 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb,
err = output(sk, skb);
 
if (!err)
-   IP_INC_STATS(dev_net(dev), 
IPSTATS_MIB_FRAGCREATES);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
if (err || !frag)
break;
 
@@ -636,7 +638,7 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb,
}
 
if (err == 0) {
-   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
return 0;
}
 
@@ -645,7 +647,7 @@ int ip_do_fragment(struct sock *sk, struct sk_buff *skb,
kfree_skb(frag);
frag = skb;
}
-   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
return err;
 
 slow_path_clean:
@@ -767,15 +769,15 @@ slow_path:
if (err)
goto fail;
 
-   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
}
consume_skb(skb);
-   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
return err;
 
 fail:
kfree_skb(skb);
-   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
return err;
 }
 EXPORT_SYMBOL(ip_do_fragment);
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 28/30] netfilter: Pass struct net into the netfilter hooks

2015-09-15 Thread Eric W. Biederman

Pass a network namespace parameter into the netfilter hooks.  At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.

This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".

In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.

The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()  ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()  ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc()sock_net(sk)
ipv6/ip6_output.c:ip6_xmit()sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb()   dev_net(skb->dev) not 
dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc()   sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before 
skb->dev is set to nf_bridge->physindev

In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)".  I am documenting them in case something odd
pops up and someone starts trying to track down what happened.

Signed-off-by: "Eric W. Biederman" 
---
 drivers/net/vrf.c |  7 ---
 include/linux/netfilter.h | 27 ---
 net/bridge/br_forward.c   | 13 +++--
 net/bridge/br_input.c | 13 +++--
 net/bridge/br_multicast.c |  4 ++--
 net/bridge/br_netfilter_hooks.c   | 15 ---
 net/bridge/br_netfilter_ipv6.c|  7 ---
 net/bridge/br_stp_bpdu.c  |  4 ++--
 net/decnet/dn_neigh.c | 15 +--
 net/decnet/dn_nsp_in.c|  4 ++--
 net/decnet/dn_route.c | 24 
 net/ipv4/arp.c| 10 ++
 net/ipv4/ip_forward.c |  5 +++--
 net/ipv4/ip_input.c   |  8 
 net/ipv4/ip_output.c  | 22 +-
 net/ipv4/ipmr.c   |  4 ++--
 net/ipv4/raw.c|  5 +++--
 net/ipv4/xfrm4_input.c|  4 ++--
 net/ipv4/xfrm4_output.c   |  6 --
 net/ipv6/ip6_input.c  |  8 
 net/ipv6/ip6_output.c | 15 ---
 net/ipv6/ip6mr.c  |  4 ++--
 net/ipv6/mcast.c  |  7 ---
 net/ipv6/ndisc.c  |  4 ++--
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |  2 +-
 net/ipv6/output_core.c|  6 --
 net/ipv6/raw.c|  2 +-
 net/ipv6/xfrm6_input.c|  4 ++--
 net/ipv6/xfrm6_output.c   |  6 --
 net/netfilter/ipvs/ip_vs_xmit.c   |  4 ++--
 net/xfrm/xfrm_output.c|  3 ++-
 31 files changed, 142 insertions(+), 120 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index e7094fbd7568..c82260341b72 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -298,14 +298,15 @@ err:
 static int vrf_output(struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
+   struct net *net = dev_net(dev);
 
-   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+   IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
-   return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, sk, skb,
-   NULL, dev,
+   return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
+   net, sk, skb, NULL, dev,
vrf_finish_output,
!(IPCB(skb)->flags & IPSKB_REROUTED));
 }
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 042148dc1e22..295f2650b5dc 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -190,12 +190,11 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
return 1;
 }
 
-static inline int nf_hook(u_int8_t pf, unsigned int hook, struct sock *sk,
- struct sk_buff *skb, struct net_device *indev,
- struct net_device *outdev,
+static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
+ struct sock *sk, struct sk_buff *skb,
+ struct net_device *indev, struct net_device *outdev,
  int (*okfn)(struct sock *, struct sk_buff *))
 {
-   struct net *net = dev_net(indev ? ind

[PATCH next 30/30] netfilter: Pass net into okfn

2015-09-15 Thread Eric W. Biederman

This is immediately motivated by the bridge code that chains functions that
call into netfilter.  Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn.  For the moment dst_output_okfn
just silently drops the struct net.

Signed-off-by: "Eric W. Biederman" 
---
 drivers/net/vrf.c|  2 +-
 include/linux/netdevice.h|  2 +-
 include/linux/netfilter.h| 26 ++
 include/linux/netfilter_bridge.h |  2 +-
 include/net/dn_neigh.h   |  6 +++---
 include/net/dst.h|  4 
 include/net/ipv6.h   |  2 +-
 include/net/netfilter/br_netfilter.h |  2 +-
 net/bridge/br_forward.c  |  5 ++---
 net/bridge/br_input.c|  7 ---
 net/bridge/br_netfilter_hooks.c  | 21 +
 net/bridge/br_netfilter_ipv6.c   |  3 +--
 net/bridge/br_private.h  |  6 +++---
 net/bridge/br_stp_bpdu.c |  3 ++-
 net/core/dev.c   |  4 +++-
 net/decnet/dn_neigh.c|  8 
 net/decnet/dn_nsp_in.c   |  3 ++-
 net/decnet/dn_route.c|  6 +++---
 net/ipv4/arp.c   |  7 +++
 net/ipv4/ip_forward.c|  3 +--
 net/ipv4/ip_input.c  |  7 ++-
 net/ipv4/ip_output.c |  4 ++--
 net/ipv4/ipmr.c  |  4 ++--
 net/ipv4/raw.c   |  2 +-
 net/ipv4/xfrm4_input.c   |  3 ++-
 net/ipv4/xfrm4_output.c  |  2 +-
 net/ipv6/ip6_input.c |  5 ++---
 net/ipv6/ip6_output.c|  7 ---
 net/ipv6/ip6mr.c |  3 +--
 net/ipv6/mcast.c |  4 ++--
 net/ipv6/ndisc.c |  2 +-
 net/ipv6/output_core.c   |  2 +-
 net/ipv6/raw.c   |  2 +-
 net/ipv6/xfrm6_output.c  |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c  |  4 ++--
 net/netfilter/nf_queue.c |  2 +-
 net/xfrm/xfrm_output.c   | 12 ++--
 37 files changed, 95 insertions(+), 94 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index c82260341b72..4dd701d7b8e6 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -253,7 +253,7 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct 
net_device *dev)
 }
 
 /* modelled after ip_finish_output2 */
-static int vrf_finish_output(struct sock *sk, struct sk_buff *skb)
+static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
struct dst_entry *dst = skb_dst(skb);
struct rtable *rt = (struct rtable *)dst;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 97ab5c9a7069..b791405958b4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2212,7 +2212,7 @@ int dev_open(struct net_device *dev);
 int dev_close(struct net_device *dev);
 int dev_close_many(struct list_head *head, bool unlink);
 void dev_disable_lro(struct net_device *dev);
-int dev_loopback_xmit(struct sock *sk, struct sk_buff *newskb);
+int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff 
*newskb);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
 int register_netdevice(struct net_device *dev);
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 295f2650b5dc..0b4d4560f33d 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -56,7 +56,7 @@ struct nf_hook_state {
struct sock *sk;
struct net *net;
struct list_head *hook_list;
-   int (*okfn)(struct sock *, struct sk_buff *);
+   int (*okfn)(struct net *, struct sock *, struct sk_buff *);
 };
 
 static inline void nf_hook_state_init(struct nf_hook_state *p,
@@ -67,7 +67,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p,
  struct net_device *outdev,
  struct sock *sk,
  struct net *net,
- int (*okfn)(struct sock *, struct sk_buff 
*))
+ int (*okfn)(struct net *, struct sock *, 
struct sk_buff *))
 {
p->hook = hook;
p->thresh = thresh;
@@ -175,7 +175,7 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int 
hook,
 struct sk_buff *skb,
 struct net_device *indev,
 struct net_device *outdev,
-

[PATCH next 16/30] ipv6: Only compute net once in ip6mr_forward2_finish

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv6/ip6mr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index e95f6b6281de..3e3085b37a91 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1987,9 +1987,10 @@ int ip6mr_compat_ioctl(struct sock *sk, unsigned int 
cmd, void __user *arg)
 
 static inline int ip6mr_forward2_finish(struct sock *sk, struct sk_buff *skb)
 {
-   IP6_INC_STATS_BH(dev_net(skb_dst(skb)->dev), ip6_dst_idev(skb_dst(skb)),
+   struct net *net = dev_net(skb_dst(skb)->dev);
+   IP6_INC_STATS_BH(net, ip6_dst_idev(skb_dst(skb)),
 IPSTATS_MIB_OUTFORWDATAGRAMS);
-   IP6_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), ip6_dst_idev(skb_dst(skb)),
+   IP6_ADD_STATS_BH(net, ip6_dst_idev(skb_dst(skb)),
 IPSTATS_MIB_OUTOCTETS, skb->len);
return dst_output(sk, skb);
 }
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 27/30] bridge: Add br_netif_receive_skb remove netif_receive_skb_sk

2015-09-15 Thread Eric W. Biederman

netif_receive_skb_sk is only called once in the bridge code, replace
it with a bridge specific function that calls netif_receive_skb.

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/netdevice.h | 6 +-
 net/bridge/br_input.c | 7 ++-
 net/core/dev.c| 4 ++--
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e664f87c8e4c..97ab5c9a7069 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2985,11 +2985,7 @@ static inline void dev_consume_skb_any(struct sk_buff 
*skb)
 
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
-int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb);
-static inline int netif_receive_skb(struct sk_buff *skb)
-{
-   return netif_receive_skb_sk(skb->sk, skb);
-}
+int netif_receive_skb(struct sk_buff *skb);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f921a5dce22d..2359c041e27c 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -26,6 +26,11 @@
 br_should_route_hook_t __rcu *br_should_route_hook __read_mostly;
 EXPORT_SYMBOL(br_should_route_hook);
 
+static int br_netif_receive_skb(struct sock *sk, struct sk_buff *skb)
+{
+   return netif_receive_skb(skb);
+}
+
 static int br_pass_frame_up(struct sk_buff *skb)
 {
struct net_device *indev, *brdev = BR_INPUT_SKB_CB(skb)->brdev;
@@ -57,7 +62,7 @@ static int br_pass_frame_up(struct sk_buff *skb)
 
return NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, NULL, skb,
   indev, NULL,
-  netif_receive_skb_sk);
+  br_netif_receive_skb);
 }
 
 static void br_do_proxy_arp(struct sk_buff *skb, struct net_bridge *br,
diff --git a/net/core/dev.c b/net/core/dev.c
index dcf9ff913925..7db9b012dfb7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3982,13 +3982,13 @@ static int netif_receive_skb_internal(struct sk_buff 
*skb)
  * NET_RX_SUCCESS: no congestion
  * NET_RX_DROP: packet was dropped
  */
-int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb)
+int netif_receive_skb(struct sk_buff *skb)
 {
trace_netif_receive_skb_entry(skb);
 
return netif_receive_skb_internal(skb);
 }
-EXPORT_SYMBOL(netif_receive_skb_sk);
+EXPORT_SYMBOL(netif_receive_skb);
 
 /* Network device is going away, flush any packets still pending
  * Called with irqs disabled.
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 18/30] bridge: Introduce br_send_bpdu_finish

2015-09-15 Thread Eric W. Biederman

The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce br_send_bpdu_finish to remove the need for dev_queue_xmit_skb_sk,
and have br_send_bpdu_finish call dev_queue_xmit.

Signed-off-by: "Eric W. Biederman" 
---
 net/bridge/br_stp_bpdu.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_stp_bpdu.c b/net/bridge/br_stp_bpdu.c
index 534fc4cd263e..3017a396cdef 100644
--- a/net/bridge/br_stp_bpdu.c
+++ b/net/bridge/br_stp_bpdu.c
@@ -30,6 +30,11 @@
 
 #define LLC_RESERVE sizeof(struct llc_pdu_un)
 
+static int br_send_bpdu_finish(struct sock *sk, struct sk_buff *skb)
+{
+   return dev_queue_xmit(skb);
+}
+
 static void br_send_bpdu(struct net_bridge_port *p,
 const unsigned char *data, int length)
 {
@@ -56,7 +61,7 @@ static void br_send_bpdu(struct net_bridge_port *p,
 
NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_OUT, NULL, skb,
NULL, skb->dev,
-   dev_queue_xmit_sk);
+   br_send_bpdu_finish);
 }
 
 static inline void br_set_ticks(unsigned char *dest, int j)
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 13/30] ipv4: Only compute net once in ip_finish_output2

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 095754c99061..fc550e97daac 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -177,14 +177,15 @@ static int ip_finish_output2(struct sock *sk, struct 
sk_buff *skb)
struct dst_entry *dst = skb_dst(skb);
struct rtable *rt = (struct rtable *)dst;
struct net_device *dev = dst->dev;
+   struct net *net = dev_net(dev);
unsigned int hh_len = LL_RESERVED_SPACE(dev);
struct neighbour *neigh;
u32 nexthop;
 
if (rt->rt_type == RTN_MULTICAST) {
-   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTMCAST, skb->len);
+   IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len);
} else if (rt->rt_type == RTN_BROADCAST)
-   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTBCAST, skb->len);
+   IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
 
/* Be paranoid, rather than too clever. */
if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 22/30] ipv6: Cache net in ip6_output

2015-09-15 Thread Eric W. Biederman

Keep net in a local variable so I can use it in NF_HOOK_COND
when I pass struct net to all of the netfilter hooks.

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv6/ip6_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 12d0166a64cd..8cab909b181e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -135,9 +135,9 @@ int ip6_output(struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+   struct net *net = dev_net(dev);
if (unlikely(idev->cnf.disable_ipv6)) {
-   IP6_INC_STATS(dev_net(dev), idev,
- IPSTATS_MIB_OUTDISCARDS);
+   IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
kfree_skb(skb);
return 0;
}
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 17/30] arp: Introduce arp_xmit_finish

2015-09-15 Thread Eric W. Biederman

The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce arp_xmit_finish to remove the need for dev_queue_xmit_skb_sk,
and have arp_xmit_finish call dev_queue_xmit.

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/arp.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 30409b75e925..3632e98eb0f9 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -621,6 +621,11 @@ out:
 }
 EXPORT_SYMBOL(arp_create);
 
+static int arp_xmit_finish(struct sock *sk, struct sk_buff *skb)
+{
+   return dev_queue_xmit(skb);
+}
+
 /*
  * Send an arp packet.
  */
@@ -628,7 +633,7 @@ void arp_xmit(struct sk_buff *skb)
 {
/* Send it off, maybe filter it using firewalling first.  */
NF_HOOK(NFPROTO_ARP, NF_ARP_OUT, NULL, skb,
-   NULL, skb->dev, dev_queue_xmit_sk);
+   NULL, skb->dev, arp_xmit_finish);
 }
 EXPORT_SYMBOL(arp_xmit);
 
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 15/30] ipv4: Only compute net once in ipmr_forward_finish

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ipmr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 075bc695ae34..dfe4e8ec6c3a 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1681,9 +1681,10 @@ static void ip_encap(struct net *net, struct sk_buff 
*skb,
 static inline int ipmr_forward_finish(struct sock *sk, struct sk_buff *skb)
 {
struct ip_options *opt = &(IPCB(skb)->opt);
+   struct net *net = dev_net(skb_dst(skb)->dev);
 
-   IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), 
IPSTATS_MIB_OUTFORWDATAGRAMS);
-   IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, 
skb->len);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
+   IP_ADD_STATS_BH(net, IPSTATS_MIB_OUTOCTETS, skb->len);
 
if (unlikely(opt->optlen))
ip_forward_options(skb);
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 29/30] netfilter: Use nf_hook_state.net

2015-09-15 Thread Eric W. Biederman

Instead of saying "net = dev_net(state->in?state->in:state->out)"
just say "state->net".  As that information is now availabe,
much less confusing and much less error prone.

Signed-off-by: "Eric W. Biederman" 
---
 net/bridge/netfilter/ebtable_filter.c  | 4 ++--
 net/bridge/netfilter/ebtable_nat.c | 4 ++--
 net/ipv4/netfilter/arptable_filter.c   | 4 +---
 net/ipv4/netfilter/ip_tables.c | 8 
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 2 +-
 net/ipv4/netfilter/ipt_SYNPROXY.c  | 2 +-
 net/ipv4/netfilter/iptable_filter.c| 6 ++
 net/ipv4/netfilter/iptable_mangle.c| 7 +++
 net/ipv4/netfilter/iptable_nat.c   | 5 ++---
 net/ipv4/netfilter/iptable_raw.c   | 6 ++
 net/ipv4/netfilter/iptable_security.c  | 5 +
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 4 ++--
 net/ipv6/netfilter/ip6_tables.c| 8 
 net/ipv6/netfilter/ip6t_SYNPROXY.c | 2 +-
 net/ipv6/netfilter/ip6table_filter.c   | 5 ++---
 net/ipv6/netfilter/ip6table_mangle.c   | 6 +++---
 net/ipv6/netfilter/ip6table_nat.c  | 5 ++---
 net/ipv6/netfilter/ip6table_raw.c  | 5 ++---
 net/ipv6/netfilter/ip6table_security.c | 4 +---
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 4 ++--
 net/netfilter/nfnetlink_queue_core.c   | 3 +--
 21 files changed, 41 insertions(+), 58 deletions(-)

diff --git a/net/bridge/netfilter/ebtable_filter.c 
b/net/bridge/netfilter/ebtable_filter.c
index 8a3f63b2e807..ab20d6ed6e2f 100644
--- a/net/bridge/netfilter/ebtable_filter.c
+++ b/net/bridge/netfilter/ebtable_filter.c
@@ -61,7 +61,7 @@ ebt_in_hook(const struct nf_hook_ops *ops, struct sk_buff 
*skb,
const struct nf_hook_state *state)
 {
return ebt_do_table(ops->hooknum, skb, state->in, state->out,
-   dev_net(state->in)->xt.frame_filter);
+   state->net->xt.frame_filter);
 }
 
 static unsigned int
@@ -69,7 +69,7 @@ ebt_out_hook(const struct nf_hook_ops *ops, struct sk_buff 
*skb,
 const struct nf_hook_state *state)
 {
return ebt_do_table(ops->hooknum, skb, state->in, state->out,
-   dev_net(state->out)->xt.frame_filter);
+   state->net->xt.frame_filter);
 }
 
 static struct nf_hook_ops ebt_ops_filter[] __read_mostly = {
diff --git a/net/bridge/netfilter/ebtable_nat.c 
b/net/bridge/netfilter/ebtable_nat.c
index c5ef5b1ab678..ad81a5a65644 100644
--- a/net/bridge/netfilter/ebtable_nat.c
+++ b/net/bridge/netfilter/ebtable_nat.c
@@ -61,7 +61,7 @@ ebt_nat_in(const struct nf_hook_ops *ops, struct sk_buff *skb,
   const struct nf_hook_state *state)
 {
return ebt_do_table(ops->hooknum, skb, state->in, state->out,
-   dev_net(state->in)->xt.frame_nat);
+   state->net->xt.frame_nat);
 }
 
 static unsigned int
@@ -69,7 +69,7 @@ ebt_nat_out(const struct nf_hook_ops *ops, struct sk_buff 
*skb,
const struct nf_hook_state *state)
 {
return ebt_do_table(ops->hooknum, skb, state->in, state->out,
-   dev_net(state->out)->xt.frame_nat);
+   state->net->xt.frame_nat);
 }
 
 static struct nf_hook_ops ebt_ops_nat[] __read_mostly = {
diff --git a/net/ipv4/netfilter/arptable_filter.c 
b/net/ipv4/netfilter/arptable_filter.c
index 93876d03120c..d217e4c19645 100644
--- a/net/ipv4/netfilter/arptable_filter.c
+++ b/net/ipv4/netfilter/arptable_filter.c
@@ -30,10 +30,8 @@ static unsigned int
 arptable_filter_hook(const struct nf_hook_ops *ops, struct sk_buff *skb,
 const struct nf_hook_state *state)
 {
-   const struct net *net = dev_net(state->in ? state->in : state->out);
-
return arpt_do_table(skb, ops->hooknum, state,
-net->ipv4.arptable_filter);
+state->net->ipv4.arptable_filter);
 }
 
 static struct nf_hook_ops *arpfilter_ops __read_mostly;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index b0a86e73451c..5d514eac4c31 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -246,7 +246,8 @@ get_chainname_rulenum(const struct ipt_entry *s, const 
struct ipt_entry *e,
return 0;
 }
 
-static void trace_packet(const struct sk_buff *skb,
+static void trace_packet(struct net *net,
+const struct sk_buff *skb,
 unsigned int hook,
 const struct net_device *in,
 const struct net_device *out,
@@ -258,7 +259,6 @@ static void trace_packet(const struct sk_buff *skb,
const char *hookname, *chainname, *comment;
const struct ipt_entry *iter;
unsigned int rulenum = 0;
-   struct net *net = dev_net(in ? in : out);
 
root = get_entry(private->entries, pr

[PATCH next 14/30] ipv4: Only compute net once in ip_rcv_finish

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_input.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index ff908863f22e..cc242b9501d9 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -314,6 +314,7 @@ EXPORT_SYMBOL(sysctl_ip_early_demux);
 static int ip_rcv_finish(struct sock *sk, struct sk_buff *skb)
 {
const struct iphdr *iph = ip_hdr(skb);
+   struct net *net = dev_net(skb->dev);
struct rtable *rt;
 
if (sysctl_ip_early_demux && !skb_dst(skb) && !skb->sk) {
@@ -337,8 +338,7 @@ static int ip_rcv_finish(struct sock *sk, struct sk_buff 
*skb)
   iph->tos, skb->dev);
if (unlikely(err)) {
if (err == -EXDEV)
-   NET_INC_STATS_BH(dev_net(skb->dev),
-LINUX_MIB_IPRPFILTER);
+   NET_INC_STATS_BH(net, LINUX_MIB_IPRPFILTER);
goto drop;
}
}
@@ -359,11 +359,9 @@ static int ip_rcv_finish(struct sock *sk, struct sk_buff 
*skb)
 
rt = skb_rtable(skb);
if (rt->rt_type == RTN_MULTICAST) {
-   IP_UPD_PO_STATS_BH(dev_net(rt->dst.dev), IPSTATS_MIB_INMCAST,
-   skb->len);
+   IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_INMCAST, skb->len);
} else if (rt->rt_type == RTN_BROADCAST)
-   IP_UPD_PO_STATS_BH(dev_net(rt->dst.dev), IPSTATS_MIB_INBCAST,
-   skb->len);
+   IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_INBCAST, skb->len);
 
return dst_input(skb);
 
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 12/30] ipv4: Explicitly compute net in ip_fragment

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_output.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 85b72d450184..095754c99061 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -500,10 +500,9 @@ static int ip_fragment(struct sock *sk, struct sk_buff 
*skb,
if (unlikely(!skb->ignore_df ||
 (IPCB(skb)->frag_max_size &&
  IPCB(skb)->frag_max_size > mtu))) {
-   struct rtable *rt = skb_rtable(skb);
-   struct net_device *dev = rt->dst.dev;
+   struct net *net = dev_net(skb_rtable(skb)->dst.dev);
 
-   IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(mtu));
kfree_skb(skb);
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 08/30] ipv4: Compute net once in ip_rcv

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_input.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index f4fc8a77aaa7..ff908863f22e 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -378,6 +378,7 @@ drop:
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type 
*pt, struct net_device *orig_dev)
 {
const struct iphdr *iph;
+   struct net *net;
u32 len;
 
/* When the interface is in promisc. mode, drop all the crap
@@ -387,11 +388,12 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt,
goto drop;
 
 
-   IP_UPD_PO_STATS_BH(dev_net(dev), IPSTATS_MIB_IN, skb->len);
+   net = dev_net(dev);
+   IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_IN, skb->len);
 
skb = skb_share_check(skb, GFP_ATOMIC);
if (!skb) {
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_INDISCARDS);
goto out;
}
 
@@ -417,7 +419,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt,
BUILD_BUG_ON(IPSTATS_MIB_ECT1PKTS != IPSTATS_MIB_NOECTPKTS + 
INET_ECN_ECT_1);
BUILD_BUG_ON(IPSTATS_MIB_ECT0PKTS != IPSTATS_MIB_NOECTPKTS + 
INET_ECN_ECT_0);
BUILD_BUG_ON(IPSTATS_MIB_CEPKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_CE);
-   IP_ADD_STATS_BH(dev_net(dev),
+   IP_ADD_STATS_BH(net,
IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
 
@@ -431,7 +433,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt,
 
len = ntohs(iph->tot_len);
if (skb->len < len) {
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_INTRUNCATEDPKTS);
goto drop;
} else if (len < (iph->ihl*4))
goto inhdr_error;
@@ -441,7 +443,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt,
 * Note this now means skb->len holds ntohs(iph->tot_len).
 */
if (pskb_trim_rcsum(skb, len)) {
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_INDISCARDS);
goto drop;
}
 
@@ -458,9 +460,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt,
   ip_rcv_finish);
 
 csum_error:
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_CSUMERRORS);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_INHDRERRORS);
 drop:
kfree_skb(skb);
 out:
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 03/30] netfilter: Pass net to nf_hook_thresh

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/netfilter.h | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 889ac0e11f01..042148dc1e22 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -170,6 +170,7 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state 
*state);
  * value indicates the packet has been consumed by the hook.
  */
 static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
+struct net *net,
 struct sock *sk,
 struct sk_buff *skb,
 struct net_device *indev,
@@ -177,7 +178,6 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int 
hook,
 int (*okfn)(struct sock *, struct sk_buff *),
 int thresh)
 {
-   struct net *net = dev_net(indev ? indev : outdev);
struct list_head *hook_list = &net->nf.hooks[pf][hook];
 
if (nf_hook_list_active(hook_list, pf, hook)) {
@@ -195,7 +195,8 @@ static inline int nf_hook(u_int8_t pf, unsigned int hook, 
struct sock *sk,
  struct net_device *outdev,
  int (*okfn)(struct sock *, struct sk_buff *))
 {
-   return nf_hook_thresh(pf, hook, sk, skb, indev, outdev, okfn, INT_MIN);
+   struct net *net = dev_net(indev ? indev : outdev);
+   return nf_hook_thresh(pf, hook, net, sk, skb, indev, outdev, okfn, 
INT_MIN);
 }

 /* Activate hook; either okfn or kfree_skb called, unless a hook
@@ -221,7 +222,8 @@ NF_HOOK_THRESH(uint8_t pf, unsigned int hook, struct sock 
*sk,
   struct net_device *out,
   int (*okfn)(struct sock *, struct sk_buff *), int thresh)
 {
-   int ret = nf_hook_thresh(pf, hook, sk, skb, in, out, okfn, thresh);
+   struct net *net = dev_net(in ? in : out);
+   int ret = nf_hook_thresh(pf, hook, net, sk, skb, in, out, okfn, thresh);
if (ret == 1)
ret = okfn(sk, skb);
return ret;
@@ -232,10 +234,11 @@ NF_HOOK_COND(uint8_t pf, unsigned int hook, struct sock 
*sk,
 struct sk_buff *skb, struct net_device *in, struct net_device *out,
 int (*okfn)(struct sock *, struct sk_buff *), bool cond)
 {
+   struct net *net = dev_net(in ? in : out);
int ret;
 
if (!cond ||
-   ((ret = nf_hook_thresh(pf, hook, sk, skb, in, out, okfn, INT_MIN)) 
== 1))
+   ((ret = nf_hook_thresh(pf, hook, net, sk, skb, in, out, okfn, 
INT_MIN)) == 1))
ret = okfn(sk, skb);
return ret;
 }
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 02/30] netfilter: Store net in nf_hook_state

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/netfilter.h | 5 -
 include/linux/netfilter_ingress.h | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 1abac85ec907..889ac0e11f01 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -54,6 +54,7 @@ struct nf_hook_state {
struct net_device *in;
struct net_device *out;
struct sock *sk;
+   struct net *net;
struct list_head *hook_list;
int (*okfn)(struct sock *, struct sk_buff *);
 };
@@ -65,6 +66,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p,
  struct net_device *indev,
  struct net_device *outdev,
  struct sock *sk,
+ struct net *net,
  int (*okfn)(struct sock *, struct sk_buff 
*))
 {
p->hook = hook;
@@ -73,6 +75,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p,
p->in = indev;
p->out = outdev;
p->sk = sk;
+   p->net = net;
p->hook_list = hook_list;
p->okfn = okfn;
 }
@@ -181,7 +184,7 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int 
hook,
struct nf_hook_state state;
 
nf_hook_state_init(&state, hook_list, hook, thresh,
-  pf, indev, outdev, sk, okfn);
+  pf, indev, outdev, sk, net, okfn);
return nf_hook_slow(skb, &state);
}
return 1;
diff --git a/include/linux/netfilter_ingress.h 
b/include/linux/netfilter_ingress.h
index cb0727fe2b3d..187feabe557c 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -17,7 +17,7 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 
nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress,
   NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV, NULL,
-  skb->dev, NULL, NULL);
+  skb->dev, NULL, dev_net(skb->dev), NULL);
return nf_hook_slow(skb, &state);
 }
 
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 05/30] net: Merge dst_output and dst_output_sk

2015-09-15 Thread Eric W. Biederman

Add a sock paramter to dst_output making dst_output_sk superfluous.
Add a skb->sk parameter to all of the callers of dst_output
Have the callers of dst_output_sk call dst_output.

Signed-off-by: "Eric W. Biederman" 
---
 include/net/dst.h   | 6 +-
 net/decnet/dn_nsp_out.c | 4 ++--
 net/ipv4/ip_forward.c   | 2 +-
 net/ipv4/ip_output.c| 6 +++---
 net/ipv4/ip_vti.c   | 2 +-
 net/ipv4/ipmr.c | 2 +-
 net/ipv4/raw.c  | 2 +-
 net/ipv4/xfrm4_output.c | 2 +-
 net/ipv6/ip6_output.c   | 4 ++--
 net/ipv6/ip6_vti.c  | 2 +-
 net/ipv6/ip6mr.c| 2 +-
 net/ipv6/mcast.c| 4 ++--
 net/ipv6/ndisc.c| 2 +-
 net/ipv6/output_core.c  | 4 ++--
 net/ipv6/raw.c  | 2 +-
 net/ipv6/xfrm6_output.c | 2 +-
 net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
 net/xfrm/xfrm_output.c  | 2 +-
 net/xfrm/xfrm_policy.c  | 2 +-
 19 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 9261d928303d..c72e58474e52 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -454,14 +454,10 @@ static inline void dst_set_expires(struct dst_entry *dst, 
int timeout)
 }
 
 /* Output packet to network from transport.  */
-static inline int dst_output_sk(struct sock *sk, struct sk_buff *skb)
+static inline int dst_output(struct sock *sk, struct sk_buff *skb)
 {
return skb_dst(skb)->output(sk, skb);
 }
-static inline int dst_output(struct sk_buff *skb)
-{
-   return dst_output_sk(skb->sk, skb);
-}
 
 /* Input packet from network to transport.  */
 static inline int dst_input(struct sk_buff *skb)
diff --git a/net/decnet/dn_nsp_out.c b/net/decnet/dn_nsp_out.c
index 1aaa51ebbda6..4b02dd300f50 100644
--- a/net/decnet/dn_nsp_out.c
+++ b/net/decnet/dn_nsp_out.c
@@ -85,7 +85,7 @@ static void dn_nsp_send(struct sk_buff *skb)
if (dst) {
 try_again:
skb_dst_set(skb, dst);
-   dst_output(skb);
+   dst_output(skb->sk, skb);
return;
}
 
@@ -582,7 +582,7 @@ static __inline__ void dn_nsp_do_disc(struct sock *sk, 
unsigned char msgflg,
 * associations.
 */
skb_dst_set(skb, dst_clone(dst));
-   dst_output(skb);
+   dst_output(skb->sk, skb);
 }
 
 
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 2d3aa408fbdc..28fb90108f56 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -72,7 +72,7 @@ static int ip_forward_finish(struct sock *sk, struct sk_buff 
*skb)
ip_forward_options(skb);
 
skb_sender_cpu_clear(skb);
-   return dst_output_sk(sk, skb);
+   return dst_output(sk, skb);
 }
 
 int ip_forward(struct sk_buff *skb)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0138fada0951..f076f11aa94a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -102,7 +102,7 @@ static int __ip_local_out_sk(struct sock *sk, struct 
sk_buff *skb)
iph->tot_len = htons(skb->len);
ip_send_check(iph);
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, sk, skb, NULL,
-  skb_dst(skb)->dev, dst_output_sk);
+  skb_dst(skb)->dev, dst_output);
 }
 
 int __ip_local_out(struct sk_buff *skb)
@@ -116,7 +116,7 @@ int ip_local_out_sk(struct sock *sk, struct sk_buff *skb)
 
err = __ip_local_out(skb);
if (likely(err == 1))
-   err = dst_output_sk(sk, skb);
+   err = dst_output(sk, skb);
 
return err;
 }
@@ -271,7 +271,7 @@ static int ip_finish_output(struct sock *sk, struct sk_buff 
*skb)
/* Policy lookup after SNAT yielded a new policy */
if (skb_dst(skb)->xfrm) {
IPCB(skb)->flags |= IPSKB_REROUTED;
-   return dst_output_sk(sk, skb);
+   return dst_output(sk, skb);
}
 #endif
mtu = ip_skb_dst_mtu(skb);
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 0c152087ca15..3b87ec5178f9 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -197,7 +197,7 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct 
net_device *dev,
skb_dst_set(skb, dst);
skb->dev = skb_dst(skb)->dev;
 
-   err = dst_output(skb);
+   err = dst_output(skb->sk, skb);
if (net_xmit_eval(err) == 0)
err = skb->len;
iptunnel_xmit_stats(err, &dev->stats, dev->tstats);
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 866ee89f5254..a0a5def920fc 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1688,7 +1688,7 @@ static inline int ipmr_forward_finish(struct sock *sk, 
struct sk_buff *skb)
if (unlikely(opt->optlen))
ip_forward_options(skb);
 
-   return dst_output_sk(sk, skb);
+   return dst_output(sk, skb);
 }
 
 /*
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 561cd4b8fc6e..09ab5bb6913a 100644
--- a/net/ipv4/raw.c
+++ b/n

[PATCH next 09/30] ipv4: Remember the net in ip_output and ip_mc_output

2015-09-15 Thread Eric W. Biederman

This is a prepatory patch to passing net int the netfilter hooks,
where net will be used again.

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_output.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index f076f11aa94a..9ee622ad8dfa 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -288,11 +288,12 @@ int ip_mc_output(struct sock *sk, struct sk_buff *skb)
 {
struct rtable *rt = skb_rtable(skb);
struct net_device *dev = rt->dst.dev;
+   struct net *net = dev_net(dev);
 
/*
 *  If the indicated interface is up and running, send the packet.
 */
-   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+   IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
@@ -347,8 +348,9 @@ int ip_mc_output(struct sock *sk, struct sk_buff *skb)
 int ip_output(struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
+   struct net *net = dev_net(dev);
 
-   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+   IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 07/30] ipv4: Compute net once in ip_forward_finish

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_forward.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index ba2f66b3b3f6..95235c813f18 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -63,10 +63,11 @@ static bool ip_exceeds_mtu(const struct sk_buff *skb, 
unsigned int mtu)
 
 static int ip_forward_finish(struct sock *sk, struct sk_buff *skb)
 {
+   struct net *net = dev_net(skb_dst(skb)->dev);
struct ip_options *opt  = &(IPCB(skb)->opt);
 
-   IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), 
IPSTATS_MIB_OUTFORWDATAGRAMS);
-   IP_ADD_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_OUTOCTETS, 
skb->len);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
+   IP_ADD_STATS_BH(net, IPSTATS_MIB_OUTOCTETS, skb->len);
 
if (unlikely(opt->optlen))
ip_forward_options(skb);
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 06/30] ipv4: Compute net once in ip_forward

2015-09-15 Thread Eric W. Biederman

Compute struct net from the input device in ip_forward before it is
used.

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ip_forward.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 28fb90108f56..ba2f66b3b3f6 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -81,6 +81,7 @@ int ip_forward(struct sk_buff *skb)
struct iphdr *iph;  /* Our header */
struct rtable *rt;  /* Route we use */
struct ip_options *opt  = &(IPCB(skb)->opt);
+   struct net *net;
 
/* that should never happen */
if (skb->pkt_type != PACKET_HOST)
@@ -99,6 +100,7 @@ int ip_forward(struct sk_buff *skb)
return NET_RX_SUCCESS;
 
skb_forward_csum(skb);
+   net = dev_net(skb->dev);
 
/*
 *  According to the RFC, we must first decrease the TTL field. If
@@ -119,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
IPCB(skb)->flags |= IPSKB_FORWARDED;
mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
if (ip_exceeds_mtu(skb, mtu)) {
-   IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(mtu));
goto drop;
@@ -155,7 +157,7 @@ sr_failed:
 
 too_many_hops:
/* Tell the sender its packet died... */
-   IP_INC_STATS_BH(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_INHDRERRORS);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_INHDRERRORS);
icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 drop:
kfree_skb(skb);
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 10/30] ipv4: Don't recompute net in ipmr_queue_xmit

2015-09-15 Thread Eric W. Biederman

Calling dev_net(dev) for is just silly.

Signed-off-by: "Eric W. Biederman" 
---
 net/ipv4/ipmr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index a0a5def920fc..075bc695ae34 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1745,7 +1745,7 @@ static void ipmr_queue_xmit(struct net *net, struct 
mr_table *mrt,
 * to blackhole.
 */
 
-   IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+   IP_INC_STATS_BH(net, IPSTATS_MIB_FRAGFAILS);
ip_rt_put(rt);
goto out_free;
}
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 04/30] xfrm: Remove unused afinfo method init_dst

2015-09-15 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 include/net/xfrm.h | 2 --
 net/xfrm/xfrm_policy.c | 2 --
 2 files changed, 4 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 312e3fee9ccf..fd176106909a 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -296,8 +296,6 @@ struct xfrm_policy_afinfo {
  struct flowi *fl,
  int reverse);
int (*get_tos)(const struct flowi *fl);
-   void(*init_dst)(struct net *net,
-   struct xfrm_dst *dst);
int (*init_path)(struct xfrm_dst *path,
 struct dst_entry *dst,
 int nfheader_len);
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 94af3d065785..6b5d6e2b9a49 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1583,8 +1583,6 @@ static inline struct xfrm_dst *xfrm_alloc_dst(struct net 
*net, int family)
 
memset(dst + 1, 0, sizeof(*xdst) - sizeof(*dst));
xdst->flo.ops = &xfrm_bundle_fc_ops;
-   if (afinfo->init_dst)
-   afinfo->init_dst(net, xdst);
} else
xdst = ERR_PTR(-ENOBUFS);
 
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 01/30] netfilter: Remove !CONFIG_NETFITLER definition of nf_hook_thresh

2015-09-15 Thread Eric W. Biederman

The !CONFIG_NETFILTER definition of nf_hook_thresh calls okfn when
the CONFIG_NETFITLER defintion does not, making it buggy.

As the !CONFIG_NETFILTER defintion of nf_hook_thresh is not used remove
it rather than fix it.

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/netfilter.h | 9 -
 1 file changed, 9 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 36a652531791..1abac85ec907 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -344,15 +344,6 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi 
*fl, u_int8_t family)
 #else /* !CONFIG_NETFILTER */
 #define NF_HOOK(pf, hook, sk, skb, indev, outdev, okfn) (okfn)(sk, skb)
 #define NF_HOOK_COND(pf, hook, sk, skb, indev, outdev, okfn, cond) (okfn)(sk, 
skb)
-static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
-struct sock *sk,
-struct sk_buff *skb,
-struct net_device *indev,
-struct net_device *outdev,
-int (*okfn)(struct sock *sk, struct sk_buff 
*), int thresh)
-{
-   return okfn(sk, skb);
-}
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct sock *sk,
  struct sk_buff *skb, struct net_device *indev,
  struct net_device *outdev,
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH next 0/30] Passing net through the netfilter hooks

2015-09-15 Thread Eric W. Biederman


My primary goal with this patchset and it's follow ups is to cleanup the
network routing paths so that we do not look at the output device to
derive the network namespace.  My plan is to pass the network namespace
of the transmitting socket through the output path, to replace code that
looks at the output network device today.  Once that is done we can have
routes with output devices outside of the current network namespace.
Which should allow reception and transmission of packets in network
namespaces to be as fast as normal packet reception and transmission
with early demux disabled, because it will same code path.

Once skb_dst(skb)->dev is a little better under control I think it will
also be possible to use rcu to cleanup the ancient hack that sets
dst->dev to loopback_dev when a network device is removed.

The work to get there is a series of code cleanups.  I am starting with
passing net into the netfilter hooks and into the functions that are
called after the netfilter hooks.  This removes from netfilter the
need to guess which network namespace it is working on.

To get there I perform a series of minor prep patches so the big changes
at the end are possible to audit without getting lost in the noise.  In
particular I have a lot of patches computing net into a local variable
and then using it through out the function.

So this patchset encompases removing dead code, sorting out the _sk
functions that were added last time someone pushed a prototype change
through the post netfilter functions.  Cleaning up individual functions
use of the network namespace.  Passing net into the netfilter hooks.
Passing net into the post netfilter functions.  Using state->net in
the netfilter code where it is available and trivially usable.

Pablo, Dave I don't know whose tree this makes more sense to go
through.  I am assuming at least initially Pablos as netfilter is
involved.  From what I have seen there will be a lot of back and forth
between the netfilter code paths and the routing code paths.

The patches are also available (against 4.3-rc1) at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next.git master

Eric W. Biederman (30):
  netfilter: Remove !CONFIG_NETFITLER definition of nf_hook_thresh
  netfilter: Store net in nf_hook_state
  netfilter: Pass net to nf_hook_thresh
  xfrm: Remove unused afinfo method init_dst
  net: Merge dst_output and dst_output_sk
  ipv4: Compute net once in ip_forward
  ipv4: Compute net once in ip_forward_finish
  ipv4: Compute net once in ip_rcv
  ipv4: Remember the net in ip_output and ip_mc_output
  ipv4: Don't recompute net in ipmr_queue_xmit
  ipv4: Only compute net once in ip_do_fragment
  ipv4: Explicitly compute net in ip_fragment
  ipv4: Only compute net once in ip_finish_output2
  ipv4: Only compute net once in ip_rcv_finish
  ipv4: Only compute net once in ipmr_forward_finish
  ipv6: Only compute net once in ip6mr_forward2_finish
  arp: Introduce arp_xmit_finish
  bridge: Introduce br_send_bpdu_finish
  net: Remove dev_queue_xmit_sk
  ipv6: Don't recompute net in ip6_rcv
  ipv6: Only compute net once in ip6_finish_output2
  ipv6: Cache net in ip6_output
  ipv6: Compute net once in raw6_send_hdrinc
  bridge: Pass net into br_nf_ip_fragment
  bridge: Pass net into br_nf_push_frag_xmit
  bridge: Cache net in br_nf_pre_routing_finish
  bridge: Add br_netif_receive_skb remove netif_receive_skb_sk
  netfilter: Pass struct net into the netfilter hooks
  netfilter: Use nf_hook_state.net
  netfilter: Pass net into okfn

 drivers/net/vrf.c  |  9 ++--
 include/linux/netdevice.h  | 14 ++
 include/linux/netfilter.h  | 68 --
 include/linux/netfilter_bridge.h   |  2 +-
 include/linux/netfilter_ingress.h  |  2 +-
 include/net/dn_neigh.h |  6 +--
 include/net/dst.h  |  6 +--
 include/net/ipv6.h |  2 +-
 include/net/netfilter/br_netfilter.h   |  2 +-
 include/net/xfrm.h |  2 -
 net/bridge/br_forward.c| 16 +++---
 net/bridge/br_input.c  | 25 ++
 net/bridge/br_multicast.c  |  4 +-
 net/bridge/br_netfilter_hooks.c| 54 ++--
 net/bridge/br_netfilter_ipv6.c |  8 +--
 net/bridge/br_private.h|  6 +--
 net/bridge/br_stp_bpdu.c   | 12 +++--
 net/bridge/netfilter/ebtable_filter.c  |  4 +-
 net/bridge/netfilter/ebtable_nat.c |  4 +-
 net/core/dev.c | 12 +++--
 net/decnet/dn_neigh.c  | 23 +
 net/decnet/dn_nsp_in.c |  7 +--
 net/decnet/dn_nsp_out.c

[net-next 04/18] fm10k: disable service task during suspend

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

The service task reads some registers as part of its normal routine,
even while the interface is down. Normally this is ok. However, during
suspend we have disabled the PCI device. Due to this, registers will
read in the same way as a surprise-remove event. Disable the service
task while we suspend, and re-enable it after we resume. If we don't do
this, the device could be UP when you suspend and come back from resume
as closed (since fm10k closes the device when it gets a surprise
remove).

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index ce53ff2..8413ab5 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -1983,6 +1983,16 @@ static int fm10k_resume(struct pci_dev *pdev)
if (err)
return err;
 
+   /* assume host is not ready, to prevent race with watchdog in case we
+* actually don't have connection to the switch
+*/
+   interface->host_ready = false;
+   fm10k_watchdog_host_not_ready(interface);
+
+   /* clear the service task disable bit to allow service task to start */
+   clear_bit(__FM10K_SERVICE_DISABLE, &interface->state);
+   fm10k_service_event_schedule(interface);
+
/* restore SR-IOV interface */
fm10k_iov_resume(pdev);
 
@@ -2010,6 +2020,15 @@ static int fm10k_suspend(struct pci_dev *pdev,
 
fm10k_iov_suspend(pdev);
 
+   /* the watchdog tasks may read registers, which will appear like a
+* surprise-remove event once the PCI device is disabled. This will
+* cause us to close the netdevice, so we don't retain the open/closed
+* state post-resume. Prevent this by disabling the service task while
+* suspended, until we actually resume.
+*/
+   set_bit(__FM10K_SERVICE_DISABLE, &interface->state);
+   cancel_work_sync(&interface->service_task);
+
rtnl_lock();
 
if (netif_running(netdev))
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 05/18] fm10k: only prevent removal of default VID rules

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

This allows us to correctly add a VLAN even if it matches our default
VID. However, we don't want to remove the VID rules once that VLAN is
deleted. Correctly remove the stack layers information of the VLAN, but
then return to forwarding that VID as untagged frames. If we deleted the
VID rules here, we would begin dropping traffic due to VLAN membership
violations.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 99228bf..818bc8b 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -775,8 +775,8 @@ static int fm10k_update_vid(struct net_device *netdev, u16 
vid, bool set)
if (!set)
clear_bit(vid, interface->active_vlans);
 
-   /* if default VLAN is already present do nothing */
-   if (vid == hw->mac.default_vid)
+   /* Do not remove default VID related entries from VLAN and MAC tables */
+   if (!set && vid == hw->mac.default_vid)
return 0;
 
fm10k_mbx_lock(interface);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 09/18] fm10k: Report MAC address on driver load

2015-09-15 Thread Jeff Kirsher

From: Alexander Duyck 

This change adds the MAC address to the list of values recorded on driver
load.  The MAC address represents the serial number of the unit and allows
us to track the value should a card be replaced in a system.

The log message should now be similar in output to that of ixgbe.

Signed-off-by: Alexander Duyck 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index db237b7..9f2b2f1 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -1905,6 +1905,9 @@ static int fm10k_probe(struct pci_dev *pdev,
/* print warning for non-optimal configurations */
fm10k_slot_warn(interface);
 
+   /* report MAC address for logging */
+   dev_info(&pdev->dev, "%pM\n", netdev->dev_addr);
+
/* enable SR-IOV after registering netdev to enforce PF/VF ordering */
fm10k_iov_configure(pdev, 0);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 17/18] fm10k: Only trigger data path reset if fabric is up

2015-09-15 Thread Jeff Kirsher

From: Alexander Duyck 

This change makes it so that we only trigger the data path reset if the
fabric is ready to handle traffic.  The general idea is to avoid
triggering the reset unless the switch API is ready for us.  Otherwise
we can just postpone the reset until we receive a switch ready
notification.

Signed-off-by: Alexander Duyck 
Signed-off-by: Jacob Keller 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
index 241b969..d806d87 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
@@ -59,6 +59,11 @@ static s32 fm10k_reset_hw_pf(struct fm10k_hw *hw)
if (reg & (FM10K_DMA_CTRL_TX_ACTIVE | FM10K_DMA_CTRL_RX_ACTIVE))
return FM10K_ERR_DMA_PENDING;
 
+   /* verify the switch is ready for reset */
+   reg = fm10k_read_reg(hw, FM10K_DMA_CTRL2);
+   if (!(reg & FM10K_DMA_CTRL2_SWITCH_READY))
+   goto out;
+
/* Inititate data path reset */
reg |= FM10K_DMA_CTRL_DATAPATH_RESET;
fm10k_write_reg(hw, FM10K_DMA_CTRL, reg);
@@ -72,6 +77,7 @@ static s32 fm10k_reset_hw_pf(struct fm10k_hw *hw)
if (!(reg & FM10K_IP_NOTINRESET))
err = FM10K_ERR_RESET_FAILED;
 
+out:
return err;
 }
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 01/18] ixgbe: fix issue with SFP events with new X550 devices

2015-09-15 Thread Jeff Kirsher

From: Don Skidmore 

Add checks for systems that don't have SFP's to avoid incorrectly
acting on interrupts that are falsely interpreted as SFP events.
This also includes a modified check generating the EICR mask to be
more forward-looking.

Signed-off-by: Don Skidmore 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 63b2cfe..b9267e2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2495,17 +2495,26 @@ static inline bool ixgbe_is_sfp(struct ixgbe_hw *hw)
 static void ixgbe_check_sfp_event(struct ixgbe_adapter *adapter, u32 eicr)
 {
struct ixgbe_hw *hw = &adapter->hw;
+   u32 eicr_mask = IXGBE_EICR_GPI_SDP2(hw);
 
-   if (eicr & IXGBE_EICR_GPI_SDP2(hw)) {
+   if (!ixgbe_is_sfp(hw))
+   return;
+
+   /* Later MAC's use different SDP */
+   if (hw->mac.type >= ixgbe_mac_X540)
+   eicr_mask = IXGBE_EICR_GPI_SDP0_X540;
+
+   if (eicr & eicr_mask) {
/* Clear the interrupt */
-   IXGBE_WRITE_REG(hw, IXGBE_EICR, IXGBE_EICR_GPI_SDP2(hw));
+   IXGBE_WRITE_REG(hw, IXGBE_EICR, eicr_mask);
if (!test_bit(__IXGBE_DOWN, &adapter->state)) {
adapter->flags2 |= IXGBE_FLAG2_SFP_NEEDS_RESET;
ixgbe_service_event_schedule(adapter);
}
}
 
-   if (eicr & IXGBE_EICR_GPI_SDP1(hw)) {
+   if (adapter->hw.mac.type == ixgbe_mac_82599EB &&
+   (eicr & IXGBE_EICR_GPI_SDP1(hw))) {
/* Clear the interrupt */
IXGBE_WRITE_REG(hw, IXGBE_EICR, IXGBE_EICR_GPI_SDP1(hw));
if (!test_bit(__IXGBE_DOWN, &adapter->state)) {
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 11/18] fm10k: don't store sw_vid at reset

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

If we store the sw_vid at reset of PF, then we accidentally prevent the
VF from receiving the message to update its default VID. This only
occurs if the VF is created before the PF has come up, which is the
standard way of creating VFs when using the module parameter.

This fixes an issue where we request the incorrect MAC/VLAN
combinations, and prevents us from accidentally reporting some frames as
VLAN tagged.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
index 94571e6..0e25a80 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
@@ -228,9 +228,6 @@ int fm10k_iov_resume(struct pci_dev *pdev)
hw->iov.ops.set_lport(hw, vf_info, i,
  FM10K_VF_FLAG_MULTI_CAPABLE);
 
-   /* assign our default vid to the VF following reset */
-   vf_info->sw_vid = hw->mac.default_vid;
-
/* mailbox is disconnected so we don't send a message */
hw->iov.ops.assign_default_mac_vlan(hw, vf_info);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 10/18] fm10k: allow creation of VLAN interfaces even while down

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

We re-sync upon going up, so there is little reason to worry about not
syncing immediately with switch. This prevents an error that occurs if
you add a VLAN interface while down.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index b2065cb..e1ceb3a 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -779,6 +779,12 @@ static int fm10k_update_vid(struct net_device *netdev, u16 
vid, bool set)
if (!set && vid == hw->mac.default_vid)
return 0;
 
+   /* Do not throw an error if the interface is down. We will sync once
+* we come up
+*/
+   if (test_bit(__FM10K_DOWN, &interface->state))
+   return 0;
+
fm10k_mbx_lock(interface);
 
/* only need to update the VLAN if not in promiscuous mode */
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 16/18] fm10k: re-enable VF after a full reset on detection of a Malicious event

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

Modify behavior of Malicious Driver Detection events. Presently, the
hardware disables the VF queues and re-assigns them to the PF. This
causes the VF in question to continuously Tx hang, because it assumes
that it can transmit over the queues in question. For transient events,
this results in continuous logging of malicious events.

New behavior is to reset the LPORT and VF state, so that the VF will
have to reset and re-enable itself. This does mean that malicious VFs
will possibly be able to continue and attempt malicious events again.
However, it is expected that system administrators will step in and
manually remove or disable the VF in question.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 30 ++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 9bdc04d..3d71c52 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -880,10 +880,12 @@ void fm10k_netpoll(struct net_device *netdev)
 
 #endif
 #define FM10K_ERR_MSG(type) case (type): error = #type; break
-static void fm10k_print_fault(struct fm10k_intfc *interface, int type,
+static void fm10k_handle_fault(struct fm10k_intfc *interface, int type,
  struct fm10k_fault *fault)
 {
struct pci_dev *pdev = interface->pdev;
+   struct fm10k_hw *hw = &interface->hw;
+   struct fm10k_iov_data *iov_data = interface->iov_data;
char *error;
 
switch (type) {
@@ -937,6 +939,30 @@ static void fm10k_print_fault(struct fm10k_intfc 
*interface, int type,
 "%s Address: 0x%llx SpecInfo: 0x%x Func: %02x.%0x\n",
 error, fault->address, fault->specinfo,
 PCI_SLOT(fault->func), PCI_FUNC(fault->func));
+
+   /* For VF faults, clear out the respective LPORT, reset the queue
+* resources, and then reconnect to the mailbox. This allows the
+* VF in question to resume behavior. For transient faults that are
+* the result of non-malicious behavior this will log the fault and
+* allow the VF to resume functionality. Obviously for malicious VFs
+* they will be able to attempt malicious behavior again. In this
+* case, the system administrator will need to step in and manually
+* remove or disable the VF in question.
+*/
+   if (fault->func && iov_data) {
+   int vf = fault->func - 1;
+   struct fm10k_vf_info *vf_info = &iov_data->vf_info[vf];
+
+   hw->iov.ops.reset_lport(hw, vf_info);
+   hw->iov.ops.reset_resources(hw, vf_info);
+
+   /* reset_lport disables the VF, so re-enable it */
+   hw->iov.ops.set_lport(hw, vf_info, vf,
+ FM10K_VF_FLAG_MULTI_CAPABLE);
+
+   /* reset_resources will disconnect from the mbx  */
+   vf_info->mbx.ops.connect(hw, &vf_info->mbx);
+   }
 }
 
 static void fm10k_report_fault(struct fm10k_intfc *interface, u32 eicr)
@@ -960,7 +986,7 @@ static void fm10k_report_fault(struct fm10k_intfc 
*interface, u32 eicr)
continue;
}
 
-   fm10k_print_fault(interface, type, &fault);
+   fm10k_handle_fault(interface, type, &fault);
}
 }
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 06/18] fm10k: update fm10k_slot_warn to use pcie_get_minimum link

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

This is useful in cases where we connect to a slot at Gen3, but the slot
is behind a bus which only connected at Gen2. This generally only
happens when a PCIe switch is in the sequence of devices, and can be
very confusing when you see slow performance with no obvious cause.

I am aware this patch has a few lines that break 80 characters, but
there does not seem to be a readable way to format them to less than 80
characters. Suggestions welcome.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 105 +++
 1 file changed, 76 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 8413ab5..2d87c32 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -1705,22 +1705,86 @@ static int fm10k_sw_init(struct fm10k_intfc *interface,
 
 static void fm10k_slot_warn(struct fm10k_intfc *interface)
 {
-   struct device *dev = &interface->pdev->dev;
+   enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
+   enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
struct fm10k_hw *hw = &interface->hw;
+   int max_gts = 0, expected_gts = 0;
 
-   if (hw->mac.ops.is_slot_appropriate(hw))
+   if (pcie_get_minimum_link(interface->pdev, &speed, &width) ||
+   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) {
+   dev_warn(&interface->pdev->dev,
+"Unable to determine PCI Express bandwidth.\n");
return;
+   }
+
+   switch (speed) {
+   case PCIE_SPEED_2_5GT:
+   /* 8b/10b encoding reduces max throughput by 20% */
+   max_gts = 2 * width;
+   break;
+   case PCIE_SPEED_5_0GT:
+   /* 8b/10b encoding reduces max throughput by 20% */
+   max_gts = 4 * width;
+   break;
+   case PCIE_SPEED_8_0GT:
+   /* 128b/130b encoding has less than 2% impact on throughput */
+   max_gts = 8 * width;
+   break;
+   default:
+   dev_warn(&interface->pdev->dev,
+"Unable to determine PCI Express bandwidth.\n");
+   return;
+   }
+
+   dev_info(&interface->pdev->dev,
+"PCI Express bandwidth of %dGT/s available\n",
+max_gts);
+   dev_info(&interface->pdev->dev,
+"(Speed:%s, Width: x%d, Encoding Loss:%s, Payload:%s)\n",
+(speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
+ speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
+ speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
+ "Unknown"),
+hw->bus.width,
+(speed == PCIE_SPEED_2_5GT ? "20%" :
+ speed == PCIE_SPEED_5_0GT ? "20%" :
+ speed == PCIE_SPEED_8_0GT ? "<2%" :
+ "Unknown"),
+(hw->bus.payload == fm10k_bus_payload_128 ? "128B" :
+ hw->bus.payload == fm10k_bus_payload_256 ? "256B" :
+ hw->bus.payload == fm10k_bus_payload_512 ? "512B" :
+ "Unknown"));
 
-   dev_warn(dev,
-"For optimal performance, a %s %s slot is recommended.\n",
-(hw->bus_caps.width == fm10k_bus_width_pcie_x1 ? "x1" :
- hw->bus_caps.width == fm10k_bus_width_pcie_x4 ? "x4" :
- "x8"),
-(hw->bus_caps.speed == fm10k_bus_speed_2500 ? "2.5GT/s" :
- hw->bus_caps.speed == fm10k_bus_speed_5000 ? "5.0GT/s" :
- "8.0GT/s"));
-   dev_warn(dev,
-"A slot with more lanes and/or higher speed is suggested.\n");
+   switch (hw->bus_caps.speed) {
+   case fm10k_bus_speed_2500:
+   /* 8b/10b encoding reduces max throughput by 20% */
+   expected_gts = 2 * hw->bus_caps.width;
+   break;
+   case fm10k_bus_speed_5000:
+   /* 8b/10b encoding reduces max throughput by 20% */
+   expected_gts = 4 * hw->bus_caps.width;
+   break;
+   case fm10k_bus_speed_8000:
+   /* 128b/130b encoding has less than 2% impact on throughput */
+   expected_gts = 8 * hw->bus_caps.width;
+   break;
+   default:
+   dev_warn(&interface->pdev->dev,
+"Unable to determine expected PCI Express 
bandwidth.\n");
+   return;
+   }
+
+   if (max_gts < expected_gts) {
+   dev_warn(&interface->pdev->dev,
+"This device requires %dGT/s of bandwidth for optimal 
performance.\n",
+expected_gts);
+   dev_warn(&interface->pdev->dev,
+"A %sslot with x%d lanes is suggested.\n",
+(hw->bus_cap

[net-next 07/18] fm10k: update netdev perm_addr during reinit, instead of at up

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

Update the netdev permanent address during fm10k_reinit enables the user
to immediately see the new MAC address on the VF even if the device
isn't up. The previous code required that the device by opened before
changes would appear.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 15 ---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c| 15 +++
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 818bc8b..b2065cb 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -996,21 +996,6 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface)
int xcast_mode;
u16 vid, glort;
 
-   /* restore our address if perm_addr is set */
-   if (hw->mac.type == fm10k_mac_vf) {
-   if (is_valid_ether_addr(hw->mac.perm_addr)) {
-   ether_addr_copy(hw->mac.addr, hw->mac.perm_addr);
-   ether_addr_copy(netdev->perm_addr, hw->mac.perm_addr);
-   ether_addr_copy(netdev->dev_addr, hw->mac.perm_addr);
-   netdev->addr_assign_type &= ~NET_ADDR_RANDOM;
-   }
-
-   if (hw->mac.vlan_override)
-   netdev->features &= ~NETIF_F_HW_VLAN_CTAG_RX;
-   else
-   netdev->features |= NETIF_F_HW_VLAN_CTAG_RX;
-   }
-
/* record glort for this interface */
glort = interface->glort;
 
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 2d87c32..db237b7 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -170,6 +170,21 @@ static void fm10k_reinit(struct fm10k_intfc *interface)
/* reassociate interrupts */
fm10k_mbx_request_irq(interface);
 
+   /* update hardware address for VFs if perm_addr has changed */
+   if (hw->mac.type == fm10k_mac_vf) {
+   if (is_valid_ether_addr(hw->mac.perm_addr)) {
+   ether_addr_copy(hw->mac.addr, hw->mac.perm_addr);
+   ether_addr_copy(netdev->perm_addr, hw->mac.perm_addr);
+   ether_addr_copy(netdev->dev_addr, hw->mac.perm_addr);
+   netdev->addr_assign_type &= ~NET_ADDR_RANDOM;
+   }
+
+   if (hw->mac.vlan_override)
+   netdev->features &= ~NETIF_F_HW_VLAN_CTAG_RX;
+   else
+   netdev->features |= NETIF_F_HW_VLAN_CTAG_RX;
+   }
+
/* reset clock */
fm10k_ts_reset(interface);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 03/18] ixgbe: Limit lowest interrupt rate for adaptive interrupt moderation to 12K

2015-09-15 Thread Jeff Kirsher

From: Alexander Duyck 

This patch updates the lowest limit for adaptive interrupt interrupt
moderation to roughly 12K interrupts per second.

The way I came about reaching 12K as the desired interrupt rate is by
testing with UDP flows.  Specifically I had a simple test that ran a
netperf UDP_STREAM test at varying sizes.  What I found was as the packet
sizes increased the performance fell steadily behind until we were only
able to receive at ~4Gb/s with a message size of 65507.  A bit of digging
found that we were dropping packets for the socket in the network stack,
and looking at things further what I found was I could solve it by increasing
the interrupt rate, or increasing the rmem_default/rmem_max.  What I found was
that when the interrupt coalescing resulted in more data being processed
per interrupt than could be stored in the socket buffer we started losing
packets and the performance dropped.  So I reached 12K based on the
following math.

rmem_default = 212992
skb->truesize = 2994
212992 / 2994 = 71.14 packets to fill the buffer

packet rate at 1514 packet size is 812744pps
71.14 / 812744 = 87.9us to fill socket buffer

>From there it was just a matter of choosing the interrupt rate and
providing a bit of wiggle room which is why I decided to go with 12K
interrupts per second as that uses a value of 84us.

The data below is based on VM to VM over a direct assigned ixgbe interface.
The test run was:
netperf -H  -t UDP_STREAM"

Socket  Message  Elapsed  Messages   CPU  Service
SizeSize Time Okay Errors   Throughput   Util Demand
bytes   bytessecs#  #   10^6bits/sec % SS us/KB
Before:
212992   65507   60.00 1100662  0 9613.4 10.890.557
212992   60.00  4734744135.4 11.270.576

After:
212992   65507   60.00 1100413  0 9611.2 10.730.549
212992   60.00  9741328508.3 11.690.598

Using bare metal the data is similar but not as dramatic as the throughput
increases from about 8.5Gb/s to 9.5Gb/s.

Signed-off-by: Alexander Duyck 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h | 3 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 4 ++--
 4 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index edf1fb9..a699c99 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -539,8 +539,7 @@ struct hwmon_buff {
 #define IXGBE_MIN_RSC_ITR  24
 #define IXGBE_100K_ITR 40
 #define IXGBE_20K_ITR  200
-#define IXGBE_10K_ITR  400
-#define IXGBE_8K_ITR   500
+#define IXGBE_12K_ITR  336
 
 /* ixgbe_test_staterr - tests bits in Rx descriptor status and error fields */
 static inline __le32 ixgbe_test_staterr(union ixgbe_adv_rx_desc *rx_desc,
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index ab2edc8..94c4912 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2286,7 +2286,7 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
adapter->tx_itr_setting = ec->tx_coalesce_usecs;
 
if (adapter->tx_itr_setting == 1)
-   tx_itr_param = IXGBE_10K_ITR;
+   tx_itr_param = IXGBE_12K_ITR;
else
tx_itr_param = adapter->tx_itr_setting;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 68e1e75..f3168bc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -866,7 +866,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter 
*adapter,
if (txr_count && !rxr_count) {
/* tx only vector */
if (adapter->tx_itr_setting == 1)
-   q_vector->itr = IXGBE_10K_ITR;
+   q_vector->itr = IXGBE_12K_ITR;
else
q_vector->itr = adapter->tx_itr_setting;
} else {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c04480e..acb1b91 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2261,7 +2261,7 @@ static void ixgbe_update_itr(struct ixgbe_q_vector 
*q_vector,
/* simple throttlerate management
 *   0-10MB/s   lowest (10 ints/s)
 *  10-20MB/s   low(2 ints/s)
-*  20-1249MB/s bulk   (8000 ints/s)
+*  20-1249MB/s bulk   (12000 ints/s)
 */
/* what was last interrupt timeslice? */

[net-next 12/18] fm10k: remove is_slot_appropriate

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

This function is no longer used now that we have updated fm10k_slot_warn
functionality.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c   | 14 --
 drivers/net/ethernet/intel/fm10k/fm10k_type.h |  1 -
 drivers/net/ethernet/intel/fm10k/fm10k_vf.c   | 14 --
 3 files changed, 29 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
index 3ca0233..241b969 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
@@ -185,19 +185,6 @@ static s32 fm10k_init_hw_pf(struct fm10k_hw *hw)
 }
 
 /**
- *  fm10k_is_slot_appropriate_pf - Indicate appropriate slot for this SKU
- *  @hw: pointer to hardware structure
- *
- *  Looks at the PCIe bus info to confirm whether or not this slot can support
- *  the necessary bandwidth for this device.
- **/
-static bool fm10k_is_slot_appropriate_pf(struct fm10k_hw *hw)
-{
-   return (hw->bus.speed == hw->bus_caps.speed) &&
-  (hw->bus.width == hw->bus_caps.width);
-}
-
-/**
  *  fm10k_update_vlan_pf - Update status of VLAN ID in VLAN filter table
  *  @hw: pointer to hardware structure
  *  @vid: VLAN ID to add to table
@@ -1849,7 +1836,6 @@ static struct fm10k_mac_ops mac_ops_pf = {
.init_hw= &fm10k_init_hw_pf,
.start_hw   = &fm10k_start_hw_generic,
.stop_hw= &fm10k_stop_hw_generic,
-   .is_slot_appropriate= &fm10k_is_slot_appropriate_pf,
.update_vlan= &fm10k_update_vlan_pf,
.read_mac_addr  = &fm10k_read_mac_addr_pf,
.update_uc_addr = &fm10k_update_uc_addr_pf,
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_type.h 
b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
index 2a17d82..bac8d48 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_type.h
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
@@ -521,7 +521,6 @@ struct fm10k_mac_ops {
s32 (*stop_hw)(struct fm10k_hw *);
s32 (*get_bus_info)(struct fm10k_hw *);
s32 (*get_host_state)(struct fm10k_hw *, bool *);
-   bool (*is_slot_appropriate)(struct fm10k_hw *);
s32 (*update_vlan)(struct fm10k_hw *, u32, u8, bool);
s32 (*read_mac_addr)(struct fm10k_hw *);
s32 (*update_uc_addr)(struct fm10k_hw *, u16, const u8 *,
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_vf.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_vf.c
index 94f0f6a..36c8b0a 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_vf.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_vf.c
@@ -131,19 +131,6 @@ static s32 fm10k_init_hw_vf(struct fm10k_hw *hw)
return 0;
 }
 
-/**
- *  fm10k_is_slot_appropriate_vf - Indicate appropriate slot for this SKU
- *  @hw: pointer to hardware structure
- *
- *  Looks at the PCIe bus info to confirm whether or not this slot can support
- *  the necessary bandwidth for this device. Since the VF has no control over
- *  the "slot" it is in, always indicate that the slot is appropriate.
- **/
-static bool fm10k_is_slot_appropriate_vf(struct fm10k_hw *hw)
-{
-   return true;
-}
-
 /* This structure defines the attibutes to be parsed below */
 const struct fm10k_tlv_attr fm10k_mac_vlan_msg_attr[] = {
FM10K_TLV_ATTR_U32(FM10K_MAC_VLAN_MSG_VLAN),
@@ -552,7 +539,6 @@ static struct fm10k_mac_ops mac_ops_vf = {
.init_hw= &fm10k_init_hw_vf,
.start_hw   = &fm10k_start_hw_generic,
.stop_hw= &fm10k_stop_hw_vf,
-   .is_slot_appropriate= &fm10k_is_slot_appropriate_vf,
.update_vlan= &fm10k_update_vlan_vf,
.read_mac_addr  = &fm10k_read_mac_addr_vf,
.update_uc_addr = &fm10k_update_uc_addr_vf,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 02/18] ixgbe: Teardown SR-IOV before unregister_netdev()

2015-09-15 Thread Jeff Kirsher

From: Alex Williamson 

When the .remove() callback for a PF is called, SR-IOV support for the
device is disabled, which requires unbinding and removing the VFs.
The VFs may be in-use either by the host kernel or userspace, such as
assigned to a VM through vfio-pci.  In this latter case, the VFs may
be removed either by shutting down the VM or hot-unplugging the
devices from the VM.  Unfortunately in the case of a Windows 2012 R2
guest, hot-unplug is broken due to the ordering of the PF driver
teardown.  Disabling SR-IOV prior to unregister_netdev() avoids this
issue.

Signed-off-by: Alex Williamson 
Acked-by: Mitch Williams 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index b9267e2..c04480e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9028,12 +9028,12 @@ static void ixgbe_remove(struct pci_dev *pdev)
/* remove the added san mac */
ixgbe_del_sanmac_netdev(netdev);
 
-   if (netdev->reg_state == NETREG_REGISTERED)
-   unregister_netdev(netdev);
-
 #ifdef CONFIG_PCI_IOV
ixgbe_disable_sriov(adapter);
 #endif
+   if (netdev->reg_state == NETREG_REGISTERED)
+   unregister_netdev(netdev);
+
ixgbe_clear_interrupt_scheme(adapter);
 
ixgbe_release_hw_control(adapter);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 08/18] fm10k: Don't assume page fragments are page size

2015-09-15 Thread Jeff Kirsher

From: Alexander Duyck 

This change pulls out the optimization that assumed that all fragments
would be limited to page size.  That hasn't been the case for some time now
and to assume this is incorrect as the TCP allocator can provide up to a
32K page fragment.

Signed-off-by: Alexander Duyck 
Acked-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index b5b2925..6ad03e0 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -1079,9 +1079,7 @@ netdev_tx_t fm10k_xmit_frame_ring(struct sk_buff *skb,
struct fm10k_tx_buffer *first;
int tso;
u32 tx_flags = 0;
-#if PAGE_SIZE > FM10K_MAX_DATA_PER_TXD
unsigned short f;
-#endif
u16 count = TXD_USE_COUNT(skb_headlen(skb));
 
/* need: 1 descriptor per page * PAGE_SIZE/FM10K_MAX_DATA_PER_TXD,
@@ -1089,12 +1087,9 @@ netdev_tx_t fm10k_xmit_frame_ring(struct sk_buff *skb,
 *   + 2 desc gap to keep tail from touching head
 * otherwise try next time
 */
-#if PAGE_SIZE > FM10K_MAX_DATA_PER_TXD
for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
-#else
-   count += skb_shinfo(skb)->nr_frags;
-#endif
+
if (fm10k_maybe_stop_tx(tx_ring, count + 3)) {
tx_ring->tx_stats.tx_busy++;
return NETDEV_TX_BUSY;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 00/18][pull request] Intel Wired LAN Driver Updates 2015-09-15

2015-09-15 Thread Jeff Kirsher

This series contains updates to ixgbe and fm10k.

Don fixes a ixgbe issue by adding checks for systems that do not have
SFP's to avoid incorrectly acting on interrupts that are falsely
interpreted as SFP events.

Alex Williamson adds a fix for ixgbe to disable SR-IOV prior to
unregistering the netdev to avoid issues with guest OS's which do not
support hot-unplug or their hot-unplug is broken.

Alex Duyck update the lowest limit for adaptive interrupt interrupt
moderation to about 12K interrupts per second for ixgbe.  This change
increases the performance for ixgbe.  Also fixed up fm10k to remove
the optimization that assumed that all fragments would be limited to
page size, since that assumption is incorrect as the TCP allocator can
provide up to a 32K page fragment.  Updated fm10k to add the MAC
address to the list of values recorded on driver load.  Fixes fm10k
so that we only trigger the data path reset if the fabric is ready to
handle traffic to avoid triggering the reset unless the switch API is
ready for us.

Jacob updates the fm10k driver to disable the service task during
suspend and re-enable it after we resume. If we don't do this, the
device could be UP when you suspend and come back from resume as
DOWN.  Also update fm10k to prevent the removal of default VID rules,
 and correctly remove the stack layers information of the VLAN, but then
return to forwarding that VID as untagged frames.  If we deleted the VID
rules here, we would begin dropping traffic due to VLAN membership
violations.  Fixed fm10k to use pcie_get_minimum_link(), which is useful
in cases where we connect to a slot at Gen3, but the slot is behind a bus
which is only connected at Gen2.  Updated fm10k to update the netdev
permanent address during reinit instead of up to enable users to
immediately see the new MAC address on the VF even if the device is not
up.  Adds the creation of VLAN interfaces on a device, even while the
device is down for fm10k.  Fixed an issue where we request the incorrect
MAC/VLAN combinations, and prevents us from accidentally reporting some
frames as VLAN tagged.  Provided a couple of trivial fixes for fm10k
to fix code style and typos in code comments.

The following are changes since commit ad1e7b97b3adb91d46f3adb70a7867a50fc274cf:
  cdc: Fix build warning.
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master

Alex Williamson (1):
  ixgbe: Teardown SR-IOV before unregister_netdev()

Alexander Duyck (4):
  ixgbe: Limit lowest interrupt rate for adaptive interrupt moderation
to 12K
  fm10k: Don't assume page fragments are page size
  fm10k: Report MAC address on driver load
  fm10k: Only trigger data path reset if fabric is up

Don Skidmore (1):
  ixgbe: fix issue with SFP events with new X550 devices

Jacob Keller (12):
  fm10k: disable service task during suspend
  fm10k: only prevent removal of default VID rules
  fm10k: update fm10k_slot_warn to use pcie_get_minimum link
  fm10k: update netdev perm_addr during reinit, instead of at up
  fm10k: allow creation of VLAN interfaces even while down
  fm10k: don't store sw_vid at reset
  fm10k: remove is_slot_appropriate
  fm10k: TRIVIAL fix up ordering of __always_unused and style
  fm10k: send traffic on default VID to VLAN device if we have one
  fm10k: TRIVIAL fix typo in fm10k_netdev.c
  fm10k: re-enable VF after a full reset on detection of a Malicious
event
  fm10k: fix iov_msg_mac_vlan_pf VID checks

 drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c |   5 +-
 drivers/net/ethernet/intel/fm10k/fm10k_iov.c |   3 -
 drivers/net/ethernet/intel/fm10k/fm10k_main.c|  12 +-
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c  |  39 ++---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 176 +++
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c  | 105 --
 drivers/net/ethernet/intel/fm10k/fm10k_type.h|   1 -
 drivers/net/ethernet/intel/fm10k/fm10k_vf.c  |  14 --
 drivers/net/ethernet/intel/ixgbe/ixgbe.h |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|  25 ++--
 12 files changed, 252 insertions(+), 135 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 15/18] fm10k: TRIVIAL fix typo in fm10k_netdev.c

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

Signed-off-by: Jacob Keller 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 3a6230b..639263d 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1048,7 +1048,7 @@ void fm10k_restore_rx_state(struct fm10k_intfc *interface)
   vid, true, 0);
}
 
-   /* update xcast mode before syncronizing addresses */
+   /* update xcast mode before synchronizing addresses */
hw->mac.ops.update_xcast_mode(hw, glort, xcast_mode);
 
/* synchronize all of the addresses */
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 14/18] fm10k: send traffic on default VID to VLAN device if we have one

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

This patch ensures that VLAN traffic on the default VID will go to the
corresponding VLAN device if it exists. To do this, mask the rx_ring VID
if we have an active VLAN on that VID.

For this to work correctly, we need to update fm10k_process_skb_fields
to correctly mask off the VLAN_PRIO_MASK bits and compare them
separately, otherwise we incorrectly compare the priority bits with the
cleared flag. This also happens to fix a related bug where having
priority bits set causes us to incorrectly classify traffic.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c   |  5 -
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 12 
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c|  4 
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 6ad03e0..92d4155 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -497,8 +497,11 @@ static unsigned int fm10k_process_skb_fields(struct 
fm10k_ring *rx_ring,
if (rx_desc->w.vlan) {
u16 vid = le16_to_cpu(rx_desc->w.vlan);
 
-   if (vid != rx_ring->vid)
+   if ((vid & VLAN_VID_MASK) != rx_ring->vid)
__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vid);
+   else if (vid & VLAN_PRIO_MASK)
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+  vid & VLAN_PRIO_MASK);
}
 
fm10k_type_trans(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index e1ceb3a..3a6230b 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -758,6 +758,7 @@ static int fm10k_update_vid(struct net_device *netdev, u16 
vid, bool set)
struct fm10k_intfc *interface = netdev_priv(netdev);
struct fm10k_hw *hw = &interface->hw;
s32 err;
+   int i;
 
/* updates do not apply to VLAN 0 */
if (!vid)
@@ -775,6 +776,17 @@ static int fm10k_update_vid(struct net_device *netdev, u16 
vid, bool set)
if (!set)
clear_bit(vid, interface->active_vlans);
 
+   /* disable the default VID on ring if we have an active VLAN */
+   for (i = 0; i < interface->num_rx_queues; i++) {
+   struct fm10k_ring *rx_ring = interface->rx_ring[i];
+   u16 rx_vid = rx_ring->vid & (VLAN_N_VID - 1);
+
+   if (test_bit(rx_vid, interface->active_vlans))
+   rx_ring->vid |= FM10K_VLAN_CLEAR;
+   else
+   rx_ring->vid &= ~FM10K_VLAN_CLEAR;
+   }
+
/* Do not remove default VID related entries from VLAN and MAC tables */
if (!set && vid == hw->mac.default_vid)
return 0;
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 9f2b2f1..9bdc04d 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -678,6 +678,10 @@ static void fm10k_configure_rx_ring(struct fm10k_intfc 
*interface,
/* assign default VLAN to queue */
ring->vid = hw->mac.default_vid;
 
+   /* if we have an active VLAN, disable default VID */
+   if (test_bit(hw->mac.default_vid, interface->active_vlans))
+   ring->vid |= FM10K_VLAN_CLEAR;
+
/* Map interrupt */
if (ring->q_vector) {
rxint = ring->q_vector->v_idx + NON_Q_VECTORS(hw);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 13/18] fm10k: TRIVIAL fix up ordering of __always_unused and style

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

Fix some style issues in debugfs code, and correct ordering of void and
__always_unused. Technically, the order does not matter, but preferred
style is to put the macro between the type and name.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c
index f45b4d7..08ecf43 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c
@@ -37,7 +37,8 @@ static void *fm10k_dbg_desc_seq_start(struct seq_file *s, 
loff_t *pos)
 }
 
 static void *fm10k_dbg_desc_seq_next(struct seq_file *s,
-void __always_unused *v, loff_t *pos)
+void __always_unused *v,
+loff_t *pos)
 {
struct fm10k_ring *ring = s->private;
 
@@ -45,7 +46,7 @@ static void *fm10k_dbg_desc_seq_next(struct seq_file *s,
 }
 
 static void fm10k_dbg_desc_seq_stop(struct seq_file __always_unused *s,
-   __always_unused void *v)
+   void __always_unused *v)
 {
/* Do nothing. */
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 18/18] fm10k: fix iov_msg_mac_vlan_pf VID checks

2015-09-15 Thread Jeff Kirsher

From: Jacob Keller 

The VF will send a message to request multicast addresses with the
default VID. In the current code, if the PF has statically assigned a
VLAN to a VF, then the VF will not get the multicast addresses. Fix up
all of the various VLAN messages to use identical checks (since each
check was different). Also use set as a variable, so that it simplifies
our check for whether VLAN matches the pf_vid.

The new logic will allow set of a VLAN if it is zero, automatically
converting to the default VID. Otherwise it will allow setting the PF
VID, or any VLAN if PF has not statically assigned a VLAN. This is
consistent behavior, and allows VF to request either 0 or the
default_vid without silently failing.

Note that we need the check for zero since VFs might not get the default
VID message in time to actually request non-zero VLANs.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pf.c | 85 ++---
 1 file changed, 52 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
index d806d87..8c0bdc4 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pf.c
@@ -1155,6 +1155,24 @@ s32 fm10k_iov_msg_msix_pf(struct fm10k_hw *hw, u32 
**results,
 }
 
 /**
+ * fm10k_iov_select_vid - Select correct default VID
+ * @hw: Pointer to hardware structure
+ * @vid: VID to correct
+ *
+ * Will report an error if VID is out of range. For VID = 0, it will return
+ * either the pf_vid or sw_vid depending on which one is set.
+ */
+static inline s32 fm10k_iov_select_vid(struct fm10k_vf_info *vf_info, u16 vid)
+{
+   if (!vid)
+   return vf_info->pf_vid ? vf_info->pf_vid : vf_info->sw_vid;
+   else if (vf_info->pf_vid && vid != vf_info->pf_vid)
+   return FM10K_ERR_PARAM;
+   else
+   return vid;
+}
+
+/**
  *  fm10k_iov_msg_mac_vlan_pf - Message handler for MAC/VLAN request from VF
  *  @hw: Pointer to hardware structure
  *  @results: Pointer array to message, results[0] is pointer to message
@@ -1168,9 +1186,10 @@ s32 fm10k_iov_msg_mac_vlan_pf(struct fm10k_hw *hw, u32 
**results,
  struct fm10k_mbx_info *mbx)
 {
struct fm10k_vf_info *vf_info = (struct fm10k_vf_info *)mbx;
-   int err = 0;
u8 mac[ETH_ALEN];
u32 *result;
+   int err = 0;
+   bool set;
u16 vlan;
u32 vid;
 
@@ -1186,19 +1205,21 @@ s32 fm10k_iov_msg_mac_vlan_pf(struct fm10k_hw *hw, u32 
**results,
if (err)
return err;
 
-   /* if VLAN ID is 0, set the default VLAN ID instead of 0 */
-   if (!vid || (vid == FM10K_VLAN_CLEAR)) {
-   if (vf_info->pf_vid)
-   vid |= vf_info->pf_vid;
-   else
-   vid |= vf_info->sw_vid;
-   } else if (vid != vf_info->pf_vid) {
+   /* verify upper 16 bits are zero */
+   if (vid >> 16)
return FM10K_ERR_PARAM;
-   }
+
+   set = !(vid & FM10K_VLAN_CLEAR);
+   vid &= ~FM10K_VLAN_CLEAR;
+
+   err = fm10k_iov_select_vid(vf_info, vid);
+   if (err < 0)
+   return err;
+   else
+   vid = err;
 
/* update VSI info for VF in regards to VLAN table */
-   err = hw->mac.ops.update_vlan(hw, vid, vf_info->vsi,
- !(vid & FM10K_VLAN_CLEAR));
+   err = hw->mac.ops.update_vlan(hw, vid, vf_info->vsi, set);
}
 
if (!err && !!results[FM10K_MAC_VLAN_MSG_MAC]) {
@@ -1214,19 +1235,18 @@ s32 fm10k_iov_msg_mac_vlan_pf(struct fm10k_hw *hw, u32 
**results,
memcmp(mac, vf_info->mac, ETH_ALEN))
return FM10K_ERR_PARAM;
 
-   /* if VLAN ID is 0, set the default VLAN ID instead of 0 */
-   if (!vlan || (vlan == FM10K_VLAN_CLEAR)) {
-   if (vf_info->pf_vid)
-   vlan |= vf_info->pf_vid;
-   else
-   vlan |= vf_info->sw_vid;
-   } else if (vf_info->pf_vid) {
-   return FM10K_ERR_PARAM;
-   }
+   set = !(vlan & FM10K_VLAN_CLEAR);
+   vlan &= ~FM10K_VLAN_CLEAR;
+
+   err = fm10k_iov_select_vid(vf_info, vlan);
+   if (err < 0)
+   return err;
+   else
+   vlan = err;
 
/* notify switch of request for new unicast address */
-   err = hw->mac.ops.update_uc_addr(hw, vf_info->glort, mac, vlan,
-!(vlan & FM10K_VLAN_CLE

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Tom Herbert

On Tue, Sep 15, 2015 at 5:03 PM, Eric Dumazet  wrote:
> On Tue, 2015-09-15 at 16:45 -0700, Tom Herbert wrote:
>> > +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
>> > +   skb->l4_hash)
>> > +   return skb->hash;
>> > +
>> > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
>> > !bond_flow_dissect(bond, skb, &flow))
>> > return bond_eth_hash(skb);
>> >
>> >
>> Ugh, bond_flow_dissect is yet another instance of customized flow
>> dissection! We should really clean this up. I suggest that in cases
>> were we want L4 hash a call to skb_get_hash should suffice. We can
>> create skb_get_l3hash when caller explicitly wants an L3 hash-- this
>> would return skb->hash if it's valid and skb->l4_hash is not set, else
>> call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the
>> normal hash over flow keys (don't save result in skb->hash in this
>> case).
>
> This code predates all the change you did recently ;)
>
A more fundamental question is whether we can eliminate some of these
hashing types (I see five of them in if_bonding.h). Is there any
substantial difference between this and IPv4/v6 ECMP routing such that
they shouldn't all have the same path selection modes?

Tom

> BTW, the simple xor weakness is showing up after
> our change favoring even ports at connect() time, for a bonding device
> with 2 or 4 slaves.
>
> (commit 07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5
> "tcp/dccp: try to not exhaust ip_local_port_range in connect()")
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Mahesh Bandewar

On Tue, Sep 15, 2015 at 4:20 PM, Eric Dumazet  wrote:
> On Tue, 2015-09-15 at 15:54 -0700, Mahesh Bandewar wrote:
>
>> > +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
>> > +   skb->l4_hash)
>> if (ENCAP34 || LAYER34) && l4_hash) may be?
>
> Hmm, traditional BOND_XMIT_POLICY_LAYER34 did not a full flow bisection
> (tunnel awareness added in commit
> 32819dc1834866cb9547cb75f81af9edd58d33cd bonding: modify the old and add
> new xmit hash policies)
>
> This could radically change some setups and behavior.
>
> BOND_XMIT_POLICY_ENCAP34 looks a better fit to me.
>
Agreed, this will change flow distribution for LAYER34 policy but then
loose out on calculating hash per packet which I think is unnecessary.

This elimination of hash calculation is a good step but I'm feeling
that it's somehow tied to ENCAP policy which is actually orthogonal
and should be applied to LAYER34 also. However if that change in the
behavior for LAYER34 is considered too drastic then I'm perfectly fine
tying it to ENCAP34 policy.

>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Eric Dumazet

On Tue, 2015-09-15 at 16:45 -0700, Tom Herbert wrote:
> > +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
> > +   skb->l4_hash)
> > +   return skb->hash;
> > +
> > if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
> > !bond_flow_dissect(bond, skb, &flow))
> > return bond_eth_hash(skb);
> >
> >
> Ugh, bond_flow_dissect is yet another instance of customized flow
> dissection! We should really clean this up. I suggest that in cases
> were we want L4 hash a call to skb_get_hash should suffice. We can
> create skb_get_l3hash when caller explicitly wants an L3 hash-- this
> would return skb->hash if it's valid and skb->l4_hash is not set, else
> call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the
> normal hash over flow keys (don't save result in skb->hash in this
> case).

This code predates all the change you did recently ;)

BTW, the simple xor weakness is showing up after
our change favoring even ports at connect() time, for a bonding device
with 2 or 4 slaves.

(commit 07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5
"tcp/dccp: try to not exhaust ip_local_port_range in connect()")




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 25/39] net: core: drop null test before destroy functions

2015-09-15 Thread David Miller

From: Julia Lawall 
Date: Sun, 13 Sep 2015 14:15:18 +0200

> Remove unneeded NULL test.
> 
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // 
> @@ expression x; @@
> -if (x != NULL) {
>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
>   x = NULL;
> -}
> // 
> 
> Signed-off-by: Julia Lawall 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 34/39] dccp: drop null test before destroy functions

2015-09-15 Thread David Miller

From: Julia Lawall 
Date: Sun, 13 Sep 2015 14:15:27 +0200

> Remove unneeded NULL test.
> 
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // 
> @@
> expression x;
> @@
> 
> -if (x != NULL)
>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
> 
> @@
> expression x;
> @@
> 
> -if (x != NULL) {
>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
>   x = NULL;
> -}
> // 
> 
> Signed-off-by: Julia Lawall 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/39] atm: he: drop null test before destroy functions

2015-09-15 Thread David Miller

From: Julia Lawall 
Date: Sun, 13 Sep 2015 14:15:03 +0200

> Remove unneeded NULL test.
> 
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // 
> @@ expression x; @@
> -if (x != NULL)
>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
> // 
> 
> Signed-off-by: Julia Lawall 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: IPv6 routing/fragmentation panic

2015-09-15 Thread Florian Westphal

David Woodhouse  wrote:
> I can repeatably crash my router with 'ping6 -s 2000' to an external
> machine:
> [   61.741618] skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 
> head:dec98000 data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
> [   61.754128] [ cut here ]
> [   61.758754] Kernel BUG at c1201b1f [verbose debug info unavailable]
> [   61.764005] invalid opcode:  [#1] 
> [   61.764005] Modules linked in: sch_teql 8139cp mii iptable_nat pppoe 
> nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE 
> xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit 
> xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT solos_pci pppox 
> ppp_async nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat_ftp 
> nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_ftp 
> nf_conntrack iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt 
> act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw 
> sch_hfsc sch_ingress ledtrig_heartbeat ledtrig_gpio ip6t_REJECT 
> nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle 
> ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc br2684 atm 
> geode_aes cbc arc4 aes_i586
> [   61.764005] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0+ #2
> [   61.764005] task: c138d540 ti: c1386000 task.ti: c1386000
> [   61.764005] EIP: 0060:[] EFLAGS: 00210286 CPU: 0
> [   61.764005] EIP is at skb_panic+0x3b/0x3d
> [   61.764005] EAX: 007c EBX: deca3000 ECX: c13a0910 EDX: c139f3c4
> [   61.764005] ESI: dee85d8c EDI: dec9800a EBP: defe3b40 ESP: dec0bd50
> [   61.764005]  DS: 007b ES: 007b FS:  GS:  SS: 0068
> [   61.764005] CR0: 8005003b CR2: b7704474 CR3: 1ef0d000 CR4: 0090
> [   61.764005] Stack:
> [   61.764005]  c135e48c c12e1580 c1277f1e 050e 000e dec98000 
> dec97ffc dec9850a
> [   61.764005]  dec98f40 deca3000 dee85d00 c120337b c12e1580 c1277f1e 
>  000e
> [   61.764005]  dee85d7c ff671e02 deca3000 c109afd3 00200282 1d91 
> 0028 dec98012
> [   61.764005] Call Trace:
> [   61.764005]  [] ? ip6_finish_output2+0x196/0x4da

Hmm, unlike ip the ip6 stack doesn't check headroom size before adding hh.

> But should the kernel *panic* without it? If there are requirements on
> the headroom I must leave on received packets, where are they
> documented? Or is this a bug in the IPv6 fragmentation code, to make
> such assumptions?

I'm not sure the ipv6 (re)fragmentation code is to blame here.
In particular, we could have setups where additional headers need to be
inserted which could also require headroom expansion.

> I'm not entirely sure how to interpret the above stack trace. Is the
> incoming IPv6 packet being reassembled for netfilter's benefit, then re
> -fragmented for transmission?

Yes, ipv6 connection tracking depends on defragmentation.

ip6_fragment should use the frag_list of the (reassembled) skb so no
refragmentation should be happening, we should just be re-using the
original fragmented skbs from that fraglist.

What I don't understand is why you see this with fragmented ipv6 packets only
(and not with all ipv6 forwarded skbs).

Something like this copy-pastry from ip_finish_output2 should fix it:

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -62,6 +62,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff 
*skb)
struct net_device *dev = dst->dev;
struct neighbour *neigh;
struct in6_addr *nexthop;
+   unsigned int hh_len;
int ret;
 
skb->protocol = htons(ETH_P_IPV6);
@@ -104,6 +105,21 @@ static int ip6_finish_output2(struct sock *sk, struct 
sk_buff *skb)
}
}
 
+   hh_len = LL_RESERVED_SPACE(dev);
+   if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
+   struct sk_buff *skb2;
+
+   skb2 = skb_realloc_headroom(skb, hh_len);
+   if (!skb2) {
+   kfree_skb(skb);
+   return -ENOMEM;
+   }
+   if (skb->sk)
+   skb_set_owner_w(skb2, skb->sk);
+   consume_skb(skb);
+   skb = skb2;
+   }
+
rcu_read_lock_bh();
nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Tom Herbert

On Tue, Sep 15, 2015 at 3:24 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> If skb carries a l4 hash, no need to perform a flow dissection.
>
> Performance is slightly better :
>
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.39012e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.39393e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.39988e+06
>
> After patch :
>
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.43579e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.44304e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.44312e+06
>
> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> Cc: Mahesh Bandewar 
> ---
>  drivers/net/bonding/bond_main.c |4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 771a449..9250d1e 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3136,6 +3136,10 @@ u32 bond_xmit_hash(struct bonding *bond, struct 
> sk_buff *skb)
> struct flow_keys flow;
> u32 hash;
>
> +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
> +   skb->l4_hash)
> +   return skb->hash;
> +
> if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
> !bond_flow_dissect(bond, skb, &flow))
> return bond_eth_hash(skb);
>
>
Ugh, bond_flow_dissect is yet another instance of customized flow
dissection! We should really clean this up. I suggest that in cases
were we want L4 hash a call to skb_get_hash should suffice. We can
create skb_get_l3hash when caller explicitly wants an L3 hash-- this
would return skb->hash if it's valid and skb->l4_hash is not set, else
call flow dissector with FLOW_DISSECTOR_F_STOP_AT_L3 and then do the
normal hash over flow keys (don't save result in skb->hash in this
case).

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] net: irda: pxaficp_ir: use sched_clock() for time management

2015-09-15 Thread David Miller

From: Robert Jarzmik 
Date: Sat, 12 Sep 2015 13:45:22 +0200

> Instead of using directly the OS timer through direct register access,
> use the standard sched_clock(), which will end up in OSCR reading
> anyway.
> 
> This is a first step for direct access register removal and machine
> specific code removal from this driver.
> 
> Signed-off-by: Robert Jarzmik 

What is the granularity of the OSCR register?

If it is not nanoseconds, then you need to adjust calculations
such as this one:

> @@ -549,7 +548,7 @@ static int pxa_irda_hard_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>   skb_copy_from_linear_data(skb, si->dma_tx_buff, skb->len);
>  
>   if (mtt)
> - while ((unsigned)(readl_relaxed(OSCR) - 
> si->last_oscr)/4 < mtt)
> + while ((sched_clock() - si->last_clk) / 4 < mtt)
>   cpu_relax();
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/2] tcp: provide skb->hash to synack packets

2015-09-15 Thread Tom Herbert

On Tue, Sep 15, 2015 at 3:24 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf
> on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
>
> We'd like to provide one as well for SYNACK packets, so that all packets
> of a given flow share same txhash, to later enable bonding driver to
> also use skb->hash to perform slave selection.
>
> Note that a SYNACK retransmit shuffles the tx hash, as Tom did
> in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing
> advice") for established sockets.
>
> This has nice effect making TCP flows resilient to some kind of black
> holes, even at connection establish phase.
>
Acked-by: Tom Herbert 

> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> Cc: Mahesh Bandewar 
> ---
>  include/linux/tcp.h   |1 +
>  include/net/sock.h|   12 
>  net/ipv4/tcp_input.c  |1 +
>  net/ipv4/tcp_ipv4.c   |2 +-
>  net/ipv4/tcp_output.c |2 ++
>  net/ipv6/tcp_ipv6.c   |2 +-
>  6 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 48c3696..937b978 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -113,6 +113,7 @@ struct tcp_request_sock {
> struct inet_request_sockreq;
> const struct tcp_request_sock_ops *af_specific;
> booltfo_listener;
> +   u32 txhash;
> u32 rcv_isn;
> u32 snt_isn;
> u32 snt_synack; /* synack sent time */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7aa7844..94dff7f 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1654,12 +1654,16 @@ static inline void sock_graft(struct sock *sk, struct 
> socket *parent)
>  kuid_t sock_i_uid(struct sock *sk);
>  unsigned long sock_i_ino(struct sock *sk);
>
> -static inline void sk_set_txhash(struct sock *sk)
> +static inline u32 net_tx_rndhash(void)
>  {
> -   sk->sk_txhash = prandom_u32();
> +   u32 v = prandom_u32();
> +
> +   return v ?: 1;
> +}
>
> -   if (unlikely(!sk->sk_txhash))
> -   sk->sk_txhash = 1;
> +static inline void sk_set_txhash(struct sock *sk)
> +{
> +   sk->sk_txhash = net_tx_rndhash();
>  }
>
>  static inline void sk_rethink_txhash(struct sock *sk)
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a8f515b..a62e9c7 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6228,6 +6228,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
> }
>
> tcp_rsk(req)->snt_isn = isn;
> +   tcp_rsk(req)->txhash = net_tx_rndhash();
> tcp_openreq_init_rwin(req, sk, dst);
> fastopen = !want_cookie &&
>tcp_try_fastopen(sk, skb, req, &foc, dst);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 93898e0..d671d74 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1276,8 +1276,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, 
> struct sk_buff *skb,
> newinet->mc_index = inet_iif(skb);
> newinet->mc_ttl   = ip_hdr(skb)->ttl;
> newinet->rcv_tos  = ip_hdr(skb)->tos;
> +   newsk->sk_txhash  = tcp_rsk(req)->txhash;
> inet_csk(newsk)->icsk_ext_hdr_len = 0;
> -   sk_set_txhash(newsk);
> if (inet_opt)
> inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
> newinet->inet_id = newtp->write_seq ^ jiffies;
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f9a8a12..d0ad355 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2987,6 +2987,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct 
> dst_entry *dst,
> rcu_read_lock();
> md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
>  #endif
> +   skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
> tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5,
>  foc) + sizeof(*th);
>
> @@ -3505,6 +3506,7 @@ int tcp_rtx_synack(struct sock *sk, struct request_sock 
> *req)
> struct flowi fl;
> int res;
>
> +   tcp_rsk(req)->txhash = net_tx_rndhash();
> res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
> if (!res) {
> TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 97d9314..f9c0e26 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1090,7 +1090,7 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock 
> *sk, struct sk_buff *skb,
> newsk->sk_v6_rcv_saddr = ireq->ir_v6_loc_addr;
> newsk->sk_bound_dev_if = ireq->ir_iif;
>
> -   sk_set_txhash(newsk);
> +   newsk->sk_txhash = tcp_rsk(req)->txhash;
>
>

Re: [PATCH net v2] openvswitch: Fix mask generation for nested attributes.

2015-09-15 Thread David Miller

From: Jesse Gross 
Date: Fri, 11 Sep 2015 18:38:28 -0700

> Masks were added to OVS flows in a way that was backwards compatible
> with userspace programs that did not generate masks. As a result, it is
> possible that we may receive flows that do not have a mask and we need
> to synthesize one.
> 
> Generating a mask requires iterating over attributes and descending into
> nested attributes. For each level we need to know the size to generate the
> correct mask. We do this with a linked table of attribute types.
> 
> Although the logic to handle these nested attributes was there in concept,
> there are a number of bugs in practice. Examples include incomplete links
> between tables, variable length attributes being treated as nested and
> missing sanity checks.
> 
> Signed-off-by: Jesse Gross 
> ---
> v2: Fix whitespace errors.
> Add check for unknown bytes in VXLAN extensions.
> Factor out check for nested or variable attributes.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: smc91x: convert pxa dma to dmaengine

2015-09-15 Thread David Miller

From: Robert Jarzmik 
Date: Thu, 10 Sep 2015 21:26:04 +0200

> Convert the dma transfers to be dmaengine based, now pxa has a dmaengine
> slave driver. This makes this driver a bit more PXA agnostic.
> 
> The driver was tested on pxa27x (mainstone) and pxa310 (zylonite),
> ie. only pxa platforms.
> 
> Signed-off-by: Robert Jarzmik 
> Cc: Russell King 
> Cc: Arnd Bergmann 
> ---
> This has potential to break other platform such as Neponset, Idp,
> halibut and qsd8x50, so I added Russell and Arnd as they were discussing
> smc91x support last February.

Is someone testing whether such platforms break or not?  I'm waiting for
that before I consider applying this patch.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Eric Dumazet

On Tue, 2015-09-15 at 15:54 -0700, Mahesh Bandewar wrote:

> > +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
> > +   skb->l4_hash)
> if (ENCAP34 || LAYER34) && l4_hash) may be?

Hmm, traditional BOND_XMIT_POLICY_LAYER34 did not a full flow bisection
(tunnel awareness added in commit
32819dc1834866cb9547cb75f81af9edd58d33cd bonding: modify the old and add
new xmit hash policies)

This could radically change some setups and behavior.

BOND_XMIT_POLICY_ENCAP34 looks a better fit to me.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Mahesh Bandewar

On Tue, Sep 15, 2015 at 3:24 PM, Eric Dumazet  wrote:
>
> From: Eric Dumazet 
>
> If skb carries a l4 hash, no need to perform a flow dissection.
>
> Performance is slightly better :
>
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.39012e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.39393e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.39988e+06
>
> After patch :
>
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.43579e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.44304e+06
> lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
> 2.44312e+06
>
> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> Cc: Mahesh Bandewar 
> ---
>  drivers/net/bonding/bond_main.c |4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 771a449..9250d1e 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3136,6 +3136,10 @@ u32 bond_xmit_hash(struct bonding *bond, struct 
> sk_buff *skb)
> struct flow_keys flow;
> u32 hash;
>
> +   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
> +   skb->l4_hash)
if (ENCAP34 || LAYER34) && l4_hash) may be?

>
> +   return skb->hash;
> +
> if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
> !bond_flow_dissect(bond, skb, &flow))
> return bond_eth_hash(skb);
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 net-next 1/2] ipv4: L3 hash-based multipath

2015-09-15 Thread Peter Nørlund

On Tue, 15 Sep 2015 14:40:48 -0700
Alexander Duyck  wrote:

> On 09/15/2015 01:29 PM, Peter Nørlund wrote:
> > Replaces the per-packet multipath with a hash-based multipath using
> > source and destination address.
> >
> > Signed-off-by: Peter Nørlund 
> > ---
> >   include/net/ip_fib.h |  11 ++--
> >   net/ipv4/fib_semantics.c | 137
> > +--
> > net/ipv4/route.c |  23 +++- 3 files changed, 102
> > insertions(+), 69 deletions(-)
> >
> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> > index a37d043..c335dd4 100644
> > --- a/include/net/ip_fib.h
> > +++ b/include/net/ip_fib.h
> > @@ -79,7 +79,7 @@ struct fib_nh {
> > unsigned char   nh_scope;
> >   #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > int nh_weight;
> > -   int nh_power;
> > +   atomic_tnh_upper_bound;
> >   #endif
> >   #ifdef CONFIG_IP_ROUTE_CLASSID
> > __u32   nh_tclassid;
> > @@ -118,7 +118,7 @@ struct fib_info {
> >   #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
> > int fib_nhs;
> >   #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -   int fib_power;
> > +   int fib_weight;
> >   #endif
> > struct rcu_head rcu;
> > struct fib_nh   fib_nh[0];
> > @@ -312,7 +312,12 @@ int ip_fib_check_default(__be32 gw, struct
> > net_device *dev); int fib_sync_down_dev(struct net_device *dev,
> > unsigned long event); int fib_sync_down_addr(struct net *net,
> > __be32 local); int fib_sync_up(struct net_device *dev, unsigned int
> > nh_flags); -void fib_select_multipath(struct fib_result *res);
> > +
> > +extern u32 fib_multipath_secret __read_mostly;
> > +
> > +typedef int (*multipath_hash_func_t)(void *ctx);
> > +void fib_select_multipath(struct fib_result *res,
> > + multipath_hash_func_t hash_func, void
> > *ctx);
> >
> >   /* Exported by fib_trie.c */
> >   void fib_trie_init(void);
> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> > index 064bd3c..64d3e0e 100644
> > --- a/net/ipv4/fib_semantics.c
> > +++ b/net/ipv4/fib_semantics.c
> > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt;
> >   static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
> >
> >   #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -
> > -static DEFINE_SPINLOCK(fib_multipath_lock);
> > +u32 fib_multipath_secret __read_mostly;
> >
> >   #define for_nexthops(fi)
> > {   \ int nhsel; const
> > struct fib_nh *nh;  \ @@ -468,6
> > +467,55 @@ static int fib_count_nexthops(struct rtnexthop *rtnh,
> > int remaining) return remaining > 0 ? 0 : nhs; }
> >
> > +static void fib_rebalance(struct fib_info *fi)
> > +{
> > +   int total;
> > +   int w;
> > +   struct in_device *in_dev;
> > +
> > +   if (fi->fib_nhs < 2)
> > +   return;
> > +
> > +   total = 0;
> > +   for_nexthops(fi) {
> > +   if (nh->nh_flags & RTNH_F_DEAD)
> > +   continue;
> > +
> > +   in_dev = __in_dev_get_rcu(nh->nh_dev);
> > +
> > +   if (in_dev &&
> > +   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> > +   nh->nh_flags & RTNH_F_LINKDOWN)
> > +   continue;
> > +
> > +   total += nh->nh_weight;
> > +   } endfor_nexthops(fi);
> > +
> > +   w = 0;
> > +   change_nexthops(fi) {
> > +   int upper_bound;
> > +
> > +   in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
> > +
> > +   if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
> > +   upper_bound = -1;
> > +   } else if (in_dev &&
> > +
> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> > +  nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
> > {
> > +   upper_bound = -1;
> > +   } else {
> > +   w += nexthop_nh->nh_weight;
> > +   upper_bound =
> > DIV_ROUND_CLOSEST(2147483648LL * w,
> > +   total) - 1;
> > +   }
> > +
> > +   atomic_set(&nexthop_nh->nh_upper_bound,
> > upper_bound);
> > +   } endfor_nexthops(fi);
> > +
> > +   net_get_random_once(&fib_multipath_secret,
> > +   sizeof(fib_multipath_secret));
> > +}
> > +
> >   static int fib_get_nhs(struct fib_info *fi, struct rtnexthop
> > *rtnh, int remaining, struct fib_config *cfg)
> >   {
> > @@ -1094,8 +1142,15 @@ struct fib_info *fib_create_info(struct
> > fib_config *cfg)
> >
> > change_nexthops(fi) {
> > fib_info_update_nh_saddr(net, nexthop_nh);
> > +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> > +   fi->fib_weight += nexthop_nh->nh_weight;
> > +#endif
> > } endfor_nexthops(fi)
> >
> > +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> > +   fib_rebalance(fi);
> > +#endif
> > +
> >   link_it:
> > ofi = fib_find_info(fi);
> > if (ofi) {
> > @@ -1317,12 +1372,6 @@ int fib_sync_down_dev(struct net_device

Re: [PATCH v3 net-next] rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats

2015-09-15 Thread David Miller

From: Sowmini Varadhan 
Date: Fri, 11 Sep 2015 16:48:48 -0400

> 
> Many commonly used functions like getifaddrs() invoke RTM_GETLINK
> to dump the interface information, and do not need the
> the AF_INET6 statististics that are always returned by default
> from rtnl_fill_ifinfo().
> 
> Computing the statistics can be an expensive operation that impacts
> scaling, so it is desirable to avoid this if the information is
> not needed.
> 
> This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
> can be passed with netlink_request() to avoid statistics computation
> for the ifinfo path.
> 
> Signed-off-by: Sowmini Varadhan 
> ---
> v2: David Miller comments: pass u32 ext_filter_mask down.
> v3: non-RFC version of v2.

Applied, with one minor change:

> + if (!!(ext_filter_mask & RTEXT_FILTER_SKIP_STATS))

I got rid of the "!!" as it really isn't needed for an expression
like this.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/2] tcp: provide skb->hash to synack packets

2015-09-15 Thread Eric Dumazet

From: Eric Dumazet 

In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf
on xmit"), Tom provided a l4 hash to most outgoing TCP packets.

We'd like to provide one as well for SYNACK packets, so that all packets
of a given flow share same txhash, to later enable bonding driver to
also use skb->hash to perform slave selection.

Note that a SYNACK retransmit shuffles the tx hash, as Tom did
in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing
advice") for established sockets.

This has nice effect making TCP flows resilient to some kind of black
holes, even at connection establish phase.

Signed-off-by: Eric Dumazet 
Cc: Tom Herbert 
Cc: Mahesh Bandewar 
---
 include/linux/tcp.h   |1 +
 include/net/sock.h|   12 
 net/ipv4/tcp_input.c  |1 +
 net/ipv4/tcp_ipv4.c   |2 +-
 net/ipv4/tcp_output.c |2 ++
 net/ipv6/tcp_ipv6.c   |2 +-
 6 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 48c3696..937b978 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -113,6 +113,7 @@ struct tcp_request_sock {
struct inet_request_sockreq;
const struct tcp_request_sock_ops *af_specific;
booltfo_listener;
+   u32 txhash;
u32 rcv_isn;
u32 snt_isn;
u32 snt_synack; /* synack sent time */
diff --git a/include/net/sock.h b/include/net/sock.h
index 7aa7844..94dff7f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1654,12 +1654,16 @@ static inline void sock_graft(struct sock *sk, struct 
socket *parent)
 kuid_t sock_i_uid(struct sock *sk);
 unsigned long sock_i_ino(struct sock *sk);
 
-static inline void sk_set_txhash(struct sock *sk)
+static inline u32 net_tx_rndhash(void)
 {
-   sk->sk_txhash = prandom_u32();
+   u32 v = prandom_u32();
+
+   return v ?: 1;
+}
 
-   if (unlikely(!sk->sk_txhash))
-   sk->sk_txhash = 1;
+static inline void sk_set_txhash(struct sock *sk)
+{
+   sk->sk_txhash = net_tx_rndhash();
 }
 
 static inline void sk_rethink_txhash(struct sock *sk)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a8f515b..a62e9c7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6228,6 +6228,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
}
 
tcp_rsk(req)->snt_isn = isn;
+   tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
fastopen = !want_cookie &&
   tcp_try_fastopen(sk, skb, req, &foc, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 93898e0..d671d74 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1276,8 +1276,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct 
sk_buff *skb,
newinet->mc_index = inet_iif(skb);
newinet->mc_ttl   = ip_hdr(skb)->ttl;
newinet->rcv_tos  = ip_hdr(skb)->tos;
+   newsk->sk_txhash  = tcp_rsk(req)->txhash;
inet_csk(newsk)->icsk_ext_hdr_len = 0;
-   sk_set_txhash(newsk);
if (inet_opt)
inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
newinet->inet_id = newtp->write_seq ^ jiffies;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f9a8a12..d0ad355 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2987,6 +2987,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct 
dst_entry *dst,
rcu_read_lock();
md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
 #endif
+   skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5,
 foc) + sizeof(*th);
 
@@ -3505,6 +3506,7 @@ int tcp_rtx_synack(struct sock *sk, struct request_sock 
*req)
struct flowi fl;
int res;
 
+   tcp_rsk(req)->txhash = net_tx_rndhash();
res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
if (!res) {
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 97d9314..f9c0e26 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1090,7 +1090,7 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, 
struct sk_buff *skb,
newsk->sk_v6_rcv_saddr = ireq->ir_v6_loc_addr;
newsk->sk_bound_dev_if = ireq->ir_iif;
 
-   sk_set_txhash(newsk);
+   newsk->sk_txhash = tcp_rsk(req)->txhash;
 
/* Now IPv6 options...
 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 2/2] bonding: use l4 hash if available

2015-09-15 Thread Eric Dumazet

From: Eric Dumazet 

If skb carries a l4 hash, no need to perform a flow dissection.

Performance is slightly better :

lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.39012e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.39393e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.39988e+06

After patch :

lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.43579e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.44304e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.44312e+06

Signed-off-by: Eric Dumazet 
Cc: Tom Herbert 
Cc: Mahesh Bandewar 
---
 drivers/net/bonding/bond_main.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 771a449..9250d1e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3136,6 +3136,10 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
*skb)
struct flow_keys flow;
u32 hash;
 
+   if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
+   skb->l4_hash)
+   return skb->hash;
+
if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
!bond_flow_dissect(bond, skb, &flow))
return bond_eth_hash(skb);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: Fix vti use case with oif in dst lookups

2015-09-15 Thread David Ahern

Steffen reported that the recent change to add oif to dst lookups breaks
the VTI use case. The problem is that with the oif set in the flow struct
the comparison to the nh_oif is triggered. Fix by splitting the
FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
nh oif (FLOWI_FLAG_SKIP_NH_OIF).

Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")

Signed-off-by: David Ahern 
---
IPv6 does not show this problem for me. So no change is added for IPv6.
If your mileage varies let me know and I'll take another look.

 drivers/net/vrf.c   | 3 ++-
 include/net/flow.h  | 1 +
 include/net/route.h | 2 +-
 net/ipv4/fib_trie.c | 2 +-
 net/ipv4/udp.c  | 3 ++-
 net/ipv4/xfrm4_policy.c | 2 ++
 6 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index e7094fbd7568..488c6f50df73 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -193,7 +193,8 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff 
*skb,
.flowi4_oif = vrf_dev->ifindex,
.flowi4_iif = LOOPBACK_IFINDEX,
.flowi4_tos = RT_TOS(ip4h->tos),
-   .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC,
+   .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC |
+   FLOWI_FLAG_SKIP_NH_OIF,
.daddr = ip4h->daddr,
};
 
diff --git a/include/net/flow.h b/include/net/flow.h
index acd6a096250e..9b85db85f13c 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -35,6 +35,7 @@ struct flowi_common {
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
 #define FLOWI_FLAG_VRFSRC  0x04
+#define FLOWI_FLAG_SKIP_NH_OIF 0x08
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
 };
diff --git a/include/net/route.h b/include/net/route.h
index cc61cb95f059..f46af256880c 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -255,7 +255,7 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
flow_flags |= FLOWI_FLAG_ANYSRC;
 
if (netif_index_is_vrf(sock_net(sk), oif))
-   flow_flags |= FLOWI_FLAG_VRFSRC;
+   flow_flags |= FLOWI_FLAG_VRFSRC | FLOWI_FLAG_SKIP_NH_OIF;
 
flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 26d6ffb6d23c..6c2af797f2f9 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1426,7 +1426,7 @@ int fib_table_lookup(struct fib_table *tb, const struct 
flowi4 *flp,
nh->nh_flags & RTNH_F_LINKDOWN &&
!(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
continue;
-   if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
+   if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {
if (flp->flowi4_oif &&
flp->flowi4_oif != nh->nh_oif)
continue;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c0a15e7f359f..f7d1d5e19e95 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1024,7 +1024,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
if (netif_index_is_vrf(net, ipc.oif)) {
flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk->sk_protocol,
-  (flow_flags | FLOWI_FLAG_VRFSRC),
+  (flow_flags | FLOWI_FLAG_VRFSRC |
+   FLOWI_FLAG_SKIP_NH_OIF),
   faddr, saddr, dport,
   inet->inet_sport);
 
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index bb919b28619f..c10a9ee68433 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -33,6 +33,8 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, 
struct flowi4 *fl4,
if (saddr)
fl4->saddr = saddr->a4;
 
+   fl4->flowi4_flags = FLOWI_FLAG_SKIP_NH_OIF;
+
rt = __ip_route_output_key(net, fl4);
if (!IS_ERR(rt))
return &rt->dst;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] net: stmmac: Use msleep rather then udelay for reset delay

2015-09-15 Thread David Miller

From: Sjoerd Simons 
Date: Fri, 11 Sep 2015 22:25:48 +0200

> The reset delays used for stmmac are in the order of 10ms to 1 second,
> which is far too long for udelay usage, so switch to using msleep.
> 
> Practically this fixes the PHY not being reliably detected in some cases
> as udelay wouldn't actually delay for long enough to let the phy
> reliably be reset.
> 
> Signed-off-by: Sjoerd Simons 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 198 matches

Mail list logo