date:20170731

[GIT] Networking

2017-07-31 Thread David Miller


1) Handle notifier registry failures properly in tun/tap driver, from
   Tonghao Zhang.

2) Fix bpf verifier handling of subtraction bounds and add a testcase
   for this, from Edward Cree.

3) Increase reset timeout in ftgmac100 driver, from Ben Herrenschmidt.

4) Fix use after free in prd_retire_rx_blk_timer_exired() in AF_PACKET,
   from Cong Wang.

5) Fix SElinux regression due to recent UDP optimizations, from Paolo
   Abeni.

6) We accidently increment IPSTATS_MIB_FRAGFAILS in the ipv6 code paths,
   fix from Stefano Brivio.

7) Fix some mem leaks in dccp, from Xin Long.

8) Adjust MDIO_BUS kconfig deps to avoid build errors, from Arnd
   Bergmann.

9) Mac address length check and buffer size fixes from Cong Wang.

10) Don't leak sockets in ipv6 udp early demux, from Paolo Abeni.

11) Fix return value when copy_from_user() fails in
bpf_prog_get_info_by_fd(), from Daniel Borkmann.

12) Handle PHY_HALTED properly in phy library state machine, from
Florian Fainelli.

13) Fix OOPS in fib_sync_down_dev(), from Ido Schimmel.

14) Fix truesize calculation in virtio_net which led to performance
regressions, from Michael S. Tsirkin.

Please pull, thanks a lot!

The following changes since commit 96080f697786e0a30006fcbcc5b53f350fcb3e9f:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-07-20 
16:33:39 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to cc75f8514db6a3aec517760fccaf954e5b46478c:

  samples/bpf: fix bpf tunnel cleanup (2017-07-31 22:02:47 -0700)


Alex Vesker (1):
  net/mlx5e: IPoIB, Modify add/remove underlay QPN flows

Arend Van Spriel (2):
  brcmfmac: fix regression in brcmf_sdio_txpkt_hdalign()
  brcmfmac: fix memleak due to calling brcmf_sdiod_sgtable_alloc() twice

Arnd Bergmann (3):
  net: phy: rework Kconfig settings for MDIO_BUS
  phy: bcm-ns-usb3: fix MDIO_BUS dependency
  tcp: avoid bogus gcc-7 array-bounds warning

Aviv Heller (1):
  net/mlx5: Consider tx_enabled in all modes on remap

Benjamin Herrenschmidt (2):
  ftgmac100: Increase reset timeout
  ftgmac100: Make the MDIO bus a child of the ethernet device

Colin Ian King (1):
  net: tc35815: fix spelling mistake: "Intterrupt" -> "Interrupt"

Dan Carpenter (1):
  iwlwifi: missing error code in iwl_trans_pcie_alloc()

Daniel Borkmann (2):
  bpf: don't indicate success when copy_from_user fails
  bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_len

Daniel Stone (1):
  brcmfmac: Don't grow SKB by negative size

David S. Miller (4):
  Merge branch 'bpf-fix-verifier-min-max-handling-in-BPF_SUB'
  Merge tag 'wireless-drivers-for-davem-2017-07-21' of 
git://git.kernel.org/.../kvalo/wireless-drivers
  Merge tag 'mlx5-fixes-2017-07-27-V2' of 
git://git.kernel.org/.../saeed/linux
  Merge tag 'wireless-drivers-for-davem-2017-07-28' of 
git://git.kernel.org/.../kvalo/wireless-drivers

Edward Cree (2):
  selftests/bpf: subtraction bounds test
  bpf/verifier: fix min/max handling in BPF_SUB

Emmanuel Grumbach (3):
  iwlwifi: dvm: prevent an out of bounds access
  iwlwifi: mvm: fix a NULL pointer dereference of error in recovery
  iwlwifi: fix tracing when tx only is enabled

Eran Ben Elisha (1):
  net/mlx5: Clean SRIOV eswitch resources upon VF creation failure

Eugenia Emantayev (7):
  net/mlx5: Fix mlx5_ifc_mtpps_reg_bits structure size
  net/mlx5e: Add field select to MTPPS register
  net/mlx5e: Fix broken disable 1PPS flow
  net/mlx5e: Change 1PPS out scheme
  net/mlx5e: Add missing support for PTP_CLK_REQ_PPS request
  net/mlx5e: Fix wrong delay calculation for overflow check scheduling
  net/mlx5e: Schedule overflow check work to mlx5e workqueue

Florian Fainelli (4):
  net: dsa: Initialize ds->cpu_port_mask earlier
  net: phy: Correctly process PHY_HALTED in phy_stop_machine()
  MAINTAINERS: Add more files to the PHY LIBRARY section
  Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config"

Gao Feng (1):
  ppp: Fix a scheduling-while-atomic bug in del_chan

Ido Schimmel (2):
  mlxsw: spectrum_router: Don't offload routes next in list
  ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()

Ilan Tayari (1):
  net/mlx5e: Fix outer_header_zero() check size

Jakub Kicinski (1):
  bpf: don't zero out the info struct in bpf_obj_get_info_by_fd()

Jason Wang (1):
  Revert "vhost: cache used event for better performance"

Joel Stanley (1):
  ftgmac100: return error in ftgmac100_alloc_rx_buf

Johannes Berg (1):
  iwlwifi: mvm: defer setting IWL_MVM_STATUS_IN_HW_RESTART

Kalle Valo (1):
  Merge tag 'iwlwifi-for-kalle-2017-07-21' of 
git://git.kernel.org/.../iwlwifi/iwlwifi-fixes

Larry Finger (1):
  Revert "rtlwifi: btcoex: rtl8723be: fix ant_sel not

Re: [PATCH net] samples/bpf: fix bpf tunnel cleanup

2017-07-31 Thread David Miller

From: William Tu 
Date: Mon, 31 Jul 2017 14:40:50 -0700

> test_tunnel_bpf.sh fails to remove the vxlan11 tunnel device, causing the
> next geneve tunnelling test case fails.  In addition, the geneve reserved bit
> in tcbpf2_kern.c should be zero, according to the RFC.
> 
> Signed-off-by: William Tu 

Applied, thank you.

Re: [PATCH net] udp6: fix jumbogram reception

2017-07-31 Thread David Miller

From: Paolo Abeni 
Date: Mon, 31 Jul 2017 16:52:36 +0200

> Since commit 67a51780aebb ("ipv6: udp: leverage scratch area
> helpers") udp6_recvmsg() read the skb len from the scratch area,
> to avoid a cache miss.
> But the UDP6 rx path support RFC 2675 UDPv6 jumbograms, and their
> length exceeds the 16 bits available in the scratch area. As a side
> effect the length returned by recvmsg() is:
>  % (1<<16)
> 
> This commit addresses the issue allocating one more bit in the
> IP6CB flags field and setting it for incoming jumbograms.
> Such field is still in the first cacheline, so at recvmsg()
> time we can check it and fallback to access skb->len if
> required, without a measurable overhead.
> 
> Fixes: 67a51780aebb ("ipv6: udp: leverage scratch area helpers")
> Signed-off-by: Paolo Abeni 

Applied, thanks Paolo.

Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan

2017-07-31 Thread David Miller

From: gfree.w...@vip.163.com
Date: Mon, 31 Jul 2017 18:07:38 +0800

> From: Gao Feng 
> 
> The PPTP set the pptp_sock_destruct as the sock's sk_destruct, it would
> trigger this bug when __sk_free is invoked in atomic context, because of
> the call path pptp_sock_destruct->del_chan->synchronize_rcu.
> 
> Now move the synchronize_rcu to pptp_release from del_chan. This is the
> only one case which would free the sock and need the synchronize_rcu.
> 
> The following is the panic I met with kernel 3.3.8, but this issue should
> exist in current kernel too according to the codes.
 ...
> Signed-off-by: Gao Feng 

Applied, thanks.

Re: [patch net-next 09/20] net: sched: convert actions array into rcu list

2017-07-31 Thread Jiri Pirko

Mon, Jul 31, 2017 at 11:07:13PM CEST, xiyou.wangc...@gmail.com wrote:
>On Fri, Jul 28, 2017 at 7:40 AM, Jiri Pirko  wrote:
>> From: Jiri Pirko 
>>
>> Currently the actions are stored in array with array size. To traverse
>> this array in fastpath, tcf_tree_lock is taken to protect it. Convert
>> the array into a singly linked list, similar to the filter chains style
>> and allow traversal protected by rcu.
>
>Did you read commit 22dc13c837c33207548c8ee5116 ?
>
>An action can't be shared by multiple filters if you put them
>in a list (no matter singly or double), this is why I use pointers.

Allright. Will check it out.

Re: [patch net-next 04/20] net: sched: use tcf_exts_has_actions in tcf_exts_exec

2017-07-31 Thread Jiri Pirko

Mon, Jul 31, 2017 at 10:37:21PM CEST, xiyou.wangc...@gmail.com wrote:
>On Fri, Jul 28, 2017 at 7:40 AM, Jiri Pirko  wrote:
>> +static inline int
>> +tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
>> + struct tcf_result *res)
>> +{
>> +#ifdef CONFIG_NET_CLS_ACT
>> +   if (tcf_exts_has_actions(exts))
>> +   return tcf_action_exec(skb, exts->actions, exts->nr_actions,
>> +  res);
>> +#endif
>> +   return 0;
>> +}
>
>
>While you are on it, can we get rid of this macro too?
>
>tcf_action_exec() is only defined with CONFIG_NET_CLS_ACT,
>not sure if compiler is kind enough to eliminate the false branch
>for us:
>
>if (false)
>return tcf_action_exec(...); // not defined but the branch is dead
>
>At least you can add a wrapper for tcf_action_exec() to just
>return 0.

Did you see?
net: sched: remove check for number of actions in tcf_exts_exec

I will add static inline stub for tcf_action_exec in case CONFIG_NET_CLS_ACT
is not set.

Re: [PATCH net-next v12 0/4] net sched actions: improve dump performance

2017-07-31 Thread Stephen Hemminger

On Mon, 31 Jul 2017 08:06:42 -0400
Jamal Hadi Salim  wrote:

> On 17-07-30 10:28 PM, David Miller wrote:
> > 
> > Series applied, thanks.
> >   
> 
> Thanks David.
> 
> Attaching the iproute2 patch. I will submit an official one with
> man page  changes later. Stephen - you take net-next changes?
> 
> cheers,
> jamal

Please cleanup and resubmit for net-next.

The header files have been updated in iproute2 net-next branch.

It is not clear to me that the new code is backward compatiable.
Will new versions of tc work on old kernels and vice/versa?

Also, no #ifdef's

Re: [PATCH RFC, iproute2] tc/mirred: Extend the mirred/redirect action to accept additional traffic class parameter

2017-07-31 Thread Stephen Hemminger

On Mon, 31 Jul 2017 17:40:50 -0700
Amritha Nambiar  wrote:
The concept is fine, bu t the code looks different than the rest which
is never a good sign.

> + if ((argc > 0) && (matches(*argv, "tc") == 0)) {

Extra () are unnecessary in compound conditional.

> + tc = atoi(*argv);

Prefer using strtoul since it has better error handling than atoi()

> + argc--;
> + argv++;
> + }

Use NEXT_ARG() construct like rest of the code.

Re: TCP fast retransmit issues

2017-07-31 Thread Neal Cardwell

On Fri, Jul 28, 2017 at 6:54 PM, Neal Cardwell  wrote:
> On Wed, Jul 26, 2017 at 3:02 PM, Neal Cardwell  wrote:
>> On Wed, Jul 26, 2017 at 2:38 PM, Neal Cardwell  wrote:
>>> Yeah, it looks like I can reproduce this issue with (1) bad sacks
>>> causing repeated TLPs, and (2) TLPs timers being pushed out to later
>>> times due to incoming data. Scripts are attached.
>>
>> I'm testing a fix of only scheduling a TLP if (flag & FLAG_DATA_ACKED)
>> is true...
>
> An update for the TLP aspect of this thread: our team has a proposed
> fix for this RTO/TLP reschedule issue that we have reviewed internally
> and tested with our packetdrill test suite, including some new tests.
> The basic approach in the fix is as follows:
>
> a) only reschedule the xmit timer once per ACK
>
> b) only reschedule the xmit timer if tcp_clean_rtx_queue() deems this
> is safe (a packet was cumulatively ACKed, or we got a SACK for a
> packet that was sent before the most recent retransmit of the write
> queue head).
>
> After further review and testing we will post it. Hopefully next week.

The timer patches are upstream for review for the "net" branch:

  https://patchwork.ozlabs.org/patch/796057/
  https://patchwork.ozlabs.org/patch/796058/
  https://patchwork.ozlabs.org/patch/796059/

Again, thank you for reporting this, and thanks for the packet trace!

neal

[PATCH net-next 09/10] net: ipv4: Support for sockets bound to enslaved device

2017-07-31 Thread David Ahern

Add support for sockets bound to a network interface enslaved to an
L3 Master device (e.g, VRF). Currently for VRF, skb->dev points to the
VRF device meaning socket lookups only consider this device index. The
real ingress device index is saved to IPCB(skb)->iif and the VRF driver
marks the skb with IPSKB_L3SLAVE to know that the real ingress device
is an enslaved one without having to lookup the iif.

Use those flags to add the enslaved device index to the socket lookup
and allow sk->sk_bound_dev_if to match either dif (VRF device) or sdif
(enslaved device).

Signed-off-by: David Ahern 
---
 include/linux/igmp.h  |  3 ++-
 include/net/inet_hashtables.h | 10 ++
 include/net/ip.h  | 10 ++
 include/net/tcp.h | 10 ++
 net/ipv4/igmp.c   |  6 --
 net/ipv4/inet_hashtables.c| 11 +++
 net/ipv4/raw.c|  7 +--
 net/ipv4/tcp_ipv4.c   |  6 --
 net/ipv4/udp.c| 11 ---
 9 files changed, 56 insertions(+), 18 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 97caf1821de8..f8231854b5d6 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -118,7 +118,8 @@ extern int ip_mc_msfget(struct sock *sk, struct ip_msfilter 
*msf,
struct ip_msfilter __user *optval, int __user *optlen);
 extern int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,
struct group_filter __user *optval, int __user *optlen);
-extern int ip_mc_sf_allow(struct sock *sk, __be32 local, __be32 rmt, int dif);
+extern int ip_mc_sf_allow(struct sock *sk, __be32 local, __be32 rmt,
+ int dif, int sdif);
 extern void ip_mc_init_dev(struct in_device *);
 extern void ip_mc_destroy_dev(struct in_device *);
 extern void ip_mc_up(struct in_device *);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index c5f4dc3c06e4..2de3d4bc00ba 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -259,22 +259,24 @@ static inline struct sock *inet_lookup_listener(struct 
net *net,
   (((__force __u64)(__be32)(__daddr)) << 32) | 
\
   ((__force __u64)(__be32)(__saddr)))
 #endif /* __BIG_ENDIAN */
-#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif)
\
+#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, 
__sdif) \
(((__sk)->sk_portpair == (__ports)) &&  \
 ((__sk)->sk_addrpair == (__cookie))&&  \
 (!(__sk)->sk_bound_dev_if  ||  \
-  ((__sk)->sk_bound_dev_if == (__dif)))&&  \
+  ((__sk)->sk_bound_dev_if == (__dif)) ||  \
+  ((__sk)->sk_bound_dev_if == (__sdif)))   &&  \
 net_eq(sock_net(__sk), (__net)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
const int __name __deprecated __attribute__((unused))
 
-#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif) \
+#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, 
__sdif) \
(((__sk)->sk_portpair == (__ports)) &&  \
 ((__sk)->sk_daddr  == (__saddr))   &&  \
 ((__sk)->sk_rcv_saddr  == (__daddr))   &&  \
 (!(__sk)->sk_bound_dev_if  ||  \
-  ((__sk)->sk_bound_dev_if == (__dif)))&&  \
+  ((__sk)->sk_bound_dev_if == (__dif)) ||  \
+  ((__sk)->sk_bound_dev_if == (__sdif)))   &&  \
 net_eq(sock_net(__sk), (__net)))
 #endif /* 64-bit arch */
 
diff --git a/include/net/ip.h b/include/net/ip.h
index 821cedcc8e73..e10da8814dba 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -78,6 +78,16 @@ struct ipcm_cookie {
 #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
 #define PKTINFO_SKB_CB(skb) ((struct in_pktinfo *)((skb)->cb))
 
+/* return enslaved device index if relevant */
+static inline int ip_sdif(struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+   if (skb && ipv4_l3mdev_skb(IPCB(skb)->flags))
+   return IPCB(skb)->iif;
+#endif
+   return 0;
+}
+
 struct ip_ra_chain {
struct ip_ra_chain __rcu *next;
struct sock *sk;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 12d68335acd4..19827dd05dcc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -861,6 +861,16 @@ static inline bool inet_exact_dif_match(struct net *net, 
struct sk_buff *skb)
return false;
 }
 
+/* TCP_SKB_CB reference means this can not be used from early demux */
+static inline int tcp_v4_sdif(struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+   if (skb &&

[PATCH net-next 07/10] net: ipv6: Convert raw sockets to sk_lookup

2017-07-31 Thread David Ahern

Convert __raw_v6_lookup to use the new sk_lookup struct

Signed-off-by: David Ahern 
---
 include/net/rawv6.h |  3 +--
 net/ipv4/raw_diag.c | 15 ++-
 net/ipv6/raw.c  | 41 +++--
 3 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/include/net/rawv6.h b/include/net/rawv6.h
index cbe4e9de1894..406268324d26 100644
--- a/include/net/rawv6.h
+++ b/include/net/rawv6.h
@@ -5,8 +5,7 @@
 
 extern struct raw_hashinfo raw_v6_hashinfo;
 struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
-unsigned short num, const struct in6_addr 
*loc_addr,
-const struct in6_addr *rmt_addr, int dif);
+const struct sk_lookup *params);
 
 int raw_abort(struct sock *sk, int err);
 
diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c
index a708de070cc6..e081c03fd408 100644
--- a/net/ipv4/raw_diag.c
+++ b/net/ipv4/raw_diag.c
@@ -53,11 +53,16 @@ static struct sock *raw_lookup(struct net *net, struct sock 
*from,
sk = __raw_v4_lookup(net, from, );
}
 #if IS_ENABLED(CONFIG_IPV6)
-   else
-   sk = __raw_v6_lookup(net, from, r->sdiag_raw_protocol,
-(const struct in6_addr *)r->id.idiag_src,
-(const struct in6_addr *)r->id.idiag_dst,
-r->id.idiag_if);
+   else {
+   struct sk_lookup params = {
+   .saddr.ipv6 = (const struct in6_addr *)r->id.idiag_dst,
+   .daddr.ipv6 = (const struct in6_addr *)r->id.idiag_src,
+   .hnum = r->sdiag_raw_protocol,
+   .dif  = r->id.idiag_if,
+   };
+
+   sk = __raw_v6_lookup(net, from, );
+   }
 #endif
return sk;
 }
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 60be012fe708..51e651f18ffb 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -71,14 +71,14 @@ struct raw_hashinfo raw_v6_hashinfo = {
 EXPORT_SYMBOL_GPL(raw_v6_hashinfo);
 
 struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
-   unsigned short num, const struct in6_addr *loc_addr,
-   const struct in6_addr *rmt_addr, int dif)
+const struct sk_lookup *params)
 {
+   const struct in6_addr *loc_addr = params->daddr.ipv6;
+   const struct in6_addr *rmt_addr = params->saddr.ipv6;
bool is_multicast = ipv6_addr_is_multicast(loc_addr);
 
sk_for_each_from(sk)
-   if (inet_sk(sk)->inet_num == num) {
-
+   if (inet_sk(sk)->inet_num == params->hnum) {
if (!net_eq(sock_net(sk), net))
continue;
 
@@ -86,7 +86,8 @@ struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
!ipv6_addr_equal(>sk_v6_daddr, rmt_addr))
continue;
 
-   if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)
+   if (sk->sk_bound_dev_if &&
+   sk->sk_bound_dev_if != params->dif)
continue;
 
if (!ipv6_addr_any(>sk_v6_rcv_saddr)) {
@@ -159,15 +160,17 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister);
  */
 static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
 {
-   const struct in6_addr *saddr;
-   const struct in6_addr *daddr;
+   struct sk_lookup params = {
+   .saddr.ipv6 = _hdr(skb)->saddr,
+   .daddr.ipv6 = _hdr(skb)->daddr,
+   .hnum = nexthdr,
+   .dif  = inet6_iif(skb),
+   };
struct sock *sk;
bool delivered = false;
__u8 hash;
struct net *net;
 
-   saddr = _hdr(skb)->saddr;
-   daddr = saddr + 1;
 
hash = nexthdr & (RAW_HTABLE_SIZE - 1);
 
@@ -178,7 +181,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int 
nexthdr)
goto out;
 
net = dev_net(skb->dev);
-   sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, inet6_iif(skb));
+   sk = __raw_v6_lookup(net, sk, );
 
while (sk) {
int filtered;
@@ -221,8 +224,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int 
nexthdr)
rawv6_rcv(sk, clone);
}
}
-   sk = __raw_v6_lookup(net, sk_next(sk), nexthdr, daddr, saddr,
-inet6_iif(skb));
+   sk = __raw_v6_lookup(net, sk_next(sk), );
}
 out:
read_unlock(_v6_hashinfo.lock);
@@ -362,23 +364,26 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr,
u8 type, u8 code, int inner_offset, __be32 info)
 {
struct sock *sk;
-   int hash;
-   const struct in6_addr *saddr, *daddr;
struct net *net;
+   int hash;
 
hash = nexthdr &

[PATCH net-next 04/10] net: ipv4: Convert raw sockets to sk_lookup

2017-07-31 Thread David Ahern

Convert __raw_v4_lookup to use the new sk_lookup struct

Signed-off-by: David Ahern 
---
 include/net/raw.h   |  3 +--
 net/ipv4/raw.c  | 72 ++---
 net/ipv4/raw_diag.c | 15 +++
 3 files changed, 58 insertions(+), 32 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index 57c33dd22ec4..8d0f0e5d013b 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -25,8 +25,7 @@ extern struct proto raw_prot;
 
 extern struct raw_hashinfo raw_v4_hashinfo;
 struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
-unsigned short num, __be32 raddr,
-__be32 laddr, int dif);
+const struct sk_lookup *params);
 
 int raw_abort(struct sock *sk, int err);
 void raw_icmp_error(struct sk_buff *, int, u32);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index b0bb5d0a30bd..4da5d87a61a5 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -122,15 +122,23 @@ void raw_unhash_sk(struct sock *sk)
 EXPORT_SYMBOL_GPL(raw_unhash_sk);
 
 struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
-   unsigned short num, __be32 raddr, __be32 laddr, int dif)
+const struct sk_lookup *params)
 {
+   __be32 raddr = params->saddr.ipv4;
+   __be32 laddr = params->daddr.ipv4;
+
sk_for_each_from(sk) {
struct inet_sock *inet = inet_sk(sk);
+   bool dev_match;
+
+   dev_match = (!sk->sk_bound_dev_if ||
+   sk->sk_bound_dev_if == params->dif);
 
-   if (net_eq(sock_net(sk), net) && inet->inet_num == num  &&
-   !(inet->inet_daddr && inet->inet_daddr != raddr)&&
+   if (net_eq(sock_net(sk), net) &&
+   inet->inet_num == params->hnum &&
+   !(inet->inet_daddr && inet->inet_daddr != raddr) &&
!(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
-   !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+   dev_match)
goto found; /* gotcha */
}
sk = NULL;
@@ -169,23 +177,20 @@ static int icmp_filter(const struct sock *sk, const 
struct sk_buff *skb)
  * RFC 1122: SHOULD pass TOS value up to the transport layer.
  * -> It does. And not only TOS, but all IP header.
  */
-static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
+static int __raw_v4_input(struct sk_buff *skb, const struct iphdr *iph,
+ struct hlist_head *head)
 {
-   struct sock *sk;
-   struct hlist_head *head;
+   struct net *net = dev_net(skb->dev);
+   const struct sk_lookup params = {
+   .saddr.ipv4 = iph->saddr,
+   .daddr.ipv4 = iph->daddr,
+   .hnum = iph->protocol,
+   .dif  = skb->dev->ifindex,
+   };
int delivered = 0;
-   struct net *net;
-
-   read_lock(_v4_hashinfo.lock);
-   head = _v4_hashinfo.ht[hash];
-   if (hlist_empty(head))
-   goto out;
-
-   net = dev_net(skb->dev);
-   sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol,
-iph->saddr, iph->daddr,
-skb->dev->ifindex);
+   struct sock *sk;
 
+   sk = __raw_v4_lookup(net, __sk_head(head), );
while (sk) {
delivered = 1;
if ((iph->protocol != IPPROTO_ICMP || !icmp_filter(sk, skb)) &&
@@ -197,11 +202,22 @@ static int raw_v4_input(struct sk_buff *skb, const struct 
iphdr *iph, int hash)
if (clone)
raw_rcv(sk, clone);
}
-   sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol,
-iph->saddr, iph->daddr,
-skb->dev->ifindex);
+   sk = __raw_v4_lookup(net, sk_next(sk), );
}
-out:
+
+   return delivered;
+}
+
+static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
+{
+   struct hlist_head *head;
+   int delivered = 0;
+
+   read_lock(_v4_hashinfo.lock);
+   head = _v4_hashinfo.ht[hash];
+   if (!hlist_empty(head))
+   delivered = __raw_v4_input(skb, iph, head);
+
read_unlock(_v4_hashinfo.lock);
return delivered;
 }
@@ -297,12 +313,18 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, 
u32 info)
read_lock(_v4_hashinfo.lock);
raw_sk = sk_head(_v4_hashinfo.ht[hash]);
if (raw_sk) {
+   struct sk_lookup params = {
+   .hnum = protocol,
+   .dif = skb->dev->ifindex,
+   };
+
iph = (const struct iphdr *)skb->data;
net = dev_net(skb->dev);
 
-   while ((raw_sk = __raw_v4_lookup(net, raw_sk, protocol,
-

[PATCH net-next 10/10] net: ipv6: Support for sockets bound to enslaved device

2017-07-31 Thread David Ahern

Add support for sockets bound to a network interface enslaved to an
L3 Master device (e.g, VRF). Currently for VRF, skb->dev points to the
VRF device meaning socket lookups only consider this device index. The
real ingress device index is saved to IP6CB(skb)->iif and the VRF driver
marks the skb with IP6SKB_L3SLAVE to know that the real ingress device
is an enslaved one without having to lookup the iif.

Use those flags to add the enslaved device index to the socket lookup
and allow sk->sk_bound_dev_if to match either dif (VRF device) or sdif
(enslaved device).

Signed-off-by: David Ahern 
---
 include/linux/ipv6.h   | 8 
 include/net/inet6_hashtables.h | 5 +++--
 include/net/tcp.h  | 7 +++
 net/ipv6/inet6_hashtables.c| 9 +
 net/ipv6/raw.c | 5 -
 net/ipv6/tcp_ipv6.c| 3 +++
 net/ipv6/udp.c | 8 ++--
 7 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index e1b442996f81..094357907b45 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -153,6 +153,14 @@ static inline int inet6_iif(const struct sk_buff *skb)
 }
 
 /* can not be used in TCP layer after tcp_v6_fill_cb */
+static inline int inet6_sdif(const struct sk_buff *skb)
+{
+   bool l3_slave = ipv6_l3mdev_skb(IP6CB(skb)->flags);
+
+   return l3_slave ? IP6CB(skb)->iif : 0;
+}
+
+/* can not be used in TCP layer after tcp_v6_fill_cb */
 static inline bool inet6_exact_dif_match(struct net *net, struct sk_buff *skb)
 {
 #if defined(CONFIG_NET_L3_MASTER_DEV)
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 15db41272ff2..0fc5a2fe4ad3 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -94,13 +94,14 @@ struct sock *inet6_lookup(struct net *net, struct 
inet_hashinfo *hashinfo,
 int inet6_hash(struct sock *sk);
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-#define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif) \
+#define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif, __sdif) \
(((__sk)->sk_portpair == (__ports)) &&  \
 ((__sk)->sk_family == AF_INET6)&&  \
 ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr))   &&  
\
 ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr))   &&  \
 (!(__sk)->sk_bound_dev_if  ||  \
-  ((__sk)->sk_bound_dev_if == (__dif)))&&  \
+  ((__sk)->sk_bound_dev_if == (__dif)) ||  \
+  ((__sk)->sk_bound_dev_if == (__sdif)))   &&  \
 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _INET6_HASHTABLES_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 19827dd05dcc..8a081cff33f8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -848,6 +848,13 @@ static inline int tcp_v6_iif(const struct sk_buff *skb)
 
return l3_slave ? skb->skb_iif : TCP_SKB_CB(skb)->header.h6.iif;
 }
+
+static inline int tcp_v6_sdif(const struct sk_buff *skb)
+{
+   bool l3_slave = ipv6_l3mdev_skb(TCP_SKB_CB(skb)->header.h6.flags);
+
+   return l3_slave ? TCP_SKB_CB(skb)->header.h6.iif : 0;
+}
 #endif
 
 /* TCP_SKB_CB reference means this can not be used from early demux */
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 878c03094f2e..06120efb2036 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -74,13 +74,13 @@ struct sock *__inet6_lookup_established(struct net *net,
if (sk->sk_hash != hash)
continue;
if (!INET6_MATCH(sk, net, saddr, daddr, ports,
-params->dif))
+params->dif, params->sdif))
continue;
if (unlikely(!refcount_inc_not_zero(>sk_refcnt)))
goto out;
 
if (unlikely(!INET6_MATCH(sk, net, saddr, daddr, ports,
-params->dif))) {
+params->dif, params->sdif))) {
sock_gen_put(sk);
goto begin;
}
@@ -188,8 +188,9 @@ static int __inet6_check_established(struct 
inet_timewait_death_row *death_row,
const struct in6_addr *daddr = >sk_v6_rcv_saddr;
const struct in6_addr *saddr = >sk_v6_daddr;
const int dif = sk->sk_bound_dev_if;
-   const __portpair ports = INET_COMBINED_PORTS(inet->inet_dport, lport);
struct net *net = sock_net(sk);
+   const int sdif = l3mdev_master_ifindex_by_index(net, dif);
+   const __portpair ports = INET_COMBINED_PORTS(inet->inet_dport, lport);
const unsigned int hash = inet6_ehashfn(net, daddr, lport, saddr,
inet->inet_dport);
struct

[PATCH net-next 00/10] net: l3mdev: Support for sockets bound to enslaved device

2017-07-31 Thread David Ahern

A missing piece to the VRF puzzle is the ability to bind sockets to
devices enslaved to a VRF. This patch set adds the enslaved device
index, sdif, to IPv4 and IPv6 socket lookups. The end result for users
is the following scope options for services:

1. "global" services - sockets not bound to any device

   Allows 1 service to work across all network interfaces with
   connected sockets bound to the VRF the connection originates
   (Requires net.ipv4.tcp_l3mdev_accept=1 for TCP and
net.ipv4.udp_l3mdev_accept=1 for UDP)

2. "VRF" local services - sockets bound to a VRF

   Sockets work across all network interfaces enslaved to a VRF but
   are limited to just the one VRF.

3. "device" services - sockets bound to a specific network interface

   Service works only through the one specific interface.

Existing code for socket lookups already pass in 6+ arguments. Rather
than add another for the enslaved device index, the existing lookups
are converted to use a new sk_lookup struct. From there, the enslaved
device index becomes another element of the struct.

Patch 1 introduces sk_lookup struct and helper.

Patches 2-4 convert udp, inet and socket lookups for IPv4 to use the
new sk_lookup struct. Meant to be a conversion of IPv4 code only; no
functional change intended.

Patches 5-7 convert udp, inet and socket lookups for IPv6 to use the
new sk_lookup struct. Meant to be a conversion of IPv6 code only; no
functional change intended.

Patch 8 adds sdif to the sk_lookup struct allowing lookups to consider
a second device index.

Patches 9-10 add support for the enslaved device index to ipv4 and ipv6
socket lookups.

Changes since RFC:
- no significant logic changes; mainly whitespace cleanups

David Ahern (10):
  net: Add sk_lookup struct and helper
  net: ipv4: Convert udp socket lookups to new struct
  net: ipv4: Convert inet socket lookups to new struct
  net: ipv4: Convert raw sockets to sk_lookup
  net: ipv6: Convert udp socket lookups to new struct
  net: ipv6: Convert inet socket lookups to new struct
  net: ipv6: Convert raw sockets to sk_lookup
  net: Add sdif to sk_lookup
  net: ipv4: Support for sockets bound to enslaved device
  net: ipv6: Support for sockets bound to enslaved device

 include/linux/igmp.h|   3 +-
 include/linux/ipv6.h|   8 ++
 include/net/inet6_hashtables.h  |  44 -
 include/net/inet_hashtables.h   |  67 ++---
 include/net/ip.h|  10 ++
 include/net/raw.h   |   3 +-
 include/net/rawv6.h |   3 +-
 include/net/sock.h  |  42 +
 include/net/tcp.h   |  17 
 include/net/udp.h   |  18 +---
 net/dccp/ipv4.c |  19 +++-
 net/dccp/ipv6.c |  22 +++--
 net/ipv4/igmp.c |   6 +-
 net/ipv4/inet_diag.c|  50 +++---
 net/ipv4/inet_hashtables.c  |  59 +++-
 net/ipv4/netfilter/nf_socket_ipv4.c |  16 +++-
 net/ipv4/raw.c  |  77 +--
 net/ipv4/raw_diag.c |  30 --
 net/ipv4/tcp_ipv4.c |  64 +
 net/ipv4/udp.c  | 175 ++
 net/ipv4/udp_diag.c |  89 --
 net/ipv6/inet6_hashtables.c |  75 ---
 net/ipv6/netfilter/nf_socket_ipv6.c |  16 +++-
 net/ipv6/raw.c  |  44 +
 net/ipv6/tcp_ipv6.c |  63 +
 net/ipv6/udp.c  | 181 
 net/netfilter/xt_TPROXY.c   |  39 +---
 27 files changed, 759 insertions(+), 481 deletions(-)

-- 
2.1.4

[PATCH net-next 06/10] net: ipv6: Convert inet socket lookups to new struct

2017-07-31 Thread David Ahern

Convert the various inet6_lookup functions to use the new sk_lookup
struct.

Signed-off-by: David Ahern 
---
 include/net/inet6_hashtables.h  | 39 +++-
 net/dccp/ipv6.c | 22 
 net/ipv4/inet_diag.c| 19 ++
 net/ipv4/udp_diag.c |  2 ++
 net/ipv6/inet6_hashtables.c | 72 +++--
 net/ipv6/netfilter/nf_socket_ipv6.c |  5 ++-
 net/ipv6/tcp_ipv6.c | 60 +--
 net/netfilter/xt_TPROXY.c   |  8 ++---
 8 files changed, 125 insertions(+), 102 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index b87becacd9d3..15db41272ff2 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -46,63 +46,50 @@ static inline unsigned int __inet6_ehashfn(const u32 lhash,
  */
 struct sock *__inet6_lookup_established(struct net *net,
struct inet_hashinfo *hashinfo,
-   const struct in6_addr *saddr,
-   const __be16 sport,
-   const struct in6_addr *daddr,
-   const u16 hnum, const int dif);
+   const struct sk_lookup *params);
 
 struct sock *inet6_lookup_listener(struct net *net,
   struct inet_hashinfo *hashinfo,
   struct sk_buff *skb, int doff,
-  const struct in6_addr *saddr,
-  const __be16 sport,
-  const struct in6_addr *daddr,
-  const unsigned short hnum, const int dif);
+  struct sk_lookup *params);
 
 static inline struct sock *__inet6_lookup(struct net *net,
  struct inet_hashinfo *hashinfo,
  struct sk_buff *skb, int doff,
- const struct in6_addr *saddr,
- const __be16 sport,
- const struct in6_addr *daddr,
- const u16 hnum,
- const int dif,
+ struct sk_lookup *params,
  bool *refcounted)
 {
-   struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
-   sport, daddr, hnum, dif);
+   struct sock *sk = __inet6_lookup_established(net, hashinfo, params);
+
*refcounted = true;
if (sk)
return sk;
*refcounted = false;
-   return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
-daddr, hnum, dif);
+   return inet6_lookup_listener(net, hashinfo, skb, doff, params);
 }
 
 static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
  struct sk_buff *skb, int doff,
- const __be16 sport,
- const __be16 dport,
- int iif,
+ struct sk_lookup *params,
  bool *refcounted)
 {
struct sock *sk = skb_steal_sock(skb);
 
+   params->saddr.ipv6 = _hdr(skb)->saddr,
+   params->daddr.ipv6 = _hdr(skb)->daddr,
+   params->hnum = ntohs(params->dport),
+
*refcounted = true;
if (sk)
return sk;
 
return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
- doff, _hdr(skb)->saddr, sport,
- _hdr(skb)->daddr, ntohs(dport),
- iif, refcounted);
+ doff, params, refcounted);
 }
 
 struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
  struct sk_buff *skb, int doff,
- const struct in6_addr *saddr, const __be16 sport,
- const struct in6_addr *daddr, const __be16 dport,
- const int dif);
+ struct sk_lookup *params);
 
 int inet6_hash(struct sock *sk);
 #endif /* IS_ENABLED(CONFIG_IPV6) */
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index c376af5bfdfb..e92f10a832dd 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -70,6 +70,11 @@ static void dccp_v6_err(struct sk_buff *skb, struct 
inet6_skb_parm *opt,
u8 type, u8 code, int offset, __be32 info)
 {
const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
+   struct

[PATCH net-next 03/10] net: ipv4: Convert inet socket lookups to new struct

2017-07-31 Thread David Ahern

Convert the various inet_lookup functions to use the new sk_lookup
struct.

Signed-off-by: David Ahern 
---
 include/net/inet_hashtables.h   | 57 ++
 net/dccp/ipv4.c | 19 +---
 net/ipv4/inet_diag.c| 33 ++--
 net/ipv4/inet_hashtables.c  | 48 +++-
 net/ipv4/netfilter/nf_socket_ipv4.c |  5 ++-
 net/ipv4/tcp_ipv4.c | 62 +++--
 net/ipv4/udp_diag.c |  3 ++
 net/netfilter/xt_TPROXY.c   | 10 +++---
 8 files changed, 142 insertions(+), 95 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 5026b1f08bb8..c5f4dc3c06e4 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -218,19 +218,16 @@ void inet_unhash(struct sock *sk);
 struct sock *__inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
-   const __be32 saddr, const __be16 sport,
-   const __be32 daddr,
-   const unsigned short hnum,
-   const int dif);
+   struct sk_lookup *params);
 
 static inline struct sock *inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
-   __be32 saddr, __be16 sport,
-   __be32 daddr, __be16 dport, int dif)
+   struct sk_lookup *params)
 {
-   return __inet_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
- daddr, ntohs(dport), dif);
+   params->hnum = ntohs(params->dport);
+
+   return __inet_lookup_listener(net, hashinfo, skb, doff, params);
 }
 
 /* Socket demux engine toys. */
@@ -286,53 +283,44 @@ static inline struct sock *inet_lookup_listener(struct 
net *net,
  */
 struct sock *__inet_lookup_established(struct net *net,
   struct inet_hashinfo *hashinfo,
-  const __be32 saddr, const __be16 sport,
-  const __be32 daddr, const u16 hnum,
-  const int dif);
+  const struct sk_lookup *params);
 
 static inline struct sock *
inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
-   const __be32 saddr, const __be16 sport,
-   const __be32 daddr, const __be16 dport,
-   const int dif)
+   struct sk_lookup *params)
 {
-   return __inet_lookup_established(net, hashinfo, saddr, sport, daddr,
-ntohs(dport), dif);
+   params->hnum = ntohs(params->dport);
+
+   return __inet_lookup_established(net, hashinfo, params);
 }
 
 static inline struct sock *__inet_lookup(struct net *net,
 struct inet_hashinfo *hashinfo,
 struct sk_buff *skb, int doff,
-const __be32 saddr, const __be16 sport,
-const __be32 daddr, const __be16 dport,
-const int dif,
+struct sk_lookup *params,
 bool *refcounted)
 {
-   u16 hnum = ntohs(dport);
struct sock *sk;
 
-   sk = __inet_lookup_established(net, hashinfo, saddr, sport,
-  daddr, hnum, dif);
+   params->hnum = ntohs(params->dport);
+
+   sk = __inet_lookup_established(net, hashinfo, params);
*refcounted = true;
if (sk)
return sk;
*refcounted = false;
-   return __inet_lookup_listener(net, hashinfo, skb, doff, saddr,
- sport, daddr, hnum, dif);
+   return __inet_lookup_listener(net, hashinfo, skb, doff, params);
 }
 
 static inline struct sock *inet_lookup(struct net *net,
   struct inet_hashinfo *hashinfo,
   struct sk_buff *skb, int doff,
-  const __be32 saddr, const __be16 sport,
-  const __be32 daddr, const __be16 dport,
-  const int dif)
+  struct sk_lookup *params)
 {
struct sock *sk;
bool refcounted;
 
-   sk = __inet_lookup(net, hashinfo, skb, doff, saddr, sport, daddr,
-  dport, dif, );
+   sk = __inet_lookup(net, hashinfo, skb, doff,

[PATCH net-next 08/10] net: Add sdif to sk_lookup

2017-07-31 Thread David Ahern

Add a second device index, sdif, to the socket lookup struct. sdif
will be the device index for devices enslaved to an l3mdev. It allows
the lookups to consider the enslaved device as well as the L3 master
device when searching for a socket.

Signed-off-by: David Ahern 
---
 include/net/sock.h | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index a2db5fd30192..c5d93a4bcd0a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -507,23 +507,27 @@ struct sk_lookup {
unsigned short hnum;
 
int dif;
+   int sdif;
bool exact_dif;
 };
 
-/* Compare sk_bound_dev_if to socket lookup dif
+/* Compare sk_bound_dev_if to socket lookup dif and sdif
  * Returns:
  *   -1   exact dif required and not met
  *0   sk_bound_dev_if is either not set or does not match
- *1   sk_bound_dev_if is set and matches dif
+ *1   sk_bound_dev_if is set and matches dif or sdif
  */
 static inline int sk_lookup_device_cmp(const struct sock *sk,
   const struct sk_lookup *params)
 {
+   bool dev_match = (sk->sk_bound_dev_if == params->dif ||
+ sk->sk_bound_dev_if == params->sdif);
+
/* exact_dif true == l3mdev case */
-   if (params->exact_dif && sk->sk_bound_dev_if != params->dif)
+   if (params->exact_dif && !dev_match)
return -1;
 
-   if (sk->sk_bound_dev_if && sk->sk_bound_dev_if == params->dif)
+   if (sk->sk_bound_dev_if && dev_match)
return 1;
 
return 0;
-- 
2.1.4

[PATCH net-next 02/10] net: ipv4: Convert udp socket lookups to new struct

2017-07-31 Thread David Ahern

Convert udp4_lib_lookup and __udp4_lib_lookup to use the new sk_lookup
struct.

Signed-off-by: David Ahern 
---
 include/net/udp.h   |   6 +-
 net/ipv4/netfilter/nf_socket_ipv4.c |  11 ++-
 net/ipv4/udp.c  | 170 +++-
 net/ipv4/udp_diag.c |  51 +++
 net/netfilter/xt_TPROXY.c   |  11 ++-
 5 files changed, 144 insertions(+), 105 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 972ce4baab6b..5e0ff095dc6d 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -283,10 +283,8 @@ int udp_lib_getsockopt(struct sock *sk, int level, int 
optname,
 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
   char __user *optval, unsigned int optlen,
   int (*push_pending_frames)(struct sock *));
-struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
-__be32 daddr, __be16 dport, int dif);
-struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
-  __be32 daddr, __be16 dport, int dif,
+struct sock *udp4_lib_lookup(struct net *net, struct sk_lookup *params);
+struct sock *__udp4_lib_lookup(struct net *net, struct sk_lookup *params,
   struct udp_table *tbl, struct sk_buff *skb);
 struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
 __be16 sport, __be16 dport);
diff --git a/net/ipv4/netfilter/nf_socket_ipv4.c 
b/net/ipv4/netfilter/nf_socket_ipv4.c
index e9293bdebba0..121767b36763 100644
--- a/net/ipv4/netfilter/nf_socket_ipv4.c
+++ b/net/ipv4/netfilter/nf_socket_ipv4.c
@@ -81,14 +81,21 @@ nf_socket_get_sock_v4(struct net *net, struct sk_buff *skb, 
const int doff,
  const __be16 sport, const __be16 dport,
  const struct net_device *in)
 {
+   struct sk_lookup params = {
+   .saddr.ipv4 = saddr,
+   .daddr.ipv4 = daddr,
+   .sport = sport,
+   .dport = dport,
+   .dif = in->ifindex,
+   };
+
switch (protocol) {
case IPPROTO_TCP:
return inet_lookup(net, _hashinfo, skb, doff,
   saddr, sport, daddr, dport,
   in->ifindex);
case IPPROTO_UDP:
-   return udp4_lib_lookup(net, saddr, sport, daddr, dport,
-  in->ifindex);
+   return udp4_lib_lookup(net, );
}
return NULL;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b057653ceca9..132a8f070d16 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -379,15 +379,13 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
 }
 
 static int compute_score(struct sock *sk, struct net *net,
-__be32 saddr, __be16 sport,
-__be32 daddr, unsigned short hnum, int dif,
-bool exact_dif)
+const struct sk_lookup *params)
 {
-   int score;
struct inet_sock *inet;
+   int score, rc;
 
if (!net_eq(sock_net(sk), net) ||
-   udp_sk(sk)->udp_port_hash != hnum ||
+   udp_sk(sk)->udp_port_hash != params->hnum ||
ipv6_only_sock(sk))
return -1;
 
@@ -395,28 +393,28 @@ static int compute_score(struct sock *sk, struct net *net,
inet = inet_sk(sk);
 
if (inet->inet_rcv_saddr) {
-   if (inet->inet_rcv_saddr != daddr)
+   if (inet->inet_rcv_saddr != params->daddr.ipv4)
return -1;
score += 4;
}
 
if (inet->inet_daddr) {
-   if (inet->inet_daddr != saddr)
+   if (inet->inet_daddr != params->saddr.ipv4)
return -1;
score += 4;
}
 
if (inet->inet_dport) {
-   if (inet->inet_dport != sport)
+   if (inet->inet_dport != params->sport)
return -1;
score += 4;
}
 
-   if (sk->sk_bound_dev_if || exact_dif) {
-   if (sk->sk_bound_dev_if != dif)
-   return -1;
+   rc = sk_lookup_device_cmp(sk, params);
+   if (rc < 0)
+   return -1;
+   if (rc > 0)
score += 4;
-   }
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
return score;
@@ -436,10 +434,9 @@ static u32 udp_ehashfn(const struct net *net, const __be32 
laddr,
 
 /* called with rcu_read_lock() */
 static struct sock *udp4_lib_lookup2(struct net *net,
-   __be32 saddr, __be16 sport,
-   __be32 daddr, unsigned int hnum, int dif, bool exact_dif,
-   struct udp_hslot *hslot2,
-   struct sk_buff *skb)
+const struct sk_lookup *params,
+

[PATCH net-next 01/10] net: Add sk_lookup struct and helper

2017-07-31 Thread David Ahern

Consolidate the socket lookup args into a struct.

Add helper that compares sk_bound_dev_if for a socket to the lookup
parameters.

Signed-off-by: David Ahern 
---
 include/net/sock.h | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7c0632c7e870..a2db5fd30192 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -491,6 +491,44 @@ enum sk_pacing {
 #define rcu_dereference_sk_user_data(sk)   
rcu_dereference(__sk_user_data((sk)))
 #define rcu_assign_sk_user_data(sk, ptr)   
rcu_assign_pointer(__sk_user_data((sk)), ptr)
 
+/* used for socket lookups */
+struct sk_lookup {
+   union {
+   const struct in6_addr *ipv6;
+   __be32 ipv4;
+   } saddr;
+   union {
+   const struct in6_addr *ipv6;
+   __be32 ipv4;
+   } daddr;
+
+   __be16 sport;
+   __be16 dport;
+   unsigned short hnum;
+
+   int dif;
+   bool exact_dif;
+};
+
+/* Compare sk_bound_dev_if to socket lookup dif
+ * Returns:
+ *   -1   exact dif required and not met
+ *0   sk_bound_dev_if is either not set or does not match
+ *1   sk_bound_dev_if is set and matches dif
+ */
+static inline int sk_lookup_device_cmp(const struct sock *sk,
+  const struct sk_lookup *params)
+{
+   /* exact_dif true == l3mdev case */
+   if (params->exact_dif && sk->sk_bound_dev_if != params->dif)
+   return -1;
+
+   if (sk->sk_bound_dev_if && sk->sk_bound_dev_if == params->dif)
+   return 1;
+
+   return 0;
+}
+
 /*
  * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
  * or not whether his port will be reused by someone else. SK_FORCE_REUSE
-- 
2.1.4

[PATCH net-next 05/10] net: ipv6: Convert udp socket lookups to new struct

2017-07-31 Thread David Ahern

Convert udp6_lib_lookup and __udp6_lib_lookup to use the new sk_lookup
struct.

Signed-off-by: David Ahern 
---
 include/net/udp.h   |  12 +--
 net/ipv4/udp_diag.c |  33 ---
 net/ipv6/netfilter/nf_socket_ipv6.c |  11 ++-
 net/ipv6/udp.c  | 177 +++-
 net/netfilter/xt_TPROXY.c   |  10 +-
 5 files changed, 135 insertions(+), 108 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 5e0ff095dc6d..c5a75e9422c6 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -288,15 +288,9 @@ struct sock *__udp4_lib_lookup(struct net *net, struct 
sk_lookup *params,
   struct udp_table *tbl, struct sk_buff *skb);
 struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
 __be16 sport, __be16 dport);
-struct sock *udp6_lib_lookup(struct net *net,
-const struct in6_addr *saddr, __be16 sport,
-const struct in6_addr *daddr, __be16 dport,
-int dif);
-struct sock *__udp6_lib_lookup(struct net *net,
-  const struct in6_addr *saddr, __be16 sport,
-  const struct in6_addr *daddr, __be16 dport,
-  int dif, struct udp_table *tbl,
-  struct sk_buff *skb);
+struct sock *udp6_lib_lookup(struct net *net, struct sk_lookup *params);
+struct sock *__udp6_lib_lookup(struct net *net, struct sk_lookup *params,
+  struct udp_table *tbl, struct sk_buff *skb);
 struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 __be16 sport, __be16 dport);
 
diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c
index d7f6af42ebcc..10738c10c5ae 100644
--- a/net/ipv4/udp_diag.c
+++ b/net/ipv4/udp_diag.c
@@ -54,13 +54,17 @@ static int udp_dump_one(struct udp_table *tbl, struct 
sk_buff *in_skb,
sk = __udp4_lib_lookup(net, , tbl, NULL);
}
 #if IS_ENABLED(CONFIG_IPV6)
-   else if (req->sdiag_family == AF_INET6)
-   sk = __udp6_lib_lookup(net,
-   (struct in6_addr *)req->id.idiag_src,
-   req->id.idiag_sport,
-   (struct in6_addr *)req->id.idiag_dst,
-   req->id.idiag_dport,
-   req->id.idiag_if, tbl, NULL);
+   else if (req->sdiag_family == AF_INET6) {
+   struct sk_lookup params = {
+   .saddr.ipv6 = (struct in6_addr *)req->id.idiag_src,
+   .daddr.ipv6 = (struct in6_addr *)req->id.idiag_dst,
+   .sport = req->id.idiag_sport,
+   .dport = req->id.idiag_dport,
+   .dif   =  req->id.idiag_if,
+   };
+
+   sk = __udp6_lib_lookup(net, , tbl, NULL);
+   }
 #endif
if (sk && !refcount_inc_not_zero(>sk_refcnt))
sk = NULL;
@@ -212,12 +216,15 @@ static int __udp_diag_destroy(struct sk_buff *in_skb,
 
sk = __udp4_lib_lookup(net, , tbl, NULL);
} else {
-   sk = __udp6_lib_lookup(net,
-   (struct in6_addr *)req->id.idiag_dst,
-   req->id.idiag_dport,
-   (struct in6_addr *)req->id.idiag_src,
-   req->id.idiag_sport,
-   req->id.idiag_if, tbl, NULL);
+   struct sk_lookup params = {
+   .saddr.ipv6 = (struct in6_addr 
*)req->id.idiag_dst,
+   .daddr.ipv6 = (struct in6_addr 
*)req->id.idiag_src,
+   .sport = req->id.idiag_dport,
+   .dport = req->id.idiag_sport,
+   .dif   = req->id.idiag_if,
+   };
+
+   sk = __udp6_lib_lookup(net, , tbl, NULL);
}
}
 #endif
diff --git a/net/ipv6/netfilter/nf_socket_ipv6.c 
b/net/ipv6/netfilter/nf_socket_ipv6.c
index ebb2bf84232a..c1c193103063 100644
--- a/net/ipv6/netfilter/nf_socket_ipv6.c
+++ b/net/ipv6/netfilter/nf_socket_ipv6.c
@@ -86,14 +86,21 @@ nf_socket_get_sock_v6(struct net *net, struct sk_buff *skb, 
int doff,
  const __be16 sport, const __be16 dport,
  const struct net_device *in)
 {
+   struct sk_lookup params = {
+   .saddr.ipv6 = saddr,
+   .daddr.ipv6 = daddr,
+   .sport = sport,
+   .dport = dport,
+   .dif   = in->ifindex,
+   };
+
switch (protocol) {
case IPPROTO_TCP:
return inet6_lookup(net, _hashinfo, skb, doff,

Re: [PATCH V3 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission

2017-07-31 Thread Neal Cardwell

On Mon, Jul 31, 2017 at 11:49 AM, Neal Cardwell  wrote:
> On Sun, Jul 30, 2017 at 11:29 PM, maowenan  wrote:
>> [Mao Wenan]please refer to the attachment, test.pkt is packetdrill script.
>> In test.pcap, packet number 17 is the TLP probe, packet number 218 is the
>> retransmission packet because client don't send data packet to server.
>> From the capture time, there are about 6 seconds the retransmission
>> packet can be sent, and this time can be added more as long as client
>> send data packet continually.
>> I have reproduced this issue in Linux 4.13-rc3, 3.10, 4.1. Please check the 
>> timing
>> When you use test.pkt to reproduce in your environment.
>
> Thank you for your very nice packetdrill test case illustrating this
> problem! And thanks for verifying that the problem shows up in those
> kernel versions.
>
> We are able to run the script in our environment and both verify that
> the bug is the one we hypothesized, and verify our  proposed patch
> fixes it (the RTO for the TLP happens 221ms after the TLP, instead of
> ~5 secs later). We will send out our proposed patches ASAP.

The timer patches are upstream for review for the "net" branch:

  https://patchwork.ozlabs.org/patch/796057/
  https://patchwork.ozlabs.org/patch/796058/
  https://patchwork.ozlabs.org/patch/796059/

Again, thank you for reporting this and providing a packetdrill script
to reproduce this!

neal

[PATCH net 2/3] tcp: enable xmit timer fix by having TLP use time when RTO should fire

2017-07-31 Thread Neal Cardwell

Have tcp_schedule_loss_probe() base the TLP scheduling decision based
on when the RTO *should* fire. This is to enable the upcoming xmit
timer fix in this series, where tcp_schedule_loss_probe() cannot
assume that the last timer installed was an RTO timer (because we are
no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
ACK). So tcp_schedule_loss_probe() must independently figure out when
an RTO would want to fire.

In the new TLP implementation following in this series, we cannot
assume that icsk_timeout was set based on an RTO; after processing a
cumulative ACK the icsk_timeout we see can be from a previous TLP or
RTO. So we need to independently recalculate the RTO time (instead of
reading it out of icsk_timeout). Removing this dependency on the
nature of icsk_timeout makes things a little easier to reason about
anyway.

Note that the old and new code should be equivalent, since they are
both saying: "if the RTO is in the future, but at an earlier time than
the normal TLP time, then set the TLP timer to fire when the RTO would
have fired".

Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
---
 net/ipv4/tcp_output.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2f1588bf73da..0ae6b5d176c0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2377,8 +2377,8 @@ bool tcp_schedule_loss_probe(struct sock *sk)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
-   u32 timeout, tlp_time_stamp, rto_time_stamp;
u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3);
+   u32 timeout, rto_delta_us;
 
/* No consecutive loss probes. */
if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
@@ -2418,13 +2418,9 @@ bool tcp_schedule_loss_probe(struct sock *sk)
timeout = max_t(u32, timeout, msecs_to_jiffies(10));
 
/* If RTO is shorter, just schedule TLP in its place. */
-   tlp_time_stamp = tcp_jiffies32 + timeout;
-   rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
-   if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
-   s32 delta = rto_time_stamp - tcp_jiffies32;
-   if (delta > 0)
-   timeout = delta;
-   }
+   rto_delta_us = tcp_rto_delta_us(sk);  /* How far in future is RTO? */
+   if (rto_delta_us > 0)
+   timeout = min_t(u32, timeout, usecs_to_jiffies(rto_delta_us));
 
inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout,
  TCP_RTO_MAX);
-- 
2.14.0.rc0.400.g1c36432dff-goog

[PATCH net 1/3] tcp: introduce tcp_rto_delta_us() helper for xmit timer fix

2017-07-31 Thread Neal Cardwell

Pure refactor. This helper will be required in the xmit timer fix
later in the patch series. (Because the TLP logic will want to make
this calculation.)

Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
---
 include/net/tcp.h| 10 ++
 net/ipv4/tcp_input.c |  5 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70483296157f..ada65e767b28 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1916,6 +1916,16 @@ extern void tcp_rack_advance(struct tcp_sock *tp, u8 
sacked, u32 end_seq,
 u64 xmit_time);
 extern void tcp_rack_reo_timeout(struct sock *sk);
 
+/* At how many usecs into the future should the RTO fire? */
+static inline s64 tcp_rto_delta_us(const struct sock *sk)
+{
+   const struct sk_buff *skb = tcp_write_queue_head(sk);
+   u32 rto = inet_csk(sk)->icsk_rto;
+   u64 rto_time_stamp_us = skb->skb_mstamp + jiffies_to_usecs(rto);
+
+   return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp;
+}
+
 /*
  * Save and compile IPv4 options, return a pointer to it
  */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2920e0cb09f8..345febf0a46e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3004,10 +3004,7 @@ void tcp_rearm_rto(struct sock *sk)
/* Offset the time elapsed after installing regular RTO */
if (icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
-   struct sk_buff *skb = tcp_write_queue_head(sk);
-   u64 rto_time_stamp = skb->skb_mstamp +
-jiffies_to_usecs(rto);
-   s64 delta_us = rto_time_stamp - tp->tcp_mstamp;
+   s64 delta_us = tcp_rto_delta_us(sk);
/* delta_us may not be positive if the socket is locked
 * when the retrans timer fires and is rescheduled.
 */
-- 
2.14.0.rc0.400.g1c36432dff-goog

[PATCH net 0/3] tcp: fix xmit timer rearming to avoid stalls

2017-07-31 Thread Neal Cardwell

This patch series is a bug fix for a TCP loss recovery performance bug
reported independently in recent netdev threads:

 (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
 (ii) July 26, 2017: netdev thread:
   "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
   outstanding TLP retransmission"

Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports,
traces, and packetdrill test cases, which enabled us to root-cause
this issue and verify the fix.

Neal Cardwell (3):
  tcp: introduce tcp_rto_delta_us() helper for xmit timer fix
  tcp: enable xmit timer fix by having TLP use time when RTO should fire
  tcp: fix xmit timer to only be reset if data ACKed/SACKed

 include/net/tcp.h | 10 ++
 net/ipv4/tcp_input.c  | 30 +-
 net/ipv4/tcp_output.c | 21 -
 3 files changed, 31 insertions(+), 30 deletions(-)

-- 
2.14.0.rc0.400.g1c36432dff-goog

[PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

2017-07-31 Thread Neal Cardwell

Fix a TCP loss recovery performance bug raised recently on the netdev
list, in two threads:

(i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
 "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
 outstanding TLP retransmission"

The basic problem is that incoming TCP packets that did not indicate
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
and pushed back in time. In certain corner cases this could result in
the following problems noted in these threads:

 - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
   could cause TCP to repeatedly schedule TLPs forever. We kept
   sending TLPs after every ~200ms, which elicited bogus SACKs, which
   caused more TLPs, ad infinitum; we never fired an RTO to fill in
   the holes.

 - Incoming data segments could, in some cases, cause us to reschedule
   our RTO or TLP timer further out in time, for no good reason. This
   could cause repeated inbound data to result in stalls in outbound
   data, in the presence of packet loss.

This commit fixes these bugs by changing the TLP and RTO ACK
processing to:

 (a) Only reschedule the xmit timer once per ACK.

 (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
 ACK indicates sufficient forward progress (a packet was
 cumulatively ACKed, or we got a SACK for a packet that was sent
 before the most recent retransmit of the write queue head).

This brings us back into closer compliance with the RFCs, since, as
the comment for tcp_rearm_rto() notes, we should only restart the RTO
timer after forward progress on the connection. Previously we were
restarting the xmit timer even in these cases where there was no
forward progress.

As a side benefit, this commit simplifies and speeds up the TCP timer
arming logic. We had been calling inet_csk_reset_xmit_timer() three
times on normal ACKs that cumulatively acknowledged some data:

1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
   tcp_rearm_rto(sk);

2) Once in tcp_clean_rtx_queue(), to update the RTO:
if (flag & FLAG_ACKED) {
   tcp_rearm_rto(sk);

3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
   to TLP:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
   tcp_schedule_loss_probe(sk);

This commit, by only rescheduling the xmit timer once per ACK,
simplifies the code and reduces CPU overhead.

This commit was tested in an A/B test with Google web server
traffic. SNMP stats and request latency metrics were within noise
levels, substantiating that for normal web traffic patterns this is a
rare issue. This commit was also tested with packetdrill tests to
verify that it fixes the timer behavior in the corner cases discussed
in the netdev threads mentioned above.

This patch is a bug fix patch intended to be queued for -stable
relases.

Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Reported-by: Klavs Klavsen 
Reported-by: Mao Wenan 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
---
 net/ipv4/tcp_input.c  | 25 -
 net/ipv4/tcp_output.c |  9 -
 2 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 345febf0a46e..3e777cfbba56 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -107,6 +107,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
 #define FLAG_ORIG_SACK_ACKED   0x200 /* Never retransmitted data are (s)acked  
*/
 #define FLAG_SND_UNA_ADVANCED  0x400 /* Snd_una was changed (!= 
FLAG_DATA_ACKED) */
 #define FLAG_DSACKING_ACK  0x800 /* SACK blocks contained D-SACK info */
+#define FLAG_SET_XMIT_TIMER0x1000 /* Set TLP or RTO timer */
 #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
 #define FLAG_UPDATE_TS_RECENT  0x4000 /* tcp_replace_ts_recent() */
 #define FLAG_NO_CHALLENGE_ACK  0x8000 /* do not call tcp_send_challenge_ack()  
*/
@@ -3016,6 +3017,13 @@ void tcp_rearm_rto(struct sock *sk)
}
 }
 
+/* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */
+static void tcp_set_xmit_timer(struct sock *sk)
+{
+   if (!tcp_schedule_loss_probe(sk))
+   tcp_rearm_rto(sk);
+}
+
 /* If we get here, the whole TSO packet has not been acked. */
 static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
 {
@@ -3177,7 +3185,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets,
ca_rtt_us, sack->rate);
 
if (flag & FLAG_ACKED) {
-   tcp_rearm_rto(sk);
+   flag |= FLAG_SET_XMIT_TIMER;  /* set TLP or RTO timer */
if (unlikely(icsk->icsk_mtup.probe_size &&

Re: [RFC net-next] net ipv6: convert fib6_table rwlock to a percpu lock

2017-07-31 Thread Shaohua Li

On Mon, Jul 31, 2017 at 04:10:07PM -0700, Stephen Hemminger wrote:
> On Mon, 31 Jul 2017 10:18:57 -0700
> Shaohua Li  wrote:
> 
> > From: Shaohua Li 
> > 
> > In a syn flooding test, the fib6_table rwlock is a significant
> > bottleneck. While converting the rwlock to rcu sounds straighforward,
> > but is very challenging if it's possible. A percpu spinlock is quite
> > trival for this problem since updating the routing table is a rare
> > event. In my test, the server receives around 1.5 Mpps in syn flooding
> > test without the patch in a dual sockets and 56-CPU system. With the
> > patch, the server receives around 3.8Mpps, and perf report doesn't show
> > the locking issue.
> > 
> > Cc: Wei Wang 
> 
> You just reinvented brlock...

you mean lglock? It has been removed from kernel.
 
> RCU is not that hard, why not do it right?

Maybe. But don't think it's the reason why we shouldn't do the percpu lock now,
this is a simple change, if some smart guys find a way of RCU, we can easily
remove this.

Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT

2017-07-31 Thread Xin Long

On Tue, Aug 1, 2017 at 2:01 PM, David Ahern  wrote:
> On 7/31/17 7:40 PM, Xin Long wrote:
>> To respect the old code more, setting RTPROT_RA only when
>> it's with the flag (ADDRCONF | DEFAULT | ROUTEINFO),
>> shouldn't it be:
>
> Look at rtm_fill_info:
>
> if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
> rtm->rtm_protocol = RTPROT_RA;
>
>
> If either flag is set the protocol should be RTPROT_RA and looking at
> both places that seems correct to me.
>
ok, right

Re: [PATCH V2 net-next 0/2] liquidio: Add support for managing liquidio adapter

2017-07-31 Thread Felix Manlunas

On Mon, Jul 31, 2017 at 05:59:37PM -0700, David Miller wrote:
> From: Simon Horman 
> Date: Sun, 30 Jul 2017 22:21:04 +0200
> 
> > On Fri, Jul 28, 2017 at 11:17:07PM -0700, Felix Manlunas wrote:
> >> From: Veerasenareddy Burru 
> >> 
> >> The LiquidIO adapter has processor cores that can run Linux. This patch
> >> set adds support to create a virtual Ethernet interface on host to
> >> communicate with applications running on Linux in the LiquidIO adapter.
> >> The virtual Ethernet interface also provides login access to Linux on
> >> LiquidIO through ssh for management and debugging.
> > 
> > As per the somewhat more detailed feedback provided by my colleague Jakub
> > Kicinski to v1 of this patchset[1] I am concerned that this patchset breaks 
> > down
> > the long standing practice of not granting direct access to firmware from
> > userspace.
> > 
> > [1] https://www.spinics.net/lists/netdev/msg444929.html
> 
> Agreed, I've seen no attempt to address this important feedback, which
> I agree with.

We posted a response to the original comment on Friday 28-July.
but for some reason, it did not make to the mailing list outside
cavium domain. our apologies, we did not double check before submitting
V2 patch. Please find below the response reposted on original thread

http://marc.info/?l=linux-netdev=150155273724386=2

Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT

2017-07-31 Thread David Ahern

On 7/31/17 7:40 PM, Xin Long wrote:
> To respect the old code more, setting RTPROT_RA only when
> it's with the flag (ADDRCONF | DEFAULT | ROUTEINFO),
> shouldn't it be:

Look at rtm_fill_info:

if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
rtm->rtm_protocol = RTPROT_RA;

If either flag is set the protocol should be RTPROT_RA and looking at
both places that seems correct to me.

Re: [PATCH net-next 2/2] liquidio: Add support to create management interface

2017-07-31 Thread Felix Manlunas

On Tue, Jul 18, 2017 at 11:58:27AM -0700, Jakub Kicinski wrote:
> On Mon, 17 Jul 2017 12:52:17 -0700, Felix Manlunas wrote:
> > From: VSR Burru 
> > 
> > This patch adds support to create a virtual ethernet interface to
> > communicate with Linux on LiquidIO adapter for management.
> > 
> > Signed-off-by: VSR Burru 
> > Signed-off-by: Srinivasa Jampala 
> > Signed-off-by: Satanand Burla 
> > Signed-off-by: Raghu Vatsavayi 
> > Signed-off-by: Felix Manlunas 
> 
> Not my call, but I have mixed feelings about this one.  Is there any
> precedent under drivers/net/ethernet of exposing special communication
> channels with FW like this?  It's irrelevant to me that you're running
> SSH, arbitrary communication with FW from userspace is not something
> netdev community usually accepts.  And I'm afraid what the effects will
> be of this getting accepted.  I'm pretty sure most modern network
> adapters have management CPU cores perfectly capable of running Linux.
> I know NFP does, here is the out-of-tree code equivalent to this patch:

LiquidIO is committed to ethtool and we are not trying to force users to
use this communication channel in place of ethtool. This communication
channel is for our field debug and informattion purposes and not for end
users. If most modern network adapters have management cores that can
run Linux, we could probably also think of finding a standard way to
talk to that Linux.

> 
> https://github.com/Netronome/nfp-drv-kmods/blob/master/src/nfpcore/nfp_net_vnic.c
> 
> I'm not looking forward to a world where I have to ssh into my NIC and
> run vendor commands to configure things.

We are not asking users to ssh into card and run vendor commands. Users
of LiquidIO card will continue to use ethtool for configuration. This is
for our field debugging where we would like to login to the linux and be
able to know the status of different hardware blocks in the card.

Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT

2017-07-31 Thread Xin Long

On Tue, Aug 1, 2017 at 12:12 PM, David Ahern  wrote:
> On 7/30/17 9:31 PM, Xin Long wrote:
>>> Did you look at removing this hunk from rt6_fill_node:
>>>
>>> if (rt->rt6i_flags & RTF_DYNAMIC)
>>> rtm->rtm_protocol = RTPROT_REDIRECT;
>>> else if (rt->rt6i_flags & RTF_ADDRCONF) {
>>> if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
>>> rtm->rtm_protocol = RTPROT_RA;
>>> else
>>> rtm->rtm_protocol = RTPROT_KERNEL;
>>> }
>> The issue seems to affect "ip -6 route flush all" as well, not only cache
>> since 'else if {}' also  causes rtm proto being different from rt6 proto.
>>
>>>
>>> And have rtm_protocol set properly on the route when it is installed?
>> The codes not keeping rtm proto consistent with rt6 proto day 1,
>> any idea on why it didn't use rt6 proto in kernel properly?
>
> no, AFAIK it was just an oversight when the original code was written. I
> do not know of any reason that would prevent properly setting the
> rt6i_protocol in the route when it is allocated.
That's what I was worried about, it might break something,
but double checked, should be fine.

>
> Something like this (not compiled, much less tested):
To respect the old code more, setting RTPROT_RA only when
it's with the flag (ADDRCONF | DEFAULT | ROUTEINFO),
shouldn't it be:

[...]
@@ -2464,6 +2465,7 @@ static struct rt6_info
*rt6_add_route_info(struct net *net,
.fc_nlinfo.portid = 0,
.fc_nlinfo.nlh = NULL,
.fc_nlinfo.nl_net = net,
+   .fc_protocol = RTPROT_KERNEL,
};

cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO,
@@ -2471,8 +2473,10 @@ static struct rt6_info
*rt6_add_route_info(struct net *net,
cfg.fc_gateway = *gwaddr;

/* We should treat it as a default route if prefix length is 0. */
-   if (!prefixlen)
+   if (!prefixlen) {
+   cfg.fc_protocol = RTPROT_RA;
cfg.fc_flags |= RTF_DEFAULT;
+   }

ip6_route_add(, NULL);

@@ -2516,6 +2520,7 @@ struct rt6_info *rt6_add_dflt_router(const
struct in6_addr *gwaddr,
.fc_nlinfo.portid = 0,
.fc_nlinfo.nlh = NULL,
.fc_nlinfo.nl_net = dev_net(dev),
+   .fc_protocol = RTPROT_KERNEL,
};
[...]

or you changed it intentionally ?

I will do some testing before posting v2.
thanks for your suggestion. :-)

>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 4d30c96a819d..9a928839d247 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2347,6 +2347,7 @@ static void rt6_do_redirect(struct dst_entry *dst,
> struct sock *sk, struct sk_bu
> if (!nrt)
> goto out;
>
> +   nrt->rt6i_protocol = RTPROT_REDIRECT;
> nrt->rt6i_flags = RTF_GATEWAY|RTF_UP|RTF_DYNAMIC|RTF_CACHE;
> if (on_link)
> nrt->rt6i_flags &= ~RTF_GATEWAY;
> @@ -2461,6 +2462,7 @@ static struct rt6_info *rt6_add_route_info(struct
> net *net,
> .fc_dst_len = prefixlen,
> .fc_flags   = RTF_GATEWAY | RTF_ADDRCONF |
> RTF_ROUTEINFO |
>   RTF_UP | RTF_PREF(pref),
> +   .fc_protocol= RTPROT_RA,
> .fc_nlinfo.portid = 0,
> .fc_nlinfo.nlh = NULL,
> .fc_nlinfo.nl_net = net,
> @@ -2513,6 +2515,7 @@ struct rt6_info *rt6_add_dflt_router(const struct
> in6_addr *gwaddr,
> .fc_ifindex = dev->ifindex,
> .fc_flags   = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT |
>   RTF_UP | RTF_EXPIRES | RTF_PREF(pref),
> +   .fc_protocol= RTPROT_RA,
> .fc_nlinfo.portid = 0,
> .fc_nlinfo.nlh = NULL,
> .fc_nlinfo.nl_net = dev_net(dev),
> @@ -3424,14 +3427,6 @@ static int rt6_fill_node(struct net *net,
> rtm->rtm_flags = 0;
> rtm->rtm_scope = RT_SCOPE_UNIVERSE;
> rtm->rtm_protocol = rt->rt6i_protocol;
> -   if (rt->rt6i_flags & RTF_DYNAMIC)
> -   rtm->rtm_protocol = RTPROT_REDIRECT;
> -   else if (rt->rt6i_flags & RTF_ADDRCONF) {
> -   if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
> -   rtm->rtm_protocol = RTPROT_RA;
> -   else
> -   rtm->rtm_protocol = RTPROT_KERNEL;
> -   }
>
> if (rt->rt6i_flags & RTF_CACHE)
> rtm->rtm_flags |= RTM_F_CLONED;

[PATCH v6 net-next] net: systemport: Support 64bit statistics

2017-07-31 Thread Jianming.qiao

When using Broadcom Systemport device in 32bit Platform, ifconfig can
only report up to 4G tx,rx status, which will be wrapped to 0 when the
number of incoming or outgoing packets exceeds 4G, only taking
around 2 hours in busy network environment (such as streaming).
Therefore, it makes hard for network diagnostic tool to get reliable
statistical result, so the patch is used to add 64bit support for
Broadcom Systemport device in 32bit Platform.

Signed-off-by: Jianming.qiao 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 68 --
 drivers/net/ethernet/broadcom/bcmsysport.h |  9 +++-
 2 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 5333601..bb3cc7a 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -662,6 +662,7 @@ static int bcm_sysport_alloc_rx_bufs(struct 
bcm_sysport_priv *priv)
 static unsigned int bcm_sysport_desc_rx(struct bcm_sysport_priv *priv,
unsigned int budget)
 {
+   struct bcm_sysport_stats *stats64 = >stats64;
struct net_device *ndev = priv->netdev;
unsigned int processed = 0, to_process;
struct bcm_sysport_cb *cb;
@@ -765,6 +766,10 @@ static unsigned int bcm_sysport_desc_rx(struct 
bcm_sysport_priv *priv,
skb->protocol = eth_type_trans(skb, ndev);
ndev->stats.rx_packets++;
ndev->stats.rx_bytes += len;
+   u64_stats_update_begin(>syncp);
+   stats64->rx_packets++;
+   stats64->rx_bytes += len;
+   u64_stats_update_end(>syncp);
 
napi_gro_receive(>napi, skb);
 next:
@@ -787,17 +792,15 @@ static void bcm_sysport_tx_reclaim_one(struct 
bcm_sysport_tx_ring *ring,
struct device *kdev = >pdev->dev;
 
if (cb->skb) {
-   ring->bytes += cb->skb->len;
*bytes_compl += cb->skb->len;
dma_unmap_single(kdev, dma_unmap_addr(cb, dma_addr),
 dma_unmap_len(cb, dma_len),
 DMA_TO_DEVICE);
-   ring->packets++;
(*pkts_compl)++;
bcm_sysport_free_cb(cb);
/* SKB fragment */
} else if (dma_unmap_addr(cb, dma_addr)) {
-   ring->bytes += dma_unmap_len(cb, dma_len);
+   *bytes_compl += dma_unmap_len(cb, dma_len);
dma_unmap_page(kdev, dma_unmap_addr(cb, dma_addr),
   dma_unmap_len(cb, dma_len), DMA_TO_DEVICE);
dma_unmap_addr_set(cb, dma_addr, 0);
@@ -808,9 +811,10 @@ static void bcm_sysport_tx_reclaim_one(struct 
bcm_sysport_tx_ring *ring,
 static unsigned int __bcm_sysport_tx_reclaim(struct bcm_sysport_priv *priv,
 struct bcm_sysport_tx_ring *ring)
 {
-   struct net_device *ndev = priv->netdev;
unsigned int c_index, last_c_index, last_tx_cn, num_tx_cbs;
+   struct bcm_sysport_stats *stats64 = >stats64;
unsigned int pkts_compl = 0, bytes_compl = 0;
+   struct net_device *ndev = priv->netdev;
struct bcm_sysport_cb *cb;
u32 hw_ind;
 
@@ -849,6 +853,11 @@ static unsigned int __bcm_sysport_tx_reclaim(struct 
bcm_sysport_priv *priv,
last_c_index &= (num_tx_cbs - 1);
}
 
+   u64_stats_update_begin(>syncp);
+   ring->packets += pkts_compl;
+   ring->bytes += bytes_compl;
+   u64_stats_update_end(>syncp);
+
ring->c_index = c_index;
 
netif_dbg(priv, tx_done, ndev,
@@ -1671,24 +1680,6 @@ static int bcm_sysport_change_mac(struct net_device 
*dev, void *p)
return 0;
 }
 
-static struct net_device_stats *bcm_sysport_get_nstats(struct net_device *dev)
-{
-   struct bcm_sysport_priv *priv = netdev_priv(dev);
-   unsigned long tx_bytes = 0, tx_packets = 0;
-   struct bcm_sysport_tx_ring *ring;
-   unsigned int q;
-
-   for (q = 0; q < dev->num_tx_queues; q++) {
-   ring = >tx_rings[q];
-   tx_bytes += ring->bytes;
-   tx_packets += ring->packets;
-   }
-
-   dev->stats.tx_bytes = tx_bytes;
-   dev->stats.tx_packets = tx_packets;
-   return >stats;
-}
-
 static void bcm_sysport_netif_start(struct net_device *dev)
 {
struct bcm_sysport_priv *priv = netdev_priv(dev);
@@ -1923,6 +1914,37 @@ static int bcm_sysport_stop(struct net_device *dev)
return 0;
 }
 
+static void bcm_sysport_get_stats64(struct net_device *dev,
+   struct rtnl_link_stats64 *stats)
+{
+   struct bcm_sysport_priv *priv = netdev_priv(dev);
+   struct bcm_sysport_stats *stats64 = >stats64;
+   struct bcm_sysport_tx_ring *ring;
+   u64 tx_packets = 0, tx_bytes = 0;
+   unsigned int start;
+   unsigned int q;
+
+

Re: [PATCH net] Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config"

2017-07-31 Thread David Miller

From: Florian Fainelli 
Date: Mon, 31 Jul 2017 11:05:32 -0700

> This reverts commit 28b45910ccda ("net: bcmgenet: Remove init parameter
> from bcmgenet_mii_config") because in the process of moving from
> dev_info() to dev_info_once() we essentially lost the helpful printed
> messages once the second instance of the driver is loaded.
> dev_info_once() does not actually print the message once per device
> instance, but once period.
> 
> Fixes: 28b45910ccda ("net: bcmgenet: Remove init parameter from 
> bcmgenet_mii_config")
> Signed-off-by: Florian Fainelli 

Applied, thanks Florian.

Re: [PATCH net-next] ipv6: Avoid going through ->sk_net to access the netns

2017-07-31 Thread David Miller

From: Jakub Sitnicki 
Date: Mon, 31 Jul 2017 10:09:41 +0200

> There is no need to go through sk->sk_net to access the net namespace
> and its sysctl variables because we allocate the sock and initialize
> sk_net just a few lines earlier in the same routine.
> 
> Signed-off-by: Jakub Sitnicki 

Applied, thanks.

Re: [PATCH net-next 0/7] More Marvell PHY refactoring and cleanup

2017-07-31 Thread David Miller

From: Andrew Lunn 
Date: Sun, 30 Jul 2017 22:41:43 +0200

> Consolidate more duplicated code into helpers, make use of core
> helpers, move code into a helper for later adding functionality to add
> marvell PHYs, etc.

Series applied.

Re: [PATCH V2 net-next 0/2] liquidio: Add support for managing liquidio adapter

2017-07-31 Thread David Miller

From: Simon Horman 
Date: Sun, 30 Jul 2017 22:21:04 +0200

> On Fri, Jul 28, 2017 at 11:17:07PM -0700, Felix Manlunas wrote:
>> From: Veerasenareddy Burru 
>> 
>> The LiquidIO adapter has processor cores that can run Linux. This patch
>> set adds support to create a virtual Ethernet interface on host to
>> communicate with applications running on Linux in the LiquidIO adapter.
>> The virtual Ethernet interface also provides login access to Linux on
>> LiquidIO through ssh for management and debugging.
> 
> As per the somewhat more detailed feedback provided by my colleague Jakub
> Kicinski to v1 of this patchset[1] I am concerned that this patchset breaks 
> down
> the long standing practice of not granting direct access to firmware from
> userspace.
> 
> [1] https://www.spinics.net/lists/netdev/msg444929.html

Agreed, I've seen no attempt to address this important feedback, which
I agree with.

Re: [PATCH] mv643xx_eth: fix of_irq_to_resource() error check

2017-07-31 Thread David Miller

From: Sergei Shtylyov 
Date: Sat, 29 Jul 2017 22:18:41 +0300

> of_irq_to_resource() has recently been  fixed to return negative error #'s
> along with 0 in case of failure,  however the Marvell MV643xx Ethernet
> driver still only regards 0  as invalid IRQ -- fix it up.
> 
> Fixes: 7a4228bbff76 ("of: irq: use of_irq_get() in of_irq_to_resource()")
> Signed-off-by: Sergei Shtylyov 

Applied.

Re: [PATCH] net-next: stmmac: dwmac-sun8i: fix of_table.cocci warnings

2017-07-31 Thread David Miller

From: Julia Lawall 
Date: Sat, 29 Jul 2017 17:54:10 +0200 (CEST)

> Make sure (of/i2c/platform)_device_id tables are NULL terminated
> 
> Generated by: scripts/coccinelle/misc/of_table.cocci
> 
> Fixes: d5dbe1976d52 ("net-next: stmmac: dwmac-sun8i: choose internal PHY via 
> compatible")
> CC: Corentin Labbe 
> Signed-off-by: Fengguang Wu 

This change seems to be no longer relevant.

Re: [PATCH net-next] net: bcmgenet: Add dependency on HAS_IOMEM && OF

2017-07-31 Thread David Miller

From: Florian Fainelli 
Date: Mon, 31 Jul 2017 17:53:07 -0700

> The driver needs CONFIG_HAS_IOMEM and OF to be functional, but we still
> let it build with COMPILE_TEST. This fixes the unmet dependency after
> selecting MDIO_BCM_UNIMAC in commit mentioned below:
> 
> warning: (NET_DSA_BCM_SF2 && BCMGENET) selects MDIO_BCM_UNIMAC which has
> unmet direct dependencies (NETDEVICES && MDIO_DEVICE && HAS_IOMEM &&
> OF_MDIO)
> 
> Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO 
> controller driver")
> Signed-off-by: Florian Fainelli 

Applied.

[PATCH net-next] net: bcmgenet: Add dependency on HAS_IOMEM && OF

2017-07-31 Thread Florian Fainelli

The driver needs CONFIG_HAS_IOMEM and OF to be functional, but we still
let it build with COMPILE_TEST. This fixes the unmet dependency after
selecting MDIO_BCM_UNIMAC in commit mentioned below:

warning: (NET_DSA_BCM_SF2 && BCMGENET) selects MDIO_BCM_UNIMAC which has
unmet direct dependencies (NETDEVICES && MDIO_DEVICE && HAS_IOMEM &&
OF_MDIO)

Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO 
controller driver")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/Kconfig 
b/drivers/net/ethernet/broadcom/Kconfig
index ec7a798c6bd1..45775399cab6 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -61,6 +61,7 @@ config BCM63XX_ENET
 
 config BCMGENET
tristate "Broadcom GENET internal MAC support"
+   depends on (OF && HAS_IOMEM) || COMPILE_TEST
select MII
select PHYLIB
select FIXED_PHY
-- 
2.9.3

Re: [PATCH net v3] MAINTAINERS: Add more files to the PHY LIBRARY section

2017-07-31 Thread David Miller

From: Florian Fainelli 
Date: Mon, 31 Jul 2017 09:47:50 -0700

> Include missing files that are provided by, used, or directly maintained
> within the PHY LIBRARY, this include uapi header, header files used by
> Device Tree code etc.
> 
> Signed-off-by: Florian Fainelli 

Applied.

Re: [PATCH net] ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()

2017-07-31 Thread David Miller

From: Ido Schimmel 
Date: Fri, 28 Jul 2017 23:27:44 +0300

> Michał reported a NULL pointer deref during fib_sync_down_dev() when
> unregistering a netdevice. The problem is that we don't check for
> 'in_dev' being NULL, which can happen in very specific cases.
> 
> Usually routes are flushed upon NETDEV_DOWN sent in either the netdev or
> the inetaddr notification chains. However, if an interface isn't
> configured with any IP address, then it's possible for host routes to be
> flushed following NETDEV_UNREGISTER, after NULLing dev->ip_ptr in
> inetdev_destroy().
> 
> To reproduce:
> $ ip link add type dummy
> $ ip route add local 1.1.1.0/24 dev dummy0
> $ ip link del dev dummy0
> 
> Fix this by checking for the presence of 'in_dev' before referencing it.
> 
> Fixes: 982acb97560c ("ipv4: fib: Notify about nexthop status changes")
> Signed-off-by: Ido Schimmel 
> Reported-by: Michał Mirosław 
> ---
> Please consider this for -stable.

Applied and queued up for -stable, thanks!

[PATCH RFC, iproute2] tc/mirred: Extend the mirred/redirect action to accept additional traffic class parameter

2017-07-31 Thread Amritha Nambiar

The Mirred/redirect action is extended to accept a traffic
class on the device in addition to the device's ifindex.

Usage: mirred

Example:
# tc qdisc add dev eth0 ingress

# tc filter add dev eth0 protocol ip parent : prio 1 flower\
  dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
  indev eth0 action mirred ingress redirect dev eth0 tc 1

Signed-off-by: Amritha Nambiar 
---
 tc/m_mirred.c |   26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/tc_act/tc_mirred.h b/include/linux/tc_act/tc_mirred.h
index 3d7a2b3..9a3aa61 100644
--- a/include/linux/tc_act/tc_mirred.h
+++ b/include/linux/tc_act/tc_mirred.h
@@ -9,6 +9,9 @@
 #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */
 #define TCA_INGRESS_REDIR 3  /* packet redirect to INGRESS*/
 #define TCA_INGRESS_MIRROR 4 /* mirror packet to INGRESS */
+
+#define MIRRED_F_TC_MAP0x1
+#define MIRRED_TC_MAP_MAX  0x10

 
 struct tc_mirred {
tc_gen;
@@ -21,6 +24,7 @@ enum {
TCA_MIRRED_TM,
TCA_MIRRED_PARMS,
TCA_MIRRED_PAD,
+   TCA_MIRRED_TC_MAP,
__TCA_MIRRED_MAX
 };
 #define TCA_MIRRED_MAX (__TCA_MIRRED_MAX - 1)
diff --git a/tc/m_mirred.c b/tc/m_mirred.c
index 2384bda..1a18c6b 100644
--- a/tc/m_mirred.c
+++ b/tc/m_mirred.c
@@ -29,12 +29,13 @@
 static void
 explain(void)
 {
-   fprintf(stderr, "Usage: mirred   [index INDEX] \n");
+   fprintf(stderr, "Usage: mirred   [index INDEX]  [tc TCINDEX]\n");
fprintf(stderr, "where:\n");
fprintf(stderr, "\tDIRECTION := \n");
fprintf(stderr, "\tACTION := \n");
fprintf(stderr, "\tINDEX  is the specific policy instance id\n");
fprintf(stderr, "\tDEVICENAME is the devicename\n");
+   fprintf(stderr, "\tTCINDEX is the traffic class index\n");
 
 }
 
@@ -72,6 +73,8 @@ parse_direction(struct action_util *a, int *argc_p, char 
***argv_p,
struct tc_mirred p = {};
struct rtattr *tail;
char d[16] = {};
+   __u32 flags = 0;
+   __u8 tc;
 
while (argc > 0) {
 
@@ -142,6 +145,18 @@ parse_direction(struct action_util *a, int *argc_p, char 
***argv_p,
argc--;
argv++;
 
+   if ((argc > 0) && (matches(*argv, "tc") == 0)) {
+   NEXT_ARG();
+   tc = atoi(*argv);
+   if (tc >= MIRRED_TC_MAP_MAX) {
+   fprintf(stderr, "Invalid TC 
index\n");
+   return -1;
+   }
+   flags |= MIRRED_F_TC_MAP;
+   ok++;
+   argc--;
+   argv++;
+   }
break;
 
}
@@ -193,6 +208,9 @@ parse_direction(struct action_util *a, int *argc_p, char 
***argv_p,
tail = NLMSG_TAIL(n);
addattr_l(n, MAX_MSG, tca_id, NULL, 0);
addattr_l(n, MAX_MSG, TCA_MIRRED_PARMS, , sizeof(p));
+   if (flags & MIRRED_F_TC_MAP)
+   addattr_l(n, MAX_MSG, TCA_MIRRED_TC_MAP,
+ , sizeof(tc));
tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
 
*argc_p = argc;
@@ -248,6 +266,7 @@ print_mirred(struct action_util *au, FILE * f, struct 
rtattr *arg)
struct tc_mirred *p;
struct rtattr *tb[TCA_MIRRED_MAX + 1];
const char *dev;
+   __u8 *tc;
 
if (arg == NULL)
return -1;
@@ -273,6 +292,11 @@ print_mirred(struct action_util *au, FILE * f, struct 
rtattr *arg)
fprintf(f, "mirred (%s to device %s)", mirred_n2a(p->eaction), dev);
print_action_control(f, " ", p->action, "");
 
+   if (tb[TCA_MIRRED_TC_MAP]) {
+   tc = RTA_DATA(tb[TCA_MIRRED_TC_MAP]);
+   fprintf(f, " tc %d", *tc);
+   }
+
fprintf(f, "\n ");
fprintf(f, "\tindex %u ref %d bind %d", p->index, p->refcnt,
p->bindcnt);

[PATCH 4/6] [net-next]net: i40e: Admin queue definitions for cloud filters

2017-07-31 Thread Amritha Nambiar

Add new admin queue definitions and extended fields for cloud
filter support. Define big buffer for extended general fields
in Add/Remove Cloud filters command.

Signed-off-by: Amritha Nambiar 
Signed-off-by: Kiran Patil 
Signed-off-by: Store Laura 
Signed-off-by: Iremonger Bernard 
Signed-off-by: Jingjing Wu 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |   98 +
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index 8bba04c..9f14305 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -1358,7 +1358,9 @@ struct i40e_aqc_add_remove_cloud_filters {
 #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT  0
 #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_MASK   (0x3FF << \
I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT)
-   u8  reserved2[4];
+   u8  big_buffer_flag;
+#defineI40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER   1
+   u8  reserved2[3];
__le32  addr_high;
__le32  addr_low;
 };
@@ -1395,6 +1397,13 @@ struct i40e_aqc_add_remove_cloud_filters_element_data {
 #define I40E_AQC_ADD_CLOUD_FILTER_IMAC 0x000A
 #define I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC 0x000B
 #define I40E_AQC_ADD_CLOUD_FILTER_IIP  0x000C
+/* 0x0010 to 0x0017 is for custom filters */
+/* flag to be used when adding cloud filter: IP + L4 Port */
+#define I40E_AQC_ADD_CLOUD_FILTER_IP_PORT  0x0010
+/* flag to be used when adding cloud filter: Dest MAC + L4 Port */
+#define I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT 0x0011
+/* flag to be used when adding cloud filter: Dest MAC + VLAN + L4 Port */
+#define I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT0x0012
 
 #define I40E_AQC_ADD_CLOUD_FLAGS_TO_QUEUE  0x0080
 #define I40E_AQC_ADD_CLOUD_VNK_SHIFT   6
@@ -1429,6 +1438,45 @@ struct i40e_aqc_add_remove_cloud_filters_element_data {
u8  response_reserved[7];
 };
 
+/* i40e_aqc_add_remove_cloud_filters_element_big_data is used when
+ * I40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER flag is set.
+ */
+struct i40e_aqc_add_remove_cloud_filters_element_big_data {
+   struct i40e_aqc_add_remove_cloud_filters_element_data element;
+   u16 general_fields[32];
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD0   0
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD1   1
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD2   2
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD0   3
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD1   4
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD2   5
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD0   6
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD1   7
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD2   8
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD0   9
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD1   10
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD2   11
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD0   12
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD1   13
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD2   14
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD0   15
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD1   16
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD2   17
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD3   18
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD4   19
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD5   20
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD6   21
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD7   22
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD0   23
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD1   24
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD2   25
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD3   26
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD4   27
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD5   28
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD6   29
+#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD7   30
+};
+
 struct i40e_aqc_remove_cloud_filters_completion {
__le16 perfect_ovlan_used;
__le16 perfect_ovlan_free;
@@ -1440,6 +1488,54 @@ struct i40e_aqc_remove_cloud_filters_completion {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_remove_cloud_filters_completion);
 
+/* Replace filter Command 0x025F
+ * uses the i40e_aqc_replace_cloud_filters,
+ * and the generic indirect completion structure
+ */
+struct i40e_filter_data {
+   u8 filter_type;
+   u8 input[3];
+};
+
+struct i40e_aqc_replace_cloud_filters_cmd {
+   u8  valid_flags;
+#define I40E_AQC_REPLACE_L1_FILTER 0x0
+#define I40E_AQC_REPLACE_CLOUD_FILTER  0x1
+#define I40E_AQC_GET_CLOUD_FILTERS 0x2
+#define I40E_AQC_MIRROR_CLOUD_FILTER   0x4
+#define I40E_AQC_HIGH_PRIORITY_CLOUD_FILTER0x8
+   u8  old_filter_type;
+   u8

[PATCH 6/6] [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier

2017-07-31 Thread Amritha Nambiar

This patch enables tc-flower based hardware offloads. tc/flower
filter provided by the kernel is configured as driver specific
cloud filter. The patch implements functions and admin queue
commands needed to support cloud filters in the driver and
adds cloud filters to configure these tc-flower filters.

The only action supported is to redirect packets to a traffic class
on the same device.

# tc qdisc add dev eth0 ingress
# ethtool -K eth0 hw-tc-offload on

# tc filter add dev eth0 protocol ip parent :\
  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw indev eth0\
  action mirred ingress redirect dev eth0 tc 0

# tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw indev eth0\
  action mirred ingress redirect dev eth0 tc 1

# tc filter add dev eth0 protocol ipv6 parent :\
  prio 3 flower dst_ip fe8::200:1\
  ip_proto udp dst_port 66 skip_sw indev eth0\
  action mirred ingress redirect dev eth0 tc 2

Delete tc flower filter:
Example:

# tc filter del dev eth0 parent : prio 3 handle 0x1 flower
# tc filter del dev eth0 parent :

Flow Director Sideband is disabled while configuring cloud filters
via tc-flower.

Unsupported matches when cloud filters are added using enhanced
big buffer cloud filter mode of underlying switch include:
1. source port and source IP
2. Combined MAC address and IP fields.
3. Not specfying L4 port

These filter matches can however be used to redirect traffic to
the main VSI (tc 0) which does not require the enhanced big buffer
cloud filter support.

Signed-off-by: Amritha Nambiar 
Signed-off-by: Kiran Patil 
---
 drivers/net/ethernet/intel/i40e/i40e.h   |   46 +
 drivers/net/ethernet/intel/i40e/i40e_common.c|  180 
 drivers/net/ethernet/intel/i40e/i40e_main.c  |  952 ++
 drivers/net/ethernet/intel/i40e/i40e_prototype.h |   17 
 4 files changed, 1193 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 5c0cad5..7288265 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -55,6 +55,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "i40e_type.h"
 #include "i40e_prototype.h"
 #include "i40e_client.h"
@@ -252,10 +254,51 @@ struct i40e_fdir_filter {
u32 fd_id;
 };
 
+#define I40E_CLOUD_FIELD_OMAC  0x01
+#define I40E_CLOUD_FIELD_IMAC  0x02
+#define I40E_CLOUD_FIELD_IVLAN 0x04
+#define I40E_CLOUD_FIELD_TEN_ID0x08
+#define I40E_CLOUD_FIELD_IIP   0x10
+
+#define I40E_CLOUD_FILTER_FLAGS_OMAC   I40E_CLOUD_FIELD_OMAC
+#define I40E_CLOUD_FILTER_FLAGS_IMAC   I40E_CLOUD_FIELD_IMAC
+#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN (I40E_CLOUD_FIELD_IMAC | \
+I40E_CLOUD_FIELD_IVLAN)
+#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID(I40E_CLOUD_FIELD_IMAC | \
+I40E_CLOUD_FIELD_TEN_ID)
+#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \
+ I40E_CLOUD_FIELD_IMAC | \
+ I40E_CLOUD_FIELD_TEN_ID)
+#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
+  I40E_CLOUD_FIELD_IVLAN | \
+  I40E_CLOUD_FIELD_TEN_ID)
+#define I40E_CLOUD_FILTER_FLAGS_IIPI40E_CLOUD_FIELD_IIP
+
 struct i40e_cloud_filter {
struct hlist_node cloud_node;
/* cloud filter input set follows */
unsigned long cookie;
+   u8 dst_mac[ETH_ALEN];
+   u8 src_mac[ETH_ALEN];
+   __be16 vlan_id;
+   __be32 dst_ip[4];
+   __be32 src_ip[4];
+   u8 dst_ipv6[16];
+   u8 src_ipv6[16];
+   __be16 dst_port;
+   __be16 src_port;
+   /* matter only when IP based filtering is set */
+   bool is_ipv6;
+   /* IPPROTO value */
+   u8 ip_proto;
+   /* L4 port type: src or destination port */
+#define I40E_CLOUD_FILTER_PORT_SRC 0x01
+#define I40E_CLOUD_FILTER_PORT_DEST0x02
+   u8 port_type;
+   u32 tenant_id;
+   u8 flags;
+#define I40E_CLOUD_TNL_TYPE_NONE   0xff
+   u8 tunnel_type;
/* filter control */
u16 seid;
 };
@@ -574,6 +617,9 @@ struct i40e_pf {
u16 phy_led_val;
 
u16 override_q_count;
+   u16 last_sw_conf_flags;
+   u16 last_sw_conf_valid_flags;
+
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index d0e8138..bfbe304 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -5269,5 +5269,185 @@ i40e_add_pinfo_to_list(struct i40e_hw *hw,
 
status = i40e_aq_write_ppp(hw, (void *)sec, sec->data_end,

[PATCH 3/6] [net-next]net: i40e: Extend set switch config command to accept cloud filter mode

2017-07-31 Thread Amritha Nambiar

Add definitions for L4 filters and switch modes based on cloud filters
modes and extend the set switch config command to include the
additional cloud filter mode.

Signed-off-by: Amritha Nambiar 
Signed-off-by: Kiran Patil 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |   34 -
 drivers/net/ethernet/intel/i40e/i40e_common.c |4 ++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c|2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |2 +
 drivers/net/ethernet/intel/i40e/i40e_prototype.h  |2 +
 drivers/net/ethernet/intel/i40e/i40e_type.h   |9 ++
 6 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index e2a9ec8..8bba04c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -773,7 +773,39 @@ struct i40e_aqc_set_switch_config {
 #define I40E_AQ_SET_SWITCH_CFG_PROMISC 0x0001
 #define I40E_AQ_SET_SWITCH_CFG_L2_FILTER   0x0002
__le16  valid_flags;
-   u8  reserved[12];
+
+   u8  rsvd6[6];
+
+   /* Next byte is split into following:
+* Bit 7 : 0: No action, 1: Switch to mode defined by bits 6:0
+* Bit 6: 0 : Destination Port, 1: source port
+* Bit 5..4: L4 type
+*  0: rsvd
+*  1: TCP
+*  2: UDP
+*  3: Both TCP and UDP
+* Bits 3:0 Mode
+*  0: default mode
+*  1: L4 port only mode
+*  2: non-tunneled mode
+*  3: tunneled mode
+*/
+#define I40E_AQ_SET_SWITCH_BIT7_VALID  0x80
+
+#define I40E_AQ_SET_SWITCH_L4_SRC_PORT 0x40
+
+#define I40E_AQ_SET_SWITCH_L4_TYPE_RSVD0x00
+#define I40E_AQ_SET_SWITCH_L4_TYPE_TCP 0x10
+#define I40E_AQ_SET_SWITCH_L4_TYPE_UDP 0x20
+#define I40E_AQ_SET_SWITCH_L4_TYPE_BOTH0x30
+
+#define I40E_AQ_SET_SWITCH_MODE_DEFAULT0x00
+#define I40E_AQ_SET_SWITCH_MODE_L4_PORT0x01
+#define I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL 0x02
+#define I40E_AQ_SET_SWITCH_MODE_TUNNEL 0x03
+   u8 mode;
+
+   u8  rsvd5[5];
 };
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_set_switch_config);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index e4e86e0..d0e8138 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -2380,13 +2380,14 @@ i40e_status i40e_aq_get_switch_config(struct i40e_hw 
*hw,
  * @hw: pointer to the hardware structure
  * @flags: bit flag values to set
  * @valid_flags: which bit flags to set
+ * @mode: cloud filter mode
  * @cmd_details: pointer to command details structure or NULL
  *
  * Set switch configuration bits
  **/
 enum i40e_status_code i40e_aq_set_switch_config(struct i40e_hw *hw,
u16 flags,
-   u16 valid_flags,
+   u16 valid_flags, u8 mode,
struct i40e_asq_cmd_details *cmd_details)
 {
struct i40e_aq_desc desc;
@@ -2398,6 +2399,7 @@ enum i40e_status_code i40e_aq_set_switch_config(struct 
i40e_hw *hw,
  i40e_aqc_opc_set_switch_config);
scfg->flags = cpu_to_le16(flags);
scfg->valid_flags = cpu_to_le16(valid_flags);
+   scfg->mode = mode;
 
status = i40e_asq_send_command(hw, , NULL, 0, cmd_details);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 326fc18..232e066e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -4181,7 +4181,7 @@ static int i40e_set_priv_flags(struct net_device *dev, 
u32 flags)
sw_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
valid_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
ret = i40e_aq_set_switch_config(>hw, sw_flags, valid_flags,
-   NULL);
+   0, NULL);
if (ret && pf->hw.aq.asq_last_status != I40E_AQ_RC_ESRCH) {
dev_info(>pdev->dev,
 "couldn't set switch config bits, err %s 
aq_err %s\n",
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 1daf95e..f74 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12107,7 +12107,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, 
bool reinit)
u16 valid_flags;
 
valid_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
-   ret =

[PATCH 2/6] [net-next]net: i40e: Maintain a mapping of TCs with the VSI seids

2017-07-31 Thread Amritha Nambiar

Add mapping of TCs with the seids of the channel VSIs. TC0
will be mapped to the main VSI seid and all other TCs are
mapped to the seid of the channel VSI.

Signed-off-by: Amritha Nambiar 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |1 +
 drivers/net/ethernet/intel/i40e/i40e_main.c |2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 8852ac0..1391e5d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -738,6 +738,7 @@ struct i40e_vsi {
atomic_t next_base_queue;
 
struct list_head ch_list;
+   u16 tc_seid_map[I40E_MAX_TRAFFIC_CLASS];
 
void *priv; /* client driver data reference. */
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 370ce9f..1daf95e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6127,6 +6127,7 @@ static int i40e_configure_queue_channels(struct i40e_vsi 
*vsi)
int ret = 0, i;
 
/* Create app vsi with the TCs. Main VSI with TC0 is already set up */
+   vsi->tc_seid_map[0] = vsi->seid;
for (i = 1; i < I40E_MAX_TRAFFIC_CLASS; i++)
if (vsi->tc_config.enabled_tc & BIT(i)) {
ch = kzalloc(sizeof(*ch), GFP_KERNEL);
@@ -6156,6 +6157,7 @@ static int i40e_configure_queue_channels(struct i40e_vsi 
*vsi)
i, ch->num_queue_pairs);
goto err_free;
}
+   vsi->tc_seid_map[i] = ch->seid;
}
return ret;

[PATCH 1/6] [net-next]net: sched: act_mirred: Extend redirect action to accept a traffic class

2017-07-31 Thread Amritha Nambiar

The Mirred/redirect action is extended to forward to a traffic
class on the device. The traffic class index needs to be
provided in addition to the device's ifindex.

Example:
# tc filter add dev eth0 protocol ip parent : prio 1 flower\
  dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
  skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1

Signed-off-by: Amritha Nambiar 
---
 include/net/tc_act/tc_mirred.h|7 +++
 include/uapi/linux/tc_act/tc_mirred.h |5 +
 net/sched/act_mirred.c|   17 +
 3 files changed, 29 insertions(+)

diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
index 604bc31..60058c4 100644
--- a/include/net/tc_act/tc_mirred.h
+++ b/include/net/tc_act/tc_mirred.h
@@ -9,6 +9,8 @@ struct tcf_mirred {
int tcfm_eaction;
int tcfm_ifindex;
booltcfm_mac_header_xmit;
+   u8  tcfm_tc;
+   u32 flags;
struct net_device __rcu *tcfm_dev;
struct list_headtcfm_list;
 };
@@ -37,4 +39,9 @@ static inline int tcf_mirred_ifindex(const struct tc_action 
*a)
return to_mirred(a)->tcfm_ifindex;
 }
 
+static inline int tcf_mirred_tc(const struct tc_action *a)
+{
+   return to_mirred(a)->tcfm_tc;
+}
+
 #endif /* __NET_TC_MIR_H */
diff --git a/include/uapi/linux/tc_act/tc_mirred.h 
b/include/uapi/linux/tc_act/tc_mirred.h
index 3d7a2b3..8ff4d76 100644
--- a/include/uapi/linux/tc_act/tc_mirred.h
+++ b/include/uapi/linux/tc_act/tc_mirred.h
@@ -9,6 +9,10 @@
 #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */
 #define TCA_INGRESS_REDIR 3  /* packet redirect to INGRESS*/
 #define TCA_INGRESS_MIRROR 4 /* mirror packet to INGRESS */
+
+#define MIRRED_F_TC_MAP0x1
+#define MIRRED_TC_MAP_MAX  0x10
+#define MIRRED_TC_MAP_MASK 0xF

 
 struct tc_mirred {
tc_gen;
@@ -21,6 +25,7 @@ enum {
TCA_MIRRED_TM,
TCA_MIRRED_PARMS,
TCA_MIRRED_PAD,
+   TCA_MIRRED_TC_MAP,
__TCA_MIRRED_MAX
 };
 #define TCA_MIRRED_MAX (__TCA_MIRRED_MAX - 1)
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 1b5549a..f9801de 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -67,6 +67,7 @@ static void tcf_mirred_release(struct tc_action *a, int bind)
 
 static const struct nla_policy mirred_policy[TCA_MIRRED_MAX + 1] = {
[TCA_MIRRED_PARMS]  = { .len = sizeof(struct tc_mirred) },
+   [TCA_MIRRED_TC_MAP] = { .type = NLA_U8 },
 };
 
 static unsigned int mirred_net_id;
@@ -83,6 +84,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
struct tcf_mirred *m;
struct net_device *dev;
bool exists = false;
+   u8 *tc_map = NULL;
+   u32 flags = 0;
int ret;
 
if (nla == NULL)
@@ -92,6 +95,14 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
return ret;
if (tb[TCA_MIRRED_PARMS] == NULL)
return -EINVAL;
+
+   if (tb[TCA_MIRRED_TC_MAP]) {
+   tc_map = nla_data(tb[TCA_MIRRED_TC_MAP]);
+   if (*tc_map >= MIRRED_TC_MAP_MAX)
+   return -EINVAL;
+   flags |= MIRRED_F_TC_MAP;
+   }
+
parm = nla_data(tb[TCA_MIRRED_PARMS]);
 
exists = tcf_hash_check(tn, parm->index, a, bind);
@@ -139,6 +150,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
ASSERT_RTNL();
m->tcf_action = parm->action;
m->tcfm_eaction = parm->eaction;
+   m->flags = flags;
if (dev != NULL) {
m->tcfm_ifindex = parm->ifindex;
if (ret != ACT_P_CREATED)
@@ -146,6 +158,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
dev_hold(dev);
rcu_assign_pointer(m->tcfm_dev, dev);
m->tcfm_mac_header_xmit = mac_header_xmit;
+   if (flags & MIRRED_F_TC_MAP)
+   m->tcfm_tc = *tc_map & MIRRED_TC_MAP_MASK;
}
 
if (ret == ACT_P_CREATED) {
@@ -259,6 +273,9 @@ static int tcf_mirred_dump(struct sk_buff *skb, struct 
tc_action *a, int bind,
 
if (nla_put(skb, TCA_MIRRED_PARMS, sizeof(opt), ))
goto nla_put_failure;
+   if ((m->flags & MIRRED_F_TC_MAP) &&
+   nla_put_u8(skb, TCA_MIRRED_TC_MAP, m->tcfm_tc))
+   goto nla_put_failure;
 
tcf_tm_dump(, >tcf_tm);
if (nla_put_64bit(skb, TCA_MIRRED_TM, sizeof(t), , TCA_MIRRED_PAD))

[PATCH 5/6] [net-next]net: i40e: Clean up of cloud filters

2017-07-31 Thread Amritha Nambiar

Introduce the cloud filter datastructure and cleanup of cloud
filters associated with the device.

Signed-off-by: Amritha Nambiar 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |   11 +++
 drivers/net/ethernet/intel/i40e/i40e_main.c |   27 +++
 2 files changed, 38 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 1391e5d..5c0cad5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -252,6 +252,14 @@ struct i40e_fdir_filter {
u32 fd_id;
 };
 
+struct i40e_cloud_filter {
+   struct hlist_node cloud_node;
+   /* cloud filter input set follows */
+   unsigned long cookie;
+   /* filter control */
+   u16 seid;
+};
+
 #define I40E_ETH_P_LLDP0x88cc
 
 #define I40E_DCB_PRIO_TYPE_STRICT  0
@@ -419,6 +427,9 @@ struct i40e_pf {
struct i40e_udp_port_config udp_ports[I40E_MAX_PF_UDP_OFFLOAD_PORTS];
u16 pending_udp_bitmap;
 
+   struct hlist_head cloud_filter_list;
+   u16 num_cloud_filters;
+
enum i40e_interrupt_policy int_policy;
u16 rx_itr_default;
u16 tx_itr_default;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index f74..93f6fe2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6928,6 +6928,29 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf)
 }
 
 /**
+ * i40e_cloud_filter_exit - Cleans up the Cloud Filters
+ * @pf: Pointer to PF
+ *
+ * This function destroys the hlist where all the Cloud Filters
+ * filters were saved.
+ **/
+static void i40e_cloud_filter_exit(struct i40e_pf *pf)
+{
+   struct i40e_cloud_filter *cfilter;
+   struct hlist_node *node;
+
+   if (hlist_empty(>cloud_filter_list))
+   return;
+
+   hlist_for_each_entry_safe(cfilter, node,
+ >cloud_filter_list, cloud_node) {
+   hlist_del(>cloud_node);
+   kfree(cfilter);
+   }
+   pf->num_cloud_filters = 0;
+}
+
+/**
  * i40e_close - Disables a network interface
  * @netdev: network interface device structure
  *
@@ -12137,6 +12160,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, 
bool reinit)
vsi = i40e_vsi_reinit_setup(pf->vsi[pf->lan_vsi]);
if (!vsi) {
dev_info(>pdev->dev, "setup of MAIN VSI failed\n");
+   i40e_cloud_filter_exit(pf);
i40e_fdir_teardown(pf);
return -EAGAIN;
}
@@ -12961,6 +12985,8 @@ static void i40e_remove(struct pci_dev *pdev)
if (pf->vsi[pf->lan_vsi])
i40e_vsi_release(pf->vsi[pf->lan_vsi]);
 
+   i40e_cloud_filter_exit(pf);
+
/* remove attached clients */
if (pf->flags & I40E_FLAG_IWARP_ENABLED) {
ret_code = i40e_lan_del_device(pf);
@@ -13170,6 +13196,7 @@ static void i40e_shutdown(struct pci_dev *pdev)
 
del_timer_sync(>service_timer);
cancel_work_sync(>service_task);
+   i40e_cloud_filter_exit(pf);
i40e_fdir_teardown(pf);
 
/* Client close must be called explicitly here because the timer

[PATCH net-next RFC 0/6] Configure cloud filters in i40e via tc/flower classifier

2017-07-31 Thread Amritha Nambiar

This patch series enables configuring cloud filters in i40e
using the tc/flower classifier. The only tc-filter action
supported is to redirect packets to a traffic class on the
same device. The tc/mirred:redirect action is extended to
accept a traffic class to achieve this.

The cloud filters are added for a VSI and are cleaned up when
the VSI is deleted. The filters that match on L4 ports needs
enhanced admin queue functions with big buffer support for
extended general fields in Add/Remove Cloud filters command.

Example:
# tc qdisc add dev eth0 ingress

# ethtool -K eth0 hw-tc-offload on

# tc filter add dev eth0 protocol ip parent : prio 1 flower\
  dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\
  skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1

# tc filter show dev eth0 parent :
filter protocol ip pref 1 flower
filter protocol ip pref 1 flower handle 0x1
  indev eth0
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.1.1
  dst_port 22
  skip_sw
  in_hw
action order 1: mirred (Ingress Redirect to device eth0) stolen tc 1
index 1 ref 1 bind 1
---

Amritha Nambiar (6):
  [net-next]net: sched: act_mirred: Extend redirect action to accept a 
traffic class
  [net-next]net: i40e: Maintain a mapping of TCs with the VSI seids
  [net-next]net: i40e: Extend set switch config command to accept cloud 
filter mode
  [net-next]net: i40e: Admin queue definitions for cloud filters
  [net-next]net: i40e: Clean up of cloud filters
  [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier


 drivers/net/ethernet/intel/i40e/i40e.h|   58 +
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |  132 +++
 drivers/net/ethernet/intel/i40e/i40e_common.c |  184 
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c|2 
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  983 +
 drivers/net/ethernet/intel/i40e/i40e_prototype.h  |   19 
 drivers/net/ethernet/intel/i40e/i40e_type.h   |9 
 include/net/tc_act/tc_mirred.h|7 
 include/uapi/linux/tc_act/tc_mirred.h |5 
 net/sched/act_mirred.c|   17 
 10 files changed, 1408 insertions(+), 8 deletions(-)

--

Re: [PATCH net-next 1/7] net: phy: mdio-bcm-unimac: factor busy polling loop

2017-07-31 Thread Florian Fainelli

On 07/31/2017 05:28 PM, kbuild test robot wrote:
> Hi Florian,
> 
> [auto build test ERROR on net-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847
> config: xtensa-allmodconfig (attached as .config)
> compiler: xtensa-linux-gcc (GCC) 4.9.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=xtensa 
> 
> Note: the 
> linux-review/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847
>  HEAD 68043f6ab1b54d29abcfc56fdec46d280b76 builds fine.
>   It only hurts bisectibility.
> 
> All errors (new ones prefixed by >>):
> 
>drivers/net/phy/mdio-bcm-unimac.c: In function 'unimac_mdio_read':
>>> drivers/net/phy/mdio-bcm-unimac.c:89:2: error: 'ret' undeclared (first use 
>>> in this function)
>  ret = unimac_mdio_poll(priv);
>  ^
>drivers/net/phy/mdio-bcm-unimac.c:89:2: note: each undeclared identifier 
> is reported only once for each function it appears in
> 
> vim +/ret +89 drivers/net/phy/mdio-bcm-unimac.c

This is "just" a bisectability problem, patch 4 does actually add the
int ret variable to store the return value...

I will still fix the unmet dependency warning, depends on would actually
be more correct here anyway.

> 
> 76
> 77static int unimac_mdio_read(struct mii_bus *bus, int phy_id, 
> int reg)
> 78{
> 79struct unimac_mdio_priv *priv = bus->priv;
> 80u32 cmd;
> 81
> 82/* Prepare the read operation */
> 83cmd = MDIO_RD | (phy_id << MDIO_PMD_SHIFT) | (reg << 
> MDIO_REG_SHIFT);
> 84__raw_writel(cmd, priv->base + MDIO_CMD);
> 85
> 86/* Start MDIO transaction */
> 87unimac_mdio_start(priv);
> 88
>   > 89ret = unimac_mdio_poll(priv);
> 90if (ret)
> 91return ret;
> 92
> 93cmd = __raw_readl(priv->base + MDIO_CMD);
> 94
> 95/* Some broken devices are known not to release the 
> line during
> 96 * turn-around, e.g: Broadcom BCM53125 external 
> switches, so check for
> 97 * that condition here and ignore the MDIO controller 
> read failure
> 98 * indication.
> 99 */
>100if (!(bus->phy_ignore_ta_mask & 1 << phy_id) && (cmd & 
> MDIO_READ_FAIL))
>101return -EIO;
>102
>103return cmd & 0x;
>104}
>105
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> 


-- 
Florian

Re: [PATCH net-next 1/7] net: phy: mdio-bcm-unimac: factor busy polling loop

2017-07-31 Thread kbuild test robot

Hi Florian,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

Note: the 
linux-review/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847
 HEAD 68043f6ab1b54d29abcfc56fdec46d280b76 builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   drivers/net/phy/mdio-bcm-unimac.c: In function 'unimac_mdio_read':
>> drivers/net/phy/mdio-bcm-unimac.c:89:2: error: 'ret' undeclared (first use 
>> in this function)
 ret = unimac_mdio_poll(priv);
 ^
   drivers/net/phy/mdio-bcm-unimac.c:89:2: note: each undeclared identifier is 
reported only once for each function it appears in

vim +/ret +89 drivers/net/phy/mdio-bcm-unimac.c

76  
77  static int unimac_mdio_read(struct mii_bus *bus, int phy_id, int reg)
78  {
79  struct unimac_mdio_priv *priv = bus->priv;
80  u32 cmd;
81  
82  /* Prepare the read operation */
83  cmd = MDIO_RD | (phy_id << MDIO_PMD_SHIFT) | (reg << 
MDIO_REG_SHIFT);
84  __raw_writel(cmd, priv->base + MDIO_CMD);
85  
86  /* Start MDIO transaction */
87  unimac_mdio_start(priv);
88  
  > 89  ret = unimac_mdio_poll(priv);
90  if (ret)
91  return ret;
92  
93  cmd = __raw_readl(priv->base + MDIO_CMD);
94  
95  /* Some broken devices are known not to release the line during
96   * turn-around, e.g: Broadcom BCM53125 external switches, so 
check for
97   * that condition here and ignore the MDIO controller read 
failure
98   * indication.
99   */
   100  if (!(bus->phy_ignore_ta_mask & 1 << phy_id) && (cmd & 
MDIO_READ_FAIL))
   101  return -EIO;
   102  
   103  return cmd & 0x;
   104  }
   105  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH net v2] net: phy: Correctly process PHY_HALTED in phy_stop_machine()

2017-07-31 Thread David Miller

From: Florian Fainelli 
Date: Fri, 28 Jul 2017 11:58:36 -0700

> Marc reported that he was not getting the PHY library adjust_link()
> callback function to run when calling phy_stop() + phy_disconnect()
> which does not indeed happen because we set the state machine to
> PHY_HALTED but we don't get to run it to process this state past that
> point.
> 
> Fix this with a synchronous call to phy_state_machine() in order to have
> the state machine actually act on PHY_HALTED, set the PHY device's link
> down, turn the network device's carrier off and finally call the
> adjust_link() function.
> 
> Reported-by: Marc Gonzalez 
> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
> Signed-off-by: Florian Fainelli 
> ---
> Changes in v2:
> 
> - reword subject and commit message based on changes
> - dropped flush_scheduled_work() since it is redundant

Applied and queued up for -stable, thanks.

[net-next:master 345/358] warning: (NET_DSA_BCM_SF2 && ..) selects MDIO_BCM_UNIMAC which has unmet direct dependencies (NETDEVICES && ..)

2017-07-31 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   6a95befc8d0346d6cb3b4646c761e8b42e66a4df
commit: 9a4e79697009ddd0d1af52053c830f6e60e1c771 [345/358] net: bcmgenet: 
utilize generic Broadcom UniMAC MDIO controller driver
config: x86_64-randconfig-x001-201731 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
git checkout 9a4e79697009ddd0d1af52053c830f6e60e1c771
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

warning: (NET_DSA_BCM_SF2 && BCMGENET) selects MDIO_BCM_UNIMAC which has unmet 
direct dependencies (NETDEVICES && MDIO_DEVICE && HAS_IOMEM && OF_MDIO)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH] hysdn: fix to a race condition in put_log_buffer

2017-07-31 Thread David Miller

From: Anton Volkov 
Date: Fri, 28 Jul 2017 17:53:51 +0300

> The synchronization type that was used earlier to guard the loop that
> deletes unused log buffers may have lead to a situation that prevents
> any thread from going through the loop.
> 
> The patch deletes previously used synchronization mechanism and moves
> the loop under the spin_lock so the similar cases won't be feasible in
> the future.
> 
> Found by by Linux Driver Verification project (linuxtesting.org).
> 
> Signed-off-by: Anton Volkov 

This patch doesn't apply at all.

It's probably been corrupted by your email client.

Re: [PATCH net-next 1/2] tcp: extract the function to compute delivery rate

2017-07-31 Thread David Miller

From: Wei Wang 
Date: Fri, 28 Jul 2017 10:28:20 -0700

> From: Wei Wang 
> 
> Refactor the code to extract the function to compute delivery rate.
> This function will be used in later commit.
> 
> Signed-off-by: Wei Wang 
> Acked-by: Yuchung Cheng 
> Acked-by: Soheil Hassas Yeganeh 

Applied.

Re: [PATCH net-next 2/2] tcp: add related fields into SCM_TIMESTAMPING_OPT_STATS

2017-07-31 Thread David Miller

From: Wei Wang 
Date: Fri, 28 Jul 2017 10:28:21 -0700

> From: Wei Wang 
> 
> Add the following stats into SCM_TIMESTAMPING_OPT_STATS control msg:
> TCP_NLA_PACING_RATE
> TCP_NLA_DELIVERY_RATE
> TCP_NLA_SND_CWND
> TCP_NLA_REORDERING
> TCP_NLA_MIN_RTT
> TCP_NLA_RECUR_RETRANS
> TCP_NLA_DELIVERY_RATE_APP_LMT
> 
> Signed-off-by: Wei Wang 
> Acked-by: Yuchung Cheng 
> Acked-by: Soheil Hassas Yeganeh 

Applied.

Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT

2017-07-31 Thread David Ahern

On 7/30/17 9:31 PM, Xin Long wrote:
>> Did you look at removing this hunk from rt6_fill_node:
>>
>> if (rt->rt6i_flags & RTF_DYNAMIC)
>> rtm->rtm_protocol = RTPROT_REDIRECT;
>> else if (rt->rt6i_flags & RTF_ADDRCONF) {
>> if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
>> rtm->rtm_protocol = RTPROT_RA;
>> else
>> rtm->rtm_protocol = RTPROT_KERNEL;
>> }
> The issue seems to affect "ip -6 route flush all" as well, not only cache
> since 'else if {}' also  causes rtm proto being different from rt6 proto.
> 
>>
>> And have rtm_protocol set properly on the route when it is installed?
> The codes not keeping rtm proto consistent with rt6 proto day 1,
> any idea on why it didn't use rt6 proto in kernel properly?

no, AFAIK it was just an oversight when the original code was written. I
do not know of any reason that would prevent properly setting the
rt6i_protocol in the route when it is allocated.

Something like this (not compiled, much less tested):

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4d30c96a819d..9a928839d247 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2347,6 +2347,7 @@ static void rt6_do_redirect(struct dst_entry *dst,
struct sock *sk, struct sk_bu
if (!nrt)
goto out;

+   nrt->rt6i_protocol = RTPROT_REDIRECT;
nrt->rt6i_flags = RTF_GATEWAY|RTF_UP|RTF_DYNAMIC|RTF_CACHE;
if (on_link)
nrt->rt6i_flags &= ~RTF_GATEWAY;
@@ -2461,6 +2462,7 @@ static struct rt6_info *rt6_add_route_info(struct
net *net,
.fc_dst_len = prefixlen,
.fc_flags   = RTF_GATEWAY | RTF_ADDRCONF |
RTF_ROUTEINFO |
  RTF_UP | RTF_PREF(pref),
+   .fc_protocol= RTPROT_RA,
.fc_nlinfo.portid = 0,
.fc_nlinfo.nlh = NULL,
.fc_nlinfo.nl_net = net,
@@ -2513,6 +2515,7 @@ struct rt6_info *rt6_add_dflt_router(const struct
in6_addr *gwaddr,
.fc_ifindex = dev->ifindex,
.fc_flags   = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT |
  RTF_UP | RTF_EXPIRES | RTF_PREF(pref),
+   .fc_protocol= RTPROT_RA,
.fc_nlinfo.portid = 0,
.fc_nlinfo.nlh = NULL,
.fc_nlinfo.nl_net = dev_net(dev),
@@ -3424,14 +3427,6 @@ static int rt6_fill_node(struct net *net,
rtm->rtm_flags = 0;
rtm->rtm_scope = RT_SCOPE_UNIVERSE;
rtm->rtm_protocol = rt->rt6i_protocol;
-   if (rt->rt6i_flags & RTF_DYNAMIC)
-   rtm->rtm_protocol = RTPROT_REDIRECT;
-   else if (rt->rt6i_flags & RTF_ADDRCONF) {
-   if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
-   rtm->rtm_protocol = RTPROT_RA;
-   else
-   rtm->rtm_protocol = RTPROT_KERNEL;
-   }

if (rt->rt6i_flags & RTF_CACHE)
rtm->rtm_flags |= RTM_F_CLONED;

Re: Long stalls creating a new netns after a netns with a SMB client exits

2017-07-31 Thread David Ahern

On 7/31/17 4:01 PM, Cong Wang wrote:
>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>> index 3a19ea28339f..37db087b6c97 100644
>>> --- a/net/ipv4/tcp_ipv4.c
>>> +++ b/net/ipv4/tcp_ipv4.c
>>> @@ -1855,7 +1855,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const
>>> struct sk_buff *skb)
>>>  {
>>> struct dst_entry *dst = skb_dst(skb);
>>>
>>> -   if (dst && dst_hold_safe(dst)) {
>>> +   if (0 && dst && dst_hold_safe(dst)) {
>>> sk->sk_rx_dst = dst;
>>> inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
>>> }
>>
>>
>> This removes the 200s stall (the test is IPv4/TCP based)
> 
> 
> Interesting. This means we have a kernel socket which holds
> the dst refcnt.

Right now there is no tracking that I am aware of for a dst cached on
the socket (outside of walking all sockets). I have been bitten by it
several times in trying to make various changes. It's basically a hidden
reference for the device.

[PATCH net-next v2 0/3] netvsc: transparent SR-IOV VF support

2017-07-31 Thread Stephen Hemminger

This patch set changes how SR-IOV Virtual Function devices are managed
in the Hyper-V network driver. It was part of earlier bundle, but
is now updated.

Background

In Hyper-V SR-IOV can be enabled (and disabled) by changing guest settings
on host. When SR-IOV is enabled a matching PCI device is hot plugged and
visible on guest. The VF device is an add-on to an existing netvsc
device, and has the same MAC address.

How is this different?

The original support of VF relied on using bonding driver in active
standby mode to handle the VF device.

With the new netvsc VF logic, the Linux hyper-V network
virtual driver will directly manage the link to SR-IOV VF device.
When VF device is detected (hot plug) it is automatically made a
slave device of the netvsc device. The VF device state reflects
the state of the netvsc device; i.e. if netvsc is set down, then
VF is set down. If netvsc is set up, then VF is brought up.
 
Packet flow is independent of VF status; all packets are sent and
received as if they were associated with the netvsc device. If VF is
removed or link is down then the synthetic VMBUS path is used.
 
What was wrong with using bonding script?

A lot of work went into getting the bonding script to work on all
distributions, but it was a major struggle. Linux network devices
can be configured many, many ways and there is no one solution from
userspace to make it all work. What is really hard is when
configuration is attached to synthetic device during boot (eth0) and
then the same addresses and firewall rules needs to also work later if
doing bonding. The new code gets around all of this.
 
How does VF work during initialization?

Since all packets are sent and received through the logical netvsc
device, initialization is much easier. Just configure the regular
netvsc Ethernet device; when/if SR-IOV is enabled it just
works. Provisioning and cloud init only need to worry about setting up
netvsc device (eth0). If SR-IOV is enabled (even as a later step), the
address and rules stay the same.
 
What devices show up?

Both netvsc and PCI devices are visible in the system. The netvsc
device is active and named in usual manner (eth0). The PCI device is
visible to Linux and gets renamed by udev to a persistent name
(enP2p3s0). The PCI device name is now irrelevant now.

The logic also sets the PCI VF device SLAVE flag on the network
device so network tools can see the relationship if they are smart
enough to understand how layered devices work.
 
This is a lot like how I see Windows working.
The VF device is visible in Device Manager, but is not configured.
 
Is there any performance impact?
There is no visible change in performance. The bonding
and netvsc driver both have equivalent steps.
 
Is it compatible with old bonding script?

It turns out that if you use the old bonding script, then everything
still works but in a sub-optimum manner. What happens is that bonding
is unable to steal the VF from the netvsc device so it creates a one
legged bond.  Packet flow then is:
bond0 <--> eth0 <- -> VF (enP2p3s0).
In other words, if you get it wrong it still works, just
awkward and slower.
 
What if I add address or firewall rule onto the VF?

Same problems occur with now as already occur with bonding, bridging,
teaming on Linux if user incorrectly does configuration onto
an underlying slave device. It will sort of work, packets will come in
and out but the Linux kernel gets confused and things like ARP don’t
work right.  There is no way to block manipulation of the slave
device, and I am sure someone will find some special use case where
they want it.


Stephen Hemminger (3):
  netvsc: transparent VF management
  netvsc: add documentation
  netvsc: remove bonding setup script

 Documentation/networking/netvsc.txt |  63 ++
 MAINTAINERS |   1 +
 drivers/net/hyperv/hyperv_net.h |  12 ++
 drivers/net/hyperv/netvsc_drv.c | 419 
 tools/hv/bondvf.sh  | 255 --
 5 files changed, 406 insertions(+), 344 deletions(-)
 create mode 100644 Documentation/networking/netvsc.txt
 delete mode 100755 tools/hv/bondvf.sh

-- 
2.11.0

[PATCH net-next v2 1/3] netvsc: transparent VF management

2017-07-31 Thread Stephen Hemminger

This patch implements transparent fail over from synthetic NIC to
SR-IOV virtual function NIC in Hyper-V environment. It is a better
alternative to using bonding as is done now. Instead, the receive and
transmit fail over is done internally inside the driver.

Using bonding driver has lots of issues because it depends on the
script being run early enough in the boot process and with sufficient
information to make the association. This patch moves all that
functionality into the kernel.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h |  12 ++
 drivers/net/hyperv/netvsc_drv.c | 419 +++-
 2 files changed, 342 insertions(+), 89 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index f2cef5aaed1f..c701b059c5ac 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -680,6 +680,15 @@ struct netvsc_ethtool_stats {
unsigned long tx_busy;
 };
 
+struct netvsc_vf_pcpu_stats {
+   u64 rx_packets;
+   u64 rx_bytes;
+   u64 tx_packets;
+   u64 tx_bytes;
+   struct u64_stats_sync   syncp;
+   u32 tx_dropped;
+};
+
 struct netvsc_reconfig {
struct list_head list;
u32 event;
@@ -713,6 +722,9 @@ struct net_device_context {
 
/* State to manage the associated VF interface. */
struct net_device __rcu *vf_netdev;
+   struct netvsc_vf_pcpu_stats __percpu *vf_stats;
+   struct work_struct vf_takeover;
+   struct work_struct vf_notify;
 
/* 1: allocated, serial number is valid. 0: not allocated */
u32 vf_alloc;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8ff4cbf582cc..fef80dcbd71b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -71,6 +72,7 @@ static void netvsc_set_multicast_list(struct net_device *net)
 static int netvsc_open(struct net_device *net)
 {
struct net_device_context *ndev_ctx = netdev_priv(net);
+   struct net_device *vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
struct netvsc_device *nvdev = rtnl_dereference(ndev_ctx->nvdev);
struct rndis_device *rdev;
int ret = 0;
@@ -87,15 +89,29 @@ static int netvsc_open(struct net_device *net)
netif_tx_wake_all_queues(net);
 
rdev = nvdev->extension;
-   if (!rdev->link_state && !ndev_ctx->datapath)
+
+   if (!rdev->link_state)
netif_carrier_on(net);
 
-   return ret;
+   if (vf_netdev) {
+   /* Setting synthetic device up transparently sets
+* slave as up. If open fails, then slave will be
+* still be offline (and not used).
+*/
+   ret = dev_open(vf_netdev);
+   if (ret)
+   netdev_warn(net,
+   "unable to open slave: %s: %d\n",
+   vf_netdev->name, ret);
+   }
+   return 0;
 }
 
 static int netvsc_close(struct net_device *net)
 {
struct net_device_context *net_device_ctx = netdev_priv(net);
+   struct net_device *vf_netdev
+   = rtnl_dereference(net_device_ctx->vf_netdev);
struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev);
int ret;
u32 aread, i, msec = 10, retry = 0, retry_max = 20;
@@ -141,6 +157,9 @@ static int netvsc_close(struct net_device *net)
ret = -ETIMEDOUT;
}
 
+   if (vf_netdev)
+   dev_close(vf_netdev);
+
return ret;
 }
 
@@ -224,13 +243,11 @@ static inline int netvsc_get_tx_queue(struct net_device 
*ndev,
  *
  * TODO support XPS - but get_xps_queue not exported
  */
-static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
-   void *accel_priv, select_queue_fallback_t fallback)
+static u16 netvsc_pick_tx(struct net_device *ndev, struct sk_buff *skb)
 {
-   unsigned int num_tx_queues = ndev->real_num_tx_queues;
int q_idx = sk_tx_queue_get(skb->sk);
 
-   if (q_idx < 0 || skb->ooo_okay) {
+   if (q_idx < 0 || skb->ooo_okay || q_idx >= ndev->real_num_tx_queues) {
/* If forwarding a packet, we use the recorded queue when
 * available for better cache locality.
 */
@@ -240,12 +257,33 @@ static u16 netvsc_select_queue(struct net_device *ndev, 
struct sk_buff *skb,
q_idx = netvsc_get_tx_queue(ndev, skb, q_idx);
}
 
-   while (unlikely(q_idx >= num_tx_queues))
-   q_idx -= num_tx_queues;
-
return q_idx;
 }
 
+static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
+  void *accel_priv,
+  select_queue_fallback_t fallback)
+{
+

[PATCH net-next v2 3/3] netvsc: remove bonding setup script

2017-07-31 Thread Stephen Hemminger

No longer needed, now all managed by transparent VF logic.

Signed-off-by: Stephen Hemminger 
---
 tools/hv/bondvf.sh | 255 -
 1 file changed, 255 deletions(-)
 delete mode 100755 tools/hv/bondvf.sh

diff --git a/tools/hv/bondvf.sh b/tools/hv/bondvf.sh
deleted file mode 100755
index 80f102860cf8..
--- a/tools/hv/bondvf.sh
+++ /dev/null
@@ -1,255 +0,0 @@
-#!/bin/bash
-
-# This example script creates bonding network devices based on synthetic NIC
-# (the virtual network adapter usually provided by Hyper-V) and the matching
-# VF NIC (SRIOV virtual function). So the synthetic NIC and VF NIC can
-# function as one network device, and fail over to the synthetic NIC if VF is
-# down.
-#
-# Usage:
-# - After configured vSwitch and vNIC with SRIOV, start Linux virtual
-#   machine (VM)
-# - Run this scripts on the VM. It will create configuration files in
-#   distro specific directory.
-# - Reboot the VM, so that the bonding config are enabled.
-#
-# The config files are DHCP by default. You may edit them if you need to change
-# to Static IP or change other settings.
-#
-
-sysdir=/sys/class/net
-netvsc_cls={f8615163-df3e-46c5-913f-f2d2f965ed0e}
-bondcnt=0
-
-# Detect Distro
-if [ -f /etc/redhat-release ];
-then
-   cfgdir=/etc/sysconfig/network-scripts
-   distro=redhat
-elif grep -q 'Ubuntu' /etc/issue
-then
-   cfgdir=/etc/network
-   distro=ubuntu
-elif grep -q 'SUSE' /etc/issue
-then
-   cfgdir=/etc/sysconfig/network
-   distro=suse
-else
-   echo "Unsupported Distro"
-   exit 1
-fi
-
-echo Detected Distro: $distro, or compatible
-
-# Get a list of ethernet names
-list_eth=(`cd $sysdir && ls -d */ | cut -d/ -f1 | grep -v bond`)
-eth_cnt=${#list_eth[@]}
-
-echo List of net devices:
-
-# Get the MAC addresses
-for (( i=0; i < $eth_cnt; i++ ))
-do
-   list_mac[$i]=`cat $sysdir/${list_eth[$i]}/address`
-   echo ${list_eth[$i]}, ${list_mac[$i]}
-done
-
-# Find NIC with matching MAC
-for (( i=0; i < $eth_cnt-1; i++ ))
-do
-   for (( j=i+1; j < $eth_cnt; j++ ))
-   do
-   if [ "${list_mac[$i]}" = "${list_mac[$j]}" ]
-   then
-   list_match[$i]=${list_eth[$j]}
-   break
-   fi
-   done
-done
-
-function create_eth_cfg_redhat {
-   local fn=$cfgdir/ifcfg-$1
-
-   rm -f $fn
-   echo DEVICE=$1 >>$fn
-   echo TYPE=Ethernet >>$fn
-   echo BOOTPROTO=none >>$fn
-   echo UUID=`uuidgen` >>$fn
-   echo ONBOOT=yes >>$fn
-   echo PEERDNS=yes >>$fn
-   echo IPV6INIT=yes >>$fn
-   echo MASTER=$2 >>$fn
-   echo SLAVE=yes >>$fn
-}
-
-function create_eth_cfg_pri_redhat {
-   create_eth_cfg_redhat $1 $2
-}
-
-function create_bond_cfg_redhat {
-   local fn=$cfgdir/ifcfg-$1
-
-   rm -f $fn
-   echo DEVICE=$1 >>$fn
-   echo TYPE=Bond >>$fn
-   echo BOOTPROTO=dhcp >>$fn
-   echo UUID=`uuidgen` >>$fn
-   echo ONBOOT=yes >>$fn
-   echo PEERDNS=yes >>$fn
-   echo IPV6INIT=yes >>$fn
-   echo BONDING_MASTER=yes >>$fn
-   echo BONDING_OPTS=\"mode=active-backup miimon=100 primary=$2\" >>$fn
-}
-
-function del_eth_cfg_ubuntu {
-   local mainfn=$cfgdir/interfaces
-   local fnlist=( $mainfn )
-
-   local dirlist=(`awk '/^[ \t]*source/{print $2}' $mainfn`)
-
-   local i
-   for i in "${dirlist[@]}"
-   do
-   fnlist+=(`ls $i 2>/dev/null`)
-   done
-
-   local tmpfl=$(mktemp)
-
-   local nic_start='^[ \t]*(auto|iface|mapping|allow-.*)[ \t]+'$1
-   local nic_end='^[ \t]*(auto|iface|mapping|allow-.*|source)'
-
-   local fn
-   for fn in "${fnlist[@]}"
-   do
-   awk "/$nic_end/{x=0} x{next} /$nic_start/{x=1;next} 1" \
-   $fn >$tmpfl
-
-   cp $tmpfl $fn
-   done
-
-   rm $tmpfl
-}
-
-function create_eth_cfg_ubuntu {
-   local fn=$cfgdir/interfaces
-
-   del_eth_cfg_ubuntu $1
-   echo $'\n'auto $1 >>$fn
-   echo iface $1 inet manual >>$fn
-   echo bond-master $2 >>$fn
-}
-
-function create_eth_cfg_pri_ubuntu {
-   local fn=$cfgdir/interfaces
-
-   del_eth_cfg_ubuntu $1
-   echo $'\n'allow-hotplug $1 >>$fn
-   echo iface $1 inet manual >>$fn
-   echo bond-master $2 >>$fn
-   echo bond-primary $1 >>$fn
-}
-
-function create_bond_cfg_ubuntu {
-   local fn=$cfgdir/interfaces
-
-   del_eth_cfg_ubuntu $1
-
-   echo $'\n'auto $1 >>$fn
-   echo iface $1 inet dhcp >>$fn
-   echo bond-mode active-backup >>$fn
-   echo bond-miimon 100 >>$fn
-   echo bond-slaves none >>$fn
-}
-
-function create_eth_cfg_suse {
-local fn=$cfgdir/ifcfg-$1
-
-rm -f $fn
-   echo BOOTPROTO=none >>$fn
-   echo STARTMODE=auto >>$fn
-}
-
-function create_eth_cfg_pri_suse {
-   local fn=$cfgdir/ifcfg-$1
-
-   rm -f $fn
-   echo BOOTPROTO=none >>$fn
-   echo

[PATCH net-next v2 2/3] netvsc: add documentation

2017-07-31 Thread Stephen Hemminger

Add some background documentation on netvsc device options
and limitations.

Signed-off-by: Stephen Hemminger 
---
 Documentation/networking/netvsc.txt | 63 +
 MAINTAINERS |  1 +
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/networking/netvsc.txt

diff --git a/Documentation/networking/netvsc.txt 
b/Documentation/networking/netvsc.txt
new file mode 100644
index ..4ddb4e4b0426
--- /dev/null
+++ b/Documentation/networking/netvsc.txt
@@ -0,0 +1,63 @@
+Hyper-V network driver
+==
+
+Compatibility
+=
+
+This driver is compatible with Windows Server 2012 R2, 2016 and
+Windows 10.
+
+Features
+
+
+  Checksum offload
+  
+  The netvsc driver supports checksum offload as long as the
+  Hyper-V host version does. Windows Server 2016 and Azure
+  support checksum offload for TCP and UDP for both IPv4 and
+  IPv6. Windows Server 2012 only supports checksum offload for TCP.
+
+  Receive Side Scaling
+  
+  Hyper-V supports receive side scaling. For TCP, packets are
+  distributed among available queues based on IP address and port
+  number. Current versions of Hyper-V host, only distribute UDP
+  packets based on the IP source and destination address.
+  The port number is not used as part of the hash value for UDP.
+  Fragmented IP packets are not distributed between queues;
+  all fragmented packets arrive on the first channel.
+
+  Generic Receive Offload, aka GRO
+  
+  The driver supports GRO and it is enabled by default. GRO coalesces
+  like packets and significantly reduces CPU usage under heavy Rx
+  load.
+
+  SR-IOV support
+  --
+  Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV
+  is enabled in both the vSwitch and the guest configuration, then the
+  Virtual Function (VF) device is passed to the guest as a PCI
+  device. In this case, both a synthetic (netvsc) and VF device are
+  visible in the guest OS and both NIC's have the same MAC address.
+
+  The VF is enslaved by netvsc device.  The netvsc driver will transparently
+  switch the data path to the VF when it is available and up.
+  Network state (addresses, firewall, etc) should be applied only to the
+  netvsc device; the slave device should not be accessed directly in
+  most cases.  The exceptions are if some special queue discipline or
+  flow direction is desired, these should be applied directly to the
+  VF slave device.
+
+  Receive Buffer
+  --
+  Packets are received into a receive area which is created when device
+  is probed. The receive area is broken into MTU sized chunks and each may
+  contain one or more packets. The number of receive sections may be changed
+  via ethtool Rx ring parameters.
+
+  There is a similar send buffer which is used to aggregate packets for 
sending.
+  The send area is broken into chunks of 6144 bytes, each of section may
+  contain one or more packets. The send buffer is an optimization, the driver
+  will use slower method to handle very large packets or if the send buffer
+  area is exhausted.
diff --git a/MAINTAINERS b/MAINTAINERS
index 297e610c9163..d30c17df1deb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6294,6 +6294,7 @@ M:Haiyang Zhang 
 M: Stephen Hemminger 
 L: de...@linuxdriverproject.org
 S: Maintained
+F: Documentation/networking/netvsc.txt
 F: arch/x86/include/asm/mshyperv.h
 F: arch/x86/include/uapi/asm/hyperv.h
 F: arch/x86/kernel/cpu/mshyperv.c
-- 
2.11.0

Re: [PATCH net-next v12 0/4] net sched actions: improve dump performance

2017-07-31 Thread Stephen Hemminger

On Mon, 31 Jul 2017 08:06:42 -0400
Jamal Hadi Salim  wrote:

> On 17-07-30 10:28 PM, David Miller wrote:
> > 
> > Series applied, thanks.
> >   
> 
> Thanks David.
> 
> Attaching the iproute2 patch. I will submit an official one with
> man page  changes later. Stephen - you take net-next changes?
> 
> cheers,
> jamal

I will fix this up. The kernel headers for iproute2 come from sanitized
kernel headers (not direct copy).

Re: [PATCH net-next 4/4] pci-hyperv: do not sleep in compose_msi_msg

2017-07-31 Thread Stephen Hemminger

On Mon, 31 Jul 2017 16:37:12 -0700
Stephen Hemminger  wrote:

> The setup of MSI with Hyper-V host was sleeping with locks held.
> This error is reported when doing SR-IOV hotplug with kernel built with 
> lockdep.
> 
> BUG: sleeping function called from invalid context at 
> kernel/sched/completion.c:93
> in_atomic(): 1, irqs_disabled(): 1, pid: 1405, name: ip
> 3 locks held by ip/1405:
>#0:  (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1b/0x40
>#1:  (>request_mutex){+.+...}, at: [] 
> __setup_irq+0xb3/0x720
>#2:  (_desc_lock_class){-.-...}, at: [] 
> __setup_irq+0xe5/0x720
>irq event stamp: 3476
>hardirqs last  enabled at (3475): [] 
> get_page_from_freelist+0x225/0xc90
>hardirqs last disabled at (3476): [] 
> _raw_spin_lock_irqsave+0x27/0x90
>softirqs last  enabled at (2446): [] 
> ixgbevf_configure+0x380/0x7c0 [ixgbevf]
>softirqs last disabled at (2444): [] 
> ixgbevf_configure+0x35d/0x7c0 [ixgbevf]
> 
> The workaround is to poll for host response instead of blocking on
> completion.
> 
> Signed-off-by: Stephen Hemminger 

This patch is not directly network related. It needs to go through PCI.

I will resend the series.

Re: [PATCHv4 net] ipv6: no need to check rt->dst.error when get route info

2017-07-31 Thread David Miller

From: David Ahern 
Date: Mon, 31 Jul 2017 17:34:09 -0600

> On 7/31/17 5:22 PM, David Miller wrote:
>> From: Hangbin Liu 
>> Date: Fri, 28 Jul 2017 00:25:36 +0800
>> 
>>> After commit 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib
>>> result when requested"). When we get a prohibit ertry, we will return
>>> -EACCES directly instead of dump route info.
>>>
>>> Fix it by remove the rt->dst.error check.
>> ...
>>> Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib...")
>>> Signed-off-by: Hangbin Liu 
>> 
>> David A., where are we on this?
>> 
> 
> Dizzy from running in circles.

:-)

> Question I posed to you Saturday morning, 8:41 MDT [1]:
> 
> "... Roopa's fibmatch patches caused a change in user behavior in IPv6
> getroute for prohibit, blackhole and unreachable route entries. Opinions
> on whether we should limit that new behavior to just the fibmatch lookup
> in which case a patch is needed or take the new behavior and consistency
> in which case nothing is needed?"
> 
> Personally, after all the discussion I think the behavior as it is right
> now is best.
> 
> [1] https://www.spinics.net/lists/netdev/msg446571.html

I agree with you that we should keep the behavior as is.

[PATCH net-next 4/4] pci-hyperv: do not sleep in compose_msi_msg

2017-07-31 Thread Stephen Hemminger

The setup of MSI with Hyper-V host was sleeping with locks held.
This error is reported when doing SR-IOV hotplug with kernel built with lockdep.

BUG: sleeping function called from invalid context at 
kernel/sched/completion.c:93
in_atomic(): 1, irqs_disabled(): 1, pid: 1405, name: ip
3 locks held by ip/1405:
   #0:  (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1b/0x40
   #1:  (>request_mutex){+.+...}, at: [] 
__setup_irq+0xb3/0x720
   #2:  (_desc_lock_class){-.-...}, at: [] 
__setup_irq+0xe5/0x720
   irq event stamp: 3476
   hardirqs last  enabled at (3475): [] 
get_page_from_freelist+0x225/0xc90
   hardirqs last disabled at (3476): [] 
_raw_spin_lock_irqsave+0x27/0x90
   softirqs last  enabled at (2446): [] 
ixgbevf_configure+0x380/0x7c0 [ixgbevf]
   softirqs last disabled at (2444): [] 
ixgbevf_configure+0x35d/0x7c0 [ixgbevf]

The workaround is to poll for host response instead of blocking on
completion.

Signed-off-by: Stephen Hemminger 
---
 drivers/pci/host/pci-hyperv.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 415dcc69a502..334c9a7b8991 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1159,7 +1160,12 @@ static void hv_compose_msi_msg(struct irq_data *data, 
struct msi_msg *msg)
goto free_int_desc;
}
 
-   wait_for_completion(_pkt.host_event);
+   /*
+* Since this function is called with IRQ locks held, can't
+* do normal wait for completion; instead poll.
+*/
+   while (!try_wait_for_completion(_pkt.host_event))
+   udelay(100);
 
if (comp.comp_pkt.completion_status < 0) {
dev_err(>hdev->device,
-- 
2.11.0

[PATCH net-next 1/4] netvsc: transparent VF management

2017-07-31 Thread Stephen Hemminger

This patch implements transparent fail over from synthetic NIC to
SR-IOV virtual function NIC in Hyper-V environment. It is a better
alternative to using bonding as is done now. Instead, the receive and
transmit fail over is done internally inside the driver.

Using bonding driver has lots of issues because it depends on the
script being run early enough in the boot process and with sufficient
information to make the association. This patch moves all that
functionality into the kernel.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h |  12 ++
 drivers/net/hyperv/netvsc_drv.c | 419 +++-
 2 files changed, 342 insertions(+), 89 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index f2cef5aaed1f..c701b059c5ac 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -680,6 +680,15 @@ struct netvsc_ethtool_stats {
unsigned long tx_busy;
 };
 
+struct netvsc_vf_pcpu_stats {
+   u64 rx_packets;
+   u64 rx_bytes;
+   u64 tx_packets;
+   u64 tx_bytes;
+   struct u64_stats_sync   syncp;
+   u32 tx_dropped;
+};
+
 struct netvsc_reconfig {
struct list_head list;
u32 event;
@@ -713,6 +722,9 @@ struct net_device_context {
 
/* State to manage the associated VF interface. */
struct net_device __rcu *vf_netdev;
+   struct netvsc_vf_pcpu_stats __percpu *vf_stats;
+   struct work_struct vf_takeover;
+   struct work_struct vf_notify;
 
/* 1: allocated, serial number is valid. 0: not allocated */
u32 vf_alloc;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8ff4cbf582cc..fef80dcbd71b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -71,6 +72,7 @@ static void netvsc_set_multicast_list(struct net_device *net)
 static int netvsc_open(struct net_device *net)
 {
struct net_device_context *ndev_ctx = netdev_priv(net);
+   struct net_device *vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
struct netvsc_device *nvdev = rtnl_dereference(ndev_ctx->nvdev);
struct rndis_device *rdev;
int ret = 0;
@@ -87,15 +89,29 @@ static int netvsc_open(struct net_device *net)
netif_tx_wake_all_queues(net);
 
rdev = nvdev->extension;
-   if (!rdev->link_state && !ndev_ctx->datapath)
+
+   if (!rdev->link_state)
netif_carrier_on(net);
 
-   return ret;
+   if (vf_netdev) {
+   /* Setting synthetic device up transparently sets
+* slave as up. If open fails, then slave will be
+* still be offline (and not used).
+*/
+   ret = dev_open(vf_netdev);
+   if (ret)
+   netdev_warn(net,
+   "unable to open slave: %s: %d\n",
+   vf_netdev->name, ret);
+   }
+   return 0;
 }
 
 static int netvsc_close(struct net_device *net)
 {
struct net_device_context *net_device_ctx = netdev_priv(net);
+   struct net_device *vf_netdev
+   = rtnl_dereference(net_device_ctx->vf_netdev);
struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev);
int ret;
u32 aread, i, msec = 10, retry = 0, retry_max = 20;
@@ -141,6 +157,9 @@ static int netvsc_close(struct net_device *net)
ret = -ETIMEDOUT;
}
 
+   if (vf_netdev)
+   dev_close(vf_netdev);
+
return ret;
 }
 
@@ -224,13 +243,11 @@ static inline int netvsc_get_tx_queue(struct net_device 
*ndev,
  *
  * TODO support XPS - but get_xps_queue not exported
  */
-static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
-   void *accel_priv, select_queue_fallback_t fallback)
+static u16 netvsc_pick_tx(struct net_device *ndev, struct sk_buff *skb)
 {
-   unsigned int num_tx_queues = ndev->real_num_tx_queues;
int q_idx = sk_tx_queue_get(skb->sk);
 
-   if (q_idx < 0 || skb->ooo_okay) {
+   if (q_idx < 0 || skb->ooo_okay || q_idx >= ndev->real_num_tx_queues) {
/* If forwarding a packet, we use the recorded queue when
 * available for better cache locality.
 */
@@ -240,12 +257,33 @@ static u16 netvsc_select_queue(struct net_device *ndev, 
struct sk_buff *skb,
q_idx = netvsc_get_tx_queue(ndev, skb, q_idx);
}
 
-   while (unlikely(q_idx >= num_tx_queues))
-   q_idx -= num_tx_queues;
-
return q_idx;
 }
 
+static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
+  void *accel_priv,
+  select_queue_fallback_t fallback)
+{
+

[PATCH net-next 3/4] netvsc: remove bonding setup script

2017-07-31 Thread Stephen Hemminger

No longer needed, now all managed by transparent VF logic.

Signed-off-by: Stephen Hemminger 
---
 tools/hv/bondvf.sh | 255 -
 1 file changed, 255 deletions(-)
 delete mode 100755 tools/hv/bondvf.sh

diff --git a/tools/hv/bondvf.sh b/tools/hv/bondvf.sh
deleted file mode 100755
index 80f102860cf8..
--- a/tools/hv/bondvf.sh
+++ /dev/null
@@ -1,255 +0,0 @@
-#!/bin/bash
-
-# This example script creates bonding network devices based on synthetic NIC
-# (the virtual network adapter usually provided by Hyper-V) and the matching
-# VF NIC (SRIOV virtual function). So the synthetic NIC and VF NIC can
-# function as one network device, and fail over to the synthetic NIC if VF is
-# down.
-#
-# Usage:
-# - After configured vSwitch and vNIC with SRIOV, start Linux virtual
-#   machine (VM)
-# - Run this scripts on the VM. It will create configuration files in
-#   distro specific directory.
-# - Reboot the VM, so that the bonding config are enabled.
-#
-# The config files are DHCP by default. You may edit them if you need to change
-# to Static IP or change other settings.
-#
-
-sysdir=/sys/class/net
-netvsc_cls={f8615163-df3e-46c5-913f-f2d2f965ed0e}
-bondcnt=0
-
-# Detect Distro
-if [ -f /etc/redhat-release ];
-then
-   cfgdir=/etc/sysconfig/network-scripts
-   distro=redhat
-elif grep -q 'Ubuntu' /etc/issue
-then
-   cfgdir=/etc/network
-   distro=ubuntu
-elif grep -q 'SUSE' /etc/issue
-then
-   cfgdir=/etc/sysconfig/network
-   distro=suse
-else
-   echo "Unsupported Distro"
-   exit 1
-fi
-
-echo Detected Distro: $distro, or compatible
-
-# Get a list of ethernet names
-list_eth=(`cd $sysdir && ls -d */ | cut -d/ -f1 | grep -v bond`)
-eth_cnt=${#list_eth[@]}
-
-echo List of net devices:
-
-# Get the MAC addresses
-for (( i=0; i < $eth_cnt; i++ ))
-do
-   list_mac[$i]=`cat $sysdir/${list_eth[$i]}/address`
-   echo ${list_eth[$i]}, ${list_mac[$i]}
-done
-
-# Find NIC with matching MAC
-for (( i=0; i < $eth_cnt-1; i++ ))
-do
-   for (( j=i+1; j < $eth_cnt; j++ ))
-   do
-   if [ "${list_mac[$i]}" = "${list_mac[$j]}" ]
-   then
-   list_match[$i]=${list_eth[$j]}
-   break
-   fi
-   done
-done
-
-function create_eth_cfg_redhat {
-   local fn=$cfgdir/ifcfg-$1
-
-   rm -f $fn
-   echo DEVICE=$1 >>$fn
-   echo TYPE=Ethernet >>$fn
-   echo BOOTPROTO=none >>$fn
-   echo UUID=`uuidgen` >>$fn
-   echo ONBOOT=yes >>$fn
-   echo PEERDNS=yes >>$fn
-   echo IPV6INIT=yes >>$fn
-   echo MASTER=$2 >>$fn
-   echo SLAVE=yes >>$fn
-}
-
-function create_eth_cfg_pri_redhat {
-   create_eth_cfg_redhat $1 $2
-}
-
-function create_bond_cfg_redhat {
-   local fn=$cfgdir/ifcfg-$1
-
-   rm -f $fn
-   echo DEVICE=$1 >>$fn
-   echo TYPE=Bond >>$fn
-   echo BOOTPROTO=dhcp >>$fn
-   echo UUID=`uuidgen` >>$fn
-   echo ONBOOT=yes >>$fn
-   echo PEERDNS=yes >>$fn
-   echo IPV6INIT=yes >>$fn
-   echo BONDING_MASTER=yes >>$fn
-   echo BONDING_OPTS=\"mode=active-backup miimon=100 primary=$2\" >>$fn
-}
-
-function del_eth_cfg_ubuntu {
-   local mainfn=$cfgdir/interfaces
-   local fnlist=( $mainfn )
-
-   local dirlist=(`awk '/^[ \t]*source/{print $2}' $mainfn`)
-
-   local i
-   for i in "${dirlist[@]}"
-   do
-   fnlist+=(`ls $i 2>/dev/null`)
-   done
-
-   local tmpfl=$(mktemp)
-
-   local nic_start='^[ \t]*(auto|iface|mapping|allow-.*)[ \t]+'$1
-   local nic_end='^[ \t]*(auto|iface|mapping|allow-.*|source)'
-
-   local fn
-   for fn in "${fnlist[@]}"
-   do
-   awk "/$nic_end/{x=0} x{next} /$nic_start/{x=1;next} 1" \
-   $fn >$tmpfl
-
-   cp $tmpfl $fn
-   done
-
-   rm $tmpfl
-}
-
-function create_eth_cfg_ubuntu {
-   local fn=$cfgdir/interfaces
-
-   del_eth_cfg_ubuntu $1
-   echo $'\n'auto $1 >>$fn
-   echo iface $1 inet manual >>$fn
-   echo bond-master $2 >>$fn
-}
-
-function create_eth_cfg_pri_ubuntu {
-   local fn=$cfgdir/interfaces
-
-   del_eth_cfg_ubuntu $1
-   echo $'\n'allow-hotplug $1 >>$fn
-   echo iface $1 inet manual >>$fn
-   echo bond-master $2 >>$fn
-   echo bond-primary $1 >>$fn
-}
-
-function create_bond_cfg_ubuntu {
-   local fn=$cfgdir/interfaces
-
-   del_eth_cfg_ubuntu $1
-
-   echo $'\n'auto $1 >>$fn
-   echo iface $1 inet dhcp >>$fn
-   echo bond-mode active-backup >>$fn
-   echo bond-miimon 100 >>$fn
-   echo bond-slaves none >>$fn
-}
-
-function create_eth_cfg_suse {
-local fn=$cfgdir/ifcfg-$1
-
-rm -f $fn
-   echo BOOTPROTO=none >>$fn
-   echo STARTMODE=auto >>$fn
-}
-
-function create_eth_cfg_pri_suse {
-   local fn=$cfgdir/ifcfg-$1
-
-   rm -f $fn
-   echo BOOTPROTO=none >>$fn
-   echo

[PATCH net-next 0/4] netvsc: transparent SR-IOV VF support

2017-07-31 Thread Stephen Hemminger

This patch set changes how SR-IOV Virtual Function devices are managed
in the Hyper-V network driver. It was part of earlier bundle, but
is now updated.

Background
In Hyper-V SR-IOV can be enabled (and disabled) by changing guest settings
on host. When SR-IOV is enabled a matching PCI device is hot plugged and
visible on guest. The VF device is an add-on to an existing netvsc
device, and has the same MAC address.

How is this different?

The original support of VF relied on using bonding driver in active
standby mode to handle the VF device.

With the new netvsc VF logic, the Linux hyper-V network
virtual driver will directly manage the link to SR-IOV VF device.
When VF device is detected (hot plug) it is automatically made a
slave device of the netvsc device. The VF device state reflects
the state of the netvsc device; i.e. if netvsc is set down, then
VF is set down. If netvsc is set up, then VF is brought up.
 
Packet flow is independent of VF status; all packets are sent and
received as if they were associated with the netvsc device. If VF is
removed or link is down then the synthetic VMBUS path is used.
 
What was wrong with using bonding script?

A lot of work went into getting the bonding script to work on all
distributions, but it was a major struggle. Linux network devices
can be configured many, many ways and there is no one solution from
userspace to make it all work. What is really hard is when
configuration is attached to synthetic device during boot (eth0) and
then the same addresses and firewall rules needs to also work later if
doing bonding. The new code gets around all of this.
 
How does VF work during initialization?

Since all packets are sent and received through the logical netvsc
device, initialization is much easier. Just configure the regular
netvsc Ethernet device; when/if SR-IOV is enabled it just
works. Provisioning and cloud init only need to worry about setting up
netvsc device (eth0). If SR-IOV is enabled (even as a later step), the
address and rules stay the same.
 
What devices show up?

Both netvsc and PCI devices are visible in the system. The netvsc
device is active and named in usual manner (eth0). The PCI device is
visible to Linux and gets renamed by udev to a persistent name
(enP2p3s0). The PCI device name is now irrelevant now.

The logic also sets the PCI VF device SLAVE flag on the network
device so network tools can see the relationship if they are smart
enough to understand how layered devices work.
 
This is a lot like how I see Windows working.
The VF device is visible in Device Manager, but is not configured.
 
Is there any performance impact?
There is no visible change in performance. The bonding
and netvsc driver both have equivalent steps.
 
Is it compatible with old bonding script?

It turns out that if you use the old bonding script, then everything
still works but in a sub-optimum manner. What happens is that bonding
is unable to steal the VF from the netvsc device so it creates a one
legged bond.  Packet flow then is:
bond0 <--> eth0 <- -> VF (enP2p3s0).
In other words, if you get it wrong it still works, just
awkward and slower.
 
What if I add address or firewall rule onto the VF?

Same problems occur with now as already occur with bonding, bridging,
teaming on Linux if user incorrectly does configuration onto
an underlying slave device. It will sort of work, packets will come in
and out but the Linux kernel gets confused and things like ARP don’t
work right.  There is no way to block manipulation of the slave
device, and I am sure someone will find some special use case where
they want it.


Stephen Hemminger (4):
  netvsc: transparent VF management
  netvsc: add documentation
  netvsc: remove bonding setup script
  pci-hyperv: do not sleep in compose_msi_msg

 Documentation/networking/netvsc.txt |  63 ++
 MAINTAINERS |   1 +
 drivers/net/hyperv/hyperv_net.h |  12 ++
 drivers/net/hyperv/netvsc_drv.c | 419 
 drivers/pci/host/pci-hyperv.c   |   8 +-
 tools/hv/bondvf.sh  | 255 --
 6 files changed, 413 insertions(+), 345 deletions(-)
 create mode 100644 Documentation/networking/netvsc.txt
 delete mode 100755 tools/hv/bondvf.sh

-- 
2.11.0

[PATCH net-next 2/4] netvsc: add documentation

2017-07-31 Thread Stephen Hemminger

Add some background documentation on netvsc device options
and limitations.

Signed-off-by: Stephen Hemminger 
---
 Documentation/networking/netvsc.txt | 63 +
 MAINTAINERS |  1 +
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/networking/netvsc.txt

diff --git a/Documentation/networking/netvsc.txt 
b/Documentation/networking/netvsc.txt
new file mode 100644
index ..4ddb4e4b0426
--- /dev/null
+++ b/Documentation/networking/netvsc.txt
@@ -0,0 +1,63 @@
+Hyper-V network driver
+==
+
+Compatibility
+=
+
+This driver is compatible with Windows Server 2012 R2, 2016 and
+Windows 10.
+
+Features
+
+
+  Checksum offload
+  
+  The netvsc driver supports checksum offload as long as the
+  Hyper-V host version does. Windows Server 2016 and Azure
+  support checksum offload for TCP and UDP for both IPv4 and
+  IPv6. Windows Server 2012 only supports checksum offload for TCP.
+
+  Receive Side Scaling
+  
+  Hyper-V supports receive side scaling. For TCP, packets are
+  distributed among available queues based on IP address and port
+  number. Current versions of Hyper-V host, only distribute UDP
+  packets based on the IP source and destination address.
+  The port number is not used as part of the hash value for UDP.
+  Fragmented IP packets are not distributed between queues;
+  all fragmented packets arrive on the first channel.
+
+  Generic Receive Offload, aka GRO
+  
+  The driver supports GRO and it is enabled by default. GRO coalesces
+  like packets and significantly reduces CPU usage under heavy Rx
+  load.
+
+  SR-IOV support
+  --
+  Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV
+  is enabled in both the vSwitch and the guest configuration, then the
+  Virtual Function (VF) device is passed to the guest as a PCI
+  device. In this case, both a synthetic (netvsc) and VF device are
+  visible in the guest OS and both NIC's have the same MAC address.
+
+  The VF is enslaved by netvsc device.  The netvsc driver will transparently
+  switch the data path to the VF when it is available and up.
+  Network state (addresses, firewall, etc) should be applied only to the
+  netvsc device; the slave device should not be accessed directly in
+  most cases.  The exceptions are if some special queue discipline or
+  flow direction is desired, these should be applied directly to the
+  VF slave device.
+
+  Receive Buffer
+  --
+  Packets are received into a receive area which is created when device
+  is probed. The receive area is broken into MTU sized chunks and each may
+  contain one or more packets. The number of receive sections may be changed
+  via ethtool Rx ring parameters.
+
+  There is a similar send buffer which is used to aggregate packets for 
sending.
+  The send area is broken into chunks of 6144 bytes, each of section may
+  contain one or more packets. The send buffer is an optimization, the driver
+  will use slower method to handle very large packets or if the send buffer
+  area is exhausted.
diff --git a/MAINTAINERS b/MAINTAINERS
index 297e610c9163..d30c17df1deb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6294,6 +6294,7 @@ M:Haiyang Zhang 
 M: Stephen Hemminger 
 L: de...@linuxdriverproject.org
 S: Maintained
+F: Documentation/networking/netvsc.txt
 F: arch/x86/include/asm/mshyperv.h
 F: arch/x86/include/uapi/asm/hyperv.h
 F: arch/x86/kernel/cpu/mshyperv.c
-- 
2.11.0

Re: [PATCHv4 net] ipv6: no need to check rt->dst.error when get route info

2017-07-31 Thread David Ahern

On 7/31/17 5:22 PM, David Miller wrote:
> From: Hangbin Liu 
> Date: Fri, 28 Jul 2017 00:25:36 +0800
> 
>> After commit 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib
>> result when requested"). When we get a prohibit ertry, we will return
>> -EACCES directly instead of dump route info.
>>
>> Fix it by remove the rt->dst.error check.
> ...
>> Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib...")
>> Signed-off-by: Hangbin Liu 
> 
> David A., where are we on this?
> 

Dizzy from running in circles.

Question I posed to you Saturday morning, 8:41 MDT [1]:

"... Roopa's fibmatch patches caused a change in user behavior in IPv6
getroute for prohibit, blackhole and unreachable route entries. Opinions
on whether we should limit that new behavior to just the fibmatch lookup
in which case a patch is needed or take the new behavior and consistency
in which case nothing is needed?"

Personally, after all the discussion I think the behavior as it is right
now is best.

[1] https://www.spinics.net/lists/netdev/msg446571.html

Re: [PATCHv4 net] ipv6: no need to check rt->dst.error when get route info

2017-07-31 Thread David Miller

From: Hangbin Liu 
Date: Fri, 28 Jul 2017 00:25:36 +0800

> After commit 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib
> result when requested"). When we get a prohibit ertry, we will return
> -EACCES directly instead of dump route info.
> 
> Fix it by remove the rt->dst.error check.
...
> Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib...")
> Signed-off-by: Hangbin Liu 

David A., where are we on this?

Re: [PATCH v2] net: phy: Log only PHY state transitions

2017-07-31 Thread David Miller

From: Marc Gonzalez 
Date: Fri, 28 Jul 2017 13:18:30 +0200

> In the current code, old and new PHY states are always logged.
> From now on, log only PHY state transitions.
> 
> Signed-off-by: Marc Gonzalez 

Applied to net-next, thanks.

Re: [RFC net-next] net ipv6: convert fib6_table rwlock to a percpu lock

2017-07-31 Thread Stephen Hemminger

On Mon, 31 Jul 2017 10:18:57 -0700
Shaohua Li  wrote:

> From: Shaohua Li 
> 
> In a syn flooding test, the fib6_table rwlock is a significant
> bottleneck. While converting the rwlock to rcu sounds straighforward,
> but is very challenging if it's possible. A percpu spinlock is quite
> trival for this problem since updating the routing table is a rare
> event. In my test, the server receives around 1.5 Mpps in syn flooding
> test without the patch in a dual sockets and 56-CPU system. With the
> patch, the server receives around 3.8Mpps, and perf report doesn't show
> the locking issue.
> 
> Cc: Wei Wang 

You just reinvented brlock...

RCU is not that hard, why not do it right?

Re: [PATCH V5 1/2] firmware: add more flexible request_firmware_async function

2017-07-31 Thread kbuild test robot

Hi Rafał,

[auto build test WARNING on driver-core/driver-core-testing]
[also build test WARNING on v4.13-rc3 next-20170731]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Rafa-Mi-ecki/firmware-add-more-flexible-request_firmware_async-function/20170801-033319
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   WARNING: convert(1) not found, for SVG to PDF conversion install ImageMagick 
(https://www.imagemagick.org)
   include/linux/init.h:1: warning: no structured comments found
   include/linux/mod_devicetable.h:687: warning: Excess 
struct/union/enum/typedef member 'ver_major' description in 'fsl_mc_device_id'
   include/linux/mod_devicetable.h:687: warning: Excess 
struct/union/enum/typedef member 'ver_minor' description in 'fsl_mc_device_id'
   kernel/sched/core.c:2080: warning: No description found for parameter 'rf'
   kernel/sched/core.c:2080: warning: Excess function parameter 'cookie' 
description in 'try_to_wake_up_local'
   include/linux/wait.h:555: warning: No description found for parameter 'wq'
   include/linux/wait.h:555: warning: Excess function parameter 'wq_head' 
description in 'wait_event_interruptible_hrtimeout'
   include/linux/wait.h:759: warning: No description found for parameter 
'wq_head'
   include/linux/wait.h:759: warning: Excess function parameter 'wq' 
description in 'wait_event_killable'
   include/linux/kthread.h:26: warning: Excess function parameter '...' 
description in 'kthread_create'
   kernel/sys.c:1: warning: no structured comments found
   include/linux/device.h:968: warning: No description found for parameter 
'dma_ops'
   drivers/dma-buf/seqno-fence.c:1: warning: no structured comments found
>> drivers/base/firmware_class.c:1: warning: no structured comments found
   include/linux/iio/iio.h:603: warning: No description found for parameter 
'trig_readonly'
   include/linux/iio/trigger.h:151: warning: No description found for parameter 
'indio_dev'
   include/linux/iio/trigger.h:151: warning: No description found for parameter 
'trig'
   include/linux/device.h:969: warning: No description found for parameter 
'dma_ops'
   drivers/ata/libata-eh.c:1449: warning: No description found for parameter 
'link'
   drivers/ata/libata-eh.c:1449: warning: Excess function parameter 'ap' 
description in 'ata_eh_done'
   drivers/ata/libata-eh.c:1590: warning: No description found for parameter 
'qc'
   drivers/ata/libata-eh.c:1590: warning: Excess function parameter 'dev' 
description in 'ata_eh_request_sense'
   drivers/mtd/nand/nand_base.c:2751: warning: Excess function parameter 
'cached' description in 'nand_write_page'
   drivers/mtd/nand/nand_base.c:2751: warning: Excess function parameter 
'cached' description in 'nand_write_page'
   arch/s390/include/asm/cmb.h:1: warning: no structured comments found
   drivers/scsi/scsi_lib.c:1116: warning: No description found for parameter 
'rq'
   drivers/scsi/constants.c:1: warning: no structured comments found
   include/linux/usb/gadget.h:230: warning: No description found for parameter 
'claimed'
   include/linux/usb/gadget.h:230: warning: No description found for parameter 
'enabled'
   include/linux/usb/gadget.h:412: warning: No description found for parameter 
'quirk_altset_not_supp'
   include/linux/usb/gadget.h:412: warning: No description found for parameter 
'quirk_stall_not_supp'
   include/linux/usb/gadget.h:412: warning: No description found for parameter 
'quirk_zlp_not_supp'
   fs/inode.c:1666: warning: No description found for parameter 'rcu'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_transaction'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_next_transaction'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_list'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_vfs_inode'
   include/linux/jbd2.h:443: warning: No description found for parameter 
'i_flags'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_rsv_handle'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_reserved'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_type'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_line_no'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_start_jiffies'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'h_requested_credits'
   include/linux/jbd2.h:497: warning: No description found for parameter 
'saved_alloc_context'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_chkpt_bhs'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_devname'
   include/linux/jbd2.h:1050: warning: No description found for parameter 
'j_average_commit_time'
   include/linux

Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6

2017-07-31 Thread Shaohua Li

On Mon, Jul 31, 2017 at 03:35:02PM -0700, Cong Wang wrote:
> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li  wrote:
> >  static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff 
> > *skb,
> > __be32 flowlabel, bool autolabel,
> > -   struct flowi6 *fl6)
> > +   struct flowi6 *fl6, u32 hash)
> >  {
> > -   u32 hash;
> > -
> > /* @flowlabel may include more than a flow label, eg, the traffic 
> > class.
> >  * Here we want only the flow label value.
> >  */
> > @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net 
> > *net, struct sk_buff *skb,
> >  net->ipv6.sysctl.auto_flowlabels != 
> > IP6_AUTO_FLOW_LABEL_FORCED))
> > return flowlabel;
> >
> > -   hash = skb_get_hash_flowi6(skb, fl6);
> > +   if (skb)
> > +   hash = skb_get_hash_flowi6(skb, fl6);
> 
> 
> Why not just move skb_get_hash_flowi6() to its caller?
> This check is not necessary. If you don't want to touch
> existing callers, you can just introduce a wrapper:
> 
> 
> static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> __be32 flowlabel, bool autolabel,
> struct flowi6 *fl6)
> {
>   u32 hash = skb_get_hash_flowi6(skb, fl6);
>   return __ip6_make_flowlabel(net, flowlabel, autolabel, hash);
> }

this will always call skb_get_hash_flowi6 for the fast path even auto flowlabel
is disabled. I thought we should avoid this.

> 
> And your code can just call:
> 
> __ip6_make_flowlabel(net, flowlabel, autolabel, sk->sk_txhash);

Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6

2017-07-31 Thread Cong Wang

On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li  wrote:
>  static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
> __be32 flowlabel, bool autolabel,
> -   struct flowi6 *fl6)
> +   struct flowi6 *fl6, u32 hash)
>  {
> -   u32 hash;
> -
> /* @flowlabel may include more than a flow label, eg, the traffic 
> class.
>  * Here we want only the flow label value.
>  */
> @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net *net, 
> struct sk_buff *skb,
>  net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED))
> return flowlabel;
>
> -   hash = skb_get_hash_flowi6(skb, fl6);
> +   if (skb)
> +   hash = skb_get_hash_flowi6(skb, fl6);


Why not just move skb_get_hash_flowi6() to its caller?
This check is not necessary. If you don't want to touch
existing callers, you can just introduce a wrapper:


static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
__be32 flowlabel, bool autolabel,
struct flowi6 *fl6)
{
  u32 hash = skb_get_hash_flowi6(skb, fl6);
  return __ip6_make_flowlabel(net, flowlabel, autolabel, hash);
}

And your code can just call:

__ip6_make_flowlabel(net, flowlabel, autolabel, sk->sk_txhash);

[PATCH net-next 01/11] net: dsa: make EEE ops optional

2017-07-31 Thread Vivien Didelot

Even though EEE implies the port's PHY and MAC of both ends, a switch
may not need to do anything to configure the port's MAC.

This makes it impossible for the DSA layer to distinguish e.g. this case
from a disabled EEE when a driver returns 0 from the get EEE operation.

For this reason, make the EEE ops optional and call them only when
provided. Calling it first allows a switch driver to stop the whole
operation at runtime if a given switch does not support the EEE setting.

If both the MAC operation and PHY are not present, -ENODEV is returned.

Signed-off-by: Vivien Didelot 
---
 net/dsa/slave.c | 44 
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9507bd38cf04..518145ced434 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -646,38 +646,42 @@ static int dsa_slave_set_eee(struct net_device *dev, 
struct ethtool_eee *e)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_switch *ds = p->dp->ds;
-   int ret;
+   int err = -ENODEV;
 
-   if (!ds->ops->set_eee)
-   return -EOPNOTSUPP;
+   if (ds->ops->set_eee) {
+   err = ds->ops->set_eee(ds, p->dp->index, p->phy, e);
+   if (err)
+   return err;
+   }
 
-   ret = ds->ops->set_eee(ds, p->dp->index, p->phy, e);
-   if (ret)
-   return ret;
+   if (p->phy) {
+   err = phy_ethtool_set_eee(p->phy, e);
+   if (err)
+   return err;
+   }
 
-   if (p->phy)
-   ret = phy_ethtool_set_eee(p->phy, e);
-
-   return ret;
+   return err;
 }
 
 static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_switch *ds = p->dp->ds;
-   int ret;
+   int err = -ENODEV;
 
-   if (!ds->ops->get_eee)
-   return -EOPNOTSUPP;
+   if (ds->ops->get_eee) {
+   err = ds->ops->get_eee(ds, p->dp->index, e);
+   if (err)
+   return err;
+   }
 
-   ret = ds->ops->get_eee(ds, p->dp->index, e);
-   if (ret)
-   return ret;
+   if (p->phy) {
+   err = phy_ethtool_get_eee(p->phy, e);
+   if (err)
+   return err;
+   }
 
-   if (p->phy)
-   ret = phy_ethtool_get_eee(p->phy, e);
-
-   return ret;
+   return err;
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-- 
2.13.3

[PATCH net-next 11/11] net: dsa: rename switch EEE ops

2017-07-31 Thread Vivien Didelot

To avoid confusion with the PHY EEE settings, rename the .set_eee and
.get_eee ops to respectively .set_mac_eee and .get_mac_eee.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/bcm_sf2.c | 12 ++--
 drivers/net/dsa/qca8k.c   |  4 ++--
 include/net/dsa.h | 10 +-
 net/dsa/slave.c   |  8 
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index ce886345d8d2..6bbfa6ea1efb 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -338,8 +338,8 @@ static int bcm_sf2_eee_init(struct dsa_switch *ds, int port,
return 1;
 }
 
-static int bcm_sf2_sw_get_eee(struct dsa_switch *ds, int port,
- struct ethtool_eee *e)
+static int bcm_sf2_sw_get_mac_eee(struct dsa_switch *ds, int port,
+ struct ethtool_eee *e)
 {
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
struct ethtool_eee *p = >port_sts[port].eee;
@@ -352,8 +352,8 @@ static int bcm_sf2_sw_get_eee(struct dsa_switch *ds, int 
port,
return 0;
 }
 
-static int bcm_sf2_sw_set_eee(struct dsa_switch *ds, int port,
- struct ethtool_eee *e)
+static int bcm_sf2_sw_set_mac_eee(struct dsa_switch *ds, int port,
+ struct ethtool_eee *e)
 {
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
struct ethtool_eee *p = >port_sts[port].eee;
@@ -1011,8 +1011,8 @@ static const struct dsa_switch_ops bcm_sf2_ops = {
.set_wol= bcm_sf2_sw_set_wol,
.port_enable= bcm_sf2_port_setup,
.port_disable   = bcm_sf2_port_disable,
-   .get_eee= bcm_sf2_sw_get_eee,
-   .set_eee= bcm_sf2_sw_set_eee,
+   .get_mac_eee= bcm_sf2_sw_get_mac_eee,
+   .set_mac_eee= bcm_sf2_sw_set_mac_eee,
.port_bridge_join   = b53_br_join,
.port_bridge_leave  = b53_br_leave,
.port_stp_state_set = b53_br_set_stp_state,
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 400333077a9f..1f6cb107bc63 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -638,7 +638,7 @@ qca8k_get_sset_count(struct dsa_switch *ds)
 }
 
 static int
-qca8k_set_eee(struct dsa_switch *ds, int port, struct ethtool_eee *eee)
+qca8k_set_mac_eee(struct dsa_switch *ds, int port, struct ethtool_eee *eee)
 {
struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
u32 lpi_en = QCA8K_REG_EEE_CTRL_LPI_EN(port);
@@ -855,7 +855,7 @@ static const struct dsa_switch_ops qca8k_switch_ops = {
.phy_write  = qca8k_phy_write,
.get_ethtool_stats  = qca8k_get_ethtool_stats,
.get_sset_count = qca8k_get_sset_count,
-   .set_eee= qca8k_set_eee,
+   .set_mac_eee= qca8k_set_mac_eee,
.port_enable= qca8k_port_enable,
.port_disable   = qca8k_port_disable,
.port_stp_state_set = qca8k_port_stp_state_set,
diff --git a/include/net/dsa.h b/include/net/dsa.h
index ce46db323394..0b1a0622b33c 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -332,12 +332,12 @@ struct dsa_switch_ops {
struct phy_device *phy);
 
/*
-* EEE setttings
+* Port's MAC EEE settings
 */
-   int (*set_eee)(struct dsa_switch *ds, int port,
-  struct ethtool_eee *e);
-   int (*get_eee)(struct dsa_switch *ds, int port,
-  struct ethtool_eee *e);
+   int (*set_mac_eee)(struct dsa_switch *ds, int port,
+  struct ethtool_eee *e);
+   int (*get_mac_eee)(struct dsa_switch *ds, int port,
+  struct ethtool_eee *e);
 
/* EEPROM access */
int (*get_eeprom_len)(struct dsa_switch *ds);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 6bc75ab438e8..832a54c94652 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -648,8 +648,8 @@ static int dsa_slave_set_eee(struct net_device *dev, struct 
ethtool_eee *e)
struct dsa_switch *ds = p->dp->ds;
int err = -ENODEV;
 
-   if (ds->ops->set_eee) {
-   err = ds->ops->set_eee(ds, p->dp->index, e);
+   if (ds->ops->set_mac_eee) {
+   err = ds->ops->set_mac_eee(ds, p->dp->index, e);
if (err)
return err;
}
@@ -675,8 +675,8 @@ static int dsa_slave_get_eee(struct net_device *dev, struct 
ethtool_eee *e)
struct dsa_switch *ds = p->dp->ds;
int err = -ENODEV;
 
-   if (ds->ops->get_eee) {
-   err = ds->ops->get_eee(ds, p->dp->index, e);
+   if (ds->ops->get_mac_eee) {
+   err = ds->ops->get_mac_eee(ds, p->dp->index, e);
if (err)

[PATCH net-next 03/11] net: dsa: qca8k: enable EEE once

2017-07-31 Thread Vivien Didelot

If EEE is queried enabled, qca8k_set_eee calls qca8k_eee_enable_set
twice (because it is already called in qca8k_eee_init). Fix that.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/qca8k.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index e076ab23d4df..9d6b5d2f7a4a 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -684,12 +684,13 @@ qca8k_set_eee(struct dsa_switch *ds, int port,
 
p->eee_enabled = e->eee_enabled;
 
-   if (e->eee_enabled) {
+   if (!p->eee_enabled) {
+   qca8k_eee_enable_set(ds, port, false);
+   } else {
p->eee_enabled = qca8k_eee_init(ds, port, phydev);
if (!p->eee_enabled)
ret = -EOPNOTSUPP;
}
-   qca8k_eee_enable_set(ds, port, p->eee_enabled);
 
return ret;
 }
-- 
2.13.3

[PATCH net-next 00/11] net: dsa: rework EEE support

2017-07-31 Thread Vivien Didelot

EEE implies configuring the port's PHY and MAC of both ends of the wire.

The current EEE support in DSA mixes PHY and MAC configuration, which is
bad because PHYs must be configured through a proper PHY driver. The DSA
switch operations for EEE are only meant for configuring the port's MAC,
which are integrated in the Ethernet switch device.

This patchset fixes the EEE support in qca8k driver, makes the DSA layer
call phy_init_eee for all drivers, and remove the EEE support from the
mv88e6xxx driver since the Marvell PHY driver should be enough for it.

Vivien Didelot (11):
  net: dsa: make EEE ops optional
  net: dsa: qca8k: fix EEE init
  net: dsa: qca8k: enable EEE once
  net: dsa: qca8k: do not cache unneeded EEE fields
  net: dsa: qca8k: remove qca8k_get_eee
  net: dsa: bcm_sf2: remove unneeded supported flags
  net: dsa: mv88e6xxx: call phy_init_eee
  net: dsa: call phy_init_eee in DSA layer
  net: dsa: remove PHY device argument from .set_eee
  net: dsa: mv88e6xxx: remove EEE support
  net: dsa: rename switch EEE ops

 drivers/net/dsa/bcm_sf2.c| 26 +++
 drivers/net/dsa/mv88e6xxx/chip.c | 82 --
 drivers/net/dsa/mv88e6xxx/chip.h |  6 ---
 drivers/net/dsa/mv88e6xxx/phy.c  | 96 
 drivers/net/dsa/mv88e6xxx/phy.h  | 22 -
 drivers/net/dsa/mv88e6xxx/port.c | 17 ---
 drivers/net/dsa/mv88e6xxx/port.h |  3 --
 drivers/net/dsa/qca8k.c  | 68 ++--
 drivers/net/dsa/qca8k.h  |  1 -
 include/net/dsa.h| 11 +++--
 net/dsa/slave.c  | 48 
 11 files changed, 45 insertions(+), 335 deletions(-)

-- 
2.13.3

[PATCH net-next 07/11] net: dsa: mv88e6xxx: call phy_init_eee

2017-07-31 Thread Vivien Didelot

It is safer to init the EEE before the DSA layer call
phy_ethtool_set_eee, as sf2 and qca8k are doing.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 647d5d45c1d6..b531d4a3bab5 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -855,6 +855,12 @@ static int mv88e6xxx_set_eee(struct dsa_switch *ds, int 
port,
struct mv88e6xxx_chip *chip = ds->priv;
int err;
 
+   if (e->eee_enabled) {
+   err = phy_init_eee(phydev, 0);
+   if (err)
+   return err;
+   }
+
mutex_lock(>reg_lock);
err = mv88e6xxx_energy_detect_write(chip, port, e);
mutex_unlock(>reg_lock);
-- 
2.13.3

[PATCH net-next 09/11] net: dsa: remove PHY device argument from .set_eee

2017-07-31 Thread Vivien Didelot

The DSA switch operations for EEE are only meant to configure a port's
MAC EEE settings. The port's PHY EEE settings are accessed by the DSA
layer and must be made available via a proper PHY driver.

In order to reduce this confusion, remove the phy_device argument from
the .set_eee operation.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/bcm_sf2.c|  1 -
 drivers/net/dsa/mv88e6xxx/chip.c |  2 +-
 drivers/net/dsa/qca8k.c  | 14 +++---
 include/net/dsa.h|  1 -
 net/dsa/slave.c  |  2 +-
 5 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 9d10aac8f241..ce886345d8d2 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -353,7 +353,6 @@ static int bcm_sf2_sw_get_eee(struct dsa_switch *ds, int 
port,
 }
 
 static int bcm_sf2_sw_set_eee(struct dsa_switch *ds, int port,
- struct phy_device *phydev,
  struct ethtool_eee *e)
 {
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 647d5d45c1d6..aaa96487f21f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -850,7 +850,7 @@ static int mv88e6xxx_get_eee(struct dsa_switch *ds, int 
port,
 }
 
 static int mv88e6xxx_set_eee(struct dsa_switch *ds, int port,
-struct phy_device *phydev, struct ethtool_eee *e)
+struct ethtool_eee *e)
 {
struct mv88e6xxx_chip *chip = ds->priv;
int err;
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 038a895d9a96..400333077a9f 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -637,8 +637,8 @@ qca8k_get_sset_count(struct dsa_switch *ds)
return ARRAY_SIZE(ar8327_mib);
 }
 
-static void
-qca8k_eee_enable_set(struct dsa_switch *ds, int port, bool enable)
+static int
+qca8k_set_eee(struct dsa_switch *ds, int port, struct ethtool_eee *eee)
 {
struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
u32 lpi_en = QCA8K_REG_EEE_CTRL_LPI_EN(port);
@@ -646,20 +646,12 @@ qca8k_eee_enable_set(struct dsa_switch *ds, int port, 
bool enable)
 
mutex_lock(>reg_mutex);
reg = qca8k_read(priv, QCA8K_REG_EEE_CTRL);
-   if (enable)
+   if (eee->eee_enabled)
reg |= lpi_en;
else
reg &= ~lpi_en;
qca8k_write(priv, QCA8K_REG_EEE_CTRL, reg);
mutex_unlock(>reg_mutex);
-}
-
-static int
-qca8k_set_eee(struct dsa_switch *ds, int port,
- struct phy_device *phydev,
- struct ethtool_eee *e)
-{
-   qca8k_eee_enable_set(ds, port, e->eee_enabled);
 
return 0;
 }
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 88da272d20d0..ce46db323394 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -335,7 +335,6 @@ struct dsa_switch_ops {
 * EEE setttings
 */
int (*set_eee)(struct dsa_switch *ds, int port,
-  struct phy_device *phydev,
   struct ethtool_eee *e);
int (*get_eee)(struct dsa_switch *ds, int port,
   struct ethtool_eee *e);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index bf71c206fe8f..6bc75ab438e8 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -649,7 +649,7 @@ static int dsa_slave_set_eee(struct net_device *dev, struct 
ethtool_eee *e)
int err = -ENODEV;
 
if (ds->ops->set_eee) {
-   err = ds->ops->set_eee(ds, p->dp->index, p->phy, e);
+   err = ds->ops->set_eee(ds, p->dp->index, e);
if (err)
return err;
}
-- 
2.13.3

[PATCH net-next 08/11] net: dsa: call phy_init_eee in DSA layer

2017-07-31 Thread Vivien Didelot

All DSA drivers are calling phy_init_eee if eee_enabled is true.

Move up this statement in the DSA layer to simplify the DSA drivers.
qca8k does not require to cache the ethtool_eee structures from now on.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/bcm_sf2.c|  9 +
 drivers/net/dsa/mv88e6xxx/chip.c |  6 --
 drivers/net/dsa/qca8k.c  | 31 ++-
 drivers/net/dsa/qca8k.h  |  1 -
 net/dsa/slave.c  |  6 ++
 5 files changed, 9 insertions(+), 44 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index aef475f1ce06..9d10aac8f241 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -360,14 +360,7 @@ static int bcm_sf2_sw_set_eee(struct dsa_switch *ds, int 
port,
struct ethtool_eee *p = >port_sts[port].eee;
 
p->eee_enabled = e->eee_enabled;
-
-   if (!p->eee_enabled) {
-   bcm_sf2_eee_enable_set(ds, port, false);
-   } else {
-   p->eee_enabled = bcm_sf2_eee_init(ds, port, phydev);
-   if (!p->eee_enabled)
-   return -EOPNOTSUPP;
-   }
+   bcm_sf2_eee_enable_set(ds, port, e->eee_enabled);
 
return 0;
 }
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index b531d4a3bab5..647d5d45c1d6 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -855,12 +855,6 @@ static int mv88e6xxx_set_eee(struct dsa_switch *ds, int 
port,
struct mv88e6xxx_chip *chip = ds->priv;
int err;
 
-   if (e->eee_enabled) {
-   err = phy_init_eee(phydev, 0);
-   if (err)
-   return err;
-   }
-
mutex_lock(>reg_lock);
err = mv88e6xxx_energy_detect_write(chip, port, e);
mutex_unlock(>reg_lock);
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index b5f2710064e5..038a895d9a96 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -655,40 +655,13 @@ qca8k_eee_enable_set(struct dsa_switch *ds, int port, 
bool enable)
 }
 
 static int
-qca8k_eee_init(struct dsa_switch *ds, int port,
-  struct phy_device *phy)
-{
-   int ret;
-
-   ret = phy_init_eee(phy, 0);
-   if (ret)
-   return 0;
-
-   qca8k_eee_enable_set(ds, port, true);
-
-   return 1;
-}
-
-static int
 qca8k_set_eee(struct dsa_switch *ds, int port,
  struct phy_device *phydev,
  struct ethtool_eee *e)
 {
-   struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
-   struct ethtool_eee *p = >port_sts[port].eee;
-   int ret = 0;
+   qca8k_eee_enable_set(ds, port, e->eee_enabled);
 
-   p->eee_enabled = e->eee_enabled;
-
-   if (!p->eee_enabled) {
-   qca8k_eee_enable_set(ds, port, false);
-   } else {
-   p->eee_enabled = qca8k_eee_init(ds, port, phydev);
-   if (!p->eee_enabled)
-   ret = -EOPNOTSUPP;
-   }
-
-   return ret;
+   return 0;
 }
 
 static void
diff --git a/drivers/net/dsa/qca8k.h b/drivers/net/dsa/qca8k.h
index 1ed4fac6cd6d..1cf8a920d4ff 100644
--- a/drivers/net/dsa/qca8k.h
+++ b/drivers/net/dsa/qca8k.h
@@ -156,7 +156,6 @@ enum qca8k_fdb_cmd {
 };
 
 struct ar8xxx_port_status {
-   struct ethtool_eee eee;
int enabled;
 };
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 518145ced434..bf71c206fe8f 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -655,6 +655,12 @@ static int dsa_slave_set_eee(struct net_device *dev, 
struct ethtool_eee *e)
}
 
if (p->phy) {
+   if (e->eee_enabled) {
+   err = phy_init_eee(p->phy, 0);
+   if (err)
+   return err;
+   }
+
err = phy_ethtool_set_eee(p->phy, e);
if (err)
return err;
-- 
2.13.3

[PATCH net-next 10/11] net: dsa: mv88e6xxx: remove EEE support

2017-07-31 Thread Vivien Didelot

The PHY's EEE settings are already accessed by the DSA layer through the
Marvell PHY driver and there is nothing to be done for switch's MACs.

Remove all EEE support from the mv88e6xxx driver.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 82 --
 drivers/net/dsa/mv88e6xxx/chip.h |  6 ---
 drivers/net/dsa/mv88e6xxx/phy.c  | 96 
 drivers/net/dsa/mv88e6xxx/phy.h  | 22 -
 drivers/net/dsa/mv88e6xxx/port.c | 17 ---
 drivers/net/dsa/mv88e6xxx/port.h |  3 --
 6 files changed, 226 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index aaa96487f21f..746ebf2fed80 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -810,58 +810,6 @@ static void mv88e6xxx_get_regs(struct dsa_switch *ds, int 
port,
mutex_unlock(>reg_lock);
 }
 
-static int mv88e6xxx_energy_detect_read(struct mv88e6xxx_chip *chip, int port,
-   struct ethtool_eee *eee)
-{
-   int err;
-
-   if (!chip->info->ops->phy_energy_detect_read)
-   return -EOPNOTSUPP;
-
-   /* assign eee->eee_enabled and eee->tx_lpi_enabled */
-   err = chip->info->ops->phy_energy_detect_read(chip, port, eee);
-   if (err)
-   return err;
-
-   /* assign eee->eee_active */
-   return mv88e6xxx_port_status_eee(chip, port, eee);
-}
-
-static int mv88e6xxx_energy_detect_write(struct mv88e6xxx_chip *chip, int port,
-struct ethtool_eee *eee)
-{
-   if (!chip->info->ops->phy_energy_detect_write)
-   return -EOPNOTSUPP;
-
-   return chip->info->ops->phy_energy_detect_write(chip, port, eee);
-}
-
-static int mv88e6xxx_get_eee(struct dsa_switch *ds, int port,
-struct ethtool_eee *e)
-{
-   struct mv88e6xxx_chip *chip = ds->priv;
-   int err;
-
-   mutex_lock(>reg_lock);
-   err = mv88e6xxx_energy_detect_read(chip, port, e);
-   mutex_unlock(>reg_lock);
-
-   return err;
-}
-
-static int mv88e6xxx_set_eee(struct dsa_switch *ds, int port,
-struct ethtool_eee *e)
-{
-   struct mv88e6xxx_chip *chip = ds->priv;
-   int err;
-
-   mutex_lock(>reg_lock);
-   err = mv88e6xxx_energy_detect_write(chip, port, e);
-   mutex_unlock(>reg_lock);
-
-   return err;
-}
-
 static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 {
struct dsa_switch *ds = NULL;
@@ -2521,8 +2469,6 @@ static const struct mv88e6xxx_ops mv88e6141_ops = {
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
-   .phy_energy_detect_read = mv88e6352_phy_energy_detect_read,
-   .phy_energy_detect_write = mv88e6352_phy_energy_detect_write,
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_rgmii_delay = mv88e6390_port_set_rgmii_delay,
@@ -2648,8 +2594,6 @@ static const struct mv88e6xxx_ops mv88e6172_ops = {
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
-   .phy_energy_detect_read = mv88e6352_phy_energy_detect_read,
-   .phy_energy_detect_write = mv88e6352_phy_energy_detect_write,
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay,
@@ -2719,8 +2663,6 @@ static const struct mv88e6xxx_ops mv88e6176_ops = {
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
-   .phy_energy_detect_read = mv88e6352_phy_energy_detect_read,
-   .phy_energy_detect_write = mv88e6352_phy_energy_detect_write,
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay,
@@ -2784,8 +2726,6 @@ static const struct mv88e6xxx_ops mv88e6190_ops = {
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
-   .phy_energy_detect_read = mv88e6390_phy_energy_detect_read,
-   .phy_energy_detect_write = mv88e6390_phy_energy_detect_write,
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_rgmii_delay = mv88e6390_port_set_rgmii_delay,
@@ -2821,8 +2761,6 @@ static const struct mv88e6xxx_ops mv88e6190x_ops = {
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
.phy_read = mv88e6xxx_g2_smi_phy_read,
.phy_write = mv88e6xxx_g2_smi_phy_write,
-   .phy_energy_detect_read

[PATCH net-next 02/11] net: dsa: qca8k: fix EEE init

2017-07-31 Thread Vivien Didelot

The qca8k obviously copied code from the sf2 driver as how to set EEE:

if (e->eee_enabled) {
p->eee_enabled = qca8k_eee_init(ds, port, phydev);
if (!p->eee_enabled)
ret = -EOPNOTSUPP;
}

But it did not use the same logic for the EEE init routine, which is
"Returns 0 if EEE was not enabled, or 1 otherwise". This results in
returning -EOPNOTSUPP on success and caching EEE enabled on failure.

This patch fixes the returned value of qca8k_eee_init.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/qca8k.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index b3bee7eab45f..e076ab23d4df 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -666,11 +666,11 @@ qca8k_eee_init(struct dsa_switch *ds, int port,
 
ret = phy_init_eee(phy, 0);
if (ret)
-   return ret;
+   return 0;
 
qca8k_eee_enable_set(ds, port, true);
 
-   return 0;
+   return 1;
 }
 
 static int
-- 
2.13.3

[PATCH net-next 04/11] net: dsa: qca8k: do not cache unneeded EEE fields

2017-07-31 Thread Vivien Didelot

The qca8k driver is currently caching a bitfield of the supported member
of a ethtool_eee private structure, which is unused.

Only the eee_enabled field of the private ethtool_eee copy is updated,
thus using p->advertised and p->lp_advertised is also erroneous.

Remove the usage of these private ethtool_eee members and only rely on
phy_ethtool_get_eee to assign the eee_active member.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/qca8k.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 9d6b5d2f7a4a..c316c55aabc6 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -658,12 +658,8 @@ static int
 qca8k_eee_init(struct dsa_switch *ds, int port,
   struct phy_device *phy)
 {
-   struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
-   struct ethtool_eee *p = >port_sts[port].eee;
int ret;
 
-   p->supported = (SUPPORTED_1000baseT_Full | SUPPORTED_100baseT_Full);
-
ret = phy_init_eee(phy, 0);
if (ret)
return 0;
@@ -705,12 +701,7 @@ qca8k_get_eee(struct dsa_switch *ds, int port,
int ret;
 
ret = phy_ethtool_get_eee(netdev->phydev, p);
-   if (!ret)
-   e->eee_active =
-   !!(p->supported & p->advertised & p->lp_advertised);
-   else
-   e->eee_active = 0;
-
+   e->eee_active = p->eee_active;
e->eee_enabled = p->eee_enabled;
 
return ret;
-- 
2.13.3

[PATCH net-next 05/11] net: dsa: qca8k: remove qca8k_get_eee

2017-07-31 Thread Vivien Didelot

phy_ethtool_get_eee is already called by the DSA layer, thus remove the
duplicated call in the qca8k driver. qca8k_get_eee becomes unnecessary.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/qca8k.c | 17 -
 1 file changed, 17 deletions(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index c316c55aabc6..b5f2710064e5 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -691,22 +691,6 @@ qca8k_set_eee(struct dsa_switch *ds, int port,
return ret;
 }
 
-static int
-qca8k_get_eee(struct dsa_switch *ds, int port,
- struct ethtool_eee *e)
-{
-   struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
-   struct ethtool_eee *p = >port_sts[port].eee;
-   struct net_device *netdev = ds->ports[port].netdev;
-   int ret;
-
-   ret = phy_ethtool_get_eee(netdev->phydev, p);
-   e->eee_active = p->eee_active;
-   e->eee_enabled = p->eee_enabled;
-
-   return ret;
-}
-
 static void
 qca8k_port_stp_state_set(struct dsa_switch *ds, int port, u8 state)
 {
@@ -906,7 +890,6 @@ static const struct dsa_switch_ops qca8k_switch_ops = {
.phy_write  = qca8k_phy_write,
.get_ethtool_stats  = qca8k_get_ethtool_stats,
.get_sset_count = qca8k_get_sset_count,
-   .get_eee= qca8k_get_eee,
.set_eee= qca8k_set_eee,
.port_enable= qca8k_port_enable,
.port_disable   = qca8k_port_disable,
-- 
2.13.3

[PATCH net-next 06/11] net: dsa: bcm_sf2: remove unneeded supported flags

2017-07-31 Thread Vivien Didelot

The SF2 driver is masking the supported bitfield of its private copy of
the ports' ethtool_eee structures. It is used nowhere, thus remove it.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/bcm_sf2.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 648f91b58d1e..aef475f1ce06 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -327,12 +327,8 @@ static void bcm_sf2_port_disable(struct dsa_switch *ds, 
int port,
 static int bcm_sf2_eee_init(struct dsa_switch *ds, int port,
struct phy_device *phy)
 {
-   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
-   struct ethtool_eee *p = >port_sts[port].eee;
int ret;
 
-   p->supported = (SUPPORTED_1000baseT_Full | SUPPORTED_100baseT_Full);
-
ret = phy_init_eee(phy, 0);
if (ret)
return 0;
-- 
2.13.3

[PATCH V4 net 1/2] net: remove unnecessary rotation

2017-07-31 Thread Shaohua Li

From: Shaohua Li 

According to David Miller, the rotation doesn't really help avoid
security problem, so delte it.

Suggested-by: David Miller 
Signed-off-by: Shaohua Li 
---
 include/net/ipv6.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6eac5cf..7548367 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -790,12 +790,6 @@ static inline __be32 ip6_make_flowlabel(struct net *net, 
struct sk_buff *skb,
 
hash = skb_get_hash_flowi6(skb, fl6);
 
-   /* Since this is being sent on the wire obfuscate hash a bit
-* to minimize possbility that any useful information to an
-* attacker is leaked. Only lower 20 bits are relevant.
-*/
-   rol32(hash, 16);
-
flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
 
if (net->ipv6.sysctl.flowlabel_state_ranges)
-- 
2.9.3

[PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet

2017-07-31 Thread Shaohua Li

From: Shaohua Li 

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
(0xd827f) are different. This causes our router doesn't correctly close tcp
connection. The patches try to fix the issue.

Thanks,
Shaohua

Shaohua Li (2):
  net: remove unnecessary rotation
  net: fix tcp reset packet flowlabel for ipv6

 include/net/ipv6.h   | 15 ---
 net/ipv4/tcp_minisocks.c |  8 +++-
 net/ipv6/ip6_gre.c   |  2 +-
 net/ipv6/ip6_output.c|  4 ++--
 net/ipv6/ip6_tunnel.c|  2 +-
 net/ipv6/tcp_ipv6.c  | 18 +-
 6 files changed, 32 insertions(+), 17 deletions(-)

-- 
2.9.3

[PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6

2017-07-31 Thread Shaohua Li

From: Shaohua Li 

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The tcp reset packet has a different flowlabel, which causes our router
doesn't correctly close tcp connection. The reason is the normal packet
gets the skb->hash from sk->sk_txhash, which is generated randomly.
ip6_make_flowlabel then uses the hash to create a flowlabel. The reset
packet doesn't get assigned a hash, so the flowlabel is calculated with
flowi6.

Since user can't change timewait sock flowlabel, we create a flowlabel
for timewait socket with the random generated hash (sk->sk_txhash), then
use it in reset packet. In this way, the reset packet will have the same
flowlabel as normal packets.

This also fixes the flowlabel issue for reset packet if user configures
flowlabel, which is ignored previously.

Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Signed-off-by: Shaohua Li 
---
 include/net/ipv6.h   |  9 -
 net/ipv4/tcp_minisocks.c |  8 +++-
 net/ipv6/ip6_gre.c   |  2 +-
 net/ipv6/ip6_output.c|  4 ++--
 net/ipv6/ip6_tunnel.c|  2 +-
 net/ipv6/tcp_ipv6.c  | 18 +-
 6 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 7548367..f8713fd 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -773,10 +773,8 @@ static inline void iph_to_flow_copy_v6addrs(struct 
flow_keys *flow,
 
 static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb,
__be32 flowlabel, bool autolabel,
-   struct flowi6 *fl6)
+   struct flowi6 *fl6, u32 hash)
 {
-   u32 hash;
-
/* @flowlabel may include more than a flow

Re: [PATCH v3 net-next 1/4] tcp: ULP infrastructure

2017-07-31 Thread Dave Watson

On 07/29/17 01:12 PM, Tom Herbert wrote:
> On Wed, Jun 14, 2017 at 11:37 AM, Dave Watson  wrote:
> > Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
> > sockets. Based on a similar infrastructure in tcp_cong.  The idea is that 
> > any
> > ULP can add its own logic by changing the TCP proto_ops structure to its own
> > methods.
> >
> > Example usage:
> >
> > setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
> >
> One question: is there a good reason why the ULP infrastructure should
> just be for TCP sockets. For example, I'd really like to be able
> something like:
> 
> setsockopt(sock, SOL_SOCKET, SO_ULP, _param, sizeof(ulp_param));
> 
> Where ulp_param is a structure containing the ULP name as well as some
> ULP specific parameters that are passed to init_ulp. ulp_init could
> determine whether the socket family is appropriate for the ULP being
> requested.

Using SOL_SOCKET instead seems reasonable to me.  I can see how
ulp_params could have some use, perhaps at a slight loss in clarity.
TLS needs its own setsockopts anyway though, for renegotiate for
example.

Re: Long stalls creating a new netns after a netns with a SMB client exits

2017-07-31 Thread Cong Wang

On Mon, Jul 31, 2017 at 9:22 AM, Rolf Neugebauer
 wrote:
> On Fri, Jul 28, 2017 at 8:16 PM, David Ahern  wrote:
>> On 7/28/17 12:58 PM, Rolf Neugebauer wrote:
> I can readily reproduce this on 4.9.39, 4.11.12 and another user
> repro-ed it on 4.12.3. It seems to happen every time. At least one
> user reported issues with NFS mounts as well, but we were not able to
> reproduce it. It's not clear to me if this is directly related to
> 'mount.cifs' or if that just happens to reliably repro it.

 OK, so commit d747a7a51b00984127a88113c does not help this case
 either.
>>>
>>> d747a7a51b009("tcp: reset sk_rx_dst in tcp_disconnect()") indeed seems
>>> a different issue. As I understand that actually caused the ref count
>>> never to get decremented, while here eventually some cleanup kicks in
>>> after a long timeout.
>>
>> It could be a dst is cached on a socket and does not get cleared until
>> the socket time outs are done.
>>
>> Test that theory by something like this for IPv4 TCP (similar change for
>> UDP if the client is UDP based):
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 3a19ea28339f..37db087b6c97 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1855,7 +1855,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const
>> struct sk_buff *skb)
>>  {
>> struct dst_entry *dst = skb_dst(skb);
>>
>> -   if (dst && dst_hold_safe(dst)) {
>> +   if (0 && dst && dst_hold_safe(dst)) {
>> sk->sk_rx_dst = dst;
>> inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
>> }
>
>
> This removes the 200s stall (the test is IPv4/TCP based)


Interesting. This means we have a kernel socket which holds
the dst refcnt.

Looking at the cifs code, it does create a TCP kernel socket
which doesn't hold refcnt to netns but its sk_rx_dst could
still be set as usual, therefore this socket could hold the dst
which holds lo device after the netns is gone. But its timeout
seems to be 60sec (SMB_ECHO_INTERVAL_DEFAULT),
not 200sec.

Ideally it should use a per netns socket so that it would have
a same life-time with netns. But you need to check this with
cifs developers, I don't understand cifs at all.

Re: Kernel TLS in 4.13-rc1

2017-07-31 Thread Dave Watson

On 07/30/17 11:14 PM, David Oberhollenzer wrote:
> On 07/24/2017 11:10 PM, Dave Watson wrote:
> > On 07/23/17 09:39 PM, David Oberhollenzer wrote:
> >> After fixing the benchmark/test tool that the patch description
> >> linked to (https://github.com/Mellanox/tls-af_ktls_tool) to make
> >> sure that the server and client actually *agree* on AES-128-GCM,
> >> I simply ran the client program with the --verify-sendpage option.
> >>
> >> The handshake and setting up of the sockets appears to work but
> >> the program complains that the sent and received page contents
> >> do not match (sent is 0x12 repeated all over and received looks
> >> pretty random).
> > 
> > The --verify functions depend on the RX path as well, which has not
> > been merged.  Any programs / tests using OpenSSL + patches should work
> > fine.
> > 
> > If you want to use the tool, something like this should work, so that
> > the receive path uses gnutls:
> > 
> > ./server --no-echo
> > 
> > ./client --server-port 12345 --sendfile some_file --server-host localhost
> > 
> 
> Thanks! This appears to work as expected (output from the server matches the
> input from the client and the pcap dumps look fine).
> 
> From briefly browsing through the code of the test tool I was initially under
> the impression that it would generate an error message and terminate if an
> attempt was made at configuring ktls for the RX path.
> 
> Anyway, I already read in the patch description that RX wasn't included yet,
> still requires a few cleanups and would follow at some point.
> 
> Is there currently a "not-so-clean" version of the RX patches floating around
> somewhere that we could take a look at?

I dumped the current state here.  Still plenty rough but at least passes
--verify-transmission for me.

https://github.com/ktls/net_next_ktls/tree/tls_recv_net_next

and config changes to af_ktls-tool

https://github.com/ktls/af_ktls-tool/tree/RX

Re: [patch net-next 0/8] mlxsw: Various small fixes

2017-07-31 Thread David Miller

From: Jiri Pirko 
Date: Mon, 31 Jul 2017 09:27:22 +0200

> This patch series is to contribute several fixes for nits that I noticed while
> working on mlxsw. The changes range from typo fixes to local improvements of
> the code and have little in common besides being small in scope.

Series applied, thanks Jiri.

Re: [PATCH net-next 0/7] net: bcmgenet: utilize MDIO unimac driver

2017-07-31 Thread David Miller

From: Florian Fainelli 
Date: Mon, 31 Jul 2017 12:04:21 -0700

> Hi all,
> 
> This patch series migrates the Broadcom GENET driver to use the 
> mdio-bcm-unimac
> driver. This MDIO HW is the same as the one GENET internally embedds, yet for
> historical reasons the two drivers lived their own lives. Because of the GENET
> interrupt situation, we let it specify how it wants to signal MDIO operations
> completion using its driver-private waitqueue.
> 
> The diffstat is not super impressive, but it's still negative! This would
> make it easier in the future to absorb possible workarounds/bugs/features
> within the same location.
> 
> This was tested on BCM7260 (GENETv5, single instance), BCM7439 (GENETv4, 
> triple
> instance) and BCM7445 (bcm_sf2 + mdio-bcm-unimac).
> 
> We also now have a nice /proc/iomem output:
> 
> f0b0-f0b0fc4b : /rdb/ethernet@f0b0
>   f0b00e14-f0b00e1c : unimac-mdio.0
> f0b2-f0b2fc4b : /rdb/ethernet@f0b2
>   f0b20e14-f0b20e1c : unimac-mdio.1
> f0b4-f0b4fc4b : /rdb/ethernet@f0b4
>   f0b40e14-f0b40e1c : unimac-mdio.2

I love cleanups like this... even if the diffstat breaks even :-)

Applied, thanks.

[PATCH net] samples/bpf: fix bpf tunnel cleanup

2017-07-31 Thread William Tu

test_tunnel_bpf.sh fails to remove the vxlan11 tunnel device, causing the
next geneve tunnelling test case fails.  In addition, the geneve reserved bit
in tcbpf2_kern.c should be zero, according to the RFC.

Signed-off-by: William Tu 
---
 samples/bpf/tcbpf2_kern.c  | 4 ++--
 samples/bpf/test_tunnel_bpf.sh | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 9c823a609e75..270edcc149a1 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -147,9 +147,9 @@ int _geneve_set_tunnel(struct __sk_buff *skb)
__builtin_memset(, 0x0, sizeof(gopt));
gopt.opt_class = 0x102; /* Open Virtual Networking (OVN) */
gopt.type = 0x08;
-   gopt.r1 = 1;
+   gopt.r1 = 0;
gopt.r2 = 0;
-   gopt.r3 = 1;
+   gopt.r3 = 0;
gopt.length = 2; /* 4-byte multiple */
*(int *) _data = 0xdeadbeef;
 
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index 1ff634f187b7..a70d2ea90313 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -149,6 +149,7 @@ function cleanup {
ip link del veth1
ip link del ipip11
ip link del gretap11
+   ip link del vxlan11
ip link del geneve11
pkill tcpdump
pkill cat
-- 
2.7.4

Re: [RFC net-next 0/6] tcp: remove prequeue and header prediction

2017-07-31 Thread David Miller

From: Eric Dumazet 
Date: Mon, 31 Jul 2017 13:22:22 -0700

> On Mon, Jul 31, 2017 at 1:04 PM, Yuchung Cheng  wrote:
>> by the time these devices use 4.12 kernels they are likely powerful
>> enough to make header prediction irrelevant...
> 
> Also note that TCP stack complexity has increased a lot, I seriously
> doubt anyone could notice any difference.
> 
> On small devices, the major cost is the wakeup of the cpu to process
> one frame before going back to idle...

I agree with Yuchung and Eric on all counts.

1 2 3 >

1 - 100 of 231 matches

Mail list logo