[GIT] Networking
1) Handle notifier registry failures properly in tun/tap driver, from Tonghao Zhang. 2) Fix bpf verifier handling of subtraction bounds and add a testcase for this, from Edward Cree. 3) Increase reset timeout in ftgmac100 driver, from Ben Herrenschmidt. 4) Fix use after free in prd_retire_rx_blk_timer_exired() in AF_PACKET, from Cong Wang. 5) Fix SElinux regression due to recent UDP optimizations, from Paolo Abeni. 6) We accidently increment IPSTATS_MIB_FRAGFAILS in the ipv6 code paths, fix from Stefano Brivio. 7) Fix some mem leaks in dccp, from Xin Long. 8) Adjust MDIO_BUS kconfig deps to avoid build errors, from Arnd Bergmann. 9) Mac address length check and buffer size fixes from Cong Wang. 10) Don't leak sockets in ipv6 udp early demux, from Paolo Abeni. 11) Fix return value when copy_from_user() fails in bpf_prog_get_info_by_fd(), from Daniel Borkmann. 12) Handle PHY_HALTED properly in phy library state machine, from Florian Fainelli. 13) Fix OOPS in fib_sync_down_dev(), from Ido Schimmel. 14) Fix truesize calculation in virtio_net which led to performance regressions, from Michael S. Tsirkin. Please pull, thanks a lot! The following changes since commit 96080f697786e0a30006fcbcc5b53f350fcb3e9f: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-07-20 16:33:39 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to cc75f8514db6a3aec517760fccaf954e5b46478c: samples/bpf: fix bpf tunnel cleanup (2017-07-31 22:02:47 -0700) Alex Vesker (1): net/mlx5e: IPoIB, Modify add/remove underlay QPN flows Arend Van Spriel (2): brcmfmac: fix regression in brcmf_sdio_txpkt_hdalign() brcmfmac: fix memleak due to calling brcmf_sdiod_sgtable_alloc() twice Arnd Bergmann (3): net: phy: rework Kconfig settings for MDIO_BUS phy: bcm-ns-usb3: fix MDIO_BUS dependency tcp: avoid bogus gcc-7 array-bounds warning Aviv Heller (1): net/mlx5: Consider tx_enabled in all modes on remap Benjamin Herrenschmidt (2): ftgmac100: Increase reset timeout ftgmac100: Make the MDIO bus a child of the ethernet device Colin Ian King (1): net: tc35815: fix spelling mistake: "Intterrupt" -> "Interrupt" Dan Carpenter (1): iwlwifi: missing error code in iwl_trans_pcie_alloc() Daniel Borkmann (2): bpf: don't indicate success when copy_from_user fails bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_len Daniel Stone (1): brcmfmac: Don't grow SKB by negative size David S. Miller (4): Merge branch 'bpf-fix-verifier-min-max-handling-in-BPF_SUB' Merge tag 'wireless-drivers-for-davem-2017-07-21' of git://git.kernel.org/.../kvalo/wireless-drivers Merge tag 'mlx5-fixes-2017-07-27-V2' of git://git.kernel.org/.../saeed/linux Merge tag 'wireless-drivers-for-davem-2017-07-28' of git://git.kernel.org/.../kvalo/wireless-drivers Edward Cree (2): selftests/bpf: subtraction bounds test bpf/verifier: fix min/max handling in BPF_SUB Emmanuel Grumbach (3): iwlwifi: dvm: prevent an out of bounds access iwlwifi: mvm: fix a NULL pointer dereference of error in recovery iwlwifi: fix tracing when tx only is enabled Eran Ben Elisha (1): net/mlx5: Clean SRIOV eswitch resources upon VF creation failure Eugenia Emantayev (7): net/mlx5: Fix mlx5_ifc_mtpps_reg_bits structure size net/mlx5e: Add field select to MTPPS register net/mlx5e: Fix broken disable 1PPS flow net/mlx5e: Change 1PPS out scheme net/mlx5e: Add missing support for PTP_CLK_REQ_PPS request net/mlx5e: Fix wrong delay calculation for overflow check scheduling net/mlx5e: Schedule overflow check work to mlx5e workqueue Florian Fainelli (4): net: dsa: Initialize ds->cpu_port_mask earlier net: phy: Correctly process PHY_HALTED in phy_stop_machine() MAINTAINERS: Add more files to the PHY LIBRARY section Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config" Gao Feng (1): ppp: Fix a scheduling-while-atomic bug in del_chan Ido Schimmel (2): mlxsw: spectrum_router: Don't offload routes next in list ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev() Ilan Tayari (1): net/mlx5e: Fix outer_header_zero() check size Jakub Kicinski (1): bpf: don't zero out the info struct in bpf_obj_get_info_by_fd() Jason Wang (1): Revert "vhost: cache used event for better performance" Joel Stanley (1): ftgmac100: return error in ftgmac100_alloc_rx_buf Johannes Berg (1): iwlwifi: mvm: defer setting IWL_MVM_STATUS_IN_HW_RESTART Kalle Valo (1): Merge tag 'iwlwifi-for-kalle-2017-07-21' of git://git.kernel.org/.../iwlwifi/iwlwifi-fixes Larry Finger (1): Revert "rtlwifi: btcoex: rtl8723be: fix ant_sel not
Re: [PATCH net] samples/bpf: fix bpf tunnel cleanup
From: William TuDate: Mon, 31 Jul 2017 14:40:50 -0700 > test_tunnel_bpf.sh fails to remove the vxlan11 tunnel device, causing the > next geneve tunnelling test case fails. In addition, the geneve reserved bit > in tcbpf2_kern.c should be zero, according to the RFC. > > Signed-off-by: William Tu Applied, thank you.
Re: [PATCH net] udp6: fix jumbogram reception
From: Paolo AbeniDate: Mon, 31 Jul 2017 16:52:36 +0200 > Since commit 67a51780aebb ("ipv6: udp: leverage scratch area > helpers") udp6_recvmsg() read the skb len from the scratch area, > to avoid a cache miss. > But the UDP6 rx path support RFC 2675 UDPv6 jumbograms, and their > length exceeds the 16 bits available in the scratch area. As a side > effect the length returned by recvmsg() is: > % (1<<16) > > This commit addresses the issue allocating one more bit in the > IP6CB flags field and setting it for incoming jumbograms. > Such field is still in the first cacheline, so at recvmsg() > time we can check it and fallback to access skb->len if > required, without a measurable overhead. > > Fixes: 67a51780aebb ("ipv6: udp: leverage scratch area helpers") > Signed-off-by: Paolo Abeni Applied, thanks Paolo.
Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan
From: gfree.w...@vip.163.com Date: Mon, 31 Jul 2017 18:07:38 +0800 > From: Gao Feng> > The PPTP set the pptp_sock_destruct as the sock's sk_destruct, it would > trigger this bug when __sk_free is invoked in atomic context, because of > the call path pptp_sock_destruct->del_chan->synchronize_rcu. > > Now move the synchronize_rcu to pptp_release from del_chan. This is the > only one case which would free the sock and need the synchronize_rcu. > > The following is the panic I met with kernel 3.3.8, but this issue should > exist in current kernel too according to the codes. ... > Signed-off-by: Gao Feng Applied, thanks.
Re: [patch net-next 09/20] net: sched: convert actions array into rcu list
Mon, Jul 31, 2017 at 11:07:13PM CEST, xiyou.wangc...@gmail.com wrote: >On Fri, Jul 28, 2017 at 7:40 AM, Jiri Pirkowrote: >> From: Jiri Pirko >> >> Currently the actions are stored in array with array size. To traverse >> this array in fastpath, tcf_tree_lock is taken to protect it. Convert >> the array into a singly linked list, similar to the filter chains style >> and allow traversal protected by rcu. > >Did you read commit 22dc13c837c33207548c8ee5116 ? > >An action can't be shared by multiple filters if you put them >in a list (no matter singly or double), this is why I use pointers. Allright. Will check it out.
Re: [patch net-next 04/20] net: sched: use tcf_exts_has_actions in tcf_exts_exec
Mon, Jul 31, 2017 at 10:37:21PM CEST, xiyou.wangc...@gmail.com wrote: >On Fri, Jul 28, 2017 at 7:40 AM, Jiri Pirkowrote: >> +static inline int >> +tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts, >> + struct tcf_result *res) >> +{ >> +#ifdef CONFIG_NET_CLS_ACT >> + if (tcf_exts_has_actions(exts)) >> + return tcf_action_exec(skb, exts->actions, exts->nr_actions, >> + res); >> +#endif >> + return 0; >> +} > > >While you are on it, can we get rid of this macro too? > >tcf_action_exec() is only defined with CONFIG_NET_CLS_ACT, >not sure if compiler is kind enough to eliminate the false branch >for us: > >if (false) >return tcf_action_exec(...); // not defined but the branch is dead > >At least you can add a wrapper for tcf_action_exec() to just >return 0. Did you see? net: sched: remove check for number of actions in tcf_exts_exec I will add static inline stub for tcf_action_exec in case CONFIG_NET_CLS_ACT is not set.
Re: [PATCH net-next v12 0/4] net sched actions: improve dump performance
On Mon, 31 Jul 2017 08:06:42 -0400 Jamal Hadi Salimwrote: > On 17-07-30 10:28 PM, David Miller wrote: > > > > Series applied, thanks. > > > > Thanks David. > > Attaching the iproute2 patch. I will submit an official one with > man page changes later. Stephen - you take net-next changes? > > cheers, > jamal Please cleanup and resubmit for net-next. The header files have been updated in iproute2 net-next branch. It is not clear to me that the new code is backward compatiable. Will new versions of tc work on old kernels and vice/versa? Also, no #ifdef's
Re: [PATCH RFC, iproute2] tc/mirred: Extend the mirred/redirect action to accept additional traffic class parameter
On Mon, 31 Jul 2017 17:40:50 -0700 Amritha Nambiarwrote: The concept is fine, bu t the code looks different than the rest which is never a good sign. > + if ((argc > 0) && (matches(*argv, "tc") == 0)) { Extra () are unnecessary in compound conditional. > + tc = atoi(*argv); Prefer using strtoul since it has better error handling than atoi() > + argc--; > + argv++; > + } Use NEXT_ARG() construct like rest of the code.
Re: TCP fast retransmit issues
On Fri, Jul 28, 2017 at 6:54 PM, Neal Cardwellwrote: > On Wed, Jul 26, 2017 at 3:02 PM, Neal Cardwell wrote: >> On Wed, Jul 26, 2017 at 2:38 PM, Neal Cardwell wrote: >>> Yeah, it looks like I can reproduce this issue with (1) bad sacks >>> causing repeated TLPs, and (2) TLPs timers being pushed out to later >>> times due to incoming data. Scripts are attached. >> >> I'm testing a fix of only scheduling a TLP if (flag & FLAG_DATA_ACKED) >> is true... > > An update for the TLP aspect of this thread: our team has a proposed > fix for this RTO/TLP reschedule issue that we have reviewed internally > and tested with our packetdrill test suite, including some new tests. > The basic approach in the fix is as follows: > > a) only reschedule the xmit timer once per ACK > > b) only reschedule the xmit timer if tcp_clean_rtx_queue() deems this > is safe (a packet was cumulatively ACKed, or we got a SACK for a > packet that was sent before the most recent retransmit of the write > queue head). > > After further review and testing we will post it. Hopefully next week. The timer patches are upstream for review for the "net" branch: https://patchwork.ozlabs.org/patch/796057/ https://patchwork.ozlabs.org/patch/796058/ https://patchwork.ozlabs.org/patch/796059/ Again, thank you for reporting this, and thanks for the packet trace! neal
[PATCH net-next 09/10] net: ipv4: Support for sockets bound to enslaved device
Add support for sockets bound to a network interface enslaved to an L3 Master device (e.g, VRF). Currently for VRF, skb->dev points to the VRF device meaning socket lookups only consider this device index. The real ingress device index is saved to IPCB(skb)->iif and the VRF driver marks the skb with IPSKB_L3SLAVE to know that the real ingress device is an enslaved one without having to lookup the iif. Use those flags to add the enslaved device index to the socket lookup and allow sk->sk_bound_dev_if to match either dif (VRF device) or sdif (enslaved device). Signed-off-by: David Ahern--- include/linux/igmp.h | 3 ++- include/net/inet_hashtables.h | 10 ++ include/net/ip.h | 10 ++ include/net/tcp.h | 10 ++ net/ipv4/igmp.c | 6 -- net/ipv4/inet_hashtables.c| 11 +++ net/ipv4/raw.c| 7 +-- net/ipv4/tcp_ipv4.c | 6 -- net/ipv4/udp.c| 11 --- 9 files changed, 56 insertions(+), 18 deletions(-) diff --git a/include/linux/igmp.h b/include/linux/igmp.h index 97caf1821de8..f8231854b5d6 100644 --- a/include/linux/igmp.h +++ b/include/linux/igmp.h @@ -118,7 +118,8 @@ extern int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf, struct ip_msfilter __user *optval, int __user *optlen); extern int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf, struct group_filter __user *optval, int __user *optlen); -extern int ip_mc_sf_allow(struct sock *sk, __be32 local, __be32 rmt, int dif); +extern int ip_mc_sf_allow(struct sock *sk, __be32 local, __be32 rmt, + int dif, int sdif); extern void ip_mc_init_dev(struct in_device *); extern void ip_mc_destroy_dev(struct in_device *); extern void ip_mc_up(struct in_device *); diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index c5f4dc3c06e4..2de3d4bc00ba 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -259,22 +259,24 @@ static inline struct sock *inet_lookup_listener(struct net *net, (((__force __u64)(__be32)(__daddr)) << 32) | \ ((__force __u64)(__be32)(__saddr))) #endif /* __BIG_ENDIAN */ -#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif) \ +#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \ (((__sk)->sk_portpair == (__ports)) && \ ((__sk)->sk_addrpair == (__cookie))&& \ (!(__sk)->sk_bound_dev_if || \ - ((__sk)->sk_bound_dev_if == (__dif)))&& \ + ((__sk)->sk_bound_dev_if == (__dif)) || \ + ((__sk)->sk_bound_dev_if == (__sdif))) && \ net_eq(sock_net(__sk), (__net))) #else /* 32-bit arch */ #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \ const int __name __deprecated __attribute__((unused)) -#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif) \ +#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \ (((__sk)->sk_portpair == (__ports)) && \ ((__sk)->sk_daddr == (__saddr)) && \ ((__sk)->sk_rcv_saddr == (__daddr)) && \ (!(__sk)->sk_bound_dev_if || \ - ((__sk)->sk_bound_dev_if == (__dif)))&& \ + ((__sk)->sk_bound_dev_if == (__dif)) || \ + ((__sk)->sk_bound_dev_if == (__sdif))) && \ net_eq(sock_net(__sk), (__net))) #endif /* 64-bit arch */ diff --git a/include/net/ip.h b/include/net/ip.h index 821cedcc8e73..e10da8814dba 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -78,6 +78,16 @@ struct ipcm_cookie { #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb)) #define PKTINFO_SKB_CB(skb) ((struct in_pktinfo *)((skb)->cb)) +/* return enslaved device index if relevant */ +static inline int ip_sdif(struct sk_buff *skb) +{ +#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV) + if (skb && ipv4_l3mdev_skb(IPCB(skb)->flags)) + return IPCB(skb)->iif; +#endif + return 0; +} + struct ip_ra_chain { struct ip_ra_chain __rcu *next; struct sock *sk; diff --git a/include/net/tcp.h b/include/net/tcp.h index 12d68335acd4..19827dd05dcc 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -861,6 +861,16 @@ static inline bool inet_exact_dif_match(struct net *net, struct sk_buff *skb) return false; } +/* TCP_SKB_CB reference means this can not be used from early demux */ +static inline int tcp_v4_sdif(struct sk_buff *skb) +{ +#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV) + if (skb &&
[PATCH net-next 07/10] net: ipv6: Convert raw sockets to sk_lookup
Convert __raw_v6_lookup to use the new sk_lookup struct Signed-off-by: David Ahern--- include/net/rawv6.h | 3 +-- net/ipv4/raw_diag.c | 15 ++- net/ipv6/raw.c | 41 +++-- 3 files changed, 34 insertions(+), 25 deletions(-) diff --git a/include/net/rawv6.h b/include/net/rawv6.h index cbe4e9de1894..406268324d26 100644 --- a/include/net/rawv6.h +++ b/include/net/rawv6.h @@ -5,8 +5,7 @@ extern struct raw_hashinfo raw_v6_hashinfo; struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, -unsigned short num, const struct in6_addr *loc_addr, -const struct in6_addr *rmt_addr, int dif); +const struct sk_lookup *params); int raw_abort(struct sock *sk, int err); diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index a708de070cc6..e081c03fd408 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -53,11 +53,16 @@ static struct sock *raw_lookup(struct net *net, struct sock *from, sk = __raw_v4_lookup(net, from, ); } #if IS_ENABLED(CONFIG_IPV6) - else - sk = __raw_v6_lookup(net, from, r->sdiag_raw_protocol, -(const struct in6_addr *)r->id.idiag_src, -(const struct in6_addr *)r->id.idiag_dst, -r->id.idiag_if); + else { + struct sk_lookup params = { + .saddr.ipv6 = (const struct in6_addr *)r->id.idiag_dst, + .daddr.ipv6 = (const struct in6_addr *)r->id.idiag_src, + .hnum = r->sdiag_raw_protocol, + .dif = r->id.idiag_if, + }; + + sk = __raw_v6_lookup(net, from, ); + } #endif return sk; } diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 60be012fe708..51e651f18ffb 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -71,14 +71,14 @@ struct raw_hashinfo raw_v6_hashinfo = { EXPORT_SYMBOL_GPL(raw_v6_hashinfo); struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, - unsigned short num, const struct in6_addr *loc_addr, - const struct in6_addr *rmt_addr, int dif) +const struct sk_lookup *params) { + const struct in6_addr *loc_addr = params->daddr.ipv6; + const struct in6_addr *rmt_addr = params->saddr.ipv6; bool is_multicast = ipv6_addr_is_multicast(loc_addr); sk_for_each_from(sk) - if (inet_sk(sk)->inet_num == num) { - + if (inet_sk(sk)->inet_num == params->hnum) { if (!net_eq(sock_net(sk), net)) continue; @@ -86,7 +86,8 @@ struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, !ipv6_addr_equal(>sk_v6_daddr, rmt_addr)) continue; - if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif) + if (sk->sk_bound_dev_if && + sk->sk_bound_dev_if != params->dif) continue; if (!ipv6_addr_any(>sk_v6_rcv_saddr)) { @@ -159,15 +160,17 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister); */ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) { - const struct in6_addr *saddr; - const struct in6_addr *daddr; + struct sk_lookup params = { + .saddr.ipv6 = _hdr(skb)->saddr, + .daddr.ipv6 = _hdr(skb)->daddr, + .hnum = nexthdr, + .dif = inet6_iif(skb), + }; struct sock *sk; bool delivered = false; __u8 hash; struct net *net; - saddr = _hdr(skb)->saddr; - daddr = saddr + 1; hash = nexthdr & (RAW_HTABLE_SIZE - 1); @@ -178,7 +181,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) goto out; net = dev_net(skb->dev); - sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, inet6_iif(skb)); + sk = __raw_v6_lookup(net, sk, ); while (sk) { int filtered; @@ -221,8 +224,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) rawv6_rcv(sk, clone); } } - sk = __raw_v6_lookup(net, sk_next(sk), nexthdr, daddr, saddr, -inet6_iif(skb)); + sk = __raw_v6_lookup(net, sk_next(sk), ); } out: read_unlock(_v6_hashinfo.lock); @@ -362,23 +364,26 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr, u8 type, u8 code, int inner_offset, __be32 info) { struct sock *sk; - int hash; - const struct in6_addr *saddr, *daddr; struct net *net; + int hash; hash = nexthdr &
[PATCH net-next 04/10] net: ipv4: Convert raw sockets to sk_lookup
Convert __raw_v4_lookup to use the new sk_lookup struct Signed-off-by: David Ahern--- include/net/raw.h | 3 +-- net/ipv4/raw.c | 72 ++--- net/ipv4/raw_diag.c | 15 +++ 3 files changed, 58 insertions(+), 32 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index 57c33dd22ec4..8d0f0e5d013b 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -25,8 +25,7 @@ extern struct proto raw_prot; extern struct raw_hashinfo raw_v4_hashinfo; struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, -unsigned short num, __be32 raddr, -__be32 laddr, int dif); +const struct sk_lookup *params); int raw_abort(struct sock *sk, int err); void raw_icmp_error(struct sk_buff *, int, u32); diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index b0bb5d0a30bd..4da5d87a61a5 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -122,15 +122,23 @@ void raw_unhash_sk(struct sock *sk) EXPORT_SYMBOL_GPL(raw_unhash_sk); struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, - unsigned short num, __be32 raddr, __be32 laddr, int dif) +const struct sk_lookup *params) { + __be32 raddr = params->saddr.ipv4; + __be32 laddr = params->daddr.ipv4; + sk_for_each_from(sk) { struct inet_sock *inet = inet_sk(sk); + bool dev_match; + + dev_match = (!sk->sk_bound_dev_if || + sk->sk_bound_dev_if == params->dif); - if (net_eq(sock_net(sk), net) && inet->inet_num == num && - !(inet->inet_daddr && inet->inet_daddr != raddr)&& + if (net_eq(sock_net(sk), net) && + inet->inet_num == params->hnum && + !(inet->inet_daddr && inet->inet_daddr != raddr) && !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) && - !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)) + dev_match) goto found; /* gotcha */ } sk = NULL; @@ -169,23 +177,20 @@ static int icmp_filter(const struct sock *sk, const struct sk_buff *skb) * RFC 1122: SHOULD pass TOS value up to the transport layer. * -> It does. And not only TOS, but all IP header. */ -static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) +static int __raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, + struct hlist_head *head) { - struct sock *sk; - struct hlist_head *head; + struct net *net = dev_net(skb->dev); + const struct sk_lookup params = { + .saddr.ipv4 = iph->saddr, + .daddr.ipv4 = iph->daddr, + .hnum = iph->protocol, + .dif = skb->dev->ifindex, + }; int delivered = 0; - struct net *net; - - read_lock(_v4_hashinfo.lock); - head = _v4_hashinfo.ht[hash]; - if (hlist_empty(head)) - goto out; - - net = dev_net(skb->dev); - sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol, -iph->saddr, iph->daddr, -skb->dev->ifindex); + struct sock *sk; + sk = __raw_v4_lookup(net, __sk_head(head), ); while (sk) { delivered = 1; if ((iph->protocol != IPPROTO_ICMP || !icmp_filter(sk, skb)) && @@ -197,11 +202,22 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) if (clone) raw_rcv(sk, clone); } - sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol, -iph->saddr, iph->daddr, -skb->dev->ifindex); + sk = __raw_v4_lookup(net, sk_next(sk), ); } -out: + + return delivered; +} + +static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) +{ + struct hlist_head *head; + int delivered = 0; + + read_lock(_v4_hashinfo.lock); + head = _v4_hashinfo.ht[hash]; + if (!hlist_empty(head)) + delivered = __raw_v4_input(skb, iph, head); + read_unlock(_v4_hashinfo.lock); return delivered; } @@ -297,12 +313,18 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) read_lock(_v4_hashinfo.lock); raw_sk = sk_head(_v4_hashinfo.ht[hash]); if (raw_sk) { + struct sk_lookup params = { + .hnum = protocol, + .dif = skb->dev->ifindex, + }; + iph = (const struct iphdr *)skb->data; net = dev_net(skb->dev); - while ((raw_sk = __raw_v4_lookup(net, raw_sk, protocol, -
[PATCH net-next 10/10] net: ipv6: Support for sockets bound to enslaved device
Add support for sockets bound to a network interface enslaved to an L3 Master device (e.g, VRF). Currently for VRF, skb->dev points to the VRF device meaning socket lookups only consider this device index. The real ingress device index is saved to IP6CB(skb)->iif and the VRF driver marks the skb with IP6SKB_L3SLAVE to know that the real ingress device is an enslaved one without having to lookup the iif. Use those flags to add the enslaved device index to the socket lookup and allow sk->sk_bound_dev_if to match either dif (VRF device) or sdif (enslaved device). Signed-off-by: David Ahern--- include/linux/ipv6.h | 8 include/net/inet6_hashtables.h | 5 +++-- include/net/tcp.h | 7 +++ net/ipv6/inet6_hashtables.c| 9 + net/ipv6/raw.c | 5 - net/ipv6/tcp_ipv6.c| 3 +++ net/ipv6/udp.c | 8 ++-- 7 files changed, 36 insertions(+), 9 deletions(-) diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index e1b442996f81..094357907b45 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -153,6 +153,14 @@ static inline int inet6_iif(const struct sk_buff *skb) } /* can not be used in TCP layer after tcp_v6_fill_cb */ +static inline int inet6_sdif(const struct sk_buff *skb) +{ + bool l3_slave = ipv6_l3mdev_skb(IP6CB(skb)->flags); + + return l3_slave ? IP6CB(skb)->iif : 0; +} + +/* can not be used in TCP layer after tcp_v6_fill_cb */ static inline bool inet6_exact_dif_match(struct net *net, struct sk_buff *skb) { #if defined(CONFIG_NET_L3_MASTER_DEV) diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h index 15db41272ff2..0fc5a2fe4ad3 100644 --- a/include/net/inet6_hashtables.h +++ b/include/net/inet6_hashtables.h @@ -94,13 +94,14 @@ struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo, int inet6_hash(struct sock *sk); #endif /* IS_ENABLED(CONFIG_IPV6) */ -#define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif) \ +#define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif, __sdif) \ (((__sk)->sk_portpair == (__ports)) && \ ((__sk)->sk_family == AF_INET6)&& \ ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr)) && \ ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr)) && \ (!(__sk)->sk_bound_dev_if || \ - ((__sk)->sk_bound_dev_if == (__dif)))&& \ + ((__sk)->sk_bound_dev_if == (__dif)) || \ + ((__sk)->sk_bound_dev_if == (__sdif))) && \ net_eq(sock_net(__sk), (__net))) #endif /* _INET6_HASHTABLES_H */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 19827dd05dcc..8a081cff33f8 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -848,6 +848,13 @@ static inline int tcp_v6_iif(const struct sk_buff *skb) return l3_slave ? skb->skb_iif : TCP_SKB_CB(skb)->header.h6.iif; } + +static inline int tcp_v6_sdif(const struct sk_buff *skb) +{ + bool l3_slave = ipv6_l3mdev_skb(TCP_SKB_CB(skb)->header.h6.flags); + + return l3_slave ? TCP_SKB_CB(skb)->header.h6.iif : 0; +} #endif /* TCP_SKB_CB reference means this can not be used from early demux */ diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 878c03094f2e..06120efb2036 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -74,13 +74,13 @@ struct sock *__inet6_lookup_established(struct net *net, if (sk->sk_hash != hash) continue; if (!INET6_MATCH(sk, net, saddr, daddr, ports, -params->dif)) +params->dif, params->sdif)) continue; if (unlikely(!refcount_inc_not_zero(>sk_refcnt))) goto out; if (unlikely(!INET6_MATCH(sk, net, saddr, daddr, ports, -params->dif))) { +params->dif, params->sdif))) { sock_gen_put(sk); goto begin; } @@ -188,8 +188,9 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row, const struct in6_addr *daddr = >sk_v6_rcv_saddr; const struct in6_addr *saddr = >sk_v6_daddr; const int dif = sk->sk_bound_dev_if; - const __portpair ports = INET_COMBINED_PORTS(inet->inet_dport, lport); struct net *net = sock_net(sk); + const int sdif = l3mdev_master_ifindex_by_index(net, dif); + const __portpair ports = INET_COMBINED_PORTS(inet->inet_dport, lport); const unsigned int hash = inet6_ehashfn(net, daddr, lport, saddr, inet->inet_dport); struct
[PATCH net-next 00/10] net: l3mdev: Support for sockets bound to enslaved device
A missing piece to the VRF puzzle is the ability to bind sockets to devices enslaved to a VRF. This patch set adds the enslaved device index, sdif, to IPv4 and IPv6 socket lookups. The end result for users is the following scope options for services: 1. "global" services - sockets not bound to any device Allows 1 service to work across all network interfaces with connected sockets bound to the VRF the connection originates (Requires net.ipv4.tcp_l3mdev_accept=1 for TCP and net.ipv4.udp_l3mdev_accept=1 for UDP) 2. "VRF" local services - sockets bound to a VRF Sockets work across all network interfaces enslaved to a VRF but are limited to just the one VRF. 3. "device" services - sockets bound to a specific network interface Service works only through the one specific interface. Existing code for socket lookups already pass in 6+ arguments. Rather than add another for the enslaved device index, the existing lookups are converted to use a new sk_lookup struct. From there, the enslaved device index becomes another element of the struct. Patch 1 introduces sk_lookup struct and helper. Patches 2-4 convert udp, inet and socket lookups for IPv4 to use the new sk_lookup struct. Meant to be a conversion of IPv4 code only; no functional change intended. Patches 5-7 convert udp, inet and socket lookups for IPv6 to use the new sk_lookup struct. Meant to be a conversion of IPv6 code only; no functional change intended. Patch 8 adds sdif to the sk_lookup struct allowing lookups to consider a second device index. Patches 9-10 add support for the enslaved device index to ipv4 and ipv6 socket lookups. Changes since RFC: - no significant logic changes; mainly whitespace cleanups David Ahern (10): net: Add sk_lookup struct and helper net: ipv4: Convert udp socket lookups to new struct net: ipv4: Convert inet socket lookups to new struct net: ipv4: Convert raw sockets to sk_lookup net: ipv6: Convert udp socket lookups to new struct net: ipv6: Convert inet socket lookups to new struct net: ipv6: Convert raw sockets to sk_lookup net: Add sdif to sk_lookup net: ipv4: Support for sockets bound to enslaved device net: ipv6: Support for sockets bound to enslaved device include/linux/igmp.h| 3 +- include/linux/ipv6.h| 8 ++ include/net/inet6_hashtables.h | 44 - include/net/inet_hashtables.h | 67 ++--- include/net/ip.h| 10 ++ include/net/raw.h | 3 +- include/net/rawv6.h | 3 +- include/net/sock.h | 42 + include/net/tcp.h | 17 include/net/udp.h | 18 +--- net/dccp/ipv4.c | 19 +++- net/dccp/ipv6.c | 22 +++-- net/ipv4/igmp.c | 6 +- net/ipv4/inet_diag.c| 50 +++--- net/ipv4/inet_hashtables.c | 59 +++- net/ipv4/netfilter/nf_socket_ipv4.c | 16 +++- net/ipv4/raw.c | 77 +-- net/ipv4/raw_diag.c | 30 -- net/ipv4/tcp_ipv4.c | 64 + net/ipv4/udp.c | 175 ++ net/ipv4/udp_diag.c | 89 -- net/ipv6/inet6_hashtables.c | 75 --- net/ipv6/netfilter/nf_socket_ipv6.c | 16 +++- net/ipv6/raw.c | 44 + net/ipv6/tcp_ipv6.c | 63 + net/ipv6/udp.c | 181 net/netfilter/xt_TPROXY.c | 39 +--- 27 files changed, 759 insertions(+), 481 deletions(-) -- 2.1.4
[PATCH net-next 06/10] net: ipv6: Convert inet socket lookups to new struct
Convert the various inet6_lookup functions to use the new sk_lookup struct. Signed-off-by: David Ahern--- include/net/inet6_hashtables.h | 39 +++- net/dccp/ipv6.c | 22 net/ipv4/inet_diag.c| 19 ++ net/ipv4/udp_diag.c | 2 ++ net/ipv6/inet6_hashtables.c | 72 +++-- net/ipv6/netfilter/nf_socket_ipv6.c | 5 ++- net/ipv6/tcp_ipv6.c | 60 +-- net/netfilter/xt_TPROXY.c | 8 ++--- 8 files changed, 125 insertions(+), 102 deletions(-) diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h index b87becacd9d3..15db41272ff2 100644 --- a/include/net/inet6_hashtables.h +++ b/include/net/inet6_hashtables.h @@ -46,63 +46,50 @@ static inline unsigned int __inet6_ehashfn(const u32 lhash, */ struct sock *__inet6_lookup_established(struct net *net, struct inet_hashinfo *hashinfo, - const struct in6_addr *saddr, - const __be16 sport, - const struct in6_addr *daddr, - const u16 hnum, const int dif); + const struct sk_lookup *params); struct sock *inet6_lookup_listener(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, - const struct in6_addr *saddr, - const __be16 sport, - const struct in6_addr *daddr, - const unsigned short hnum, const int dif); + struct sk_lookup *params); static inline struct sock *__inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, - const struct in6_addr *saddr, - const __be16 sport, - const struct in6_addr *daddr, - const u16 hnum, - const int dif, + struct sk_lookup *params, bool *refcounted) { - struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr, - sport, daddr, hnum, dif); + struct sock *sk = __inet6_lookup_established(net, hashinfo, params); + *refcounted = true; if (sk) return sk; *refcounted = false; - return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport, -daddr, hnum, dif); + return inet6_lookup_listener(net, hashinfo, skb, doff, params); } static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, - const __be16 sport, - const __be16 dport, - int iif, + struct sk_lookup *params, bool *refcounted) { struct sock *sk = skb_steal_sock(skb); + params->saddr.ipv6 = _hdr(skb)->saddr, + params->daddr.ipv6 = _hdr(skb)->daddr, + params->hnum = ntohs(params->dport), + *refcounted = true; if (sk) return sk; return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb, - doff, _hdr(skb)->saddr, sport, - _hdr(skb)->daddr, ntohs(dport), - iif, refcounted); + doff, params, refcounted); } struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, - const struct in6_addr *saddr, const __be16 sport, - const struct in6_addr *daddr, const __be16 dport, - const int dif); + struct sk_lookup *params); int inet6_hash(struct sock *sk); #endif /* IS_ENABLED(CONFIG_IPV6) */ diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index c376af5bfdfb..e92f10a832dd 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -70,6 +70,11 @@ static void dccp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, u8 type, u8 code, int offset, __be32 info) { const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data; + struct
[PATCH net-next 03/10] net: ipv4: Convert inet socket lookups to new struct
Convert the various inet_lookup functions to use the new sk_lookup struct. Signed-off-by: David Ahern--- include/net/inet_hashtables.h | 57 ++ net/dccp/ipv4.c | 19 +--- net/ipv4/inet_diag.c| 33 ++-- net/ipv4/inet_hashtables.c | 48 +++- net/ipv4/netfilter/nf_socket_ipv4.c | 5 ++- net/ipv4/tcp_ipv4.c | 62 +++-- net/ipv4/udp_diag.c | 3 ++ net/netfilter/xt_TPROXY.c | 10 +++--- 8 files changed, 142 insertions(+), 95 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 5026b1f08bb8..c5f4dc3c06e4 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -218,19 +218,16 @@ void inet_unhash(struct sock *sk); struct sock *__inet_lookup_listener(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, - const __be32 saddr, const __be16 sport, - const __be32 daddr, - const unsigned short hnum, - const int dif); + struct sk_lookup *params); static inline struct sock *inet_lookup_listener(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, - __be32 saddr, __be16 sport, - __be32 daddr, __be16 dport, int dif) + struct sk_lookup *params) { - return __inet_lookup_listener(net, hashinfo, skb, doff, saddr, sport, - daddr, ntohs(dport), dif); + params->hnum = ntohs(params->dport); + + return __inet_lookup_listener(net, hashinfo, skb, doff, params); } /* Socket demux engine toys. */ @@ -286,53 +283,44 @@ static inline struct sock *inet_lookup_listener(struct net *net, */ struct sock *__inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo, - const __be32 saddr, const __be16 sport, - const __be32 daddr, const u16 hnum, - const int dif); + const struct sk_lookup *params); static inline struct sock * inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo, - const __be32 saddr, const __be16 sport, - const __be32 daddr, const __be16 dport, - const int dif) + struct sk_lookup *params) { - return __inet_lookup_established(net, hashinfo, saddr, sport, daddr, -ntohs(dport), dif); + params->hnum = ntohs(params->dport); + + return __inet_lookup_established(net, hashinfo, params); } static inline struct sock *__inet_lookup(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, -const __be32 saddr, const __be16 sport, -const __be32 daddr, const __be16 dport, -const int dif, +struct sk_lookup *params, bool *refcounted) { - u16 hnum = ntohs(dport); struct sock *sk; - sk = __inet_lookup_established(net, hashinfo, saddr, sport, - daddr, hnum, dif); + params->hnum = ntohs(params->dport); + + sk = __inet_lookup_established(net, hashinfo, params); *refcounted = true; if (sk) return sk; *refcounted = false; - return __inet_lookup_listener(net, hashinfo, skb, doff, saddr, - sport, daddr, hnum, dif); + return __inet_lookup_listener(net, hashinfo, skb, doff, params); } static inline struct sock *inet_lookup(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, - const __be32 saddr, const __be16 sport, - const __be32 daddr, const __be16 dport, - const int dif) + struct sk_lookup *params) { struct sock *sk; bool refcounted; - sk = __inet_lookup(net, hashinfo, skb, doff, saddr, sport, daddr, - dport, dif, ); + sk = __inet_lookup(net, hashinfo, skb, doff,
[PATCH net-next 08/10] net: Add sdif to sk_lookup
Add a second device index, sdif, to the socket lookup struct. sdif will be the device index for devices enslaved to an l3mdev. It allows the lookups to consider the enslaved device as well as the L3 master device when searching for a socket. Signed-off-by: David Ahern--- include/net/sock.h | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index a2db5fd30192..c5d93a4bcd0a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -507,23 +507,27 @@ struct sk_lookup { unsigned short hnum; int dif; + int sdif; bool exact_dif; }; -/* Compare sk_bound_dev_if to socket lookup dif +/* Compare sk_bound_dev_if to socket lookup dif and sdif * Returns: * -1 exact dif required and not met *0 sk_bound_dev_if is either not set or does not match - *1 sk_bound_dev_if is set and matches dif + *1 sk_bound_dev_if is set and matches dif or sdif */ static inline int sk_lookup_device_cmp(const struct sock *sk, const struct sk_lookup *params) { + bool dev_match = (sk->sk_bound_dev_if == params->dif || + sk->sk_bound_dev_if == params->sdif); + /* exact_dif true == l3mdev case */ - if (params->exact_dif && sk->sk_bound_dev_if != params->dif) + if (params->exact_dif && !dev_match) return -1; - if (sk->sk_bound_dev_if && sk->sk_bound_dev_if == params->dif) + if (sk->sk_bound_dev_if && dev_match) return 1; return 0; -- 2.1.4
[PATCH net-next 02/10] net: ipv4: Convert udp socket lookups to new struct
Convert udp4_lib_lookup and __udp4_lib_lookup to use the new sk_lookup struct. Signed-off-by: David Ahern--- include/net/udp.h | 6 +- net/ipv4/netfilter/nf_socket_ipv4.c | 11 ++- net/ipv4/udp.c | 170 +++- net/ipv4/udp_diag.c | 51 +++ net/netfilter/xt_TPROXY.c | 11 ++- 5 files changed, 144 insertions(+), 105 deletions(-) diff --git a/include/net/udp.h b/include/net/udp.h index 972ce4baab6b..5e0ff095dc6d 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -283,10 +283,8 @@ int udp_lib_getsockopt(struct sock *sk, int level, int optname, int udp_lib_setsockopt(struct sock *sk, int level, int optname, char __user *optval, unsigned int optlen, int (*push_pending_frames)(struct sock *)); -struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport, -__be32 daddr, __be16 dport, int dif); -struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport, - __be32 daddr, __be16 dport, int dif, +struct sock *udp4_lib_lookup(struct net *net, struct sk_lookup *params); +struct sock *__udp4_lib_lookup(struct net *net, struct sk_lookup *params, struct udp_table *tbl, struct sk_buff *skb); struct sock *udp4_lib_lookup_skb(struct sk_buff *skb, __be16 sport, __be16 dport); diff --git a/net/ipv4/netfilter/nf_socket_ipv4.c b/net/ipv4/netfilter/nf_socket_ipv4.c index e9293bdebba0..121767b36763 100644 --- a/net/ipv4/netfilter/nf_socket_ipv4.c +++ b/net/ipv4/netfilter/nf_socket_ipv4.c @@ -81,14 +81,21 @@ nf_socket_get_sock_v4(struct net *net, struct sk_buff *skb, const int doff, const __be16 sport, const __be16 dport, const struct net_device *in) { + struct sk_lookup params = { + .saddr.ipv4 = saddr, + .daddr.ipv4 = daddr, + .sport = sport, + .dport = dport, + .dif = in->ifindex, + }; + switch (protocol) { case IPPROTO_TCP: return inet_lookup(net, _hashinfo, skb, doff, saddr, sport, daddr, dport, in->ifindex); case IPPROTO_UDP: - return udp4_lib_lookup(net, saddr, sport, daddr, dport, - in->ifindex); + return udp4_lib_lookup(net, ); } return NULL; } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index b057653ceca9..132a8f070d16 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -379,15 +379,13 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum) } static int compute_score(struct sock *sk, struct net *net, -__be32 saddr, __be16 sport, -__be32 daddr, unsigned short hnum, int dif, -bool exact_dif) +const struct sk_lookup *params) { - int score; struct inet_sock *inet; + int score, rc; if (!net_eq(sock_net(sk), net) || - udp_sk(sk)->udp_port_hash != hnum || + udp_sk(sk)->udp_port_hash != params->hnum || ipv6_only_sock(sk)) return -1; @@ -395,28 +393,28 @@ static int compute_score(struct sock *sk, struct net *net, inet = inet_sk(sk); if (inet->inet_rcv_saddr) { - if (inet->inet_rcv_saddr != daddr) + if (inet->inet_rcv_saddr != params->daddr.ipv4) return -1; score += 4; } if (inet->inet_daddr) { - if (inet->inet_daddr != saddr) + if (inet->inet_daddr != params->saddr.ipv4) return -1; score += 4; } if (inet->inet_dport) { - if (inet->inet_dport != sport) + if (inet->inet_dport != params->sport) return -1; score += 4; } - if (sk->sk_bound_dev_if || exact_dif) { - if (sk->sk_bound_dev_if != dif) - return -1; + rc = sk_lookup_device_cmp(sk, params); + if (rc < 0) + return -1; + if (rc > 0) score += 4; - } if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; return score; @@ -436,10 +434,9 @@ static u32 udp_ehashfn(const struct net *net, const __be32 laddr, /* called with rcu_read_lock() */ static struct sock *udp4_lib_lookup2(struct net *net, - __be32 saddr, __be16 sport, - __be32 daddr, unsigned int hnum, int dif, bool exact_dif, - struct udp_hslot *hslot2, - struct sk_buff *skb) +const struct sk_lookup *params, +
[PATCH net-next 01/10] net: Add sk_lookup struct and helper
Consolidate the socket lookup args into a struct. Add helper that compares sk_bound_dev_if for a socket to the lookup parameters. Signed-off-by: David Ahern--- include/net/sock.h | 38 ++ 1 file changed, 38 insertions(+) diff --git a/include/net/sock.h b/include/net/sock.h index 7c0632c7e870..a2db5fd30192 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -491,6 +491,44 @@ enum sk_pacing { #define rcu_dereference_sk_user_data(sk) rcu_dereference(__sk_user_data((sk))) #define rcu_assign_sk_user_data(sk, ptr) rcu_assign_pointer(__sk_user_data((sk)), ptr) +/* used for socket lookups */ +struct sk_lookup { + union { + const struct in6_addr *ipv6; + __be32 ipv4; + } saddr; + union { + const struct in6_addr *ipv6; + __be32 ipv4; + } daddr; + + __be16 sport; + __be16 dport; + unsigned short hnum; + + int dif; + bool exact_dif; +}; + +/* Compare sk_bound_dev_if to socket lookup dif + * Returns: + * -1 exact dif required and not met + *0 sk_bound_dev_if is either not set or does not match + *1 sk_bound_dev_if is set and matches dif + */ +static inline int sk_lookup_device_cmp(const struct sock *sk, + const struct sk_lookup *params) +{ + /* exact_dif true == l3mdev case */ + if (params->exact_dif && sk->sk_bound_dev_if != params->dif) + return -1; + + if (sk->sk_bound_dev_if && sk->sk_bound_dev_if == params->dif) + return 1; + + return 0; +} + /* * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK * or not whether his port will be reused by someone else. SK_FORCE_REUSE -- 2.1.4
[PATCH net-next 05/10] net: ipv6: Convert udp socket lookups to new struct
Convert udp6_lib_lookup and __udp6_lib_lookup to use the new sk_lookup struct. Signed-off-by: David Ahern--- include/net/udp.h | 12 +-- net/ipv4/udp_diag.c | 33 --- net/ipv6/netfilter/nf_socket_ipv6.c | 11 ++- net/ipv6/udp.c | 177 +++- net/netfilter/xt_TPROXY.c | 10 +- 5 files changed, 135 insertions(+), 108 deletions(-) diff --git a/include/net/udp.h b/include/net/udp.h index 5e0ff095dc6d..c5a75e9422c6 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -288,15 +288,9 @@ struct sock *__udp4_lib_lookup(struct net *net, struct sk_lookup *params, struct udp_table *tbl, struct sk_buff *skb); struct sock *udp4_lib_lookup_skb(struct sk_buff *skb, __be16 sport, __be16 dport); -struct sock *udp6_lib_lookup(struct net *net, -const struct in6_addr *saddr, __be16 sport, -const struct in6_addr *daddr, __be16 dport, -int dif); -struct sock *__udp6_lib_lookup(struct net *net, - const struct in6_addr *saddr, __be16 sport, - const struct in6_addr *daddr, __be16 dport, - int dif, struct udp_table *tbl, - struct sk_buff *skb); +struct sock *udp6_lib_lookup(struct net *net, struct sk_lookup *params); +struct sock *__udp6_lib_lookup(struct net *net, struct sk_lookup *params, + struct udp_table *tbl, struct sk_buff *skb); struct sock *udp6_lib_lookup_skb(struct sk_buff *skb, __be16 sport, __be16 dport); diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c index d7f6af42ebcc..10738c10c5ae 100644 --- a/net/ipv4/udp_diag.c +++ b/net/ipv4/udp_diag.c @@ -54,13 +54,17 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb, sk = __udp4_lib_lookup(net, , tbl, NULL); } #if IS_ENABLED(CONFIG_IPV6) - else if (req->sdiag_family == AF_INET6) - sk = __udp6_lib_lookup(net, - (struct in6_addr *)req->id.idiag_src, - req->id.idiag_sport, - (struct in6_addr *)req->id.idiag_dst, - req->id.idiag_dport, - req->id.idiag_if, tbl, NULL); + else if (req->sdiag_family == AF_INET6) { + struct sk_lookup params = { + .saddr.ipv6 = (struct in6_addr *)req->id.idiag_src, + .daddr.ipv6 = (struct in6_addr *)req->id.idiag_dst, + .sport = req->id.idiag_sport, + .dport = req->id.idiag_dport, + .dif = req->id.idiag_if, + }; + + sk = __udp6_lib_lookup(net, , tbl, NULL); + } #endif if (sk && !refcount_inc_not_zero(>sk_refcnt)) sk = NULL; @@ -212,12 +216,15 @@ static int __udp_diag_destroy(struct sk_buff *in_skb, sk = __udp4_lib_lookup(net, , tbl, NULL); } else { - sk = __udp6_lib_lookup(net, - (struct in6_addr *)req->id.idiag_dst, - req->id.idiag_dport, - (struct in6_addr *)req->id.idiag_src, - req->id.idiag_sport, - req->id.idiag_if, tbl, NULL); + struct sk_lookup params = { + .saddr.ipv6 = (struct in6_addr *)req->id.idiag_dst, + .daddr.ipv6 = (struct in6_addr *)req->id.idiag_src, + .sport = req->id.idiag_dport, + .dport = req->id.idiag_sport, + .dif = req->id.idiag_if, + }; + + sk = __udp6_lib_lookup(net, , tbl, NULL); } } #endif diff --git a/net/ipv6/netfilter/nf_socket_ipv6.c b/net/ipv6/netfilter/nf_socket_ipv6.c index ebb2bf84232a..c1c193103063 100644 --- a/net/ipv6/netfilter/nf_socket_ipv6.c +++ b/net/ipv6/netfilter/nf_socket_ipv6.c @@ -86,14 +86,21 @@ nf_socket_get_sock_v6(struct net *net, struct sk_buff *skb, int doff, const __be16 sport, const __be16 dport, const struct net_device *in) { + struct sk_lookup params = { + .saddr.ipv6 = saddr, + .daddr.ipv6 = daddr, + .sport = sport, + .dport = dport, + .dif = in->ifindex, + }; + switch (protocol) { case IPPROTO_TCP: return inet6_lookup(net, _hashinfo, skb, doff,
Re: [PATCH V3 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission
On Mon, Jul 31, 2017 at 11:49 AM, Neal Cardwellwrote: > On Sun, Jul 30, 2017 at 11:29 PM, maowenan wrote: >> [Mao Wenan]please refer to the attachment, test.pkt is packetdrill script. >> In test.pcap, packet number 17 is the TLP probe, packet number 218 is the >> retransmission packet because client don't send data packet to server. >> From the capture time, there are about 6 seconds the retransmission >> packet can be sent, and this time can be added more as long as client >> send data packet continually. >> I have reproduced this issue in Linux 4.13-rc3, 3.10, 4.1. Please check the >> timing >> When you use test.pkt to reproduce in your environment. > > Thank you for your very nice packetdrill test case illustrating this > problem! And thanks for verifying that the problem shows up in those > kernel versions. > > We are able to run the script in our environment and both verify that > the bug is the one we hypothesized, and verify our proposed patch > fixes it (the RTO for the TLP happens 221ms after the TLP, instead of > ~5 secs later). We will send out our proposed patches ASAP. The timer patches are upstream for review for the "net" branch: https://patchwork.ozlabs.org/patch/796057/ https://patchwork.ozlabs.org/patch/796058/ https://patchwork.ozlabs.org/patch/796059/ Again, thank you for reporting this and providing a packetdrill script to reproduce this! neal
[PATCH net 2/3] tcp: enable xmit timer fix by having TLP use time when RTO should fire
Have tcp_schedule_loss_probe() base the TLP scheduling decision based on when the RTO *should* fire. This is to enable the upcoming xmit timer fix in this series, where tcp_schedule_loss_probe() cannot assume that the last timer installed was an RTO timer (because we are no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every ACK). So tcp_schedule_loss_probe() must independently figure out when an RTO would want to fire. In the new TLP implementation following in this series, we cannot assume that icsk_timeout was set based on an RTO; after processing a cumulative ACK the icsk_timeout we see can be from a previous TLP or RTO. So we need to independently recalculate the RTO time (instead of reading it out of icsk_timeout). Removing this dependency on the nature of icsk_timeout makes things a little easier to reason about anyway. Note that the old and new code should be equivalent, since they are both saying: "if the RTO is in the future, but at an earlier time than the normal TLP time, then set the TLP timer to fire when the RTO would have fired". Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Signed-off-by: Neal CardwellSigned-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati --- net/ipv4/tcp_output.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 2f1588bf73da..0ae6b5d176c0 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2377,8 +2377,8 @@ bool tcp_schedule_loss_probe(struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); - u32 timeout, tlp_time_stamp, rto_time_stamp; u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3); + u32 timeout, rto_delta_us; /* No consecutive loss probes. */ if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) { @@ -2418,13 +2418,9 @@ bool tcp_schedule_loss_probe(struct sock *sk) timeout = max_t(u32, timeout, msecs_to_jiffies(10)); /* If RTO is shorter, just schedule TLP in its place. */ - tlp_time_stamp = tcp_jiffies32 + timeout; - rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout; - if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) { - s32 delta = rto_time_stamp - tcp_jiffies32; - if (delta > 0) - timeout = delta; - } + rto_delta_us = tcp_rto_delta_us(sk); /* How far in future is RTO? */ + if (rto_delta_us > 0) + timeout = min_t(u32, timeout, usecs_to_jiffies(rto_delta_us)); inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout, TCP_RTO_MAX); -- 2.14.0.rc0.400.g1c36432dff-goog
[PATCH net 1/3] tcp: introduce tcp_rto_delta_us() helper for xmit timer fix
Pure refactor. This helper will be required in the xmit timer fix later in the patch series. (Because the TLP logic will want to make this calculation.) Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Signed-off-by: Neal CardwellSigned-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati --- include/net/tcp.h| 10 ++ net/ipv4/tcp_input.c | 5 + 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 70483296157f..ada65e767b28 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1916,6 +1916,16 @@ extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq, u64 xmit_time); extern void tcp_rack_reo_timeout(struct sock *sk); +/* At how many usecs into the future should the RTO fire? */ +static inline s64 tcp_rto_delta_us(const struct sock *sk) +{ + const struct sk_buff *skb = tcp_write_queue_head(sk); + u32 rto = inet_csk(sk)->icsk_rto; + u64 rto_time_stamp_us = skb->skb_mstamp + jiffies_to_usecs(rto); + + return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; +} + /* * Save and compile IPv4 options, return a pointer to it */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2920e0cb09f8..345febf0a46e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3004,10 +3004,7 @@ void tcp_rearm_rto(struct sock *sk) /* Offset the time elapsed after installing regular RTO */ if (icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT || icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { - struct sk_buff *skb = tcp_write_queue_head(sk); - u64 rto_time_stamp = skb->skb_mstamp + -jiffies_to_usecs(rto); - s64 delta_us = rto_time_stamp - tp->tcp_mstamp; + s64 delta_us = tcp_rto_delta_us(sk); /* delta_us may not be positive if the socket is locked * when the retrans timer fires and is rescheduled. */ -- 2.14.0.rc0.400.g1c36432dff-goog
[PATCH net 0/3] tcp: fix xmit timer rearming to avoid stalls
This patch series is a bug fix for a TCP loss recovery performance bug reported independently in recent netdev threads: (i) July 26, 2017: netdev thread "TCP fast retransmit issues" (ii) July 26, 2017: netdev thread: "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission" Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports, traces, and packetdrill test cases, which enabled us to root-cause this issue and verify the fix. Neal Cardwell (3): tcp: introduce tcp_rto_delta_us() helper for xmit timer fix tcp: enable xmit timer fix by having TLP use time when RTO should fire tcp: fix xmit timer to only be reset if data ACKed/SACKed include/net/tcp.h | 10 ++ net/ipv4/tcp_input.c | 30 +- net/ipv4/tcp_output.c | 21 - 3 files changed, 31 insertions(+), 30 deletions(-) -- 2.14.0.rc0.400.g1c36432dff-goog
[PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed
Fix a TCP loss recovery performance bug raised recently on the netdev list, in two threads: (i) July 26, 2017: netdev thread "TCP fast retransmit issues" (ii) July 26, 2017: netdev thread: "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission" The basic problem is that incoming TCP packets that did not indicate forward progress could cause the xmit timer (TLP or RTO) to be rearmed and pushed back in time. In certain corner cases this could result in the following problems noted in these threads: - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes could cause TCP to repeatedly schedule TLPs forever. We kept sending TLPs after every ~200ms, which elicited bogus SACKs, which caused more TLPs, ad infinitum; we never fired an RTO to fill in the holes. - Incoming data segments could, in some cases, cause us to reschedule our RTO or TLP timer further out in time, for no good reason. This could cause repeated inbound data to result in stalls in outbound data, in the presence of packet loss. This commit fixes these bugs by changing the TLP and RTO ACK processing to: (a) Only reschedule the xmit timer once per ACK. (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the ACK indicates sufficient forward progress (a packet was cumulatively ACKed, or we got a SACK for a packet that was sent before the most recent retransmit of the write queue head). This brings us back into closer compliance with the RFCs, since, as the comment for tcp_rearm_rto() notes, we should only restart the RTO timer after forward progress on the connection. Previously we were restarting the xmit timer even in these cases where there was no forward progress. As a side benefit, this commit simplifies and speeds up the TCP timer arming logic. We had been calling inet_csk_reset_xmit_timer() three times on normal ACKs that cumulatively acknowledged some data: 1) Once near the top of tcp_ack() to switch from TLP timer to RTO: if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) tcp_rearm_rto(sk); 2) Once in tcp_clean_rtx_queue(), to update the RTO: if (flag & FLAG_ACKED) { tcp_rearm_rto(sk); 3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO to TLP: if (icsk->icsk_pending == ICSK_TIME_RETRANS) tcp_schedule_loss_probe(sk); This commit, by only rescheduling the xmit timer once per ACK, simplifies the code and reduces CPU overhead. This commit was tested in an A/B test with Google web server traffic. SNMP stats and request latency metrics were within noise levels, substantiating that for normal web traffic patterns this is a rare issue. This commit was also tested with packetdrill tests to verify that it fixes the timer behavior in the corner cases discussed in the netdev threads mentioned above. This patch is a bug fix patch intended to be queued for -stable relases. Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Reported-by: Klavs KlavsenReported-by: Mao Wenan Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati --- net/ipv4/tcp_input.c | 25 - net/ipv4/tcp_output.c | 9 - 2 files changed, 16 insertions(+), 18 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 345febf0a46e..3e777cfbba56 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -107,6 +107,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2; #define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */ #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ #define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */ +#define FLAG_SET_XMIT_TIMER0x1000 /* Set TLP or RTO timer */ #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */ #define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */ #define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call tcp_send_challenge_ack() */ @@ -3016,6 +3017,13 @@ void tcp_rearm_rto(struct sock *sk) } } +/* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */ +static void tcp_set_xmit_timer(struct sock *sk) +{ + if (!tcp_schedule_loss_probe(sk)) + tcp_rearm_rto(sk); +} + /* If we get here, the whole TSO packet has not been acked. */ static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb) { @@ -3177,7 +3185,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets, ca_rtt_us, sack->rate); if (flag & FLAG_ACKED) { - tcp_rearm_rto(sk); + flag |= FLAG_SET_XMIT_TIMER; /* set TLP or RTO timer */ if (unlikely(icsk->icsk_mtup.probe_size &&
Re: [RFC net-next] net ipv6: convert fib6_table rwlock to a percpu lock
On Mon, Jul 31, 2017 at 04:10:07PM -0700, Stephen Hemminger wrote: > On Mon, 31 Jul 2017 10:18:57 -0700 > Shaohua Liwrote: > > > From: Shaohua Li > > > > In a syn flooding test, the fib6_table rwlock is a significant > > bottleneck. While converting the rwlock to rcu sounds straighforward, > > but is very challenging if it's possible. A percpu spinlock is quite > > trival for this problem since updating the routing table is a rare > > event. In my test, the server receives around 1.5 Mpps in syn flooding > > test without the patch in a dual sockets and 56-CPU system. With the > > patch, the server receives around 3.8Mpps, and perf report doesn't show > > the locking issue. > > > > Cc: Wei Wang > > You just reinvented brlock... you mean lglock? It has been removed from kernel. > RCU is not that hard, why not do it right? Maybe. But don't think it's the reason why we shouldn't do the percpu lock now, this is a simple change, if some smart guys find a way of RCU, we can easily remove this.
Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT
On Tue, Aug 1, 2017 at 2:01 PM, David Ahernwrote: > On 7/31/17 7:40 PM, Xin Long wrote: >> To respect the old code more, setting RTPROT_RA only when >> it's with the flag (ADDRCONF | DEFAULT | ROUTEINFO), >> shouldn't it be: > > Look at rtm_fill_info: > > if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO)) > rtm->rtm_protocol = RTPROT_RA; > > > If either flag is set the protocol should be RTPROT_RA and looking at > both places that seems correct to me. > ok, right
Re: [PATCH V2 net-next 0/2] liquidio: Add support for managing liquidio adapter
On Mon, Jul 31, 2017 at 05:59:37PM -0700, David Miller wrote: > From: Simon Horman> Date: Sun, 30 Jul 2017 22:21:04 +0200 > > > On Fri, Jul 28, 2017 at 11:17:07PM -0700, Felix Manlunas wrote: > >> From: Veerasenareddy Burru > >> > >> The LiquidIO adapter has processor cores that can run Linux. This patch > >> set adds support to create a virtual Ethernet interface on host to > >> communicate with applications running on Linux in the LiquidIO adapter. > >> The virtual Ethernet interface also provides login access to Linux on > >> LiquidIO through ssh for management and debugging. > > > > As per the somewhat more detailed feedback provided by my colleague Jakub > > Kicinski to v1 of this patchset[1] I am concerned that this patchset breaks > > down > > the long standing practice of not granting direct access to firmware from > > userspace. > > > > [1] https://www.spinics.net/lists/netdev/msg444929.html > > Agreed, I've seen no attempt to address this important feedback, which > I agree with. We posted a response to the original comment on Friday 28-July. but for some reason, it did not make to the mailing list outside cavium domain. our apologies, we did not double check before submitting V2 patch. Please find below the response reposted on original thread http://marc.info/?l=linux-netdev=150155273724386=2
Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT
On 7/31/17 7:40 PM, Xin Long wrote: > To respect the old code more, setting RTPROT_RA only when > it's with the flag (ADDRCONF | DEFAULT | ROUTEINFO), > shouldn't it be: Look at rtm_fill_info: if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO)) rtm->rtm_protocol = RTPROT_RA; If either flag is set the protocol should be RTPROT_RA and looking at both places that seems correct to me.
Re: [PATCH net-next 2/2] liquidio: Add support to create management interface
On Tue, Jul 18, 2017 at 11:58:27AM -0700, Jakub Kicinski wrote: > On Mon, 17 Jul 2017 12:52:17 -0700, Felix Manlunas wrote: > > From: VSR Burru> > > > This patch adds support to create a virtual ethernet interface to > > communicate with Linux on LiquidIO adapter for management. > > > > Signed-off-by: VSR Burru > > Signed-off-by: Srinivasa Jampala > > Signed-off-by: Satanand Burla > > Signed-off-by: Raghu Vatsavayi > > Signed-off-by: Felix Manlunas > > Not my call, but I have mixed feelings about this one. Is there any > precedent under drivers/net/ethernet of exposing special communication > channels with FW like this? It's irrelevant to me that you're running > SSH, arbitrary communication with FW from userspace is not something > netdev community usually accepts. And I'm afraid what the effects will > be of this getting accepted. I'm pretty sure most modern network > adapters have management CPU cores perfectly capable of running Linux. > I know NFP does, here is the out-of-tree code equivalent to this patch: LiquidIO is committed to ethtool and we are not trying to force users to use this communication channel in place of ethtool. This communication channel is for our field debug and informattion purposes and not for end users. If most modern network adapters have management cores that can run Linux, we could probably also think of finding a standard way to talk to that Linux. > > https://github.com/Netronome/nfp-drv-kmods/blob/master/src/nfpcore/nfp_net_vnic.c > > I'm not looking forward to a world where I have to ssh into my NIC and > run vendor commands to configure things. We are not asking users to ssh into card and run vendor commands. Users of LiquidIO card will continue to use ethtool for configuration. This is for our field debugging where we would like to login to the linux and be able to know the status of different hardware blocks in the card.
Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT
On Tue, Aug 1, 2017 at 12:12 PM, David Ahernwrote: > On 7/30/17 9:31 PM, Xin Long wrote: >>> Did you look at removing this hunk from rt6_fill_node: >>> >>> if (rt->rt6i_flags & RTF_DYNAMIC) >>> rtm->rtm_protocol = RTPROT_REDIRECT; >>> else if (rt->rt6i_flags & RTF_ADDRCONF) { >>> if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO)) >>> rtm->rtm_protocol = RTPROT_RA; >>> else >>> rtm->rtm_protocol = RTPROT_KERNEL; >>> } >> The issue seems to affect "ip -6 route flush all" as well, not only cache >> since 'else if {}' also causes rtm proto being different from rt6 proto. >> >>> >>> And have rtm_protocol set properly on the route when it is installed? >> The codes not keeping rtm proto consistent with rt6 proto day 1, >> any idea on why it didn't use rt6 proto in kernel properly? > > no, AFAIK it was just an oversight when the original code was written. I > do not know of any reason that would prevent properly setting the > rt6i_protocol in the route when it is allocated. That's what I was worried about, it might break something, but double checked, should be fine. > > Something like this (not compiled, much less tested): To respect the old code more, setting RTPROT_RA only when it's with the flag (ADDRCONF | DEFAULT | ROUTEINFO), shouldn't it be: [...] @@ -2464,6 +2465,7 @@ static struct rt6_info *rt6_add_route_info(struct net *net, .fc_nlinfo.portid = 0, .fc_nlinfo.nlh = NULL, .fc_nlinfo.nl_net = net, + .fc_protocol = RTPROT_KERNEL, }; cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO, @@ -2471,8 +2473,10 @@ static struct rt6_info *rt6_add_route_info(struct net *net, cfg.fc_gateway = *gwaddr; /* We should treat it as a default route if prefix length is 0. */ - if (!prefixlen) + if (!prefixlen) { + cfg.fc_protocol = RTPROT_RA; cfg.fc_flags |= RTF_DEFAULT; + } ip6_route_add(, NULL); @@ -2516,6 +2520,7 @@ struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr, .fc_nlinfo.portid = 0, .fc_nlinfo.nlh = NULL, .fc_nlinfo.nl_net = dev_net(dev), + .fc_protocol = RTPROT_KERNEL, }; [...] or you changed it intentionally ? I will do some testing before posting v2. thanks for your suggestion. :-) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index 4d30c96a819d..9a928839d247 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -2347,6 +2347,7 @@ static void rt6_do_redirect(struct dst_entry *dst, > struct sock *sk, struct sk_bu > if (!nrt) > goto out; > > + nrt->rt6i_protocol = RTPROT_REDIRECT; > nrt->rt6i_flags = RTF_GATEWAY|RTF_UP|RTF_DYNAMIC|RTF_CACHE; > if (on_link) > nrt->rt6i_flags &= ~RTF_GATEWAY; > @@ -2461,6 +2462,7 @@ static struct rt6_info *rt6_add_route_info(struct > net *net, > .fc_dst_len = prefixlen, > .fc_flags = RTF_GATEWAY | RTF_ADDRCONF | > RTF_ROUTEINFO | > RTF_UP | RTF_PREF(pref), > + .fc_protocol= RTPROT_RA, > .fc_nlinfo.portid = 0, > .fc_nlinfo.nlh = NULL, > .fc_nlinfo.nl_net = net, > @@ -2513,6 +2515,7 @@ struct rt6_info *rt6_add_dflt_router(const struct > in6_addr *gwaddr, > .fc_ifindex = dev->ifindex, > .fc_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT | > RTF_UP | RTF_EXPIRES | RTF_PREF(pref), > + .fc_protocol= RTPROT_RA, > .fc_nlinfo.portid = 0, > .fc_nlinfo.nlh = NULL, > .fc_nlinfo.nl_net = dev_net(dev), > @@ -3424,14 +3427,6 @@ static int rt6_fill_node(struct net *net, > rtm->rtm_flags = 0; > rtm->rtm_scope = RT_SCOPE_UNIVERSE; > rtm->rtm_protocol = rt->rt6i_protocol; > - if (rt->rt6i_flags & RTF_DYNAMIC) > - rtm->rtm_protocol = RTPROT_REDIRECT; > - else if (rt->rt6i_flags & RTF_ADDRCONF) { > - if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO)) > - rtm->rtm_protocol = RTPROT_RA; > - else > - rtm->rtm_protocol = RTPROT_KERNEL; > - } > > if (rt->rt6i_flags & RTF_CACHE) > rtm->rtm_flags |= RTM_F_CLONED;
[PATCH v6 net-next] net: systemport: Support 64bit statistics
When using Broadcom Systemport device in 32bit Platform, ifconfig can only report up to 4G tx,rx status, which will be wrapped to 0 when the number of incoming or outgoing packets exceeds 4G, only taking around 2 hours in busy network environment (such as streaming). Therefore, it makes hard for network diagnostic tool to get reliable statistical result, so the patch is used to add 64bit support for Broadcom Systemport device in 32bit Platform. Signed-off-by: Jianming.qiao--- drivers/net/ethernet/broadcom/bcmsysport.c | 68 -- drivers/net/ethernet/broadcom/bcmsysport.h | 9 +++- 2 files changed, 52 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index 5333601..bb3cc7a 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -662,6 +662,7 @@ static int bcm_sysport_alloc_rx_bufs(struct bcm_sysport_priv *priv) static unsigned int bcm_sysport_desc_rx(struct bcm_sysport_priv *priv, unsigned int budget) { + struct bcm_sysport_stats *stats64 = >stats64; struct net_device *ndev = priv->netdev; unsigned int processed = 0, to_process; struct bcm_sysport_cb *cb; @@ -765,6 +766,10 @@ static unsigned int bcm_sysport_desc_rx(struct bcm_sysport_priv *priv, skb->protocol = eth_type_trans(skb, ndev); ndev->stats.rx_packets++; ndev->stats.rx_bytes += len; + u64_stats_update_begin(>syncp); + stats64->rx_packets++; + stats64->rx_bytes += len; + u64_stats_update_end(>syncp); napi_gro_receive(>napi, skb); next: @@ -787,17 +792,15 @@ static void bcm_sysport_tx_reclaim_one(struct bcm_sysport_tx_ring *ring, struct device *kdev = >pdev->dev; if (cb->skb) { - ring->bytes += cb->skb->len; *bytes_compl += cb->skb->len; dma_unmap_single(kdev, dma_unmap_addr(cb, dma_addr), dma_unmap_len(cb, dma_len), DMA_TO_DEVICE); - ring->packets++; (*pkts_compl)++; bcm_sysport_free_cb(cb); /* SKB fragment */ } else if (dma_unmap_addr(cb, dma_addr)) { - ring->bytes += dma_unmap_len(cb, dma_len); + *bytes_compl += dma_unmap_len(cb, dma_len); dma_unmap_page(kdev, dma_unmap_addr(cb, dma_addr), dma_unmap_len(cb, dma_len), DMA_TO_DEVICE); dma_unmap_addr_set(cb, dma_addr, 0); @@ -808,9 +811,10 @@ static void bcm_sysport_tx_reclaim_one(struct bcm_sysport_tx_ring *ring, static unsigned int __bcm_sysport_tx_reclaim(struct bcm_sysport_priv *priv, struct bcm_sysport_tx_ring *ring) { - struct net_device *ndev = priv->netdev; unsigned int c_index, last_c_index, last_tx_cn, num_tx_cbs; + struct bcm_sysport_stats *stats64 = >stats64; unsigned int pkts_compl = 0, bytes_compl = 0; + struct net_device *ndev = priv->netdev; struct bcm_sysport_cb *cb; u32 hw_ind; @@ -849,6 +853,11 @@ static unsigned int __bcm_sysport_tx_reclaim(struct bcm_sysport_priv *priv, last_c_index &= (num_tx_cbs - 1); } + u64_stats_update_begin(>syncp); + ring->packets += pkts_compl; + ring->bytes += bytes_compl; + u64_stats_update_end(>syncp); + ring->c_index = c_index; netif_dbg(priv, tx_done, ndev, @@ -1671,24 +1680,6 @@ static int bcm_sysport_change_mac(struct net_device *dev, void *p) return 0; } -static struct net_device_stats *bcm_sysport_get_nstats(struct net_device *dev) -{ - struct bcm_sysport_priv *priv = netdev_priv(dev); - unsigned long tx_bytes = 0, tx_packets = 0; - struct bcm_sysport_tx_ring *ring; - unsigned int q; - - for (q = 0; q < dev->num_tx_queues; q++) { - ring = >tx_rings[q]; - tx_bytes += ring->bytes; - tx_packets += ring->packets; - } - - dev->stats.tx_bytes = tx_bytes; - dev->stats.tx_packets = tx_packets; - return >stats; -} - static void bcm_sysport_netif_start(struct net_device *dev) { struct bcm_sysport_priv *priv = netdev_priv(dev); @@ -1923,6 +1914,37 @@ static int bcm_sysport_stop(struct net_device *dev) return 0; } +static void bcm_sysport_get_stats64(struct net_device *dev, + struct rtnl_link_stats64 *stats) +{ + struct bcm_sysport_priv *priv = netdev_priv(dev); + struct bcm_sysport_stats *stats64 = >stats64; + struct bcm_sysport_tx_ring *ring; + u64 tx_packets = 0, tx_bytes = 0; + unsigned int start; + unsigned int q; + +
Re: [PATCH net] Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config"
From: Florian FainelliDate: Mon, 31 Jul 2017 11:05:32 -0700 > This reverts commit 28b45910ccda ("net: bcmgenet: Remove init parameter > from bcmgenet_mii_config") because in the process of moving from > dev_info() to dev_info_once() we essentially lost the helpful printed > messages once the second instance of the driver is loaded. > dev_info_once() does not actually print the message once per device > instance, but once period. > > Fixes: 28b45910ccda ("net: bcmgenet: Remove init parameter from > bcmgenet_mii_config") > Signed-off-by: Florian Fainelli Applied, thanks Florian.
Re: [PATCH net-next] ipv6: Avoid going through ->sk_net to access the netns
From: Jakub SitnickiDate: Mon, 31 Jul 2017 10:09:41 +0200 > There is no need to go through sk->sk_net to access the net namespace > and its sysctl variables because we allocate the sock and initialize > sk_net just a few lines earlier in the same routine. > > Signed-off-by: Jakub Sitnicki Applied, thanks.
Re: [PATCH net-next 0/7] More Marvell PHY refactoring and cleanup
From: Andrew LunnDate: Sun, 30 Jul 2017 22:41:43 +0200 > Consolidate more duplicated code into helpers, make use of core > helpers, move code into a helper for later adding functionality to add > marvell PHYs, etc. Series applied.
Re: [PATCH V2 net-next 0/2] liquidio: Add support for managing liquidio adapter
From: Simon HormanDate: Sun, 30 Jul 2017 22:21:04 +0200 > On Fri, Jul 28, 2017 at 11:17:07PM -0700, Felix Manlunas wrote: >> From: Veerasenareddy Burru >> >> The LiquidIO adapter has processor cores that can run Linux. This patch >> set adds support to create a virtual Ethernet interface on host to >> communicate with applications running on Linux in the LiquidIO adapter. >> The virtual Ethernet interface also provides login access to Linux on >> LiquidIO through ssh for management and debugging. > > As per the somewhat more detailed feedback provided by my colleague Jakub > Kicinski to v1 of this patchset[1] I am concerned that this patchset breaks > down > the long standing practice of not granting direct access to firmware from > userspace. > > [1] https://www.spinics.net/lists/netdev/msg444929.html Agreed, I've seen no attempt to address this important feedback, which I agree with.
Re: [PATCH] mv643xx_eth: fix of_irq_to_resource() error check
From: Sergei ShtylyovDate: Sat, 29 Jul 2017 22:18:41 +0300 > of_irq_to_resource() has recently been fixed to return negative error #'s > along with 0 in case of failure, however the Marvell MV643xx Ethernet > driver still only regards 0 as invalid IRQ -- fix it up. > > Fixes: 7a4228bbff76 ("of: irq: use of_irq_get() in of_irq_to_resource()") > Signed-off-by: Sergei Shtylyov Applied.
Re: [PATCH] net-next: stmmac: dwmac-sun8i: fix of_table.cocci warnings
From: Julia LawallDate: Sat, 29 Jul 2017 17:54:10 +0200 (CEST) > Make sure (of/i2c/platform)_device_id tables are NULL terminated > > Generated by: scripts/coccinelle/misc/of_table.cocci > > Fixes: d5dbe1976d52 ("net-next: stmmac: dwmac-sun8i: choose internal PHY via > compatible") > CC: Corentin Labbe > Signed-off-by: Fengguang Wu This change seems to be no longer relevant.
Re: [PATCH net-next] net: bcmgenet: Add dependency on HAS_IOMEM && OF
From: Florian FainelliDate: Mon, 31 Jul 2017 17:53:07 -0700 > The driver needs CONFIG_HAS_IOMEM and OF to be functional, but we still > let it build with COMPILE_TEST. This fixes the unmet dependency after > selecting MDIO_BCM_UNIMAC in commit mentioned below: > > warning: (NET_DSA_BCM_SF2 && BCMGENET) selects MDIO_BCM_UNIMAC which has > unmet direct dependencies (NETDEVICES && MDIO_DEVICE && HAS_IOMEM && > OF_MDIO) > > Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO > controller driver") > Signed-off-by: Florian Fainelli Applied.
[PATCH net-next] net: bcmgenet: Add dependency on HAS_IOMEM && OF
The driver needs CONFIG_HAS_IOMEM and OF to be functional, but we still let it build with COMPILE_TEST. This fixes the unmet dependency after selecting MDIO_BCM_UNIMAC in commit mentioned below: warning: (NET_DSA_BCM_SF2 && BCMGENET) selects MDIO_BCM_UNIMAC which has unmet direct dependencies (NETDEVICES && MDIO_DEVICE && HAS_IOMEM && OF_MDIO) Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver") Signed-off-by: Florian Fainelli--- drivers/net/ethernet/broadcom/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig index ec7a798c6bd1..45775399cab6 100644 --- a/drivers/net/ethernet/broadcom/Kconfig +++ b/drivers/net/ethernet/broadcom/Kconfig @@ -61,6 +61,7 @@ config BCM63XX_ENET config BCMGENET tristate "Broadcom GENET internal MAC support" + depends on (OF && HAS_IOMEM) || COMPILE_TEST select MII select PHYLIB select FIXED_PHY -- 2.9.3
Re: [PATCH net v3] MAINTAINERS: Add more files to the PHY LIBRARY section
From: Florian FainelliDate: Mon, 31 Jul 2017 09:47:50 -0700 > Include missing files that are provided by, used, or directly maintained > within the PHY LIBRARY, this include uapi header, header files used by > Device Tree code etc. > > Signed-off-by: Florian Fainelli Applied.
Re: [PATCH net] ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()
From: Ido SchimmelDate: Fri, 28 Jul 2017 23:27:44 +0300 > Michał reported a NULL pointer deref during fib_sync_down_dev() when > unregistering a netdevice. The problem is that we don't check for > 'in_dev' being NULL, which can happen in very specific cases. > > Usually routes are flushed upon NETDEV_DOWN sent in either the netdev or > the inetaddr notification chains. However, if an interface isn't > configured with any IP address, then it's possible for host routes to be > flushed following NETDEV_UNREGISTER, after NULLing dev->ip_ptr in > inetdev_destroy(). > > To reproduce: > $ ip link add type dummy > $ ip route add local 1.1.1.0/24 dev dummy0 > $ ip link del dev dummy0 > > Fix this by checking for the presence of 'in_dev' before referencing it. > > Fixes: 982acb97560c ("ipv4: fib: Notify about nexthop status changes") > Signed-off-by: Ido Schimmel > Reported-by: Michał Mirosław > --- > Please consider this for -stable. Applied and queued up for -stable, thanks!
[PATCH RFC, iproute2] tc/mirred: Extend the mirred/redirect action to accept additional traffic class parameter
The Mirred/redirect action is extended to accept a traffic class on the device in addition to the device's ifindex. Usage: mirred Example: # tc qdisc add dev eth0 ingress # tc filter add dev eth0 protocol ip parent : prio 1 flower\ dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\ indev eth0 action mirred ingress redirect dev eth0 tc 1 Signed-off-by: Amritha Nambiar--- tc/m_mirred.c | 26 +- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/tc_act/tc_mirred.h b/include/linux/tc_act/tc_mirred.h index 3d7a2b3..9a3aa61 100644 --- a/include/linux/tc_act/tc_mirred.h +++ b/include/linux/tc_act/tc_mirred.h @@ -9,6 +9,9 @@ #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */ #define TCA_INGRESS_REDIR 3 /* packet redirect to INGRESS*/ #define TCA_INGRESS_MIRROR 4 /* mirror packet to INGRESS */ + +#define MIRRED_F_TC_MAP0x1 +#define MIRRED_TC_MAP_MAX 0x10 struct tc_mirred { tc_gen; @@ -21,6 +24,7 @@ enum { TCA_MIRRED_TM, TCA_MIRRED_PARMS, TCA_MIRRED_PAD, + TCA_MIRRED_TC_MAP, __TCA_MIRRED_MAX }; #define TCA_MIRRED_MAX (__TCA_MIRRED_MAX - 1) diff --git a/tc/m_mirred.c b/tc/m_mirred.c index 2384bda..1a18c6b 100644 --- a/tc/m_mirred.c +++ b/tc/m_mirred.c @@ -29,12 +29,13 @@ static void explain(void) { - fprintf(stderr, "Usage: mirred [index INDEX] \n"); + fprintf(stderr, "Usage: mirred [index INDEX] [tc TCINDEX]\n"); fprintf(stderr, "where:\n"); fprintf(stderr, "\tDIRECTION := \n"); fprintf(stderr, "\tACTION := \n"); fprintf(stderr, "\tINDEX is the specific policy instance id\n"); fprintf(stderr, "\tDEVICENAME is the devicename\n"); + fprintf(stderr, "\tTCINDEX is the traffic class index\n"); } @@ -72,6 +73,8 @@ parse_direction(struct action_util *a, int *argc_p, char ***argv_p, struct tc_mirred p = {}; struct rtattr *tail; char d[16] = {}; + __u32 flags = 0; + __u8 tc; while (argc > 0) { @@ -142,6 +145,18 @@ parse_direction(struct action_util *a, int *argc_p, char ***argv_p, argc--; argv++; + if ((argc > 0) && (matches(*argv, "tc") == 0)) { + NEXT_ARG(); + tc = atoi(*argv); + if (tc >= MIRRED_TC_MAP_MAX) { + fprintf(stderr, "Invalid TC index\n"); + return -1; + } + flags |= MIRRED_F_TC_MAP; + ok++; + argc--; + argv++; + } break; } @@ -193,6 +208,9 @@ parse_direction(struct action_util *a, int *argc_p, char ***argv_p, tail = NLMSG_TAIL(n); addattr_l(n, MAX_MSG, tca_id, NULL, 0); addattr_l(n, MAX_MSG, TCA_MIRRED_PARMS, , sizeof(p)); + if (flags & MIRRED_F_TC_MAP) + addattr_l(n, MAX_MSG, TCA_MIRRED_TC_MAP, + , sizeof(tc)); tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail; *argc_p = argc; @@ -248,6 +266,7 @@ print_mirred(struct action_util *au, FILE * f, struct rtattr *arg) struct tc_mirred *p; struct rtattr *tb[TCA_MIRRED_MAX + 1]; const char *dev; + __u8 *tc; if (arg == NULL) return -1; @@ -273,6 +292,11 @@ print_mirred(struct action_util *au, FILE * f, struct rtattr *arg) fprintf(f, "mirred (%s to device %s)", mirred_n2a(p->eaction), dev); print_action_control(f, " ", p->action, ""); + if (tb[TCA_MIRRED_TC_MAP]) { + tc = RTA_DATA(tb[TCA_MIRRED_TC_MAP]); + fprintf(f, " tc %d", *tc); + } + fprintf(f, "\n "); fprintf(f, "\tindex %u ref %d bind %d", p->index, p->refcnt, p->bindcnt);
[PATCH 4/6] [net-next]net: i40e: Admin queue definitions for cloud filters
Add new admin queue definitions and extended fields for cloud filter support. Define big buffer for extended general fields in Add/Remove Cloud filters command. Signed-off-by: Amritha NambiarSigned-off-by: Kiran Patil Signed-off-by: Store Laura Signed-off-by: Iremonger Bernard Signed-off-by: Jingjing Wu --- drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h | 98 + 1 file changed, 97 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h index 8bba04c..9f14305 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h @@ -1358,7 +1358,9 @@ struct i40e_aqc_add_remove_cloud_filters { #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT 0 #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_MASK (0x3FF << \ I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT) - u8 reserved2[4]; + u8 big_buffer_flag; +#defineI40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER 1 + u8 reserved2[3]; __le32 addr_high; __le32 addr_low; }; @@ -1395,6 +1397,13 @@ struct i40e_aqc_add_remove_cloud_filters_element_data { #define I40E_AQC_ADD_CLOUD_FILTER_IMAC 0x000A #define I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC 0x000B #define I40E_AQC_ADD_CLOUD_FILTER_IIP 0x000C +/* 0x0010 to 0x0017 is for custom filters */ +/* flag to be used when adding cloud filter: IP + L4 Port */ +#define I40E_AQC_ADD_CLOUD_FILTER_IP_PORT 0x0010 +/* flag to be used when adding cloud filter: Dest MAC + L4 Port */ +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT 0x0011 +/* flag to be used when adding cloud filter: Dest MAC + VLAN + L4 Port */ +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT0x0012 #define I40E_AQC_ADD_CLOUD_FLAGS_TO_QUEUE 0x0080 #define I40E_AQC_ADD_CLOUD_VNK_SHIFT 6 @@ -1429,6 +1438,45 @@ struct i40e_aqc_add_remove_cloud_filters_element_data { u8 response_reserved[7]; }; +/* i40e_aqc_add_remove_cloud_filters_element_big_data is used when + * I40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER flag is set. + */ +struct i40e_aqc_add_remove_cloud_filters_element_big_data { + struct i40e_aqc_add_remove_cloud_filters_element_data element; + u16 general_fields[32]; +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD0 0 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD1 1 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD2 2 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD0 3 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD1 4 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD2 5 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD0 6 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD1 7 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD2 8 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD0 9 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD1 10 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD2 11 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD0 12 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD1 13 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD2 14 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD0 15 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD1 16 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD2 17 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD3 18 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD4 19 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD5 20 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD6 21 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD7 22 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD0 23 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD1 24 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD2 25 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD3 26 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD4 27 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD5 28 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD6 29 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD7 30 +}; + struct i40e_aqc_remove_cloud_filters_completion { __le16 perfect_ovlan_used; __le16 perfect_ovlan_free; @@ -1440,6 +1488,54 @@ struct i40e_aqc_remove_cloud_filters_completion { I40E_CHECK_CMD_LENGTH(i40e_aqc_remove_cloud_filters_completion); +/* Replace filter Command 0x025F + * uses the i40e_aqc_replace_cloud_filters, + * and the generic indirect completion structure + */ +struct i40e_filter_data { + u8 filter_type; + u8 input[3]; +}; + +struct i40e_aqc_replace_cloud_filters_cmd { + u8 valid_flags; +#define I40E_AQC_REPLACE_L1_FILTER 0x0 +#define I40E_AQC_REPLACE_CLOUD_FILTER 0x1 +#define I40E_AQC_GET_CLOUD_FILTERS 0x2 +#define I40E_AQC_MIRROR_CLOUD_FILTER 0x4 +#define I40E_AQC_HIGH_PRIORITY_CLOUD_FILTER0x8 + u8 old_filter_type; + u8
[PATCH 6/6] [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier
This patch enables tc-flower based hardware offloads. tc/flower filter provided by the kernel is configured as driver specific cloud filter. The patch implements functions and admin queue commands needed to support cloud filters in the driver and adds cloud filters to configure these tc-flower filters. The only action supported is to redirect packets to a traffic class on the same device. # tc qdisc add dev eth0 ingress # ethtool -K eth0 hw-tc-offload on # tc filter add dev eth0 protocol ip parent :\ prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw indev eth0\ action mirred ingress redirect dev eth0 tc 0 # tc filter add dev eth0 protocol ip parent :\ prio 2 flower dst_ip 192.168.3.5/32\ ip_proto udp dst_port 25 skip_sw indev eth0\ action mirred ingress redirect dev eth0 tc 1 # tc filter add dev eth0 protocol ipv6 parent :\ prio 3 flower dst_ip fe8::200:1\ ip_proto udp dst_port 66 skip_sw indev eth0\ action mirred ingress redirect dev eth0 tc 2 Delete tc flower filter: Example: # tc filter del dev eth0 parent : prio 3 handle 0x1 flower # tc filter del dev eth0 parent : Flow Director Sideband is disabled while configuring cloud filters via tc-flower. Unsupported matches when cloud filters are added using enhanced big buffer cloud filter mode of underlying switch include: 1. source port and source IP 2. Combined MAC address and IP fields. 3. Not specfying L4 port These filter matches can however be used to redirect traffic to the main VSI (tc 0) which does not require the enhanced big buffer cloud filter support. Signed-off-by: Amritha NambiarSigned-off-by: Kiran Patil --- drivers/net/ethernet/intel/i40e/i40e.h | 46 + drivers/net/ethernet/intel/i40e/i40e_common.c| 180 drivers/net/ethernet/intel/i40e/i40e_main.c | 952 ++ drivers/net/ethernet/intel/i40e/i40e_prototype.h | 17 4 files changed, 1193 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index 5c0cad5..7288265 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -55,6 +55,8 @@ #include #include #include +#include +#include #include "i40e_type.h" #include "i40e_prototype.h" #include "i40e_client.h" @@ -252,10 +254,51 @@ struct i40e_fdir_filter { u32 fd_id; }; +#define I40E_CLOUD_FIELD_OMAC 0x01 +#define I40E_CLOUD_FIELD_IMAC 0x02 +#define I40E_CLOUD_FIELD_IVLAN 0x04 +#define I40E_CLOUD_FIELD_TEN_ID0x08 +#define I40E_CLOUD_FIELD_IIP 0x10 + +#define I40E_CLOUD_FILTER_FLAGS_OMAC I40E_CLOUD_FIELD_OMAC +#define I40E_CLOUD_FILTER_FLAGS_IMAC I40E_CLOUD_FIELD_IMAC +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN (I40E_CLOUD_FIELD_IMAC | \ +I40E_CLOUD_FIELD_IVLAN) +#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID(I40E_CLOUD_FIELD_IMAC | \ +I40E_CLOUD_FIELD_TEN_ID) +#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \ + I40E_CLOUD_FIELD_IMAC | \ + I40E_CLOUD_FIELD_TEN_ID) +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \ + I40E_CLOUD_FIELD_IVLAN | \ + I40E_CLOUD_FIELD_TEN_ID) +#define I40E_CLOUD_FILTER_FLAGS_IIPI40E_CLOUD_FIELD_IIP + struct i40e_cloud_filter { struct hlist_node cloud_node; /* cloud filter input set follows */ unsigned long cookie; + u8 dst_mac[ETH_ALEN]; + u8 src_mac[ETH_ALEN]; + __be16 vlan_id; + __be32 dst_ip[4]; + __be32 src_ip[4]; + u8 dst_ipv6[16]; + u8 src_ipv6[16]; + __be16 dst_port; + __be16 src_port; + /* matter only when IP based filtering is set */ + bool is_ipv6; + /* IPPROTO value */ + u8 ip_proto; + /* L4 port type: src or destination port */ +#define I40E_CLOUD_FILTER_PORT_SRC 0x01 +#define I40E_CLOUD_FILTER_PORT_DEST0x02 + u8 port_type; + u32 tenant_id; + u8 flags; +#define I40E_CLOUD_TNL_TYPE_NONE 0xff + u8 tunnel_type; /* filter control */ u16 seid; }; @@ -574,6 +617,9 @@ struct i40e_pf { u16 phy_led_val; u16 override_q_count; + u16 last_sw_conf_flags; + u16 last_sw_conf_valid_flags; + }; /** diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c index d0e8138..bfbe304 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_common.c +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c @@ -5269,5 +5269,185 @@ i40e_add_pinfo_to_list(struct i40e_hw *hw, status = i40e_aq_write_ppp(hw, (void *)sec, sec->data_end,
[PATCH 3/6] [net-next]net: i40e: Extend set switch config command to accept cloud filter mode
Add definitions for L4 filters and switch modes based on cloud filters modes and extend the set switch config command to include the additional cloud filter mode. Signed-off-by: Amritha NambiarSigned-off-by: Kiran Patil --- drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h | 34 - drivers/net/ethernet/intel/i40e/i40e_common.c |4 ++ drivers/net/ethernet/intel/i40e/i40e_ethtool.c|2 + drivers/net/ethernet/intel/i40e/i40e_main.c |2 + drivers/net/ethernet/intel/i40e/i40e_prototype.h |2 + drivers/net/ethernet/intel/i40e/i40e_type.h |9 ++ 6 files changed, 48 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h index e2a9ec8..8bba04c 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h @@ -773,7 +773,39 @@ struct i40e_aqc_set_switch_config { #define I40E_AQ_SET_SWITCH_CFG_PROMISC 0x0001 #define I40E_AQ_SET_SWITCH_CFG_L2_FILTER 0x0002 __le16 valid_flags; - u8 reserved[12]; + + u8 rsvd6[6]; + + /* Next byte is split into following: +* Bit 7 : 0: No action, 1: Switch to mode defined by bits 6:0 +* Bit 6: 0 : Destination Port, 1: source port +* Bit 5..4: L4 type +* 0: rsvd +* 1: TCP +* 2: UDP +* 3: Both TCP and UDP +* Bits 3:0 Mode +* 0: default mode +* 1: L4 port only mode +* 2: non-tunneled mode +* 3: tunneled mode +*/ +#define I40E_AQ_SET_SWITCH_BIT7_VALID 0x80 + +#define I40E_AQ_SET_SWITCH_L4_SRC_PORT 0x40 + +#define I40E_AQ_SET_SWITCH_L4_TYPE_RSVD0x00 +#define I40E_AQ_SET_SWITCH_L4_TYPE_TCP 0x10 +#define I40E_AQ_SET_SWITCH_L4_TYPE_UDP 0x20 +#define I40E_AQ_SET_SWITCH_L4_TYPE_BOTH0x30 + +#define I40E_AQ_SET_SWITCH_MODE_DEFAULT0x00 +#define I40E_AQ_SET_SWITCH_MODE_L4_PORT0x01 +#define I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL 0x02 +#define I40E_AQ_SET_SWITCH_MODE_TUNNEL 0x03 + u8 mode; + + u8 rsvd5[5]; }; I40E_CHECK_CMD_LENGTH(i40e_aqc_set_switch_config); diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c index e4e86e0..d0e8138 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_common.c +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c @@ -2380,13 +2380,14 @@ i40e_status i40e_aq_get_switch_config(struct i40e_hw *hw, * @hw: pointer to the hardware structure * @flags: bit flag values to set * @valid_flags: which bit flags to set + * @mode: cloud filter mode * @cmd_details: pointer to command details structure or NULL * * Set switch configuration bits **/ enum i40e_status_code i40e_aq_set_switch_config(struct i40e_hw *hw, u16 flags, - u16 valid_flags, + u16 valid_flags, u8 mode, struct i40e_asq_cmd_details *cmd_details) { struct i40e_aq_desc desc; @@ -2398,6 +2399,7 @@ enum i40e_status_code i40e_aq_set_switch_config(struct i40e_hw *hw, i40e_aqc_opc_set_switch_config); scfg->flags = cpu_to_le16(flags); scfg->valid_flags = cpu_to_le16(valid_flags); + scfg->mode = mode; status = i40e_asq_send_command(hw, , NULL, 0, cmd_details); diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 326fc18..232e066e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -4181,7 +4181,7 @@ static int i40e_set_priv_flags(struct net_device *dev, u32 flags) sw_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC; valid_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC; ret = i40e_aq_set_switch_config(>hw, sw_flags, valid_flags, - NULL); + 0, NULL); if (ret && pf->hw.aq.asq_last_status != I40E_AQ_RC_ESRCH) { dev_info(>pdev->dev, "couldn't set switch config bits, err %s aq_err %s\n", diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 1daf95e..f74 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -12107,7 +12107,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit) u16 valid_flags; valid_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC; - ret =
[PATCH 2/6] [net-next]net: i40e: Maintain a mapping of TCs with the VSI seids
Add mapping of TCs with the seids of the channel VSIs. TC0 will be mapped to the main VSI seid and all other TCs are mapped to the seid of the channel VSI. Signed-off-by: Amritha Nambiar--- drivers/net/ethernet/intel/i40e/i40e.h |1 + drivers/net/ethernet/intel/i40e/i40e_main.c |2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index 8852ac0..1391e5d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -738,6 +738,7 @@ struct i40e_vsi { atomic_t next_base_queue; struct list_head ch_list; + u16 tc_seid_map[I40E_MAX_TRAFFIC_CLASS]; void *priv; /* client driver data reference. */ diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 370ce9f..1daf95e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -6127,6 +6127,7 @@ static int i40e_configure_queue_channels(struct i40e_vsi *vsi) int ret = 0, i; /* Create app vsi with the TCs. Main VSI with TC0 is already set up */ + vsi->tc_seid_map[0] = vsi->seid; for (i = 1; i < I40E_MAX_TRAFFIC_CLASS; i++) if (vsi->tc_config.enabled_tc & BIT(i)) { ch = kzalloc(sizeof(*ch), GFP_KERNEL); @@ -6156,6 +6157,7 @@ static int i40e_configure_queue_channels(struct i40e_vsi *vsi) i, ch->num_queue_pairs); goto err_free; } + vsi->tc_seid_map[i] = ch->seid; } return ret;
[PATCH 1/6] [net-next]net: sched: act_mirred: Extend redirect action to accept a traffic class
The Mirred/redirect action is extended to forward to a traffic class on the device. The traffic class index needs to be provided in addition to the device's ifindex. Example: # tc filter add dev eth0 protocol ip parent : prio 1 flower\ dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\ skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1 Signed-off-by: Amritha Nambiar--- include/net/tc_act/tc_mirred.h|7 +++ include/uapi/linux/tc_act/tc_mirred.h |5 + net/sched/act_mirred.c| 17 + 3 files changed, 29 insertions(+) diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h index 604bc31..60058c4 100644 --- a/include/net/tc_act/tc_mirred.h +++ b/include/net/tc_act/tc_mirred.h @@ -9,6 +9,8 @@ struct tcf_mirred { int tcfm_eaction; int tcfm_ifindex; booltcfm_mac_header_xmit; + u8 tcfm_tc; + u32 flags; struct net_device __rcu *tcfm_dev; struct list_headtcfm_list; }; @@ -37,4 +39,9 @@ static inline int tcf_mirred_ifindex(const struct tc_action *a) return to_mirred(a)->tcfm_ifindex; } +static inline int tcf_mirred_tc(const struct tc_action *a) +{ + return to_mirred(a)->tcfm_tc; +} + #endif /* __NET_TC_MIR_H */ diff --git a/include/uapi/linux/tc_act/tc_mirred.h b/include/uapi/linux/tc_act/tc_mirred.h index 3d7a2b3..8ff4d76 100644 --- a/include/uapi/linux/tc_act/tc_mirred.h +++ b/include/uapi/linux/tc_act/tc_mirred.h @@ -9,6 +9,10 @@ #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */ #define TCA_INGRESS_REDIR 3 /* packet redirect to INGRESS*/ #define TCA_INGRESS_MIRROR 4 /* mirror packet to INGRESS */ + +#define MIRRED_F_TC_MAP0x1 +#define MIRRED_TC_MAP_MAX 0x10 +#define MIRRED_TC_MAP_MASK 0xF struct tc_mirred { tc_gen; @@ -21,6 +25,7 @@ enum { TCA_MIRRED_TM, TCA_MIRRED_PARMS, TCA_MIRRED_PAD, + TCA_MIRRED_TC_MAP, __TCA_MIRRED_MAX }; #define TCA_MIRRED_MAX (__TCA_MIRRED_MAX - 1) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 1b5549a..f9801de 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -67,6 +67,7 @@ static void tcf_mirred_release(struct tc_action *a, int bind) static const struct nla_policy mirred_policy[TCA_MIRRED_MAX + 1] = { [TCA_MIRRED_PARMS] = { .len = sizeof(struct tc_mirred) }, + [TCA_MIRRED_TC_MAP] = { .type = NLA_U8 }, }; static unsigned int mirred_net_id; @@ -83,6 +84,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, struct tcf_mirred *m; struct net_device *dev; bool exists = false; + u8 *tc_map = NULL; + u32 flags = 0; int ret; if (nla == NULL) @@ -92,6 +95,14 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, return ret; if (tb[TCA_MIRRED_PARMS] == NULL) return -EINVAL; + + if (tb[TCA_MIRRED_TC_MAP]) { + tc_map = nla_data(tb[TCA_MIRRED_TC_MAP]); + if (*tc_map >= MIRRED_TC_MAP_MAX) + return -EINVAL; + flags |= MIRRED_F_TC_MAP; + } + parm = nla_data(tb[TCA_MIRRED_PARMS]); exists = tcf_hash_check(tn, parm->index, a, bind); @@ -139,6 +150,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, ASSERT_RTNL(); m->tcf_action = parm->action; m->tcfm_eaction = parm->eaction; + m->flags = flags; if (dev != NULL) { m->tcfm_ifindex = parm->ifindex; if (ret != ACT_P_CREATED) @@ -146,6 +158,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, dev_hold(dev); rcu_assign_pointer(m->tcfm_dev, dev); m->tcfm_mac_header_xmit = mac_header_xmit; + if (flags & MIRRED_F_TC_MAP) + m->tcfm_tc = *tc_map & MIRRED_TC_MAP_MASK; } if (ret == ACT_P_CREATED) { @@ -259,6 +273,9 @@ static int tcf_mirred_dump(struct sk_buff *skb, struct tc_action *a, int bind, if (nla_put(skb, TCA_MIRRED_PARMS, sizeof(opt), )) goto nla_put_failure; + if ((m->flags & MIRRED_F_TC_MAP) && + nla_put_u8(skb, TCA_MIRRED_TC_MAP, m->tcfm_tc)) + goto nla_put_failure; tcf_tm_dump(, >tcf_tm); if (nla_put_64bit(skb, TCA_MIRRED_TM, sizeof(t), , TCA_MIRRED_PAD))
[PATCH 5/6] [net-next]net: i40e: Clean up of cloud filters
Introduce the cloud filter datastructure and cleanup of cloud filters associated with the device. Signed-off-by: Amritha Nambiar--- drivers/net/ethernet/intel/i40e/i40e.h | 11 +++ drivers/net/ethernet/intel/i40e/i40e_main.c | 27 +++ 2 files changed, 38 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index 1391e5d..5c0cad5 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -252,6 +252,14 @@ struct i40e_fdir_filter { u32 fd_id; }; +struct i40e_cloud_filter { + struct hlist_node cloud_node; + /* cloud filter input set follows */ + unsigned long cookie; + /* filter control */ + u16 seid; +}; + #define I40E_ETH_P_LLDP0x88cc #define I40E_DCB_PRIO_TYPE_STRICT 0 @@ -419,6 +427,9 @@ struct i40e_pf { struct i40e_udp_port_config udp_ports[I40E_MAX_PF_UDP_OFFLOAD_PORTS]; u16 pending_udp_bitmap; + struct hlist_head cloud_filter_list; + u16 num_cloud_filters; + enum i40e_interrupt_policy int_policy; u16 rx_itr_default; u16 tx_itr_default; diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index f74..93f6fe2 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -6928,6 +6928,29 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf) } /** + * i40e_cloud_filter_exit - Cleans up the Cloud Filters + * @pf: Pointer to PF + * + * This function destroys the hlist where all the Cloud Filters + * filters were saved. + **/ +static void i40e_cloud_filter_exit(struct i40e_pf *pf) +{ + struct i40e_cloud_filter *cfilter; + struct hlist_node *node; + + if (hlist_empty(>cloud_filter_list)) + return; + + hlist_for_each_entry_safe(cfilter, node, + >cloud_filter_list, cloud_node) { + hlist_del(>cloud_node); + kfree(cfilter); + } + pf->num_cloud_filters = 0; +} + +/** * i40e_close - Disables a network interface * @netdev: network interface device structure * @@ -12137,6 +12160,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit) vsi = i40e_vsi_reinit_setup(pf->vsi[pf->lan_vsi]); if (!vsi) { dev_info(>pdev->dev, "setup of MAIN VSI failed\n"); + i40e_cloud_filter_exit(pf); i40e_fdir_teardown(pf); return -EAGAIN; } @@ -12961,6 +12985,8 @@ static void i40e_remove(struct pci_dev *pdev) if (pf->vsi[pf->lan_vsi]) i40e_vsi_release(pf->vsi[pf->lan_vsi]); + i40e_cloud_filter_exit(pf); + /* remove attached clients */ if (pf->flags & I40E_FLAG_IWARP_ENABLED) { ret_code = i40e_lan_del_device(pf); @@ -13170,6 +13196,7 @@ static void i40e_shutdown(struct pci_dev *pdev) del_timer_sync(>service_timer); cancel_work_sync(>service_task); + i40e_cloud_filter_exit(pf); i40e_fdir_teardown(pf); /* Client close must be called explicitly here because the timer
[PATCH net-next RFC 0/6] Configure cloud filters in i40e via tc/flower classifier
This patch series enables configuring cloud filters in i40e using the tc/flower classifier. The only tc-filter action supported is to redirect packets to a traffic class on the same device. The tc/mirred:redirect action is extended to accept a traffic class to achieve this. The cloud filters are added for a VSI and are cleaned up when the VSI is deleted. The filters that match on L4 ports needs enhanced admin queue functions with big buffer support for extended general fields in Add/Remove Cloud filters command. Example: # tc qdisc add dev eth0 ingress # ethtool -K eth0 hw-tc-offload on # tc filter add dev eth0 protocol ip parent : prio 1 flower\ dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\ skip_sw indev eth0 action mirred ingress redirect dev eth0 tc 1 # tc filter show dev eth0 parent : filter protocol ip pref 1 flower filter protocol ip pref 1 flower handle 0x1 indev eth0 eth_type ipv4 ip_proto udp dst_ip 192.168.1.1 dst_port 22 skip_sw in_hw action order 1: mirred (Ingress Redirect to device eth0) stolen tc 1 index 1 ref 1 bind 1 --- Amritha Nambiar (6): [net-next]net: sched: act_mirred: Extend redirect action to accept a traffic class [net-next]net: i40e: Maintain a mapping of TCs with the VSI seids [net-next]net: i40e: Extend set switch config command to accept cloud filter mode [net-next]net: i40e: Admin queue definitions for cloud filters [net-next]net: i40e: Clean up of cloud filters [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier drivers/net/ethernet/intel/i40e/i40e.h| 58 + drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h | 132 +++ drivers/net/ethernet/intel/i40e/i40e_common.c | 184 drivers/net/ethernet/intel/i40e/i40e_ethtool.c|2 drivers/net/ethernet/intel/i40e/i40e_main.c | 983 + drivers/net/ethernet/intel/i40e/i40e_prototype.h | 19 drivers/net/ethernet/intel/i40e/i40e_type.h |9 include/net/tc_act/tc_mirred.h|7 include/uapi/linux/tc_act/tc_mirred.h |5 net/sched/act_mirred.c| 17 10 files changed, 1408 insertions(+), 8 deletions(-) --
Re: [PATCH net-next 1/7] net: phy: mdio-bcm-unimac: factor busy polling loop
On 07/31/2017 05:28 PM, kbuild test robot wrote: > Hi Florian, > > [auto build test ERROR on net-next/master] > > url: > https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847 > config: xtensa-allmodconfig (attached as .config) > compiler: xtensa-linux-gcc (GCC) 4.9.0 > reproduce: > wget > https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > make.cross ARCH=xtensa > > Note: the > linux-review/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847 > HEAD 68043f6ab1b54d29abcfc56fdec46d280b76 builds fine. > It only hurts bisectibility. > > All errors (new ones prefixed by >>): > >drivers/net/phy/mdio-bcm-unimac.c: In function 'unimac_mdio_read': >>> drivers/net/phy/mdio-bcm-unimac.c:89:2: error: 'ret' undeclared (first use >>> in this function) > ret = unimac_mdio_poll(priv); > ^ >drivers/net/phy/mdio-bcm-unimac.c:89:2: note: each undeclared identifier > is reported only once for each function it appears in > > vim +/ret +89 drivers/net/phy/mdio-bcm-unimac.c This is "just" a bisectability problem, patch 4 does actually add the int ret variable to store the return value... I will still fix the unmet dependency warning, depends on would actually be more correct here anyway. > > 76 > 77static int unimac_mdio_read(struct mii_bus *bus, int phy_id, > int reg) > 78{ > 79struct unimac_mdio_priv *priv = bus->priv; > 80u32 cmd; > 81 > 82/* Prepare the read operation */ > 83cmd = MDIO_RD | (phy_id << MDIO_PMD_SHIFT) | (reg << > MDIO_REG_SHIFT); > 84__raw_writel(cmd, priv->base + MDIO_CMD); > 85 > 86/* Start MDIO transaction */ > 87unimac_mdio_start(priv); > 88 > > 89ret = unimac_mdio_poll(priv); > 90if (ret) > 91return ret; > 92 > 93cmd = __raw_readl(priv->base + MDIO_CMD); > 94 > 95/* Some broken devices are known not to release the > line during > 96 * turn-around, e.g: Broadcom BCM53125 external > switches, so check for > 97 * that condition here and ignore the MDIO controller > read failure > 98 * indication. > 99 */ >100if (!(bus->phy_ignore_ta_mask & 1 << phy_id) && (cmd & > MDIO_READ_FAIL)) >101return -EIO; >102 >103return cmd & 0x; >104} >105 > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation > -- Florian
Re: [PATCH net-next 1/7] net: phy: mdio-bcm-unimac: factor busy polling loop
Hi Florian, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847 config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa Note: the linux-review/Florian-Fainelli/net-bcmgenet-utilize-MDIO-unimac-driver/20170801-075847 HEAD 68043f6ab1b54d29abcfc56fdec46d280b76 builds fine. It only hurts bisectibility. All errors (new ones prefixed by >>): drivers/net/phy/mdio-bcm-unimac.c: In function 'unimac_mdio_read': >> drivers/net/phy/mdio-bcm-unimac.c:89:2: error: 'ret' undeclared (first use >> in this function) ret = unimac_mdio_poll(priv); ^ drivers/net/phy/mdio-bcm-unimac.c:89:2: note: each undeclared identifier is reported only once for each function it appears in vim +/ret +89 drivers/net/phy/mdio-bcm-unimac.c 76 77 static int unimac_mdio_read(struct mii_bus *bus, int phy_id, int reg) 78 { 79 struct unimac_mdio_priv *priv = bus->priv; 80 u32 cmd; 81 82 /* Prepare the read operation */ 83 cmd = MDIO_RD | (phy_id << MDIO_PMD_SHIFT) | (reg << MDIO_REG_SHIFT); 84 __raw_writel(cmd, priv->base + MDIO_CMD); 85 86 /* Start MDIO transaction */ 87 unimac_mdio_start(priv); 88 > 89 ret = unimac_mdio_poll(priv); 90 if (ret) 91 return ret; 92 93 cmd = __raw_readl(priv->base + MDIO_CMD); 94 95 /* Some broken devices are known not to release the line during 96 * turn-around, e.g: Broadcom BCM53125 external switches, so check for 97 * that condition here and ignore the MDIO controller read failure 98 * indication. 99 */ 100 if (!(bus->phy_ignore_ta_mask & 1 << phy_id) && (cmd & MDIO_READ_FAIL)) 101 return -EIO; 102 103 return cmd & 0x; 104 } 105 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH net v2] net: phy: Correctly process PHY_HALTED in phy_stop_machine()
From: Florian FainelliDate: Fri, 28 Jul 2017 11:58:36 -0700 > Marc reported that he was not getting the PHY library adjust_link() > callback function to run when calling phy_stop() + phy_disconnect() > which does not indeed happen because we set the state machine to > PHY_HALTED but we don't get to run it to process this state past that > point. > > Fix this with a synchronous call to phy_state_machine() in order to have > the state machine actually act on PHY_HALTED, set the PHY device's link > down, turn the network device's carrier off and finally call the > adjust_link() function. > > Reported-by: Marc Gonzalez > Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") > Signed-off-by: Florian Fainelli > --- > Changes in v2: > > - reword subject and commit message based on changes > - dropped flush_scheduled_work() since it is redundant Applied and queued up for -stable, thanks.
[net-next:master 345/358] warning: (NET_DSA_BCM_SF2 && ..) selects MDIO_BCM_UNIMAC which has unmet direct dependencies (NETDEVICES && ..)
tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: 6a95befc8d0346d6cb3b4646c761e8b42e66a4df commit: 9a4e79697009ddd0d1af52053c830f6e60e1c771 [345/358] net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver config: x86_64-randconfig-x001-201731 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: git checkout 9a4e79697009ddd0d1af52053c830f6e60e1c771 # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): warning: (NET_DSA_BCM_SF2 && BCMGENET) selects MDIO_BCM_UNIMAC which has unmet direct dependencies (NETDEVICES && MDIO_DEVICE && HAS_IOMEM && OF_MDIO) --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH] hysdn: fix to a race condition in put_log_buffer
From: Anton VolkovDate: Fri, 28 Jul 2017 17:53:51 +0300 > The synchronization type that was used earlier to guard the loop that > deletes unused log buffers may have lead to a situation that prevents > any thread from going through the loop. > > The patch deletes previously used synchronization mechanism and moves > the loop under the spin_lock so the similar cases won't be feasible in > the future. > > Found by by Linux Driver Verification project (linuxtesting.org). > > Signed-off-by: Anton Volkov This patch doesn't apply at all. It's probably been corrupted by your email client.
Re: [PATCH net-next 1/2] tcp: extract the function to compute delivery rate
From: Wei WangDate: Fri, 28 Jul 2017 10:28:20 -0700 > From: Wei Wang > > Refactor the code to extract the function to compute delivery rate. > This function will be used in later commit. > > Signed-off-by: Wei Wang > Acked-by: Yuchung Cheng > Acked-by: Soheil Hassas Yeganeh Applied.
Re: [PATCH net-next 2/2] tcp: add related fields into SCM_TIMESTAMPING_OPT_STATS
From: Wei WangDate: Fri, 28 Jul 2017 10:28:21 -0700 > From: Wei Wang > > Add the following stats into SCM_TIMESTAMPING_OPT_STATS control msg: > TCP_NLA_PACING_RATE > TCP_NLA_DELIVERY_RATE > TCP_NLA_SND_CWND > TCP_NLA_REORDERING > TCP_NLA_MIN_RTT > TCP_NLA_RECUR_RETRANS > TCP_NLA_DELIVERY_RATE_APP_LMT > > Signed-off-by: Wei Wang > Acked-by: Yuchung Cheng > Acked-by: Soheil Hassas Yeganeh Applied.
Re: [PATCH net] ipv6: set fc_protocol with 0 when rtm_protocol is RTPROT_REDIRECT
On 7/30/17 9:31 PM, Xin Long wrote: >> Did you look at removing this hunk from rt6_fill_node: >> >> if (rt->rt6i_flags & RTF_DYNAMIC) >> rtm->rtm_protocol = RTPROT_REDIRECT; >> else if (rt->rt6i_flags & RTF_ADDRCONF) { >> if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO)) >> rtm->rtm_protocol = RTPROT_RA; >> else >> rtm->rtm_protocol = RTPROT_KERNEL; >> } > The issue seems to affect "ip -6 route flush all" as well, not only cache > since 'else if {}' also causes rtm proto being different from rt6 proto. > >> >> And have rtm_protocol set properly on the route when it is installed? > The codes not keeping rtm proto consistent with rt6 proto day 1, > any idea on why it didn't use rt6 proto in kernel properly? no, AFAIK it was just an oversight when the original code was written. I do not know of any reason that would prevent properly setting the rt6i_protocol in the route when it is allocated. Something like this (not compiled, much less tested): diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 4d30c96a819d..9a928839d247 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2347,6 +2347,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu if (!nrt) goto out; + nrt->rt6i_protocol = RTPROT_REDIRECT; nrt->rt6i_flags = RTF_GATEWAY|RTF_UP|RTF_DYNAMIC|RTF_CACHE; if (on_link) nrt->rt6i_flags &= ~RTF_GATEWAY; @@ -2461,6 +2462,7 @@ static struct rt6_info *rt6_add_route_info(struct net *net, .fc_dst_len = prefixlen, .fc_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_ROUTEINFO | RTF_UP | RTF_PREF(pref), + .fc_protocol= RTPROT_RA, .fc_nlinfo.portid = 0, .fc_nlinfo.nlh = NULL, .fc_nlinfo.nl_net = net, @@ -2513,6 +2515,7 @@ struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr, .fc_ifindex = dev->ifindex, .fc_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT | RTF_UP | RTF_EXPIRES | RTF_PREF(pref), + .fc_protocol= RTPROT_RA, .fc_nlinfo.portid = 0, .fc_nlinfo.nlh = NULL, .fc_nlinfo.nl_net = dev_net(dev), @@ -3424,14 +3427,6 @@ static int rt6_fill_node(struct net *net, rtm->rtm_flags = 0; rtm->rtm_scope = RT_SCOPE_UNIVERSE; rtm->rtm_protocol = rt->rt6i_protocol; - if (rt->rt6i_flags & RTF_DYNAMIC) - rtm->rtm_protocol = RTPROT_REDIRECT; - else if (rt->rt6i_flags & RTF_ADDRCONF) { - if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO)) - rtm->rtm_protocol = RTPROT_RA; - else - rtm->rtm_protocol = RTPROT_KERNEL; - } if (rt->rt6i_flags & RTF_CACHE) rtm->rtm_flags |= RTM_F_CLONED;
Re: Long stalls creating a new netns after a netns with a SMB client exits
On 7/31/17 4:01 PM, Cong Wang wrote: >>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c >>> index 3a19ea28339f..37db087b6c97 100644 >>> --- a/net/ipv4/tcp_ipv4.c >>> +++ b/net/ipv4/tcp_ipv4.c >>> @@ -1855,7 +1855,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const >>> struct sk_buff *skb) >>> { >>> struct dst_entry *dst = skb_dst(skb); >>> >>> - if (dst && dst_hold_safe(dst)) { >>> + if (0 && dst && dst_hold_safe(dst)) { >>> sk->sk_rx_dst = dst; >>> inet_sk(sk)->rx_dst_ifindex = skb->skb_iif; >>> } >> >> >> This removes the 200s stall (the test is IPv4/TCP based) > > > Interesting. This means we have a kernel socket which holds > the dst refcnt. Right now there is no tracking that I am aware of for a dst cached on the socket (outside of walking all sockets). I have been bitten by it several times in trying to make various changes. It's basically a hidden reference for the device.
[PATCH net-next v2 0/3] netvsc: transparent SR-IOV VF support
This patch set changes how SR-IOV Virtual Function devices are managed in the Hyper-V network driver. It was part of earlier bundle, but is now updated. Background In Hyper-V SR-IOV can be enabled (and disabled) by changing guest settings on host. When SR-IOV is enabled a matching PCI device is hot plugged and visible on guest. The VF device is an add-on to an existing netvsc device, and has the same MAC address. How is this different? The original support of VF relied on using bonding driver in active standby mode to handle the VF device. With the new netvsc VF logic, the Linux hyper-V network virtual driver will directly manage the link to SR-IOV VF device. When VF device is detected (hot plug) it is automatically made a slave device of the netvsc device. The VF device state reflects the state of the netvsc device; i.e. if netvsc is set down, then VF is set down. If netvsc is set up, then VF is brought up. Packet flow is independent of VF status; all packets are sent and received as if they were associated with the netvsc device. If VF is removed or link is down then the synthetic VMBUS path is used. What was wrong with using bonding script? A lot of work went into getting the bonding script to work on all distributions, but it was a major struggle. Linux network devices can be configured many, many ways and there is no one solution from userspace to make it all work. What is really hard is when configuration is attached to synthetic device during boot (eth0) and then the same addresses and firewall rules needs to also work later if doing bonding. The new code gets around all of this. How does VF work during initialization? Since all packets are sent and received through the logical netvsc device, initialization is much easier. Just configure the regular netvsc Ethernet device; when/if SR-IOV is enabled it just works. Provisioning and cloud init only need to worry about setting up netvsc device (eth0). If SR-IOV is enabled (even as a later step), the address and rules stay the same. What devices show up? Both netvsc and PCI devices are visible in the system. The netvsc device is active and named in usual manner (eth0). The PCI device is visible to Linux and gets renamed by udev to a persistent name (enP2p3s0). The PCI device name is now irrelevant now. The logic also sets the PCI VF device SLAVE flag on the network device so network tools can see the relationship if they are smart enough to understand how layered devices work. This is a lot like how I see Windows working. The VF device is visible in Device Manager, but is not configured. Is there any performance impact? There is no visible change in performance. The bonding and netvsc driver both have equivalent steps. Is it compatible with old bonding script? It turns out that if you use the old bonding script, then everything still works but in a sub-optimum manner. What happens is that bonding is unable to steal the VF from the netvsc device so it creates a one legged bond. Packet flow then is: bond0 <--> eth0 <- -> VF (enP2p3s0). In other words, if you get it wrong it still works, just awkward and slower. What if I add address or firewall rule onto the VF? Same problems occur with now as already occur with bonding, bridging, teaming on Linux if user incorrectly does configuration onto an underlying slave device. It will sort of work, packets will come in and out but the Linux kernel gets confused and things like ARP don’t work right. There is no way to block manipulation of the slave device, and I am sure someone will find some special use case where they want it. Stephen Hemminger (3): netvsc: transparent VF management netvsc: add documentation netvsc: remove bonding setup script Documentation/networking/netvsc.txt | 63 ++ MAINTAINERS | 1 + drivers/net/hyperv/hyperv_net.h | 12 ++ drivers/net/hyperv/netvsc_drv.c | 419 tools/hv/bondvf.sh | 255 -- 5 files changed, 406 insertions(+), 344 deletions(-) create mode 100644 Documentation/networking/netvsc.txt delete mode 100755 tools/hv/bondvf.sh -- 2.11.0
[PATCH net-next v2 1/3] netvsc: transparent VF management
This patch implements transparent fail over from synthetic NIC to SR-IOV virtual function NIC in Hyper-V environment. It is a better alternative to using bonding as is done now. Instead, the receive and transmit fail over is done internally inside the driver. Using bonding driver has lots of issues because it depends on the script being run early enough in the boot process and with sufficient information to make the association. This patch moves all that functionality into the kernel. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/hyperv_net.h | 12 ++ drivers/net/hyperv/netvsc_drv.c | 419 +++- 2 files changed, 342 insertions(+), 89 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index f2cef5aaed1f..c701b059c5ac 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -680,6 +680,15 @@ struct netvsc_ethtool_stats { unsigned long tx_busy; }; +struct netvsc_vf_pcpu_stats { + u64 rx_packets; + u64 rx_bytes; + u64 tx_packets; + u64 tx_bytes; + struct u64_stats_sync syncp; + u32 tx_dropped; +}; + struct netvsc_reconfig { struct list_head list; u32 event; @@ -713,6 +722,9 @@ struct net_device_context { /* State to manage the associated VF interface. */ struct net_device __rcu *vf_netdev; + struct netvsc_vf_pcpu_stats __percpu *vf_stats; + struct work_struct vf_takeover; + struct work_struct vf_notify; /* 1: allocated, serial number is valid. 0: not allocated */ u32 vf_alloc; diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 8ff4cbf582cc..fef80dcbd71b 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -71,6 +72,7 @@ static void netvsc_set_multicast_list(struct net_device *net) static int netvsc_open(struct net_device *net) { struct net_device_context *ndev_ctx = netdev_priv(net); + struct net_device *vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev); struct netvsc_device *nvdev = rtnl_dereference(ndev_ctx->nvdev); struct rndis_device *rdev; int ret = 0; @@ -87,15 +89,29 @@ static int netvsc_open(struct net_device *net) netif_tx_wake_all_queues(net); rdev = nvdev->extension; - if (!rdev->link_state && !ndev_ctx->datapath) + + if (!rdev->link_state) netif_carrier_on(net); - return ret; + if (vf_netdev) { + /* Setting synthetic device up transparently sets +* slave as up. If open fails, then slave will be +* still be offline (and not used). +*/ + ret = dev_open(vf_netdev); + if (ret) + netdev_warn(net, + "unable to open slave: %s: %d\n", + vf_netdev->name, ret); + } + return 0; } static int netvsc_close(struct net_device *net) { struct net_device_context *net_device_ctx = netdev_priv(net); + struct net_device *vf_netdev + = rtnl_dereference(net_device_ctx->vf_netdev); struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev); int ret; u32 aread, i, msec = 10, retry = 0, retry_max = 20; @@ -141,6 +157,9 @@ static int netvsc_close(struct net_device *net) ret = -ETIMEDOUT; } + if (vf_netdev) + dev_close(vf_netdev); + return ret; } @@ -224,13 +243,11 @@ static inline int netvsc_get_tx_queue(struct net_device *ndev, * * TODO support XPS - but get_xps_queue not exported */ -static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, - void *accel_priv, select_queue_fallback_t fallback) +static u16 netvsc_pick_tx(struct net_device *ndev, struct sk_buff *skb) { - unsigned int num_tx_queues = ndev->real_num_tx_queues; int q_idx = sk_tx_queue_get(skb->sk); - if (q_idx < 0 || skb->ooo_okay) { + if (q_idx < 0 || skb->ooo_okay || q_idx >= ndev->real_num_tx_queues) { /* If forwarding a packet, we use the recorded queue when * available for better cache locality. */ @@ -240,12 +257,33 @@ static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, q_idx = netvsc_get_tx_queue(ndev, skb, q_idx); } - while (unlikely(q_idx >= num_tx_queues)) - q_idx -= num_tx_queues; - return q_idx; } +static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, + void *accel_priv, + select_queue_fallback_t fallback) +{ +
[PATCH net-next v2 3/3] netvsc: remove bonding setup script
No longer needed, now all managed by transparent VF logic. Signed-off-by: Stephen Hemminger--- tools/hv/bondvf.sh | 255 - 1 file changed, 255 deletions(-) delete mode 100755 tools/hv/bondvf.sh diff --git a/tools/hv/bondvf.sh b/tools/hv/bondvf.sh deleted file mode 100755 index 80f102860cf8.. --- a/tools/hv/bondvf.sh +++ /dev/null @@ -1,255 +0,0 @@ -#!/bin/bash - -# This example script creates bonding network devices based on synthetic NIC -# (the virtual network adapter usually provided by Hyper-V) and the matching -# VF NIC (SRIOV virtual function). So the synthetic NIC and VF NIC can -# function as one network device, and fail over to the synthetic NIC if VF is -# down. -# -# Usage: -# - After configured vSwitch and vNIC with SRIOV, start Linux virtual -# machine (VM) -# - Run this scripts on the VM. It will create configuration files in -# distro specific directory. -# - Reboot the VM, so that the bonding config are enabled. -# -# The config files are DHCP by default. You may edit them if you need to change -# to Static IP or change other settings. -# - -sysdir=/sys/class/net -netvsc_cls={f8615163-df3e-46c5-913f-f2d2f965ed0e} -bondcnt=0 - -# Detect Distro -if [ -f /etc/redhat-release ]; -then - cfgdir=/etc/sysconfig/network-scripts - distro=redhat -elif grep -q 'Ubuntu' /etc/issue -then - cfgdir=/etc/network - distro=ubuntu -elif grep -q 'SUSE' /etc/issue -then - cfgdir=/etc/sysconfig/network - distro=suse -else - echo "Unsupported Distro" - exit 1 -fi - -echo Detected Distro: $distro, or compatible - -# Get a list of ethernet names -list_eth=(`cd $sysdir && ls -d */ | cut -d/ -f1 | grep -v bond`) -eth_cnt=${#list_eth[@]} - -echo List of net devices: - -# Get the MAC addresses -for (( i=0; i < $eth_cnt; i++ )) -do - list_mac[$i]=`cat $sysdir/${list_eth[$i]}/address` - echo ${list_eth[$i]}, ${list_mac[$i]} -done - -# Find NIC with matching MAC -for (( i=0; i < $eth_cnt-1; i++ )) -do - for (( j=i+1; j < $eth_cnt; j++ )) - do - if [ "${list_mac[$i]}" = "${list_mac[$j]}" ] - then - list_match[$i]=${list_eth[$j]} - break - fi - done -done - -function create_eth_cfg_redhat { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo DEVICE=$1 >>$fn - echo TYPE=Ethernet >>$fn - echo BOOTPROTO=none >>$fn - echo UUID=`uuidgen` >>$fn - echo ONBOOT=yes >>$fn - echo PEERDNS=yes >>$fn - echo IPV6INIT=yes >>$fn - echo MASTER=$2 >>$fn - echo SLAVE=yes >>$fn -} - -function create_eth_cfg_pri_redhat { - create_eth_cfg_redhat $1 $2 -} - -function create_bond_cfg_redhat { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo DEVICE=$1 >>$fn - echo TYPE=Bond >>$fn - echo BOOTPROTO=dhcp >>$fn - echo UUID=`uuidgen` >>$fn - echo ONBOOT=yes >>$fn - echo PEERDNS=yes >>$fn - echo IPV6INIT=yes >>$fn - echo BONDING_MASTER=yes >>$fn - echo BONDING_OPTS=\"mode=active-backup miimon=100 primary=$2\" >>$fn -} - -function del_eth_cfg_ubuntu { - local mainfn=$cfgdir/interfaces - local fnlist=( $mainfn ) - - local dirlist=(`awk '/^[ \t]*source/{print $2}' $mainfn`) - - local i - for i in "${dirlist[@]}" - do - fnlist+=(`ls $i 2>/dev/null`) - done - - local tmpfl=$(mktemp) - - local nic_start='^[ \t]*(auto|iface|mapping|allow-.*)[ \t]+'$1 - local nic_end='^[ \t]*(auto|iface|mapping|allow-.*|source)' - - local fn - for fn in "${fnlist[@]}" - do - awk "/$nic_end/{x=0} x{next} /$nic_start/{x=1;next} 1" \ - $fn >$tmpfl - - cp $tmpfl $fn - done - - rm $tmpfl -} - -function create_eth_cfg_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - echo $'\n'auto $1 >>$fn - echo iface $1 inet manual >>$fn - echo bond-master $2 >>$fn -} - -function create_eth_cfg_pri_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - echo $'\n'allow-hotplug $1 >>$fn - echo iface $1 inet manual >>$fn - echo bond-master $2 >>$fn - echo bond-primary $1 >>$fn -} - -function create_bond_cfg_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - - echo $'\n'auto $1 >>$fn - echo iface $1 inet dhcp >>$fn - echo bond-mode active-backup >>$fn - echo bond-miimon 100 >>$fn - echo bond-slaves none >>$fn -} - -function create_eth_cfg_suse { -local fn=$cfgdir/ifcfg-$1 - -rm -f $fn - echo BOOTPROTO=none >>$fn - echo STARTMODE=auto >>$fn -} - -function create_eth_cfg_pri_suse { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo BOOTPROTO=none >>$fn - echo
[PATCH net-next v2 2/3] netvsc: add documentation
Add some background documentation on netvsc device options and limitations. Signed-off-by: Stephen Hemminger--- Documentation/networking/netvsc.txt | 63 + MAINTAINERS | 1 + 2 files changed, 64 insertions(+) create mode 100644 Documentation/networking/netvsc.txt diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/netvsc.txt new file mode 100644 index ..4ddb4e4b0426 --- /dev/null +++ b/Documentation/networking/netvsc.txt @@ -0,0 +1,63 @@ +Hyper-V network driver +== + +Compatibility += + +This driver is compatible with Windows Server 2012 R2, 2016 and +Windows 10. + +Features + + + Checksum offload + + The netvsc driver supports checksum offload as long as the + Hyper-V host version does. Windows Server 2016 and Azure + support checksum offload for TCP and UDP for both IPv4 and + IPv6. Windows Server 2012 only supports checksum offload for TCP. + + Receive Side Scaling + + Hyper-V supports receive side scaling. For TCP, packets are + distributed among available queues based on IP address and port + number. Current versions of Hyper-V host, only distribute UDP + packets based on the IP source and destination address. + The port number is not used as part of the hash value for UDP. + Fragmented IP packets are not distributed between queues; + all fragmented packets arrive on the first channel. + + Generic Receive Offload, aka GRO + + The driver supports GRO and it is enabled by default. GRO coalesces + like packets and significantly reduces CPU usage under heavy Rx + load. + + SR-IOV support + -- + Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV + is enabled in both the vSwitch and the guest configuration, then the + Virtual Function (VF) device is passed to the guest as a PCI + device. In this case, both a synthetic (netvsc) and VF device are + visible in the guest OS and both NIC's have the same MAC address. + + The VF is enslaved by netvsc device. The netvsc driver will transparently + switch the data path to the VF when it is available and up. + Network state (addresses, firewall, etc) should be applied only to the + netvsc device; the slave device should not be accessed directly in + most cases. The exceptions are if some special queue discipline or + flow direction is desired, these should be applied directly to the + VF slave device. + + Receive Buffer + -- + Packets are received into a receive area which is created when device + is probed. The receive area is broken into MTU sized chunks and each may + contain one or more packets. The number of receive sections may be changed + via ethtool Rx ring parameters. + + There is a similar send buffer which is used to aggregate packets for sending. + The send area is broken into chunks of 6144 bytes, each of section may + contain one or more packets. The send buffer is an optimization, the driver + will use slower method to handle very large packets or if the send buffer + area is exhausted. diff --git a/MAINTAINERS b/MAINTAINERS index 297e610c9163..d30c17df1deb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6294,6 +6294,7 @@ M:Haiyang Zhang M: Stephen Hemminger L: de...@linuxdriverproject.org S: Maintained +F: Documentation/networking/netvsc.txt F: arch/x86/include/asm/mshyperv.h F: arch/x86/include/uapi/asm/hyperv.h F: arch/x86/kernel/cpu/mshyperv.c -- 2.11.0
Re: [PATCH net-next v12 0/4] net sched actions: improve dump performance
On Mon, 31 Jul 2017 08:06:42 -0400 Jamal Hadi Salimwrote: > On 17-07-30 10:28 PM, David Miller wrote: > > > > Series applied, thanks. > > > > Thanks David. > > Attaching the iproute2 patch. I will submit an official one with > man page changes later. Stephen - you take net-next changes? > > cheers, > jamal I will fix this up. The kernel headers for iproute2 come from sanitized kernel headers (not direct copy).
Re: [PATCH net-next 4/4] pci-hyperv: do not sleep in compose_msi_msg
On Mon, 31 Jul 2017 16:37:12 -0700 Stephen Hemmingerwrote: > The setup of MSI with Hyper-V host was sleeping with locks held. > This error is reported when doing SR-IOV hotplug with kernel built with > lockdep. > > BUG: sleeping function called from invalid context at > kernel/sched/completion.c:93 > in_atomic(): 1, irqs_disabled(): 1, pid: 1405, name: ip > 3 locks held by ip/1405: >#0: (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1b/0x40 >#1: (>request_mutex){+.+...}, at: [] > __setup_irq+0xb3/0x720 >#2: (_desc_lock_class){-.-...}, at: [] > __setup_irq+0xe5/0x720 >irq event stamp: 3476 >hardirqs last enabled at (3475): [] > get_page_from_freelist+0x225/0xc90 >hardirqs last disabled at (3476): [] > _raw_spin_lock_irqsave+0x27/0x90 >softirqs last enabled at (2446): [] > ixgbevf_configure+0x380/0x7c0 [ixgbevf] >softirqs last disabled at (2444): [] > ixgbevf_configure+0x35d/0x7c0 [ixgbevf] > > The workaround is to poll for host response instead of blocking on > completion. > > Signed-off-by: Stephen Hemminger This patch is not directly network related. It needs to go through PCI. I will resend the series.
Re: [PATCHv4 net] ipv6: no need to check rt->dst.error when get route info
From: David AhernDate: Mon, 31 Jul 2017 17:34:09 -0600 > On 7/31/17 5:22 PM, David Miller wrote: >> From: Hangbin Liu >> Date: Fri, 28 Jul 2017 00:25:36 +0800 >> >>> After commit 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib >>> result when requested"). When we get a prohibit ertry, we will return >>> -EACCES directly instead of dump route info. >>> >>> Fix it by remove the rt->dst.error check. >> ... >>> Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib...") >>> Signed-off-by: Hangbin Liu >> >> David A., where are we on this? >> > > Dizzy from running in circles. :-) > Question I posed to you Saturday morning, 8:41 MDT [1]: > > "... Roopa's fibmatch patches caused a change in user behavior in IPv6 > getroute for prohibit, blackhole and unreachable route entries. Opinions > on whether we should limit that new behavior to just the fibmatch lookup > in which case a patch is needed or take the new behavior and consistency > in which case nothing is needed?" > > Personally, after all the discussion I think the behavior as it is right > now is best. > > [1] https://www.spinics.net/lists/netdev/msg446571.html I agree with you that we should keep the behavior as is.
[PATCH net-next 4/4] pci-hyperv: do not sleep in compose_msi_msg
The setup of MSI with Hyper-V host was sleeping with locks held. This error is reported when doing SR-IOV hotplug with kernel built with lockdep. BUG: sleeping function called from invalid context at kernel/sched/completion.c:93 in_atomic(): 1, irqs_disabled(): 1, pid: 1405, name: ip 3 locks held by ip/1405: #0: (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1b/0x40 #1: (>request_mutex){+.+...}, at: [] __setup_irq+0xb3/0x720 #2: (_desc_lock_class){-.-...}, at: [] __setup_irq+0xe5/0x720 irq event stamp: 3476 hardirqs last enabled at (3475): [] get_page_from_freelist+0x225/0xc90 hardirqs last disabled at (3476): [] _raw_spin_lock_irqsave+0x27/0x90 softirqs last enabled at (2446): [] ixgbevf_configure+0x380/0x7c0 [ixgbevf] softirqs last disabled at (2444): [] ixgbevf_configure+0x35d/0x7c0 [ixgbevf] The workaround is to poll for host response instead of blocking on completion. Signed-off-by: Stephen Hemminger--- drivers/pci/host/pci-hyperv.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c index 415dcc69a502..334c9a7b8991 100644 --- a/drivers/pci/host/pci-hyperv.c +++ b/drivers/pci/host/pci-hyperv.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include #include @@ -1159,7 +1160,12 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) goto free_int_desc; } - wait_for_completion(_pkt.host_event); + /* +* Since this function is called with IRQ locks held, can't +* do normal wait for completion; instead poll. +*/ + while (!try_wait_for_completion(_pkt.host_event)) + udelay(100); if (comp.comp_pkt.completion_status < 0) { dev_err(>hdev->device, -- 2.11.0
[PATCH net-next 1/4] netvsc: transparent VF management
This patch implements transparent fail over from synthetic NIC to SR-IOV virtual function NIC in Hyper-V environment. It is a better alternative to using bonding as is done now. Instead, the receive and transmit fail over is done internally inside the driver. Using bonding driver has lots of issues because it depends on the script being run early enough in the boot process and with sufficient information to make the association. This patch moves all that functionality into the kernel. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/hyperv_net.h | 12 ++ drivers/net/hyperv/netvsc_drv.c | 419 +++- 2 files changed, 342 insertions(+), 89 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index f2cef5aaed1f..c701b059c5ac 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -680,6 +680,15 @@ struct netvsc_ethtool_stats { unsigned long tx_busy; }; +struct netvsc_vf_pcpu_stats { + u64 rx_packets; + u64 rx_bytes; + u64 tx_packets; + u64 tx_bytes; + struct u64_stats_sync syncp; + u32 tx_dropped; +}; + struct netvsc_reconfig { struct list_head list; u32 event; @@ -713,6 +722,9 @@ struct net_device_context { /* State to manage the associated VF interface. */ struct net_device __rcu *vf_netdev; + struct netvsc_vf_pcpu_stats __percpu *vf_stats; + struct work_struct vf_takeover; + struct work_struct vf_notify; /* 1: allocated, serial number is valid. 0: not allocated */ u32 vf_alloc; diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 8ff4cbf582cc..fef80dcbd71b 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -71,6 +72,7 @@ static void netvsc_set_multicast_list(struct net_device *net) static int netvsc_open(struct net_device *net) { struct net_device_context *ndev_ctx = netdev_priv(net); + struct net_device *vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev); struct netvsc_device *nvdev = rtnl_dereference(ndev_ctx->nvdev); struct rndis_device *rdev; int ret = 0; @@ -87,15 +89,29 @@ static int netvsc_open(struct net_device *net) netif_tx_wake_all_queues(net); rdev = nvdev->extension; - if (!rdev->link_state && !ndev_ctx->datapath) + + if (!rdev->link_state) netif_carrier_on(net); - return ret; + if (vf_netdev) { + /* Setting synthetic device up transparently sets +* slave as up. If open fails, then slave will be +* still be offline (and not used). +*/ + ret = dev_open(vf_netdev); + if (ret) + netdev_warn(net, + "unable to open slave: %s: %d\n", + vf_netdev->name, ret); + } + return 0; } static int netvsc_close(struct net_device *net) { struct net_device_context *net_device_ctx = netdev_priv(net); + struct net_device *vf_netdev + = rtnl_dereference(net_device_ctx->vf_netdev); struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev); int ret; u32 aread, i, msec = 10, retry = 0, retry_max = 20; @@ -141,6 +157,9 @@ static int netvsc_close(struct net_device *net) ret = -ETIMEDOUT; } + if (vf_netdev) + dev_close(vf_netdev); + return ret; } @@ -224,13 +243,11 @@ static inline int netvsc_get_tx_queue(struct net_device *ndev, * * TODO support XPS - but get_xps_queue not exported */ -static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, - void *accel_priv, select_queue_fallback_t fallback) +static u16 netvsc_pick_tx(struct net_device *ndev, struct sk_buff *skb) { - unsigned int num_tx_queues = ndev->real_num_tx_queues; int q_idx = sk_tx_queue_get(skb->sk); - if (q_idx < 0 || skb->ooo_okay) { + if (q_idx < 0 || skb->ooo_okay || q_idx >= ndev->real_num_tx_queues) { /* If forwarding a packet, we use the recorded queue when * available for better cache locality. */ @@ -240,12 +257,33 @@ static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, q_idx = netvsc_get_tx_queue(ndev, skb, q_idx); } - while (unlikely(q_idx >= num_tx_queues)) - q_idx -= num_tx_queues; - return q_idx; } +static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, + void *accel_priv, + select_queue_fallback_t fallback) +{ +
[PATCH net-next 3/4] netvsc: remove bonding setup script
No longer needed, now all managed by transparent VF logic. Signed-off-by: Stephen Hemminger--- tools/hv/bondvf.sh | 255 - 1 file changed, 255 deletions(-) delete mode 100755 tools/hv/bondvf.sh diff --git a/tools/hv/bondvf.sh b/tools/hv/bondvf.sh deleted file mode 100755 index 80f102860cf8.. --- a/tools/hv/bondvf.sh +++ /dev/null @@ -1,255 +0,0 @@ -#!/bin/bash - -# This example script creates bonding network devices based on synthetic NIC -# (the virtual network adapter usually provided by Hyper-V) and the matching -# VF NIC (SRIOV virtual function). So the synthetic NIC and VF NIC can -# function as one network device, and fail over to the synthetic NIC if VF is -# down. -# -# Usage: -# - After configured vSwitch and vNIC with SRIOV, start Linux virtual -# machine (VM) -# - Run this scripts on the VM. It will create configuration files in -# distro specific directory. -# - Reboot the VM, so that the bonding config are enabled. -# -# The config files are DHCP by default. You may edit them if you need to change -# to Static IP or change other settings. -# - -sysdir=/sys/class/net -netvsc_cls={f8615163-df3e-46c5-913f-f2d2f965ed0e} -bondcnt=0 - -# Detect Distro -if [ -f /etc/redhat-release ]; -then - cfgdir=/etc/sysconfig/network-scripts - distro=redhat -elif grep -q 'Ubuntu' /etc/issue -then - cfgdir=/etc/network - distro=ubuntu -elif grep -q 'SUSE' /etc/issue -then - cfgdir=/etc/sysconfig/network - distro=suse -else - echo "Unsupported Distro" - exit 1 -fi - -echo Detected Distro: $distro, or compatible - -# Get a list of ethernet names -list_eth=(`cd $sysdir && ls -d */ | cut -d/ -f1 | grep -v bond`) -eth_cnt=${#list_eth[@]} - -echo List of net devices: - -# Get the MAC addresses -for (( i=0; i < $eth_cnt; i++ )) -do - list_mac[$i]=`cat $sysdir/${list_eth[$i]}/address` - echo ${list_eth[$i]}, ${list_mac[$i]} -done - -# Find NIC with matching MAC -for (( i=0; i < $eth_cnt-1; i++ )) -do - for (( j=i+1; j < $eth_cnt; j++ )) - do - if [ "${list_mac[$i]}" = "${list_mac[$j]}" ] - then - list_match[$i]=${list_eth[$j]} - break - fi - done -done - -function create_eth_cfg_redhat { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo DEVICE=$1 >>$fn - echo TYPE=Ethernet >>$fn - echo BOOTPROTO=none >>$fn - echo UUID=`uuidgen` >>$fn - echo ONBOOT=yes >>$fn - echo PEERDNS=yes >>$fn - echo IPV6INIT=yes >>$fn - echo MASTER=$2 >>$fn - echo SLAVE=yes >>$fn -} - -function create_eth_cfg_pri_redhat { - create_eth_cfg_redhat $1 $2 -} - -function create_bond_cfg_redhat { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo DEVICE=$1 >>$fn - echo TYPE=Bond >>$fn - echo BOOTPROTO=dhcp >>$fn - echo UUID=`uuidgen` >>$fn - echo ONBOOT=yes >>$fn - echo PEERDNS=yes >>$fn - echo IPV6INIT=yes >>$fn - echo BONDING_MASTER=yes >>$fn - echo BONDING_OPTS=\"mode=active-backup miimon=100 primary=$2\" >>$fn -} - -function del_eth_cfg_ubuntu { - local mainfn=$cfgdir/interfaces - local fnlist=( $mainfn ) - - local dirlist=(`awk '/^[ \t]*source/{print $2}' $mainfn`) - - local i - for i in "${dirlist[@]}" - do - fnlist+=(`ls $i 2>/dev/null`) - done - - local tmpfl=$(mktemp) - - local nic_start='^[ \t]*(auto|iface|mapping|allow-.*)[ \t]+'$1 - local nic_end='^[ \t]*(auto|iface|mapping|allow-.*|source)' - - local fn - for fn in "${fnlist[@]}" - do - awk "/$nic_end/{x=0} x{next} /$nic_start/{x=1;next} 1" \ - $fn >$tmpfl - - cp $tmpfl $fn - done - - rm $tmpfl -} - -function create_eth_cfg_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - echo $'\n'auto $1 >>$fn - echo iface $1 inet manual >>$fn - echo bond-master $2 >>$fn -} - -function create_eth_cfg_pri_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - echo $'\n'allow-hotplug $1 >>$fn - echo iface $1 inet manual >>$fn - echo bond-master $2 >>$fn - echo bond-primary $1 >>$fn -} - -function create_bond_cfg_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - - echo $'\n'auto $1 >>$fn - echo iface $1 inet dhcp >>$fn - echo bond-mode active-backup >>$fn - echo bond-miimon 100 >>$fn - echo bond-slaves none >>$fn -} - -function create_eth_cfg_suse { -local fn=$cfgdir/ifcfg-$1 - -rm -f $fn - echo BOOTPROTO=none >>$fn - echo STARTMODE=auto >>$fn -} - -function create_eth_cfg_pri_suse { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo BOOTPROTO=none >>$fn - echo
[PATCH net-next 0/4] netvsc: transparent SR-IOV VF support
This patch set changes how SR-IOV Virtual Function devices are managed in the Hyper-V network driver. It was part of earlier bundle, but is now updated. Background In Hyper-V SR-IOV can be enabled (and disabled) by changing guest settings on host. When SR-IOV is enabled a matching PCI device is hot plugged and visible on guest. The VF device is an add-on to an existing netvsc device, and has the same MAC address. How is this different? The original support of VF relied on using bonding driver in active standby mode to handle the VF device. With the new netvsc VF logic, the Linux hyper-V network virtual driver will directly manage the link to SR-IOV VF device. When VF device is detected (hot plug) it is automatically made a slave device of the netvsc device. The VF device state reflects the state of the netvsc device; i.e. if netvsc is set down, then VF is set down. If netvsc is set up, then VF is brought up. Packet flow is independent of VF status; all packets are sent and received as if they were associated with the netvsc device. If VF is removed or link is down then the synthetic VMBUS path is used. What was wrong with using bonding script? A lot of work went into getting the bonding script to work on all distributions, but it was a major struggle. Linux network devices can be configured many, many ways and there is no one solution from userspace to make it all work. What is really hard is when configuration is attached to synthetic device during boot (eth0) and then the same addresses and firewall rules needs to also work later if doing bonding. The new code gets around all of this. How does VF work during initialization? Since all packets are sent and received through the logical netvsc device, initialization is much easier. Just configure the regular netvsc Ethernet device; when/if SR-IOV is enabled it just works. Provisioning and cloud init only need to worry about setting up netvsc device (eth0). If SR-IOV is enabled (even as a later step), the address and rules stay the same. What devices show up? Both netvsc and PCI devices are visible in the system. The netvsc device is active and named in usual manner (eth0). The PCI device is visible to Linux and gets renamed by udev to a persistent name (enP2p3s0). The PCI device name is now irrelevant now. The logic also sets the PCI VF device SLAVE flag on the network device so network tools can see the relationship if they are smart enough to understand how layered devices work. This is a lot like how I see Windows working. The VF device is visible in Device Manager, but is not configured. Is there any performance impact? There is no visible change in performance. The bonding and netvsc driver both have equivalent steps. Is it compatible with old bonding script? It turns out that if you use the old bonding script, then everything still works but in a sub-optimum manner. What happens is that bonding is unable to steal the VF from the netvsc device so it creates a one legged bond. Packet flow then is: bond0 <--> eth0 <- -> VF (enP2p3s0). In other words, if you get it wrong it still works, just awkward and slower. What if I add address or firewall rule onto the VF? Same problems occur with now as already occur with bonding, bridging, teaming on Linux if user incorrectly does configuration onto an underlying slave device. It will sort of work, packets will come in and out but the Linux kernel gets confused and things like ARP don’t work right. There is no way to block manipulation of the slave device, and I am sure someone will find some special use case where they want it. Stephen Hemminger (4): netvsc: transparent VF management netvsc: add documentation netvsc: remove bonding setup script pci-hyperv: do not sleep in compose_msi_msg Documentation/networking/netvsc.txt | 63 ++ MAINTAINERS | 1 + drivers/net/hyperv/hyperv_net.h | 12 ++ drivers/net/hyperv/netvsc_drv.c | 419 drivers/pci/host/pci-hyperv.c | 8 +- tools/hv/bondvf.sh | 255 -- 6 files changed, 413 insertions(+), 345 deletions(-) create mode 100644 Documentation/networking/netvsc.txt delete mode 100755 tools/hv/bondvf.sh -- 2.11.0
[PATCH net-next 2/4] netvsc: add documentation
Add some background documentation on netvsc device options and limitations. Signed-off-by: Stephen Hemminger--- Documentation/networking/netvsc.txt | 63 + MAINTAINERS | 1 + 2 files changed, 64 insertions(+) create mode 100644 Documentation/networking/netvsc.txt diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/netvsc.txt new file mode 100644 index ..4ddb4e4b0426 --- /dev/null +++ b/Documentation/networking/netvsc.txt @@ -0,0 +1,63 @@ +Hyper-V network driver +== + +Compatibility += + +This driver is compatible with Windows Server 2012 R2, 2016 and +Windows 10. + +Features + + + Checksum offload + + The netvsc driver supports checksum offload as long as the + Hyper-V host version does. Windows Server 2016 and Azure + support checksum offload for TCP and UDP for both IPv4 and + IPv6. Windows Server 2012 only supports checksum offload for TCP. + + Receive Side Scaling + + Hyper-V supports receive side scaling. For TCP, packets are + distributed among available queues based on IP address and port + number. Current versions of Hyper-V host, only distribute UDP + packets based on the IP source and destination address. + The port number is not used as part of the hash value for UDP. + Fragmented IP packets are not distributed between queues; + all fragmented packets arrive on the first channel. + + Generic Receive Offload, aka GRO + + The driver supports GRO and it is enabled by default. GRO coalesces + like packets and significantly reduces CPU usage under heavy Rx + load. + + SR-IOV support + -- + Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV + is enabled in both the vSwitch and the guest configuration, then the + Virtual Function (VF) device is passed to the guest as a PCI + device. In this case, both a synthetic (netvsc) and VF device are + visible in the guest OS and both NIC's have the same MAC address. + + The VF is enslaved by netvsc device. The netvsc driver will transparently + switch the data path to the VF when it is available and up. + Network state (addresses, firewall, etc) should be applied only to the + netvsc device; the slave device should not be accessed directly in + most cases. The exceptions are if some special queue discipline or + flow direction is desired, these should be applied directly to the + VF slave device. + + Receive Buffer + -- + Packets are received into a receive area which is created when device + is probed. The receive area is broken into MTU sized chunks and each may + contain one or more packets. The number of receive sections may be changed + via ethtool Rx ring parameters. + + There is a similar send buffer which is used to aggregate packets for sending. + The send area is broken into chunks of 6144 bytes, each of section may + contain one or more packets. The send buffer is an optimization, the driver + will use slower method to handle very large packets or if the send buffer + area is exhausted. diff --git a/MAINTAINERS b/MAINTAINERS index 297e610c9163..d30c17df1deb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6294,6 +6294,7 @@ M:Haiyang Zhang M: Stephen Hemminger L: de...@linuxdriverproject.org S: Maintained +F: Documentation/networking/netvsc.txt F: arch/x86/include/asm/mshyperv.h F: arch/x86/include/uapi/asm/hyperv.h F: arch/x86/kernel/cpu/mshyperv.c -- 2.11.0
Re: [PATCHv4 net] ipv6: no need to check rt->dst.error when get route info
On 7/31/17 5:22 PM, David Miller wrote: > From: Hangbin Liu> Date: Fri, 28 Jul 2017 00:25:36 +0800 > >> After commit 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib >> result when requested"). When we get a prohibit ertry, we will return >> -EACCES directly instead of dump route info. >> >> Fix it by remove the rt->dst.error check. > ... >> Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib...") >> Signed-off-by: Hangbin Liu > > David A., where are we on this? > Dizzy from running in circles. Question I posed to you Saturday morning, 8:41 MDT [1]: "... Roopa's fibmatch patches caused a change in user behavior in IPv6 getroute for prohibit, blackhole and unreachable route entries. Opinions on whether we should limit that new behavior to just the fibmatch lookup in which case a patch is needed or take the new behavior and consistency in which case nothing is needed?" Personally, after all the discussion I think the behavior as it is right now is best. [1] https://www.spinics.net/lists/netdev/msg446571.html
Re: [PATCHv4 net] ipv6: no need to check rt->dst.error when get route info
From: Hangbin LiuDate: Fri, 28 Jul 2017 00:25:36 +0800 > After commit 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib > result when requested"). When we get a prohibit ertry, we will return > -EACCES directly instead of dump route info. > > Fix it by remove the rt->dst.error check. ... > Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib...") > Signed-off-by: Hangbin Liu David A., where are we on this?
Re: [PATCH v2] net: phy: Log only PHY state transitions
From: Marc GonzalezDate: Fri, 28 Jul 2017 13:18:30 +0200 > In the current code, old and new PHY states are always logged. > From now on, log only PHY state transitions. > > Signed-off-by: Marc Gonzalez Applied to net-next, thanks.
Re: [RFC net-next] net ipv6: convert fib6_table rwlock to a percpu lock
On Mon, 31 Jul 2017 10:18:57 -0700 Shaohua Liwrote: > From: Shaohua Li > > In a syn flooding test, the fib6_table rwlock is a significant > bottleneck. While converting the rwlock to rcu sounds straighforward, > but is very challenging if it's possible. A percpu spinlock is quite > trival for this problem since updating the routing table is a rare > event. In my test, the server receives around 1.5 Mpps in syn flooding > test without the patch in a dual sockets and 56-CPU system. With the > patch, the server receives around 3.8Mpps, and perf report doesn't show > the locking issue. > > Cc: Wei Wang You just reinvented brlock... RCU is not that hard, why not do it right?
Re: [PATCH V5 1/2] firmware: add more flexible request_firmware_async function
Hi Rafał, [auto build test WARNING on driver-core/driver-core-testing] [also build test WARNING on v4.13-rc3 next-20170731] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Rafa-Mi-ecki/firmware-add-more-flexible-request_firmware_async-function/20170801-033319 reproduce: make htmldocs All warnings (new ones prefixed by >>): WARNING: convert(1) not found, for SVG to PDF conversion install ImageMagick (https://www.imagemagick.org) include/linux/init.h:1: warning: no structured comments found include/linux/mod_devicetable.h:687: warning: Excess struct/union/enum/typedef member 'ver_major' description in 'fsl_mc_device_id' include/linux/mod_devicetable.h:687: warning: Excess struct/union/enum/typedef member 'ver_minor' description in 'fsl_mc_device_id' kernel/sched/core.c:2080: warning: No description found for parameter 'rf' kernel/sched/core.c:2080: warning: Excess function parameter 'cookie' description in 'try_to_wake_up_local' include/linux/wait.h:555: warning: No description found for parameter 'wq' include/linux/wait.h:555: warning: Excess function parameter 'wq_head' description in 'wait_event_interruptible_hrtimeout' include/linux/wait.h:759: warning: No description found for parameter 'wq_head' include/linux/wait.h:759: warning: Excess function parameter 'wq' description in 'wait_event_killable' include/linux/kthread.h:26: warning: Excess function parameter '...' description in 'kthread_create' kernel/sys.c:1: warning: no structured comments found include/linux/device.h:968: warning: No description found for parameter 'dma_ops' drivers/dma-buf/seqno-fence.c:1: warning: no structured comments found >> drivers/base/firmware_class.c:1: warning: no structured comments found include/linux/iio/iio.h:603: warning: No description found for parameter 'trig_readonly' include/linux/iio/trigger.h:151: warning: No description found for parameter 'indio_dev' include/linux/iio/trigger.h:151: warning: No description found for parameter 'trig' include/linux/device.h:969: warning: No description found for parameter 'dma_ops' drivers/ata/libata-eh.c:1449: warning: No description found for parameter 'link' drivers/ata/libata-eh.c:1449: warning: Excess function parameter 'ap' description in 'ata_eh_done' drivers/ata/libata-eh.c:1590: warning: No description found for parameter 'qc' drivers/ata/libata-eh.c:1590: warning: Excess function parameter 'dev' description in 'ata_eh_request_sense' drivers/mtd/nand/nand_base.c:2751: warning: Excess function parameter 'cached' description in 'nand_write_page' drivers/mtd/nand/nand_base.c:2751: warning: Excess function parameter 'cached' description in 'nand_write_page' arch/s390/include/asm/cmb.h:1: warning: no structured comments found drivers/scsi/scsi_lib.c:1116: warning: No description found for parameter 'rq' drivers/scsi/constants.c:1: warning: no structured comments found include/linux/usb/gadget.h:230: warning: No description found for parameter 'claimed' include/linux/usb/gadget.h:230: warning: No description found for parameter 'enabled' include/linux/usb/gadget.h:412: warning: No description found for parameter 'quirk_altset_not_supp' include/linux/usb/gadget.h:412: warning: No description found for parameter 'quirk_stall_not_supp' include/linux/usb/gadget.h:412: warning: No description found for parameter 'quirk_zlp_not_supp' fs/inode.c:1666: warning: No description found for parameter 'rcu' include/linux/jbd2.h:443: warning: No description found for parameter 'i_transaction' include/linux/jbd2.h:443: warning: No description found for parameter 'i_next_transaction' include/linux/jbd2.h:443: warning: No description found for parameter 'i_list' include/linux/jbd2.h:443: warning: No description found for parameter 'i_vfs_inode' include/linux/jbd2.h:443: warning: No description found for parameter 'i_flags' include/linux/jbd2.h:497: warning: No description found for parameter 'h_rsv_handle' include/linux/jbd2.h:497: warning: No description found for parameter 'h_reserved' include/linux/jbd2.h:497: warning: No description found for parameter 'h_type' include/linux/jbd2.h:497: warning: No description found for parameter 'h_line_no' include/linux/jbd2.h:497: warning: No description found for parameter 'h_start_jiffies' include/linux/jbd2.h:497: warning: No description found for parameter 'h_requested_credits' include/linux/jbd2.h:497: warning: No description found for parameter 'saved_alloc_context' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_chkpt_bhs' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_devname' include/linux/jbd2.h:1050: warning: No description found for parameter 'j_average_commit_time' include/linux
Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
On Mon, Jul 31, 2017 at 03:35:02PM -0700, Cong Wang wrote: > On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Liwrote: > > static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff > > *skb, > > __be32 flowlabel, bool autolabel, > > - struct flowi6 *fl6) > > + struct flowi6 *fl6, u32 hash) > > { > > - u32 hash; > > - > > /* @flowlabel may include more than a flow label, eg, the traffic > > class. > > * Here we want only the flow label value. > > */ > > @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net > > *net, struct sk_buff *skb, > > net->ipv6.sysctl.auto_flowlabels != > > IP6_AUTO_FLOW_LABEL_FORCED)) > > return flowlabel; > > > > - hash = skb_get_hash_flowi6(skb, fl6); > > + if (skb) > > + hash = skb_get_hash_flowi6(skb, fl6); > > > Why not just move skb_get_hash_flowi6() to its caller? > This check is not necessary. If you don't want to touch > existing callers, you can just introduce a wrapper: > > > static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb, > __be32 flowlabel, bool autolabel, > struct flowi6 *fl6) > { > u32 hash = skb_get_hash_flowi6(skb, fl6); > return __ip6_make_flowlabel(net, flowlabel, autolabel, hash); > } this will always call skb_get_hash_flowi6 for the fast path even auto flowlabel is disabled. I thought we should avoid this. > > And your code can just call: > > __ip6_make_flowlabel(net, flowlabel, autolabel, sk->sk_txhash);
Re: [PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Liwrote: > static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb, > __be32 flowlabel, bool autolabel, > - struct flowi6 *fl6) > + struct flowi6 *fl6, u32 hash) > { > - u32 hash; > - > /* @flowlabel may include more than a flow label, eg, the traffic > class. > * Here we want only the flow label value. > */ > @@ -788,7 +786,8 @@ static inline __be32 ip6_make_flowlabel(struct net *net, > struct sk_buff *skb, > net->ipv6.sysctl.auto_flowlabels != IP6_AUTO_FLOW_LABEL_FORCED)) > return flowlabel; > > - hash = skb_get_hash_flowi6(skb, fl6); > + if (skb) > + hash = skb_get_hash_flowi6(skb, fl6); Why not just move skb_get_hash_flowi6() to its caller? This check is not necessary. If you don't want to touch existing callers, you can just introduce a wrapper: static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb, __be32 flowlabel, bool autolabel, struct flowi6 *fl6) { u32 hash = skb_get_hash_flowi6(skb, fl6); return __ip6_make_flowlabel(net, flowlabel, autolabel, hash); } And your code can just call: __ip6_make_flowlabel(net, flowlabel, autolabel, sk->sk_txhash);
[PATCH net-next 01/11] net: dsa: make EEE ops optional
Even though EEE implies the port's PHY and MAC of both ends, a switch may not need to do anything to configure the port's MAC. This makes it impossible for the DSA layer to distinguish e.g. this case from a disabled EEE when a driver returns 0 from the get EEE operation. For this reason, make the EEE ops optional and call them only when provided. Calling it first allows a switch driver to stop the whole operation at runtime if a given switch does not support the EEE setting. If both the MAC operation and PHY are not present, -ENODEV is returned. Signed-off-by: Vivien Didelot--- net/dsa/slave.c | 44 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 9507bd38cf04..518145ced434 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -646,38 +646,42 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e) { struct dsa_slave_priv *p = netdev_priv(dev); struct dsa_switch *ds = p->dp->ds; - int ret; + int err = -ENODEV; - if (!ds->ops->set_eee) - return -EOPNOTSUPP; + if (ds->ops->set_eee) { + err = ds->ops->set_eee(ds, p->dp->index, p->phy, e); + if (err) + return err; + } - ret = ds->ops->set_eee(ds, p->dp->index, p->phy, e); - if (ret) - return ret; + if (p->phy) { + err = phy_ethtool_set_eee(p->phy, e); + if (err) + return err; + } - if (p->phy) - ret = phy_ethtool_set_eee(p->phy, e); - - return ret; + return err; } static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e) { struct dsa_slave_priv *p = netdev_priv(dev); struct dsa_switch *ds = p->dp->ds; - int ret; + int err = -ENODEV; - if (!ds->ops->get_eee) - return -EOPNOTSUPP; + if (ds->ops->get_eee) { + err = ds->ops->get_eee(ds, p->dp->index, e); + if (err) + return err; + } - ret = ds->ops->get_eee(ds, p->dp->index, e); - if (ret) - return ret; + if (p->phy) { + err = phy_ethtool_get_eee(p->phy, e); + if (err) + return err; + } - if (p->phy) - ret = phy_ethtool_get_eee(p->phy, e); - - return ret; + return err; } #ifdef CONFIG_NET_POLL_CONTROLLER -- 2.13.3
[PATCH net-next 11/11] net: dsa: rename switch EEE ops
To avoid confusion with the PHY EEE settings, rename the .set_eee and .get_eee ops to respectively .set_mac_eee and .get_mac_eee. Signed-off-by: Vivien Didelot--- drivers/net/dsa/bcm_sf2.c | 12 ++-- drivers/net/dsa/qca8k.c | 4 ++-- include/net/dsa.h | 10 +- net/dsa/slave.c | 8 4 files changed, 17 insertions(+), 17 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index ce886345d8d2..6bbfa6ea1efb 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -338,8 +338,8 @@ static int bcm_sf2_eee_init(struct dsa_switch *ds, int port, return 1; } -static int bcm_sf2_sw_get_eee(struct dsa_switch *ds, int port, - struct ethtool_eee *e) +static int bcm_sf2_sw_get_mac_eee(struct dsa_switch *ds, int port, + struct ethtool_eee *e) { struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); struct ethtool_eee *p = >port_sts[port].eee; @@ -352,8 +352,8 @@ static int bcm_sf2_sw_get_eee(struct dsa_switch *ds, int port, return 0; } -static int bcm_sf2_sw_set_eee(struct dsa_switch *ds, int port, - struct ethtool_eee *e) +static int bcm_sf2_sw_set_mac_eee(struct dsa_switch *ds, int port, + struct ethtool_eee *e) { struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); struct ethtool_eee *p = >port_sts[port].eee; @@ -1011,8 +1011,8 @@ static const struct dsa_switch_ops bcm_sf2_ops = { .set_wol= bcm_sf2_sw_set_wol, .port_enable= bcm_sf2_port_setup, .port_disable = bcm_sf2_port_disable, - .get_eee= bcm_sf2_sw_get_eee, - .set_eee= bcm_sf2_sw_set_eee, + .get_mac_eee= bcm_sf2_sw_get_mac_eee, + .set_mac_eee= bcm_sf2_sw_set_mac_eee, .port_bridge_join = b53_br_join, .port_bridge_leave = b53_br_leave, .port_stp_state_set = b53_br_set_stp_state, diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c index 400333077a9f..1f6cb107bc63 100644 --- a/drivers/net/dsa/qca8k.c +++ b/drivers/net/dsa/qca8k.c @@ -638,7 +638,7 @@ qca8k_get_sset_count(struct dsa_switch *ds) } static int -qca8k_set_eee(struct dsa_switch *ds, int port, struct ethtool_eee *eee) +qca8k_set_mac_eee(struct dsa_switch *ds, int port, struct ethtool_eee *eee) { struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv; u32 lpi_en = QCA8K_REG_EEE_CTRL_LPI_EN(port); @@ -855,7 +855,7 @@ static const struct dsa_switch_ops qca8k_switch_ops = { .phy_write = qca8k_phy_write, .get_ethtool_stats = qca8k_get_ethtool_stats, .get_sset_count = qca8k_get_sset_count, - .set_eee= qca8k_set_eee, + .set_mac_eee= qca8k_set_mac_eee, .port_enable= qca8k_port_enable, .port_disable = qca8k_port_disable, .port_stp_state_set = qca8k_port_stp_state_set, diff --git a/include/net/dsa.h b/include/net/dsa.h index ce46db323394..0b1a0622b33c 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -332,12 +332,12 @@ struct dsa_switch_ops { struct phy_device *phy); /* -* EEE setttings +* Port's MAC EEE settings */ - int (*set_eee)(struct dsa_switch *ds, int port, - struct ethtool_eee *e); - int (*get_eee)(struct dsa_switch *ds, int port, - struct ethtool_eee *e); + int (*set_mac_eee)(struct dsa_switch *ds, int port, + struct ethtool_eee *e); + int (*get_mac_eee)(struct dsa_switch *ds, int port, + struct ethtool_eee *e); /* EEPROM access */ int (*get_eeprom_len)(struct dsa_switch *ds); diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 6bc75ab438e8..832a54c94652 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -648,8 +648,8 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e) struct dsa_switch *ds = p->dp->ds; int err = -ENODEV; - if (ds->ops->set_eee) { - err = ds->ops->set_eee(ds, p->dp->index, e); + if (ds->ops->set_mac_eee) { + err = ds->ops->set_mac_eee(ds, p->dp->index, e); if (err) return err; } @@ -675,8 +675,8 @@ static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e) struct dsa_switch *ds = p->dp->ds; int err = -ENODEV; - if (ds->ops->get_eee) { - err = ds->ops->get_eee(ds, p->dp->index, e); + if (ds->ops->get_mac_eee) { + err = ds->ops->get_mac_eee(ds, p->dp->index, e); if (err)
[PATCH net-next 03/11] net: dsa: qca8k: enable EEE once
If EEE is queried enabled, qca8k_set_eee calls qca8k_eee_enable_set twice (because it is already called in qca8k_eee_init). Fix that. Signed-off-by: Vivien Didelot--- drivers/net/dsa/qca8k.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c index e076ab23d4df..9d6b5d2f7a4a 100644 --- a/drivers/net/dsa/qca8k.c +++ b/drivers/net/dsa/qca8k.c @@ -684,12 +684,13 @@ qca8k_set_eee(struct dsa_switch *ds, int port, p->eee_enabled = e->eee_enabled; - if (e->eee_enabled) { + if (!p->eee_enabled) { + qca8k_eee_enable_set(ds, port, false); + } else { p->eee_enabled = qca8k_eee_init(ds, port, phydev); if (!p->eee_enabled) ret = -EOPNOTSUPP; } - qca8k_eee_enable_set(ds, port, p->eee_enabled); return ret; } -- 2.13.3
[PATCH net-next 00/11] net: dsa: rework EEE support
EEE implies configuring the port's PHY and MAC of both ends of the wire. The current EEE support in DSA mixes PHY and MAC configuration, which is bad because PHYs must be configured through a proper PHY driver. The DSA switch operations for EEE are only meant for configuring the port's MAC, which are integrated in the Ethernet switch device. This patchset fixes the EEE support in qca8k driver, makes the DSA layer call phy_init_eee for all drivers, and remove the EEE support from the mv88e6xxx driver since the Marvell PHY driver should be enough for it. Vivien Didelot (11): net: dsa: make EEE ops optional net: dsa: qca8k: fix EEE init net: dsa: qca8k: enable EEE once net: dsa: qca8k: do not cache unneeded EEE fields net: dsa: qca8k: remove qca8k_get_eee net: dsa: bcm_sf2: remove unneeded supported flags net: dsa: mv88e6xxx: call phy_init_eee net: dsa: call phy_init_eee in DSA layer net: dsa: remove PHY device argument from .set_eee net: dsa: mv88e6xxx: remove EEE support net: dsa: rename switch EEE ops drivers/net/dsa/bcm_sf2.c| 26 +++ drivers/net/dsa/mv88e6xxx/chip.c | 82 -- drivers/net/dsa/mv88e6xxx/chip.h | 6 --- drivers/net/dsa/mv88e6xxx/phy.c | 96 drivers/net/dsa/mv88e6xxx/phy.h | 22 - drivers/net/dsa/mv88e6xxx/port.c | 17 --- drivers/net/dsa/mv88e6xxx/port.h | 3 -- drivers/net/dsa/qca8k.c | 68 ++-- drivers/net/dsa/qca8k.h | 1 - include/net/dsa.h| 11 +++-- net/dsa/slave.c | 48 11 files changed, 45 insertions(+), 335 deletions(-) -- 2.13.3
[PATCH net-next 07/11] net: dsa: mv88e6xxx: call phy_init_eee
It is safer to init the EEE before the DSA layer call phy_ethtool_set_eee, as sf2 and qca8k are doing. Signed-off-by: Vivien Didelot--- drivers/net/dsa/mv88e6xxx/chip.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 647d5d45c1d6..b531d4a3bab5 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -855,6 +855,12 @@ static int mv88e6xxx_set_eee(struct dsa_switch *ds, int port, struct mv88e6xxx_chip *chip = ds->priv; int err; + if (e->eee_enabled) { + err = phy_init_eee(phydev, 0); + if (err) + return err; + } + mutex_lock(>reg_lock); err = mv88e6xxx_energy_detect_write(chip, port, e); mutex_unlock(>reg_lock); -- 2.13.3
[PATCH net-next 09/11] net: dsa: remove PHY device argument from .set_eee
The DSA switch operations for EEE are only meant to configure a port's MAC EEE settings. The port's PHY EEE settings are accessed by the DSA layer and must be made available via a proper PHY driver. In order to reduce this confusion, remove the phy_device argument from the .set_eee operation. Signed-off-by: Vivien Didelot--- drivers/net/dsa/bcm_sf2.c| 1 - drivers/net/dsa/mv88e6xxx/chip.c | 2 +- drivers/net/dsa/qca8k.c | 14 +++--- include/net/dsa.h| 1 - net/dsa/slave.c | 2 +- 5 files changed, 5 insertions(+), 15 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index 9d10aac8f241..ce886345d8d2 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -353,7 +353,6 @@ static int bcm_sf2_sw_get_eee(struct dsa_switch *ds, int port, } static int bcm_sf2_sw_set_eee(struct dsa_switch *ds, int port, - struct phy_device *phydev, struct ethtool_eee *e) { struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 647d5d45c1d6..aaa96487f21f 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -850,7 +850,7 @@ static int mv88e6xxx_get_eee(struct dsa_switch *ds, int port, } static int mv88e6xxx_set_eee(struct dsa_switch *ds, int port, -struct phy_device *phydev, struct ethtool_eee *e) +struct ethtool_eee *e) { struct mv88e6xxx_chip *chip = ds->priv; int err; diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c index 038a895d9a96..400333077a9f 100644 --- a/drivers/net/dsa/qca8k.c +++ b/drivers/net/dsa/qca8k.c @@ -637,8 +637,8 @@ qca8k_get_sset_count(struct dsa_switch *ds) return ARRAY_SIZE(ar8327_mib); } -static void -qca8k_eee_enable_set(struct dsa_switch *ds, int port, bool enable) +static int +qca8k_set_eee(struct dsa_switch *ds, int port, struct ethtool_eee *eee) { struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv; u32 lpi_en = QCA8K_REG_EEE_CTRL_LPI_EN(port); @@ -646,20 +646,12 @@ qca8k_eee_enable_set(struct dsa_switch *ds, int port, bool enable) mutex_lock(>reg_mutex); reg = qca8k_read(priv, QCA8K_REG_EEE_CTRL); - if (enable) + if (eee->eee_enabled) reg |= lpi_en; else reg &= ~lpi_en; qca8k_write(priv, QCA8K_REG_EEE_CTRL, reg); mutex_unlock(>reg_mutex); -} - -static int -qca8k_set_eee(struct dsa_switch *ds, int port, - struct phy_device *phydev, - struct ethtool_eee *e) -{ - qca8k_eee_enable_set(ds, port, e->eee_enabled); return 0; } diff --git a/include/net/dsa.h b/include/net/dsa.h index 88da272d20d0..ce46db323394 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -335,7 +335,6 @@ struct dsa_switch_ops { * EEE setttings */ int (*set_eee)(struct dsa_switch *ds, int port, - struct phy_device *phydev, struct ethtool_eee *e); int (*get_eee)(struct dsa_switch *ds, int port, struct ethtool_eee *e); diff --git a/net/dsa/slave.c b/net/dsa/slave.c index bf71c206fe8f..6bc75ab438e8 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -649,7 +649,7 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e) int err = -ENODEV; if (ds->ops->set_eee) { - err = ds->ops->set_eee(ds, p->dp->index, p->phy, e); + err = ds->ops->set_eee(ds, p->dp->index, e); if (err) return err; } -- 2.13.3
[PATCH net-next 08/11] net: dsa: call phy_init_eee in DSA layer
All DSA drivers are calling phy_init_eee if eee_enabled is true. Move up this statement in the DSA layer to simplify the DSA drivers. qca8k does not require to cache the ethtool_eee structures from now on. Signed-off-by: Vivien Didelot--- drivers/net/dsa/bcm_sf2.c| 9 + drivers/net/dsa/mv88e6xxx/chip.c | 6 -- drivers/net/dsa/qca8k.c | 31 ++- drivers/net/dsa/qca8k.h | 1 - net/dsa/slave.c | 6 ++ 5 files changed, 9 insertions(+), 44 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index aef475f1ce06..9d10aac8f241 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -360,14 +360,7 @@ static int bcm_sf2_sw_set_eee(struct dsa_switch *ds, int port, struct ethtool_eee *p = >port_sts[port].eee; p->eee_enabled = e->eee_enabled; - - if (!p->eee_enabled) { - bcm_sf2_eee_enable_set(ds, port, false); - } else { - p->eee_enabled = bcm_sf2_eee_init(ds, port, phydev); - if (!p->eee_enabled) - return -EOPNOTSUPP; - } + bcm_sf2_eee_enable_set(ds, port, e->eee_enabled); return 0; } diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index b531d4a3bab5..647d5d45c1d6 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -855,12 +855,6 @@ static int mv88e6xxx_set_eee(struct dsa_switch *ds, int port, struct mv88e6xxx_chip *chip = ds->priv; int err; - if (e->eee_enabled) { - err = phy_init_eee(phydev, 0); - if (err) - return err; - } - mutex_lock(>reg_lock); err = mv88e6xxx_energy_detect_write(chip, port, e); mutex_unlock(>reg_lock); diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c index b5f2710064e5..038a895d9a96 100644 --- a/drivers/net/dsa/qca8k.c +++ b/drivers/net/dsa/qca8k.c @@ -655,40 +655,13 @@ qca8k_eee_enable_set(struct dsa_switch *ds, int port, bool enable) } static int -qca8k_eee_init(struct dsa_switch *ds, int port, - struct phy_device *phy) -{ - int ret; - - ret = phy_init_eee(phy, 0); - if (ret) - return 0; - - qca8k_eee_enable_set(ds, port, true); - - return 1; -} - -static int qca8k_set_eee(struct dsa_switch *ds, int port, struct phy_device *phydev, struct ethtool_eee *e) { - struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv; - struct ethtool_eee *p = >port_sts[port].eee; - int ret = 0; + qca8k_eee_enable_set(ds, port, e->eee_enabled); - p->eee_enabled = e->eee_enabled; - - if (!p->eee_enabled) { - qca8k_eee_enable_set(ds, port, false); - } else { - p->eee_enabled = qca8k_eee_init(ds, port, phydev); - if (!p->eee_enabled) - ret = -EOPNOTSUPP; - } - - return ret; + return 0; } static void diff --git a/drivers/net/dsa/qca8k.h b/drivers/net/dsa/qca8k.h index 1ed4fac6cd6d..1cf8a920d4ff 100644 --- a/drivers/net/dsa/qca8k.h +++ b/drivers/net/dsa/qca8k.h @@ -156,7 +156,6 @@ enum qca8k_fdb_cmd { }; struct ar8xxx_port_status { - struct ethtool_eee eee; int enabled; }; diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 518145ced434..bf71c206fe8f 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -655,6 +655,12 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e) } if (p->phy) { + if (e->eee_enabled) { + err = phy_init_eee(p->phy, 0); + if (err) + return err; + } + err = phy_ethtool_set_eee(p->phy, e); if (err) return err; -- 2.13.3
[PATCH net-next 10/11] net: dsa: mv88e6xxx: remove EEE support
The PHY's EEE settings are already accessed by the DSA layer through the Marvell PHY driver and there is nothing to be done for switch's MACs. Remove all EEE support from the mv88e6xxx driver. Signed-off-by: Vivien Didelot--- drivers/net/dsa/mv88e6xxx/chip.c | 82 -- drivers/net/dsa/mv88e6xxx/chip.h | 6 --- drivers/net/dsa/mv88e6xxx/phy.c | 96 drivers/net/dsa/mv88e6xxx/phy.h | 22 - drivers/net/dsa/mv88e6xxx/port.c | 17 --- drivers/net/dsa/mv88e6xxx/port.h | 3 -- 6 files changed, 226 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index aaa96487f21f..746ebf2fed80 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -810,58 +810,6 @@ static void mv88e6xxx_get_regs(struct dsa_switch *ds, int port, mutex_unlock(>reg_lock); } -static int mv88e6xxx_energy_detect_read(struct mv88e6xxx_chip *chip, int port, - struct ethtool_eee *eee) -{ - int err; - - if (!chip->info->ops->phy_energy_detect_read) - return -EOPNOTSUPP; - - /* assign eee->eee_enabled and eee->tx_lpi_enabled */ - err = chip->info->ops->phy_energy_detect_read(chip, port, eee); - if (err) - return err; - - /* assign eee->eee_active */ - return mv88e6xxx_port_status_eee(chip, port, eee); -} - -static int mv88e6xxx_energy_detect_write(struct mv88e6xxx_chip *chip, int port, -struct ethtool_eee *eee) -{ - if (!chip->info->ops->phy_energy_detect_write) - return -EOPNOTSUPP; - - return chip->info->ops->phy_energy_detect_write(chip, port, eee); -} - -static int mv88e6xxx_get_eee(struct dsa_switch *ds, int port, -struct ethtool_eee *e) -{ - struct mv88e6xxx_chip *chip = ds->priv; - int err; - - mutex_lock(>reg_lock); - err = mv88e6xxx_energy_detect_read(chip, port, e); - mutex_unlock(>reg_lock); - - return err; -} - -static int mv88e6xxx_set_eee(struct dsa_switch *ds, int port, -struct ethtool_eee *e) -{ - struct mv88e6xxx_chip *chip = ds->priv; - int err; - - mutex_lock(>reg_lock); - err = mv88e6xxx_energy_detect_write(chip, port, e); - mutex_unlock(>reg_lock); - - return err; -} - static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port) { struct dsa_switch *ds = NULL; @@ -2521,8 +2469,6 @@ static const struct mv88e6xxx_ops mv88e6141_ops = { .set_switch_mac = mv88e6xxx_g2_set_switch_mac, .phy_read = mv88e6xxx_g2_smi_phy_read, .phy_write = mv88e6xxx_g2_smi_phy_write, - .phy_energy_detect_read = mv88e6352_phy_energy_detect_read, - .phy_energy_detect_write = mv88e6352_phy_energy_detect_write, .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_rgmii_delay = mv88e6390_port_set_rgmii_delay, @@ -2648,8 +2594,6 @@ static const struct mv88e6xxx_ops mv88e6172_ops = { .set_switch_mac = mv88e6xxx_g2_set_switch_mac, .phy_read = mv88e6xxx_g2_smi_phy_read, .phy_write = mv88e6xxx_g2_smi_phy_write, - .phy_energy_detect_read = mv88e6352_phy_energy_detect_read, - .phy_energy_detect_write = mv88e6352_phy_energy_detect_write, .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay, @@ -2719,8 +2663,6 @@ static const struct mv88e6xxx_ops mv88e6176_ops = { .set_switch_mac = mv88e6xxx_g2_set_switch_mac, .phy_read = mv88e6xxx_g2_smi_phy_read, .phy_write = mv88e6xxx_g2_smi_phy_write, - .phy_energy_detect_read = mv88e6352_phy_energy_detect_read, - .phy_energy_detect_write = mv88e6352_phy_energy_detect_write, .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay, @@ -2784,8 +2726,6 @@ static const struct mv88e6xxx_ops mv88e6190_ops = { .set_switch_mac = mv88e6xxx_g2_set_switch_mac, .phy_read = mv88e6xxx_g2_smi_phy_read, .phy_write = mv88e6xxx_g2_smi_phy_write, - .phy_energy_detect_read = mv88e6390_phy_energy_detect_read, - .phy_energy_detect_write = mv88e6390_phy_energy_detect_write, .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_rgmii_delay = mv88e6390_port_set_rgmii_delay, @@ -2821,8 +2761,6 @@ static const struct mv88e6xxx_ops mv88e6190x_ops = { .set_switch_mac = mv88e6xxx_g2_set_switch_mac, .phy_read = mv88e6xxx_g2_smi_phy_read, .phy_write = mv88e6xxx_g2_smi_phy_write, - .phy_energy_detect_read
[PATCH net-next 02/11] net: dsa: qca8k: fix EEE init
The qca8k obviously copied code from the sf2 driver as how to set EEE: if (e->eee_enabled) { p->eee_enabled = qca8k_eee_init(ds, port, phydev); if (!p->eee_enabled) ret = -EOPNOTSUPP; } But it did not use the same logic for the EEE init routine, which is "Returns 0 if EEE was not enabled, or 1 otherwise". This results in returning -EOPNOTSUPP on success and caching EEE enabled on failure. This patch fixes the returned value of qca8k_eee_init. Signed-off-by: Vivien Didelot--- drivers/net/dsa/qca8k.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c index b3bee7eab45f..e076ab23d4df 100644 --- a/drivers/net/dsa/qca8k.c +++ b/drivers/net/dsa/qca8k.c @@ -666,11 +666,11 @@ qca8k_eee_init(struct dsa_switch *ds, int port, ret = phy_init_eee(phy, 0); if (ret) - return ret; + return 0; qca8k_eee_enable_set(ds, port, true); - return 0; + return 1; } static int -- 2.13.3
[PATCH net-next 04/11] net: dsa: qca8k: do not cache unneeded EEE fields
The qca8k driver is currently caching a bitfield of the supported member of a ethtool_eee private structure, which is unused. Only the eee_enabled field of the private ethtool_eee copy is updated, thus using p->advertised and p->lp_advertised is also erroneous. Remove the usage of these private ethtool_eee members and only rely on phy_ethtool_get_eee to assign the eee_active member. Signed-off-by: Vivien Didelot--- drivers/net/dsa/qca8k.c | 11 +-- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c index 9d6b5d2f7a4a..c316c55aabc6 100644 --- a/drivers/net/dsa/qca8k.c +++ b/drivers/net/dsa/qca8k.c @@ -658,12 +658,8 @@ static int qca8k_eee_init(struct dsa_switch *ds, int port, struct phy_device *phy) { - struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv; - struct ethtool_eee *p = >port_sts[port].eee; int ret; - p->supported = (SUPPORTED_1000baseT_Full | SUPPORTED_100baseT_Full); - ret = phy_init_eee(phy, 0); if (ret) return 0; @@ -705,12 +701,7 @@ qca8k_get_eee(struct dsa_switch *ds, int port, int ret; ret = phy_ethtool_get_eee(netdev->phydev, p); - if (!ret) - e->eee_active = - !!(p->supported & p->advertised & p->lp_advertised); - else - e->eee_active = 0; - + e->eee_active = p->eee_active; e->eee_enabled = p->eee_enabled; return ret; -- 2.13.3
[PATCH net-next 05/11] net: dsa: qca8k: remove qca8k_get_eee
phy_ethtool_get_eee is already called by the DSA layer, thus remove the duplicated call in the qca8k driver. qca8k_get_eee becomes unnecessary. Signed-off-by: Vivien Didelot--- drivers/net/dsa/qca8k.c | 17 - 1 file changed, 17 deletions(-) diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c index c316c55aabc6..b5f2710064e5 100644 --- a/drivers/net/dsa/qca8k.c +++ b/drivers/net/dsa/qca8k.c @@ -691,22 +691,6 @@ qca8k_set_eee(struct dsa_switch *ds, int port, return ret; } -static int -qca8k_get_eee(struct dsa_switch *ds, int port, - struct ethtool_eee *e) -{ - struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv; - struct ethtool_eee *p = >port_sts[port].eee; - struct net_device *netdev = ds->ports[port].netdev; - int ret; - - ret = phy_ethtool_get_eee(netdev->phydev, p); - e->eee_active = p->eee_active; - e->eee_enabled = p->eee_enabled; - - return ret; -} - static void qca8k_port_stp_state_set(struct dsa_switch *ds, int port, u8 state) { @@ -906,7 +890,6 @@ static const struct dsa_switch_ops qca8k_switch_ops = { .phy_write = qca8k_phy_write, .get_ethtool_stats = qca8k_get_ethtool_stats, .get_sset_count = qca8k_get_sset_count, - .get_eee= qca8k_get_eee, .set_eee= qca8k_set_eee, .port_enable= qca8k_port_enable, .port_disable = qca8k_port_disable, -- 2.13.3
[PATCH net-next 06/11] net: dsa: bcm_sf2: remove unneeded supported flags
The SF2 driver is masking the supported bitfield of its private copy of the ports' ethtool_eee structures. It is used nowhere, thus remove it. Signed-off-by: Vivien Didelot--- drivers/net/dsa/bcm_sf2.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index 648f91b58d1e..aef475f1ce06 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -327,12 +327,8 @@ static void bcm_sf2_port_disable(struct dsa_switch *ds, int port, static int bcm_sf2_eee_init(struct dsa_switch *ds, int port, struct phy_device *phy) { - struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds); - struct ethtool_eee *p = >port_sts[port].eee; int ret; - p->supported = (SUPPORTED_1000baseT_Full | SUPPORTED_100baseT_Full); - ret = phy_init_eee(phy, 0); if (ret) return 0; -- 2.13.3
[PATCH V4 net 1/2] net: remove unnecessary rotation
From: Shaohua LiAccording to David Miller, the rotation doesn't really help avoid security problem, so delte it. Suggested-by: David Miller Signed-off-by: Shaohua Li --- include/net/ipv6.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 6eac5cf..7548367 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -790,12 +790,6 @@ static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb, hash = skb_get_hash_flowi6(skb, fl6); - /* Since this is being sent on the wire obfuscate hash a bit -* to minimize possbility that any useful information to an -* attacker is leaked. Only lower 20 bits are relevant. -*/ - rol32(hash, 16); - flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK; if (net->ipv6.sysctl.flowlabel_state_ranges) -- 2.9.3
[PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet
From: Shaohua LiPlease see below tcpdump output: 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0 The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet (0xd827f) are different. This causes our router doesn't correctly close tcp connection. The patches try to fix the issue. Thanks, Shaohua Shaohua Li (2): net: remove unnecessary rotation net: fix tcp reset packet flowlabel for ipv6 include/net/ipv6.h | 15 --- net/ipv4/tcp_minisocks.c | 8 +++- net/ipv6/ip6_gre.c | 2 +- net/ipv6/ip6_output.c| 4 ++-- net/ipv6/ip6_tunnel.c| 2 +- net/ipv6/tcp_ipv6.c | 18 +- 6 files changed, 32 insertions(+), 17 deletions(-) -- 2.9.3
[PATCH V4 net 2/2] net: fix tcp reset packet flowlabel for ipv6
From: Shaohua LiPlease see below tcpdump output: 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0 The tcp reset packet has a different flowlabel, which causes our router doesn't correctly close tcp connection. The reason is the normal packet gets the skb->hash from sk->sk_txhash, which is generated randomly. ip6_make_flowlabel then uses the hash to create a flowlabel. The reset packet doesn't get assigned a hash, so the flowlabel is calculated with flowi6. Since user can't change timewait sock flowlabel, we create a flowlabel for timewait socket with the random generated hash (sk->sk_txhash), then use it in reset packet. In this way, the reset packet will have the same flowlabel as normal packets. This also fixes the flowlabel issue for reset packet if user configures flowlabel, which is ignored previously. Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Signed-off-by: Shaohua Li --- include/net/ipv6.h | 9 - net/ipv4/tcp_minisocks.c | 8 +++- net/ipv6/ip6_gre.c | 2 +- net/ipv6/ip6_output.c| 4 ++-- net/ipv6/ip6_tunnel.c| 2 +- net/ipv6/tcp_ipv6.c | 18 +- 6 files changed, 32 insertions(+), 11 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 7548367..f8713fd 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -773,10 +773,8 @@ static inline void iph_to_flow_copy_v6addrs(struct flow_keys *flow, static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb, __be32 flowlabel, bool autolabel, - struct flowi6 *fl6) + struct flowi6 *fl6, u32 hash) { - u32 hash; - /* @flowlabel may include more than a flow
Re: [PATCH v3 net-next 1/4] tcp: ULP infrastructure
On 07/29/17 01:12 PM, Tom Herbert wrote: > On Wed, Jun 14, 2017 at 11:37 AM, Dave Watsonwrote: > > Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP > > sockets. Based on a similar infrastructure in tcp_cong. The idea is that > > any > > ULP can add its own logic by changing the TCP proto_ops structure to its own > > methods. > > > > Example usage: > > > > setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); > > > One question: is there a good reason why the ULP infrastructure should > just be for TCP sockets. For example, I'd really like to be able > something like: > > setsockopt(sock, SOL_SOCKET, SO_ULP, _param, sizeof(ulp_param)); > > Where ulp_param is a structure containing the ULP name as well as some > ULP specific parameters that are passed to init_ulp. ulp_init could > determine whether the socket family is appropriate for the ULP being > requested. Using SOL_SOCKET instead seems reasonable to me. I can see how ulp_params could have some use, perhaps at a slight loss in clarity. TLS needs its own setsockopts anyway though, for renegotiate for example.
Re: Long stalls creating a new netns after a netns with a SMB client exits
On Mon, Jul 31, 2017 at 9:22 AM, Rolf Neugebauerwrote: > On Fri, Jul 28, 2017 at 8:16 PM, David Ahern wrote: >> On 7/28/17 12:58 PM, Rolf Neugebauer wrote: > I can readily reproduce this on 4.9.39, 4.11.12 and another user > repro-ed it on 4.12.3. It seems to happen every time. At least one > user reported issues with NFS mounts as well, but we were not able to > reproduce it. It's not clear to me if this is directly related to > 'mount.cifs' or if that just happens to reliably repro it. OK, so commit d747a7a51b00984127a88113c does not help this case either. >>> >>> d747a7a51b009("tcp: reset sk_rx_dst in tcp_disconnect()") indeed seems >>> a different issue. As I understand that actually caused the ref count >>> never to get decremented, while here eventually some cleanup kicks in >>> after a long timeout. >> >> It could be a dst is cached on a socket and does not get cleared until >> the socket time outs are done. >> >> Test that theory by something like this for IPv4 TCP (similar change for >> UDP if the client is UDP based): >> >> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c >> index 3a19ea28339f..37db087b6c97 100644 >> --- a/net/ipv4/tcp_ipv4.c >> +++ b/net/ipv4/tcp_ipv4.c >> @@ -1855,7 +1855,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const >> struct sk_buff *skb) >> { >> struct dst_entry *dst = skb_dst(skb); >> >> - if (dst && dst_hold_safe(dst)) { >> + if (0 && dst && dst_hold_safe(dst)) { >> sk->sk_rx_dst = dst; >> inet_sk(sk)->rx_dst_ifindex = skb->skb_iif; >> } > > > This removes the 200s stall (the test is IPv4/TCP based) Interesting. This means we have a kernel socket which holds the dst refcnt. Looking at the cifs code, it does create a TCP kernel socket which doesn't hold refcnt to netns but its sk_rx_dst could still be set as usual, therefore this socket could hold the dst which holds lo device after the netns is gone. But its timeout seems to be 60sec (SMB_ECHO_INTERVAL_DEFAULT), not 200sec. Ideally it should use a per netns socket so that it would have a same life-time with netns. But you need to check this with cifs developers, I don't understand cifs at all.
Re: Kernel TLS in 4.13-rc1
On 07/30/17 11:14 PM, David Oberhollenzer wrote: > On 07/24/2017 11:10 PM, Dave Watson wrote: > > On 07/23/17 09:39 PM, David Oberhollenzer wrote: > >> After fixing the benchmark/test tool that the patch description > >> linked to (https://github.com/Mellanox/tls-af_ktls_tool) to make > >> sure that the server and client actually *agree* on AES-128-GCM, > >> I simply ran the client program with the --verify-sendpage option. > >> > >> The handshake and setting up of the sockets appears to work but > >> the program complains that the sent and received page contents > >> do not match (sent is 0x12 repeated all over and received looks > >> pretty random). > > > > The --verify functions depend on the RX path as well, which has not > > been merged. Any programs / tests using OpenSSL + patches should work > > fine. > > > > If you want to use the tool, something like this should work, so that > > the receive path uses gnutls: > > > > ./server --no-echo > > > > ./client --server-port 12345 --sendfile some_file --server-host localhost > > > > Thanks! This appears to work as expected (output from the server matches the > input from the client and the pcap dumps look fine). > > From briefly browsing through the code of the test tool I was initially under > the impression that it would generate an error message and terminate if an > attempt was made at configuring ktls for the RX path. > > Anyway, I already read in the patch description that RX wasn't included yet, > still requires a few cleanups and would follow at some point. > > Is there currently a "not-so-clean" version of the RX patches floating around > somewhere that we could take a look at? I dumped the current state here. Still plenty rough but at least passes --verify-transmission for me. https://github.com/ktls/net_next_ktls/tree/tls_recv_net_next and config changes to af_ktls-tool https://github.com/ktls/af_ktls-tool/tree/RX
Re: [patch net-next 0/8] mlxsw: Various small fixes
From: Jiri PirkoDate: Mon, 31 Jul 2017 09:27:22 +0200 > This patch series is to contribute several fixes for nits that I noticed while > working on mlxsw. The changes range from typo fixes to local improvements of > the code and have little in common besides being small in scope. Series applied, thanks Jiri.
Re: [PATCH net-next 0/7] net: bcmgenet: utilize MDIO unimac driver
From: Florian FainelliDate: Mon, 31 Jul 2017 12:04:21 -0700 > Hi all, > > This patch series migrates the Broadcom GENET driver to use the > mdio-bcm-unimac > driver. This MDIO HW is the same as the one GENET internally embedds, yet for > historical reasons the two drivers lived their own lives. Because of the GENET > interrupt situation, we let it specify how it wants to signal MDIO operations > completion using its driver-private waitqueue. > > The diffstat is not super impressive, but it's still negative! This would > make it easier in the future to absorb possible workarounds/bugs/features > within the same location. > > This was tested on BCM7260 (GENETv5, single instance), BCM7439 (GENETv4, > triple > instance) and BCM7445 (bcm_sf2 + mdio-bcm-unimac). > > We also now have a nice /proc/iomem output: > > f0b0-f0b0fc4b : /rdb/ethernet@f0b0 > f0b00e14-f0b00e1c : unimac-mdio.0 > f0b2-f0b2fc4b : /rdb/ethernet@f0b2 > f0b20e14-f0b20e1c : unimac-mdio.1 > f0b4-f0b4fc4b : /rdb/ethernet@f0b4 > f0b40e14-f0b40e1c : unimac-mdio.2 I love cleanups like this... even if the diffstat breaks even :-) Applied, thanks.
[PATCH net] samples/bpf: fix bpf tunnel cleanup
test_tunnel_bpf.sh fails to remove the vxlan11 tunnel device, causing the next geneve tunnelling test case fails. In addition, the geneve reserved bit in tcbpf2_kern.c should be zero, according to the RFC. Signed-off-by: William Tu--- samples/bpf/tcbpf2_kern.c | 4 ++-- samples/bpf/test_tunnel_bpf.sh | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c index 9c823a609e75..270edcc149a1 100644 --- a/samples/bpf/tcbpf2_kern.c +++ b/samples/bpf/tcbpf2_kern.c @@ -147,9 +147,9 @@ int _geneve_set_tunnel(struct __sk_buff *skb) __builtin_memset(, 0x0, sizeof(gopt)); gopt.opt_class = 0x102; /* Open Virtual Networking (OVN) */ gopt.type = 0x08; - gopt.r1 = 1; + gopt.r1 = 0; gopt.r2 = 0; - gopt.r3 = 1; + gopt.r3 = 0; gopt.length = 2; /* 4-byte multiple */ *(int *) _data = 0xdeadbeef; diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh index 1ff634f187b7..a70d2ea90313 100755 --- a/samples/bpf/test_tunnel_bpf.sh +++ b/samples/bpf/test_tunnel_bpf.sh @@ -149,6 +149,7 @@ function cleanup { ip link del veth1 ip link del ipip11 ip link del gretap11 + ip link del vxlan11 ip link del geneve11 pkill tcpdump pkill cat -- 2.7.4
Re: [RFC net-next 0/6] tcp: remove prequeue and header prediction
From: Eric DumazetDate: Mon, 31 Jul 2017 13:22:22 -0700 > On Mon, Jul 31, 2017 at 1:04 PM, Yuchung Cheng wrote: >> by the time these devices use 4.12 kernels they are likely powerful >> enough to make header prediction irrelevant... > > Also note that TCP stack complexity has increased a lot, I seriously > doubt anyone could notice any difference. > > On small devices, the major cost is the wakeup of the cpu to process > one frame before going back to idle... I agree with Yuchung and Eric on all counts.