Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 09:23:47PM -0500, David Miller wrote: > From: Thomas Gleixner> Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET) > > > I surely agree, but we have gone the way of PTI without the ability of > > exempting individual processes exactly for one reason: > > > > Lack of time > > > > It can be done on top of the PTI implementation and it won't take ages. > > > > For spectre_v1/2 we face the same problem simply because we got informed so > > much ahead of time and we were all twiddling thumbs, enjoying our christmas > > vacation and having a good time. > > I just want to point out that this should be noted in history as a > case where all of this controlled disclosure stuff seems to have made > things worse rather than better. I will note that the "controlled disclosure" for this thing was a total and complete mess, and unlike any that I have ever seen in the past. The people involved in running it had no idea how to do it at all, and because of that, it failed miserably, despite being warned about it numerous times by numerous people. > Why is there so much haste and paranoia if supposedly some group of > people had all this extra time to think about and deal with this bug? Because that group was so small and isolated that they did not actually talk to anyone who could actually provide input to help deal with the bug. So we are stuck now with dealing with this "properly", which is fine, but please don't think that this is an excuse to blame "controlled disclosure". We know how to do that correctly, it did not happen in this case at all because of the people driving the problem refused to do it. > Think I'm nuts? Ok, then how did we fare any better by keeping this > junk under wraps for weeks if not months? (seriously, did responsible > people really know about this as far back as... June 2017?) Some "people" did, just not some "responsible people" :) Oh well, time for the kernel to fix hardware bugs again, that's what we are here for, you would think we would be used to it by now... greg k-h
Re: [PATCH net-next] l2tp: adjust comments about L2TPv3 offsets
On 05/01/18 18:47, Guillaume Nault wrote: > The "offset" option has been removed by > commit 900631ee6a26 ("l2tp: remove configurable payload offset"). > > Signed-off-by: Guillaume Nault> --- > include/uapi/linux/l2tp.h | 2 +- > net/l2tp/l2tp_core.c | 7 +++ > 2 files changed, 4 insertions(+), 5 deletions(-) > > diff --git a/include/uapi/linux/l2tp.h b/include/uapi/linux/l2tp.h > index f78eef4cc56a..71e62795104d 100644 > --- a/include/uapi/linux/l2tp.h > +++ b/include/uapi/linux/l2tp.h > @@ -65,7 +65,7 @@ struct sockaddr_l2tpip6 { > * TUNNEL_MODIFY - CONN_ID, udpcsum > * TUNNEL_GETSTATS - CONN_ID, (stats) > * TUNNEL_GET- CONN_ID, (...) > - * SESSION_CREATE- SESSION_ID, PW_TYPE, offset, data_seq, cookie, > peer_cookie, offset, l2spec > + * SESSION_CREATE- SESSION_ID, PW_TYPE, data_seq, cookie, peer_cookie, > l2spec > * SESSION_DELETE- SESSION_ID > * SESSION_MODIFY- SESSION_ID, data_seq > * SESSION_GET - SESSION_ID, (...) > diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c > index 786cd7f6a5e8..62285fc6eb59 100644 > --- a/net/l2tp/l2tp_core.c > +++ b/net/l2tp/l2tp_core.c > @@ -662,10 +662,9 @@ static int l2tp_recv_data_seq(struct l2tp_session > *session, struct sk_buff *skb) > * |x|S|x|x|x|x|x|x| Sequence Number | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * > - * Cookie value, sublayer format and offset (pad) are negotiated with > - * the peer when the session is set up. Unlike L2TPv2, we do not need > - * to parse the packet header to determine if optional fields are > - * present. > + * Cookie value and sublayer format are negotiated with the peer when > + * the session is set up. Unlike L2TPv2, we do not need to parse the > + * packet header to determine if optional fields are present. > * > * Caller must already have parsed the frame and determined that it is > * a data (not control) frame before coming here. Fields up to the Acked-by: James Chapman
Re: [PATCH] atm/clip: Use seq_puts() in svc_addr()
>> @@ -708,11 +708,11 @@ static void svc_addr(struct seq_file *seq, struct >> sockaddr_atmsvc *addr) >> static int e164[] = { 1, 8, 4, 6, 1, 0 }; >> >> if (*addr->sas_addr.pub) { >> - seq_printf(seq, "%s", addr->sas_addr.pub); >> + seq_puts(seq, addr->sas_addr.pub); > > Which opens a lot of security concerns. How? - The passed string is just copied into a buffer finally, isn't it? > Never do this again. Why do you not like such a small source code transformation at the moment? > P.S. I'm wondering what would be first, I am curious on how communication difficulties can be adjusted. > Markus starts looking into the actual code, I inspected the original source code to some degree. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/fs/seq_file.c?id=895c0dde398510a5b5ded60e5064c11b94bd30ca#n682 https://elixir.free-electrons.com/linux/v4.15-rc6/source/fs/seq_file.c#L660 > or most (all) of the maintainers just ban him. The change acceptance is varying for various reasons by the involved contributors. Regards, Markus
[PATCH] docs-rst: networking: wire up msg_zerocopy
Fix the following 'make htmldocs' complaint: Documentation/networking/msg_zerocopy.rst:: WARNING: document isn't included in any toctree. Signed-off-by: Mike Rapoport--- Documentation/networking/index.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 66e620866245..7d4b15977d61 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -9,6 +9,7 @@ Contents: batman-adv kapi z8530book + msg_zerocopy .. only:: subproject @@ -16,4 +17,3 @@ Contents: === * :ref:`genindex` - -- 2.7.4
RE: WARNING: CPU: 0 PID: 0 at ./include/linux/netfilter.h:233 arp_rcv
From: Marco FranchiSent: Friday, January 05, 2018 11:03 PM >Hi, > >I am getting the following warning on a imx6ul-evk board running linux-next >20180105: > >[9.233290] [ cut here ] >[9.242068] WARNING: CPU: 0 PID: 0 at >./include/linux/netfilter.h:233 arp_rcv+0x1f8/0x228 >[9.250381] Modules linked in: >[9.253633] CPU: 0 PID: 0 Comm: swapper/0 Not tainted >4.15.0-rc6-next-20180104-dirty #2 >[9.261764] Hardware name: Freescale i.MX6 Ultralite (Device Tree) >[9.268065] Backtrace: >[9.270719] [] (dump_backtrace) from [] >(show_stack+0x18/0x1c) >[9.278438] r7: r6:6153 r5: r4:c1079878 >[9.284258] [] (show_stack) from [] >(dump_stack+0xb4/0xe8) >[9.291641] [] (dump_stack) from [] >(__warn+0xf0/0x11c) >[9.298763] r9:d8053000 r8:00e9 r7:0009 r6:c0da3bdc >r5: r4: >[9.306662] [] (__warn) from [] >(warn_slowpath_null+0x44/0x50) >[9.314383] r8:d8053000 r7:0608 r6:c08873f8 r5:00e9 r4:c0da3bdc >[9.321243] [] (warn_slowpath_null) from [] >(arp_rcv+0x1f8/0x228) >[9.329215] r6:c0887200 r5:c107ac58 r4:d899b240 >[9.333999] [] (arp_rcv) from [] >(__netif_receive_skb_core+0x878/0xbd4) >[9.342491] r6:c0887200 r5:c100ac8c r4:d899b240 >[9.347265] [] (__netif_receive_skb_core) from >[] (__netif_receive_skb+0x2c/0x8c) >[9.356643] r10:0080 r9:d8053000 r8:e0a26000 r7:c107ab0d >r6:c1008908 r5:d899b240 >[9.364598] r4:c10099c4 >[9.367288] [] (__netif_receive_skb) from [] >(netif_receive_skb_internal+0x7c/0x354) >[9.376904] r5:d899b240 r4:c10099c4 >[9.380635] [] (netif_receive_skb_internal) from >[] (napi_gro_receive+0x88/0xa4) >[9.389918] r8:e0a26000 r7:0001 r6:d899b240 r5:d899b240 r4:0003 >[9.396784] [] (napi_gro_receive) from [] >(fec_enet_rx_napi+0x3a8/0x9b8) >[9.405357] r5:d8054000 r4: >[9.409093] [] (fec_enet_rx_napi) from [] >(net_rx_action+0x220/0x334) >[9.417431] r10:dbbdfa00 r9:c1001d94 r8:0040 r7:012c >r6:8e6b r5:0001 >[9.425384] r4:d8053710 >[9.428075] [] (net_rx_action) from [] >(__do_softirq+0x128/0x2a0) >[9.436063] r10:4003 r9:c1003080 r8:0100 r7:c100308c >r6:c100 r5:0003 >[9.444018] r4: >[9.446708] [] (__do_softirq) from [] >(irq_exit+0x14c/0x1a8) >[9.454262] r10:e080a000 r9:d8004400 r8:0001 r7: >r6:c1008a98 r5: >[9.462218] r4:e000 >[9.464911] [] (irq_exit) from [] >(__handle_domain_irq+0x74/0xe8) >[9.472879] r5: r4:c0f7cc24 >[9.476614] [] (__handle_domain_irq) from [] >(gic_handle_irq+0x64/0xc4) >[9.485121] r9:c1028344 r8:c1001e98 r7: r6:03ff >r5:03eb r4:e080a00c >[9.493016] [] (gic_handle_irq) from [] >(__irq_svc+0x70/0x98) >[9.500632] Exception stack(0xc1001e98 to 0xc1001ee0) >[9.505822] 1e80: >0001 0001 >[9.514157] 1ea0: c100bf80 25963796 0002 0002 >dbbde5c8 26545294 0002 >[9.522491] 1ec0: c1001f14 c1001eb8 c1001ee8 c01724fc >c0724464 2153 >[9.530824] r10: r9:c100 r8:26545294 r7:c1001ecc >r6: r5:2153 >[9.538778] r4:c0724464 >[9.541470] [] (cpuidle_enter_state) from [] >(cpuidle_enter+0x1c/0x20) >[9.549893] r10:c100f6a8 r9:dbbde5c8 r8:c0f7c5c0 r7:c1008978 >r6:0001 r5:c100892c >[9.557857] r4:c100 r3:dbbde5c8 >[9.561589] [] (cpuidle_enter) from [] >(call_cpuidle+0x28/0x44) >[9.569399] [] (call_cpuidle) from [] >(do_idle+0x1bc/0x230) >[9.576861] [] (do_idle) from [] >(cpu_startup_entry+0x20/0x24) >[9.584587] r10:c0f63a50 r9:c1008908 r8:c107b480 r7:c1008900 >r6:c107b480 r5:0002 >[9.592546] r4:00c4 r3:c0f75354 >[9.596283] [] (cpu_startup_entry) from [] >(rest_init+0x210/0x25c) >[9.604372] [] (rest_init) from [] >(start_kernel+0x390/0x418) >[9.611998] r5: r4:c107b4cc >[9.615723] [] (start_kernel) from [<>] ( (null)) >[9.622235] r10:10c5387d r9:410fc075 r8:8300 r7: >r6:10c0387d r5:0051 >[9.630191] r4:c0f0032c >[9.632851] ---[ end trace 2d5d5f79c0c8da59 ]--- > >Does anyone know how to fix it? > >Thanks If you enable kernel config "CONFIG_NETFILTER_FAMILY_ARP" can fix the warning. It is introduced by below commit. commit 8de98f058360722a1a9febe3970de6dcd4d91513 Author: Florian Westphal Date: Thu Dec 7 16:28:26 2017 +0100 netfilter: don't allocate space for arp/bridge hooks unless needed no need to define hook points if the family isn't supported. Because we need these hooks for either nftables, arp/ebtables or the 'call-iptables' hack we have in the bridge layer add two new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the users select them. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso diff --git
Re: [PATCHv3 0/2] capability controlled user-namespaces
On Mon, Jan 08, 2018 at 11:35:26AM +1100, James Morris wrote: > On Tue, 2 Jan 2018, Mahesh Bandewar (महेश बंडेवार) wrote: > > > On Sat, Dec 30, 2017 at 12:31 AM, James Morris > >wrote: > > > On Wed, 27 Dec 2017, Mahesh Bandewar (महेश बंडेवार) wrote: > > > > > >> Hello James, > > >> > > >> Seems like I missed your name to be added into the review of this > > >> patch series. Would you be willing be pull this into the security > > >> tree? Serge Hallyn has already ACKed it. > > > > > > Sure! > > > > > Thank you James. > > I'd like to see what Eric Biederman thinks of this. > > Also, why do we need the concept of a controlled user-ns at all, if the > default whitelist maintains existing behavior? In past discussions two uses have been brought up: 1. if an 0-day is discovered which is exacerbated by a specific privilege in user namespaces, that privilege could be turned off until a reboot with a fixed kernel is scheduled, without fully disabling all containers. 2. some systems may be specifically designed to run software which only requires a few capabilities in a userns. In that case all others could be disabled.
[PATCH] vhost: Remove the unused variable.
The patch (7235acdb1) changed the way of the work flushing in which the queued seq, done seq, and the flushing are not used anymore. Then remove them now. Fixes: 7235acdb1 ("vhost: simplify work flushing") Cc: Jason WangSigned-off-by: Tonghao Zhang --- drivers/vhost/vhost.c | 1 - drivers/vhost/vhost.h | 4 2 files changed, 5 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 33ac2b186b85..9b04cad91d65 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -181,7 +181,6 @@ void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) { clear_bit(VHOST_WORK_QUEUED, >flags); work->fn = fn; - init_waitqueue_head(>done); } EXPORT_SYMBOL_GPL(vhost_work_init); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 79c6e7a60a5e..749fe13e061c 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -20,10 +20,6 @@ typedef void (*vhost_work_fn_t)(struct vhost_work *work); struct vhost_work { struct llist_node node; vhost_work_fn_t fn; - wait_queue_head_t done; - int flushing; - unsigned queue_seq; - unsigned done_seq; unsigned long flags; }; -- 2.13.6
Re: [patch iproute2 v6 0/3] tc: Add -bs option to batch mode
On 1/7/18 7:03 PM, Chris Mi wrote: >> Did you measure the effect of increasing batch sizes? > Yes. Even if we enlarge the batch size bigger than 10, there is no big > improvement. That will change over time so the tc command should allow auto-batching to work up to the message size limit. > I think that's because current kernel doesn't process the requests in > parallel. > If kernel processes the requests in parallel, I believe specifying a bigger > batch size > will get a better result. >> >> I wonder whether specifying the batch size is necessary at all. Couldn't >> batch >> mode just collect messages until either EOF or an incompatible command is >> encountered which then triggers a commit to kernel? This might simplify >> code quite a bit. > That's a good suggestion. Thanks for your time on this, Chris.
Re: [PATCH net-next V2 2/2] tun: allow to attach ebpf socket filter
On 2018年01月06日 00:21, Willem de Bruijn wrote: On Fri, Jan 5, 2018 at 4:54 AM, Jason Wangwrote: This patch allows userspace to attach eBPF filter to tun. This will allow to implement VM dataplane filtering in a more efficient way compared to cBPF filter by allowing either qemu or libvirt to attach eBPF filter to tun. Signed-off-by: Jason Wang --- drivers/net/tun.c | 39 +++ include/uapi/linux/if_tun.h | 1 + 2 files changed, 36 insertions(+), 4 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 0853829..9fc8b70 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -238,6 +238,12 @@ struct tun_struct { struct tun_pcpu_stats __percpu *pcpu_stats; struct bpf_prog __rcu *xdp_prog; struct tun_prog __rcu *steering_prog; + struct tun_prog __rcu *filter_prog; +}; + +struct veth { + __be16 h_vlan_proto; + __be16 h_vlan_TCI; }; static int tun_napi_receive(struct napi_struct *napi, int budget) @@ -984,12 +990,25 @@ static void tun_automq_xmit(struct tun_struct *tun, struct sk_buff *skb) #endif } +static unsigned int run_ebpf_filter(struct tun_struct *tun, + struct sk_buff *skb, + int len) +{ + struct tun_prog *prog = rcu_dereference(tun->filter_prog); + + if (prog) + len = bpf_prog_run_clear_cb(prog->prog, skb); + + return len; +} + /* Net device start xmit */ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) { struct tun_struct *tun = netdev_priv(dev); int txq = skb->queue_mapping; struct tun_file *tfile; + int len = skb->len; rcu_read_lock(); tfile = rcu_dereference(tun->tfiles[txq]); @@ -1015,6 +1034,16 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) sk_filter(tfile->socket.sk, skb)) goto drop; + len = run_ebpf_filter(tun, skb, len); + + /* Trim extra bytes since we may inster vlan proto & TCI inster -> insert Will fix. +* in tun_put_user(). +*/ + if (skb_vlan_tag_present(skb)) + len -= skb_vlan_tag_present(skb) ? sizeof(struct veth) : 0; no need for testing skb_vlan_tag_present twice. Right. more importantly, why trim these bytes unconditionally? only if the filter trims a packet to a length shorter than the the minimum could this cause problems. sk_filter_trim_cap with a lower bound avoids that: skb_vlan_tag_present(skb) ? sizeof(struct vlan_ethhdr) : 0; The problem is, if the filter want to trim to packet to 50 bytes, we may get 54 bytes if vlan tag is existed. This seems wrong. Thanks
Re: Subject: [RFC][PATCH 10/11] dwc-xlgmac: fix big-endian breakage
On 2018/1/6 3:32, Al Viro wrote: > Users of XLGMAC_SET_REG_BITS_LE() expect it to take le32 and return > le32. > > Signed-off-by: Al Viro> --- > drivers/net/ethernet/synopsys/dwc-xlgmac.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac.h > b/drivers/net/ethernet/synopsys/dwc-xlgmac.h > index cab3e40a86b9..e95c4c250e16 100644 > --- a/drivers/net/ethernet/synopsys/dwc-xlgmac.h > +++ b/drivers/net/ethernet/synopsys/dwc-xlgmac.h > @@ -106,7 +106,7 @@ > #define XLGMAC_GET_REG_BITS_LE(var, pos, len) ({ \ > typeof(pos) _pos = (pos); \ > typeof(len) _len = (len); \ > - typeof(var) _var = le32_to_cpu((var)); \ > + u32 _var = le32_to_cpu((var)); \ > ((_var) & GENMASK(_pos + _len - 1, _pos)) >> (_pos);\ > }) > > @@ -125,8 +125,8 @@ > typeof(len) _len = (len); \ > typeof(val) _val = (val); \ > _val = (_val << _pos) & GENMASK(_pos + _len - 1, _pos); \ > - _var = (_var & ~GENMASK(_pos + _len - 1, _pos)) | _val; \ > - cpu_to_le32(_var); \ > + (_var & ~cpu_to_le32(GENMASK(_pos + _len - 1, _pos))) | \ > + cpu_to_le32(_val); \ > }) > > struct xlgmac_pdata; Thanks for your patch. Acked-by: Jie Deng
[PATCH bpf-next] bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715. A quote from goolge project zero blog: "At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel - while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel's text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets." To make attacker job harder introduce BPF_JIT_ALWAYS_ON config option that removes interpreter from the kernel in favor of JIT-only mode. So far eBPF JIT is supported by: x64, arm64, arm32, sparc64, s390, powerpc64, mips64 The start of JITed program is randomized and code page is marked as read-only. In addition "constant blinding" can be turned on with net.core.bpf_jit_harden Signed-off-by: Alexei Starovoitov--- init/Kconfig | 7 +++ kernel/bpf/core.c | 9 + kernel/bpf/verifier.c | 4 net/core/sysctl_net_core.c | 9 + 4 files changed, 29 insertions(+) diff --git a/init/Kconfig b/init/Kconfig index 2934249fba46..5e2a4a391ba9 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1392,6 +1392,13 @@ config BPF_SYSCALL Enable the bpf() system call that allows to manipulate eBPF programs and maps via file descriptors. +config BPF_JIT_ALWAYS_ON + bool "Permanently enable BPF JIT and remove BPF interpreter" + depends on BPF_SYSCALL && HAVE_EBPF_JIT && BPF_JIT + help + Enables BPF JIT and removes BPF interpreter to avoid + speculative execution of BPF instructions by the interpreter + config USERFAULTFD bool "Enable userfaultfd() system call" select ANON_INODES diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 70a534549cd3..42756c434e0b 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -781,6 +781,7 @@ noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) } EXPORT_SYMBOL_GPL(__bpf_call_base); +#ifndef CONFIG_BPF_JIT_ALWAYS_ON /** * __bpf_prog_run - run eBPF program on a given context * @ctx: is the data we are operating on @@ -1376,6 +1377,7 @@ void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth) __bpf_call_base_args; insn->code = BPF_JMP | BPF_CALL_ARGS; } +#endif bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp) @@ -1427,9 +1429,11 @@ static int bpf_check_tail_call(const struct bpf_prog *fp) */ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err) { +#ifndef CONFIG_BPF_JIT_ALWAYS_ON u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1); fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1]; +#endif /* eBPF JITs can rewrite the program in case constant * blinding is active. However, in case of error during @@ -1453,6 +1457,11 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err) */ *err = bpf_check_tail_call(fp); +#ifdef CONFIG_BPF_JIT_ALWAYS_ON + if (!fp->jited) + *err = -ENOTSUPP; +#endif + return fp; } EXPORT_SYMBOL_GPL(bpf_prog_select_runtime); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index a2b211262c25..ca80559c4ec3 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5267,7 +5267,11 @@ static int fixup_call_args(struct bpf_verifier_env *env) depth = get_callee_stack_depth(env, insn, i); if (depth < 0) return depth; +#ifdef CONFIG_BPF_JIT_ALWAYS_ON + return -ENOTSUPP; +#else bpf_patch_call_args(insn, depth); +#endif } return 0; } diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index cbc3dde4cfcc..1c8af0f4f385 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -325,7 +325,13 @@ static struct ctl_table net_core_table[] = { .data = _jit_enable, .maxlen = sizeof(int), .mode = 0644, +#ifndef CONFIG_BPF_JIT_ALWAYS_ON .proc_handler = proc_dointvec +#else + .proc_handler = proc_dointvec_minmax, + .extra1 = , + .extra2 = , +#endif }, # ifdef CONFIG_HAVE_EBPF_JIT { @@ -524,6 +530,9 @@ static __net_initdata struct pernet_operations sysctl_core_ops = { static __init int sysctl_core_init(void) { +#if
Re: [net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"
David Millerwrites: > From: Michael Ellerman > Date: Fri, 22 Dec 2017 15:22:22 +1100 > >>> On Tue, Dec 19 2017, Michael Ellerman >>> wrote: This revert seems to have broken networking on one of my powerpc machines, according to git bisect. The symptom is DHCP fails and I don't get a link, I didn't dig any further than that. I can if it's helpful. I think the problem is that 87c320e51519 ("net: core: dev_get_valid_name is now the same as dev_alloc_name_ns") only makes sense while d6f295e9def0 remains in the tree. >>> >>> I'm sorry about all of this, I really didn't think there would be such >>> consequences of changing an errno return. Indeed, d6f29 was preparation >>> for unifying the two functions that do the exact same thing (and how we >>> ever got into that situation is somewhat unclear), except for >>> their behaviour in the case the requested name already exists. So one of >>> the two interfaces had to change its return value, and as I wrote, I >>> thought EEXIST was the saner choice when an explicit name (no %d) had >>> been requested. >> >> No worries. >> ie. before the entire series, dev_get_valid_name() would return EEXIST, and that was retained when 87c320e51519 was merged, but now that d6f295e9def0 has been reverted dev_get_valid_name() is returning ENFILE. I can get the network up again if I also revert 87c320e51519 ("net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"), or with the gross patch below. >>> >>> I don't think changing -ENFILE to -EEXIST would be right either, since >>> dev_get_valid_name() used to be able to return both (-EEXIST in the case >>> where there's no %d, -ENFILE in the case where we end up calling >>> dev_alloc_name_ns()). If anything, we could do the check for the old >>> -EEXIST condition first, and then call dev_alloc_name_ns(). But I'm also >>> fine with reverting. >> >> Yeah I think a revert would be best, given it's nearly rc5. >> >> My userspace is not exotic AFAIK, just debian something, so presumably >> this will affect other people too. > > I've just queued up the following revert, thanks! Thanks. I don't see it in rc7, will it get to Linus sometime before the release? cheers
Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
On 12/29/17 12:20 AM, Masami Hiramatsu wrote: Please run Josef's test in the !ftrace setup. Yes, I'll add the result of the test case. if Josef's test is passing in !ftrace config, please resend your patches. I think 2 and 3 were nice simplifications. and patch 1 is good too if it's passes the test. Would be great to get them in for this merge window. Thanks!
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 01:59:35PM +, Alan Cox wrote: > > which means that POC is relying 64-bit address speculation. > > In the places coverity found the user supplied value is 32-bit, > > People have 32bit computers as well as 64bit and in some cases 32bit is > fine for an attack depending where your target is relative to the object. right. absolutely correct on 32-bit archs. The question whether truncation to 32-bit is enough to workaround spectre1 on 64-bit archs. I hope the researchers can clarify. > lfence timing is also heavily dependent upon what work has to be done to > retire previous live instructions. > BPF does not normally do a lot of writing so you'd expect the cost to be low. right. to retire previous loads. Not sure what 'not a lot of writing' has to do with lfence. Our XDP based DDOS mostly does reads with few writes for stats into maps, whereas XDP based load balancer modifies every packet. XDP is root only, so not relevant in the spectre context. Just clarifying read vs writes in BPF.
Re: [PATCH net-next 00/18] ipv6: Align nexthop behaviour with IPv4
From: Ido SchimmelDate: Sun, 7 Jan 2018 12:45:00 +0200 > This set tries to eliminate some differences between IPv4's and IPv6's > treatment of nexthops. These differences are most likely a side effect > of IPv6's data structures (specifically 'rt6_info') that incorporate > both the route and the nexthop and the late addition of ECMP support in > commit 51ebd3181572 ("ipv6: add support of equal cost multipath > (ECMP)"). ... > Finally, this series also serves as a good first step towards David > Ahern's goal of treating nexthops as standalone objects [2], as it makes > the code more in line with IPv4 where the nexthop and the nexthop group > are separate objects from the route itself. ... Looks great, series applied, thanks Ido!
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 12:15:40PM -0800, Dan Williams wrote: > > I'm thinking we should provide the option to at least build the > hot-path nospec_array_ptr() usages without an lfence. > > CONFIG_SPECTRE1_PARANOIA_SAFE > CONFIG_SPECTRE1_PARANOIA_PERF SAFE vs PERF naming is problematic and misleading, since users don't have the data to make a decision they will be forced to go with SAFE. What is not safe about array_access() macro with AND ? How lfence approach makes it safer ? Only because lfence was blessed by intel earlier when they couldn't figure out a different way? How about: CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE > ...if only for easing performance testing and let the distribution set > its policy. > > Where hot-path usages can do: > > nospec_relax(nospec_array_ptr()) AND approach doesn't prevent speculation hence nospec_ is an incorrect prefix. Alan's "speculation management" terminology fits well here. Can we keep array_access() name and change it underneath to either mask or lfence ?
Re: pull-request: bpf-next 2018-01-07
From: Daniel BorkmannDate: Mon, 8 Jan 2018 00:56:16 +0100 > The following pull-request contains BPF updates for your *net-next* tree. > > The main changes are: ... > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git > > Thanks a lot & a happy new year! Looks great, pulled, thanks a lot Daniel!
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
From: Thomas GleixnerDate: Sun, 7 Jan 2018 21:56:39 +0100 (CET) > I surely agree, but we have gone the way of PTI without the ability of > exempting individual processes exactly for one reason: > > Lack of time > > It can be done on top of the PTI implementation and it won't take ages. > > For spectre_v1/2 we face the same problem simply because we got informed so > much ahead of time and we were all twiddling thumbs, enjoying our christmas > vacation and having a good time. I just want to point out that this should be noted in history as a case where all of this controlled disclosure stuff seems to have made things worse rather than better. Why is there so much haste and paranoia if supposedly some group of people had all this extra time to think about and deal with this bug? >From what I've seen, every single time, the worse a problem is, the more important it is to expose it to as many smart folks as possible. And to do so as fast as possible. And to me that means full disclosure immediately for the super high level stuff like what we are dealing with here. Think I'm nuts? Ok, then how did we fare any better by keeping this junk under wraps for weeks if not months? (seriously, did responsible people really know about this as far back as... June 2017?) Controlled disclosure for high propfile bugs seems to only achieve two things: 1) Vendors can cover their butts and come up with deflection strategies. 2) The "theatre" aspect of security can be maximized as much as possible. We even have a pretty web site and cute avatars this time! None of this has anything to do with having time to come up with the best possible implementation of a fix. You know, the technical part? So after what appears to be as much as 6 months of deliberating the very wise men in the special room said: "KPTI and lfence" Do I get this right?
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 11:08:24AM +0100, Thomas Gleixner wrote: > On Sat, 6 Jan 2018, Alexei Starovoitov wrote: > > which clearly states that bpf_tail_call() was used in the attack. > > Yet none of the intel nor arm patches address speculation in > > this bpf helper! > > It means that: > > - gpz didn't share neither exploit nor the detailed description > > of the POC with cpu vendors until now > > - coverity rules used to find all these places in the kernel > > failed to find bpf_tail_call > > - cpu vendors were speculating what variant 1 can actually do > > You forgot to mention that there might be other attacks than the public POC > which are not covered by a simple AND if above statement is not a pure speculation please share CVE number. For varaint[123] they've been reserved for months. > For spectre_v1/2 we face the same problem simply because we got informed so > much ahead of time and we were all twiddling thumbs, enjoying our christmas > vacation and having a good time. right. they were informed in a way that they failed to address variant1 with pre-embargo and public patches. > The exploits are out in the wild and they are observed already, so we this statement contradicts with everything that was publicly stated. Or you're referring to 'exploit' at the end of spectre paper? > want to discuss the right way to fix it for the next 3 month and leave all > doors open until the last bit of performance is squeezed out. Let's look at facts: - Linus explains his array_access() idea - lfence proponents quickly point out that it needs gcc to be smart enough to emit setbe and go back to lfence patches - I spent half an hour to tweak array_access() to be generic on all archs and compiler indepedent - lfence proponets point out that AND with a variable somehow won't work, yet after further discussion it's actually fine due to the nature of variant1 attack that needs to train predictor with in-bounds access to mispredict with out-of-bounds speculative load - then lfence proponets claim 'non public POC' not covered by AND - it all in the matter of 2 days. - and now the argument that it will take 3 month to discuss a solution without lfence, yet still no performance numbers for lfence though 'people in the know' were twiddling thumbs for months. That's just not cool.
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
From: Thomas GleixnerDate: Sun, 7 Jan 2018 19:31:41 +0100 (CET) > 2) Alexei's analyis is purely based on the public information of the google >zero folks. If it would be complete and the only attack vector all fine. > >If not and I doubt it is, we're going to regret this decision faster >than we made it and this is not the kind of play field where we can >afford that. Please state this more clearly. Do you know about other attack vectors and just are not allowed to talk about them? Or is this, ironically, speculation?
RE: [patch iproute2 v6 0/3] tc: Add -bs option to batch mode
> -Original Message- > From: n...@orbyte.nwl.cc [mailto:n...@orbyte.nwl.cc] On Behalf Of Phil > Sutter > Sent: Saturday, January 6, 2018 1:25 AM > To: Chris Mi> Cc: netdev@vger.kernel.org; gerlitz...@gmail.com; > step...@networkplumber.org; dsah...@gmail.com; > marcelo.leit...@gmail.com > Subject: Re: [patch iproute2 v6 0/3] tc: Add -bs option to batch mode > > Hi Chris, > > On Thu, Jan 04, 2018 at 04:34:51PM +0900, Chris Mi wrote: > > Currently in tc batch mode, only one command is read from the batch > > file and sent to kernel to process. With this patchset, we can > > accumulate several commands before sending to kernel. The batch size > > is specified using option -bs or -batchsize. > > > > To accumulate the commands in tc, client should allocate an array of > > struct iovec. If batchsize is bigger than 1, only after the client has > > accumulated enough commands, can the client call rtnl_talk_msg to send > > the message that includes the iov array. One exception is that there > > is no more command in the batch file. > > > > But please note that kernel still processes the requests one by one. > > To process the requests in parallel in kernel is another effort. > > The time we're saving in this patchset is the user mode and kernel > > mode context switch. So this patchset works on top of the current kernel. > > > > Using the following script in kernel, we can generate 1,000,000 rules. > > tools/testing/selftests/tc-testing/tdc_batch.py > > > > Without this patchset, 'tc -b $file' exection time is: > > > > real0m15.555s > > user0m7.211s > > sys 0m8.284s > > > > With this patchset, 'tc -b $file -bs 10' exection time is: > > > > real0m13.043s > > user0m6.479s > > sys 0m6.504s > > > > The insertion rate is improved more than 10%. > > Did you measure the effect of increasing batch sizes? Yes. Even if we enlarge the batch size bigger than 10, there is no big improvement. I think that's because current kernel doesn't process the requests in parallel. If kernel processes the requests in parallel, I believe specifying a bigger batch size will get a better result. > > I wonder whether specifying the batch size is necessary at all. Couldn't batch > mode just collect messages until either EOF or an incompatible command is > encountered which then triggers a commit to kernel? This might simplify > code quite a bit. That's a good suggestion. -Chris > > Cheers, Phil
Re: Subject: [RFC][PATCH 05/11] stmmac: fix several stray endianness bugs
On 01/05/2018 11:32 AM, Al Viro wrote: > > A few places got missed by "net: ethernet: stmmac: change dma descriptors > to __le32" (having been introduced just before the merge of that patch, > AFAICS). Fix them up... I could not spot the commit id associated with that change, which one is it? -- Florian
[PATCH v3 bpf] bpf: prevent out-of-bounds speculation
Under speculation, CPUs may mis-predict branches in bounds checks. Thus, memory accesses under a bounds check may be speculated even if the bounds check fails, providing a primitive for building a side channel. To avoid leaking kernel data round up array-based maps and mask the index after bounds check, so speculated load with out of bounds index will load either valid value from the array or zero from the padded area. Unconditionally mask index for all array types even when max_entries are not rounded to power of 2 for root user. When map is created by unpriv user generate a sequence of bpf insns that includes AND operation to make sure that JITed code includes the same 'index & index_mask' operation. If prog_array map is created by unpriv user replace bpf_tail_call(ctx, map, index); with if (index >= max_entries) { index &= map->index_mask; bpf_tail_call(ctx, map, index); } (along with roundup to power 2) to prevent out-of-bounds speculation. There is secondary redundant 'if (index >= max_entries)' in the interpreter and in all JITs, but they can be optimized later if necessary. Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array) cannot be used by unpriv, so no changes there. That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on all architectures with and without JIT. v2->v3: Daniel noticed that attack potentially can be crafted via syscall commands without loading the program, so add masking to those paths as well. Signed-off-by: Alexei StarovoitovAcked-by: John Fastabend --- include/linux/bpf.h | 2 ++ kernel/bpf/arraymap.c | 47 --- kernel/bpf/verifier.c | 36 3 files changed, 74 insertions(+), 11 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e55e4255a210..1b985ca4ffbe 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -52,6 +52,7 @@ struct bpf_map { u32 pages; u32 id; int numa_node; + bool unpriv_array; struct user_struct *user; const struct bpf_map_ops *ops; struct work_struct work; @@ -221,6 +222,7 @@ struct bpf_prog_aux { struct bpf_array { struct bpf_map map; u32 elem_size; + u32 index_mask; /* 'ownership' of prog_array is claimed by the first program that * is going to use this map or by the first program which FD is stored * in the map to make sure that all callers and callees have the same diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 7c25426d3cf5..aaa319848e7d 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -53,9 +53,10 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) { bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; int numa_node = bpf_map_attr_numa_node(attr); + u32 elem_size, index_mask, max_entries; + bool unpriv = !capable(CAP_SYS_ADMIN); struct bpf_array *array; u64 array_size; - u32 elem_size; /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 4 || @@ -72,11 +73,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) elem_size = round_up(attr->value_size, 8); + max_entries = attr->max_entries; + index_mask = roundup_pow_of_two(max_entries) - 1; + + if (unpriv) + /* round up array size to nearest power of 2, +* since cpu will speculate within index_mask limits +*/ + max_entries = index_mask + 1; + array_size = sizeof(*array); if (percpu) - array_size += (u64) attr->max_entries * sizeof(void *); + array_size += (u64) max_entries * sizeof(void *); else - array_size += (u64) attr->max_entries * elem_size; + array_size += (u64) max_entries * elem_size; /* make sure there is no u32 overflow later in round_up() */ if (array_size >= U32_MAX - PAGE_SIZE) @@ -86,6 +96,8 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) array = bpf_map_area_alloc(array_size, numa_node); if (!array) return ERR_PTR(-ENOMEM); + array->index_mask = index_mask; + array->map.unpriv_array = unpriv; /* copy mandatory map attributes */ array->map.map_type = attr->map_type; @@ -121,12 +133,13 @@ static void *array_map_lookup_elem(struct bpf_map *map, void *key) if (unlikely(index >= array->map.max_entries)) return NULL; - return array->value + array->elem_size * index; + return array->value + array->elem_size * (index & array->index_mask); } /* emit BPF instructions equivalent to C code of array_map_lookup_elem() */ static u32 array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) { + struct
Re: Subject: [RFC][PATCH 07/11] broadcom: trivial sparse annotations
On 01/05/2018 11:32 AM, Al Viro wrote: > > Signed-off-by: Al Viro> --- > drivers/net/ethernet/broadcom/bcmsysport.c | 2 +- > drivers/net/ethernet/broadcom/bgmac.h | 6 +++--- > drivers/net/ethernet/broadcom/bnx2.c | 6 +++--- > drivers/net/ethernet/broadcom/cnic_if.h| 8 > drivers/net/ethernet/broadcom/genet/bcmgenet.c | 4 ++-- > drivers/net/ethernet/broadcom/tg3.c| 2 +- > 6 files changed, 14 insertions(+), 14 deletions(-) > > diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c > b/drivers/net/ethernet/broadcom/bcmsysport.c > index f15a8fc6dfc9..c2969b260aed 100644 > --- a/drivers/net/ethernet/broadcom/bcmsysport.c > +++ b/drivers/net/ethernet/broadcom/bcmsysport.c > @@ -1156,7 +1156,7 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct > sk_buff *skb, > memset(tsb, 0, sizeof(*tsb)); > > if (skb->ip_summed == CHECKSUM_PARTIAL) { > - ip_ver = htons(skb->protocol); > + ip_ver = ntohs(skb->protocol); Al can you make sure you CC drivers maintainers for these changes? You change bcmsysport.c and bcmgenet.c but you leave tg3.c with the same type of construct in tg3_start_xmit() any reason for that? I am also curious about this conversion considering the following numbers: grep 'skb->protocol == htons' {net/*,drivers/net/*} | wc -l 151 git grep 'skb->protocol == ntohs' {net/*,drivers/net/*} | wc -l 0 > switch (ip_ver) { > case ETH_P_IP: > ip_proto = ip_hdr(skb)->protocol; > diff --git a/drivers/net/ethernet/broadcom/bgmac.h > b/drivers/net/ethernet/broadcom/bgmac.h > index 4040d846da8e..40d02fec2747 100644 > --- a/drivers/net/ethernet/broadcom/bgmac.h > +++ b/drivers/net/ethernet/broadcom/bgmac.h > @@ -479,9 +479,9 @@ struct bgmac_rx_header { > struct bgmac { > union { > struct { > - void *base; > - void *idm_base; > - void *nicpm_base; > + void __iomem *base; > + void __iomem *idm_base; > + void __iomem *nicpm_base; > } plat; This part of the patch is correct. > struct { > struct bcma_device *core; > diff --git a/drivers/net/ethernet/broadcom/bnx2.c > b/drivers/net/ethernet/broadcom/bnx2.c > index 7919f6112ecf..154866e8517a 100644 > --- a/drivers/net/ethernet/broadcom/bnx2.c > +++ b/drivers/net/ethernet/broadcom/bnx2.c > @@ -8330,9 +8330,9 @@ bnx2_init_board(struct pci_dev *pdev, struct net_device > *dev) > if (j < 32) > bp->fw_version[j++] = ' '; > for (i = 0; i < 3 && j < 28; i++) { > - reg = bnx2_reg_rd_ind(bp, addr + i * 4); > - reg = be32_to_cpu(reg); > - memcpy(>fw_version[j], , 4); > + __be32 v; > + v = cpu_to_be32(bnx2_reg_rd_ind(bp, addr + i * 4)); > + memcpy(>fw_version[j], , 4); > j += 4; > } > } > diff --git a/drivers/net/ethernet/broadcom/cnic_if.h > b/drivers/net/ethernet/broadcom/cnic_if.h > index 789e5c7e9311..8cfef07a8e3d 100644 > --- a/drivers/net/ethernet/broadcom/cnic_if.h > +++ b/drivers/net/ethernet/broadcom/cnic_if.h > @@ -260,10 +260,10 @@ struct cnic_sockaddr { > struct cnic_sock { > struct cnic_dev *dev; > void*context; > - u32 src_ip[4]; > - u32 dst_ip[4]; > - u16 src_port; > - u16 dst_port; > + __be32 src_ip[4]; > + __be32 dst_ip[4]; > + __be16 src_port; > + __be16 dst_port; > u16 vlan_id; > unsigned char old_ha[ETH_ALEN]; > unsigned char ha[ETH_ALEN]; > diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c > b/drivers/net/ethernet/broadcom/genet/bcmgenet.c > index 24b4f4ceceef..77154f1479a9 100644 > --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c > +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c > @@ -1321,7 +1321,7 @@ static struct sk_buff *bcmgenet_free_tx_cb(struct > device *dev, > dma_unmap_addr_set(cb, dma_addr, 0); > } > > - return 0; > + return NULL; And this part as well is correct. > } > > /* Simple helper to free a receive control block's resources */ > @@ -1480,7 +1480,7 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct > net_device *dev, > status = (struct status_64 *)skb->data; > > if (skb->ip_summed == CHECKSUM_PARTIAL) { > - ip_ver = htons(skb->protocol); > + ip_ver = ntohs(skb->protocol); > switch (ip_ver) { > case ETH_P_IP: > ip_proto = ip_hdr(skb)->protocol; > diff --git a/drivers/net/ethernet/broadcom/tg3.c > b/drivers/net/ethernet/broadcom/tg3.c > index a77ee2f8fb8d..2bd77d9990f2 100644 > --- a/drivers/net/ethernet/broadcom/tg3.c > +++
pull-request: bpf-next 2018-01-07
Hi David, The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add a start of a framework for extending struct xdp_buff without having the overhead of populating every data at runtime. Idea is to have a new per-queue struct xdp_rxq_info that holds read mostly data (currently that is, queue number and a pointer to the corresponding netdev) which is set up during rxqueue config time. When a XDP program is invoked, struct xdp_buff holds a pointer to struct xdp_rxq_info that the BPF program can then walk. The user facing BPF program that uses struct xdp_md for context can use these members directly, and the verifier rewrites context access transparently by walking the xdp_rxq_info and net_device pointers to load the data, from Jesper. 2) Redo the reporting of offload device information to user space such that it works in combination with network namespaces. The latter is reported through a device/inode tuple as similarly done in other subsystems as well (e.g. perf) in order to identify the namespace. For this to work, ns_get_path() has been generalized such that the namespace can be retrieved not only from a specific task (perf case), but also from a callback where we deduce the netns (ns_common) from a netdevice. bpftool support using the new uapi info and extensive test cases for test_offload.py in BPF selftests have been added as well, from Jakub. 3) Add two bpftool improvements: i) properly report the bpftool version such that it corresponds to the version from the kernel source tree. So pick the right linux/version.h from the source tree instead of the installed one. ii) fix bpftool and also bpf_jit_disasm build with bintutils >= 2.9. The reason for the build breakage is that binutils library changed the function signature to select the disassembler. Given this is needed in multiple tools, add a proper feature detection to the tools/build/features infrastructure, from Roman. 4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the stacktrace map. It is currently unimplemented, but there are use cases where user space needs to walk all stacktrace map entries e.g. for dumping or deleting map entries w/o having to close and recreate the map. Add BPF selftests along with it, from Yonghong. 5) Few follow-up cleanups for the bpftool cgroup code: i) rename the cgroup 'list' command into 'show' as we have it for other subcommands as well, ii) then alias the 'show' command such that 'list' is accepted which is also common practice in iproute2, and iii) remove couple of newlines from error messages using p_err(), from Jakub. 6) Two follow-up cleanups to sockmap code: i) remove the unused bpf_compute_data_end_sk_skb() function and ii) only build the sockmap infrastructure when CONFIG_INET is enabled since it's only aware of TCP sockets at this time, from John. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot & a happy new year! The following changes since commit 6bb8824732f69de0f233ae6b1a8158e149627b38: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-12-29 15:42:26 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git for you to fetch changes up to 9be99badee761f0b2c065ecbd8bd54a96cbd0fa0: Merge branch 'bpf-stacktrace-map-next-key-support' (2018-01-06 23:52:23 +0100) Alexei Starovoitov (1): Merge branch 'xdp_rxq_info' Daniel Borkmann (3): Merge branch 'bpf-offload-report-dev' Merge branch 'bpf-bpftool-misc-fixes' Merge branch 'bpf-stacktrace-map-next-key-support' Jakub Kicinski (12): bpf: offload: don't require rtnl for dev list manipulation bpf: offload: don't use prog->aux->offload as boolean bpf: offload: allow netdev to disappear while verifier is running bpf: offload: free prog->aux->offload when device disappears bpf: offload: free program id when device disappears nsfs: generalize ns_get_path() for path resolution with a task bpf: offload: report device information for offloaded programs tools: bpftool: report device information for offloaded programs selftests/bpf: test device info reporting for bound progs tools: bpftool: rename cgroup list -> show in the code tools: bpftool: alias show and list commands tools: bpftool: remove new lines from errors Jesper Dangaard Brouer (14): xdp: base API for new XDP rx-queue info concept xdp/mlx5: setup xdp_rxq_info i40e: setup xdp_rxq_info ixgbe: setup xdp_rxq_info xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg mlx4: setup xdp_rxq_info bnxt_en:
[PATCH] ipv6: use ARRAY_SIZE for array sizing calculation on array seg6_action_table
From: Colin Ian KingUse the ARRAY_SIZE macro on array seg6_action_table to determine size of the array. Improvement suggested by coccinelle. Signed-off-by: Colin Ian King --- net/ipv6/seg6_local.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c index 825b8e01f947..ba3767ef5e93 100644 --- a/net/ipv6/seg6_local.c +++ b/net/ipv6/seg6_local.c @@ -501,7 +501,7 @@ static struct seg6_action_desc *__get_action_desc(int action) struct seg6_action_desc *desc; int i, count; - count = sizeof(seg6_action_table) / sizeof(struct seg6_action_desc); + count = ARRAY_SIZE(seg6_action_table); for (i = 0; i < count; i++) { desc = _action_table[i]; if (desc->action == action) -- 2.15.1
[PATCH] be2net: use ARRAY_SIZE for array sizing calculation on array cmd_priv_map
From: Colin Ian KingUse the ARRAY_SIZE macro on array cmd_priv_map to determine size of the array. Improvement suggested by coccinelle. Signed-off-by: Colin Ian King --- drivers/net/ethernet/emulex/benet/be_cmds.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c index 02dd5246dfae..1a49297224ed 100644 --- a/drivers/net/ethernet/emulex/benet/be_cmds.c +++ b/drivers/net/ethernet/emulex/benet/be_cmds.c @@ -103,7 +103,7 @@ static struct be_cmd_priv_map cmd_priv_map[] = { static bool be_cmd_allowed(struct be_adapter *adapter, u8 opcode, u8 subsystem) { int i; - int num_entries = sizeof(cmd_priv_map)/sizeof(struct be_cmd_priv_map); + int num_entries = ARRAY_SIZE(cmd_priv_map); u32 cmd_privileges = adapter->cmd_privileges; for (i = 0; i < num_entries; i++) -- 2.15.1
[PATCH] ixgbe: use ARRAY_SIZE for array sizing calculation on array buf
From: Colin Ian KingUse the ARRAY_SIZE macro on array buf to determine size of the array. Improvement suggested by coccinelle. Signed-off-by: Colin Ian King --- drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c index cb7da5f9c4da..28c23ef2ec28 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c @@ -949,7 +949,7 @@ static s32 ixgbe_checksum_ptr_x550(struct ixgbe_hw *hw, u16 ptr, u16 length, bufsz, i, start; u16 *local_buffer; - bufsz = sizeof(buf) / sizeof(buf[0]); + bufsz = ARRAY_SIZE(buf); /* Read a chunk at the pointer location */ if (!buffer) { -- 2.15.1
Re: [PATCH] atm/clip: Use seq_puts() in svc_addr()
On Sat, Jan 6, 2018 at 11:44 PM, SF Markus Elfringwrote: > From: Markus Elfring > Date: Sat, 6 Jan 2018 22:34:12 +0100 > > Two strings should be quickly put into a sequence by two function calls. > Thus use the function "seq_puts" instead of "seq_printf". > > This issue was detected by using the Coccinelle software. > > Signed-off-by: Markus Elfring > --- > net/atm/clip.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/atm/clip.c b/net/atm/clip.c > index d4f6029d5109..62a852165b19 100644 > --- a/net/atm/clip.c > +++ b/net/atm/clip.c > @@ -708,11 +708,11 @@ static void svc_addr(struct seq_file *seq, struct > sockaddr_atmsvc *addr) > static int e164[] = { 1, 8, 4, 6, 1, 0 }; > > if (*addr->sas_addr.pub) { > - seq_printf(seq, "%s", addr->sas_addr.pub); > + seq_puts(seq, addr->sas_addr.pub); Which opens a lot of security concerns. Never do this again. > if (*addr->sas_addr.prv) > seq_putc(seq, '+'); > } else if (!*addr->sas_addr.prv) { > - seq_printf(seq, "%s", "(none)"); > + seq_puts(seq, "(none)"); ...while this one is okay per se, better to keep above pattern (same style over the piece of code / function). > return; > } > if (*addr->sas_addr.prv) { > -- > 2.15.1 > P.S. I'm wondering what would be first, Markus starts looking into the actual code, or most (all) of the maintainers just ban him. -- With Best Regards, Andy Shevchenko
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote: > We need to fix the security problem, but we need to do it *without* > these braindead arguments that performance is somehow secondary. OK OK. At least we should have security by default and let people trade it against performance if they want and understand the risk. People never know when they're secure enough (security doesn't measure) but they know fairly well when they're not performant enough to take action (most often changing the machine). Willy
Re: [PATCH iproute2] ip fou: Pass family attribute as u8
Hi Stephen, > How is this binary compatiable with older versions. I'm not sure what you mean. Kernel expects family attribute to be passed as u8 (and it always did). But current ip passes family attribute as u16. It works now only on little endian systems because the relevant byte happens to be on the same position. It is the current version of ip which is incompatible. Filip
Re: dvb usb issues since kernel 4.9
On Sat, Jan 6, 2018 at 11:54 AM, Mauro Carvalho Chehabwrote: > > Em Sat, 6 Jan 2018 16:04:16 +0100 > "Josef Griebichler" escreveu: >> >> the causing commit has been identified. >> After reverting commit >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882 >> its working again. > > Just replying to me won't magically fix this. The ones that were involved on > this patch should also be c/c, plus USB people. Just added them. Actually, you seem to have added an odd subset of the people involved. For example, Ingo - who actually committed that patch - wasn't on the cc. I do think we need to simply revert that patch. It's very simple: it has been reported to lead to actual problems for people, and we don't fix one problem and then say "well, it fixed something else" when something breaks. When something breaks, we either unbreak it, or we revert the change that caused the breakage. It's really that simple. That's what "no regressions" means. We don't accept changes that cause regressions. This one did. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, 7 Jan 2018, Linus Torvalds wrote: > We need to fix the security problem, but we need to do it *without* > these braindead arguments that performance is somehow secondary. I surely agree, but we have gone the way of PTI without the ability of exempting individual processes exactly for one reason: Lack of time It can be done on top of the PTI implementation and it won't take ages. For spectre_v1/2 we face the same problem simply because we got informed so much ahead of time and we were all twiddling thumbs, enjoying our christmas vacation and having a good time. The exploits are out in the wild and they are observed already, so we really have to make a decision whether we want to fix that in the most obvious ways even if it hurts performance right now and then take a break from all that hell and sit down and sort the performance mess or whether we want to discuss the right way to fix it for the next 3 month and leave all doors open until the last bit of performance is squeezed out. I totally can understand the anger of those who learned all of this a few days ago and are now grasping straws to avoid the obviously coming performance hit, but its not our fault that we have to make a decision which cannot make everyone happy tomorrow. If the general sentiment is that we have to fix the performance issue first, then please let me know and I take 3 weeks of vacation right now. Thanks, tglx
Re: [PATCH v2 net-next] net: tracepoint: exposing sk_faimily in tracepoint inet_sock_set_state
> On Jan 6, 2018, at 10:31 PM, Yafang Shaowrote: > > As of now, there're two sk_family are traced with sock:inet_sock_set_state, > which are AF_INET and AF_INET6. > So the sk_family are exposed as well. > Then we can conveniently use it to do the filter. > > Both sk_family and sk_protocol are showed in the printk message, so we need > not expose them as tracepoint arguments. > > Suggested-by: Brendan Gregg > Suggested-by: Song Liu > Signed-off-by: Yafang Shao > --- > include/trace/events/sock.h | 15 +-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h > index 3537c5f..3176a39 100644 > --- a/include/trace/events/sock.h > +++ b/include/trace/events/sock.h > @@ -11,7 +11,11 @@ > #include > #include > > -/* The protocol traced by sock_set_state */ > +#define family_names \ > + EM(AF_INET) \ > + EMe(AF_INET6) > + > +/* The protocol traced by inet_sock_set_state */ > #define inet_protocol_names \ > EM(IPPROTO_TCP) \ > EM(IPPROTO_DCCP)\ > @@ -37,6 +41,7 @@ > #define EM(a) TRACE_DEFINE_ENUM(a); > #define EMe(a) TRACE_DEFINE_ENUM(a); > > +family_names > inet_protocol_names > tcp_state_names > > @@ -45,6 +50,9 @@ > #define EM(a) { a, #a }, > #define EMe(a) { a, #a } > > +#define show_family_name(val)\ > + __print_symbolic(val, family_names) > + > #define show_inet_protocol_name(val)\ > __print_symbolic(val, inet_protocol_names) > > @@ -118,6 +126,7 @@ > __field(int, newstate) > __field(__u16, sport) > __field(__u16, dport) > + __field(__u16, family) > __field(__u8, protocol) > __array(__u8, saddr, 4) > __array(__u8, daddr, 4) > @@ -134,6 +143,7 @@ > __entry->oldstate = oldstate; > __entry->newstate = newstate; > > + __entry->family = sk->sk_family; > __entry->protocol = sk->sk_protocol; > __entry->sport = ntohs(inet->inet_sport); > __entry->dport = ntohs(inet->inet_dport); > @@ -160,7 +170,8 @@ > } > ), > > - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 > saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", > + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 > daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", > + show_family_name(__entry->family), > show_inet_protocol_name(__entry->protocol), > __entry->sport, __entry->dport, > __entry->saddr, __entry->daddr, > -- > 1.8.3.1 > Reviewed-by: Song Liu
Re: [PATCH iproute2] ip fou: Pass family attribute as u8
On Sun, 7 Jan 2018 15:28:13 +0100 Filip Mocwrote: > This fixes fou on big-endian systems. > > Signed-off-by: Filip Moc > --- > ip/ipfou.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/ip/ipfou.c b/ip/ipfou.c > index febc2c8c..1f392ade 100644 > --- a/ip/ipfou.c > +++ b/ip/ipfou.c > @@ -52,7 +52,7 @@ static int fou_parse_opt(int argc, char **argv, struct > nlmsghdr *n, > __u8 ipproto, type; > bool gue_set = false; > int ipproto_set = 0; > - unsigned short family = AF_INET; > + __u8 family = AF_INET; > > while (argc > 0) { > if (!matches(*argv, "port")) { > @@ -103,7 +103,7 @@ static int fou_parse_opt(int argc, char **argv, struct > nlmsghdr *n, > > addattr16(n, 1024, FOU_ATTR_PORT, port); > addattr8(n, 1024, FOU_ATTR_TYPE, type); > - addattr16(n, 1024, FOU_ATTR_AF, family); > + addattr8(n, 1024, FOU_ATTR_AF, family); > > if (ipproto_set) > addattr8(n, 1024, FOU_ATTR_IPPROTO, ipproto); How is this binary compatiable with older versions.
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 7, 2018 at 12:12 PM, Willy Tarreauwrote: > > Linus, no need to explain that to me, I'm precisely trying to see how > to disable PTI for a specific process because I face up to 45% loss in > certain circumstances, making it a no-go. But while a few of us have > very specific workloads emphasizing this impact, others have very > different ones and will not notice. For example my laptop did boot > pretty fine and I didn't notice anything until I fire a network > benchmark. Sure, most people have hardware where the bottleneck is entirely elsewhere (slow network, rotating disk, whatever). But this whole "normal people won't notice" is dangerous thinking. They may well notice very much, we simply don't know what they are doing. Quite honesty, it's equally correct to say "normal people won't be affected by the security issue in the first place". That laptop that you didn't have any issues with? Likely it never had an exploit running on it either! So the whole "normal people" argument is pure and utter garbage. It's wrong. It's pure shit when it comes to performance, but it's also pure shit when it comes to the security issue. Don't use it. We need to fix the security problem, but we need to do it *without* these braindead arguments that performance is somehow secondary. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 7, 2018 at 11:47 AM, Linus Torvaldswrote: > On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau wrote: >> >> To be fair there's overreaction on both sides. The vast majority of >> users need to get a 100% safe system and will never notice any >> difference. > > There is no such thing as a "100% safe system". Never will be - unless > you make sure you have no users. > > Also, people definitely *are* noticing the performance issues with the > current set of patches, and they are causing real problems. Go search > for reports of Amazon AWS slowdowns. > > So this whole "security is so important that performance doesn't > matter" mindset is pure and utter garbage. > > And the whole "normal people won't even notice" is pure garbage too. > Don't spread that bullshit when you see actual normal people > complaining. > > Performance matters. A *LOT*. I'm thinking we should provide the option to at least build the hot-path nospec_array_ptr() usages without an lfence. CONFIG_SPECTRE1_PARANOIA_SAFE CONFIG_SPECTRE1_PARANOIA_PERF ...if only for easing performance testing and let the distribution set its policy. Where hot-path usages can do: nospec_relax(nospec_array_ptr()) ...to optionally ellide the lfence.
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote: > And the whole "normal people won't even notice" is pure garbage too. > Don't spread that bullshit when you see actual normal people > complaining. > > Performance matters. A *LOT*. Linus, no need to explain that to me, I'm precisely trying to see how to disable PTI for a specific process because I face up to 45% loss in certain circumstances, making it a no-go. But while a few of us have very specific workloads emphasizing this impact, others have very different ones and will not notice. For example my laptop did boot pretty fine and I didn't notice anything until I fire a network benchmark. Willy
[PATCH] hyperv/netvsc: Delete two error messages for a failed memory allocation in netvsc_init_buf()
From: Markus ElfringDate: Sun, 7 Jan 2018 21:03:26 +0100 Omit extra messages for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- drivers/net/hyperv/netvsc.c | 5 - 1 file changed, 5 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 17e529af79dc..c1ec02f801f6 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -275,9 +275,6 @@ static int netvsc_init_buf(struct hv_device *device, net_device->recv_buf = vzalloc(buf_size); if (!net_device->recv_buf) { - netdev_err(ndev, - "unable to allocate receive buffer of size %u\n", - buf_size); ret = -ENOMEM; goto cleanup; } @@ -357,8 +354,6 @@ static int netvsc_init_buf(struct hv_device *device, net_device->send_buf = vzalloc(buf_size); if (!net_device->send_buf) { - netdev_err(ndev, "unable to allocate send buffer of size %u\n", - buf_size); ret = -ENOMEM; goto cleanup; } -- 2.15.1
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreauwrote: > > To be fair there's overreaction on both sides. The vast majority of > users need to get a 100% safe system and will never notice any > difference. There is no such thing as a "100% safe system". Never will be - unless you make sure you have no users. Also, people definitely *are* noticing the performance issues with the current set of patches, and they are causing real problems. Go search for reports of Amazon AWS slowdowns. So this whole "security is so important that performance doesn't matter" mindset is pure and utter garbage. And the whole "normal people won't even notice" is pure garbage too. Don't spread that bullshit when you see actual normal people complaining. Performance matters. A *LOT*. Linus
Re: [PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution
On Sun, Jan 7, 2018 at 1:09 AM, Greg KHwrote: [..] > Sorry for the confusion, no, I don't mean the "taint tracking", I mean > the generic pattern of "speculative out of bounds access" that we are > fixing here. > > Yes, as you mentioned before, there are tons of false-positives in the > tree, as to find the real problems you have to show that userspace > controls the access index. But if we have a generic pattern that can > rewrite that type of logic into one where it does not matter at all > (i.e. like the ebpf proposed changes), then it would not be an issue if > they are false or not, we just rewrite them all to be safe. > > We need to find some way not only to fix these issues now (like you are > doing with this series), but to prevent them from every coming back into > the codebase again. It's that second part that we need to keep in the > back of our minds here, while doing the first portion of this work. I understand the goal, but I'm not sure any of our current annotation mechanisms are suitable. We have: __attribute__((noderef, address_space(x))) ...for the '__user' annotation and other pointers that must not be de-referenced without a specific accessor. We also have: __attribute__((bitwise)) ...for values that should not be consumed directly without a specific conversion like endian swapping. The problem is that we need to see if a value derived from a userspace controlled input is used to trigger a chain of dependent reads. As far as I can see the annotation would need to be guided by taint analysis to be useful, at which point we can just "annotate" the problem spot with nospec_array_ptr(). Otherwise it seems the scope of a "__nospec_array_index" annotation would have a low signal to noise ratio. Stopping speculation past a uacess_begin() boundary appears to handle a wide swath of potential problems, and the rest likely needs taint analysis, at least for now. All that to say, yes, we need better tooling and infrastructure going forward.
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
> everyone. I'm not saying this always happens, but it is reasonable to > let the iterative pushback see if we can get to better code in this > case rather than trying to cut it of with the "because *security*" > argument. > I'm not arguing otherwise - I'd just prefer most users machines are secure while we have the discussion and while we see what other architectural tricks people can come up with Alan
[PATCH net-next v3 RESEND 10/10] net: qualcomm: rmnet: Add support for GSO
Real devices may support scatter gather(SG), so enable SG on rmnet devices to use GSO. GSO reduces CPU cycles by 20% for a rate of 146Mpbs for a single stream TCP connection. Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c index f7f57ce..570a227 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c @@ -190,6 +190,7 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev, rmnet_dev->hw_features = NETIF_F_RXCSUM; rmnet_dev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM; + rmnet_dev->hw_features |= NETIF_F_SG; rc = register_netdevice(rmnet_dev); if (!rc) { -- 1.9.1
[PATCH net-next v3 RESEND 08/10] net: qualcomm: rmnet: Handle command packets with checksum trailer
When using the MAPv4 packet format in conjunction with MAP commands, a dummy DL checksum trailer will be appended to the packet. Before this packet is sent out as an ACK, the DL checksum trailer needs to be removed. Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c index 51e6049..6bc328f 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c @@ -58,11 +58,24 @@ static u8 rmnet_map_do_flow_control(struct sk_buff *skb, } static void rmnet_map_send_ack(struct sk_buff *skb, - unsigned char type) + unsigned char type, + struct rmnet_port *port) { struct rmnet_map_control_command *cmd; int xmit_status; + if (port->data_format & RMNET_INGRESS_FORMAT_MAP_CKSUMV4) { + if (skb->len < sizeof(struct rmnet_map_header) + + RMNET_MAP_GET_LENGTH(skb) + + sizeof(struct rmnet_map_dl_csum_trailer)) { + kfree_skb(skb); + return; + } + + skb_trim(skb, skb->len - +sizeof(struct rmnet_map_dl_csum_trailer)); + } + skb->protocol = htons(ETH_P_MAP); cmd = RMNET_MAP_GET_CMD_START(skb); @@ -100,5 +113,5 @@ void rmnet_map_command(struct sk_buff *skb, struct rmnet_port *port) break; } if (rc == RMNET_MAP_COMMAND_ACK) - rmnet_map_send_ack(skb, rc); + rmnet_map_send_ack(skb, rc, port); } -- 1.9.1
[PATCH net-next v3 RESEND 05/10] net: qualcomm: rmnet: Set pacing shift
The real device over which the rmnet devices are installed also aggregate multiple IP packets and sends them as a single large aggregate frame to the hardware. This causes degraded throughput for TCP TX due to bufferbloat. To overcome this problem, pacing shift value of 8 is set using the sk_pacing_shift_update() helper. This value was determined based on experiments with a single stream TCP TX using iperf for a duration of 30s. Pacing shift | Observed data rate (Mbps) 10 | 9 9 | 140 8 | 146 (Max link rate) Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 8e1f43a..8f8c4f2 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "rmnet_private.h" #include "rmnet_config.h" #include "rmnet_vnd.h" @@ -204,6 +205,8 @@ void rmnet_egress_handler(struct sk_buff *skb) struct rmnet_priv *priv; u8 mux_id; + sk_pacing_shift_update(skb->sk, 8); + orig_dev = skb->dev; priv = netdev_priv(orig_dev); skb->dev = priv->real_dev; -- 1.9.1
[PATCH net-next v3 RESEND 09/10] net: qualcomm: rmnet: Add support for TX checksum offload
TX checksum offload applies to TCP / UDP packets which are not fragmented using the MAPv4 checksum trailer. The following needs to be done to have checksum computed in hardware - 1. Set the checksum start offset and inset offset. 2. Set the csum_enabled bit 3. Compute and set 1's complement of partial checksum field in transport header. Signed-off-by: Subash Abhinov Kasiviswanathan--- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 8 ++ drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h| 2 + .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c | 120 + drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 1 + 4 files changed, 131 insertions(+) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 3409458..601edec 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -141,11 +141,19 @@ static int rmnet_map_egress_handler(struct sk_buff *skb, additional_header_len = 0; required_headroom = sizeof(struct rmnet_map_header); + if (port->data_format & RMNET_EGRESS_FORMAT_MAP_CKSUMV4) { + additional_header_len = sizeof(struct rmnet_map_ul_csum_header); + required_headroom += additional_header_len; + } + if (skb_headroom(skb) < required_headroom) { if (pskb_expand_head(skb, required_headroom, 0, GFP_KERNEL)) goto fail; } + if (port->data_format & RMNET_EGRESS_FORMAT_MAP_CKSUMV4) + rmnet_map_checksum_uplink_packet(skb, orig_dev); + map_header = rmnet_map_add_map_header(skb, additional_header_len, 0); if (!map_header) goto fail; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h index ca9f473..6ce31e2 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h @@ -89,5 +89,7 @@ struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb, int hdrlen, int pad); void rmnet_map_command(struct sk_buff *skb, struct rmnet_port *port); int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len); +void rmnet_map_checksum_uplink_packet(struct sk_buff *skb, + struct net_device *orig_dev); #endif /* _RMNET_MAP_H_ */ diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c index 881c1dc..c74a6c5 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c @@ -171,6 +171,86 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char protocol, } #endif +static void rmnet_map_complement_ipv4_txporthdr_csum_field(void *iphdr) +{ + struct iphdr *ip4h = (struct iphdr *)iphdr; + void *txphdr; + u16 *csum; + + txphdr = iphdr + ip4h->ihl * 4; + + if (ip4h->protocol == IPPROTO_TCP || ip4h->protocol == IPPROTO_UDP) { + csum = (u16 *)rmnet_map_get_csum_field(ip4h->protocol, txphdr); + *csum = ~(*csum); + } +} + +static void +rmnet_map_ipv4_ul_csum_header(void *iphdr, + struct rmnet_map_ul_csum_header *ul_header, + struct sk_buff *skb) +{ + struct iphdr *ip4h = (struct iphdr *)iphdr; + __be16 *hdr = (__be16 *)ul_header, offset; + + offset = htons((__force u16)(skb_transport_header(skb) - +(unsigned char *)iphdr)); + ul_header->csum_start_offset = offset; + ul_header->csum_insert_offset = skb->csum_offset; + ul_header->csum_enabled = 1; + if (ip4h->protocol == IPPROTO_UDP) + ul_header->udp_ip4_ind = 1; + else + ul_header->udp_ip4_ind = 0; + + /* Changing remaining fields to network order */ + hdr++; + *hdr = htons((__force u16)*hdr); + + skb->ip_summed = CHECKSUM_NONE; + + rmnet_map_complement_ipv4_txporthdr_csum_field(iphdr); +} + +#if IS_ENABLED(CONFIG_IPV6) +static void rmnet_map_complement_ipv6_txporthdr_csum_field(void *ip6hdr) +{ + struct ipv6hdr *ip6h = (struct ipv6hdr *)ip6hdr; + void *txphdr; + u16 *csum; + + txphdr = ip6hdr + sizeof(struct ipv6hdr); + + if (ip6h->nexthdr == IPPROTO_TCP || ip6h->nexthdr == IPPROTO_UDP) { + csum = (u16 *)rmnet_map_get_csum_field(ip6h->nexthdr, txphdr); + *csum = ~(*csum); + } +} + +static void +rmnet_map_ipv6_ul_csum_header(void *ip6hdr, + struct rmnet_map_ul_csum_header *ul_header, + struct sk_buff *skb) +{ + __be16 *hdr = (__be16 *)ul_header, offset; + + offset = htons((__force
[PATCH net-next v3 RESEND 06/10] net: qualcomm: rmnet: Define the MAPv4 packet formats
The MAPv4 packet format adds support for RX / TX checksum offload. For a bi-directional UDP stream at a rate of 570 / 146 Mbps, roughly 10% CPU cycles are saved. For receive path, there is a checksum trailer appended to the end of the MAP packet. The valid field indicates if hardware has computed the checksum. csum_start_offset indicates the offset from the start of the IP header from which hardware has computed checksum. csum_length is the number of bytes over which the checksum was computed and the resulting value is csum_value. In the transmit path, a header is appended between the end of the MAP header and the start of the IP packet. csum_start_offset is the offset in bytes from which hardware will compute the checksum if the csum_enabled bit is set. udp_ip4_ind indicates if the checksum value of 0 is valid or not. csum_insert_offset is the offset from the csum_start_offset where hardware will insert the computed checksum. The use of this additional packet format for checksum offload is explained in subsequent patches. Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 16 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h | 2 ++ 2 files changed, 18 insertions(+) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h index ef0eff2..50c50cd 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h @@ -47,6 +47,22 @@ struct rmnet_map_header { u16 pkt_len; } __aligned(1); +struct rmnet_map_dl_csum_trailer { + u8 reserved1; + u8 valid:1; + u8 reserved2:7; + u16 csum_start_offset; + u16 csum_length; + __be16 csum_value; +} __aligned(1); + +struct rmnet_map_ul_csum_header { + __be16 csum_start_offset; + u16 csum_insert_offset:14; + u16 udp_ip4_ind:1; + u16 csum_enabled:1; +} __aligned(1); + #define RMNET_MAP_GET_MUX_ID(Y) (((struct rmnet_map_header *) \ (Y)->data)->mux_id) #define RMNET_MAP_GET_CD_BIT(Y) (((struct rmnet_map_header *) \ diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h index d214280..de0143e 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h @@ -21,6 +21,8 @@ /* Constants */ #define RMNET_INGRESS_FORMAT_DEAGGREGATION BIT(0) #define RMNET_INGRESS_FORMAT_MAP_COMMANDS BIT(1) +#define RMNET_INGRESS_FORMAT_MAP_CKSUMV4BIT(2) +#define RMNET_EGRESS_FORMAT_MAP_CKSUMV4 BIT(3) /* Replace skb->dev to a virtual rmnet device and pass up the stack */ #define RMNET_EPMODE_VND (1) -- 1.9.1
[PATCH net-next v3 RESEND 02/10] net: qualcomm: rmnet: Remove invalid condition while stamping mux id
rmnet devices cannot have a mux id of 255. This is validated when assigning the mux id to the rmnet devices. As a result, checking for mux id 255 does not apply in egress path. Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 0553932..b2d317e3 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -143,10 +143,7 @@ static int rmnet_map_egress_handler(struct sk_buff *skb, if (!map_header) goto fail; - if (mux_id == 0xff) - map_header->mux_id = 0; - else - map_header->mux_id = mux_id; + map_header->mux_id = mux_id; skb->protocol = htons(ETH_P_MAP); -- 1.9.1
[PATCH net-next v3 RESEND 04/10] net: qualcomm: rmnet: Rename ingress data format to data format
This is done so that we can use this field for both ingress and egress flags. Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 10 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 2 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 5 ++--- 3 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index cedacdd..7e7704d 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -143,7 +143,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { - int ingress_format = RMNET_INGRESS_FORMAT_DEAGGREGATION; + u32 data_format = RMNET_INGRESS_FORMAT_DEAGGREGATION; struct net_device *real_dev; int mode = RMNET_EPMODE_VND; struct rmnet_endpoint *ep; @@ -185,11 +185,11 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, struct ifla_vlan_flags *flags; flags = nla_data(data[IFLA_VLAN_FLAGS]); - ingress_format = flags->flags & flags->mask; + data_format = flags->flags & flags->mask; } - netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format); - port->ingress_data_format = ingress_format; + netdev_dbg(dev, "data format [0x%08X]\n", data_format); + port->data_format = data_format; return 0; @@ -353,7 +353,7 @@ static int rmnet_changelink(struct net_device *dev, struct nlattr *tb[], struct ifla_vlan_flags *flags; flags = nla_data(data[IFLA_VLAN_FLAGS]); - port->ingress_data_format = flags->flags & flags->mask; + port->data_format = flags->flags & flags->mask; } return 0; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index 2ea9fe3..00e4634 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -32,7 +32,7 @@ struct rmnet_endpoint { */ struct rmnet_port { struct net_device *dev; - u32 ingress_data_format; + u32 data_format; u8 nr_rmnet_devs; u8 rmnet_mode; struct hlist_head muxed_ep[RMNET_MAX_LOGICAL_EP]; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index b2d317e3..8e1f43a 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -69,8 +69,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) u16 len; if (RMNET_MAP_GET_CD_BIT(skb)) { - if (port->ingress_data_format - & RMNET_INGRESS_FORMAT_MAP_COMMANDS) + if (port->data_format & RMNET_INGRESS_FORMAT_MAP_COMMANDS) return rmnet_map_command(skb, port); goto free_skb; @@ -114,7 +113,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) skb_push(skb, ETH_HLEN); } - if (port->ingress_data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) { + if (port->data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) { while ((skbn = rmnet_map_deaggregate(skb)) != NULL) __rmnet_map_ingress_handler(skbn, port); -- 1.9.1
[PATCH net-next v3 RESEND 07/10] net: qualcomm: rmnet: Add support for RX checksum offload
When using the MAPv4 packet format, receive checksum offload can be enabled in hardware. The checksum computation over pseudo header is not offloaded but the rest of the checksum computation over the payload is offloaded. This applies only for TCP / UDP packets which are not fragmented. rmnet validates the TCP/UDP checksum for the packet using the checksum from the checksum trailer added to the packet by hardware. The validation performed is as following - 1. Perform 1's complement over the checksum value from the trailer 2. Compute 1's complement checksum over IPv4 / IPv6 header and subtracts it from the value from step 1 3. Computes 1's complement checksum over IPv4 / IPv6 pseudo header and adds it to the value from step 2 4. Subtracts the checksum value from the TCP / UDP header from the value from step 3. 5. Compares the value from step 4 to the checksum value from the TCP / UDP header. 6. If the comparison in step 5 succeeds, CHECKSUM_UNNECESSARY is set and the packet is passed on to network stack. If there is a failure, then the packet is passed on as such without modifying the ip_summed field. The checksum field is also checked for UDP checksum 0 as per RFC 768 and for unexpected TCP checksum of 0. If checksum offload is disabled when using MAPv4 packet format in receive path, the packet is queued as is to network stack without the validations above. Signed-off-by: Subash Abhinov Kasiviswanathan--- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 15 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h| 4 +- .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c | 186 - drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 2 + 4 files changed, 201 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 8f8c4f2..3409458 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -66,8 +66,8 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) struct rmnet_port *port) { struct rmnet_endpoint *ep; + u16 len, pad; u8 mux_id; - u16 len; if (RMNET_MAP_GET_CD_BIT(skb)) { if (port->data_format & RMNET_INGRESS_FORMAT_MAP_COMMANDS) @@ -77,7 +77,8 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) } mux_id = RMNET_MAP_GET_MUX_ID(skb); - len = RMNET_MAP_GET_LENGTH(skb) - RMNET_MAP_GET_PAD(skb); + pad = RMNET_MAP_GET_PAD(skb); + len = RMNET_MAP_GET_LENGTH(skb) - pad; if (mux_id >= RMNET_MAX_LOGICAL_EP) goto free_skb; @@ -90,8 +91,14 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) /* Subtract MAP header */ skb_pull(skb, sizeof(struct rmnet_map_header)); - skb_trim(skb, len); rmnet_set_skb_proto(skb); + + if (port->data_format & RMNET_INGRESS_FORMAT_MAP_CKSUMV4) { + if (!rmnet_map_checksum_downlink_packet(skb, len + pad)) + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + + skb_trim(skb, len); rmnet_deliver_skb(skb); return; @@ -115,7 +122,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) } if (port->data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) { - while ((skbn = rmnet_map_deaggregate(skb)) != NULL) + while ((skbn = rmnet_map_deaggregate(skb, port)) != NULL) __rmnet_map_ingress_handler(skbn, port); consume_skb(skb); diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h index 50c50cd..ca9f473 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h @@ -83,9 +83,11 @@ struct rmnet_map_ul_csum_header { #define RMNET_MAP_NO_PAD_BYTES0 #define RMNET_MAP_ADD_PAD_BYTES 1 -struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb); +struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb, + struct rmnet_port *port); struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb, int hdrlen, int pad); void rmnet_map_command(struct sk_buff *skb, struct rmnet_port *port); +int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len); #endif /* _RMNET_MAP_H_ */ diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c index 978ce26..881c1dc 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c @@ -14,6 +14,9 @@ */ #include +#include +#include +#include #include "rmnet_config.h" #include "rmnet_map.h" #include "rmnet_private.h" @@ -21,6
[PATCH net-next v3 RESEND 03/10] net: qualcomm: rmnet: Remove unused function declaration
rmnet_map_demultiplex() is only declared but not defined anywhere, so remove it. Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h index 4df359d..ef0eff2 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h @@ -67,7 +67,6 @@ struct rmnet_map_header { #define RMNET_MAP_NO_PAD_BYTES0 #define RMNET_MAP_ADD_PAD_BYTES 1 -u8 rmnet_map_demultiplex(struct sk_buff *skb); struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb); struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb, int hdrlen, int pad); -- 1.9.1
[PATCH net-next v3 RESEND 00/10] net: qualcomm: rmnet: Enable csum offloads
This series introduces the MAPv4 packet format for checksum offload plus some other minor changes. Patches 1-3 are cleanups. Patch 4 renames the ingress format to data format so that all data formats can be configured using this going forward. Patch 5 uses the pacing helper to improve TCP transmit performance. Patch 6-9 defines the the MAPv4 for checksum offload for RX and TX. A new header and trailer format are used as part of MAPv4. For RX checksum offload, only the 1's complement of the IP payload portion is computed by hardware. The meta data from RX header is used to verify the checksum field in the packet. Note that the IP packet and its field itself is not modified by hardware. This gives metadata to help with the RX checksum. For TX, the required metadata is filled up so hardware can compute the checksum. Patch 10 enables GSO on rmnet devices v1->v2: Fix sparse errors reported by kbuild test robot v2->v3: Update the commit message for Patch 5 based on Eric's comments Subash Abhinov Kasiviswanathan (10): net: qualcomm: rmnet: Remove redundant check when stamping map header net: qualcomm: rmnet: Remove invalid condition while stamping mux id net: qualcomm: rmnet: Remove unused function declaration net: qualcomm: rmnet: Rename ingress data format to data format net: qualcomm: rmnet: Set pacing shift net: qualcomm: rmnet: Define the MAPv4 packet formats net: qualcomm: rmnet: Add support for RX checksum offload net: qualcomm: rmnet: Handle command packets with checksum trailer net: qualcomm: rmnet: Add support for TX checksum offload net: qualcomm: rmnet: Add support for GSO drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 10 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 2 +- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 36 ++- drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h| 23 +- .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 17 +- .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c | 309 - .../net/ethernet/qualcomm/rmnet/rmnet_private.h| 2 + drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 4 + 8 files changed, 378 insertions(+), 25 deletions(-) -- 1.9.1
[PATCH net-next v3 RESEND 01/10] net: qualcomm: rmnet: Remove redundant check when stamping map header
We already check the headroom once in rmnet_map_egress_handler(), so this is not needed. Signed-off-by: Subash Abhinov Kasiviswanathan--- drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c index 86b8c75..978ce26 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c @@ -32,9 +32,6 @@ struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb, u32 padding, map_datalen; u8 *padbytes; - if (skb_headroom(skb) < sizeof(struct rmnet_map_header)) - return NULL; - map_datalen = skb->len - hdrlen; map_header = (struct rmnet_map_header *) skb_push(skb, sizeof(struct rmnet_map_header)); -- 1.9.1
Re: [iproute2 1/1] tc: Fix filter protocol output
Sun, Jan 07, 2018 at 03:43:28PM CET, j...@mojatatu.com wrote: >From: Jamal Hadi Salim> >Fixes: 249284ff5a44 ("tc: jsonify filter core") >Signed-off-by: Jamal Hadi Salim >--- > tc/tc_filter.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/tc/tc_filter.c b/tc/tc_filter.c >index 545cc3a..546311a 100644 >--- a/tc/tc_filter.c >+++ b/tc/tc_filter.c >@@ -276,7 +276,7 @@ int print_filter(const struct sockaddr_nl *who, struct >nlmsghdr *n, void *arg) > if (!filter_protocol || filter_protocol != f_proto) { > if (f_proto) { > SPRINT_BUF(b1); >- print_string(PRINT_JSON, "protocol", >+ print_string(PRINT_ANY, "protocol", ha! Thanks! Acked-by: Jiri Pirko
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, 7 Jan 2018, James Bottomley wrote: > On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote: > > From: Willy Tarreau> > Date: Sat, 6 Jan 2018 21:42:29 +0100 > > > > > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote: > > >> Normally people who propose security fixes don't have to argue > > about the > > >> fact they added 30 clocks to avoid your box being 0wned. > > > > > > In fact it depends, because if a fix makes the system unusable for > > its > > > initial purpose, this fix will simply not be deployed at all, which > > is > > > the worst that can happen. > > > > +1 > > > > I completely agree with Willy and Alexei. > > > > And the scale isn't even accurate, we're talking about at least > > hundreds upon hundreds of clocks, not 30, if we add an operation > > whose side effect is to wait for all pending loads to complete. So > > yeah this is going to be heavily scrutinized. > > Plus this is the standard kernel code review MO: we've never blindly > accepted code just because *security* (otherwise we'd have grsec in by > now). We use the pushback to get better and more performant code. > What often happens is it turns out that the "either security or > performance" position was a false dichotomy and there is a way of > fixing stuff that's acceptable (although not usually perfect) for > everyone. I'm not saying this always happens, but it is reasonable to > let the iterative pushback see if we can get to better code in this > case rather than trying to cut it of with the "because *security*" > argument. In principle I agree, though there are a few things to consider: 1) We have not the time to discuss that for the next 6 month 2) Alexei's analyis is purely based on the public information of the google zero folks. If it would be complete and the only attack vector all fine. If not and I doubt it is, we're going to regret this decision faster than we made it and this is not the kind of play field where we can afford that. The whole 'we know better and performance is key' attitude is what led to this disaster in the first place. We should finaly start to learn. Can we please stop that and live with the extra cycles for a few month up to the point where we get more informed answers to all these questions? Thanks, tglx
Re: [PATCH net] ipv6: remove null_entry before adding default route
On Sat, Jan 6, 2018 at 10:16 PM, Martin KaFai Lauwrote: > On Sat, Jan 06, 2018 at 05:41:28PM -0800, Wei Wang wrote: >> On Fri, Jan 5, 2018 at 11:42 PM, Martin KaFai Lau wrote: >> > On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote: >> >> From: Wei Wang >> >> >> >> In the current code, when creating a new fib6 table, tb6_root.leaf gets >> >> initialized to net->ipv6.ip6_null_entry. >> >> If a default route is being added with rt->rt6i_metric = 0x, >> >> fib6_add() will add this route after net->ipv6.ip6_null_entry. As >> >> null_entry is shared, it could cause problem. >> >> >> >> In order to fix it, set fn->leaf to NULL before calling >> >> fib6_add_rt2node() when trying to add the first default route. >> >> And reset fn->leaf to null_entry when adding fails or when deleting the >> >> last default route. >> >> >> >> syzkaller reported the following issue which is fixed by this commit: >> >> = >> >> WARNING: suspicious RCU usage >> >> 4.15.0-rc5+ #171 Not tainted >> >> - >> >> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage! >> >> >> >> other info that might help us debug this: >> >> >> >> rcu_scheduler_active = 2, debug_locks = 1 >> >> 4 locks held by swapper/0/0: >> >> #0: ((>ipv6.ip6_fib_timer)){+.-.}, at: [ ] >> >> lockdep_copy_map include/linux/lockdep.h:178 [inline] >> >> #0: ((>ipv6.ip6_fib_timer)){+.-.}, at: [ ] >> >> call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310 >> >> #1: (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] >> >> spin_lock_bh include/linux/spinlock.h:315 [inline] >> >> #1: (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] >> >> fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007 >> >> #2: (rcu_read_lock){}, at: [<91db762d>] >> >> __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560 >> >> #3: (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] >> >> spin_lock_bh include/linux/spinlock.h:315 [inline] >> >> #3: (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] >> >> __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948 >> >> >> >> stack backtrace: >> >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171 >> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS >> >> Google 01/01/2011 >> >> Call Trace: >> >> >> >> __dump_stack lib/dump_stack.c:17 [inline] >> >> dump_stack+0x194/0x257 lib/dump_stack.c:53 >> >> lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585 >> >> fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701 >> >> fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892 >> >> fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815 >> >> fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863 >> >> fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933 >> >> __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949 >> >> fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline] >> >> fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016 >> >> fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033 >> >> call_timer_fn+0x228/0x820 kernel/time/timer.c:1320 >> >> expire_timers kernel/time/timer.c:1357 [inline] >> >> __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660 >> >> run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 >> >> __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 >> >> invoke_softirq kernel/softirq.c:365 [inline] >> >> irq_exit+0x1cc/0x200 kernel/softirq.c:405 >> >> exiting_irq arch/x86/include/asm/apic.h:540 [inline] >> >> smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 >> >> apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904 >> >> >> >> >> >> Reported-by: syzbot >> >> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in >> >> fib6_table") >> >> Signed-off-by: Wei Wang >> >> --- >> >> net/ipv6/ip6_fib.c | 45 +++-- >> >> 1 file changed, 35 insertions(+), 10 deletions(-) >> >> >> >> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c >> >> index d11a5578e4f8..37cb4ad1ea29 100644 >> >> --- a/net/ipv6/ip6_fib.c >> >> +++ b/net/ipv6/ip6_fib.c >> >> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net, >> >> if (!(fn->fn_flags & RTN_RTINFO)) { >> >> RCU_INIT_POINTER(fn->leaf, NULL); >> >> rt6_release(leaf); >> >> + /* remove null_entry in the root node */ >> >> + } else if (fn->fn_flags & RTN_TL_ROOT && >> >> +rcu_access_pointer(fn->leaf) == >> >> +net->ipv6.ip6_null_entry) { >> >> + RCU_INIT_POINTER(fn->leaf, NULL); >> > It seems the reader side could see tb6_root.leaf == NULL after >> > this change and I think it should be fine? If it is, instead >> > of
Re: [PATCH net-next 00/18] ipv6: Align nexthop behaviour with IPv4
On 1/7/18 3:45 AM, Ido Schimmel wrote: > This set tries to eliminate some differences between IPv4's and IPv6's > treatment of nexthops. These differences are most likely a side effect > of IPv6's data structures (specifically 'rt6_info') that incorporate > both the route and the nexthop and the late addition of ECMP support in > commit 51ebd3181572 ("ipv6: add support of equal cost multipath > (ECMP)"). > > IPv4 and IPv6 do not react the same to certain netdev events. For > example, upon carrier change affected IPv4 nexthops are marked using the > RTNH_F_LINKDOWN flag and the nexthop group is rebalanced accordingly. > IPv6 on the other hand, does nothing which forces us to perform a > carrier check during route lookup and dump. This makes it difficult to > introduce features such as non-equal-cost multipath that are built on > top of this set [1]. > > In addition, when a netdev is put administratively down IPv4 nexthops > are marked using the RTNH_F_DEAD flag, whereas IPv6 simply flushes all > the routes using these nexthops. To be consistent with IPv4, multipath > routes should only be flushed when all nexthops in the group are > considered dead. > > The first 12 patches introduce non-functional changes that store the > RTNH_F_DEAD and RTNH_F_LINKDOWN flags in IPv6 routes based on netdev > events, in a similar fashion to IPv4. This allows us to remove the > carrier check performed during route lookup and dump. > > The next three patches make sure we only flush a multipath route when > all of its nexthops are dead. > > Last three patches add test cases for IPv4/IPv6 FIB. These verify that > both address families react similarly to netdev events. > > Finally, this series also serves as a good first step towards David > Ahern's goal of treating nexthops as standalone objects [2], as it makes > the code more in line with IPv4 where the nexthop and the nexthop group > are separate objects from the route itself. > > 1. https://github.com/idosch/linux/tree/ipv6-nexthops > 2. http://vger.kernel.org/netconf2017_files/nexthop-objects.pdf > Thanks for working on this - and creating the test cases. One of many follow on changes that would be beneficial is to remove the idev dereference in the hot path to check the ignore_routes_with_linkdown sysctl.
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote: > From: Willy Tarreau> Date: Sat, 6 Jan 2018 21:42:29 +0100 > > > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote: > >> Normally people who propose security fixes don't have to argue > about the > >> fact they added 30 clocks to avoid your box being 0wned. > > > > In fact it depends, because if a fix makes the system unusable for > its > > initial purpose, this fix will simply not be deployed at all, which > is > > the worst that can happen. > > +1 > > I completely agree with Willy and Alexei. > > And the scale isn't even accurate, we're talking about at least > hundreds upon hundreds of clocks, not 30, if we add an operation > whose side effect is to wait for all pending loads to complete. So > yeah this is going to be heavily scrutinized. Plus this is the standard kernel code review MO: we've never blindly accepted code just because *security* (otherwise we'd have grsec in by now). We use the pushback to get better and more performant code. What often happens is it turns out that the "either security or performance" position was a false dichotomy and there is a way of fixing stuff that's acceptable (although not usually perfect) for everyone. I'm not saying this always happens, but it is reasonable to let the iterative pushback see if we can get to better code in this case rather than trying to cut it of with the "because *security*" argument. James
Re: [PATCH net-next 18/18] selftests: fib_tests: Add test cases for netdev carrier change
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Check that IPv4 and IPv6 react the same when the carrier of a netdev is > toggled. Local routes should not be affected by this, whereas unicast > routes should. > > Signed-off-by: Ido Schimmel> --- > tools/testing/selftests/net/fib_tests.sh | 142 > +++ > 1 file changed, 142 insertions(+) Acked-by: David Ahern
Re: [PATCH net-next 17/18] selftests: fib_tests: Add test cases for netdev down
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Check that IPv4 and IPv6 react the same when a netdev is being put > administratively down. > > Signed-off-by: Ido Schimmel> --- > tools/testing/selftests/net/fib_tests.sh | 141 > +++ > 1 file changed, 141 insertions(+) > Acked-by: David Ahern
Re: [PATCH net-next 16/18] selftests: fib_tests: Add test cases for IPv4/IPv6 FIB
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Add test cases to check that IPv4 and IPv6 react to a netdev being > unregistered as expected. > > Signed-off-by: Ido Schimmel> --- > tools/testing/selftests/net/Makefile | 1 + > tools/testing/selftests/net/fib_tests.sh | 146 > +++ > 2 files changed, 147 insertions(+) > create mode 100755 tools/testing/selftests/net/fib_tests.sh Acked-by: David Ahern FYI: "ip -netns testns" is more efficient than "ip netns exec testns ip".
Re: [PATCH net-next 15/18] ipv6: Flush multipath routes when all siblings are dead
On 1/7/18 3:45 AM, Ido Schimmel wrote: > By default, IPv6 deletes nexthops from a multipath route when the > nexthop device is put administratively down. This differs from IPv4 > where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A > multipath route is flushed when all of its nexthops become dead. > > Align IPv6 with IPv4 and have it conform to the same guidelines. > > In case the multipath route needs to be flushed, its siblings are > flushed one by one. Otherwise, the nexthops are marked with the > appropriate flags and the tree walker is instructed to skip all the > siblings. > > As explained in previous patches, care is taken to update the sernum of > the affected tree nodes, so as to prevent the use of wrong dst entries. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/route.c | 83 > ++-- > 1 file changed, 75 insertions(+), 8 deletions(-) Acked-by: David Ahern
Re: [PATCH net-next 14/18] ipv6: Take table lock outside of sernum update function
On 1/7/18 3:45 AM, Ido Schimmel wrote: > The next patch is going to allow dead routes to remain in the FIB tree > in certain situations. > > When this happens we need to be sure to bump the sernum of the nodes > where these are stored so that potential copies cached in sockets are > invalidated. > > The function that performs this update assumes the table lock is not > taken when it is invoked, but that will not be the case when it is > invoked by the tree walker. > > Have the function assume the lock is taken and make the single caller > take the lock itself. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/ip6_fib.c | 5 + > net/ipv6/route.c | 2 ++ > 2 files changed, 3 insertions(+), 4 deletions(-) > Acked-by: David Ahern
Re: [PATCH net-next 13/18] ipv6: Export sernum update function
On 1/7/18 3:45 AM, Ido Schimmel wrote: > We are going to allow dead routes to stay in the FIB tree (e.g., when > they are part of a multipath route, directly connected route with no > carrier) and revive them when their nexthop device gains carrier or when > it is put administratively up. > > This is equivalent to the addition of the route to the FIB tree and we > should therefore take care of updating the sernum of all the parent > nodes of the node where the route is stored. Otherwise, we risk sockets > caching and using sub-optimal dst entries. > > Export the function that performs the above, so that it could be invoked > from fib6_ifup() later on. > > Signed-off-by: Ido Schimmel> --- > include/net/ip6_fib.h | 1 + > net/ipv6/ip6_fib.c| 11 --- > 2 files changed, 9 insertions(+), 3 deletions(-) > Acked-by: David Ahern
Re: [PATCH net-next 12/18] ipv6: Teach tree walker to skip multipath routes
On 1/7/18 3:45 AM, Ido Schimmel wrote: > As explained in previous patch, fib6_ifdown() needs to consider the > state of all the sibling routes when a multipath route is traversed. > > This is done by evaluating all the siblings when the first sibling in a > multipath route is traversed. If the multipath route does not need to be > flushed (e.g., not all siblings are dead), then we should just skip the > multipath route as our work is done. > > Have the tree walker jump to the last sibling when it is determined that > the multipath route needs to be skipped. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/ip6_fib.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > Acked-by: David Ahern
Re: [PATCH net-next 11/18] ipv6: Add explicit flush indication to routes
On 1/7/18 3:45 AM, Ido Schimmel wrote: > When routes that are a part of a multipath route are evaluated by > fib6_ifdown() in response to NETDEV_DOWN and NETDEV_UNREGISTER events > the state of their sibling routes is not considered. > > This will change in subsequent patches in order to align IPv6 with > IPv4's behavior. For example, when the last sibling in a multipath route > becomes dead, the entire multipath route needs to be removed. > > To prevent the tree walker from re-evaluating all the sibling routes > each time, we can simply evaluate them once - when the first sibling is > traversed. > > If we determine the entire multipath route needs to be removed, then the > 'should_flush' bit is set in all the siblings, which will cause the > walker to flush them when it traverses them. > > Signed-off-by: Ido Schimmel> --- > include/net/ip6_fib.h | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) Acked-by: David Ahern
Re: [PATCH net-next 10/18] ipv6: Report dead flag during route dump
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Up until now the RTNH_F_DEAD flag was only reported in route dump when > the 'ignore_routes_with_linkdown' sysctl was set. This is expected as > dead routes were flushed otherwise. > > The reliance on this sysctl is going to be removed, so we need to report > the flag regardless of the sysctl's value. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/route.c | 3 +++ > 1 file changed, 3 insertions(+) Acked-by: David Ahern
Re: [PATCH net-next 09/18] ipv6: Ignore dead routes during lookup
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Currently, dead routes are only present in the routing tables in case > the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are > flushed. > > Subsequent patches are going to remove the reliance on this sysctl and > make IPv6 more consistent with IPv4. > > Before this is done, we need to make sure dead routes are skipped during > route lookup, so as to not cause packet loss. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/route.c | 18 ++ > 1 file changed, 14 insertions(+), 4 deletions(-) Acked-by: David Ahern
Aw: Re: dvb usb issues since kernel 4.9
Hi, here I provide lsusb from my affected hardware (technotrend s2-4600). http://ix.io/DLY With this hardware I had errors when recording with tvheadend. Livetv was ok, only channel switching made some problems sometimes. Please see attached tvheadend service logs. I also provide dmesg (libreelec on rpi3 with kernel 4.14.10 with revert of the mentioned commit). http://ix.io/DM2 Regards Josef Gesendet: Sonntag, 07. Januar 2018 um 16:41 Uhr Von: "Alan Stern"An: "Mauro Carvalho Chehab" Cc: "Josef Griebichler" , "Greg Kroah-Hartman" , linux-...@vger.kernel.org, "Eric Dumazet" , "Rik van Riel" , "Paolo Abeni" , "Hannes Frederic Sowa" , "Jesper Dangaard Brouer" , linux-kernel , netdev , "Jonathan Corbet" , LMML , "Peter Zijlstra" , "David Miller" , torva...@linux-foundation.org Betreff: Re: dvb usb issues since kernel 4.9 On Sun, 7 Jan 2018, Mauro Carvalho Chehab wrote: > > > It seems that the original patch were designed to solve some IRQ issues > > > with network cards with causes data losses on high traffic. However, > > > it is also causing bad effects on sustained high bandwidth demands > > > required by DVB cards, at least on some USB host drivers. > > > > > > Alan/Greg/Eric/David: > > > > > > Any ideas about how to fix it without causing regressions to > > > network? > > > > It would be good to know what hardware was involved on the x86 system > > and to have some timing data. Can we see the output from lsusb and > > usbmon, running on a vanilla kernel that gets plenty of video glitches? > > From Josef's report, and from the BZ, the affected hardware seems > to be based on Montage Technology M88DS3103/M88TS2022 chipset. What type of USB host controller does the x86_64 system use? EHCI or xHCI? > The driver it uses is at drivers/media/usb/dvb-usb-v2/dvbsky.c, > with shares a USB implementation that is used by a lot more drivers. > The URB handling code is at: > > drivers/media/usb/dvb-usb-v2/usb_urb.c > > This particular driver allocates 8 buffers with 4096 bytes each > for bulk transfers, using transfer_flags = URB_NO_TRANSFER_DMA_MAP. > > This become a popular USB hardware nowadays. I have one S960c > myself, so I can send you the lsusb from it. You should notice, however, > that a DVB-C/DVB-S2 channel can easily provide very high sustained bit > rates. Here, on my DVB-S2 provider, a typical transponder produces 58 Mpps > of payload after removing URB headers. You mentioned earlier that the driver uses bulk transfers. In USB-2.0, the maximum possible payload data transfer rate using bulk transfers is 53248 bytes/ms, which is 53.248 MB/s (i.e., lower than 58 MB/s). And even this is possible only if almost nothing else is using the bus at the same time. > A 10 minutes record with the > entire data (with typically contains 5-10 channels) can easily go > above 4 GB, just to reproduce 1-2 glitches. So, I'm not sure if > a usbmon dump would be useful. It might not be helpful at all. However, I'm not interested in the payload data (which would be unintelligible to me anyway) but rather the timing of URB submissions and completions. A usbmon trace which didn't keep much of the payload data would only require on the order of 50 MB per minute -- and Josef said that glitches usually would show up within a minute or so. > I'm enclosing the lsusb from a S960C device, with is based on those > Montage chipsets: What I wanted to see was the output from "lsusb" on the affected system, not the output from "lsusb -v -s B:D" on your system. > > Overall, this may be a very difficult problem to solve. The > > 4cd13c21b207 commit was intended to improve throughput at the cost of > > increased latency. But then what do you do when the latency becomes > > too high for the video subsystem to handle? > > Latency can't be too high, otherwise frames will be dropped. Yes, that's the whole point. > Even if the Kernel itself doesn't drop, if the delay goes higher > than a certain threshold, userspace will need to drop, as it > should be presenting audio and video on real time. Yet, typically, > userspace will delay it by one or two seconds, with would mean > 1500-3500 buffers, with I suspect it is a lot more than the hardware > limits. So I suspect that the hardware starves free buffers a way > before userspace, as media hardware don't have unlimited buffers > inside them, as they assume that the Kernel/userspace will be fast > enough to sustain bit rates up to 66 Mbps of payload. The timing information would tell us how large the latency is. In any case, you might be able to attack the problem simply by using more than 8
Re: [PATCH net-next 08/18] ipv6: Check nexthop flags in route dump instead of carrier
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Similar to previous patch, there is no need to check for the carrier of > the nexthop device when dumping the route and we can instead check for > the presence of the RTNH_F_LINKDOWN flag. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/route.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > Acked-by: David Ahern
Re: [PATCH net-next 07/18] ipv6: Check nexthop flags during route lookup instead of carrier
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the > need to dereference the nexthop device and check its carrier and instead > check for the presence of the flag. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/route.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) Acked-by: David Ahern
Re: [PATCH net-next 06/18] ipv6: Set nexthop flags during route creation
On 1/7/18 3:45 AM, Ido Schimmel wrote: > It is valid to install routes with a nexthop device that does not have a > carrier, so we need to make sure they're marked accordingly. > > As explained in the previous patch, host and anycast routes are never > marked with the 'linkdown' flag. > > Note that reject routes are unaffected, as these use the loopback device > which always has a carrier. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/route.c | 3 +++ > 1 file changed, 3 insertions(+) Acked-by: David Ahern
Re: [PATCH net-next 05/18] ipv6: Set nexthop flags upon carrier change
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Similar to IPv4, when the carrier of a netdev changes we should toggle > the 'linkdown' flag on all the nexthops using it as their nexthop > device. > > This will later allow us to test for the presence of this flag during > route lookup and dump. > > Up until commit 4832c30d5458 ("net: ipv6: put host and anycast routes on > device with address") host and anycast routes used the loopback netdev > as their nexthop device and thus were not marked with the 'linkdown' > flag. The patch preserves this behavior and allows one to ping the local > address even when the nexthop device does not have a carrier and the > 'ignore_routes_with_linkdown' sysctl is set. > > Signed-off-by: Ido Schimmel> --- > include/net/ip6_route.h | 1 + > net/ipv6/addrconf.c | 2 ++ > net/ipv6/route.c| 23 +-- > 3 files changed, 20 insertions(+), 6 deletions(-) Acked-by: David Ahern
Re: [PATCH net] ipv6: sr: fix TLVs not being copied using setsockopt
Just realized I messed up the justification about sr_has_hmac. The branch will be taken, but its execution will not complete since the TLV's len and type fields aren't copied, hence seg6_get_tlv_hmac will fail, and the HMAC will not be computed. 2018-01-07 17:12 GMT+00:00 Mathieu Xhonneux: > Function ipv6_push_rthdr4 allows to add an IPv6 Segment Routing Header > to a socket through setsockopt, but the current implementation doesn't > copy possible TLVs at the end of the SRH (i.e., the following branch > if (sr_has_hmac(sr_phdr)) will never be taken as no HMAC TLV is copied). > > This commit adds a memcpy in case TLVs have been appended to the SRH. > > Fixes: a149e7c7ce812561f0fdc7a86ddc42f294e5eb3e ("ipv6: sr: add > support for SRH injection through setsockopt") > Acked-by: David Lebrun > Signed-off-by: Mathieu Xhonneux > --- > net/ipv6/exthdrs.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c > index 83bd75713535..bc68eb661970 100644 > --- a/net/ipv6/exthdrs.c > +++ b/net/ipv6/exthdrs.c > @@ -925,6 +925,15 @@ static void ipv6_push_rthdr4(struct sk_buff *skb, u8 > *proto, > sr_phdr->segments[0] = **addr_p; > *addr_p = _ihdr->segments[sr_ihdr->segments_left]; > > + if (sr_ihdr->hdrlen > hops * 2) { > + int tlvs_offset, tlvs_length; > + > + tlvs_offset = (1 + hops * 2) << 3; > + tlvs_length = (sr_ihdr->hdrlen - hops * 2) << 3; > + memcpy((char *)sr_phdr + tlvs_offset, > + (char *)sr_ihdr + tlvs_offset, tlvs_length); > + } > + > #ifdef CONFIG_IPV6_SEG6_HMAC > if (sr_has_hmac(sr_phdr)) { > struct net *net = NULL; > -- > 2.15.1 >
Re: [PATCH net-next 04/18] ipv6: Prepare to handle multiple netdev events
On 1/7/18 3:45 AM, Ido Schimmel wrote: > To make IPv6 more in line with IPv4 we need to be able to respond > differently to different netdev events. For example, when a netdev is > unregistered all the routes using it as their nexthop device should be > flushed, whereas when the netdev's carrier changes only the 'linkdown' > flag should be toggled. > > Currently, this is not possible, as the function that traverses the > routing tables is not aware of the triggering event. > > Propagate the triggering event down, so that it could be used in later > patches. > > Signed-off-by: Ido Schimmel> --- > include/net/ip6_route.h | 2 +- > net/ipv6/addrconf.c | 4 ++-- > net/ipv6/route.c| 37 + > 3 files changed, 24 insertions(+), 19 deletions(-) Acked-by: David Ahern
Re: [PATCH net-next 03/18] ipv6: Clear nexthop flags upon netdev up
On 1/7/18 3:45 AM, Ido Schimmel wrote: > Previous patch marked nexthops with the 'dead' and 'linkdown' flags. > Clear these flags when the netdev comes back up. > > Signed-off-by: Ido Schimmel> --- > include/net/ip6_route.h | 1 + > net/ipv6/addrconf.c | 3 +++ > net/ipv6/route.c| 29 + > 3 files changed, 33 insertions(+) > Acked-by: David Ahern
Re: [PATCH net-next 02/18] ipv6: Mark dead nexthops with appropriate flags
On 1/7/18 3:45 AM, Ido Schimmel wrote: > When a netdev is put administratively down or unregistered all the > nexthops using it as their nexthop device should be marked with the > 'dead' and 'linkdown' flags. > > Currently, when a route is dumped its nexthop device is tested and the > flags are set accordingly. A similar check is performed during route > lookup. > > Instead, we can simply mark the nexthops based on netdev events and > avoid checking the netdev's state during route dump and lookup. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/route.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > Acked-by: David Ahern
Re: [PATCH net-next 01/18] ipv6: Remove redundant route flushing during namespace dismantle
On 1/7/18 3:45 AM, Ido Schimmel wrote: > By the time fib6_net_exit() is executed all the netdevs in the namespace > have been either unregistered or pushed back to the default namespace. > That is because pernet subsys operations are always ordered before > pernet device operations and therefore invoked after them during > namespace dismantle. > > Thus, all the routing tables in the namespace are empty by the time > fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed. > > This allows us to simplify the condition in fib6_ifdown() as it's only > ever called with an actual netdev. > > Signed-off-by: Ido Schimmel> --- > net/ipv6/ip6_fib.c | 1 - > net/ipv6/route.c | 8 +++- > 2 files changed, 3 insertions(+), 6 deletions(-) > Acked-by: David Ahern
Re: atm/clip: Use seq_puts() in svc_addr()
>> Is the function "seq_puts" a bit more efficient for the desired output >> of a single string in comparison to calling the function "seq_printf" >> for this purpose? > > Will you please be so kind and tell us? How do you think about to get the run time characteristics for these sequence output functions better documented? https://elixir.free-electrons.com/linux/v4.15-rc6/source/fs/seq_file.c#L660 Can an information like “WARNING: Prefer seq_puts to seq_printf” (from the script “checkpatch.pl”) be another incentive? >>> and "strings should be quickly put into a sequence" >>> isn't terribly helpful. >> >> Which wording would you find more appropriate for the suggested >> adjustment of these function calls? > > Whatever describes the actual issue and what you're doing about it. > Turn your rhetorical question above into a commit message, done. > > Compare that with your original commit message, on the other hand, > and you should understand what I mean. Which descriptions are you really missing for the affected data output? Regards, Markus
Re: [net-next] netfilter: add segment routing header 'srh' match
On Sun, 7 Jan 2018 00:40:03 +0100 Pablo Neira Ayusowrote: > Hi Ahmed, > > On Fri, Dec 29, 2017 at 12:07:52PM +0100, Ahmed Abdelsalam wrote: > > It allows matching packets based on Segment Routing Header > > (SRH) information. > > The implementation considers revision 7 of the SRH draft. > > https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07 > > > > Currently supported match options include: > > (1) Next Header > > (2) Hdr Ext Len > > (3) Segments Left > > (4) Last Entry > > (5) Tag value of SRH > > > > Signed-off-by: Ahmed Abdelsalam > > --- > > include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 63 ++ > > net/ipv6/netfilter/Kconfig | 9 ++ > > net/ipv6/netfilter/Makefile | 1 + > > net/ipv6/netfilter/ip6t_srh.c| 165 > > +++ > > 4 files changed, 238 insertions(+) > > create mode 100644 include/uapi/linux/netfilter_ipv6/ip6t_srh.h > > create mode 100644 net/ipv6/netfilter/ip6t_srh.c > > > > diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h > > b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h > > new file mode 100644 > > index 000..1b5dbd8 > > --- /dev/null > > +++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h > > @@ -0,0 +1,63 @@ > > +/** > > + * Definitions for Segment Routing Header 'srh' match > > + * > > + * Author: > > + * Ahmed Abdelsalam > > + */ > > Please, add this in SPDX format instead. > > See include/uapi/linux/netfilter/xt_owner.h for instance. > Ok > > +#ifndef _IP6T_SRH_H > > +#define _IP6T_SRH_H > > + > > +#include > > +#include > > + > > +/* Values for "mt_flags" field in struct ip6t_srh */ > > +#define IP6T_SRH_NEXTHDR0x0001 > > +#define IP6T_SRH_LEN_EQ 0x0002 > > +#define IP6T_SRH_LEN_GT 0x0004 > > +#define IP6T_SRH_LEN_LT 0x0008 > > +#define IP6T_SRH_SEGS_EQ0x0010 > > +#define IP6T_SRH_SEGS_GT0x0020 > > +#define IP6T_SRH_SEGS_LT0x0040 > > +#define IP6T_SRH_LAST_EQ0x0080 > > +#define IP6T_SRH_LAST_GT0x0100 > > +#define IP6T_SRH_LAST_LT0x0200 > > +#define IP6T_SRH_TAG0x0400 > > +#define IP6T_SRH_MASK 0x07FF > > + > > +/* Values for "mt_invflags" field in struct ip6t_srh */ > > +#define IP6T_SRH_INV_NEXTHDR0x0001 > > +#define IP6T_SRH_INV_LEN_EQ 0x0002 > > +#define IP6T_SRH_INV_LEN_GT 0x0004 > > +#define IP6T_SRH_INV_LEN_LT 0x0008 > > +#define IP6T_SRH_INV_SEGS_EQ0x0010 > > +#define IP6T_SRH_INV_SEGS_GT0x0020 > > +#define IP6T_SRH_INV_SEGS_LT0x0040 > > +#define IP6T_SRH_INV_LAST_EQ0x0080 > > +#define IP6T_SRH_INV_LAST_GT0x0100 > > +#define IP6T_SRH_INV_LAST_LT0x0200 > > +#define IP6T_SRH_INV_TAG0x0400 > > +#define IP6T_SRH_INV_MASK 0x07FF > > Looking at all these EQ, GT, LT... I think this should be very easy to > implement in nf_tables with no kernel changes. > > You only need to add the protocol definition to: > > nftables/src/exthdr.c > > Would you have a look into this? This would be very much appreciated > to we keep nftables in sync with what we have in iptables. Yes, I look into it. I will send you a patch for nf_tables as well. > > > + > > +/** > > + * struct ip6t_srh - SRH match options > > + * @ next_hdr: Next header field of SRH > > + * @ hdr_len: Extension header length field of SRH > > + * @ segs_left: Segments left field of SRH > > + * @ last_entry: Last entry field of SRH > > + * @ tag: Tag field of SRH > > + * @ mt_flags: match options > > + * @ mt_invflags: Invert the sense of match options > > + */ > > + > > +struct ip6t_srh { > > + __u8next_hdr; > > + __u8hdr_len; > > + __u8segs_left; > > + __u8last_entry; > > + __u16 tag; > > + __u16 mt_flags; > > + __u16 mt_invflags; > > +}; > > + > > +#endif /*_IP6T_SRH_H*/ > > diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig > > index 6acb2ee..e1818eb 100644 > > --- a/net/ipv6/netfilter/Kconfig > > +++ b/net/ipv6/netfilter/Kconfig > > @@ -232,6 +232,15 @@ config IP6_NF_MATCH_RT > > > > To compile it as a module, choose M here. If unsure, say N. > > > > +config IP6_NF_MATCH_SRH > > +tristate '"srh" Segment Routing header match support' > > +depends on NETFILTER_ADVANCED > > +help > > + srh matching allows you to match packets based on the segment > > + routing header of the packet. > > + > > + To compile it as a module, choose M here. If unsure, say N. > > + > > # The targets > > config IP6_NF_TARGET_HL > > tristate '"HL" hoplimit target support' > > diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile > > index c6ee0cd..e0d51a9 100644 > > --- a/net/ipv6/netfilter/Makefile > > +++
[PATCH net] ipv6: sr: fix TLVs not being copied using setsockopt
Function ipv6_push_rthdr4 allows to add an IPv6 Segment Routing Header to a socket through setsockopt, but the current implementation doesn't copy possible TLVs at the end of the SRH (i.e., the following branch if (sr_has_hmac(sr_phdr)) will never be taken as no HMAC TLV is copied). This commit adds a memcpy in case TLVs have been appended to the SRH. Fixes: a149e7c7ce812561f0fdc7a86ddc42f294e5eb3e ("ipv6: sr: add support for SRH injection through setsockopt") Acked-by: David LebrunSigned-off-by: Mathieu Xhonneux --- net/ipv6/exthdrs.c | 9 + 1 file changed, 9 insertions(+) diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c index 83bd75713535..bc68eb661970 100644 --- a/net/ipv6/exthdrs.c +++ b/net/ipv6/exthdrs.c @@ -925,6 +925,15 @@ static void ipv6_push_rthdr4(struct sk_buff *skb, u8 *proto, sr_phdr->segments[0] = **addr_p; *addr_p = _ihdr->segments[sr_ihdr->segments_left]; + if (sr_ihdr->hdrlen > hops * 2) { + int tlvs_offset, tlvs_length; + + tlvs_offset = (1 + hops * 2) << 3; + tlvs_length = (sr_ihdr->hdrlen - hops * 2) << 3; + memcpy((char *)sr_phdr + tlvs_offset, + (char *)sr_ihdr + tlvs_offset, tlvs_length); + } + #ifdef CONFIG_IPV6_SEG6_HMAC if (sr_has_hmac(sr_phdr)) { struct net *net = NULL; -- 2.15.1
Re: dvb usb issues since kernel 4.9
On Sun, 7 Jan 2018, Mauro Carvalho Chehab wrote: > > > It seems that the original patch were designed to solve some IRQ issues > > > with network cards with causes data losses on high traffic. However, > > > it is also causing bad effects on sustained high bandwidth demands > > > required by DVB cards, at least on some USB host drivers. > > > > > > Alan/Greg/Eric/David: > > > > > > Any ideas about how to fix it without causing regressions to > > > network? > > > > It would be good to know what hardware was involved on the x86 system > > and to have some timing data. Can we see the output from lsusb and > > usbmon, running on a vanilla kernel that gets plenty of video glitches? > > From Josef's report, and from the BZ, the affected hardware seems > to be based on Montage Technology M88DS3103/M88TS2022 chipset. What type of USB host controller does the x86_64 system use? EHCI or xHCI? > The driver it uses is at drivers/media/usb/dvb-usb-v2/dvbsky.c, > with shares a USB implementation that is used by a lot more drivers. > The URB handling code is at: > > drivers/media/usb/dvb-usb-v2/usb_urb.c > > This particular driver allocates 8 buffers with 4096 bytes each > for bulk transfers, using transfer_flags = URB_NO_TRANSFER_DMA_MAP. > > This become a popular USB hardware nowadays. I have one S960c > myself, so I can send you the lsusb from it. You should notice, however, > that a DVB-C/DVB-S2 channel can easily provide very high sustained bit > rates. Here, on my DVB-S2 provider, a typical transponder produces 58 Mpps > of payload after removing URB headers. You mentioned earlier that the driver uses bulk transfers. In USB-2.0, the maximum possible payload data transfer rate using bulk transfers is 53248 bytes/ms, which is 53.248 MB/s (i.e., lower than 58 MB/s). And even this is possible only if almost nothing else is using the bus at the same time. > A 10 minutes record with the > entire data (with typically contains 5-10 channels) can easily go > above 4 GB, just to reproduce 1-2 glitches. So, I'm not sure if > a usbmon dump would be useful. It might not be helpful at all. However, I'm not interested in the payload data (which would be unintelligible to me anyway) but rather the timing of URB submissions and completions. A usbmon trace which didn't keep much of the payload data would only require on the order of 50 MB per minute -- and Josef said that glitches usually would show up within a minute or so. > I'm enclosing the lsusb from a S960C device, with is based on those > Montage chipsets: What I wanted to see was the output from "lsusb" on the affected system, not the output from "lsusb -v -s B:D" on your system. > > Overall, this may be a very difficult problem to solve. The > > 4cd13c21b207 commit was intended to improve throughput at the cost of > > increased latency. But then what do you do when the latency becomes > > too high for the video subsystem to handle? > > Latency can't be too high, otherwise frames will be dropped. Yes, that's the whole point. > Even if the Kernel itself doesn't drop, if the delay goes higher > than a certain threshold, userspace will need to drop, as it > should be presenting audio and video on real time. Yet, typically, > userspace will delay it by one or two seconds, with would mean > 1500-3500 buffers, with I suspect it is a lot more than the hardware > limits. So I suspect that the hardware starves free buffers a way > before userspace, as media hardware don't have unlimited buffers > inside them, as they assume that the Kernel/userspace will be fast > enough to sustain bit rates up to 66 Mbps of payload. The timing information would tell us how large the latency is. In any case, you might be able to attack the problem simply by using more than 8 buffers. With just eight 4096-byte buffers, the total pipeline capacity is only about 0.62 ms (at the maximum possible transfer rate). Increasing the number of buffers to 65 would give a capacity of 5 ms, which is probably a lot better suited for situations where completions are handled by the ksoftirqd thread. > Perhaps media drivers could pass some quirk similar to URB_ISO_ASAP, > in order to revert the kernel logic to prioritize latency instead of > throughput. It can't be done without pervasive changes to the USB subsystem, which I would greatly prefer to avoid. Besides, this wouldn't really solve the problem. Decreasing the latency for one device will cause it to be increased for others. Alan Stern
Re: atm/clip: Use seq_puts() in svc_addr()
On Sun, 7 Jan 2018 09:19:17 +0100 SF Markus Elfringwrote: > >> Two strings should be quickly put into a sequence by two function calls. > >> Thus use the function "seq_puts" instead of "seq_printf". > >> > >> This issue was detected by using the Coccinelle software. > > > > Can you please explain what the issue really is and what you're trying > > to do here? > > Is the function "seq_puts" a bit more efficient for the desired output > of a single string in comparison to calling the function "seq_printf" > for this purpose? Will you please be so kind and tell us? > > One shouldn't need to dig into Coccinelle patterns to find > > out what you mean, > > Why did an attribution for a software tool confuse you? I'm not confused. I'm saying that one shouldn't need to dig into Coccinelle patterns to find out what you mean. > > and "strings should be quickly put into a sequence" > > isn't terribly helpful. > > Which wording would you find more appropriate for the suggested > adjustment of these function calls? Whatever describes the actual issue and what you're doing about it. Turn your rhetorical question above into a commit message, done. Compare that with your original commit message, on the other hand, and you should understand what I mean. -- Stefano
Re: [PATCH net-next v3 00/10] net: qualcomm: rmnet: Enable csum offloads
From: Subash Abhinov KasiviswanathanDate: Sat, 06 Jan 2018 19:19:00 -0700 > I still dont see the v3 submission in patchwork if I query it. Please resubmit then.
[PATCH] ixgbevf: use ARRAY_SIZE for various array sizing calculations
From: Colin Ian KingUse the ARRAY_SIZE macro on various arrays to determine size of the arrays. Improvement suggested by coccinelle. Signed-off-by: Colin Ian King --- drivers/net/ethernet/intel/ixgbevf/vf.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c b/drivers/net/ethernet/intel/ixgbevf/vf.c index 0c25006ce9af..cd79c4a73b48 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.c +++ b/drivers/net/ethernet/intel/ixgbevf/vf.c @@ -285,7 +285,7 @@ static s32 ixgbevf_set_uc_addr_vf(struct ixgbe_hw *hw, u32 index, u8 *addr) ether_addr_copy(msg_addr, addr); ret_val = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf, -sizeof(msgbuf) / sizeof(u32)); +ARRAY_SIZE(msgbuf)); if (!ret_val) { msgbuf[0] &= ~IXGBE_VT_MSGTYPE_CTS; @@ -455,8 +455,7 @@ static s32 ixgbevf_set_rar_vf(struct ixgbe_hw *hw, u32 index, u8 *addr, ether_addr_copy(msg_addr, addr); ret_val = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf, -sizeof(msgbuf) / sizeof(u32)); - +ARRAY_SIZE(msgbuf)); msgbuf[0] &= ~IXGBE_VT_MSGTYPE_CTS; /* if nacked the address was rejected, use "perm_addr" */ @@ -571,7 +570,7 @@ static s32 ixgbevf_update_xcast_mode(struct ixgbe_hw *hw, int xcast_mode) msgbuf[1] = xcast_mode; err = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf, -sizeof(msgbuf) / sizeof(u32)); +ARRAY_SIZE(msgbuf)); if (err) return err; @@ -609,7 +608,7 @@ static s32 ixgbevf_set_vfta_vf(struct ixgbe_hw *hw, u32 vlan, u32 vind, msgbuf[0] |= vlan_on << IXGBE_VT_MSGINFO_SHIFT; err = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf, -sizeof(msgbuf) / sizeof(u32)); +ARRAY_SIZE(msgbuf)); if (err) goto mbx_err; @@ -813,7 +812,7 @@ static s32 ixgbevf_set_rlpml_vf(struct ixgbe_hw *hw, u16 max_size) msgbuf[1] = max_size; ret_val = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf, -sizeof(msgbuf) / sizeof(u32)); +ARRAY_SIZE(msgbuf)); if (ret_val) return ret_val; if ((msgbuf[0] & IXGBE_VF_SET_LPE) && @@ -859,8 +858,7 @@ static int ixgbevf_negotiate_api_version_vf(struct ixgbe_hw *hw, int api) msg[1] = api; msg[2] = 0; - err = ixgbevf_write_msg_read_ack(hw, msg, msg, -sizeof(msg) / sizeof(u32)); + err = ixgbevf_write_msg_read_ack(hw, msg, msg, ARRAY_SIZE(msg)); if (!err) { msg[0] &= ~IXGBE_VT_MSGTYPE_CTS; @@ -911,8 +909,7 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int *num_tcs, msg[0] = IXGBE_VF_GET_QUEUE; msg[1] = msg[2] = msg[3] = msg[4] = 0; - err = ixgbevf_write_msg_read_ack(hw, msg, msg, -sizeof(msg) / sizeof(u32)); + err = ixgbevf_write_msg_read_ack(hw, msg, msg, ARRAY_SIZE(msg)); if (!err) { msg[0] &= ~IXGBE_VT_MSGTYPE_CTS; -- 2.15.1
Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared
On 18-01-07 09:28 AM, Jamal Hadi Salim wrote: On 18-01-07 08:46 AM, Jiri Pirko wrote: Sun, Jan 07, 2018 at 02:11:19PM CET, j...@mojatatu.com wrote: On 18-01-06 03:43 PM, Jiri Pirko wrote: @@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, struct sk_buff *skb, tcm->tcm_family = AF_UNSPEC; tcm->tcm__pad1 = 0; tcm->tcm__pad2 = 0; - tcm->tcm_ifindex = qdisc_dev(q)->ifindex; - tcm->tcm_parent = parent; + if (q) { + tcm->tcm_ifindex = qdisc_dev(q)->ifindex; + tcm->tcm_parent = parent; + } else { + tcm->tcm_ifindex = 0; /* block index is stored in parent */ + tcm->tcm_parent = block->index; + } Please guys, please look at this reuse (also on clt side). I would like you to double-check this reuse of existing API for balock_index carrying purpose. I believe it's UAPI safe. But please, check it out carefully. Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex (not sure if zero means something speacial to someone). Like -1 means parent is block_index? Yes. Why would 0 mean something special? Could you point to a code that suggests it? I cant point to any such code, it is just the ifindex is an int. And the negative space looks like less likely someone would think of using for signalling (0x as an example). tcpdump -i any probably assumes soem weird ifindex (havent looked at the code). In any case, 0 is fine too. Also a #define for whatever value you use would make it more readable, ex: #define IFINDEX_ANY .. cheers, jamal
[iproute2 1/1] tc: Fix filter protocol output
From: Jamal Hadi SalimFixes: 249284ff5a44 ("tc: jsonify filter core") Signed-off-by: Jamal Hadi Salim --- tc/tc_filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tc/tc_filter.c b/tc/tc_filter.c index 545cc3a..546311a 100644 --- a/tc/tc_filter.c +++ b/tc/tc_filter.c @@ -276,7 +276,7 @@ int print_filter(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (!filter_protocol || filter_protocol != f_proto) { if (f_proto) { SPRINT_BUF(b1); - print_string(PRINT_JSON, "protocol", + print_string(PRINT_ANY, "protocol", "protocol %s ", ll_proto_n2a(f_proto, b1, sizeof(b1))); } -- 1.9.1
[PATCH iproute2] ip fou: Pass family attribute as u8
This fixes fou on big-endian systems. Signed-off-by: Filip Moc--- ip/ipfou.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ip/ipfou.c b/ip/ipfou.c index febc2c8c..1f392ade 100644 --- a/ip/ipfou.c +++ b/ip/ipfou.c @@ -52,7 +52,7 @@ static int fou_parse_opt(int argc, char **argv, struct nlmsghdr *n, __u8 ipproto, type; bool gue_set = false; int ipproto_set = 0; - unsigned short family = AF_INET; + __u8 family = AF_INET; while (argc > 0) { if (!matches(*argv, "port")) { @@ -103,7 +103,7 @@ static int fou_parse_opt(int argc, char **argv, struct nlmsghdr *n, addattr16(n, 1024, FOU_ATTR_PORT, port); addattr8(n, 1024, FOU_ATTR_TYPE, type); - addattr16(n, 1024, FOU_ATTR_AF, family); + addattr8(n, 1024, FOU_ATTR_AF, family); if (ipproto_set) addattr8(n, 1024, FOU_ATTR_IPPROTO, ipproto); -- 2.11.0
Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared
On 18-01-07 08:46 AM, Jiri Pirko wrote: Sun, Jan 07, 2018 at 02:11:19PM CET, j...@mojatatu.com wrote: On 18-01-06 03:43 PM, Jiri Pirko wrote: @@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, struct sk_buff *skb, tcm->tcm_family = AF_UNSPEC; tcm->tcm__pad1 = 0; tcm->tcm__pad2 = 0; - tcm->tcm_ifindex = qdisc_dev(q)->ifindex; - tcm->tcm_parent = parent; + if (q) { + tcm->tcm_ifindex = qdisc_dev(q)->ifindex; + tcm->tcm_parent = parent; + } else { + tcm->tcm_ifindex = 0; /* block index is stored in parent */ + tcm->tcm_parent = block->index; + } Please guys, please look at this reuse (also on clt side). I would like you to double-check this reuse of existing API for balock_index carrying purpose. I believe it's UAPI safe. But please, check it out carefully. Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex (not sure if zero means something speacial to someone). Like -1 means parent is block_index? Yes. Why would 0 mean something special? Could you point to a code that suggests it? I cant point to any such code, it is just the ifindex is an int. And the negative space looks like less likely someone would think of using for signalling (0x as an example). tcpdump -i any probably assumes soem weird ifindex (havent looked at the code). In any case, 0 is fine too. cheers, jamal
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
> > 2. It is very very complicated to answer a question like "is > > sequence x safe on all of vendor's microprocessors" even for the vendor > > so far "is sequence x safe" was viewed by cpu vendors as > "is sequence x going to stop speculative execution". Incorrect. Modern processors are very very sophisticated beasts and you are dealing with a wide range of architectures and micro-architectures that all have their own differences. > > Intel's current position is 'lfence'. > which clearly states that bpf_tail_call() was used in the attack. > Yet none of the intel nor arm patches address speculation in > this bpf helper! There are a set of patches under discussion for eBPF both the JIT and interpreter. BTW I would expect that there are things Coverity didn't find, and that we'll also see variants on the attack where different tricks for measurement emerge that may change what is needed a little. > which means that POC is relying 64-bit address speculation. > In the places coverity found the user supplied value is 32-bit, People have 32bit computers as well as 64bit and in some cases 32bit is fine for an attack depending where your target is relative to the object. > > The differences involved on the "lfence" versus "and" versus before are > > not likely to be anywhere in that order of magnitude. > > we clearly disagree here. Both intel and arm patches proposed > to add lfence in bpf_map_lookup() which is the hottest function > we have and we do run it at 40+Gbps speeds where every nanosecond > counts, so no, lfence is not a solution. The last solution I saw proposed in that path for the JIT is to "and" with constant which in that situation clearly makes the most sense providing it is safe on all the CPUs involved. lfence timing is also heavily dependent upon what work has to be done to retire previous live instructions. BPF does not normally do a lot of writing so you'd expect the cost to be low. Anyway - Intel's current position is that lfence is safe. It's not impossible other sequences will in future be documented as safe by one or more vendors of x86 processors. Alan
Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared
Sun, Jan 07, 2018 at 02:11:19PM CET, j...@mojatatu.com wrote: >On 18-01-06 03:43 PM, Jiri Pirko wrote: > > >> >> > @@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, struct >> > sk_buff *skb, >> >tcm->tcm_family = AF_UNSPEC; >> >tcm->tcm__pad1 = 0; >> >tcm->tcm__pad2 = 0; >> > - tcm->tcm_ifindex = qdisc_dev(q)->ifindex; >> > - tcm->tcm_parent = parent; >> > + if (q) { >> > + tcm->tcm_ifindex = qdisc_dev(q)->ifindex; >> > + tcm->tcm_parent = parent; >> > + } else { >> > + tcm->tcm_ifindex = 0; /* block index is stored in parent */ >> > + tcm->tcm_parent = block->index; >> > + } >> >> Please guys, please look at this reuse (also on clt side). I would like >> you to double-check this reuse of existing API for balock_index carrying >> purpose. I believe it's UAPI safe. But please, check it out carefully. >> > > >Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex >(not sure if zero means something speacial to someone). Like -1 means parent is block_index? Why would 0 mean something special? Could you point to a code that suggests it? > >cheers, >jamal >
Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared
On 18-01-06 03:43 PM, Jiri Pirko wrote: @@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, struct sk_buff *skb, tcm->tcm_family = AF_UNSPEC; tcm->tcm__pad1 = 0; tcm->tcm__pad2 = 0; - tcm->tcm_ifindex = qdisc_dev(q)->ifindex; - tcm->tcm_parent = parent; + if (q) { + tcm->tcm_ifindex = qdisc_dev(q)->ifindex; + tcm->tcm_parent = parent; + } else { + tcm->tcm_ifindex = 0; /* block index is stored in parent */ + tcm->tcm_parent = block->index; + } Please guys, please look at this reuse (also on clt side). I would like you to double-check this reuse of existing API for balock_index carrying purpose. I believe it's UAPI safe. But please, check it out carefully. Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex (not sure if zero means something speacial to someone). cheers, jamal
Re: dvb usb issues since kernel 4.9
Em Sat, 6 Jan 2018 16:44:20 -0500 (EST) Alan Sternescreveu: > On Sat, 6 Jan 2018, Mauro Carvalho Chehab wrote: > > > Hi Josef, > > > > Em Sat, 6 Jan 2018 16:04:16 +0100 > > "Josef Griebichler" escreveu: > > > > > Hi, > > > > > > the causing commit has been identified. > > > After reverting commit > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882 > > > its working again. > > > > Just replying to me won't magically fix this. The ones that were involved on > > this patch should also be c/c, plus USB people. Just added them. > > > > > Please have a look into the thread > > > https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?pageNo=13 > > > here are several users aknowledging the revert solves their issues with > > > usb dvb cards. > > > > I read the entire (long) thread there. In order to make easier for the > > others, from what I understand, the problem happens on both x86 and arm, > > although almost all comments there are mentioning tests with raspbian > > Kernel (with uses a different USB host driver than the upstream one). > > > > It happens when watching digital TV DVB-C channels, with usually means > > a sustained bit rate of 11 MBps to 54 MBps. > > > > The reports mention the dvbsky, with uses USB URB bulk transfers. > > On every several minutes (5 to 10 mins), the stream suffer "glitches" > > caused by frame losses. > > > > The part of the thread that contains the bisect is at: > > > > https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=75965#post75965 > > > > It indirectly mentions another comment on the thread with points > > to: > > https://github.com/raspberrypi/linux/issues/2134 > > > > There, it says that this fix part of the issues: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34f41c0316ed52b0b44542491d89278efdaa70e4 > > > > but it affects URB packet losses on a lesser extend. > > > > The main issue is really the logic changes a the core softirq logic. > > > > Using Kernel 4.14.10 on a Raspberry Pi 3 with 4cd13c2 commit reverted > > fixed the issue. > > > > Joseph, is the above right? Anything else to mention? Does the > > same issue affect also on x86 with vanilla Kernel 4.14.10? > > > > - > > > > It seems that the original patch were designed to solve some IRQ issues > > with network cards with causes data losses on high traffic. However, > > it is also causing bad effects on sustained high bandwidth demands > > required by DVB cards, at least on some USB host drivers. > > > > Alan/Greg/Eric/David: > > > > Any ideas about how to fix it without causing regressions to > > network? > > It would be good to know what hardware was involved on the x86 system > and to have some timing data. Can we see the output from lsusb and > usbmon, running on a vanilla kernel that gets plenty of video glitches? >From Josef's report, and from the BZ, the affected hardware seems to be based on Montage Technology M88DS3103/M88TS2022 chipset. The driver it uses is at drivers/media/usb/dvb-usb-v2/dvbsky.c, with shares a USB implementation that is used by a lot more drivers. The URB handling code is at: drivers/media/usb/dvb-usb-v2/usb_urb.c This particular driver allocates 8 buffers with 4096 bytes each for bulk transfers, using transfer_flags = URB_NO_TRANSFER_DMA_MAP. This become a popular USB hardware nowadays. I have one S960c myself, so I can send you the lsusb from it. You should notice, however, that a DVB-C/DVB-S2 channel can easily provide very high sustained bit rates. Here, on my DVB-S2 provider, a typical transponder produces 58 Mpps of payload after removing URB headers. A 10 minutes record with the entire data (with typically contains 5-10 channels) can easily go above 4 GB, just to reproduce 1-2 glitches. So, I'm not sure if a usbmon dump would be useful. I'm enclosing the lsusb from a S960C device, with is based on those Montage chipsets: Bus 002 Device 007: ID 0572:960c Conexant Systems (Rockwell), Inc. DVBSky S960C DVB-S2 tuner Couldn't open device, some information will be missing Device Descriptor: bLength18 bDescriptorType 1 bcdUSB 2.00 bDeviceClass0 (Defined at Interface level) bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize064 idVendor 0x0572 Conexant Systems (Rockwell), Inc. idProduct 0x960c DVBSky S960C DVB-S2 tuner bcdDevice0.00 iManufacturer 1 iProduct2 iSerial 3 bNumConfigurations 1 Configuration Descriptor: bLength 9 bDescriptorType 2 wTotalLength 219 bNumInterfaces 1 bConfigurationValue 1 iConfiguration 4
[PATCH net-next 07/18] ipv6: Check nexthop flags during route lookup instead of carrier
Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the need to dereference the nexthop device and check its carrier and instead check for the presence of the flag. Signed-off-by: Ido Schimmel--- net/ipv6/route.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 314e3bf41f6f..ab0eed43ed97 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -474,7 +474,7 @@ static struct rt6_info *rt6_multipath_select(struct rt6_info *match, if (route_choosen == 0) { struct inet6_dev *idev = sibling->rt6i_idev; - if (!netif_carrier_ok(sibling->dst.dev) && + if (sibling->rt6i_nh_flags & RTNH_F_LINKDOWN && idev->cnf.ignore_routes_with_linkdown) break; if (rt6_score_route(sibling, oif, strict) < 0) @@ -679,10 +679,9 @@ static struct rt6_info *find_match(struct rt6_info *rt, int oif, int strict, int m; bool match_do_rr = false; struct inet6_dev *idev = rt->rt6i_idev; - struct net_device *dev = rt->dst.dev; - if (dev && !netif_carrier_ok(dev) && - idev->cnf.ignore_routes_with_linkdown && + if (idev->cnf.ignore_routes_with_linkdown && + rt->rt6i_nh_flags & RTNH_F_LINKDOWN && !(strict & RT6_LOOKUP_F_IGNORE_LINKSTATE)) goto out; -- 2.14.3
[PATCH net-next 02/18] ipv6: Mark dead nexthops with appropriate flags
When a netdev is put administratively down or unregistered all the nexthops using it as their nexthop device should be marked with the 'dead' and 'linkdown' flags. Currently, when a route is dumped its nexthop device is tested and the flags are set accordingly. A similar check is performed during route lookup. Instead, we can simply mark the nexthops based on netdev events and avoid checking the netdev's state during route dump and lookup. Signed-off-by: Ido Schimmel--- net/ipv6/route.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index c557362daa23..f5eda0aeab55 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3473,8 +3473,10 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) if (rt->dst.dev == dev && rt != adn->net->ipv6.ip6_null_entry && (rt->rt6i_nsiblings == 0 || netdev_unregistering(dev) || -!rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) +!rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) { + rt->rt6i_nh_flags |= (RTNH_F_DEAD | RTNH_F_LINKDOWN); return -1; + } return 0; } -- 2.14.3
[PATCH net-next 03/18] ipv6: Clear nexthop flags upon netdev up
Previous patch marked nexthops with the 'dead' and 'linkdown' flags. Clear these flags when the netdev comes back up. Signed-off-by: Ido Schimmel--- include/net/ip6_route.h | 1 + net/ipv6/addrconf.c | 3 +++ net/ipv6/route.c| 29 + 3 files changed, 33 insertions(+) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 18e442ea93d8..caad39198c2a 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -169,6 +169,7 @@ void rt6_ifdown(struct net *net, struct net_device *dev); void rt6_mtu_change(struct net_device *dev, unsigned int mtu); void rt6_remove_prefsrc(struct inet6_ifaddr *ifp); void rt6_clean_tohost(struct net *net, struct in6_addr *gateway); +void rt6_sync_up(struct net_device *dev, unsigned int nh_flags); static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb) { diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index ed06b1190f05..b6405568ed7b 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3484,6 +3484,9 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, if (run_pending) addrconf_dad_run(idev); + /* Device has an address by now */ + rt6_sync_up(dev, RTNH_F_DEAD); + /* * If the MTU changed during the interface down, * when the interface up, the changed MTU must be diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f5eda0aeab55..4796d87e0b93 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3459,6 +3459,35 @@ void rt6_clean_tohost(struct net *net, struct in6_addr *gateway) fib6_clean_all(net, fib6_clean_tohost, gateway); } +struct arg_netdev_event { + const struct net_device *dev; + unsigned int nh_flags; +}; + +static int fib6_ifup(struct rt6_info *rt, void *p_arg) +{ + const struct arg_netdev_event *arg = p_arg; + const struct net *net = dev_net(arg->dev); + + if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev) + rt->rt6i_nh_flags &= ~arg->nh_flags; + + return 0; +} + +void rt6_sync_up(struct net_device *dev, unsigned int nh_flags) +{ + struct arg_netdev_event arg = { + .dev = dev, + .nh_flags = nh_flags, + }; + + if (nh_flags & RTNH_F_DEAD && netif_carrier_ok(dev)) + arg.nh_flags |= RTNH_F_LINKDOWN; + + fib6_clean_all(dev_net(dev), fib6_ifup, ); +} + struct arg_dev_net { struct net_device *dev; struct net *net; -- 2.14.3
[PATCH net-next 05/18] ipv6: Set nexthop flags upon carrier change
Similar to IPv4, when the carrier of a netdev changes we should toggle the 'linkdown' flag on all the nexthops using it as their nexthop device. This will later allow us to test for the presence of this flag during route lookup and dump. Up until commit 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address") host and anycast routes used the loopback netdev as their nexthop device and thus were not marked with the 'linkdown' flag. The patch preserves this behavior and allows one to ping the local address even when the nexthop device does not have a carrier and the 'ignore_routes_with_linkdown' sysctl is set. Signed-off-by: Ido Schimmel--- include/net/ip6_route.h | 1 + net/ipv6/addrconf.c | 2 ++ net/ipv6/route.c| 23 +-- 3 files changed, 20 insertions(+), 6 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 6a2f80cbdf65..34cd3b0c6ded 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -170,6 +170,7 @@ void rt6_remove_prefsrc(struct inet6_ifaddr *ifp); void rt6_clean_tohost(struct net *net, struct in6_addr *gateway); void rt6_sync_up(struct net_device *dev, unsigned int nh_flags); void rt6_disable_ip(struct net_device *dev, unsigned long event); +void rt6_sync_down_dev(struct net_device *dev, unsigned long event); static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb) { diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index a13e1ffe87ec..2435f7ab070b 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3438,6 +3438,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, } else if (event == NETDEV_CHANGE) { if (!addrconf_link_ready(dev)) { /* device is still not ready. */ + rt6_sync_down_dev(dev, event); break; } @@ -3449,6 +3450,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, * multicast snooping switches */ ipv6_mc_up(idev); + rt6_sync_up(dev, RTNH_F_LINKDOWN); break; } idev->if_flags |= IF_READY; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 194fe9d9cd85..2fd36c7dd143 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3498,18 +3498,29 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg) const struct net_device *dev = arg->dev; const struct net *net = dev_net(dev); - if (rt->dst.dev == dev && - rt != net->ipv6.ip6_null_entry && - (rt->rt6i_nsiblings == 0 || netdev_unregistering(dev) || -!rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) { - rt->rt6i_nh_flags |= (RTNH_F_DEAD | RTNH_F_LINKDOWN); + if (rt->dst.dev != dev || rt == net->ipv6.ip6_null_entry) + return 0; + + switch (arg->event) { + case NETDEV_UNREGISTER: return -1; + case NETDEV_DOWN: + if (rt->rt6i_nsiblings == 0 || + !rt->rt6i_idev->cnf.ignore_routes_with_linkdown) + return -1; + rt->rt6i_nh_flags |= RTNH_F_DEAD; + /* fall through */ + case NETDEV_CHANGE: + if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) + break; + rt->rt6i_nh_flags |= RTNH_F_LINKDOWN; + break; } return 0; } -static void rt6_sync_down_dev(struct net_device *dev, unsigned long event) +void rt6_sync_down_dev(struct net_device *dev, unsigned long event) { struct arg_netdev_event arg = { .dev = dev, -- 2.14.3
[PATCH net-next 04/18] ipv6: Prepare to handle multiple netdev events
To make IPv6 more in line with IPv4 we need to be able to respond differently to different netdev events. For example, when a netdev is unregistered all the routes using it as their nexthop device should be flushed, whereas when the netdev's carrier changes only the 'linkdown' flag should be toggled. Currently, this is not possible, as the function that traverses the routing tables is not aware of the triggering event. Propagate the triggering event down, so that it could be used in later patches. Signed-off-by: Ido Schimmel--- include/net/ip6_route.h | 2 +- net/ipv6/addrconf.c | 4 ++-- net/ipv6/route.c| 37 + 3 files changed, 24 insertions(+), 19 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index caad39198c2a..6a2f80cbdf65 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -165,11 +165,11 @@ struct rt6_rtnl_dump_arg { }; int rt6_dump_route(struct rt6_info *rt, void *p_arg); -void rt6_ifdown(struct net *net, struct net_device *dev); void rt6_mtu_change(struct net_device *dev, unsigned int mtu); void rt6_remove_prefsrc(struct inet6_ifaddr *ifp); void rt6_clean_tohost(struct net *net, struct in6_addr *gateway); void rt6_sync_up(struct net_device *dev, unsigned int nh_flags); +void rt6_disable_ip(struct net_device *dev, unsigned long event); static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb) { diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index b6405568ed7b..a13e1ffe87ec 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3580,6 +3580,7 @@ static bool addr_is_local(const struct in6_addr *addr) static int addrconf_ifdown(struct net_device *dev, int how) { + unsigned long event = how ? NETDEV_UNREGISTER : NETDEV_DOWN; struct net *net = dev_net(dev); struct inet6_dev *idev; struct inet6_ifaddr *ifa, *tmp; @@ -3589,8 +3590,7 @@ static int addrconf_ifdown(struct net_device *dev, int how) ASSERT_RTNL(); - rt6_ifdown(net, dev); - neigh_ifdown(_tbl, dev); + rt6_disable_ip(dev, event); idev = __in6_dev_get(dev); if (!idev) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 4796d87e0b93..194fe9d9cd85 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2344,7 +2344,7 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, rt->rt6i_idev = idev; dst_metric_set(>dst, RTAX_HOPLIMIT, 0); - /* Add this dst into uncached_list so that rt6_ifdown() can + /* Add this dst into uncached_list so that rt6_disable_ip() can * do proper release of the net_device */ rt6_uncached_list_add(rt); @@ -3461,7 +3461,10 @@ void rt6_clean_tohost(struct net *net, struct in6_addr *gateway) struct arg_netdev_event { const struct net_device *dev; - unsigned int nh_flags; + union { + unsigned int nh_flags; + unsigned long event; + }; }; static int fib6_ifup(struct rt6_info *rt, void *p_arg) @@ -3488,19 +3491,15 @@ void rt6_sync_up(struct net_device *dev, unsigned int nh_flags) fib6_clean_all(dev_net(dev), fib6_ifup, ); } -struct arg_dev_net { - struct net_device *dev; - struct net *net; -}; - /* called with write lock held for table with rt */ -static int fib6_ifdown(struct rt6_info *rt, void *arg) +static int fib6_ifdown(struct rt6_info *rt, void *p_arg) { - const struct arg_dev_net *adn = arg; - const struct net_device *dev = adn->dev; + const struct arg_netdev_event *arg = p_arg; + const struct net_device *dev = arg->dev; + const struct net *net = dev_net(dev); if (rt->dst.dev == dev && - rt != adn->net->ipv6.ip6_null_entry && + rt != net->ipv6.ip6_null_entry && (rt->rt6i_nsiblings == 0 || netdev_unregistering(dev) || !rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) { rt->rt6i_nh_flags |= (RTNH_F_DEAD | RTNH_F_LINKDOWN); @@ -3510,15 +3509,21 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) return 0; } -void rt6_ifdown(struct net *net, struct net_device *dev) +static void rt6_sync_down_dev(struct net_device *dev, unsigned long event) { - struct arg_dev_net adn = { + struct arg_netdev_event arg = { .dev = dev, - .net = net, + .event = event, }; - fib6_clean_all(net, fib6_ifdown, ); - rt6_uncached_list_flush_dev(net, dev); + fib6_clean_all(dev_net(dev), fib6_ifdown, ); +} + +void rt6_disable_ip(struct net_device *dev, unsigned long event) +{ + rt6_sync_down_dev(dev, event); + rt6_uncached_list_flush_dev(dev_net(dev), dev); + neigh_ifdown(_tbl, dev); } struct rt6_mtu_change_arg { -- 2.14.3
[PATCH net-next 15/18] ipv6: Flush multipath routes when all siblings are dead
By default, IPv6 deletes nexthops from a multipath route when the nexthop device is put administratively down. This differs from IPv4 where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A multipath route is flushed when all of its nexthops become dead. Align IPv6 with IPv4 and have it conform to the same guidelines. In case the multipath route needs to be flushed, its siblings are flushed one by one. Otherwise, the nexthops are marked with the appropriate flags and the tree walker is instructed to skip all the siblings. As explained in previous patches, care is taken to update the sernum of the affected tree nodes, so as to prevent the use of wrong dst entries. Signed-off-by: Ido Schimmel--- net/ipv6/route.c | 83 ++-- 1 file changed, 75 insertions(+), 8 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index a3bfce71c861..1054b059747f 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3486,8 +3486,10 @@ static int fib6_ifup(struct rt6_info *rt, void *p_arg) const struct arg_netdev_event *arg = p_arg; const struct net *net = dev_net(arg->dev); - if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev) + if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev) { rt->rt6i_nh_flags &= ~arg->nh_flags; + fib6_update_sernum_upto_root(dev_net(rt->dst.dev), rt); + } return 0; } @@ -3505,6 +3507,58 @@ void rt6_sync_up(struct net_device *dev, unsigned int nh_flags) fib6_clean_all(dev_net(dev), fib6_ifup, ); } +static bool rt6_multipath_uses_dev(const struct rt6_info *rt, + const struct net_device *dev) +{ + struct rt6_info *iter; + + if (rt->dst.dev == dev) + return true; + list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings) + if (iter->dst.dev == dev) + return true; + + return false; +} + +static void rt6_multipath_flush(struct rt6_info *rt) +{ + struct rt6_info *iter; + + rt->should_flush = 1; + list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings) + iter->should_flush = 1; +} + +static unsigned int rt6_multipath_dead_count(const struct rt6_info *rt, +const struct net_device *down_dev) +{ + struct rt6_info *iter; + unsigned int dead = 0; + + if (rt->dst.dev == down_dev || rt->rt6i_nh_flags & RTNH_F_DEAD) + dead++; + list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings) + if (iter->dst.dev == down_dev || + iter->rt6i_nh_flags & RTNH_F_DEAD) + dead++; + + return dead; +} + +static void rt6_multipath_nh_flags_set(struct rt6_info *rt, + const struct net_device *dev, + unsigned int nh_flags) +{ + struct rt6_info *iter; + + if (rt->dst.dev == dev) + rt->rt6i_nh_flags |= nh_flags; + list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings) + if (iter->dst.dev == dev) + iter->rt6i_nh_flags |= nh_flags; +} + /* called with write lock held for table with rt */ static int fib6_ifdown(struct rt6_info *rt, void *p_arg) { @@ -3512,20 +3566,33 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg) const struct net_device *dev = arg->dev; const struct net *net = dev_net(dev); - if (rt->dst.dev != dev || rt == net->ipv6.ip6_null_entry) + if (rt == net->ipv6.ip6_null_entry) return 0; switch (arg->event) { case NETDEV_UNREGISTER: - return -1; + return rt->dst.dev == dev ? -1 : 0; case NETDEV_DOWN: - if (rt->rt6i_nsiblings == 0 || - !rt->rt6i_idev->cnf.ignore_routes_with_linkdown) + if (rt->should_flush) return -1; - rt->rt6i_nh_flags |= RTNH_F_DEAD; - /* fall through */ + if (!rt->rt6i_nsiblings) + return rt->dst.dev == dev ? -1 : 0; + if (rt6_multipath_uses_dev(rt, dev)) { + unsigned int count; + + count = rt6_multipath_dead_count(rt, dev); + if (rt->rt6i_nsiblings + 1 == count) { + rt6_multipath_flush(rt); + return -1; + } + rt6_multipath_nh_flags_set(rt, dev, RTNH_F_DEAD | + RTNH_F_LINKDOWN); + fib6_update_sernum(rt); + } + return -2; case NETDEV_CHANGE: - if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) + if (rt->dst.dev != dev || +
[PATCH net-next 14/18] ipv6: Take table lock outside of sernum update function
The next patch is going to allow dead routes to remain in the FIB tree in certain situations. When this happens we need to be sure to bump the sernum of the nodes where these are stored so that potential copies cached in sockets are invalidated. The function that performs this update assumes the table lock is not taken when it is invoked, but that will not be the case when it is invoked by the tree walker. Have the function assume the lock is taken and make the single caller take the lock itself. Signed-off-by: Ido Schimmel--- net/ipv6/ip6_fib.c | 5 + net/ipv6/route.c | 2 ++ 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index c1bbe7bf9fdd..edda5ad3b405 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -107,16 +107,13 @@ enum { void fib6_update_sernum(struct rt6_info *rt) { - struct fib6_table *table = rt->rt6i_table; struct net *net = dev_net(rt->dst.dev); struct fib6_node *fn; - spin_lock_bh(>tb6_lock); fn = rcu_dereference_protected(rt->rt6i_node, - lockdep_is_held(>tb6_lock)); + lockdep_is_held(>rt6i_table->tb6_lock)); if (fn) fn->fn_sernum = fib6_new_sernum(net); - spin_unlock_bh(>tb6_lock); } /* diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f62d24948aa2..a3bfce71c861 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1353,7 +1353,9 @@ static int rt6_insert_exception(struct rt6_info *nrt, /* Update fn->fn_sernum to invalidate all cached dst */ if (!err) { + spin_lock_bh(>rt6i_table->tb6_lock); fib6_update_sernum(ort); + spin_unlock_bh(>rt6i_table->tb6_lock); fib6_force_start_gc(net); } -- 2.14.3