Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Greg KH
On Sun, Jan 07, 2018 at 09:23:47PM -0500, David Miller wrote:
> From: Thomas Gleixner 
> Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)
> 
> > I surely agree, but we have gone the way of PTI without the ability of
> > exempting individual processes exactly for one reason:
> > 
> >   Lack of time
> > 
> > It can be done on top of the PTI implementation and it won't take ages.
> > 
> > For spectre_v1/2 we face the same problem simply because we got informed so
> > much ahead of time and we were all twiddling thumbs, enjoying our christmas
> > vacation and having a good time.
> 
> I just want to point out that this should be noted in history as a
> case where all of this controlled disclosure stuff seems to have made
> things worse rather than better.

I will note that the "controlled disclosure" for this thing was a total
and complete mess, and unlike any that I have ever seen in the past.
The people involved in running it had no idea how to do it at all, and
because of that, it failed miserably, despite being warned about it
numerous times by numerous people.

> Why is there so much haste and paranoia if supposedly some group of
> people had all this extra time to think about and deal with this bug?

Because that group was so small and isolated that they did not actually
talk to anyone who could actually provide input to help deal with the
bug.

So we are stuck now with dealing with this "properly", which is fine,
but please don't think that this is an excuse to blame "controlled
disclosure".  We know how to do that correctly, it did not happen in
this case at all because of the people driving the problem refused to do
it.

> Think I'm nuts?  Ok, then how did we fare any better by keeping this
> junk under wraps for weeks if not months?  (seriously, did responsible
> people really know about this as far back as... June 2017?)

Some "people" did, just not some "responsible people" :)

Oh well, time for the kernel to fix hardware bugs again, that's what we
are here for, you would think we would be used to it by now...

greg k-h


Re: [PATCH net-next] l2tp: adjust comments about L2TPv3 offsets

2018-01-07 Thread James Chapman
On 05/01/18 18:47, Guillaume Nault wrote:
> The "offset" option has been removed by
> commit 900631ee6a26 ("l2tp: remove configurable payload offset").
>
> Signed-off-by: Guillaume Nault 
> ---
>  include/uapi/linux/l2tp.h | 2 +-
>  net/l2tp/l2tp_core.c  | 7 +++
>  2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/l2tp.h b/include/uapi/linux/l2tp.h
> index f78eef4cc56a..71e62795104d 100644
> --- a/include/uapi/linux/l2tp.h
> +++ b/include/uapi/linux/l2tp.h
> @@ -65,7 +65,7 @@ struct sockaddr_l2tpip6 {
>   * TUNNEL_MODIFY - CONN_ID, udpcsum
>   * TUNNEL_GETSTATS   - CONN_ID, (stats)
>   * TUNNEL_GET- CONN_ID, (...)
> - * SESSION_CREATE- SESSION_ID, PW_TYPE, offset, data_seq, cookie, 
> peer_cookie, offset, l2spec
> + * SESSION_CREATE- SESSION_ID, PW_TYPE, data_seq, cookie, peer_cookie, 
> l2spec
>   * SESSION_DELETE- SESSION_ID
>   * SESSION_MODIFY- SESSION_ID, data_seq
>   * SESSION_GET   - SESSION_ID, (...)
> diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
> index 786cd7f6a5e8..62285fc6eb59 100644
> --- a/net/l2tp/l2tp_core.c
> +++ b/net/l2tp/l2tp_core.c
> @@ -662,10 +662,9 @@ static int l2tp_recv_data_seq(struct l2tp_session 
> *session, struct sk_buff *skb)
>   * |x|S|x|x|x|x|x|x|  Sequence Number  |
>   * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>   *
> - * Cookie value, sublayer format and offset (pad) are negotiated with
> - * the peer when the session is set up. Unlike L2TPv2, we do not need
> - * to parse the packet header to determine if optional fields are
> - * present.
> + * Cookie value and sublayer format are negotiated with the peer when
> + * the session is set up. Unlike L2TPv2, we do not need to parse the
> + * packet header to determine if optional fields are present.
>   *
>   * Caller must already have parsed the frame and determined that it is
>   * a data (not control) frame before coming here. Fields up to the

Acked-by: James Chapman 




Re: [PATCH] atm/clip: Use seq_puts() in svc_addr()

2018-01-07 Thread SF Markus Elfring
>> @@ -708,11 +708,11 @@ static void svc_addr(struct seq_file *seq, struct 
>> sockaddr_atmsvc *addr)
>> static int e164[] = { 1, 8, 4, 6, 1, 0 };
>>
>> if (*addr->sas_addr.pub) {
>> -   seq_printf(seq, "%s", addr->sas_addr.pub);
>> +   seq_puts(seq, addr->sas_addr.pub);
> 
> Which opens a lot of security concerns.

How? - The passed string is just copied into a buffer finally, isn't it?


> Never do this again.

Why do you not like such a small source code transformation at the moment?


> P.S. I'm wondering what would be first,

I am curious on how communication difficulties can be adjusted.


> Markus starts looking into the actual code,

I inspected the original source code to some degree.
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/fs/seq_file.c?id=895c0dde398510a5b5ded60e5064c11b94bd30ca#n682
https://elixir.free-electrons.com/linux/v4.15-rc6/source/fs/seq_file.c#L660


> or most (all) of the maintainers just ban him.

The change acceptance is varying for various reasons by the involved 
contributors.

Regards,
Markus


[PATCH] docs-rst: networking: wire up msg_zerocopy

2018-01-07 Thread Mike Rapoport
Fix the following 'make htmldocs' complaint:

Documentation/networking/msg_zerocopy.rst:: WARNING: document isn't included in 
any toctree.

Signed-off-by: Mike Rapoport 
---
 Documentation/networking/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/networking/index.rst 
b/Documentation/networking/index.rst
index 66e620866245..7d4b15977d61 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -9,6 +9,7 @@ Contents:
batman-adv
kapi
z8530book
+   msg_zerocopy
 
 .. only::  subproject
 
@@ -16,4 +17,3 @@ Contents:
===
 
* :ref:`genindex`
-
-- 
2.7.4



RE: WARNING: CPU: 0 PID: 0 at ./include/linux/netfilter.h:233 arp_rcv

2018-01-07 Thread Andy Duan
From: Marco Franchi  Sent: Friday, January 05, 2018 11:03 PM
>Hi,
>
>I am getting the following warning on a imx6ul-evk board running linux-next
>20180105:
>
>[9.233290] [ cut here ]
>[9.242068] WARNING: CPU: 0 PID: 0 at
>./include/linux/netfilter.h:233 arp_rcv+0x1f8/0x228
>[9.250381] Modules linked in:
>[9.253633] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
>4.15.0-rc6-next-20180104-dirty #2
>[9.261764] Hardware name: Freescale i.MX6 Ultralite (Device Tree)
>[9.268065] Backtrace:
>[9.270719] [] (dump_backtrace) from []
>(show_stack+0x18/0x1c)
>[9.278438]  r7: r6:6153 r5: r4:c1079878
>[9.284258] [] (show_stack) from []
>(dump_stack+0xb4/0xe8)
>[9.291641] [] (dump_stack) from []
>(__warn+0xf0/0x11c)
>[9.298763]  r9:d8053000 r8:00e9 r7:0009 r6:c0da3bdc
>r5: r4:
>[9.306662] [] (__warn) from []
>(warn_slowpath_null+0x44/0x50)
>[9.314383]  r8:d8053000 r7:0608 r6:c08873f8 r5:00e9 r4:c0da3bdc
>[9.321243] [] (warn_slowpath_null) from []
>(arp_rcv+0x1f8/0x228)
>[9.329215]  r6:c0887200 r5:c107ac58 r4:d899b240
>[9.333999] [] (arp_rcv) from []
>(__netif_receive_skb_core+0x878/0xbd4)
>[9.342491]  r6:c0887200 r5:c100ac8c r4:d899b240
>[9.347265] [] (__netif_receive_skb_core) from
>[] (__netif_receive_skb+0x2c/0x8c)
>[9.356643]  r10:0080 r9:d8053000 r8:e0a26000 r7:c107ab0d
>r6:c1008908 r5:d899b240
>[9.364598]  r4:c10099c4
>[9.367288] [] (__netif_receive_skb) from []
>(netif_receive_skb_internal+0x7c/0x354)
>[9.376904]  r5:d899b240 r4:c10099c4
>[9.380635] [] (netif_receive_skb_internal) from
>[] (napi_gro_receive+0x88/0xa4)
>[9.389918]  r8:e0a26000 r7:0001 r6:d899b240 r5:d899b240 r4:0003
>[9.396784] [] (napi_gro_receive) from []
>(fec_enet_rx_napi+0x3a8/0x9b8)
>[9.405357]  r5:d8054000 r4:
>[9.409093] [] (fec_enet_rx_napi) from []
>(net_rx_action+0x220/0x334)
>[9.417431]  r10:dbbdfa00 r9:c1001d94 r8:0040 r7:012c
>r6:8e6b r5:0001
>[9.425384]  r4:d8053710
>[9.428075] [] (net_rx_action) from []
>(__do_softirq+0x128/0x2a0)
>[9.436063]  r10:4003 r9:c1003080 r8:0100 r7:c100308c
>r6:c100 r5:0003
>[9.444018]  r4:
>[9.446708] [] (__do_softirq) from []
>(irq_exit+0x14c/0x1a8)
>[9.454262]  r10:e080a000 r9:d8004400 r8:0001 r7:
>r6:c1008a98 r5:
>[9.462218]  r4:e000
>[9.464911] [] (irq_exit) from []
>(__handle_domain_irq+0x74/0xe8)
>[9.472879]  r5: r4:c0f7cc24
>[9.476614] [] (__handle_domain_irq) from []
>(gic_handle_irq+0x64/0xc4)
>[9.485121]  r9:c1028344 r8:c1001e98 r7: r6:03ff
>r5:03eb r4:e080a00c
>[9.493016] [] (gic_handle_irq) from []
>(__irq_svc+0x70/0x98)
>[9.500632] Exception stack(0xc1001e98 to 0xc1001ee0)
>[9.505822] 1e80:
>0001 0001
>[9.514157] 1ea0:  c100bf80 25963796 0002 0002
>dbbde5c8 26545294 0002
>[9.522491] 1ec0:  c1001f14 c1001eb8 c1001ee8 c01724fc
>c0724464 2153 
>[9.530824]  r10: r9:c100 r8:26545294 r7:c1001ecc
>r6: r5:2153
>[9.538778]  r4:c0724464
>[9.541470] [] (cpuidle_enter_state) from []
>(cpuidle_enter+0x1c/0x20)
>[9.549893]  r10:c100f6a8 r9:dbbde5c8 r8:c0f7c5c0 r7:c1008978
>r6:0001 r5:c100892c
>[9.557857]  r4:c100 r3:dbbde5c8
>[9.561589] [] (cpuidle_enter) from []
>(call_cpuidle+0x28/0x44)
>[9.569399] [] (call_cpuidle) from []
>(do_idle+0x1bc/0x230)
>[9.576861] [] (do_idle) from []
>(cpu_startup_entry+0x20/0x24)
>[9.584587]  r10:c0f63a50 r9:c1008908 r8:c107b480 r7:c1008900
>r6:c107b480 r5:0002
>[9.592546]  r4:00c4 r3:c0f75354
>[9.596283] [] (cpu_startup_entry) from []
>(rest_init+0x210/0x25c)
>[9.604372] [] (rest_init) from []
>(start_kernel+0x390/0x418)
>[9.611998]  r5: r4:c107b4cc
>[9.615723] [] (start_kernel) from [<>] (  (null))
>[9.622235]  r10:10c5387d r9:410fc075 r8:8300 r7:
>r6:10c0387d r5:0051
>[9.630191]  r4:c0f0032c
>[9.632851] ---[ end trace 2d5d5f79c0c8da59 ]---
>
>Does anyone know how to fix it?
>
>Thanks

If you enable kernel config "CONFIG_NETFILTER_FAMILY_ARP" can fix the warning.
It is introduced by below commit.

commit 8de98f058360722a1a9febe3970de6dcd4d91513
Author: Florian Westphal 
Date:   Thu Dec 7 16:28:26 2017 +0100

netfilter: don't allocate space for arp/bridge hooks unless needed

no need to define hook points if the family isn't supported.
Because we need these hooks for either nftables, arp/ebtables
or the 'call-iptables' hack we have in the bridge layer add two
new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
users select them.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 

diff --git 

Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-07 Thread Serge E. Hallyn
On Mon, Jan 08, 2018 at 11:35:26AM +1100, James Morris wrote:
> On Tue, 2 Jan 2018, Mahesh Bandewar (महेश बंडेवार) wrote:
> 
> > On Sat, Dec 30, 2017 at 12:31 AM, James Morris
> >  wrote:
> > > On Wed, 27 Dec 2017, Mahesh Bandewar (महेश बंडेवार) wrote:
> > >
> > >> Hello James,
> > >>
> > >> Seems like I missed your name to be added into the review of this
> > >> patch series. Would you be willing be pull this into the security
> > >> tree? Serge Hallyn has already ACKed it.
> > >
> > > Sure!
> > >
> > Thank you James.
> 
> I'd like to see what Eric Biederman thinks of this.
> 
> Also, why do we need the concept of a controlled user-ns at all, if the 
> default whitelist maintains existing behavior?

In past discussions two uses have been brought up:

1. if an 0-day is discovered which is exacerbated by a specific
privilege in user namespaces, that privilege could be turned off until a
reboot with a fixed kernel is scheduled, without fully disabling all
containers.

2. some systems may be specifically designed to run software which
only requires a few capabilities in a userns.  In that case all others
could be disabled.


[PATCH] vhost: Remove the unused variable.

2018-01-07 Thread Tonghao Zhang
The patch (7235acdb1) changed the way of the work
flushing in which the queued seq, done seq, and the
flushing are not used anymore. Then remove them now.

Fixes: 7235acdb1 ("vhost: simplify work flushing")
Cc: Jason Wang 
Signed-off-by: Tonghao Zhang 
---
 drivers/vhost/vhost.c | 1 -
 drivers/vhost/vhost.h | 4 
 2 files changed, 5 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 33ac2b186b85..9b04cad91d65 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -181,7 +181,6 @@ void vhost_work_init(struct vhost_work *work, 
vhost_work_fn_t fn)
 {
clear_bit(VHOST_WORK_QUEUED, >flags);
work->fn = fn;
-   init_waitqueue_head(>done);
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 79c6e7a60a5e..749fe13e061c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -20,10 +20,6 @@ typedef void (*vhost_work_fn_t)(struct vhost_work *work);
 struct vhost_work {
struct llist_node node;
vhost_work_fn_t   fn;
-   wait_queue_head_t done;
-   int   flushing;
-   unsigned  queue_seq;
-   unsigned  done_seq;
unsigned long flags;
 };
 
-- 
2.13.6



Re: [patch iproute2 v6 0/3] tc: Add -bs option to batch mode

2018-01-07 Thread David Ahern
On 1/7/18 7:03 PM, Chris Mi wrote:

>> Did you measure the effect of increasing batch sizes?
> Yes. Even if we enlarge the batch size bigger than 10, there is no big 
> improvement.

That will change over time so the tc command should allow auto-batching
to work up to the message size limit.


> I think that's because current kernel doesn't process the requests in 
> parallel.
> If kernel processes the requests in parallel, I believe specifying a bigger 
> batch size
> will get a better result.
>>
>> I wonder whether specifying the batch size is necessary at all. Couldn't 
>> batch
>> mode just collect messages until either EOF or an incompatible command is
>> encountered which then triggers a commit to kernel? This might simplify
>> code quite a bit.
> That's a good suggestion.

Thanks for your time on this, Chris.


Re: [PATCH net-next V2 2/2] tun: allow to attach ebpf socket filter

2018-01-07 Thread Jason Wang



On 2018年01月06日 00:21, Willem de Bruijn wrote:

On Fri, Jan 5, 2018 at 4:54 AM, Jason Wang  wrote:

This patch allows userspace to attach eBPF filter to tun. This will
allow to implement VM dataplane filtering in a more efficient way
compared to cBPF filter by allowing either qemu or libvirt to
attach eBPF filter to tun.

Signed-off-by: Jason Wang 
---
  drivers/net/tun.c   | 39 +++
  include/uapi/linux/if_tun.h |  1 +
  2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 0853829..9fc8b70 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -238,6 +238,12 @@ struct tun_struct {
 struct tun_pcpu_stats __percpu *pcpu_stats;
 struct bpf_prog __rcu *xdp_prog;
 struct tun_prog __rcu *steering_prog;
+   struct tun_prog __rcu *filter_prog;
+};
+
+struct veth {
+   __be16 h_vlan_proto;
+   __be16 h_vlan_TCI;
  };

  static int tun_napi_receive(struct napi_struct *napi, int budget)
@@ -984,12 +990,25 @@ static void tun_automq_xmit(struct tun_struct *tun, 
struct sk_buff *skb)
  #endif
  }

+static unsigned int run_ebpf_filter(struct tun_struct *tun,
+   struct sk_buff *skb,
+   int len)
+{
+   struct tun_prog *prog = rcu_dereference(tun->filter_prog);
+
+   if (prog)
+   len = bpf_prog_run_clear_cb(prog->prog, skb);
+
+   return len;
+}
+
  /* Net device start xmit */
  static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
  {
 struct tun_struct *tun = netdev_priv(dev);
 int txq = skb->queue_mapping;
 struct tun_file *tfile;
+   int len = skb->len;

 rcu_read_lock();
 tfile = rcu_dereference(tun->tfiles[txq]);
@@ -1015,6 +1034,16 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, 
struct net_device *dev)
 sk_filter(tfile->socket.sk, skb))
 goto drop;

+   len = run_ebpf_filter(tun, skb, len);
+
+   /* Trim extra bytes since we may inster vlan proto & TCI

inster -> insert


Will fix.




+* in tun_put_user().
+*/
+   if (skb_vlan_tag_present(skb))
+   len -= skb_vlan_tag_present(skb) ? sizeof(struct veth) : 0;

no need for testing skb_vlan_tag_present twice.


Right.


more importantly, why trim these bytes unconditionally?

only if the filter trims a packet to a length shorter than the the minimum
could this cause problems. sk_filter_trim_cap with a lower bound avoids
that: skb_vlan_tag_present(skb) ? sizeof(struct vlan_ethhdr) : 0;


The problem is, if the filter want to trim to packet to 50 bytes, we may 
get 54 bytes if vlan tag is existed. This seems wrong.


Thanks


Re: Subject: [RFC][PATCH 10/11] dwc-xlgmac: fix big-endian breakage

2018-01-07 Thread Jie Deng
On 2018/1/6 3:32, Al Viro wrote:

> Users of XLGMAC_SET_REG_BITS_LE() expect it to take le32 and return
> le32.
>
> Signed-off-by: Al Viro 
> ---
>  drivers/net/ethernet/synopsys/dwc-xlgmac.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac.h 
> b/drivers/net/ethernet/synopsys/dwc-xlgmac.h
> index cab3e40a86b9..e95c4c250e16 100644
> --- a/drivers/net/ethernet/synopsys/dwc-xlgmac.h
> +++ b/drivers/net/ethernet/synopsys/dwc-xlgmac.h
> @@ -106,7 +106,7 @@
>  #define XLGMAC_GET_REG_BITS_LE(var, pos, len) ({ \
>   typeof(pos) _pos = (pos);   \
>   typeof(len) _len = (len);   \
> - typeof(var) _var = le32_to_cpu((var));  \
> + u32 _var = le32_to_cpu((var));  \
>   ((_var) & GENMASK(_pos + _len - 1, _pos)) >> (_pos);\
>  })
>  
> @@ -125,8 +125,8 @@
>   typeof(len) _len = (len);   \
>   typeof(val) _val = (val);   \
>   _val = (_val << _pos) & GENMASK(_pos + _len - 1, _pos); \
> - _var = (_var & ~GENMASK(_pos + _len - 1, _pos)) | _val; \
> - cpu_to_le32(_var);  \
> + (_var & ~cpu_to_le32(GENMASK(_pos + _len - 1, _pos))) | \
> + cpu_to_le32(_val);  \
>  })
>  
>  struct xlgmac_pdata;
Thanks for your patch.
Acked-by: Jie Deng 


[PATCH bpf-next] bpf: introduce BPF_JIT_ALWAYS_ON config

2018-01-07 Thread Alexei Starovoitov
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.

A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."

To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64

The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden

Signed-off-by: Alexei Starovoitov 
---
 init/Kconfig   | 7 +++
 kernel/bpf/core.c  | 9 +
 kernel/bpf/verifier.c  | 4 
 net/core/sysctl_net_core.c | 9 +
 4 files changed, 29 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 2934249fba46..5e2a4a391ba9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1392,6 +1392,13 @@ config BPF_SYSCALL
  Enable the bpf() system call that allows to manipulate eBPF
  programs and maps via file descriptors.
 
+config BPF_JIT_ALWAYS_ON
+   bool "Permanently enable BPF JIT and remove BPF interpreter"
+   depends on BPF_SYSCALL && HAVE_EBPF_JIT && BPF_JIT
+   help
+ Enables BPF JIT and removes BPF interpreter to avoid
+ speculative execution of BPF instructions by the interpreter
+
 config USERFAULTFD
bool "Enable userfaultfd() system call"
select ANON_INODES
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 70a534549cd3..42756c434e0b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -781,6 +781,7 @@ noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 
r4, u64 r5)
 }
 EXPORT_SYMBOL_GPL(__bpf_call_base);
 
+#ifndef CONFIG_BPF_JIT_ALWAYS_ON
 /**
  * __bpf_prog_run - run eBPF program on a given context
  * @ctx: is the data we are operating on
@@ -1376,6 +1377,7 @@ void bpf_patch_call_args(struct bpf_insn *insn, u32 
stack_depth)
__bpf_call_base_args;
insn->code = BPF_JMP | BPF_CALL_ARGS;
 }
+#endif
 
 bool bpf_prog_array_compatible(struct bpf_array *array,
   const struct bpf_prog *fp)
@@ -1427,9 +1429,11 @@ static int bpf_check_tail_call(const struct bpf_prog *fp)
  */
 struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 {
+#ifndef CONFIG_BPF_JIT_ALWAYS_ON
u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
 
fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
+#endif
 
/* eBPF JITs can rewrite the program in case constant
 * blinding is active. However, in case of error during
@@ -1453,6 +1457,11 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog 
*fp, int *err)
 */
*err = bpf_check_tail_call(fp);
 
+#ifdef CONFIG_BPF_JIT_ALWAYS_ON
+   if (!fp->jited)
+   *err = -ENOTSUPP;
+#endif
+
return fp;
 }
 EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a2b211262c25..ca80559c4ec3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5267,7 +5267,11 @@ static int fixup_call_args(struct bpf_verifier_env *env)
depth = get_callee_stack_depth(env, insn, i);
if (depth < 0)
return depth;
+#ifdef CONFIG_BPF_JIT_ALWAYS_ON
+   return -ENOTSUPP;
+#else
bpf_patch_call_args(insn, depth);
+#endif
}
return 0;
 }
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cbc3dde4cfcc..1c8af0f4f385 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -325,7 +325,13 @@ static struct ctl_table net_core_table[] = {
.data   = _jit_enable,
.maxlen = sizeof(int),
.mode   = 0644,
+#ifndef CONFIG_BPF_JIT_ALWAYS_ON
.proc_handler   = proc_dointvec
+#else
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
+   .extra2 = ,
+#endif
},
 # ifdef CONFIG_HAVE_EBPF_JIT
{
@@ -524,6 +530,9 @@ static __net_initdata struct pernet_operations 
sysctl_core_ops = {
 
 static __init int sysctl_core_init(void)
 {
+#if 

Re: [net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"

2018-01-07 Thread Michael Ellerman
David Miller  writes:

> From: Michael Ellerman 
> Date: Fri, 22 Dec 2017 15:22:22 +1100
>
>>> On Tue, Dec 19 2017, Michael Ellerman  
>>> wrote:
 This revert seems to have broken networking on one of my powerpc
 machines, according to git bisect.

 The symptom is DHCP fails and I don't get a link, I didn't dig any
 further than that. I can if it's helpful.

 I think the problem is that 87c320e51519 ("net: core: dev_get_valid_name
 is now the same as dev_alloc_name_ns") only makes sense while
 d6f295e9def0 remains in the tree.
>>>
>>> I'm sorry about all of this, I really didn't think there would be such
>>> consequences of changing an errno return. Indeed, d6f29 was preparation
>>> for unifying the two functions that do the exact same thing (and how we
>>> ever got into that situation is somewhat unclear), except for
>>> their behaviour in the case the requested name already exists. So one of
>>> the two interfaces had to change its return value, and as I wrote, I
>>> thought EEXIST was the saner choice when an explicit name (no %d) had
>>> been requested.
>> 
>> No worries.
>> 
 ie. before the entire series, dev_get_valid_name() would return EEXIST,
 and that was retained when 87c320e51519 was merged, but now that
 d6f295e9def0 has been reverted dev_get_valid_name() is returning ENFILE.

 I can get the network up again if I also revert 87c320e51519 ("net:
 core: dev_get_valid_name is now the same as dev_alloc_name_ns"), or with
 the gross patch below.
>>>
>>> I don't think changing -ENFILE to -EEXIST would be right either, since
>>> dev_get_valid_name() used to be able to return both (-EEXIST in the case
>>> where there's no %d, -ENFILE in the case where we end up calling
>>> dev_alloc_name_ns()). If anything, we could do the check for the old
>>> -EEXIST condition first, and then call dev_alloc_name_ns(). But I'm also
>>> fine with reverting.
>> 
>> Yeah I think a revert would be best, given it's nearly rc5.
>> 
>> My userspace is not exotic AFAIK, just debian something, so presumably
>> this will affect other people too.
>
> I've just queued up the following revert, thanks!

Thanks.

I don't see it in rc7, will it get to Linus sometime before the release?

cheers


Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2018-01-07 Thread Alexei Starovoitov

On 12/29/17 12:20 AM, Masami Hiramatsu wrote:

Please run Josef's test in the !ftrace setup.

Yes, I'll add the result of the test case.


if Josef's test is passing in !ftrace config,
please resend your patches.
I think 2 and 3 were nice simplifications.
and patch 1 is good too if it's passes the test.
Would be great to get them in for this merge window.

Thanks!



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 01:59:35PM +, Alan Cox wrote:
> > which means that POC is relying 64-bit address speculation.
> > In the places coverity found the user supplied value is 32-bit,
> 
> People have 32bit computers as well as 64bit and in some cases 32bit is
> fine for an attack depending where your target is relative to the object.

right. absolutely correct on 32-bit archs.
The question whether truncation to 32-bit is enough to workaround
spectre1 on 64-bit archs.
I hope the researchers can clarify.

> lfence timing is also heavily dependent upon what work has to be done to
> retire previous live instructions. 
> BPF does not normally do a lot of writing so you'd expect the cost to be low.

right. to retire previous loads. Not sure what 'not a lot of writing'
has to do with lfence.

Our XDP based DDOS mostly does reads with few writes for stats into maps,
whereas XDP based load balancer modifies every packet.
XDP is root only, so not relevant in the spectre context. Just clarifying
read vs writes in BPF.



Re: [PATCH net-next 00/18] ipv6: Align nexthop behaviour with IPv4

2018-01-07 Thread David Miller
From: Ido Schimmel 
Date: Sun,  7 Jan 2018 12:45:00 +0200

> This set tries to eliminate some differences between IPv4's and IPv6's
> treatment of nexthops. These differences are most likely a side effect
> of IPv6's data structures (specifically 'rt6_info') that incorporate
> both the route and the nexthop and the late addition of ECMP support in
> commit 51ebd3181572 ("ipv6: add support of equal cost multipath
> (ECMP)").
 ...
> Finally, this series also serves as a good first step towards David
> Ahern's goal of treating nexthops as standalone objects [2], as it makes
> the code more in line with IPv4 where the nexthop and the nexthop group
> are separate objects from the route itself.
 ...

Looks great, series applied, thanks Ido!


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 12:15:40PM -0800, Dan Williams wrote:
> 
> I'm thinking we should provide the option to at least build the
> hot-path nospec_array_ptr() usages without an lfence.
> 
> CONFIG_SPECTRE1_PARANOIA_SAFE
> CONFIG_SPECTRE1_PARANOIA_PERF

SAFE vs PERF naming is problematic and misleading, since users don't
have the data to make a decision they will be forced to go with SAFE.

What is not safe about array_access() macro with AND ?
How lfence approach makes it safer ?
Only because lfence was blessed by intel earlier when
they couldn't figure out a different way?

How about:
CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE

> ...if only for easing performance testing and let the distribution set
> its policy.
> 
> Where hot-path usages can do:
> 
> nospec_relax(nospec_array_ptr())

AND approach doesn't prevent speculation hence nospec_ is an incorrect prefix.
Alan's "speculation management" terminology fits well here.

Can we keep array_access() name and change it underneath to
either mask or lfence ?



Re: pull-request: bpf-next 2018-01-07

2018-01-07 Thread David Miller
From: Daniel Borkmann 
Date: Mon,  8 Jan 2018 00:56:16 +0100

> The following pull-request contains BPF updates for your *net-next* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> 
> Thanks a lot & a happy new year!

Looks great, pulled, thanks a lot Daniel!


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread David Miller
From: Thomas Gleixner 
Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)

> I surely agree, but we have gone the way of PTI without the ability of
> exempting individual processes exactly for one reason:
> 
>   Lack of time
> 
> It can be done on top of the PTI implementation and it won't take ages.
> 
> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.

I just want to point out that this should be noted in history as a
case where all of this controlled disclosure stuff seems to have made
things worse rather than better.

Why is there so much haste and paranoia if supposedly some group of
people had all this extra time to think about and deal with this bug?

>From what I've seen, every single time, the worse a problem is, the
more important it is to expose it to as many smart folks as possible.
And to do so as fast as possible.

And to me that means full disclosure immediately for the super high
level stuff like what we are dealing with here.

Think I'm nuts?  Ok, then how did we fare any better by keeping this
junk under wraps for weeks if not months?  (seriously, did responsible
people really know about this as far back as... June 2017?)

Controlled disclosure for high propfile bugs seems to only achieve two
things:

1) Vendors can cover their butts and come up with deflection
   strategies.

2) The "theatre" aspect of security can be maximized as much as
   possible.  We even have a pretty web site and cute avatars this
   time!

None of this has anything to do with having time to come up with the
best possible implementation of a fix.  You know, the technical part?

So after what appears to be as much as 6 months of deliberating the
very wise men in the special room said: "KPTI and lfence"

Do I get this right?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 11:08:24AM +0100, Thomas Gleixner wrote:
> On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> > which clearly states that bpf_tail_call() was used in the attack.
> > Yet none of the intel nor arm patches address speculation in
> > this bpf helper!
> > It means that:
> > - gpz didn't share neither exploit nor the detailed description
> >   of the POC with cpu vendors until now
> > - coverity rules used to find all these places in the kernel
> >   failed to find bpf_tail_call
> > - cpu vendors were speculating what variant 1 can actually do
> 
> You forgot to mention that there might be other attacks than the public POC
> which are not covered by a simple AND 

if above statement is not a pure speculation please share CVE number.
For varaint[123] they've been reserved for months.

> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.

right. they were informed in a way that they failed to address
variant1 with pre-embargo and public patches.

> The exploits are out in the wild and they are observed already, so we

this statement contradicts with everything that was publicly stated.
Or you're referring to 'exploit' at the end of spectre paper?

> want to discuss the right way to fix it for the next 3 month and leave all
> doors open until the last bit of performance is squeezed out.

Let's look at facts:
- Linus explains his array_access() idea
- lfence proponents quickly point out that it needs gcc to be smart
  enough to emit setbe and go back to lfence patches
- I spent half an hour to tweak array_access() to be generic
  on all archs and compiler indepedent
- lfence proponets point out that AND with a variable somehow
  won't work, yet after further discussion it's actually fine due to
  the nature of variant1 attack that needs to train predictor with in-bounds
  access to mispredict with out-of-bounds speculative load
- then lfence proponets claim 'non public POC' not covered by AND
- it all in the matter of 2 days.
- and now the argument that it will take 3 month to discuss a solution
  without lfence, yet still no performance numbers for lfence
  though 'people in the know' were twiddling thumbs for months.

That's just not cool.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread David Miller
From: Thomas Gleixner 
Date: Sun, 7 Jan 2018 19:31:41 +0100 (CET)

> 2) Alexei's analyis is purely based on the public information of the google
>zero folks. If it would be complete and the only attack vector all fine.
> 
>If not and I doubt it is, we're going to regret this decision faster
>than we made it and this is not the kind of play field where we can
>afford that.

Please state this more clearly.

Do you know about other attack vectors and just are not allowed to
talk about them?

Or is this, ironically, speculation?


RE: [patch iproute2 v6 0/3] tc: Add -bs option to batch mode

2018-01-07 Thread Chris Mi
> -Original Message-
> From: n...@orbyte.nwl.cc [mailto:n...@orbyte.nwl.cc] On Behalf Of Phil
> Sutter
> Sent: Saturday, January 6, 2018 1:25 AM
> To: Chris Mi 
> Cc: netdev@vger.kernel.org; gerlitz...@gmail.com;
> step...@networkplumber.org; dsah...@gmail.com;
> marcelo.leit...@gmail.com
> Subject: Re: [patch iproute2 v6 0/3] tc: Add -bs option to batch mode
> 
> Hi Chris,
> 
> On Thu, Jan 04, 2018 at 04:34:51PM +0900, Chris Mi wrote:
> > Currently in tc batch mode, only one command is read from the batch
> > file and sent to kernel to process. With this patchset, we can
> > accumulate several commands before sending to kernel. The batch size
> > is specified using option -bs or -batchsize.
> >
> > To accumulate the commands in tc, client should allocate an array of
> > struct iovec. If batchsize is bigger than 1, only after the client has
> > accumulated enough commands, can the client call rtnl_talk_msg to send
> > the message that includes the iov array. One exception is that there
> > is no more command in the batch file.
> >
> > But please note that kernel still processes the requests one by one.
> > To process the requests in parallel in kernel is another effort.
> > The time we're saving in this patchset is the user mode and kernel
> > mode context switch. So this patchset works on top of the current kernel.
> >
> > Using the following script in kernel, we can generate 1,000,000 rules.
> > tools/testing/selftests/tc-testing/tdc_batch.py
> >
> > Without this patchset, 'tc -b $file' exection time is:
> >
> > real0m15.555s
> > user0m7.211s
> > sys 0m8.284s
> >
> > With this patchset, 'tc -b $file -bs 10' exection time is:
> >
> > real0m13.043s
> > user0m6.479s
> > sys 0m6.504s
> >
> > The insertion rate is improved more than 10%.
> 
> Did you measure the effect of increasing batch sizes?
Yes. Even if we enlarge the batch size bigger than 10, there is no big 
improvement.
I think that's because current kernel doesn't process the requests in parallel.
If kernel processes the requests in parallel, I believe specifying a bigger 
batch size
will get a better result.
> 
> I wonder whether specifying the batch size is necessary at all. Couldn't batch
> mode just collect messages until either EOF or an incompatible command is
> encountered which then triggers a commit to kernel? This might simplify
> code quite a bit.
That's a good suggestion.

-Chris
> 
> Cheers, Phil


Re: Subject: [RFC][PATCH 05/11] stmmac: fix several stray endianness bugs

2018-01-07 Thread Florian Fainelli


On 01/05/2018 11:32 AM, Al Viro wrote:
> 
> A few places got missed by "net: ethernet: stmmac: change dma descriptors
> to __le32" (having been introduced just before the merge of that patch,
> AFAICS).  Fix them up...

I could not spot the commit id associated with that change, which one is it?
-- 
Florian


[PATCH v3 bpf] bpf: prevent out-of-bounds speculation

2018-01-07 Thread Alexei Starovoitov
Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.

If prog_array map is created by unpriv user replace
  bpf_tail_call(ctx, map, index);
with
  if (index >= max_entries) {
index &= map->index_mask;
bpf_tail_call(ctx, map, index);
  }
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.

Signed-off-by: Alexei Starovoitov 
Acked-by: John Fastabend 
---
 include/linux/bpf.h   |  2 ++
 kernel/bpf/arraymap.c | 47 ---
 kernel/bpf/verifier.c | 36 
 3 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e55e4255a210..1b985ca4ffbe 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -52,6 +52,7 @@ struct bpf_map {
u32 pages;
u32 id;
int numa_node;
+   bool unpriv_array;
struct user_struct *user;
const struct bpf_map_ops *ops;
struct work_struct work;
@@ -221,6 +222,7 @@ struct bpf_prog_aux {
 struct bpf_array {
struct bpf_map map;
u32 elem_size;
+   u32 index_mask;
/* 'ownership' of prog_array is claimed by the first program that
 * is going to use this map or by the first program which FD is stored
 * in the map to make sure that all callers and callees have the same
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7c25426d3cf5..aaa319848e7d 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -53,9 +53,10 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 {
bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
int numa_node = bpf_map_attr_numa_node(attr);
+   u32 elem_size, index_mask, max_entries;
+   bool unpriv = !capable(CAP_SYS_ADMIN);
struct bpf_array *array;
u64 array_size;
-   u32 elem_size;
 
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
@@ -72,11 +73,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 
elem_size = round_up(attr->value_size, 8);
 
+   max_entries = attr->max_entries;
+   index_mask = roundup_pow_of_two(max_entries) - 1;
+
+   if (unpriv)
+   /* round up array size to nearest power of 2,
+* since cpu will speculate within index_mask limits
+*/
+   max_entries = index_mask + 1;
+
array_size = sizeof(*array);
if (percpu)
-   array_size += (u64) attr->max_entries * sizeof(void *);
+   array_size += (u64) max_entries * sizeof(void *);
else
-   array_size += (u64) attr->max_entries * elem_size;
+   array_size += (u64) max_entries * elem_size;
 
/* make sure there is no u32 overflow later in round_up() */
if (array_size >= U32_MAX - PAGE_SIZE)
@@ -86,6 +96,8 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
array = bpf_map_area_alloc(array_size, numa_node);
if (!array)
return ERR_PTR(-ENOMEM);
+   array->index_mask = index_mask;
+   array->map.unpriv_array = unpriv;
 
/* copy mandatory map attributes */
array->map.map_type = attr->map_type;
@@ -121,12 +133,13 @@ static void *array_map_lookup_elem(struct bpf_map *map, 
void *key)
if (unlikely(index >= array->map.max_entries))
return NULL;
 
-   return array->value + array->elem_size * index;
+   return array->value + array->elem_size * (index & array->index_mask);
 }
 
 /* emit BPF instructions equivalent to C code of array_map_lookup_elem() */
 static u32 array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 {
+   struct 

Re: Subject: [RFC][PATCH 07/11] broadcom: trivial sparse annotations

2018-01-07 Thread Florian Fainelli


On 01/05/2018 11:32 AM, Al Viro wrote:
> 
> Signed-off-by: Al Viro 
> ---
>  drivers/net/ethernet/broadcom/bcmsysport.c | 2 +-
>  drivers/net/ethernet/broadcom/bgmac.h  | 6 +++---
>  drivers/net/ethernet/broadcom/bnx2.c   | 6 +++---
>  drivers/net/ethernet/broadcom/cnic_if.h| 8 
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 4 ++--
>  drivers/net/ethernet/broadcom/tg3.c| 2 +-
>  6 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
> b/drivers/net/ethernet/broadcom/bcmsysport.c
> index f15a8fc6dfc9..c2969b260aed 100644
> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
> @@ -1156,7 +1156,7 @@ static struct sk_buff *bcm_sysport_insert_tsb(struct 
> sk_buff *skb,
>   memset(tsb, 0, sizeof(*tsb));
>  
>   if (skb->ip_summed == CHECKSUM_PARTIAL) {
> - ip_ver = htons(skb->protocol);
> + ip_ver = ntohs(skb->protocol);

Al can you make sure you CC drivers maintainers for these changes? You
change bcmsysport.c and bcmgenet.c but you leave tg3.c with the same
type of construct in tg3_start_xmit() any reason for that? I am also
curious about this conversion considering the following numbers:

grep 'skb->protocol == htons' {net/*,drivers/net/*} | wc -l
151
git grep 'skb->protocol == ntohs' {net/*,drivers/net/*} | wc -l
0

>   switch (ip_ver) {
>   case ETH_P_IP:
>   ip_proto = ip_hdr(skb)->protocol;
> diff --git a/drivers/net/ethernet/broadcom/bgmac.h 
> b/drivers/net/ethernet/broadcom/bgmac.h
> index 4040d846da8e..40d02fec2747 100644
> --- a/drivers/net/ethernet/broadcom/bgmac.h
> +++ b/drivers/net/ethernet/broadcom/bgmac.h
> @@ -479,9 +479,9 @@ struct bgmac_rx_header {
>  struct bgmac {
>   union {
>   struct {
> - void *base;
> - void *idm_base;
> - void *nicpm_base;
> + void __iomem *base;
> + void __iomem *idm_base;
> + void __iomem *nicpm_base;
>   } plat;

This part of the patch is correct.

>   struct {
>   struct bcma_device *core;
> diff --git a/drivers/net/ethernet/broadcom/bnx2.c 
> b/drivers/net/ethernet/broadcom/bnx2.c
> index 7919f6112ecf..154866e8517a 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.c
> +++ b/drivers/net/ethernet/broadcom/bnx2.c
> @@ -8330,9 +8330,9 @@ bnx2_init_board(struct pci_dev *pdev, struct net_device 
> *dev)
>   if (j < 32)
>   bp->fw_version[j++] = ' ';
>   for (i = 0; i < 3 && j < 28; i++) {
> - reg = bnx2_reg_rd_ind(bp, addr + i * 4);
> - reg = be32_to_cpu(reg);
> - memcpy(>fw_version[j], , 4);
> + __be32 v;
> + v = cpu_to_be32(bnx2_reg_rd_ind(bp, addr + i * 4));
> + memcpy(>fw_version[j], , 4);
>   j += 4;
>   }
>   }
> diff --git a/drivers/net/ethernet/broadcom/cnic_if.h 
> b/drivers/net/ethernet/broadcom/cnic_if.h
> index 789e5c7e9311..8cfef07a8e3d 100644
> --- a/drivers/net/ethernet/broadcom/cnic_if.h
> +++ b/drivers/net/ethernet/broadcom/cnic_if.h
> @@ -260,10 +260,10 @@ struct cnic_sockaddr {
>  struct cnic_sock {
>   struct cnic_dev *dev;
>   void*context;
> - u32 src_ip[4];
> - u32 dst_ip[4];
> - u16 src_port;
> - u16 dst_port;
> + __be32  src_ip[4];
> + __be32  dst_ip[4];
> + __be16  src_port;
> + __be16  dst_port;
>   u16 vlan_id;
>   unsigned char old_ha[ETH_ALEN];
>   unsigned char ha[ETH_ALEN];
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index 24b4f4ceceef..77154f1479a9 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -1321,7 +1321,7 @@ static struct sk_buff *bcmgenet_free_tx_cb(struct 
> device *dev,
>   dma_unmap_addr_set(cb, dma_addr, 0);
>   }
>  
> - return 0;
> + return NULL;

And this part as well is correct.

>  }
>  
>  /* Simple helper to free a receive control block's resources */
> @@ -1480,7 +1480,7 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct 
> net_device *dev,
>   status = (struct status_64 *)skb->data;
>  
>   if (skb->ip_summed  == CHECKSUM_PARTIAL) {
> - ip_ver = htons(skb->protocol);
> + ip_ver = ntohs(skb->protocol);
>   switch (ip_ver) {
>   case ETH_P_IP:
>   ip_proto = ip_hdr(skb)->protocol;
> diff --git a/drivers/net/ethernet/broadcom/tg3.c 
> b/drivers/net/ethernet/broadcom/tg3.c
> index a77ee2f8fb8d..2bd77d9990f2 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ 

pull-request: bpf-next 2018-01-07

2018-01-07 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a start of a framework for extending struct xdp_buff without
   having the overhead of populating every data at runtime. Idea
   is to have a new per-queue struct xdp_rxq_info that holds read
   mostly data (currently that is, queue number and a pointer to
   the corresponding netdev) which is set up during rxqueue config
   time. When a XDP program is invoked, struct xdp_buff holds a
   pointer to struct xdp_rxq_info that the BPF program can then
   walk. The user facing BPF program that uses struct xdp_md for
   context can use these members directly, and the verifier rewrites
   context access transparently by walking the xdp_rxq_info and
   net_device pointers to load the data, from Jesper.

2) Redo the reporting of offload device information to user space
   such that it works in combination with network namespaces. The
   latter is reported through a device/inode tuple as similarly
   done in other subsystems as well (e.g. perf) in order to identify
   the namespace. For this to work, ns_get_path() has been generalized
   such that the namespace can be retrieved not only from a specific
   task (perf case), but also from a callback where we deduce the
   netns (ns_common) from a netdevice. bpftool support using the new
   uapi info and extensive test cases for test_offload.py in BPF
   selftests have been added as well, from Jakub.

3) Add two bpftool improvements: i) properly report the bpftool
   version such that it corresponds to the version from the kernel
   source tree. So pick the right linux/version.h from the source
   tree instead of the installed one. ii) fix bpftool and also
   bpf_jit_disasm build with bintutils >= 2.9. The reason for the
   build breakage is that binutils library changed the function
   signature to select the disassembler. Given this is needed in
   multiple tools, add a proper feature detection to the
   tools/build/features infrastructure, from Roman.

4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the
   stacktrace map. It is currently unimplemented, but there are
   use cases where user space needs to walk all stacktrace map
   entries e.g. for dumping or deleting map entries w/o having to
   close and recreate the map. Add BPF selftests along with it,
   from Yonghong.

5) Few follow-up cleanups for the bpftool cgroup code: i) rename
   the cgroup 'list' command into 'show' as we have it for other
   subcommands as well, ii) then alias the 'show' command such that
   'list' is accepted which is also common practice in iproute2,
   and iii) remove couple of newlines from error messages using
   p_err(), from Jakub.

6) Two follow-up cleanups to sockmap code: i) remove the unused
   bpf_compute_data_end_sk_skb() function and ii) only build the
   sockmap infrastructure when CONFIG_INET is enabled since it's
   only aware of TCP sockets at this time, from John.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot & a happy new year!



The following changes since commit 6bb8824732f69de0f233ae6b1a8158e149627b38:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-12-29 
15:42:26 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 9be99badee761f0b2c065ecbd8bd54a96cbd0fa0:

  Merge branch 'bpf-stacktrace-map-next-key-support' (2018-01-06 23:52:23 +0100)


Alexei Starovoitov (1):
  Merge branch 'xdp_rxq_info'

Daniel Borkmann (3):
  Merge branch 'bpf-offload-report-dev'
  Merge branch 'bpf-bpftool-misc-fixes'
  Merge branch 'bpf-stacktrace-map-next-key-support'

Jakub Kicinski (12):
  bpf: offload: don't require rtnl for dev list manipulation
  bpf: offload: don't use prog->aux->offload as boolean
  bpf: offload: allow netdev to disappear while verifier is running
  bpf: offload: free prog->aux->offload when device disappears
  bpf: offload: free program id when device disappears
  nsfs: generalize ns_get_path() for path resolution with a task
  bpf: offload: report device information for offloaded programs
  tools: bpftool: report device information for offloaded programs
  selftests/bpf: test device info reporting for bound progs
  tools: bpftool: rename cgroup list -> show in the code
  tools: bpftool: alias show and list commands
  tools: bpftool: remove new lines from errors

Jesper Dangaard Brouer (14):
  xdp: base API for new XDP rx-queue info concept
  xdp/mlx5: setup xdp_rxq_info
  i40e: setup xdp_rxq_info
  ixgbe: setup xdp_rxq_info
  xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg
  mlx4: setup xdp_rxq_info
  bnxt_en: 

[PATCH] ipv6: use ARRAY_SIZE for array sizing calculation on array seg6_action_table

2018-01-07 Thread Colin King
From: Colin Ian King 

Use the ARRAY_SIZE macro on array seg6_action_table to determine size of
the array. Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King 
---
 net/ipv6/seg6_local.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 825b8e01f947..ba3767ef5e93 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -501,7 +501,7 @@ static struct seg6_action_desc *__get_action_desc(int 
action)
struct seg6_action_desc *desc;
int i, count;
 
-   count = sizeof(seg6_action_table) / sizeof(struct seg6_action_desc);
+   count = ARRAY_SIZE(seg6_action_table);
for (i = 0; i < count; i++) {
desc = _action_table[i];
if (desc->action == action)
-- 
2.15.1



[PATCH] be2net: use ARRAY_SIZE for array sizing calculation on array cmd_priv_map

2018-01-07 Thread Colin King
From: Colin Ian King 

Use the ARRAY_SIZE macro on array cmd_priv_map to determine size of the
array.  Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/emulex/benet/be_cmds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c 
b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 02dd5246dfae..1a49297224ed 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -103,7 +103,7 @@ static struct be_cmd_priv_map cmd_priv_map[] = {
 static bool be_cmd_allowed(struct be_adapter *adapter, u8 opcode, u8 subsystem)
 {
int i;
-   int num_entries = sizeof(cmd_priv_map)/sizeof(struct be_cmd_priv_map);
+   int num_entries = ARRAY_SIZE(cmd_priv_map);
u32 cmd_privileges = adapter->cmd_privileges;
 
for (i = 0; i < num_entries; i++)
-- 
2.15.1



[PATCH] ixgbe: use ARRAY_SIZE for array sizing calculation on array buf

2018-01-07 Thread Colin King
From: Colin Ian King 

Use the ARRAY_SIZE macro on array buf to determine size of the array.
Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index cb7da5f9c4da..28c23ef2ec28 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -949,7 +949,7 @@ static s32 ixgbe_checksum_ptr_x550(struct ixgbe_hw *hw, u16 
ptr,
u16 length, bufsz, i, start;
u16 *local_buffer;
 
-   bufsz = sizeof(buf) / sizeof(buf[0]);
+   bufsz = ARRAY_SIZE(buf);
 
/* Read a chunk at the pointer location */
if (!buffer) {
-- 
2.15.1



Re: [PATCH] atm/clip: Use seq_puts() in svc_addr()

2018-01-07 Thread Andy Shevchenko
On Sat, Jan 6, 2018 at 11:44 PM, SF Markus Elfring
 wrote:
> From: Markus Elfring 
> Date: Sat, 6 Jan 2018 22:34:12 +0100
>
> Two strings should be quickly put into a sequence by two function calls.
> Thus use the function "seq_puts" instead of "seq_printf".
>
> This issue was detected by using the Coccinelle software.
>
> Signed-off-by: Markus Elfring 
> ---
>  net/atm/clip.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/atm/clip.c b/net/atm/clip.c
> index d4f6029d5109..62a852165b19 100644
> --- a/net/atm/clip.c
> +++ b/net/atm/clip.c
> @@ -708,11 +708,11 @@ static void svc_addr(struct seq_file *seq, struct 
> sockaddr_atmsvc *addr)
> static int e164[] = { 1, 8, 4, 6, 1, 0 };
>
> if (*addr->sas_addr.pub) {
> -   seq_printf(seq, "%s", addr->sas_addr.pub);
> +   seq_puts(seq, addr->sas_addr.pub);

Which opens a lot of security concerns.
Never do this again.

> if (*addr->sas_addr.prv)
> seq_putc(seq, '+');
> } else if (!*addr->sas_addr.prv) {

> -   seq_printf(seq, "%s", "(none)");
> +   seq_puts(seq, "(none)");

...while this one is okay per se, better to keep above pattern (same
style over the piece of code / function).

> return;
> }
> if (*addr->sas_addr.prv) {
> --
> 2.15.1
>

P.S. I'm wondering what would be first, Markus starts looking into the
actual code, or most (all) of the maintainers just ban him.

-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau
On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

OK OK. At least we should have security by default and let people trade
it against performance if they want and understand the risk. People
never know when they're secure enough (security doesn't measure) but
they know fairly well when they're not performant enough to take action
(most often changing the machine).

Willy


Re: [PATCH iproute2] ip fou: Pass family attribute as u8

2018-01-07 Thread Filip Moc
Hi Stephen,

> How is this binary compatiable with older versions.
I'm not sure what you mean. Kernel expects family attribute to be passed as u8
(and it always did). But current ip passes family attribute as u16. It works
now only on little endian systems because the relevant byte happens to be on
the same position. It is the current version of ip which is incompatible.

Filip



Re: dvb usb issues since kernel 4.9

2018-01-07 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 11:54 AM, Mauro Carvalho Chehab
 wrote:
>
> Em Sat, 6 Jan 2018 16:04:16 +0100
> "Josef Griebichler"  escreveu:
>>
>> the causing commit has been identified.
>> After reverting commit 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882
>> its working again.
>
> Just replying to me won't magically fix this. The ones that were involved on
> this patch should also be c/c, plus USB people. Just added them.

Actually, you seem to have added an odd subset of the people involved.

For example, Ingo - who actually committed that patch - wasn't on the cc.

I do think we need to simply revert that patch. It's very simple: it
has been reported to lead to actual problems for people, and we don't
fix one problem and then say "well, it fixed something else" when
something breaks.

When something breaks, we either unbreak it, or we revert the change
that caused the breakage.

It's really that simple. That's what "no regressions" means.  We don't
accept changes that cause regressions. This one did.

  Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sun, 7 Jan 2018, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

I surely agree, but we have gone the way of PTI without the ability of
exempting individual processes exactly for one reason:

  Lack of time

It can be done on top of the PTI implementation and it won't take ages.

For spectre_v1/2 we face the same problem simply because we got informed so
much ahead of time and we were all twiddling thumbs, enjoying our christmas
vacation and having a good time.

The exploits are out in the wild and they are observed already, so we
really have to make a decision whether we want to fix that in the most
obvious ways even if it hurts performance right now and then take a break
from all that hell and sit down and sort the performance mess or whether we
want to discuss the right way to fix it for the next 3 month and leave all
doors open until the last bit of performance is squeezed out.

I totally can understand the anger of those who learned all of this a few
days ago and are now grasping straws to avoid the obviously coming
performance hit, but its not our fault that we have to make a decision
which cannot make everyone happy tomorrow.

If the general sentiment is that we have to fix the performance issue
first, then please let me know and I take 3 weeks of vacation right now.

Thanks,

tglx


Re: [PATCH v2 net-next] net: tracepoint: exposing sk_faimily in tracepoint inet_sock_set_state

2018-01-07 Thread Song Liu

> On Jan 6, 2018, at 10:31 PM, Yafang Shao  wrote:
> 
> As of now, there're two sk_family are traced with sock:inet_sock_set_state,
> which are AF_INET and AF_INET6.
> So the sk_family are exposed as well.
> Then we can conveniently use it to do the filter.
> 
> Both sk_family and sk_protocol are showed in the printk message, so we need
> not expose them as tracepoint arguments.
> 
> Suggested-by: Brendan Gregg 
> Suggested-by: Song Liu 
> Signed-off-by: Yafang Shao 
> ---
> include/trace/events/sock.h | 15 +--
> 1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
> index 3537c5f..3176a39 100644
> --- a/include/trace/events/sock.h
> +++ b/include/trace/events/sock.h
> @@ -11,7 +11,11 @@
> #include 
> #include 
> 
> -/* The protocol traced by sock_set_state */
> +#define family_names \
> + EM(AF_INET) \
> + EMe(AF_INET6)
> +
> +/* The protocol traced by inet_sock_set_state */
> #define inet_protocol_names   \
>   EM(IPPROTO_TCP) \
>   EM(IPPROTO_DCCP)\
> @@ -37,6 +41,7 @@
> #define EM(a)   TRACE_DEFINE_ENUM(a);
> #define EMe(a)  TRACE_DEFINE_ENUM(a);
> 
> +family_names
> inet_protocol_names
> tcp_state_names
> 
> @@ -45,6 +50,9 @@
> #define EM(a)   { a, #a },
> #define EMe(a)  { a, #a }
> 
> +#define show_family_name(val)\
> + __print_symbolic(val, family_names)
> +
> #define show_inet_protocol_name(val)\
>   __print_symbolic(val, inet_protocol_names)
> 
> @@ -118,6 +126,7 @@
>   __field(int, newstate)
>   __field(__u16, sport)
>   __field(__u16, dport)
> + __field(__u16, family)
>   __field(__u8, protocol)
>   __array(__u8, saddr, 4)
>   __array(__u8, daddr, 4)
> @@ -134,6 +143,7 @@
>   __entry->oldstate = oldstate;
>   __entry->newstate = newstate;
> 
> + __entry->family = sk->sk_family;
>   __entry->protocol = sk->sk_protocol;
>   __entry->sport = ntohs(inet->inet_sport);
>   __entry->dport = ntohs(inet->inet_dport);
> @@ -160,7 +170,8 @@
>   }
>   ),
> 
> - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 
> saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
> + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 
> daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
> + show_family_name(__entry->family),
>   show_inet_protocol_name(__entry->protocol),
>   __entry->sport, __entry->dport,
>   __entry->saddr, __entry->daddr,
> --
> 1.8.3.1
> 

Reviewed-by: Song Liu 



Re: [PATCH iproute2] ip fou: Pass family attribute as u8

2018-01-07 Thread Stephen Hemminger
On Sun, 7 Jan 2018 15:28:13 +0100
Filip Moc  wrote:

> This fixes fou on big-endian systems.
> 
> Signed-off-by: Filip Moc 
> ---
>  ip/ipfou.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/ip/ipfou.c b/ip/ipfou.c
> index febc2c8c..1f392ade 100644
> --- a/ip/ipfou.c
> +++ b/ip/ipfou.c
> @@ -52,7 +52,7 @@ static int fou_parse_opt(int argc, char **argv, struct 
> nlmsghdr *n,
>   __u8 ipproto, type;
>   bool gue_set = false;
>   int ipproto_set = 0;
> - unsigned short family = AF_INET;
> + __u8 family = AF_INET;
>  
>   while (argc > 0) {
>   if (!matches(*argv, "port")) {
> @@ -103,7 +103,7 @@ static int fou_parse_opt(int argc, char **argv, struct 
> nlmsghdr *n,
>  
>   addattr16(n, 1024, FOU_ATTR_PORT, port);
>   addattr8(n, 1024, FOU_ATTR_TYPE, type);
> - addattr16(n, 1024, FOU_ATTR_AF, family);
> + addattr8(n, 1024, FOU_ATTR_AF, family);
>  
>   if (ipproto_set)
>   addattr8(n, 1024, FOU_ATTR_IPPROTO, ipproto);

How is this binary compatiable with older versions.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sun, Jan 7, 2018 at 12:12 PM, Willy Tarreau  wrote:
>
> Linus, no need to explain that to me, I'm precisely trying to see how
> to disable PTI for a specific process because I face up to 45% loss in
> certain circumstances, making it a no-go. But while a few of us have
> very specific workloads emphasizing this impact, others have very
> different ones and will not notice. For example my laptop did boot
> pretty fine and I didn't notice anything until I fire a network
> benchmark.

Sure, most people have hardware where the bottleneck is entirely
elsewhere (slow network, rotating disk, whatever).

But this whole "normal people won't notice" is dangerous thinking.
They may well notice very much, we simply don't know what they are
doing.

Quite honesty, it's equally correct to say "normal people won't be
affected by the security issue in the first place".

That laptop that you didn't have any issues with? Likely it never had
an exploit running on it either!

So the whole "normal people" argument is pure and utter garbage. It's
wrong. It's pure shit when it comes to performance, but it's also pure
shit when it comes to the security issue.

Don't use it.

We need to fix the security problem, but we need to do it *without*
these braindead arguments that performance is somehow secondary.

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Dan Williams
On Sun, Jan 7, 2018 at 11:47 AM, Linus Torvalds
 wrote:
> On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>>
>> To be fair there's overreaction on both sides. The vast majority of
>> users need to get a 100% safe system and will never notice  any
>> difference.
>
> There is no such thing as a "100% safe system". Never will be - unless
> you make sure you have no users.
>
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.
>
> So this whole "security is so important that performance doesn't
> matter" mindset is pure and utter garbage.
>
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
>
> Performance matters. A *LOT*.

I'm thinking we should provide the option to at least build the
hot-path nospec_array_ptr() usages without an lfence.

CONFIG_SPECTRE1_PARANOIA_SAFE
CONFIG_SPECTRE1_PARANOIA_PERF

...if only for easing performance testing and let the distribution set
its policy.

Where hot-path usages can do:

nospec_relax(nospec_array_ptr())

...to optionally ellide the lfence.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau
On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote:
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
> 
> Performance matters. A *LOT*.

Linus, no need to explain that to me, I'm precisely trying to see how
to disable PTI for a specific process because I face up to 45% loss in
certain circumstances, making it a no-go. But while a few of us have
very specific workloads emphasizing this impact, others have very
different ones and will not notice. For example my laptop did boot
pretty fine and I didn't notice anything until I fire a network
benchmark.

Willy


[PATCH] hyperv/netvsc: Delete two error messages for a failed memory allocation in netvsc_init_buf()

2018-01-07 Thread SF Markus Elfring
From: Markus Elfring 
Date: Sun, 7 Jan 2018 21:03:26 +0100

Omit extra messages for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/hyperv/netvsc.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 17e529af79dc..c1ec02f801f6 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -275,9 +275,6 @@ static int netvsc_init_buf(struct hv_device *device,
 
net_device->recv_buf = vzalloc(buf_size);
if (!net_device->recv_buf) {
-   netdev_err(ndev,
-  "unable to allocate receive buffer of size %u\n",
-  buf_size);
ret = -ENOMEM;
goto cleanup;
}
@@ -357,8 +354,6 @@ static int netvsc_init_buf(struct hv_device *device,
 
net_device->send_buf = vzalloc(buf_size);
if (!net_device->send_buf) {
-   netdev_err(ndev, "unable to allocate send buffer of size %u\n",
-  buf_size);
ret = -ENOMEM;
goto cleanup;
}
-- 
2.15.1



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>
> To be fair there's overreaction on both sides. The vast majority of
> users need to get a 100% safe system and will never notice  any
> difference.

There is no such thing as a "100% safe system". Never will be - unless
you make sure you have no users.

Also, people definitely *are* noticing the performance issues with the
current set of patches, and they are causing real problems. Go search
for reports of Amazon AWS slowdowns.

So this whole "security is so important that performance doesn't
matter" mindset is pure and utter garbage.

And the whole "normal people won't even notice" is pure garbage too.
Don't spread that bullshit when you see actual normal people
complaining.

Performance matters. A *LOT*.

Linus


Re: [PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-07 Thread Dan Williams
On Sun, Jan 7, 2018 at 1:09 AM, Greg KH  wrote:
[..]
> Sorry for the confusion, no, I don't mean the "taint tracking", I mean
> the generic pattern of "speculative out of bounds access" that we are
> fixing here.
>
> Yes, as you mentioned before, there are tons of false-positives in the
> tree, as to find the real problems you have to show that userspace
> controls the access index.  But if we have a generic pattern that can
> rewrite that type of logic into one where it does not matter at all
> (i.e. like the ebpf proposed changes), then it would not be an issue if
> they are false or not, we just rewrite them all to be safe.
>
> We need to find some way not only to fix these issues now (like you are
> doing with this series), but to prevent them from every coming back into
> the codebase again.  It's that second part that we need to keep in the
> back of our minds here, while doing the first portion of this work.

I understand the goal, but I'm not sure any of our current annotation
mechanisms are suitable. We have:

__attribute__((noderef, address_space(x)))

...for the '__user' annotation and other pointers that must not be
de-referenced without a specific accessor. We also have:

__attribute__((bitwise))

...for values that should not be consumed directly without a specific
conversion like endian swapping.

The problem is that we need to see if a value derived from a userspace
controlled input is used to trigger a chain of dependent reads. As far
as I can see the annotation would need to be guided by taint analysis
to be useful, at which point we can just "annotate" the problem spot
with nospec_array_ptr(). Otherwise it seems the scope of a
"__nospec_array_index" annotation would have a low signal to noise
ratio.

Stopping speculation past a uacess_begin() boundary appears to handle
a wide swath of potential problems, and the rest likely needs taint
analysis, at least for now.

All that to say, yes, we need better tooling and infrastructure going forward.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alan Cox
> everyone.  I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.
> 

I'm not arguing otherwise - I'd just prefer most users machines are
secure while we have the discussion and while we see what other
architectural tricks people can come up with

Alan


[PATCH net-next v3 RESEND 10/10] net: qualcomm: rmnet: Add support for GSO

2018-01-07 Thread Subash Abhinov Kasiviswanathan
Real devices may support scatter gather(SG), so enable SG on rmnet
devices to use GSO. GSO reduces CPU cycles by 20% for a rate of
146Mpbs for a single stream TCP connection.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index f7f57ce..570a227 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -190,6 +190,7 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev,
 
rmnet_dev->hw_features = NETIF_F_RXCSUM;
rmnet_dev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+   rmnet_dev->hw_features |= NETIF_F_SG;
 
rc = register_netdevice(rmnet_dev);
if (!rc) {
-- 
1.9.1



[PATCH net-next v3 RESEND 08/10] net: qualcomm: rmnet: Handle command packets with checksum trailer

2018-01-07 Thread Subash Abhinov Kasiviswanathan
When using the MAPv4 packet format in conjunction with MAP commands,
a dummy DL checksum trailer will be appended to the packet. Before
this packet is sent out as an ACK, the DL checksum trailer needs to be
removed.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index 51e6049..6bc328f 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -58,11 +58,24 @@ static u8 rmnet_map_do_flow_control(struct sk_buff *skb,
 }
 
 static void rmnet_map_send_ack(struct sk_buff *skb,
-  unsigned char type)
+  unsigned char type,
+  struct rmnet_port *port)
 {
struct rmnet_map_control_command *cmd;
int xmit_status;
 
+   if (port->data_format & RMNET_INGRESS_FORMAT_MAP_CKSUMV4) {
+   if (skb->len < sizeof(struct rmnet_map_header) +
+   RMNET_MAP_GET_LENGTH(skb) +
+   sizeof(struct rmnet_map_dl_csum_trailer)) {
+   kfree_skb(skb);
+   return;
+   }
+
+   skb_trim(skb, skb->len -
+sizeof(struct rmnet_map_dl_csum_trailer));
+   }
+
skb->protocol = htons(ETH_P_MAP);
 
cmd = RMNET_MAP_GET_CMD_START(skb);
@@ -100,5 +113,5 @@ void rmnet_map_command(struct sk_buff *skb, struct 
rmnet_port *port)
break;
}
if (rc == RMNET_MAP_COMMAND_ACK)
-   rmnet_map_send_ack(skb, rc);
+   rmnet_map_send_ack(skb, rc, port);
 }
-- 
1.9.1



[PATCH net-next v3 RESEND 05/10] net: qualcomm: rmnet: Set pacing shift

2018-01-07 Thread Subash Abhinov Kasiviswanathan
The real device over which the rmnet devices are installed also
aggregate multiple IP packets and sends them as a single large
aggregate frame to the hardware. This causes degraded throughput
for TCP TX due to bufferbloat.

To overcome this problem, pacing shift value of 8 is set using the
sk_pacing_shift_update() helper. This value was determined based
on experiments with a single stream TCP TX using iperf for a
duration of 30s.

Pacing shift | Observed data rate (Mbps)
  10 | 9
   9 | 140
   8 | 146 (Max link rate)

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 8e1f43a..8f8c4f2 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "rmnet_private.h"
 #include "rmnet_config.h"
 #include "rmnet_vnd.h"
@@ -204,6 +205,8 @@ void rmnet_egress_handler(struct sk_buff *skb)
struct rmnet_priv *priv;
u8 mux_id;
 
+   sk_pacing_shift_update(skb->sk, 8);
+
orig_dev = skb->dev;
priv = netdev_priv(orig_dev);
skb->dev = priv->real_dev;
-- 
1.9.1



[PATCH net-next v3 RESEND 09/10] net: qualcomm: rmnet: Add support for TX checksum offload

2018-01-07 Thread Subash Abhinov Kasiviswanathan
TX checksum offload applies to TCP / UDP packets which are not
fragmented using the MAPv4 checksum trailer. The following needs to be
done to have checksum computed in hardware -

1. Set the checksum start offset and inset offset.
2. Set the csum_enabled bit
3. Compute and set 1's complement of partial checksum field in
   transport header.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |   8 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|   2 +
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 120 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|   1 +
 4 files changed, 131 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 3409458..601edec 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -141,11 +141,19 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
additional_header_len = 0;
required_headroom = sizeof(struct rmnet_map_header);
 
+   if (port->data_format & RMNET_EGRESS_FORMAT_MAP_CKSUMV4) {
+   additional_header_len = sizeof(struct rmnet_map_ul_csum_header);
+   required_headroom += additional_header_len;
+   }
+
if (skb_headroom(skb) < required_headroom) {
if (pskb_expand_head(skb, required_headroom, 0, GFP_KERNEL))
goto fail;
}
 
+   if (port->data_format & RMNET_EGRESS_FORMAT_MAP_CKSUMV4)
+   rmnet_map_checksum_uplink_packet(skb, orig_dev);
+
map_header = rmnet_map_add_map_header(skb, additional_header_len, 0);
if (!map_header)
goto fail;
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index ca9f473..6ce31e2 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -89,5 +89,7 @@ struct rmnet_map_header *rmnet_map_add_map_header(struct 
sk_buff *skb,
  int hdrlen, int pad);
 void rmnet_map_command(struct sk_buff *skb, struct rmnet_port *port);
 int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len);
+void rmnet_map_checksum_uplink_packet(struct sk_buff *skb,
+ struct net_device *orig_dev);
 
 #endif /* _RMNET_MAP_H_ */
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index 881c1dc..c74a6c5 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -171,6 +171,86 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char 
protocol,
 }
 #endif
 
+static void rmnet_map_complement_ipv4_txporthdr_csum_field(void *iphdr)
+{
+   struct iphdr *ip4h = (struct iphdr *)iphdr;
+   void *txphdr;
+   u16 *csum;
+
+   txphdr = iphdr + ip4h->ihl * 4;
+
+   if (ip4h->protocol == IPPROTO_TCP || ip4h->protocol == IPPROTO_UDP) {
+   csum = (u16 *)rmnet_map_get_csum_field(ip4h->protocol, txphdr);
+   *csum = ~(*csum);
+   }
+}
+
+static void
+rmnet_map_ipv4_ul_csum_header(void *iphdr,
+ struct rmnet_map_ul_csum_header *ul_header,
+ struct sk_buff *skb)
+{
+   struct iphdr *ip4h = (struct iphdr *)iphdr;
+   __be16 *hdr = (__be16 *)ul_header, offset;
+
+   offset = htons((__force u16)(skb_transport_header(skb) -
+(unsigned char *)iphdr));
+   ul_header->csum_start_offset = offset;
+   ul_header->csum_insert_offset = skb->csum_offset;
+   ul_header->csum_enabled = 1;
+   if (ip4h->protocol == IPPROTO_UDP)
+   ul_header->udp_ip4_ind = 1;
+   else
+   ul_header->udp_ip4_ind = 0;
+
+   /* Changing remaining fields to network order */
+   hdr++;
+   *hdr = htons((__force u16)*hdr);
+
+   skb->ip_summed = CHECKSUM_NONE;
+
+   rmnet_map_complement_ipv4_txporthdr_csum_field(iphdr);
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void rmnet_map_complement_ipv6_txporthdr_csum_field(void *ip6hdr)
+{
+   struct ipv6hdr *ip6h = (struct ipv6hdr *)ip6hdr;
+   void *txphdr;
+   u16 *csum;
+
+   txphdr = ip6hdr + sizeof(struct ipv6hdr);
+
+   if (ip6h->nexthdr == IPPROTO_TCP || ip6h->nexthdr == IPPROTO_UDP) {
+   csum = (u16 *)rmnet_map_get_csum_field(ip6h->nexthdr, txphdr);
+   *csum = ~(*csum);
+   }
+}
+
+static void
+rmnet_map_ipv6_ul_csum_header(void *ip6hdr,
+ struct rmnet_map_ul_csum_header *ul_header,
+ struct sk_buff *skb)
+{
+   __be16 *hdr = (__be16 *)ul_header, offset;
+
+   offset = htons((__force 

[PATCH net-next v3 RESEND 06/10] net: qualcomm: rmnet: Define the MAPv4 packet formats

2018-01-07 Thread Subash Abhinov Kasiviswanathan
The MAPv4 packet format adds support for RX / TX checksum offload.
For a bi-directional UDP stream at a rate of 570 / 146 Mbps, roughly
10% CPU cycles are saved.

For receive path, there is a checksum trailer appended to the end of
the MAP packet. The valid field indicates if hardware has computed
the checksum. csum_start_offset indicates the offset from the start
of the IP header from which hardware has computed checksum.
csum_length is the number of bytes over which the checksum was
computed and the resulting value is csum_value.

In the transmit path, a header is appended between the end of the MAP
header and the start of the IP packet. csum_start_offset is the offset
in bytes from which hardware will compute the checksum if the
csum_enabled bit is set. udp_ip4_ind indicates if the checksum
value of 0 is valid or not. csum_insert_offset is the offset from the
csum_start_offset where hardware will insert the computed checksum.

The use of this additional packet format for checksum offload is
explained in subsequent patches.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 16 
 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index ef0eff2..50c50cd 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -47,6 +47,22 @@ struct rmnet_map_header {
u16 pkt_len;
 }  __aligned(1);
 
+struct rmnet_map_dl_csum_trailer {
+   u8  reserved1;
+   u8  valid:1;
+   u8  reserved2:7;
+   u16 csum_start_offset;
+   u16 csum_length;
+   __be16 csum_value;
+} __aligned(1);
+
+struct rmnet_map_ul_csum_header {
+   __be16 csum_start_offset;
+   u16 csum_insert_offset:14;
+   u16 udp_ip4_ind:1;
+   u16 csum_enabled:1;
+} __aligned(1);
+
 #define RMNET_MAP_GET_MUX_ID(Y) (((struct rmnet_map_header *) \
 (Y)->data)->mux_id)
 #define RMNET_MAP_GET_CD_BIT(Y) (((struct rmnet_map_header *) \
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
index d214280..de0143e 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
@@ -21,6 +21,8 @@
 /* Constants */
 #define RMNET_INGRESS_FORMAT_DEAGGREGATION  BIT(0)
 #define RMNET_INGRESS_FORMAT_MAP_COMMANDS   BIT(1)
+#define RMNET_INGRESS_FORMAT_MAP_CKSUMV4BIT(2)
+#define RMNET_EGRESS_FORMAT_MAP_CKSUMV4 BIT(3)
 
 /* Replace skb->dev to a virtual rmnet device and pass up the stack */
 #define RMNET_EPMODE_VND (1)
-- 
1.9.1



[PATCH net-next v3 RESEND 02/10] net: qualcomm: rmnet: Remove invalid condition while stamping mux id

2018-01-07 Thread Subash Abhinov Kasiviswanathan
rmnet devices cannot have a mux id of 255. This is validated when
assigning the mux id to the rmnet devices. As a result, checking for
mux id 255 does not apply in egress path.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 0553932..b2d317e3 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -143,10 +143,7 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
if (!map_header)
goto fail;
 
-   if (mux_id == 0xff)
-   map_header->mux_id = 0;
-   else
-   map_header->mux_id = mux_id;
+   map_header->mux_id = mux_id;
 
skb->protocol = htons(ETH_P_MAP);
 
-- 
1.9.1



[PATCH net-next v3 RESEND 04/10] net: qualcomm: rmnet: Rename ingress data format to data format

2018-01-07 Thread Subash Abhinov Kasiviswanathan
This is done so that we can use this field for both ingress and
egress flags.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c   | 10 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h   |  2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c |  5 ++---
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index cedacdd..7e7704d 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -143,7 +143,7 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
 struct nlattr *tb[], struct nlattr *data[],
 struct netlink_ext_ack *extack)
 {
-   int ingress_format = RMNET_INGRESS_FORMAT_DEAGGREGATION;
+   u32 data_format = RMNET_INGRESS_FORMAT_DEAGGREGATION;
struct net_device *real_dev;
int mode = RMNET_EPMODE_VND;
struct rmnet_endpoint *ep;
@@ -185,11 +185,11 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
struct ifla_vlan_flags *flags;
 
flags = nla_data(data[IFLA_VLAN_FLAGS]);
-   ingress_format = flags->flags & flags->mask;
+   data_format = flags->flags & flags->mask;
}
 
-   netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format);
-   port->ingress_data_format = ingress_format;
+   netdev_dbg(dev, "data format [0x%08X]\n", data_format);
+   port->data_format = data_format;
 
return 0;
 
@@ -353,7 +353,7 @@ static int rmnet_changelink(struct net_device *dev, struct 
nlattr *tb[],
struct ifla_vlan_flags *flags;
 
flags = nla_data(data[IFLA_VLAN_FLAGS]);
-   port->ingress_data_format = flags->flags & flags->mask;
+   port->data_format = flags->flags & flags->mask;
}
 
return 0;
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
index 2ea9fe3..00e4634 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
@@ -32,7 +32,7 @@ struct rmnet_endpoint {
  */
 struct rmnet_port {
struct net_device *dev;
-   u32 ingress_data_format;
+   u32 data_format;
u8 nr_rmnet_devs;
u8 rmnet_mode;
struct hlist_head muxed_ep[RMNET_MAX_LOGICAL_EP];
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index b2d317e3..8e1f43a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -69,8 +69,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
u16 len;
 
if (RMNET_MAP_GET_CD_BIT(skb)) {
-   if (port->ingress_data_format
-   & RMNET_INGRESS_FORMAT_MAP_COMMANDS)
+   if (port->data_format & RMNET_INGRESS_FORMAT_MAP_COMMANDS)
return rmnet_map_command(skb, port);
 
goto free_skb;
@@ -114,7 +113,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
skb_push(skb, ETH_HLEN);
}
 
-   if (port->ingress_data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) {
+   if (port->data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) {
while ((skbn = rmnet_map_deaggregate(skb)) != NULL)
__rmnet_map_ingress_handler(skbn, port);
 
-- 
1.9.1



[PATCH net-next v3 RESEND 07/10] net: qualcomm: rmnet: Add support for RX checksum offload

2018-01-07 Thread Subash Abhinov Kasiviswanathan
When using the MAPv4 packet format, receive checksum offload can be
enabled in hardware. The checksum computation over pseudo header is
not offloaded but the rest of the checksum computation over
the payload is offloaded. This applies only for TCP / UDP packets
which are not fragmented.

rmnet validates the TCP/UDP checksum for the packet using the checksum
from the checksum trailer added to the packet by hardware. The
validation performed is as following -

1. Perform 1's complement over the checksum value from the trailer
2. Compute 1's complement checksum over IPv4 / IPv6 header and
   subtracts it from the value from step 1
3. Computes 1's complement checksum over IPv4 / IPv6 pseudo header and
   adds it to the value from step 2
4. Subtracts the checksum value from the TCP / UDP header from the
   value from step 3.
5. Compares the value from step 4 to the checksum value from the
   TCP / UDP header.
6. If the comparison in step 5 succeeds, CHECKSUM_UNNECESSARY is set
   and the packet is passed on to network stack. If there is a
   failure, then the packet is passed on as such without modifying
   the ip_summed field.

The checksum field is also checked for UDP checksum 0 as per RFC 768
and for unexpected TCP checksum of 0.

If checksum offload is disabled when using MAPv4 packet format in
receive path, the packet is queued as is to network stack without
the validations above.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |  15 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|   4 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 186 -
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|   2 +
 4 files changed, 201 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 8f8c4f2..3409458 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -66,8 +66,8 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
struct rmnet_port *port)
 {
struct rmnet_endpoint *ep;
+   u16 len, pad;
u8 mux_id;
-   u16 len;
 
if (RMNET_MAP_GET_CD_BIT(skb)) {
if (port->data_format & RMNET_INGRESS_FORMAT_MAP_COMMANDS)
@@ -77,7 +77,8 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
}
 
mux_id = RMNET_MAP_GET_MUX_ID(skb);
-   len = RMNET_MAP_GET_LENGTH(skb) - RMNET_MAP_GET_PAD(skb);
+   pad = RMNET_MAP_GET_PAD(skb);
+   len = RMNET_MAP_GET_LENGTH(skb) - pad;
 
if (mux_id >= RMNET_MAX_LOGICAL_EP)
goto free_skb;
@@ -90,8 +91,14 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
 
/* Subtract MAP header */
skb_pull(skb, sizeof(struct rmnet_map_header));
-   skb_trim(skb, len);
rmnet_set_skb_proto(skb);
+
+   if (port->data_format & RMNET_INGRESS_FORMAT_MAP_CKSUMV4) {
+   if (!rmnet_map_checksum_downlink_packet(skb, len + pad))
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
+   }
+
+   skb_trim(skb, len);
rmnet_deliver_skb(skb);
return;
 
@@ -115,7 +122,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
}
 
if (port->data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) {
-   while ((skbn = rmnet_map_deaggregate(skb)) != NULL)
+   while ((skbn = rmnet_map_deaggregate(skb, port)) != NULL)
__rmnet_map_ingress_handler(skbn, port);
 
consume_skb(skb);
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 50c50cd..ca9f473 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -83,9 +83,11 @@ struct rmnet_map_ul_csum_header {
 #define RMNET_MAP_NO_PAD_BYTES0
 #define RMNET_MAP_ADD_PAD_BYTES   1
 
-struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb);
+struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb,
+ struct rmnet_port *port);
 struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb,
  int hdrlen, int pad);
 void rmnet_map_command(struct sk_buff *skb, struct rmnet_port *port);
+int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len);
 
 #endif /* _RMNET_MAP_H_ */
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index 978ce26..881c1dc 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -14,6 +14,9 @@
  */
 
 #include 
+#include 
+#include 
+#include 
 #include "rmnet_config.h"
 #include "rmnet_map.h"
 #include "rmnet_private.h"
@@ -21,6 

[PATCH net-next v3 RESEND 03/10] net: qualcomm: rmnet: Remove unused function declaration

2018-01-07 Thread Subash Abhinov Kasiviswanathan
rmnet_map_demultiplex() is only declared but not defined anywhere,
so remove it.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 4df359d..ef0eff2 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -67,7 +67,6 @@ struct rmnet_map_header {
 #define RMNET_MAP_NO_PAD_BYTES0
 #define RMNET_MAP_ADD_PAD_BYTES   1
 
-u8 rmnet_map_demultiplex(struct sk_buff *skb);
 struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb);
 struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb,
  int hdrlen, int pad);
-- 
1.9.1



[PATCH net-next v3 RESEND 00/10] net: qualcomm: rmnet: Enable csum offloads

2018-01-07 Thread Subash Abhinov Kasiviswanathan
This series introduces the MAPv4 packet format for checksum
offload plus some other minor changes.

Patches 1-3 are cleanups.

Patch 4 renames the ingress format to data format so that all data
formats can be configured using this going forward.

Patch 5 uses the pacing helper to improve TCP transmit performance.

Patch 6-9 defines the the MAPv4 for checksum offload for RX and TX.
A new header and trailer format are used as part of MAPv4.
For RX checksum offload, only the 1's complement of the IP payload
portion is computed by hardware. The meta data from RX header is
used to verify the checksum field in the packet. Note that the
IP packet and its field itself is not modified by hardware.
This gives metadata to help with the RX checksum. For TX, the
required metadata is filled up so hardware can compute the
checksum.

Patch 10 enables GSO on rmnet devices

v1->v2: Fix sparse errors reported by kbuild test robot

v2->v3: Update the commit message for Patch 5 based on Eric's comments

Subash Abhinov Kasiviswanathan (10):
  net: qualcomm: rmnet: Remove redundant check when stamping map header
  net: qualcomm: rmnet: Remove invalid condition while stamping mux id
  net: qualcomm: rmnet: Remove unused function declaration
  net: qualcomm: rmnet: Rename ingress data format to data format
  net: qualcomm: rmnet: Set pacing shift
  net: qualcomm: rmnet: Define the MAPv4 packet formats
  net: qualcomm: rmnet: Add support for RX checksum offload
  net: qualcomm: rmnet: Handle command packets with checksum trailer
  net: qualcomm: rmnet: Add support for TX checksum offload
  net: qualcomm: rmnet: Add support for GSO

 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c |  10 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |   2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |  36 ++-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  23 +-
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c|  17 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 309 -
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|   2 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|   4 +
 8 files changed, 378 insertions(+), 25 deletions(-)

-- 
1.9.1



[PATCH net-next v3 RESEND 01/10] net: qualcomm: rmnet: Remove redundant check when stamping map header

2018-01-07 Thread Subash Abhinov Kasiviswanathan
We already check the headroom once in rmnet_map_egress_handler(),
so this is not needed.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index 86b8c75..978ce26 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -32,9 +32,6 @@ struct rmnet_map_header *rmnet_map_add_map_header(struct 
sk_buff *skb,
u32 padding, map_datalen;
u8 *padbytes;
 
-   if (skb_headroom(skb) < sizeof(struct rmnet_map_header))
-   return NULL;
-
map_datalen = skb->len - hdrlen;
map_header = (struct rmnet_map_header *)
skb_push(skb, sizeof(struct rmnet_map_header));
-- 
1.9.1



Re: [iproute2 1/1] tc: Fix filter protocol output

2018-01-07 Thread Jiri Pirko
Sun, Jan 07, 2018 at 03:43:28PM CET, j...@mojatatu.com wrote:
>From: Jamal Hadi Salim 
>
>Fixes: 249284ff5a44 ("tc: jsonify filter core")
>Signed-off-by: Jamal Hadi Salim 
>---
> tc/tc_filter.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/tc/tc_filter.c b/tc/tc_filter.c
>index 545cc3a..546311a 100644
>--- a/tc/tc_filter.c
>+++ b/tc/tc_filter.c
>@@ -276,7 +276,7 @@ int print_filter(const struct sockaddr_nl *who, struct 
>nlmsghdr *n, void *arg)
>   if (!filter_protocol || filter_protocol != f_proto) {
>   if (f_proto) {
>   SPRINT_BUF(b1);
>-  print_string(PRINT_JSON, "protocol",
>+  print_string(PRINT_ANY, "protocol",

ha! Thanks!

Acked-by: Jiri Pirko 


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sun, 7 Jan 2018, James Bottomley wrote:

> On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> > From: Willy Tarreau 
> > Date: Sat, 6 Jan 2018 21:42:29 +0100
> > 
> > > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> > >> Normally people who propose security fixes don't have to argue
> > about the
> > >> fact they added 30 clocks to avoid your box being 0wned.
> > > 
> > > In fact it depends, because if a fix makes the system unusable for
> > its
> > > initial purpose, this fix will simply not be deployed at all, which
> > is
> > > the worst that can happen.
> > 
> > +1
> > 
> > I completely agree with Willy and Alexei.
> > 
> > And the scale isn't even accurate, we're talking about at least
> > hundreds upon hundreds of clocks, not 30, if we add an operation
> > whose side effect is to wait for all pending loads to complete.  So
> > yeah this is going to be heavily scrutinized.
> 
> Plus this is the standard kernel code review MO: we've never blindly
> accepted code just because *security* (otherwise we'd have grsec in by
> now).  We use the pushback to get better and more performant code.
>  What often happens is it turns out that the "either security or
> performance" position was a false dichotomy and there is a way of
> fixing stuff that's acceptable (although not usually perfect) for
> everyone.  I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.

In principle I agree, though there are a few things to consider:

1) We have not the time to discuss that for the next 6 month

2) Alexei's analyis is purely based on the public information of the google
   zero folks. If it would be complete and the only attack vector all fine.

   If not and I doubt it is, we're going to regret this decision faster
   than we made it and this is not the kind of play field where we can
   afford that.

The whole 'we know better and performance is key' attitude is what led to
this disaster in the first place. We should finaly start to learn.

Can we please stop that and live with the extra cycles for a few month up
to the point where we get more informed answers to all these questions?

Thanks,

tglx


Re: [PATCH net] ipv6: remove null_entry before adding default route

2018-01-07 Thread Wei Wang
On Sat, Jan 6, 2018 at 10:16 PM, Martin KaFai Lau  wrote:
> On Sat, Jan 06, 2018 at 05:41:28PM -0800, Wei Wang wrote:
>> On Fri, Jan 5, 2018 at 11:42 PM, Martin KaFai Lau  wrote:
>> > On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
>> >> From: Wei Wang 
>> >>
>> >> In the current code, when creating a new fib6 table, tb6_root.leaf gets
>> >> initialized to net->ipv6.ip6_null_entry.
>> >> If a default route is being added with rt->rt6i_metric = 0x,
>> >> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
>> >> null_entry is shared, it could cause problem.
>> >>
>> >> In order to fix it, set fn->leaf to NULL before calling
>> >> fib6_add_rt2node() when trying to add the first default route.
>> >> And reset fn->leaf to null_entry when adding fails or when deleting the
>> >> last default route.
>> >>
>> >> syzkaller reported the following issue which is fixed by this commit:
>> >> =
>> >> WARNING: suspicious RCU usage
>> >> 4.15.0-rc5+ #171 Not tainted
>> >> -
>> >> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
>> >>
>> >> other info that might help us debug this:
>> >>
>> >> rcu_scheduler_active = 2, debug_locks = 1
>> >> 4 locks held by swapper/0/0:
>> >>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [] 
>> >> lockdep_copy_map include/linux/lockdep.h:178 [inline]
>> >>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [] 
>> >> call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>> >>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
>> >> spin_lock_bh include/linux/spinlock.h:315 [inline]
>> >>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
>> >> fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>> >>  #2:  (rcu_read_lock){}, at: [<91db762d>] 
>> >> __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>> >>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
>> >> spin_lock_bh include/linux/spinlock.h:315 [inline]
>> >>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
>> >> __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
>> >>
>> >> stack backtrace:
>> >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
>> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> >> Google 01/01/2011
>> >> Call Trace:
>> >>  
>> >>  __dump_stack lib/dump_stack.c:17 [inline]
>> >>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>> >>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>> >>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>> >>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>> >>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>> >>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>> >>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>> >>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>> >>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>> >>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>> >>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>> >>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>> >>  expire_timers kernel/time/timer.c:1357 [inline]
>> >>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>> >>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>> >>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>> >>  invoke_softirq kernel/softirq.c:365 [inline]
>> >>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>> >>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>> >>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>> >>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>> >>  
>> >>
>> >> Reported-by: syzbot 
>> >> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in 
>> >> fib6_table")
>> >> Signed-off-by: Wei Wang 
>> >> ---
>> >>  net/ipv6/ip6_fib.c | 45 +++--
>> >>  1 file changed, 35 insertions(+), 10 deletions(-)
>> >>
>> >> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> >> index d11a5578e4f8..37cb4ad1ea29 100644
>> >> --- a/net/ipv6/ip6_fib.c
>> >> +++ b/net/ipv6/ip6_fib.c
>> >> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
>> >>   if (!(fn->fn_flags & RTN_RTINFO)) {
>> >>   RCU_INIT_POINTER(fn->leaf, NULL);
>> >>   rt6_release(leaf);
>> >> + /* remove null_entry in the root node */
>> >> + } else if (fn->fn_flags & RTN_TL_ROOT &&
>> >> +rcu_access_pointer(fn->leaf) ==
>> >> +net->ipv6.ip6_null_entry) {
>> >> + RCU_INIT_POINTER(fn->leaf, NULL);
>> > It seems the reader side could see tb6_root.leaf == NULL after
>> > this change and I think it should be fine?  If it is, instead
>> > of 

Re: [PATCH net-next 00/18] ipv6: Align nexthop behaviour with IPv4

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> This set tries to eliminate some differences between IPv4's and IPv6's
> treatment of nexthops. These differences are most likely a side effect
> of IPv6's data structures (specifically 'rt6_info') that incorporate
> both the route and the nexthop and the late addition of ECMP support in
> commit 51ebd3181572 ("ipv6: add support of equal cost multipath
> (ECMP)").
> 
> IPv4 and IPv6 do not react the same to certain netdev events. For
> example, upon carrier change affected IPv4 nexthops are marked using the
> RTNH_F_LINKDOWN flag and the nexthop group is rebalanced accordingly.
> IPv6 on the other hand, does nothing which forces us to perform a
> carrier check during route lookup and dump. This makes it difficult to
> introduce features such as non-equal-cost multipath that are built on
> top of this set [1].
> 
> In addition, when a netdev is put administratively down IPv4 nexthops
> are marked using the RTNH_F_DEAD flag, whereas IPv6 simply flushes all
> the routes using these nexthops. To be consistent with IPv4, multipath
> routes should only be flushed when all nexthops in the group are
> considered dead.
> 
> The first 12 patches introduce non-functional changes that store the
> RTNH_F_DEAD and RTNH_F_LINKDOWN flags in IPv6 routes based on netdev
> events, in a similar fashion to IPv4. This allows us to remove the
> carrier check performed during route lookup and dump.
> 
> The next three patches make sure we only flush a multipath route when
> all of its nexthops are dead.
> 
> Last three patches add test cases for IPv4/IPv6 FIB. These verify that
> both address families react similarly to netdev events.
> 
> Finally, this series also serves as a good first step towards David
> Ahern's goal of treating nexthops as standalone objects [2], as it makes
> the code more in line with IPv4 where the nexthop and the nexthop group
> are separate objects from the route itself.
> 
> 1. https://github.com/idosch/linux/tree/ipv6-nexthops
> 2. http://vger.kernel.org/netconf2017_files/nexthop-objects.pdf
> 

Thanks for working on this - and creating the test cases.

One of many follow on changes that would be beneficial is to remove the
idev dereference in the hot path to check the
ignore_routes_with_linkdown sysctl.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread James Bottomley
On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> From: Willy Tarreau 
> Date: Sat, 6 Jan 2018 21:42:29 +0100
> 
> > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> >> Normally people who propose security fixes don't have to argue
> about the
> >> fact they added 30 clocks to avoid your box being 0wned.
> > 
> > In fact it depends, because if a fix makes the system unusable for
> its
> > initial purpose, this fix will simply not be deployed at all, which
> is
> > the worst that can happen.
> 
> +1
> 
> I completely agree with Willy and Alexei.
> 
> And the scale isn't even accurate, we're talking about at least
> hundreds upon hundreds of clocks, not 30, if we add an operation
> whose side effect is to wait for all pending loads to complete.  So
> yeah this is going to be heavily scrutinized.

Plus this is the standard kernel code review MO: we've never blindly
accepted code just because *security* (otherwise we'd have grsec in by
now).  We use the pushback to get better and more performant code.
 What often happens is it turns out that the "either security or
performance" position was a false dichotomy and there is a way of
fixing stuff that's acceptable (although not usually perfect) for
everyone.  I'm not saying this always happens, but it is reasonable to
let the iterative pushback see if we can get to better code in this
case rather than trying to cut it of with the "because *security*"
argument.

James



Re: [PATCH net-next 18/18] selftests: fib_tests: Add test cases for netdev carrier change

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Check that IPv4 and IPv6 react the same when the carrier of a netdev is
> toggled. Local routes should not be affected by this, whereas unicast
> routes should.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  tools/testing/selftests/net/fib_tests.sh | 142 
> +++
>  1 file changed, 142 insertions(+)

Acked-by: David Ahern 


Re: [PATCH net-next 17/18] selftests: fib_tests: Add test cases for netdev down

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Check that IPv4 and IPv6 react the same when a netdev is being put
> administratively down.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  tools/testing/selftests/net/fib_tests.sh | 141 
> +++
>  1 file changed, 141 insertions(+)
> 

Acked-by: David Ahern 




Re: [PATCH net-next 16/18] selftests: fib_tests: Add test cases for IPv4/IPv6 FIB

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Add test cases to check that IPv4 and IPv6 react to a netdev being
> unregistered as expected.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  tools/testing/selftests/net/Makefile |   1 +
>  tools/testing/selftests/net/fib_tests.sh | 146 
> +++
>  2 files changed, 147 insertions(+)
>  create mode 100755 tools/testing/selftests/net/fib_tests.sh

Acked-by: David Ahern 

FYI: "ip -netns testns" is more efficient than "ip netns exec testns ip".


Re: [PATCH net-next 15/18] ipv6: Flush multipath routes when all siblings are dead

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> By default, IPv6 deletes nexthops from a multipath route when the
> nexthop device is put administratively down. This differs from IPv4
> where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A
> multipath route is flushed when all of its nexthops become dead.
> 
> Align IPv6 with IPv4 and have it conform to the same guidelines.
> 
> In case the multipath route needs to be flushed, its siblings are
> flushed one by one. Otherwise, the nexthops are marked with the
> appropriate flags and the tree walker is instructed to skip all the
> siblings.
> 
> As explained in previous patches, care is taken to update the sernum of
> the affected tree nodes, so as to prevent the use of wrong dst entries.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/route.c | 83 
> ++--
>  1 file changed, 75 insertions(+), 8 deletions(-)

Acked-by: David Ahern 



Re: [PATCH net-next 14/18] ipv6: Take table lock outside of sernum update function

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> The next patch is going to allow dead routes to remain in the FIB tree
> in certain situations.
> 
> When this happens we need to be sure to bump the sernum of the nodes
> where these are stored so that potential copies cached in sockets are
> invalidated.
> 
> The function that performs this update assumes the table lock is not
> taken when it is invoked, but that will not be the case when it is
> invoked by the tree walker.
> 
> Have the function assume the lock is taken and make the single caller
> take the lock itself.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/ip6_fib.c | 5 +
>  net/ipv6/route.c   | 2 ++
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 

Acked-by: David Ahern 



Re: [PATCH net-next 13/18] ipv6: Export sernum update function

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> We are going to allow dead routes to stay in the FIB tree (e.g., when
> they are part of a multipath route, directly connected route with no
> carrier) and revive them when their nexthop device gains carrier or when
> it is put administratively up.
> 
> This is equivalent to the addition of the route to the FIB tree and we
> should therefore take care of updating the sernum of all the parent
> nodes of the node where the route is stored. Otherwise, we risk sockets
> caching and using sub-optimal dst entries.
> 
> Export the function that performs the above, so that it could be invoked
> from fib6_ifup() later on.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  include/net/ip6_fib.h |  1 +
>  net/ipv6/ip6_fib.c| 11 ---
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 

Acked-by: David Ahern 



Re: [PATCH net-next 12/18] ipv6: Teach tree walker to skip multipath routes

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> As explained in previous patch, fib6_ifdown() needs to consider the
> state of all the sibling routes when a multipath route is traversed.
> 
> This is done by evaluating all the siblings when the first sibling in a
> multipath route is traversed. If the multipath route does not need to be
> flushed (e.g., not all siblings are dead), then we should just skip the
> multipath route as our work is done.
> 
> Have the tree walker jump to the last sibling when it is determined that
> the multipath route needs to be skipped.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/ip6_fib.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern 



Re: [PATCH net-next 11/18] ipv6: Add explicit flush indication to routes

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> When routes that are a part of a multipath route are evaluated by
> fib6_ifdown() in response to NETDEV_DOWN and NETDEV_UNREGISTER events
> the state of their sibling routes is not considered.
> 
> This will change in subsequent patches in order to align IPv6 with
> IPv4's behavior. For example, when the last sibling in a multipath route
> becomes dead, the entire multipath route needs to be removed.
> 
> To prevent the tree walker from re-evaluating all the sibling routes
> each time, we can simply evaluate them once - when the first sibling is
> traversed.
> 
> If we determine the entire multipath route needs to be removed, then the
> 'should_flush' bit is set in all the siblings, which will cause the
> walker to flush them when it traverses them.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  include/net/ip6_fib.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Acked-by: David Ahern 


Re: [PATCH net-next 10/18] ipv6: Report dead flag during route dump

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Up until now the RTNH_F_DEAD flag was only reported in route dump when
> the 'ignore_routes_with_linkdown' sysctl was set. This is expected as
> dead routes were flushed otherwise.
> 
> The reliance on this sysctl is going to be removed, so we need to report
> the flag regardless of the sysctl's value.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/route.c | 3 +++
>  1 file changed, 3 insertions(+)

Acked-by: David Ahern 



Re: [PATCH net-next 09/18] ipv6: Ignore dead routes during lookup

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Currently, dead routes are only present in the routing tables in case
> the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are
> flushed.
> 
> Subsequent patches are going to remove the reliance on this sysctl and
> make IPv6 more consistent with IPv4.
> 
> Before this is done, we need to make sure dead routes are skipped during
> route lookup, so as to not cause packet loss.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/route.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)

Acked-by: David Ahern 



Aw: Re: dvb usb issues since kernel 4.9

2018-01-07 Thread Josef Griebichler
Hi,

here I provide lsusb from my affected hardware (technotrend s2-4600).
http://ix.io/DLY

With this hardware I had errors when recording with tvheadend. Livetv was ok, 
only channel switching made some problems sometimes. Please see attached 
tvheadend service logs.

I also provide dmesg (libreelec on rpi3 with kernel 4.14.10 with revert of the 
mentioned commit).
http://ix.io/DM2


Regards
Josef
 
 

Gesendet: Sonntag, 07. Januar 2018 um 16:41 Uhr
Von: "Alan Stern" 
An: "Mauro Carvalho Chehab" 
Cc: "Josef Griebichler" , "Greg Kroah-Hartman" 
, linux-...@vger.kernel.org, "Eric Dumazet" 
, "Rik van Riel" , "Paolo Abeni" 
, "Hannes Frederic Sowa" , "Jesper 
Dangaard Brouer" , linux-kernel 
, netdev , "Jonathan 
Corbet" , LMML , "Peter Zijlstra" 
, "David Miller" , 
torva...@linux-foundation.org
Betreff: Re: dvb usb issues since kernel 4.9
On Sun, 7 Jan 2018, Mauro Carvalho Chehab wrote: > > > It seems that the 
original patch were designed to solve some IRQ issues > > > with network cards 
with causes data losses on high traffic. However, > > > it is also causing bad 
effects on sustained high bandwidth demands > > > required by DVB cards, at 
least on some USB host drivers. > > > > > > Alan/Greg/Eric/David: > > > > > > 
Any ideas about how to fix it without causing regressions to > > > network? > > 
> > It would be good to know what hardware was involved on the x86 system > > 
and to have some timing data. Can we see the output from lsusb and > > usbmon, 
running on a vanilla kernel that gets plenty of video glitches? > > From 
Josef's report, and from the BZ, the affected hardware seems > to be based on 
Montage Technology M88DS3103/M88TS2022 chipset. What type of USB host 
controller does the x86_64 system use? EHCI or xHCI? > The driver it uses is at 
drivers/media/usb/dvb-usb-v2/dvbsky.c, > with shares a USB implementation that 
is used by a lot more drivers. > The URB handling code is at: > > 
drivers/media/usb/dvb-usb-v2/usb_urb.c > > This particular driver allocates 8 
buffers with 4096 bytes each > for bulk transfers, using transfer_flags = 
URB_NO_TRANSFER_DMA_MAP. > > This become a popular USB hardware nowadays. I 
have one S960c > myself, so I can send you the lsusb from it. You should 
notice, however, > that a DVB-C/DVB-S2 channel can easily provide very high 
sustained bit > rates. Here, on my DVB-S2 provider, a typical transponder 
produces 58 Mpps > of payload after removing URB headers. You mentioned earlier 
that the driver uses bulk transfers. In USB-2.0, the maximum possible payload 
data transfer rate using bulk transfers is 53248 bytes/ms, which is 53.248 MB/s 
(i.e., lower than 58 MB/s). And even this is possible only if almost nothing 
else is using the bus at the same time. > A 10 minutes record with the > entire 
data (with typically contains 5-10 channels) can easily go > above 4 GB, just 
to reproduce 1-2 glitches. So, I'm not sure if > a usbmon dump would be useful. 
It might not be helpful at all. However, I'm not interested in the payload data 
(which would be unintelligible to me anyway) but rather the timing of URB 
submissions and completions. A usbmon trace which didn't keep much of the 
payload data would only require on the order of 50 MB per minute -- and Josef 
said that glitches usually would show up within a minute or so. > I'm enclosing 
the lsusb from a S960C device, with is based on those > Montage chipsets: What 
I wanted to see was the output from "lsusb" on the affected system, not the 
output from "lsusb -v -s B:D" on your system. > > Overall, this may be a very 
difficult problem to solve. The > > 4cd13c21b207 commit was intended to improve 
throughput at the cost of > > increased latency. But then what do you do when 
the latency becomes > > too high for the video subsystem to handle? > > Latency 
can't be too high, otherwise frames will be dropped. Yes, that's the whole 
point. > Even if the Kernel itself doesn't drop, if the delay goes higher > 
than a certain threshold, userspace will need to drop, as it > should be 
presenting audio and video on real time. Yet, typically, > userspace will delay 
it by one or two seconds, with would mean > 1500-3500 buffers, with I suspect 
it is a lot more than the hardware > limits. So I suspect that the hardware 
starves free buffers a way > before userspace, as media hardware don't have 
unlimited buffers > inside them, as they assume that the Kernel/userspace will 
be fast > enough to sustain bit rates up to 66 Mbps of payload. The timing 
information would tell us how large the latency is. In any case, you might be 
able to attack the problem simply by using more than 8 

Re: [PATCH net-next 08/18] ipv6: Check nexthop flags in route dump instead of carrier

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Similar to previous patch, there is no need to check for the carrier of
> the nexthop device when dumping the route and we can instead check for
> the presence of the RTNH_F_LINKDOWN flag.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Acked-by: David Ahern 



Re: [PATCH net-next 07/18] ipv6: Check nexthop flags during route lookup instead of carrier

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the
> need to dereference the nexthop device and check its carrier and instead
> check for the presence of the flag.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/route.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)

Acked-by: David Ahern 


Re: [PATCH net-next 06/18] ipv6: Set nexthop flags during route creation

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> It is valid to install routes with a nexthop device that does not have a
> carrier, so we need to make sure they're marked accordingly.
> 
> As explained in the previous patch, host and anycast routes are never
> marked with the 'linkdown' flag.
> 
> Note that reject routes are unaffected, as these use the loopback device
> which always has a carrier.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/route.c | 3 +++
>  1 file changed, 3 insertions(+)

Acked-by: David Ahern 


Re: [PATCH net-next 05/18] ipv6: Set nexthop flags upon carrier change

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Similar to IPv4, when the carrier of a netdev changes we should toggle
> the 'linkdown' flag on all the nexthops using it as their nexthop
> device.
> 
> This will later allow us to test for the presence of this flag during
> route lookup and dump.
> 
> Up until commit 4832c30d5458 ("net: ipv6: put host and anycast routes on
> device with address") host and anycast routes used the loopback netdev
> as their nexthop device and thus were not marked with the 'linkdown'
> flag. The patch preserves this behavior and allows one to ping the local
> address even when the nexthop device does not have a carrier and the
> 'ignore_routes_with_linkdown' sysctl is set.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  include/net/ip6_route.h |  1 +
>  net/ipv6/addrconf.c |  2 ++
>  net/ipv6/route.c| 23 +--
>  3 files changed, 20 insertions(+), 6 deletions(-)

Acked-by: David Ahern 



Re: [PATCH net] ipv6: sr: fix TLVs not being copied using setsockopt

2018-01-07 Thread Mathieu Xhonneux
Just realized I messed up the justification about sr_has_hmac. The
branch will be taken, but its execution will not complete since the
TLV's len and type fields aren't copied, hence seg6_get_tlv_hmac will
fail, and the HMAC will not be computed.

2018-01-07 17:12 GMT+00:00 Mathieu Xhonneux :
> Function ipv6_push_rthdr4 allows to add an IPv6 Segment Routing Header
> to a socket through setsockopt, but the current implementation doesn't
> copy possible TLVs at the end of the SRH (i.e., the following branch
> if (sr_has_hmac(sr_phdr)) will never be taken as no HMAC TLV is copied).
>
> This commit adds a memcpy in case TLVs have been appended to the SRH.
>
> Fixes: a149e7c7ce812561f0fdc7a86ddc42f294e5eb3e ("ipv6: sr: add
> support for SRH injection through setsockopt")
> Acked-by: David Lebrun 
> Signed-off-by: Mathieu Xhonneux 
> ---
>  net/ipv6/exthdrs.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
> index 83bd75713535..bc68eb661970 100644
> --- a/net/ipv6/exthdrs.c
> +++ b/net/ipv6/exthdrs.c
> @@ -925,6 +925,15 @@ static void ipv6_push_rthdr4(struct sk_buff *skb, u8 
> *proto,
> sr_phdr->segments[0] = **addr_p;
> *addr_p = _ihdr->segments[sr_ihdr->segments_left];
>
> +   if (sr_ihdr->hdrlen > hops * 2) {
> +   int tlvs_offset, tlvs_length;
> +
> +   tlvs_offset = (1 + hops * 2) << 3;
> +   tlvs_length = (sr_ihdr->hdrlen - hops * 2) << 3;
> +   memcpy((char *)sr_phdr + tlvs_offset,
> +  (char *)sr_ihdr + tlvs_offset, tlvs_length);
> +   }
> +
>  #ifdef CONFIG_IPV6_SEG6_HMAC
> if (sr_has_hmac(sr_phdr)) {
> struct net *net = NULL;
> --
> 2.15.1
>


Re: [PATCH net-next 04/18] ipv6: Prepare to handle multiple netdev events

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> To make IPv6 more in line with IPv4 we need to be able to respond
> differently to different netdev events. For example, when a netdev is
> unregistered all the routes using it as their nexthop device should be
> flushed, whereas when the netdev's carrier changes only the 'linkdown'
> flag should be toggled.
> 
> Currently, this is not possible, as the function that traverses the
> routing tables is not aware of the triggering event.
> 
> Propagate the triggering event down, so that it could be used in later
> patches.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  include/net/ip6_route.h |  2 +-
>  net/ipv6/addrconf.c |  4 ++--
>  net/ipv6/route.c| 37 +
>  3 files changed, 24 insertions(+), 19 deletions(-)

Acked-by: David Ahern 


Re: [PATCH net-next 03/18] ipv6: Clear nexthop flags upon netdev up

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> Previous patch marked nexthops with the 'dead' and 'linkdown' flags.
> Clear these flags when the netdev comes back up.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  include/net/ip6_route.h |  1 +
>  net/ipv6/addrconf.c |  3 +++
>  net/ipv6/route.c| 29 +
>  3 files changed, 33 insertions(+)
> 

Acked-by: David Ahern 




Re: [PATCH net-next 02/18] ipv6: Mark dead nexthops with appropriate flags

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> When a netdev is put administratively down or unregistered all the
> nexthops using it as their nexthop device should be marked with the
> 'dead' and 'linkdown' flags.
> 
> Currently, when a route is dumped its nexthop device is tested and the
> flags are set accordingly. A similar check is performed during route
> lookup.
> 
> Instead, we can simply mark the nexthops based on netdev events and
> avoid checking the netdev's state during route dump and lookup.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/route.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 

Acked-by: David Ahern 




Re: [PATCH net-next 01/18] ipv6: Remove redundant route flushing during namespace dismantle

2018-01-07 Thread David Ahern
On 1/7/18 3:45 AM, Ido Schimmel wrote:
> By the time fib6_net_exit() is executed all the netdevs in the namespace
> have been either unregistered or pushed back to the default namespace.
> That is because pernet subsys operations are always ordered before
> pernet device operations and therefore invoked after them during
> namespace dismantle.
> 
> Thus, all the routing tables in the namespace are empty by the time
> fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed.
> 
> This allows us to simplify the condition in fib6_ifdown() as it's only
> ever called with an actual netdev.
> 
> Signed-off-by: Ido Schimmel 
> ---
>  net/ipv6/ip6_fib.c | 1 -
>  net/ipv6/route.c   | 8 +++-
>  2 files changed, 3 insertions(+), 6 deletions(-)
> 

Acked-by: David Ahern 



Re: atm/clip: Use seq_puts() in svc_addr()

2018-01-07 Thread SF Markus Elfring
>> Is the function "seq_puts" a bit more efficient for the desired output
>> of a single string in comparison to calling the function "seq_printf"
>> for this purpose?
> 
> Will you please be so kind and tell us?

How do you think about to get the run time characteristics for these
sequence output functions better documented?
https://elixir.free-electrons.com/linux/v4.15-rc6/source/fs/seq_file.c#L660

Can an information like “WARNING: Prefer seq_puts to seq_printf”
(from the script “checkpatch.pl”) be another incentive?


>>> and "strings should be quickly put into a sequence"
>>> isn't terribly helpful.  
>>
>> Which wording would you find more appropriate for the suggested
>> adjustment of these function calls?
> 
> Whatever describes the actual issue and what you're doing about it.
> Turn your rhetorical question above into a commit message, done.
> 
> Compare that with your original commit message, on the other hand,
> and you should understand what I mean.

Which descriptions are you really missing for the affected data output?

Regards,
Markus


Re: [net-next] netfilter: add segment routing header 'srh' match

2018-01-07 Thread Ahmed Abdelsalam
On Sun, 7 Jan 2018 00:40:03 +0100
Pablo Neira Ayuso  wrote:

> Hi Ahmed,
> 
> On Fri, Dec 29, 2017 at 12:07:52PM +0100, Ahmed Abdelsalam wrote:
> > It allows matching packets based on Segment Routing Header
> > (SRH) information.
> > The implementation considers revision 7 of the SRH draft.
> > https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07
> > 
> > Currently supported match options include:
> > (1) Next Header
> > (2) Hdr Ext Len
> > (3) Segments Left
> > (4) Last Entry
> > (5) Tag value of SRH
> > 
> > Signed-off-by: Ahmed Abdelsalam 
> > ---
> >  include/uapi/linux/netfilter_ipv6/ip6t_srh.h |  63 ++
> >  net/ipv6/netfilter/Kconfig   |   9 ++
> >  net/ipv6/netfilter/Makefile  |   1 +
> >  net/ipv6/netfilter/ip6t_srh.c| 165 
> > +++
> >  4 files changed, 238 insertions(+)
> >  create mode 100644 include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> >  create mode 100644 net/ipv6/netfilter/ip6t_srh.c
> > 
> > diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h 
> > b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > new file mode 100644
> > index 000..1b5dbd8
> > --- /dev/null
> > +++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > @@ -0,0 +1,63 @@
> > +/**
> > + * Definitions for Segment Routing Header 'srh' match
> > + *
> > + * Author:
> > + *   Ahmed Abdelsalam   
> > + */
> 
> Please, add this in SPDX format instead.
> 
> See include/uapi/linux/netfilter/xt_owner.h for instance.
> 
Ok
> > +#ifndef _IP6T_SRH_H
> > +#define _IP6T_SRH_H
> > +
> > +#include 
> > +#include 
> > +
> > +/* Values for "mt_flags" field in struct ip6t_srh */
> > +#define IP6T_SRH_NEXTHDR0x0001
> > +#define IP6T_SRH_LEN_EQ 0x0002
> > +#define IP6T_SRH_LEN_GT 0x0004
> > +#define IP6T_SRH_LEN_LT 0x0008
> > +#define IP6T_SRH_SEGS_EQ0x0010
> > +#define IP6T_SRH_SEGS_GT0x0020
> > +#define IP6T_SRH_SEGS_LT0x0040
> > +#define IP6T_SRH_LAST_EQ0x0080
> > +#define IP6T_SRH_LAST_GT0x0100
> > +#define IP6T_SRH_LAST_LT0x0200
> > +#define IP6T_SRH_TAG0x0400
> > +#define IP6T_SRH_MASK   0x07FF
> > +
> > +/* Values for "mt_invflags" field in struct ip6t_srh */
> > +#define IP6T_SRH_INV_NEXTHDR0x0001
> > +#define IP6T_SRH_INV_LEN_EQ 0x0002
> > +#define IP6T_SRH_INV_LEN_GT 0x0004
> > +#define IP6T_SRH_INV_LEN_LT 0x0008
> > +#define IP6T_SRH_INV_SEGS_EQ0x0010
> > +#define IP6T_SRH_INV_SEGS_GT0x0020
> > +#define IP6T_SRH_INV_SEGS_LT0x0040
> > +#define IP6T_SRH_INV_LAST_EQ0x0080
> > +#define IP6T_SRH_INV_LAST_GT0x0100
> > +#define IP6T_SRH_INV_LAST_LT0x0200
> > +#define IP6T_SRH_INV_TAG0x0400
> > +#define IP6T_SRH_INV_MASK   0x07FF
> 
> Looking at all these EQ, GT, LT... I think this should be very easy to
> implement in nf_tables with no kernel changes.
> 
> You only need to add the protocol definition to:
> 
> nftables/src/exthdr.c
> 
> Would you have a look into this? This would be very much appreciated
> to we keep nftables in sync with what we have in iptables.
Yes, I look into it. I will send you a patch for nf_tables as well. 
> 
> > +
> > +/**
> > + *  struct ip6t_srh - SRH match options
> > + *  @ next_hdr: Next header field of SRH
> > + *  @ hdr_len: Extension header length field of SRH
> > + *  @ segs_left: Segments left field of SRH
> > + *  @ last_entry: Last entry field of SRH
> > + *  @ tag: Tag field of SRH
> > + *  @ mt_flags: match options
> > + *  @ mt_invflags: Invert the sense of match options
> > + */
> > +
> > +struct ip6t_srh {
> > +   __u8next_hdr;
> > +   __u8hdr_len;
> > +   __u8segs_left;
> > +   __u8last_entry;
> > +   __u16   tag;
> > +   __u16   mt_flags;
> > +   __u16   mt_invflags;
> > +};
> > +
> > +#endif /*_IP6T_SRH_H*/
> > diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
> > index 6acb2ee..e1818eb 100644
> > --- a/net/ipv6/netfilter/Kconfig
> > +++ b/net/ipv6/netfilter/Kconfig
> > @@ -232,6 +232,15 @@ config IP6_NF_MATCH_RT
> >  
> >   To compile it as a module, choose M here.  If unsure, say N.
> >  
> > +config IP6_NF_MATCH_SRH
> > +tristate '"srh" Segment Routing header match support'
> > +depends on NETFILTER_ADVANCED
> > +help
> > +  srh matching allows you to match packets based on the segment
> > + routing header of the packet.
> > +
> > +  To compile it as a module, choose M here.  If unsure, say N.
> > +
> >  # The targets
> >  config IP6_NF_TARGET_HL
> > tristate '"HL" hoplimit target support'
> > diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
> > index c6ee0cd..e0d51a9 100644
> > --- a/net/ipv6/netfilter/Makefile
> > +++ 

[PATCH net] ipv6: sr: fix TLVs not being copied using setsockopt

2018-01-07 Thread Mathieu Xhonneux
Function ipv6_push_rthdr4 allows to add an IPv6 Segment Routing Header
to a socket through setsockopt, but the current implementation doesn't
copy possible TLVs at the end of the SRH (i.e., the following branch
if (sr_has_hmac(sr_phdr)) will never be taken as no HMAC TLV is copied).

This commit adds a memcpy in case TLVs have been appended to the SRH.

Fixes: a149e7c7ce812561f0fdc7a86ddc42f294e5eb3e ("ipv6: sr: add
support for SRH injection through setsockopt")
Acked-by: David Lebrun 
Signed-off-by: Mathieu Xhonneux 
---
 net/ipv6/exthdrs.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index 83bd75713535..bc68eb661970 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -925,6 +925,15 @@ static void ipv6_push_rthdr4(struct sk_buff *skb, u8 
*proto,
sr_phdr->segments[0] = **addr_p;
*addr_p = _ihdr->segments[sr_ihdr->segments_left];
 
+   if (sr_ihdr->hdrlen > hops * 2) {
+   int tlvs_offset, tlvs_length;
+
+   tlvs_offset = (1 + hops * 2) << 3;
+   tlvs_length = (sr_ihdr->hdrlen - hops * 2) << 3;
+   memcpy((char *)sr_phdr + tlvs_offset,
+  (char *)sr_ihdr + tlvs_offset, tlvs_length);
+   }
+
 #ifdef CONFIG_IPV6_SEG6_HMAC
if (sr_has_hmac(sr_phdr)) {
struct net *net = NULL;
-- 
2.15.1



Re: dvb usb issues since kernel 4.9

2018-01-07 Thread Alan Stern
On Sun, 7 Jan 2018, Mauro Carvalho Chehab wrote:

> > > It seems that the original patch were designed to solve some IRQ issues
> > > with network cards with causes data losses on high traffic. However,
> > > it is also causing bad effects on sustained high bandwidth demands
> > > required by DVB cards, at least on some USB host drivers.
> > > 
> > > Alan/Greg/Eric/David:
> > > 
> > > Any ideas about how to fix it without causing regressions to
> > > network?  
> > 
> > It would be good to know what hardware was involved on the x86 system
> > and to have some timing data.  Can we see the output from lsusb and
> > usbmon, running on a vanilla kernel that gets plenty of video glitches?
> 
> From Josef's report, and from the BZ, the affected hardware seems
> to be based on Montage Technology M88DS3103/M88TS2022 chipset.

What type of USB host controller does the x86_64 system use?  EHCI or
xHCI?

> The driver it uses is at drivers/media/usb/dvb-usb-v2/dvbsky.c,
> with shares a USB implementation that is used by a lot more drivers.
> The URB handling code is at:
> 
>   drivers/media/usb/dvb-usb-v2/usb_urb.c
> 
> This particular driver allocates 8 buffers with 4096 bytes each
> for bulk transfers, using transfer_flags = URB_NO_TRANSFER_DMA_MAP.
> 
> This become a popular USB hardware nowadays. I have one S960c
> myself, so I can send you the lsusb from it. You should notice, however,
> that a DVB-C/DVB-S2 channel can easily provide very high sustained bit
> rates. Here, on my DVB-S2 provider, a typical transponder produces 58 Mpps
> of payload after removing URB headers.

You mentioned earlier that the driver uses bulk transfers.  In USB-2.0,
the maximum possible payload data transfer rate using bulk transfers is
53248 bytes/ms, which is 53.248 MB/s (i.e., lower than 58 MB/s).  And
even this is possible only if almost nothing else is using the bus at
the same time.

> A 10 minutes record with the
> entire data (with typically contains 5-10 channels) can easily go
> above 4 GB, just to reproduce 1-2 glitches. So, I'm not sure if
> a usbmon dump would be useful.

It might not be helpful at all.  However, I'm not interested in the 
payload data (which would be unintelligible to me anyway) but rather 
the timing of URB submissions and completions.  A usbmon trace which 
didn't keep much of the payload data would only require on the order of 
50 MB per minute -- and Josef said that glitches usually would show up 
within a minute or so.

> I'm enclosing the lsusb from a S960C device, with is based on those
> Montage chipsets:

What I wanted to see was the output from "lsusb" on the affected
system, not the output from "lsusb -v -s B:D" on your system.

> > Overall, this may be a very difficult problem to solve.  The
> > 4cd13c21b207 commit was intended to improve throughput at the cost of
> > increased latency.  But then what do you do when the latency becomes
> > too high for the video subsystem to handle?
> 
> Latency can't be too high, otherwise frames will be dropped.

Yes, that's the whole point.

> Even if the Kernel itself doesn't drop, if the delay goes higher
> than a certain threshold, userspace will need to drop, as it
> should be presenting audio and video on real time. Yet, typically,
> userspace will delay it by one or two seconds, with would mean
> 1500-3500 buffers, with I suspect it is a lot more than the hardware
> limits. So I suspect that the hardware starves free buffers a way 
> before userspace, as media hardware don't have unlimited buffers
> inside them, as they assume that the Kernel/userspace will be fast
> enough to sustain bit rates up to 66 Mbps of payload.

The timing information would tell us how large the latency is.

In any case, you might be able to attack the problem simply by using
more than 8 buffers.  With just eight 4096-byte buffers, the total
pipeline capacity is only about 0.62 ms (at the maximum possible
transfer rate).  Increasing the number of buffers to 65 would give a
capacity of 5 ms, which is probably a lot better suited for situations
where completions are handled by the ksoftirqd thread.

> Perhaps media drivers could pass some quirk similar to URB_ISO_ASAP,
> in order to revert the kernel logic to prioritize latency instead of
> throughput.

It can't be done without pervasive changes to the USB subsystem, which
I would greatly prefer to avoid.  Besides, this wouldn't really solve
the problem.  Decreasing the latency for one device will cause it to be
increased for others.

Alan Stern



Re: atm/clip: Use seq_puts() in svc_addr()

2018-01-07 Thread Stefano Brivio
On Sun, 7 Jan 2018 09:19:17 +0100
SF Markus Elfring  wrote:

> >> Two strings should be quickly put into a sequence by two function calls.
> >> Thus use the function "seq_puts" instead of "seq_printf".
> >>
> >> This issue was detected by using the Coccinelle software.  
> > 
> > Can you please explain what the issue really is and what you're trying
> > to do here?  
> 
> Is the function "seq_puts" a bit more efficient for the desired output
> of a single string in comparison to calling the function "seq_printf"
> for this purpose?

Will you please be so kind and tell us?

> > One shouldn't need to dig into Coccinelle patterns to find
> > out what you mean,  
> 
> Why did an attribution for a software tool confuse you?

I'm not confused. I'm saying that one shouldn't need to dig into
Coccinelle patterns to find out what you mean.

> > and "strings should be quickly put into a sequence"
> > isn't terribly helpful.  
> 
> Which wording would you find more appropriate for the suggested
> adjustment of these function calls?

Whatever describes the actual issue and what you're doing about it.
Turn your rhetorical question above into a commit message, done.

Compare that with your original commit message, on the other hand,
and you should understand what I mean.

-- 
Stefano


Re: [PATCH net-next v3 00/10] net: qualcomm: rmnet: Enable csum offloads

2018-01-07 Thread David Miller
From: Subash Abhinov Kasiviswanathan 
Date: Sat, 06 Jan 2018 19:19:00 -0700

> I still dont see the v3 submission in patchwork if I query it.

Please resubmit then.


[PATCH] ixgbevf: use ARRAY_SIZE for various array sizing calculations

2018-01-07 Thread Colin King
From: Colin Ian King 

Use the ARRAY_SIZE macro on various arrays to determine
size of the arrays. Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/intel/ixgbevf/vf.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c 
b/drivers/net/ethernet/intel/ixgbevf/vf.c
index 0c25006ce9af..cd79c4a73b48 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.c
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
@@ -285,7 +285,7 @@ static s32 ixgbevf_set_uc_addr_vf(struct ixgbe_hw *hw, u32 
index, u8 *addr)
ether_addr_copy(msg_addr, addr);
 
ret_val = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf,
-sizeof(msgbuf) / sizeof(u32));
+ARRAY_SIZE(msgbuf));
if (!ret_val) {
msgbuf[0] &= ~IXGBE_VT_MSGTYPE_CTS;
 
@@ -455,8 +455,7 @@ static s32 ixgbevf_set_rar_vf(struct ixgbe_hw *hw, u32 
index, u8 *addr,
ether_addr_copy(msg_addr, addr);
 
ret_val = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf,
-sizeof(msgbuf) / sizeof(u32));
-
+ARRAY_SIZE(msgbuf));
msgbuf[0] &= ~IXGBE_VT_MSGTYPE_CTS;
 
/* if nacked the address was rejected, use "perm_addr" */
@@ -571,7 +570,7 @@ static s32 ixgbevf_update_xcast_mode(struct ixgbe_hw *hw, 
int xcast_mode)
msgbuf[1] = xcast_mode;
 
err = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf,
-sizeof(msgbuf) / sizeof(u32));
+ARRAY_SIZE(msgbuf));
if (err)
return err;
 
@@ -609,7 +608,7 @@ static s32 ixgbevf_set_vfta_vf(struct ixgbe_hw *hw, u32 
vlan, u32 vind,
msgbuf[0] |= vlan_on << IXGBE_VT_MSGINFO_SHIFT;
 
err = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf,
-sizeof(msgbuf) / sizeof(u32));
+ARRAY_SIZE(msgbuf));
if (err)
goto mbx_err;
 
@@ -813,7 +812,7 @@ static s32 ixgbevf_set_rlpml_vf(struct ixgbe_hw *hw, u16 
max_size)
msgbuf[1] = max_size;
 
ret_val = ixgbevf_write_msg_read_ack(hw, msgbuf, msgbuf,
-sizeof(msgbuf) / sizeof(u32));
+ARRAY_SIZE(msgbuf));
if (ret_val)
return ret_val;
if ((msgbuf[0] & IXGBE_VF_SET_LPE) &&
@@ -859,8 +858,7 @@ static int ixgbevf_negotiate_api_version_vf(struct ixgbe_hw 
*hw, int api)
msg[1] = api;
msg[2] = 0;
 
-   err = ixgbevf_write_msg_read_ack(hw, msg, msg,
-sizeof(msg) / sizeof(u32));
+   err = ixgbevf_write_msg_read_ack(hw, msg, msg, ARRAY_SIZE(msg));
if (!err) {
msg[0] &= ~IXGBE_VT_MSGTYPE_CTS;
 
@@ -911,8 +909,7 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int 
*num_tcs,
msg[0] = IXGBE_VF_GET_QUEUE;
msg[1] = msg[2] = msg[3] = msg[4] = 0;
 
-   err = ixgbevf_write_msg_read_ack(hw, msg, msg,
-sizeof(msg) / sizeof(u32));
+   err = ixgbevf_write_msg_read_ack(hw, msg, msg, ARRAY_SIZE(msg));
if (!err) {
msg[0] &= ~IXGBE_VT_MSGTYPE_CTS;
 
-- 
2.15.1



Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared

2018-01-07 Thread Jamal Hadi Salim

On 18-01-07 09:28 AM, Jamal Hadi Salim wrote:

On 18-01-07 08:46 AM, Jiri Pirko wrote:

Sun, Jan 07, 2018 at 02:11:19PM CET, j...@mojatatu.com wrote:

On 18-01-06 03:43 PM, Jiri Pirko wrote:




@@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, 
struct sk_buff *skb,

tcm->tcm_family = AF_UNSPEC;
tcm->tcm__pad1 = 0;
tcm->tcm__pad2 = 0;
-    tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
-    tcm->tcm_parent = parent;
+    if (q) {
+    tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
+    tcm->tcm_parent = parent;
+    } else {
+    tcm->tcm_ifindex = 0; /* block index is stored in parent */
+    tcm->tcm_parent = block->index;
+    }


Please guys, please look at this reuse (also on clt side). I would like
you to double-check this reuse of existing API for balock_index 
carrying

purpose. I believe it's UAPI safe. But please, check it out carefully.




Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex
(not sure if zero means something speacial to someone).


Like -1 means parent is block_index?



Yes.


Why would 0 mean something special? Could you point to a code that
suggests it?



I cant point to any such code, it is just the ifindex is an int.
And the negative space looks like less likely someone would think
of using for signalling (0x as an example).
tcpdump -i any probably assumes soem weird ifindex (havent looked
at the code).

In any case, 0 is fine too.


Also a #define for whatever value you use would make it more readable,
ex:
#define IFINDEX_ANY ..

cheers,
jamal


[iproute2 1/1] tc: Fix filter protocol output

2018-01-07 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

Fixes: 249284ff5a44 ("tc: jsonify filter core")
Signed-off-by: Jamal Hadi Salim 
---
 tc/tc_filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index 545cc3a..546311a 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -276,7 +276,7 @@ int print_filter(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (!filter_protocol || filter_protocol != f_proto) {
if (f_proto) {
SPRINT_BUF(b1);
-   print_string(PRINT_JSON, "protocol",
+   print_string(PRINT_ANY, "protocol",
 "protocol %s ",
 ll_proto_n2a(f_proto, b1, 
sizeof(b1)));
}
-- 
1.9.1



[PATCH iproute2] ip fou: Pass family attribute as u8

2018-01-07 Thread Filip Moc
This fixes fou on big-endian systems.

Signed-off-by: Filip Moc 
---
 ip/ipfou.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ip/ipfou.c b/ip/ipfou.c
index febc2c8c..1f392ade 100644
--- a/ip/ipfou.c
+++ b/ip/ipfou.c
@@ -52,7 +52,7 @@ static int fou_parse_opt(int argc, char **argv, struct 
nlmsghdr *n,
__u8 ipproto, type;
bool gue_set = false;
int ipproto_set = 0;
-   unsigned short family = AF_INET;
+   __u8 family = AF_INET;
 
while (argc > 0) {
if (!matches(*argv, "port")) {
@@ -103,7 +103,7 @@ static int fou_parse_opt(int argc, char **argv, struct 
nlmsghdr *n,
 
addattr16(n, 1024, FOU_ATTR_PORT, port);
addattr8(n, 1024, FOU_ATTR_TYPE, type);
-   addattr16(n, 1024, FOU_ATTR_AF, family);
+   addattr8(n, 1024, FOU_ATTR_AF, family);
 
if (ipproto_set)
addattr8(n, 1024, FOU_ATTR_IPPROTO, ipproto);
-- 
2.11.0



Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared

2018-01-07 Thread Jamal Hadi Salim

On 18-01-07 08:46 AM, Jiri Pirko wrote:

Sun, Jan 07, 2018 at 02:11:19PM CET, j...@mojatatu.com wrote:

On 18-01-06 03:43 PM, Jiri Pirko wrote:





@@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, struct sk_buff 
*skb,
tcm->tcm_family = AF_UNSPEC;
tcm->tcm__pad1 = 0;
tcm->tcm__pad2 = 0;
-   tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
-   tcm->tcm_parent = parent;
+   if (q) {
+   tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
+   tcm->tcm_parent = parent;
+   } else {
+   tcm->tcm_ifindex = 0; /* block index is stored in parent */
+   tcm->tcm_parent = block->index;
+   }


Please guys, please look at this reuse (also on clt side). I would like
you to double-check this reuse of existing API for balock_index carrying
purpose. I believe it's UAPI safe. But please, check it out carefully.




Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex
(not sure if zero means something speacial to someone).


Like -1 means parent is block_index?



Yes.


Why would 0 mean something special? Could you point to a code that
suggests it?



I cant point to any such code, it is just the ifindex is an int.
And the negative space looks like less likely someone would think
of using for signalling (0x as an example).
tcpdump -i any probably assumes soem weird ifindex (havent looked
at the code).

In any case, 0 is fine too.

cheers,
jamal


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alan Cox
> > 2.  It is very very complicated to answer a question like "is
> > sequence x safe on all of vendor's microprocessors" even for the vendor  
> 
> so far "is sequence x safe" was viewed by cpu vendors as
> "is sequence x going to stop speculative execution".

Incorrect. Modern processors are very very sophisticated beasts and you
are dealing with a wide range of architectures and micro-architectures
that all have their own differences.

> > Intel's current position is 'lfence'.  

> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!

There are a set of patches under discussion for eBPF both the JIT and
interpreter. BTW I would expect that there are things Coverity didn't
find, and that we'll also see variants on the attack where different
tricks for measurement emerge that may change what is needed a little.

> which means that POC is relying 64-bit address speculation.
> In the places coverity found the user supplied value is 32-bit,

People have 32bit computers as well as 64bit and in some cases 32bit is
fine for an attack depending where your target is relative to the object.

> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.  
> 
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.

The last solution I saw proposed in that path for the JIT is to "and" with
constant which in that situation clearly makes the most sense providing
it is safe on all the CPUs involved.

lfence timing is also heavily dependent upon what work has to be done to
retire previous live instructions. BPF does not normally do a lot of
writing so you'd expect the cost to be low.

Anyway - Intel's current position is that lfence is safe. It's not
impossible other sequences will in future be documented as safe by one or
more vendors of x86 processors.

Alan


Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared

2018-01-07 Thread Jiri Pirko
Sun, Jan 07, 2018 at 02:11:19PM CET, j...@mojatatu.com wrote:
>On 18-01-06 03:43 PM, Jiri Pirko wrote:
>
>
>> 
>> > @@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, struct 
>> > sk_buff *skb,
>> >tcm->tcm_family = AF_UNSPEC;
>> >tcm->tcm__pad1 = 0;
>> >tcm->tcm__pad2 = 0;
>> > -  tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
>> > -  tcm->tcm_parent = parent;
>> > +  if (q) {
>> > +  tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
>> > +  tcm->tcm_parent = parent;
>> > +  } else {
>> > +  tcm->tcm_ifindex = 0; /* block index is stored in parent */
>> > +  tcm->tcm_parent = block->index;
>> > +  }
>> 
>> Please guys, please look at this reuse (also on clt side). I would like
>> you to double-check this reuse of existing API for balock_index carrying
>> purpose. I believe it's UAPI safe. But please, check it out carefully.
>> 
>
>
>Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex
>(not sure if zero means something speacial to someone).

Like -1 means parent is block_index?

Why would 0 mean something special? Could you point to a code that
suggests it?

>
>cheers,
>jamal
>


Re: [patch net-next v6 06/11] net: sched: use block index as a handle instead of qdisc when block is shared

2018-01-07 Thread Jamal Hadi Salim

On 18-01-06 03:43 PM, Jiri Pirko wrote:





@@ -886,8 +887,13 @@ static int tcf_fill_node(struct net *net, struct sk_buff 
*skb,
tcm->tcm_family = AF_UNSPEC;
tcm->tcm__pad1 = 0;
tcm->tcm__pad2 = 0;
-   tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
-   tcm->tcm_parent = parent;
+   if (q) {
+   tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
+   tcm->tcm_parent = parent;
+   } else {
+   tcm->tcm_ifindex = 0; /* block index is stored in parent */
+   tcm->tcm_parent = block->index;
+   }


Please guys, please look at this reuse (also on clt side). I would like
you to double-check this reuse of existing API for balock_index carrying
purpose. I believe it's UAPI safe. But please, check it out carefully.




Should not break any ABI/UAPI AFAIK. Maybe go for a negative ifindex
(not sure if zero means something speacial to someone).

cheers,
jamal



Re: dvb usb issues since kernel 4.9

2018-01-07 Thread Mauro Carvalho Chehab
Em Sat, 6 Jan 2018 16:44:20 -0500 (EST)
Alan Stern  escreveu:

> On Sat, 6 Jan 2018, Mauro Carvalho Chehab wrote:
> 
> > Hi Josef,
> > 
> > Em Sat, 6 Jan 2018 16:04:16 +0100
> > "Josef Griebichler"  escreveu:
> >   
> > > Hi,
> > > 
> > > the causing commit has been identified.
> > > After reverting commit 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882
> > > its working again.  
> > 
> > Just replying to me won't magically fix this. The ones that were involved on
> > this patch should also be c/c, plus USB people. Just added them.
> >   
> > > Please have a look into the thread 
> > > https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?pageNo=13
> > > here are several users aknowledging the revert solves their issues with 
> > > usb dvb cards.  
> > 
> > I read the entire (long) thread there. In order to make easier for the
> > others, from what I understand, the problem happens on both x86 and arm,
> > although almost all comments there are mentioning tests with raspbian
> > Kernel (with uses a different USB host driver than the upstream one).
> > 
> > It happens when watching digital TV DVB-C channels, with usually means
> > a sustained bit rate of 11 MBps to 54 MBps.
> > 
> > The reports mention the dvbsky, with uses USB URB bulk transfers.
> > On every several minutes (5 to 10 mins), the stream suffer "glitches"
> > caused by frame losses.
> > 
> > The part of the thread that contains the bisect is at:
> > 
> > https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=75965#post75965
> > 
> > It indirectly mentions another comment on the thread with points
> > to:
> > https://github.com/raspberrypi/linux/issues/2134
> > 
> > There, it says that this fix part of the issues:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34f41c0316ed52b0b44542491d89278efdaa70e4
> > 
> > but it affects URB packet losses on a lesser extend.
> > 
> > The main issue is really the logic changes a the core softirq logic.
> > 
> > Using Kernel 4.14.10 on a Raspberry Pi 3 with 4cd13c2 commit reverted
> > fixed the issue. 
> > 
> > Joseph, is the above right? Anything else to mention? Does the
> > same issue affect also on x86 with vanilla Kernel 4.14.10?
> > 
> > -
> > 
> > It seems that the original patch were designed to solve some IRQ issues
> > with network cards with causes data losses on high traffic. However,
> > it is also causing bad effects on sustained high bandwidth demands
> > required by DVB cards, at least on some USB host drivers.
> > 
> > Alan/Greg/Eric/David:
> > 
> > Any ideas about how to fix it without causing regressions to
> > network?  
> 
> It would be good to know what hardware was involved on the x86 system
> and to have some timing data.  Can we see the output from lsusb and
> usbmon, running on a vanilla kernel that gets plenty of video glitches?

>From Josef's report, and from the BZ, the affected hardware seems
to be based on Montage Technology M88DS3103/M88TS2022 chipset.
The driver it uses is at drivers/media/usb/dvb-usb-v2/dvbsky.c,
with shares a USB implementation that is used by a lot more drivers.
The URB handling code is at:

drivers/media/usb/dvb-usb-v2/usb_urb.c

This particular driver allocates 8 buffers with 4096 bytes each
for bulk transfers, using transfer_flags = URB_NO_TRANSFER_DMA_MAP.

This become a popular USB hardware nowadays. I have one S960c
myself, so I can send you the lsusb from it. You should notice, however,
that a DVB-C/DVB-S2 channel can easily provide very high sustained bit
rates. Here, on my DVB-S2 provider, a typical transponder produces 58 Mpps
of payload after removing URB headers. A 10 minutes record with the
entire data (with typically contains 5-10 channels) can easily go
above 4 GB, just to reproduce 1-2 glitches. So, I'm not sure if
a usbmon dump would be useful.

I'm enclosing the lsusb from a S960C device, with is based on those
Montage chipsets:

Bus 002 Device 007: ID 0572:960c Conexant Systems (Rockwell), Inc. DVBSky S960C 
DVB-S2 tuner
Couldn't open device, some information will be missing
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass0 (Defined at Interface level)
  bDeviceSubClass 0 
  bDeviceProtocol 0 
  bMaxPacketSize064
  idVendor   0x0572 Conexant Systems (Rockwell), Inc.
  idProduct  0x960c DVBSky S960C DVB-S2 tuner
  bcdDevice0.00
  iManufacturer   1 
  iProduct2 
  iSerial 3 
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength  219
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  4 

[PATCH net-next 07/18] ipv6: Check nexthop flags during route lookup instead of carrier

2018-01-07 Thread Ido Schimmel
Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the
need to dereference the nexthop device and check its carrier and instead
check for the presence of the flag.

Signed-off-by: Ido Schimmel 
---
 net/ipv6/route.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 314e3bf41f6f..ab0eed43ed97 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -474,7 +474,7 @@ static struct rt6_info *rt6_multipath_select(struct 
rt6_info *match,
if (route_choosen == 0) {
struct inet6_dev *idev = sibling->rt6i_idev;
 
-   if (!netif_carrier_ok(sibling->dst.dev) &&
+   if (sibling->rt6i_nh_flags & RTNH_F_LINKDOWN &&
idev->cnf.ignore_routes_with_linkdown)
break;
if (rt6_score_route(sibling, oif, strict) < 0)
@@ -679,10 +679,9 @@ static struct rt6_info *find_match(struct rt6_info *rt, 
int oif, int strict,
int m;
bool match_do_rr = false;
struct inet6_dev *idev = rt->rt6i_idev;
-   struct net_device *dev = rt->dst.dev;
 
-   if (dev && !netif_carrier_ok(dev) &&
-   idev->cnf.ignore_routes_with_linkdown &&
+   if (idev->cnf.ignore_routes_with_linkdown &&
+   rt->rt6i_nh_flags & RTNH_F_LINKDOWN &&
!(strict & RT6_LOOKUP_F_IGNORE_LINKSTATE))
goto out;
 
-- 
2.14.3



[PATCH net-next 02/18] ipv6: Mark dead nexthops with appropriate flags

2018-01-07 Thread Ido Schimmel
When a netdev is put administratively down or unregistered all the
nexthops using it as their nexthop device should be marked with the
'dead' and 'linkdown' flags.

Currently, when a route is dumped its nexthop device is tested and the
flags are set accordingly. A similar check is performed during route
lookup.

Instead, we can simply mark the nexthops based on netdev events and
avoid checking the netdev's state during route dump and lookup.

Signed-off-by: Ido Schimmel 
---
 net/ipv6/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c557362daa23..f5eda0aeab55 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3473,8 +3473,10 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
if (rt->dst.dev == dev &&
rt != adn->net->ipv6.ip6_null_entry &&
(rt->rt6i_nsiblings == 0 || netdev_unregistering(dev) ||
-!rt->rt6i_idev->cnf.ignore_routes_with_linkdown))
+!rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) {
+   rt->rt6i_nh_flags |= (RTNH_F_DEAD | RTNH_F_LINKDOWN);
return -1;
+   }
 
return 0;
 }
-- 
2.14.3



[PATCH net-next 03/18] ipv6: Clear nexthop flags upon netdev up

2018-01-07 Thread Ido Schimmel
Previous patch marked nexthops with the 'dead' and 'linkdown' flags.
Clear these flags when the netdev comes back up.

Signed-off-by: Ido Schimmel 
---
 include/net/ip6_route.h |  1 +
 net/ipv6/addrconf.c |  3 +++
 net/ipv6/route.c| 29 +
 3 files changed, 33 insertions(+)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 18e442ea93d8..caad39198c2a 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -169,6 +169,7 @@ void rt6_ifdown(struct net *net, struct net_device *dev);
 void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
 void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
+void rt6_sync_up(struct net_device *dev, unsigned int nh_flags);
 
 static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb)
 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ed06b1190f05..b6405568ed7b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3484,6 +3484,9 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
if (run_pending)
addrconf_dad_run(idev);
 
+   /* Device has an address by now */
+   rt6_sync_up(dev, RTNH_F_DEAD);
+
/*
 * If the MTU changed during the interface down,
 * when the interface up, the changed MTU must be
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f5eda0aeab55..4796d87e0b93 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3459,6 +3459,35 @@ void rt6_clean_tohost(struct net *net, struct in6_addr 
*gateway)
fib6_clean_all(net, fib6_clean_tohost, gateway);
 }
 
+struct arg_netdev_event {
+   const struct net_device *dev;
+   unsigned int nh_flags;
+};
+
+static int fib6_ifup(struct rt6_info *rt, void *p_arg)
+{
+   const struct arg_netdev_event *arg = p_arg;
+   const struct net *net = dev_net(arg->dev);
+
+   if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev)
+   rt->rt6i_nh_flags &= ~arg->nh_flags;
+
+   return 0;
+}
+
+void rt6_sync_up(struct net_device *dev, unsigned int nh_flags)
+{
+   struct arg_netdev_event arg = {
+   .dev = dev,
+   .nh_flags = nh_flags,
+   };
+
+   if (nh_flags & RTNH_F_DEAD && netif_carrier_ok(dev))
+   arg.nh_flags |= RTNH_F_LINKDOWN;
+
+   fib6_clean_all(dev_net(dev), fib6_ifup, );
+}
+
 struct arg_dev_net {
struct net_device *dev;
struct net *net;
-- 
2.14.3



[PATCH net-next 05/18] ipv6: Set nexthop flags upon carrier change

2018-01-07 Thread Ido Schimmel
Similar to IPv4, when the carrier of a netdev changes we should toggle
the 'linkdown' flag on all the nexthops using it as their nexthop
device.

This will later allow us to test for the presence of this flag during
route lookup and dump.

Up until commit 4832c30d5458 ("net: ipv6: put host and anycast routes on
device with address") host and anycast routes used the loopback netdev
as their nexthop device and thus were not marked with the 'linkdown'
flag. The patch preserves this behavior and allows one to ping the local
address even when the nexthop device does not have a carrier and the
'ignore_routes_with_linkdown' sysctl is set.

Signed-off-by: Ido Schimmel 
---
 include/net/ip6_route.h |  1 +
 net/ipv6/addrconf.c |  2 ++
 net/ipv6/route.c| 23 +--
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6a2f80cbdf65..34cd3b0c6ded 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -170,6 +170,7 @@ void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
 void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
 void rt6_sync_up(struct net_device *dev, unsigned int nh_flags);
 void rt6_disable_ip(struct net_device *dev, unsigned long event);
+void rt6_sync_down_dev(struct net_device *dev, unsigned long event);
 
 static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb)
 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a13e1ffe87ec..2435f7ab070b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3438,6 +3438,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
} else if (event == NETDEV_CHANGE) {
if (!addrconf_link_ready(dev)) {
/* device is still not ready. */
+   rt6_sync_down_dev(dev, event);
break;
}
 
@@ -3449,6 +3450,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
 * multicast snooping switches
 */
ipv6_mc_up(idev);
+   rt6_sync_up(dev, RTNH_F_LINKDOWN);
break;
}
idev->if_flags |= IF_READY;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 194fe9d9cd85..2fd36c7dd143 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3498,18 +3498,29 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
const struct net_device *dev = arg->dev;
const struct net *net = dev_net(dev);
 
-   if (rt->dst.dev == dev &&
-   rt != net->ipv6.ip6_null_entry &&
-   (rt->rt6i_nsiblings == 0 || netdev_unregistering(dev) ||
-!rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) {
-   rt->rt6i_nh_flags |= (RTNH_F_DEAD | RTNH_F_LINKDOWN);
+   if (rt->dst.dev != dev || rt == net->ipv6.ip6_null_entry)
+   return 0;
+
+   switch (arg->event) {
+   case NETDEV_UNREGISTER:
return -1;
+   case NETDEV_DOWN:
+   if (rt->rt6i_nsiblings == 0 ||
+   !rt->rt6i_idev->cnf.ignore_routes_with_linkdown)
+   return -1;
+   rt->rt6i_nh_flags |= RTNH_F_DEAD;
+   /* fall through */
+   case NETDEV_CHANGE:
+   if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST))
+   break;
+   rt->rt6i_nh_flags |= RTNH_F_LINKDOWN;
+   break;
}
 
return 0;
 }
 
-static void rt6_sync_down_dev(struct net_device *dev, unsigned long event)
+void rt6_sync_down_dev(struct net_device *dev, unsigned long event)
 {
struct arg_netdev_event arg = {
.dev = dev,
-- 
2.14.3



[PATCH net-next 04/18] ipv6: Prepare to handle multiple netdev events

2018-01-07 Thread Ido Schimmel
To make IPv6 more in line with IPv4 we need to be able to respond
differently to different netdev events. For example, when a netdev is
unregistered all the routes using it as their nexthop device should be
flushed, whereas when the netdev's carrier changes only the 'linkdown'
flag should be toggled.

Currently, this is not possible, as the function that traverses the
routing tables is not aware of the triggering event.

Propagate the triggering event down, so that it could be used in later
patches.

Signed-off-by: Ido Schimmel 
---
 include/net/ip6_route.h |  2 +-
 net/ipv6/addrconf.c |  4 ++--
 net/ipv6/route.c| 37 +
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index caad39198c2a..6a2f80cbdf65 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -165,11 +165,11 @@ struct rt6_rtnl_dump_arg {
 };
 
 int rt6_dump_route(struct rt6_info *rt, void *p_arg);
-void rt6_ifdown(struct net *net, struct net_device *dev);
 void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
 void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
 void rt6_sync_up(struct net_device *dev, unsigned int nh_flags);
+void rt6_disable_ip(struct net_device *dev, unsigned long event);
 
 static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb)
 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index b6405568ed7b..a13e1ffe87ec 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3580,6 +3580,7 @@ static bool addr_is_local(const struct in6_addr *addr)
 
 static int addrconf_ifdown(struct net_device *dev, int how)
 {
+   unsigned long event = how ? NETDEV_UNREGISTER : NETDEV_DOWN;
struct net *net = dev_net(dev);
struct inet6_dev *idev;
struct inet6_ifaddr *ifa, *tmp;
@@ -3589,8 +3590,7 @@ static int addrconf_ifdown(struct net_device *dev, int 
how)
 
ASSERT_RTNL();
 
-   rt6_ifdown(net, dev);
-   neigh_ifdown(_tbl, dev);
+   rt6_disable_ip(dev, event);
 
idev = __in6_dev_get(dev);
if (!idev)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4796d87e0b93..194fe9d9cd85 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2344,7 +2344,7 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
rt->rt6i_idev = idev;
dst_metric_set(>dst, RTAX_HOPLIMIT, 0);
 
-   /* Add this dst into uncached_list so that rt6_ifdown() can
+   /* Add this dst into uncached_list so that rt6_disable_ip() can
 * do proper release of the net_device
 */
rt6_uncached_list_add(rt);
@@ -3461,7 +3461,10 @@ void rt6_clean_tohost(struct net *net, struct in6_addr 
*gateway)
 
 struct arg_netdev_event {
const struct net_device *dev;
-   unsigned int nh_flags;
+   union {
+   unsigned int nh_flags;
+   unsigned long event;
+   };
 };
 
 static int fib6_ifup(struct rt6_info *rt, void *p_arg)
@@ -3488,19 +3491,15 @@ void rt6_sync_up(struct net_device *dev, unsigned int 
nh_flags)
fib6_clean_all(dev_net(dev), fib6_ifup, );
 }
 
-struct arg_dev_net {
-   struct net_device *dev;
-   struct net *net;
-};
-
 /* called with write lock held for table with rt */
-static int fib6_ifdown(struct rt6_info *rt, void *arg)
+static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
 {
-   const struct arg_dev_net *adn = arg;
-   const struct net_device *dev = adn->dev;
+   const struct arg_netdev_event *arg = p_arg;
+   const struct net_device *dev = arg->dev;
+   const struct net *net = dev_net(dev);
 
if (rt->dst.dev == dev &&
-   rt != adn->net->ipv6.ip6_null_entry &&
+   rt != net->ipv6.ip6_null_entry &&
(rt->rt6i_nsiblings == 0 || netdev_unregistering(dev) ||
 !rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) {
rt->rt6i_nh_flags |= (RTNH_F_DEAD | RTNH_F_LINKDOWN);
@@ -3510,15 +3509,21 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg)
return 0;
 }
 
-void rt6_ifdown(struct net *net, struct net_device *dev)
+static void rt6_sync_down_dev(struct net_device *dev, unsigned long event)
 {
-   struct arg_dev_net adn = {
+   struct arg_netdev_event arg = {
.dev = dev,
-   .net = net,
+   .event = event,
};
 
-   fib6_clean_all(net, fib6_ifdown, );
-   rt6_uncached_list_flush_dev(net, dev);
+   fib6_clean_all(dev_net(dev), fib6_ifdown, );
+}
+
+void rt6_disable_ip(struct net_device *dev, unsigned long event)
+{
+   rt6_sync_down_dev(dev, event);
+   rt6_uncached_list_flush_dev(dev_net(dev), dev);
+   neigh_ifdown(_tbl, dev);
 }
 
 struct rt6_mtu_change_arg {
-- 
2.14.3



[PATCH net-next 15/18] ipv6: Flush multipath routes when all siblings are dead

2018-01-07 Thread Ido Schimmel
By default, IPv6 deletes nexthops from a multipath route when the
nexthop device is put administratively down. This differs from IPv4
where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A
multipath route is flushed when all of its nexthops become dead.

Align IPv6 with IPv4 and have it conform to the same guidelines.

In case the multipath route needs to be flushed, its siblings are
flushed one by one. Otherwise, the nexthops are marked with the
appropriate flags and the tree walker is instructed to skip all the
siblings.

As explained in previous patches, care is taken to update the sernum of
the affected tree nodes, so as to prevent the use of wrong dst entries.

Signed-off-by: Ido Schimmel 
---
 net/ipv6/route.c | 83 ++--
 1 file changed, 75 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a3bfce71c861..1054b059747f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3486,8 +3486,10 @@ static int fib6_ifup(struct rt6_info *rt, void *p_arg)
const struct arg_netdev_event *arg = p_arg;
const struct net *net = dev_net(arg->dev);
 
-   if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev)
+   if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev) {
rt->rt6i_nh_flags &= ~arg->nh_flags;
+   fib6_update_sernum_upto_root(dev_net(rt->dst.dev), rt);
+   }
 
return 0;
 }
@@ -3505,6 +3507,58 @@ void rt6_sync_up(struct net_device *dev, unsigned int 
nh_flags)
fib6_clean_all(dev_net(dev), fib6_ifup, );
 }
 
+static bool rt6_multipath_uses_dev(const struct rt6_info *rt,
+  const struct net_device *dev)
+{
+   struct rt6_info *iter;
+
+   if (rt->dst.dev == dev)
+   return true;
+   list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings)
+   if (iter->dst.dev == dev)
+   return true;
+
+   return false;
+}
+
+static void rt6_multipath_flush(struct rt6_info *rt)
+{
+   struct rt6_info *iter;
+
+   rt->should_flush = 1;
+   list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings)
+   iter->should_flush = 1;
+}
+
+static unsigned int rt6_multipath_dead_count(const struct rt6_info *rt,
+const struct net_device *down_dev)
+{
+   struct rt6_info *iter;
+   unsigned int dead = 0;
+
+   if (rt->dst.dev == down_dev || rt->rt6i_nh_flags & RTNH_F_DEAD)
+   dead++;
+   list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings)
+   if (iter->dst.dev == down_dev ||
+   iter->rt6i_nh_flags & RTNH_F_DEAD)
+   dead++;
+
+   return dead;
+}
+
+static void rt6_multipath_nh_flags_set(struct rt6_info *rt,
+  const struct net_device *dev,
+  unsigned int nh_flags)
+{
+   struct rt6_info *iter;
+
+   if (rt->dst.dev == dev)
+   rt->rt6i_nh_flags |= nh_flags;
+   list_for_each_entry(iter, >rt6i_siblings, rt6i_siblings)
+   if (iter->dst.dev == dev)
+   iter->rt6i_nh_flags |= nh_flags;
+}
+
 /* called with write lock held for table with rt */
 static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
 {
@@ -3512,20 +3566,33 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
const struct net_device *dev = arg->dev;
const struct net *net = dev_net(dev);
 
-   if (rt->dst.dev != dev || rt == net->ipv6.ip6_null_entry)
+   if (rt == net->ipv6.ip6_null_entry)
return 0;
 
switch (arg->event) {
case NETDEV_UNREGISTER:
-   return -1;
+   return rt->dst.dev == dev ? -1 : 0;
case NETDEV_DOWN:
-   if (rt->rt6i_nsiblings == 0 ||
-   !rt->rt6i_idev->cnf.ignore_routes_with_linkdown)
+   if (rt->should_flush)
return -1;
-   rt->rt6i_nh_flags |= RTNH_F_DEAD;
-   /* fall through */
+   if (!rt->rt6i_nsiblings)
+   return rt->dst.dev == dev ? -1 : 0;
+   if (rt6_multipath_uses_dev(rt, dev)) {
+   unsigned int count;
+
+   count = rt6_multipath_dead_count(rt, dev);
+   if (rt->rt6i_nsiblings + 1 == count) {
+   rt6_multipath_flush(rt);
+   return -1;
+   }
+   rt6_multipath_nh_flags_set(rt, dev, RTNH_F_DEAD |
+  RTNH_F_LINKDOWN);
+   fib6_update_sernum(rt);
+   }
+   return -2;
case NETDEV_CHANGE:
-   if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST))
+   if (rt->dst.dev != dev ||
+

[PATCH net-next 14/18] ipv6: Take table lock outside of sernum update function

2018-01-07 Thread Ido Schimmel
The next patch is going to allow dead routes to remain in the FIB tree
in certain situations.

When this happens we need to be sure to bump the sernum of the nodes
where these are stored so that potential copies cached in sockets are
invalidated.

The function that performs this update assumes the table lock is not
taken when it is invoked, but that will not be the case when it is
invoked by the tree walker.

Have the function assume the lock is taken and make the single caller
take the lock itself.

Signed-off-by: Ido Schimmel 
---
 net/ipv6/ip6_fib.c | 5 +
 net/ipv6/route.c   | 2 ++
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index c1bbe7bf9fdd..edda5ad3b405 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -107,16 +107,13 @@ enum {
 
 void fib6_update_sernum(struct rt6_info *rt)
 {
-   struct fib6_table *table = rt->rt6i_table;
struct net *net = dev_net(rt->dst.dev);
struct fib6_node *fn;
 
-   spin_lock_bh(>tb6_lock);
fn = rcu_dereference_protected(rt->rt6i_node,
-   lockdep_is_held(>tb6_lock));
+   lockdep_is_held(>rt6i_table->tb6_lock));
if (fn)
fn->fn_sernum = fib6_new_sernum(net);
-   spin_unlock_bh(>tb6_lock);
 }
 
 /*
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f62d24948aa2..a3bfce71c861 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1353,7 +1353,9 @@ static int rt6_insert_exception(struct rt6_info *nrt,
 
/* Update fn->fn_sernum to invalidate all cached dst */
if (!err) {
+   spin_lock_bh(>rt6i_table->tb6_lock);
fib6_update_sernum(ort);
+   spin_unlock_bh(>rt6i_table->tb6_lock);
fib6_force_start_gc(net);
}
 
-- 
2.14.3



  1   2   >