Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Fri, Jul 06, 2018 at 06:44:24PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 17:53:18 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote:
> > > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:  
> > > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:  
> > > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:
> > > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > > > > > 
> > > > > > > I'm also not 100% on board with the argument that "future" FW can
> > > > > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > > > > format?  Or will it be reconfigurable at runtime?  
> > > > > > 
> > > > > > let's table configuration of metadata aside for a second.
> > > > > > 
> > > > > > Describing metedata layout in BTF allows NICs to disclose everything
> > > > > > NIC has to users in a standard and generic way.
> > > > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > > > and hw powercycled to have new md layout (and corresponding BTF 
> > > > > > description)
> > > > > > is a separate discussion.
> > > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct 
> > > > > > xdp_md_info'
> > > > > > to reach 'hash' value in metadata.
> > > > > > Essentially it's a run-time way to access 'hash' instead of 
> > > > > > build-time.
> > > > > > So bpf program would need two loads to read csum or hash field 
> > > > > > instead of one.
> > > > > > With BTF the layout of metadata is known to the program at 
> > > > > > build-time.
> > > > > > 
> > > > > > To reiterate the proposal:
> > > > > > - driver+firmware keep layout of the metadata in BTF format (either 
> > > > > > in the driver
> > > > > >   or driver can read it from firmware)
> > > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query 
> > > > > > the driver and
> > > > > >   generate normal C header file based on BTF in the given NIC
> > > > > > - user does #include "md_desc.h" and bpf program can access 
> > > > > > md->csum or md->hash
> > > > > >   with direct single load out of metadata area in front of the 
> > > > > > packet
> > > > > > - llvm compiles bpf program and records how program is doing this 
> > > > > > md->csum accesses
> > > > > >   in BTF format as well (the compiler will be keeping such records
> > > > > >   for __sk_buff and all other structs too, but that's separate 
> > > > > > discussion)
> > > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) 
> > > > > > that the way the program
> > > > > >   accesses metadata (and other structs) matches BTF from the driver,
> > > > > >   so no surprises if driver+firmware got updated, but program is 
> > > > > > not recompiled
> > > > > > - every NIC can have their own layout of metadata and its own 
> > > > > > meaning of the fields,
> > > > > >   but would be good to standardize at least a few common fields 
> > > > > > like hash
> > > > > 
> > > > > Can I expose HW descriptors this way, though, or is the proposal to
> > > > > copy this data into the packet buffer?
> > > > 
> > > > That crossed my mind too. We can use BTF to describe HW descriptors too,
> > > > but I don't think it would buy us anything. AF_XDP approach is better.
> > > >  
> > > > > > Once this is working we can do more cool things with BTF.
> > > > > > Like doing offset rewriting at program load time similar to what we 
> > > > > > plan
> > > > > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > > > > and the kernel will adjust offsetof(struct task_struct, pid) during 
> > > > > > program load
> > > > > > depending on BTF for the kernel.
> > > > > > The same trick we can do for networking metadata.
> > > > > > The program will contain load instruction md->hash that will get 
> > > > > > automatically
> > > > > > adjusted to proper offset depending on BTF of 'hash' field in the 
> > > > > > NIC.
> > > > > > For now I'm proposing _not_ to go that far with offset rewriting 
> > > > > > and start
> > > > > > with simple steps described above.
> > > > > 
> > > > > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > > > > impacting fast paths of the drivers yet again?  This rewrite should be
> > > > > easier than task->pid, because we have the synthetic user space struct
> > > > > xdp_md definition.
> > > > 
> > > > I don't understand 'impact fast path yet again' concern.
> > > > If NIC has certain metadata today, just derscribe what it has in BTF
> > > > and supply the actual per-packet md to xdp program as-is.
> > > > No changes for fast path at all.
> > > > Future rewritting will be done by the bpf/xdp core with zero
> > > > changes to the driver. All driver has to do is to provide BTF.  
> > > 
> > > I'm confused.  AFAIK most *existing* NICs have 

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Fri, Jul 06, 2018 at 06:20:47PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 18:00:13 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote:
> > > 
> > > Could we just say that at the start we expose only existing SKB fields
> > > (csum, hash, mark) and the meaning of the is the same as in the SKB?  
> > 
> > what would be the meaning of 'hash' ? Over which fields?
> > Does it support inner and outer packets? How about udp encap (vxlan and 
> > friends) ?
> 
> We don't seem to need to answer that for the rest of the stack, no?  We
> can expose the "hash type" field as well if that's *really* necessary.
> 
> > Same question of csum... tcp only? 
> 
> Shouldn't we just stick to CHECKSUM_COMPLETE?

I don't yet see how checksum_complete and 'hash just like stack'
helps to accelerate xdp programs like ddos and load balancer.
'hash like stack' may help to accelerate networking stack itself,
but that's very different discussion. There is no xdp in such case.
No programs and no metadata. If some NICs can do
'hash like stack' and demonstrate performance improvements
for regular tcp/ip processing with netperf in user space
that would definitely be interesting and standardizing such
hash from the drivers into upper layers would warrant its own patches,
but seems orthogonal to this xdp hints discussion.



Re: [PATCH V2 net-next] liquidio: fix kernel panic when NIC firmware is older than 1.7.2

2018-07-06 Thread David Miller
From: Felix Manlunas 
Date: Fri, 6 Jul 2018 11:27:07 -0700

> From: Rick Farrington 
> 
> Pre-1.7.2 NIC firmware does not support (and does not respond to) the "get
> speed" command which is sent by the 1.7.2 driver (for CN23XX-225 cards
> only) during modprobe.  Due to a bug in older firmware (with respect to
> unknown commands), this unsupported command causes a cascade of errors that
> ends in a kernel panic.
> 
> Fix it by making the sending of the "get speed" command conditional on the
> firmware version.
> 
> Signed-off-by: Rick Farrington 
> Signed-off-by: Felix Manlunas 

Applied, thanks.


Re: [PATCH net-next 0/6] sock cookie initializers

2018-07-06 Thread David Miller
From: Willem de Bruijn 
Date: Fri,  6 Jul 2018 10:12:53 -0400

> From: Willem de Bruijn 
> 
> Recent UDP GSO and SO_TXTIME features added new fields to cookie
> structs.
> 
> When adding a field, all sites where a struct is initialized have to
> be updated, which is a lot of boilerplate. Alternatively, a field can
> be initialized selectively, but this is fragile. I introduced a bug
> in udp gso where an uninitialized field was read. See also fix commit
> ("9887cba19978 ip: limit use of gso_size to udp").
> 
> Introduce initializers for structs ipcm(6)_cookie and sockc_cookie.
> 
> patch 1..3 do exactly this.
> patch 4..5 make ipv4 and ipv6 handle cookies the same way and
>remove some boilerplate in doing so.
> patch 6removes the udp gso branch that needed the above fix

Series applied, thanks Willem.


Re: [PATCH net-next 0/3] fix use-after-free bugs in skb list processing

2018-07-06 Thread David Miller
From: Edward Cree 
Date: Fri, 6 Jul 2018 18:01:04 +0100

> A couple of bugs in skb list handling were spotted by Dan Carpenter, with
>  the help of Smatch; following up on them I found a couple more similar
>  cases.  This series fixes them by changing the relevant loops to use the
>  dequeue-enqueue model (rather than in-place list modification), and then
>  adds a list.h helper macro to refactor code using the dequeue-enqueue
>  model.

Edward, I think the helper added in patch #3 is not all that great.

It can be even more confusing to understand what the code is doing in
my opinion.  I am pretty sure more hackers would need to go and
read the macro definition in list.h before understanding what the
read side effects of it are.

Would you please resubmit this series including only patches #1
and #2?

Thank you.


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Jakub Kicinski
On Fri, 6 Jul 2018 17:53:18 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote:
> > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:  
> > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:  
> > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:
> > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > > > > 
> > > > > > I'm also not 100% on board with the argument that "future" FW can
> > > > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > > > format?  Or will it be reconfigurable at runtime?  
> > > > > 
> > > > > let's table configuration of metadata aside for a second.
> > > > > 
> > > > > Describing metedata layout in BTF allows NICs to disclose everything
> > > > > NIC has to users in a standard and generic way.
> > > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > > and hw powercycled to have new md layout (and corresponding BTF 
> > > > > description)
> > > > > is a separate discussion.
> > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct 
> > > > > xdp_md_info'
> > > > > to reach 'hash' value in metadata.
> > > > > Essentially it's a run-time way to access 'hash' instead of 
> > > > > build-time.
> > > > > So bpf program would need two loads to read csum or hash field 
> > > > > instead of one.
> > > > > With BTF the layout of metadata is known to the program at build-time.
> > > > > 
> > > > > To reiterate the proposal:
> > > > > - driver+firmware keep layout of the metadata in BTF format (either 
> > > > > in the driver
> > > > >   or driver can read it from firmware)
> > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query 
> > > > > the driver and
> > > > >   generate normal C header file based on BTF in the given NIC
> > > > > - user does #include "md_desc.h" and bpf program can access md->csum 
> > > > > or md->hash
> > > > >   with direct single load out of metadata area in front of the packet
> > > > > - llvm compiles bpf program and records how program is doing this 
> > > > > md->csum accesses
> > > > >   in BTF format as well (the compiler will be keeping such records
> > > > >   for __sk_buff and all other structs too, but that's separate 
> > > > > discussion)
> > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that 
> > > > > the way the program
> > > > >   accesses metadata (and other structs) matches BTF from the driver,
> > > > >   so no surprises if driver+firmware got updated, but program is not 
> > > > > recompiled
> > > > > - every NIC can have their own layout of metadata and its own meaning 
> > > > > of the fields,
> > > > >   but would be good to standardize at least a few common fields like 
> > > > > hash
> > > > 
> > > > Can I expose HW descriptors this way, though, or is the proposal to
> > > > copy this data into the packet buffer?
> > > 
> > > That crossed my mind too. We can use BTF to describe HW descriptors too,
> > > but I don't think it would buy us anything. AF_XDP approach is better.
> > >  
> > > > > Once this is working we can do more cool things with BTF.
> > > > > Like doing offset rewriting at program load time similar to what we 
> > > > > plan
> > > > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > > > and the kernel will adjust offsetof(struct task_struct, pid) during 
> > > > > program load
> > > > > depending on BTF for the kernel.
> > > > > The same trick we can do for networking metadata.
> > > > > The program will contain load instruction md->hash that will get 
> > > > > automatically
> > > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > > > > For now I'm proposing _not_ to go that far with offset rewriting and 
> > > > > start
> > > > > with simple steps described above.
> > > > 
> > > > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > > > impacting fast paths of the drivers yet again?  This rewrite should be
> > > > easier than task->pid, because we have the synthetic user space struct
> > > > xdp_md definition.
> > > 
> > > I don't understand 'impact fast path yet again' concern.
> > > If NIC has certain metadata today, just derscribe what it has in BTF
> > > and supply the actual per-packet md to xdp program as-is.
> > > No changes for fast path at all.
> > > Future rewritting will be done by the bpf/xdp core with zero
> > > changes to the driver. All driver has to do is to provide BTF.  
> > 
> > I'm confused.  AFAIK most *existing* NICs have the metadata in the
> > "descriptor", i.e. not in the packet buffer.  So if the NIC just
> > describes what it has, and there is no data shuffling/copying
> > (performance) then we have to expose the descriptor, no?  
> 
> which piece of sw put that data into desciptor ?
> I bet 

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 6 Jul 2018 17:53:18 -0700

> I'd like to push back on firmware folks that should be listening
> to feedback from driver folks and kernel stack instead of saying
> 'here is hw spec that firmware provides'. Firmware is software.
> It can change and should be open to change by the community
> with proper maintainership.

+1


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread David Miller
From: Alexei Starovoitov 
Date: Fri, 6 Jul 2018 09:30:42 -0700

> Like doing offset rewriting at program load time similar to what we plan
> to do for tracing. Tracing programs will be doing 'task->pid' access
> and the kernel will adjust offsetof(struct task_struct, pid) during program 
> load
> depending on BTF for the kernel.
> The same trick we can do for networking metadata.
> The program will contain load instruction md->hash that will get automatically
> adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> For now I'm proposing _not_ to go that far with offset rewriting and start
> with simple steps described above.

+1


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Jakub Kicinski
On Fri, 6 Jul 2018 18:00:13 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote:
> > 
> > Could we just say that at the start we expose only existing SKB fields
> > (csum, hash, mark) and the meaning of the is the same as in the SKB?  
> 
> what would be the meaning of 'hash' ? Over which fields?
> Does it support inner and outer packets? How about udp encap (vxlan and 
> friends) ?

We don't seem to need to answer that for the rest of the stack, no?  We
can expose the "hash type" field as well if that's *really* necessary.

> Same question of csum... tcp only? 

Shouldn't we just stick to CHECKSUM_COMPLETE?

> how about crc for sctp?

That's harder to answer.  Can cls_bpf access such info?

> What is 'mark' ?

Same thing it would be on an skb.  Most likely set with an offloaded TC
rule?


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote:
> 
> Could we just say that at the start we expose only existing SKB fields
> (csum, hash, mark) and the meaning of the is the same as in the SKB?

what would be the meaning of 'hash' ? Over which fields?
Does it support inner and outer packets? How about udp encap (vxlan and 
friends) ?
Same question of csum... tcp only? how about crc for sctp?
What is 'mark' ?



Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:
> > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:  
> > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:  
> > > > > 
> > > > > I'm also not 100% on board with the argument that "future" FW can
> > > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > > format?  Or will it be reconfigurable at runtime?
> > > > 
> > > > let's table configuration of metadata aside for a second.
> > > > 
> > > > Describing metedata layout in BTF allows NICs to disclose everything
> > > > NIC has to users in a standard and generic way.
> > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > and hw powercycled to have new md layout (and corresponding BTF 
> > > > description)
> > > > is a separate discussion.
> > > > Saeed's proposal introduces the concept of 'offset' inside 'struct 
> > > > xdp_md_info'
> > > > to reach 'hash' value in metadata.
> > > > Essentially it's a run-time way to access 'hash' instead of build-time.
> > > > So bpf program would need two loads to read csum or hash field instead 
> > > > of one.
> > > > With BTF the layout of metadata is known to the program at build-time.
> > > > 
> > > > To reiterate the proposal:
> > > > - driver+firmware keep layout of the metadata in BTF format (either in 
> > > > the driver
> > > >   or driver can read it from firmware)
> > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the 
> > > > driver and
> > > >   generate normal C header file based on BTF in the given NIC
> > > > - user does #include "md_desc.h" and bpf program can access md->csum or 
> > > > md->hash
> > > >   with direct single load out of metadata area in front of the packet
> > > > - llvm compiles bpf program and records how program is doing this 
> > > > md->csum accesses
> > > >   in BTF format as well (the compiler will be keeping such records
> > > >   for __sk_buff and all other structs too, but that's separate 
> > > > discussion)
> > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that 
> > > > the way the program
> > > >   accesses metadata (and other structs) matches BTF from the driver,
> > > >   so no surprises if driver+firmware got updated, but program is not 
> > > > recompiled
> > > > - every NIC can have their own layout of metadata and its own meaning 
> > > > of the fields,
> > > >   but would be good to standardize at least a few common fields like 
> > > > hash  
> > > 
> > > Can I expose HW descriptors this way, though, or is the proposal to
> > > copy this data into the packet buffer?  
> > 
> > That crossed my mind too. We can use BTF to describe HW descriptors too,
> > but I don't think it would buy us anything. AF_XDP approach is better.
> >
> > > > Once this is working we can do more cool things with BTF.
> > > > Like doing offset rewriting at program load time similar to what we plan
> > > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > > and the kernel will adjust offsetof(struct task_struct, pid) during 
> > > > program load
> > > > depending on BTF for the kernel.
> > > > The same trick we can do for networking metadata.
> > > > The program will contain load instruction md->hash that will get 
> > > > automatically
> > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > > > For now I'm proposing _not_ to go that far with offset rewriting and 
> > > > start
> > > > with simple steps described above.  
> > > 
> > > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > > impacting fast paths of the drivers yet again?  This rewrite should be
> > > easier than task->pid, because we have the synthetic user space struct
> > > xdp_md definition.  
> > 
> > I don't understand 'impact fast path yet again' concern.
> > If NIC has certain metadata today, just derscribe what it has in BTF
> > and supply the actual per-packet md to xdp program as-is.
> > No changes for fast path at all.
> > Future rewritting will be done by the bpf/xdp core with zero
> > changes to the driver. All driver has to do is to provide BTF.
> 
> I'm confused.  AFAIK most *existing* NICs have the metadata in the
> "descriptor", i.e. not in the packet buffer.  So if the NIC just
> describes what it has, and there is no data shuffling/copying
> (performance) then we have to expose the descriptor, no?

which piece of sw put that data into desciptor ?
I bet it's firmware. It could have stored it into pre-packet data, no?
I'd like to avoid _all_ copies.
Right now xdp program can only see a pointer to packet and pre-packet.
If we need another pointer to a piece of the packet descriptor,
that's also fine. Both pre-packet metadata and pieces of 

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Fri, Jul 06, 2018 at 11:49:55PM +, Waskiewicz Jr, Peter wrote:
> On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote:
> > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > > > 
> > > > > I'm also not 100% on board with the argument that "future" FW
> > > > > can
> > > > > reshuffle things whatever way it wants to.  Is the assumption
> > > > > that
> > > > > future ASICs/FW will be designed to always use the "blessed"
> > > > > BTF
> > > > > format?  Or will it be reconfigurable at runtime?
> > > > 
> > > > let's table configuration of metadata aside for a second.
> > > 
> > > I agree that this should/could be NIC-specific and shouldn't weigh
> > > on
> > > the metadata interface between the drivers and XDP layer.
> > > 
> > > > Describing metedata layout in BTF allows NICs to disclose
> > > > everything
> > > > NIC has to users in a standard and generic way.
> > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > and hw powercycled to have new md layout (and corresponding BTF
> > > > description)
> > > > is a separate discussion.
> > > > Saeed's proposal introduces the concept of 'offset' inside
> > > > 'struct
> > > > xdp_md_info'
> > > > to reach 'hash' value in metadata.
> > > > Essentially it's a run-time way to access 'hash' instead of
> > > > build-
> > > > time.
> > > > So bpf program would need two loads to read csum or hash field
> > > > instead of one.
> > > > With BTF the layout of metadata is known to the program at build-
> > > > time.
> > > > 
> > > > To reiterate the proposal:
> > > > - driver+firmware keep layout of the metadata in BTF format
> > > > (either
> > > > in the driver
> > > >   or driver can read it from firmware)
> > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will
> > > > query
> > > > the driver and
> > > >   generate normal C header file based on BTF in the given NIC
> > > > - user does #include "md_desc.h" and bpf program can access md-
> > > > >csum
> > > > or md->hash
> > > >   with direct single load out of metadata area in front of the
> > > > packet
> > > 
> > > This piece is where I'd like to discuss more.  When we discussed
> > > this
> > > in Seoul, the initial proposal was a static struct that we'd try to
> > > hammer out a common layout between the interested parties.  That
> > > obviously wasn't going to scale, and we wanted to pursue something
> > > more
> > > dynamic.  But I thought the goal was the XDP/eBPF program wouldn't
> > > want
> > > to care what the underlying device is, and could just ask for
> > > metadata
> > > that it's interested in.  With this approach, your eBPF program is
> > > now
> > > bound/tied to the NIC/driver, and if you switched to a differen
> > > NIC/driver combo, then you'd have to rewrite part of your eBPF
> > > program
> > > to comprehend that.  I thought we were trying to avoid that.
> > 
> > It looks to me that NICs have a lot more differences instead of
> > common pieces.
> > If we focus the architecture on making common things as generic
> > as possible we miss the bigger picture of covering distinct
> > and unique features that hw provides.
> > 
> > In the last email when I said "would be good to standardize at least
> > a few common fields"
> > I meant that it make sense only at the later phases.
> > I don't think we should put things into uapi upfront.
> > Otherwise we will end up with likely useless fields.
> > Like vlan field. Surely a lot of nics can store it into metadata, but
> > why?
> > bpf program can easily read it from the packet.
> > What's the benefit of adding it to md?
> > This metadata concept we're discussing in the context of program
> > acceleration.
> > If metadata doesn't give perf benefits it should not be done by
> > firmware, it should not be supported by the driver and certainly
> > should not be part of uapi.
> > Hence I'd like to see pure BTF description without any
> > common/standard
> > fields across the drivers and figure out what (if anything) is useful
> > to accelerate xdp programs (and af_xdp in the future).
> 
> I guess I didn't write my response very well before, apologies.  I
> actually really like the BTF description and the dynamic-ness of it. 
> It solves lots of problems and keeps things out of the UAPI.
> 
> What I was getting at was using the BTF description and dumping a
> header file to be used in the program to be loaded.  If we have an
> application developer include that header, it will be the Intel BTF
> description, or Mellanox BTF description, etc.  I'm fine with that, but
> it makes your program not as portable, since you'd need to get a
> different header and recompile it if you change the NIC.  Unless I'm
> missing something.

Correct. It makes programs not portable.
That is phase one to figure out what's useful.
Rewriting of offsets and figuring out 

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Jakub Kicinski
On Fri, 6 Jul 2018 23:49:55 +, Waskiewicz Jr, Peter wrote:
> On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote:  
> > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:  
> > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:  
> > > > > 
> > > > > I'm also not 100% on board with the argument that "future" FW
> > > > > can
> > > > > reshuffle things whatever way it wants to.  Is the assumption
> > > > > that
> > > > > future ASICs/FW will be designed to always use the "blessed"
> > > > > BTF
> > > > > format?  Or will it be reconfigurable at runtime?  
> > > > 
> > > > let's table configuration of metadata aside for a second.  
> > > 
> > > I agree that this should/could be NIC-specific and shouldn't weigh
> > > on
> > > the metadata interface between the drivers and XDP layer.
> > >   
> > > > Describing metedata layout in BTF allows NICs to disclose
> > > > everything
> > > > NIC has to users in a standard and generic way.
> > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > and hw powercycled to have new md layout (and corresponding BTF
> > > > description)
> > > > is a separate discussion.
> > > > Saeed's proposal introduces the concept of 'offset' inside
> > > > 'struct
> > > > xdp_md_info'
> > > > to reach 'hash' value in metadata.
> > > > Essentially it's a run-time way to access 'hash' instead of
> > > > build-
> > > > time.
> > > > So bpf program would need two loads to read csum or hash field
> > > > instead of one.
> > > > With BTF the layout of metadata is known to the program at build-
> > > > time.
> > > > 
> > > > To reiterate the proposal:
> > > > - driver+firmware keep layout of the metadata in BTF format
> > > > (either
> > > > in the driver
> > > >   or driver can read it from firmware)
> > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will
> > > > query
> > > > the driver and
> > > >   generate normal C header file based on BTF in the given NIC
> > > > - user does #include "md_desc.h" and bpf program can access md-  
> > > > >csum  
> > > > or md->hash
> > > >   with direct single load out of metadata area in front of the
> > > > packet  
> > > 
> > > This piece is where I'd like to discuss more.  When we discussed
> > > this
> > > in Seoul, the initial proposal was a static struct that we'd try to
> > > hammer out a common layout between the interested parties.  That
> > > obviously wasn't going to scale, and we wanted to pursue something
> > > more
> > > dynamic.  But I thought the goal was the XDP/eBPF program wouldn't
> > > want
> > > to care what the underlying device is, and could just ask for
> > > metadata
> > > that it's interested in.  With this approach, your eBPF program is
> > > now
> > > bound/tied to the NIC/driver, and if you switched to a differen
> > > NIC/driver combo, then you'd have to rewrite part of your eBPF
> > > program
> > > to comprehend that.  I thought we were trying to avoid that.  
> > 
> > It looks to me that NICs have a lot more differences instead of
> > common pieces.
> > If we focus the architecture on making common things as generic
> > as possible we miss the bigger picture of covering distinct
> > and unique features that hw provides.
> > 
> > In the last email when I said "would be good to standardize at least
> > a few common fields"
> > I meant that it make sense only at the later phases.
> > I don't think we should put things into uapi upfront.
> > Otherwise we will end up with likely useless fields.
> > Like vlan field. Surely a lot of nics can store it into metadata, but
> > why?

FWIW vlan is extracted from the packet because of cBPF uABI.  tcpdump
breaks badly if VLAN header is in the packet buffer.  It may not be
entirely without merit to expose the field for writing, perhaps that can
simplify some EVPN use cases?  But that's most likely off-topic.

> > bpf program can easily read it from the packet.
> > What's the benefit of adding it to md?
> > This metadata concept we're discussing in the context of program
> > acceleration.
> > If metadata doesn't give perf benefits it should not be done by
> > firmware, it should not be supported by the driver and certainly
> > should not be part of uapi.
> > Hence I'd like to see pure BTF description without any
> > common/standard
> > fields across the drivers and figure out what (if anything) is useful
> > to accelerate xdp programs (and af_xdp in the future).  
> 
> I guess I didn't write my response very well before, apologies.  I
> actually really like the BTF description and the dynamic-ness of it. 
> It solves lots of problems and keeps things out of the UAPI.
> 
> What I was getting at was using the BTF description and dumping a
> header file to be used in the program to be loaded.  If we have an
> application developer include that header, it will be the Intel BTF
> description, or Mellanox BTF description, etc.  I'm fine with that, 

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Jakub Kicinski
On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:
> > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:  
> > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:  
> > > > 
> > > > I'm also not 100% on board with the argument that "future" FW can
> > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > format?  Or will it be reconfigurable at runtime?
> > > 
> > > let's table configuration of metadata aside for a second.
> > > 
> > > Describing metedata layout in BTF allows NICs to disclose everything
> > > NIC has to users in a standard and generic way.
> > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > and hw powercycled to have new md layout (and corresponding BTF 
> > > description)
> > > is a separate discussion.
> > > Saeed's proposal introduces the concept of 'offset' inside 'struct 
> > > xdp_md_info'
> > > to reach 'hash' value in metadata.
> > > Essentially it's a run-time way to access 'hash' instead of build-time.
> > > So bpf program would need two loads to read csum or hash field instead of 
> > > one.
> > > With BTF the layout of metadata is known to the program at build-time.
> > > 
> > > To reiterate the proposal:
> > > - driver+firmware keep layout of the metadata in BTF format (either in 
> > > the driver
> > >   or driver can read it from firmware)
> > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the 
> > > driver and
> > >   generate normal C header file based on BTF in the given NIC
> > > - user does #include "md_desc.h" and bpf program can access md->csum or 
> > > md->hash
> > >   with direct single load out of metadata area in front of the packet
> > > - llvm compiles bpf program and records how program is doing this 
> > > md->csum accesses
> > >   in BTF format as well (the compiler will be keeping such records
> > >   for __sk_buff and all other structs too, but that's separate discussion)
> > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the 
> > > way the program
> > >   accesses metadata (and other structs) matches BTF from the driver,
> > >   so no surprises if driver+firmware got updated, but program is not 
> > > recompiled
> > > - every NIC can have their own layout of metadata and its own meaning of 
> > > the fields,
> > >   but would be good to standardize at least a few common fields like hash 
> > >  
> > 
> > Can I expose HW descriptors this way, though, or is the proposal to
> > copy this data into the packet buffer?  
> 
> That crossed my mind too. We can use BTF to describe HW descriptors too,
> but I don't think it would buy us anything. AF_XDP approach is better.
>
> > > Once this is working we can do more cool things with BTF.
> > > Like doing offset rewriting at program load time similar to what we plan
> > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > and the kernel will adjust offsetof(struct task_struct, pid) during 
> > > program load
> > > depending on BTF for the kernel.
> > > The same trick we can do for networking metadata.
> > > The program will contain load instruction md->hash that will get 
> > > automatically
> > > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > > For now I'm proposing _not_ to go that far with offset rewriting and start
> > > with simple steps described above.  
> > 
> > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > impacting fast paths of the drivers yet again?  This rewrite should be
> > easier than task->pid, because we have the synthetic user space struct
> > xdp_md definition.  
> 
> I don't understand 'impact fast path yet again' concern.
> If NIC has certain metadata today, just derscribe what it has in BTF
> and supply the actual per-packet md to xdp program as-is.
> No changes for fast path at all.
> Future rewritting will be done by the bpf/xdp core with zero
> changes to the driver. All driver has to do is to provide BTF.

I'm confused.  AFAIK most *existing* NICs have the metadata in the
"descriptor", i.e. not in the packet buffer.  So if the NIC just
describes what it has, and there is no data shuffling/copying
(performance) then we have to expose the descriptor, no?

In case of NFP half of the metadata is in the packet buffer, half in
the descriptor.

Maybe I'm conflating your proposal with Saeed's.  In your proposal
would we still use the overload of metadata prepend (as in struct
xdp_md::data_meta, bpf_xdp_adjust_meta() etc.) or are we adding a
pointer to a new driver-defined struct inside struct xdp_md?

Thanks for bearing with me!


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Waskiewicz Jr, Peter
On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote:
> > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > > 
> > > > I'm also not 100% on board with the argument that "future" FW
> > > > can
> > > > reshuffle things whatever way it wants to.  Is the assumption
> > > > that
> > > > future ASICs/FW will be designed to always use the "blessed"
> > > > BTF
> > > > format?  Or will it be reconfigurable at runtime?
> > > 
> > > let's table configuration of metadata aside for a second.
> > 
> > I agree that this should/could be NIC-specific and shouldn't weigh
> > on
> > the metadata interface between the drivers and XDP layer.
> > 
> > > Describing metedata layout in BTF allows NICs to disclose
> > > everything
> > > NIC has to users in a standard and generic way.
> > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > and hw powercycled to have new md layout (and corresponding BTF
> > > description)
> > > is a separate discussion.
> > > Saeed's proposal introduces the concept of 'offset' inside
> > > 'struct
> > > xdp_md_info'
> > > to reach 'hash' value in metadata.
> > > Essentially it's a run-time way to access 'hash' instead of
> > > build-
> > > time.
> > > So bpf program would need two loads to read csum or hash field
> > > instead of one.
> > > With BTF the layout of metadata is known to the program at build-
> > > time.
> > > 
> > > To reiterate the proposal:
> > > - driver+firmware keep layout of the metadata in BTF format
> > > (either
> > > in the driver
> > >   or driver can read it from firmware)
> > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will
> > > query
> > > the driver and
> > >   generate normal C header file based on BTF in the given NIC
> > > - user does #include "md_desc.h" and bpf program can access md-
> > > >csum
> > > or md->hash
> > >   with direct single load out of metadata area in front of the
> > > packet
> > 
> > This piece is where I'd like to discuss more.  When we discussed
> > this
> > in Seoul, the initial proposal was a static struct that we'd try to
> > hammer out a common layout between the interested parties.  That
> > obviously wasn't going to scale, and we wanted to pursue something
> > more
> > dynamic.  But I thought the goal was the XDP/eBPF program wouldn't
> > want
> > to care what the underlying device is, and could just ask for
> > metadata
> > that it's interested in.  With this approach, your eBPF program is
> > now
> > bound/tied to the NIC/driver, and if you switched to a differen
> > NIC/driver combo, then you'd have to rewrite part of your eBPF
> > program
> > to comprehend that.  I thought we were trying to avoid that.
> 
> It looks to me that NICs have a lot more differences instead of
> common pieces.
> If we focus the architecture on making common things as generic
> as possible we miss the bigger picture of covering distinct
> and unique features that hw provides.
> 
> In the last email when I said "would be good to standardize at least
> a few common fields"
> I meant that it make sense only at the later phases.
> I don't think we should put things into uapi upfront.
> Otherwise we will end up with likely useless fields.
> Like vlan field. Surely a lot of nics can store it into metadata, but
> why?
> bpf program can easily read it from the packet.
> What's the benefit of adding it to md?
> This metadata concept we're discussing in the context of program
> acceleration.
> If metadata doesn't give perf benefits it should not be done by
> firmware, it should not be supported by the driver and certainly
> should not be part of uapi.
> Hence I'd like to see pure BTF description without any
> common/standard
> fields across the drivers and figure out what (if anything) is useful
> to accelerate xdp programs (and af_xdp in the future).

I guess I didn't write my response very well before, apologies.  I
actually really like the BTF description and the dynamic-ness of it. 
It solves lots of problems and keeps things out of the UAPI.

What I was getting at was using the BTF description and dumping a
header file to be used in the program to be loaded.  If we have an
application developer include that header, it will be the Intel BTF
description, or Mellanox BTF description, etc.  I'm fine with that, but
it makes your program not as portable, since you'd need to get a
different header and recompile it if you change the NIC.  Unless I'm
missing something.

> 
> > Our proposed approach (still working on something ready to RFC) is
> > to
> > provide a method for the eBPF program to send a struct of requested
> > hints down to the driver on load.  If the driver can provide the
> > hints,
> > then that'd be how they'd be laid out in the metadata.  If it can't
> > provide them, we'd probably reject the program loading, or discuss
> > providing a software fallback 

Re: [PATCH bpf-next v2 0/6] nfp: bpf: add multiplication and divide support on NFP JIT

2018-07-06 Thread Daniel Borkmann
On 07/07/2018 12:13 AM, Jakub Kicinski wrote:
> Jiong says:
> 
> NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per
> step, therefore we need 2 steps for u16 and 4 steps for u32.
> 
> We also need one start instruction to initialize the sequence and one or
> two instructions to fetch the result depending on either you need the high
> halve of u32 multiplication.
> 
> For ALU64, if either operand is beyond u32's value range, we reject it. One
> thing to note, if the source operand is BPF_K, then we need to check "imm"
> field directly, and we'd reject it if it is negative.  Because for ALU64,
> "imm" (with s32 type) is expected to be sign extended to s64 which NFP mul
> doesn't support. For ALU32, it is fine for "imm" be negative though,
> because the result is 32-bits and here is no difference on the low halve
> of result for signed/unsigned mul, so we will get correct result.
> 
> NFP doesn't have integer divide instruction, this patch set uses reciprocal
> algorithm (the basic one, reciprocal_div) to emulate it.
> 
> For each u32 divide, we would need 11 instructions to finish the operation.
> 
>7 (for multiplication) + 4 (various ALUs) = 11
> 
> Given NFP only supports multiplication no bigger than u32, we'd require
> divisor and dividend no bigger than that as well.
> 
> Also eBPF doesn't support signed divide and has enforced this on C language
> level by failing compilation. However LLVM assembler hasn't enforced this,
> so it is possible for negative constant to leak in as a BPF_K operand
> through assembly code, we reject such cases as well.
> 
> Meanwhile reciprocal_div.h only implemented the basic version of:
> 
>"Division by Invariant Integers Using Multiplication"
>   - Torbjörn Granlund and Peter L. Montgomery
> 
> This patch set further implements the optimized version (Figure 4.2 in the
> paper) inside existing reciprocal_div.h. When the divider is even and the
> calculated reciprocal magic number doesn't fit u32, we could reduce the
> required ALU instructions from 4 to 2 or 1 for some cases.
> 
> The advanced version requires more complex calculation to get the
> reciprocal multiplier and other control variables, but then could reduce
> the required emulation operations. It makes sense to use it for JIT divide
> code generation (for example eBPF JIT backends) for which we are willing to
> trade performance of JITed code with that of host.
> 
> v2:
>   - add warning in l == 32 code path. (Song Liu/Jakub)
>   - jit separate insn sequence for l == 32. (Jakub/Edwin)
>   - should use unrestricted operand for mul.
>   - once place should use "1U << exp" instead of "1 << exp".
>   
> Jiong Wang (6):
>   lib: reciprocal_div: implement the improved algorithm on the paper
> mentioned
>   nfp: bpf: rename umin/umax to umin_src/umax_src
>   nfp: bpf: copy range info for all operands of all ALU operations
>   nfp: bpf: support u16 and u32 multiplications
>   nfp: bpf: support u32 divide using reciprocal_div.h
>   nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h

Applied to bpf-next, thanks everyone!


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:
> > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > 
> > > I'm also not 100% on board with the argument that "future" FW can
> > > reshuffle things whatever way it wants to.  Is the assumption that
> > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > format?  Or will it be reconfigurable at runtime?  
> > 
> > let's table configuration of metadata aside for a second.
> > 
> > Describing metedata layout in BTF allows NICs to disclose everything
> > NIC has to users in a standard and generic way.
> > Whether firmware is reconfigurable on the fly or has to reflashed
> > and hw powercycled to have new md layout (and corresponding BTF description)
> > is a separate discussion.
> > Saeed's proposal introduces the concept of 'offset' inside 'struct 
> > xdp_md_info'
> > to reach 'hash' value in metadata.
> > Essentially it's a run-time way to access 'hash' instead of build-time.
> > So bpf program would need two loads to read csum or hash field instead of 
> > one.
> > With BTF the layout of metadata is known to the program at build-time.
> > 
> > To reiterate the proposal:
> > - driver+firmware keep layout of the metadata in BTF format (either in the 
> > driver
> >   or driver can read it from firmware)
> > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the 
> > driver and
> >   generate normal C header file based on BTF in the given NIC
> > - user does #include "md_desc.h" and bpf program can access md->csum or 
> > md->hash
> >   with direct single load out of metadata area in front of the packet
> > - llvm compiles bpf program and records how program is doing this md->csum 
> > accesses
> >   in BTF format as well (the compiler will be keeping such records
> >   for __sk_buff and all other structs too, but that's separate discussion)
> > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the 
> > way the program
> >   accesses metadata (and other structs) matches BTF from the driver,
> >   so no surprises if driver+firmware got updated, but program is not 
> > recompiled
> > - every NIC can have their own layout of metadata and its own meaning of 
> > the fields,
> >   but would be good to standardize at least a few common fields like hash
> 
> Can I expose HW descriptors this way, though, or is the proposal to
> copy this data into the packet buffer?

That crossed my mind too. We can use BTF to describe HW descriptors too,
but I don't think it would buy us anything. AF_XDP approach is better.

> > Once this is working we can do more cool things with BTF.
> > Like doing offset rewriting at program load time similar to what we plan
> > to do for tracing. Tracing programs will be doing 'task->pid' access
> > and the kernel will adjust offsetof(struct task_struct, pid) during program 
> > load
> > depending on BTF for the kernel.
> > The same trick we can do for networking metadata.
> > The program will contain load instruction md->hash that will get 
> > automatically
> > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > For now I'm proposing _not_ to go that far with offset rewriting and start
> > with simple steps described above.
> 
> Why? :(  Could we please go with the rewrite/driver helpers instead of
> impacting fast paths of the drivers yet again?  This rewrite should be
> easier than task->pid, because we have the synthetic user space struct
> xdp_md definition.

I don't understand 'impact fast path yet again' concern.
If NIC has certain metadata today, just derscribe what it has in BTF
and supply the actual per-packet md to xdp program as-is.
No changes for fast path at all.
Future rewritting will be done by the bpf/xdp core with zero
changes to the driver. All driver has to do is to provide BTF.



Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote:
> On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > 
> > > I'm also not 100% on board with the argument that "future" FW can
> > > reshuffle things whatever way it wants to.  Is the assumption that
> > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > format?  Or will it be reconfigurable at runtime?
> > 
> > let's table configuration of metadata aside for a second.
> 
> I agree that this should/could be NIC-specific and shouldn't weigh on
> the metadata interface between the drivers and XDP layer.
> 
> > Describing metedata layout in BTF allows NICs to disclose everything
> > NIC has to users in a standard and generic way.
> > Whether firmware is reconfigurable on the fly or has to reflashed
> > and hw powercycled to have new md layout (and corresponding BTF
> > description)
> > is a separate discussion.
> > Saeed's proposal introduces the concept of 'offset' inside 'struct
> > xdp_md_info'
> > to reach 'hash' value in metadata.
> > Essentially it's a run-time way to access 'hash' instead of build-
> > time.
> > So bpf program would need two loads to read csum or hash field
> > instead of one.
> > With BTF the layout of metadata is known to the program at build-
> > time.
> > 
> > To reiterate the proposal:
> > - driver+firmware keep layout of the metadata in BTF format (either
> > in the driver
> >   or driver can read it from firmware)
> > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query
> > the driver and
> >   generate normal C header file based on BTF in the given NIC
> > - user does #include "md_desc.h" and bpf program can access md->csum
> > or md->hash
> >   with direct single load out of metadata area in front of the packet
> 
> This piece is where I'd like to discuss more.  When we discussed this
> in Seoul, the initial proposal was a static struct that we'd try to
> hammer out a common layout between the interested parties.  That
> obviously wasn't going to scale, and we wanted to pursue something more
> dynamic.  But I thought the goal was the XDP/eBPF program wouldn't want
> to care what the underlying device is, and could just ask for metadata
> that it's interested in.  With this approach, your eBPF program is now
> bound/tied to the NIC/driver, and if you switched to a differen
> NIC/driver combo, then you'd have to rewrite part of your eBPF program
> to comprehend that.  I thought we were trying to avoid that.

It looks to me that NICs have a lot more differences instead of common pieces.
If we focus the architecture on making common things as generic
as possible we miss the bigger picture of covering distinct
and unique features that hw provides.

In the last email when I said "would be good to standardize at least a few 
common fields"
I meant that it make sense only at the later phases.
I don't think we should put things into uapi upfront.
Otherwise we will end up with likely useless fields.
Like vlan field. Surely a lot of nics can store it into metadata, but why?
bpf program can easily read it from the packet.
What's the benefit of adding it to md?
This metadata concept we're discussing in the context of program acceleration.
If metadata doesn't give perf benefits it should not be done by
firmware, it should not be supported by the driver and certainly
should not be part of uapi.
Hence I'd like to see pure BTF description without any common/standard
fields across the drivers and figure out what (if anything) is useful
to accelerate xdp programs (and af_xdp in the future).

> Our proposed approach (still working on something ready to RFC) is to
> provide a method for the eBPF program to send a struct of requested
> hints down to the driver on load.  If the driver can provide the hints,
> then that'd be how they'd be laid out in the metadata.  If it can't
> provide them, we'd probably reject the program loading, or discuss
> providing a software fallback (I know this is an area of contention).

I don't think that approach can work.
My .02 is the only useful acceleration feature out if intel nics is
an ability to parse the packet and report it to the xdp program
as a node id in the parse graph. As far as I know that is pretty unique
to intel. Configuration of that is a different matter.
Other things like hash are interesting, but we will quickly get
into rss spec definition if we go standardization route now instead
of later phases.
Therefore I'd rather let firmware+driver define its own BTF description
that says here is 'hash' of the packet hw can provide and it's fine
that this hash may be over different tuples depending on the nic
and even version of the firmware in that nic.
We need to see performance numbers and benefits for real world
xdp programs before drilling into specific semantics of the hash.

> I suppose we could get there with the rewriting mechanism described
> below, but 

Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread David Ahern
On 7/6/18 9:58 AM, Sabrina Dubroca wrote:
> 
> Right. I'll add that as a separate patch in this series, unless you
> really prefer the change squashed into this patch.

no preference.

> 
> 
>> Looking at other addr_gen_mode sites, addrconf_sysctl_stable_secret is
>> messed up as well. It propagates a change to 'default' to all existing
>> devices.
> 
> I guess it was intentional, given:
> 
> if (>ipv6.devconf_all->stable_secret == ctl->data)
> return -EIO;
> 
> It only propagates the mode, and not the secret itself, to all
> devices. After thinking about it for a while, I guess it considers the
> new default not only as default for newly created devices, but also
> for newly added addresses/prefixes.
> Or am I making stuff up?
> 

Maybe Hannes can explain (622c81d57b392). It should have been all
instead of default. As I understand it default is what devices start
with on create and individual devices can be changed. 'All' overrides
device-specific settings. So the above is inconsistent. One of many with
sysctl that makes it frustrating for users.


Re: Crash due to destroying TCP request sockets using SOCK_DESTROY

2018-07-06 Thread Subash Abhinov Kasiviswanathan

Looks like for a TCP_NEW_SYN_RECV socket, sock_diag_destroy
essentially ends up doing:

struct request_sock *req = inet_reqsk(sk);

local_bh_disable();

inet_csk_reqsk_queue_drop_and_put(req->rsk_listener,

  req);
local_bh_enable();
...

sock_gen_put(sk);

It looks like inet_csk_reqsk_queue_drop_and_put calls reqsk_put(req),
which frees the socket, and at that point sock_gen_put is a UAF. Do we
just need:

-
inet_csk_reqsk_queue_drop_and_put(req->rsk_listener,

-   req);
+inet_csk_reqsk_queue_drop(req->rsk_listener, 
req);


since sock_gen_put will also end up calling reqsk_put() for a
TCP_SYN_RECV socket?

Alastair - you're able to reproduce this UAF using net_test on qemu,
right? If so, could you try that two-line patch above?



Hi Lorenzo

Your patch makes sense to me, please submit it formally with :

Fixes: d7226c7a4dd1 ("net: diag: Fix refcnt leak in error path
destroying socket")
Cc: David Ahern 

Thanks !


Thanks Lorenzo and Eric. I will try it out locally.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH v12 01/10] dt-bindings: Add Cavium Octeon Common Ethernet Interface.

2018-07-06 Thread Andrew Lunn
On Fri, Jul 06, 2018 at 05:10:39PM -0500, Steven J. Hill wrote:
> On 06/28/2018 03:35 AM, Andrew Lunn wrote:
> > 
> >> +- cavium,rx-clk-delay-bypass: Set to <1> to bypass the rx clock delay 
> >> setting.
> >> +  Needed by the Micrel PHY.
> > 
> > Could you explain this some more. Is it anything to do with RGMII delays?
> > 
> Andrew,
> 
> One of my colleagues tracked this down for me. This device tree option is in 
> place
> because there are several different ways to do the clock and data with 
> respect to
> RGMII. This controls the delay introduced for the RX clock with respect to 
> the data.
> Without this, RX will not work with Micrel PHYs. Thanks.

Hi Steven

This is his RGMII delays, as i guess.

Don't add this property, do it the Linux way. Look at phy-mode values

phy.h:PHY_INTERFACE_MODE_RGMII_ID,
phy.h:PHY_INTERFACE_MODE_RGMII_RXID,
phy.h:PHY_INTERFACE_MODE_RGMII_TXID,

There are plenty of examples in drivers/net/ethernet

  Andrew


[PATCH bpf-next v2 6/6] nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h

2018-07-06 Thread Jakub Kicinski
From: Jiong Wang 

As we are doing JIT, we would want to use the advanced version of the
reciprocal divide (reciprocal_value_adv) to trade performance with host.

We could reduce the required ALU instructions from 4 to 2 or 1.

Signed-off-by: Jiong Wang 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 64 
 1 file changed, 53 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 7c9ee3d85907..1d9e36835404 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1497,8 +1497,8 @@ wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta 
*meta,
 static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 dst, u64 imm)
 {
swreg dst_both = reg_both(dst), dst_a = reg_a(dst), dst_b = reg_a(dst);
-   struct reciprocal_value rvalue;
-   swreg tmp_b = imm_b(nfp_prog);
+   struct reciprocal_value_adv rvalue;
+   u8 pre_shift, exp;
swreg magic;
 
if (imm > U32_MAX) {
@@ -1506,16 +1506,58 @@ static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 
dst, u64 imm)
return 0;
}
 
-   rvalue = reciprocal_value(imm);
+   /* NOTE: because we are using "reciprocal_value_adv" which doesn't
+* support "divisor > (1u << 31)", we need to JIT separate NFP sequence
+* to handle such case which actually equals to the result of unsigned
+* comparison "dst >= imm" which could be calculated using the following
+* NFP sequence:
+*
+*  alu[--, dst, -, imm]
+*  immed[imm, 0]
+*  alu[dst, imm, +carry, 0]
+*
+*/
+   if (imm > 1U << 31) {
+   swreg tmp_b = ur_load_imm_any(nfp_prog, imm, imm_b(nfp_prog));
+
+   emit_alu(nfp_prog, reg_none(), dst_a, ALU_OP_SUB, tmp_b);
+   wrp_immed(nfp_prog, imm_a(nfp_prog), 0);
+   emit_alu(nfp_prog, dst_both, imm_a(nfp_prog), ALU_OP_ADD_C,
+reg_imm(0));
+   return 0;
+   }
+
+   rvalue = reciprocal_value_adv(imm, 32);
+   exp = rvalue.exp;
+   if (rvalue.is_wide_m && !(imm & 1)) {
+   pre_shift = fls(imm & -imm) - 1;
+   rvalue = reciprocal_value_adv(imm >> pre_shift, 32 - pre_shift);
+   } else {
+   pre_shift = 0;
+   }
magic = ur_load_imm_any(nfp_prog, rvalue.m, imm_b(nfp_prog));
-   wrp_mul_u32(nfp_prog, imm_both(nfp_prog), reg_none(), dst_a, magic,
-   true);
-   emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB, tmp_b);
-   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
-SHF_SC_R_SHF, rvalue.sh1);
-   emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD, tmp_b);
-   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
-SHF_SC_R_SHF, rvalue.sh2);
+   if (imm == 1U << exp) {
+   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
+SHF_SC_R_SHF, exp);
+   } else if (rvalue.is_wide_m) {
+   wrp_mul_u32(nfp_prog, imm_both(nfp_prog), reg_none(), dst_a,
+   magic, true);
+   emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB,
+imm_b(nfp_prog));
+   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
+SHF_SC_R_SHF, 1);
+   emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD,
+imm_b(nfp_prog));
+   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
+SHF_SC_R_SHF, rvalue.sh - 1);
+   } else {
+   if (pre_shift)
+   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE,
+dst_b, SHF_SC_R_SHF, pre_shift);
+   wrp_mul_u32(nfp_prog, dst_both, reg_none(), dst_a, magic, true);
+   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE,
+dst_b, SHF_SC_R_SHF, rvalue.sh);
+   }
 
return 0;
 }
-- 
2.17.1



[PATCH bpf-next v2 3/6] nfp: bpf: copy range info for all operands of all ALU operations

2018-07-06 Thread Jakub Kicinski
From: Jiong Wang 

NFP verifier hook is coping range information of the shift amount for
indirect shift operation so optimized shift sequences could be generated.

We want to use range info to do more things. For example, to decide whether
multiplication and divide are supported on the given range.

This patch simply let NFP verifier hook to copy range info for all operands
of all ALU operands.

Signed-off-by: Jiong Wang 
Reviewed-by: Jakub Kicinski 
Acked-by: Song Liu 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.h | 33 +++
 .../net/ethernet/netronome/nfp/bpf/offload.c  |  4 ++-
 .../net/ethernet/netronome/nfp/bpf/verifier.c |  6 +++-
 3 files changed, 20 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 5975a19c28cb..c985d0ac61a3 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -265,6 +265,8 @@ struct nfp_bpf_reg_state {
  * @arg2: arg2 for call instructions
  * @umin_src: copy of core verifier umin_value for src opearnd.
  * @umax_src: copy of core verifier umax_value for src operand.
+ * @umin_dst: copy of core verifier umin_value for dst opearnd.
+ * @umax_dst: copy of core verifier umax_value for dst operand.
  * @off: index of first generated machine instruction (in nfp_prog.prog)
  * @n: eBPF instruction number
  * @flags: eBPF instruction extra optimization flags
@@ -300,12 +302,15 @@ struct nfp_insn_meta {
struct bpf_reg_state arg1;
struct nfp_bpf_reg_state arg2;
};
-   /* We are interested in range info for some operands,
-* for example, the shift amount which is kept in src operand.
+   /* We are interested in range info for operands of ALU
+* operations. For example, shift amount, multiplicand and
+* multiplier etc.
 */
struct {
u64 umin_src;
u64 umax_src;
+   u64 umin_dst;
+   u64 umax_dst;
};
};
unsigned int off;
@@ -339,6 +344,11 @@ static inline u8 mbpf_mode(const struct nfp_insn_meta 
*meta)
return BPF_MODE(meta->insn.code);
 }
 
+static inline bool is_mbpf_alu(const struct nfp_insn_meta *meta)
+{
+   return mbpf_class(meta) == BPF_ALU64 || mbpf_class(meta) == BPF_ALU;
+}
+
 static inline bool is_mbpf_load(const struct nfp_insn_meta *meta)
 {
return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_LDX | BPF_MEM);
@@ -384,25 +394,6 @@ static inline bool is_mbpf_xadd(const struct nfp_insn_meta 
*meta)
return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_XADD);
 }
 
-static inline bool is_mbpf_indir_shift(const struct nfp_insn_meta *meta)
-{
-   u8 code = meta->insn.code;
-   bool is_alu, is_shift;
-   u8 opclass, opcode;
-
-   opclass = BPF_CLASS(code);
-   is_alu = opclass == BPF_ALU64 || opclass == BPF_ALU;
-   if (!is_alu)
-   return false;
-
-   opcode = BPF_OP(code);
-   is_shift = opcode == BPF_LSH || opcode == BPF_RSH || opcode == BPF_ARSH;
-   if (!is_shift)
-   return false;
-
-   return BPF_SRC(code) == BPF_X;
-}
-
 /**
  * struct nfp_prog - nfp BPF program
  * @bpf: backpointer to the bpf app priv structure
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 856a0003bb75..78f44c4d95b4 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -190,8 +190,10 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct 
bpf_insn *prog,
 
meta->insn = prog[i];
meta->n = i;
-   if (is_mbpf_indir_shift(meta))
+   if (is_mbpf_alu(meta)) {
meta->umin_src = U64_MAX;
+   meta->umin_dst = U64_MAX;
+   }
 
list_add_tail(>l, _prog->insns);
}
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c 
b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
index e862b739441f..7bd9666bd8ff 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
@@ -551,12 +551,16 @@ nfp_verify_insn(struct bpf_verifier_env *env, int 
insn_idx, int prev_insn_idx)
if (is_mbpf_xadd(meta))
return nfp_bpf_check_xadd(nfp_prog, meta, env);
 
-   if (is_mbpf_indir_shift(meta)) {
+   if (is_mbpf_alu(meta)) {
const struct bpf_reg_state *sreg =
cur_regs(env) + meta->insn.src_reg;
+   const struct bpf_reg_state *dreg =
+   cur_regs(env) + meta->insn.dst_reg;
 
meta->umin_src = min(meta->umin_src, sreg->umin_value);

[PATCH bpf-next v2 5/6] nfp: bpf: support u32 divide using reciprocal_div.h

2018-07-06 Thread Jakub Kicinski
From: Jiong Wang 

NFP doesn't have integer divide instruction, this patch use reciprocal
algorithm (the basic one, reciprocal_div) to emulate it.

For each u32 divide, we would need 11 instructions to finish the operation.

  7 (for multiplication) + 4 (various ALUs) = 11

Given NFP only supports multiplication no bigger than u32, we'd require
divisor and dividend no bigger than that as well.

Also eBPF doesn't support signed divide and has enforced this on C language
level by failing compilation. However LLVM assembler hasn't enforced this,
so it is possible for negative constant to leak in as a BPF_K operand
through assembly code, we reject such cases as well.

Signed-off-by: Jiong Wang 
Reviewed-by: Jakub Kicinski 
Acked-by: Song Liu 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 58 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  5 ++
 .../net/ethernet/netronome/nfp/bpf/verifier.c | 31 ++
 3 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index f1b27c3a4347..7c9ee3d85907 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -34,10 +34,11 @@
 #define pr_fmt(fmt)"NFP net bpf: " fmt
 
 #include 
-#include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 
 #include "main.h"
@@ -1493,6 +1494,32 @@ wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta 
*meta,
return 0;
 }
 
+static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 dst, u64 imm)
+{
+   swreg dst_both = reg_both(dst), dst_a = reg_a(dst), dst_b = reg_a(dst);
+   struct reciprocal_value rvalue;
+   swreg tmp_b = imm_b(nfp_prog);
+   swreg magic;
+
+   if (imm > U32_MAX) {
+   wrp_immed(nfp_prog, dst_both, 0);
+   return 0;
+   }
+
+   rvalue = reciprocal_value(imm);
+   magic = ur_load_imm_any(nfp_prog, rvalue.m, imm_b(nfp_prog));
+   wrp_mul_u32(nfp_prog, imm_both(nfp_prog), reg_none(), dst_a, magic,
+   true);
+   emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB, tmp_b);
+   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
+SHF_SC_R_SHF, rvalue.sh1);
+   emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD, tmp_b);
+   emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
+SHF_SC_R_SHF, rvalue.sh2);
+
+   return 0;
+}
+
 static int adjust_head(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
swreg tmp = imm_a(nfp_prog), tmp_len = imm_b(nfp_prog);
@@ -1807,6 +1834,21 @@ static int mul_imm64(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return wrp_mul(nfp_prog, meta, true, false);
 }
 
+static int div_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   const struct bpf_insn *insn = >insn;
+
+   return wrp_div_imm(nfp_prog, insn->dst_reg * 2, insn->imm);
+}
+
+static int div_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   /* NOTE: verifier hook has rejected cases for which verifier doesn't
+* know whether the source operand is constant or not.
+*/
+   return wrp_div_imm(nfp_prog, meta->insn.dst_reg * 2, meta->umin_src);
+}
+
 static int neg_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
const struct bpf_insn *insn = >insn;
@@ -2230,6 +2272,16 @@ static int mul_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return wrp_mul(nfp_prog, meta, false, false);
 }
 
+static int div_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   return div_reg64(nfp_prog, meta);
+}
+
+static int div_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   return div_imm64(nfp_prog, meta);
+}
+
 static int neg_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
u8 dst = meta->insn.dst_reg * 2;
@@ -2983,6 +3035,8 @@ static const instr_cb_t instr_cb[256] = {
[BPF_ALU64 | BPF_SUB | BPF_K] = sub_imm64,
[BPF_ALU64 | BPF_MUL | BPF_X] = mul_reg64,
[BPF_ALU64 | BPF_MUL | BPF_K] = mul_imm64,
+   [BPF_ALU64 | BPF_DIV | BPF_X] = div_reg64,
+   [BPF_ALU64 | BPF_DIV | BPF_K] = div_imm64,
[BPF_ALU64 | BPF_NEG] = neg_reg64,
[BPF_ALU64 | BPF_LSH | BPF_X] = shl_reg64,
[BPF_ALU64 | BPF_LSH | BPF_K] = shl_imm64,
@@ -3004,6 +3058,8 @@ static const instr_cb_t instr_cb[256] = {
[BPF_ALU | BPF_SUB | BPF_K] =   sub_imm,
[BPF_ALU | BPF_MUL | BPF_X] =   mul_reg,
[BPF_ALU | BPF_MUL | BPF_K] =   mul_imm,
+   [BPF_ALU | BPF_DIV | BPF_X] =   div_reg,
+   [BPF_ALU | BPF_DIV | BPF_K] =   div_imm,
[BPF_ALU | BPF_NEG] =   neg_reg,
[BPF_ALU | BPF_LSH | BPF_K] =   shl_imm,
[BPF_ALU | BPF_END | BPF_X] =   end_reg32,
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h

[PATCH bpf-next v2 4/6] nfp: bpf: support u16 and u32 multiplications

2018-07-06 Thread Jakub Kicinski
From: Jiong Wang 

NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per
step, therefore we need 2 steps for u16 and 4 steps for u32.

We also need one start instruction to initialize the sequence and one or
two instructions to fetch the result depending on either you need the high
halve of u32 multiplication.

For ALU64, if either operand is beyond u32's value range, we reject it. One
thing to note, if the source operand is BPF_K, then we need to check "imm"
field directly, and we'd reject it if it is negative.  Because for ALU64,
"imm" (with s32 type) is expected to be sign extended to s64 which NFP mul
doesn't support. For ALU32, it is fine for "imm" be negative though,
because the result is 32-bits and here is no difference on the low halve
of result for signed/unsigned mul, so we will get correct result.

Signed-off-by: Jiong Wang 
Reviewed-by: Jakub Kicinski 
Acked-by: Song Liu 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 137 ++
 drivers/net/ethernet/netronome/nfp/bpf/main.h |   5 +
 .../net/ethernet/netronome/nfp/bpf/verifier.c |  58 ++--
 drivers/net/ethernet/netronome/nfp/nfp_asm.h  |  28 
 4 files changed, 217 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 4a629e9b5c0f..f1b27c3a4347 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -415,6 +415,60 @@ emit_alu(struct nfp_prog *nfp_prog, swreg dst,
   reg.dst_lmextn, reg.src_lmextn);
 }
 
+static void
+__emit_mul(struct nfp_prog *nfp_prog, enum alu_dst_ab dst_ab, u16 areg,
+  enum mul_type type, enum mul_step step, u16 breg, bool swap,
+  bool wr_both, bool dst_lmextn, bool src_lmextn)
+{
+   u64 insn;
+
+   insn = OP_MUL_BASE |
+   FIELD_PREP(OP_MUL_A_SRC, areg) |
+   FIELD_PREP(OP_MUL_B_SRC, breg) |
+   FIELD_PREP(OP_MUL_STEP, step) |
+   FIELD_PREP(OP_MUL_DST_AB, dst_ab) |
+   FIELD_PREP(OP_MUL_SW, swap) |
+   FIELD_PREP(OP_MUL_TYPE, type) |
+   FIELD_PREP(OP_MUL_WR_AB, wr_both) |
+   FIELD_PREP(OP_MUL_SRC_LMEXTN, src_lmextn) |
+   FIELD_PREP(OP_MUL_DST_LMEXTN, dst_lmextn);
+
+   nfp_prog_push(nfp_prog, insn);
+}
+
+static void
+emit_mul(struct nfp_prog *nfp_prog, swreg lreg, enum mul_type type,
+enum mul_step step, swreg rreg)
+{
+   struct nfp_insn_ur_regs reg;
+   u16 areg;
+   int err;
+
+   if (type == MUL_TYPE_START && step != MUL_STEP_NONE) {
+   nfp_prog->error = -EINVAL;
+   return;
+   }
+
+   if (step == MUL_LAST || step == MUL_LAST_2) {
+   /* When type is step and step Number is LAST or LAST2, left
+* source is used as destination.
+*/
+   err = swreg_to_unrestricted(lreg, reg_none(), rreg, );
+   areg = reg.dst;
+   } else {
+   err = swreg_to_unrestricted(reg_none(), lreg, rreg, );
+   areg = reg.areg;
+   }
+
+   if (err) {
+   nfp_prog->error = err;
+   return;
+   }
+
+   __emit_mul(nfp_prog, reg.dst_ab, areg, type, step, reg.breg, reg.swap,
+  reg.wr_both, reg.dst_lmextn, reg.src_lmextn);
+}
+
 static void
 __emit_ld_field(struct nfp_prog *nfp_prog, enum shf_sc sc,
u8 areg, u8 bmask, u8 breg, u8 shift, bool imm8,
@@ -1380,6 +1434,65 @@ static void wrp_end32(struct nfp_prog *nfp_prog, swreg 
reg_in, u8 gpr_out)
  SHF_SC_R_ROT, 16);
 }
 
+static void
+wrp_mul_u32(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg,
+   swreg rreg, bool gen_high_half)
+{
+   emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg);
+   emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_1, rreg);
+   emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_2, rreg);
+   emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_3, rreg);
+   emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_4, rreg);
+   emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_32x32, MUL_LAST, reg_none());
+   if (gen_high_half)
+   emit_mul(nfp_prog, dst_hi, MUL_TYPE_STEP_32x32, MUL_LAST_2,
+reg_none());
+   else
+   wrp_immed(nfp_prog, dst_hi, 0);
+}
+
+static void
+wrp_mul_u16(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg,
+   swreg rreg)
+{
+   emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg);
+   emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_1, rreg);
+   emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_2, rreg);
+   emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_16x16, MUL_LAST, reg_none());
+}
+
+static int
+wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
+   bool gen_high_half, bool 

[PATCH bpf-next v2 1/6] lib: reciprocal_div: implement the improved algorithm on the paper mentioned

2018-07-06 Thread Jakub Kicinski
From: Jiong Wang 

The new added "reciprocal_value_adv" implements the advanced version of the
algorithm described in Figure 4.2 of the paper except when
"divisor > (1U << 31)" whose ceil(log2(d)) result will be 32 which then
requires u128 divide on host. The exception case could be easily handled
before calling "reciprocal_value_adv".

The advanced version requires more complex calculation to get the
reciprocal multiplier and other control variables, but then could reduce
the required emulation operations.

It makes no sense to use this advanced version for host divide emulation,
those extra complexities for calculating multiplier etc could completely
waive our saving on emulation operations.

However, it makes sense to use it for JIT divide code generation (for
example eBPF JIT backends) for which we are willing to trade performance of
JITed code with that of host. As shown by the following pseudo code, the
required emulation operations could go down from 6 (the basic version) to 3
or 4.

To use the result of "reciprocal_value_adv", suppose we want to calculate
n/d, the C-style pseudo code will be the following, it could be easily
changed to real code generation for other JIT targets.

  struct reciprocal_value_adv rvalue;
  u8 pre_shift, exp;

  // handle exception case.
  if (d >= (1U << 31)) {
result = n >= d;
return;
  }
  rvalue = reciprocal_value_adv(d, 32)
  exp = rvalue.exp;
  if (rvalue.is_wide_m && !(d & 1)) {
// floor(log2(d & (2^32 -d)))
pre_shift = fls(d & -d) - 1;
rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
  } else {
pre_shift = 0;
  }

  // code generation starts.
  if (imm == 1U << exp) {
result = n >> exp;
  } else if (rvalue.is_wide_m) {
// pre_shift must be zero when reached here.
t = (n * rvalue.m) >> 32;
result = n - t;
result >>= 1;
result += t;
result >>= rvalue.sh - 1;
  } else {
if (pre_shift)
  result = n >> pre_shift;
result = ((u64)result * rvalue.m) >> 32;
result >>= rvalue.sh;
  }

Signed-off-by: Jiong Wang 
Reviewed-by: Jakub Kicinski 
---
 include/linux/reciprocal_div.h | 68 ++
 lib/reciprocal_div.c   | 41 
 2 files changed, 109 insertions(+)

diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h
index e031e9f2f9d8..585ce89c0f33 100644
--- a/include/linux/reciprocal_div.h
+++ b/include/linux/reciprocal_div.h
@@ -25,6 +25,9 @@ struct reciprocal_value {
u8 sh1, sh2;
 };
 
+/* "reciprocal_value" and "reciprocal_divide" together implement the basic
+ * version of the algorithm described in Figure 4.1 of the paper.
+ */
 struct reciprocal_value reciprocal_value(u32 d);
 
 static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
@@ -33,4 +36,69 @@ static inline u32 reciprocal_divide(u32 a, struct 
reciprocal_value R)
return (t + ((a - t) >> R.sh1)) >> R.sh2;
 }
 
+struct reciprocal_value_adv {
+   u32 m;
+   u8 sh, exp;
+   bool is_wide_m;
+};
+
+/* "reciprocal_value_adv" implements the advanced version of the algorithm
+ * described in Figure 4.2 of the paper except when "divisor > (1U << 31)" 
whose
+ * ceil(log2(d)) result will be 32 which then requires u128 divide on host. The
+ * exception case could be easily handled before calling 
"reciprocal_value_adv".
+ *
+ * The advanced version requires more complex calculation to get the reciprocal
+ * multiplier and other control variables, but then could reduce the required
+ * emulation operations.
+ *
+ * It makes no sense to use this advanced version for host divide emulation,
+ * those extra complexities for calculating multiplier etc could completely
+ * waive our saving on emulation operations.
+ *
+ * However, it makes sense to use it for JIT divide code generation for which
+ * we are willing to trade performance of JITed code with that of host. As 
shown
+ * by the following pseudo code, the required emulation operations could go 
down
+ * from 6 (the basic version) to 3 or 4.
+ *
+ * To use the result of "reciprocal_value_adv", suppose we want to calculate
+ * n/d, the pseudo C code will be:
+ *
+ *   struct reciprocal_value_adv rvalue;
+ *   u8 pre_shift, exp;
+ *
+ *   // handle exception case.
+ *   if (d >= (1U << 31)) {
+ * result = n >= d;
+ * return;
+ *   }
+ *
+ *   rvalue = reciprocal_value_adv(d, 32)
+ *   exp = rvalue.exp;
+ *   if (rvalue.is_wide_m && !(d & 1)) {
+ * // floor(log2(d & (2^32 -d)))
+ * pre_shift = fls(d & -d) - 1;
+ * rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
+ *   } else {
+ * pre_shift = 0;
+ *   }
+ *
+ *   // code generation starts.
+ *   if (imm == 1U << exp) {
+ * result = n >> exp;
+ *   } else if (rvalue.is_wide_m) {
+ * // pre_shift must be zero when reached here.
+ * t = (n * rvalue.m) >> 32;
+ * result = n - t;
+ * result >>= 1;
+ * result += t;
+ * result >>= rvalue.sh - 1;
+ *   } else {
+ * if 

[PATCH bpf-next v2 0/6] nfp: bpf: add multiplication and divide support on NFP JIT

2018-07-06 Thread Jakub Kicinski
Jiong says:

NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per
step, therefore we need 2 steps for u16 and 4 steps for u32.

We also need one start instruction to initialize the sequence and one or
two instructions to fetch the result depending on either you need the high
halve of u32 multiplication.

For ALU64, if either operand is beyond u32's value range, we reject it. One
thing to note, if the source operand is BPF_K, then we need to check "imm"
field directly, and we'd reject it if it is negative.  Because for ALU64,
"imm" (with s32 type) is expected to be sign extended to s64 which NFP mul
doesn't support. For ALU32, it is fine for "imm" be negative though,
because the result is 32-bits and here is no difference on the low halve
of result for signed/unsigned mul, so we will get correct result.

NFP doesn't have integer divide instruction, this patch set uses reciprocal
algorithm (the basic one, reciprocal_div) to emulate it.

For each u32 divide, we would need 11 instructions to finish the operation.

   7 (for multiplication) + 4 (various ALUs) = 11

Given NFP only supports multiplication no bigger than u32, we'd require
divisor and dividend no bigger than that as well.

Also eBPF doesn't support signed divide and has enforced this on C language
level by failing compilation. However LLVM assembler hasn't enforced this,
so it is possible for negative constant to leak in as a BPF_K operand
through assembly code, we reject such cases as well.

Meanwhile reciprocal_div.h only implemented the basic version of:

   "Division by Invariant Integers Using Multiplication"
  - Torbjörn Granlund and Peter L. Montgomery

This patch set further implements the optimized version (Figure 4.2 in the
paper) inside existing reciprocal_div.h. When the divider is even and the
calculated reciprocal magic number doesn't fit u32, we could reduce the
required ALU instructions from 4 to 2 or 1 for some cases.

The advanced version requires more complex calculation to get the
reciprocal multiplier and other control variables, but then could reduce
the required emulation operations. It makes sense to use it for JIT divide
code generation (for example eBPF JIT backends) for which we are willing to
trade performance of JITed code with that of host.

v2:
  - add warning in l == 32 code path. (Song Liu/Jakub)
  - jit separate insn sequence for l == 32. (Jakub/Edwin)
  - should use unrestricted operand for mul.
  - once place should use "1U << exp" instead of "1 << exp".

Jiong Wang (6):
  lib: reciprocal_div: implement the improved algorithm on the paper
mentioned
  nfp: bpf: rename umin/umax to umin_src/umax_src
  nfp: bpf: copy range info for all operands of all ALU operations
  nfp: bpf: support u16 and u32 multiplications
  nfp: bpf: support u32 divide using reciprocal_div.h
  nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h

 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 249 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  43 +--
 .../net/ethernet/netronome/nfp/bpf/offload.c  |   6 +-
 .../net/ethernet/netronome/nfp/bpf/verifier.c |  85 +-
 drivers/net/ethernet/netronome/nfp/nfp_asm.h  |  28 ++
 include/linux/reciprocal_div.h|  68 +
 lib/reciprocal_div.c  |  41 +++
 7 files changed, 483 insertions(+), 37 deletions(-)

-- 
2.17.1



[PATCH bpf-next v2 2/6] nfp: bpf: rename umin/umax to umin_src/umax_src

2018-07-06 Thread Jakub Kicinski
From: Jiong Wang 

The two fields are a copy of umin and umax info of bpf_insn->src_reg
generated by verifier.

Rename to make their meaning clear.

Signed-off-by: Jiong Wang 
Reviewed-by: Jakub Kicinski 
Acked-by: Song Liu 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 12 ++--
 drivers/net/ethernet/netronome/nfp/bpf/main.h | 10 +-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c  |  2 +-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |  4 ++--
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 33111739b210..4a629e9b5c0f 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1772,8 +1772,8 @@ static int shl_reg64(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
u8 dst, src;
 
dst = insn->dst_reg * 2;
-   umin = meta->umin;
-   umax = meta->umax;
+   umin = meta->umin_src;
+   umax = meta->umax_src;
if (umin == umax)
return __shl_imm64(nfp_prog, dst, umin);
 
@@ -1881,8 +1881,8 @@ static int shr_reg64(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
u8 dst, src;
 
dst = insn->dst_reg * 2;
-   umin = meta->umin;
-   umax = meta->umax;
+   umin = meta->umin_src;
+   umax = meta->umax_src;
if (umin == umax)
return __shr_imm64(nfp_prog, dst, umin);
 
@@ -1995,8 +1995,8 @@ static int ashr_reg64(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
u8 dst, src;
 
dst = insn->dst_reg * 2;
-   umin = meta->umin;
-   umax = meta->umax;
+   umin = meta->umin_src;
+   umax = meta->umax_src;
if (umin == umax)
return __ashr_imm64(nfp_prog, dst, umin);
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 654fe7823e5e..5975a19c28cb 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -263,8 +263,8 @@ struct nfp_bpf_reg_state {
  * @func_id: function id for call instructions
  * @arg1: arg1 for call instructions
  * @arg2: arg2 for call instructions
- * @umin: copy of core verifier umin_value.
- * @umax: copy of core verifier umax_value.
+ * @umin_src: copy of core verifier umin_value for src opearnd.
+ * @umax_src: copy of core verifier umax_value for src operand.
  * @off: index of first generated machine instruction (in nfp_prog.prog)
  * @n: eBPF instruction number
  * @flags: eBPF instruction extra optimization flags
@@ -301,11 +301,11 @@ struct nfp_insn_meta {
struct nfp_bpf_reg_state arg2;
};
/* We are interested in range info for some operands,
-* for example, the shift amount.
+* for example, the shift amount which is kept in src operand.
 */
struct {
-   u64 umin;
-   u64 umax;
+   u64 umin_src;
+   u64 umax_src;
};
};
unsigned int off;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 7eae4c0266f8..856a0003bb75 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -191,7 +191,7 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct 
bpf_insn *prog,
meta->insn = prog[i];
meta->n = i;
if (is_mbpf_indir_shift(meta))
-   meta->umin = U64_MAX;
+   meta->umin_src = U64_MAX;
 
list_add_tail(>l, _prog->insns);
}
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c 
b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
index 4bfeba7b21b2..e862b739441f 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
@@ -555,8 +555,8 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, 
int prev_insn_idx)
const struct bpf_reg_state *sreg =
cur_regs(env) + meta->insn.src_reg;
 
-   meta->umin = min(meta->umin, sreg->umin_value);
-   meta->umax = max(meta->umax, sreg->umax_value);
+   meta->umin_src = min(meta->umin_src, sreg->umin_value);
+   meta->umax_src = max(meta->umax_src, sreg->umax_value);
}
 
return 0;
-- 
2.17.1



Re: [PATCH v12 01/10] dt-bindings: Add Cavium Octeon Common Ethernet Interface.

2018-07-06 Thread Steven J. Hill
On 06/28/2018 03:35 AM, Andrew Lunn wrote:
> 
>> +- cavium,rx-clk-delay-bypass: Set to <1> to bypass the rx clock delay 
>> setting.
>> +  Needed by the Micrel PHY.
> 
> Could you explain this some more. Is it anything to do with RGMII delays?
> 
Andrew,

One of my colleagues tracked this down for me. This device tree option is in 
place
because there are several different ways to do the clock and data with respect 
to
RGMII. This controls the delay introduced for the RX clock with respect to the 
data.
Without this, RX will not work with Micrel PHYs. Thanks.

Steve


Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc

2018-07-06 Thread Jesus Sanchez-Palencia
Hi Stephen,


On 07/06/2018 02:32 PM, Stephen Hemminger wrote:
> 
>> diff --git a/tc/q_etf.c b/tc/q_etf.c
>> new file mode 100644
>> index ..5db1dd6f
>> --- /dev/null
>> +++ b/tc/q_etf.c
>> @@ -0,0 +1,168 @@
>> +/*
>> + * q_etf.c  Earliest TxTime First (ETF).
>> + *
>> + *  This program is free software; you can redistribute it and/or
>> + *  modify it under the terms of the GNU General Public License
>> + *  as published by the Free Software Foundation; either version
>> + *  2 of the License, or (at your option) any later version.
>> + *
>> + * Authors: Vinicius Costa Gomes 
>> + *  Jesus Sanchez-Palencia 
>> + *
>> + */
> 
> 
> Please use SPDX tag rather than GPL boilerplate when adding new code.

Sure, will do for v4.

> 
>> +static int get_clockid(__s32 *val, const char *arg)
>> +{
>> +const struct static_clockid {
>> +const char *name;
>> +clockid_t clockid;
>> +} clockids_sysv[] = {
>> +{ "CLOCK_REALTIME", CLOCK_REALTIME },
>> +{ "CLOCK_TAI", CLOCK_TAI },
>> +{ "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
>> +{ "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
>> +{ NULL }
>> +};
>> +
>> +const struct static_clockid *c;
>> +
>> +for (c = clockids_sysv; c->name; c++) {
>> +if (strncasecmp(c->name, arg, 25) == 0) {
>> +*val = c->clockid;
>> +
>> +return 0;
>> +}
>> +}
>> +
>> +return -1;
>> +}
> 
> Internally, kernel must use ktime. For the userspace part the TC standard
> is to use USER HZ of 100.
> 
> Please change user kernel API of this module to match other existing modules.
> Doing something unique for this module is not necessary.


I don't follow you on the above, sorry.

The qdisc must be configured with a valid clockid_t. This type is used by both
userspace and kernel.

As requested before, we made the tc etf command line interface more
user-friendly and allowed for users to input the clock name as a string. The
code above is just lookup table converting the string into a a valid clockid_t
so we can then pass it to the kernel.

There is no ktime or any other timestamp type above, only clockid_t.

Can you please clarify what your request is?


Thanks,
Jesus


Re: Fw: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE

2018-07-06 Thread Eric Dumazet



On 07/06/2018 02:24 PM, Stephen Hemminger wrote:
> 
> 
> Begin forwarded message:
> 
> Date: Sat, 23 Jun 2018 00:00:25 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=200215
> 
> Bug ID: 200215
>Summary: UBSAN: Undefined behaviour in net/core/sock.c:LINE
>Product: Networking
>Version: 2.5
> Kernel Version: v4.18-rc2
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Other
>   Assignee: step...@networkplumber.org
>   Reporter: icy...@gmail.com
> Regression: No
> 
> Hi,
> This bug was found in Linux Kernel v4.18-rc2
> 
> $ cat report4 
> 
> UBSAN: Undefined behaviour in net/core/sock.c:793:19
> signed integer overflow:
> -1704733899 * 2 cannot be represented in type 'int'
> CPU: 0 PID: 5695 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x122/0x1c8 lib/dump_stack.c:113
>  ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
>  handle_overflow+0x1c2/0x21f lib/ubsan.c:190
>  __ubsan_handle_mul_overflow+0x2a/0x38 lib/ubsan.c:214
>  sock_setsockopt+0x17f1/0x1c80 net/core/sock.c:793
>  __sys_setsockopt+0x23f/0x2a0 net/socket.c:1943
>  __do_sys_setsockopt net/socket.c:1958 [inline]
>  __se_sys_setsockopt net/socket.c:1955 [inline]
>  __x64_sys_setsockopt+0xcc/0x170 net/socket.c:1955
>  do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This seems harmless, a fix would be :

diff --git a/net/core/sock.c b/net/core/sock.c
index 
03fdea5b0f575945a58fd14b546226d61ccd4988..9160f412d49dfe7706c202dbd531c247c0548a21
 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -792,7 +792,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 * returning the value we actually used in getsockopt
 * is the most desirable behavior.
 */
-   sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+   sk->sk_rcvbuf = max_t(int, val << 1, SOCK_MIN_RCVBUF);
break;
 
case SO_RCVBUFFORCE:



Re: [PATCH v2 net-next 00/14] Scheduled packet Transmission: ETF

2018-07-06 Thread Stephen Hemminger
On Tue,  3 Jul 2018 15:42:46 -0700
Jesus Sanchez-Palencia  wrote:

> Changes since v1:
>   - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
>   - sk_clockid was changed from u16 to u8;
>   - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
>   - the socket option flags are now validated in sock_setsockopt();
>   - added SO_EE_ORIGIN_TXTIME;
>   - sockc.transmit_time is now initialized from all IPv4 Tx paths;
>   - added support for the IPv6 Tx path;
> 
> 
> Overview
> 
> 
> This work consists of a set of kernel interfaces that can be used by
> applications that require (time-based) Scheduled Tx of packets.
> It is comprised by 3 new components to the kernel:
> 
>   - SO_TXTIME: socket option + cmsg programming interfaces.
> 
>   - etf: the "earliest txtime first" qdisc, that provides per-queue
>TxTime-based scheduling. This has been renamed from 'tbs' to
>'etf' to better describe its functionality.
> 
>   - taprio: the "time-aware priority scheduler" qdisc, that provides
>   per-port Time-Aware scheduling;
> 
> This patchset is providing the first 2 components, which have been
> developed for longer. The taprio qdisc will be shared as an RFC separately
> (shortly).
> 
> Note that this series is a follow up of the "Time based packet
> transmission" RFCv3 [1].
> 
> 
> 
> etf (formerly known as 'tbs')
> =
> 
> For applications/systems that the concept of time slices isn't precise
> enough, the etf qdisc allows applications to control the instant when
> a packet should leave the network controller. When used in conjunction
> with taprio, it can also be used in case the application needs to
> control with greater guarantee the offset into each time slice a packet
> will be sent. Another use case of etf, is when only a small number of
> applications on a system are time sensitive, so it can then be used
> with a more traditional root qdisc (like mqprio).
> 
> The etf qdisc is designed so it buffers packets until a configurable
> time before their deadline (Tx time). The qdisc uses a rbtree internally
> so the buffered packets are always 'ordered' by their txtime (deadline)
> and will be dequeued following the earliest txtime first.
> 
> It relies on the SO_TXTIME API set for receiving the per-packet timestamp
> (txtime) as well as the config flags for each socket: the clockid to be
> used as a reference, if the expected mode of txtime for that socket is
> deadline or strict mode, and if packet drops should be reported on the
> socket's error queue or not.
> 
> The qdisc will drop any packets with a Tx time in the past, or if a
> packet expires while waiting for being dequeued. Drops can be reported
> as errors back to userspace through the socket's error queue.
> 
> Example configuration:
> 
> $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 20 \
> clockid CLOCK_TAI
> 
> Here, the Qdisc will use HW offload for the txtime control.
> Packets will be dequeued by the qdisc "delta" (20) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by hrtimers, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.
> 
> A more complete example can be found here, with instructions of how to
> test it:
> 
> https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]
> 
> 
> Note that we haven't modified the qdisc so it uses a timerqueue because
> the modification needed was increasing the number of cachelines of a sk_buff.
> 
> 
> 
> This series is also hosted on github and can be found at [3].
> The companion iproute2 patches can be found at [4].
> 
> 
> [1] https://patchwork.ozlabs.org/cover/882342/
> 
> [2] github doesn't make it clear, but the gist can be cloned like this:
> $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f 
> scheduled-tx-tests
> 
> [3] https://github.com/jeez/linux/tree/etf-v2
> 
> [4] https://github.com/jeez/iproute2/tree/etf-v2
> 
> 
> 
> Jesus Sanchez-Palencia (10):
>   net: Clear skb->tstamp only on the forwarding path
>   net: ipv4: Hook into time based transmission
>   net: ipv6: Hook into time based transmission
>   net/sched: Add HW offloading capability to ETF
>   igb: Refactor igb_configure_cbs()
>   igb: Only change Tx arbitration when CBS is on
>   igb: Refactor igb_offload_cbs()
>   igb: Only call skb_tx_timestamp after descriptors are ready
>   igb: Add support for ETF offload
>   net/sched: Make etf report drops on error_queue
> 
> Richard Cochran (2):
>   net: Add a new socket option for a future transmit time.
>   net: packet: Hook into time based transmission.
> 
> Vinicius Costa Gomes (2):
>   net/sched: Allow creating a Qdisc watchdog with other clocks
>   net/sched: Introduce the ETF Qdisc
> 
>  arch/alpha/include/uapi/asm/socket.h  |   3 +
>  arch/ia64/include/uapi/asm/socket.h   |   3 +
>  

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Jakub Kicinski
On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:
> On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > 
> > I'm also not 100% on board with the argument that "future" FW can
> > reshuffle things whatever way it wants to.  Is the assumption that
> > future ASICs/FW will be designed to always use the "blessed" BTF
> > format?  Or will it be reconfigurable at runtime?  
> 
> let's table configuration of metadata aside for a second.
> 
> Describing metedata layout in BTF allows NICs to disclose everything
> NIC has to users in a standard and generic way.
> Whether firmware is reconfigurable on the fly or has to reflashed
> and hw powercycled to have new md layout (and corresponding BTF description)
> is a separate discussion.
> Saeed's proposal introduces the concept of 'offset' inside 'struct 
> xdp_md_info'
> to reach 'hash' value in metadata.
> Essentially it's a run-time way to access 'hash' instead of build-time.
> So bpf program would need two loads to read csum or hash field instead of one.
> With BTF the layout of metadata is known to the program at build-time.
> 
> To reiterate the proposal:
> - driver+firmware keep layout of the metadata in BTF format (either in the 
> driver
>   or driver can read it from firmware)
> - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver 
> and
>   generate normal C header file based on BTF in the given NIC
> - user does #include "md_desc.h" and bpf program can access md->csum or 
> md->hash
>   with direct single load out of metadata area in front of the packet
> - llvm compiles bpf program and records how program is doing this md->csum 
> accesses
>   in BTF format as well (the compiler will be keeping such records
>   for __sk_buff and all other structs too, but that's separate discussion)
> - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way 
> the program
>   accesses metadata (and other structs) matches BTF from the driver,
>   so no surprises if driver+firmware got updated, but program is not 
> recompiled
> - every NIC can have their own layout of metadata and its own meaning of the 
> fields,
>   but would be good to standardize at least a few common fields like hash

Can I expose HW descriptors this way, though, or is the proposal to
copy this data into the packet buffer?

> Once this is working we can do more cool things with BTF.
> Like doing offset rewriting at program load time similar to what we plan
> to do for tracing. Tracing programs will be doing 'task->pid' access
> and the kernel will adjust offsetof(struct task_struct, pid) during program 
> load
> depending on BTF for the kernel.
> The same trick we can do for networking metadata.
> The program will contain load instruction md->hash that will get automatically
> adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> For now I'm proposing _not_ to go that far with offset rewriting and start
> with simple steps described above.

Why? :(  Could we please go with the rewrite/driver helpers instead of
impacting fast paths of the drivers yet again?  This rewrite should be
easier than task->pid, because we have the synthetic user space struct
xdp_md definition.

The flexibility and power of BPF rewrites/JITing is at our disposal to
deliver maximum performance..


Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc

2018-07-06 Thread Stephen Hemminger


> diff --git a/tc/q_etf.c b/tc/q_etf.c
> new file mode 100644
> index ..5db1dd6f
> --- /dev/null
> +++ b/tc/q_etf.c
> @@ -0,0 +1,168 @@
> +/*
> + * q_etf.c   Earliest TxTime First (ETF).
> + *
> + *   This program is free software; you can redistribute it and/or
> + *   modify it under the terms of the GNU General Public License
> + *   as published by the Free Software Foundation; either version
> + *   2 of the License, or (at your option) any later version.
> + *
> + * Authors:  Vinicius Costa Gomes 
> + *   Jesus Sanchez-Palencia 
> + *
> + */


Please use SPDX tag rather than GPL boilerplate when adding new code.

> +static int get_clockid(__s32 *val, const char *arg)
> +{
> + const struct static_clockid {
> + const char *name;
> + clockid_t clockid;
> + } clockids_sysv[] = {
> + { "CLOCK_REALTIME", CLOCK_REALTIME },
> + { "CLOCK_TAI", CLOCK_TAI },
> + { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
> + { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
> + { NULL }
> + };
> +
> + const struct static_clockid *c;
> +
> + for (c = clockids_sysv; c->name; c++) {
> + if (strncasecmp(c->name, arg, 25) == 0) {
> + *val = c->clockid;
> +
> + return 0;
> + }
> + }
> +
> + return -1;
> +}

Internally, kernel must use ktime. For the userspace part the TC standard
is to use USER HZ of 100.

Please change user kernel API of this module to match other existing modules.
Doing something unique for this module is not necessary.




Fw: [Bug 200315] New: [Regression] Wired network does not come back after reboot

2018-07-06 Thread Stephen Hemminger



Begin forwarded message:

Date: Thu, 28 Jun 2018 01:59:30 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200315] New: [Regression] Wired network does not come back after 
reboot


https://bugzilla.kernel.org/show_bug.cgi?id=200315

Bug ID: 200315
   Summary: [Regression] Wired network does not come back after
reboot
   Product: Networking
   Version: 2.5
Kernel Version: 4.17.2
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: nicholaslima...@gmail.com
Regression: No

Created attachment 276937
  --> https://bugzilla.kernel.org/attachment.cgi?id=276937=edit  
It's a log from journalctl NetworkManager

Only difference is using kernel 4.17.2 vs kernel 4.16.15

When using 4.17.2, the network does not work after rebooting. With 4.16.15 it
works as expected.

Complementing: The wired connection doesn't work if I choose to reboot, but It
does work if I power it off and on again.

-- 
You are receiving this mail because:
You are the assignee for the bug.


Fw: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE

2018-07-06 Thread Stephen Hemminger



Begin forwarded message:

Date: Sat, 23 Jun 2018 00:00:25 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE


https://bugzilla.kernel.org/show_bug.cgi?id=200215

Bug ID: 200215
   Summary: UBSAN: Undefined behaviour in net/core/sock.c:LINE
   Product: Networking
   Version: 2.5
Kernel Version: v4.18-rc2
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: icy...@gmail.com
Regression: No

Hi,
This bug was found in Linux Kernel v4.18-rc2

$ cat report4 

UBSAN: Undefined behaviour in net/core/sock.c:793:19
signed integer overflow:
-1704733899 * 2 cannot be represented in type 'int'
CPU: 0 PID: 5695 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x122/0x1c8 lib/dump_stack.c:113
 ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
 handle_overflow+0x1c2/0x21f lib/ubsan.c:190
 __ubsan_handle_mul_overflow+0x2a/0x38 lib/ubsan.c:214
 sock_setsockopt+0x17f1/0x1c80 net/core/sock.c:793
 __sys_setsockopt+0x23f/0x2a0 net/socket.c:1943
 __do_sys_setsockopt net/socket.c:1958 [inline]
 __se_sys_setsockopt net/socket.c:1955 [inline]
 __x64_sys_setsockopt+0xcc/0x170 net/socket.c:1955
 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455a09
Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83
eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 
RSP: 002b:7fb211862c68 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 7fb2118636d4 RCX: 00455a09
RDX: 0021 RSI: 0001 RDI: 0013
RBP: 0072bea0 R08: 0004 R09: 
R10: 2000 R11: 0246 R12: 
R13: 06dc R14: 006ff540 R15: 

sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] 
sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present
sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
print_req_error: I/O error, dev sr0, sector 0
sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] 
sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present
sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 40 00 00 40 00
print_req_error: I/O error, dev sr0, sector 256
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] 
sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 02 00
print_req_error: critical target error, dev sr0, sector 0
Buffer I/O error on dev sr0, logical block 0, lost async page write
Buffer I/O error on dev sr0, logical block 1, lost async page write
Buffer I/O error on dev sr0, logical block 2, lost async page write
Buffer I/O error on dev sr0, logical block 3, lost async page write
Buffer I/O error on dev sr0, logical block 4, lost async page write
Buffer I/O error on dev sr0, logical block 5, lost async page write
Buffer I/O error on dev sr0, logical block 6, lost async page write
Buffer I/O error on dev sr0, logical block 7, lost async page write

This bug can be repro, if you need it, please tell me.
Thanks,
Icytxw

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: [PATCH v1 net-next 9/9] lan743x: Add PTP support

2018-07-06 Thread Richard Cochran
On Thu, Jul 05, 2018 at 12:39:26PM -0400, Bryan Whitehead wrote:
> Signed-off-by: Bryan Whitehead 

1. You forgot the commit message.
2. You forgot to add the PTP maintainer on CC.

Thanks,
Richard



Fw: [Bug 200239] New: RTL8211E lockup when having heavy tx traffic

2018-07-06 Thread Stephen Hemminger



Begin forwarded message:

Date: Sun, 24 Jun 2018 10:26:02 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200239] New: RTL8211E lockup when having heavy tx traffic


https://bugzilla.kernel.org/show_bug.cgi?id=200239

Bug ID: 200239
   Summary: RTL8211E lockup when having heavy tx traffic
   Product: Networking
   Version: 2.5
Kernel Version: 4.17
  Hardware: ARM
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: IPV4
  Assignee: step...@networkplumber.org
  Reporter: lollipop.studio...@gmail.com
Regression: No

My device: roc-rk3328-cc
It's a SBC(Single Board Computer) use soc Rockchip RK3328

How to reproduce this bug: just running a mainline linux kernel,then run iperf3
to test.

I also tried this patch https://patchwork.kernel.org/patch/10178969/ ,It
slightly improved but cannot completely solve this problem.

dmesg summary:
[14755.405897] [ cut here ]
[14755.406386] NETDEV WATCHDOG: eth0 (rk_gmac-dwmac): transmit queue 0 timed
out
[14755.407258] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:473
dev_watchdog+0x2ac/0x2b8
[14755.407992] Modules linked in: 8021q garp mrp stp llc rockchipdrm
analogix_dp phy_rockchip_inno_hdmi tcp_bbr sch_fq
[14755.408985] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
4.17.0-rc7-1-90343-g15e473367050-dirty #1
[14755.409762] Hardware name: Firefly roc-rk3328-cc (DT)
[14755.410224] pstate: 2005 (nzCv daif -PAN -UAO)
[14755.410664] pc : dev_watchdog+0x2ac/0x2b8
[14755.411034] lr : dev_watchdog+0x2ac/0x2b8
[14755.411396] sp : 08013d80
[14755.411696] x29: 08013d80 x28: 0005 
[14755.412184] x27: 0101 x26:  
[14755.412670] x25: 800079d604b8 x24: 0002 
[14755.413156] x23: 800079d6049c x22: 80007b74de80 
[14755.413641] x21: 09b07000 x20: 800079d6 
[14755.414128] x19:  x18: 0010 
[14755.414614] x17:  x16: 08229758 
[14755.415099] x15:  x14: 09b09d88 
[14755.415585] x13: 89cb16f7 x12: 09cb16ff 
[14755.416073] x11: 09b23000 x10: 08013a60 
[14755.416557] x9 : ffd0 x8 : 0874ee68 
[14755.417044] x7 : 756575712074696d x6 : 017e 
[14755.417525] x5 : 085b5400 x4 : 0004 
[14755.418012] x3 : 09b0c850 x2 : 80007c3d3800 
[14755.418496] x1 : bea2cbb4ea219900 x0 :  
[14755.418988] Call trace:
[14755.419226]  dev_watchdog+0x2ac/0x2b8
[14755.419574]  call_timer_fn+0x20/0x78
[14755.419903]  expire_timers+0xa8/0xb8
[14755.420239]  run_timer_softirq+0xa4/0x190
[14755.420610]  __do_softirq+0x124/0x220
[14755.420951]  irq_exit+0xb0/0xd0
[14755.421250]  __handle_domain_irq+0x64/0xb8
[14755.421627]  gic_handle_irq+0x50/0xa0
[14755.421967]  el1_irq+0xb0/0x128
[14755.422265]  arch_cpu_idle+0x10/0x18
[14755.422600]  do_idle+0x208/0x270
[14755.422896]  cpu_startup_entry+0x20/0x28
[14755.423265]  secondary_start_kernel+0x14c/0x160
[14755.423675] ---[ end trace 633d025b2aafb3cb ]---
[14755.424181] rk_gmac-dwmac ff54.ethernet eth0: Reset adapter.
[14755.427911] RTL8211E Gigabit Ethernet stmmac-0:00: attached PHY driver
[RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=stmmac-0:00, irq=POLL)
[14755.441983] rk_gmac-dwmac ff54.ethernet eth0: PTP not supported by HW
[14755.442989] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[14759.535770] rk_gmac-dwmac ff54.ethernet eth0: Link is Up - 1Gbps/Full -
flow control rx/tx
[14759.536606] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

-- 
You are receiving this mail because:
You are the assignee for the bug.


Fw: [Bug 200033] stack-out-of-bounds in __xfrm_dst_hash net/xfrm/xfrm_hash.h

2018-07-06 Thread Stephen Hemminger



Begin forwarded message:

Date: Sat, 23 Jun 2018 14:24:30 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200033] stack-out-of-bounds in __xfrm_dst_hash 
net/xfrm/xfrm_hash.h


https://bugzilla.kernel.org/show_bug.cgi?id=200033

--- Comment #1 from icytxw (icy...@gmail.com) ---

More details:

 26 static inline unsigned int __xfrm6_daddr_saddr_hash(const xfrm_address_t
*daddr,
 27 const xfrm_address_t
*saddr)
 28 {
 29 return ntohl(daddr->a6[2] ^ daddr->a6[3] ^
 30  saddr->a6[2] ^ saddr->a6[3]);
 31 }

$ cat report0 
==
BUG: KASAN: stack-out-of-bounds in __xfrm6_daddr_saddr_hash
net/xfrm/xfrm_hash.h:29 [inline]
BUG: KASAN: stack-out-of-bounds in __xfrm_dst_hash net/xfrm/xfrm_hash.h:96
[inline]
BUG: KASAN: stack-out-of-bounds in xfrm_dst_hash net/xfrm/xfrm_state.c:61
[inline]
BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x2b18/0x3160
net/xfrm/xfrm_state.c:953
Read of size 4 at addr 88004ad57b20 by task syz-executor0/4355

CPU: 0 PID: 4355 Comm: syz-executor0 Not tainted 4.18.0-rc1 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x122/0x1c8 lib/dump_stack.c:113
 print_address_description+0x88/0x3b0 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x185/0x4a0 mm/kasan/report.c:412
 __asan_report_load4_noabort+0x19/0x20 mm/kasan/report.c:432
 __xfrm6_daddr_saddr_hash net/xfrm/xfrm_hash.h:29 [inline]
 __xfrm_dst_hash net/xfrm/xfrm_hash.h:96 [inline]
 xfrm_dst_hash net/xfrm/xfrm_state.c:61 [inline]
 xfrm_state_find+0x2b18/0x3160 net/xfrm/xfrm_state.c:953
 xfrm_tmpl_resolve_one net/xfrm/xfrm_policy.c:1393 [inline]
 xfrm_tmpl_resolve+0x418/0xc10 net/xfrm/xfrm_policy.c:1437
 xfrm_resolve_and_create_bundle+0x112/0x980 net/xfrm/xfrm_policy.c:1832
 xfrm_lookup+0x298/0x1be0 net/xfrm/xfrm_policy.c:2162
 xfrm_lookup_route+0x42/0x1f0 net/xfrm/xfrm_policy.c:2282
 ip_route_output_flow+0x86/0xc0 net/ipv4/route.c:2588
 udp_sendmsg+0x15c1/0x2180 net/ipv4/udp.c:1086
 inet_sendmsg+0x103/0x490 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:645 [inline]
 sock_sendmsg+0xf9/0x180 net/socket.c:655
 __sys_sendto+0x239/0x3c0 net/socket.c:1833
 __do_sys_sendto net/socket.c:1845 [inline]
 __se_sys_sendto net/socket.c:1841 [inline]
 __x64_sys_sendto+0xef/0x1c0 net/socket.c:1841
 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455a09
Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83
eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 
RSP: 002b:7f58654ecc68 EFLAGS: 0246 ORIG_RAX: 002c
RAX: ffda RBX: 7f58654ed6d4 RCX: 00455a09
RDX:  RSI: 200014c0 RDI: 0013
RBP: 0072bea0 R08: 20001540 R09: 0010
R10:  R11: 0246 R12: 
R13: 05d7 R14: 006fdcc8 R15: 

The buggy address belongs to the page:
page:ea00012b55c0 count:0 mapcount:0 mapping: index:0x0
flags: 0x100()
raw: 0100  ea00012b55c8 
raw:    
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 88004ad57a00: f2 00 f4 f4 f4 f2 f2 f2 f2 00 00 00 00 f2 f2 f2
 88004ad57a80: f2 00 00 00 00 00 f4 f4 f4 f2 f2 f2 f2 00 00 00
>88004ad57b00: 00 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 00 00 00  
   ^
 88004ad57b80: 00 00 f4 f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00
 88004ad57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==
a repro of this bug see:
https://github.com/lcytxw/bug_repro/tree/master/bug_200033

-- 
You are receiving this mail because:
You are the assignee for the bug.


Fw: [Bug 200195] New: [Regression] Network does not come back after suspend

2018-07-06 Thread Stephen Hemminger



Begin forwarded message:

Date: Fri, 22 Jun 2018 18:04:00 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200195] New: [Regression] Network does not come back after suspend


https://bugzilla.kernel.org/show_bug.cgi?id=200195

Bug ID: 200195
   Summary: [Regression] Network does not come back after suspend
   Product: Networking
   Version: 2.5
Kernel Version: 4.17.2
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: aa...@kde.org
Regression: No

Created attachment 276753
  --> https://bugzilla.kernel.org/attachment.cgi?id=276753=edit  
lspci -vvv output

Setup:
 * Using Archlinux
 * Using wired network

Only difference is using kernel 4.17.2 vs kernel 4.16.13

When using 4.17.2, the network does not work after resuming from suspend. With
4.16.13 it works as expected.

My guess is that this is driver/hardware dependent since i have another machine
with the same setup and it does not seem to have that regression.

I'm attaching the lspci output of 4.16.13.

Anything else I can provide or test to pinpoint where the regression happened
please do not hesitate to ask.

-- 
You are receiving this mail because:
You are the assignee for the bug.


Fw: [Bug 200187] New: UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568

2018-07-06 Thread Stephen Hemminger



Begin forwarded message:

Date: Fri, 22 Jun 2018 15:09:48 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200187] New: UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568


https://bugzilla.kernel.org/show_bug.cgi?id=200187

Bug ID: 200187
   Summary: UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568
   Product: Networking
   Version: 2.5
Kernel Version: v4.18-rc2
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: icy...@gmail.com
Regression: No

It seems like that to->to_retries to large while shift exponent.
if (to->to_exponential)
req->rq_majortimeo <<= to->to_retries;
else


UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568:22
shift exponent 536870976 is too large for 64-bit type 'long unsigned int'
CPU: 0 PID: 21219 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x122/0x1c8 lib/dump_stack.c:113
 ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
 __ubsan_handle_shift_out_of_bounds+0x29a/0x2ff lib/ubsan.c:425
 xprt_reset_majortimeo+0x323/0x440 net/sunrpc/xprt.c:568
 xprt_request_init+0x4d2/0x730 net/sunrpc/xprt.c:1330
 call_reserveresult+0x9d/0x240 net/sunrpc/clnt.c:1549
 __rpc_execute+0x23a/0xc20 net/sunrpc/sched.c:784
 rpc_execute+0xf5/0x250 net/sunrpc/sched.c:852
 rpc_run_task+0x4a1/0x9f0 net/sunrpc/clnt.c:1053
 rpc_call_sync+0xcf/0x1a0 net/sunrpc/clnt.c:1082
 rpc_ping+0xde/0x140 net/sunrpc/clnt.c:2514
 rpc_create_xprt+0x1e8/0x590 net/sunrpc/clnt.c:479
 rpc_create+0x3a9/0x600 net/sunrpc/clnt.c:587
 nfs_create_rpc_client+0x3a1/0x4d0 fs/nfs/client.c:523
 nfs_init_client+0x7b/0x100 fs/nfs/client.c:634
 nfs_get_client+0xa74/0x1440 fs/nfs/client.c:425
 nfs_init_server+0x236/0xdf0 fs/nfs/client.c:670
 nfs_create_server+0x9a/0x750 fs/nfs/client.c:953
 nfs_try_mount+0x270/0xaf0 fs/nfs/super.c:1884
 nfs_fs_mount+0x151f/0x30e0 fs/nfs/super.c:2695
 mount_fs+0xaf/0x330 fs/super.c:1277
 vfs_kern_mount+0xfc/0x4d0 fs/namespace.c:1037
 do_new_mount fs/namespace.c:2518 [inline]
 do_mount+0x46f/0x2fa0 fs/namespace.c:2848
 ksys_mount+0xb0/0x120 fs/namespace.c:3064
 __do_sys_mount fs/namespace.c:3078 [inline]
 __se_sys_mount fs/namespace.c:3075 [inline]
 __x64_sys_mount+0xcc/0x170 fs/namespace.c:3075
 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455a09
Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83
eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 
RSP: 002b:7f11c724ac68 EFLAGS: 0246 ORIG_RAX: 00a5
RAX: ffda RBX: 7f11c724b6d4 RCX: 00455a09
RDX: 21c0 RSI: 2100 RDI: 2200
RBP: 0072bea0 R08: 2180 R09: 
R10:  R11: 0246 R12: 
R13: 04d1 R14: 006fc438 R15: 

rpcbind: RPC call returned error 22
rpcbind: RPC call returned error 22
rpcbind: RPC call returned error 22
rpcbind: RPC call returned error 22
rpcbind: RPC call returned error 22
FAT-fs (loop1): bogus number of reserved sectors
FAT-fs (loop1): Can't find a valid FAT filesystem
FAT-fs (loop1): bogus number of reserved sectors
FAT-fs (loop1): Can't find a valid FAT filesystem
FAT-fs (loop1): bogus number of reserved sectors
FAT-fs (loop1): Can't find a valid FAT filesystem
FAT-fs (loop1): bogus number of reserved sectors
FAT-fs (loop1): Can't find a valid FAT filesystem
FAT-fs (loop1): bogus number of reserved sectors
FAT-fs (loop1): Can't find a valid FAT filesystem
EXT4-fs (sda): re-mounted. Opts: 
EXT4-fs (sda): re-mounted. Opts: 
EXT4-fs (sda): re-mounted. Opts: 
EXT4-fs (sda): re-mounted. Opts: 
EXT4-fs (sda): re-mounted. Opts:

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: [PATCH v1 net-next 9/9] lan743x: Add PTP support

2018-07-06 Thread kbuild test robot
Hi Bryan,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Bryan-Whitehead/lan743x-Add-features-to-lan743x-driver/20180706-051812
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "__moddi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined!
>> ERROR: "__udivdi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined!
>> ERROR: "__umoddi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined!
>> ERROR: "__divdi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v1 iproute2 1/2] uapi pkt_sched: Add etf info - DO NOT COMMIT

2018-07-06 Thread Stephen Hemminger
On Wed, 27 Jun 2018 15:09:11 -0700
Jesus Sanchez-Palencia  wrote:

> This should come from the next uapi headers update.
> Sending it now just as a convenience so anyone can build tc with etf
> and taprio support.
> 
> Signed-off-by: Jesus Sanchez-Palencia 

This should be targeted at iproute2-next


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Waskiewicz Jr, Peter
On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > 
> > I'm also not 100% on board with the argument that "future" FW can
> > reshuffle things whatever way it wants to.  Is the assumption that
> > future ASICs/FW will be designed to always use the "blessed" BTF
> > format?  Or will it be reconfigurable at runtime?
> 
> let's table configuration of metadata aside for a second.

I agree that this should/could be NIC-specific and shouldn't weigh on
the metadata interface between the drivers and XDP layer.

> Describing metedata layout in BTF allows NICs to disclose everything
> NIC has to users in a standard and generic way.
> Whether firmware is reconfigurable on the fly or has to reflashed
> and hw powercycled to have new md layout (and corresponding BTF
> description)
> is a separate discussion.
> Saeed's proposal introduces the concept of 'offset' inside 'struct
> xdp_md_info'
> to reach 'hash' value in metadata.
> Essentially it's a run-time way to access 'hash' instead of build-
> time.
> So bpf program would need two loads to read csum or hash field
> instead of one.
> With BTF the layout of metadata is known to the program at build-
> time.
> 
> To reiterate the proposal:
> - driver+firmware keep layout of the metadata in BTF format (either
> in the driver
>   or driver can read it from firmware)
> - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query
> the driver and
>   generate normal C header file based on BTF in the given NIC
> - user does #include "md_desc.h" and bpf program can access md->csum
> or md->hash
>   with direct single load out of metadata area in front of the packet

This piece is where I'd like to discuss more.  When we discussed this
in Seoul, the initial proposal was a static struct that we'd try to
hammer out a common layout between the interested parties.  That
obviously wasn't going to scale, and we wanted to pursue something more
dynamic.  But I thought the goal was the XDP/eBPF program wouldn't want
to care what the underlying device is, and could just ask for metadata
that it's interested in.  With this approach, your eBPF program is now
bound/tied to the NIC/driver, and if you switched to a differen
NIC/driver combo, then you'd have to rewrite part of your eBPF program
to comprehend that.  I thought we were trying to avoid that.

Our proposed approach (still working on something ready to RFC) is to
provide a method for the eBPF program to send a struct of requested
hints down to the driver on load.  If the driver can provide the hints,
then that'd be how they'd be laid out in the metadata.  If it can't
provide them, we'd probably reject the program loading, or discuss
providing a software fallback (I know this is an area of contention).

I suppose we could get there with the rewriting mechanism described
below, but that'd be a tough sell to set a bit of ABI for metadata,
then change it to be potentially dynamic at runtime in the future.

-PJ

[PATCH net 1/2] net/sched: act_csum: fix NULL dereference when 'goto chain' is used

2018-07-06 Thread Davide Caratti
the control action in the common member of struct tcf_csum must be a valid
value, as it can contain the chain index when 'goto chain' is used. Ensure
that the control action can be read as x->tcfa_action, when x is a pointer
to struct tc_action and x->ops->type is TCA_ACT_CSUM, to prevent the
following command:

  # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
  > $tcflags dst_mac $h2mac action csum ip or tcp or udp or sctp goto chain 1

from triggering a NULL pointer dereference when a matching packet is
received.

 BUG: unable to handle kernel NULL pointer dereference at 
 PGD 80010416b067 P4D 80010416b067 PUD 1041be067 PMD 0
 Oops:  [#1] SMP PTI
 CPU: 0 PID: 3072 Comm: mausezahn Tainted: GE 
4.18.0-rc2.auguri+ #421
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 
02/07/2013
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 
5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 
83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:a020dea03c40 EFLAGS: 00010246
 RAX: 2001 RBX: a020d7ccef00 RCX: 0054
 RDX:  RSI: a020ca5ae000 RDI: a020d7ccef00
 RBP: a020dea03e60 R08:  R09: a020dea03c9c
 R10: a020dea03c78 R11: 0008 R12: a020d3fe4f00
 R13: a020d3fe4f08 R14: 0001 R15: a020d53ca300
 FS:  7f5a46942740() GS:a020dea0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 000104218002 CR4: 001606f0
 Call Trace:
  
  fl_classify+0x1ad/0x1c0 [cls_flower]
  ? arp_rcv+0x121/0x1b0
  ? __x2apic_send_IPI_dest+0x40/0x40
  ? smp_reschedule_interrupt+0x1c/0xd0
  ? reschedule_interrupt+0xf/0x20
  ? reschedule_interrupt+0xa/0x20
  ? device_is_rmrr_locked+0xe/0x50
  ? iommu_should_identity_map+0x49/0xd0
  ? __intel_map_single+0x30/0x140
  ? e1000e_update_rdt_wa.isra.52+0x22/0xb0 [e1000e]
  ? e1000_alloc_rx_buffers+0x233/0x250 [e1000e]
  ? kmem_cache_alloc+0x38/0x1c0
  tcf_classify+0x89/0x140
  __netif_receive_skb_core+0x5ea/0xb70
  ? enqueue_task_fair+0xb6/0x7d0
  ? process_backlog+0x97/0x150
  process_backlog+0x97/0x150
  net_rx_action+0x14b/0x3e0
  __do_softirq+0xde/0x2b4
  do_softirq_own_stack+0x2a/0x40
  
  do_softirq.part.18+0x49/0x50
  __local_bh_enable_ip+0x49/0x50
  __dev_queue_xmit+0x4ab/0x8a0
  ? wait_woken+0x80/0x80
  ? packet_sendmsg+0x38f/0x810
  ? __dev_queue_xmit+0x8a0/0x8a0
  packet_sendmsg+0x38f/0x810
  sock_sendmsg+0x36/0x40
  __sys_sendto+0x10e/0x140
  ? do_vfs_ioctl+0xa4/0x630
  ? syscall_trace_enter+0x1df/0x2e0
  ? __audit_syscall_exit+0x22a/0x290
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f5a45cbec93
 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 
3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 
c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
 RSP: 002b:7ffd0ee6d748 EFLAGS: 0246 ORIG_RAX: 002c
 RAX: ffda RBX: 01161010 RCX: 7f5a45cbec93
 RDX: 0062 RSI: 01161322 RDI: 0003
 RBP: 7ffd0ee6d780 R08: 7ffd0ee6d760 R09: 0014
 R10:  R11: 0246 R12: 0062
 R13: 01161322 R14: 7ffd0ee6d760 R15: 0003
 Modules linked in: act_csum act_gact cls_flower sch_ingress vrf veth 
act_tunnel_key(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat 
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek 
kvm snd_hda_codec_generic hp_wmi iTCO_wdt sparse_keymap rfkill mei_wdt 
iTCO_vendor_support wmi_bmof gpio_ich irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel snd_hda_intel crypto_simd cryptd 
snd_hda_codec glue_helper snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 
pcspkr i2c_i801 snd_timer snd sg lpc_ich soundcore wmi mei_me
  mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
libcrc32c sr_mod cdrom sd_mod ahci libahci crc32c_intel i915 ixgbe serio_raw 
libata video dca i2c_algo_bit sfc drm_kms_helper syscopyarea mtd sysfillrect 
mdio sysimgblt fb_sys_fops drm e1000e i2c_core
 CR2: 
 ---[ end trace 3c9e9d1a77df4026 ]---
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 
5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 
83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:a020dea03c40 EFLAGS: 00010246
 RAX: 2001 RBX: 

[PATCH net 2/2] net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used

2018-07-06 Thread Davide Caratti
the control action in the common member of struct tcf_tunnel_key must be a
valid value, as it can contain the chain index when 'goto chain' is used.
Ensure that the control action can be read as x->tcfa_action, when x is a
pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to
prevent the following command:

 # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
 > $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1

from causing a NULL dereference when a matching packet is received:

 BUG: unable to handle kernel NULL pointer dereference at 
 PGD 8001097ac067 P4D 8001097ac067 PUD 103b0a067 PMD 0
 Oops:  [#1] SMP PTI
 CPU: 0 PID: 3491 Comm: mausezahn Tainted: GE 
4.18.0-rc2.auguri+ #421
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 
02/07/2013
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 
5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 
83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:95145ea03c40 EFLAGS: 00010246
 RAX: 2001 RBX: 9514499e5800 RCX: 0001
 RDX:  RSI: 0002 RDI: 
 RBP: 95145ea03e60 R08:  R09: 95145ea03c9c
 R10: 95145ea03c78 R11: 0008 R12: 951456a69800
 R13: 951456a69808 R14: 0001 R15: 95144965ee40
 FS:  7fd67ee11740() GS:95145ea0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 0001038a2006 CR4: 001606f0
 Call Trace:
  
  fl_classify+0x1ad/0x1c0 [cls_flower]
  ? __update_load_avg_se.isra.47+0x1ca/0x1d0
  ? __update_load_avg_se.isra.47+0x1ca/0x1d0
  ? update_load_avg+0x665/0x690
  ? update_load_avg+0x665/0x690
  ? kmem_cache_alloc+0x38/0x1c0
  tcf_classify+0x89/0x140
  __netif_receive_skb_core+0x5ea/0xb70
  ? enqueue_entity+0xd0/0x270
  ? process_backlog+0x97/0x150
  process_backlog+0x97/0x150
  net_rx_action+0x14b/0x3e0
  __do_softirq+0xde/0x2b4
  do_softirq_own_stack+0x2a/0x40
  
  do_softirq.part.18+0x49/0x50
  __local_bh_enable_ip+0x49/0x50
  __dev_queue_xmit+0x4ab/0x8a0
  ? wait_woken+0x80/0x80
  ? packet_sendmsg+0x38f/0x810
  ? __dev_queue_xmit+0x8a0/0x8a0
  packet_sendmsg+0x38f/0x810
  sock_sendmsg+0x36/0x40
  __sys_sendto+0x10e/0x140
  ? do_vfs_ioctl+0xa4/0x630
  ? syscall_trace_enter+0x1df/0x2e0
  ? __audit_syscall_exit+0x22a/0x290
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fd67e18dc93
 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 
3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 
c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
 RSP: 002b:7ffe0189b748 EFLAGS: 0246 ORIG_RAX: 002c
 RAX: ffda RBX: 020ca010 RCX: 7fd67e18dc93
 RDX: 0062 RSI: 020ca322 RDI: 0003
 RBP: 7ffe0189b780 R08: 7ffe0189b760 R09: 0014
 R10:  R11: 0246 R12: 0062
 R13: 020ca322 R14: 7ffe0189b760 R15: 0003
 Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth 
act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal 
intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel 
kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi 
ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill 
snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich 
iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm 
pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me
  mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci 
serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca
 CR2: 
 ---[ end trace 1ab8b5b5d4639dfc ]---
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 
5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 
83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:95145ea03c40 EFLAGS: 00010246
 RAX: 2001 RBX: 9514499e5800 RCX: 0001
 RDX:  RSI: 0002 RDI: 
 RBP: 95145ea03e60 R08:  R09: 95145ea03c9c
 R10: 95145ea03c78 R11: 0008 R12: 951456a69800
 R13: 

[PATCH net 0/2] net/sched: fix NULL dereference in 'goto chain' control action

2018-07-06 Thread Davide Caratti
in a couple of TC actions (i.e. csum and tunnel_key), the control action
is stored together with the action-specific configuration data.
This avoids a race condition (see [1]), but it causes a crash when 'goto
chain' is used with the above actions. Since this race condition is
tolerated on the other TC actions (it's present even on actions where the
spinlock is still used), storing the control action in the common area
should be acceptable for tunnel_key and csum as well.

[1] https://www.spinics.net/lists/netdev/msg472047.html

Davide Caratti (2):
  net/sched: act_csum: fix NULL dereference when 'goto chain' is used
  net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is
used

 include/net/tc_act/tc_csum.h   | 1 -
 include/net/tc_act/tc_tunnel_key.h | 1 -
 net/sched/act_csum.c   | 6 +++---
 net/sched/act_tunnel_key.c | 6 +++---
 4 files changed, 6 insertions(+), 8 deletions(-)

-- 
2.17.1



[PATCH V2 net-next] liquidio: fix kernel panic when NIC firmware is older than 1.7.2

2018-07-06 Thread Felix Manlunas
From: Rick Farrington 

Pre-1.7.2 NIC firmware does not support (and does not respond to) the "get
speed" command which is sent by the 1.7.2 driver (for CN23XX-225 cards
only) during modprobe.  Due to a bug in older firmware (with respect to
unknown commands), this unsupported command causes a cascade of errors that
ends in a kernel panic.

Fix it by making the sending of the "get speed" command conditional on the
firmware version.

Signed-off-by: Rick Farrington 
Signed-off-by: Felix Manlunas 
---
Patch Change Log:
  V1 -> V2:
* In the if-statement that checks the firmware version, replace the
  boolean expression that calls strcmp() (which is not suitable when
  the firmware micro version has more than one digit) with a boolean
  expression that works in all cases.

* In the patch description, specify which types of liquidio cards are
  afffected.

 drivers/net/ethernet/cavium/liquidio/lio_main.c| 26 --
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  9 
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 7cb4e75..ebda6ef 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3299,7 +3299,9 @@ static int setup_nic_devices(struct octeon_device 
*octeon_dev)
 {
struct lio *lio = NULL;
struct net_device *netdev;
-   u8 mac[6], i, j, *fw_ver;
+   u8 mac[6], i, j, *fw_ver, *micro_ver;
+   unsigned long micro;
+   u32 cur_ver;
struct octeon_soft_command *sc;
struct liquidio_if_cfg_context *ctx;
struct liquidio_if_cfg_resp *resp;
@@ -3429,6 +3431,14 @@ static int setup_nic_devices(struct octeon_device 
*octeon_dev)
 fw_ver);
}
 
+   /* extract micro version field; point past '..' */
+   micro_ver = fw_ver + strlen(LIQUIDIO_BASE_VERSION) + 1;
+   if (kstrtoul(micro_ver, 10, ) != 0)
+   micro = 0;
+   octeon_dev->fw_info.ver.maj = LIQUIDIO_BASE_MAJOR_VERSION;
+   octeon_dev->fw_info.ver.min = LIQUIDIO_BASE_MINOR_VERSION;
+   octeon_dev->fw_info.ver.rev = micro;
+
octeon_swap_8B_data((u64 *)(>cfg_info),
(sizeof(struct liquidio_if_cfg_info)) >> 3);
 
@@ -3671,7 +3681,19 @@ static int setup_nic_devices(struct octeon_device 
*octeon_dev)
OCTEON_CN2350_25GB_SUBSYS_ID ||
octeon_dev->subsystem_id ==
OCTEON_CN2360_25GB_SUBSYS_ID) {
-   liquidio_get_speed(lio);
+   cur_ver = OCT_FW_VER(octeon_dev->fw_info.ver.maj,
+octeon_dev->fw_info.ver.min,
+octeon_dev->fw_info.ver.rev);
+
+   /* speed control unsupported in f/w older than 1.7.2 */
+   if (cur_ver < OCT_FW_VER(1, 7, 2)) {
+   dev_info(_dev->pci_dev->dev,
+"speed setting not supported by f/w.");
+   octeon_dev->speed_setting = 25;
+   octeon_dev->no_speed_setting = 1;
+   } else {
+   liquidio_get_speed(lio);
+   }
 
if (octeon_dev->speed_setting == 0) {
octeon_dev->speed_setting = 25;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
index 94a4ed88d..d99ca6b 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
@@ -288,8 +288,17 @@ struct oct_fw_info {
 */
u32 app_mode;
char   liquidio_firmware_version[32];
+   /* Fields extracted from legacy string 'liquidio_firmware_version' */
+   struct {
+   u8  maj;
+   u8  min;
+   u8  rev;
+   } ver;
 };
 
+#define OCT_FW_VER(maj, min, rev) \
+   (((u32)(maj) << 16) | ((u32)(min) << 8) | ((u32)(rev)))
+
 /* wrappers around work structs */
 struct cavium_wk {
struct delayed_work work;


Re: [PATCH net-next 0/6] sock cookie initializers

2018-07-06 Thread Willem de Bruijn
On Fri, Jul 6, 2018 at 1:24 PM Jesus Sanchez-Palencia
 wrote:
>
>
>
> On 07/06/2018 07:12 AM, Willem de Bruijn wrote:
> > From: Willem de Bruijn 
> >
> > Recent UDP GSO and SO_TXTIME features added new fields to cookie
> > structs.
> >
> > When adding a field, all sites where a struct is initialized have to
> > be updated, which is a lot of boilerplate. Alternatively, a field can
> > be initialized selectively, but this is fragile. I introduced a bug
> > in udp gso where an uninitialized field was read. See also fix commit
> > ("9887cba19978 ip: limit use of gso_size to udp").
> >
> > Introduce initializers for structs ipcm(6)_cookie and sockc_cookie.
> >
> > patch 1..3 do exactly this.
> > patch 4..5 make ipv4 and ipv6 handle cookies the same way and
> >remove some boilerplate in doing so.
> > patch 6removes the udp gso branch that needed the above fix
>
>
> Acked-by: Jesus Sanchez-Palencia 
>
> I've applied this series and tested SO_TXTIME + the etf qdisc and everything 
> is
> working just fine.

Thanks for testing, Jesus!

I also ran the udpgso and txtimestamp tests (and am expanding
the second to run as part of kselftest, and have ipv6 and cmsg support).


Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-06 Thread Santosh Shilimkar

Hi Ka-Cheong,

On 7/6/2018 8:25 AM, Sowmini Varadhan wrote:

On (07/06/18 23:08), Ka-Cheong Poon wrote:


As mentioned in a previous mail, it is unclear why the
port number is transport specific.  Most Internet services
use the same port number running over TCP/UDP as shown
in the IANA database.  And the IANA RDS registration is
the same.  What is the rationale of having a transport
specific number in the RDS implementation?


because not every transport may need a port number.


Lets keep separate port for RDMA and TCP transport. This has been
already useful for wireshark dissector and can also help for eBPF
like external tooling. The fragment format and re-assembly is
different across transports.

I do see your point and also agree that port number isn't transport
specific and in case we need to add another transport, what port
to use. But may be till then lets keep the existing behavior.
As such this port switch is not related to IPv6 support as such
so lets deal with it separately.

Regards,
Santosh


Re: [PATCH net-next 0/6] sock cookie initializers

2018-07-06 Thread Jesus Sanchez-Palencia



On 07/06/2018 07:12 AM, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Recent UDP GSO and SO_TXTIME features added new fields to cookie
> structs.
> 
> When adding a field, all sites where a struct is initialized have to
> be updated, which is a lot of boilerplate. Alternatively, a field can
> be initialized selectively, but this is fragile. I introduced a bug
> in udp gso where an uninitialized field was read. See also fix commit
> ("9887cba19978 ip: limit use of gso_size to udp").
> 
> Introduce initializers for structs ipcm(6)_cookie and sockc_cookie.
> 
> patch 1..3 do exactly this.
> patch 4..5 make ipv4 and ipv6 handle cookies the same way and
>remove some boilerplate in doing so.
> patch 6removes the udp gso branch that needed the above fix


Acked-by: Jesus Sanchez-Palencia 

I've applied this series and tested SO_TXTIME + the etf qdisc and everything is
working just fine.


Thanks,
Jesus



> 
> Willem de Bruijn (6):
>   ipv4: ipcm_cookie initializers
>   ipv6: ipcm6_cookie initializer
>   sock: sockc cookie initializer
>   ipv6: fold sockcm_cookie into ipcm6_cookie
>   ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6
>   ip: unconditionally set cork gso_size
> 
>  include/net/ip.h | 16 ++-
>  include/net/ipv6.h   | 26 
>  include/net/sock.h   |  6 ++
>  include/net/transp_v6.h  |  3 +--
>  net/ipv4/icmp.c  | 11 ++
>  net/ipv4/ip_output.c | 12 ---
>  net/ipv4/ping.c  | 11 +-
>  net/ipv4/raw.c   | 11 +-
>  net/ipv4/tcp.c   |  2 +-
>  net/ipv4/udp.c   | 12 +--
>  net/ipv6/datagram.c  |  4 ++--
>  net/ipv6/icmp.c  | 14 -
>  net/ipv6/ip6_flowlabel.c |  3 +--
>  net/ipv6/ip6_output.c| 43 +---
>  net/ipv6/ipv6_sockglue.c |  3 +--
>  net/ipv6/ping.c  |  7 ++-
>  net/ipv6/raw.c   | 15 +-
>  net/ipv6/udp.c   | 14 +
>  net/l2tp/l2tp_ip6.c  | 10 +++---
>  net/packet/af_packet.c   |  9 +++--
>  20 files changed, 98 insertions(+), 134 deletions(-)
> 


[PATCH net-next 3/3] net: refactor dequeue-model list processing

2018-07-06 Thread Edward Cree
New macro list_for_each_entry_dequeue loops over a list by popping entries
 from the head, allowing a more concise expression of the dequeue-enqueue
 model of list processing and avoiding the need for a 'next' pointer (as
 used in list_for_each_entry_safe).

Signed-off-by: Edward Cree 
---
 include/linux/list.h  | 15 +++
 include/linux/netfilter.h |  6 ++
 net/core/dev.c|  6 ++
 net/ipv4/ip_input.c   | 10 --
 net/ipv6/ip6_input.c  | 10 --
 5 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index de04cc5ed536..150751d03441 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -645,6 +645,21 @@ static inline void list_splice_tail_init(struct list_head 
*list,
 #define list_safe_reset_next(pos, n, member)   \
n = list_next_entry(pos, member)
 
+/**
+ * list_for_each_entry_dequeue - iterate over list by removing entries
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_head within the struct.
+ *
+ * Iterate over list of given type, removing each entry from the list before
+ * running the loop body.  The loop will run until the list is empty, so
+ * adding an entry back onto the list in the loop body will requeue it.
+ */
+#define list_for_each_entry_dequeue(pos, head, member) \
+   while (!list_empty(head) && \
+  (pos = list_first_entry(head, typeof(*pos), member)) &&  \
+  (list_del(>member), true))
+
 /*
  * Double linked lists with a single pointer list head.
  * Mostly useful for hash tables where the two pointer list head is
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 23b48de8c2e2..f9a037eb053d 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -293,15 +293,13 @@ NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net 
*net, struct sock *sk,
 struct list_head *head, struct net_device *in, struct net_device 
*out,
 int (*okfn)(struct net *, struct sock *, struct sk_buff *))
 {
-   struct sk_buff *skb, *next;
struct list_head sublist;
+   struct sk_buff *skb;
 
INIT_LIST_HEAD();
-   list_for_each_entry_safe(skb, next, head, list) {
-   list_del(>list);
+   list_for_each_entry_dequeue(skb, head, list)
if (nf_hook(pf, hook, net, sk, skb, in, out, okfn) == 1)
list_add_tail(>list, );
-   }
/* Put passed packets back on main list */
list_splice(, head);
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index ce4583564e00..68055f012d13 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4985,9 +4985,8 @@ static void netif_receive_skb_list_internal(struct 
list_head *head)
struct list_head sublist;
 
INIT_LIST_HEAD();
-   list_for_each_entry_safe(skb, next, head, list) {
+   list_for_each_entry_dequeue(skb, head, list) {
net_timestamp_check(netdev_tstamp_prequeue, skb);
-   list_del(>list);
if (!skb_defer_rx_timestamp(skb))
list_add_tail(>list, );
}
@@ -4996,9 +4995,8 @@ static void netif_receive_skb_list_internal(struct 
list_head *head)
if (static_branch_unlikely(_xdp_needed_key)) {
preempt_disable();
rcu_read_lock();
-   list_for_each_entry_safe(skb, next, head, list) {
+   list_for_each_entry_dequeue(skb, head, list) {
xdp_prog = rcu_dereference(skb->dev->xdp_prog);
-   list_del(>list);
if (do_xdp_generic(xdp_prog, skb) == XDP_PASS)
list_add_tail(>list, );
}
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 1a3b6f32b1c9..26285a24c067 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -538,14 +538,13 @@ static void ip_list_rcv_finish(struct net *net, struct 
sock *sk,
   struct list_head *head)
 {
struct dst_entry *curr_dst = NULL;
-   struct sk_buff *skb, *next;
struct list_head sublist;
+   struct sk_buff *skb;
 
INIT_LIST_HEAD();
-   list_for_each_entry_safe(skb, next, head, list) {
+   list_for_each_entry_dequeue(skb, head, list) {
struct dst_entry *dst;
 
-   list_del(>list);
/* if ingress device is enslaved to an L3 master device pass the
 * skb to its handler for processing
 */
@@ -584,15 +583,14 @@ void ip_list_rcv(struct list_head *head, struct 
packet_type *pt,
 {
struct net_device *curr_dev = NULL;
struct net *curr_net = NULL;
-   struct sk_buff *skb, *next;
struct list_head sublist;
+   struct sk_buff *skb;
 
INIT_LIST_HEAD();
- 

[PATCH net-next 1/3] net: core: fix uses-after-free in list processing

2018-07-06 Thread Edward Cree
In netif_receive_skb_list_internal(), all of skb_defer_rx_timestamp(),
 do_xdp_generic() and enqueue_to_backlog() can lead to kfree(skb).  Thus,
 we cannot wait until after they return to remove the skb from the list;
 instead, we remove it first and, in the pass case, add it to a sublist
 afterwards.
In the case of enqueue_to_backlog() we have already decided not to pass
 when we call the function, so we do not need a sublist.

Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing")
Reported-by: Dan Carpenter 
Signed-off-by: Edward Cree 
---
 net/core/dev.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 89825c1eccdc..ce4583564e00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4982,25 +4982,30 @@ static void netif_receive_skb_list_internal(struct 
list_head *head)
 {
struct bpf_prog *xdp_prog = NULL;
struct sk_buff *skb, *next;
+   struct list_head sublist;
 
+   INIT_LIST_HEAD();
list_for_each_entry_safe(skb, next, head, list) {
net_timestamp_check(netdev_tstamp_prequeue, skb);
-   if (skb_defer_rx_timestamp(skb))
-   /* Handled, remove from list */
-   list_del(>list);
+   list_del(>list);
+   if (!skb_defer_rx_timestamp(skb))
+   list_add_tail(>list, );
}
+   list_splice_init(, head);
 
if (static_branch_unlikely(_xdp_needed_key)) {
preempt_disable();
rcu_read_lock();
list_for_each_entry_safe(skb, next, head, list) {
xdp_prog = rcu_dereference(skb->dev->xdp_prog);
-   if (do_xdp_generic(xdp_prog, skb) != XDP_PASS)
-   /* Dropped, remove from list */
-   list_del(>list);
+   list_del(>list);
+   if (do_xdp_generic(xdp_prog, skb) == XDP_PASS)
+   list_add_tail(>list, );
}
rcu_read_unlock();
preempt_enable();
+   /* Put passed packets back on main list */
+   list_splice_init(, head);
}
 
rcu_read_lock();
@@ -5011,9 +5016,9 @@ static void netif_receive_skb_list_internal(struct 
list_head *head)
int cpu = get_rps_cpu(skb->dev, skb, );
 
if (cpu >= 0) {
-   enqueue_to_backlog(skb, cpu, 
>last_qtail);
-   /* Handled, remove from list */
+   /* Will be handled, remove from list */
list_del(>list);
+   enqueue_to_backlog(skb, cpu, 
>last_qtail);
}
}
}



[PATCH net-next 2/3] netfilter: fix use-after-free in NF_HOOK_LIST

2018-07-06 Thread Edward Cree
nf_hook() can free the skb, so we need to remove it from the list before
 calling, and add passed skbs to a sublist afterwards.

Fixes: 17266ee93984 ("net: ipv4: listified version of ip_rcv")
Reported-by: Dan Carpenter 
Signed-off-by: Edward Cree 
---
 include/linux/netfilter.h | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 5a5e0a2ab2a3..23b48de8c2e2 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -294,12 +294,16 @@ NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net 
*net, struct sock *sk,
 int (*okfn)(struct net *, struct sock *, struct sk_buff *))
 {
struct sk_buff *skb, *next;
+   struct list_head sublist;
 
+   INIT_LIST_HEAD();
list_for_each_entry_safe(skb, next, head, list) {
-   int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
-   if (ret != 1)
-   list_del(>list);
+   list_del(>list);
+   if (nf_hook(pf, hook, net, sk, skb, in, out, okfn) == 1)
+   list_add_tail(>list, );
}
+   /* Put passed packets back on main list */
+   list_splice(, head);
 }
 
 /* Call setsockopt() */



[PATCH net-next 0/3] fix use-after-free bugs in skb list processing

2018-07-06 Thread Edward Cree
A couple of bugs in skb list handling were spotted by Dan Carpenter, with
 the help of Smatch; following up on them I found a couple more similar
 cases.  This series fixes them by changing the relevant loops to use the
 dequeue-enqueue model (rather than in-place list modification), and then
 adds a list.h helper macro to refactor code using the dequeue-enqueue
 model.

Edward Cree (3):
  net: core: fix uses-after-free in list processing
  netfilter: fix use-after-free in NF_HOOK_LIST
  net: refactor dequeue-model list processing

 include/linux/list.h  | 15 +++
 include/linux/netfilter.h | 16 +---
 net/core/dev.c| 23 +--
 net/ipv4/ip_input.c   | 10 --
 net/ipv6/ip6_input.c  | 10 --
 5 files changed, 45 insertions(+), 29 deletions(-)



Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Alexei Starovoitov
On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> 
> I'm also not 100% on board with the argument that "future" FW can
> reshuffle things whatever way it wants to.  Is the assumption that
> future ASICs/FW will be designed to always use the "blessed" BTF
> format?  Or will it be reconfigurable at runtime?

let's table configuration of metadata aside for a second.

Describing metedata layout in BTF allows NICs to disclose everything
NIC has to users in a standard and generic way.
Whether firmware is reconfigurable on the fly or has to reflashed
and hw powercycled to have new md layout (and corresponding BTF description)
is a separate discussion.
Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
to reach 'hash' value in metadata.
Essentially it's a run-time way to access 'hash' instead of build-time.
So bpf program would need two loads to read csum or hash field instead of one.
With BTF the layout of metadata is known to the program at build-time.

To reiterate the proposal:
- driver+firmware keep layout of the metadata in BTF format (either in the 
driver
  or driver can read it from firmware)
- 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver 
and
  generate normal C header file based on BTF in the given NIC
- user does #include "md_desc.h" and bpf program can access md->csum or md->hash
  with direct single load out of metadata area in front of the packet
- llvm compiles bpf program and records how program is doing this md->csum 
accesses
  in BTF format as well (the compiler will be keeping such records
  for __sk_buff and all other structs too, but that's separate discussion)
- during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way 
the program
  accesses metadata (and other structs) matches BTF from the driver,
  so no surprises if driver+firmware got updated, but program is not recompiled
- every NIC can have their own layout of metadata and its own meaning of the 
fields,
  but would be good to standardize at least a few common fields like hash

Once this is working we can do more cool things with BTF.
Like doing offset rewriting at program load time similar to what we plan
to do for tracing. Tracing programs will be doing 'task->pid' access
and the kernel will adjust offsetof(struct task_struct, pid) during program load
depending on BTF for the kernel.
The same trick we can do for networking metadata.
The program will contain load instruction md->hash that will get automatically
adjusted to proper offset depending on BTF of 'hash' field in the NIC.
For now I'm proposing _not_ to go that far with offset rewriting and start
with simple steps described above.



Re: [PATCH iproute2-next] tc: m_tunnel_key: Add tunnel option support to act_tunnel_key

2018-07-06 Thread David Ahern
On 7/5/18 6:12 PM, Jakub Kicinski wrote:
> From: Simon Horman 
> 
> Allow setting tunnel options using the act_tunnel_key action.
> 
> Options are expressed as class:type:data and multiple options
> may be listed using a comma delimiter.
> 
>  # ip link add name geneve0 type geneve dstport 0 external
>  # tc qdisc add dev eth0 ingress
>  # tc filter add dev eth0 protocol ip parent : \
>  flower indev eth0 \
> ip_proto udp \
> action tunnel_key \
> set src_ip 10.0.99.192 \
> dst_ip 10.0.99.193 \
> dst_port 6081 \
> id 11 \
> geneve_opts 0102:80:00800022,0102:80:00800022 \
> action mirred egress redirect dev geneve0
> 
> Signed-off-by: Simon Horman 
> Signed-off-by: Pieter Jansen van Vuuren 
> Reviewed-by: Jakub Kicinski 
> ---
>  man/man8/tc-tunnel_key.8 |  12 ++-
>  tc/m_tunnel_key.c| 177 +++
>  2 files changed, 188 insertions(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks




Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread Sabrina Dubroca
2018-07-06, 09:28:48 -0600, David Ahern wrote:
> On 7/6/18 9:02 AM, Sabrina Dubroca wrote:
> > 2018-07-06, 08:42:01 -0600, David Ahern wrote:
> >> On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
> >>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> >>> index 91580c62bb86..e9ba53d2a147 100644
> >>> --- a/net/ipv6/addrconf.c
> >>> +++ b/net/ipv6/addrconf.c
> >>> @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> >>> ctl_table *ctl, int write,
> >>>loff_t *ppos)
> >>>  {
> >>>   int ret = 0;
> >>> - int new_val;
> >>> + u32 new_val;
> >>>   struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
> >>>   struct net *net = (struct net *)ctl->extra2;
> >>> + struct ctl_table tmp = {
> >>> + .data = _val,
> >>> + .maxlen = sizeof(new_val),
> >>> + .mode = ctl->mode,
> >>> + };
> >>>  
> >>>   if (!rtnl_trylock())
> >>>   return restart_syscall();
> >>>  
> >>> - ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> >>> + new_val = *((u32 *)ctl->data);
> >>>  
> >>> - if (write) {
> >>> - new_val = *((int *)ctl->data);
> >>> + ret = proc_douintvec(, write, buffer, lenp, ppos);
> >>> + if (ret != 0)
> >>> + goto out;
> >>>  
> >>> + if (write) {
> >>>   if (check_addr_gen_mode(new_val) < 0) {
> >>>   ret = -EINVAL;
> >>>   goto out;
> >>>   }
> >>>  
> >>> - /* request for default */
> >>> - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
> >>> - ipv6_devconf_dflt.addr_gen_mode = new_val;
> >>
> >> updating the template is wrong, but you still need to update the
> >> namespace's default value for new devices.
> > 
> > That's already handled by storing new_val into ctl->data at the end of
> > the 'write' block.
> 
> ok. missed that. It's part of your change below.
> 
> > 
> > BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so
> > that people aren't tempted to update the template, but I'm thinking of
> > doing that in net-next rather than net.
> > 
> >> also, if you are fixing this it would be good to handle the change to
> >> 'all' as well and update all existing devices.
> > 
> > I thought about it, and wasn't sure if that change of behavior was
> > acceptable, especially for stable (I think the current patch should go
> > into stable). OTOH, it's quite clearly what "all" should do.
> 
> IMHO it's a bug that changing 'all' does not actually change all
> existing devices nor is it ever used.

Right. I'll add that as a separate patch in this series, unless you
really prefer the change squashed into this patch.


> Looking at other addr_gen_mode sites, addrconf_sysctl_stable_secret is
> messed up as well. It propagates a change to 'default' to all existing
> devices.

I guess it was intentional, given:

if (>ipv6.devconf_all->stable_secret == ctl->data)
return -EIO;

It only propagates the mode, and not the secret itself, to all
devices. After thinking about it for a while, I guess it considers the
new default not only as default for newly created devices, but also
for newly added addresses/prefixes.
Or am I making stuff up?

-- 
Sabrina


Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc

2018-07-06 Thread David Ahern
On 7/5/18 4:42 PM, Jesus Sanchez-Palencia wrote:

> +static int get_clockid(__s32 *val, const char *arg)
> +{
> + const struct static_clockid {
> + const char *name;
> + clockid_t clockid;
> + } clockids_sysv[] = {
> + { "CLOCK_REALTIME", CLOCK_REALTIME },
> + { "CLOCK_TAI", CLOCK_TAI },
> + { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
> + { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
> + { NULL }
> + };
> +
> + const struct static_clockid *c;
> +
> + for (c = clockids_sysv; c->name; c++) {
> + if (strncasecmp(c->name, arg, 25) == 0) {

Why 25?

be nice to allow shortcuts -- e.g., just REALTIME or realtime.

> + *val = c->clockid;
> +
> + return 0;
> + }
> + }
> +
> + return -1;
> +}
> +
> +
> +static int etf_parse_opt(struct qdisc_util *qu, int argc,
> +  char **argv, struct nlmsghdr *n, const char *dev)
> +{
> + struct tc_etf_qopt opt = {
> + .clockid = CLOCKID_INVALID,
> + };
> + struct rtattr *tail;
> +
> + while (argc > 0) {
> + if (matches(*argv, "offload") == 0) {
> + if (opt.flags & TC_ETF_OFFLOAD_ON) {
> + fprintf(stderr, "etf: duplicate \"offload\" 
> specification\n");
> + return -1;
> + }
> +
> + opt.flags |= TC_ETF_OFFLOAD_ON;
> + } else if (matches(*argv, "deadline_mode") == 0) {
> + if (opt.flags & TC_ETF_DEADLINE_MODE_ON) {
> + fprintf(stderr, "etf: duplicate 
> \"deadline_mode\" specification\n");
> + return -1;
> + }
> +
> + opt.flags |= TC_ETF_DEADLINE_MODE_ON;
> + } else if (matches(*argv, "delta") == 0) {
> + NEXT_ARG();
> + if (opt.delta) {
> + fprintf(stderr, "etf: duplicate \"delta\" 
> specification\n");
> + return -1;
> + }
> + if (get_s32(, *argv, 0)) {
> + explain1("delta", *argv);
> + return -1;
> + }
> + } else if (matches(*argv, "clockid") == 0) {
> + NEXT_ARG();
> + if (opt.clockid != CLOCKID_INVALID) {
> + fprintf(stderr, "etf: duplicate \"clockid\" 
> specification\n");
> + return -1;
> + }
> + if (get_clockid(, *argv)) {
> + explain_clockid(*argv);
> + return -1;
> + }
> + } else if (strcmp(*argv, "help") == 0) {
> + explain();
> + return -1;
> + } else {
> + fprintf(stderr, "etf: unknown parameter \"%s\"\n", 
> *argv);
> + explain();
> + return -1;
> + }
> + argc--; argv++;
> + }
> +
> + tail = NLMSG_TAIL(n);
> + addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
> + addattr_l(n, 2024, TCA_ETF_PARMS, , sizeof(opt));
> + tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
> + return 0;
> +}
> +
> +static int etf_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
> +{
> + struct rtattr *tb[TCA_ETF_MAX+1];
> + struct tc_etf_qopt *qopt;
> +
> + if (opt == NULL)
> + return 0;
> +
> + parse_rtattr_nested(tb, TCA_ETF_MAX, opt);
> +
> + if (tb[TCA_ETF_PARMS] == NULL)
> + return -1;
> +
> + qopt = RTA_DATA(tb[TCA_ETF_PARMS]);
> + if (RTA_PAYLOAD(tb[TCA_ETF_PARMS])  < sizeof(*qopt))
> + return -1;
> +
> + if (qopt->clockid == CLOCKID_INVALID)
> + print_string(PRINT_ANY, "clockid", "clockid %s ", "invalid");
> + else
> + print_uint(PRINT_ANY, "clockid", "clockid %d ", qopt->clockid);

If you allow the user to input a string, then you should return it here too.



Re: [PATCH bpf-next 11/11] tools: bpftool: allow reuse of maps with bpftool prog load

2018-07-06 Thread Jakub Kicinski
On Fri, 6 Jul 2018 09:16:24 +0200, Daniel Borkmann wrote:
> On 07/06/2018 12:57 AM, Jakub Kicinski wrote:
> > On Thu, 5 Jul 2018 10:35:24 +0200, Daniel Borkmann wrote:  
> >> On 07/04/2018 04:54 AM, Jakub Kicinski wrote:  
> >>> Add map parameter to prog load which will allow reuse of existing
> >>> maps instead of creating new ones.
> >>>
> >>> Signed-off-by: Jakub Kicinski 
> >>> Reviewed-by: Quentin Monnet 
> >> [...]  
> >>> +
> >>> + fd = map_parse_fd(, );
> >>> + if (fd < 0)
> >>> + goto err_free_reuse_maps;
> >>> +
> >>> + map_replace = reallocarray(map_replace, old_map_fds + 1,
> >>> +sizeof(*map_replace));
> >>> + if (!map_replace) {
> >>> + p_err("mem alloc failed");
> >>> + goto err_free_reuse_maps;
> >>
> >> Series in general looks good to me. However, above reallocarray() doesn't
> >> exist and hence build fails, please see below. Is that from newest glibc?
> >>
> >> You probably need some fallback implementation or in general have something
> >> bpftool internal that doesn't make a bet on its availability.
> >>
> >> # make
> >>
> >> Auto-detecting system features:
> >> ...libbfd: [ on  ]
> >> ...disassembler-four-args: [ OFF ]
> >>
> >>   CC   bpf_jit_disasm.o
> >>   LINK bpf_jit_disasm
> >>   CC   bpf_dbg.o
> >>   LINK bpf_dbg
> >>   CC   bpf_asm.o
> >>   BISONbpf_exp.yacc.c
> >>   CC   bpf_exp.yacc.o
> >>   FLEX bpf_exp.lex.c
> >>   CC   bpf_exp.lex.o
> >>   LINK bpf_asm
> >>   DESCEND  bpftool
> >>
> >> Auto-detecting system features:
> y>> ...libbfd: [ on  ]
> >> ...disassembler-four-args: [ OFF ]
> >>
> >>   CC   map_perf_ring.o
> >>   CC   xlated_dumper.o
> >>   CC   perf.o
> >>   CC   prog.o
> >> prog.c: In function ‘do_load’:
> >> prog.c:785:18: warning: implicit declaration of function ‘reallocarray’ 
> >> [-Wimplicit-function-declaration]
> >> map_replace = reallocarray(map_replace, old_map_fds + 1,
> >>   ^~~~
> >> prog.c:785:16: warning: assignment makes pointer from integer without a 
> >> cast [-Wint-conversion]
> >> map_replace = reallocarray(map_replace, old_map_fds + 1,
> >> ^
> >>   CC   common.o
> >>   CC   cgroup.o
> >>   CC   main.o
> >>   CC   json_writer.o
> >>   CC   cfg.o
> >>   CC   map.o
> >>   CC   jit_disasm.o
> >>   CC   disasm.o
> >>
> >> Auto-detecting system features:
> >> ...libelf: [ on  ]
> >> ...   bpf: [ on  ]
> >>
> >> Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs 
> >> from latest version at 'include/uapi/linux/bpf.h'
> >>   CC   libbpf.o
> >>   CC   bpf.o
> >>   CC   nlattr.o
> >>   CC   btf.o
> >>   LD   libbpf-in.o
> >>   LINK libbpf.a
> >>   LINK bpftool
> >> prog.o: In function `do_load':
> >> prog.c:(.text+0x23d): undefined reference to `reallocarray'
> >> collect2: error: ld returned 1 exit status
> >> Makefile:89: recipe for target 'bpftool' failed
> >> make[1]: *** [bpftool] Error 1
> >> Makefile:99: recipe for target 'bpftool' failed
> >> make: *** [bpftool] Error 2  
> > 
> > Oh no..  Sorry & thanks for catching this.  It would be nice to not
> > depend on Glibc version defines, in case someone is using a different
> > library.  Jiong suggested we can just use the feature detection, so I
> > have something like this:  
> 
> Yeah, that would be okay to do if you want to go that route, sure. Other 
> option
> I had in mind would have been to import include/linux/overflow.h into the
> tools/include/linux/ and have a minor wrapper similar to kmalloc_array() in a
> utils.h in bpftool to get to the same for all users. But I think feature test
> is totally fine too, and in general some form of reallocarray() would be good
> to have rather than plain realloc().
> 
> So, below complies for me. Although don't we need to define a CFLAG based on
> the outcome of the test similar as in feature-disassembler-four-args? 
> Otherwise
> with the below diff the test doesn't really do much, no? Meaning, adding a ...
> 
>   ifeq ($(feature-reallocarray), 1)
>   CFLAGS += -DHAVE_REALLOCARRAY
>   endif
> 
> ... to the Makefile and in compat.h having it enabled through:
> 
>   #ifndef HAVE_REALLOCARRAY
>   static inline void *reallocarray(void *ptr, size_t nmemb, size_t size)
>   {
>   return realloc(ptr, nmemb * size);
>   }
>   #endif

Ugh, you're very right, reallocarray is a weak alias in glibc, which is
probably why I didn't see any errors.

> In any case, we could use reallocarray() also in couple of other places in
> bpftool and libbpf.

I'll look into it, too.


Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread David Ahern
On 7/6/18 9:02 AM, Sabrina Dubroca wrote:
> 2018-07-06, 08:42:01 -0600, David Ahern wrote:
>> On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
>>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>>> index 91580c62bb86..e9ba53d2a147 100644
>>> --- a/net/ipv6/addrconf.c
>>> +++ b/net/ipv6/addrconf.c
>>> @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
>>> ctl_table *ctl, int write,
>>>  loff_t *ppos)
>>>  {
>>> int ret = 0;
>>> -   int new_val;
>>> +   u32 new_val;
>>> struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
>>> struct net *net = (struct net *)ctl->extra2;
>>> +   struct ctl_table tmp = {
>>> +   .data = _val,
>>> +   .maxlen = sizeof(new_val),
>>> +   .mode = ctl->mode,
>>> +   };
>>>  
>>> if (!rtnl_trylock())
>>> return restart_syscall();
>>>  
>>> -   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>>> +   new_val = *((u32 *)ctl->data);
>>>  
>>> -   if (write) {
>>> -   new_val = *((int *)ctl->data);
>>> +   ret = proc_douintvec(, write, buffer, lenp, ppos);
>>> +   if (ret != 0)
>>> +   goto out;
>>>  
>>> +   if (write) {
>>> if (check_addr_gen_mode(new_val) < 0) {
>>> ret = -EINVAL;
>>> goto out;
>>> }
>>>  
>>> -   /* request for default */
>>> -   if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
>>> -   ipv6_devconf_dflt.addr_gen_mode = new_val;
>>
>> updating the template is wrong, but you still need to update the
>> namespace's default value for new devices.
> 
> That's already handled by storing new_val into ctl->data at the end of
> the 'write' block.

ok. missed that. It's part of your change below.

> 
> BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so
> that people aren't tempted to update the template, but I'm thinking of
> doing that in net-next rather than net.
> 
>> also, if you are fixing this it would be good to handle the change to
>> 'all' as well and update all existing devices.
> 
> I thought about it, and wasn't sure if that change of behavior was
> acceptable, especially for stable (I think the current patch should go
> into stable). OTOH, it's quite clearly what "all" should do.

IMHO it's a bug that changing 'all' does not actually change all
existing devices nor is it ever used.

Looking at other addr_gen_mode sites, addrconf_sysctl_stable_secret is
messed up as well. It propagates a change to 'default' to all existing
devices.


Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-06 Thread Sowmini Varadhan
On (07/06/18 23:08), Ka-Cheong Poon wrote:
> 
> As mentioned in a previous mail, it is unclear why the
> port number is transport specific.  Most Internet services
> use the same port number running over TCP/UDP as shown
> in the IANA database.  And the IANA RDS registration is
> the same.  What is the rationale of having a transport
> specific number in the RDS implementation?

because not every transport may need a port number.

e.g., "RDS over pigeon carrier" may not need a port number.

in a different design (e.g., KCM) the listen port is configurable
with sysctl

Some may need more than one, e.g., rds_rdma, evidently.

What is the problem with using the unused 18635 for the RDS_CM_PORT?






Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-06 Thread Sowmini Varadhan
On (07/06/18 22:36), Ka-Cheong Poon wrote:
> This patch series does not change existing behavior.  But
> I think this is a strange RDS semantics as it differs from
> other types of socket.  But this is not about IPv6 support
> and can be dealt with later.

sure, 

> >   Since we are choosing to treat this behavior as a feature we
> >   need to be consistent in rds_getname(). If we find
> >   (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound,
> >   the returned address (address size, address family) should be based
> >   on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a
> >   bind(), and try to do a getsockname(), I get back a sockaddr_in?!)
> 
> 
> OK, will change that.
> >- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg
> >   As DaveM has already pointed out, we should be using sa_family to figure
> >   out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting
> >   len first, and then the family- that way wont work if I pass in
> >   sockaddr_storage).  E.g., see inet_dgram_connect.
> >
> > if (addr_len < sizeof(uaddr->sa_family))
> > return -EINVAL;
> OK, will change that.

thank you

> >- In net/rds/rds.h;
> >
> >   /* The following ports, 16385, 18634, 18635, are registered with IANA as
> >* the ports to be used for RDS over TCP and UDP.  18634 is the historical
> >
> >   What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP.
> >   IN fact RDS has nothing to do with UDP. The comment is confused. See next
> >   item below, where the comment disappears.
> 
> 
> As mentioned in the above comments, they are registered with IANA.

16385 is registered for  RDS-TCP. 

what does it mean to run rds-tcp and rds_rdma with the CM at the same time? 
Is this possible?  (if it is, why not poach some other number like 2049
from NFS?!)

18635 is actually not used in the RDS code.
Why can you not use that for RDS_CM_PORT? 

In general, please do NOT pull up these definitions into net/rds/rds.h
They may change in the future- and the really belong in the transport
specific header file - you question this further below, see my answer
there.

> There is no RDS/UDP in Linux but the port numbers are registered
> nevertheless.  And RDS can work over UDP AFAICT.  BTW, it is really

No it cannot. RDS needs a reliable transport.  If you start using
UDP (or pigeon-carrier) for the transport you have to build the
reliability yourself.

The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP
case they do their own congestion management in userspace, and 
history has shown that it is hard to get that right, thus the desire to
switch to rds-tcp).

Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP",
that is probably the most harmless one to use for this case, because
- it is unused, 
- it calls itself rds-Infiniband,  
- the likelihood of doing rds-udp is infinitesmally small (it is more likely
  that we will need to send packets from abc::1 -> fe80:2 before we need
  rds-udp :-) :-) :-))
- and if rds-udp happens, we can use 16385/udp for rds-udp

so please use 18635 for your RDS_CM_PORT and move it to IB specific
header files and lets converge this thread.

Towing RDS_TCP_PORT around has absolutely nothing to do with your
ipv6 patch set.

> strange that RDS/TCP does not use the port number already registered.
> Anyway, the above comments are just a note on this fact in the IANA

16385 was in defacto use for a long time (since 2.6 kernels), thus I
registered it as well when I started fixing this up. 

the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp.
They are available for your use.

> database.  The following is a link to the IANA assignment.

yes, I am aware

> IMHO, there should only be RDS_PORT in the first place.  And it
> can be used for all transports.  For example, if RDS/SCTP is added,
> it can also use the same port number.  There is no need to define

if/when we add rds/sctp, we shall keep that in mind. 

This is getting to be "arguing for the sake of arguing". I dont have
the time for that.

> a new port number for each transport.  What is the rationale that
> each transport needs to have its own number and be defined in its
> own header file?

Some transports may not even need a port number. Some may need
several. Some may use sysctl (suggested by Tom Herbert) to make
this configurable. Until recently, we had rds/iWARP, that may need
its own transport hooks, it does not make sense to peanut-butter
that into the core module.

That is why it has to be in the transport. I hope that answers the
question.

> If the behavior of a software component is modified/enhanced such
> that the existing API of this component has problems dealing with
> this new behavior, should the API be modified or a new API be added
> to handle that?  If neither is done, shouldn't this be considered
> a bug?

whatever. The design (parallelization for perf) is fine. Some
parts of 

Re: [PATCH iproute2-next] tc: m_tunnel_key: Add tunnel option support to act_tunnel_key

2018-07-06 Thread Jakub Kicinski
On Fri, Jul 6, 2018 at 6:32 AM, Roman Mashak  wrote:
> Jakub Kicinski  writes:
>
>> From: Simon Horman 
>>
>> Allow setting tunnel options using the act_tunnel_key action.
>>
>> Options are expressed as class:type:data and multiple options
>> may be listed using a comma delimiter.
>>
>>  # ip link add name geneve0 type geneve dstport 0 external
>>  # tc qdisc add dev eth0 ingress
>>  # tc filter add dev eth0 protocol ip parent : \
>>  flower indev eth0 \
>> ip_proto udp \
>> action tunnel_key \
>> set src_ip 10.0.99.192 \
>> dst_ip 10.0.99.193 \
>> dst_port 6081 \
>> id 11 \
>> geneve_opts 0102:80:00800022,0102:80:00800022 \
>> action mirred egress redirect dev geneve0
>
> [...]
>
> Jakub, could you also add relevant tests for the new option in file
> $(kernel)/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json?

We'll look into it!


Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-06 Thread Ka-Cheong Poon

On 07/06/2018 06:15 PM, Sowmini Varadhan wrote:

On (07/06/18 17:08), Ka-Cheong Poon wrote:

Hmm. Why do you need to include tcp header in ib transport
code ? If there is any common function just move to core
common file and use it.


I think it can be removed as it is left over from earlier
changes when the IB IPv6 listener port was RDS_TCP_PORT.
Now all the port definitions are in rds.h.


Transport specific information such as port numbers should not
be smeared into the common rds-module.  Please fix that.

If IB is also piggybacking on port 16385 (why?), please use your
own definitions for it in IB specific header files.



As mentioned in a previous mail, it is unclear why the
port number is transport specific.  Most Internet services
use the same port number running over TCP/UDP as shown
in the IANA database.  And the IANA RDS registration is
the same.  What is the rationale of having a transport
specific number in the RDS implementation?


--
K. Poon
ka-cheong.p...@oracle.com




Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread Sabrina Dubroca
2018-07-06, 08:42:01 -0600, David Ahern wrote:
> On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
> > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> > index 91580c62bb86..e9ba53d2a147 100644
> > --- a/net/ipv6/addrconf.c
> > +++ b/net/ipv6/addrconf.c
> > @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> > ctl_table *ctl, int write,
> >  loff_t *ppos)
> >  {
> > int ret = 0;
> > -   int new_val;
> > +   u32 new_val;
> > struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
> > struct net *net = (struct net *)ctl->extra2;
> > +   struct ctl_table tmp = {
> > +   .data = _val,
> > +   .maxlen = sizeof(new_val),
> > +   .mode = ctl->mode,
> > +   };
> >  
> > if (!rtnl_trylock())
> > return restart_syscall();
> >  
> > -   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> > +   new_val = *((u32 *)ctl->data);
> >  
> > -   if (write) {
> > -   new_val = *((int *)ctl->data);
> > +   ret = proc_douintvec(, write, buffer, lenp, ppos);
> > +   if (ret != 0)
> > +   goto out;
> >  
> > +   if (write) {
> > if (check_addr_gen_mode(new_val) < 0) {
> > ret = -EINVAL;
> > goto out;
> > }
> >  
> > -   /* request for default */
> > -   if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
> > -   ipv6_devconf_dflt.addr_gen_mode = new_val;
> 
> updating the template is wrong, but you still need to update the
> namespace's default value for new devices.

That's already handled by storing new_val into ctl->data at the end of
the 'write' block.

BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so
that people aren't tempted to update the template, but I'm thinking of
doing that in net-next rather than net.

> also, if you are fixing this it would be good to handle the change to
> 'all' as well and update all existing devices.

I thought about it, and wasn't sure if that change of behavior was
acceptable, especially for stable (I think the current patch should go
into stable). OTOH, it's quite clearly what "all" should do.

> > -
> > -   /* request for individual net device */
> > -   } else {
> > -   if (!idev)
> > -   goto out;
> > -
> > +   if (idev) {
> > if (check_stable_privacy(idev, net, new_val) < 0) {
> > ret = -EINVAL;
> > goto out;
> > @@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> > ctl_table *ctl, int write,
> > addrconf_dev_config(idev->dev);
> > }
> > }
> > +
> > +   *((u32 *)ctl->data) = new_val;
> > }
> >  
> >  out:
> > 
> 

-- 
Sabrina


Re: [PATCH iproute2 net-next] bridge: add support for isolated option

2018-07-06 Thread David Ahern
On 7/3/18 6:42 AM, Nikolay Aleksandrov wrote:
> This patch adds support for the new isolated port option which, if set,
> would allow the isolated ports to communicate only with non-isolated
> ports and the bridge device. The option can be set via the bridge or ip
> link type bridge_slave commands, e.g.:
> $ ip link set dev eth0 type bridge_slave isolated on
> $ bridge link set dev eth0 isolated on
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  bridge/link.c| 11 +++
>  ip/iplink_bridge_slave.c |  9 +
>  man/man8/bridge.8|  6 ++
>  man/man8/ip-link.8.in|  6 --
>  4 files changed, 30 insertions(+), 2 deletions(-)
> 

applied to iproute2-next. Thanks




Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-06 Thread santosh.shilim...@oracle.com

On 7/6/18 2:08 AM, Ka-Cheong Poon wrote:

On 07/06/2018 01:58 AM, Santosh Shilimkar wrote:






diff --git a/net/rds/connection.c b/net/rds/connection.c
index abef75d..ca72563 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c


@@ -142,9 +151,12 @@ static void __rds_conn_path_init(struct 
rds_connection *conn,

   * are torn down as the module is removed, if ever.
   */
  static struct rds_connection *__rds_conn_create(struct net *net,
-    __be32 laddr, __be32 faddr,
-   struct rds_transport *trans, gfp_t gfp,
-   int is_outgoing)
+    const struct in6_addr *laddr,
+    const struct in6_addr *faddr,
+    struct rds_transport *trans,
+    gfp_t gfp,
+    int is_outgoing,
+    int dev_if)

Am just wondering if we can handle local link address case differently
instead of pushing the interface index everywhere. Did you think about
any alternative. This can also avoid bunch of changes again and hence
the question. Am trying to see if we can minimize the changes to
absolute must have to support IPv6.



If link local address is supported, scoped ID must be used.
Unless we remove the support of link local address, we need
to keep scope ID.


Ok.




diff --git a/net/rds/ib.h b/net/rds/ib.h
index a6f4d7d..12f96b3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h



+union rds_ib_conn_priv {
+    struct rds_ib_connect_private    ricp_v4;
+    struct rds6_ib_connect_private    ricp_v6;
  };

This change was invetiable. Just add a comment saying the priviate
data size for v6 vs v4 is  different. Also for IPv6 priviate data,
I suggest add some resrve fields for any future extensions so
that things can be added without breaking wire protocol. With IPv4,
we ended up in bit of mess.



Will add some comments.  Unfortunately the IB private data
exchange has only a limited space.  I don't think there is
any more space left for reserved field.  I think we should
think of a different way to handle extensions in future.


There is enough space. You can send upto 512 bytes iirc. Please
add some reserve fields and ping me if you run into issues.


diff --git a/net/rds/send.c b/net/rds/send.c
index 94c7f74..6ed2e92 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c


@@ -1081,27 +1085,59 @@ int rds_sendmsg(struct socket *sock, struct 
msghdr *msg, size_t payload_len)

  goto out;
  }
-    if (msg->msg_namelen) {
-    /* XXX fail non-unicast destination IPs? */
-    if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != 
AF_INET) {

+    namelen = msg->msg_namelen;
+    if (namelen != 0) {
+    if (namelen < sizeof(*usin)) {
+    ret = -EINVAL;
+    goto out;
+    }
+    switch (namelen) {
+    case sizeof(*usin):
+    if (usin->sin_family != AF_INET ||
+    usin->sin_addr.s_addr == htonl(INADDR_ANY) ||
+    usin->sin_addr.s_addr == htonl(INADDR_BROADCAST) ||
+    IN_MULTICAST(ntohl(usin->sin_addr.s_addr))) {
+    ret = -EINVAL;

The idea was to fail non-unicast IP someday but can you not add this
change as part of v6 series. We can look at it later unless you need
this change for v6. Again the question is mainly to add only necessary
check and leave the existing behavior as is so please re-check all below
checks too.



This is to match the same IPv6 check.  In IPv6 code, it checks
if the address is a unicast address.  I can remove the IPv4
checks but if an app does send to a multicast address, the
connection will be stuck forever.


OK. Lets keep it then.




diff --git a/net/rds/transport.c b/net/rds/transport.c
index 0b188dd..c9788db 100644
--- a/net/rds/transport.c
+++ b/net/rds/transport.c



+    if (ipv6_addr_v4mapped(addr)) {


Dave already commented on ipv6_addr_v4mapped(). Apart from
some of those comments questions, rest of the patch looks
really good and nicely done. Will also look at your
subsequent two patches and try to send you comments in next
few days.



Do you mean using mapped address to create IPv4 connections?
I already have it in my work space.  Will be in v3.


yeah. Thanks !!


Re: [PATCH net 3/3] net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE

2018-07-06 Thread David Ahern
On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
> inet6_ifla6_size() is called to check how much space is needed by
> inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include
> the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it.
> 
> Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes")
> Signed-off-by: Sabrina Dubroca 
> ---
>  net/ipv6/addrconf.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH net 2/3] net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev

2018-07-06 Thread David Ahern
On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
> The value has already been copied from this netns's devconf_dflt, it
> shouldn't be reset to the global kernel default.
> 
> Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
> generation mode")
> Signed-off-by: Sabrina Dubroca 
> ---
>  net/ipv6/addrconf.c | 2 --
>  1 file changed, 2 deletions(-)
> 

Reviewed-by: David Ahern 



Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread David Ahern
On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 91580c62bb86..e9ba53d2a147 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> ctl_table *ctl, int write,
>loff_t *ppos)
>  {
>   int ret = 0;
> - int new_val;
> + u32 new_val;
>   struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
>   struct net *net = (struct net *)ctl->extra2;
> + struct ctl_table tmp = {
> + .data = _val,
> + .maxlen = sizeof(new_val),
> + .mode = ctl->mode,
> + };
>  
>   if (!rtnl_trylock())
>   return restart_syscall();
>  
> - ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> + new_val = *((u32 *)ctl->data);
>  
> - if (write) {
> - new_val = *((int *)ctl->data);
> + ret = proc_douintvec(, write, buffer, lenp, ppos);
> + if (ret != 0)
> + goto out;
>  
> + if (write) {
>   if (check_addr_gen_mode(new_val) < 0) {
>   ret = -EINVAL;
>   goto out;
>   }
>  
> - /* request for default */
> - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
> - ipv6_devconf_dflt.addr_gen_mode = new_val;

updating the template is wrong, but you still need to update the
namespace's default value for new devices.

also, if you are fixing this it would be good to handle the change to
'all' as well and update all existing devices.

> -
> - /* request for individual net device */
> - } else {
> - if (!idev)
> - goto out;
> -
> + if (idev) {
>   if (check_stable_privacy(idev, net, new_val) < 0) {
>   ret = -EINVAL;
>   goto out;
> @@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> ctl_table *ctl, int write,
>   addrconf_dev_config(idev->dev);
>   }
>   }
> +
> + *((u32 *)ctl->data) = new_val;
>   }
>  
>  out:
> 



Re: [RFC PATCH] ip: re-introduce fragments cache worker

2018-07-06 Thread Eric Dumazet



On 07/06/2018 06:56 AM, Paolo Abeni wrote:
> With:
> 
> schedule_work_on(smp_processor_id(), #... )
> 
> We can be sure to run exclusively on the cpu handling the RX queue even with 
> the worker.
> 

Make sure to test your patch with 16 RX queues (16 cpus) feeding the defrag 
unit.

In this (common) mode, then having one cpu trying to purge frags wont be enough.

Really the GC can not cope with DDOS, we tried that in the past (IPv4 route 
cache)
and failed miserably.



Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-06 Thread Ka-Cheong Poon

On 07/06/2018 01:44 AM, Sowmini Varadhan wrote:


- rds_getname(): one of the existing properties of PF_RDS is that a
   connect() does not involve an implicit bind(). Note that you are basing
   the changes in rds_bind() on this behavior, thus I guess we make the
   choice of treating this as a feature, not a bug.



This patch series does not change existing behavior.  But
I think this is a strange RDS semantics as it differs from
other types of socket.  But this is not about IPv6 support
and can be dealt with later.



   Since we are choosing to treat this behavior as a feature we
   need to be consistent in rds_getname(). If we find
   (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound,
   the returned address (address size, address family) should be based
   on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a
   bind(), and try to do a getsockname(), I get back a sockaddr_in?!)



OK, will change that.



- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg
   As DaveM has already pointed out, we should be using sa_family to figure
   out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting
   len first, and then the family- that way wont work if I pass in
   sockaddr_storage).  E.g., see inet_dgram_connect.

 if (addr_len < sizeof(uaddr->sa_family))
 return -EINVAL;



OK, will change that.



- In net/rds/rds.h;

   /* The following ports, 16385, 18634, 18635, are registered with IANA as
* the ports to be used for RDS over TCP and UDP.  18634 is the historical

   What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP.
   IN fact RDS has nothing to do with UDP. The comment is confused. See next
   item below, where the comment disappears.



As mentioned in the above comments, they are registered with IANA.
There is no RDS/UDP in Linux but the port numbers are registered
nevertheless.  And RDS can work over UDP AFAICT.  BTW, it is really
strange that RDS/TCP does not use the port number already registered.
Anyway, the above comments are just a note on this fact in the IANA
database.  The following is a link to the IANA assignment.

https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt



- Also in net/rds/rds.h
   Please dont define transport specific parameters like RD_CM_PORT in the
   common rds.h header. It is unfortunate that we already have RDS_PORT there,
   and we should try to clean that up as well. NOte that RDS_TCP_PORT
   is now in the correct transport-module-specific header (net/rds/tcp.h)
   and its unclean to drag it from there, into the common header as you are
   doing.

   In fact I just tried to move the RDS_PORT definition into
   net/rds/rdma_transport.h and it built just-fine. So to summarize,
   please do the following:
   1. move RDS_PORT into rdma_transport.h
   2. add RDS_CM_PORT into rdma_transport.h
   3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h
  to net/rds/rds.h



IMHO, there should only be RDS_PORT in the first place.  And it
can be used for all transports.  For example, if RDS/SCTP is added,
it can also use the same port number.  There is no need to define
a new port number for each transport.  What is the rationale that
each transport needs to have its own number and be defined in its
own header file?



- net/rds/connection.c
   As we have discussed offline before, the fact that we cannot report
   TCP seq# etc info via the existing rds-info API is not "a bug in the
   design of MPRDS" but rather a lacking in the API design. Moreover,
   much of the useful information around the TCP socket is already
   available via procfs, TCP_INFO etc, so the info by rds-info is rarely
   used for rds-tcp- the more useful information is around the RDS socket
   itself.  So there is a bug in the comment, would be nice if you removed it.



If the behavior of a software component is modified/enhanced such
that the existing API of this component has problems dealing with
this new behavior, should the API be modified or a new API be added
to handle that?  If neither is done, shouldn't this be considered
a bug?



   Also, while you are there, s/exisiting/existing, please.



OK, with change that.



General comments:
-
I remain unconvinced by your global <-> link-local arguments.

For UDP sockets we can do this:

  eth0
host1 -- host2
  abc::1/64   fe80::2
  add abc::/64 as onlink subnet route


   host1# traceroute6 -i eth0 -s abc::1 fe80::2

You just broke this for RDS and are using polemic to defend your case,
but the main thrust of your diatribe seems to be "why would you need
this?" I'll try to address that briefly here.



It is unclear why you need to use words like "polemic"
and "diatribe".  What I wrote is in public and folks can
judge whether they are 

Re: [RFC PATCH] ip: re-introduce fragments cache worker

2018-07-06 Thread Eric Dumazet



On 07/06/2018 06:56 AM, Paolo Abeni wrote:
> On Fri, 2018-07-06 at 05:09 -0700, Eric Dumazet wrote:
>> On 07/06/2018 04:56 AM, Paolo Abeni wrote:
>>> With your setting, you need a bit more concurrent connections (400 ?)
>>> to saturate the ipfrag cache. Above that number, performances will
>>> still sink.
>>
>> Maybe, but IP defrag can not be 'perfect'.
>>
>> For this particular use case I could still bump high_thresh to 6 GB and all 
>> would be good :)
> 
> Understood.
> 
> I'd like to be sure I stated the problem I see clearly. With the
> current code the "goodput" goes to almost 0 as soon as the ipfrag cache
> load goes above it's capacity. Before the worker removal, after
> reaching high_thresh, the "goodput" degratated slowly and even with a
> load more than an order of magnitude higher, the performances were
> still quite good. I think we can't ask customers to add more memory for
> a kernel upgrade; even changing the default sysfs configuration is
> somewhat troubling.
> 
>>> This looks nice, I'll try to test it in my use case and I'll report
>>> here.
> 
> I tried the patch, but the result are not encouraging:
> 
> ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60
> 34.94
> 
> # on the receiver side:
> echo 2 >  /proc/sys/net/ipv4/ipfrag_time
> 
> # on the sender side:
> ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60
> 85.8
> 
> # still on receiver side, while the test is running:
> nstat>/dev/null ;sleep 1; nstat |grep IpReasm
> IpReasmTimeout  2128   0.0
> IpReasmReqds754770 0.0
> IpReasmOKs  1350.0
> IpReasmFails752811 0.0
> 
> grep FRAG /proc/net/sockstat
> FRAG: inuse 124 memory 5286144
> 
> The patch has some effect, as I basically saw no timeout without it,
> but still does not look aggressive enough. Or possibly it's evicting
> the fragments that are more likely to be used/completed (the most
> recents one).

Hey, that was simply an idea (not even compiled), not the final patch.

I will test/polish it later, I am coming back from vacations and have a backlog.

Here are my results : (Note that I have _not_ changed 
/proc/sys/net/ipv4/ipfrag_time )

lpaa6:~# grep . /proc/sys/net/ipv4/ipfrag_* ; grep FRAG /proc/net/sockstat
/proc/sys/net/ipv4/ipfrag_high_thresh:104857600
/proc/sys/net/ipv4/ipfrag_low_thresh:78643200
/proc/sys/net/ipv4/ipfrag_max_dist:0
/proc/sys/net/ipv4/ipfrag_secret_interval:0
/proc/sys/net/ipv4/ipfrag_time:30
FRAG: inuse 1379 memory 105006776

lpaa5:/export/hda3/google/edumazet# ./super_netperf 400 -H 10.246.7.134 -t 
UDP_STREAM -l 60
netperf: send_omni: send_data failed: No route to host
netperf: send_omni: send_data failed: No route to host
   9063


I would say that it looks pretty good to me.








[PATCH net-next 5/6] ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6

2018-07-06 Thread Willem de Bruijn
From: Willem de Bruijn 

skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.

The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.

There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.

Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.

After this change the IPv4 and IPv6 logic is the same.

Signed-off-by: Willem de Bruijn 
---
 include/net/ip.h  |  1 -
 net/ipv4/ip_output.c  |  3 ++-
 net/ipv4/ping.c   |  2 --
 net/ipv4/raw.c|  2 --
 net/ipv4/udp.c|  2 --
 net/ipv6/ip6_output.c | 18 --
 6 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 6db23bf1e5eb..e44b1a44f67a 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -72,7 +72,6 @@ struct ipcm_cookie {
__be32  addr;
int oif;
struct ip_options_rcu   *opt;
-   __u8tx_flags;
__u8ttl;
__s16   tos;
charpriority;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 81d0e4a77ec5..e14c774cc092 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1153,8 +1153,9 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->ttl = ipc->ttl;
cork->tos = ipc->tos;
cork->priority = ipc->priority;
-   cork->tx_flags = ipc->tx_flags;
cork->transmit_time = ipc->sockc.transmit_time;
+   cork->tx_flags = 0;
+   sock_tx_timestamp(sk, ipc->sockc.tsflags, >tx_flags);
 
return 0;
 }
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 6f17fc8ebbdb..b54c964ad925 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -763,8 +763,6 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
rcu_read_unlock();
}
 
-   sock_tx_timestamp(sk, ipc.sockc.tsflags, _flags);
-
saddr = ipc.addr;
ipc.addr = faddr = daddr;
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index cf142909389c..33df4d76db2d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -665,8 +665,6 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
  , msg->msg_flags, );
 
 else {
-   sock_tx_timestamp(sk, ipc.sockc.tsflags, _flags);
-
if (!ipc.addr)
ipc.addr = fl4.daddr;
lock_sock(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 87f3a0b77864..060e841dde40 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1020,8 +1020,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
saddr = ipc.addr;
ipc.addr = faddr = daddr;
 
-   sock_tx_timestamp(sk, ipc.sockc.tsflags, _flags);
-
if (ipc.opt && ipc.opt->opt.srr) {
if (!daddr) {
err = -EINVAL;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 1a3bf6437cb9..ff4b28a600ab 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1221,6 +1221,8 @@ static int ip6_setup_cork(struct sock *sk, struct 
inet_cork_full *cork,
cork->base.fragsize = mtu;
cork->base.gso_size = sk->sk_type == SOCK_DGRAM &&
  sk->sk_protocol == IPPROTO_UDP ? ipc6->gso_size : 
0;
+   cork->base.tx_flags = 0;
+   sock_tx_timestamp(sk, ipc6->sockc.tsflags, >base.tx_flags);
 
if (dst_allfrag(xfrm_dst_path(>dst)))
cork->base.flags |= IPCORK_ALLFRAG;
@@ -1250,7 +1252,6 @@ static int __ip6_append_data(struct sock *sk,
int copy;
int err;
int offset = 0;
-   __u8 tx_flags = 0;
u32 tskey = 0;
struct rt6_info *rt = (struct rt6_info *)cork->dst;
struct ipv6_txoptions *opt = v6_cork->opt;
@@ -1269,6 +1270,10 @@ static int __ip6_append_data(struct sock *sk,
mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize;
orig_mtu = mtu;
 
+   if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP &&
+   sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
+   tskey = sk->sk_tskey++;
+
hh_len = LL_RESERVED_SPACE(rt->dst.dev);
 
fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
@@ -1318,13 +1323,6 @@ static int __ip6_append_data(struct sock *sk,
rt->dst.dev->features & (NETIF_F_IPV6_CSUM 

[PATCH net-next 4/6] ipv6: fold sockcm_cookie into ipcm6_cookie

2018-07-06 Thread Willem de Bruijn
From: Willem de Bruijn 

ipcm_cookie includes sockcm_cookie. Do the same for ipcm6_cookie.

This reduces the number of arguments that need to be passed around,
applies ipcm6_init to all cookie fields at once and reduces code
differentiation between ipv4 and ipv6.

Signed-off-by: Willem de Bruijn 
---
 include/net/ipv6.h   |  7 +++
 include/net/transp_v6.h  |  3 +--
 net/ipv6/datagram.c  |  4 ++--
 net/ipv6/icmp.c  |  7 ++-
 net/ipv6/ip6_flowlabel.c |  3 +--
 net/ipv6/ip6_output.c| 24 ++--
 net/ipv6/ipv6_sockglue.c |  3 +--
 net/ipv6/ping.c  |  3 +--
 net/ipv6/raw.c   | 10 --
 net/ipv6/udp.c   | 10 --
 net/l2tp/l2tp_ip6.c  |  6 ++
 11 files changed, 31 insertions(+), 49 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6cb247f54d4c..aa6fd11a887c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -294,6 +294,7 @@ struct ipv6_fl_socklist {
 };
 
 struct ipcm6_cookie {
+   struct sockcm_cookie sockc;
__s16 hlimit;
__s16 tclass;
__s8  dontfrag;
@@ -959,8 +960,7 @@ int ip6_append_data(struct sock *sk,
int odd, struct sk_buff *skb),
void *from, int length, int transhdrlen,
struct ipcm6_cookie *ipc6, struct flowi6 *fl6,
-   struct rt6_info *rt, unsigned int flags,
-   const struct sockcm_cookie *sockc);
+   struct rt6_info *rt, unsigned int flags);
 
 int ip6_push_pending_frames(struct sock *sk);
 
@@ -977,8 +977,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 void *from, int length, int transhdrlen,
 struct ipcm6_cookie *ipc6, struct flowi6 *fl6,
 struct rt6_info *rt, unsigned int flags,
-struct inet_cork_full *cork,
-const struct sockcm_cookie *sockc);
+struct inet_cork_full *cork);
 
 static inline struct sk_buff *ip6_finish_skb(struct sock *sk)
 {
diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index f6a3543e5247..a8f6020f1196 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -42,8 +42,7 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, struct 
msghdr *msg,
struct sk_buff *skb);
 
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
- struct flowi6 *fl6, struct ipcm6_cookie *ipc6,
- struct sockcm_cookie *sockc);
+ struct flowi6 *fl6, struct ipcm6_cookie *ipc6);
 
 void __ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
   __u16 srcp, __u16 destp, int rqueue, int bucket);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 2ee08b6a86a4..201306b9b5ea 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -736,7 +736,7 @@ EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
 
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
  struct msghdr *msg, struct flowi6 *fl6,
- struct ipcm6_cookie *ipc6, struct sockcm_cookie 
*sockc)
+ struct ipcm6_cookie *ipc6)
 {
struct in6_pktinfo *src_info;
struct cmsghdr *cmsg;
@@ -755,7 +755,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
}
 
if (cmsg->cmsg_level == SOL_SOCKET) {
-   err = __sock_cmsg_send(sk, msg, cmsg, sockc);
+   err = __sock_cmsg_send(sk, msg, cmsg, >sockc);
if (err)
return err;
continue;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index d99fed67cd10..24611c8b0562 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -430,7 +430,6 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 
code, __u32 info,
struct icmp6hdr tmp_hdr;
struct flowi6 fl6;
struct icmpv6_msg msg;
-   struct sockcm_cookie sockc_unused = {0};
struct ipcm6_cookie ipc6;
int iif = 0;
int addr_type = 0;
@@ -573,7 +572,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 
code, __u32 info,
len + sizeof(struct icmp6hdr),
sizeof(struct icmp6hdr),
, , (struct rt6_info *)dst,
-   MSG_DONTWAIT, _unused)) {
+   MSG_DONTWAIT)) {
ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS);
ip6_flush_pending_frames(sk);
} else {
@@ -677,7 +676,6 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
struct dst_entry *dst;
struct ipcm6_cookie ipc6;
u32 mark = IP6_REPLY_MARK(net, skb->mark);
-   struct sockcm_cookie sockc_unused = {0};
 

[PATCH net-next 1/6] ipv4: ipcm_cookie initializers

2018-07-06 Thread Willem de Bruijn
From: Willem de Bruijn 

Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn 
---
 include/net/ip.h | 15 +++
 net/ipv4/icmp.c  | 11 ++-
 net/ipv4/ip_output.c |  6 +-
 net/ipv4/ping.c  |  9 +
 net/ipv4/raw.c   |  9 +
 net/ipv4/udp.c   | 10 +-
 6 files changed, 21 insertions(+), 39 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 99d1b835d2aa..6db23bf1e5eb 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -79,6 +79,21 @@ struct ipcm_cookie {
__u16   gso_size;
 };
 
+static inline void ipcm_init(struct ipcm_cookie *ipcm)
+{
+   *ipcm = (struct ipcm_cookie) { .tos = -1 };
+}
+
+static inline void ipcm_init_sk(struct ipcm_cookie *ipcm,
+   const struct inet_sock *inet)
+{
+   ipcm_init(ipcm);
+
+   ipcm->sockc.tsflags = inet->sk.sk_tsflags;
+   ipcm->oif = inet->sk.sk_bound_dev_if;
+   ipcm->addr = inet->inet_saddr;
+}
+
 #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
 #define PKTINFO_SKB_CB(skb) ((struct in_pktinfo *)((skb)->cb))
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 937239afd68d..695979b7ef6d 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -429,15 +429,11 @@ static void icmp_reply(struct icmp_bxm *icmp_param, 
struct sk_buff *skb)
 
icmp_param->data.icmph.checksum = 0;
 
+   ipcm_init();
inet->tos = ip_hdr(skb)->tos;
sk->sk_mark = mark;
daddr = ipc.addr = ip_hdr(skb)->saddr;
saddr = fib_compute_spec_dst(skb);
-   ipc.opt = NULL;
-   ipc.tx_flags = 0;
-   ipc.ttl = 0;
-   ipc.tos = -1;
-   ipc.sockc.transmit_time = 0;
 
if (icmp_param->replyopts.opt.opt.optlen) {
ipc.opt = _param->replyopts.opt;
@@ -711,12 +707,9 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
icmp_param.offset = skb_network_offset(skb_in);
inet_sk(sk)->tos = tos;
sk->sk_mark = mark;
+   ipcm_init();
ipc.addr = iph->saddr;
ipc.opt = _param.replyopts.opt;
-   ipc.tx_flags = 0;
-   ipc.ttl = 0;
-   ipc.tos = -1;
-   ipc.sockc.transmit_time = 0;
 
rt = icmp_route_lookup(net, , skb_in, iph, saddr, tos, mark,
   type, code, _param);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 570e3ebc3974..81d0e4a77ec5 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1548,12 +1548,8 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
if (__ip_options_echo(net, , skb, sopt))
return;
 
+   ipcm_init();
ipc.addr = daddr;
-   ipc.opt = NULL;
-   ipc.tx_flags = 0;
-   ipc.ttl = 0;
-   ipc.tos = -1;
-   ipc.sockc.transmit_time = 0;
 
if (replyopts.opt.opt.optlen) {
ipc.opt = 
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index b47492205507..6f17fc8ebbdb 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -739,14 +739,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
/* no remote port */
}
 
-   ipc.sockc.tsflags = sk->sk_tsflags;
-   ipc.addr = inet->inet_saddr;
-   ipc.opt = NULL;
-   ipc.oif = sk->sk_bound_dev_if;
-   ipc.tx_flags = 0;
-   ipc.ttl = 0;
-   ipc.tos = -1;
-   ipc.sockc.transmit_time = 0;
+   ipcm_init_sk(, inet);
 
if (msg->msg_controllen) {
err = ip_cmsg_send(sk, msg, , false);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 446af7be2b55..cf142909389c 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -562,14 +562,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
daddr = inet->inet_daddr;
}
 
-   ipc.sockc.tsflags = sk->sk_tsflags;
-   ipc.sockc.transmit_time = 0;
-   ipc.addr = inet->inet_saddr;
-   ipc.opt = NULL;
-   ipc.tx_flags = 0;
-   ipc.ttl = 0;
-   ipc.tos = -1;
-   ipc.oif = sk->sk_bound_dev_if;
+   ipcm_init_sk(, inet);
 
if (msg->msg_controllen) {
err = ip_cmsg_send(sk, msg, , false);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5c76ba0666ec..87f3a0b77864 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -926,12 +926,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility 
*/
return -EOPNOTSUPP;
 
-   ipc.opt = NULL;
-   ipc.tx_flags = 0;
-   ipc.ttl = 0;
-   ipc.tos = -1;
-   ipc.sockc.transmit_time = 0;
-
getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
 
fl4 = >cork.fl.u.ip4;
@@ -978,9 +972,7 @@ int udp_sendmsg(struct sock *sk, struct 

[PATCH net-next 2/6] ipv6: ipcm6_cookie initializer

2018-07-06 Thread Willem de Bruijn
From: Willem de Bruijn 

Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn 
---
 include/net/ipv6.h  | 19 +++
 net/ipv6/icmp.c |  7 ++-
 net/ipv6/ping.c |  4 +---
 net/ipv6/raw.c  |  5 +
 net/ipv6/udp.c  |  4 +---
 net/l2tp/l2tp_ip6.c |  4 +---
 6 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index b7843e0b16ee..6cb247f54d4c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -301,6 +301,25 @@ struct ipcm6_cookie {
__u16 gso_size;
 };
 
+static inline void ipcm6_init(struct ipcm6_cookie *ipc6)
+{
+   *ipc6 = (struct ipcm6_cookie) {
+   .hlimit = -1,
+   .tclass = -1,
+   .dontfrag = -1,
+   };
+}
+
+static inline void ipcm6_init_sk(struct ipcm6_cookie *ipc6,
+const struct ipv6_pinfo *np)
+{
+   *ipc6 = (struct ipcm6_cookie) {
+   .hlimit = -1,
+   .tclass = np->tclass,
+   .dontfrag = np->dontfrag,
+   };
+}
+
 static inline struct ipv6_txoptions *txopt_get(const struct ipv6_pinfo *np)
 {
struct ipv6_txoptions *opt;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index be491bf6ab6e..d99fed67cd10 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -545,7 +545,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 
code, __u32 info,
else if (!fl6.flowi6_oif)
fl6.flowi6_oif = np->ucast_oif;
 
-   ipc6.tclass = np->tclass;
+   ipcm6_init_sk(, np);
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
dst = icmpv6_route_lookup(net, skb, sk, );
@@ -553,8 +553,6 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 
code, __u32 info,
goto out;
 
ipc6.hlimit = ip6_sk_dst_hoplimit(np, , dst);
-   ipc6.dontfrag = np->dontfrag;
-   ipc6.opt = NULL;
 
msg.skb = skb;
msg.offset = skb_network_offset(skb);
@@ -726,10 +724,9 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
msg.offset = 0;
msg.type = ICMPV6_ECHO_REPLY;
 
+   ipcm6_init_sk(, np);
ipc6.hlimit = ip6_sk_dst_hoplimit(np, , dst);
ipc6.tclass = ipv6_get_dsfield(ipv6_hdr(skb));
-   ipc6.dontfrag = np->dontfrag;
-   ipc6.opt = NULL;
 
if (ip6_append_data(sk, icmpv6_getfrag, ,
skb->len + sizeof(struct icmp6hdr),
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index 96f56bf49a30..717e7c1fba29 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -119,7 +119,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
fl6.fl6_icmp_code = user_icmph.icmp6_code;
security_sk_classify_flow(sk, flowi6_to_flowi());
 
-   ipc6.tclass = np->tclass;
+   ipcm6_init_sk(, np);
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
dst = ip6_sk_dst_lookup_flow(sk, , daddr, false);
@@ -142,8 +142,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
pfh.family = AF_INET6;
 
ipc6.hlimit = ip6_sk_dst_hoplimit(np, , dst);
-   ipc6.dontfrag = np->dontfrag;
-   ipc6.opt = NULL;
 
lock_sock(sk);
err = ip6_append_data(sk, ping_getfrag, , len,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5737c50f16eb..5f40670271ee 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -791,10 +791,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
fl6.flowi6_mark = sk->sk_mark;
fl6.flowi6_uid = sk->sk_uid;
 
-   ipc6.hlimit = -1;
-   ipc6.tclass = -1;
-   ipc6.dontfrag = -1;
-   ipc6.opt = NULL;
+   ipcm6_init();
 
if (sin6) {
if (addr_len < SIN6_LEN_RFC2133)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ac6fc6728903..940115da9843 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1143,9 +1143,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
struct sockcm_cookie sockc;
 
-   ipc6.hlimit = -1;
-   ipc6.tclass = -1;
-   ipc6.dontfrag = -1;
+   ipcm6_init();
ipc6.gso_size = up->gso_size;
sockc.tsflags = sk->sk_tsflags;
sockc.transmit_time = 0;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 957369192ca1..38f80691f4ab 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -525,9 +525,7 @@ static int l2tp_ip6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
fl6.flowi6_mark = sk->sk_mark;
fl6.flowi6_uid = sk->sk_uid;
 
-   ipc6.hlimit = -1;
-   ipc6.tclass = -1;
-   ipc6.dontfrag = -1;
+   ipcm6_init();
 
if (lsa) {
if (addr_len < 

[PATCH net-next 6/6] ip: unconditionally set cork gso_size

2018-07-06 Thread Willem de Bruijn
From: Willem de Bruijn 

Now that ipc(6)->gso_size is correctly initialized in all callers of
ip(6)_setup_cork, it is safe to unconditionally pass it to the cork.

Link: 
http://lkml.kernel.org/r/20180619164752.143249-1-willemdebruijn.ker...@gmail.com
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/ip_output.c  | 3 +--
 net/ipv6/ip6_output.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e14c774cc092..e2b6bd478afb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1146,8 +1146,7 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->fragsize = ip_sk_use_pmtu(sk) ?
 dst_mtu(>dst) : rt->dst.dev->mtu;
 
-   cork->gso_size = sk->sk_type == SOCK_DGRAM &&
-sk->sk_protocol == IPPROTO_UDP ? ipc->gso_size : 0;
+   cork->gso_size = ipc->gso_size;
cork->dst = >dst;
cork->length = 0;
cork->ttl = ipc->ttl;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4b28a600ab..8047fd41ba88 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1219,8 +1219,7 @@ static int ip6_setup_cork(struct sock *sk, struct 
inet_cork_full *cork,
if (mtu < IPV6_MIN_MTU)
return -EINVAL;
cork->base.fragsize = mtu;
-   cork->base.gso_size = sk->sk_type == SOCK_DGRAM &&
- sk->sk_protocol == IPPROTO_UDP ? ipc6->gso_size : 
0;
+   cork->base.gso_size = ipc6->gso_size;
cork->base.tx_flags = 0;
sock_tx_timestamp(sk, ipc6->sockc.tsflags, >base.tx_flags);
 
-- 
2.18.0.399.gad0ab374a1-goog



[PATCH net-next 3/6] sock: sockc cookie initializer

2018-07-06 Thread Willem de Bruijn
From: Willem de Bruijn 

Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn 
---
 include/net/sock.h | 6 ++
 net/ipv4/tcp.c | 2 +-
 net/packet/af_packet.c | 9 +++--
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index e0eac9ef44b5..83b747538bd0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1600,6 +1600,12 @@ struct sockcm_cookie {
u16 tsflags;
 };
 
+static inline void sockcm_init(struct sockcm_cookie *sockc,
+  const struct sock *sk)
+{
+   *sockc = (struct sockcm_cookie) { .tsflags = sk->sk_tsflags };
+}
+
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bf461fa77ed6..850dc8f15afc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1241,7 +1241,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
/* 'common' sending to sendq */
}
 
-   sockc.tsflags = sk->sk_tsflags;
+   sockcm_init(, sk);
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, );
if (unlikely(err)) {
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3428f7739ae9..47931ebfaef3 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1951,8 +1951,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   sockc.transmit_time = 0;
-   sockc.tsflags = sk->sk_tsflags;
+   sockcm_init(, sk);
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, );
if (unlikely(err))
@@ -2636,8 +2635,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
 
-   sockc.transmit_time = 0;
-   sockc.tsflags = po->sk.sk_tsflags;
+   sockcm_init(, >sk);
if (msg->msg_controllen) {
err = sock_cmsg_send(>sk, msg, );
if (unlikely(err))
@@ -2833,8 +2831,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
-   sockc.transmit_time = 0;
-   sockc.tsflags = sk->sk_tsflags;
+   sockcm_init(, sk);
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, );
-- 
2.18.0.399.gad0ab374a1-goog



[PATCH net-next 0/6] sock cookie initializers

2018-07-06 Thread Willem de Bruijn
From: Willem de Bruijn 

Recent UDP GSO and SO_TXTIME features added new fields to cookie
structs.

When adding a field, all sites where a struct is initialized have to
be updated, which is a lot of boilerplate. Alternatively, a field can
be initialized selectively, but this is fragile. I introduced a bug
in udp gso where an uninitialized field was read. See also fix commit
("9887cba19978 ip: limit use of gso_size to udp").

Introduce initializers for structs ipcm(6)_cookie and sockc_cookie.

patch 1..3 do exactly this.
patch 4..5 make ipv4 and ipv6 handle cookies the same way and
   remove some boilerplate in doing so.
patch 6removes the udp gso branch that needed the above fix

Willem de Bruijn (6):
  ipv4: ipcm_cookie initializers
  ipv6: ipcm6_cookie initializer
  sock: sockc cookie initializer
  ipv6: fold sockcm_cookie into ipcm6_cookie
  ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6
  ip: unconditionally set cork gso_size

 include/net/ip.h | 16 ++-
 include/net/ipv6.h   | 26 
 include/net/sock.h   |  6 ++
 include/net/transp_v6.h  |  3 +--
 net/ipv4/icmp.c  | 11 ++
 net/ipv4/ip_output.c | 12 ---
 net/ipv4/ping.c  | 11 +-
 net/ipv4/raw.c   | 11 +-
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/udp.c   | 12 +--
 net/ipv6/datagram.c  |  4 ++--
 net/ipv6/icmp.c  | 14 -
 net/ipv6/ip6_flowlabel.c |  3 +--
 net/ipv6/ip6_output.c| 43 +---
 net/ipv6/ipv6_sockglue.c |  3 +--
 net/ipv6/ping.c  |  7 ++-
 net/ipv6/raw.c   | 15 +-
 net/ipv6/udp.c   | 14 +
 net/l2tp/l2tp_ip6.c  | 10 +++---
 net/packet/af_packet.c   |  9 +++--
 20 files changed, 98 insertions(+), 134 deletions(-)

-- 
2.18.0.399.gad0ab374a1-goog



Re: [PATCH v1 net-next 6/9] lan743x: Add power management support

2018-07-06 Thread kbuild test robot
Hi Bryan,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Bryan-Whitehead/lan743x-Add-features-to-lan743x-driver/20180706-051812
config: alpha-allmodconfig (attached as .config)
compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=alpha 

All warnings (new ones prefixed by >>):

>> drivers/net/ethernet/microchip/lan743x_main.c:2787:5: warning: "CONFIG_PM" 
>> is not defined, evaluates to 0 [-Wundef]
#if CONFIG_PM
^
   drivers/net/ethernet/microchip/lan743x_main.c:2894:5: warning: "CONFIG_PM" 
is not defined, evaluates to 0 [-Wundef]
#if CONFIG_PM
^
   drivers/net/ethernet/microchip/lan743x_main.c:2926:5: warning: "CONFIG_PM" 
is not defined, evaluates to 0 [-Wundef]
#if CONFIG_PM
^
   drivers/net/ethernet/microchip/lan743x_main.c:2957:5: warning: "CONFIG_PM" 
is not defined, evaluates to 0 [-Wundef]
#if CONFIG_PM
^
--
>> drivers/net/ethernet/microchip/lan743x_ethtool.c:436:5: warning: "CONFIG_PM" 
>> is not defined, evaluates to 0 [-Wundef]
#if CONFIG_PM
^

vim +/CONFIG_PM +2787 drivers/net/ethernet/microchip/lan743x_main.c

  2786  
> 2787  #if CONFIG_PM
  2788  static void lan743x_pm_set_wol(struct lan743x_adapter *adapter)
  2789  {
  2790  const u8 ipv4_multicast[3] = { 0x01, 0x00, 0x5E };
  2791  const u8 ipv6_multicast[3] = { 0x33, 0x33 };
  2792  const u8 arp_type[2] = { 0x08, 0x06 };
  2793  int mask_index;
  2794  u32 pmtctl;
  2795  u32 wucsr;
  2796  u32 macrx;
  2797  u16 crc;
  2798  
  2799  for (mask_index = 0; mask_index < MAC_NUM_OF_WUF_CFG; 
mask_index++)
  2800  lan743x_csr_write(adapter, MAC_WUF_CFG(mask_index), 0);
  2801  
  2802  /* clear wake settings */
  2803  pmtctl = lan743x_csr_read(adapter, PMT_CTL);
  2804  pmtctl |= PMT_CTL_WUPS_MASK_;
  2805  pmtctl &= ~(PMT_CTL_GPIO_WAKEUP_EN_ | PMT_CTL_EEE_WAKEUP_EN_ |
  2806  PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_ |
  2807  PMT_CTL_RX_FCT_RFE_D3_CLK_OVR_ | 
PMT_CTL_ETH_PHY_WAKE_EN_);
  2808  
  2809  macrx = lan743x_csr_read(adapter, MAC_RX);
  2810  
  2811  wucsr = 0;
  2812  mask_index = 0;
  2813  
  2814  pmtctl |= PMT_CTL_ETH_PHY_D3_COLD_OVR_ | 
PMT_CTL_ETH_PHY_D3_OVR_;
  2815  
  2816  if (adapter->wolopts & WAKE_PHY) {
  2817  pmtctl |= PMT_CTL_ETH_PHY_EDPD_PLL_CTL_;
  2818  pmtctl |= PMT_CTL_ETH_PHY_WAKE_EN_;
  2819  }
  2820  if (adapter->wolopts & WAKE_MAGIC) {
  2821  wucsr |= MAC_WUCSR_MPEN_;
  2822  macrx |= MAC_RX_RXEN_;
  2823  pmtctl |= PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_;
  2824  }
  2825  if (adapter->wolopts & WAKE_UCAST) {
  2826  wucsr |= MAC_WUCSR_RFE_WAKE_EN_ | MAC_WUCSR_PFDA_EN_;
  2827  macrx |= MAC_RX_RXEN_;
  2828  pmtctl |= PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_;
  2829  pmtctl |= PMT_CTL_RX_FCT_RFE_D3_CLK_OVR_;
  2830  }
  2831  if (adapter->wolopts & WAKE_BCAST) {
  2832  wucsr |= MAC_WUCSR_RFE_WAKE_EN_ | MAC_WUCSR_BCST_EN_;
  2833  macrx |= MAC_RX_RXEN_;
  2834  pmtctl |= PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_;
  2835  pmtctl |= PMT_CTL_RX_FCT_RFE_D3_CLK_OVR_;
  2836  }
  2837  if (adapter->wolopts & WAKE_MCAST) {
  2838  /* IPv4 multicast */
  2839  crc = lan743x_pm_wakeframe_crc16(ipv4_multicast, 3);
  2840  lan743x_csr_write(adapter, MAC_WUF_CFG(mask_index),
  2841MAC_WUF_CFG_EN_ | 
MAC_WUF_CFG_TYPE_MCAST_ |
  2842(0 << MAC_WUF_CFG_OFFSET_SHIFT_) |
  2843(crc & MAC_WUF_CFG_CRC16_MASK_));
  2844  lan743x_csr_write(adapter, MAC_WUF_MASK0(mask_index), 
7);
  2845  lan743x_csr_write(adapter, MAC_WUF_MASK1(mask_index), 
0);
  2846  lan743x_csr_write(adapter, MAC_WUF_MASK2(mask_index), 
0);
  2847  lan743x_csr_write(adapter, MAC_WUF_MASK3(mask_index), 
0);
  2848  mask_index++;
  2849  
  2850  /* IPv6 multicast */
  2851  crc = lan743x_pm_wakeframe_crc16(ipv6_multicast, 2)

Re: [RFC PATCH] ip: re-introduce fragments cache worker

2018-07-06 Thread Paolo Abeni
On Fri, 2018-07-06 at 05:09 -0700, Eric Dumazet wrote:
> On 07/06/2018 04:56 AM, Paolo Abeni wrote:
> > With your setting, you need a bit more concurrent connections (400 ?)
> > to saturate the ipfrag cache. Above that number, performances will
> > still sink.
> 
> Maybe, but IP defrag can not be 'perfect'.
> 
> For this particular use case I could still bump high_thresh to 6 GB and all 
> would be good :)

Understood.

I'd like to be sure I stated the problem I see clearly. With the
current code the "goodput" goes to almost 0 as soon as the ipfrag cache
load goes above it's capacity. Before the worker removal, after
reaching high_thresh, the "goodput" degratated slowly and even with a
load more than an order of magnitude higher, the performances were
still quite good. I think we can't ask customers to add more memory for
a kernel upgrade; even changing the default sysfs configuration is
somewhat troubling.

> > This looks nice, I'll try to test it in my use case and I'll report
> > here.

I tried the patch, but the result are not encouraging:

./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60
34.94

# on the receiver side:
echo 2 >  /proc/sys/net/ipv4/ipfrag_time

# on the sender side:
./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60
85.8

# still on receiver side, while the test is running:
nstat>/dev/null ;sleep 1; nstat |grep IpReasm
IpReasmTimeout  2128   0.0
IpReasmReqds754770 0.0
IpReasmOKs  1350.0
IpReasmFails752811 0.0

grep FRAG /proc/net/sockstat
FRAG: inuse 124 memory 5286144

The patch has some effect, as I basically saw no timeout without it,
but still does not look aggressive enough. Or possibly it's evicting
the fragments that are more likely to be used/completed (the most
recents one).

> > I have doubt: under DDOS we will trigger  timeout per
> > jiffy, can that keep a CPU busy, too?
> 
> Yes, the cpu(s) handling the RX queue(s), which are already provisioned for 
> networking stuff ;)
> 
> Even without any frag being received, these cpu can be 100% busy.

With:

schedule_work_on(smp_processor_id(), #... )

We can be sure to run exclusively on the cpu handling the RX queue even with 
the worker.

Cheers,

Paolo


[PATCH net 2/3] net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev

2018-07-06 Thread Sabrina Dubroca
The value has already been copied from this netns's devconf_dflt, it
shouldn't be reset to the global kernel default.

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e9ba53d2a147..e20f8a1d8cdb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -385,8 +385,6 @@ static struct inet6_dev *ipv6_add_dev(struct net_device 
*dev)
 
if (ndev->cnf.stable_secret.initialized)
ndev->cnf.addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
-   else
-   ndev->cnf.addr_gen_mode = ipv6_devconf_dflt.addr_gen_mode;
 
ndev->cnf.mtu6 = dev->mtu;
ndev->nd_parms = neigh_parms_alloc(dev, _tbl);
-- 
2.18.0



[PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread Sabrina Dubroca
addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores
the errors returned by proc_dointvec().

addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which
writes the value to memory, and then checks if it's valid and may return
EINVAL. If a bad value is given, the value displayed when reading
net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the
value provided by the user was valid, addrconf_dev_config() won't be
called since idev->cnf.addr_gen_mode has already been updated.

Fix this in the usual way we deal with values that need to be checked
after the proc_do*() helper has returned: define a local ctl_table and
storage, call proc_dointvec() on that temporary area, then check and
store.

addrconf_sysctl_addr_gen_mode() also writes the new value to the global
ipv6_devconf_dflt, when we're writing to some netns's default, so that
new netns will inherit the value that was set by the change occuring in
any netns. That doesn't make any sense, so let's drop this assignment.

Finally, since addr_gen_mode is a __u32, switch to proc_douintvec().

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 91580c62bb86..e9ba53d2a147 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
ctl_table *ctl, int write,
 loff_t *ppos)
 {
int ret = 0;
-   int new_val;
+   u32 new_val;
struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
struct net *net = (struct net *)ctl->extra2;
+   struct ctl_table tmp = {
+   .data = _val,
+   .maxlen = sizeof(new_val),
+   .mode = ctl->mode,
+   };
 
if (!rtnl_trylock())
return restart_syscall();
 
-   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+   new_val = *((u32 *)ctl->data);
 
-   if (write) {
-   new_val = *((int *)ctl->data);
+   ret = proc_douintvec(, write, buffer, lenp, ppos);
+   if (ret != 0)
+   goto out;
 
+   if (write) {
if (check_addr_gen_mode(new_val) < 0) {
ret = -EINVAL;
goto out;
}
 
-   /* request for default */
-   if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
-   ipv6_devconf_dflt.addr_gen_mode = new_val;
-
-   /* request for individual net device */
-   } else {
-   if (!idev)
-   goto out;
-
+   if (idev) {
if (check_stable_privacy(idev, net, new_val) < 0) {
ret = -EINVAL;
goto out;
@@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct ctl_table 
*ctl, int write,
addrconf_dev_config(idev->dev);
}
}
+
+   *((u32 *)ctl->data) = new_val;
}
 
 out:
-- 
2.18.0



[PATCH net 0/3] net/ipv6: addr_gen_mode fixes

2018-07-06 Thread Sabrina Dubroca
This series fixes bugs in handling of the addr_gen_mode option, mainly
related to the sysctl. A minor netlink issue was also present in the
initial commit introducing the option on a per-netdevice basis.

Sabrina Dubroca (3):
  net/ipv6: fix addrconf_sysctl_addr_gen_mode
  net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev
  net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE

 net/ipv6/addrconf.c | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

-- 
2.18.0



[PATCH net 3/3] net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE

2018-07-06 Thread Sabrina Dubroca
inet6_ifla6_size() is called to check how much space is needed by
inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include
the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it.

Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e20f8a1d8cdb..e89bca83e0e4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5208,7 +5208,9 @@ static inline size_t inet6_ifla6_size(void)
 + nla_total_size(DEVCONF_MAX * 4) /* IFLA_INET6_CONF */
 + nla_total_size(IPSTATS_MIB_MAX * 8) /* IFLA_INET6_STATS */
 + nla_total_size(ICMP6_MIB_MAX * 8) /* IFLA_INET6_ICMP6STATS */
-+ nla_total_size(sizeof(struct in6_addr)); /* IFLA_INET6_TOKEN */
++ nla_total_size(sizeof(struct in6_addr)) /* IFLA_INET6_TOKEN */
++ nla_total_size(1) /* IFLA_INET6_ADDR_GEN_MODE */
++ 0;
 }
 
 static inline size_t inet6_if_nlmsg_size(void)
-- 
2.18.0



Re: [PATCH iproute2-next] tc: m_tunnel_key: Add tunnel option support to act_tunnel_key

2018-07-06 Thread Roman Mashak
Jakub Kicinski  writes:

> From: Simon Horman 
>
> Allow setting tunnel options using the act_tunnel_key action.
>
> Options are expressed as class:type:data and multiple options
> may be listed using a comma delimiter.
>
>  # ip link add name geneve0 type geneve dstport 0 external
>  # tc qdisc add dev eth0 ingress
>  # tc filter add dev eth0 protocol ip parent : \
>  flower indev eth0 \
> ip_proto udp \
> action tunnel_key \
> set src_ip 10.0.99.192 \
> dst_ip 10.0.99.193 \
> dst_port 6081 \
> id 11 \
> geneve_opts 0102:80:00800022,0102:80:00800022 \
> action mirred egress redirect dev geneve0

[...]

Jakub, could you also add relevant tests for the new option in file
$(kernel)/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json?


[PATCH] ipv6: icmp: Updating pmtu for link local route

2018-07-06 Thread Georg Kohmann
When a ICMPV6_PKT_TOOBIG is received from a link local address the pmtu 
will be updated on a route with an arbitrary interface index. Subsequent 
packets sent back to the same link local address may therefore end up 
not considering the updated pmtu.


Current behavior breaks TAHI v6LC4.1.4 Reduce PMTU On-link.
Referring to RFC 1981: Section 3:
"Note that Path MTU Discovery must be performed even in cases where a 
node "thinks" a destination is attached to the same link as itself. In a 
situation such as when a neighboring router acts as proxy [ND] for some 
destination, the destination can to appear to be directly connected but 
is in fact more than one hop away."


Using the interface index from the incoming ICMPV6_PKT_TOOBIG when 
updating the pmtu.


Signed-off-by: Georg Kohmann 
---
 net/ipv6/icmp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index be491bf..7d0e9c7 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -92,7 +92,7 @@ static void icmpv6_err(struct sk_buff *skb, struct 
inet6_skb_parm *opt,

    struct net *net = dev_net(skb->dev);

    if (type == ICMPV6_PKT_TOOBIG)
-   ip6_update_pmtu(skb, net, info, 0, 0, sock_net_uid(net, 
NULL));
+   ip6_update_pmtu(skb, net, info, skb->dev->ifindex, 0, 
sock_net_uid(net, NULL));

    else if (type == NDISC_REDIRECT)
    ip6_redirect(skb, net, skb->dev->ifindex, 0,
 sock_net_uid(net, NULL));
---

Tested on kernel 4.4.38.
I am however worried that this patch may have consequences for other 
traffic on the same interface. Please advice or suggest at better approach.




Re: [RFC PATCH] ip: re-introduce fragments cache worker

2018-07-06 Thread Eric Dumazet



On 07/06/2018 04:56 AM, Paolo Abeni wrote:
> Hi,
> 
> On Fri, 2018-07-06 at 04:23 -0700, Eric Dumazet wrote:
>> Ho hum. No please.
>>
>> I do not think adding back a GC is wise, since my patches were going in the 
>> direction
>> of allowing us to increase limits on current hardware.
>>
>> Meaning that the amount of frags to evict would be quite big under DDOS.
>> (One inet_frag_queue allocated for every incoming tiny frame :/ )
>>
>> A GC is a _huge_ problem, burning one cpu (you would have to provision for 
>> this CPU)
>> compared to letting normal per frag timer doing its job.
>>
>> My plan was to reduce the per frag timer under load (default is 30 seconds), 
>> since
>> this is exactly what your patch is indirectly doing, by aggressively pruning
>> frags under stress.
>>
>> That would be a much simpler heuristic. [1]
>>
>> BTW my own results (before patch) are :
>>
>> lpaa5:/export/hda3/google/edumazet# ./super_netperf 10 -H 10.246.7.134 -t 
>> UDP_STREAM -l 60
>>9602
>> lpaa5:/export/hda3/google/edumazet# ./super_netperf 200 -H 10.246.7.134 -t 
>> UDP_STREAM -l 60
>>9557
>>
>> On receiver (normal settings here) I had :
>>
>> lpaa6:/export/hda3/google/edumazet# grep . /proc/sys/net/ipv4/ipfrag_*
>> /proc/sys/net/ipv4/ipfrag_high_thresh:104857600
>> /proc/sys/net/ipv4/ipfrag_low_thresh:78643200
>> /proc/sys/net/ipv4/ipfrag_max_dist:0
>> /proc/sys/net/ipv4/ipfrag_secret_interval:0
>> /proc/sys/net/ipv4/ipfrag_time:30
>>
>> lpaa6:/export/hda3/google/edumazet# grep FRAG /proc/net/sockstat
>> FRAG: inuse 824 memory 53125312
> 
> Than you for the feedback.
> 
> With your setting, you need a bit more concurrent connections (400 ?)
> to saturate the ipfrag cache. Above that number, performances will
> still sink.

Maybe, but IP defrag can not be 'perfect'.

For this particular use case I could still bump high_thresh to 6 GB and all 
would be good :)

> This looks nice, I'll try to test it in my use case and I'll report
> here.
> 
> Perhaps we can use the default timeout when usage < low_thresh, to
> avoid some maths in possibly common scenario?

On current 64bit hardware, a divide here is not a big cost (compared to the rest
of frag setup)
and I would rather starting having smaller timeouts sooner than later ;)

(low_thresh is typically set to 75% of high_thresh)

> 
> I have doubt: under DDOS we will trigger  timeout per
> jiffy, can that keep a CPU busy, too?

Yes, the cpu(s) handling the RX queue(s), which are already provisioned for 
networking stuff ;)

Even without any frag being received, these cpu can be 100% busy.



Re: [RFC PATCH] ip: re-introduce fragments cache worker

2018-07-06 Thread Paolo Abeni
Hi,

On Fri, 2018-07-06 at 04:23 -0700, Eric Dumazet wrote:
> Ho hum. No please.
> 
> I do not think adding back a GC is wise, since my patches were going in the 
> direction
> of allowing us to increase limits on current hardware.
> 
> Meaning that the amount of frags to evict would be quite big under DDOS.
> (One inet_frag_queue allocated for every incoming tiny frame :/ )
> 
> A GC is a _huge_ problem, burning one cpu (you would have to provision for 
> this CPU)
> compared to letting normal per frag timer doing its job.
> 
> My plan was to reduce the per frag timer under load (default is 30 seconds), 
> since
> this is exactly what your patch is indirectly doing, by aggressively pruning
> frags under stress.
> 
> That would be a much simpler heuristic. [1]
> 
> BTW my own results (before patch) are :
> 
> lpaa5:/export/hda3/google/edumazet# ./super_netperf 10 -H 10.246.7.134 -t 
> UDP_STREAM -l 60
>9602
> lpaa5:/export/hda3/google/edumazet# ./super_netperf 200 -H 10.246.7.134 -t 
> UDP_STREAM -l 60
>9557
> 
> On receiver (normal settings here) I had :
> 
> lpaa6:/export/hda3/google/edumazet# grep . /proc/sys/net/ipv4/ipfrag_*
> /proc/sys/net/ipv4/ipfrag_high_thresh:104857600
> /proc/sys/net/ipv4/ipfrag_low_thresh:78643200
> /proc/sys/net/ipv4/ipfrag_max_dist:0
> /proc/sys/net/ipv4/ipfrag_secret_interval:0
> /proc/sys/net/ipv4/ipfrag_time:30
> 
> lpaa6:/export/hda3/google/edumazet# grep FRAG /proc/net/sockstat
> FRAG: inuse 824 memory 53125312

Than you for the feedback.

With your setting, you need a bit more concurrent connections (400 ?)
to saturate the ipfrag cache. Above that number, performances will
still sink.

> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 
> c9e35b81d0931df8429a33e8d03e719b87da0747..88ed61bcda00f3357724e5c4dbcb97400b4a8b21
>  100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -155,9 +155,15 @@ static struct inet_frag_queue *inet_frag_alloc(struct 
> netns_frags *nf,
>struct inet_frags *f,
>void *arg)
>  {
> +   long high_thresh = READ_ONCE(nf->high_thresh);
> struct inet_frag_queue *q;
> +   u64 timeout;
> +   long usage;
>  
> -   if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh)
> +   if (!high_thresh)
> +   return NULL;
> +   usage = frag_mem_limit(nf);
> +   if (usage > high_thresh)
> return NULL;
>  
> q = kmem_cache_zalloc(f->frags_cachep, GFP_ATOMIC);
> @@ -171,6 +177,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct 
> netns_frags *nf,
> timer_setup(>timer, f->frag_expire, 0);
> spin_lock_init(>lock);
> refcount_set(>refcnt, 3);
> +   timeout = (u64)nf->timeout * (high_thresh - usage);
> +   mod_timer(>timer, jiffies + div64_long(timeout, high_thresh));
>  
> return q;
>  }

This looks nice, I'll try to test it in my use case and I'll report
here.

Perhaps we can use the default timeout when usage < low_thresh, to
avoid some maths in possibly common scenario?

I have doubt: under DDOS we will trigger  timeout per
jiffy, can that keep a CPU busy, too?

Cheers,

Paolo


Re: [RFC PATCH] ip: re-introduce fragments cache worker

2018-07-06 Thread Eric Dumazet



On 07/06/2018 03:10 AM, Paolo Abeni wrote:
> Currently, the ip frag cache is fragile to overload. With
> flow control disabled:
> 
> ./super_netperf.sh 10  -H 192.168.101.2 -t UDP_STREAM -l 60
> 9618.08
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60
> 28.66
> 
> Once that the overload condition is reached, the system does not
> recover until it's almost completely idle:
> 
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60 &
> sleep 4; I=0;
> for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; 
> done
> 13.72
> 
> This is due to the removal of the fragment cache worker, which
> was responsible to free some IP fragment cache memory when the
> high threshold was reached, allowing the system to cope successfully
> with the next fragmented packets.
> 
> This commit re-introduces the worker, on a per netns basis. Thanks
> to rhashtable walkers we can block the bh only for an entry removal.
> 
> After this commit (and before IP frag worker removal):
> 
> ./super_netperf.sh 10  -H 192.168.101.2 -t UDP_STREAM -l 60
> 9618.08
> 
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60
> 8599.77
> 
> ./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60 &
> sleep 4; I=0;
> for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; 
> done
> 9623.12
> 
> Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
> Signed-off-by: Paolo Abeni 
> ---
> Note: tweaking ipfrag sysfs does not solve completely the issue:
> - raising ipfrag_high_thresh increases the number of parallels
>   connections required to degrade the tput, but reached the IP
>   fragment cache capacity the good-put still goes almost to 0,
>   with the worker we get much more nice behaviour.
> - setting ipfrag_time to 2 increases the change to recover from
>   overload (the 2# test above), but with several experiments 
>   in such test, I got an average of 50% the expected tput with a very
>   large variance, with the worker we always see the expected/
>   line rate tput.
> ---

Ho hum. No please.

I do not think adding back a GC is wise, since my patches were going in the 
direction
of allowing us to increase limits on current hardware.

Meaning that the amount of frags to evict would be quite big under DDOS.
(One inet_frag_queue allocated for every incoming tiny frame :/ )

A GC is a _huge_ problem, burning one cpu (you would have to provision for this 
CPU)
compared to letting normal per frag timer doing its job.

My plan was to reduce the per frag timer under load (default is 30 seconds), 
since
this is exactly what your patch is indirectly doing, by aggressively pruning
frags under stress.

That would be a much simpler heuristic. [1]

BTW my own results (before patch) are :

lpaa5:/export/hda3/google/edumazet# ./super_netperf 10 -H 10.246.7.134 -t 
UDP_STREAM -l 60
   9602
lpaa5:/export/hda3/google/edumazet# ./super_netperf 200 -H 10.246.7.134 -t 
UDP_STREAM -l 60
   9557

On receiver (normal settings here) I had :

lpaa6:/export/hda3/google/edumazet# grep . /proc/sys/net/ipv4/ipfrag_*
/proc/sys/net/ipv4/ipfrag_high_thresh:104857600
/proc/sys/net/ipv4/ipfrag_low_thresh:78643200
/proc/sys/net/ipv4/ipfrag_max_dist:0
/proc/sys/net/ipv4/ipfrag_secret_interval:0
/proc/sys/net/ipv4/ipfrag_time:30

lpaa6:/export/hda3/google/edumazet# grep FRAG /proc/net/sockstat
FRAG: inuse 824 memory 53125312

[1] Something like (for IPv4 only here)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 
c9e35b81d0931df8429a33e8d03e719b87da0747..88ed61bcda00f3357724e5c4dbcb97400b4a8b21
 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -155,9 +155,15 @@ static struct inet_frag_queue *inet_frag_alloc(struct 
netns_frags *nf,
   struct inet_frags *f,
   void *arg)
 {
+   long high_thresh = READ_ONCE(nf->high_thresh);
struct inet_frag_queue *q;
+   u64 timeout;
+   long usage;
 
-   if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh)
+   if (!high_thresh)
+   return NULL;
+   usage = frag_mem_limit(nf);
+   if (usage > high_thresh)
return NULL;
 
q = kmem_cache_zalloc(f->frags_cachep, GFP_ATOMIC);
@@ -171,6 +177,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct 
netns_frags *nf,
timer_setup(>timer, f->frag_expire, 0);
spin_lock_init(>lock);
refcount_set(>refcnt, 3);
+   timeout = (u64)nf->timeout * (high_thresh - usage);
+   mod_timer(>timer, jiffies + div64_long(timeout, high_thresh));
 
return q;
 }
@@ -186,8 +194,6 @@ static struct inet_frag_queue *inet_frag_create(struct 
netns_frags *nf,
if (!q)
return NULL;
 
-   mod_timer(>timer, jiffies + nf->timeout);
-
err = rhashtable_insert_fast(>rhashtable, >node,
 f->rhash_params);
if (err < 

[PATCH net-next 2/2] net/sched: act_skbedit: don't use spinlock in the data path

2018-07-06 Thread Davide Caratti
use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on
act_skbedit configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.

Signed-off-by: Davide Caratti 
---
 include/net/tc_act/tc_skbedit.h |  37 ---
 net/sched/act_skbedit.c | 107 
 2 files changed, 95 insertions(+), 49 deletions(-)

diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h
index 19cd3d345804..911bbac838a2 100644
--- a/include/net/tc_act/tc_skbedit.h
+++ b/include/net/tc_act/tc_skbedit.h
@@ -22,14 +22,19 @@
 #include 
 #include 
 
+struct tcf_skbedit_params {
+   u32 flags;
+   u32 priority;
+   u32 mark;
+   u32 mask;
+   u16 queue_mapping;
+   u16 ptype;
+   struct rcu_head rcu;
+};
+
 struct tcf_skbedit {
-   struct tc_actioncommon;
-   u32 flags;
-   u32 priority;
-   u32 mark;
-   u32 mask;
-   u16 queue_mapping;
-   u16 ptype;
+   struct tc_action common;
+   struct tcf_skbedit_params __rcu *params;
 };
 #define to_skbedit(a) ((struct tcf_skbedit *)a)
 
@@ -37,15 +42,27 @@ struct tcf_skbedit {
 static inline bool is_tcf_skbedit_mark(const struct tc_action *a)
 {
 #ifdef CONFIG_NET_CLS_ACT
-   if (a->ops && a->ops->type == TCA_ACT_SKBEDIT)
-   return to_skbedit(a)->flags == SKBEDIT_F_MARK;
+   u32 flags;
+
+   if (a->ops && a->ops->type == TCA_ACT_SKBEDIT) {
+   rcu_read_lock();
+   flags = rcu_dereference(to_skbedit(a)->params)->flags;
+   rcu_read_unlock();
+   return flags == SKBEDIT_F_MARK;
+   }
 #endif
return false;
 }
 
 static inline u32 tcf_skbedit_mark(const struct tc_action *a)
 {
-   return to_skbedit(a)->mark;
+   u32 mark;
+
+   rcu_read_lock();
+   mark = rcu_dereference(to_skbedit(a)->params)->mark;
+   rcu_read_unlock();
+
+   return mark;
 }
 
 #endif /* __NET_TC_SKBEDIT_H */
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index c1bfa28ba477..028eb679e532 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -37,14 +37,19 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
   struct tcf_result *res)
 {
struct tcf_skbedit *d = to_skbedit(a);
+   struct tcf_skbedit_params *params;
+   int action;
 
tcf_lastuse_update(>tcf_tm);
bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb);
 
-   spin_lock(>tcf_lock);
-   if (d->flags & SKBEDIT_F_PRIORITY)
-   skb->priority = d->priority;
-   if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
+   rcu_read_lock();
+   params = rcu_dereference(d->params);
+   action = READ_ONCE(d->tcf_action);
+
+   if (params->flags & SKBEDIT_F_PRIORITY)
+   skb->priority = params->priority;
+   if (params->flags & SKBEDIT_F_INHERITDSFIELD) {
int wlen = skb_network_offset(skb);
 
switch (tc_skb_protocol(skb)) {
@@ -63,23 +68,23 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
break;
}
}
-   if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
-   skb->dev->real_num_tx_queues > d->queue_mapping)
-   skb_set_queue_mapping(skb, d->queue_mapping);
-   if (d->flags & SKBEDIT_F_MARK) {
-   skb->mark &= ~d->mask;
-   skb->mark |= d->mark & d->mask;
+   if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
+   skb->dev->real_num_tx_queues > params->queue_mapping)
+   skb_set_queue_mapping(skb, params->queue_mapping);
+   if (params->flags & SKBEDIT_F_MARK) {
+   skb->mark &= ~params->mask;
+   skb->mark |= params->mark & params->mask;
}
-   if (d->flags & SKBEDIT_F_PTYPE)
-   skb->pkt_type = d->ptype;
-
-   spin_unlock(>tcf_lock);
-   return d->tcf_action;
+   if (params->flags & SKBEDIT_F_PTYPE)
+   skb->pkt_type = params->ptype;
 
+unlock:
+   rcu_read_unlock();
+   return action;
 err:
-   spin_unlock(>tcf_lock);
qstats_drop_inc(this_cpu_ptr(d->common.cpu_qstats));
-   return TC_ACT_SHOT;
+   action = TC_ACT_SHOT;
+   goto unlock;
 }
 
 static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
@@ -97,6 +102,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
int ovr, int bind, struct netlink_ext_ack *extack)
 {
struct tc_action_net *tn = net_generic(net, skbedit_net_id);
+   struct tcf_skbedit_params *params_old, *params_new;
struct nlattr *tb[TCA_SKBEDIT_MAX + 1];
struct tc_skbedit *parm;
struct tcf_skbedit *d;
@@ -176,25 +182,34 @@ static int tcf_skbedit_init(struct net *net, struct 
nlattr 

[PATCH net-next 1/2] net/sched: skbedit: use per-cpu counters

2018-07-06 Thread Davide Caratti
use per-CPU counters, instead of sharing a single set of stats with all
cores: this removes the need of spinlocks when stats are read/updated.

Signed-off-by: Davide Caratti 
---
 net/sched/act_skbedit.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index dfaf5d8028dd..c1bfa28ba477 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -38,10 +38,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
 {
struct tcf_skbedit *d = to_skbedit(a);
 
-   spin_lock(>tcf_lock);
tcf_lastuse_update(>tcf_tm);
-   bstats_update(>tcf_bstats, skb);
+   bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb);
 
+   spin_lock(>tcf_lock);
if (d->flags & SKBEDIT_F_PRIORITY)
skb->priority = d->priority;
if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
@@ -77,8 +77,8 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
return d->tcf_action;
 
 err:
-   d->tcf_qstats.drops++;
spin_unlock(>tcf_lock);
+   qstats_drop_inc(this_cpu_ptr(d->common.cpu_qstats));
return TC_ACT_SHOT;
 }
 
@@ -163,7 +163,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
 
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
-_skbedit_ops, bind, false);
+_skbedit_ops, bind, true);
if (ret)
return ret;
 
-- 
2.17.1



[PATCH net-next 0/2] net/sched: act_skbedit: lockless data path

2018-07-06 Thread Davide Caratti
the data path of act_skbedit can be faster if we avoid using spinlocks:
 - patch 1 converts act_skbedit statistics to use per-cpu counters
 - patch 2 lets act_skbedit use RCU to read/update its configuration 

test procedure (using pktgen from https://github.com/netoptimizer):

 # ip link add name eth1 type dummy
 # ip link set dev eth1 up
 # tc qdisc add dev eth1 clsact
 # tc filter add dev eth1 egress matchall action skbedit priority c1a0:c1a0
 # for c in 1 2 4 ; do
 > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 500 -i eth1
 > done

test results (avg. pps/thread)

  $c | before patch |  after patch | improvement
 +--+--+
   1 | 3917464 ± 3% | 4000458 ± 3% |  irrelevant
   2 | 3455367 ± 4% | 3953076 ± 1% |+14%
   4 | 2496594 ± 2% | 3801123 ± 3% |+52%


Davide Caratti (2):
  net/sched: skbedit: use per-cpu counters
  net/sched: act_skbedit: don't use spinlock in the data path

 include/net/tc_act/tc_skbedit.h |  37 ---
 net/sched/act_skbedit.c | 113 
 2 files changed, 98 insertions(+), 52 deletions(-)

-- 
2.17.1



[PATCH net] ipfrag: really prevent allocation on netns exit

2018-07-06 Thread Paolo Abeni
Setting the low threshold to 0 has no effect on frags allocation,
we need to clear high_thresh instead.

The code was pre-existent to commit 648700f76b03 ("inet: frags:
use rhashtables for reassembly units"), but before the above,
such assignment had a different role: prevent concurrent eviction
from the worker and the netns cleanup helper.

Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Paolo Abeni 
---
 net/ipv4/inet_fragment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c9e35b81d093..1e4cf3ab560f 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -90,7 +90,7 @@ static void inet_frags_free_cb(void *ptr, void *arg)
 
 void inet_frags_exit_net(struct netns_frags *nf)
 {
-   nf->low_thresh = 0; /* prevent creation of new frags */
+   nf->high_thresh = 0; /* prevent creation of new frags */
 
rhashtable_free_and_destroy(>rhashtable, inet_frags_free_cb, NULL);
 }
-- 
2.17.1



[PATCH net-next 03/10] net: hns3: Fix for waterline not setting correctly

2018-07-06 Thread Salil Mehta
From: Yunsheng Lin 

The HCLGE_RX_PRIV_EN_B is used to tell the firmware whether
to update the specific waterline value, if the is not set,
the firmware will ignore the value.

This patch fixes by setting the HCLGE_RX_PRIV_EN_B even if
the updated value is zero.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility 
Layer Support")
Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
Signed-off-by: Salil Mehta 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 6fffc69..dae1aa5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -1834,8 +1834,6 @@ static int hclge_rx_priv_buf_alloc(struct hclge_dev *hdev,
return 0;
 }
 
-#define HCLGE_PRIV_ENABLE(a) ((a) > 0 ? 1 : 0)
-
 static int hclge_rx_priv_wl_config(struct hclge_dev *hdev,
   struct hclge_pkt_buf_alloc *buf_alloc)
 {
@@ -1863,13 +1861,11 @@ static int hclge_rx_priv_wl_config(struct hclge_dev 
*hdev,
req->tc_wl[j].high =
cpu_to_le16(priv->wl.high >> HCLGE_BUF_UNIT_S);
req->tc_wl[j].high |=
-   cpu_to_le16(HCLGE_PRIV_ENABLE(priv->wl.high) <<
-   HCLGE_RX_PRIV_EN_B);
+   cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B));
req->tc_wl[j].low =
cpu_to_le16(priv->wl.low >> HCLGE_BUF_UNIT_S);
req->tc_wl[j].low |=
-   cpu_to_le16(HCLGE_PRIV_ENABLE(priv->wl.low) <<
-   HCLGE_RX_PRIV_EN_B);
+cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B));
}
}
 
@@ -1911,13 +1907,11 @@ static int hclge_common_thrd_config(struct hclge_dev 
*hdev,
req->com_thrd[j].high =
cpu_to_le16(tc->high >> HCLGE_BUF_UNIT_S);
req->com_thrd[j].high |=
-   cpu_to_le16(HCLGE_PRIV_ENABLE(tc->high) <<
-   HCLGE_RX_PRIV_EN_B);
+cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B));
req->com_thrd[j].low =
cpu_to_le16(tc->low >> HCLGE_BUF_UNIT_S);
req->com_thrd[j].low |=
-   cpu_to_le16(HCLGE_PRIV_ENABLE(tc->low) <<
-   HCLGE_RX_PRIV_EN_B);
+cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B));
}
}
 
@@ -1943,14 +1937,10 @@ static int hclge_common_wl_config(struct hclge_dev 
*hdev,
 
req = (struct hclge_rx_com_wl *)desc.data;
req->com_wl.high = cpu_to_le16(buf->self.high >> HCLGE_BUF_UNIT_S);
-   req->com_wl.high |=
-   cpu_to_le16(HCLGE_PRIV_ENABLE(buf->self.high) <<
-   HCLGE_RX_PRIV_EN_B);
+   req->com_wl.high |=  cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B));
 
req->com_wl.low = cpu_to_le16(buf->self.low >> HCLGE_BUF_UNIT_S);
-   req->com_wl.low |=
-   cpu_to_le16(HCLGE_PRIV_ENABLE(buf->self.low) <<
-   HCLGE_RX_PRIV_EN_B);
+   req->com_wl.low |=  cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B));
 
ret = hclge_cmd_send(>hw, , 1);
if (ret) {
-- 
2.7.4




Re: Crash due to destroying TCP request sockets using SOCK_DESTROY

2018-07-06 Thread Eric Dumazet



On 07/05/2018 09:46 PM, Lorenzo Colitti wrote:
> On Fri, Jul 6, 2018 at 11:37 AM Subash Abhinov Kasiviswanathan
>  wrote:
>>
>>  From the call stack, a TCP socket is being destroyed using netlink_diag.
>> The memory dump showed that the socket was an inet request socket (in
>> state TCP_NEW_SYN_RECV) with refcount of 0.
>> [...]
>>   13232.479820:   <2> refcount_t: underflow; use-after-free.
>>   13232.479838:   <6> [ cut here ]
>>   13232.479843:   <6> kernel BUG at kernel/msm-4.14/lib/refcount.c:204!
>>   13232.479849:   <6> Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [...]
>>   13232.479996:   <6> Process netd (pid: 648, stack limit =
>> 0xff801cf98000)
>>   13232.479998:   <2> Call trace:
>>   13232.48:   <2>  refcount_sub_and_test+0x64/0x78
>>   13232.480002:   <2>  refcount_dec_and_test+0x18/0x24
>>   13232.480005:   <2>  sock_gen_put+0x1c/0xb0
>>   13232.480009:   <2>  tcp_diag_destroy+0x54/0x68
>> [...]
> 
> Looks like for a TCP_NEW_SYN_RECV socket, sock_diag_destroy
> essentially ends up doing:
> 
> struct request_sock *req = inet_reqsk(sk);
> 
> local_bh_disable();
> inet_csk_reqsk_queue_drop_and_put(req->rsk_listener,
>   req);
> local_bh_enable();
> ...
> 
> sock_gen_put(sk);
> 
> It looks like inet_csk_reqsk_queue_drop_and_put calls reqsk_put(req),
> which frees the socket, and at that point sock_gen_put is a UAF. Do we
> just need:
> 
> -inet_csk_reqsk_queue_drop_and_put(req->rsk_listener,
> -   req);
> +inet_csk_reqsk_queue_drop(req->rsk_listener, req);
> 
> since sock_gen_put will also end up calling reqsk_put() for a
> TCP_SYN_RECV socket?
> 
> Alastair - you're able to reproduce this UAF using net_test on qemu,
> right? If so, could you try that two-line patch above?
> 

Hi Lorenzo

Your patch makes sense to me, please submit it formally with :

Fixes: d7226c7a4dd1 ("net: diag: Fix refcnt leak in error path destroying 
socket")
Cc: David Ahern 

Thanks !


Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-06 Thread Sowmini Varadhan
On (07/06/18 17:08), Ka-Cheong Poon wrote:
> >Hmm. Why do you need to include tcp header in ib transport
> >code ? If there is any common function just move to core
> >common file and use it.
> 
> I think it can be removed as it is left over from earlier
> changes when the IB IPv6 listener port was RDS_TCP_PORT.
> Now all the port definitions are in rds.h.

Transport specific information such as port numbers should not
be smeared into the common rds-module.  Please fix that.

If IB is also piggybacking on port 16385 (why?), please use your
own definitions for it in IB specific header files. 

Santosh, David, I have to NACK this if it is not changed.

--Sowmini









[RFC PATCH] ip: re-introduce fragments cache worker

2018-07-06 Thread Paolo Abeni
Currently, the ip frag cache is fragile to overload. With
flow control disabled:

./super_netperf.sh 10  -H 192.168.101.2 -t UDP_STREAM -l 60
9618.08
./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60
28.66

Once that the overload condition is reached, the system does not
recover until it's almost completely idle:

./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60 &
sleep 4; I=0;
for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; 
done
13.72

This is due to the removal of the fragment cache worker, which
was responsible to free some IP fragment cache memory when the
high threshold was reached, allowing the system to cope successfully
with the next fragmented packets.

This commit re-introduces the worker, on a per netns basis. Thanks
to rhashtable walkers we can block the bh only for an entry removal.

After this commit (and before IP frag worker removal):

./super_netperf.sh 10  -H 192.168.101.2 -t UDP_STREAM -l 60
9618.08

./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60
8599.77

./super_netperf.sh 200  -H 192.168.101.2 -t UDP_STREAM -l 60 &
sleep 4; I=0;
for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; 
done
9623.12

Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Paolo Abeni 
---
Note: tweaking ipfrag sysfs does not solve completely the issue:
- raising ipfrag_high_thresh increases the number of parallels
  connections required to degrade the tput, but reached the IP
  fragment cache capacity the good-put still goes almost to 0,
  with the worker we get much more nice behaviour.
- setting ipfrag_time to 2 increases the change to recover from
  overload (the 2# test above), but with several experiments 
  in such test, I got an average of 50% the expected tput with a very
  large variance, with the worker we always see the expected/
  line rate tput.
---
 include/net/inet_frag.h  |  8 ++---
 net/ipv4/inet_fragment.c | 72 ++--
 2 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index ed07e3786d98..1f12692d7f7d 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -11,6 +11,8 @@ struct netns_frags {
int timeout;
int max_dist;
struct inet_frags   *f;
+   struct work_struct  frags_work;
+   struct rhashtable_iter  iter;
 
struct rhashtable   rhashtable cacheline_aligned_in_smp;
 
@@ -101,11 +103,7 @@ struct inet_frags {
 int inet_frags_init(struct inet_frags *);
 void inet_frags_fini(struct inet_frags *);
 
-static inline int inet_frags_init_net(struct netns_frags *nf)
-{
-   atomic_long_set(>mem, 0);
-   return rhashtable_init(>rhashtable, >f->rhash_params);
-}
+int inet_frags_init_net(struct netns_frags *nf);
 void inet_frags_exit_net(struct netns_frags *nf);
 
 void inet_frag_kill(struct inet_frag_queue *q);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c9e35b81d093..0f5b29ce96de 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -88,10 +88,76 @@ static void inet_frags_free_cb(void *ptr, void *arg)
inet_frag_put(fq);
 }
 
+static void inet_frag_schedule_worker(struct netns_frags *nf)
+{
+   if (unlikely(!work_pending(>frags_work)))
+   schedule_work(>frags_work);
+}
+
+#define INETFRAGS_EVICT_MAX64
+static void inet_frag_worker(struct work_struct *work)
+{
+   struct netns_frags *nf;
+   bool reschedule;
+   int evicted = 0;
+
+   nf = container_of(work, struct netns_frags, frags_work);
+
+   rhashtable_walk_start(>iter);
+
+   while ((reschedule = (frag_mem_limit(nf) > nf->low_thresh))) {
+   struct inet_frag_queue *fq = rhashtable_walk_next(>iter);
+
+   if (IS_ERR(fq) && PTR_ERR(fq) == -EAGAIN)
+   continue;
+   if (!fq) {
+   /* end of table, restart the walk */
+   rhashtable_walk_stop(>iter);
+   rhashtable_walk_exit(>iter);
+   rhashtable_walk_enter(>rhashtable, >iter);
+   rhashtable_walk_start(>iter);
+   continue;
+   }
+   if (!refcount_inc_not_zero(>refcnt))
+   continue;
+
+   spin_lock_bh(>lock);
+   inet_frag_kill(fq);
+   spin_unlock_bh(>lock);
+   inet_frag_put(fq);
+
+   /* limit the amount of work we can do before a reschedule,
+* to avoid starving others queued works
+*/
+   if (++evicted > INETFRAGS_EVICT_MAX)
+   break;
+   }
+
+   rhashtable_walk_stop(>iter);
+
+   if (reschedule)
+   inet_frag_schedule_worker(nf);
+}
+
+int inet_frags_init_net(struct netns_frags *nf)
+{
+   

  1   2   >