Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, Jul 06, 2018 at 06:44:24PM -0700, Jakub Kicinski wrote: > On Fri, 6 Jul 2018 17:53:18 -0700, Alexei Starovoitov wrote: > > On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote: > > > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote: > > > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote: > > > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote: > > > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > > > > > > > I'm also not 100% on board with the argument that "future" FW can > > > > > > > reshuffle things whatever way it wants to. Is the assumption that > > > > > > > future ASICs/FW will be designed to always use the "blessed" BTF > > > > > > > format? Or will it be reconfigurable at runtime? > > > > > > > > > > > > let's table configuration of metadata aside for a second. > > > > > > > > > > > > Describing metedata layout in BTF allows NICs to disclose everything > > > > > > NIC has to users in a standard and generic way. > > > > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > > > > and hw powercycled to have new md layout (and corresponding BTF > > > > > > description) > > > > > > is a separate discussion. > > > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct > > > > > > xdp_md_info' > > > > > > to reach 'hash' value in metadata. > > > > > > Essentially it's a run-time way to access 'hash' instead of > > > > > > build-time. > > > > > > So bpf program would need two loads to read csum or hash field > > > > > > instead of one. > > > > > > With BTF the layout of metadata is known to the program at > > > > > > build-time. > > > > > > > > > > > > To reiterate the proposal: > > > > > > - driver+firmware keep layout of the metadata in BTF format (either > > > > > > in the driver > > > > > > or driver can read it from firmware) > > > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query > > > > > > the driver and > > > > > > generate normal C header file based on BTF in the given NIC > > > > > > - user does #include "md_desc.h" and bpf program can access > > > > > > md->csum or md->hash > > > > > > with direct single load out of metadata area in front of the > > > > > > packet > > > > > > - llvm compiles bpf program and records how program is doing this > > > > > > md->csum accesses > > > > > > in BTF format as well (the compiler will be keeping such records > > > > > > for __sk_buff and all other structs too, but that's separate > > > > > > discussion) > > > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) > > > > > > that the way the program > > > > > > accesses metadata (and other structs) matches BTF from the driver, > > > > > > so no surprises if driver+firmware got updated, but program is > > > > > > not recompiled > > > > > > - every NIC can have their own layout of metadata and its own > > > > > > meaning of the fields, > > > > > > but would be good to standardize at least a few common fields > > > > > > like hash > > > > > > > > > > Can I expose HW descriptors this way, though, or is the proposal to > > > > > copy this data into the packet buffer? > > > > > > > > That crossed my mind too. We can use BTF to describe HW descriptors too, > > > > but I don't think it would buy us anything. AF_XDP approach is better. > > > > > > > > > > Once this is working we can do more cool things with BTF. > > > > > > Like doing offset rewriting at program load time similar to what we > > > > > > plan > > > > > > to do for tracing. Tracing programs will be doing 'task->pid' access > > > > > > and the kernel will adjust offsetof(struct task_struct, pid) during > > > > > > program load > > > > > > depending on BTF for the kernel. > > > > > > The same trick we can do for networking metadata. > > > > > > The program will contain load instruction md->hash that will get > > > > > > automatically > > > > > > adjusted to proper offset depending on BTF of 'hash' field in the > > > > > > NIC. > > > > > > For now I'm proposing _not_ to go that far with offset rewriting > > > > > > and start > > > > > > with simple steps described above. > > > > > > > > > > Why? :( Could we please go with the rewrite/driver helpers instead of > > > > > impacting fast paths of the drivers yet again? This rewrite should be > > > > > easier than task->pid, because we have the synthetic user space struct > > > > > xdp_md definition. > > > > > > > > I don't understand 'impact fast path yet again' concern. > > > > If NIC has certain metadata today, just derscribe what it has in BTF > > > > and supply the actual per-packet md to xdp program as-is. > > > > No changes for fast path at all. > > > > Future rewritting will be done by the bpf/xdp core with zero > > > > changes to the driver. All driver has to do is to provide BTF. > > > > > > I'm confused. AFAIK most *existing* NICs have
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, Jul 06, 2018 at 06:20:47PM -0700, Jakub Kicinski wrote: > On Fri, 6 Jul 2018 18:00:13 -0700, Alexei Starovoitov wrote: > > On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote: > > > > > > Could we just say that at the start we expose only existing SKB fields > > > (csum, hash, mark) and the meaning of the is the same as in the SKB? > > > > what would be the meaning of 'hash' ? Over which fields? > > Does it support inner and outer packets? How about udp encap (vxlan and > > friends) ? > > We don't seem to need to answer that for the rest of the stack, no? We > can expose the "hash type" field as well if that's *really* necessary. > > > Same question of csum... tcp only? > > Shouldn't we just stick to CHECKSUM_COMPLETE? I don't yet see how checksum_complete and 'hash just like stack' helps to accelerate xdp programs like ddos and load balancer. 'hash like stack' may help to accelerate networking stack itself, but that's very different discussion. There is no xdp in such case. No programs and no metadata. If some NICs can do 'hash like stack' and demonstrate performance improvements for regular tcp/ip processing with netperf in user space that would definitely be interesting and standardizing such hash from the drivers into upper layers would warrant its own patches, but seems orthogonal to this xdp hints discussion.
Re: [PATCH V2 net-next] liquidio: fix kernel panic when NIC firmware is older than 1.7.2
From: Felix Manlunas Date: Fri, 6 Jul 2018 11:27:07 -0700 > From: Rick Farrington > > Pre-1.7.2 NIC firmware does not support (and does not respond to) the "get > speed" command which is sent by the 1.7.2 driver (for CN23XX-225 cards > only) during modprobe. Due to a bug in older firmware (with respect to > unknown commands), this unsupported command causes a cascade of errors that > ends in a kernel panic. > > Fix it by making the sending of the "get speed" command conditional on the > firmware version. > > Signed-off-by: Rick Farrington > Signed-off-by: Felix Manlunas Applied, thanks.
Re: [PATCH net-next 0/6] sock cookie initializers
From: Willem de Bruijn Date: Fri, 6 Jul 2018 10:12:53 -0400 > From: Willem de Bruijn > > Recent UDP GSO and SO_TXTIME features added new fields to cookie > structs. > > When adding a field, all sites where a struct is initialized have to > be updated, which is a lot of boilerplate. Alternatively, a field can > be initialized selectively, but this is fragile. I introduced a bug > in udp gso where an uninitialized field was read. See also fix commit > ("9887cba19978 ip: limit use of gso_size to udp"). > > Introduce initializers for structs ipcm(6)_cookie and sockc_cookie. > > patch 1..3 do exactly this. > patch 4..5 make ipv4 and ipv6 handle cookies the same way and >remove some boilerplate in doing so. > patch 6removes the udp gso branch that needed the above fix Series applied, thanks Willem.
Re: [PATCH net-next 0/3] fix use-after-free bugs in skb list processing
From: Edward Cree Date: Fri, 6 Jul 2018 18:01:04 +0100 > A couple of bugs in skb list handling were spotted by Dan Carpenter, with > the help of Smatch; following up on them I found a couple more similar > cases. This series fixes them by changing the relevant loops to use the > dequeue-enqueue model (rather than in-place list modification), and then > adds a list.h helper macro to refactor code using the dequeue-enqueue > model. Edward, I think the helper added in patch #3 is not all that great. It can be even more confusing to understand what the code is doing in my opinion. I am pretty sure more hackers would need to go and read the macro definition in list.h before understanding what the read side effects of it are. Would you please resubmit this series including only patches #1 and #2? Thank you.
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 6 Jul 2018 17:53:18 -0700, Alexei Starovoitov wrote: > On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote: > > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote: > > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote: > > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote: > > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > > > > > I'm also not 100% on board with the argument that "future" FW can > > > > > > reshuffle things whatever way it wants to. Is the assumption that > > > > > > future ASICs/FW will be designed to always use the "blessed" BTF > > > > > > format? Or will it be reconfigurable at runtime? > > > > > > > > > > let's table configuration of metadata aside for a second. > > > > > > > > > > Describing metedata layout in BTF allows NICs to disclose everything > > > > > NIC has to users in a standard and generic way. > > > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > > > and hw powercycled to have new md layout (and corresponding BTF > > > > > description) > > > > > is a separate discussion. > > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct > > > > > xdp_md_info' > > > > > to reach 'hash' value in metadata. > > > > > Essentially it's a run-time way to access 'hash' instead of > > > > > build-time. > > > > > So bpf program would need two loads to read csum or hash field > > > > > instead of one. > > > > > With BTF the layout of metadata is known to the program at build-time. > > > > > > > > > > To reiterate the proposal: > > > > > - driver+firmware keep layout of the metadata in BTF format (either > > > > > in the driver > > > > > or driver can read it from firmware) > > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query > > > > > the driver and > > > > > generate normal C header file based on BTF in the given NIC > > > > > - user does #include "md_desc.h" and bpf program can access md->csum > > > > > or md->hash > > > > > with direct single load out of metadata area in front of the packet > > > > > - llvm compiles bpf program and records how program is doing this > > > > > md->csum accesses > > > > > in BTF format as well (the compiler will be keeping such records > > > > > for __sk_buff and all other structs too, but that's separate > > > > > discussion) > > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that > > > > > the way the program > > > > > accesses metadata (and other structs) matches BTF from the driver, > > > > > so no surprises if driver+firmware got updated, but program is not > > > > > recompiled > > > > > - every NIC can have their own layout of metadata and its own meaning > > > > > of the fields, > > > > > but would be good to standardize at least a few common fields like > > > > > hash > > > > > > > > Can I expose HW descriptors this way, though, or is the proposal to > > > > copy this data into the packet buffer? > > > > > > That crossed my mind too. We can use BTF to describe HW descriptors too, > > > but I don't think it would buy us anything. AF_XDP approach is better. > > > > > > > > Once this is working we can do more cool things with BTF. > > > > > Like doing offset rewriting at program load time similar to what we > > > > > plan > > > > > to do for tracing. Tracing programs will be doing 'task->pid' access > > > > > and the kernel will adjust offsetof(struct task_struct, pid) during > > > > > program load > > > > > depending on BTF for the kernel. > > > > > The same trick we can do for networking metadata. > > > > > The program will contain load instruction md->hash that will get > > > > > automatically > > > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC. > > > > > For now I'm proposing _not_ to go that far with offset rewriting and > > > > > start > > > > > with simple steps described above. > > > > > > > > Why? :( Could we please go with the rewrite/driver helpers instead of > > > > impacting fast paths of the drivers yet again? This rewrite should be > > > > easier than task->pid, because we have the synthetic user space struct > > > > xdp_md definition. > > > > > > I don't understand 'impact fast path yet again' concern. > > > If NIC has certain metadata today, just derscribe what it has in BTF > > > and supply the actual per-packet md to xdp program as-is. > > > No changes for fast path at all. > > > Future rewritting will be done by the bpf/xdp core with zero > > > changes to the driver. All driver has to do is to provide BTF. > > > > I'm confused. AFAIK most *existing* NICs have the metadata in the > > "descriptor", i.e. not in the packet buffer. So if the NIC just > > describes what it has, and there is no data shuffling/copying > > (performance) then we have to expose the descriptor, no? > > which piece of sw put that data into desciptor ? > I bet
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
From: Alexei Starovoitov Date: Fri, 6 Jul 2018 17:53:18 -0700 > I'd like to push back on firmware folks that should be listening > to feedback from driver folks and kernel stack instead of saying > 'here is hw spec that firmware provides'. Firmware is software. > It can change and should be open to change by the community > with proper maintainership. +1
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
From: Alexei Starovoitov Date: Fri, 6 Jul 2018 09:30:42 -0700 > Like doing offset rewriting at program load time similar to what we plan > to do for tracing. Tracing programs will be doing 'task->pid' access > and the kernel will adjust offsetof(struct task_struct, pid) during program > load > depending on BTF for the kernel. > The same trick we can do for networking metadata. > The program will contain load instruction md->hash that will get automatically > adjusted to proper offset depending on BTF of 'hash' field in the NIC. > For now I'm proposing _not_ to go that far with offset rewriting and start > with simple steps described above. +1
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 6 Jul 2018 18:00:13 -0700, Alexei Starovoitov wrote: > On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote: > > > > Could we just say that at the start we expose only existing SKB fields > > (csum, hash, mark) and the meaning of the is the same as in the SKB? > > what would be the meaning of 'hash' ? Over which fields? > Does it support inner and outer packets? How about udp encap (vxlan and > friends) ? We don't seem to need to answer that for the rest of the stack, no? We can expose the "hash type" field as well if that's *really* necessary. > Same question of csum... tcp only? Shouldn't we just stick to CHECKSUM_COMPLETE? > how about crc for sctp? That's harder to answer. Can cls_bpf access such info? > What is 'mark' ? Same thing it would be on an skb. Most likely set with an offloaded TC rule?
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote: > > Could we just say that at the start we expose only existing SKB fields > (csum, hash, mark) and the meaning of the is the same as in the SKB? what would be the meaning of 'hash' ? Over which fields? Does it support inner and outer packets? How about udp encap (vxlan and friends) ? Same question of csum... tcp only? how about crc for sctp? What is 'mark' ?
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote: > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote: > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote: > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote: > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > > > I'm also not 100% on board with the argument that "future" FW can > > > > > reshuffle things whatever way it wants to. Is the assumption that > > > > > future ASICs/FW will be designed to always use the "blessed" BTF > > > > > format? Or will it be reconfigurable at runtime? > > > > > > > > let's table configuration of metadata aside for a second. > > > > > > > > Describing metedata layout in BTF allows NICs to disclose everything > > > > NIC has to users in a standard and generic way. > > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > > and hw powercycled to have new md layout (and corresponding BTF > > > > description) > > > > is a separate discussion. > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct > > > > xdp_md_info' > > > > to reach 'hash' value in metadata. > > > > Essentially it's a run-time way to access 'hash' instead of build-time. > > > > So bpf program would need two loads to read csum or hash field instead > > > > of one. > > > > With BTF the layout of metadata is known to the program at build-time. > > > > > > > > To reiterate the proposal: > > > > - driver+firmware keep layout of the metadata in BTF format (either in > > > > the driver > > > > or driver can read it from firmware) > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the > > > > driver and > > > > generate normal C header file based on BTF in the given NIC > > > > - user does #include "md_desc.h" and bpf program can access md->csum or > > > > md->hash > > > > with direct single load out of metadata area in front of the packet > > > > - llvm compiles bpf program and records how program is doing this > > > > md->csum accesses > > > > in BTF format as well (the compiler will be keeping such records > > > > for __sk_buff and all other structs too, but that's separate > > > > discussion) > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that > > > > the way the program > > > > accesses metadata (and other structs) matches BTF from the driver, > > > > so no surprises if driver+firmware got updated, but program is not > > > > recompiled > > > > - every NIC can have their own layout of metadata and its own meaning > > > > of the fields, > > > > but would be good to standardize at least a few common fields like > > > > hash > > > > > > Can I expose HW descriptors this way, though, or is the proposal to > > > copy this data into the packet buffer? > > > > That crossed my mind too. We can use BTF to describe HW descriptors too, > > but I don't think it would buy us anything. AF_XDP approach is better. > > > > > > Once this is working we can do more cool things with BTF. > > > > Like doing offset rewriting at program load time similar to what we plan > > > > to do for tracing. Tracing programs will be doing 'task->pid' access > > > > and the kernel will adjust offsetof(struct task_struct, pid) during > > > > program load > > > > depending on BTF for the kernel. > > > > The same trick we can do for networking metadata. > > > > The program will contain load instruction md->hash that will get > > > > automatically > > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC. > > > > For now I'm proposing _not_ to go that far with offset rewriting and > > > > start > > > > with simple steps described above. > > > > > > Why? :( Could we please go with the rewrite/driver helpers instead of > > > impacting fast paths of the drivers yet again? This rewrite should be > > > easier than task->pid, because we have the synthetic user space struct > > > xdp_md definition. > > > > I don't understand 'impact fast path yet again' concern. > > If NIC has certain metadata today, just derscribe what it has in BTF > > and supply the actual per-packet md to xdp program as-is. > > No changes for fast path at all. > > Future rewritting will be done by the bpf/xdp core with zero > > changes to the driver. All driver has to do is to provide BTF. > > I'm confused. AFAIK most *existing* NICs have the metadata in the > "descriptor", i.e. not in the packet buffer. So if the NIC just > describes what it has, and there is no data shuffling/copying > (performance) then we have to expose the descriptor, no? which piece of sw put that data into desciptor ? I bet it's firmware. It could have stored it into pre-packet data, no? I'd like to avoid _all_ copies. Right now xdp program can only see a pointer to packet and pre-packet. If we need another pointer to a piece of the packet descriptor, that's also fine. Both pre-packet metadata and pieces of
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, Jul 06, 2018 at 11:49:55PM +, Waskiewicz Jr, Peter wrote: > On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote: > > On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote: > > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote: > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > > > I'm also not 100% on board with the argument that "future" FW > > > > > can > > > > > reshuffle things whatever way it wants to. Is the assumption > > > > > that > > > > > future ASICs/FW will be designed to always use the "blessed" > > > > > BTF > > > > > format? Or will it be reconfigurable at runtime? > > > > > > > > let's table configuration of metadata aside for a second. > > > > > > I agree that this should/could be NIC-specific and shouldn't weigh > > > on > > > the metadata interface between the drivers and XDP layer. > > > > > > > Describing metedata layout in BTF allows NICs to disclose > > > > everything > > > > NIC has to users in a standard and generic way. > > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > > and hw powercycled to have new md layout (and corresponding BTF > > > > description) > > > > is a separate discussion. > > > > Saeed's proposal introduces the concept of 'offset' inside > > > > 'struct > > > > xdp_md_info' > > > > to reach 'hash' value in metadata. > > > > Essentially it's a run-time way to access 'hash' instead of > > > > build- > > > > time. > > > > So bpf program would need two loads to read csum or hash field > > > > instead of one. > > > > With BTF the layout of metadata is known to the program at build- > > > > time. > > > > > > > > To reiterate the proposal: > > > > - driver+firmware keep layout of the metadata in BTF format > > > > (either > > > > in the driver > > > > or driver can read it from firmware) > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will > > > > query > > > > the driver and > > > > generate normal C header file based on BTF in the given NIC > > > > - user does #include "md_desc.h" and bpf program can access md- > > > > >csum > > > > or md->hash > > > > with direct single load out of metadata area in front of the > > > > packet > > > > > > This piece is where I'd like to discuss more. When we discussed > > > this > > > in Seoul, the initial proposal was a static struct that we'd try to > > > hammer out a common layout between the interested parties. That > > > obviously wasn't going to scale, and we wanted to pursue something > > > more > > > dynamic. But I thought the goal was the XDP/eBPF program wouldn't > > > want > > > to care what the underlying device is, and could just ask for > > > metadata > > > that it's interested in. With this approach, your eBPF program is > > > now > > > bound/tied to the NIC/driver, and if you switched to a differen > > > NIC/driver combo, then you'd have to rewrite part of your eBPF > > > program > > > to comprehend that. I thought we were trying to avoid that. > > > > It looks to me that NICs have a lot more differences instead of > > common pieces. > > If we focus the architecture on making common things as generic > > as possible we miss the bigger picture of covering distinct > > and unique features that hw provides. > > > > In the last email when I said "would be good to standardize at least > > a few common fields" > > I meant that it make sense only at the later phases. > > I don't think we should put things into uapi upfront. > > Otherwise we will end up with likely useless fields. > > Like vlan field. Surely a lot of nics can store it into metadata, but > > why? > > bpf program can easily read it from the packet. > > What's the benefit of adding it to md? > > This metadata concept we're discussing in the context of program > > acceleration. > > If metadata doesn't give perf benefits it should not be done by > > firmware, it should not be supported by the driver and certainly > > should not be part of uapi. > > Hence I'd like to see pure BTF description without any > > common/standard > > fields across the drivers and figure out what (if anything) is useful > > to accelerate xdp programs (and af_xdp in the future). > > I guess I didn't write my response very well before, apologies. I > actually really like the BTF description and the dynamic-ness of it. > It solves lots of problems and keeps things out of the UAPI. > > What I was getting at was using the BTF description and dumping a > header file to be used in the program to be loaded. If we have an > application developer include that header, it will be the Intel BTF > description, or Mellanox BTF description, etc. I'm fine with that, but > it makes your program not as portable, since you'd need to get a > different header and recompile it if you change the NIC. Unless I'm > missing something. Correct. It makes programs not portable. That is phase one to figure out what's useful. Rewriting of offsets and figuring out
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 6 Jul 2018 23:49:55 +, Waskiewicz Jr, Peter wrote: > On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote: > > On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote: > > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote: > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > > > I'm also not 100% on board with the argument that "future" FW > > > > > can > > > > > reshuffle things whatever way it wants to. Is the assumption > > > > > that > > > > > future ASICs/FW will be designed to always use the "blessed" > > > > > BTF > > > > > format? Or will it be reconfigurable at runtime? > > > > > > > > let's table configuration of metadata aside for a second. > > > > > > I agree that this should/could be NIC-specific and shouldn't weigh > > > on > > > the metadata interface between the drivers and XDP layer. > > > > > > > Describing metedata layout in BTF allows NICs to disclose > > > > everything > > > > NIC has to users in a standard and generic way. > > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > > and hw powercycled to have new md layout (and corresponding BTF > > > > description) > > > > is a separate discussion. > > > > Saeed's proposal introduces the concept of 'offset' inside > > > > 'struct > > > > xdp_md_info' > > > > to reach 'hash' value in metadata. > > > > Essentially it's a run-time way to access 'hash' instead of > > > > build- > > > > time. > > > > So bpf program would need two loads to read csum or hash field > > > > instead of one. > > > > With BTF the layout of metadata is known to the program at build- > > > > time. > > > > > > > > To reiterate the proposal: > > > > - driver+firmware keep layout of the metadata in BTF format > > > > (either > > > > in the driver > > > > or driver can read it from firmware) > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will > > > > query > > > > the driver and > > > > generate normal C header file based on BTF in the given NIC > > > > - user does #include "md_desc.h" and bpf program can access md- > > > > >csum > > > > or md->hash > > > > with direct single load out of metadata area in front of the > > > > packet > > > > > > This piece is where I'd like to discuss more. When we discussed > > > this > > > in Seoul, the initial proposal was a static struct that we'd try to > > > hammer out a common layout between the interested parties. That > > > obviously wasn't going to scale, and we wanted to pursue something > > > more > > > dynamic. But I thought the goal was the XDP/eBPF program wouldn't > > > want > > > to care what the underlying device is, and could just ask for > > > metadata > > > that it's interested in. With this approach, your eBPF program is > > > now > > > bound/tied to the NIC/driver, and if you switched to a differen > > > NIC/driver combo, then you'd have to rewrite part of your eBPF > > > program > > > to comprehend that. I thought we were trying to avoid that. > > > > It looks to me that NICs have a lot more differences instead of > > common pieces. > > If we focus the architecture on making common things as generic > > as possible we miss the bigger picture of covering distinct > > and unique features that hw provides. > > > > In the last email when I said "would be good to standardize at least > > a few common fields" > > I meant that it make sense only at the later phases. > > I don't think we should put things into uapi upfront. > > Otherwise we will end up with likely useless fields. > > Like vlan field. Surely a lot of nics can store it into metadata, but > > why? FWIW vlan is extracted from the packet because of cBPF uABI. tcpdump breaks badly if VLAN header is in the packet buffer. It may not be entirely without merit to expose the field for writing, perhaps that can simplify some EVPN use cases? But that's most likely off-topic. > > bpf program can easily read it from the packet. > > What's the benefit of adding it to md? > > This metadata concept we're discussing in the context of program > > acceleration. > > If metadata doesn't give perf benefits it should not be done by > > firmware, it should not be supported by the driver and certainly > > should not be part of uapi. > > Hence I'd like to see pure BTF description without any > > common/standard > > fields across the drivers and figure out what (if anything) is useful > > to accelerate xdp programs (and af_xdp in the future). > > I guess I didn't write my response very well before, apologies. I > actually really like the BTF description and the dynamic-ness of it. > It solves lots of problems and keeps things out of the UAPI. > > What I was getting at was using the BTF description and dumping a > header file to be used in the program to be loaded. If we have an > application developer include that header, it will be the Intel BTF > description, or Mellanox BTF description, etc. I'm fine with that,
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote: > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote: > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote: > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > I'm also not 100% on board with the argument that "future" FW can > > > > reshuffle things whatever way it wants to. Is the assumption that > > > > future ASICs/FW will be designed to always use the "blessed" BTF > > > > format? Or will it be reconfigurable at runtime? > > > > > > let's table configuration of metadata aside for a second. > > > > > > Describing metedata layout in BTF allows NICs to disclose everything > > > NIC has to users in a standard and generic way. > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > and hw powercycled to have new md layout (and corresponding BTF > > > description) > > > is a separate discussion. > > > Saeed's proposal introduces the concept of 'offset' inside 'struct > > > xdp_md_info' > > > to reach 'hash' value in metadata. > > > Essentially it's a run-time way to access 'hash' instead of build-time. > > > So bpf program would need two loads to read csum or hash field instead of > > > one. > > > With BTF the layout of metadata is known to the program at build-time. > > > > > > To reiterate the proposal: > > > - driver+firmware keep layout of the metadata in BTF format (either in > > > the driver > > > or driver can read it from firmware) > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the > > > driver and > > > generate normal C header file based on BTF in the given NIC > > > - user does #include "md_desc.h" and bpf program can access md->csum or > > > md->hash > > > with direct single load out of metadata area in front of the packet > > > - llvm compiles bpf program and records how program is doing this > > > md->csum accesses > > > in BTF format as well (the compiler will be keeping such records > > > for __sk_buff and all other structs too, but that's separate discussion) > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the > > > way the program > > > accesses metadata (and other structs) matches BTF from the driver, > > > so no surprises if driver+firmware got updated, but program is not > > > recompiled > > > - every NIC can have their own layout of metadata and its own meaning of > > > the fields, > > > but would be good to standardize at least a few common fields like hash > > > > > > > Can I expose HW descriptors this way, though, or is the proposal to > > copy this data into the packet buffer? > > That crossed my mind too. We can use BTF to describe HW descriptors too, > but I don't think it would buy us anything. AF_XDP approach is better. > > > > Once this is working we can do more cool things with BTF. > > > Like doing offset rewriting at program load time similar to what we plan > > > to do for tracing. Tracing programs will be doing 'task->pid' access > > > and the kernel will adjust offsetof(struct task_struct, pid) during > > > program load > > > depending on BTF for the kernel. > > > The same trick we can do for networking metadata. > > > The program will contain load instruction md->hash that will get > > > automatically > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC. > > > For now I'm proposing _not_ to go that far with offset rewriting and start > > > with simple steps described above. > > > > Why? :( Could we please go with the rewrite/driver helpers instead of > > impacting fast paths of the drivers yet again? This rewrite should be > > easier than task->pid, because we have the synthetic user space struct > > xdp_md definition. > > I don't understand 'impact fast path yet again' concern. > If NIC has certain metadata today, just derscribe what it has in BTF > and supply the actual per-packet md to xdp program as-is. > No changes for fast path at all. > Future rewritting will be done by the bpf/xdp core with zero > changes to the driver. All driver has to do is to provide BTF. I'm confused. AFAIK most *existing* NICs have the metadata in the "descriptor", i.e. not in the packet buffer. So if the NIC just describes what it has, and there is no data shuffling/copying (performance) then we have to expose the descriptor, no? In case of NFP half of the metadata is in the packet buffer, half in the descriptor. Maybe I'm conflating your proposal with Saeed's. In your proposal would we still use the overload of metadata prepend (as in struct xdp_md::data_meta, bpf_xdp_adjust_meta() etc.) or are we adding a pointer to a new driver-defined struct inside struct xdp_md? Thanks for bearing with me!
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote: > On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote: > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote: > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > I'm also not 100% on board with the argument that "future" FW > > > > can > > > > reshuffle things whatever way it wants to. Is the assumption > > > > that > > > > future ASICs/FW will be designed to always use the "blessed" > > > > BTF > > > > format? Or will it be reconfigurable at runtime? > > > > > > let's table configuration of metadata aside for a second. > > > > I agree that this should/could be NIC-specific and shouldn't weigh > > on > > the metadata interface between the drivers and XDP layer. > > > > > Describing metedata layout in BTF allows NICs to disclose > > > everything > > > NIC has to users in a standard and generic way. > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > and hw powercycled to have new md layout (and corresponding BTF > > > description) > > > is a separate discussion. > > > Saeed's proposal introduces the concept of 'offset' inside > > > 'struct > > > xdp_md_info' > > > to reach 'hash' value in metadata. > > > Essentially it's a run-time way to access 'hash' instead of > > > build- > > > time. > > > So bpf program would need two loads to read csum or hash field > > > instead of one. > > > With BTF the layout of metadata is known to the program at build- > > > time. > > > > > > To reiterate the proposal: > > > - driver+firmware keep layout of the metadata in BTF format > > > (either > > > in the driver > > > or driver can read it from firmware) > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will > > > query > > > the driver and > > > generate normal C header file based on BTF in the given NIC > > > - user does #include "md_desc.h" and bpf program can access md- > > > >csum > > > or md->hash > > > with direct single load out of metadata area in front of the > > > packet > > > > This piece is where I'd like to discuss more. When we discussed > > this > > in Seoul, the initial proposal was a static struct that we'd try to > > hammer out a common layout between the interested parties. That > > obviously wasn't going to scale, and we wanted to pursue something > > more > > dynamic. But I thought the goal was the XDP/eBPF program wouldn't > > want > > to care what the underlying device is, and could just ask for > > metadata > > that it's interested in. With this approach, your eBPF program is > > now > > bound/tied to the NIC/driver, and if you switched to a differen > > NIC/driver combo, then you'd have to rewrite part of your eBPF > > program > > to comprehend that. I thought we were trying to avoid that. > > It looks to me that NICs have a lot more differences instead of > common pieces. > If we focus the architecture on making common things as generic > as possible we miss the bigger picture of covering distinct > and unique features that hw provides. > > In the last email when I said "would be good to standardize at least > a few common fields" > I meant that it make sense only at the later phases. > I don't think we should put things into uapi upfront. > Otherwise we will end up with likely useless fields. > Like vlan field. Surely a lot of nics can store it into metadata, but > why? > bpf program can easily read it from the packet. > What's the benefit of adding it to md? > This metadata concept we're discussing in the context of program > acceleration. > If metadata doesn't give perf benefits it should not be done by > firmware, it should not be supported by the driver and certainly > should not be part of uapi. > Hence I'd like to see pure BTF description without any > common/standard > fields across the drivers and figure out what (if anything) is useful > to accelerate xdp programs (and af_xdp in the future). I guess I didn't write my response very well before, apologies. I actually really like the BTF description and the dynamic-ness of it. It solves lots of problems and keeps things out of the UAPI. What I was getting at was using the BTF description and dumping a header file to be used in the program to be loaded. If we have an application developer include that header, it will be the Intel BTF description, or Mellanox BTF description, etc. I'm fine with that, but it makes your program not as portable, since you'd need to get a different header and recompile it if you change the NIC. Unless I'm missing something. > > > Our proposed approach (still working on something ready to RFC) is > > to > > provide a method for the eBPF program to send a struct of requested > > hints down to the driver on load. If the driver can provide the > > hints, > > then that'd be how they'd be laid out in the metadata. If it can't > > provide them, we'd probably reject the program loading, or discuss > > providing a software fallback
Re: [PATCH bpf-next v2 0/6] nfp: bpf: add multiplication and divide support on NFP JIT
On 07/07/2018 12:13 AM, Jakub Kicinski wrote: > Jiong says: > > NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per > step, therefore we need 2 steps for u16 and 4 steps for u32. > > We also need one start instruction to initialize the sequence and one or > two instructions to fetch the result depending on either you need the high > halve of u32 multiplication. > > For ALU64, if either operand is beyond u32's value range, we reject it. One > thing to note, if the source operand is BPF_K, then we need to check "imm" > field directly, and we'd reject it if it is negative. Because for ALU64, > "imm" (with s32 type) is expected to be sign extended to s64 which NFP mul > doesn't support. For ALU32, it is fine for "imm" be negative though, > because the result is 32-bits and here is no difference on the low halve > of result for signed/unsigned mul, so we will get correct result. > > NFP doesn't have integer divide instruction, this patch set uses reciprocal > algorithm (the basic one, reciprocal_div) to emulate it. > > For each u32 divide, we would need 11 instructions to finish the operation. > >7 (for multiplication) + 4 (various ALUs) = 11 > > Given NFP only supports multiplication no bigger than u32, we'd require > divisor and dividend no bigger than that as well. > > Also eBPF doesn't support signed divide and has enforced this on C language > level by failing compilation. However LLVM assembler hasn't enforced this, > so it is possible for negative constant to leak in as a BPF_K operand > through assembly code, we reject such cases as well. > > Meanwhile reciprocal_div.h only implemented the basic version of: > >"Division by Invariant Integers Using Multiplication" > - Torbjörn Granlund and Peter L. Montgomery > > This patch set further implements the optimized version (Figure 4.2 in the > paper) inside existing reciprocal_div.h. When the divider is even and the > calculated reciprocal magic number doesn't fit u32, we could reduce the > required ALU instructions from 4 to 2 or 1 for some cases. > > The advanced version requires more complex calculation to get the > reciprocal multiplier and other control variables, but then could reduce > the required emulation operations. It makes sense to use it for JIT divide > code generation (for example eBPF JIT backends) for which we are willing to > trade performance of JITed code with that of host. > > v2: > - add warning in l == 32 code path. (Song Liu/Jakub) > - jit separate insn sequence for l == 32. (Jakub/Edwin) > - should use unrestricted operand for mul. > - once place should use "1U << exp" instead of "1 << exp". > > Jiong Wang (6): > lib: reciprocal_div: implement the improved algorithm on the paper > mentioned > nfp: bpf: rename umin/umax to umin_src/umax_src > nfp: bpf: copy range info for all operands of all ALU operations > nfp: bpf: support u16 and u32 multiplications > nfp: bpf: support u32 divide using reciprocal_div.h > nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h Applied to bpf-next, thanks everyone!
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote: > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote: > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > I'm also not 100% on board with the argument that "future" FW can > > > reshuffle things whatever way it wants to. Is the assumption that > > > future ASICs/FW will be designed to always use the "blessed" BTF > > > format? Or will it be reconfigurable at runtime? > > > > let's table configuration of metadata aside for a second. > > > > Describing metedata layout in BTF allows NICs to disclose everything > > NIC has to users in a standard and generic way. > > Whether firmware is reconfigurable on the fly or has to reflashed > > and hw powercycled to have new md layout (and corresponding BTF description) > > is a separate discussion. > > Saeed's proposal introduces the concept of 'offset' inside 'struct > > xdp_md_info' > > to reach 'hash' value in metadata. > > Essentially it's a run-time way to access 'hash' instead of build-time. > > So bpf program would need two loads to read csum or hash field instead of > > one. > > With BTF the layout of metadata is known to the program at build-time. > > > > To reiterate the proposal: > > - driver+firmware keep layout of the metadata in BTF format (either in the > > driver > > or driver can read it from firmware) > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the > > driver and > > generate normal C header file based on BTF in the given NIC > > - user does #include "md_desc.h" and bpf program can access md->csum or > > md->hash > > with direct single load out of metadata area in front of the packet > > - llvm compiles bpf program and records how program is doing this md->csum > > accesses > > in BTF format as well (the compiler will be keeping such records > > for __sk_buff and all other structs too, but that's separate discussion) > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the > > way the program > > accesses metadata (and other structs) matches BTF from the driver, > > so no surprises if driver+firmware got updated, but program is not > > recompiled > > - every NIC can have their own layout of metadata and its own meaning of > > the fields, > > but would be good to standardize at least a few common fields like hash > > Can I expose HW descriptors this way, though, or is the proposal to > copy this data into the packet buffer? That crossed my mind too. We can use BTF to describe HW descriptors too, but I don't think it would buy us anything. AF_XDP approach is better. > > Once this is working we can do more cool things with BTF. > > Like doing offset rewriting at program load time similar to what we plan > > to do for tracing. Tracing programs will be doing 'task->pid' access > > and the kernel will adjust offsetof(struct task_struct, pid) during program > > load > > depending on BTF for the kernel. > > The same trick we can do for networking metadata. > > The program will contain load instruction md->hash that will get > > automatically > > adjusted to proper offset depending on BTF of 'hash' field in the NIC. > > For now I'm proposing _not_ to go that far with offset rewriting and start > > with simple steps described above. > > Why? :( Could we please go with the rewrite/driver helpers instead of > impacting fast paths of the drivers yet again? This rewrite should be > easier than task->pid, because we have the synthetic user space struct > xdp_md definition. I don't understand 'impact fast path yet again' concern. If NIC has certain metadata today, just derscribe what it has in BTF and supply the actual per-packet md to xdp program as-is. No changes for fast path at all. Future rewritting will be done by the bpf/xdp core with zero changes to the driver. All driver has to do is to provide BTF.
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, Jul 06, 2018 at 08:44:24PM +, Waskiewicz Jr, Peter wrote: > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote: > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > I'm also not 100% on board with the argument that "future" FW can > > > reshuffle things whatever way it wants to. Is the assumption that > > > future ASICs/FW will be designed to always use the "blessed" BTF > > > format? Or will it be reconfigurable at runtime? > > > > let's table configuration of metadata aside for a second. > > I agree that this should/could be NIC-specific and shouldn't weigh on > the metadata interface between the drivers and XDP layer. > > > Describing metedata layout in BTF allows NICs to disclose everything > > NIC has to users in a standard and generic way. > > Whether firmware is reconfigurable on the fly or has to reflashed > > and hw powercycled to have new md layout (and corresponding BTF > > description) > > is a separate discussion. > > Saeed's proposal introduces the concept of 'offset' inside 'struct > > xdp_md_info' > > to reach 'hash' value in metadata. > > Essentially it's a run-time way to access 'hash' instead of build- > > time. > > So bpf program would need two loads to read csum or hash field > > instead of one. > > With BTF the layout of metadata is known to the program at build- > > time. > > > > To reiterate the proposal: > > - driver+firmware keep layout of the metadata in BTF format (either > > in the driver > > or driver can read it from firmware) > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query > > the driver and > > generate normal C header file based on BTF in the given NIC > > - user does #include "md_desc.h" and bpf program can access md->csum > > or md->hash > > with direct single load out of metadata area in front of the packet > > This piece is where I'd like to discuss more. When we discussed this > in Seoul, the initial proposal was a static struct that we'd try to > hammer out a common layout between the interested parties. That > obviously wasn't going to scale, and we wanted to pursue something more > dynamic. But I thought the goal was the XDP/eBPF program wouldn't want > to care what the underlying device is, and could just ask for metadata > that it's interested in. With this approach, your eBPF program is now > bound/tied to the NIC/driver, and if you switched to a differen > NIC/driver combo, then you'd have to rewrite part of your eBPF program > to comprehend that. I thought we were trying to avoid that. It looks to me that NICs have a lot more differences instead of common pieces. If we focus the architecture on making common things as generic as possible we miss the bigger picture of covering distinct and unique features that hw provides. In the last email when I said "would be good to standardize at least a few common fields" I meant that it make sense only at the later phases. I don't think we should put things into uapi upfront. Otherwise we will end up with likely useless fields. Like vlan field. Surely a lot of nics can store it into metadata, but why? bpf program can easily read it from the packet. What's the benefit of adding it to md? This metadata concept we're discussing in the context of program acceleration. If metadata doesn't give perf benefits it should not be done by firmware, it should not be supported by the driver and certainly should not be part of uapi. Hence I'd like to see pure BTF description without any common/standard fields across the drivers and figure out what (if anything) is useful to accelerate xdp programs (and af_xdp in the future). > Our proposed approach (still working on something ready to RFC) is to > provide a method for the eBPF program to send a struct of requested > hints down to the driver on load. If the driver can provide the hints, > then that'd be how they'd be laid out in the metadata. If it can't > provide them, we'd probably reject the program loading, or discuss > providing a software fallback (I know this is an area of contention). I don't think that approach can work. My .02 is the only useful acceleration feature out if intel nics is an ability to parse the packet and report it to the xdp program as a node id in the parse graph. As far as I know that is pretty unique to intel. Configuration of that is a different matter. Other things like hash are interesting, but we will quickly get into rss spec definition if we go standardization route now instead of later phases. Therefore I'd rather let firmware+driver define its own BTF description that says here is 'hash' of the packet hw can provide and it's fine that this hash may be over different tuples depending on the nic and even version of the firmware in that nic. We need to see performance numbers and benefits for real world xdp programs before drilling into specific semantics of the hash. > I suppose we could get there with the rewriting mechanism described > below, but
Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode
On 7/6/18 9:58 AM, Sabrina Dubroca wrote: > > Right. I'll add that as a separate patch in this series, unless you > really prefer the change squashed into this patch. no preference. > > >> Looking at other addr_gen_mode sites, addrconf_sysctl_stable_secret is >> messed up as well. It propagates a change to 'default' to all existing >> devices. > > I guess it was intentional, given: > > if (>ipv6.devconf_all->stable_secret == ctl->data) > return -EIO; > > It only propagates the mode, and not the secret itself, to all > devices. After thinking about it for a while, I guess it considers the > new default not only as default for newly created devices, but also > for newly added addresses/prefixes. > Or am I making stuff up? > Maybe Hannes can explain (622c81d57b392). It should have been all instead of default. As I understand it default is what devices start with on create and individual devices can be changed. 'All' overrides device-specific settings. So the above is inconsistent. One of many with sysctl that makes it frustrating for users.
Re: Crash due to destroying TCP request sockets using SOCK_DESTROY
Looks like for a TCP_NEW_SYN_RECV socket, sock_diag_destroy essentially ends up doing: struct request_sock *req = inet_reqsk(sk); local_bh_disable(); inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req); local_bh_enable(); ... sock_gen_put(sk); It looks like inet_csk_reqsk_queue_drop_and_put calls reqsk_put(req), which frees the socket, and at that point sock_gen_put is a UAF. Do we just need: - inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, - req); +inet_csk_reqsk_queue_drop(req->rsk_listener, req); since sock_gen_put will also end up calling reqsk_put() for a TCP_SYN_RECV socket? Alastair - you're able to reproduce this UAF using net_test on qemu, right? If so, could you try that two-line patch above? Hi Lorenzo Your patch makes sense to me, please submit it formally with : Fixes: d7226c7a4dd1 ("net: diag: Fix refcnt leak in error path destroying socket") Cc: David Ahern Thanks ! Thanks Lorenzo and Eric. I will try it out locally. -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v12 01/10] dt-bindings: Add Cavium Octeon Common Ethernet Interface.
On Fri, Jul 06, 2018 at 05:10:39PM -0500, Steven J. Hill wrote: > On 06/28/2018 03:35 AM, Andrew Lunn wrote: > > > >> +- cavium,rx-clk-delay-bypass: Set to <1> to bypass the rx clock delay > >> setting. > >> + Needed by the Micrel PHY. > > > > Could you explain this some more. Is it anything to do with RGMII delays? > > > Andrew, > > One of my colleagues tracked this down for me. This device tree option is in > place > because there are several different ways to do the clock and data with > respect to > RGMII. This controls the delay introduced for the RX clock with respect to > the data. > Without this, RX will not work with Micrel PHYs. Thanks. Hi Steven This is his RGMII delays, as i guess. Don't add this property, do it the Linux way. Look at phy-mode values phy.h:PHY_INTERFACE_MODE_RGMII_ID, phy.h:PHY_INTERFACE_MODE_RGMII_RXID, phy.h:PHY_INTERFACE_MODE_RGMII_TXID, There are plenty of examples in drivers/net/ethernet Andrew
[PATCH bpf-next v2 6/6] nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h
From: Jiong Wang As we are doing JIT, we would want to use the advanced version of the reciprocal divide (reciprocal_value_adv) to trade performance with host. We could reduce the required ALU instructions from 4 to 2 or 1. Signed-off-by: Jiong Wang Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/bpf/jit.c | 64 1 file changed, 53 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c index 7c9ee3d85907..1d9e36835404 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c @@ -1497,8 +1497,8 @@ wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 dst, u64 imm) { swreg dst_both = reg_both(dst), dst_a = reg_a(dst), dst_b = reg_a(dst); - struct reciprocal_value rvalue; - swreg tmp_b = imm_b(nfp_prog); + struct reciprocal_value_adv rvalue; + u8 pre_shift, exp; swreg magic; if (imm > U32_MAX) { @@ -1506,16 +1506,58 @@ static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 dst, u64 imm) return 0; } - rvalue = reciprocal_value(imm); + /* NOTE: because we are using "reciprocal_value_adv" which doesn't +* support "divisor > (1u << 31)", we need to JIT separate NFP sequence +* to handle such case which actually equals to the result of unsigned +* comparison "dst >= imm" which could be calculated using the following +* NFP sequence: +* +* alu[--, dst, -, imm] +* immed[imm, 0] +* alu[dst, imm, +carry, 0] +* +*/ + if (imm > 1U << 31) { + swreg tmp_b = ur_load_imm_any(nfp_prog, imm, imm_b(nfp_prog)); + + emit_alu(nfp_prog, reg_none(), dst_a, ALU_OP_SUB, tmp_b); + wrp_immed(nfp_prog, imm_a(nfp_prog), 0); + emit_alu(nfp_prog, dst_both, imm_a(nfp_prog), ALU_OP_ADD_C, +reg_imm(0)); + return 0; + } + + rvalue = reciprocal_value_adv(imm, 32); + exp = rvalue.exp; + if (rvalue.is_wide_m && !(imm & 1)) { + pre_shift = fls(imm & -imm) - 1; + rvalue = reciprocal_value_adv(imm >> pre_shift, 32 - pre_shift); + } else { + pre_shift = 0; + } magic = ur_load_imm_any(nfp_prog, rvalue.m, imm_b(nfp_prog)); - wrp_mul_u32(nfp_prog, imm_both(nfp_prog), reg_none(), dst_a, magic, - true); - emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB, tmp_b); - emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b, -SHF_SC_R_SHF, rvalue.sh1); - emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD, tmp_b); - emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b, -SHF_SC_R_SHF, rvalue.sh2); + if (imm == 1U << exp) { + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b, +SHF_SC_R_SHF, exp); + } else if (rvalue.is_wide_m) { + wrp_mul_u32(nfp_prog, imm_both(nfp_prog), reg_none(), dst_a, + magic, true); + emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB, +imm_b(nfp_prog)); + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b, +SHF_SC_R_SHF, 1); + emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD, +imm_b(nfp_prog)); + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b, +SHF_SC_R_SHF, rvalue.sh - 1); + } else { + if (pre_shift) + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, +dst_b, SHF_SC_R_SHF, pre_shift); + wrp_mul_u32(nfp_prog, dst_both, reg_none(), dst_a, magic, true); + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, +dst_b, SHF_SC_R_SHF, rvalue.sh); + } return 0; } -- 2.17.1
[PATCH bpf-next v2 3/6] nfp: bpf: copy range info for all operands of all ALU operations
From: Jiong Wang NFP verifier hook is coping range information of the shift amount for indirect shift operation so optimized shift sequences could be generated. We want to use range info to do more things. For example, to decide whether multiplication and divide are supported on the given range. This patch simply let NFP verifier hook to copy range info for all operands of all ALU operands. Signed-off-by: Jiong Wang Reviewed-by: Jakub Kicinski Acked-by: Song Liu --- drivers/net/ethernet/netronome/nfp/bpf/main.h | 33 +++ .../net/ethernet/netronome/nfp/bpf/offload.c | 4 ++- .../net/ethernet/netronome/nfp/bpf/verifier.c | 6 +++- 3 files changed, 20 insertions(+), 23 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h index 5975a19c28cb..c985d0ac61a3 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h @@ -265,6 +265,8 @@ struct nfp_bpf_reg_state { * @arg2: arg2 for call instructions * @umin_src: copy of core verifier umin_value for src opearnd. * @umax_src: copy of core verifier umax_value for src operand. + * @umin_dst: copy of core verifier umin_value for dst opearnd. + * @umax_dst: copy of core verifier umax_value for dst operand. * @off: index of first generated machine instruction (in nfp_prog.prog) * @n: eBPF instruction number * @flags: eBPF instruction extra optimization flags @@ -300,12 +302,15 @@ struct nfp_insn_meta { struct bpf_reg_state arg1; struct nfp_bpf_reg_state arg2; }; - /* We are interested in range info for some operands, -* for example, the shift amount which is kept in src operand. + /* We are interested in range info for operands of ALU +* operations. For example, shift amount, multiplicand and +* multiplier etc. */ struct { u64 umin_src; u64 umax_src; + u64 umin_dst; + u64 umax_dst; }; }; unsigned int off; @@ -339,6 +344,11 @@ static inline u8 mbpf_mode(const struct nfp_insn_meta *meta) return BPF_MODE(meta->insn.code); } +static inline bool is_mbpf_alu(const struct nfp_insn_meta *meta) +{ + return mbpf_class(meta) == BPF_ALU64 || mbpf_class(meta) == BPF_ALU; +} + static inline bool is_mbpf_load(const struct nfp_insn_meta *meta) { return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_LDX | BPF_MEM); @@ -384,25 +394,6 @@ static inline bool is_mbpf_xadd(const struct nfp_insn_meta *meta) return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_XADD); } -static inline bool is_mbpf_indir_shift(const struct nfp_insn_meta *meta) -{ - u8 code = meta->insn.code; - bool is_alu, is_shift; - u8 opclass, opcode; - - opclass = BPF_CLASS(code); - is_alu = opclass == BPF_ALU64 || opclass == BPF_ALU; - if (!is_alu) - return false; - - opcode = BPF_OP(code); - is_shift = opcode == BPF_LSH || opcode == BPF_RSH || opcode == BPF_ARSH; - if (!is_shift) - return false; - - return BPF_SRC(code) == BPF_X; -} - /** * struct nfp_prog - nfp BPF program * @bpf: backpointer to the bpf app priv structure diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 856a0003bb75..78f44c4d95b4 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -190,8 +190,10 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog, meta->insn = prog[i]; meta->n = i; - if (is_mbpf_indir_shift(meta)) + if (is_mbpf_alu(meta)) { meta->umin_src = U64_MAX; + meta->umin_dst = U64_MAX; + } list_add_tail(>l, _prog->insns); } diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c index e862b739441f..7bd9666bd8ff 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c @@ -551,12 +551,16 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx) if (is_mbpf_xadd(meta)) return nfp_bpf_check_xadd(nfp_prog, meta, env); - if (is_mbpf_indir_shift(meta)) { + if (is_mbpf_alu(meta)) { const struct bpf_reg_state *sreg = cur_regs(env) + meta->insn.src_reg; + const struct bpf_reg_state *dreg = + cur_regs(env) + meta->insn.dst_reg; meta->umin_src = min(meta->umin_src, sreg->umin_value);
[PATCH bpf-next v2 5/6] nfp: bpf: support u32 divide using reciprocal_div.h
From: Jiong Wang NFP doesn't have integer divide instruction, this patch use reciprocal algorithm (the basic one, reciprocal_div) to emulate it. For each u32 divide, we would need 11 instructions to finish the operation. 7 (for multiplication) + 4 (various ALUs) = 11 Given NFP only supports multiplication no bigger than u32, we'd require divisor and dividend no bigger than that as well. Also eBPF doesn't support signed divide and has enforced this on C language level by failing compilation. However LLVM assembler hasn't enforced this, so it is possible for negative constant to leak in as a BPF_K operand through assembly code, we reject such cases as well. Signed-off-by: Jiong Wang Reviewed-by: Jakub Kicinski Acked-by: Song Liu --- drivers/net/ethernet/netronome/nfp/bpf/jit.c | 58 ++- drivers/net/ethernet/netronome/nfp/bpf/main.h | 5 ++ .../net/ethernet/netronome/nfp/bpf/verifier.c | 31 ++ 3 files changed, 93 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c index f1b27c3a4347..7c9ee3d85907 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c @@ -34,10 +34,11 @@ #define pr_fmt(fmt)"NFP net bpf: " fmt #include -#include #include #include +#include #include +#include #include #include "main.h" @@ -1493,6 +1494,32 @@ wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, return 0; } +static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 dst, u64 imm) +{ + swreg dst_both = reg_both(dst), dst_a = reg_a(dst), dst_b = reg_a(dst); + struct reciprocal_value rvalue; + swreg tmp_b = imm_b(nfp_prog); + swreg magic; + + if (imm > U32_MAX) { + wrp_immed(nfp_prog, dst_both, 0); + return 0; + } + + rvalue = reciprocal_value(imm); + magic = ur_load_imm_any(nfp_prog, rvalue.m, imm_b(nfp_prog)); + wrp_mul_u32(nfp_prog, imm_both(nfp_prog), reg_none(), dst_a, magic, + true); + emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB, tmp_b); + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b, +SHF_SC_R_SHF, rvalue.sh1); + emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD, tmp_b); + emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b, +SHF_SC_R_SHF, rvalue.sh2); + + return 0; +} + static int adjust_head(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) { swreg tmp = imm_a(nfp_prog), tmp_len = imm_b(nfp_prog); @@ -1807,6 +1834,21 @@ static int mul_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) return wrp_mul(nfp_prog, meta, true, false); } +static int div_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) +{ + const struct bpf_insn *insn = >insn; + + return wrp_div_imm(nfp_prog, insn->dst_reg * 2, insn->imm); +} + +static int div_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) +{ + /* NOTE: verifier hook has rejected cases for which verifier doesn't +* know whether the source operand is constant or not. +*/ + return wrp_div_imm(nfp_prog, meta->insn.dst_reg * 2, meta->umin_src); +} + static int neg_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) { const struct bpf_insn *insn = >insn; @@ -2230,6 +2272,16 @@ static int mul_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) return wrp_mul(nfp_prog, meta, false, false); } +static int div_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) +{ + return div_reg64(nfp_prog, meta); +} + +static int div_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) +{ + return div_imm64(nfp_prog, meta); +} + static int neg_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) { u8 dst = meta->insn.dst_reg * 2; @@ -2983,6 +3035,8 @@ static const instr_cb_t instr_cb[256] = { [BPF_ALU64 | BPF_SUB | BPF_K] = sub_imm64, [BPF_ALU64 | BPF_MUL | BPF_X] = mul_reg64, [BPF_ALU64 | BPF_MUL | BPF_K] = mul_imm64, + [BPF_ALU64 | BPF_DIV | BPF_X] = div_reg64, + [BPF_ALU64 | BPF_DIV | BPF_K] = div_imm64, [BPF_ALU64 | BPF_NEG] = neg_reg64, [BPF_ALU64 | BPF_LSH | BPF_X] = shl_reg64, [BPF_ALU64 | BPF_LSH | BPF_K] = shl_imm64, @@ -3004,6 +3058,8 @@ static const instr_cb_t instr_cb[256] = { [BPF_ALU | BPF_SUB | BPF_K] = sub_imm, [BPF_ALU | BPF_MUL | BPF_X] = mul_reg, [BPF_ALU | BPF_MUL | BPF_K] = mul_imm, + [BPF_ALU | BPF_DIV | BPF_X] = div_reg, + [BPF_ALU | BPF_DIV | BPF_K] = div_imm, [BPF_ALU | BPF_NEG] = neg_reg, [BPF_ALU | BPF_LSH | BPF_K] = shl_imm, [BPF_ALU | BPF_END | BPF_X] = end_reg32, diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
[PATCH bpf-next v2 4/6] nfp: bpf: support u16 and u32 multiplications
From: Jiong Wang NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per step, therefore we need 2 steps for u16 and 4 steps for u32. We also need one start instruction to initialize the sequence and one or two instructions to fetch the result depending on either you need the high halve of u32 multiplication. For ALU64, if either operand is beyond u32's value range, we reject it. One thing to note, if the source operand is BPF_K, then we need to check "imm" field directly, and we'd reject it if it is negative. Because for ALU64, "imm" (with s32 type) is expected to be sign extended to s64 which NFP mul doesn't support. For ALU32, it is fine for "imm" be negative though, because the result is 32-bits and here is no difference on the low halve of result for signed/unsigned mul, so we will get correct result. Signed-off-by: Jiong Wang Reviewed-by: Jakub Kicinski Acked-by: Song Liu --- drivers/net/ethernet/netronome/nfp/bpf/jit.c | 137 ++ drivers/net/ethernet/netronome/nfp/bpf/main.h | 5 + .../net/ethernet/netronome/nfp/bpf/verifier.c | 58 ++-- drivers/net/ethernet/netronome/nfp/nfp_asm.h | 28 4 files changed, 217 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c index 4a629e9b5c0f..f1b27c3a4347 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c @@ -415,6 +415,60 @@ emit_alu(struct nfp_prog *nfp_prog, swreg dst, reg.dst_lmextn, reg.src_lmextn); } +static void +__emit_mul(struct nfp_prog *nfp_prog, enum alu_dst_ab dst_ab, u16 areg, + enum mul_type type, enum mul_step step, u16 breg, bool swap, + bool wr_both, bool dst_lmextn, bool src_lmextn) +{ + u64 insn; + + insn = OP_MUL_BASE | + FIELD_PREP(OP_MUL_A_SRC, areg) | + FIELD_PREP(OP_MUL_B_SRC, breg) | + FIELD_PREP(OP_MUL_STEP, step) | + FIELD_PREP(OP_MUL_DST_AB, dst_ab) | + FIELD_PREP(OP_MUL_SW, swap) | + FIELD_PREP(OP_MUL_TYPE, type) | + FIELD_PREP(OP_MUL_WR_AB, wr_both) | + FIELD_PREP(OP_MUL_SRC_LMEXTN, src_lmextn) | + FIELD_PREP(OP_MUL_DST_LMEXTN, dst_lmextn); + + nfp_prog_push(nfp_prog, insn); +} + +static void +emit_mul(struct nfp_prog *nfp_prog, swreg lreg, enum mul_type type, +enum mul_step step, swreg rreg) +{ + struct nfp_insn_ur_regs reg; + u16 areg; + int err; + + if (type == MUL_TYPE_START && step != MUL_STEP_NONE) { + nfp_prog->error = -EINVAL; + return; + } + + if (step == MUL_LAST || step == MUL_LAST_2) { + /* When type is step and step Number is LAST or LAST2, left +* source is used as destination. +*/ + err = swreg_to_unrestricted(lreg, reg_none(), rreg, ); + areg = reg.dst; + } else { + err = swreg_to_unrestricted(reg_none(), lreg, rreg, ); + areg = reg.areg; + } + + if (err) { + nfp_prog->error = err; + return; + } + + __emit_mul(nfp_prog, reg.dst_ab, areg, type, step, reg.breg, reg.swap, + reg.wr_both, reg.dst_lmextn, reg.src_lmextn); +} + static void __emit_ld_field(struct nfp_prog *nfp_prog, enum shf_sc sc, u8 areg, u8 bmask, u8 breg, u8 shift, bool imm8, @@ -1380,6 +1434,65 @@ static void wrp_end32(struct nfp_prog *nfp_prog, swreg reg_in, u8 gpr_out) SHF_SC_R_ROT, 16); } +static void +wrp_mul_u32(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg, + swreg rreg, bool gen_high_half) +{ + emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg); + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_1, rreg); + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_2, rreg); + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_3, rreg); + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_4, rreg); + emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_32x32, MUL_LAST, reg_none()); + if (gen_high_half) + emit_mul(nfp_prog, dst_hi, MUL_TYPE_STEP_32x32, MUL_LAST_2, +reg_none()); + else + wrp_immed(nfp_prog, dst_hi, 0); +} + +static void +wrp_mul_u16(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg, + swreg rreg) +{ + emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg); + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_1, rreg); + emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_2, rreg); + emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_16x16, MUL_LAST, reg_none()); +} + +static int +wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, + bool gen_high_half, bool
[PATCH bpf-next v2 1/6] lib: reciprocal_div: implement the improved algorithm on the paper mentioned
From: Jiong Wang The new added "reciprocal_value_adv" implements the advanced version of the algorithm described in Figure 4.2 of the paper except when "divisor > (1U << 31)" whose ceil(log2(d)) result will be 32 which then requires u128 divide on host. The exception case could be easily handled before calling "reciprocal_value_adv". The advanced version requires more complex calculation to get the reciprocal multiplier and other control variables, but then could reduce the required emulation operations. It makes no sense to use this advanced version for host divide emulation, those extra complexities for calculating multiplier etc could completely waive our saving on emulation operations. However, it makes sense to use it for JIT divide code generation (for example eBPF JIT backends) for which we are willing to trade performance of JITed code with that of host. As shown by the following pseudo code, the required emulation operations could go down from 6 (the basic version) to 3 or 4. To use the result of "reciprocal_value_adv", suppose we want to calculate n/d, the C-style pseudo code will be the following, it could be easily changed to real code generation for other JIT targets. struct reciprocal_value_adv rvalue; u8 pre_shift, exp; // handle exception case. if (d >= (1U << 31)) { result = n >= d; return; } rvalue = reciprocal_value_adv(d, 32) exp = rvalue.exp; if (rvalue.is_wide_m && !(d & 1)) { // floor(log2(d & (2^32 -d))) pre_shift = fls(d & -d) - 1; rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift); } else { pre_shift = 0; } // code generation starts. if (imm == 1U << exp) { result = n >> exp; } else if (rvalue.is_wide_m) { // pre_shift must be zero when reached here. t = (n * rvalue.m) >> 32; result = n - t; result >>= 1; result += t; result >>= rvalue.sh - 1; } else { if (pre_shift) result = n >> pre_shift; result = ((u64)result * rvalue.m) >> 32; result >>= rvalue.sh; } Signed-off-by: Jiong Wang Reviewed-by: Jakub Kicinski --- include/linux/reciprocal_div.h | 68 ++ lib/reciprocal_div.c | 41 2 files changed, 109 insertions(+) diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h index e031e9f2f9d8..585ce89c0f33 100644 --- a/include/linux/reciprocal_div.h +++ b/include/linux/reciprocal_div.h @@ -25,6 +25,9 @@ struct reciprocal_value { u8 sh1, sh2; }; +/* "reciprocal_value" and "reciprocal_divide" together implement the basic + * version of the algorithm described in Figure 4.1 of the paper. + */ struct reciprocal_value reciprocal_value(u32 d); static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R) @@ -33,4 +36,69 @@ static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R) return (t + ((a - t) >> R.sh1)) >> R.sh2; } +struct reciprocal_value_adv { + u32 m; + u8 sh, exp; + bool is_wide_m; +}; + +/* "reciprocal_value_adv" implements the advanced version of the algorithm + * described in Figure 4.2 of the paper except when "divisor > (1U << 31)" whose + * ceil(log2(d)) result will be 32 which then requires u128 divide on host. The + * exception case could be easily handled before calling "reciprocal_value_adv". + * + * The advanced version requires more complex calculation to get the reciprocal + * multiplier and other control variables, but then could reduce the required + * emulation operations. + * + * It makes no sense to use this advanced version for host divide emulation, + * those extra complexities for calculating multiplier etc could completely + * waive our saving on emulation operations. + * + * However, it makes sense to use it for JIT divide code generation for which + * we are willing to trade performance of JITed code with that of host. As shown + * by the following pseudo code, the required emulation operations could go down + * from 6 (the basic version) to 3 or 4. + * + * To use the result of "reciprocal_value_adv", suppose we want to calculate + * n/d, the pseudo C code will be: + * + * struct reciprocal_value_adv rvalue; + * u8 pre_shift, exp; + * + * // handle exception case. + * if (d >= (1U << 31)) { + * result = n >= d; + * return; + * } + * + * rvalue = reciprocal_value_adv(d, 32) + * exp = rvalue.exp; + * if (rvalue.is_wide_m && !(d & 1)) { + * // floor(log2(d & (2^32 -d))) + * pre_shift = fls(d & -d) - 1; + * rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift); + * } else { + * pre_shift = 0; + * } + * + * // code generation starts. + * if (imm == 1U << exp) { + * result = n >> exp; + * } else if (rvalue.is_wide_m) { + * // pre_shift must be zero when reached here. + * t = (n * rvalue.m) >> 32; + * result = n - t; + * result >>= 1; + * result += t; + * result >>= rvalue.sh - 1; + * } else { + * if
[PATCH bpf-next v2 0/6] nfp: bpf: add multiplication and divide support on NFP JIT
Jiong says: NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per step, therefore we need 2 steps for u16 and 4 steps for u32. We also need one start instruction to initialize the sequence and one or two instructions to fetch the result depending on either you need the high halve of u32 multiplication. For ALU64, if either operand is beyond u32's value range, we reject it. One thing to note, if the source operand is BPF_K, then we need to check "imm" field directly, and we'd reject it if it is negative. Because for ALU64, "imm" (with s32 type) is expected to be sign extended to s64 which NFP mul doesn't support. For ALU32, it is fine for "imm" be negative though, because the result is 32-bits and here is no difference on the low halve of result for signed/unsigned mul, so we will get correct result. NFP doesn't have integer divide instruction, this patch set uses reciprocal algorithm (the basic one, reciprocal_div) to emulate it. For each u32 divide, we would need 11 instructions to finish the operation. 7 (for multiplication) + 4 (various ALUs) = 11 Given NFP only supports multiplication no bigger than u32, we'd require divisor and dividend no bigger than that as well. Also eBPF doesn't support signed divide and has enforced this on C language level by failing compilation. However LLVM assembler hasn't enforced this, so it is possible for negative constant to leak in as a BPF_K operand through assembly code, we reject such cases as well. Meanwhile reciprocal_div.h only implemented the basic version of: "Division by Invariant Integers Using Multiplication" - Torbjörn Granlund and Peter L. Montgomery This patch set further implements the optimized version (Figure 4.2 in the paper) inside existing reciprocal_div.h. When the divider is even and the calculated reciprocal magic number doesn't fit u32, we could reduce the required ALU instructions from 4 to 2 or 1 for some cases. The advanced version requires more complex calculation to get the reciprocal multiplier and other control variables, but then could reduce the required emulation operations. It makes sense to use it for JIT divide code generation (for example eBPF JIT backends) for which we are willing to trade performance of JITed code with that of host. v2: - add warning in l == 32 code path. (Song Liu/Jakub) - jit separate insn sequence for l == 32. (Jakub/Edwin) - should use unrestricted operand for mul. - once place should use "1U << exp" instead of "1 << exp". Jiong Wang (6): lib: reciprocal_div: implement the improved algorithm on the paper mentioned nfp: bpf: rename umin/umax to umin_src/umax_src nfp: bpf: copy range info for all operands of all ALU operations nfp: bpf: support u16 and u32 multiplications nfp: bpf: support u32 divide using reciprocal_div.h nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h drivers/net/ethernet/netronome/nfp/bpf/jit.c | 249 +- drivers/net/ethernet/netronome/nfp/bpf/main.h | 43 +-- .../net/ethernet/netronome/nfp/bpf/offload.c | 6 +- .../net/ethernet/netronome/nfp/bpf/verifier.c | 85 +- drivers/net/ethernet/netronome/nfp/nfp_asm.h | 28 ++ include/linux/reciprocal_div.h| 68 + lib/reciprocal_div.c | 41 +++ 7 files changed, 483 insertions(+), 37 deletions(-) -- 2.17.1
[PATCH bpf-next v2 2/6] nfp: bpf: rename umin/umax to umin_src/umax_src
From: Jiong Wang The two fields are a copy of umin and umax info of bpf_insn->src_reg generated by verifier. Rename to make their meaning clear. Signed-off-by: Jiong Wang Reviewed-by: Jakub Kicinski Acked-by: Song Liu --- drivers/net/ethernet/netronome/nfp/bpf/jit.c | 12 ++-- drivers/net/ethernet/netronome/nfp/bpf/main.h | 10 +- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 2 +- drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 4 ++-- 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c index 33111739b210..4a629e9b5c0f 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c @@ -1772,8 +1772,8 @@ static int shl_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) u8 dst, src; dst = insn->dst_reg * 2; - umin = meta->umin; - umax = meta->umax; + umin = meta->umin_src; + umax = meta->umax_src; if (umin == umax) return __shl_imm64(nfp_prog, dst, umin); @@ -1881,8 +1881,8 @@ static int shr_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) u8 dst, src; dst = insn->dst_reg * 2; - umin = meta->umin; - umax = meta->umax; + umin = meta->umin_src; + umax = meta->umax_src; if (umin == umax) return __shr_imm64(nfp_prog, dst, umin); @@ -1995,8 +1995,8 @@ static int ashr_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) u8 dst, src; dst = insn->dst_reg * 2; - umin = meta->umin; - umax = meta->umax; + umin = meta->umin_src; + umax = meta->umax_src; if (umin == umax) return __ashr_imm64(nfp_prog, dst, umin); diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h index 654fe7823e5e..5975a19c28cb 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h @@ -263,8 +263,8 @@ struct nfp_bpf_reg_state { * @func_id: function id for call instructions * @arg1: arg1 for call instructions * @arg2: arg2 for call instructions - * @umin: copy of core verifier umin_value. - * @umax: copy of core verifier umax_value. + * @umin_src: copy of core verifier umin_value for src opearnd. + * @umax_src: copy of core verifier umax_value for src operand. * @off: index of first generated machine instruction (in nfp_prog.prog) * @n: eBPF instruction number * @flags: eBPF instruction extra optimization flags @@ -301,11 +301,11 @@ struct nfp_insn_meta { struct nfp_bpf_reg_state arg2; }; /* We are interested in range info for some operands, -* for example, the shift amount. +* for example, the shift amount which is kept in src operand. */ struct { - u64 umin; - u64 umax; + u64 umin_src; + u64 umax_src; }; }; unsigned int off; diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 7eae4c0266f8..856a0003bb75 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -191,7 +191,7 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog, meta->insn = prog[i]; meta->n = i; if (is_mbpf_indir_shift(meta)) - meta->umin = U64_MAX; + meta->umin_src = U64_MAX; list_add_tail(>l, _prog->insns); } diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c index 4bfeba7b21b2..e862b739441f 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c @@ -555,8 +555,8 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx) const struct bpf_reg_state *sreg = cur_regs(env) + meta->insn.src_reg; - meta->umin = min(meta->umin, sreg->umin_value); - meta->umax = max(meta->umax, sreg->umax_value); + meta->umin_src = min(meta->umin_src, sreg->umin_value); + meta->umax_src = max(meta->umax_src, sreg->umax_value); } return 0; -- 2.17.1
Re: [PATCH v12 01/10] dt-bindings: Add Cavium Octeon Common Ethernet Interface.
On 06/28/2018 03:35 AM, Andrew Lunn wrote: > >> +- cavium,rx-clk-delay-bypass: Set to <1> to bypass the rx clock delay >> setting. >> + Needed by the Micrel PHY. > > Could you explain this some more. Is it anything to do with RGMII delays? > Andrew, One of my colleagues tracked this down for me. This device tree option is in place because there are several different ways to do the clock and data with respect to RGMII. This controls the delay introduced for the RX clock with respect to the data. Without this, RX will not work with Micrel PHYs. Thanks. Steve
Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc
Hi Stephen, On 07/06/2018 02:32 PM, Stephen Hemminger wrote: > >> diff --git a/tc/q_etf.c b/tc/q_etf.c >> new file mode 100644 >> index ..5db1dd6f >> --- /dev/null >> +++ b/tc/q_etf.c >> @@ -0,0 +1,168 @@ >> +/* >> + * q_etf.c Earliest TxTime First (ETF). >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public License >> + * as published by the Free Software Foundation; either version >> + * 2 of the License, or (at your option) any later version. >> + * >> + * Authors: Vinicius Costa Gomes >> + * Jesus Sanchez-Palencia >> + * >> + */ > > > Please use SPDX tag rather than GPL boilerplate when adding new code. Sure, will do for v4. > >> +static int get_clockid(__s32 *val, const char *arg) >> +{ >> +const struct static_clockid { >> +const char *name; >> +clockid_t clockid; >> +} clockids_sysv[] = { >> +{ "CLOCK_REALTIME", CLOCK_REALTIME }, >> +{ "CLOCK_TAI", CLOCK_TAI }, >> +{ "CLOCK_BOOTTIME", CLOCK_BOOTTIME }, >> +{ "CLOCK_MONOTONIC", CLOCK_MONOTONIC }, >> +{ NULL } >> +}; >> + >> +const struct static_clockid *c; >> + >> +for (c = clockids_sysv; c->name; c++) { >> +if (strncasecmp(c->name, arg, 25) == 0) { >> +*val = c->clockid; >> + >> +return 0; >> +} >> +} >> + >> +return -1; >> +} > > Internally, kernel must use ktime. For the userspace part the TC standard > is to use USER HZ of 100. > > Please change user kernel API of this module to match other existing modules. > Doing something unique for this module is not necessary. I don't follow you on the above, sorry. The qdisc must be configured with a valid clockid_t. This type is used by both userspace and kernel. As requested before, we made the tc etf command line interface more user-friendly and allowed for users to input the clock name as a string. The code above is just lookup table converting the string into a a valid clockid_t so we can then pass it to the kernel. There is no ktime or any other timestamp type above, only clockid_t. Can you please clarify what your request is? Thanks, Jesus
Re: Fw: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE
On 07/06/2018 02:24 PM, Stephen Hemminger wrote: > > > Begin forwarded message: > > Date: Sat, 23 Jun 2018 00:00:25 + > From: bugzilla-dae...@bugzilla.kernel.org > To: step...@networkplumber.org > Subject: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE > > > https://bugzilla.kernel.org/show_bug.cgi?id=200215 > > Bug ID: 200215 >Summary: UBSAN: Undefined behaviour in net/core/sock.c:LINE >Product: Networking >Version: 2.5 > Kernel Version: v4.18-rc2 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > Assignee: step...@networkplumber.org > Reporter: icy...@gmail.com > Regression: No > > Hi, > This bug was found in Linux Kernel v4.18-rc2 > > $ cat report4 > > UBSAN: Undefined behaviour in net/core/sock.c:793:19 > signed integer overflow: > -1704733899 * 2 cannot be represented in type 'int' > CPU: 0 PID: 5695 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x122/0x1c8 lib/dump_stack.c:113 > ubsan_epilogue+0x12/0x86 lib/ubsan.c:159 > handle_overflow+0x1c2/0x21f lib/ubsan.c:190 > __ubsan_handle_mul_overflow+0x2a/0x38 lib/ubsan.c:214 > sock_setsockopt+0x17f1/0x1c80 net/core/sock.c:793 > __sys_setsockopt+0x23f/0x2a0 net/socket.c:1943 > __do_sys_setsockopt net/socket.c:1958 [inline] > __se_sys_setsockopt net/socket.c:1955 [inline] > __x64_sys_setsockopt+0xcc/0x170 net/socket.c:1955 > do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe This seems harmless, a fix would be : diff --git a/net/core/sock.c b/net/core/sock.c index 03fdea5b0f575945a58fd14b546226d61ccd4988..9160f412d49dfe7706c202dbd531c247c0548a21 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -792,7 +792,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, * returning the value we actually used in getsockopt * is the most desirable behavior. */ - sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF); + sk->sk_rcvbuf = max_t(int, val << 1, SOCK_MIN_RCVBUF); break; case SO_RCVBUFFORCE:
Re: [PATCH v2 net-next 00/14] Scheduled packet Transmission: ETF
On Tue, 3 Jul 2018 15:42:46 -0700 Jesus Sanchez-Palencia wrote: > Changes since v1: > - moved struct sock_txtime from socket.h to uapi net_tstamp.h; > - sk_clockid was changed from u16 to u8; > - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock; > - the socket option flags are now validated in sock_setsockopt(); > - added SO_EE_ORIGIN_TXTIME; > - sockc.transmit_time is now initialized from all IPv4 Tx paths; > - added support for the IPv6 Tx path; > > > Overview > > > This work consists of a set of kernel interfaces that can be used by > applications that require (time-based) Scheduled Tx of packets. > It is comprised by 3 new components to the kernel: > > - SO_TXTIME: socket option + cmsg programming interfaces. > > - etf: the "earliest txtime first" qdisc, that provides per-queue >TxTime-based scheduling. This has been renamed from 'tbs' to >'etf' to better describe its functionality. > > - taprio: the "time-aware priority scheduler" qdisc, that provides > per-port Time-Aware scheduling; > > This patchset is providing the first 2 components, which have been > developed for longer. The taprio qdisc will be shared as an RFC separately > (shortly). > > Note that this series is a follow up of the "Time based packet > transmission" RFCv3 [1]. > > > > etf (formerly known as 'tbs') > = > > For applications/systems that the concept of time slices isn't precise > enough, the etf qdisc allows applications to control the instant when > a packet should leave the network controller. When used in conjunction > with taprio, it can also be used in case the application needs to > control with greater guarantee the offset into each time slice a packet > will be sent. Another use case of etf, is when only a small number of > applications on a system are time sensitive, so it can then be used > with a more traditional root qdisc (like mqprio). > > The etf qdisc is designed so it buffers packets until a configurable > time before their deadline (Tx time). The qdisc uses a rbtree internally > so the buffered packets are always 'ordered' by their txtime (deadline) > and will be dequeued following the earliest txtime first. > > It relies on the SO_TXTIME API set for receiving the per-packet timestamp > (txtime) as well as the config flags for each socket: the clockid to be > used as a reference, if the expected mode of txtime for that socket is > deadline or strict mode, and if packet drops should be reported on the > socket's error queue or not. > > The qdisc will drop any packets with a Tx time in the past, or if a > packet expires while waiting for being dequeued. Drops can be reported > as errors back to userspace through the socket's error queue. > > Example configuration: > > $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 20 \ > clockid CLOCK_TAI > > Here, the Qdisc will use HW offload for the txtime control. > Packets will be dequeued by the qdisc "delta" (20) nanoseconds before > their transmission time. Because this will be using HW offload and > since dynamic clocks are not supported by hrtimers, the system clock > and the PHC clock must be synchronized for this mode to behave as expected. > > A more complete example can be found here, with instructions of how to > test it: > > https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2] > > > Note that we haven't modified the qdisc so it uses a timerqueue because > the modification needed was increasing the number of cachelines of a sk_buff. > > > > This series is also hosted on github and can be found at [3]. > The companion iproute2 patches can be found at [4]. > > > [1] https://patchwork.ozlabs.org/cover/882342/ > > [2] github doesn't make it clear, but the gist can be cloned like this: > $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f > scheduled-tx-tests > > [3] https://github.com/jeez/linux/tree/etf-v2 > > [4] https://github.com/jeez/iproute2/tree/etf-v2 > > > > Jesus Sanchez-Palencia (10): > net: Clear skb->tstamp only on the forwarding path > net: ipv4: Hook into time based transmission > net: ipv6: Hook into time based transmission > net/sched: Add HW offloading capability to ETF > igb: Refactor igb_configure_cbs() > igb: Only change Tx arbitration when CBS is on > igb: Refactor igb_offload_cbs() > igb: Only call skb_tx_timestamp after descriptors are ready > igb: Add support for ETF offload > net/sched: Make etf report drops on error_queue > > Richard Cochran (2): > net: Add a new socket option for a future transmit time. > net: packet: Hook into time based transmission. > > Vinicius Costa Gomes (2): > net/sched: Allow creating a Qdisc watchdog with other clocks > net/sched: Introduce the ETF Qdisc > > arch/alpha/include/uapi/asm/socket.h | 3 + > arch/ia64/include/uapi/asm/socket.h | 3 + >
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote: > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > I'm also not 100% on board with the argument that "future" FW can > > reshuffle things whatever way it wants to. Is the assumption that > > future ASICs/FW will be designed to always use the "blessed" BTF > > format? Or will it be reconfigurable at runtime? > > let's table configuration of metadata aside for a second. > > Describing metedata layout in BTF allows NICs to disclose everything > NIC has to users in a standard and generic way. > Whether firmware is reconfigurable on the fly or has to reflashed > and hw powercycled to have new md layout (and corresponding BTF description) > is a separate discussion. > Saeed's proposal introduces the concept of 'offset' inside 'struct > xdp_md_info' > to reach 'hash' value in metadata. > Essentially it's a run-time way to access 'hash' instead of build-time. > So bpf program would need two loads to read csum or hash field instead of one. > With BTF the layout of metadata is known to the program at build-time. > > To reiterate the proposal: > - driver+firmware keep layout of the metadata in BTF format (either in the > driver > or driver can read it from firmware) > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver > and > generate normal C header file based on BTF in the given NIC > - user does #include "md_desc.h" and bpf program can access md->csum or > md->hash > with direct single load out of metadata area in front of the packet > - llvm compiles bpf program and records how program is doing this md->csum > accesses > in BTF format as well (the compiler will be keeping such records > for __sk_buff and all other structs too, but that's separate discussion) > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way > the program > accesses metadata (and other structs) matches BTF from the driver, > so no surprises if driver+firmware got updated, but program is not > recompiled > - every NIC can have their own layout of metadata and its own meaning of the > fields, > but would be good to standardize at least a few common fields like hash Can I expose HW descriptors this way, though, or is the proposal to copy this data into the packet buffer? > Once this is working we can do more cool things with BTF. > Like doing offset rewriting at program load time similar to what we plan > to do for tracing. Tracing programs will be doing 'task->pid' access > and the kernel will adjust offsetof(struct task_struct, pid) during program > load > depending on BTF for the kernel. > The same trick we can do for networking metadata. > The program will contain load instruction md->hash that will get automatically > adjusted to proper offset depending on BTF of 'hash' field in the NIC. > For now I'm proposing _not_ to go that far with offset rewriting and start > with simple steps described above. Why? :( Could we please go with the rewrite/driver helpers instead of impacting fast paths of the drivers yet again? This rewrite should be easier than task->pid, because we have the synthetic user space struct xdp_md definition. The flexibility and power of BPF rewrites/JITing is at our disposal to deliver maximum performance..
Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc
> diff --git a/tc/q_etf.c b/tc/q_etf.c > new file mode 100644 > index ..5db1dd6f > --- /dev/null > +++ b/tc/q_etf.c > @@ -0,0 +1,168 @@ > +/* > + * q_etf.c Earliest TxTime First (ETF). > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; either version > + * 2 of the License, or (at your option) any later version. > + * > + * Authors: Vinicius Costa Gomes > + * Jesus Sanchez-Palencia > + * > + */ Please use SPDX tag rather than GPL boilerplate when adding new code. > +static int get_clockid(__s32 *val, const char *arg) > +{ > + const struct static_clockid { > + const char *name; > + clockid_t clockid; > + } clockids_sysv[] = { > + { "CLOCK_REALTIME", CLOCK_REALTIME }, > + { "CLOCK_TAI", CLOCK_TAI }, > + { "CLOCK_BOOTTIME", CLOCK_BOOTTIME }, > + { "CLOCK_MONOTONIC", CLOCK_MONOTONIC }, > + { NULL } > + }; > + > + const struct static_clockid *c; > + > + for (c = clockids_sysv; c->name; c++) { > + if (strncasecmp(c->name, arg, 25) == 0) { > + *val = c->clockid; > + > + return 0; > + } > + } > + > + return -1; > +} Internally, kernel must use ktime. For the userspace part the TC standard is to use USER HZ of 100. Please change user kernel API of this module to match other existing modules. Doing something unique for this module is not necessary.
Fw: [Bug 200315] New: [Regression] Wired network does not come back after reboot
Begin forwarded message: Date: Thu, 28 Jun 2018 01:59:30 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200315] New: [Regression] Wired network does not come back after reboot https://bugzilla.kernel.org/show_bug.cgi?id=200315 Bug ID: 200315 Summary: [Regression] Wired network does not come back after reboot Product: Networking Version: 2.5 Kernel Version: 4.17.2 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other Assignee: step...@networkplumber.org Reporter: nicholaslima...@gmail.com Regression: No Created attachment 276937 --> https://bugzilla.kernel.org/attachment.cgi?id=276937=edit It's a log from journalctl NetworkManager Only difference is using kernel 4.17.2 vs kernel 4.16.15 When using 4.17.2, the network does not work after rebooting. With 4.16.15 it works as expected. Complementing: The wired connection doesn't work if I choose to reboot, but It does work if I power it off and on again. -- You are receiving this mail because: You are the assignee for the bug.
Fw: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE
Begin forwarded message: Date: Sat, 23 Jun 2018 00:00:25 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200215] New: UBSAN: Undefined behaviour in net/core/sock.c:LINE https://bugzilla.kernel.org/show_bug.cgi?id=200215 Bug ID: 200215 Summary: UBSAN: Undefined behaviour in net/core/sock.c:LINE Product: Networking Version: 2.5 Kernel Version: v4.18-rc2 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other Assignee: step...@networkplumber.org Reporter: icy...@gmail.com Regression: No Hi, This bug was found in Linux Kernel v4.18-rc2 $ cat report4 UBSAN: Undefined behaviour in net/core/sock.c:793:19 signed integer overflow: -1704733899 * 2 cannot be represented in type 'int' CPU: 0 PID: 5695 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x122/0x1c8 lib/dump_stack.c:113 ubsan_epilogue+0x12/0x86 lib/ubsan.c:159 handle_overflow+0x1c2/0x21f lib/ubsan.c:190 __ubsan_handle_mul_overflow+0x2a/0x38 lib/ubsan.c:214 sock_setsockopt+0x17f1/0x1c80 net/core/sock.c:793 __sys_setsockopt+0x23f/0x2a0 net/socket.c:1943 __do_sys_setsockopt net/socket.c:1958 [inline] __se_sys_setsockopt net/socket.c:1955 [inline] __x64_sys_setsockopt+0xcc/0x170 net/socket.c:1955 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x455a09 Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7fb211862c68 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: 7fb2118636d4 RCX: 00455a09 RDX: 0021 RSI: 0001 RDI: 0013 RBP: 0072bea0 R08: 0004 R09: R10: 2000 R11: 0246 R12: R13: 06dc R14: 006ff540 R15: sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 40 00 print_req_error: I/O error, dev sr0, sector 0 sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 40 00 00 40 00 print_req_error: I/O error, dev sr0, sector 256 sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] unaligned transfer sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00 00 00 02 00 print_req_error: critical target error, dev sr0, sector 0 Buffer I/O error on dev sr0, logical block 0, lost async page write Buffer I/O error on dev sr0, logical block 1, lost async page write Buffer I/O error on dev sr0, logical block 2, lost async page write Buffer I/O error on dev sr0, logical block 3, lost async page write Buffer I/O error on dev sr0, logical block 4, lost async page write Buffer I/O error on dev sr0, logical block 5, lost async page write Buffer I/O error on dev sr0, logical block 6, lost async page write Buffer I/O error on dev sr0, logical block 7, lost async page write This bug can be repro, if you need it, please tell me. Thanks, Icytxw -- You are receiving this mail because: You are the assignee for the bug.
Re: [PATCH v1 net-next 9/9] lan743x: Add PTP support
On Thu, Jul 05, 2018 at 12:39:26PM -0400, Bryan Whitehead wrote: > Signed-off-by: Bryan Whitehead 1. You forgot the commit message. 2. You forgot to add the PTP maintainer on CC. Thanks, Richard
Fw: [Bug 200239] New: RTL8211E lockup when having heavy tx traffic
Begin forwarded message: Date: Sun, 24 Jun 2018 10:26:02 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200239] New: RTL8211E lockup when having heavy tx traffic https://bugzilla.kernel.org/show_bug.cgi?id=200239 Bug ID: 200239 Summary: RTL8211E lockup when having heavy tx traffic Product: Networking Version: 2.5 Kernel Version: 4.17 Hardware: ARM OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: IPV4 Assignee: step...@networkplumber.org Reporter: lollipop.studio...@gmail.com Regression: No My device: roc-rk3328-cc It's a SBC(Single Board Computer) use soc Rockchip RK3328 How to reproduce this bug: just running a mainline linux kernel,then run iperf3 to test. I also tried this patch https://patchwork.kernel.org/patch/10178969/ ,It slightly improved but cannot completely solve this problem. dmesg summary: [14755.405897] [ cut here ] [14755.406386] NETDEV WATCHDOG: eth0 (rk_gmac-dwmac): transmit queue 0 timed out [14755.407258] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:473 dev_watchdog+0x2ac/0x2b8 [14755.407992] Modules linked in: 8021q garp mrp stp llc rockchipdrm analogix_dp phy_rockchip_inno_hdmi tcp_bbr sch_fq [14755.408985] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.17.0-rc7-1-90343-g15e473367050-dirty #1 [14755.409762] Hardware name: Firefly roc-rk3328-cc (DT) [14755.410224] pstate: 2005 (nzCv daif -PAN -UAO) [14755.410664] pc : dev_watchdog+0x2ac/0x2b8 [14755.411034] lr : dev_watchdog+0x2ac/0x2b8 [14755.411396] sp : 08013d80 [14755.411696] x29: 08013d80 x28: 0005 [14755.412184] x27: 0101 x26: [14755.412670] x25: 800079d604b8 x24: 0002 [14755.413156] x23: 800079d6049c x22: 80007b74de80 [14755.413641] x21: 09b07000 x20: 800079d6 [14755.414128] x19: x18: 0010 [14755.414614] x17: x16: 08229758 [14755.415099] x15: x14: 09b09d88 [14755.415585] x13: 89cb16f7 x12: 09cb16ff [14755.416073] x11: 09b23000 x10: 08013a60 [14755.416557] x9 : ffd0 x8 : 0874ee68 [14755.417044] x7 : 756575712074696d x6 : 017e [14755.417525] x5 : 085b5400 x4 : 0004 [14755.418012] x3 : 09b0c850 x2 : 80007c3d3800 [14755.418496] x1 : bea2cbb4ea219900 x0 : [14755.418988] Call trace: [14755.419226] dev_watchdog+0x2ac/0x2b8 [14755.419574] call_timer_fn+0x20/0x78 [14755.419903] expire_timers+0xa8/0xb8 [14755.420239] run_timer_softirq+0xa4/0x190 [14755.420610] __do_softirq+0x124/0x220 [14755.420951] irq_exit+0xb0/0xd0 [14755.421250] __handle_domain_irq+0x64/0xb8 [14755.421627] gic_handle_irq+0x50/0xa0 [14755.421967] el1_irq+0xb0/0x128 [14755.422265] arch_cpu_idle+0x10/0x18 [14755.422600] do_idle+0x208/0x270 [14755.422896] cpu_startup_entry+0x20/0x28 [14755.423265] secondary_start_kernel+0x14c/0x160 [14755.423675] ---[ end trace 633d025b2aafb3cb ]--- [14755.424181] rk_gmac-dwmac ff54.ethernet eth0: Reset adapter. [14755.427911] RTL8211E Gigabit Ethernet stmmac-0:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=stmmac-0:00, irq=POLL) [14755.441983] rk_gmac-dwmac ff54.ethernet eth0: PTP not supported by HW [14755.442989] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [14759.535770] rk_gmac-dwmac ff54.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [14759.536606] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready -- You are receiving this mail because: You are the assignee for the bug.
Fw: [Bug 200033] stack-out-of-bounds in __xfrm_dst_hash net/xfrm/xfrm_hash.h
Begin forwarded message: Date: Sat, 23 Jun 2018 14:24:30 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200033] stack-out-of-bounds in __xfrm_dst_hash net/xfrm/xfrm_hash.h https://bugzilla.kernel.org/show_bug.cgi?id=200033 --- Comment #1 from icytxw (icy...@gmail.com) --- More details: 26 static inline unsigned int __xfrm6_daddr_saddr_hash(const xfrm_address_t *daddr, 27 const xfrm_address_t *saddr) 28 { 29 return ntohl(daddr->a6[2] ^ daddr->a6[3] ^ 30 saddr->a6[2] ^ saddr->a6[3]); 31 } $ cat report0 == BUG: KASAN: stack-out-of-bounds in __xfrm6_daddr_saddr_hash net/xfrm/xfrm_hash.h:29 [inline] BUG: KASAN: stack-out-of-bounds in __xfrm_dst_hash net/xfrm/xfrm_hash.h:96 [inline] BUG: KASAN: stack-out-of-bounds in xfrm_dst_hash net/xfrm/xfrm_state.c:61 [inline] BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x2b18/0x3160 net/xfrm/xfrm_state.c:953 Read of size 4 at addr 88004ad57b20 by task syz-executor0/4355 CPU: 0 PID: 4355 Comm: syz-executor0 Not tainted 4.18.0-rc1 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x122/0x1c8 lib/dump_stack.c:113 print_address_description+0x88/0x3b0 mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report+0x185/0x4a0 mm/kasan/report.c:412 __asan_report_load4_noabort+0x19/0x20 mm/kasan/report.c:432 __xfrm6_daddr_saddr_hash net/xfrm/xfrm_hash.h:29 [inline] __xfrm_dst_hash net/xfrm/xfrm_hash.h:96 [inline] xfrm_dst_hash net/xfrm/xfrm_state.c:61 [inline] xfrm_state_find+0x2b18/0x3160 net/xfrm/xfrm_state.c:953 xfrm_tmpl_resolve_one net/xfrm/xfrm_policy.c:1393 [inline] xfrm_tmpl_resolve+0x418/0xc10 net/xfrm/xfrm_policy.c:1437 xfrm_resolve_and_create_bundle+0x112/0x980 net/xfrm/xfrm_policy.c:1832 xfrm_lookup+0x298/0x1be0 net/xfrm/xfrm_policy.c:2162 xfrm_lookup_route+0x42/0x1f0 net/xfrm/xfrm_policy.c:2282 ip_route_output_flow+0x86/0xc0 net/ipv4/route.c:2588 udp_sendmsg+0x15c1/0x2180 net/ipv4/udp.c:1086 inet_sendmsg+0x103/0x490 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:645 [inline] sock_sendmsg+0xf9/0x180 net/socket.c:655 __sys_sendto+0x239/0x3c0 net/socket.c:1833 __do_sys_sendto net/socket.c:1845 [inline] __se_sys_sendto net/socket.c:1841 [inline] __x64_sys_sendto+0xef/0x1c0 net/socket.c:1841 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x455a09 Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7f58654ecc68 EFLAGS: 0246 ORIG_RAX: 002c RAX: ffda RBX: 7f58654ed6d4 RCX: 00455a09 RDX: RSI: 200014c0 RDI: 0013 RBP: 0072bea0 R08: 20001540 R09: 0010 R10: R11: 0246 R12: R13: 05d7 R14: 006fdcc8 R15: The buggy address belongs to the page: page:ea00012b55c0 count:0 mapcount:0 mapping: index:0x0 flags: 0x100() raw: 0100 ea00012b55c8 raw: page dumped because: kasan: bad access detected Memory state around the buggy address: 88004ad57a00: f2 00 f4 f4 f4 f2 f2 f2 f2 00 00 00 00 f2 f2 f2 88004ad57a80: f2 00 00 00 00 00 f4 f4 f4 f2 f2 f2 f2 00 00 00 >88004ad57b00: 00 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 00 00 00 ^ 88004ad57b80: 00 00 f4 f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 88004ad57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 == a repro of this bug see: https://github.com/lcytxw/bug_repro/tree/master/bug_200033 -- You are receiving this mail because: You are the assignee for the bug.
Fw: [Bug 200195] New: [Regression] Network does not come back after suspend
Begin forwarded message: Date: Fri, 22 Jun 2018 18:04:00 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200195] New: [Regression] Network does not come back after suspend https://bugzilla.kernel.org/show_bug.cgi?id=200195 Bug ID: 200195 Summary: [Regression] Network does not come back after suspend Product: Networking Version: 2.5 Kernel Version: 4.17.2 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other Assignee: step...@networkplumber.org Reporter: aa...@kde.org Regression: No Created attachment 276753 --> https://bugzilla.kernel.org/attachment.cgi?id=276753=edit lspci -vvv output Setup: * Using Archlinux * Using wired network Only difference is using kernel 4.17.2 vs kernel 4.16.13 When using 4.17.2, the network does not work after resuming from suspend. With 4.16.13 it works as expected. My guess is that this is driver/hardware dependent since i have another machine with the same setup and it does not seem to have that regression. I'm attaching the lspci output of 4.16.13. Anything else I can provide or test to pinpoint where the regression happened please do not hesitate to ask. -- You are receiving this mail because: You are the assignee for the bug.
Fw: [Bug 200187] New: UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568
Begin forwarded message: Date: Fri, 22 Jun 2018 15:09:48 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200187] New: UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568 https://bugzilla.kernel.org/show_bug.cgi?id=200187 Bug ID: 200187 Summary: UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568 Product: Networking Version: 2.5 Kernel Version: v4.18-rc2 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other Assignee: step...@networkplumber.org Reporter: icy...@gmail.com Regression: No It seems like that to->to_retries to large while shift exponent. if (to->to_exponential) req->rq_majortimeo <<= to->to_retries; else UBSAN: Undefined behaviour in net/sunrpc/xprt.c:568:22 shift exponent 536870976 is too large for 64-bit type 'long unsigned int' CPU: 0 PID: 21219 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x122/0x1c8 lib/dump_stack.c:113 ubsan_epilogue+0x12/0x86 lib/ubsan.c:159 __ubsan_handle_shift_out_of_bounds+0x29a/0x2ff lib/ubsan.c:425 xprt_reset_majortimeo+0x323/0x440 net/sunrpc/xprt.c:568 xprt_request_init+0x4d2/0x730 net/sunrpc/xprt.c:1330 call_reserveresult+0x9d/0x240 net/sunrpc/clnt.c:1549 __rpc_execute+0x23a/0xc20 net/sunrpc/sched.c:784 rpc_execute+0xf5/0x250 net/sunrpc/sched.c:852 rpc_run_task+0x4a1/0x9f0 net/sunrpc/clnt.c:1053 rpc_call_sync+0xcf/0x1a0 net/sunrpc/clnt.c:1082 rpc_ping+0xde/0x140 net/sunrpc/clnt.c:2514 rpc_create_xprt+0x1e8/0x590 net/sunrpc/clnt.c:479 rpc_create+0x3a9/0x600 net/sunrpc/clnt.c:587 nfs_create_rpc_client+0x3a1/0x4d0 fs/nfs/client.c:523 nfs_init_client+0x7b/0x100 fs/nfs/client.c:634 nfs_get_client+0xa74/0x1440 fs/nfs/client.c:425 nfs_init_server+0x236/0xdf0 fs/nfs/client.c:670 nfs_create_server+0x9a/0x750 fs/nfs/client.c:953 nfs_try_mount+0x270/0xaf0 fs/nfs/super.c:1884 nfs_fs_mount+0x151f/0x30e0 fs/nfs/super.c:2695 mount_fs+0xaf/0x330 fs/super.c:1277 vfs_kern_mount+0xfc/0x4d0 fs/namespace.c:1037 do_new_mount fs/namespace.c:2518 [inline] do_mount+0x46f/0x2fa0 fs/namespace.c:2848 ksys_mount+0xb0/0x120 fs/namespace.c:3064 __do_sys_mount fs/namespace.c:3078 [inline] __se_sys_mount fs/namespace.c:3075 [inline] __x64_sys_mount+0xcc/0x170 fs/namespace.c:3075 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x455a09 Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7f11c724ac68 EFLAGS: 0246 ORIG_RAX: 00a5 RAX: ffda RBX: 7f11c724b6d4 RCX: 00455a09 RDX: 21c0 RSI: 2100 RDI: 2200 RBP: 0072bea0 R08: 2180 R09: R10: R11: 0246 R12: R13: 04d1 R14: 006fc438 R15: rpcbind: RPC call returned error 22 rpcbind: RPC call returned error 22 rpcbind: RPC call returned error 22 rpcbind: RPC call returned error 22 rpcbind: RPC call returned error 22 FAT-fs (loop1): bogus number of reserved sectors FAT-fs (loop1): Can't find a valid FAT filesystem FAT-fs (loop1): bogus number of reserved sectors FAT-fs (loop1): Can't find a valid FAT filesystem FAT-fs (loop1): bogus number of reserved sectors FAT-fs (loop1): Can't find a valid FAT filesystem FAT-fs (loop1): bogus number of reserved sectors FAT-fs (loop1): Can't find a valid FAT filesystem FAT-fs (loop1): bogus number of reserved sectors FAT-fs (loop1): Can't find a valid FAT filesystem EXT4-fs (sda): re-mounted. Opts: EXT4-fs (sda): re-mounted. Opts: EXT4-fs (sda): re-mounted. Opts: EXT4-fs (sda): re-mounted. Opts: EXT4-fs (sda): re-mounted. Opts: -- You are receiving this mail because: You are the assignee for the bug.
Re: [PATCH v1 net-next 9/9] lan743x: Add PTP support
Hi Bryan, I love your patch! Yet something to improve: [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Bryan-Whitehead/lan743x-Add-features-to-lan743x-driver/20180706-051812 config: i386-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> ERROR: "__moddi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined! >> ERROR: "__udivdi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined! >> ERROR: "__umoddi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined! >> ERROR: "__divdi3" [drivers/net/ethernet/microchip/lan743x.ko] undefined! --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v1 iproute2 1/2] uapi pkt_sched: Add etf info - DO NOT COMMIT
On Wed, 27 Jun 2018 15:09:11 -0700 Jesus Sanchez-Palencia wrote: > This should come from the next uapi headers update. > Sending it now just as a convenience so anyone can build tc with etf > and taprio support. > > Signed-off-by: Jesus Sanchez-Palencia This should be targeted at iproute2-next
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote: > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > I'm also not 100% on board with the argument that "future" FW can > > reshuffle things whatever way it wants to. Is the assumption that > > future ASICs/FW will be designed to always use the "blessed" BTF > > format? Or will it be reconfigurable at runtime? > > let's table configuration of metadata aside for a second. I agree that this should/could be NIC-specific and shouldn't weigh on the metadata interface between the drivers and XDP layer. > Describing metedata layout in BTF allows NICs to disclose everything > NIC has to users in a standard and generic way. > Whether firmware is reconfigurable on the fly or has to reflashed > and hw powercycled to have new md layout (and corresponding BTF > description) > is a separate discussion. > Saeed's proposal introduces the concept of 'offset' inside 'struct > xdp_md_info' > to reach 'hash' value in metadata. > Essentially it's a run-time way to access 'hash' instead of build- > time. > So bpf program would need two loads to read csum or hash field > instead of one. > With BTF the layout of metadata is known to the program at build- > time. > > To reiterate the proposal: > - driver+firmware keep layout of the metadata in BTF format (either > in the driver > or driver can read it from firmware) > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query > the driver and > generate normal C header file based on BTF in the given NIC > - user does #include "md_desc.h" and bpf program can access md->csum > or md->hash > with direct single load out of metadata area in front of the packet This piece is where I'd like to discuss more. When we discussed this in Seoul, the initial proposal was a static struct that we'd try to hammer out a common layout between the interested parties. That obviously wasn't going to scale, and we wanted to pursue something more dynamic. But I thought the goal was the XDP/eBPF program wouldn't want to care what the underlying device is, and could just ask for metadata that it's interested in. With this approach, your eBPF program is now bound/tied to the NIC/driver, and if you switched to a differen NIC/driver combo, then you'd have to rewrite part of your eBPF program to comprehend that. I thought we were trying to avoid that. Our proposed approach (still working on something ready to RFC) is to provide a method for the eBPF program to send a struct of requested hints down to the driver on load. If the driver can provide the hints, then that'd be how they'd be laid out in the metadata. If it can't provide them, we'd probably reject the program loading, or discuss providing a software fallback (I know this is an area of contention). I suppose we could get there with the rewriting mechanism described below, but that'd be a tough sell to set a bit of ABI for metadata, then change it to be potentially dynamic at runtime in the future. -PJ
[PATCH net 1/2] net/sched: act_csum: fix NULL dereference when 'goto chain' is used
the control action in the common member of struct tcf_csum must be a valid value, as it can contain the chain index when 'goto chain' is used. Ensure that the control action can be read as x->tcfa_action, when x is a pointer to struct tc_action and x->ops->type is TCA_ACT_CSUM, to prevent the following command: # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \ > $tcflags dst_mac $h2mac action csum ip or tcp or udp or sctp goto chain 1 from triggering a NULL pointer dereference when a matching packet is received. BUG: unable to handle kernel NULL pointer dereference at PGD 80010416b067 P4D 80010416b067 PUD 1041be067 PMD 0 Oops: [#1] SMP PTI CPU: 0 PID: 3072 Comm: mausezahn Tainted: GE 4.18.0-rc2.auguri+ #421 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013 RIP: 0010:tcf_action_exec+0xb8/0x100 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 RSP: 0018:a020dea03c40 EFLAGS: 00010246 RAX: 2001 RBX: a020d7ccef00 RCX: 0054 RDX: RSI: a020ca5ae000 RDI: a020d7ccef00 RBP: a020dea03e60 R08: R09: a020dea03c9c R10: a020dea03c78 R11: 0008 R12: a020d3fe4f00 R13: a020d3fe4f08 R14: 0001 R15: a020d53ca300 FS: 7f5a46942740() GS:a020dea0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 000104218002 CR4: 001606f0 Call Trace: fl_classify+0x1ad/0x1c0 [cls_flower] ? arp_rcv+0x121/0x1b0 ? __x2apic_send_IPI_dest+0x40/0x40 ? smp_reschedule_interrupt+0x1c/0xd0 ? reschedule_interrupt+0xf/0x20 ? reschedule_interrupt+0xa/0x20 ? device_is_rmrr_locked+0xe/0x50 ? iommu_should_identity_map+0x49/0xd0 ? __intel_map_single+0x30/0x140 ? e1000e_update_rdt_wa.isra.52+0x22/0xb0 [e1000e] ? e1000_alloc_rx_buffers+0x233/0x250 [e1000e] ? kmem_cache_alloc+0x38/0x1c0 tcf_classify+0x89/0x140 __netif_receive_skb_core+0x5ea/0xb70 ? enqueue_task_fair+0xb6/0x7d0 ? process_backlog+0x97/0x150 process_backlog+0x97/0x150 net_rx_action+0x14b/0x3e0 __do_softirq+0xde/0x2b4 do_softirq_own_stack+0x2a/0x40 do_softirq.part.18+0x49/0x50 __local_bh_enable_ip+0x49/0x50 __dev_queue_xmit+0x4ab/0x8a0 ? wait_woken+0x80/0x80 ? packet_sendmsg+0x38f/0x810 ? __dev_queue_xmit+0x8a0/0x8a0 packet_sendmsg+0x38f/0x810 sock_sendmsg+0x36/0x40 __sys_sendto+0x10e/0x140 ? do_vfs_ioctl+0xa4/0x630 ? syscall_trace_enter+0x1df/0x2e0 ? __audit_syscall_exit+0x22a/0x290 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5a45cbec93 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24 RSP: 002b:7ffd0ee6d748 EFLAGS: 0246 ORIG_RAX: 002c RAX: ffda RBX: 01161010 RCX: 7f5a45cbec93 RDX: 0062 RSI: 01161322 RDI: 0003 RBP: 7ffd0ee6d780 R08: 7ffd0ee6d760 R09: 0014 R10: R11: 0246 R12: 0062 R13: 01161322 R14: 7ffd0ee6d760 R15: 0003 Modules linked in: act_csum act_gact cls_flower sch_ingress vrf veth act_tunnel_key(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic hp_wmi iTCO_wdt sparse_keymap rfkill mei_wdt iTCO_vendor_support wmi_bmof gpio_ich irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_intel crypto_simd cryptd snd_hda_codec glue_helper snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm pcspkr i2c_i801 snd_timer snd sg lpc_ich soundcore wmi mei_me mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ahci libahci crc32c_intel i915 ixgbe serio_raw libata video dca i2c_algo_bit sfc drm_kms_helper syscopyarea mtd sysfillrect mdio sysimgblt fb_sys_fops drm e1000e i2c_core CR2: ---[ end trace 3c9e9d1a77df4026 ]--- RIP: 0010:tcf_action_exec+0xb8/0x100 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 RSP: 0018:a020dea03c40 EFLAGS: 00010246 RAX: 2001 RBX:
[PATCH net 2/2] net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used
the control action in the common member of struct tcf_tunnel_key must be a valid value, as it can contain the chain index when 'goto chain' is used. Ensure that the control action can be read as x->tcfa_action, when x is a pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to prevent the following command: # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \ > $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1 from causing a NULL dereference when a matching packet is received: BUG: unable to handle kernel NULL pointer dereference at PGD 8001097ac067 P4D 8001097ac067 PUD 103b0a067 PMD 0 Oops: [#1] SMP PTI CPU: 0 PID: 3491 Comm: mausezahn Tainted: GE 4.18.0-rc2.auguri+ #421 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013 RIP: 0010:tcf_action_exec+0xb8/0x100 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 RSP: 0018:95145ea03c40 EFLAGS: 00010246 RAX: 2001 RBX: 9514499e5800 RCX: 0001 RDX: RSI: 0002 RDI: RBP: 95145ea03e60 R08: R09: 95145ea03c9c R10: 95145ea03c78 R11: 0008 R12: 951456a69800 R13: 951456a69808 R14: 0001 R15: 95144965ee40 FS: 7fd67ee11740() GS:95145ea0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0001038a2006 CR4: 001606f0 Call Trace: fl_classify+0x1ad/0x1c0 [cls_flower] ? __update_load_avg_se.isra.47+0x1ca/0x1d0 ? __update_load_avg_se.isra.47+0x1ca/0x1d0 ? update_load_avg+0x665/0x690 ? update_load_avg+0x665/0x690 ? kmem_cache_alloc+0x38/0x1c0 tcf_classify+0x89/0x140 __netif_receive_skb_core+0x5ea/0xb70 ? enqueue_entity+0xd0/0x270 ? process_backlog+0x97/0x150 process_backlog+0x97/0x150 net_rx_action+0x14b/0x3e0 __do_softirq+0xde/0x2b4 do_softirq_own_stack+0x2a/0x40 do_softirq.part.18+0x49/0x50 __local_bh_enable_ip+0x49/0x50 __dev_queue_xmit+0x4ab/0x8a0 ? wait_woken+0x80/0x80 ? packet_sendmsg+0x38f/0x810 ? __dev_queue_xmit+0x8a0/0x8a0 packet_sendmsg+0x38f/0x810 sock_sendmsg+0x36/0x40 __sys_sendto+0x10e/0x140 ? do_vfs_ioctl+0xa4/0x630 ? syscall_trace_enter+0x1df/0x2e0 ? __audit_syscall_exit+0x22a/0x290 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fd67e18dc93 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24 RSP: 002b:7ffe0189b748 EFLAGS: 0246 ORIG_RAX: 002c RAX: ffda RBX: 020ca010 RCX: 7fd67e18dc93 RDX: 0062 RSI: 020ca322 RDI: 0003 RBP: 7ffe0189b780 R08: 7ffe0189b760 R09: 0014 R10: R11: 0246 R12: 0062 R13: 020ca322 R14: 7ffe0189b760 R15: 0003 Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca CR2: ---[ end trace 1ab8b5b5d4639dfc ]--- RIP: 0010:tcf_action_exec+0xb8/0x100 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 RSP: 0018:95145ea03c40 EFLAGS: 00010246 RAX: 2001 RBX: 9514499e5800 RCX: 0001 RDX: RSI: 0002 RDI: RBP: 95145ea03e60 R08: R09: 95145ea03c9c R10: 95145ea03c78 R11: 0008 R12: 951456a69800 R13:
[PATCH net 0/2] net/sched: fix NULL dereference in 'goto chain' control action
in a couple of TC actions (i.e. csum and tunnel_key), the control action is stored together with the action-specific configuration data. This avoids a race condition (see [1]), but it causes a crash when 'goto chain' is used with the above actions. Since this race condition is tolerated on the other TC actions (it's present even on actions where the spinlock is still used), storing the control action in the common area should be acceptable for tunnel_key and csum as well. [1] https://www.spinics.net/lists/netdev/msg472047.html Davide Caratti (2): net/sched: act_csum: fix NULL dereference when 'goto chain' is used net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used include/net/tc_act/tc_csum.h | 1 - include/net/tc_act/tc_tunnel_key.h | 1 - net/sched/act_csum.c | 6 +++--- net/sched/act_tunnel_key.c | 6 +++--- 4 files changed, 6 insertions(+), 8 deletions(-) -- 2.17.1
[PATCH V2 net-next] liquidio: fix kernel panic when NIC firmware is older than 1.7.2
From: Rick Farrington Pre-1.7.2 NIC firmware does not support (and does not respond to) the "get speed" command which is sent by the 1.7.2 driver (for CN23XX-225 cards only) during modprobe. Due to a bug in older firmware (with respect to unknown commands), this unsupported command causes a cascade of errors that ends in a kernel panic. Fix it by making the sending of the "get speed" command conditional on the firmware version. Signed-off-by: Rick Farrington Signed-off-by: Felix Manlunas --- Patch Change Log: V1 -> V2: * In the if-statement that checks the firmware version, replace the boolean expression that calls strcmp() (which is not suitable when the firmware micro version has more than one digit) with a boolean expression that works in all cases. * In the patch description, specify which types of liquidio cards are afffected. drivers/net/ethernet/cavium/liquidio/lio_main.c| 26 -- .../net/ethernet/cavium/liquidio/octeon_device.h | 9 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c index 7cb4e75..ebda6ef 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c @@ -3299,7 +3299,9 @@ static int setup_nic_devices(struct octeon_device *octeon_dev) { struct lio *lio = NULL; struct net_device *netdev; - u8 mac[6], i, j, *fw_ver; + u8 mac[6], i, j, *fw_ver, *micro_ver; + unsigned long micro; + u32 cur_ver; struct octeon_soft_command *sc; struct liquidio_if_cfg_context *ctx; struct liquidio_if_cfg_resp *resp; @@ -3429,6 +3431,14 @@ static int setup_nic_devices(struct octeon_device *octeon_dev) fw_ver); } + /* extract micro version field; point past '..' */ + micro_ver = fw_ver + strlen(LIQUIDIO_BASE_VERSION) + 1; + if (kstrtoul(micro_ver, 10, ) != 0) + micro = 0; + octeon_dev->fw_info.ver.maj = LIQUIDIO_BASE_MAJOR_VERSION; + octeon_dev->fw_info.ver.min = LIQUIDIO_BASE_MINOR_VERSION; + octeon_dev->fw_info.ver.rev = micro; + octeon_swap_8B_data((u64 *)(>cfg_info), (sizeof(struct liquidio_if_cfg_info)) >> 3); @@ -3671,7 +3681,19 @@ static int setup_nic_devices(struct octeon_device *octeon_dev) OCTEON_CN2350_25GB_SUBSYS_ID || octeon_dev->subsystem_id == OCTEON_CN2360_25GB_SUBSYS_ID) { - liquidio_get_speed(lio); + cur_ver = OCT_FW_VER(octeon_dev->fw_info.ver.maj, +octeon_dev->fw_info.ver.min, +octeon_dev->fw_info.ver.rev); + + /* speed control unsupported in f/w older than 1.7.2 */ + if (cur_ver < OCT_FW_VER(1, 7, 2)) { + dev_info(_dev->pci_dev->dev, +"speed setting not supported by f/w."); + octeon_dev->speed_setting = 25; + octeon_dev->no_speed_setting = 1; + } else { + liquidio_get_speed(lio); + } if (octeon_dev->speed_setting == 0) { octeon_dev->speed_setting = 25; diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.h b/drivers/net/ethernet/cavium/liquidio/octeon_device.h index 94a4ed88d..d99ca6b 100644 --- a/drivers/net/ethernet/cavium/liquidio/octeon_device.h +++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.h @@ -288,8 +288,17 @@ struct oct_fw_info { */ u32 app_mode; char liquidio_firmware_version[32]; + /* Fields extracted from legacy string 'liquidio_firmware_version' */ + struct { + u8 maj; + u8 min; + u8 rev; + } ver; }; +#define OCT_FW_VER(maj, min, rev) \ + (((u32)(maj) << 16) | ((u32)(min) << 8) | ((u32)(rev))) + /* wrappers around work structs */ struct cavium_wk { struct delayed_work work;
Re: [PATCH net-next 0/6] sock cookie initializers
On Fri, Jul 6, 2018 at 1:24 PM Jesus Sanchez-Palencia wrote: > > > > On 07/06/2018 07:12 AM, Willem de Bruijn wrote: > > From: Willem de Bruijn > > > > Recent UDP GSO and SO_TXTIME features added new fields to cookie > > structs. > > > > When adding a field, all sites where a struct is initialized have to > > be updated, which is a lot of boilerplate. Alternatively, a field can > > be initialized selectively, but this is fragile. I introduced a bug > > in udp gso where an uninitialized field was read. See also fix commit > > ("9887cba19978 ip: limit use of gso_size to udp"). > > > > Introduce initializers for structs ipcm(6)_cookie and sockc_cookie. > > > > patch 1..3 do exactly this. > > patch 4..5 make ipv4 and ipv6 handle cookies the same way and > >remove some boilerplate in doing so. > > patch 6removes the udp gso branch that needed the above fix > > > Acked-by: Jesus Sanchez-Palencia > > I've applied this series and tested SO_TXTIME + the etf qdisc and everything > is > working just fine. Thanks for testing, Jesus! I also ran the udpgso and txtimestamp tests (and am expanding the second to run as part of kselftest, and have ipv6 and cmsg support).
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
Hi Ka-Cheong, On 7/6/2018 8:25 AM, Sowmini Varadhan wrote: On (07/06/18 23:08), Ka-Cheong Poon wrote: As mentioned in a previous mail, it is unclear why the port number is transport specific. Most Internet services use the same port number running over TCP/UDP as shown in the IANA database. And the IANA RDS registration is the same. What is the rationale of having a transport specific number in the RDS implementation? because not every transport may need a port number. Lets keep separate port for RDMA and TCP transport. This has been already useful for wireshark dissector and can also help for eBPF like external tooling. The fragment format and re-assembly is different across transports. I do see your point and also agree that port number isn't transport specific and in case we need to add another transport, what port to use. But may be till then lets keep the existing behavior. As such this port switch is not related to IPv6 support as such so lets deal with it separately. Regards, Santosh
Re: [PATCH net-next 0/6] sock cookie initializers
On 07/06/2018 07:12 AM, Willem de Bruijn wrote: > From: Willem de Bruijn > > Recent UDP GSO and SO_TXTIME features added new fields to cookie > structs. > > When adding a field, all sites where a struct is initialized have to > be updated, which is a lot of boilerplate. Alternatively, a field can > be initialized selectively, but this is fragile. I introduced a bug > in udp gso where an uninitialized field was read. See also fix commit > ("9887cba19978 ip: limit use of gso_size to udp"). > > Introduce initializers for structs ipcm(6)_cookie and sockc_cookie. > > patch 1..3 do exactly this. > patch 4..5 make ipv4 and ipv6 handle cookies the same way and >remove some boilerplate in doing so. > patch 6removes the udp gso branch that needed the above fix Acked-by: Jesus Sanchez-Palencia I've applied this series and tested SO_TXTIME + the etf qdisc and everything is working just fine. Thanks, Jesus > > Willem de Bruijn (6): > ipv4: ipcm_cookie initializers > ipv6: ipcm6_cookie initializer > sock: sockc cookie initializer > ipv6: fold sockcm_cookie into ipcm6_cookie > ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6 > ip: unconditionally set cork gso_size > > include/net/ip.h | 16 ++- > include/net/ipv6.h | 26 > include/net/sock.h | 6 ++ > include/net/transp_v6.h | 3 +-- > net/ipv4/icmp.c | 11 ++ > net/ipv4/ip_output.c | 12 --- > net/ipv4/ping.c | 11 +- > net/ipv4/raw.c | 11 +- > net/ipv4/tcp.c | 2 +- > net/ipv4/udp.c | 12 +-- > net/ipv6/datagram.c | 4 ++-- > net/ipv6/icmp.c | 14 - > net/ipv6/ip6_flowlabel.c | 3 +-- > net/ipv6/ip6_output.c| 43 +--- > net/ipv6/ipv6_sockglue.c | 3 +-- > net/ipv6/ping.c | 7 ++- > net/ipv6/raw.c | 15 +- > net/ipv6/udp.c | 14 + > net/l2tp/l2tp_ip6.c | 10 +++--- > net/packet/af_packet.c | 9 +++-- > 20 files changed, 98 insertions(+), 134 deletions(-) >
[PATCH net-next 3/3] net: refactor dequeue-model list processing
New macro list_for_each_entry_dequeue loops over a list by popping entries from the head, allowing a more concise expression of the dequeue-enqueue model of list processing and avoiding the need for a 'next' pointer (as used in list_for_each_entry_safe). Signed-off-by: Edward Cree --- include/linux/list.h | 15 +++ include/linux/netfilter.h | 6 ++ net/core/dev.c| 6 ++ net/ipv4/ip_input.c | 10 -- net/ipv6/ip6_input.c | 10 -- 5 files changed, 27 insertions(+), 20 deletions(-) diff --git a/include/linux/list.h b/include/linux/list.h index de04cc5ed536..150751d03441 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -645,6 +645,21 @@ static inline void list_splice_tail_init(struct list_head *list, #define list_safe_reset_next(pos, n, member) \ n = list_next_entry(pos, member) +/** + * list_for_each_entry_dequeue - iterate over list by removing entries + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member:the name of the list_head within the struct. + * + * Iterate over list of given type, removing each entry from the list before + * running the loop body. The loop will run until the list is empty, so + * adding an entry back onto the list in the loop body will requeue it. + */ +#define list_for_each_entry_dequeue(pos, head, member) \ + while (!list_empty(head) && \ + (pos = list_first_entry(head, typeof(*pos), member)) && \ + (list_del(>member), true)) + /* * Double linked lists with a single pointer list head. * Mostly useful for hash tables where the two pointer list head is diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 23b48de8c2e2..f9a037eb053d 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -293,15 +293,13 @@ NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct list_head *head, struct net_device *in, struct net_device *out, int (*okfn)(struct net *, struct sock *, struct sk_buff *)) { - struct sk_buff *skb, *next; struct list_head sublist; + struct sk_buff *skb; INIT_LIST_HEAD(); - list_for_each_entry_safe(skb, next, head, list) { - list_del(>list); + list_for_each_entry_dequeue(skb, head, list) if (nf_hook(pf, hook, net, sk, skb, in, out, okfn) == 1) list_add_tail(>list, ); - } /* Put passed packets back on main list */ list_splice(, head); } diff --git a/net/core/dev.c b/net/core/dev.c index ce4583564e00..68055f012d13 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4985,9 +4985,8 @@ static void netif_receive_skb_list_internal(struct list_head *head) struct list_head sublist; INIT_LIST_HEAD(); - list_for_each_entry_safe(skb, next, head, list) { + list_for_each_entry_dequeue(skb, head, list) { net_timestamp_check(netdev_tstamp_prequeue, skb); - list_del(>list); if (!skb_defer_rx_timestamp(skb)) list_add_tail(>list, ); } @@ -4996,9 +4995,8 @@ static void netif_receive_skb_list_internal(struct list_head *head) if (static_branch_unlikely(_xdp_needed_key)) { preempt_disable(); rcu_read_lock(); - list_for_each_entry_safe(skb, next, head, list) { + list_for_each_entry_dequeue(skb, head, list) { xdp_prog = rcu_dereference(skb->dev->xdp_prog); - list_del(>list); if (do_xdp_generic(xdp_prog, skb) == XDP_PASS) list_add_tail(>list, ); } diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index 1a3b6f32b1c9..26285a24c067 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -538,14 +538,13 @@ static void ip_list_rcv_finish(struct net *net, struct sock *sk, struct list_head *head) { struct dst_entry *curr_dst = NULL; - struct sk_buff *skb, *next; struct list_head sublist; + struct sk_buff *skb; INIT_LIST_HEAD(); - list_for_each_entry_safe(skb, next, head, list) { + list_for_each_entry_dequeue(skb, head, list) { struct dst_entry *dst; - list_del(>list); /* if ingress device is enslaved to an L3 master device pass the * skb to its handler for processing */ @@ -584,15 +583,14 @@ void ip_list_rcv(struct list_head *head, struct packet_type *pt, { struct net_device *curr_dev = NULL; struct net *curr_net = NULL; - struct sk_buff *skb, *next; struct list_head sublist; + struct sk_buff *skb; INIT_LIST_HEAD(); -
[PATCH net-next 1/3] net: core: fix uses-after-free in list processing
In netif_receive_skb_list_internal(), all of skb_defer_rx_timestamp(), do_xdp_generic() and enqueue_to_backlog() can lead to kfree(skb). Thus, we cannot wait until after they return to remove the skb from the list; instead, we remove it first and, in the pass case, add it to a sublist afterwards. In the case of enqueue_to_backlog() we have already decided not to pass when we call the function, so we do not need a sublist. Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing") Reported-by: Dan Carpenter Signed-off-by: Edward Cree --- net/core/dev.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 89825c1eccdc..ce4583564e00 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4982,25 +4982,30 @@ static void netif_receive_skb_list_internal(struct list_head *head) { struct bpf_prog *xdp_prog = NULL; struct sk_buff *skb, *next; + struct list_head sublist; + INIT_LIST_HEAD(); list_for_each_entry_safe(skb, next, head, list) { net_timestamp_check(netdev_tstamp_prequeue, skb); - if (skb_defer_rx_timestamp(skb)) - /* Handled, remove from list */ - list_del(>list); + list_del(>list); + if (!skb_defer_rx_timestamp(skb)) + list_add_tail(>list, ); } + list_splice_init(, head); if (static_branch_unlikely(_xdp_needed_key)) { preempt_disable(); rcu_read_lock(); list_for_each_entry_safe(skb, next, head, list) { xdp_prog = rcu_dereference(skb->dev->xdp_prog); - if (do_xdp_generic(xdp_prog, skb) != XDP_PASS) - /* Dropped, remove from list */ - list_del(>list); + list_del(>list); + if (do_xdp_generic(xdp_prog, skb) == XDP_PASS) + list_add_tail(>list, ); } rcu_read_unlock(); preempt_enable(); + /* Put passed packets back on main list */ + list_splice_init(, head); } rcu_read_lock(); @@ -5011,9 +5016,9 @@ static void netif_receive_skb_list_internal(struct list_head *head) int cpu = get_rps_cpu(skb->dev, skb, ); if (cpu >= 0) { - enqueue_to_backlog(skb, cpu, >last_qtail); - /* Handled, remove from list */ + /* Will be handled, remove from list */ list_del(>list); + enqueue_to_backlog(skb, cpu, >last_qtail); } } }
[PATCH net-next 2/3] netfilter: fix use-after-free in NF_HOOK_LIST
nf_hook() can free the skb, so we need to remove it from the list before calling, and add passed skbs to a sublist afterwards. Fixes: 17266ee93984 ("net: ipv4: listified version of ip_rcv") Reported-by: Dan Carpenter Signed-off-by: Edward Cree --- include/linux/netfilter.h | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 5a5e0a2ab2a3..23b48de8c2e2 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -294,12 +294,16 @@ NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, int (*okfn)(struct net *, struct sock *, struct sk_buff *)) { struct sk_buff *skb, *next; + struct list_head sublist; + INIT_LIST_HEAD(); list_for_each_entry_safe(skb, next, head, list) { - int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn); - if (ret != 1) - list_del(>list); + list_del(>list); + if (nf_hook(pf, hook, net, sk, skb, in, out, okfn) == 1) + list_add_tail(>list, ); } + /* Put passed packets back on main list */ + list_splice(, head); } /* Call setsockopt() */
[PATCH net-next 0/3] fix use-after-free bugs in skb list processing
A couple of bugs in skb list handling were spotted by Dan Carpenter, with the help of Smatch; following up on them I found a couple more similar cases. This series fixes them by changing the relevant loops to use the dequeue-enqueue model (rather than in-place list modification), and then adds a list.h helper macro to refactor code using the dequeue-enqueue model. Edward Cree (3): net: core: fix uses-after-free in list processing netfilter: fix use-after-free in NF_HOOK_LIST net: refactor dequeue-model list processing include/linux/list.h | 15 +++ include/linux/netfilter.h | 16 +--- net/core/dev.c| 23 +-- net/ipv4/ip_input.c | 10 -- net/ipv6/ip6_input.c | 10 -- 5 files changed, 45 insertions(+), 29 deletions(-)
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > I'm also not 100% on board with the argument that "future" FW can > reshuffle things whatever way it wants to. Is the assumption that > future ASICs/FW will be designed to always use the "blessed" BTF > format? Or will it be reconfigurable at runtime? let's table configuration of metadata aside for a second. Describing metedata layout in BTF allows NICs to disclose everything NIC has to users in a standard and generic way. Whether firmware is reconfigurable on the fly or has to reflashed and hw powercycled to have new md layout (and corresponding BTF description) is a separate discussion. Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info' to reach 'hash' value in metadata. Essentially it's a run-time way to access 'hash' instead of build-time. So bpf program would need two loads to read csum or hash field instead of one. With BTF the layout of metadata is known to the program at build-time. To reiterate the proposal: - driver+firmware keep layout of the metadata in BTF format (either in the driver or driver can read it from firmware) - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and generate normal C header file based on BTF in the given NIC - user does #include "md_desc.h" and bpf program can access md->csum or md->hash with direct single load out of metadata area in front of the packet - llvm compiles bpf program and records how program is doing this md->csum accesses in BTF format as well (the compiler will be keeping such records for __sk_buff and all other structs too, but that's separate discussion) - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program accesses metadata (and other structs) matches BTF from the driver, so no surprises if driver+firmware got updated, but program is not recompiled - every NIC can have their own layout of metadata and its own meaning of the fields, but would be good to standardize at least a few common fields like hash Once this is working we can do more cool things with BTF. Like doing offset rewriting at program load time similar to what we plan to do for tracing. Tracing programs will be doing 'task->pid' access and the kernel will adjust offsetof(struct task_struct, pid) during program load depending on BTF for the kernel. The same trick we can do for networking metadata. The program will contain load instruction md->hash that will get automatically adjusted to proper offset depending on BTF of 'hash' field in the NIC. For now I'm proposing _not_ to go that far with offset rewriting and start with simple steps described above.
Re: [PATCH iproute2-next] tc: m_tunnel_key: Add tunnel option support to act_tunnel_key
On 7/5/18 6:12 PM, Jakub Kicinski wrote: > From: Simon Horman > > Allow setting tunnel options using the act_tunnel_key action. > > Options are expressed as class:type:data and multiple options > may be listed using a comma delimiter. > > # ip link add name geneve0 type geneve dstport 0 external > # tc qdisc add dev eth0 ingress > # tc filter add dev eth0 protocol ip parent : \ > flower indev eth0 \ > ip_proto udp \ > action tunnel_key \ > set src_ip 10.0.99.192 \ > dst_ip 10.0.99.193 \ > dst_port 6081 \ > id 11 \ > geneve_opts 0102:80:00800022,0102:80:00800022 \ > action mirred egress redirect dev geneve0 > > Signed-off-by: Simon Horman > Signed-off-by: Pieter Jansen van Vuuren > Reviewed-by: Jakub Kicinski > --- > man/man8/tc-tunnel_key.8 | 12 ++- > tc/m_tunnel_key.c| 177 +++ > 2 files changed, 188 insertions(+), 1 deletion(-) > applied to iproute2-next. Thanks
Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode
2018-07-06, 09:28:48 -0600, David Ahern wrote: > On 7/6/18 9:02 AM, Sabrina Dubroca wrote: > > 2018-07-06, 08:42:01 -0600, David Ahern wrote: > >> On 7/6/18 7:49 AM, Sabrina Dubroca wrote: > >>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > >>> index 91580c62bb86..e9ba53d2a147 100644 > >>> --- a/net/ipv6/addrconf.c > >>> +++ b/net/ipv6/addrconf.c > >>> @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct > >>> ctl_table *ctl, int write, > >>>loff_t *ppos) > >>> { > >>> int ret = 0; > >>> - int new_val; > >>> + u32 new_val; > >>> struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1; > >>> struct net *net = (struct net *)ctl->extra2; > >>> + struct ctl_table tmp = { > >>> + .data = _val, > >>> + .maxlen = sizeof(new_val), > >>> + .mode = ctl->mode, > >>> + }; > >>> > >>> if (!rtnl_trylock()) > >>> return restart_syscall(); > >>> > >>> - ret = proc_dointvec(ctl, write, buffer, lenp, ppos); > >>> + new_val = *((u32 *)ctl->data); > >>> > >>> - if (write) { > >>> - new_val = *((int *)ctl->data); > >>> + ret = proc_douintvec(, write, buffer, lenp, ppos); > >>> + if (ret != 0) > >>> + goto out; > >>> > >>> + if (write) { > >>> if (check_addr_gen_mode(new_val) < 0) { > >>> ret = -EINVAL; > >>> goto out; > >>> } > >>> > >>> - /* request for default */ > >>> - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) { > >>> - ipv6_devconf_dflt.addr_gen_mode = new_val; > >> > >> updating the template is wrong, but you still need to update the > >> namespace's default value for new devices. > > > > That's already handled by storing new_val into ctl->data at the end of > > the 'write' block. > > ok. missed that. It's part of your change below. > > > > > BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so > > that people aren't tempted to update the template, but I'm thinking of > > doing that in net-next rather than net. > > > >> also, if you are fixing this it would be good to handle the change to > >> 'all' as well and update all existing devices. > > > > I thought about it, and wasn't sure if that change of behavior was > > acceptable, especially for stable (I think the current patch should go > > into stable). OTOH, it's quite clearly what "all" should do. > > IMHO it's a bug that changing 'all' does not actually change all > existing devices nor is it ever used. Right. I'll add that as a separate patch in this series, unless you really prefer the change squashed into this patch. > Looking at other addr_gen_mode sites, addrconf_sysctl_stable_secret is > messed up as well. It propagates a change to 'default' to all existing > devices. I guess it was intentional, given: if (>ipv6.devconf_all->stable_secret == ctl->data) return -EIO; It only propagates the mode, and not the secret itself, to all devices. After thinking about it for a while, I guess it considers the new default not only as default for newly created devices, but also for newly added addresses/prefixes. Or am I making stuff up? -- Sabrina
Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc
On 7/5/18 4:42 PM, Jesus Sanchez-Palencia wrote: > +static int get_clockid(__s32 *val, const char *arg) > +{ > + const struct static_clockid { > + const char *name; > + clockid_t clockid; > + } clockids_sysv[] = { > + { "CLOCK_REALTIME", CLOCK_REALTIME }, > + { "CLOCK_TAI", CLOCK_TAI }, > + { "CLOCK_BOOTTIME", CLOCK_BOOTTIME }, > + { "CLOCK_MONOTONIC", CLOCK_MONOTONIC }, > + { NULL } > + }; > + > + const struct static_clockid *c; > + > + for (c = clockids_sysv; c->name; c++) { > + if (strncasecmp(c->name, arg, 25) == 0) { Why 25? be nice to allow shortcuts -- e.g., just REALTIME or realtime. > + *val = c->clockid; > + > + return 0; > + } > + } > + > + return -1; > +} > + > + > +static int etf_parse_opt(struct qdisc_util *qu, int argc, > + char **argv, struct nlmsghdr *n, const char *dev) > +{ > + struct tc_etf_qopt opt = { > + .clockid = CLOCKID_INVALID, > + }; > + struct rtattr *tail; > + > + while (argc > 0) { > + if (matches(*argv, "offload") == 0) { > + if (opt.flags & TC_ETF_OFFLOAD_ON) { > + fprintf(stderr, "etf: duplicate \"offload\" > specification\n"); > + return -1; > + } > + > + opt.flags |= TC_ETF_OFFLOAD_ON; > + } else if (matches(*argv, "deadline_mode") == 0) { > + if (opt.flags & TC_ETF_DEADLINE_MODE_ON) { > + fprintf(stderr, "etf: duplicate > \"deadline_mode\" specification\n"); > + return -1; > + } > + > + opt.flags |= TC_ETF_DEADLINE_MODE_ON; > + } else if (matches(*argv, "delta") == 0) { > + NEXT_ARG(); > + if (opt.delta) { > + fprintf(stderr, "etf: duplicate \"delta\" > specification\n"); > + return -1; > + } > + if (get_s32(, *argv, 0)) { > + explain1("delta", *argv); > + return -1; > + } > + } else if (matches(*argv, "clockid") == 0) { > + NEXT_ARG(); > + if (opt.clockid != CLOCKID_INVALID) { > + fprintf(stderr, "etf: duplicate \"clockid\" > specification\n"); > + return -1; > + } > + if (get_clockid(, *argv)) { > + explain_clockid(*argv); > + return -1; > + } > + } else if (strcmp(*argv, "help") == 0) { > + explain(); > + return -1; > + } else { > + fprintf(stderr, "etf: unknown parameter \"%s\"\n", > *argv); > + explain(); > + return -1; > + } > + argc--; argv++; > + } > + > + tail = NLMSG_TAIL(n); > + addattr_l(n, 1024, TCA_OPTIONS, NULL, 0); > + addattr_l(n, 2024, TCA_ETF_PARMS, , sizeof(opt)); > + tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail; > + return 0; > +} > + > +static int etf_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt) > +{ > + struct rtattr *tb[TCA_ETF_MAX+1]; > + struct tc_etf_qopt *qopt; > + > + if (opt == NULL) > + return 0; > + > + parse_rtattr_nested(tb, TCA_ETF_MAX, opt); > + > + if (tb[TCA_ETF_PARMS] == NULL) > + return -1; > + > + qopt = RTA_DATA(tb[TCA_ETF_PARMS]); > + if (RTA_PAYLOAD(tb[TCA_ETF_PARMS]) < sizeof(*qopt)) > + return -1; > + > + if (qopt->clockid == CLOCKID_INVALID) > + print_string(PRINT_ANY, "clockid", "clockid %s ", "invalid"); > + else > + print_uint(PRINT_ANY, "clockid", "clockid %d ", qopt->clockid); If you allow the user to input a string, then you should return it here too.
Re: [PATCH bpf-next 11/11] tools: bpftool: allow reuse of maps with bpftool prog load
On Fri, 6 Jul 2018 09:16:24 +0200, Daniel Borkmann wrote: > On 07/06/2018 12:57 AM, Jakub Kicinski wrote: > > On Thu, 5 Jul 2018 10:35:24 +0200, Daniel Borkmann wrote: > >> On 07/04/2018 04:54 AM, Jakub Kicinski wrote: > >>> Add map parameter to prog load which will allow reuse of existing > >>> maps instead of creating new ones. > >>> > >>> Signed-off-by: Jakub Kicinski > >>> Reviewed-by: Quentin Monnet > >> [...] > >>> + > >>> + fd = map_parse_fd(, ); > >>> + if (fd < 0) > >>> + goto err_free_reuse_maps; > >>> + > >>> + map_replace = reallocarray(map_replace, old_map_fds + 1, > >>> +sizeof(*map_replace)); > >>> + if (!map_replace) { > >>> + p_err("mem alloc failed"); > >>> + goto err_free_reuse_maps; > >> > >> Series in general looks good to me. However, above reallocarray() doesn't > >> exist and hence build fails, please see below. Is that from newest glibc? > >> > >> You probably need some fallback implementation or in general have something > >> bpftool internal that doesn't make a bet on its availability. > >> > >> # make > >> > >> Auto-detecting system features: > >> ...libbfd: [ on ] > >> ...disassembler-four-args: [ OFF ] > >> > >> CC bpf_jit_disasm.o > >> LINK bpf_jit_disasm > >> CC bpf_dbg.o > >> LINK bpf_dbg > >> CC bpf_asm.o > >> BISONbpf_exp.yacc.c > >> CC bpf_exp.yacc.o > >> FLEX bpf_exp.lex.c > >> CC bpf_exp.lex.o > >> LINK bpf_asm > >> DESCEND bpftool > >> > >> Auto-detecting system features: > y>> ...libbfd: [ on ] > >> ...disassembler-four-args: [ OFF ] > >> > >> CC map_perf_ring.o > >> CC xlated_dumper.o > >> CC perf.o > >> CC prog.o > >> prog.c: In function ‘do_load’: > >> prog.c:785:18: warning: implicit declaration of function ‘reallocarray’ > >> [-Wimplicit-function-declaration] > >> map_replace = reallocarray(map_replace, old_map_fds + 1, > >> ^~~~ > >> prog.c:785:16: warning: assignment makes pointer from integer without a > >> cast [-Wint-conversion] > >> map_replace = reallocarray(map_replace, old_map_fds + 1, > >> ^ > >> CC common.o > >> CC cgroup.o > >> CC main.o > >> CC json_writer.o > >> CC cfg.o > >> CC map.o > >> CC jit_disasm.o > >> CC disasm.o > >> > >> Auto-detecting system features: > >> ...libelf: [ on ] > >> ... bpf: [ on ] > >> > >> Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs > >> from latest version at 'include/uapi/linux/bpf.h' > >> CC libbpf.o > >> CC bpf.o > >> CC nlattr.o > >> CC btf.o > >> LD libbpf-in.o > >> LINK libbpf.a > >> LINK bpftool > >> prog.o: In function `do_load': > >> prog.c:(.text+0x23d): undefined reference to `reallocarray' > >> collect2: error: ld returned 1 exit status > >> Makefile:89: recipe for target 'bpftool' failed > >> make[1]: *** [bpftool] Error 1 > >> Makefile:99: recipe for target 'bpftool' failed > >> make: *** [bpftool] Error 2 > > > > Oh no.. Sorry & thanks for catching this. It would be nice to not > > depend on Glibc version defines, in case someone is using a different > > library. Jiong suggested we can just use the feature detection, so I > > have something like this: > > Yeah, that would be okay to do if you want to go that route, sure. Other > option > I had in mind would have been to import include/linux/overflow.h into the > tools/include/linux/ and have a minor wrapper similar to kmalloc_array() in a > utils.h in bpftool to get to the same for all users. But I think feature test > is totally fine too, and in general some form of reallocarray() would be good > to have rather than plain realloc(). > > So, below complies for me. Although don't we need to define a CFLAG based on > the outcome of the test similar as in feature-disassembler-four-args? > Otherwise > with the below diff the test doesn't really do much, no? Meaning, adding a ... > > ifeq ($(feature-reallocarray), 1) > CFLAGS += -DHAVE_REALLOCARRAY > endif > > ... to the Makefile and in compat.h having it enabled through: > > #ifndef HAVE_REALLOCARRAY > static inline void *reallocarray(void *ptr, size_t nmemb, size_t size) > { > return realloc(ptr, nmemb * size); > } > #endif Ugh, you're very right, reallocarray is a weak alias in glibc, which is probably why I didn't see any errors. > In any case, we could use reallocarray() also in couple of other places in > bpftool and libbpf. I'll look into it, too.
Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode
On 7/6/18 9:02 AM, Sabrina Dubroca wrote: > 2018-07-06, 08:42:01 -0600, David Ahern wrote: >> On 7/6/18 7:49 AM, Sabrina Dubroca wrote: >>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c >>> index 91580c62bb86..e9ba53d2a147 100644 >>> --- a/net/ipv6/addrconf.c >>> +++ b/net/ipv6/addrconf.c >>> @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct >>> ctl_table *ctl, int write, >>> loff_t *ppos) >>> { >>> int ret = 0; >>> - int new_val; >>> + u32 new_val; >>> struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1; >>> struct net *net = (struct net *)ctl->extra2; >>> + struct ctl_table tmp = { >>> + .data = _val, >>> + .maxlen = sizeof(new_val), >>> + .mode = ctl->mode, >>> + }; >>> >>> if (!rtnl_trylock()) >>> return restart_syscall(); >>> >>> - ret = proc_dointvec(ctl, write, buffer, lenp, ppos); >>> + new_val = *((u32 *)ctl->data); >>> >>> - if (write) { >>> - new_val = *((int *)ctl->data); >>> + ret = proc_douintvec(, write, buffer, lenp, ppos); >>> + if (ret != 0) >>> + goto out; >>> >>> + if (write) { >>> if (check_addr_gen_mode(new_val) < 0) { >>> ret = -EINVAL; >>> goto out; >>> } >>> >>> - /* request for default */ >>> - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) { >>> - ipv6_devconf_dflt.addr_gen_mode = new_val; >> >> updating the template is wrong, but you still need to update the >> namespace's default value for new devices. > > That's already handled by storing new_val into ctl->data at the end of > the 'write' block. ok. missed that. It's part of your change below. > > BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so > that people aren't tempted to update the template, but I'm thinking of > doing that in net-next rather than net. > >> also, if you are fixing this it would be good to handle the change to >> 'all' as well and update all existing devices. > > I thought about it, and wasn't sure if that change of behavior was > acceptable, especially for stable (I think the current patch should go > into stable). OTOH, it's quite clearly what "all" should do. IMHO it's a bug that changing 'all' does not actually change all existing devices nor is it ever used. Looking at other addr_gen_mode sites, addrconf_sysctl_stable_secret is messed up as well. It propagates a change to 'default' to all existing devices.
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On (07/06/18 23:08), Ka-Cheong Poon wrote: > > As mentioned in a previous mail, it is unclear why the > port number is transport specific. Most Internet services > use the same port number running over TCP/UDP as shown > in the IANA database. And the IANA RDS registration is > the same. What is the rationale of having a transport > specific number in the RDS implementation? because not every transport may need a port number. e.g., "RDS over pigeon carrier" may not need a port number. in a different design (e.g., KCM) the listen port is configurable with sysctl Some may need more than one, e.g., rds_rdma, evidently. What is the problem with using the unused 18635 for the RDS_CM_PORT?
Re: [PATCH v2 net-next 0/3] rds: IPv6 support
On (07/06/18 22:36), Ka-Cheong Poon wrote: > This patch series does not change existing behavior. But > I think this is a strange RDS semantics as it differs from > other types of socket. But this is not about IPv6 support > and can be dealt with later. sure, > > Since we are choosing to treat this behavior as a feature we > > need to be consistent in rds_getname(). If we find > > (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, > > the returned address (address size, address family) should be based > > on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a > > bind(), and try to do a getsockname(), I get back a sockaddr_in?!) > > > OK, will change that. > >- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg > > As DaveM has already pointed out, we should be using sa_family to figure > > out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting > > len first, and then the family- that way wont work if I pass in > > sockaddr_storage). E.g., see inet_dgram_connect. > > > > if (addr_len < sizeof(uaddr->sa_family)) > > return -EINVAL; > OK, will change that. thank you > >- In net/rds/rds.h; > > > > /* The following ports, 16385, 18634, 18635, are registered with IANA as > >* the ports to be used for RDS over TCP and UDP. 18634 is the historical > > > > What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. > > IN fact RDS has nothing to do with UDP. The comment is confused. See next > > item below, where the comment disappears. > > > As mentioned in the above comments, they are registered with IANA. 16385 is registered for RDS-TCP. what does it mean to run rds-tcp and rds_rdma with the CM at the same time? Is this possible? (if it is, why not poach some other number like 2049 from NFS?!) 18635 is actually not used in the RDS code. Why can you not use that for RDS_CM_PORT? In general, please do NOT pull up these definitions into net/rds/rds.h They may change in the future- and the really belong in the transport specific header file - you question this further below, see my answer there. > There is no RDS/UDP in Linux but the port numbers are registered > nevertheless. And RDS can work over UDP AFAICT. BTW, it is really No it cannot. RDS needs a reliable transport. If you start using UDP (or pigeon-carrier) for the transport you have to build the reliability yourself. The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP case they do their own congestion management in userspace, and history has shown that it is hard to get that right, thus the desire to switch to rds-tcp). Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP", that is probably the most harmless one to use for this case, because - it is unused, - it calls itself rds-Infiniband, - the likelihood of doing rds-udp is infinitesmally small (it is more likely that we will need to send packets from abc::1 -> fe80:2 before we need rds-udp :-) :-) :-)) - and if rds-udp happens, we can use 16385/udp for rds-udp so please use 18635 for your RDS_CM_PORT and move it to IB specific header files and lets converge this thread. Towing RDS_TCP_PORT around has absolutely nothing to do with your ipv6 patch set. > strange that RDS/TCP does not use the port number already registered. > Anyway, the above comments are just a note on this fact in the IANA 16385 was in defacto use for a long time (since 2.6 kernels), thus I registered it as well when I started fixing this up. the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp. They are available for your use. > database. The following is a link to the IANA assignment. yes, I am aware > IMHO, there should only be RDS_PORT in the first place. And it > can be used for all transports. For example, if RDS/SCTP is added, > it can also use the same port number. There is no need to define if/when we add rds/sctp, we shall keep that in mind. This is getting to be "arguing for the sake of arguing". I dont have the time for that. > a new port number for each transport. What is the rationale that > each transport needs to have its own number and be defined in its > own header file? Some transports may not even need a port number. Some may need several. Some may use sysctl (suggested by Tom Herbert) to make this configurable. Until recently, we had rds/iWARP, that may need its own transport hooks, it does not make sense to peanut-butter that into the core module. That is why it has to be in the transport. I hope that answers the question. > If the behavior of a software component is modified/enhanced such > that the existing API of this component has problems dealing with > this new behavior, should the API be modified or a new API be added > to handle that? If neither is done, shouldn't this be considered > a bug? whatever. The design (parallelization for perf) is fine. Some parts of
Re: [PATCH iproute2-next] tc: m_tunnel_key: Add tunnel option support to act_tunnel_key
On Fri, Jul 6, 2018 at 6:32 AM, Roman Mashak wrote: > Jakub Kicinski writes: > >> From: Simon Horman >> >> Allow setting tunnel options using the act_tunnel_key action. >> >> Options are expressed as class:type:data and multiple options >> may be listed using a comma delimiter. >> >> # ip link add name geneve0 type geneve dstport 0 external >> # tc qdisc add dev eth0 ingress >> # tc filter add dev eth0 protocol ip parent : \ >> flower indev eth0 \ >> ip_proto udp \ >> action tunnel_key \ >> set src_ip 10.0.99.192 \ >> dst_ip 10.0.99.193 \ >> dst_port 6081 \ >> id 11 \ >> geneve_opts 0102:80:00800022,0102:80:00800022 \ >> action mirred egress redirect dev geneve0 > > [...] > > Jakub, could you also add relevant tests for the new option in file > $(kernel)/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json? We'll look into it!
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On 07/06/2018 06:15 PM, Sowmini Varadhan wrote: On (07/06/18 17:08), Ka-Cheong Poon wrote: Hmm. Why do you need to include tcp header in ib transport code ? If there is any common function just move to core common file and use it. I think it can be removed as it is left over from earlier changes when the IB IPv6 listener port was RDS_TCP_PORT. Now all the port definitions are in rds.h. Transport specific information such as port numbers should not be smeared into the common rds-module. Please fix that. If IB is also piggybacking on port 16385 (why?), please use your own definitions for it in IB specific header files. As mentioned in a previous mail, it is unclear why the port number is transport specific. Most Internet services use the same port number running over TCP/UDP as shown in the IANA database. And the IANA RDS registration is the same. What is the rationale of having a transport specific number in the RDS implementation? -- K. Poon ka-cheong.p...@oracle.com
Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode
2018-07-06, 08:42:01 -0600, David Ahern wrote: > On 7/6/18 7:49 AM, Sabrina Dubroca wrote: > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > > index 91580c62bb86..e9ba53d2a147 100644 > > --- a/net/ipv6/addrconf.c > > +++ b/net/ipv6/addrconf.c > > @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct > > ctl_table *ctl, int write, > > loff_t *ppos) > > { > > int ret = 0; > > - int new_val; > > + u32 new_val; > > struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1; > > struct net *net = (struct net *)ctl->extra2; > > + struct ctl_table tmp = { > > + .data = _val, > > + .maxlen = sizeof(new_val), > > + .mode = ctl->mode, > > + }; > > > > if (!rtnl_trylock()) > > return restart_syscall(); > > > > - ret = proc_dointvec(ctl, write, buffer, lenp, ppos); > > + new_val = *((u32 *)ctl->data); > > > > - if (write) { > > - new_val = *((int *)ctl->data); > > + ret = proc_douintvec(, write, buffer, lenp, ppos); > > + if (ret != 0) > > + goto out; > > > > + if (write) { > > if (check_addr_gen_mode(new_val) < 0) { > > ret = -EINVAL; > > goto out; > > } > > > > - /* request for default */ > > - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) { > > - ipv6_devconf_dflt.addr_gen_mode = new_val; > > updating the template is wrong, but you still need to update the > namespace's default value for new devices. That's already handled by storing new_val into ctl->data at the end of the 'write' block. BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so that people aren't tempted to update the template, but I'm thinking of doing that in net-next rather than net. > also, if you are fixing this it would be good to handle the change to > 'all' as well and update all existing devices. I thought about it, and wasn't sure if that change of behavior was acceptable, especially for stable (I think the current patch should go into stable). OTOH, it's quite clearly what "all" should do. > > - > > - /* request for individual net device */ > > - } else { > > - if (!idev) > > - goto out; > > - > > + if (idev) { > > if (check_stable_privacy(idev, net, new_val) < 0) { > > ret = -EINVAL; > > goto out; > > @@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct > > ctl_table *ctl, int write, > > addrconf_dev_config(idev->dev); > > } > > } > > + > > + *((u32 *)ctl->data) = new_val; > > } > > > > out: > > > -- Sabrina
Re: [PATCH iproute2 net-next] bridge: add support for isolated option
On 7/3/18 6:42 AM, Nikolay Aleksandrov wrote: > This patch adds support for the new isolated port option which, if set, > would allow the isolated ports to communicate only with non-isolated > ports and the bridge device. The option can be set via the bridge or ip > link type bridge_slave commands, e.g.: > $ ip link set dev eth0 type bridge_slave isolated on > $ bridge link set dev eth0 isolated on > > Signed-off-by: Nikolay Aleksandrov > --- > bridge/link.c| 11 +++ > ip/iplink_bridge_slave.c | 9 + > man/man8/bridge.8| 6 ++ > man/man8/ip-link.8.in| 6 -- > 4 files changed, 30 insertions(+), 2 deletions(-) > applied to iproute2-next. Thanks
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On 7/6/18 2:08 AM, Ka-Cheong Poon wrote: On 07/06/2018 01:58 AM, Santosh Shilimkar wrote: diff --git a/net/rds/connection.c b/net/rds/connection.c index abef75d..ca72563 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -142,9 +151,12 @@ static void __rds_conn_path_init(struct rds_connection *conn, * are torn down as the module is removed, if ever. */ static struct rds_connection *__rds_conn_create(struct net *net, - __be32 laddr, __be32 faddr, - struct rds_transport *trans, gfp_t gfp, - int is_outgoing) + const struct in6_addr *laddr, + const struct in6_addr *faddr, + struct rds_transport *trans, + gfp_t gfp, + int is_outgoing, + int dev_if) Am just wondering if we can handle local link address case differently instead of pushing the interface index everywhere. Did you think about any alternative. This can also avoid bunch of changes again and hence the question. Am trying to see if we can minimize the changes to absolute must have to support IPv6. If link local address is supported, scoped ID must be used. Unless we remove the support of link local address, we need to keep scope ID. Ok. diff --git a/net/rds/ib.h b/net/rds/ib.h index a6f4d7d..12f96b3 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h +union rds_ib_conn_priv { + struct rds_ib_connect_private ricp_v4; + struct rds6_ib_connect_private ricp_v6; }; This change was invetiable. Just add a comment saying the priviate data size for v6 vs v4 is different. Also for IPv6 priviate data, I suggest add some resrve fields for any future extensions so that things can be added without breaking wire protocol. With IPv4, we ended up in bit of mess. Will add some comments. Unfortunately the IB private data exchange has only a limited space. I don't think there is any more space left for reserved field. I think we should think of a different way to handle extensions in future. There is enough space. You can send upto 512 bytes iirc. Please add some reserve fields and ping me if you run into issues. diff --git a/net/rds/send.c b/net/rds/send.c index 94c7f74..6ed2e92 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -1081,27 +1085,59 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) goto out; } - if (msg->msg_namelen) { - /* XXX fail non-unicast destination IPs? */ - if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != AF_INET) { + namelen = msg->msg_namelen; + if (namelen != 0) { + if (namelen < sizeof(*usin)) { + ret = -EINVAL; + goto out; + } + switch (namelen) { + case sizeof(*usin): + if (usin->sin_family != AF_INET || + usin->sin_addr.s_addr == htonl(INADDR_ANY) || + usin->sin_addr.s_addr == htonl(INADDR_BROADCAST) || + IN_MULTICAST(ntohl(usin->sin_addr.s_addr))) { + ret = -EINVAL; The idea was to fail non-unicast IP someday but can you not add this change as part of v6 series. We can look at it later unless you need this change for v6. Again the question is mainly to add only necessary check and leave the existing behavior as is so please re-check all below checks too. This is to match the same IPv6 check. In IPv6 code, it checks if the address is a unicast address. I can remove the IPv4 checks but if an app does send to a multicast address, the connection will be stuck forever. OK. Lets keep it then. diff --git a/net/rds/transport.c b/net/rds/transport.c index 0b188dd..c9788db 100644 --- a/net/rds/transport.c +++ b/net/rds/transport.c + if (ipv6_addr_v4mapped(addr)) { Dave already commented on ipv6_addr_v4mapped(). Apart from some of those comments questions, rest of the patch looks really good and nicely done. Will also look at your subsequent two patches and try to send you comments in next few days. Do you mean using mapped address to create IPv4 connections? I already have it in my work space. Will be in v3. yeah. Thanks !!
Re: [PATCH net 3/3] net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE
On 7/6/18 7:49 AM, Sabrina Dubroca wrote: > inet6_ifla6_size() is called to check how much space is needed by > inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include > the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it. > > Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") > Signed-off-by: Sabrina Dubroca > --- > net/ipv6/addrconf.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > Reviewed-by: David Ahern
Re: [PATCH net 2/3] net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev
On 7/6/18 7:49 AM, Sabrina Dubroca wrote: > The value has already been copied from this netns's devconf_dflt, it > shouldn't be reset to the global kernel default. > > Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address > generation mode") > Signed-off-by: Sabrina Dubroca > --- > net/ipv6/addrconf.c | 2 -- > 1 file changed, 2 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode
On 7/6/18 7:49 AM, Sabrina Dubroca wrote: > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index 91580c62bb86..e9ba53d2a147 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct > ctl_table *ctl, int write, >loff_t *ppos) > { > int ret = 0; > - int new_val; > + u32 new_val; > struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1; > struct net *net = (struct net *)ctl->extra2; > + struct ctl_table tmp = { > + .data = _val, > + .maxlen = sizeof(new_val), > + .mode = ctl->mode, > + }; > > if (!rtnl_trylock()) > return restart_syscall(); > > - ret = proc_dointvec(ctl, write, buffer, lenp, ppos); > + new_val = *((u32 *)ctl->data); > > - if (write) { > - new_val = *((int *)ctl->data); > + ret = proc_douintvec(, write, buffer, lenp, ppos); > + if (ret != 0) > + goto out; > > + if (write) { > if (check_addr_gen_mode(new_val) < 0) { > ret = -EINVAL; > goto out; > } > > - /* request for default */ > - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) { > - ipv6_devconf_dflt.addr_gen_mode = new_val; updating the template is wrong, but you still need to update the namespace's default value for new devices. also, if you are fixing this it would be good to handle the change to 'all' as well and update all existing devices. > - > - /* request for individual net device */ > - } else { > - if (!idev) > - goto out; > - > + if (idev) { > if (check_stable_privacy(idev, net, new_val) < 0) { > ret = -EINVAL; > goto out; > @@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct > ctl_table *ctl, int write, > addrconf_dev_config(idev->dev); > } > } > + > + *((u32 *)ctl->data) = new_val; > } > > out: >
Re: [RFC PATCH] ip: re-introduce fragments cache worker
On 07/06/2018 06:56 AM, Paolo Abeni wrote: > With: > > schedule_work_on(smp_processor_id(), #... ) > > We can be sure to run exclusively on the cpu handling the RX queue even with > the worker. > Make sure to test your patch with 16 RX queues (16 cpus) feeding the defrag unit. In this (common) mode, then having one cpu trying to purge frags wont be enough. Really the GC can not cope with DDOS, we tried that in the past (IPv4 route cache) and failed miserably.
Re: [PATCH v2 net-next 0/3] rds: IPv6 support
On 07/06/2018 01:44 AM, Sowmini Varadhan wrote: - rds_getname(): one of the existing properties of PF_RDS is that a connect() does not involve an implicit bind(). Note that you are basing the changes in rds_bind() on this behavior, thus I guess we make the choice of treating this as a feature, not a bug. This patch series does not change existing behavior. But I think this is a strange RDS semantics as it differs from other types of socket. But this is not about IPv6 support and can be dealt with later. Since we are choosing to treat this behavior as a feature we need to be consistent in rds_getname(). If we find (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, the returned address (address size, address family) should be based on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a bind(), and try to do a getsockname(), I get back a sockaddr_in?!) OK, will change that. - rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg As DaveM has already pointed out, we should be using sa_family to figure out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting len first, and then the family- that way wont work if I pass in sockaddr_storage). E.g., see inet_dgram_connect. if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; OK, will change that. - In net/rds/rds.h; /* The following ports, 16385, 18634, 18635, are registered with IANA as * the ports to be used for RDS over TCP and UDP. 18634 is the historical What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. IN fact RDS has nothing to do with UDP. The comment is confused. See next item below, where the comment disappears. As mentioned in the above comments, they are registered with IANA. There is no RDS/UDP in Linux but the port numbers are registered nevertheless. And RDS can work over UDP AFAICT. BTW, it is really strange that RDS/TCP does not use the port number already registered. Anyway, the above comments are just a note on this fact in the IANA database. The following is a link to the IANA assignment. https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt - Also in net/rds/rds.h Please dont define transport specific parameters like RD_CM_PORT in the common rds.h header. It is unfortunate that we already have RDS_PORT there, and we should try to clean that up as well. NOte that RDS_TCP_PORT is now in the correct transport-module-specific header (net/rds/tcp.h) and its unclean to drag it from there, into the common header as you are doing. In fact I just tried to move the RDS_PORT definition into net/rds/rdma_transport.h and it built just-fine. So to summarize, please do the following: 1. move RDS_PORT into rdma_transport.h 2. add RDS_CM_PORT into rdma_transport.h 3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h to net/rds/rds.h IMHO, there should only be RDS_PORT in the first place. And it can be used for all transports. For example, if RDS/SCTP is added, it can also use the same port number. There is no need to define a new port number for each transport. What is the rationale that each transport needs to have its own number and be defined in its own header file? - net/rds/connection.c As we have discussed offline before, the fact that we cannot report TCP seq# etc info via the existing rds-info API is not "a bug in the design of MPRDS" but rather a lacking in the API design. Moreover, much of the useful information around the TCP socket is already available via procfs, TCP_INFO etc, so the info by rds-info is rarely used for rds-tcp- the more useful information is around the RDS socket itself. So there is a bug in the comment, would be nice if you removed it. If the behavior of a software component is modified/enhanced such that the existing API of this component has problems dealing with this new behavior, should the API be modified or a new API be added to handle that? If neither is done, shouldn't this be considered a bug? Also, while you are there, s/exisiting/existing, please. OK, with change that. General comments: - I remain unconvinced by your global <-> link-local arguments. For UDP sockets we can do this: eth0 host1 -- host2 abc::1/64 fe80::2 add abc::/64 as onlink subnet route host1# traceroute6 -i eth0 -s abc::1 fe80::2 You just broke this for RDS and are using polemic to defend your case, but the main thrust of your diatribe seems to be "why would you need this?" I'll try to address that briefly here. It is unclear why you need to use words like "polemic" and "diatribe". What I wrote is in public and folks can judge whether they are
Re: [RFC PATCH] ip: re-introduce fragments cache worker
On 07/06/2018 06:56 AM, Paolo Abeni wrote: > On Fri, 2018-07-06 at 05:09 -0700, Eric Dumazet wrote: >> On 07/06/2018 04:56 AM, Paolo Abeni wrote: >>> With your setting, you need a bit more concurrent connections (400 ?) >>> to saturate the ipfrag cache. Above that number, performances will >>> still sink. >> >> Maybe, but IP defrag can not be 'perfect'. >> >> For this particular use case I could still bump high_thresh to 6 GB and all >> would be good :) > > Understood. > > I'd like to be sure I stated the problem I see clearly. With the > current code the "goodput" goes to almost 0 as soon as the ipfrag cache > load goes above it's capacity. Before the worker removal, after > reaching high_thresh, the "goodput" degratated slowly and even with a > load more than an order of magnitude higher, the performances were > still quite good. I think we can't ask customers to add more memory for > a kernel upgrade; even changing the default sysfs configuration is > somewhat troubling. > >>> This looks nice, I'll try to test it in my use case and I'll report >>> here. > > I tried the patch, but the result are not encouraging: > > ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 > 34.94 > > # on the receiver side: > echo 2 > /proc/sys/net/ipv4/ipfrag_time > > # on the sender side: > ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 > 85.8 > > # still on receiver side, while the test is running: > nstat>/dev/null ;sleep 1; nstat |grep IpReasm > IpReasmTimeout 2128 0.0 > IpReasmReqds754770 0.0 > IpReasmOKs 1350.0 > IpReasmFails752811 0.0 > > grep FRAG /proc/net/sockstat > FRAG: inuse 124 memory 5286144 > > The patch has some effect, as I basically saw no timeout without it, > but still does not look aggressive enough. Or possibly it's evicting > the fragments that are more likely to be used/completed (the most > recents one). Hey, that was simply an idea (not even compiled), not the final patch. I will test/polish it later, I am coming back from vacations and have a backlog. Here are my results : (Note that I have _not_ changed /proc/sys/net/ipv4/ipfrag_time ) lpaa6:~# grep . /proc/sys/net/ipv4/ipfrag_* ; grep FRAG /proc/net/sockstat /proc/sys/net/ipv4/ipfrag_high_thresh:104857600 /proc/sys/net/ipv4/ipfrag_low_thresh:78643200 /proc/sys/net/ipv4/ipfrag_max_dist:0 /proc/sys/net/ipv4/ipfrag_secret_interval:0 /proc/sys/net/ipv4/ipfrag_time:30 FRAG: inuse 1379 memory 105006776 lpaa5:/export/hda3/google/edumazet# ./super_netperf 400 -H 10.246.7.134 -t UDP_STREAM -l 60 netperf: send_omni: send_data failed: No route to host netperf: send_omni: send_data failed: No route to host 9063 I would say that it looks pretty good to me.
[PATCH net-next 5/6] ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6
From: Willem de Bruijn skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly after modification by __sock_cmsg_send, by calling sock_tx_timestamp. The IPv4 and IPv6 paths do this conversion differently. In IPv4, the individual protocols that support tx timestamps call this function and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is called in __ip6_append_data. There is no need to store both tx_flags and ts_flags in the cookie as one is derived from the other. Convert when setting up the cork and remove the redundant field. This is similar to IPv6, only have the conversion happen only once per datagram, in ip(6)_setup_cork. Also change __ip6_append_data to match __ip_append_data. Only update tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is redundant: only valid protocols can have non-zero cork->tx_flags. After this change the IPv4 and IPv6 logic is the same. Signed-off-by: Willem de Bruijn --- include/net/ip.h | 1 - net/ipv4/ip_output.c | 3 ++- net/ipv4/ping.c | 2 -- net/ipv4/raw.c| 2 -- net/ipv4/udp.c| 2 -- net/ipv6/ip6_output.c | 18 -- 6 files changed, 10 insertions(+), 18 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 6db23bf1e5eb..e44b1a44f67a 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -72,7 +72,6 @@ struct ipcm_cookie { __be32 addr; int oif; struct ip_options_rcu *opt; - __u8tx_flags; __u8ttl; __s16 tos; charpriority; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 81d0e4a77ec5..e14c774cc092 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1153,8 +1153,9 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork, cork->ttl = ipc->ttl; cork->tos = ipc->tos; cork->priority = ipc->priority; - cork->tx_flags = ipc->tx_flags; cork->transmit_time = ipc->sockc.transmit_time; + cork->tx_flags = 0; + sock_tx_timestamp(sk, ipc->sockc.tsflags, >tx_flags); return 0; } diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c index 6f17fc8ebbdb..b54c964ad925 100644 --- a/net/ipv4/ping.c +++ b/net/ipv4/ping.c @@ -763,8 +763,6 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) rcu_read_unlock(); } - sock_tx_timestamp(sk, ipc.sockc.tsflags, _flags); - saddr = ipc.addr; ipc.addr = faddr = daddr; diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index cf142909389c..33df4d76db2d 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -665,8 +665,6 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) , msg->msg_flags, ); else { - sock_tx_timestamp(sk, ipc.sockc.tsflags, _flags); - if (!ipc.addr) ipc.addr = fl4.daddr; lock_sock(sk); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 87f3a0b77864..060e841dde40 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1020,8 +1020,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) saddr = ipc.addr; ipc.addr = faddr = daddr; - sock_tx_timestamp(sk, ipc.sockc.tsflags, _flags); - if (ipc.opt && ipc.opt->opt.srr) { if (!daddr) { err = -EINVAL; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 1a3bf6437cb9..ff4b28a600ab 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1221,6 +1221,8 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork, cork->base.fragsize = mtu; cork->base.gso_size = sk->sk_type == SOCK_DGRAM && sk->sk_protocol == IPPROTO_UDP ? ipc6->gso_size : 0; + cork->base.tx_flags = 0; + sock_tx_timestamp(sk, ipc6->sockc.tsflags, >base.tx_flags); if (dst_allfrag(xfrm_dst_path(>dst))) cork->base.flags |= IPCORK_ALLFRAG; @@ -1250,7 +1252,6 @@ static int __ip6_append_data(struct sock *sk, int copy; int err; int offset = 0; - __u8 tx_flags = 0; u32 tskey = 0; struct rt6_info *rt = (struct rt6_info *)cork->dst; struct ipv6_txoptions *opt = v6_cork->opt; @@ -1269,6 +1270,10 @@ static int __ip6_append_data(struct sock *sk, mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize; orig_mtu = mtu; + if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP && + sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) + tskey = sk->sk_tskey++; + hh_len = LL_RESERVED_SPACE(rt->dst.dev); fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len + @@ -1318,13 +1323,6 @@ static int __ip6_append_data(struct sock *sk, rt->dst.dev->features & (NETIF_F_IPV6_CSUM
[PATCH net-next 4/6] ipv6: fold sockcm_cookie into ipcm6_cookie
From: Willem de Bruijn ipcm_cookie includes sockcm_cookie. Do the same for ipcm6_cookie. This reduces the number of arguments that need to be passed around, applies ipcm6_init to all cookie fields at once and reduces code differentiation between ipv4 and ipv6. Signed-off-by: Willem de Bruijn --- include/net/ipv6.h | 7 +++ include/net/transp_v6.h | 3 +-- net/ipv6/datagram.c | 4 ++-- net/ipv6/icmp.c | 7 ++- net/ipv6/ip6_flowlabel.c | 3 +-- net/ipv6/ip6_output.c| 24 ++-- net/ipv6/ipv6_sockglue.c | 3 +-- net/ipv6/ping.c | 3 +-- net/ipv6/raw.c | 10 -- net/ipv6/udp.c | 10 -- net/l2tp/l2tp_ip6.c | 6 ++ 11 files changed, 31 insertions(+), 49 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 6cb247f54d4c..aa6fd11a887c 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -294,6 +294,7 @@ struct ipv6_fl_socklist { }; struct ipcm6_cookie { + struct sockcm_cookie sockc; __s16 hlimit; __s16 tclass; __s8 dontfrag; @@ -959,8 +960,7 @@ int ip6_append_data(struct sock *sk, int odd, struct sk_buff *skb), void *from, int length, int transhdrlen, struct ipcm6_cookie *ipc6, struct flowi6 *fl6, - struct rt6_info *rt, unsigned int flags, - const struct sockcm_cookie *sockc); + struct rt6_info *rt, unsigned int flags); int ip6_push_pending_frames(struct sock *sk); @@ -977,8 +977,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk, void *from, int length, int transhdrlen, struct ipcm6_cookie *ipc6, struct flowi6 *fl6, struct rt6_info *rt, unsigned int flags, -struct inet_cork_full *cork, -const struct sockcm_cookie *sockc); +struct inet_cork_full *cork); static inline struct sk_buff *ip6_finish_skb(struct sock *sk) { diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h index f6a3543e5247..a8f6020f1196 100644 --- a/include/net/transp_v6.h +++ b/include/net/transp_v6.h @@ -42,8 +42,7 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, struct msghdr *msg, struct sk_buff *skb); int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg, - struct flowi6 *fl6, struct ipcm6_cookie *ipc6, - struct sockcm_cookie *sockc); + struct flowi6 *fl6, struct ipcm6_cookie *ipc6); void __ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp, __u16 srcp, __u16 destp, int rqueue, int bucket); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 2ee08b6a86a4..201306b9b5ea 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -736,7 +736,7 @@ EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl); int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg, struct flowi6 *fl6, - struct ipcm6_cookie *ipc6, struct sockcm_cookie *sockc) + struct ipcm6_cookie *ipc6) { struct in6_pktinfo *src_info; struct cmsghdr *cmsg; @@ -755,7 +755,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk, } if (cmsg->cmsg_level == SOL_SOCKET) { - err = __sock_cmsg_send(sk, msg, cmsg, sockc); + err = __sock_cmsg_send(sk, msg, cmsg, >sockc); if (err) return err; continue; diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index d99fed67cd10..24611c8b0562 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -430,7 +430,6 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, struct icmp6hdr tmp_hdr; struct flowi6 fl6; struct icmpv6_msg msg; - struct sockcm_cookie sockc_unused = {0}; struct ipcm6_cookie ipc6; int iif = 0; int addr_type = 0; @@ -573,7 +572,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, len + sizeof(struct icmp6hdr), sizeof(struct icmp6hdr), , , (struct rt6_info *)dst, - MSG_DONTWAIT, _unused)) { + MSG_DONTWAIT)) { ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS); ip6_flush_pending_frames(sk); } else { @@ -677,7 +676,6 @@ static void icmpv6_echo_reply(struct sk_buff *skb) struct dst_entry *dst; struct ipcm6_cookie ipc6; u32 mark = IP6_REPLY_MARK(net, skb->mark); - struct sockcm_cookie sockc_unused = {0};
[PATCH net-next 1/6] ipv4: ipcm_cookie initializers
From: Willem de Bruijn Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn --- include/net/ip.h | 15 +++ net/ipv4/icmp.c | 11 ++- net/ipv4/ip_output.c | 6 +- net/ipv4/ping.c | 9 + net/ipv4/raw.c | 9 + net/ipv4/udp.c | 10 +- 6 files changed, 21 insertions(+), 39 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 99d1b835d2aa..6db23bf1e5eb 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -79,6 +79,21 @@ struct ipcm_cookie { __u16 gso_size; }; +static inline void ipcm_init(struct ipcm_cookie *ipcm) +{ + *ipcm = (struct ipcm_cookie) { .tos = -1 }; +} + +static inline void ipcm_init_sk(struct ipcm_cookie *ipcm, + const struct inet_sock *inet) +{ + ipcm_init(ipcm); + + ipcm->sockc.tsflags = inet->sk.sk_tsflags; + ipcm->oif = inet->sk.sk_bound_dev_if; + ipcm->addr = inet->inet_saddr; +} + #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb)) #define PKTINFO_SKB_CB(skb) ((struct in_pktinfo *)((skb)->cb)) diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 937239afd68d..695979b7ef6d 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -429,15 +429,11 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb) icmp_param->data.icmph.checksum = 0; + ipcm_init(); inet->tos = ip_hdr(skb)->tos; sk->sk_mark = mark; daddr = ipc.addr = ip_hdr(skb)->saddr; saddr = fib_compute_spec_dst(skb); - ipc.opt = NULL; - ipc.tx_flags = 0; - ipc.ttl = 0; - ipc.tos = -1; - ipc.sockc.transmit_time = 0; if (icmp_param->replyopts.opt.opt.optlen) { ipc.opt = _param->replyopts.opt; @@ -711,12 +707,9 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info) icmp_param.offset = skb_network_offset(skb_in); inet_sk(sk)->tos = tos; sk->sk_mark = mark; + ipcm_init(); ipc.addr = iph->saddr; ipc.opt = _param.replyopts.opt; - ipc.tx_flags = 0; - ipc.ttl = 0; - ipc.tos = -1; - ipc.sockc.transmit_time = 0; rt = icmp_route_lookup(net, , skb_in, iph, saddr, tos, mark, type, code, _param); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 570e3ebc3974..81d0e4a77ec5 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1548,12 +1548,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, if (__ip_options_echo(net, , skb, sopt)) return; + ipcm_init(); ipc.addr = daddr; - ipc.opt = NULL; - ipc.tx_flags = 0; - ipc.ttl = 0; - ipc.tos = -1; - ipc.sockc.transmit_time = 0; if (replyopts.opt.opt.optlen) { ipc.opt = diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c index b47492205507..6f17fc8ebbdb 100644 --- a/net/ipv4/ping.c +++ b/net/ipv4/ping.c @@ -739,14 +739,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) /* no remote port */ } - ipc.sockc.tsflags = sk->sk_tsflags; - ipc.addr = inet->inet_saddr; - ipc.opt = NULL; - ipc.oif = sk->sk_bound_dev_if; - ipc.tx_flags = 0; - ipc.ttl = 0; - ipc.tos = -1; - ipc.sockc.transmit_time = 0; + ipcm_init_sk(, inet); if (msg->msg_controllen) { err = ip_cmsg_send(sk, msg, , false); diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 446af7be2b55..cf142909389c 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -562,14 +562,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) daddr = inet->inet_daddr; } - ipc.sockc.tsflags = sk->sk_tsflags; - ipc.sockc.transmit_time = 0; - ipc.addr = inet->inet_saddr; - ipc.opt = NULL; - ipc.tx_flags = 0; - ipc.ttl = 0; - ipc.tos = -1; - ipc.oif = sk->sk_bound_dev_if; + ipcm_init_sk(, inet); if (msg->msg_controllen) { err = ip_cmsg_send(sk, msg, , false); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 5c76ba0666ec..87f3a0b77864 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -926,12 +926,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */ return -EOPNOTSUPP; - ipc.opt = NULL; - ipc.tx_flags = 0; - ipc.ttl = 0; - ipc.tos = -1; - ipc.sockc.transmit_time = 0; - getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag; fl4 = >cork.fl.u.ip4; @@ -978,9 +972,7 @@ int udp_sendmsg(struct sock *sk, struct
[PATCH net-next 2/6] ipv6: ipcm6_cookie initializer
From: Willem de Bruijn Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn --- include/net/ipv6.h | 19 +++ net/ipv6/icmp.c | 7 ++- net/ipv6/ping.c | 4 +--- net/ipv6/raw.c | 5 + net/ipv6/udp.c | 4 +--- net/l2tp/l2tp_ip6.c | 4 +--- 6 files changed, 25 insertions(+), 18 deletions(-) diff --git a/include/net/ipv6.h b/include/net/ipv6.h index b7843e0b16ee..6cb247f54d4c 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -301,6 +301,25 @@ struct ipcm6_cookie { __u16 gso_size; }; +static inline void ipcm6_init(struct ipcm6_cookie *ipc6) +{ + *ipc6 = (struct ipcm6_cookie) { + .hlimit = -1, + .tclass = -1, + .dontfrag = -1, + }; +} + +static inline void ipcm6_init_sk(struct ipcm6_cookie *ipc6, +const struct ipv6_pinfo *np) +{ + *ipc6 = (struct ipcm6_cookie) { + .hlimit = -1, + .tclass = np->tclass, + .dontfrag = np->dontfrag, + }; +} + static inline struct ipv6_txoptions *txopt_get(const struct ipv6_pinfo *np) { struct ipv6_txoptions *opt; diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index be491bf6ab6e..d99fed67cd10 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -545,7 +545,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, else if (!fl6.flowi6_oif) fl6.flowi6_oif = np->ucast_oif; - ipc6.tclass = np->tclass; + ipcm6_init_sk(, np); fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel); dst = icmpv6_route_lookup(net, skb, sk, ); @@ -553,8 +553,6 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, goto out; ipc6.hlimit = ip6_sk_dst_hoplimit(np, , dst); - ipc6.dontfrag = np->dontfrag; - ipc6.opt = NULL; msg.skb = skb; msg.offset = skb_network_offset(skb); @@ -726,10 +724,9 @@ static void icmpv6_echo_reply(struct sk_buff *skb) msg.offset = 0; msg.type = ICMPV6_ECHO_REPLY; + ipcm6_init_sk(, np); ipc6.hlimit = ip6_sk_dst_hoplimit(np, , dst); ipc6.tclass = ipv6_get_dsfield(ipv6_hdr(skb)); - ipc6.dontfrag = np->dontfrag; - ipc6.opt = NULL; if (ip6_append_data(sk, icmpv6_getfrag, , skb->len + sizeof(struct icmp6hdr), diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index 96f56bf49a30..717e7c1fba29 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -119,7 +119,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) fl6.fl6_icmp_code = user_icmph.icmp6_code; security_sk_classify_flow(sk, flowi6_to_flowi()); - ipc6.tclass = np->tclass; + ipcm6_init_sk(, np); fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel); dst = ip6_sk_dst_lookup_flow(sk, , daddr, false); @@ -142,8 +142,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) pfh.family = AF_INET6; ipc6.hlimit = ip6_sk_dst_hoplimit(np, , dst); - ipc6.dontfrag = np->dontfrag; - ipc6.opt = NULL; lock_sock(sk); err = ip6_append_data(sk, ping_getfrag, , len, diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 5737c50f16eb..5f40670271ee 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -791,10 +791,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) fl6.flowi6_mark = sk->sk_mark; fl6.flowi6_uid = sk->sk_uid; - ipc6.hlimit = -1; - ipc6.tclass = -1; - ipc6.dontfrag = -1; - ipc6.opt = NULL; + ipcm6_init(); if (sin6) { if (addr_len < SIN6_LEN_RFC2133) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index ac6fc6728903..940115da9843 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1143,9 +1143,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) int (*getfrag)(void *, char *, int, int, int, struct sk_buff *); struct sockcm_cookie sockc; - ipc6.hlimit = -1; - ipc6.tclass = -1; - ipc6.dontfrag = -1; + ipcm6_init(); ipc6.gso_size = up->gso_size; sockc.tsflags = sk->sk_tsflags; sockc.transmit_time = 0; diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c index 957369192ca1..38f80691f4ab 100644 --- a/net/l2tp/l2tp_ip6.c +++ b/net/l2tp/l2tp_ip6.c @@ -525,9 +525,7 @@ static int l2tp_ip6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) fl6.flowi6_mark = sk->sk_mark; fl6.flowi6_uid = sk->sk_uid; - ipc6.hlimit = -1; - ipc6.tclass = -1; - ipc6.dontfrag = -1; + ipcm6_init(); if (lsa) { if (addr_len <
[PATCH net-next 6/6] ip: unconditionally set cork gso_size
From: Willem de Bruijn Now that ipc(6)->gso_size is correctly initialized in all callers of ip(6)_setup_cork, it is safe to unconditionally pass it to the cork. Link: http://lkml.kernel.org/r/20180619164752.143249-1-willemdebruijn.ker...@gmail.com Signed-off-by: Willem de Bruijn --- net/ipv4/ip_output.c | 3 +-- net/ipv6/ip6_output.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index e14c774cc092..e2b6bd478afb 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1146,8 +1146,7 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork, cork->fragsize = ip_sk_use_pmtu(sk) ? dst_mtu(>dst) : rt->dst.dev->mtu; - cork->gso_size = sk->sk_type == SOCK_DGRAM && -sk->sk_protocol == IPPROTO_UDP ? ipc->gso_size : 0; + cork->gso_size = ipc->gso_size; cork->dst = >dst; cork->length = 0; cork->ttl = ipc->ttl; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index ff4b28a600ab..8047fd41ba88 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1219,8 +1219,7 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork, if (mtu < IPV6_MIN_MTU) return -EINVAL; cork->base.fragsize = mtu; - cork->base.gso_size = sk->sk_type == SOCK_DGRAM && - sk->sk_protocol == IPPROTO_UDP ? ipc6->gso_size : 0; + cork->base.gso_size = ipc6->gso_size; cork->base.tx_flags = 0; sock_tx_timestamp(sk, ipc6->sockc.tsflags, >base.tx_flags); -- 2.18.0.399.gad0ab374a1-goog
[PATCH net-next 3/6] sock: sockc cookie initializer
From: Willem de Bruijn Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn --- include/net/sock.h | 6 ++ net/ipv4/tcp.c | 2 +- net/packet/af_packet.c | 9 +++-- 3 files changed, 10 insertions(+), 7 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index e0eac9ef44b5..83b747538bd0 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1600,6 +1600,12 @@ struct sockcm_cookie { u16 tsflags; }; +static inline void sockcm_init(struct sockcm_cookie *sockc, + const struct sock *sk) +{ + *sockc = (struct sockcm_cookie) { .tsflags = sk->sk_tsflags }; +} + int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, struct sockcm_cookie *sockc); int sock_cmsg_send(struct sock *sk, struct msghdr *msg, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bf461fa77ed6..850dc8f15afc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1241,7 +1241,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) /* 'common' sending to sendq */ } - sockc.tsflags = sk->sk_tsflags; + sockcm_init(, sk); if (msg->msg_controllen) { err = sock_cmsg_send(sk, msg, ); if (unlikely(err)) { diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 3428f7739ae9..47931ebfaef3 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1951,8 +1951,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg, goto out_unlock; } - sockc.transmit_time = 0; - sockc.tsflags = sk->sk_tsflags; + sockcm_init(, sk); if (msg->msg_controllen) { err = sock_cmsg_send(sk, msg, ); if (unlikely(err)) @@ -2636,8 +2635,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) if (unlikely(!(dev->flags & IFF_UP))) goto out_put; - sockc.transmit_time = 0; - sockc.tsflags = po->sk.sk_tsflags; + sockcm_init(, >sk); if (msg->msg_controllen) { err = sock_cmsg_send(>sk, msg, ); if (unlikely(err)) @@ -2833,8 +2831,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) if (unlikely(!(dev->flags & IFF_UP))) goto out_unlock; - sockc.transmit_time = 0; - sockc.tsflags = sk->sk_tsflags; + sockcm_init(, sk); sockc.mark = sk->sk_mark; if (msg->msg_controllen) { err = sock_cmsg_send(sk, msg, ); -- 2.18.0.399.gad0ab374a1-goog
[PATCH net-next 0/6] sock cookie initializers
From: Willem de Bruijn Recent UDP GSO and SO_TXTIME features added new fields to cookie structs. When adding a field, all sites where a struct is initialized have to be updated, which is a lot of boilerplate. Alternatively, a field can be initialized selectively, but this is fragile. I introduced a bug in udp gso where an uninitialized field was read. See also fix commit ("9887cba19978 ip: limit use of gso_size to udp"). Introduce initializers for structs ipcm(6)_cookie and sockc_cookie. patch 1..3 do exactly this. patch 4..5 make ipv4 and ipv6 handle cookies the same way and remove some boilerplate in doing so. patch 6removes the udp gso branch that needed the above fix Willem de Bruijn (6): ipv4: ipcm_cookie initializers ipv6: ipcm6_cookie initializer sock: sockc cookie initializer ipv6: fold sockcm_cookie into ipcm6_cookie ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6 ip: unconditionally set cork gso_size include/net/ip.h | 16 ++- include/net/ipv6.h | 26 include/net/sock.h | 6 ++ include/net/transp_v6.h | 3 +-- net/ipv4/icmp.c | 11 ++ net/ipv4/ip_output.c | 12 --- net/ipv4/ping.c | 11 +- net/ipv4/raw.c | 11 +- net/ipv4/tcp.c | 2 +- net/ipv4/udp.c | 12 +-- net/ipv6/datagram.c | 4 ++-- net/ipv6/icmp.c | 14 - net/ipv6/ip6_flowlabel.c | 3 +-- net/ipv6/ip6_output.c| 43 +--- net/ipv6/ipv6_sockglue.c | 3 +-- net/ipv6/ping.c | 7 ++- net/ipv6/raw.c | 15 +- net/ipv6/udp.c | 14 + net/l2tp/l2tp_ip6.c | 10 +++--- net/packet/af_packet.c | 9 +++-- 20 files changed, 98 insertions(+), 134 deletions(-) -- 2.18.0.399.gad0ab374a1-goog
Re: [PATCH v1 net-next 6/9] lan743x: Add power management support
Hi Bryan, I love your patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Bryan-Whitehead/lan743x-Add-features-to-lan743x-driver/20180706-051812 config: alpha-allmodconfig (attached as .config) compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.2.0 make.cross ARCH=alpha All warnings (new ones prefixed by >>): >> drivers/net/ethernet/microchip/lan743x_main.c:2787:5: warning: "CONFIG_PM" >> is not defined, evaluates to 0 [-Wundef] #if CONFIG_PM ^ drivers/net/ethernet/microchip/lan743x_main.c:2894:5: warning: "CONFIG_PM" is not defined, evaluates to 0 [-Wundef] #if CONFIG_PM ^ drivers/net/ethernet/microchip/lan743x_main.c:2926:5: warning: "CONFIG_PM" is not defined, evaluates to 0 [-Wundef] #if CONFIG_PM ^ drivers/net/ethernet/microchip/lan743x_main.c:2957:5: warning: "CONFIG_PM" is not defined, evaluates to 0 [-Wundef] #if CONFIG_PM ^ -- >> drivers/net/ethernet/microchip/lan743x_ethtool.c:436:5: warning: "CONFIG_PM" >> is not defined, evaluates to 0 [-Wundef] #if CONFIG_PM ^ vim +/CONFIG_PM +2787 drivers/net/ethernet/microchip/lan743x_main.c 2786 > 2787 #if CONFIG_PM 2788 static void lan743x_pm_set_wol(struct lan743x_adapter *adapter) 2789 { 2790 const u8 ipv4_multicast[3] = { 0x01, 0x00, 0x5E }; 2791 const u8 ipv6_multicast[3] = { 0x33, 0x33 }; 2792 const u8 arp_type[2] = { 0x08, 0x06 }; 2793 int mask_index; 2794 u32 pmtctl; 2795 u32 wucsr; 2796 u32 macrx; 2797 u16 crc; 2798 2799 for (mask_index = 0; mask_index < MAC_NUM_OF_WUF_CFG; mask_index++) 2800 lan743x_csr_write(adapter, MAC_WUF_CFG(mask_index), 0); 2801 2802 /* clear wake settings */ 2803 pmtctl = lan743x_csr_read(adapter, PMT_CTL); 2804 pmtctl |= PMT_CTL_WUPS_MASK_; 2805 pmtctl &= ~(PMT_CTL_GPIO_WAKEUP_EN_ | PMT_CTL_EEE_WAKEUP_EN_ | 2806 PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_ | 2807 PMT_CTL_RX_FCT_RFE_D3_CLK_OVR_ | PMT_CTL_ETH_PHY_WAKE_EN_); 2808 2809 macrx = lan743x_csr_read(adapter, MAC_RX); 2810 2811 wucsr = 0; 2812 mask_index = 0; 2813 2814 pmtctl |= PMT_CTL_ETH_PHY_D3_COLD_OVR_ | PMT_CTL_ETH_PHY_D3_OVR_; 2815 2816 if (adapter->wolopts & WAKE_PHY) { 2817 pmtctl |= PMT_CTL_ETH_PHY_EDPD_PLL_CTL_; 2818 pmtctl |= PMT_CTL_ETH_PHY_WAKE_EN_; 2819 } 2820 if (adapter->wolopts & WAKE_MAGIC) { 2821 wucsr |= MAC_WUCSR_MPEN_; 2822 macrx |= MAC_RX_RXEN_; 2823 pmtctl |= PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_; 2824 } 2825 if (adapter->wolopts & WAKE_UCAST) { 2826 wucsr |= MAC_WUCSR_RFE_WAKE_EN_ | MAC_WUCSR_PFDA_EN_; 2827 macrx |= MAC_RX_RXEN_; 2828 pmtctl |= PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_; 2829 pmtctl |= PMT_CTL_RX_FCT_RFE_D3_CLK_OVR_; 2830 } 2831 if (adapter->wolopts & WAKE_BCAST) { 2832 wucsr |= MAC_WUCSR_RFE_WAKE_EN_ | MAC_WUCSR_BCST_EN_; 2833 macrx |= MAC_RX_RXEN_; 2834 pmtctl |= PMT_CTL_WOL_EN_ | PMT_CTL_MAC_D3_RX_CLK_OVR_; 2835 pmtctl |= PMT_CTL_RX_FCT_RFE_D3_CLK_OVR_; 2836 } 2837 if (adapter->wolopts & WAKE_MCAST) { 2838 /* IPv4 multicast */ 2839 crc = lan743x_pm_wakeframe_crc16(ipv4_multicast, 3); 2840 lan743x_csr_write(adapter, MAC_WUF_CFG(mask_index), 2841MAC_WUF_CFG_EN_ | MAC_WUF_CFG_TYPE_MCAST_ | 2842(0 << MAC_WUF_CFG_OFFSET_SHIFT_) | 2843(crc & MAC_WUF_CFG_CRC16_MASK_)); 2844 lan743x_csr_write(adapter, MAC_WUF_MASK0(mask_index), 7); 2845 lan743x_csr_write(adapter, MAC_WUF_MASK1(mask_index), 0); 2846 lan743x_csr_write(adapter, MAC_WUF_MASK2(mask_index), 0); 2847 lan743x_csr_write(adapter, MAC_WUF_MASK3(mask_index), 0); 2848 mask_index++; 2849 2850 /* IPv6 multicast */ 2851 crc = lan743x_pm_wakeframe_crc16(ipv6_multicast, 2)
Re: [RFC PATCH] ip: re-introduce fragments cache worker
On Fri, 2018-07-06 at 05:09 -0700, Eric Dumazet wrote: > On 07/06/2018 04:56 AM, Paolo Abeni wrote: > > With your setting, you need a bit more concurrent connections (400 ?) > > to saturate the ipfrag cache. Above that number, performances will > > still sink. > > Maybe, but IP defrag can not be 'perfect'. > > For this particular use case I could still bump high_thresh to 6 GB and all > would be good :) Understood. I'd like to be sure I stated the problem I see clearly. With the current code the "goodput" goes to almost 0 as soon as the ipfrag cache load goes above it's capacity. Before the worker removal, after reaching high_thresh, the "goodput" degratated slowly and even with a load more than an order of magnitude higher, the performances were still quite good. I think we can't ask customers to add more memory for a kernel upgrade; even changing the default sysfs configuration is somewhat troubling. > > This looks nice, I'll try to test it in my use case and I'll report > > here. I tried the patch, but the result are not encouraging: ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 34.94 # on the receiver side: echo 2 > /proc/sys/net/ipv4/ipfrag_time # on the sender side: ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 85.8 # still on receiver side, while the test is running: nstat>/dev/null ;sleep 1; nstat |grep IpReasm IpReasmTimeout 2128 0.0 IpReasmReqds754770 0.0 IpReasmOKs 1350.0 IpReasmFails752811 0.0 grep FRAG /proc/net/sockstat FRAG: inuse 124 memory 5286144 The patch has some effect, as I basically saw no timeout without it, but still does not look aggressive enough. Or possibly it's evicting the fragments that are more likely to be used/completed (the most recents one). > > I have doubt: under DDOS we will trigger timeout per > > jiffy, can that keep a CPU busy, too? > > Yes, the cpu(s) handling the RX queue(s), which are already provisioned for > networking stuff ;) > > Even without any frag being received, these cpu can be 100% busy. With: schedule_work_on(smp_processor_id(), #... ) We can be sure to run exclusively on the cpu handling the RX queue even with the worker. Cheers, Paolo
[PATCH net 2/3] net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev
The value has already been copied from this netns's devconf_dflt, it shouldn't be reset to the global kernel default. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca --- net/ipv6/addrconf.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index e9ba53d2a147..e20f8a1d8cdb 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -385,8 +385,6 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev) if (ndev->cnf.stable_secret.initialized) ndev->cnf.addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY; - else - ndev->cnf.addr_gen_mode = ipv6_devconf_dflt.addr_gen_mode; ndev->cnf.mtu6 = dev->mtu; ndev->nd_parms = neigh_parms_alloc(dev, _tbl); -- 2.18.0
[PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode
addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores the errors returned by proc_dointvec(). addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which writes the value to memory, and then checks if it's valid and may return EINVAL. If a bad value is given, the value displayed when reading net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the value provided by the user was valid, addrconf_dev_config() won't be called since idev->cnf.addr_gen_mode has already been updated. Fix this in the usual way we deal with values that need to be checked after the proc_do*() helper has returned: define a local ctl_table and storage, call proc_dointvec() on that temporary area, then check and store. addrconf_sysctl_addr_gen_mode() also writes the new value to the global ipv6_devconf_dflt, when we're writing to some netns's default, so that new netns will inherit the value that was set by the change occuring in any netns. That doesn't make any sense, so let's drop this assignment. Finally, since addr_gen_mode is a __u32, switch to proc_douintvec(). Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca --- net/ipv6/addrconf.c | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 91580c62bb86..e9ba53d2a147 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct ctl_table *ctl, int write, loff_t *ppos) { int ret = 0; - int new_val; + u32 new_val; struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1; struct net *net = (struct net *)ctl->extra2; + struct ctl_table tmp = { + .data = _val, + .maxlen = sizeof(new_val), + .mode = ctl->mode, + }; if (!rtnl_trylock()) return restart_syscall(); - ret = proc_dointvec(ctl, write, buffer, lenp, ppos); + new_val = *((u32 *)ctl->data); - if (write) { - new_val = *((int *)ctl->data); + ret = proc_douintvec(, write, buffer, lenp, ppos); + if (ret != 0) + goto out; + if (write) { if (check_addr_gen_mode(new_val) < 0) { ret = -EINVAL; goto out; } - /* request for default */ - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) { - ipv6_devconf_dflt.addr_gen_mode = new_val; - - /* request for individual net device */ - } else { - if (!idev) - goto out; - + if (idev) { if (check_stable_privacy(idev, net, new_val) < 0) { ret = -EINVAL; goto out; @@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct ctl_table *ctl, int write, addrconf_dev_config(idev->dev); } } + + *((u32 *)ctl->data) = new_val; } out: -- 2.18.0
[PATCH net 0/3] net/ipv6: addr_gen_mode fixes
This series fixes bugs in handling of the addr_gen_mode option, mainly related to the sysctl. A minor netlink issue was also present in the initial commit introducing the option on a per-netdevice basis. Sabrina Dubroca (3): net/ipv6: fix addrconf_sysctl_addr_gen_mode net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE net/ipv6/addrconf.c | 33 + 1 file changed, 17 insertions(+), 16 deletions(-) -- 2.18.0
[PATCH net 3/3] net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE
inet6_ifla6_size() is called to check how much space is needed by inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it. Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Signed-off-by: Sabrina Dubroca --- net/ipv6/addrconf.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index e20f8a1d8cdb..e89bca83e0e4 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -5208,7 +5208,9 @@ static inline size_t inet6_ifla6_size(void) + nla_total_size(DEVCONF_MAX * 4) /* IFLA_INET6_CONF */ + nla_total_size(IPSTATS_MIB_MAX * 8) /* IFLA_INET6_STATS */ + nla_total_size(ICMP6_MIB_MAX * 8) /* IFLA_INET6_ICMP6STATS */ -+ nla_total_size(sizeof(struct in6_addr)); /* IFLA_INET6_TOKEN */ ++ nla_total_size(sizeof(struct in6_addr)) /* IFLA_INET6_TOKEN */ ++ nla_total_size(1) /* IFLA_INET6_ADDR_GEN_MODE */ ++ 0; } static inline size_t inet6_if_nlmsg_size(void) -- 2.18.0
Re: [PATCH iproute2-next] tc: m_tunnel_key: Add tunnel option support to act_tunnel_key
Jakub Kicinski writes: > From: Simon Horman > > Allow setting tunnel options using the act_tunnel_key action. > > Options are expressed as class:type:data and multiple options > may be listed using a comma delimiter. > > # ip link add name geneve0 type geneve dstport 0 external > # tc qdisc add dev eth0 ingress > # tc filter add dev eth0 protocol ip parent : \ > flower indev eth0 \ > ip_proto udp \ > action tunnel_key \ > set src_ip 10.0.99.192 \ > dst_ip 10.0.99.193 \ > dst_port 6081 \ > id 11 \ > geneve_opts 0102:80:00800022,0102:80:00800022 \ > action mirred egress redirect dev geneve0 [...] Jakub, could you also add relevant tests for the new option in file $(kernel)/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json?
[PATCH] ipv6: icmp: Updating pmtu for link local route
When a ICMPV6_PKT_TOOBIG is received from a link local address the pmtu will be updated on a route with an arbitrary interface index. Subsequent packets sent back to the same link local address may therefore end up not considering the updated pmtu. Current behavior breaks TAHI v6LC4.1.4 Reduce PMTU On-link. Referring to RFC 1981: Section 3: "Note that Path MTU Discovery must be performed even in cases where a node "thinks" a destination is attached to the same link as itself. In a situation such as when a neighboring router acts as proxy [ND] for some destination, the destination can to appear to be directly connected but is in fact more than one hop away." Using the interface index from the incoming ICMPV6_PKT_TOOBIG when updating the pmtu. Signed-off-by: Georg Kohmann --- net/ipv6/icmp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index be491bf..7d0e9c7 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -92,7 +92,7 @@ static void icmpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, struct net *net = dev_net(skb->dev); if (type == ICMPV6_PKT_TOOBIG) - ip6_update_pmtu(skb, net, info, 0, 0, sock_net_uid(net, NULL)); + ip6_update_pmtu(skb, net, info, skb->dev->ifindex, 0, sock_net_uid(net, NULL)); else if (type == NDISC_REDIRECT) ip6_redirect(skb, net, skb->dev->ifindex, 0, sock_net_uid(net, NULL)); --- Tested on kernel 4.4.38. I am however worried that this patch may have consequences for other traffic on the same interface. Please advice or suggest at better approach.
Re: [RFC PATCH] ip: re-introduce fragments cache worker
On 07/06/2018 04:56 AM, Paolo Abeni wrote: > Hi, > > On Fri, 2018-07-06 at 04:23 -0700, Eric Dumazet wrote: >> Ho hum. No please. >> >> I do not think adding back a GC is wise, since my patches were going in the >> direction >> of allowing us to increase limits on current hardware. >> >> Meaning that the amount of frags to evict would be quite big under DDOS. >> (One inet_frag_queue allocated for every incoming tiny frame :/ ) >> >> A GC is a _huge_ problem, burning one cpu (you would have to provision for >> this CPU) >> compared to letting normal per frag timer doing its job. >> >> My plan was to reduce the per frag timer under load (default is 30 seconds), >> since >> this is exactly what your patch is indirectly doing, by aggressively pruning >> frags under stress. >> >> That would be a much simpler heuristic. [1] >> >> BTW my own results (before patch) are : >> >> lpaa5:/export/hda3/google/edumazet# ./super_netperf 10 -H 10.246.7.134 -t >> UDP_STREAM -l 60 >>9602 >> lpaa5:/export/hda3/google/edumazet# ./super_netperf 200 -H 10.246.7.134 -t >> UDP_STREAM -l 60 >>9557 >> >> On receiver (normal settings here) I had : >> >> lpaa6:/export/hda3/google/edumazet# grep . /proc/sys/net/ipv4/ipfrag_* >> /proc/sys/net/ipv4/ipfrag_high_thresh:104857600 >> /proc/sys/net/ipv4/ipfrag_low_thresh:78643200 >> /proc/sys/net/ipv4/ipfrag_max_dist:0 >> /proc/sys/net/ipv4/ipfrag_secret_interval:0 >> /proc/sys/net/ipv4/ipfrag_time:30 >> >> lpaa6:/export/hda3/google/edumazet# grep FRAG /proc/net/sockstat >> FRAG: inuse 824 memory 53125312 > > Than you for the feedback. > > With your setting, you need a bit more concurrent connections (400 ?) > to saturate the ipfrag cache. Above that number, performances will > still sink. Maybe, but IP defrag can not be 'perfect'. For this particular use case I could still bump high_thresh to 6 GB and all would be good :) > This looks nice, I'll try to test it in my use case and I'll report > here. > > Perhaps we can use the default timeout when usage < low_thresh, to > avoid some maths in possibly common scenario? On current 64bit hardware, a divide here is not a big cost (compared to the rest of frag setup) and I would rather starting having smaller timeouts sooner than later ;) (low_thresh is typically set to 75% of high_thresh) > > I have doubt: under DDOS we will trigger timeout per > jiffy, can that keep a CPU busy, too? Yes, the cpu(s) handling the RX queue(s), which are already provisioned for networking stuff ;) Even without any frag being received, these cpu can be 100% busy.
Re: [RFC PATCH] ip: re-introduce fragments cache worker
Hi, On Fri, 2018-07-06 at 04:23 -0700, Eric Dumazet wrote: > Ho hum. No please. > > I do not think adding back a GC is wise, since my patches were going in the > direction > of allowing us to increase limits on current hardware. > > Meaning that the amount of frags to evict would be quite big under DDOS. > (One inet_frag_queue allocated for every incoming tiny frame :/ ) > > A GC is a _huge_ problem, burning one cpu (you would have to provision for > this CPU) > compared to letting normal per frag timer doing its job. > > My plan was to reduce the per frag timer under load (default is 30 seconds), > since > this is exactly what your patch is indirectly doing, by aggressively pruning > frags under stress. > > That would be a much simpler heuristic. [1] > > BTW my own results (before patch) are : > > lpaa5:/export/hda3/google/edumazet# ./super_netperf 10 -H 10.246.7.134 -t > UDP_STREAM -l 60 >9602 > lpaa5:/export/hda3/google/edumazet# ./super_netperf 200 -H 10.246.7.134 -t > UDP_STREAM -l 60 >9557 > > On receiver (normal settings here) I had : > > lpaa6:/export/hda3/google/edumazet# grep . /proc/sys/net/ipv4/ipfrag_* > /proc/sys/net/ipv4/ipfrag_high_thresh:104857600 > /proc/sys/net/ipv4/ipfrag_low_thresh:78643200 > /proc/sys/net/ipv4/ipfrag_max_dist:0 > /proc/sys/net/ipv4/ipfrag_secret_interval:0 > /proc/sys/net/ipv4/ipfrag_time:30 > > lpaa6:/export/hda3/google/edumazet# grep FRAG /proc/net/sockstat > FRAG: inuse 824 memory 53125312 Than you for the feedback. With your setting, you need a bit more concurrent connections (400 ?) to saturate the ipfrag cache. Above that number, performances will still sink. > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > index > c9e35b81d0931df8429a33e8d03e719b87da0747..88ed61bcda00f3357724e5c4dbcb97400b4a8b21 > 100644 > --- a/net/ipv4/inet_fragment.c > +++ b/net/ipv4/inet_fragment.c > @@ -155,9 +155,15 @@ static struct inet_frag_queue *inet_frag_alloc(struct > netns_frags *nf, >struct inet_frags *f, >void *arg) > { > + long high_thresh = READ_ONCE(nf->high_thresh); > struct inet_frag_queue *q; > + u64 timeout; > + long usage; > > - if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) > + if (!high_thresh) > + return NULL; > + usage = frag_mem_limit(nf); > + if (usage > high_thresh) > return NULL; > > q = kmem_cache_zalloc(f->frags_cachep, GFP_ATOMIC); > @@ -171,6 +177,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct > netns_frags *nf, > timer_setup(>timer, f->frag_expire, 0); > spin_lock_init(>lock); > refcount_set(>refcnt, 3); > + timeout = (u64)nf->timeout * (high_thresh - usage); > + mod_timer(>timer, jiffies + div64_long(timeout, high_thresh)); > > return q; > } This looks nice, I'll try to test it in my use case and I'll report here. Perhaps we can use the default timeout when usage < low_thresh, to avoid some maths in possibly common scenario? I have doubt: under DDOS we will trigger timeout per jiffy, can that keep a CPU busy, too? Cheers, Paolo
Re: [RFC PATCH] ip: re-introduce fragments cache worker
On 07/06/2018 03:10 AM, Paolo Abeni wrote: > Currently, the ip frag cache is fragile to overload. With > flow control disabled: > > ./super_netperf.sh 10 -H 192.168.101.2 -t UDP_STREAM -l 60 > 9618.08 > ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 > 28.66 > > Once that the overload condition is reached, the system does not > recover until it's almost completely idle: > > ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 & > sleep 4; I=0; > for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; > done > 13.72 > > This is due to the removal of the fragment cache worker, which > was responsible to free some IP fragment cache memory when the > high threshold was reached, allowing the system to cope successfully > with the next fragmented packets. > > This commit re-introduces the worker, on a per netns basis. Thanks > to rhashtable walkers we can block the bh only for an entry removal. > > After this commit (and before IP frag worker removal): > > ./super_netperf.sh 10 -H 192.168.101.2 -t UDP_STREAM -l 60 > 9618.08 > > ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 > 8599.77 > > ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 & > sleep 4; I=0; > for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; > done > 9623.12 > > Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units") > Signed-off-by: Paolo Abeni > --- > Note: tweaking ipfrag sysfs does not solve completely the issue: > - raising ipfrag_high_thresh increases the number of parallels > connections required to degrade the tput, but reached the IP > fragment cache capacity the good-put still goes almost to 0, > with the worker we get much more nice behaviour. > - setting ipfrag_time to 2 increases the change to recover from > overload (the 2# test above), but with several experiments > in such test, I got an average of 50% the expected tput with a very > large variance, with the worker we always see the expected/ > line rate tput. > --- Ho hum. No please. I do not think adding back a GC is wise, since my patches were going in the direction of allowing us to increase limits on current hardware. Meaning that the amount of frags to evict would be quite big under DDOS. (One inet_frag_queue allocated for every incoming tiny frame :/ ) A GC is a _huge_ problem, burning one cpu (you would have to provision for this CPU) compared to letting normal per frag timer doing its job. My plan was to reduce the per frag timer under load (default is 30 seconds), since this is exactly what your patch is indirectly doing, by aggressively pruning frags under stress. That would be a much simpler heuristic. [1] BTW my own results (before patch) are : lpaa5:/export/hda3/google/edumazet# ./super_netperf 10 -H 10.246.7.134 -t UDP_STREAM -l 60 9602 lpaa5:/export/hda3/google/edumazet# ./super_netperf 200 -H 10.246.7.134 -t UDP_STREAM -l 60 9557 On receiver (normal settings here) I had : lpaa6:/export/hda3/google/edumazet# grep . /proc/sys/net/ipv4/ipfrag_* /proc/sys/net/ipv4/ipfrag_high_thresh:104857600 /proc/sys/net/ipv4/ipfrag_low_thresh:78643200 /proc/sys/net/ipv4/ipfrag_max_dist:0 /proc/sys/net/ipv4/ipfrag_secret_interval:0 /proc/sys/net/ipv4/ipfrag_time:30 lpaa6:/export/hda3/google/edumazet# grep FRAG /proc/net/sockstat FRAG: inuse 824 memory 53125312 [1] Something like (for IPv4 only here) diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index c9e35b81d0931df8429a33e8d03e719b87da0747..88ed61bcda00f3357724e5c4dbcb97400b4a8b21 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -155,9 +155,15 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf, struct inet_frags *f, void *arg) { + long high_thresh = READ_ONCE(nf->high_thresh); struct inet_frag_queue *q; + u64 timeout; + long usage; - if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) + if (!high_thresh) + return NULL; + usage = frag_mem_limit(nf); + if (usage > high_thresh) return NULL; q = kmem_cache_zalloc(f->frags_cachep, GFP_ATOMIC); @@ -171,6 +177,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf, timer_setup(>timer, f->frag_expire, 0); spin_lock_init(>lock); refcount_set(>refcnt, 3); + timeout = (u64)nf->timeout * (high_thresh - usage); + mod_timer(>timer, jiffies + div64_long(timeout, high_thresh)); return q; } @@ -186,8 +194,6 @@ static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf, if (!q) return NULL; - mod_timer(>timer, jiffies + nf->timeout); - err = rhashtable_insert_fast(>rhashtable, >node, f->rhash_params); if (err <
[PATCH net-next 2/2] net/sched: act_skbedit: don't use spinlock in the data path
use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on act_skbedit configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti --- include/net/tc_act/tc_skbedit.h | 37 --- net/sched/act_skbedit.c | 107 2 files changed, 95 insertions(+), 49 deletions(-) diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h index 19cd3d345804..911bbac838a2 100644 --- a/include/net/tc_act/tc_skbedit.h +++ b/include/net/tc_act/tc_skbedit.h @@ -22,14 +22,19 @@ #include #include +struct tcf_skbedit_params { + u32 flags; + u32 priority; + u32 mark; + u32 mask; + u16 queue_mapping; + u16 ptype; + struct rcu_head rcu; +}; + struct tcf_skbedit { - struct tc_actioncommon; - u32 flags; - u32 priority; - u32 mark; - u32 mask; - u16 queue_mapping; - u16 ptype; + struct tc_action common; + struct tcf_skbedit_params __rcu *params; }; #define to_skbedit(a) ((struct tcf_skbedit *)a) @@ -37,15 +42,27 @@ struct tcf_skbedit { static inline bool is_tcf_skbedit_mark(const struct tc_action *a) { #ifdef CONFIG_NET_CLS_ACT - if (a->ops && a->ops->type == TCA_ACT_SKBEDIT) - return to_skbedit(a)->flags == SKBEDIT_F_MARK; + u32 flags; + + if (a->ops && a->ops->type == TCA_ACT_SKBEDIT) { + rcu_read_lock(); + flags = rcu_dereference(to_skbedit(a)->params)->flags; + rcu_read_unlock(); + return flags == SKBEDIT_F_MARK; + } #endif return false; } static inline u32 tcf_skbedit_mark(const struct tc_action *a) { - return to_skbedit(a)->mark; + u32 mark; + + rcu_read_lock(); + mark = rcu_dereference(to_skbedit(a)->params)->mark; + rcu_read_unlock(); + + return mark; } #endif /* __NET_TC_SKBEDIT_H */ diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index c1bfa28ba477..028eb679e532 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -37,14 +37,19 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, struct tcf_result *res) { struct tcf_skbedit *d = to_skbedit(a); + struct tcf_skbedit_params *params; + int action; tcf_lastuse_update(>tcf_tm); bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb); - spin_lock(>tcf_lock); - if (d->flags & SKBEDIT_F_PRIORITY) - skb->priority = d->priority; - if (d->flags & SKBEDIT_F_INHERITDSFIELD) { + rcu_read_lock(); + params = rcu_dereference(d->params); + action = READ_ONCE(d->tcf_action); + + if (params->flags & SKBEDIT_F_PRIORITY) + skb->priority = params->priority; + if (params->flags & SKBEDIT_F_INHERITDSFIELD) { int wlen = skb_network_offset(skb); switch (tc_skb_protocol(skb)) { @@ -63,23 +68,23 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, break; } } - if (d->flags & SKBEDIT_F_QUEUE_MAPPING && - skb->dev->real_num_tx_queues > d->queue_mapping) - skb_set_queue_mapping(skb, d->queue_mapping); - if (d->flags & SKBEDIT_F_MARK) { - skb->mark &= ~d->mask; - skb->mark |= d->mark & d->mask; + if (params->flags & SKBEDIT_F_QUEUE_MAPPING && + skb->dev->real_num_tx_queues > params->queue_mapping) + skb_set_queue_mapping(skb, params->queue_mapping); + if (params->flags & SKBEDIT_F_MARK) { + skb->mark &= ~params->mask; + skb->mark |= params->mark & params->mask; } - if (d->flags & SKBEDIT_F_PTYPE) - skb->pkt_type = d->ptype; - - spin_unlock(>tcf_lock); - return d->tcf_action; + if (params->flags & SKBEDIT_F_PTYPE) + skb->pkt_type = params->ptype; +unlock: + rcu_read_unlock(); + return action; err: - spin_unlock(>tcf_lock); qstats_drop_inc(this_cpu_ptr(d->common.cpu_qstats)); - return TC_ACT_SHOT; + action = TC_ACT_SHOT; + goto unlock; } static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = { @@ -97,6 +102,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, int ovr, int bind, struct netlink_ext_ack *extack) { struct tc_action_net *tn = net_generic(net, skbedit_net_id); + struct tcf_skbedit_params *params_old, *params_new; struct nlattr *tb[TCA_SKBEDIT_MAX + 1]; struct tc_skbedit *parm; struct tcf_skbedit *d; @@ -176,25 +182,34 @@ static int tcf_skbedit_init(struct net *net, struct nlattr
[PATCH net-next 1/2] net/sched: skbedit: use per-cpu counters
use per-CPU counters, instead of sharing a single set of stats with all cores: this removes the need of spinlocks when stats are read/updated. Signed-off-by: Davide Caratti --- net/sched/act_skbedit.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index dfaf5d8028dd..c1bfa28ba477 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -38,10 +38,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, { struct tcf_skbedit *d = to_skbedit(a); - spin_lock(>tcf_lock); tcf_lastuse_update(>tcf_tm); - bstats_update(>tcf_bstats, skb); + bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb); + spin_lock(>tcf_lock); if (d->flags & SKBEDIT_F_PRIORITY) skb->priority = d->priority; if (d->flags & SKBEDIT_F_INHERITDSFIELD) { @@ -77,8 +77,8 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, return d->tcf_action; err: - d->tcf_qstats.drops++; spin_unlock(>tcf_lock); + qstats_drop_inc(this_cpu_ptr(d->common.cpu_qstats)); return TC_ACT_SHOT; } @@ -163,7 +163,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, if (!exists) { ret = tcf_idr_create(tn, parm->index, est, a, -_skbedit_ops, bind, false); +_skbedit_ops, bind, true); if (ret) return ret; -- 2.17.1
[PATCH net-next 0/2] net/sched: act_skbedit: lockless data path
the data path of act_skbedit can be faster if we avoid using spinlocks: - patch 1 converts act_skbedit statistics to use per-cpu counters - patch 2 lets act_skbedit use RCU to read/update its configuration test procedure (using pktgen from https://github.com/netoptimizer): # ip link add name eth1 type dummy # ip link set dev eth1 up # tc qdisc add dev eth1 clsact # tc filter add dev eth1 egress matchall action skbedit priority c1a0:c1a0 # for c in 1 2 4 ; do > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 500 -i eth1 > done test results (avg. pps/thread) $c | before patch | after patch | improvement +--+--+ 1 | 3917464 ± 3% | 4000458 ± 3% | irrelevant 2 | 3455367 ± 4% | 3953076 ± 1% |+14% 4 | 2496594 ± 2% | 3801123 ± 3% |+52% Davide Caratti (2): net/sched: skbedit: use per-cpu counters net/sched: act_skbedit: don't use spinlock in the data path include/net/tc_act/tc_skbedit.h | 37 --- net/sched/act_skbedit.c | 113 2 files changed, 98 insertions(+), 52 deletions(-) -- 2.17.1
[PATCH net] ipfrag: really prevent allocation on netns exit
Setting the low threshold to 0 has no effect on frags allocation, we need to clear high_thresh instead. The code was pre-existent to commit 648700f76b03 ("inet: frags: use rhashtables for reassembly units"), but before the above, such assignment had a different role: prevent concurrent eviction from the worker and the netns cleanup helper. Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Paolo Abeni --- net/ipv4/inet_fragment.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index c9e35b81d093..1e4cf3ab560f 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -90,7 +90,7 @@ static void inet_frags_free_cb(void *ptr, void *arg) void inet_frags_exit_net(struct netns_frags *nf) { - nf->low_thresh = 0; /* prevent creation of new frags */ + nf->high_thresh = 0; /* prevent creation of new frags */ rhashtable_free_and_destroy(>rhashtable, inet_frags_free_cb, NULL); } -- 2.17.1
[PATCH net-next 03/10] net: hns3: Fix for waterline not setting correctly
From: Yunsheng Lin The HCLGE_RX_PRIV_EN_B is used to tell the firmware whether to update the specific waterline value, if the is not set, the firmware will ignore the value. This patch fixes by setting the HCLGE_RX_PRIV_EN_B even if the updated value is zero. Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Yunsheng Lin Signed-off-by: Peng Li Signed-off-by: Salil Mehta --- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 22 ++ 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 6fffc69..dae1aa5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -1834,8 +1834,6 @@ static int hclge_rx_priv_buf_alloc(struct hclge_dev *hdev, return 0; } -#define HCLGE_PRIV_ENABLE(a) ((a) > 0 ? 1 : 0) - static int hclge_rx_priv_wl_config(struct hclge_dev *hdev, struct hclge_pkt_buf_alloc *buf_alloc) { @@ -1863,13 +1861,11 @@ static int hclge_rx_priv_wl_config(struct hclge_dev *hdev, req->tc_wl[j].high = cpu_to_le16(priv->wl.high >> HCLGE_BUF_UNIT_S); req->tc_wl[j].high |= - cpu_to_le16(HCLGE_PRIV_ENABLE(priv->wl.high) << - HCLGE_RX_PRIV_EN_B); + cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B)); req->tc_wl[j].low = cpu_to_le16(priv->wl.low >> HCLGE_BUF_UNIT_S); req->tc_wl[j].low |= - cpu_to_le16(HCLGE_PRIV_ENABLE(priv->wl.low) << - HCLGE_RX_PRIV_EN_B); +cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B)); } } @@ -1911,13 +1907,11 @@ static int hclge_common_thrd_config(struct hclge_dev *hdev, req->com_thrd[j].high = cpu_to_le16(tc->high >> HCLGE_BUF_UNIT_S); req->com_thrd[j].high |= - cpu_to_le16(HCLGE_PRIV_ENABLE(tc->high) << - HCLGE_RX_PRIV_EN_B); +cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B)); req->com_thrd[j].low = cpu_to_le16(tc->low >> HCLGE_BUF_UNIT_S); req->com_thrd[j].low |= - cpu_to_le16(HCLGE_PRIV_ENABLE(tc->low) << - HCLGE_RX_PRIV_EN_B); +cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B)); } } @@ -1943,14 +1937,10 @@ static int hclge_common_wl_config(struct hclge_dev *hdev, req = (struct hclge_rx_com_wl *)desc.data; req->com_wl.high = cpu_to_le16(buf->self.high >> HCLGE_BUF_UNIT_S); - req->com_wl.high |= - cpu_to_le16(HCLGE_PRIV_ENABLE(buf->self.high) << - HCLGE_RX_PRIV_EN_B); + req->com_wl.high |= cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B)); req->com_wl.low = cpu_to_le16(buf->self.low >> HCLGE_BUF_UNIT_S); - req->com_wl.low |= - cpu_to_le16(HCLGE_PRIV_ENABLE(buf->self.low) << - HCLGE_RX_PRIV_EN_B); + req->com_wl.low |= cpu_to_le16(BIT(HCLGE_RX_PRIV_EN_B)); ret = hclge_cmd_send(>hw, , 1); if (ret) { -- 2.7.4
Re: Crash due to destroying TCP request sockets using SOCK_DESTROY
On 07/05/2018 09:46 PM, Lorenzo Colitti wrote: > On Fri, Jul 6, 2018 at 11:37 AM Subash Abhinov Kasiviswanathan > wrote: >> >> From the call stack, a TCP socket is being destroyed using netlink_diag. >> The memory dump showed that the socket was an inet request socket (in >> state TCP_NEW_SYN_RECV) with refcount of 0. >> [...] >> 13232.479820: <2> refcount_t: underflow; use-after-free. >> 13232.479838: <6> [ cut here ] >> 13232.479843: <6> kernel BUG at kernel/msm-4.14/lib/refcount.c:204! >> 13232.479849: <6> Internal error: Oops - BUG: 0 [#1] PREEMPT SMP >> [...] >> 13232.479996: <6> Process netd (pid: 648, stack limit = >> 0xff801cf98000) >> 13232.479998: <2> Call trace: >> 13232.48: <2> refcount_sub_and_test+0x64/0x78 >> 13232.480002: <2> refcount_dec_and_test+0x18/0x24 >> 13232.480005: <2> sock_gen_put+0x1c/0xb0 >> 13232.480009: <2> tcp_diag_destroy+0x54/0x68 >> [...] > > Looks like for a TCP_NEW_SYN_RECV socket, sock_diag_destroy > essentially ends up doing: > > struct request_sock *req = inet_reqsk(sk); > > local_bh_disable(); > inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, > req); > local_bh_enable(); > ... > > sock_gen_put(sk); > > It looks like inet_csk_reqsk_queue_drop_and_put calls reqsk_put(req), > which frees the socket, and at that point sock_gen_put is a UAF. Do we > just need: > > -inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, > - req); > +inet_csk_reqsk_queue_drop(req->rsk_listener, req); > > since sock_gen_put will also end up calling reqsk_put() for a > TCP_SYN_RECV socket? > > Alastair - you're able to reproduce this UAF using net_test on qemu, > right? If so, could you try that two-line patch above? > Hi Lorenzo Your patch makes sense to me, please submit it formally with : Fixes: d7226c7a4dd1 ("net: diag: Fix refcnt leak in error path destroying socket") Cc: David Ahern Thanks !
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On (07/06/18 17:08), Ka-Cheong Poon wrote: > >Hmm. Why do you need to include tcp header in ib transport > >code ? If there is any common function just move to core > >common file and use it. > > I think it can be removed as it is left over from earlier > changes when the IB IPv6 listener port was RDS_TCP_PORT. > Now all the port definitions are in rds.h. Transport specific information such as port numbers should not be smeared into the common rds-module. Please fix that. If IB is also piggybacking on port 16385 (why?), please use your own definitions for it in IB specific header files. Santosh, David, I have to NACK this if it is not changed. --Sowmini
[RFC PATCH] ip: re-introduce fragments cache worker
Currently, the ip frag cache is fragile to overload. With flow control disabled: ./super_netperf.sh 10 -H 192.168.101.2 -t UDP_STREAM -l 60 9618.08 ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 28.66 Once that the overload condition is reached, the system does not recover until it's almost completely idle: ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 & sleep 4; I=0; for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; done 13.72 This is due to the removal of the fragment cache worker, which was responsible to free some IP fragment cache memory when the high threshold was reached, allowing the system to cope successfully with the next fragmented packets. This commit re-introduces the worker, on a per netns basis. Thanks to rhashtable walkers we can block the bh only for an entry removal. After this commit (and before IP frag worker removal): ./super_netperf.sh 10 -H 192.168.101.2 -t UDP_STREAM -l 60 9618.08 ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 8599.77 ./super_netperf.sh 200 -H 192.168.101.2 -t UDP_STREAM -l 60 & sleep 4; I=0; for P in `pidof netperf`; do kill -9 $P; I=$((I+1)); [ $I -gt 190 ] && break; done 9623.12 Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Paolo Abeni --- Note: tweaking ipfrag sysfs does not solve completely the issue: - raising ipfrag_high_thresh increases the number of parallels connections required to degrade the tput, but reached the IP fragment cache capacity the good-put still goes almost to 0, with the worker we get much more nice behaviour. - setting ipfrag_time to 2 increases the change to recover from overload (the 2# test above), but with several experiments in such test, I got an average of 50% the expected tput with a very large variance, with the worker we always see the expected/ line rate tput. --- include/net/inet_frag.h | 8 ++--- net/ipv4/inet_fragment.c | 72 ++-- 2 files changed, 73 insertions(+), 7 deletions(-) diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index ed07e3786d98..1f12692d7f7d 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -11,6 +11,8 @@ struct netns_frags { int timeout; int max_dist; struct inet_frags *f; + struct work_struct frags_work; + struct rhashtable_iter iter; struct rhashtable rhashtable cacheline_aligned_in_smp; @@ -101,11 +103,7 @@ struct inet_frags { int inet_frags_init(struct inet_frags *); void inet_frags_fini(struct inet_frags *); -static inline int inet_frags_init_net(struct netns_frags *nf) -{ - atomic_long_set(>mem, 0); - return rhashtable_init(>rhashtable, >f->rhash_params); -} +int inet_frags_init_net(struct netns_frags *nf); void inet_frags_exit_net(struct netns_frags *nf); void inet_frag_kill(struct inet_frag_queue *q); diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index c9e35b81d093..0f5b29ce96de 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -88,10 +88,76 @@ static void inet_frags_free_cb(void *ptr, void *arg) inet_frag_put(fq); } +static void inet_frag_schedule_worker(struct netns_frags *nf) +{ + if (unlikely(!work_pending(>frags_work))) + schedule_work(>frags_work); +} + +#define INETFRAGS_EVICT_MAX64 +static void inet_frag_worker(struct work_struct *work) +{ + struct netns_frags *nf; + bool reschedule; + int evicted = 0; + + nf = container_of(work, struct netns_frags, frags_work); + + rhashtable_walk_start(>iter); + + while ((reschedule = (frag_mem_limit(nf) > nf->low_thresh))) { + struct inet_frag_queue *fq = rhashtable_walk_next(>iter); + + if (IS_ERR(fq) && PTR_ERR(fq) == -EAGAIN) + continue; + if (!fq) { + /* end of table, restart the walk */ + rhashtable_walk_stop(>iter); + rhashtable_walk_exit(>iter); + rhashtable_walk_enter(>rhashtable, >iter); + rhashtable_walk_start(>iter); + continue; + } + if (!refcount_inc_not_zero(>refcnt)) + continue; + + spin_lock_bh(>lock); + inet_frag_kill(fq); + spin_unlock_bh(>lock); + inet_frag_put(fq); + + /* limit the amount of work we can do before a reschedule, +* to avoid starving others queued works +*/ + if (++evicted > INETFRAGS_EVICT_MAX) + break; + } + + rhashtable_walk_stop(>iter); + + if (reschedule) + inet_frag_schedule_worker(nf); +} + +int inet_frags_init_net(struct netns_frags *nf) +{ +