Re: Kernel panic on kernel-3.10.0-693.21.1.el7 in ndisc.h
On Mon, May 14, 2018 at 02:06:11PM -0700, Stephen Hemminger wrote: > On Mon, 14 May 2018 21:29:03 +0300 Roman Makhov > wrote: > > Thank you for the answer. > > Unfortunately CentOS goes with these dinosaurs. > > So we will try to debug the problem in the current one and try to > > reproduce on the latest kernel. > > If you are stuck in old kernels, please bug the CentOs maintainers not > upstream developers. Actually, it's not even the dinosaur. Kernels from RHEL (and therefore CentOS) or SLES differ from the original mainline version they are based on quite a lot. It's possible that this bug didn't really exist in the old 3.10 and is related to one of the backports added to CentOS (or rather RHEL in this case) kernel. People outside of the distribution have little idea what was backported and why so unless the issue can be reproduced with mainline kernel they cannot be expected to help. Michal Kubecek
Re: STMMAC driver with TSO enabled issue
Hi Jose, On 5/10/2018 9:15 PM, Jose Abreu wrote: On 10-05-2018 16:08, Bhadram Varka wrote: Hi Jose, On 5/10/2018 7:59 PM, Jose Abreu wrote: Hi Bhadram, On 10-05-2018 09:55, Jose Abreu wrote: ++net-dev Hi Bhadram, On 09-05-2018 12:03, Bhadram Varka wrote: Hi, Thanks for responding. Tried below suggested way. Still observing the issue - It seems stmmac has a bug in the RX side when using TSO which is causing all the RX descriptors to be consumed. The stmmac_rx() function will need to be refactored. I will send a fix ASAP. Are you using this patch [1] ? Because there is a problem with the patch. By adding the previously removed call to stmmac_init_rx_desc() TSO works okay in my setup. No. I don't have this change in my code base. I am using net-next tree. Can you please post the change for which TSO works ? I can help you with the testing. It should work with net-next because patch was not merged yet ... Please send me the output of "dmesg | grep -i stmmac", since boot and your full register values (from 0x0 to 0x12E4). [root@alarm ~]# dmesg | grep -i dwc [6.925005] dwc-eth-dwmac 249.ethernet: Cannot get CSR clock [6.933657] dwc-eth-dwmac 249.ethernet: no reset control found [6.955325] dwc-eth-dwmac 249.ethernet: User ID: 0x10, Synopsys ID: 0x41 [6.962379] dwc-eth-dwmac 249.ethernet: DWMAC4/5 [6.967434] dwc-eth-dwmac 249.ethernet: DMA HW capability register supported [6.974827] dwc-eth-dwmac 249.ethernet: RX Checksum Offload Engine supported [6.982915] dwc-eth-dwmac 249.ethernet: TX Checksum insertion supported [6.991235] dwc-eth-dwmac 249.ethernet: Wake-Up On Lan supported [6.998974] dwc-eth-dwmac 249.ethernet: TSO supported [7.006422] dwc-eth-dwmac 249.ethernet: TSO feature enabled [7.012581] dwc-eth-dwmac 249.ethernet: Enable RX Mitigation via HW Watchdog Timer [7.236391] dwc-eth-dwmac 249.ethernet eth0: device MAC address 4a:d1:e3:58:cb:7a [7.333414] dwc-eth-dwmac 249.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported [7.342441] dwc-eth-dwmac 249.ethernet eth0: registered PTP clock [ 10.157066] dwc-eth-dwmac 249.ethernet eth0: Link is Up - 1Gbps/Full - flow control off [root@alarm ~]# dmesg | grep -i stmma [7.020567] libphy: stmmac: probed [7.316295] Broadcom BCM89610 stmmac-0:00: attached PHY driver [Broadcom BCM89610] (mii_bus:phy_addr=stmmac-0:00, irq=64) I will get the register details - FYI - TSO works fine with single channel. I see the issue only if multi channel enabled (supports 4 Tx/Rx channels). -- Thanks, Bhadram.
Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
On Tue, May 15, 2018 at 12:25:03AM -0600, David Ahern wrote: > On 5/15/18 12:12 AM, Tobin C. Harding wrote: > > Queue questions on how to run your selftests ;) > > copy tools/testing/selftests/net/fib_tests.sh to test environment. > > Run it with no options to run all tests or to run specific tests: > > $ fib_tests.sh -t ipv4_rt > $ fib_tests.sh -t ipv6_rt I was joking but thanks! Tobin
Re: [PATCH net-next 2/3] gso: limit udp gso to egress-only virtual devices
Hi Willem, Thank you for the patch! Yet something to improve: [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/udp-gso-fixes/20180515-120246 config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All errors (new ones prefixed by >>): drivers/net//team/team.c: In function '__team_compute_features': >> drivers/net//team/team.c:1030:10: error: 'NETIF_GSO_UDP_L4' undeclared >> (first use in this function); did you mean 'NETIF_F_GSO_UDP_L4'? NETIF_GSO_UDP_L4; ^~~~ NETIF_F_GSO_UDP_L4 drivers/net//team/team.c:1030:10: note: each undeclared identifier is reported only once for each function it appears in vim +1030 drivers/net//team/team.c 996 997 #define TEAM_VLAN_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \ 998 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \ 999 NETIF_F_HIGHDMA | NETIF_F_LRO) 1000 1001 #define TEAM_ENC_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \ 1002 NETIF_F_RXCSUM | NETIF_F_ALL_TSO) 1003 1004 static void __team_compute_features(struct team *team) 1005 { 1006 struct team_port *port; 1007 u32 vlan_features = TEAM_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL; 1008 netdev_features_t enc_features = TEAM_ENC_FEATURES; 1009 unsigned short max_hard_header_len = ETH_HLEN; 1010 unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE | 1011 IFF_XMIT_DST_RELEASE_PERM; 1012 1013 list_for_each_entry(port, &team->port_list, list) { 1014 vlan_features = netdev_increment_features(vlan_features, 1015 port->dev->vlan_features, 1016 TEAM_VLAN_FEATURES); 1017 enc_features = 1018 netdev_increment_features(enc_features, 1019 port->dev->hw_enc_features, 1020TEAM_ENC_FEATURES); 1021 1022 1023 dst_release_flag &= port->dev->priv_flags; 1024 if (port->dev->hard_header_len > max_hard_header_len) 1025 max_hard_header_len = port->dev->hard_header_len; 1026 } 1027 1028 team->dev->vlan_features = vlan_features; 1029 team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL | > 1030 NETIF_GSO_UDP_L4; 1031 team->dev->hard_header_len = max_hard_header_len; 1032 1033 team->dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; 1034 if (dst_release_flag == (IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM)) 1035 team->dev->priv_flags |= IFF_XMIT_DST_RELEASE; 1036 } 1037 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
On 5/15/18 12:12 AM, Tobin C. Harding wrote: > Queue questions on how to run your selftests ;) copy tools/testing/selftests/net/fib_tests.sh to test environment. Run it with no options to run all tests or to run specific tests: $ fib_tests.sh -t ipv4_rt $ fib_tests.sh -t ipv6_rt
Re: possible deadlock in sk_diag_fill
On Tue, May 15, 2018 at 07:19:39AM +0200, Dmitry Vyukov wrote: > On Mon, May 14, 2018 at 8:00 PM, Andrei Vagin wrote: > >> >> Hello, > >> >> > >> >> syzbot found the following crash on: > >> >> > >> >> HEAD commit:c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of > >> >> git://git.k.. > >> >> git tree: upstream > >> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c9780 > >> >> kernel config: > >> >> https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27 > >> >> dashboard link: > >> >> https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669 > >> >> compiler: gcc (GCC) 8.0.1 20180413 (experimental) > >> >> userspace arch: i386 > >> >> > >> >> Unfortunately, I don't have any reproducer for this crash yet. > >> >> > >> >> IMPORTANT: if you fix the bug, please add the following tag to the > >> >> commit: > >> >> Reported-by: syzbot+c1872be62e587eae9...@syzkaller.appspotmail.com > >> >> > >> >> > >> >> == > >> >> WARNING: possible circular locking dependency detected > >> >> 4.17.0-rc3+ #59 Not tainted > >> >> -- > >> >> syz-executor1/25282 is trying to acquire lock: > >> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons > >> >> net/unix/diag.c:82 [inline] > >> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at: > >> >> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144 > >> >> > >> >> but task is already holding lock: > >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock > >> >> include/linux/spinlock.h:310 [inline] > >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons > >> >> net/unix/diag.c:64 [inline] > >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: > >> >> sk_diag_fill.isra.5+0x94e/0x10d0 > >> >> net/unix/diag.c:144 > >> >> > >> >> which lock already depends on the new lock. > >> > > >> > In the code, we have a comment which explains why it is safe to take > >> > this lock > >> > > >> > /* > >> > * The state lock is outer for the same sk's > >> > * queue lock. With the other's queue locked it's > >> > * OK to lock the state. > >> > */ > >> > unix_state_lock_nested(req); > >> > > >> > It is a question how to explain this to lockdep. > >> > >> Do I understand it correctly that (&u->lock)->rlock associated with > >> AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is > >> locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I > >> think we need to split (&u->lock)->rlock by family too, so that we > >> have u->lock-AF_UNIX and u->lock-AF_NETLINK. > > > > I think here is another problem. lockdep woried about > > sk->sk_receive_queue vs unix_sk(s)->lock. > > > > sk_diag_dump_icons() takes sk->sk_receive_queue and then > > unix_sk(s)->lock. > > > > unix_dgram_sendmsg takes unix_sk(sk)->lock and then sk->sk_receive_queue. > > > > sk_diag_dump_icons() takes locks for two different sockets, but > > unix_dgram_sendmsg() takes locks for one socket. > > > > sk_diag_dump_icons > > if (sk->sk_state == TCP_LISTEN) { > > spin_lock(&sk->sk_receive_queue.lock); > > skb_queue_walk(&sk->sk_receive_queue, skb) { > > unix_state_lock_nested(req); > > spin_lock_nested(&unix_sk(s)->lock, > > > > > > unix_dgram_sendmsg > > unix_state_lock(other) > > spin_lock(&unix_sk(s)->lock) > > skb_queue_tail(&other->sk_receive_queue, skb); > > spin_lock_irqsave(&list->lock, flags); > > > Do you mean the following? > There is socket 1 with state lock (S1) and queue lock (Q2), and socket > 2 with state lock (S2) and queue lock (Q2). unix_dgram_sendmsg lock > S1->Q1. And sk_diag_dump_icons locks Q1->S2. > If yes, then this looks pretty much as deadlock. Consider that 2 > unix_dgram_sendmsg in 2 different threads lock S1 and S2 respectively. > Now 2 sk_diag_dump_icons in 2 different threads lock Q1 and Q2 > respectively. Now sk_diag_dump_icons want to lock S's, and > unix_dgram_sendmsg want to lock Q's. Nobody can proceed. Q1 and S1 belongs to a listen socket, so they can't be taken from unix_dgram_sendmsg().
Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
On Mon, May 14, 2018 at 11:54:08PM -0600, David Ahern wrote: > On 5/14/18 11:21 PM, Tobin C. Harding wrote: > > Hi David, > > > > On Mon, May 14, 2018 at 07:51:05PM -0700, David Ahern wrote: > >> This patch set fixes a few append and replace uses cases for IPv6 and > >> adds test cases that codifies the expectations of how append and replace > >> are expected to work. > > > > Nood question: what commit does this apply on top of please. I > > attempted to apply the set on top of net-next > > > > commit (961423f9fcbc Merge branch 'sctp-Introduce-sctp_flush_ctx') > > > > patch 4 and 5 have merge conflicts (I stopped at 5). > > Base commit is: > > commit 289e1f4e9e4a09c73a1c0152bb93855ea351ccda > Author: Anders Roxell > Date: Sun May 13 21:48:30 2018 +0200 > > net: ipv4: ipconfig: fix unused variable > > > > I don't see 961423f9fcbc in any tree. Got it, thanks. FTR $ date Tuesday 15 May 16:05:51 AEST 2018 $ git status On branch net-next Your branch is up-to-date with 'net-next/master'. nothing to commit, working directory clean $ git log -49 --pretty=oneline --abbrev-commit --reverse | grep 'ipv4: ipconfig: fix unused variable' 289e1f4e9e4a net: ipv4: ipconfig: fix unused variable $ git log -48 --pretty=oneline --abbrev-commit --reverse | grep 'ipv4: ipconfig: fix unused variable' Queue questions on how to run your selftests ;) thanks, Tobin.
Re: [PATCH bpf-next v2 0/5] samples: bpf: fix build after move to full libbpf
On Mon, 14 May 2018 22:57:53 -0700 Alexei Starovoitov wrote: > On Mon, May 14, 2018 at 10:35:01PM -0700, Jakub Kicinski wrote: > > Hi! > > > > Following patches address build issues after recent move to libbpf. > > For out-of-tree builds we would see the following error: > > > > gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or > > directory > > > > libbpf build system is now always invoked explicitly rather than > > relying on building single objects most of the time. We need to > > resolve the friction between Kbuild and tools/ build system. > > > > Mini-library called libbpf.h in samples is renamed to bpf_insn.h, > > using linux/filter.h seems not completely trivial since some samples > > get upset when order on include search path in changed. We do have > > to rename libbpf.h, however, because otherwise it's hard to reliably > > get to libbpf's header in out-of-tree builds. > > > > v2: > > - fix the build error harder (patch 3); > > - add patch 5 (make clang less noisy). > > Applied, Thanks I just tried it out, but the build fails. $ make samples/bpf/ CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CHK include/generated/bounds.h CHK include/generated/timeconst.h CHK include/generated/asm-offsets.h CALLscripts/checksyscalls.sh DESCEND objtool CHK scripts/mod/devicetable-offsets.h make -C /home/jbrouer/git/kernel/bpf-next/samples/bpf/../../tools/lib/bpf/ RM='rm -rf' LDFLAGS= srctree=/home/jbrouer/git/kernel/bpf-next/samples/bpf/../../ O= Auto-detecting system features: ...libelf: [ OFF ] ... bpf: [ OFF ] No libelf found make[2]: *** [Makefile:212: elfdep] Error 1 make[1]: *** [samples/bpf/Makefile:207: /home/jbrouer/git/kernel/bpf-next/samples/bpf/../../tools/lib/bpf/libbpf.a] Error 2 make: *** [Makefile:1743: samples/bpf/] Error 2 But it might not be related to this change, as the problem seems to the with the "Auto-detecting system features". It is the build of libbpf that is the problem. $ cd tools/lib/bpf/ $ make Auto-detecting system features: ...libelf: [ OFF ] ... bpf: [ OFF ] No libelf found make: *** [Makefile:212: elfdep] Error 1 And do have 'libelf' installed on this system... the same build/make commands works on net-next. On net-next I see: $ cd tools/lib/bpf/ $ make Auto-detecting system features: ...libelf: [ on ] ... bpf: [ on ] Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h' Warning: Kernel ABI header at 'tools/include/uapi/linux/if_link.h' differs from latest version at 'include/uapi/linux/if_link.h' CC libbpf.o CC bpf.o CC nlattr.o CC btf.o LD libbpf-in.o LINK libbpf.a LINK libbpf.so -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [bpf-next PATCH 0/5] bpf, doc: convert Documentation/bpf to RST-formatting
On Mon, May 14, 2018 at 10:57:25PM -0700, Y Song wrote: > On Mon, May 14, 2018 at 6:42 AM, Jesper Dangaard Brouer > wrote: > > The kernel is moving files under Documentation to use the RST > > (reStructuredText) format and Sphinx [1]. This patchset converts the > > files under Documentation/bpf/ into RST format. The Sphinx > > integration is left as followup work. > > > > [1] https://www.kernel.org/doc/html/latest/doc-guide/sphinx.html > > > > This patchset have been uploaded as branch bpf_doc10 on github[2], so > > reviewers can see how GitHub renders this. > > > > [2] https://github.com/netoptimizer/linux/tree/bpf_doc10/Documentation/bpf > > > > --- > > > > Jesper Dangaard Brouer (5): > > bpf, doc: add basic README.rst file > > bpf, doc: rename txt files to rst files > > bpf, doc: convert bpf_design_QA.rst to use RST formatting > > bpf, doc: convert bpf_devel_QA.rst to use RST formatting > > bpf, doc: howto use/run the BPF selftests > > This initial conversion from .txt to .rst files look good to me. Ack > for the whole series. > Acked-by: Yonghong Song Applied, Thanks
Re: [PATCH bpf-next v2 3/5] samples: bpf: fix build after move to compiling full libbpf.a
2018-05-15 7:35 GMT+02:00 Jakub Kicinski : > There are many ways users may compile samples, some of them got > broken by commit 5f9380572b4b ("samples: bpf: compile and link > against full libbpf"). Improve path resolution and make libbpf > building a dependency of source files to force its build. > > Samples should now again build with any of: > cd samples/bpf; make > make samples/bpf/ > make -C samples/bpf > cd samples/bpf; make O=builddir > make samples/bpf/ O=builddir > make -C samples/bpf O=builddir > export KBUILD_OUTPUT=builddir > make samples/bpf/ > make -C samples/bpf > > Fixes: 5f9380572b4b ("samples: bpf: compile and link against full libbpf") > Reported-by: Björn Töpel > Signed-off-by: Jakub Kicinski > --- > samples/bpf/Makefile | 17 - > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile > index 9e255ca4059a..0dae77c88d2e 100644 > --- a/samples/bpf/Makefile > +++ b/samples/bpf/Makefile > @@ -1,4 +1,8 @@ > # SPDX-License-Identifier: GPL-2.0 > + > +BPF_SAMPLES_PATH ?= $(abspath $(srctree)/$(src)) > +TOOLS_PATH := $(BPF_SAMPLES_PATH)/../../tools > + > # List of programs to build > hostprogs-y := test_lru_dist > hostprogs-y += sock_example > @@ -49,7 +53,8 @@ hostprogs-y += xdpsock > hostprogs-y += xdp_fwd > > # Libbpf dependencies > -LIBBPF := ../../tools/lib/bpf/libbpf.a > +LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a > + > CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o > TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o > > @@ -233,15 +238,16 @@ CLANG_ARCH_ARGS = -target $(ARCH) > endif > > # Trick to allow make to be run from this directory > -all: $(LIBBPF) > - $(MAKE) -C ../../ $(CURDIR)/ > +all: > + $(MAKE) -C ../../ $(CURDIR)/ BPF_SAMPLES_PATH=$(CURDIR) > > clean: > $(MAKE) -C ../../ M=$(CURDIR) clean > @rm -f *~ > > $(LIBBPF): FORCE > - $(MAKE) -C $(dir $@) > +# Fix up variables inherited from Kbuild that tools/ build system won't like > + $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= > srctree=$(BPF_SAMPLES_PATH)/../../ O= > > $(obj)/syscall_nrs.s: $(src)/syscall_nrs.c > $(call if_changed_dep,cc_s_c) > @@ -272,7 +278,8 @@ verify_target_bpf: verify_cmds > exit 2; \ > else true; fi > > -$(src)/*.c: verify_target_bpf > +$(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF) > +$(src)/*.c: verify_target_bpf $(LIBBPF) > > $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h > > -- > 2.17.0 > Thanks a bunch for this! Tested-by: Björn Töpel
Re: [bpf-next PATCH 0/5] bpf, doc: convert Documentation/bpf to RST-formatting
On Mon, May 14, 2018 at 6:42 AM, Jesper Dangaard Brouer wrote: > The kernel is moving files under Documentation to use the RST > (reStructuredText) format and Sphinx [1]. This patchset converts the > files under Documentation/bpf/ into RST format. The Sphinx > integration is left as followup work. > > [1] https://www.kernel.org/doc/html/latest/doc-guide/sphinx.html > > This patchset have been uploaded as branch bpf_doc10 on github[2], so > reviewers can see how GitHub renders this. > > [2] https://github.com/netoptimizer/linux/tree/bpf_doc10/Documentation/bpf > > --- > > Jesper Dangaard Brouer (5): > bpf, doc: add basic README.rst file > bpf, doc: rename txt files to rst files > bpf, doc: convert bpf_design_QA.rst to use RST formatting > bpf, doc: convert bpf_devel_QA.rst to use RST formatting > bpf, doc: howto use/run the BPF selftests This initial conversion from .txt to .rst files look good to me. Ack for the whole series. Acked-by: Yonghong Song > > > Documentation/bpf/README.rst| 36 ++ > Documentation/bpf/bpf_design_QA.rst | 221 > Documentation/bpf/bpf_design_QA.txt | 156 - > Documentation/bpf/bpf_devel_QA.rst | 640 > +++ > Documentation/bpf/bpf_devel_QA.txt | 570 --- > 5 files changed, 897 insertions(+), 726 deletions(-) > create mode 100644 Documentation/bpf/README.rst > create mode 100644 Documentation/bpf/bpf_design_QA.rst > delete mode 100644 Documentation/bpf/bpf_design_QA.txt > create mode 100644 Documentation/bpf/bpf_devel_QA.rst > delete mode 100644 Documentation/bpf/bpf_devel_QA.txt > > --
Re: [PATCH bpf-next v2 0/5] samples: bpf: fix build after move to full libbpf
On Mon, May 14, 2018 at 10:35:01PM -0700, Jakub Kicinski wrote: > Hi! > > Following patches address build issues after recent move to libbpf. > For out-of-tree builds we would see the following error: > > gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or > directory > > libbpf build system is now always invoked explicitly rather than > relying on building single objects most of the time. We need to > resolve the friction between Kbuild and tools/ build system. > > Mini-library called libbpf.h in samples is renamed to bpf_insn.h, > using linux/filter.h seems not completely trivial since some samples > get upset when order on include search path in changed. We do have > to rename libbpf.h, however, because otherwise it's hard to reliably > get to libbpf's header in out-of-tree builds. > > v2: > - fix the build error harder (patch 3); > - add patch 5 (make clang less noisy). Applied, Thanks
Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
On 5/14/18 11:21 PM, Tobin C. Harding wrote: > Hi David, > > On Mon, May 14, 2018 at 07:51:05PM -0700, David Ahern wrote: >> This patch set fixes a few append and replace uses cases for IPv6 and >> adds test cases that codifies the expectations of how append and replace >> are expected to work. > > Nood question: what commit does this apply on top of please. I > attempted to apply the set on top of net-next > > commit (961423f9fcbc Merge branch 'sctp-Introduce-sctp_flush_ctx') > > patch 4 and 5 have merge conflicts (I stopped at 5). Base commit is: commit 289e1f4e9e4a09c73a1c0152bb93855ea351ccda Author: Anders Roxell Date: Sun May 13 21:48:30 2018 +0200 net: ipv4: ipconfig: fix unused variable I don't see 961423f9fcbc in any tree.
[PATCH bpf-next v2 4/5] samples: bpf: move libbpf from object dependencies to libs
Make complains that it doesn't know how to make libbpf.a: scripts/Makefile.host:106: target 'samples/bpf/../../tools/lib/bpf/libbpf.a' doesn't match the target pattern Now that we have it as a dependency of the sources simply add libbpf.a to libraries not objects. Signed-off-by: Jakub Kicinski Acked-by: Jesper Dangaard Brouer --- samples/bpf/Makefile | 145 +++ 1 file changed, 51 insertions(+), 94 deletions(-) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 0dae77c88d2e..0036a77c2d97 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -58,55 +58,53 @@ LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o -test_lru_dist-objs := test_lru_dist.o $(LIBBPF) -sock_example-objs := sock_example.o $(LIBBPF) -fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o -sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o -sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o -sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o -tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o -tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o -tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o -tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o -tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o -tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o -tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o -load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o -test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o -trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS) -lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o -offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS) -spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS) -map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o -test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o -test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o $(LIBBPF) -test_cgrp2_attach-objs := test_cgrp2_attach.o $(LIBBPF) -test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(LIBBPF) $(CGROUP_HELPERS) -test_cgrp2_sock-objs := test_cgrp2_sock.o $(LIBBPF) -test_cgrp2_sock2-objs := bpf_load.o $(LIBBPF) test_cgrp2_sock2.o -xdp1-objs := xdp1_user.o $(LIBBPF) +fds_example-objs := bpf_load.o fds_example.o +sockex1-objs := bpf_load.o sockex1_user.o +sockex2-objs := bpf_load.o sockex2_user.o +sockex3-objs := bpf_load.o sockex3_user.o +tracex1-objs := bpf_load.o tracex1_user.o +tracex2-objs := bpf_load.o tracex2_user.o +tracex3-objs := bpf_load.o tracex3_user.o +tracex4-objs := bpf_load.o tracex4_user.o +tracex5-objs := bpf_load.o tracex5_user.o +tracex6-objs := bpf_load.o tracex6_user.o +tracex7-objs := bpf_load.o tracex7_user.o +load_sock_ops-objs := bpf_load.o load_sock_ops.o +test_probe_write_user-objs := bpf_load.o test_probe_write_user_user.o +trace_output-objs := bpf_load.o trace_output_user.o $(TRACE_HELPERS) +lathist-objs := bpf_load.o lathist_user.o +offwaketime-objs := bpf_load.o offwaketime_user.o $(TRACE_HELPERS) +spintest-objs := bpf_load.o spintest_user.o $(TRACE_HELPERS) +map_perf_test-objs := bpf_load.o map_perf_test_user.o +test_overhead-objs := bpf_load.o test_overhead_user.o +test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o +test_cgrp2_attach-objs := test_cgrp2_attach.o +test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(CGROUP_HELPERS) +test_cgrp2_sock-objs := test_cgrp2_sock.o +test_cgrp2_sock2-objs := bpf_load.o test_cgrp2_sock2.o +xdp1-objs := xdp1_user.o # reuse xdp1 source intentionally -xdp2-objs := xdp1_user.o $(LIBBPF) -xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o -test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \ +xdp2-objs := xdp1_user.o +xdp_router_ipv4-objs := bpf_load.o xdp_router_ipv4_user.o +test_current_task_under_cgroup-objs := bpf_load.o $(CGROUP_HELPERS) \ test_current_task_under_cgroup_user.o -trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS) -sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS) -tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o -lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o -xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o -test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o -per_socket_stats_example-objs := cookie_uid_helper_example.o $(LIBBPF) -xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o -xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o -xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o -xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o -xdp_rxq_info-objs := xdp_rxq_info_user.o $(LIBBPF) -syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o -cpustat-objs := bpf_load.o $(LIBBP
[PATCH bpf-next v2 2/5] samples: bpf: rename libbpf.h to bpf_insn.h
The libbpf.h file in samples is clashing with libbpf's header. Since it only includes a subset of filter.h instruction helpers rename it to bpf_insn.h. Drop the unnecessary include of bpf/bpf.h. Signed-off-by: Jakub Kicinski Acked-by: Jesper Dangaard Brouer --- samples/bpf/{libbpf.h => bpf_insn.h}| 8 +++- samples/bpf/cookie_uid_helper_example.c | 2 +- samples/bpf/fds_example.c | 4 +++- samples/bpf/sock_example.c | 3 ++- samples/bpf/test_cgrp2_attach.c | 3 ++- samples/bpf/test_cgrp2_attach2.c| 3 ++- samples/bpf/test_cgrp2_sock.c | 3 ++- samples/bpf/test_cgrp2_sock2.c | 3 ++- 8 files changed, 17 insertions(+), 12 deletions(-) rename samples/bpf/{libbpf.h => bpf_insn.h} (98%) diff --git a/samples/bpf/libbpf.h b/samples/bpf/bpf_insn.h similarity index 98% rename from samples/bpf/libbpf.h rename to samples/bpf/bpf_insn.h index 18bfee5aab6b..20dc5cefec84 100644 --- a/samples/bpf/libbpf.h +++ b/samples/bpf/bpf_insn.h @@ -1,9 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ -/* eBPF mini library */ -#ifndef __LIBBPF_H -#define __LIBBPF_H - -#include +/* eBPF instruction mini library */ +#ifndef __BPF_INSN_H +#define __BPF_INSN_H struct bpf_insn; diff --git a/samples/bpf/cookie_uid_helper_example.c b/samples/bpf/cookie_uid_helper_example.c index 8eca27e595ae..deb0e3e0324d 100644 --- a/samples/bpf/cookie_uid_helper_example.c +++ b/samples/bpf/cookie_uid_helper_example.c @@ -51,7 +51,7 @@ #include #include #include -#include "libbpf.h" +#include "bpf_insn.h" #define PORT diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c index e29bd52ff9e8..9854854f05d1 100644 --- a/samples/bpf/fds_example.c +++ b/samples/bpf/fds_example.c @@ -12,8 +12,10 @@ #include #include +#include + +#include "bpf_insn.h" #include "bpf_load.h" -#include "libbpf.h" #include "sock_example.h" #define BPF_F_PIN (1 << 0) diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c index 33a637507c00..60ec467c78ab 100644 --- a/samples/bpf/sock_example.c +++ b/samples/bpf/sock_example.c @@ -26,7 +26,8 @@ #include #include #include -#include "libbpf.h" +#include +#include "bpf_insn.h" #include "sock_example.h" char bpf_log_buf[BPF_LOG_BUF_SIZE]; diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c index 4bfcaf93fcf3..20fbd1241db3 100644 --- a/samples/bpf/test_cgrp2_attach.c +++ b/samples/bpf/test_cgrp2_attach.c @@ -28,8 +28,9 @@ #include #include +#include -#include "libbpf.h" +#include "bpf_insn.h" enum { MAP_KEY_PACKETS, diff --git a/samples/bpf/test_cgrp2_attach2.c b/samples/bpf/test_cgrp2_attach2.c index 1af412ec6007..b453e6a161be 100644 --- a/samples/bpf/test_cgrp2_attach2.c +++ b/samples/bpf/test_cgrp2_attach2.c @@ -24,8 +24,9 @@ #include #include +#include -#include "libbpf.h" +#include "bpf_insn.h" #include "cgroup_helpers.h" #define FOO"/foo" diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c index e79594dd629b..b0811da5a00f 100644 --- a/samples/bpf/test_cgrp2_sock.c +++ b/samples/bpf/test_cgrp2_sock.c @@ -21,8 +21,9 @@ #include #include #include +#include -#include "libbpf.h" +#include "bpf_insn.h" char bpf_log_buf[BPF_LOG_BUF_SIZE]; diff --git a/samples/bpf/test_cgrp2_sock2.c b/samples/bpf/test_cgrp2_sock2.c index e53f1f6f0867..3b5be2364975 100644 --- a/samples/bpf/test_cgrp2_sock2.c +++ b/samples/bpf/test_cgrp2_sock2.c @@ -19,8 +19,9 @@ #include #include #include +#include -#include "libbpf.h" +#include "bpf_insn.h" #include "bpf_load.h" static int usage(const char *argv0) -- 2.17.0
Re: [PATCH v1 4/4] samples/bpf: an example of a raw IR decoder
On Mon, May 14, 2018 at 2:11 PM, Sean Young wrote: > This implements the grundig-16 IR protocol. > > Signed-off-by: Sean Young > --- > samples/bpf/Makefile | 4 + > samples/bpf/bpf_load.c| 9 +- > samples/bpf/grundig_decoder_kern.c| 112 ++ > samples/bpf/grundig_decoder_user.c| 54 +++ > tools/bpf/bpftool/prog.c | 1 + > tools/include/uapi/linux/bpf.h| 17 +++- > tools/testing/selftests/bpf/bpf_helpers.h | 6 ++ > 7 files changed, 200 insertions(+), 3 deletions(-) > create mode 100644 samples/bpf/grundig_decoder_kern.c > create mode 100644 samples/bpf/grundig_decoder_user.c > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile > index 4d6a6edd4bf6..c6fa111f103a 100644 > --- a/samples/bpf/Makefile > +++ b/samples/bpf/Makefile > @@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor > hostprogs-y += xdp_rxq_info > hostprogs-y += syscall_tp > hostprogs-y += cpustat > +hostprogs-y += grundig_decoder > > # Libbpf dependencies > LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o > @@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o > xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o > syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o > cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o > +grundig_decoder-objs := bpf_load.o $(LIBBPF) grundig_decoder_user.o > > # Tell kbuild to always build the programs > always := $(hostprogs-y) > @@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o > always += xdp2skb_meta_kern.o > always += syscall_tp_kern.o > always += cpustat_kern.o > +always += grundig_decoder_kern.o > > HOSTCFLAGS += -I$(objtree)/usr/include > HOSTCFLAGS += -I$(srctree)/tools/lib/ > @@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf > HOSTLOADLIBES_xdp_rxq_info += -lelf > HOSTLOADLIBES_syscall_tp += -lelf > HOSTLOADLIBES_cpustat += -lelf > +HOSTLOADLIBES_grundig_decoder += -lelf > > # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on > cmdline: > # make samples/bpf/ LLC=~/git/llvm/build/bin/llc > CLANG=~/git/llvm/build/bin/clang > diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c > index bebe4188b4b3..0fd389e95bb9 100644 > --- a/samples/bpf/bpf_load.c > +++ b/samples/bpf/bpf_load.c > @@ -69,6 +69,7 @@ static int load_and_attach(const char *event, struct > bpf_insn *prog, int size) > bool is_sockops = strncmp(event, "sockops", 7) == 0; > bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0; > bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0; > + bool is_ir_decoder = strncmp(event, "ir_decoder", 10) == 0; > size_t insns_cnt = size / sizeof(struct bpf_insn); > enum bpf_prog_type prog_type; > char buf[256]; > @@ -102,6 +103,8 @@ static int load_and_attach(const char *event, struct > bpf_insn *prog, int size) > prog_type = BPF_PROG_TYPE_SK_SKB; > } else if (is_sk_msg) { > prog_type = BPF_PROG_TYPE_SK_MSG; > + } else if (is_ir_decoder) { > + prog_type = BPF_PROG_TYPE_RAWIR_DECODER; > } else { > printf("Unknown event '%s'\n", event); > return -1; > @@ -116,7 +119,8 @@ static int load_and_attach(const char *event, struct > bpf_insn *prog, int size) > > prog_fd[prog_cnt++] = fd; > > - if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk) > + if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk || > + is_ir_decoder) > return 0; > > if (is_socket || is_sockops || is_sk_skb || is_sk_msg) { > @@ -607,7 +611,8 @@ static int do_load_bpf_file(const char *path, > fixup_map_cb fixup_map) > memcmp(shname, "cgroup/", 7) == 0 || > memcmp(shname, "sockops", 7) == 0 || > memcmp(shname, "sk_skb", 6) == 0 || > - memcmp(shname, "sk_msg", 6) == 0) { > + memcmp(shname, "sk_msg", 6) == 0 || > + memcmp(shname, "ir_decoder", 10) == 0) { > ret = load_and_attach(shname, data->d_buf, > data->d_size); > if (ret != 0) > diff --git a/samples/bpf/grundig_decoder_kern.c > b/samples/bpf/grundig_decoder_kern.c > new file mode 100644 > index ..c80f2c9cc69a > --- /dev/null > +++ b/samples/bpf/grundig_decoder_kern.c > @@ -0,0 +1,112 @@ > + > +#include > +#include > +#include "bpf_helpers.h" > +#include > + > +enum grundig_state { > + STATE_INACTIVE, > + STATE_HEADER_SPACE, > + STATE_LEADING_PULSE, > + STATE_BITS_SPACE, > + STATE_BITS_PULSE, > +}; > + > +struct decoder_state { > + u32 bits; > + enum grundig_state state; > + u32 count; > + u32 last_space; > +}; > + > +struct bpf_map_def SEC("maps") decoder_state_map
[PATCH bpf-next v2 3/5] samples: bpf: fix build after move to compiling full libbpf.a
There are many ways users may compile samples, some of them got broken by commit 5f9380572b4b ("samples: bpf: compile and link against full libbpf"). Improve path resolution and make libbpf building a dependency of source files to force its build. Samples should now again build with any of: cd samples/bpf; make make samples/bpf/ make -C samples/bpf cd samples/bpf; make O=builddir make samples/bpf/ O=builddir make -C samples/bpf O=builddir export KBUILD_OUTPUT=builddir make samples/bpf/ make -C samples/bpf Fixes: 5f9380572b4b ("samples: bpf: compile and link against full libbpf") Reported-by: Björn Töpel Signed-off-by: Jakub Kicinski --- samples/bpf/Makefile | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 9e255ca4059a..0dae77c88d2e 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -1,4 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 + +BPF_SAMPLES_PATH ?= $(abspath $(srctree)/$(src)) +TOOLS_PATH := $(BPF_SAMPLES_PATH)/../../tools + # List of programs to build hostprogs-y := test_lru_dist hostprogs-y += sock_example @@ -49,7 +53,8 @@ hostprogs-y += xdpsock hostprogs-y += xdp_fwd # Libbpf dependencies -LIBBPF := ../../tools/lib/bpf/libbpf.a +LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a + CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o @@ -233,15 +238,16 @@ CLANG_ARCH_ARGS = -target $(ARCH) endif # Trick to allow make to be run from this directory -all: $(LIBBPF) - $(MAKE) -C ../../ $(CURDIR)/ +all: + $(MAKE) -C ../../ $(CURDIR)/ BPF_SAMPLES_PATH=$(CURDIR) clean: $(MAKE) -C ../../ M=$(CURDIR) clean @rm -f *~ $(LIBBPF): FORCE - $(MAKE) -C $(dir $@) +# Fix up variables inherited from Kbuild that tools/ build system won't like + $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(BPF_SAMPLES_PATH)/../../ O= $(obj)/syscall_nrs.s: $(src)/syscall_nrs.c $(call if_changed_dep,cc_s_c) @@ -272,7 +278,8 @@ verify_target_bpf: verify_cmds exit 2; \ else true; fi -$(src)/*.c: verify_target_bpf +$(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF) +$(src)/*.c: verify_target_bpf $(LIBBPF) $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h -- 2.17.0
[PATCH bpf-next v2 1/5] samples: bpf: include bpf/bpf.h instead of local libbpf.h
There are two files in the tree called libbpf.h which is becoming problematic. Most samples don't actually need the local libbpf.h they simply include it to get to bpf/bpf.h. Include bpf/bpf.h directly instead. Signed-off-by: Jakub Kicinski Acked-by: Jesper Dangaard Brouer --- samples/bpf/bpf_load.c| 2 +- samples/bpf/bpf_load.h| 2 +- samples/bpf/cpustat_user.c| 2 +- samples/bpf/lathist_user.c| 2 +- samples/bpf/load_sock_ops.c | 2 +- samples/bpf/lwt_len_hist_user.c | 2 +- samples/bpf/map_perf_test_user.c | 2 +- samples/bpf/sock_example.h| 1 - samples/bpf/sockex1_user.c| 2 +- samples/bpf/sockex2_user.c| 2 +- samples/bpf/sockex3_user.c| 2 +- samples/bpf/syscall_tp_user.c | 2 +- samples/bpf/tc_l2_redirect_user.c | 2 +- samples/bpf/test_cgrp2_array_pin.c| 2 +- samples/bpf/test_current_task_under_cgroup_user.c | 2 +- samples/bpf/test_lru_dist.c | 2 +- samples/bpf/test_map_in_map_user.c| 2 +- samples/bpf/test_overhead_user.c | 2 +- samples/bpf/test_probe_write_user_user.c | 2 +- samples/bpf/trace_output_user.c | 2 +- samples/bpf/tracex1_user.c| 2 +- samples/bpf/tracex2_user.c| 2 +- samples/bpf/tracex3_user.c| 2 +- samples/bpf/tracex4_user.c| 2 +- samples/bpf/tracex5_user.c| 2 +- samples/bpf/tracex6_user.c| 2 +- samples/bpf/tracex7_user.c| 2 +- samples/bpf/xdp_fwd_user.c| 2 +- samples/bpf/xdp_monitor_user.c| 2 +- samples/bpf/xdp_redirect_cpu_user.c | 2 +- samples/bpf/xdp_redirect_map_user.c | 2 +- samples/bpf/xdp_redirect_user.c | 2 +- samples/bpf/xdp_router_ipv4_user.c| 2 +- samples/bpf/xdp_tx_iptunnel_user.c| 2 +- samples/bpf/xdpsock_user.c| 2 +- 35 files changed, 34 insertions(+), 35 deletions(-) diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index a6b290de5632..89161c9ed466 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -24,7 +24,7 @@ #include #include #include -#include "libbpf.h" +#include #include "bpf_load.h" #include "perf-sys.h" diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h index f9da59bca0cc..814894a12974 100644 --- a/samples/bpf/bpf_load.h +++ b/samples/bpf/bpf_load.h @@ -2,7 +2,7 @@ #ifndef __BPF_LOAD_H #define __BPF_LOAD_H -#include "libbpf.h" +#include #define MAX_MAPS 32 #define MAX_PROGS 32 diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c index 2b4cd1ae57c5..869a99406dbf 100644 --- a/samples/bpf/cpustat_user.c +++ b/samples/bpf/cpustat_user.c @@ -17,7 +17,7 @@ #include #include -#include "libbpf.h" +#include #include "bpf_load.h" #define MAX_CPU8 diff --git a/samples/bpf/lathist_user.c b/samples/bpf/lathist_user.c index 6477bad5b4e2..c8e88cc84e61 100644 --- a/samples/bpf/lathist_user.c +++ b/samples/bpf/lathist_user.c @@ -10,7 +10,7 @@ #include #include #include -#include "libbpf.h" +#include #include "bpf_load.h" #define MAX_ENTRIES20 diff --git a/samples/bpf/load_sock_ops.c b/samples/bpf/load_sock_ops.c index e5da6cf71a3e..8ecb41ea0c03 100644 --- a/samples/bpf/load_sock_ops.c +++ b/samples/bpf/load_sock_ops.c @@ -8,7 +8,7 @@ #include #include #include -#include "libbpf.h" +#include #include "bpf_load.h" #include #include diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c index 7fcb94c09112..587b68b1f8dd 100644 --- a/samples/bpf/lwt_len_hist_user.c +++ b/samples/bpf/lwt_len_hist_user.c @@ -9,7 +9,7 @@ #include #include -#include "libbpf.h" +#include #include "bpf_util.h" #define MAX_INDEX 64 diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c index 519d9af4b04a..38b7b1a96cc2 100644 --- a/samples/bpf/map_perf_test_user.c +++ b/samples/bpf/map_perf_test_user.c @@ -21,7 +21,7 @@ #include #include -#include "libbpf.h" +#include #include "bpf_load.h" #define TEST_BIT(t) (1U << (t)) diff --git a/samples/bpf/sock_example.h b/samples/bpf/sock_example.h index 772d5dad8465..a27d7579bc73 100644 --- a/samples/bpf/sock_example.h +++ b/samples/bpf/sock_example.h @@ -9,7 +9,6 @@ #include #include #include -#include "libbpf.h" static inline int open_raw_sock(const char *name) { diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c index 2be935c2627d..93ec01c56104 100644 --- a/samples/bpf/sockex1_user.c +++ b/samples/bpf/sockex1_
[PATCH bpf-next v2 5/5] samples: bpf: make the build less noisy
Building samples with clang ignores the $(Q) setting, always printing full command to the output. Make it less verbose. Signed-off-by: Jakub Kicinski --- samples/bpf/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 0036a77c2d97..62d1aa1a4cf3 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -244,7 +244,8 @@ $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h # But, there is no easy way to fix it, so just exclude it since it is # useless for BPF samples. $(obj)/%.o: $(src)/%.c - $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \ + @echo " CLANG-bpf " $@ + $(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \ -I$(srctree)/tools/testing/selftests/bpf/ \ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ -D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \ -- 2.17.0
[PATCH bpf-next v2 0/5] samples: bpf: fix build after move to full libbpf
Hi! Following patches address build issues after recent move to libbpf. For out-of-tree builds we would see the following error: gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or directory libbpf build system is now always invoked explicitly rather than relying on building single objects most of the time. We need to resolve the friction between Kbuild and tools/ build system. Mini-library called libbpf.h in samples is renamed to bpf_insn.h, using linux/filter.h seems not completely trivial since some samples get upset when order on include search path in changed. We do have to rename libbpf.h, however, because otherwise it's hard to reliably get to libbpf's header in out-of-tree builds. v2: - fix the build error harder (patch 3); - add patch 5 (make clang less noisy). Jakub Kicinski (5): samples: bpf: include bpf/bpf.h instead of local libbpf.h samples: bpf: rename libbpf.h to bpf_insn.h samples: bpf: fix build after move to compiling full libbpf.a samples: bpf: move libbpf from object dependencies to libs samples: bpf: make the build less noisy samples/bpf/Makefile | 165 +++--- samples/bpf/{libbpf.h => bpf_insn.h} | 8 +- samples/bpf/bpf_load.c| 2 +- samples/bpf/bpf_load.h| 2 +- samples/bpf/cookie_uid_helper_example.c | 2 +- samples/bpf/cpustat_user.c| 2 +- samples/bpf/fds_example.c | 4 +- samples/bpf/lathist_user.c| 2 +- samples/bpf/load_sock_ops.c | 2 +- samples/bpf/lwt_len_hist_user.c | 2 +- samples/bpf/map_perf_test_user.c | 2 +- samples/bpf/sock_example.c| 3 +- samples/bpf/sock_example.h| 1 - samples/bpf/sockex1_user.c| 2 +- samples/bpf/sockex2_user.c| 2 +- samples/bpf/sockex3_user.c| 2 +- samples/bpf/syscall_tp_user.c | 2 +- samples/bpf/tc_l2_redirect_user.c | 2 +- samples/bpf/test_cgrp2_array_pin.c| 2 +- samples/bpf/test_cgrp2_attach.c | 3 +- samples/bpf/test_cgrp2_attach2.c | 3 +- samples/bpf/test_cgrp2_sock.c | 3 +- samples/bpf/test_cgrp2_sock2.c| 3 +- .../bpf/test_current_task_under_cgroup_user.c | 2 +- samples/bpf/test_lru_dist.c | 2 +- samples/bpf/test_map_in_map_user.c| 2 +- samples/bpf/test_overhead_user.c | 2 +- samples/bpf/test_probe_write_user_user.c | 2 +- samples/bpf/trace_output_user.c | 2 +- samples/bpf/tracex1_user.c| 2 +- samples/bpf/tracex2_user.c| 2 +- samples/bpf/tracex3_user.c| 2 +- samples/bpf/tracex4_user.c| 2 +- samples/bpf/tracex5_user.c| 2 +- samples/bpf/tracex6_user.c| 2 +- samples/bpf/tracex7_user.c| 2 +- samples/bpf/xdp_fwd_user.c| 2 +- samples/bpf/xdp_monitor_user.c| 2 +- samples/bpf/xdp_redirect_cpu_user.c | 2 +- samples/bpf/xdp_redirect_map_user.c | 2 +- samples/bpf/xdp_redirect_user.c | 2 +- samples/bpf/xdp_router_ipv4_user.c| 2 +- samples/bpf/xdp_tx_iptunnel_user.c| 2 +- samples/bpf/xdpsock_user.c| 2 +- 44 files changed, 116 insertions(+), 147 deletions(-) rename samples/bpf/{libbpf.h => bpf_insn.h} (98%) -- 2.17.0
Re: [PATCH net-next] erspan: set bso bit based on mirrored packet's len
On Mon, May 14, 2018 at 04:54:36PM -0700, William Tu wrote: > Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not > handled. BSO has 4 possible values: > 00 --> Good frame with no error, or unknown integrity > 11 --> Payload is a Bad Frame with CRC or Alignment Error > 01 --> Payload is a Short Frame > 10 --> Payload is an Oversized Frame > > Based the short/oversized definitions in RFC1757, the patch sets > the bso bit based on the mirrored packet's size. > > Reported-by: Xiaoyan Jin > Signed-off-by: William Tu > --- > include/net/erspan.h | 25 + > 1 file changed, 25 insertions(+) > > diff --git a/include/net/erspan.h b/include/net/erspan.h > index d044aa60cc76..5eb95f78ad45 100644 > --- a/include/net/erspan.h > +++ b/include/net/erspan.h > @@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void) > return htonl((u32)h_usecs); > } > > +/* ERSPAN BSO (Bad/Short/Oversized) > + * 00b --> Good frame with no error, or unknown integrity > + * 01b --> Payload is a Short Frame > + * 10b --> Payload is an Oversized Frame > + * 11b --> Payload is a Bad Frame with CRC or Alignment Error > + */ > +enum erspan_bso { > + BSO_NOERROR, > + BSO_SHORT, > + BSO_OVERSIZED, > + BSO_BAD, > +}; If we are relying on the values perhaps this would be clearer BSO_NOERROR = 0x00, BSO_SHORT = 0x01, BSO_OVERSIZED = 0x02, BSO_BAD = 0x03, > + > +static inline u8 erspan_detect_bso(struct sk_buff *skb) > +{ > + if (skb->len < ETH_ZLEN) > + return BSO_SHORT; > + > + if (skb->len > ETH_FRAME_LEN) > + return BSO_OVERSIZED; > + > + return BSO_NOERROR; > +} Without having much contextual knowledge around this patch; should we be doing some check on CRC or alignment (at some stage)? Having BSO_BAD seems to imply so? Hope this helps, Tobin.
Re: [PATCH net-next] sched: cls: enable verbose logging
On Mon, May 14, 2018 at 1:47 PM, Marcelo Ricardo Leitner wrote: > On Mon, May 14, 2018 at 01:30:53PM -0700, Cong Wang wrote: >> On Sun, May 13, 2018 at 1:44 PM, Marcelo Ricardo Leitner >> wrote: >> > Currently, when the rule is not to be exclusively executed by the >> > hardware, extack is not passed along and offloading failures don't >> > get logged. The idea was that hardware failures are okay because the >> > rule will get executed in software then and this way it doesn't confuse >> > unware users. >> > >> > But this is not helpful in case one needs to understand why a certain >> > rule failed to get offloaded. Considering it may have been a temporary >> > failure, like resources exceeded or so, reproducing it later and knowing >> > that it is triggering the same reason may be challenging. >> >> I fail to understand why you need a flag here, IOW, why not just pass >> extack unconditionally? > > Because (as discussed in the RFC[1], should have linked it here) it > could confuse users that are not aware of offloading and, in other > cases, it can be just noise (like it would be right now for ebpf, > which is mostly used in sw-path). > > 1.https://www.mail-archive.com/netdev@vger.kernel.org/msg223016.html My point is that a TC filter flag should be used for a filter attribute, logging is apparently not a part of filter. At least, put it into HW offloading, not in TC filter. I know DaveM hates module parameters, but a module parameter here is more suitable than a TC filter flag.
Re: [PATCH v1 3/4] media: rc bpf: move ir_raw_event to uapi
On Mon, May 14, 2018 at 2:11 PM, Sean Young wrote: > The context provided to a BPF_PROG_RAWIR_DECODER is a struct ir_raw_event; > ensure user space has a a definition. > > Signed-off-by: Sean Young > --- > include/media/rc-core.h| 19 +-- > include/uapi/linux/bpf_rcdev.h | 24 Patch #2 already referenced this file. So if Patches #1 and #2 applied, there will be a compilation error. Could you re-arrange your patchset so that after sequentially applying each patch, there is no compilation error? > 2 files changed, 25 insertions(+), 18 deletions(-) > create mode 100644 include/uapi/linux/bpf_rcdev.h > > diff --git a/include/media/rc-core.h b/include/media/rc-core.h > index 6742fd86ff65..5d31e31d8ade 100644 > --- a/include/media/rc-core.h > +++ b/include/media/rc-core.h > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > #include > > /** > @@ -299,24 +300,6 @@ void rc_keydown_notimeout(struct rc_dev *dev, enum > rc_proto protocol, > void rc_keyup(struct rc_dev *dev); > u32 rc_g_keycode_from_table(struct rc_dev *dev, u32 scancode); > > -/* > - * From rc-raw.c > - * The Raw interface is specific to InfraRed. It may be a good idea to > - * split it later into a separate header. > - */ > -struct ir_raw_event { > - union { > - u32 duration; > - u32 carrier; > - }; > - u8 duty_cycle; > - > - unsignedpulse:1; > - unsignedreset:1; > - unsignedtimeout:1; > - unsignedcarrier_report:1; > -}; > - > #define DEFINE_IR_RAW_EVENT(event) struct ir_raw_event event = {} > > static inline void init_ir_raw_event(struct ir_raw_event *ev) > diff --git a/include/uapi/linux/bpf_rcdev.h b/include/uapi/linux/bpf_rcdev.h > new file mode 100644 > index ..d8629ff2b960 > --- /dev/null > +++ b/include/uapi/linux/bpf_rcdev.h > @@ -0,0 +1,24 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +/* Copyright (c) 2018 Sean Young > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of version 2 of the GNU General Public > + * License as published by the Free Software Foundation. > + */ > +#ifndef _UAPI__LINUX_BPF_RCDEV_H__ > +#define _UAPI__LINUX_BPF_RCDEV_H__ > + > +struct ir_raw_event { > + union { > + __u32 duration; > + __u32 carrier; > + }; > + __u8duty_cycle; > + > + unsignedpulse:1; > + unsignedreset:1; > + unsignedtimeout:1; > + unsignedcarrier_report:1; > +}; > + > +#endif /* _UAPI__LINUX_BPF_RCDEV_H__ */ > -- > 2.17.0 >
Re: [PATCH v1 2/4] media: bpf: allow raw IR decoder bpf programs to be used
On Mon, May 14, 2018 at 2:10 PM, Sean Young wrote: > This implements attaching, detaching, querying and execution. The target > fd has to be the /dev/lircN device. > > Signed-off-by: Sean Young > --- > drivers/media/rc/ir-bpf-decoder.c | 191 ++ > drivers/media/rc/lirc_dev.c | 30 + > drivers/media/rc/rc-core-priv.h | 15 +++ > drivers/media/rc/rc-ir-raw.c | 5 + > include/uapi/linux/bpf.h | 1 + > kernel/bpf/syscall.c | 7 ++ > 6 files changed, 249 insertions(+) > > diff --git a/drivers/media/rc/ir-bpf-decoder.c > b/drivers/media/rc/ir-bpf-decoder.c > index aaa5e208b1a5..651590a14772 100644 > --- a/drivers/media/rc/ir-bpf-decoder.c > +++ b/drivers/media/rc/ir-bpf-decoder.c > @@ -91,3 +91,194 @@ const struct bpf_verifier_ops ir_decoder_verifier_ops = { > .get_func_proto = ir_decoder_func_proto, > .is_valid_access = ir_decoder_is_valid_access > }; > + > +#define BPF_MAX_PROGS 64 > + > +int rc_dev_bpf_attach(struct rc_dev *rcdev, struct bpf_prog *prog, u32 flags) flags is not used in this function. > +{ > + struct ir_raw_event_ctrl *raw; > + struct bpf_prog_array __rcu *old_array; > + struct bpf_prog_array *new_array; > + int ret; > + > + if (rcdev->driver_type != RC_DRIVER_IR_RAW) > + return -EINVAL; > + > + ret = mutex_lock_interruptible(&rcdev->lock); > + if (ret) > + return ret; > + > + raw = rcdev->raw; > + > + if (raw->progs && bpf_prog_array_length(raw->progs) >= BPF_MAX_PROGS) > { > + ret = -E2BIG; > + goto out; > + } > + > + old_array = raw->progs; > + ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array); > + if (ret < 0) > + goto out; > + > + rcu_assign_pointer(raw->progs, new_array); > + bpf_prog_array_free(old_array); > +out: > + mutex_unlock(&rcdev->lock); > + return ret; > +} > + > +int rc_dev_bpf_detach(struct rc_dev *rcdev, struct bpf_prog *prog, u32 flags) flags is not used in this function. > +{ > + struct ir_raw_event_ctrl *raw; > + struct bpf_prog_array __rcu *old_array; > + struct bpf_prog_array *new_array; > + int ret; > + > + if (rcdev->driver_type != RC_DRIVER_IR_RAW) > + return -EINVAL; > + > + ret = mutex_lock_interruptible(&rcdev->lock); > + if (ret) > + return ret; > + > + raw = rcdev->raw; > + > + old_array = raw->progs; > + ret = bpf_prog_array_copy(old_array, prog, NULL, &new_array); > + if (ret < 0) { > + bpf_prog_array_delete_safe(old_array, prog); > + } else { > + rcu_assign_pointer(raw->progs, new_array); > + bpf_prog_array_free(old_array); > + } > + > + bpf_prog_put(prog); > + mutex_unlock(&rcdev->lock); > + return 0; > +} > + > +void rc_dev_bpf_run(struct rc_dev *rcdev) > +{ > + struct ir_raw_event_ctrl *raw = rcdev->raw; > + > + if (raw->progs) > + BPF_PROG_RUN_ARRAY(raw->progs, &raw->prev_ev, BPF_PROG_RUN); > +} > + > +void rc_dev_bpf_put(struct rc_dev *rcdev) > +{ > + struct bpf_prog_array *progs = rcdev->raw->progs; > + int i, size; > + > + if (!progs) > + return; > + > + size = bpf_prog_array_length(progs); > + for (i = 0; i < size; i++) > + bpf_prog_put(progs->progs[i]); > + > + bpf_prog_array_free(rcdev->raw->progs); > +} > + > +int rc_dev_prog_attach(const union bpf_attr *attr) > +{ > + struct bpf_prog *prog; > + struct rc_dev *rcdev; > + int ret; > + > + if (attr->attach_flags & BPF_F_ALLOW_OVERRIDE) > + return -EINVAL; Looks like you really did not use flags except here. BPF_F_ALLOW_OVERRIDE is originally used for cgroup type of attachment and the comment explicits saying so. In the query below, the flags value "0" is copied to userspace. In your case, I think you can just disallow any value, i.g., attr->attach_flags must be 0, and then you further down check that if the prog is already in the array, you just return an error. > + > + prog = bpf_prog_get_type(attr->attach_bpf_fd, > +BPF_PROG_TYPE_RAWIR_DECODER); > + if (IS_ERR(prog)) > + return PTR_ERR(prog); > + > + rcdev = rc_dev_get_from_fd(attr->target_fd); > + if (IS_ERR(rcdev)) { > + bpf_prog_put(prog); > + return PTR_ERR(rcdev); > + } > + > + ret = rc_dev_bpf_attach(rcdev, prog, attr->attach_flags); > + if (ret) > + bpf_prog_put(prog); > + > + put_device(&rcdev->dev); > + > + return ret; > +} > + > +int rc_dev_prog_detach(const union bpf_attr *attr) > +{ > + struct bpf_prog *prog; > + struct rc_dev *rcdev; > + int ret; > + > + if (attr->attach_flags & BPF_F_ALLOW_OVERRIDE) >
Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
Hi David, On Mon, May 14, 2018 at 07:51:05PM -0700, David Ahern wrote: > This patch set fixes a few append and replace uses cases for IPv6 and > adds test cases that codifies the expectations of how append and replace > are expected to work. Nood question: what commit does this apply on top of please. I attempted to apply the set on top of net-next commit (961423f9fcbc Merge branch 'sctp-Introduce-sctp_flush_ctx') patch 4 and 5 have merge conflicts (I stopped at 5). thanks, Tobin.
Re: possible deadlock in sk_diag_fill
On Mon, May 14, 2018 at 8:00 PM, Andrei Vagin wrote: >> >> Hello, >> >> >> >> syzbot found the following crash on: >> >> >> >> HEAD commit:c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of >> >> git://git.k.. >> >> git tree: upstream >> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c9780 >> >> kernel config: https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27 >> >> dashboard link: >> >> https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669 >> >> compiler: gcc (GCC) 8.0.1 20180413 (experimental) >> >> userspace arch: i386 >> >> >> >> Unfortunately, I don't have any reproducer for this crash yet. >> >> >> >> IMPORTANT: if you fix the bug, please add the following tag to the commit: >> >> Reported-by: syzbot+c1872be62e587eae9...@syzkaller.appspotmail.com >> >> >> >> >> >> == >> >> WARNING: possible circular locking dependency detected >> >> 4.17.0-rc3+ #59 Not tainted >> >> -- >> >> syz-executor1/25282 is trying to acquire lock: >> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons >> >> net/unix/diag.c:82 [inline] >> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at: >> >> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144 >> >> >> >> but task is already holding lock: >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock >> >> include/linux/spinlock.h:310 [inline] >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons >> >> net/unix/diag.c:64 [inline] >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: >> >> sk_diag_fill.isra.5+0x94e/0x10d0 >> >> net/unix/diag.c:144 >> >> >> >> which lock already depends on the new lock. >> > >> > In the code, we have a comment which explains why it is safe to take this >> > lock >> > >> > /* >> > * The state lock is outer for the same sk's >> > * queue lock. With the other's queue locked it's >> > * OK to lock the state. >> > */ >> > unix_state_lock_nested(req); >> > >> > It is a question how to explain this to lockdep. >> >> Do I understand it correctly that (&u->lock)->rlock associated with >> AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is >> locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I >> think we need to split (&u->lock)->rlock by family too, so that we >> have u->lock-AF_UNIX and u->lock-AF_NETLINK. > > I think here is another problem. lockdep woried about > sk->sk_receive_queue vs unix_sk(s)->lock. > > sk_diag_dump_icons() takes sk->sk_receive_queue and then > unix_sk(s)->lock. > > unix_dgram_sendmsg takes unix_sk(sk)->lock and then sk->sk_receive_queue. > > sk_diag_dump_icons() takes locks for two different sockets, but > unix_dgram_sendmsg() takes locks for one socket. > > sk_diag_dump_icons > if (sk->sk_state == TCP_LISTEN) { > spin_lock(&sk->sk_receive_queue.lock); > skb_queue_walk(&sk->sk_receive_queue, skb) { > unix_state_lock_nested(req); > spin_lock_nested(&unix_sk(s)->lock, > > > unix_dgram_sendmsg > unix_state_lock(other) > spin_lock(&unix_sk(s)->lock) > skb_queue_tail(&other->sk_receive_queue, skb); > spin_lock_irqsave(&list->lock, flags); Do you mean the following? There is socket 1 with state lock (S1) and queue lock (Q2), and socket 2 with state lock (S2) and queue lock (Q2). unix_dgram_sendmsg lock S1->Q1. And sk_diag_dump_icons locks Q1->S2. If yes, then this looks pretty much as deadlock. Consider that 2 unix_dgram_sendmsg in 2 different threads lock S1 and S2 respectively. Now 2 sk_diag_dump_icons in 2 different threads lock Q1 and Q2 respectively. Now sk_diag_dump_icons want to lock S's, and unix_dgram_sendmsg want to lock Q's. Nobody can proceed.
Re: [PATCH v1 1/4] media: rc: introduce BPF_PROG_IR_DECODER
On Mon, May 14, 2018 at 2:10 PM, Sean Young wrote: > Add support for BPF_PROG_IR_DECODER. This type of BPF program can call > rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report > that the last key should be repeated. > > Signed-off-by: Sean Young > --- > drivers/media/rc/Kconfig | 8 +++ > drivers/media/rc/Makefile | 1 + > drivers/media/rc/ir-bpf-decoder.c | 93 +++ > include/linux/bpf_types.h | 3 + > include/uapi/linux/bpf.h | 16 +- > 5 files changed, 120 insertions(+), 1 deletion(-) > create mode 100644 drivers/media/rc/ir-bpf-decoder.c > > diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig > index eb2c3b6eca7f..10ad6167d87c 100644 > --- a/drivers/media/rc/Kconfig > +++ b/drivers/media/rc/Kconfig > @@ -120,6 +120,14 @@ config IR_IMON_DECODER >remote control and you would like to use it with a raw IR >receiver, or if you wish to use an encoder to transmit this IR. > > +config IR_BPF_DECODER > + bool "Enable IR raw decoder using BPF" > + depends on BPF_SYSCALL > + depends on RC_CORE=y > + help > + Enable this option to make it possible to load custom IR > + decoders written in BPF. > + > endif #RC_DECODERS > > menuconfig RC_DEVICES > diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile > index 2e1c87066f6c..12e1118430d0 100644 > --- a/drivers/media/rc/Makefile > +++ b/drivers/media/rc/Makefile > @@ -5,6 +5,7 @@ obj-y += keymaps/ > obj-$(CONFIG_RC_CORE) += rc-core.o > rc-core-y := rc-main.o rc-ir-raw.o > rc-core-$(CONFIG_LIRC) += lirc_dev.o > +rc-core-$(CONFIG_IR_BPF_DECODER) += ir-bpf-decoder.o > obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o > obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o > obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o > diff --git a/drivers/media/rc/ir-bpf-decoder.c > b/drivers/media/rc/ir-bpf-decoder.c > new file mode 100644 > index ..aaa5e208b1a5 > --- /dev/null > +++ b/drivers/media/rc/ir-bpf-decoder.c > @@ -0,0 +1,93 @@ > +// SPDX-License-Identifier: GPL-2.0 > +// ir-bpf-decoder.c - handles bpf decoders > +// > +// Copyright (C) 2018 Sean Young > + > +#include > +#include > +#include "rc-core-priv.h" > + > +/* > + * BPF interface for raw IR decoder > + */ > +const struct bpf_prog_ops ir_decoder_prog_ops = { > +}; > + > +BPF_CALL_1(bpf_rc_repeat, struct ir_raw_event*, event) > +{ > + struct ir_raw_event_ctrl *ctrl; > + > + ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev); > + > + rc_repeat(ctrl->dev); > + return 0; > +} > + > +static const struct bpf_func_proto rc_repeat_proto = { > + .func = bpf_rc_repeat, > + .gpl_only = true, // rc_repeat is EXPORT_SYMBOL_GPL > + .ret_type = RET_VOID, I suggest using RET_INTEGER here since we do return an integer 0. RET_INTEGER will also make it easy to extend if you want to return a non-zero value for error code or other reasons. > + .arg1_type = ARG_PTR_TO_CTX, > +}; > + > +BPF_CALL_4(bpf_rc_keydown, struct ir_raw_event*, event, u32, protocol, > + u32, scancode, u32, toggle) > +{ > + struct ir_raw_event_ctrl *ctrl; > + > + ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev); > + rc_keydown(ctrl->dev, protocol, scancode, toggle != 0); > + return 0; > +} > + > +static const struct bpf_func_proto rc_keydown_proto = { > + .func = bpf_rc_keydown, > + .gpl_only = true, // rc_keydown is EXPORT_SYMBOL_GPL > + .ret_type = RET_VOID, ditto. RET_INTEGER is preferable. > + .arg1_type = ARG_PTR_TO_CTX, > + .arg2_type = ARG_ANYTHING, > + .arg3_type = ARG_ANYTHING, > + .arg4_type = ARG_ANYTHING, > +}; > + > +static const struct bpf_func_proto *ir_decoder_func_proto(enum bpf_func_id > func_id, const struct bpf_prog *prog) > +{ > + switch (func_id) { > + case BPF_FUNC_rc_repeat: > + return &rc_repeat_proto; > + case BPF_FUNC_rc_keydown: > + return &rc_keydown_proto; > + case BPF_FUNC_map_lookup_elem: > + return &bpf_map_lookup_elem_proto; > + case BPF_FUNC_map_update_elem: > + return &bpf_map_update_elem_proto; > + case BPF_FUNC_map_delete_elem: > + return &bpf_map_delete_elem_proto; > + case BPF_FUNC_ktime_get_ns: > + return &bpf_ktime_get_ns_proto; > + case BPF_FUNC_tail_call: > + return &bpf_tail_call_proto; > + case BPF_FUNC_get_prandom_u32: > + return &bpf_get_prandom_u32_proto; > + default: > + return NULL; > + } > +} > + > +static bool ir_decoder_is_valid_access(int off, int size, > + enum bpf_access_type type, > + const struct bpf_prog *prog, > + struct bpf_insn_access_aux *info
Re: [PATCH net-next 2/3] net: qualcomm: rmnet: Add support for ethtool private stats
Hi Subash, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Updates-2018-05-14/20180515-115355 config: i386-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c: In function 'rmnet_map_checksum_downlink_packet': >> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:382:2: warning: this >> 'else' clause does not guard... [-Wmisleading-indentation] else ^~~~ drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:384:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'else' return -EPROTONOSUPPORT; ^~ vim +/else +382 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 349 350 /* Validates packet checksums. Function takes a pointer to 351 * the beginning of a buffer which contains the IP payload + 352 * padding + checksum trailer. 353 * Only IPv4 and IPv6 are supported along with TCP & UDP. 354 * Fragmented or tunneled packets are not supported. 355 */ 356 int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len) 357 { 358 struct rmnet_priv *priv = netdev_priv(skb->dev); 359 struct rmnet_map_dl_csum_trailer *csum_trailer; 360 361 if (unlikely(!(skb->dev->features & NETIF_F_RXCSUM))) { 362 priv->stats.csum_sw++; 363 return -EOPNOTSUPP; 364 } 365 366 csum_trailer = (struct rmnet_map_dl_csum_trailer *)(skb->data + len); 367 368 if (!csum_trailer->valid) { 369 priv->stats.csum_valid_unset++; 370 return -EINVAL; 371 } 372 373 if (skb->protocol == htons(ETH_P_IP)) 374 return rmnet_map_ipv4_dl_csum_trailer(skb, csum_trailer, priv); 375 else if (skb->protocol == htons(ETH_P_IPV6)) 376 #if IS_ENABLED(CONFIG_IPV6) 377 return rmnet_map_ipv6_dl_csum_trailer(skb, csum_trailer, priv); 378 #else 379 priv->stats.csum_err_invalid_ip_version++; 380 return -EPROTONOSUPPORT; 381 #endif > 382 else 383 priv->stats.csum_err_invalid_ip_version++; 384 return -EPROTONOSUPPORT; 385 386 return 0; 387 } 388 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH net] tcp: purge write queue in tcp_connect_init()
syzkaller found a reliable way to crash the host, hitting a BUG() in __tcp_retransmit_skb() Malicous MSG_FASTOPEN is the root cause. We need to purge write queue in tcp_connect_init() at the point we init snd_una/write_seq. This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE() kernel BUG at net/ipv4/tcp_output.c:2837! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837 RSP: :8801dae06ff8 EFLAGS: 00010206 RAX: 8801b9fe61c0 RBX: ffc18a16 RCX: 864e1a49 RDX: 0100 RSI: 864e2e12 RDI: 0005 RBP: 8801dae073a0 R08: 8801b9fe61c0 R09: ed0039c40dd2 R10: ed0039c40dd2 R11: 8801ce206e93 R12: 421eeaad R13: 8801ce206d4e R14: 8801ce206cc0 R15: 8801cd4f4a80 FS: () GS:8801dae0(0063) knlGS:096bc900 CS: 0010 DS: 002b ES: 002b CR0: 80050033 CR2: 2000 CR3: 0001c47b6000 CR4: 001406f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923 tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488 tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573 tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326 expire_timers kernel/time/timer.c:1363 [inline] __run_timers+0x79e/0xc50 kernel/time/timer.c:1666 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d1/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:525 [inline] smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863 Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") Signed-off-by: Eric Dumazet Cc: Yuchung Cheng Cc: Neal Cardwell Reported-by: syzbot --- net/ipv4/tcp_output.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 383cac0ff0ec059ca7dbc1a6304cc7f8183e008d..d07e34f8e3091144976358674b92458076f92bfb 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2833,8 +2833,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) return -EBUSY; if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { - if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) - BUG(); + if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) { + WARN_ON_ONCE(1); + return -EINVAL; + } if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq)) return -ENOMEM; } @@ -3342,6 +3344,7 @@ static void tcp_connect_init(struct sock *sk) sock_reset_flag(sk, SOCK_DONE); tp->snd_wnd = 0; tcp_init_wl(tp, 0); + tcp_write_queue_purge(sk); tp->snd_una = tp->write_seq; tp->snd_sml = tp->write_seq; tp->snd_up = tp->write_seq; -- 2.17.0.441.gb46fe60e1d-goog
Re: [PATCH iproute2] ip: do not drop capabilities if net_admin=i is set
On Fri, 11 May 2018 13:39:56 +0100 Luca Boccassi wrote: > Users have reported a regression due to ip now dropping capabilities > unconditionally. > zerotier-one VPN and VirtualBox use ambient capabilities in their > binary and then fork out to ip to set routes and links, and this > does not work anymore. > > As a workaround, do not drop caps if CAP_NET_ADMIN (the most common > capability used by ip) is set with the INHERITABLE flag. > Users that want ip vrf exec to work do not need to set INHERITABLE, > which will then only set when the calling program had privileges to > give itself the ambient capability. > > Fixes: ba2fc55b99f8 ("Drop capabilities if not running ip exec vrf with > libcap") > > Signed-off-by: Luca Boccassi Applied
mounting NFS on the same host leads to D state
Hi, Recently I have tested NFS and exportfs scenario, that NFS server and client are in the same host. And I found mounting NFS filesystm onto the same host can lead to rpc.mountd and related task become D state. My kernel version is based on 3.10, and I find 4.15 has the same appearance. My test step as below: 1)create dir. mkdir -p /home/test1 /home/test2 2)share dir /home/test1 echo '/home/test1 localhost(rw,all_squash,anonuid=0,anongid=0)' > /etc/exports 3)exportfs exportfs -vr || echo "Failed to export /home/test1" 4)mount NFS. mount localhost:/home/test1 /home/test2 -o vers=3,soft 5)share dir /home/test2 echo '/home/test2 *(rw,all_squash,anonuid=0,anongid=0)' >> /etc/exports 6)exportfs exportfs -vr 7) list /home/test2 ls /home/test2 then we found ls command is hung, ls and rpc.mountd became "D" state, and after 180 second ls command return. Another scenario as below: 1)create dir. mkdir -p /home/test3 /home/test4 2)share dir /home/test3 echo '/home/test3 localhost(rw,sync,no_wdelay,anonuid=0,anongid=0,no_subtree_check)' > /etc/exports 3)exportfs exportfs -r 4)to see NFS status showmount -e localhost 5)mount NFS mount -t nfs4 -o proto=tcp,nolock,soft,timeo=50 localhost:/home/test3 /home/test4 6) stop nfs service,and and check ls task state is D. service nfs stop ls /home/test4 ls command is hung and became D state. I wonder to know is it reasonable about these test scenario because NFS server and client are in the same host? Since some task went into D state, is there any reason about this? and is there any patch to fix this issue? Here is a link to talk about NFS mounting on the same host, https://lwn.net/Articles/595652/
Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table
On 4/29/18 7:13 PM, David Ahern wrote: > > The idea here is to fast pass packets that fit a supported profile and > are to be forwarded. Everything else should continue up the stack as it > has wider capabilities. The helper and XDP programs should make no > assumptions on what the broader kernel and userspace might be monitoring > or want to do with packets that can not be forwarded in the fast path. > This is very similar to hardware forwarding when it punts packets to the > CPU for control plane assistance. > Thinking about this some more and how to return more information to the bpf program about the FIB lookup. bpf_fib_lookup struct is 64-bytes. It can not be expanded without hurting performance. I could do another union on an input parameter and return flags indicating why the returned index is 0. Something like this: diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 360a1168c353..75591522444c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2314,6 +2314,12 @@ struct bpf_raw_tracepoint_args { #define BPF_FIB_LOOKUP_DIRECT BIT(0) #define BPF_FIB_LOOKUP_OUTPUT BIT(1) +#define BPF_FIB_LKUP_RET_NO_FWD BIT(0) /* pkt is not fwded */ +#define BPF_FIB_LKUP_RET_UNSUPP_LWT BIT(1) /* fwd requires unsupp encap */ +#define BPF_FIB_LKUP_RET_NO_NHDEVBIT(2) /* nh device does not exist */ +#define BPF_FIB_LKUP_RET_NO_NEIGHBIT(3) /* no neigh entry for nh */ +#define BPF_FIB_LKUP_RET_FRAG_NEEDED BIT(4) /* pkt too big to fwd */ + struct bpf_fib_lookup { /* input */ __u8family; /* network family, AF_INET, AF_INET6, AF_MPLS */ @@ -2325,7 +2331,11 @@ struct bpf_fib_lookup { /* total length of packet from network header - used for MTU check */ __u16 tot_len; - __u32 ifindex; /* L3 device index for lookup */ + + union { + __u32 ifindex; /* in: L3 device index for lookup */ + __u32 ret_flags; /* out: BPF_FIB_LOOKUP_RET flags */ + } union { /* inputs to lookup */ Similarly for the fib result, it could be returned with a union on say family: union { __u8 family; /* in: network family, AF_INET, AF_INET6, AF_MPLS */ __u8 rt_type; /* out: FIB lookup route type */ }; Then if the fib result is -EINVAL/-EHOSTUNREACH/-EACCES, rt_type is set to RTN_BLACKHOLE/RTN_UNREACHABLE/RTN_PROHIBIT allowing the XDP program to make an informed decision on dropping the packet. To avoid performance hits on the forwarding path, these return values would *only* set if the ifindex returned is 0.
Re: [PATCH] lib/rhashtable: reorder some inititalization sequences
On Mon, May 14, 2018 at 10:52:13PM -0400, David Miller wrote: > From: Davidlohr Bueso > Date: Mon, 14 May 2018 08:13:32 -0700 > > > rhashtable_init() allocates memory at the very end of the > > call, once everything is setup; with the exception of the > > nelems parameter. However, unless the user is doing something > > bogus with params for which -EINVAL is returned, memory > > allocation is the only operation that can trigger the call > > to fail. > > > > Thus move bucket_table_alloc() up such that we fail back to > > the caller asap, instead of doing useless checks. This is > > safe as the the table allocation isn't using the halfly > > setup 'ht' structure and bucket_table_alloc() call chain only > > ends up using the ht->nulls_base member in INIT_RHT_NULLS_HEAD. > > > > Also move the locking initialization down to the end. > > > > Signed-off-by: Davidlohr Bueso > > The user potentially "doing something bogus" is why the most > expensive part of the initialization (the memory allocation) > is done after everything else is validated. > > I think it's best to keep things as-is. I agree. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH ghak81 RFC V2 3/5] audit: use inline function to get audit context
On 2018-05-14 23:05, Richard Guy Briggs wrote: > On 2018-05-14 17:44, Paul Moore wrote: > > On Sat, May 12, 2018 at 9:58 PM, Richard Guy Briggs wrote: > > > Recognizing that the audit context is an internal audit value, use an > > > access function to retrieve the audit context pointer for the task > > > rather than reaching directly into the task struct to get it. > > > > > > Signed-off-by: Richard Guy Briggs > > > --- > > > include/linux/audit.h| 14 ++-- > > > include/net/xfrm.h | 2 +- > > > kernel/audit.c | 6 ++-- > > > kernel/audit_watch.c | 2 +- > > > kernel/auditsc.c | 64 > > > +--- > > > net/bridge/netfilter/ebtables.c | 2 +- > > > net/core/dev.c | 2 +- > > > net/netfilter/x_tables.c | 2 +- > > > net/netlabel/netlabel_user.c | 2 +- > > > security/integrity/ima/ima_api.c | 2 +- > > > security/integrity/integrity_audit.c | 2 +- > > > security/lsm_audit.c | 2 +- > > > security/selinux/hooks.c | 4 +-- > > > security/selinux/selinuxfs.c | 6 ++-- > > > security/selinux/ss/services.c | 12 +++ > > > 15 files changed, 64 insertions(+), 60 deletions(-) > > > > Merged, but there was some fuzz due to the missing 1/5 patch and a > > handfull of checkpatch.pl fixes. Please take a look at the commit in > > the audit/next branch and if anything looks awry please send a patch > > to fix it. > > Some of that fuzz was due to the two patches (ghak46/47) that went > through the xelinux tree... There will be a merge conflict. > > Otherwise, looks ok. Spoke too soon, missed one from the new seccomp actions_logged... Patch pending... > > paul moore > > - RGB - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
Re: [RFC bpf-next 11/11] Documentation: Describe bpf reference tracking
On Wed, May 09, 2018 at 02:07:09PM -0700, Joe Stringer wrote: > Signed-off-by: Joe Stringer > --- > Documentation/networking/filter.txt | 64 > + > 1 file changed, 64 insertions(+) > > diff --git a/Documentation/networking/filter.txt > b/Documentation/networking/filter.txt > index 5032e1263bc9..77be17977bc5 100644 > --- a/Documentation/networking/filter.txt > +++ b/Documentation/networking/filter.txt > @@ -1125,6 +1125,14 @@ pointer type. The types of pointers describe their > base, as follows: > PTR_TO_STACKFrame pointer. > PTR_TO_PACKET skb->data. > PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. > +PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly > refcounted. > +PTR_TO_SOCKET_OR_NULL > +Either a pointer to a socket, or NULL; socket lookup > +returns this type, which becomes a PTR_TO_SOCKET when > +checked != NULL. PTR_TO_SOCKET is reference-counted, > +so programs must release the reference through the > +socket release function before the end of the > program. > +Arithmetic on these pointers is forbidden. > However, a pointer may be offset from this base (as a result of pointer > arithmetic), and this is tracked in two parts: the 'fixed offset' and > 'variable > offset'. The former is used when an exactly-known value (e.g. an immediate > @@ -1168,6 +1176,13 @@ over the Ethernet header, then reads IHL and addes > (IHL * 4), the resulting > pointer will have a variable offset known to be 4n+2 for some n, so adding > the 2 > bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses > through > that pointer are safe. > +The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, > common > +to all copies of the pointer returned from a socket lookup. This has similar > +behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but > +it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly > +represents a reference to the corresponding 'struct sock'. To ensure that the > +reference is not leaked, it is imperative to NULL-check the reference and in > +the non-NULL case, and pass the valid reference to the socket release > function. > > Direct packet access > > @@ -1441,6 +1456,55 @@ Error: >8: (7a) *(u64 *)(r0 +0) = 1 >R0 invalid mem access 'imm' > > +Program that performs a socket lookup then sets the pointer to NULL without > +checking it: > +value: > + BPF_MOV64_IMM(BPF_REG_2, 0), > + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), > + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), > + BPF_MOV64_IMM(BPF_REG_3, 4), > + BPF_MOV64_IMM(BPF_REG_4, 0), > + BPF_MOV64_IMM(BPF_REG_5, 0), > + BPF_EMIT_CALL(BPF_FUNC_sk_lookup), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > +Error: > + 0: (b7) r2 = 0 > + 1: (63) *(u32 *)(r10 -8) = r2 > + 2: (bf) r2 = r10 > + 3: (07) r2 += -8 > + 4: (b7) r3 = 4 > + 5: (b7) r4 = 0 > + 6: (b7) r5 = 0 > + 7: (85) call bpf_sk_lookup#65 > + 8: (b7) r0 = 0 > + 9: (95) exit > + Unreleased reference id=1, alloc_insn=7 > + > +Program that performs a socket lookup but does not NULL-check the returned > +value: > + BPF_MOV64_IMM(BPF_REG_2, 0), > + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), > + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), > + BPF_MOV64_IMM(BPF_REG_3, 4), > + BPF_MOV64_IMM(BPF_REG_4, 0), > + BPF_MOV64_IMM(BPF_REG_5, 0), > + BPF_EMIT_CALL(BPF_FUNC_sk_lookup), > + BPF_EXIT_INSN(), > +Error: > + 0: (b7) r2 = 0 > + 1: (63) *(u32 *)(r10 -8) = r2 > + 2: (bf) r2 = r10 > + 3: (07) r2 += -8 > + 4: (b7) r3 = 4 > + 5: (b7) r4 = 0 > + 6: (b7) r5 = 0 > + 7: (85) call bpf_sk_lookup#65 > + 8: (95) exit > + Unreleased reference id=1, alloc_insn=7 Nice. Thank you for updating this doc. We haven't touched it in long time. It probably long overdue for complete overhaul.
Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
On Fri, May 11, 2018 at 05:54:33PM -0700, Joe Stringer wrote: > On 11 May 2018 at 14:41, Martin KaFai Lau wrote: > > On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote: > >> On 10 May 2018 at 22:00, Martin KaFai Lau wrote: > >> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote: > >> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF > >> >> programs to find out if there is a socket listening on this host, and > >> >> returns a socket pointer which the BPF program can then access to > >> >> determine, for instance, whether to forward or drop traffic. sk_lookup() > >> >> takes a reference on the socket, so when a BPF program makes use of this > >> >> function, it must subsequently pass the returned pointer into the newly > >> >> added sk_release() to return the reference. > >> >> > >> >> By way of example, the following pseudocode would filter inbound > >> >> connections at XDP if there is no corresponding service listening for > >> >> the traffic: > >> >> > >> >> struct bpf_sock_tuple tuple; > >> >> struct bpf_sock_ops *sk; > >> >> > >> >> populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet > >> >> sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0); > >> >> if (!sk) { > >> >> // Couldn't find a socket listening for this traffic. Drop. > >> >> return TC_ACT_SHOT; > >> >> } > >> >> bpf_sk_release(sk, 0); > >> >> return TC_ACT_OK; > >> >> > >> >> Signed-off-by: Joe Stringer > >> >> --- > >> > >> ... > >> > >> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto > >> >> bpf_skb_get_xfrm_state_proto = { > >> >> }; > >> >> #endif > >> >> > >> >> +struct sock * > >> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) { > >> > Would it be possible to have another version that > >> > returns a sk without taking its refcnt? > >> > It may have performance benefit. > >> > >> Not really. The sockets are not RCU-protected, and established sockets > >> may be torn down without notice. If we don't take a reference, there's > >> no guarantee that the socket will continue to exist for the duration > >> of running the BPF program. > >> > >> From what I follow, the comment below has a hidden implication which > >> is that sockets without SOCK_RCU_FREE, eg established sockets, may be > >> directly freed regardless of RCU. > > Right, SOCK_RCU_FREE sk is the one I am concern about. > > For example, TCP_LISTEN socket does not require taking a refcnt > > now. Doing a bpf_sk_lookup() may have a rather big > > impact on handling TCP syn flood. or the usual intention > > is to redirect instead of passing it up to the stack? > > I see, if you're only interested in listen sockets then probably this > series could be extended with a new flag, eg something like > BPF_F_SK_FIND_LISTENERS which restricts the set of possible sockets > found to only listen sockets, then the implementation would call into > __inet_lookup_listener() instead of inet_lookup(). The presence of > that flag in the relevant register during CALL instruction would show > that the verifier should not reference-track the result, then there'd > need to be a check on the release to ensure that this unreferenced > socket is never released. Just a thought, completely untested and I > could still be missing some detail.. > > That said, I don't completely follow how you would expect to handle > the traffic for sockets that are already established - the helper > would no longer find those sockets, so you wouldn't know whether to > pass the traffic up the stack for established traffic or not. I think Martin has a valid concern here that if somebody starts using this helper on the rx traffic the bpf program (via these two new helpers) will be doing refcnt++ and refcnt-- even for listener sockets which will cause synflood to suffer. One can argue that this is bpf author mistake, but without fixes (and api changes) to the helper the programmer doesn't really have a way of avoiding this situation. Also udp sockets don't need refcnt at all. How about we split this single helper into three: - bpf_sk_lookup_tcp_established() that will return refcnt-ed socket and has to be bpf_sk_release()d by the program. - bpf_sk_lookup_tcp_listener() that doesn't refcnt, since progs run in rcu. - bpf_sk_lookup_udp() that also doesn't refcnt. The logic you want to put into this helper can be easily replicated with these three helpers and the whole thing will be much faster. Thoughts?
Re: [PATCH net-next v3 0/3] sctp: Introduce sctp_flush_ctx
From: Marcelo Ricardo Leitner Date: Mon, 14 May 2018 14:35:17 -0300 > This struct will hold all the context used during the outq flush, so we > don't have to pass lots of pointers all around. > > Checked on x86_64, the compiler inlines all these functions and there is no > derreference added because of the struct. > > This patchset depends on 'sctp: refactor sctp_outq_flush' > > Changes since v1: > - updated to build on top of v2 of 'sctp: refactor sctp_outq_flush' > > Changes since v2: > - fixed a rebase issue which reverted a change in patch 2. > - rebased on v3 of 'sctp: refactor sctp_outq_flush' Also applied, thanks Marcelo.
Re: [PATCH ghak81 RFC V2 3/5] audit: use inline function to get audit context
On 2018-05-14 17:44, Paul Moore wrote: > On Sat, May 12, 2018 at 9:58 PM, Richard Guy Briggs wrote: > > Recognizing that the audit context is an internal audit value, use an > > access function to retrieve the audit context pointer for the task > > rather than reaching directly into the task struct to get it. > > > > Signed-off-by: Richard Guy Briggs > > --- > > include/linux/audit.h| 14 ++-- > > include/net/xfrm.h | 2 +- > > kernel/audit.c | 6 ++-- > > kernel/audit_watch.c | 2 +- > > kernel/auditsc.c | 64 > > +--- > > net/bridge/netfilter/ebtables.c | 2 +- > > net/core/dev.c | 2 +- > > net/netfilter/x_tables.c | 2 +- > > net/netlabel/netlabel_user.c | 2 +- > > security/integrity/ima/ima_api.c | 2 +- > > security/integrity/integrity_audit.c | 2 +- > > security/lsm_audit.c | 2 +- > > security/selinux/hooks.c | 4 +-- > > security/selinux/selinuxfs.c | 6 ++-- > > security/selinux/ss/services.c | 12 +++ > > 15 files changed, 64 insertions(+), 60 deletions(-) > > Merged, but there was some fuzz due to the missing 1/5 patch and a > handfull of checkpatch.pl fixes. Please take a look at the commit in > the audit/next branch and if anything looks awry please send a patch > to fix it. Some of that fuzz was due to the two patches (ghak46/47) that went through the xelinux tree... There will be a merge conflict. Otherwise, looks ok. > paul moore - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
Re: [RFC bpf-next 06/11] bpf: Add reference tracking to verifier
On Wed, May 09, 2018 at 02:07:04PM -0700, Joe Stringer wrote: > Allow helper functions to acquire a reference and return it into a > register. Specific pointer types such as the PTR_TO_SOCKET will > implicitly represent such a reference. The verifier must ensure that > these references are released exactly once in each path through the > program. > > To achieve this, this commit assigns an id to the pointer and tracks it > in the 'bpf_func_state', then when the function or program exits, > verifies that all of the acquired references have been freed. When the > pointer is passed to a function that frees the reference, it is removed > from the 'bpf_func_state` and all existing copies of the pointer in > registers are marked invalid. > > Signed-off-by: Joe Stringer > --- > include/linux/bpf_verifier.h | 18 ++- > kernel/bpf/verifier.c| 295 > --- > 2 files changed, 292 insertions(+), 21 deletions(-) > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > index 9dcd87f1d322..8dbee360b3ec 100644 > --- a/include/linux/bpf_verifier.h > +++ b/include/linux/bpf_verifier.h > @@ -104,6 +104,11 @@ struct bpf_stack_state { > u8 slot_type[BPF_REG_SIZE]; > }; > > +struct bpf_reference_state { > + int id; > + int insn_idx; /* allocation insn */ the insn_idx is for more verbose messages, right? It doesn't seem to affect the safety of algorithm. Please add a comment to clarify that. > +}; > + > /* state of the program: > * type of all registers and stack info > */ > @@ -122,7 +127,9 @@ struct bpf_func_state { >*/ > u32 subprogno; > > - /* should be second to last. See copy_func_state() */ > + /* The following fields should be last. See copy_func_state() */ > + int acquired_refs; > + struct bpf_reference_state *refs; > int allocated_stack; > struct bpf_stack_state *stack; > }; > @@ -218,11 +225,16 @@ void bpf_verifier_vlog(struct bpf_verifier_log *log, > const char *fmt, > __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env, > const char *fmt, ...); > > -static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env) > +static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env) > { > struct bpf_verifier_state *cur = env->cur_state; > > - return cur->frame[cur->curframe]->regs; > + return cur->frame[cur->curframe]; > +} > + > +static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env) > +{ > + return cur_func(env)->regs; > } > > int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env); > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index f426ebf2b6bf..92b9a5dc465a 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -1,5 +1,6 @@ > /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com > * Copyright (c) 2016 Facebook > + * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io > * > * This program is free software; you can redistribute it and/or > * modify it under the terms of version 2 of the GNU General Public > @@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const > bpf_verifier_ops[] = { > * > * After the call R0 is set to return type of the function and registers > R1-R5 > * are set to NOT_INIT to indicate that they are no longer readable. > + * > + * The following reference types represent a potential reference to a kernel > + * resource which, after first being allocated, must be checked and freed by > + * the BPF program: > + * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET > + * > + * When the verifier sees a helper call return a reference type, it > allocates a > + * pointer id for the reference and stores it in the current function state. > + * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into > + * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the > type > + * passes through a NULL-check conditional. For the branch wherein the state > is > + * changed to CONST_IMM, the verifier releases the reference. > */ > > /* verifier_state + insn_idx are pushed to stack when branch is encountered > */ > @@ -229,7 +242,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type) > > static bool reg_type_may_be_null(enum bpf_reg_type type) > { > - return type == PTR_TO_MAP_VALUE_OR_NULL; > + return type == PTR_TO_MAP_VALUE_OR_NULL || > +type == PTR_TO_SOCKET_OR_NULL; > +} > + > +static bool type_is_refcounted(enum bpf_reg_type type) > +{ > + return type == PTR_TO_SOCKET; > +} > + > +static bool type_is_refcounted_or_null(enum bpf_reg_type type) > +{ > + return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL; > +} > + > +static bool reg_is_refcounted(const struct bpf_reg_state *reg) > +{ > + return type_is_refcounted(reg->type); > +} > + > +static bool reg_is_refcounted_or_null(const struct bpf_re
Re: [PATCH net-next v3 0/8] sctp: refactor sctp_outq_flush
From: Marcelo Ricardo Leitner Date: Mon, 14 May 2018 14:34:35 -0300 > Currently sctp_outq_flush does many different things and arguably > unrelated, such as doing transport selection and outq dequeueing. > > This patchset refactors it into smaller and more dedicated functions. > The end behavior should be the same. > > The next patchset will rework the function parameters. > > Changes since v1: > - fix build issues on patches 3 and 4, and updated 5 and 8 because of > it. > > Changes since v2: > - fixed panic if building with just up to patch 3 applied Series applied, thanks.
Re: [PATCH] lib/rhashtable: reorder some inititalization sequences
From: Davidlohr Bueso Date: Mon, 14 May 2018 08:13:32 -0700 > rhashtable_init() allocates memory at the very end of the > call, once everything is setup; with the exception of the > nelems parameter. However, unless the user is doing something > bogus with params for which -EINVAL is returned, memory > allocation is the only operation that can trigger the call > to fail. > > Thus move bucket_table_alloc() up such that we fail back to > the caller asap, instead of doing useless checks. This is > safe as the the table allocation isn't using the halfly > setup 'ht' structure and bucket_table_alloc() call chain only > ends up using the ht->nulls_base member in INIT_RHT_NULLS_HEAD. > > Also move the locking initialization down to the end. > > Signed-off-by: Davidlohr Bueso The user potentially "doing something bogus" is why the most expensive part of the initialization (the memory allocation) is done after everything else is validated. I think it's best to keep things as-is.
[PATCH RFC net-next 3/7] selftests: fib_tests: Add success-fail counts
As more tests are added, it is convenient to have a tally at the end. Signed-off-by: David Ahern --- tools/testing/selftests/net/fib_tests.sh | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 9164e60d4b66..7e2291161e15 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -18,8 +18,10 @@ log_test() if [ ${rc} -eq ${expected} ]; then printf "TEST: %-60s [ OK ]\n" "${msg}" + nsuccess=$((nsuccess+1)) else ret=1 + nfail=$((nfail+1)) printf "TEST: %-60s [FAIL]\n" "${msg}" if [ "${PAUSE_ON_FAIL}" = "yes" ]; then echo @@ -598,4 +600,9 @@ cleanup &> /dev/null fib_test +if [ "$TESTS" != "none" ]; then + printf "\nTests passed: %3d\n" ${nsuccess} + printf "Tests failed: %3d\n" ${nfail} +fi + exit $ret -- 2.11.0
[PATCH RFC net-next 5/7] selftests: fib_tests: Add option to pause after each test
Add option to pause after each test before cleanup is done. Allows user to do manual inspection or more ad-hoc testing after each test with the setup in tact. Signed-off-by: David Ahern --- tools/testing/selftests/net/fib_tests.sh | 13 + 1 file changed, 13 insertions(+) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 8c99f0689efc..12b648826151 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -9,6 +9,7 @@ ret=0 TESTS="unregister down carrier nexthop" VERBOSE=0 PAUSE_ON_FAIL=no +PAUSE=no IP="ip -netns testns" log_test() @@ -31,6 +32,13 @@ log_test() [ "$a" = "q" ] && exit 1 fi fi + + if [ "${PAUSE}" = "yes" ]; then + echo + echo "hit enter to continue, 'q' to quit" + read a + [ "$a" = "q" ] && exit 1 + fi } setup() @@ -576,6 +584,7 @@ usage: ${0##*/} OPTS -tTest(s) to run (default: all) (options: $TESTS) -p Pause on fail +-P Pause after each test before cleanup -v verbose mode (show commands and output) EOF } @@ -588,6 +597,7 @@ do case $o in t) TESTS=$OPTARG;; p) PAUSE_ON_FAIL=yes;; + P) PAUSE=yes;; v) VERBOSE=$(($VERBOSE + 1));; h) usage; exit 0;; *) usage; exit 1;; @@ -596,6 +606,9 @@ done PEER_CMD="ip netns exec ${PEER_NS}" +# make sure we don't pause twice +[ "${PAUSE}" = "yes" ] && PAUSE_ON_FAIL=no + if [ "$(id -u)" -ne 0 ];then echo "SKIP: Need root privileges" exit 0 -- 2.11.0
[PATCH RFC net-next 2/7] net/ipv6: Simplify route replace and appending into multipath route
Remove rt6_qualify_for_ecmp which is just guess work. It fails in 2 cases: 1. can not replace a route with a reject route. Existing code appends a new route instead of replacing the existing one. 2. can not have a multipath route where a leg uses a dev only nexthop Existing use cases affected by this change: 1. adding a route with existing prefix and metric using NLM_F_CREATE without NLM_F_APPEND or NLM_F_EXCL (ie., what iproute2 calls 'prepend'). Existing code auto-determines that the new nexthop can be appended to an existing route to create a multipath route. This change breaks that by requiring the APPEND flag for the new route to be added to an existing one. Instead the prepend just adds another route entry. 2. route replace. Existing code replaces first matching multipath route if new route is multipath capable and fallback to first matching non-ECMP route (reject or dev only route) in case one isn't available. New behavior replaces first matching route. (Thanks to Ido for spotting this one) Note: Newer iproute2 is needed to display multipath routes with a dev-only nexthop. This is due to a bug in iproute2 and parsing nexthops. Signed-off-by: David Ahern --- include/net/ip6_route.h | 6 -- net/ipv6/ip6_fib.c | 157 ++-- net/ipv6/route.c| 3 +- 3 files changed, 73 insertions(+), 93 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 4cf1ef935ed9..9e4d0f0aeb6d 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -66,12 +66,6 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr) (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | IPV6_ADDR_LOOPBACK); } -static inline bool rt6_qualify_for_ecmp(const struct fib6_info *f6i) -{ - return (f6i->fib6_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) == - RTF_GATEWAY; -} - void ip6_route_input(struct sk_buff *skb); struct dst_entry *ip6_route_input_lookup(struct net *net, struct net_device *dev, diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index f0a4262a4789..f57ec3bc12d1 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -927,19 +927,19 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt, { struct fib6_info *leaf = rcu_dereference_protected(fn->leaf, lockdep_is_held(&rt->fib6_table->tb6_lock)); - struct fib6_info *iter = NULL; + struct fib6_info *iter = NULL, *match = NULL; struct fib6_info __rcu **ins; - struct fib6_info __rcu **fallback_ins = NULL; int replace = (info->nlh && (info->nlh->nlmsg_flags & NLM_F_REPLACE)); + int append = (info->nlh && + (info->nlh->nlmsg_flags & NLM_F_APPEND)); int add = (!info->nlh || (info->nlh->nlmsg_flags & NLM_F_CREATE)); int found = 0; - bool rt_can_ecmp = rt6_qualify_for_ecmp(rt); u16 nlflags = NLM_F_EXCL; int err; - if (info->nlh && (info->nlh->nlmsg_flags & NLM_F_APPEND)) + if (append) nlflags |= NLM_F_APPEND; ins = &fn->leaf; @@ -961,13 +961,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt, nlflags &= ~NLM_F_EXCL; if (replace) { - if (rt_can_ecmp == rt6_qualify_for_ecmp(iter)) { - found++; - break; - } - if (rt_can_ecmp) - fallback_ins = fallback_ins ?: ins; - goto next_iter; + found++; + break; } if (rt6_duplicate_nexthop(iter, rt)) { @@ -982,86 +977,67 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt, fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu); return -EEXIST; } - /* If we have the same destination and the same metric, -* but not the same gateway, then the route we try to -* add is sibling to this route, increment our counter -* of siblings, and later we will add our route to the -* list. -* Only static routes (which don't have flag -* RTF_EXPIRES) are used for ECMPv6. -* -* To avoid long list, we only had siblings if the -* route have a gateway. -*/ - if (rt_can_ecmp && - rt6_qualify_
[PATCH RFC net-next 6/7] selftests: fib_tests: Add ipv6 route add append replace tests
Add IPv6 route tests covering add, append and replace permutations. Assumes the ability to add a basic single path route works; this is required for example when adding an address to an interface. $ fib_tests.sh -t ipv6_rt IPv6 route add / append tests TEST: Attempt to add duplicate route - gw [ OK ] TEST: Attempt to add duplicate route - dev only [ OK ] TEST: Attempt to add duplicate route - reject route [ OK ] TEST: Add new nexthop for existing prefix [ OK ] TEST: Append leg to existing route - gw [ OK ] TEST: Append leg to existing route - dev only [ OK ] TEST: Append leg to existing route - reject route [ OK ] TEST: Append leg to existing reject route - gw [ OK ] TEST: Append leg to existing reject route - dev only[ OK ] TEST: add multipath route [ OK ] TEST: Attempt to add duplicate multipath route [ OK ] TEST: route add with different metrics [ OK ] TEST: route delete with metric [ OK ] IPv6 route replace tests TEST: single path with single path [ OK ] TEST: single path with multipath[ OK ] TEST: single path with reject route [ OK ] TEST: single path with single path via multipath attribute [ OK ] TEST: invalid nexthop [ OK ] TEST: single path - replace of non-existent route [ OK ] TEST: multipath with multipath [ OK ] TEST: multipath with single path[ OK ] TEST: multipath with single path via multipath attribute[ OK ] TEST: multipath with reject route [ OK ] TEST: multipath - invalid first nexthop [ OK ] TEST: multipath - invalid second nexthop[ OK ] TEST: multipath - replace of non-existent route [ OK ] Signed-off-by: David Ahern --- tools/testing/selftests/net/fib_tests.sh | 331 ++- 1 file changed, 330 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 12b648826151..20cb1d2c4e72 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -6,7 +6,7 @@ ret=0 -TESTS="unregister down carrier nexthop" +TESTS="unregister down carrier nexthop ipv6_rt" VERBOSE=0 PAUSE_ON_FAIL=no PAUSE=no @@ -574,6 +574,334 @@ fib_nexthop_test() } +# Tests on route add and replace + +run_cmd() +{ + local cmd="$1" + local out + local stderr="2>/dev/null" + + if [ "$VERBOSE" = "1" ]; then + printf "COMMAND: $cmd\n" + stderr= + fi + + out=$(eval $cmd $stderr) + rc=$? + if [ "$VERBOSE" = "1" -a -n "$out" ]; then + echo "$out" + fi + + [ "$VERBOSE" = "1" ] && echo + + return $rc +} + +# add route for a prefix, flushing any existing routes first +# expected to be the first step of a test +add_route6() +{ + local pfx="$1" + local nh="$2" + local out + + if [ "$VERBOSE" = "1" ]; then + echo + echo "##" + echo + fi + + run_cmd "$IP -6 ro flush ${pfx}" + [ $? -ne 0 ] && exit 1 + + out=$($IP -6 ro ls match ${pfx}) + if [ -n "$out" ]; then + echo "Failed to flush routes for prefix used for tests." + exit 1 + fi + + run_cmd "$IP -6 ro add ${pfx} ${nh}" + if [ $? -ne 0 ]; then + echo "Failed to add initial route for test." + exit 1 + fi +} + +# add initial route - used in replace route tests +add_initial_route6() +{ + add_route6 "2001:db8:104::/64" "$1" +} + +check_route6() +{ + local pfx="2001:db8:104::/64" + local expected="$1" + local out + local rc=0 + + out=$($IP -6 ro ls match ${pfx} | sed -e 's/ pref medium//') + if [ -z "${out}" ]; then + if [ "$VERBOSE" = "1" ]; then + printf "\nNo route entry found\n" + printf "Expected:\n" + printf "${expected}\n" + fi + return 1 + fi + + # tricky way to convert output to 1-line without ip's + # messy '\'; this drops all extra white space + out=$(ech
[PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
This patch set fixes a few append and replace uses cases for IPv6 and adds test cases that codifies the expectations of how append and replace are expected to work. In paricular it allows a multipath route to have a dev-only nexthop, something Thomas tried to accomplish with commit edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6") which had to be reverted because of breakage, and to replace an existing FIB entry with a reject route. There are a number of inconsistent and surprising aspects to the Linux API for adding, deleting, replacing and changing FIB entries. For example, with IPv4 NLM_F_APPEND means insert the route after any existing entries with the same key (prefix + priority + TOS for IPv4) and NLM_F_CREATE without the append flag inserts the new route before any existing entries. IPv6 on the other hand attempts to guess whether a new route should be appended to an existing one, possibly creating a multipath route, or to add a new entry after any existing ones. This applies to both the 'append' (NLM_F_CREATE + NLM_F_APPEND) and 'prepend' (NLM_F_CREATE only) cases meaning for IPv6 the NLM_F_APPEND is basically ignored. This guessing whether the route should be added to a multipath route (gateway routes) or inserted after existing entries (non-gateway based routes) means a multipath route can not have a dev only nexthop (potentially required in some cases - tunnels or VRF route leaking for example) and route 'replace' is a bit adhoc treating gateway based routes and dev-only / reject routes differently. This has led to frustration with developers working on routing suites such as FRR where workarounds such as delete and add. After this patch set there are 2 differences between IPv4 and IPv6: 1. 'ip ro prepend' = NLM_F_CREATE only IPv4 adds the new route before any existing ones IPv6 adds new route after any existing ones 2. 'ip ro append' = NLM_F_CREATE|NLM_F_APPEND IPv4 adds the new route after any existing ones IPv6 adds the nexthop to existing routes converting to multipath For the former, there are cases where we want same prefix routes added after existing ones (e.g., multicast, prefix routes for macvlan when used for virtual router redundancy). Requiring the APPEND flag to add a new route to an existing one helps here but is a slight change in behavior since prepend with gateway routes now create a separate entry. For the latter IPv6 behavior is preferred - appending a route for the same prefix and metric to make a multipath route, so really IPv4 not allowing an existing route to be updated is the limiter. This will be fixed when nexthops become separate objects - a future patch set. Thank you to Thomas and Ido for testing earlier versions of this set, and to Ido for providing an update to the mlxsw driver. David Ahern (7): mlxsw: spectrum_router: Add support for route append net/ipv6: Simplify appending route into multipath route selftests: fib_tests: Add success-fail counts selftests: fib_tests: Add command line options selftests: fib_tests: Add option to pause after each test selftests: fib_tests: Add ipv6 route add append replace tests selftests: fib_tests: Add ipv4 route add append replace tests .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 + include/net/ip6_route.h| 6 - net/ipv6/ip6_fib.c | 157 +++-- net/ipv6/route.c | 3 +- tools/testing/selftests/net/fib_tests.sh | 673 - 5 files changed, 737 insertions(+), 104 deletions(-) -- 2.11.0
[PATCH RFC net-next 1/7] mlxsw: spectrum_router: Add support for route append
Handle append for gateway based routes. Dev-only multipath routes will be handled by a follow on patch. Signed-off-by: Ido Schimmel Signed-off-by: David Ahern --- drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c index 8028d221aece..77b2adb29341 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c @@ -5725,6 +5725,7 @@ static void mlxsw_sp_router_fib6_event_work(struct work_struct *work) switch (fib_work->event) { case FIB_EVENT_ENTRY_REPLACE: /* fall through */ + case FIB_EVENT_ENTRY_APPEND: /* fall through */ case FIB_EVENT_ENTRY_ADD: replace = fib_work->event == FIB_EVENT_ENTRY_REPLACE; err = mlxsw_sp_router_fib6_add(mlxsw_sp, @@ -5831,6 +5832,7 @@ static void mlxsw_sp_router_fib6_event(struct mlxsw_sp_fib_event_work *fib_work, switch (fib_work->event) { case FIB_EVENT_ENTRY_REPLACE: /* fall through */ + case FIB_EVENT_ENTRY_APPEND: /* fall through */ case FIB_EVENT_ENTRY_ADD: /* fall through */ case FIB_EVENT_ENTRY_DEL: fen6_info = container_of(info, struct fib6_entry_notifier_info, -- 2.11.0
[PATCH RFC net-next 4/7] selftests: fib_tests: Add command line options
Add command line options for controlling pause on fail, controlling specific tests to run and verbose mode rather than relying on environment variables. Signed-off-by: David Ahern --- tools/testing/selftests/net/fib_tests.sh | 53 1 file changed, 40 insertions(+), 13 deletions(-) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 7e2291161e15..8c99f0689efc 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -6,8 +6,9 @@ ret=0 -VERBOSE=${VERBOSE:=0} -PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no} +TESTS="unregister down carrier nexthop" +VERBOSE=0 +PAUSE_ON_FAIL=no IP="ip -netns testns" log_test() @@ -565,20 +566,36 @@ fib_nexthop_test() } -# +# usage -fib_test() +usage() { - if [ -n "$TEST" ]; then - eval $TEST - else - fib_unreg_test - fib_down_test - fib_carrier_test - fib_nexthop_test - fi + cat < Test(s) to run (default: all) +(options: $TESTS) +-p Pause on fail +-v verbose mode (show commands and output) +EOF } + +# main + +while getopts :t:pPhv o +do + case $o in + t) TESTS=$OPTARG;; + p) PAUSE_ON_FAIL=yes;; + v) VERBOSE=$(($VERBOSE + 1));; + h) usage; exit 0;; + *) usage; exit 1;; + esac +done + +PEER_CMD="ip netns exec ${PEER_NS}" + if [ "$(id -u)" -ne 0 ];then echo "SKIP: Need root privileges" exit 0 @@ -598,7 +615,17 @@ fi # start clean cleanup &> /dev/null -fib_test +for t in $TESTS +do + case $t in + fib_unreg_test|unregister) fib_unreg_test;; + fib_down_test|down) fib_down_test;; + fib_carrier_test|carrier) fib_carrier_test;; + fib_nexthop_test|nexthop) fib_nexthop_test;; + + help) echo "Test names: $TESTS"; exit 0;; + esac +done if [ "$TESTS" != "none" ]; then printf "\nTests passed: %3d\n" ${nsuccess} -- 2.11.0
[PATCH RFC net-next 7/7] selftests: fib_tests: Add ipv4 route add append replace tests
Add IPv4 route tests covering add, append and replace permutations. Assumes the ability to add a basic single path route works; this is required for example when adding an address to an interface. $ fib_tests.sh -t ipv4_rt IPv4 route add / append tests TEST: Attempt to add duplicate route - gw [ OK ] TEST: Attempt to add duplicate route - dev only [ OK ] TEST: Attempt to add duplicate route - reject route [ OK ] TEST: Add new nexthop for existing prefix [ OK ] TEST: Append leg to existing route - gw [ OK ] TEST: Append leg to existing route - dev only [ OK ] TEST: Append leg to existing route - reject route [ OK ] TEST: Append leg to existing reject route - gw [ OK ] TEST: Append leg to existing reject route - dev only[ OK ] TEST: add multipath route [ OK ] TEST: Attempt to add duplicate multipath route [ OK ] TEST: route add with different metrics [ OK ] TEST: route delete with metric [ OK ] IPv4 route replace tests TEST: single path with single path [ OK ] TEST: single path with multipath[ OK ] TEST: single path with reject route [ OK ] TEST: single path with single path via multipath attribute [ OK ] TEST: invalid nexthop [ OK ] TEST: single path - replace of non-existent route [ OK ] TEST: multipath with multipath [ OK ] TEST: multipath with single path[ OK ] TEST: multipath with single path via multipath attribute[ OK ] TEST: multipath with reject route [ OK ] TEST: multipath - invalid first nexthop [ OK ] TEST: multipath - invalid second nexthop[ OK ] TEST: multipath - replace of non-existent route [ OK ] Signed-off-by: David Ahern --- tools/testing/selftests/net/fib_tests.sh | 277 ++- 1 file changed, 276 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 20cb1d2c4e72..19248453a2ba 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -6,7 +6,7 @@ ret=0 -TESTS="unregister down carrier nexthop ipv6_rt" +TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt" VERBOSE=0 PAUSE_ON_FAIL=no PAUSE=no @@ -702,6 +702,12 @@ route_setup() $IP -6 addr add 2001:db8:103::2/64 dev veth4 $IP -6 addr add 2001:db8:104::1/64 dev dummy1 + $IP addr add 172.16.101.1/24 dev veth1 + $IP addr add 172.16.101.2/24 dev veth2 + $IP addr add 172.16.103.1/24 dev veth3 + $IP addr add 172.16.103.2/24 dev veth4 + $IP addr add 172.16.104.1/24 dev dummy1 + set +ex } @@ -901,6 +907,274 @@ ipv6_route_test() route_cleanup } +# add route for a prefix, flushing any existing routes first +# expected to be the first step of a test +add_route() +{ + local pfx="$1" + local nh="$2" + local out + + if [ "$VERBOSE" = "1" ]; then + echo + echo "##" + echo + fi + + run_cmd "$IP ro flush ${pfx}" + [ $? -ne 0 ] && exit 1 + + out=$($IP ro ls match ${pfx}) + if [ -n "$out" ]; then + echo "Failed to flush routes for prefix used for tests." + exit 1 + fi + + run_cmd "$IP ro add ${pfx} ${nh}" + if [ $? -ne 0 ]; then + echo "Failed to add initial route for test." + exit 1 + fi +} + +# add initial route - used in replace route tests +add_initial_route() +{ + add_route "172.16.104.0/24" "$1" +} + +check_route() +{ + local pfx="172.16.104.0/24" + local expected="$1" + local out + local rc=0 + + out=$($IP ro ls match ${pfx}) + if [ -z "${out}" ]; then + if [ "$VERBOSE" = "1" ]; then + printf "\nNo route entry found\n" + printf "Expected:\n" + printf "${expected}\n" + fi + return 1 + fi + + # tricky way to convert output to 1-line without ip's + # messy '\'; this drops all extra white space + out=$(echo ${out}) + if [ "${out}" != "${expected}" ]; then + rc=1 + if [ "${VERBOSE}" = "1" ]; then +
Re: [PATCH net-next] cxgb4: add tc flower match support for tunnel VNI
From: Ganesh Goudar Date: Mon, 14 May 2018 17:51:21 +0530 > From: Kumar Sanghvi > > Adds support for matching flows based on tunnel VNI value. > Introduces fw APIs for allocating/removing MPS entries related > to encapsulation. And uses the same while adding/deleting filters > for offloading flows based on tunnel VNI match. > > Signed-off-by: Kumar Sanghvi > Signed-off-by: Ganesh Goudar Applied, thank you.
Re: [PATCH net 2/2] vmxnet3: use DMA memory barriers where required
From: Date: Mon, 14 May 2018 08:14:49 -0400 > The gen bits must be read first from (resp. written last to) DMA memory. > The proper way to enforce this on Linux is to call dma_rmb() (resp. > dma_wmb()). > > Signed-off-by: Regis Duchesne > Acked-by: Ronak Doshi Applied and queued up for -stable.
Re: [PATCH net 1/2] vmxnet3: set the DMA mask before the first DMA map operation
From: Date: Mon, 14 May 2018 08:28:26 -0400 > The DMA mask must be set before, not after, the first DMA map operation, or > the first DMA map operation could in theory fail on some systems. > > Fixes: b0eb57cb97e78 ("VMXNET3: Add support for virtual IOMMU") > Signed-off-by: Regis Duchesne > Acked-by: Ronak Doshi Applied and queued up for -stable.
Re: [PATCH net] cxgb4: Correct ntuple mask validation for hash filters
From: Ganesh Goudar Date: Mon, 14 May 2018 16:27:34 +0530 > From: Kumar Sanghvi > > Earlier code of doing bitwise AND with field width bits was wrong. > Instead, simplify code to calculate ntuple_mask based on supplied > fields and then compare with mask configured in hw - which is the > correct and simpler way to validate ntuple mask. > > Fixes: 3eb8b62d5a26 ("cxgb4: add support to create hash-filters via tc-flower > offload") > Signed-off-by: Kumar Sanghvi > Signed-off-by: Ganesh Goudar Applied and queued up for -stable.
Re: [PATCH net-next] net: stmmac: Add Jose Abreu as co-maintainer
From: Jose Abreu Date: Mon, 14 May 2018 10:29:56 +0100 > I'm offering to be a co-maintainer for stmmac driver. > > As per discussion with Alexandre, I will arranje to get STM32 boards to > test patches in GMAC version 3.x and 4.1. I also have HW to test GMAC > version 5. > > Looking forward to contribute to net-dev! > > Signed-off-by: Jose Abreu Applied with commit message typo fixed.
Re: [RFC bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type
On Wed, May 09, 2018 at 02:07:02PM -0700, Joe Stringer wrote: > Teach the verifier a little bit about a new type of pointer, a > PTR_TO_SOCKET. This pointer type is accessed from BPF through the > 'struct bpf_sock' structure. > > Signed-off-by: Joe Stringer > --- > include/linux/bpf.h | 19 +- > include/linux/bpf_verifier.h | 2 ++ > kernel/bpf/verifier.c| 86 > ++-- > net/core/filter.c| 30 +--- > 4 files changed, 114 insertions(+), 23 deletions(-) Ack for patches 1-3. In this one few nits: > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index a38e474bf7ee..a03b4b0edcb6 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -136,7 +136,7 @@ enum bpf_arg_type { > /* the following constraints used to prototype bpf_memcmp() and other >* functions that access data on eBPF program stack >*/ > - ARG_PTR_TO_MEM, /* pointer to valid memory (stack, packet, map > value) */ > + ARG_PTR_TO_MEM, /* pointer to valid memory (stack, packet, map > value, socket) */ I don't see where in this patch this change happens... > ARG_PTR_TO_MEM_OR_NULL, /* pointer to valid memory or NULL */ > ARG_PTR_TO_UNINIT_MEM, /* pointer to memory does not need to be > initialized, >* helper function must fill all bytes or clear > @@ -148,6 +148,7 @@ enum bpf_arg_type { > > ARG_PTR_TO_CTX, /* pointer to context */ > ARG_ANYTHING, /* any (initialized) argument is ok */ > + ARG_PTR_TO_SOCKET, /* pointer to bpf_sock */ > }; > > /* type of values returned from helper functions */ > @@ -155,6 +156,7 @@ enum bpf_return_type { > RET_INTEGER,/* function returns integer */ > RET_VOID, /* function doesn't return anything */ > RET_PTR_TO_MAP_VALUE_OR_NULL, /* returns a pointer to map elem value > or NULL */ > + RET_PTR_TO_SOCKET_OR_NULL, /* returns a pointer to a socket or > NULL */ > }; > > /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF > programs > @@ -205,6 +207,8 @@ enum bpf_reg_type { > PTR_TO_PACKET_META, /* skb->data - meta_len */ > PTR_TO_PACKET, /* reg points to skb->data */ > PTR_TO_PACKET_END, /* skb->data + headlen */ > + PTR_TO_SOCKET, /* reg points to struct bpf_sock */ > + PTR_TO_SOCKET_OR_NULL, /* reg points to struct bpf_sock or NULL */ > }; > > /* The information passed from prog-specific *_is_valid_access > @@ -326,6 +330,11 @@ const struct bpf_func_proto > *bpf_get_trace_printk_proto(void); > > typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src, > unsigned long off, unsigned long len); > +typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type, > + const struct bpf_insn *src, > + struct bpf_insn *dst, > + struct bpf_prog *prog, > + u32 *target_size); > > u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 > meta_size, >void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy); > @@ -729,4 +738,12 @@ extern const struct bpf_func_proto > bpf_sock_map_update_proto; > void bpf_user_rnd_init_once(void); > u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); > > +bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type, > + struct bpf_insn_access_aux *info); > +u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, > + const struct bpf_insn *si, > + struct bpf_insn *insn_buf, > + struct bpf_prog *prog, > + u32 *target_size); > + > #endif /* _LINUX_BPF_H */ > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > index a613b52ce939..9dcd87f1d322 100644 > --- a/include/linux/bpf_verifier.h > +++ b/include/linux/bpf_verifier.h > @@ -57,6 +57,8 @@ struct bpf_reg_state { >* offset, so they can share range knowledge. >* For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we >* came from, when one is tested for != NULL. > + * For PTR_TO_SOCKET this is used to share which pointers retain the > + * same reference to the socket, to determine proper reference freeing. >*/ > u32 id; > /* Ordering of fields matters. See states_equal() */ > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 1b31b805dea4..d38c7c1e9da6 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const > bpf_verifier_ops[] = { > * (like pointer plus pointer becomes SCALA
Re: [PATCH v1 2/4] media: bpf: allow raw IR decoder bpf programs to be used
Hi Sean, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.17-rc5] [cannot apply to next-20180514] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Sean-Young/media-rc-introduce-BPF_PROG_IR_DECODER/20180515-093234 config: i386-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> drivers/media/rc/lirc_dev.c:32:10: fatal error: linux/bpf-rcdev.h: No such >> file or directory #include ^~~ compilation terminated. -- >> kernel/bpf/syscall.c:30:10: fatal error: linux/bpf-rcdev.h: No such file or >> directory #include ^~~ compilation terminated. vim +32 drivers/media/rc/lirc_dev.c 31 > 32 #include 33 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH bpf-next v2 0/8] Minor follow-up cleanups in BPF JITs and optimized imm emission
On Mon, May 14, 2018 at 11:22:26PM +0200, Daniel Borkmann wrote: > This series follows up mostly with with some minor cleanups on top > of 'Move ld_abs/ld_ind to native BPF' as well as implements better > 32/64 bit immediate load into register and saves tail call init on > cBPF for the arm64 JIT. Last but not least we add a couple of test > cases. For details please see individual patches. Thanks! > > v1 -> v2: > - Minor fix in i64_i16_blocks() to remove 24 shift. > - Added last two patches. > - Added Acks from prior round. Applied, thanks.
Re: [PATCH bpf-next v2 8/8] bpf: add ld64 imm test cases
On Mon, May 14, 2018 at 2:22 PM, Daniel Borkmann wrote: > Add test cases where we combine semi-random imm values, mainly for testing > JITs when they have different encoding options for 64 bit immediates in > order to reduce resulting image size. > > Signed-off-by: Daniel Borkmann Acked-by: Yonghong Song
Re: [PATCH v1 3/4] media: rc bpf: move ir_raw_event to uapi
Hi Sean, I love your patch! Perhaps something to improve: [auto build test WARNING on linus/master] [also build test WARNING on v4.17-rc5] [cannot apply to next-20180514] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Sean-Young/media-rc-introduce-BPF_PROG_IR_DECODER/20180515-093234 config: i386-tinyconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): >> ./usr/include/linux/bpf_rcdev.h:13: found __[us]{8,16,32,64} type without >> #include --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v4 1/1] drivers core: multi-threading device shutdown
Hi Andy, Thank you for your comments. I will send an updated patch soon. My replies are below: On 05/14/2018 04:04 PM, Andy Shevchenko wrote: > Can we still preserve an order here? (Yes, even if the entire list is > not fully ordered) > In the context I see it would go before netdevice.h. Sure, I will move kthread.h. >> +static struct device * >> +device_get_child_by_index(struct device *parent, int index) >> +{ >> + struct klist_iter i; >> + struct device *dev = NULL, *d; >> + int child_index = 0; >> + >> + if (!parent->p || index < 0) >> + return NULL; >> + >> + klist_iter_init(&parent->p->klist_children, &i); >> + while ((d = next_device(&i))) { >> + if (child_index == index) { >> + dev = d; >> + break; >> + } >> + child_index++; >> + } >> + klist_iter_exit(&i); >> + >> + return dev; >> +} > > This can be implemented as a subfunction to device_find_child(), can't it be? Yes, but that would make it very inefficient to search for an index in a list via function pointer call. > >> +/** > > Hmm... Why it's marked as kernel doc while it's just a plain comment? > Same applies to the rest of similar comments. Fixed this, thanks! > >> + for (i = 0; i < children_count; i++) { >> + if (device_shutdown_serial) { >> + device_shutdown_child_task(&tdata); >> + } else { >> + kthread_run(device_shutdown_child_task, >> + &tdata, "device_shutdown.%s", >> + dev_name(dev)); >> + } >> + } > > Can't we just use device_for_each_child() instead? No, at least without doing some memory allocation. Notice in this loop we are not traversing through children, instead we are starting number of children threads, and each thread finds a child to work on. Otherwise we would have to pass child pointer via argument, and we would need to keep that argument in some memory. Pavel
Re: [PATCH 05/14] net: sched: always take reference to action
Hi Vlad, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net/master] [also build test WARNING on v4.17-rc5 next-20180514] [cannot apply to net-next/master] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Vlad-Buslov/Modify-action-API-for-implementing-lockless-actions/20180515-025420 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different address spaces) @@expected struct tc_cookie [noderef] *__ret @@ got [noderef] *__ret @@ net/sched/act_api.c:71:15:expected struct tc_cookie [noderef] *__ret net/sched/act_api.c:71:15:got struct tc_cookie *new_cookie net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different address spaces) @@expected struct tc_cookie *old @@got struct tc_cookie [noderef] *[assigned] __ret >> net/sched/act_api.c:287:6: sparse: symbol '__tcf_idr_check' was not >> declared. Should it be static? net/sched/act_api.c:144:48: sparse: dereference of noderef expression Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
[RFC PATCH] net: sched: __tcf_idr_check() can be static
Fixes: 446adedb5339 ("net: sched: always take reference to action") Signed-off-by: Fengguang Wu --- act_api.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 9459cce..27e80cf 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -284,8 +284,8 @@ int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb, } EXPORT_SYMBOL(tcf_generic_walker); -bool __tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, -int bind) +static bool __tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a, + int bind) { struct tcf_idrinfo *idrinfo = tn->idrinfo; struct tc_action *p;
Re: [kbuild-all] [PATCH net-next 2/2] sctp: add sctp_make_op_error_limited and reuse inner functions
On 05/14, Marcelo Ricardo Leitner wrote: >On Mon, May 14, 2018 at 07:47:20PM +0800, Ye Xiaolong wrote: >> On 05/14, Marcelo Ricardo Leitner wrote: >> >On Mon, May 14, 2018 at 03:40:53PM +0800, Ye Xiaolong wrote: >> >> >> config: x86_64-randconfig-x006-201817 (attached as .config) >> >> >> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 >> >> >> reproduce: >> >> >> # save the attached .config to linux build tree >> >> >> make ARCH=x86_64 >> >> >> >> >> >> All errors (new ones prefixed by >>): >> >> >> >> >> >>net//sctp/sm_make_chunk.c: In function 'sctp_make_op_error_limited': >> >> >> >> net//sctp/sm_make_chunk.c:1260:9: error: implicit declaration of >> >> >> >> function 'sctp_mtu_payload'; did you mean 'sctp_do_peeloff'? >> >> >> >> [-Werror=implicit-function-declaration] >> >> >> size = sctp_mtu_payload(sp, size, sizeof(struct sctp_errhdr)); >> >> >> ^~~~ >> >> >> sctp_do_peeloff >> >> >>cc1: some warnings being treated as errors >> >> > >> >> >Seems the test didn't pick up the MTU refactor patchset yet. >> >> >> >> Do you mean your patchset require MTU refactor patchset as prerequisites? >> > >> >Yes. >> >> Then it is recommended to use '--base' option of git format-patch, it would >> record >> the base tree info in the first patch or cover letter, 0day bot would apply >> your >> patchset to right base according to it. > >Nice. I wasn't aware of it. Thanks. > >Considering that the MTU refactor patchset was already applied on >net-next when the bot did the test, why should I have to specify the >base? Could you share me the subjects or commits of MTU refactor patcheset, I'll double check what was wrong. Thanks, Xiaolong > > Marcelo
Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
Hello, Mikulas. On Tue, Apr 24, 2018 at 02:41:47PM -0400, Mikulas Patocka wrote: > > > On Tue, 24 Apr 2018, Matthew Wilcox wrote: > > > On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote: > > > > > > > > > On Mon, 23 Apr 2018, Matthew Wilcox wrote: > > > > > > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote: > > > > > Some bugs (such as buffer overflows) are better detected > > > > > with kmalloc code, so we must test the kmalloc path too. > > > > > > > > Well now, this brings up another item for the collective TODO list -- > > > > implement redzone checks for vmalloc. Unless this is something already > > > > taken care of by kasan or similar. > > > > > > The kmalloc overflow testing is also not ideal - it rounds the size up to > > > the next slab size and detects buffer overflows only at this boundary. > > > > > > Some times ago, I made a "kmalloc guard" patch that places a magic number > > > immediatelly after the requested size - so that it can detect overflows > > > at > > > byte boundary > > > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html ) > > > > > > That patch found a bug in crypto code: > > > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html ) > > > > Is it still worth doing this, now we have kasan? > > The kmalloc guard has much lower overhead than kasan. I skimm at your code and it requires rebuilding the kernel. I think that if rebuilding is required as the same with the KASAN, using the KASAN is better since it has far better coverage for detection the bug. However, I think that if the redzone can be setup tightly without rebuild, it would be worth implementing. I have an idea to implement it only for the SLUB. Could I try it? (I'm asking this because I'm inspired from the above patch.) :) Or do you wanna try it? Thanks.
Re: [PATCH bpf-next] selftests/bpf: make sure build-id is on
On Mon, May 14, 2018 at 5:11 PM, Alexei Starovoitov wrote: > --build-id may not be a default linker config. > Make sure it's used when linking urandom_read test program. > Otherwise test_stacktrace_build_id[_nmi] tests will be failling. > > Signed-off-by: Alexei Starovoitov Tested and the change looks good. Acked-by: Yonghong Song > --- > tools/testing/selftests/bpf/Makefile | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/bpf/Makefile > b/tools/testing/selftests/bpf/Makefile > index 438d4f93875b..133ebc68cbe4 100644 > --- a/tools/testing/selftests/bpf/Makefile > +++ b/tools/testing/selftests/bpf/Makefile > @@ -19,7 +19,7 @@ all: $(TEST_CUSTOM_PROGS) > $(TEST_CUSTOM_PROGS): urandom_read > > urandom_read: urandom_read.c > - $(CC) -o $(TEST_CUSTOM_PROGS) -static $< > + $(CC) -o $(TEST_CUSTOM_PROGS) -static $< -Wl,--build-id > > # Order correspond to 'make run_tests' order > TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map > test_progs \ > -- > 2.9.5 >
[PATCH bpf-next] selftests/bpf: make sure build-id is on
--build-id may not be a default linker config. Make sure it's used when linking urandom_read test program. Otherwise test_stacktrace_build_id[_nmi] tests will be failling. Signed-off-by: Alexei Starovoitov --- tools/testing/selftests/bpf/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 438d4f93875b..133ebc68cbe4 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -19,7 +19,7 @@ all: $(TEST_CUSTOM_PROGS) $(TEST_CUSTOM_PROGS): urandom_read urandom_read: urandom_read.c - $(CC) -o $(TEST_CUSTOM_PROGS) -static $< + $(CC) -o $(TEST_CUSTOM_PROGS) -static $< -Wl,--build-id # Order correspond to 'make run_tests' order TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \ -- 2.9.5
[PATCH net-next] erspan: set bso bit based on mirrored packet's len
Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not handled. BSO has 4 possible values: 00 --> Good frame with no error, or unknown integrity 11 --> Payload is a Bad Frame with CRC or Alignment Error 01 --> Payload is a Short Frame 10 --> Payload is an Oversized Frame Based the short/oversized definitions in RFC1757, the patch sets the bso bit based on the mirrored packet's size. Reported-by: Xiaoyan Jin Signed-off-by: William Tu --- include/net/erspan.h | 25 + 1 file changed, 25 insertions(+) diff --git a/include/net/erspan.h b/include/net/erspan.h index d044aa60cc76..5eb95f78ad45 100644 --- a/include/net/erspan.h +++ b/include/net/erspan.h @@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void) return htonl((u32)h_usecs); } +/* ERSPAN BSO (Bad/Short/Oversized) + * 00b --> Good frame with no error, or unknown integrity + * 01b --> Payload is a Short Frame + * 10b --> Payload is an Oversized Frame + * 11b --> Payload is a Bad Frame with CRC or Alignment Error + */ +enum erspan_bso { + BSO_NOERROR, + BSO_SHORT, + BSO_OVERSIZED, + BSO_BAD, +}; + +static inline u8 erspan_detect_bso(struct sk_buff *skb) +{ + if (skb->len < ETH_ZLEN) + return BSO_SHORT; + + if (skb->len > ETH_FRAME_LEN) + return BSO_OVERSIZED; + + return BSO_NOERROR; +} + static inline void erspan_build_header_v2(struct sk_buff *skb, u32 id, u8 direction, u16 hwid, bool truncate, bool is_ipv4) @@ -248,6 +272,7 @@ static inline void erspan_build_header_v2(struct sk_buff *skb, vlan_tci = ntohs(qp->tci); } + bso = erspan_detect_bso(skb); skb_push(skb, sizeof(*ershdr) + ERSPAN_V2_MDSIZE); ershdr = (struct erspan_base_hdr *)skb->data; memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V2_MDSIZE); -- 2.7.4
Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather
On 05/14/2018 04:30 PM, Willem de Bruijn wrote: > I don't quite follow. The reported crash happens in the protocol layer, > because of this check. With pagedlen we have not allocated > sufficient space for the skb_put. > > if (!(rt->dst.dev->features&NETIF_F_SG)) { > unsigned int off; > > off = skb->len; > if (getfrag(from, skb_put(skb, copy), > offset, copy, off, skb) < 0) { > __skb_trim(skb, off); > err = -EFAULT; > goto error; > } > } else { > int i = skb_shinfo(skb)->nr_frags; > > Are you referring to a separate potential issue in the gso layer? > If a bonding device advertises SG, but a slave does not, then > skb_segment on the slave should build linear segs? I have not > tested that. Given that the device attribute could change under us, we need to not crash, even if initially we thought NETIF_F_SG was available. Unless you want to hold RTNL in UDP xmit :) Ideally, GSO should be always on, as we did for TCP. Otherwise, I can guarantee syzkaller will hit again.
Re: [PATCH 01/14] net: sched: use rcu for action cookie update
Hi Vlad, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net/master] [also build test WARNING on v4.17-rc5 next-20180514] [cannot apply to net-next/master] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Vlad-Buslov/Modify-action-API-for-implementing-lockless-actions/20180515-025420 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) >> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different >> address spaces) @@expected struct tc_cookie [noderef] *__ret @@ >> got [noderef] *__ret @@ net/sched/act_api.c:71:15:expected struct tc_cookie [noderef] *__ret net/sched/act_api.c:71:15:got struct tc_cookie *new_cookie >> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different >> address spaces) @@expected struct tc_cookie *old @@got struct >> tc_cookie [noderef] *[assigned] __ret >> net/sched/act_api.c:132:48: sparse: dereference of noderef expression vim +71 net/sched/act_api.c 65 66 static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie, 67struct tc_cookie *new_cookie) 68 { 69 struct tc_cookie *old; 70 > 71 old = xchg(old_cookie, new_cookie); 72 if (old) 73 call_rcu(&old->rcu, tcf_free_cookie_rcu); 74 } 75 76 /* XXX: For standalone actions, we don't need a RCU grace period either, because 77 * actions are always connected to filters and filters are already destroyed in 78 * RCU callbacks, so after a RCU grace period actions are already disconnected 79 * from filters. Readers later can not find us. 80 */ 81 static void free_tcf(struct tc_action *p) 82 { 83 free_percpu(p->cpu_bstats); 84 free_percpu(p->cpu_qstats); 85 86 tcf_set_action_cookie(&p->act_cookie, NULL); 87 if (p->goto_chain) 88 tcf_action_goto_chain_fini(p); 89 90 kfree(p); 91 } 92 93 static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p) 94 { 95 spin_lock_bh(&idrinfo->lock); 96 idr_remove(&idrinfo->action_idr, p->tcfa_index); 97 spin_unlock_bh(&idrinfo->lock); 98 gen_kill_estimator(&p->tcfa_rate_est); 99 free_tcf(p); 100 } 101 102 int __tcf_idr_release(struct tc_action *p, bool bind, bool strict) 103 { 104 int ret = 0; 105 106 ASSERT_RTNL(); 107 108 if (p) { 109 if (bind) 110 p->tcfa_bindcnt--; 111 else if (strict && p->tcfa_bindcnt > 0) 112 return -EPERM; 113 114 p->tcfa_refcnt--; 115 if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) { 116 if (p->ops->cleanup) 117 p->ops->cleanup(p); 118 tcf_idr_remove(p->idrinfo, p); 119 ret = ACT_P_DELETED; 120 } 121 } 122 123 return ret; 124 } 125 EXPORT_SYMBOL(__tcf_idr_release); 126 127 static size_t tcf_action_shared_attrs_size(const struct tc_action *act) 128 { 129 u32 cookie_len = 0; 130 131 if (act->act_cookie) > 132 cookie_len = nla_total_size(act->act_cookie->len); 133 134 return nla_total_size(0) /* action number nested */ 135 + nla_total_size(IFNAMSIZ) /* TCA_ACT_KIND */ 136 + cookie_len /* TCA_ACT_COOKIE */ 137 + nla_total_size(0) /* TCA_ACT_STATS nested */ 138 /* TCA_STATS_BASIC */ 139 + nla_total_size_64bit(sizeof(struct gnet_stats_basic)) 140 /* TCA_STATS_QUEUE */ 141 + nla_total_size_64bit(sizeof(struct gnet_stats_queue)) 142 + nla_total_size(0) /* TCA_OPTIONS nested */ 143 + nla_total_size(sizeof(struct tcf_t)); /* TCA_GACT_TM */ 144 } 145 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather
On Mon, May 14, 2018 at 7:12 PM, Eric Dumazet wrote: > > > On 05/14/2018 04:07 PM, Willem de Bruijn wrote: >> From: Willem de Bruijn >> >> Paged allocation stores most payload in skb frags. This helps udp gso >> by avoiding copying from the gso skb to segment skb in skb_segment. >> >> But without scatter-gather, data must be linear, so do not use paged >> mode unless NETIF_F_SG. >> >> Fixes: 15e36f5b8e98 ("udp: paged allocation with gso") >> Reported-by: Sean Tranchetti >> Signed-off-by: Willem de Bruijn >> --- >> net/ipv4/ip_output.c | 2 +- >> net/ipv6/ip6_output.c | 2 +- >> 2 files changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c >> index b5e21eb198d8..b38731d8a44f 100644 >> --- a/net/ipv4/ip_output.c >> +++ b/net/ipv4/ip_output.c >> @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk, >> >> exthdrlen = !skb ? rt->dst.header_len : 0; >> mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize; >> - paged = !!cork->gso_size; >> + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); >> >> if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP && >> sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) >> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c >> index 7f4493080df6..35a940b9f208 100644 >> --- a/net/ipv6/ip6_output.c >> +++ b/net/ipv6/ip6_output.c >> @@ -1262,7 +1262,7 @@ static int __ip6_append_data(struct sock *sk, >> dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len; >> } >> >> - paged = !!cork->gso_size; >> + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); >> mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize; >> orig_mtu = mtu; >> >> > > As I said, this wont help for stacked device > > bonding might advertise NETIF_F_SG, but one slave might not. I don't quite follow. The reported crash happens in the protocol layer, because of this check. With pagedlen we have not allocated sufficient space for the skb_put. if (!(rt->dst.dev->features&NETIF_F_SG)) { unsigned int off; off = skb->len; if (getfrag(from, skb_put(skb, copy), offset, copy, off, skb) < 0) { __skb_trim(skb, off); err = -EFAULT; goto error; } } else { int i = skb_shinfo(skb)->nr_frags; Are you referring to a separate potential issue in the gso layer? If a bonding device advertises SG, but a slave does not, then skb_segment on the slave should build linear segs? I have not tested that.
Re: [PATCH v1 1/4] media: rc: introduce BPF_PROG_IR_DECODER
On 05/14/2018 02:10 PM, Sean Young wrote: > Add support for BPF_PROG_IR_DECODER. This type of BPF program can call Kconfig file below uses IR_BPF_DECODER instead of the symbol name above. and then patch 3 says a third choice: The context provided to a BPF_PROG_RAWIR_DECODER is a struct ir_raw_event; > rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report > that the last key should be repeated. > > Signed-off-by: Sean Young > --- > drivers/media/rc/Kconfig | 8 +++ > drivers/media/rc/Makefile | 1 + > drivers/media/rc/ir-bpf-decoder.c | 93 +++ > include/linux/bpf_types.h | 3 + > include/uapi/linux/bpf.h | 16 +- > 5 files changed, 120 insertions(+), 1 deletion(-) > create mode 100644 drivers/media/rc/ir-bpf-decoder.c > > diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig > index eb2c3b6eca7f..10ad6167d87c 100644 > --- a/drivers/media/rc/Kconfig > +++ b/drivers/media/rc/Kconfig > @@ -120,6 +120,14 @@ config IR_IMON_DECODER > remote control and you would like to use it with a raw IR > receiver, or if you wish to use an encoder to transmit this IR. > > +config IR_BPF_DECODER > + bool "Enable IR raw decoder using BPF" > + depends on BPF_SYSCALL > + depends on RC_CORE=y > + help > +Enable this option to make it possible to load custom IR > +decoders written in BPF. > + > endif #RC_DECODERS > > menuconfig RC_DEVICES > diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile > index 2e1c87066f6c..12e1118430d0 100644 > --- a/drivers/media/rc/Makefile > +++ b/drivers/media/rc/Makefile > @@ -5,6 +5,7 @@ obj-y += keymaps/ > obj-$(CONFIG_RC_CORE) += rc-core.o > rc-core-y := rc-main.o rc-ir-raw.o > rc-core-$(CONFIG_LIRC) += lirc_dev.o > +rc-core-$(CONFIG_IR_BPF_DECODER) += ir-bpf-decoder.o -- ~Randy
Re: [net 1/1] net/mlx5: Fix build break when CONFIG_SMP=n
On 05/14/2018 03:38 PM, Saeed Mahameed wrote: > Avoid using the kernel's irq_descriptor and return IRQ vector affinity > directly from the driver. > > This fixes the following build break when CONFIG_SMP=n > > include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’: > include/linux/mlx5/driver.h:1299:13: error: > ‘struct irq_desc’ has no member named ‘affinity_hint’ > > Fixes: 6082d9c9c94a ("net/mlx5: Fix mlx5_get_vector_affinity function") > Signed-off-by: Saeed Mahameed > CC: Randy Dunlap > CC: Guenter Roeck > CC: Thomas Gleixner > Tested-by: Israel Rukshin Reported-by: kbuild test robot Reported-by: Randy Dunlap Tested-by: Randy Dunlap Thanks. > --- > > For -stable v4.14 > > include/linux/mlx5/driver.h | 12 +--- > 1 file changed, 1 insertion(+), 11 deletions(-) > > diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h > index 2a156c5dfadd..d703774982ca 100644 > --- a/include/linux/mlx5/driver.h > +++ b/include/linux/mlx5/driver.h > @@ -1286,17 +1286,7 @@ enum { > static inline const struct cpumask * > mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector) > { > - struct irq_desc *desc; > - unsigned int irq; > - int eqn; > - int err; > - > - err = mlx5_vector2eqn(dev, vector, &eqn, &irq); > - if (err) > - return NULL; > - > - desc = irq_to_desc(irq); > - return desc->affinity_hint; > + return dev->priv.irq_info[vector].mask; > } > > #endif /* MLX5_DRIVER_H */ > -- ~Randy
Re: [PATCH net-next 2/3] gso: limit udp gso to egress-only virtual devices
On Mon, May 14, 2018 at 7:07 PM, Willem de Bruijn wrote: > From: Willem de Bruijn > > Until the udp receive stack supports large packets (UDP GRO), GSO > packets must not loop from the egress to the ingress path. > > Revert the change that added NETIF_F_GSO_UDP_L4 to various virtual > devices through NETIF_F_GSO_ENCAP_ALL as this included devices that > may loop packets, such as veth and macvlan. > > Instead add it to specific devices that forward to another device's > egress path: bonding and team. > > Fixes: 83aa025f535f ("udp: add gso support to virtual devices") > CC: Alexander Duyck > Signed-off-by: Willem de Bruijn > --- > diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c > index 9dbd390ace34..c6a9f0cafea2 100644 > --- a/drivers/net/team/team.c > +++ b/drivers/net/team/team.c > @@ -1026,7 +1026,8 @@ static void __team_compute_features(struct team *team) > } > > team->dev->vlan_features = vlan_features; > - team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL; > + team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL | > +NETIF_GSO_UDP_L4; This has a typo. team.ko did not build automatically for me and caught it with a full compile just too late. Need to send a v2, sorry.
Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather
On 05/14/2018 04:07 PM, Willem de Bruijn wrote: > From: Willem de Bruijn > > Paged allocation stores most payload in skb frags. This helps udp gso > by avoiding copying from the gso skb to segment skb in skb_segment. > > But without scatter-gather, data must be linear, so do not use paged > mode unless NETIF_F_SG. > > Fixes: 15e36f5b8e98 ("udp: paged allocation with gso") > Reported-by: Sean Tranchetti > Signed-off-by: Willem de Bruijn > --- > net/ipv4/ip_output.c | 2 +- > net/ipv6/ip6_output.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index b5e21eb198d8..b38731d8a44f 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk, > > exthdrlen = !skb ? rt->dst.header_len : 0; > mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize; > - paged = !!cork->gso_size; > + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); > > if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP && > sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index 7f4493080df6..35a940b9f208 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1262,7 +1262,7 @@ static int __ip6_append_data(struct sock *sk, > dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len; > } > > - paged = !!cork->gso_size; > + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); > mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize; > orig_mtu = mtu; > > As I said, this wont help for stacked device bonding might advertise NETIF_F_SG, but one slave might not.
Re: [PATCH net-next] udp: Fix kernel panic in UDP GSO path
On 05/14/2018 03:45 PM, stran...@codeaurora.org wrote: > On 2018-05-11 17:16, Willem de Bruijn wrote: > >>> Hmm, no, we absolutely need to fix GSO instead. >>> >>> Think of a bonding device (or any virtual devices), your patch wont avoid >>> the crash. > > Hi Eric. Can you clarify what you mean by "fix GSO?" Is that just having the > GSO path work > regardless of whether or not SG is enabled for the device? > Yes. GSO is a fallback, and must work all the time, not panic.
Re: [PATCH v3 bpf-next 0/2] bpf: enable stackmap with build_id in nmi
On 05/07/2018 07:50 PM, Song Liu wrote: > Changes v2 -> v3: > Improve syntax based on suggestion by Tobin C. Harding. > > Changes v1 -> v2: > 1. Rename some variables to (hopefully) reduce confusion; > 2. Check irq_work status with IRQ_WORK_BUSY (instead of work->sem); > 3. In Kconfig, let BPF_SYSCALL select IRQ_WORK; > 4. Add static to DEFINE_PER_CPU(); >5. Remove pr_info() in stack_map_init(). > > Song Liu (2): > bpf: enable stackmap with build_id in nmi context > bpf: add selftest for stackmap with build_id in NMI context > > init/Kconfig | 1 + > kernel/bpf/stackmap.c | 59 +++-- > tools/testing/selftests/bpf/test_progs.c | 134 > + > tools/testing/selftests/bpf/urandom_read.c | 10 ++- > 4 files changed, 196 insertions(+), 8 deletions(-) Applied to bpf-next, thanks Song!
[PATCH net-next 3/3] udp: only use paged allocation with scatter-gather
From: Willem de Bruijn Paged allocation stores most payload in skb frags. This helps udp gso by avoiding copying from the gso skb to segment skb in skb_segment. But without scatter-gather, data must be linear, so do not use paged mode unless NETIF_F_SG. Fixes: 15e36f5b8e98 ("udp: paged allocation with gso") Reported-by: Sean Tranchetti Signed-off-by: Willem de Bruijn --- net/ipv4/ip_output.c | 2 +- net/ipv6/ip6_output.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index b5e21eb198d8..b38731d8a44f 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk, exthdrlen = !skb ? rt->dst.header_len : 0; mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize; - paged = !!cork->gso_size; + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP && sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7f4493080df6..35a940b9f208 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1262,7 +1262,7 @@ static int __ip6_append_data(struct sock *sk, dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len; } - paged = !!cork->gso_size; + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize; orig_mtu = mtu; -- 2.17.0.441.gb46fe60e1d-goog
Re: [PATCH bpf-next v2] samples/bpf: xdp_monitor, accept short options
On 05/14/2018 12:20 PM, Jesper Dangaard Brouer wrote: > > On Mon, 14 May 2018 17:29:15 +0900 Prashant Bhole > wrote: > >> Updated optstring parameter for getopt_long() to accept short options. >> Also updated usage() function. >> >> Signed-off-by: Prashant Bhole > > Acked-by: Jesper Dangaard Brouer Applied to bpf-next, thanks everyone!
Re: [PATCH net-next] udp: Fix kernel panic in UDP GSO path
>> Paged skbuffs is an optimization for gso, but the feature should >> continue to work even if gso skbs are linear, indeed (if at the cost >> of copying during skb_segment). >> >> We need to make paged contingent on scatter-gather. Rough >> patch below. That is for ipv4 only, the same will be needed for ipv6. >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c >> index b5e21eb198d8..b38731d8a44f 100644 >> --- a/net/ipv4/ip_output.c >> +++ b/net/ipv4/ip_output.c >> @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk, >> >> exthdrlen = !skb ? rt->dst.header_len : 0; >> mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize; >> - paged = !!cork->gso_size; >> + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); > > > Hi Willem. That's definitely a much cleaner patch than ours since it allows > the GSO to continue without failure. > We tried it on both the IPv4 and IPv6 path and didn't see the crash in > either case. Thanks for testing. I have a small set of fixes to udp gso, including this one. Let me send them right away.
[PATCH net-next 2/3] gso: limit udp gso to egress-only virtual devices
From: Willem de Bruijn Until the udp receive stack supports large packets (UDP GRO), GSO packets must not loop from the egress to the ingress path. Revert the change that added NETIF_F_GSO_UDP_L4 to various virtual devices through NETIF_F_GSO_ENCAP_ALL as this included devices that may loop packets, such as veth and macvlan. Instead add it to specific devices that forward to another device's egress path: bonding and team. Fixes: 83aa025f535f ("udp: add gso support to virtual devices") CC: Alexander Duyck Signed-off-by: Willem de Bruijn --- drivers/net/bonding/bond_main.c | 5 +++-- drivers/net/team/team.c | 5 +++-- include/linux/netdev_features.h | 1 - 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 4176e1d95f47..d7b58370ae77 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1107,7 +1107,8 @@ static void bond_compute_features(struct bonding *bond) done: bond_dev->vlan_features = vlan_features; - bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL; + bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL | + NETIF_F_GSO_UDP_L4; bond_dev->gso_max_segs = gso_max_segs; netif_set_gso_max_size(bond_dev, gso_max_size); @@ -4263,7 +4264,7 @@ void bond_setup(struct net_device *bond_dev) NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_FILTER; - bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL; + bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4; bond_dev->features |= bond_dev->hw_features; } diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 9dbd390ace34..c6a9f0cafea2 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -1026,7 +1026,8 @@ static void __team_compute_features(struct team *team) } team->dev->vlan_features = vlan_features; - team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL; + team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL | +NETIF_GSO_UDP_L4; team->dev->hard_header_len = max_hard_header_len; team->dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; @@ -2117,7 +2118,7 @@ static void team_setup(struct net_device *dev) NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_FILTER; - dev->hw_features |= NETIF_F_GSO_ENCAP_ALL; + dev->hw_features |= NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4; dev->features |= dev->hw_features; } diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index c87c3a3453c1..623bb8ced060 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -220,7 +220,6 @@ enum { NETIF_F_GSO_GRE_CSUM | \ NETIF_F_GSO_IPXIP4 | \ NETIF_F_GSO_IPXIP6 | \ -NETIF_F_GSO_UDP_L4 | \ NETIF_F_GSO_UDP_TUNNEL | \ NETIF_F_GSO_UDP_TUNNEL_CSUM) -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 0/3] udp gso fixes
From: Willem de Bruijn A few small fixes: - disallow segmentation with XFRM - do not leak gso packets into the ingress path - fix a panic if scatter-gather is disabled Willem de Bruijn (3): udp: exclude gso from xfrm paths gso: limit udp gso to egress-only virtual devices udp: only use paged allocation with scatter-gather drivers/net/bonding/bond_main.c | 5 +++-- drivers/net/team/team.c | 5 +++-- include/linux/netdev_features.h | 1 - net/ipv4/ip_output.c| 2 +- net/ipv4/udp.c | 3 ++- net/ipv6/ip6_output.c | 2 +- net/ipv6/udp.c | 3 ++- 7 files changed, 12 insertions(+), 9 deletions(-) -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 1/3] udp: exclude gso from xfrm paths
From: Willem de Bruijn UDP GSO conflicts with transformations in the XFRM layer. Return an error if GSO is attempted. Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") CC: Michal Kubecek Signed-off-by: Willem de Bruijn --- net/ipv4/udp.c | 3 ++- net/ipv6/udp.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index ff4d4ba67735..d71f1f3e1155 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -788,7 +788,8 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, return -EINVAL; if (sk->sk_no_check_tx) return -EINVAL; - if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite) + if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite || + dst_xfrm(skb_dst(skb))) return -EIO; skb_shinfo(skb)->gso_size = cork->gso_size; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 2839c1bd1e58..426c9d2b418d 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1053,7 +1053,8 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, return -EINVAL; if (udp_sk(sk)->no_check6_tx) return -EINVAL; - if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite) + if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite || + dst_xfrm(skb_dst(skb))) return -EIO; skb_shinfo(skb)->gso_size = cork->gso_size; -- 2.17.0.441.gb46fe60e1d-goog
Re: [PATCH net-next] udp: Fix kernel panic in UDP GSO path
On 2018-05-11 17:16, Willem de Bruijn wrote: Hmm, no, we absolutely need to fix GSO instead. Think of a bonding device (or any virtual devices), your patch wont avoid the crash. Hi Eric. Can you clarify what you mean by "fix GSO?" Is that just having the GSO path work regardless of whether or not SG is enabled for the device? Thanks for reporting the issue. Paged skbuffs is an optimization for gso, but the feature should continue to work even if gso skbs are linear, indeed (if at the cost of copying during skb_segment). We need to make paged contingent on scatter-gather. Rough patch below. That is for ipv4 only, the same will be needed for ipv6. diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index b5e21eb198d8..b38731d8a44f 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk, exthdrlen = !skb ? rt->dst.header_len : 0; mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize; - paged = !!cork->gso_size; + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG); Hi Willem. That's definitely a much cleaner patch than ours since it allows the GSO to continue without failure. We tried it on both the IPv4 and IPv6 path and didn't see the crash in either case. - The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH bpf-next v5 3/6] bpf: Add IPv6 Segment Routing helpers
On 05/12/2018 07:25 PM, Mathieu Xhonneux wrote: [...] > +BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset, > +const void *, from, u32, len) > +{ > +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) > + struct seg6_bpf_srh_state *srh_state = > + this_cpu_ptr(&seg6_bpf_srh_states); > + void *srh_tlvs, *srh_end, *ptr; > + struct ipv6_sr_hdr *srh; > + int srhoff = 0; > + > + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0) > + return -EINVAL; > + > + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff); > + srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4)); > + srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen); Do we need to check that this cannot go out of bounds wrt skb data? > + ptr = skb->data + offset; > + if (ptr >= srh_tlvs && ptr + len <= srh_end) > + srh_state->valid = 0; > + else if (ptr < (void *)&srh->flags || > + ptr + len > (void *)&srh->segments) > + return -EFAULT; > + > + if (unlikely(bpf_try_make_writable(skb, offset + len))) > + return -EFAULT; > + > + memcpy(ptr, from, len); You have a use after free here. bpf_try_make_writable() is potentially changing underlying skb->data (e.g. see pskb_expand_head()). Therefore memcpy()'ing into cached ptr is invalid. > + return 0; > +#else /* CONFIG_IPV6_SEG6_BPF */ > + return -EOPNOTSUPP; > +#endif > +} > + > +static const struct bpf_func_proto bpf_lwt_seg6_store_bytes_proto = { > + .func = bpf_lwt_seg6_store_bytes, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_CTX, > + .arg2_type = ARG_ANYTHING, > + .arg3_type = ARG_PTR_TO_MEM, > + .arg4_type = ARG_CONST_SIZE > +}; > + > +BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb, > +u32, action, void *, param, u32, param_len) > +{ > +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) > + struct seg6_bpf_srh_state *srh_state = > + this_cpu_ptr(&seg6_bpf_srh_states); > + struct ipv6_sr_hdr *srh; > + int srhoff = 0; > + int err; > + > + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0) > + return -EINVAL; > + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff); > + > + if (!srh_state->valid) { > + if (unlikely((srh_state->hdrlen & 7) != 0)) > + return -EBADMSG; > + > + srh->hdrlen = (u8)(srh_state->hdrlen >> 3); > + if (unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3))) > + return -EBADMSG; > + > + srh_state->valid = 1; > + } > + > + switch (action) { > + case SEG6_LOCAL_ACTION_END_X: > + if (param_len != sizeof(struct in6_addr)) > + return -EINVAL; > + return seg6_lookup_nexthop(skb, (struct in6_addr *)param, 0); > + case SEG6_LOCAL_ACTION_END_T: > + if (param_len != sizeof(int)) > + return -EINVAL; > + return seg6_lookup_nexthop(skb, NULL, *(int *)param); > + case SEG6_LOCAL_ACTION_END_B6: > + err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6_INLINE, > + param, param_len); > + if (!err) > + srh_state->hdrlen = > + ((struct ipv6_sr_hdr *)param)->hdrlen << 3; > + return err; > + case SEG6_LOCAL_ACTION_END_B6_ENCAP: > + err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6, > + param, param_len); > + if (!err) > + srh_state->hdrlen = > + ((struct ipv6_sr_hdr *)param)->hdrlen << 3; > + return err; > + default: > + return -EINVAL; > + } > +#else /* CONFIG_IPV6_SEG6_BPF */ > + return -EOPNOTSUPP; > +#endif > +} > + > +static const struct bpf_func_proto bpf_lwt_seg6_action_proto = { > + .func = bpf_lwt_seg6_action, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_CTX, > + .arg2_type = ARG_ANYTHING, > + .arg3_type = ARG_PTR_TO_MEM, > + .arg4_type = ARG_CONST_SIZE > +}; > + > +BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset, > +s32, len) > +{ > +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) > + struct seg6_bpf_srh_state *srh_state = > + this_cpu_ptr(&seg6_bpf_srh_states); > + void *srh_end, *srh_tlvs, *ptr; > + struct ipv6_sr_hdr *srh; > + struct ipv6hdr *hdr; > + int srhoff = 0; > + int ret; > + > + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0) > + return -EINVAL; > + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff); > + > + srh_tlvs = (void *)((unsigned char *)srh + sizeof(*srh) + > +
[net 1/1] net/mlx5: Fix build break when CONFIG_SMP=n
Avoid using the kernel's irq_descriptor and return IRQ vector affinity directly from the driver. This fixes the following build break when CONFIG_SMP=n include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’: include/linux/mlx5/driver.h:1299:13: error: ‘struct irq_desc’ has no member named ‘affinity_hint’ Fixes: 6082d9c9c94a ("net/mlx5: Fix mlx5_get_vector_affinity function") Signed-off-by: Saeed Mahameed CC: Randy Dunlap CC: Guenter Roeck CC: Thomas Gleixner Tested-by: Israel Rukshin --- For -stable v4.14 include/linux/mlx5/driver.h | 12 +--- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index 2a156c5dfadd..d703774982ca 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -1286,17 +1286,7 @@ enum { static inline const struct cpumask * mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector) { - struct irq_desc *desc; - unsigned int irq; - int eqn; - int err; - - err = mlx5_vector2eqn(dev, vector, &eqn, &irq); - if (err) - return NULL; - - desc = irq_to_desc(irq); - return desc->affinity_hint; + return dev->priv.irq_info[vector].mask; } #endif /* MLX5_DRIVER_H */ -- 2.17.0
Re: [PATCH] net/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'
On Mon, May 14, 2018 at 11:56 AM, David Miller wrote: > From: Christophe JAILLET > Date: Sat, 12 May 2018 19:09:25 +0200 > >> 'out' is allocated with 'kvzalloc()'. 'kvfree()' must be used to free it. >> >> Signed-off-by: Christophe JAILLET > > Saeed, I assume I will see this in one of your forthcoming pull > requests. > > Thanks. In case this is for net-next, I will apply v3 to mlx5-next once Christophe adds the "Fixes" tags according to Eric's request. if it is for net (RC) then you can go ahead and apply v3 to net branch. Thanks, Saeed.
[PATCH net-stable 03/24] hv_netvsc: Rename tx_send_table to tx_table
From: Haiyang Zhang commit 39e91cfbf6f5fb26ba64cc2e8874372baf1671e7 upstream. Simplify the variable name: tx_send_table Signed-off-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/hyperv_net.h | 2 +- drivers/net/hyperv/netvsc.c | 2 +- drivers/net/hyperv/netvsc_drv.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 8144b938460f..94de7ca1e52e 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -734,7 +734,7 @@ struct net_device_context { u32 tx_checksum_mask; - u32 tx_send_table[VRSS_SEND_TAB_SIZE]; + u32 tx_table[VRSS_SEND_TAB_SIZE]; /* Ethtool settings */ bool udp4_l4_hash; diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index a6bafcf55776..03b44ec805db 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1107,7 +1107,7 @@ static void netvsc_send_table(struct hv_device *hdev, nvmsg->msg.v5_msg.send_table.offset); for (i = 0; i < count; i++) - net_device_ctx->tx_send_table[i] = tab[i]; + net_device_ctx->tx_table[i] = tab[i]; } static void netvsc_send_vf(struct net_device_context *net_device_ctx, diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 8b358d3fc057..f5d299b82bac 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -234,8 +234,8 @@ static inline int netvsc_get_tx_queue(struct net_device *ndev, struct sock *sk = skb->sk; int q_idx; - q_idx = ndc->tx_send_table[netvsc_get_hash(skb, ndc) & - (VRSS_SEND_TAB_SIZE - 1)]; + q_idx = ndc->tx_table[netvsc_get_hash(skb, ndc) & + (VRSS_SEND_TAB_SIZE - 1)]; /* If queue index changed record the new value */ if (q_idx != old_idx && -- 2.17.0
[PATCH net-stable 05/24] hv_netvsc: Set tx_table to equal weight after subchannels open
From: Haiyang Zhang commit a6fb6aa3cfa9047b62653dbcfc9bcde6e2272b41 upstream. In some cases, like internal vSwitch, the host doesn't provide send indirection table updates. This patch sets the table to be equal weight after subchannels are all open. Otherwise, all workload will be on one TX channel. As tested, this patch has largely increased the throughput over internal vSwitch. Signed-off-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/rndis_filter.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index addf9f69c58c..0648eebda829 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -1114,6 +1114,9 @@ void rndis_set_subchannel(struct work_struct *w) netif_set_real_num_tx_queues(ndev, nvdev->num_chn); netif_set_real_num_rx_queues(ndev, nvdev->num_chn); + for (i = 0; i < VRSS_SEND_TAB_SIZE; i++) + ndev_ctx->tx_table[i] = i % nvdev->num_chn; + rtnl_unlock(); return; -- 2.17.0
[PATCH net-stable 09/24] hv_netvsc: Use the num_online_cpus() for channel limit
From: Haiyang Zhang commit 25a39f7f975c3c26a0052fbf9b59201c06744332 upstream Since we no longer localize channel/CPU affiliation within one NUMA node, num_online_cpus() is used as the number of channel cap, instead of the number of processors in a NUMA node. This patch allows a bigger range for tuning the number of channels. Signed-off-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/rndis_filter.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 0c99d9926085..05109ed5377c 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -1221,7 +1221,6 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev, struct ndis_recv_scale_cap rsscap; u32 rsscap_size = sizeof(struct ndis_recv_scale_cap); u32 mtu, size; - const struct cpumask *node_cpu_mask; u32 num_possible_rss_qs; int i, ret; @@ -1290,14 +1289,8 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev, if (ret || rsscap.num_recv_que < 2) goto out; - /* -* We will limit the VRSS channels to the number CPUs in the NUMA node -* the primary channel is currently bound to. -* -* This also guarantees that num_possible_rss_qs <= num_online_cpus -*/ - node_cpu_mask = cpumask_of_node(cpu_to_node(dev->channel->target_cpu)); - num_possible_rss_qs = min_t(u32, cpumask_weight(node_cpu_mask), + /* This guarantees that num_possible_rss_qs <= num_online_cpus */ + num_possible_rss_qs = min_t(u32, num_online_cpus(), rsscap.num_recv_que); net_device->max_chn = min_t(u32, VRSS_CHANNEL_MAX, num_possible_rss_qs); -- 2.17.0
[PATCH net-stable 10/24] hv_netvsc: avoid retry on send during shutdown
From: Stephen Hemminger commit 12f69661a49446840d742d8feb593ace022d9f66 upstream Change the initialization order so that the device is ready to transmit (ie connect vsp is completed) before setting the internal reference to the device with RCU. This avoids any races on initialization and prevents retry issues on shutdown. Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 24 +++- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 22da6399b37a..839b9d6ecb41 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -850,13 +850,6 @@ int netvsc_send(struct net_device *ndev, if (unlikely(!net_device || net_device->destroy)) return -ENODEV; - /* We may race with netvsc_connect_vsp()/netvsc_init_buf() and get -* here before the negotiation with the host is finished and -* send_section_map may not be allocated yet. -*/ - if (unlikely(!net_device->send_section_map)) - return -EAGAIN; - nvchan = &net_device->chan_table[packet->q_idx]; packet->send_buf_index = NETVSC_INVALID_INDEX; packet->cp_partial = false; @@ -864,10 +857,8 @@ int netvsc_send(struct net_device *ndev, /* Send control message directly without accessing msd (Multi-Send * Data) field which may be changed during data packet processing. */ - if (!skb) { - cur_send = packet; - goto send_now; - } + if (!skb) + return netvsc_send_pkt(device, packet, net_device, pb, skb); /* batch packets in send buffer if possible */ msdp = &nvchan->msd; @@ -951,7 +942,6 @@ int netvsc_send(struct net_device *ndev, } } -send_now: if (cur_send) ret = netvsc_send_pkt(device, cur_send, net_device, pb, skb); @@ -1308,11 +1298,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, napi_enable(&net_device->chan_table[0].napi); - /* Writing nvdev pointer unlocks netvsc_send(), make sure chn_table is -* populated. -*/ - rcu_assign_pointer(net_device_ctx->nvdev, net_device); - /* Connect with the NetVsp */ ret = netvsc_connect_vsp(device, net_device, device_info); if (ret != 0) { @@ -1321,6 +1306,11 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, goto close; } + /* Writing nvdev pointer unlocks netvsc_send(), make sure chn_table is +* populated. +*/ + rcu_assign_pointer(net_device_ctx->nvdev, net_device); + return net_device; close: -- 2.17.0
[PATCH net-stable 02/24] hv_netvsc: Rename ind_table to rx_table
From: Haiyang Zhang commit 47371300dfc269dd8d150e5b872bdbbda98ba809 upstream. Rename this variable because it is the Receive indirection table. Signed-off-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/hyperv_net.h | 2 +- drivers/net/hyperv/netvsc_drv.c | 4 ++-- drivers/net/hyperv/rndis_filter.c | 6 +++--- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 4f3afcf92a7c..8144b938460f 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -179,7 +179,7 @@ struct rndis_device { u8 hw_mac_adr[ETH_ALEN]; u8 rss_key[NETVSC_HASH_KEYLEN]; - u16 ind_table[ITAB_NUM]; + u16 rx_table[ITAB_NUM]; }; diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index e40be71df9d6..8b358d3fc057 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -1378,7 +1378,7 @@ static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key, rndis_dev = ndev->extension; if (indir) { for (i = 0; i < ITAB_NUM; i++) - indir[i] = rndis_dev->ind_table[i]; + indir[i] = rndis_dev->rx_table[i]; } if (key) @@ -1408,7 +1408,7 @@ static int netvsc_set_rxfh(struct net_device *dev, const u32 *indir, return -EINVAL; for (i = 0; i < ITAB_NUM; i++) - rndis_dev->ind_table[i] = indir[i]; + rndis_dev->rx_table[i] = indir[i]; } if (!key) { diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 065b204d8e17..addf9f69c58c 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -759,7 +759,7 @@ int rndis_filter_set_rss_param(struct rndis_device *rdev, /* Set indirection table entries */ itab = (u32 *)(rssp + 1); for (i = 0; i < ITAB_NUM; i++) - itab[i] = rdev->ind_table[i]; + itab[i] = rdev->rx_table[i]; /* Set hask key values */ keyp = (u8 *)((unsigned long)rssp + rssp->kashkey_offset); @@ -1284,8 +1284,8 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev, net_device->num_chn = min(net_device->max_chn, device_info->num_chn); for (i = 0; i < ITAB_NUM; i++) - rndis_device->ind_table[i] = ethtool_rxfh_indir_default(i, - net_device->num_chn); + rndis_device->rx_table[i] = ethtool_rxfh_indir_default( + i, net_device->num_chn); atomic_set(&net_device->open_chn, 1); vmbus_set_sc_create_callback(dev->channel, netvsc_sc_open); -- 2.17.0
[PATCH net-stable 04/24] hv_netvsc: Add initialization of tx_table in netvsc_device_add()
From: Haiyang Zhang commit 6b0cbe315868d613123cf387052ccda5f09d49ea upstream. tx_table is part of the private data of kernel net_device. It is only zero-ed out when allocating net_device. We may recreate netvsc_device w/o recreating net_device, so the private netdev data, including tx_table, are not zeroed. It may contain channel numbers for the older netvsc_device. This patch adds initialization of tx_table each time we recreate netvsc_device. Signed-off-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 03b44ec805db..73999214d444 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1252,6 +1252,9 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, if (!net_device) return ERR_PTR(-ENOMEM); + for (i = 0; i < VRSS_SEND_TAB_SIZE; i++) + net_device_ctx->tx_table[i] = 0; + net_device->ring_size = ring_size; /* Because the device uses NAPI, all the interrupt batching and -- 2.17.0
[PATCH net-stable 14/24] hv_netvsc: fix race in napi poll when rescheduling
From: Stephen Hemminger commit d64e38ae690e3337db0d38d9b149a193a1646c4b upstream There is a race between napi_reschedule and re-enabling interrupts which could lead to missed host interrrupts. This occurs when interrupts are re-enabled (hv_end_read) and vmbus irq callback (netvsc_channel_cb) has already scheduled NAPI. Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index b7720b65f5d6..7e0eb99571af 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1205,9 +1205,10 @@ int netvsc_poll(struct napi_struct *napi, int budget) if (send_recv_completions(ndev, net_device, nvchan) == 0 && work_done < budget && napi_complete_done(napi, work_done) && - hv_end_read(&channel->inbound)) { + hv_end_read(&channel->inbound) && + napi_schedule_prep(napi)) { hv_begin_read(&channel->inbound); - napi_reschedule(napi); + __napi_schedule(napi); } /* Driver may overshoot since multiple packets per descriptor */ -- 2.17.0
[PATCH net-stable 12/24] hv_netvsc: fix error unwind handling if vmbus_open fails
From: Stephen Hemminger commit fcfb4a00d1e514e8313277a01ef919de1113025b upstream Need to delete NAPI association if vmbus_open fails. Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 839b9d6ecb41..b7720b65f5d6 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1288,7 +1288,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, net_device->chan_table); if (ret != 0) { - netif_napi_del(&net_device->chan_table[0].napi); netdev_err(ndev, "unable to open channel: %d\n", ret); goto cleanup; } @@ -1321,6 +1320,7 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, vmbus_close(device->channel); cleanup: + netif_napi_del(&net_device->chan_table[0].napi); free_netvsc_device(&net_device->rcu); return ERR_PTR(ret); -- 2.17.0
[PATCH net-stable 07/24] hv_netvsc: preserve hw_features on mtu/channels/ringparam changes
From: Vitaly Kuznetsov commit aefd80e874e98a864915df5b7d90824a4340b450 upstream. rndis_filter_device_add() is called both from netvsc_probe() when we initially create the device and from set channels/mtu/ringparam routines where we basically remove the device and add it back. hw_features is reset in rndis_filter_device_add() and filled with host data. However, we lose all additional flags which are set outside of the driver, e.g. register_netdevice() adds NETIF_F_SOFT_FEATURES and many others. Unfortunately, calls to rndis_{query_hwcaps(), _set_offload_params()} calls cannot be avoided on every RNDIS reset: host expects us to set required features explicitly. Moreover, in theory hardware capabilities can change and we need to reflect the change in hw_features. Reset net->hw_features bits according to host data in rndis_netdev_set_hwcaps(), clear corresponding feature bits from net->features in case some features went missing (will never happen in real life I guess but let's be consistent). Signed-off-by: Vitaly Kuznetsov Reviewed-by: Haiyang Zhang Signed-off-by: David S. Miller --- drivers/net/hyperv/hyperv_net.h | 4 + drivers/net/hyperv/netvsc_drv.c | 2 +- drivers/net/hyperv/rndis_filter.c | 136 +- 3 files changed, 83 insertions(+), 59 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 94de7ca1e52e..a3f628c3c9ed 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -659,6 +659,10 @@ struct nvsp_message { #define NETVSC_RECEIVE_BUFFER_ID 0xcafe #define NETVSC_SEND_BUFFER_ID 0 +#define NETVSC_SUPPORTED_HW_FEATURES (NETIF_F_RXCSUM | NETIF_F_IP_CSUM | \ + NETIF_F_TSO | NETIF_F_IPV6_CSUM | \ + NETIF_F_TSO6) + #define VRSS_SEND_TAB_SIZE 16 /* must be power of 2 */ #define VRSS_CHANNEL_MAX 64 #define VRSS_CHANNEL_DEFAULT 8 diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index f5d299b82bac..cfaf433b0cda 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -1955,7 +1955,7 @@ static int netvsc_probe(struct hv_device *dev, memcpy(net->dev_addr, device_info.mac_adr, ETH_ALEN); - /* hw_features computed in rndis_filter_device_add */ + /* hw_features computed in rndis_netdev_set_hwcaps() */ net->features = net->hw_features | NETIF_F_HIGHDMA | NETIF_F_SG | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX; diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 0648eebda829..be57639bee29 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -1131,69 +1131,20 @@ void rndis_set_subchannel(struct work_struct *w) rtnl_unlock(); } -struct netvsc_device *rndis_filter_device_add(struct hv_device *dev, - struct netvsc_device_info *device_info) +static int rndis_netdev_set_hwcaps(struct rndis_device *rndis_device, + struct netvsc_device *nvdev) { - struct net_device *net = hv_get_drvdata(dev); + struct net_device *net = rndis_device->ndev; struct net_device_context *net_device_ctx = netdev_priv(net); - struct netvsc_device *net_device; - struct rndis_device *rndis_device; struct ndis_offload hwcaps; struct ndis_offload_params offloads; - struct ndis_recv_scale_cap rsscap; - u32 rsscap_size = sizeof(struct ndis_recv_scale_cap); unsigned int gso_max_size = GSO_MAX_SIZE; - u32 mtu, size; - const struct cpumask *node_cpu_mask; - u32 num_possible_rss_qs; - int i, ret; - - rndis_device = get_rndis_device(); - if (!rndis_device) - return ERR_PTR(-ENODEV); - - /* -* Let the inner driver handle this first to create the netvsc channel -* NOTE! Once the channel is created, we may get a receive callback -* (RndisFilterOnReceive()) before this call is completed -*/ - net_device = netvsc_device_add(dev, device_info); - if (IS_ERR(net_device)) { - kfree(rndis_device); - return net_device; - } - - /* Initialize the rndis device */ - net_device->max_chn = 1; - net_device->num_chn = 1; - - net_device->extension = rndis_device; - rndis_device->ndev = net; - - /* Send the rndis initialization message */ - ret = rndis_filter_init_device(rndis_device, net_device); - if (ret != 0) - goto err_dev_remv; - - /* Get the MTU from the host */ - size = sizeof(u32); - ret = rndis_filter_query_device(rndis_device, net_device, - RNDIS_OID_GEN_MAXIMUM_FRAME_SIZE, - &mtu, &size); - if (ret == 0 &
[PATCH net-stable 08/24] hv_netvsc: empty current transmit aggregation if flow blocked
From: Stephen Hemminger commit cfd8afd986cdb59ea9adac873c5082498a1eb7c0 upstream If the transmit queue is known full, then don't keep aggregating data. And the cp_partial flag which indicates that the current aggregation buffer is full can be folded in to avoid more conditionals. Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller --- drivers/net/hyperv/hyperv_net.h | 2 +- drivers/net/hyperv/netvsc.c | 36 ++- drivers/net/hyperv/netvsc_drv.c | 2 +- drivers/net/hyperv/rndis_filter.c | 3 +-- 4 files changed, 24 insertions(+), 19 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index a3f628c3c9ed..fd51a329e36e 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -192,7 +192,7 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, const struct netvsc_device_info *info); int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx); void netvsc_device_remove(struct hv_device *device); -int netvsc_send(struct net_device_context *ndc, +int netvsc_send(struct net_device *net, struct hv_netvsc_packet *packet, struct rndis_message *rndis_msg, struct hv_page_buffer *page_buffer, diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 4bc8a1d529d9..22da6399b37a 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -700,13 +700,13 @@ static u32 netvsc_get_next_send_section(struct netvsc_device *net_device) return NETVSC_INVALID_INDEX; } -static u32 netvsc_copy_to_send_buf(struct netvsc_device *net_device, - unsigned int section_index, - u32 pend_size, - struct hv_netvsc_packet *packet, - struct rndis_message *rndis_msg, - struct hv_page_buffer *pb, - struct sk_buff *skb) +static void netvsc_copy_to_send_buf(struct netvsc_device *net_device, + unsigned int section_index, + u32 pend_size, + struct hv_netvsc_packet *packet, + struct rndis_message *rndis_msg, + struct hv_page_buffer *pb, + bool xmit_more) { char *start = net_device->send_buf; char *dest = start + (section_index * net_device->send_section_size) @@ -719,7 +719,8 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device *net_device, packet->page_buf_cnt; /* Add padding */ - if (skb->xmit_more && remain && !packet->cp_partial) { + remain = packet->total_data_buflen & (net_device->pkt_align - 1); + if (xmit_more && remain) { padding = net_device->pkt_align - remain; rndis_msg->msg_len += padding; packet->total_data_buflen += padding; @@ -739,8 +740,6 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device *net_device, memset(dest, 0, padding); msg_size += padding; } - - return msg_size; } static inline int netvsc_send_pkt( @@ -828,12 +827,13 @@ static inline void move_pkt_msd(struct hv_netvsc_packet **msd_send, } /* RCU already held by caller */ -int netvsc_send(struct net_device_context *ndev_ctx, +int netvsc_send(struct net_device *ndev, struct hv_netvsc_packet *packet, struct rndis_message *rndis_msg, struct hv_page_buffer *pb, struct sk_buff *skb) { + struct net_device_context *ndev_ctx = netdev_priv(ndev); struct netvsc_device *net_device = rcu_dereference_bh(ndev_ctx->nvdev); struct hv_device *device = ndev_ctx->device_ctx; @@ -844,8 +844,7 @@ int netvsc_send(struct net_device_context *ndev_ctx, struct multi_send_data *msdp; struct hv_netvsc_packet *msd_send = NULL, *cur_send = NULL; struct sk_buff *msd_skb = NULL; - bool try_batch; - bool xmit_more = (skb != NULL) ? skb->xmit_more : false; + bool try_batch, xmit_more; /* If device is rescinded, return error and packet will get dropped. */ if (unlikely(!net_device || net_device->destroy)) @@ -896,10 +895,17 @@ int netvsc_send(struct net_device_context *ndev_ctx, } } + /* Keep aggregating only if stack says more data is coming +* and not doing mixed modes send and not flow blocked +*/ + xmit_more = skb->xmit_more && + !packet->cp_partial && + !netif_xmit_stopped(netdev_get_tx_queue(ndev, packet->q_idx)); + if (section_index != NETVSC_INVALID_INDEX) { netvsc_copy_to_send_
[PATCH net-stable 21/24] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
From: Mohammed Gamal commit 7992894c305eaf504d005529637ff8283d0a849d upstream Split each of the functions into two for each of send/recv buffers. This will be needed in order to implement a fine-grained messaging sequence to the host so that we accommodate the requirements of different Windows versions Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions") Signed-off-by: Mohammed Gamal Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 46 +++-- 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index aa9b7a912c31..25fcba506ac5 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -105,11 +105,11 @@ static void free_netvsc_device_rcu(struct netvsc_device *nvdev) call_rcu(&nvdev->rcu, free_netvsc_device); } -static void netvsc_revoke_buf(struct hv_device *device, - struct netvsc_device *net_device) +static void netvsc_revoke_recv_buf(struct hv_device *device, + struct netvsc_device *net_device) { - struct nvsp_message *revoke_packet; struct net_device *ndev = hv_get_drvdata(device); + struct nvsp_message *revoke_packet; int ret; /* @@ -151,6 +151,14 @@ static void netvsc_revoke_buf(struct hv_device *device, } net_device->recv_section_cnt = 0; } +} + +static void netvsc_revoke_send_buf(struct hv_device *device, + struct netvsc_device *net_device) +{ + struct net_device *ndev = hv_get_drvdata(device); + struct nvsp_message *revoke_packet; + int ret; /* Deal with the send buffer we may have setup. * If we got a send section size, it means we received a @@ -194,8 +202,8 @@ static void netvsc_revoke_buf(struct hv_device *device, } } -static void netvsc_teardown_gpadl(struct hv_device *device, - struct netvsc_device *net_device) +static void netvsc_teardown_recv_gpadl(struct hv_device *device, + struct netvsc_device *net_device) { struct net_device *ndev = hv_get_drvdata(device); int ret; @@ -214,6 +222,13 @@ static void netvsc_teardown_gpadl(struct hv_device *device, } net_device->recv_buf_gpadl_handle = 0; } +} + +static void netvsc_teardown_send_gpadl(struct hv_device *device, + struct netvsc_device *net_device) +{ + struct net_device *ndev = hv_get_drvdata(device); + int ret; if (net_device->send_buf_gpadl_handle) { ret = vmbus_teardown_gpadl(device->channel, @@ -423,8 +438,10 @@ static int netvsc_init_buf(struct hv_device *device, goto exit; cleanup: - netvsc_revoke_buf(device, net_device); - netvsc_teardown_gpadl(device, net_device); + netvsc_revoke_recv_buf(device, net_device); + netvsc_revoke_send_buf(device, net_device); + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_teardown_send_gpadl(device, net_device); exit: return ret; @@ -554,7 +571,8 @@ void netvsc_device_remove(struct hv_device *device) = rtnl_dereference(net_device_ctx->nvdev); int i; - netvsc_revoke_buf(device, net_device); + netvsc_revoke_recv_buf(device, net_device); + netvsc_revoke_send_buf(device, net_device); RCU_INIT_POINTER(net_device_ctx->nvdev, NULL); @@ -569,14 +587,18 @@ void netvsc_device_remove(struct hv_device *device) netdev_dbg(ndev, "net device safe to remove\n"); /* older versions require that buffer be revoked before close */ - if (vmbus_proto_version < VERSION_WIN10) - netvsc_teardown_gpadl(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) { + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_teardown_send_gpadl(device, net_device); + } /* Now, we can close the channel safely */ vmbus_close(device->channel); - if (vmbus_proto_version >= VERSION_WIN10) - netvsc_teardown_gpadl(device, net_device); + if (vmbus_proto_version >= VERSION_WIN10) { + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_teardown_send_gpadl(device, net_device); + } /* Release all resources */ free_netvsc_device_rcu(net_device); -- 2.17.0
[PATCH net-stable 17/24] hv_netvsc: use RCU to fix concurrent rx and queue changes
From: Stephen Hemminger commit 02400fcee2542ee334a2394e0d9f6efd969fe782 upstream The receive processing may continue to happen while the internal network device state is in RCU grace period. The internal RNDIS structure is associated with the internal netvsc_device structure; both have the same RCU lifetime. Defer freeing all associated parts until after grace period. Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 17 -- drivers/net/hyperv/rndis_filter.c | 39 ++- 2 files changed, 22 insertions(+), 34 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index a8789dc4fc9b..03328a60377e 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -89,6 +89,11 @@ static void free_netvsc_device(struct rcu_head *head) = container_of(head, struct netvsc_device, rcu); int i; + kfree(nvdev->extension); + vfree(nvdev->recv_buf); + vfree(nvdev->send_buf); + kfree(nvdev->send_section_map); + for (i = 0; i < VRSS_CHANNEL_MAX; i++) vfree(nvdev->chan_table[i].mrc.slots); @@ -210,12 +215,6 @@ static void netvsc_teardown_gpadl(struct hv_device *device, net_device->recv_buf_gpadl_handle = 0; } - if (net_device->recv_buf) { - /* Free up the receive buffer */ - vfree(net_device->recv_buf); - net_device->recv_buf = NULL; - } - if (net_device->send_buf_gpadl_handle) { ret = vmbus_teardown_gpadl(device->channel, net_device->send_buf_gpadl_handle); @@ -230,12 +229,6 @@ static void netvsc_teardown_gpadl(struct hv_device *device, } net_device->send_buf_gpadl_handle = 0; } - if (net_device->send_buf) { - /* Free up the send buffer */ - vfree(net_device->send_buf); - net_device->send_buf = NULL; - } - kfree(net_device->send_section_map); } int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx) diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 5a312b2d5a7b..056499014a1f 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -266,13 +266,23 @@ static void rndis_set_link_state(struct rndis_device *rdev, } } -static void rndis_filter_receive_response(struct rndis_device *dev, - struct rndis_message *resp) +static void rndis_filter_receive_response(struct net_device *ndev, + struct netvsc_device *nvdev, + const struct rndis_message *resp) { + struct rndis_device *dev = nvdev->extension; struct rndis_request *request = NULL; bool found = false; unsigned long flags; - struct net_device *ndev = dev->ndev; + + /* This should never happen, it means control message +* response received after device removed. +*/ + if (dev->state == RNDIS_DEV_UNINITIALIZED) { + netdev_err(ndev, + "got rndis message uninitialized\n"); + return; + } spin_lock_irqsave(&dev->request_lock, flags); list_for_each_entry(request, &dev->req_list, list_ent) { @@ -353,7 +363,7 @@ static inline void *rndis_get_ppi(struct rndis_packet *rpkt, u32 type) } static int rndis_filter_receive_data(struct net_device *ndev, -struct rndis_device *dev, +struct netvsc_device *nvdev, struct rndis_message *msg, struct vmbus_channel *channel, void *data, u32 data_buflen) @@ -373,7 +383,7 @@ static int rndis_filter_receive_data(struct net_device *ndev, * should be the data packet size plus the trailer padding size */ if (unlikely(data_buflen < rndis_pkt->data_len)) { - netdev_err(dev->ndev, "rndis message buffer " + netdev_err(ndev, "rndis message buffer " "overflow detected (got %u, min %u)" "...dropping this message!\n", data_buflen, rndis_pkt->data_len); @@ -401,34 +411,20 @@ int rndis_filter_receive(struct net_device *ndev, void *data, u32 buflen) { struct net_device_context *net_device_ctx = netdev_priv(ndev); - struct rndis_device *rndis_dev = net_dev->extension; struct rndis_message *rndis_msg = data; - /* Make sure the rndis device state is initialized */ - if (unlikely(!rndis_dev)) { - netif_er
[PATCH net-stable 18/24] hv_netvsc: change GPAD teardown order on older versions
From: Stephen Hemminger commit 0ef58b0a05c127762f975c3dfe8b922e4aa87a29 upstream On older versions of Windows, the host ignores messages after vmbus channel is closed. Workaround this by doing what Windows does and send the teardown before close on older versions of NVSP protocol. Reported-by: Mohammed Gamal Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 03328a60377e..3f3c6489ade2 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -570,10 +570,15 @@ void netvsc_device_remove(struct hv_device *device) */ netdev_dbg(ndev, "net device safe to remove\n"); + /* older versions require that buffer be revoked before close */ + if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4) + netvsc_teardown_gpadl(device, net_device); + /* Now, we can close the channel safely */ vmbus_close(device->channel); - netvsc_teardown_gpadl(device, net_device); + if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4) + netvsc_teardown_gpadl(device, net_device); /* Release all resources */ free_netvsc_device_rcu(net_device); -- 2.17.0
[PATCH net-stable 22/24] hv_netvsc: Ensure correct teardown message sequence order
From: Mohammed Gamal commit a56d99d714665591fed8527b90eef21530ea61e0 upstream Prior to commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") the call sequence in netvsc_device_remove() was as follows (as implemented in netvsc_destroy_buf()): 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Teardown receive buffer GPADL 3- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 4- Teardown send buffer GPADL 5- Close vmbus This didn't work for WS2016 hosts. Commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split") rearranged the teardown sequence as follows: 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 3- Close vmbus 4- Teardown receive buffer GPADL 5- Teardown send buffer GPADL That worked well for WS2016 hosts, but it prevented guests on older hosts from shutting down after changing network settings. Commit 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions") ensured the following message sequence for older hosts 1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message 2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message 3- Teardown receive buffer GPADL 4- Teardown send buffer GPADL 5- Close vmbus However, with this sequence calling `ip link set eth0 mtu 1000` hangs and the process becomes uninterruptible. On futher analysis it turns out that on tearing down the receive buffer GPADL the kernel is waiting indefinitely in vmbus_teardown_gpadl() for a completion to be signaled. Here is a snippet of where this occurs: int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle) { struct vmbus_channel_gpadl_teardown *msg; struct vmbus_channel_msginfo *info; unsigned long flags; int ret; info = kmalloc(sizeof(*info) + sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL); if (!info) return -ENOMEM; init_completion(&info->waitevent); info->waiting_channel = channel; [] ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_gpadl_teardown), true); if (ret) goto post_msg_err; wait_for_completion(&info->waitevent); [] } The completion is signaled from vmbus_ongpadl_torndown(), which gets called when the corresponding message is received from the host, which apparently never happens in that case. This patch works around the issue by restoring the first mentioned message sequence for older hosts Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions") Signed-off-by: Mohammed Gamal Signed-off-by: David S. Miller --- drivers/net/hyperv/netvsc.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 25fcba506ac5..99be63eacaeb 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -571,8 +571,17 @@ void netvsc_device_remove(struct hv_device *device) = rtnl_dereference(net_device_ctx->nvdev); int i; + /* +* Revoke receive buffer. If host is pre-Win2016 then tear down +* receive buffer GPADL. Do the same for send buffer. +*/ netvsc_revoke_recv_buf(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) + netvsc_teardown_recv_gpadl(device, net_device); + netvsc_revoke_send_buf(device, net_device); + if (vmbus_proto_version < VERSION_WIN10) + netvsc_teardown_send_gpadl(device, net_device); RCU_INIT_POINTER(net_device_ctx->nvdev, NULL); @@ -586,15 +595,13 @@ void netvsc_device_remove(struct hv_device *device) */ netdev_dbg(ndev, "net device safe to remove\n"); - /* older versions require that buffer be revoked before close */ - if (vmbus_proto_version < VERSION_WIN10) { - netvsc_teardown_recv_gpadl(device, net_device); - netvsc_teardown_send_gpadl(device, net_device); - } - /* Now, we can close the channel safely */ vmbus_close(device->channel); + /* +* If host is Win2016 or higher then we do the GPADL tear down +* here after VMBus is closed. + */ if (vmbus_proto_version >= VERSION_WIN10) { netvsc_teardown_recv_gpadl(device, net_device); netvsc_teardown_send_gpadl(device, net_device); -- 2.17.0